Fault-Tolerant Systems (CS449/549)
Welcome to Fault-Tolerant Systems CS449/549.
This course is offered in the Spring Semester 2007 at the
University of Idaho in Moscow.
The course is taught by
Dr. Axel Krings.
This web-page
contains information about the course, e.g. syllabus, class notes, pointers
to interesting places etc.
Material can be down-loaded in pdf (or postscript) format, and will be made
available in the updated form as the class goes on.
To get an idea of what this class is about, take a look at
last time's page.
However, materials and topics constantly change, and this class will
be no exception.
If you have comments, please let me know.
Course description: this course addresses design, modelling, analysis, and
integration of hardware and software to achieve dependable computing
systems employing on-line fault-tolerance.
It covers the concepts and terminologies of
Fault-Tolerant System Design including: Reliability, Dependability,
Maintainability, Redundancy, Error Detection, Damage Confinement,
Error Recovery, Fault Treatment, Redundancy Management, Voting,
Information Redundancy, Random Variables, cdf, pdf, Expectation,
Bathtub Curve, MTTF, Reliability of Series/Parallel Systems,
Stand-by Redundancy, M-of-N System, Reliability Block Diagrams, Fault Trees,
Markov Process, Petri Nets, General Stochastic Petri Nets,
Recovery Strategies, Roll-back Recovery, Agreement and Consensus,
Byzantine Clock Synchronisation, RAID, Fail-Stop Processes,
Systems Diagnosis, Case studies.
I always change the material slightly to account for interesting changes
in the field.
Note: This class has a prerequisite of
Computer Organisation and Architecture (CS245) or permission of
the instructor.
In a 400/500 level computer science class
I expect working knowledge of unix and MS operating systems.
- Contact information:
- Axel Krings (PhD), JEB 320,
- Phone: 208-885-4078, fax: 208-885-9052.
- Engineering outreach students: dial toll free 800-824-2889 ext 4078
- Mailing address: Engineering Outreach, PO Box 441014,
Moscow, Idaho 83844-1014.
- Email: krings@cs.uidaho.edu (see comments in syllabus on email procedures)
- Office Hours:
(see here)
- Class times: TR 2:00-3:15 room JEB 005.
Class Handouts:
- The handouts are ordered by sequence numbers which used to be the lecture numbers.
The dates indicate the date the handout was posted.
With class changes from MWF to TR etc. there is no longer a 1-to-1 correspondence
and the sequence numbers are simply an ordering mechanism.
- If there are any problems with accessing the handouts,
please let me know (email, phone, smoke signs, drums, ...)!
- Corrections: some slides may contain formatting errors, typos etc.
which have been addressed in class, but have not been reflected
in the notes posted here.
- WARNING LOCAL STUDENTS:
Do not send pdf files (i.e. files in pdf format) to the printer!
Pdf files are binary files and printing them "directly" will
result in a big printer mess!!!
Use acroread to view and print the file!!!
-
Syllabus.
- Lecture Notes
- Sequence 1 (01/11/07):
(pdf)
Introduction to fault-tolerance and safety-critical systems.
Top challenges facing the practice of fault-tolerances
Reading Assignment 1).
- Sequence 2 (01/16/07):
(pdf)
Definitions, Dependability...Maintainability, Fault-Error-Failure,
Reading Assignment 2).
- Sequence 3 (01/18/07):
(pdf)
Redundancy, Error Detection, Damage Confinement, Error Recovery, Fault
Treatment, Passive HW Redundancy, Voting.
- Sequence 4 (01/18/07):
(pdf)
Information Redundancy, Parity, Checksum, CRC.
- Sequence 5 (01/22/07):
(pdf)
Random Variables, cdf, pdf, Expectation, Reliability, Bathtub Curve,
MTTF, Reliability Block Diagram, Reliability of Series System, Reliability of Parallel System.
Reading Assignment 3).
- Sequence 6 (01/25/07):
(pdf)
Stand-by Redundancy, M-of-N System, Reliability Block Diagram,
Reliability analysis example Bus-Guardian
- Sequence 7 (01/30/07):
(pdf)
Fault Trees, Example Bus-Guardian, intro to SHARPE.
SHARPE manual: help save paper and try avoid printing the
manuals. You can use them on-line when you need them.
(sharpe quick starter)
- Sequence 8 (01/31/07):
(pdf)
Markov Process
- Sequence 9 (02/01/07):
(pdf)
Steady State and Transient Solution
- Sequence 10 (02/02/07):
(pdf)
Markov Models of Typical Systems
- Sequence 11 (02/08/07):
(pdf)
Petri Nets, Reading Assignment 4)
- Sequence 12 (02/20/07):
(pdf)
Petri Nets,
General Stochastic Petri Nets (GSPN)
- Sequence 13 (02/26/07):
(pdf)
Distributed Systems, Ordering-Synchronising, 500-level project posted.
- Sequence 14 (03/01/07):
(pdf)
Reliable Broadcast,
Atomic and Causal Broadcast, Reading assignment 5 and 6.
- EXAM I (03/08/07): Bring a calculator (just in case)!
- Sequence 15 (03/06/07):
(pdf)
Intro. Fault-tolerant Agreement, Oral messages.
Reading assignment 7.
- Sequence 16 (03/06/07):
(pdf)
Fault-tolerant Agreement, signed messages.
- Sequence 17 (03/19/07):
(pdf)
Agreement cont. Davis and Wakerly approach.
- Sequence 18 (03/22/07):
(pdf)
Fault models, Reading assignment 8:
Note the comment in class about the algorithm and the paper by John Rushby.
- Sequence 19 (03/27/07):
(pdf)
Clock Synchronization. Reading assignment 9.
- Sequence 20 (04/02/07):
(pdf)
Reading a remote clock.
Reading assignment 10.
- Sequence 21 (04/04/07):
(pdf)
Recovery Strategies, checkpointing,
Reading Assignment 11
- Sequence 22 (04/04/07):
(pdf)
RAID, Reading Assignment 12/13
- Sequence 23 (04/04/07):
(pdf)
Fail-Stop Processes,
Reading Assignment 14
- EXAM II (04/24/07): Study the old exams as part of your preparation.
- Sequence 24 (04/04/07):
(pdf)
Diagnosability, Reading assignment 15 and 16
- Sequence 25 (04/04/07):
(pdf)
Fault-tolerant Architectures
- Sequence 26 (04/04/07):
(pdf)
Space Shuttle, Reading Assignment 17
- Sequence 27 (04/04/07):
(pdf)
Boeing 777, Reading Assignment 18
- Sequence 28 (04/04/07):
(pdf)
Boeing 777 ADIRU, Reading Assignment 19
- Sequence 29 (04/04/07):
(pdf)
SIFT
- Sequence 28 (04/04/07):
(pdf)
Tandem, NonStop System Cyclone, Himalaya
- Sequence 28 (04/04/07):
(pdf)
MAFT
- FINAL exam (Friday, 05/11/07: open book/notes, starting at 12:30 pm)
Reading Assignments (so far):
- You need to locate the paper, unless I specify "(copy)", in which case I will
supply a hardcopy to the UI Copy Center in the Commons.
- 1) William R. Dunn, "Designing Safety-Critical Computer Systems"*
- 2) V. Nelson, "Fault-Tolerant Computing: Fundamental Concepts" *
- 3) SHARPE documentation *
- 4) Petri Nets *
- 5) Broadcast Protocols, *
- 6) Birman and Joseph paper,
*
- 7) L. Lamport, R. Shostak, and M Pease,
The Byzantine Generals Problem , 1982.
- 8) Thambidurai, P., and You-Keun Park, "Interactive Consistency with Multiple Failure Modes".
There is an interesting followup paper "Verification of Hybrid Byzantine Agreement Under Link
Faults" by P. Lincoln and J. Rushby that addresses a problem in the algorithm of
Thambidurai and Park.
*
- 9) Fred Schneider, "Understanding Protocols for Byzantine Clock Synchronization"
*
- 10) Flaviu Cristian, "A Probabilistic Approach to Distributed Clock Synchronization"
*
- 11) Mootaz et.al., "A survey of rollback-recovery protocols in message-passing systems"
*
- 12) A Case for Redundant Arrays of Inexpensive Disks (RAID), by D.A. Patterson, (google on the web).
- 13) RAID: High-Performance, Reliable Secondary Storage, Peter M. Chen, Edward Lee, Garth
Gibson, Randy Katz, and David Patterson, ACM Computing Surveys, 1994. (CiteSeer)
- 14) Byzantine Generals in Action: Implementing Fail-Stop Processors, Fred B. Schneider,
ACM Transactions on Computer Systems, Vol. 2, No..2, pp. 145-154, May 1984. (google on the web)
- 15) On the Connection Assignment Problem of Diagnosable Systems,
by F. Preparata, G. Metze and R. Chien.
1
2
3
4
5
6
7
- 16) Implementation of On-Line Distributed System-Level Diagnosis Theory,
by Ronald Bianchini and Richard Buskens, Trans. Computers, Vol. 41, No. 5, May 1992.
*
- 17) (copy) Redundancy Management Technique for Space Shuttle Computers, by Sklaroff, J., R.,
IBM Journal on Research and Development, Vol. 20, No. 1, pp. 20-28, January 1976.
*
- 18) Triple-Triple Redundant 777 Primary Flight Computer, Y.C. Yeh, 1996 IEEE Aerospace
Applications Conference, pg 293-307, 1996.
*
- 19) A Fault-Tolerant Air Data/Inertial Reference Unit, Michael L. Sheffels, IEEE AES Systems
Magazine, March 1993. (google, or find paper free with IEEEexplore)
Homeworks/Exams:
- Expectations: Homeworks are expected to look professional!
They do not have to be typed in order to look good,
but be aware that I will not accept scribbles etc.
Use a new page for each problem and staple the final submission.
The submission should have the problem sheet as a cover.
- HW1 (pdf) (handed out 02/02/07, due 02/08/07)
- HW2 (pdf) (handed out 02/23/07, due 03/01/07)
- 500-level project
(pdf)
due 05/08/2007.
- HW3 (pdf) (handed out 04/01/07, due 04/10/07)
- HW4 (pdf) (handed out 04/18/07, due tbd)
Note, use the homework also as an exam preparation mechanism.
Old Exams:
Interesting Links
A special thanks to Dr. Roger Kieckhafer (MTU) for the contributions
to the material used in this class.
Back