Fault-Tolerant Systems (CS449/549)
Welcome to Fault-Tolerant Systems CS449/549,
which is offered in the Fall Semester 2011 at the
University of Idaho in Moscow and is also available though
Engineering Outreach
for off-campus students.
This web-page
contains information about the course, e.g. syllabus, class notes, pointers
to interesting places etc.
Material can be down-loaded in pdf (or postscript) format, and will be made
available in the updated form as the class goes on.
To get an idea of what this class is about, take a look at
last time's page.
However, materials and topics constantly change, and this class will
be no exception.
If you have comments, please let me know.
Engineering Outreach students,
there are several things you should know.
First of all, if you are trying to contact me, you can call
800-824-2889 ext. 4078 (toll free).
Please download the class material from the web page.
This speeds up the distribution process and avoids shipping delays.
There are several assignments that require access to local simulation tools.
Engineering Outreach students need to have web access with ssh capability
in order to use this software. Accounts on local workstations will be made
available. I will talk more about this when the time has come...
Course description: this course addresses design, modelling, analysis, and
integration of hardware and software to achieve dependable computing
systems employing on-line fault-tolerance.
It covers the concepts and terminologies of
Fault-Tolerant System Design including: Reliability, Dependability,
Maintainability, Redundancy, Error Detection, Damage Confinement,
Error Recovery, Fault Treatment, Redundancy Management, Voting,
Information Redundancy, Random Variables, cdf, pdf, Expectation,
Bathtub Curve, MTTF, Reliability of Series/Parallel Systems,
Stand-by Redundancy, M-of-N System, Reliability Block Diagrams, Fault Trees,
Markov Process, Petri Nets, General Stochastic Petri Nets,
Recovery Strategies, Roll-back Recovery, Agreement and Consensus,
Byzantine Clock Synchronisation, RAID, Fail-Stop Processes,
Systems Diagnosis, Case studies.
I always change the material slightly to account for interesting changes
in the field.
Note: This class has a prerequisite of
Computer Operating Systems (CS240) or permission of
the instructor.
In a 400/500 level computer science class
I expect working knowledge of unix and MS operating systems.
- Contact information:
- Axel Krings (PhD), JEB 320,
- Phone: 208-885-4078, fax: 208-885-9052.
- Office Hours:
(see here)
(coming up soon)
- Class times: MWF 9:30-10:20 room JEB 026.
Class Handouts:
- The handouts are ordered by sequence numbers and the material covered in the lectures are indicated next to the date.
- If there are any problems with accessing the handouts,
please let me know (email, phone, smoke signs, drums, ...)!
- Corrections: some slides may contain formatting errors, typos etc.
which have been addressed in class, but have not been reflected
in the notes posted here.
-
Syllabus.
- Lecture Notes
- Lecture 1 (08/22/11): [1/01-1/03] Sequence 1,
(pdf)
:
Introduction to the class, syllabus, expectations, literature, etc.
Introduction to fault-tolerance and safety-critical systems.
Top challenges facing the practice of fault-tolerances
[Reading Assignment 1].
- Lecture 2 (08/24/11): [1/04-1/14] Sequence 2,
(pdf)
:
Definitions, Dependability...Maintainability, Fault-Error-Failure,
[Reading Assignment 2 and 3].
- Lecture 3 (08/26/11): [1/14-1/32] no new handout
- Lecture 4 (08/29/11): [2/01-2/12] Sequence 3,
(pdf)
:
Redundancy, Error Detection, Damage Confinement, Error Recovery, Fault
Treatment, Passive HW Redundancy, Voting.
- Lecture 5 (08/31/11): [2/13-3/13] Sequence 4,
(pdf)
:
Information Redundancy, Parity, Checksum, CRC
- Lecture 6 (09/02/11): [3/14-3/25] Sequence 5
(pdf)
:
Random Variables, cdf, pdf, Expectation, Reliability, Bathtub Curve,
MTTF, Reliability Block Diagram, Reliability of Series System, Reliability of Parallel System.
- Lecture 7 (09/07/11): [4/01-4/12] no new handout
- Lecture 8 (09/09/11): [4/13-5/04] Sequence 6
(pdf)
:
Stand-by Redundancy, M-of-N System, Reliability Block Diagram,
Reliability analysis example Bus-Guardian
- Lecture 9 (09/12/11): [5/05-5/21] Sequence 7
(pdf)
:
Fault Trees, Example Bus-Guardian, example (using SHARPE).
- Lecture 10 (09/14/11): [6/01-6/12] Sequence 8
(pdf)
:
Markov Process
- Lecture 11 (09/16/11): [6/13-7/07] Sequence 9
(pdf)
:
Steady State and Transient Solution
- Lecture 12 (09/19/11): [7/08-8/08] No new handout. [Reading Assignment 4]
- Lecture 13 (09/21/11): [8/09-9/03] Sequence 10
(pdf)
:
Markov Models of Typical Systems,
- Lecture 14 (09/23/11): [9/4-10/02] no new handout
- (09/26/11): no class, use this time to get familiar with SHARPE, which you need for HW2 [posted - get an early start]
- Lecture 15 (09/28/11): [10/03-10/08] Sequence 11
(pdf)
:
Petri Nets
- Lecture 16 (09/30/11): [10/09-11/06] No new handout
- Lecture 17 (10/03/11): [11/07-11/18] Sequence 12
(pdf)
:
(Trivedi Slides)
Petri Nets, General Stochastic Petri Nets (GSPN)
- Lecture 18 (10/05/11): [12/01-12/12]
Modeling with Petri Nets, examples, cont.
- Lecture 19 (10/07/11): [12/13-12/28] Sequence 13
(pdf)
:
Distributed Systems, Ordering, Synchronising
- Lecture 20 (10/10/11): [13/01-13/08] no new handout
- EXAM 1 (10/12/11)
- Lecture 21 (10/14/11): [13/09-14/08] Sequence 14
(pdf)
:
Reliable Broadcast,
Atomic and Causal Broadcast, Reading assignment 5 and 6.
- Lecture 22 (10/17/11): [14/09-14/09]
549 project discussion (see posting),
[Reading assignment 7].
- Lecture 23 (10/19/11): [14/09-15/04] Sequence 15
(pdf)
:
Intro. Fault-tolerant Agreement, Oral messages.
- Lecture 24 (10/21/11): [15/05-15/15] Sequence 16
(pdf)
:
Fault-tolerant Agreement, signed messages. [Reading Assignment 8]
- Lecture 25 (10/24/11): [15/16-16/04] Agreement cont. (catching up)
- Lecture 26 (10/26/11): [16/05-17/14] Sequence 17
(pdf)
:
Hardware assisted agreement (e.g., Davis and Wakerly approach.)
- Lecture 27 (10/28/11): [18/01-18/10] Sequence 18
(pdf)
:
Fault models, Reading assignment 8:
Note the comments in class about the algorithm and the paper by John Rushby.
- Lecture 28 (10/31/11): [18/11-19/03] Sequence 19
(pdf)
:
Clock Synchronization. Reading assignment 9.
- Lecture 29 (11/02/11): [19/04-19/12] clock synchronization cont.
- Lecture 30 (11/04/11): [19/13-19/21]
Sequence 20
(pdf)
:
Reading a remote clock.
Reading assignment 10.
- Lecture 31 (11/07/11): [19/22-19/31]
clock synchronization cont., catching up
- Lecture 32 (11/09/11): [20/01-20/04]
Discussion: Approximate Agreement -- MSR algorithms
- Lecture 33 (11/11/11 -- Helau - Alaaf): [20/05-21/02]
Sequence 21
(pdf)
:
Recovery Strategies, checkpointing,
Reading Assignment 11a, 11b
- Lecture 34 (11/14/11): [21/03-21/15]
Sequence 22
(pdf)
:
RAID, Reading Assignment 12/13
- Lecture 35 (11/16/11): [21/16-21/26+]
catching up, exam review etc..
- Lecture 36 (11/18/11): Exam 2 due (no class)
- Lecture 37 (11/28/11): [22/01-22/19]
Sequence 23
(pdf)
:
Fail-Stop Processes, Reading Assignment 14
- Lecture 38 (11/30/11): [22/20-23/07]
Sequence 24
(pdf)
:
Diagnosability, Reading assignment 15 and 16
- Lecture 39 (12/02/11): [23/08-24/xx]
Sequence 25,26
(pdf)
:
Fault-tolerant Architectures
(pdf)
:
Space Shuttle, Reading Assignment 17,
- Lecture 40 (12/05/11):
Sequence 27
(pdf)
:
Boeing 777, Reading Assignment 18
- Lecture 41 (12/07/11):
Sequence 28
(pdf)
:
Boeing 777 ADIRU, Reading Assignment 19
- Lecture 42 (12/09/11): [27/24-2x/xx] Sequence 29,30,31
Sequence 29
(pdf)
:
SIFT,
Sequence 30
(pdf)
:
Tandem,
Sequence 31
(pdf)
:
Tandem,
MAFT
- Final exam is on Wednesday, December 14th at 10am. This is an open book/notes exam.
Since most of you work with electronic copies of our material you may use your computer.
However, I will address the limitation of the computer usage in class.
- What does the final exam cover? All material starting with (and including) sequence 21.
However, be prepared to draw Petri nets, Markov chains or calculate the reliability of a simple system!
Reading Assignments (so far):
- You need to locate the paper if no specific link is supplied. Note that the UI has subscriptions to most sources, which are automatically granted access if you call them up from within campus. If you are off-campus this means that you will have to go through the UI library to get this access.
- 1) William R. Dunn, "Designing Safety-Critical Computer Systems", IEEE Computer, November, 2003
- 2) V. Nelson, "Fault-Tolerant Computing: Fundamental Concepts", Victor P. Nelson, IEEE Computer, July 1990.
- 3) Basic Concepts and Taxonomy of Dependable and Secure Computing, Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr,
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 1, NO. 1, JANUARY-MARCH 2004.
- 4) SHARPE introduction: go to
Kishor Trivedi's home page and follow the SHARPE link,
which will bring you to the site with documentation, examples and more.
We will not cover much detail of SHARPE, so it will be your responsibility to lear how it works and how to use it.
- 5) P.M. Melliar-Smith, L. E. Moser, and V. Agrawala, "Broadcast Protocols for Distributed Systems", IEEE Trans. on Parallel and Distributed Systems, Vol. 1, No. 1, January 1990.
- 6) K. Birman and T. Joseph, "Reliable Communication in the Presence of Failure", ACM Transactions on Computer Systems, Vol. 5, No. 1, February 1987, Pages 47-76.
- 7) L. Lamport, R. Shostak, and M Pease, "The Byzantine Generals Problem", 1982.
- 8) Thambidurai, P., and You-Keun Park, "Interactive Consistency with Multiple Failure Modes".
There is an interesting followup paper "Verification of Hybrid Byzantine Agreement Under Link
Faults" by P. Lincoln and J. Rushby that addresses a problem in the algorithm of
Thambidurai and Park.
- 9) Fred Schneider, "Understanding Protocols for Byzantine Clock Synchronization",
Fred B. Schneider, 87-859, Department of Computer Science, Cornell University, Ithaca, New York, August 1987.
- 10) Flaviu Cristian, "A Probabilistic Approach to Distributed Clock Synchronization", 9th Intl. Conference on Distributed Computing Systems, 5-9 June, 1989.
- 11a) Mootaz et.al., "A survey of rollback-recovery protocols in message-passing systems"
*
- 11b) Samir Jafar, Axel Krings and Thierry Gautier,
"Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing",
IEEE Transactions on Dependable and Secure Computing, (TDSC), Vol.~6, No.~1, January-March, 2009.
- 12) A Case for Redundant Arrays of Inexpensive Disks (RAID), by D.A. Patterson..
- 13) RAID: High-Performance, Reliable Secondary Storage, Peter M. Chen, Edward Lee, Garth
Gibson, Randy Katz, and David Patterson, ACM Computing Surveys, 1994.
- 14) Byzantine Generals in Action: Implementing Fail-Stop Processors, Fred B. Schneider,
ACM Transactions on Computer Systems, Vol. 2, No..2, pp. 145-154, May 1984.
- 15) On the Connection Assignment Problem of Diagnosable Systems,
by F. Preparata, G. Metze and R. Chien, IEEE Trans. Electronic Computers, Vol EC-16, Issue 6, Dec. 1967.
- 16) Implementation of On-Line Distributed System-Level Diagnosis Theory,
by Ronald Bianchini and Richard Buskens, Trans. Computers, Vol. 41, No. 5, May 1992.
- 17) (just take a quick peek) Redundancy Management Technique for Space Shuttle Computers, by Sklaroff, J., R.,
IBM Journal on Research and Development, Vol. 20, No. 1, pp. 20-28, January 1976.
- 18) Triple-Triple Redundant 777 Primary Flight Computer, Y.C. Yeh, 1996 IEEE Aerospace
Applications Conference, pg 293-307, 1996.
- 19) A Fault-Tolerant Air Data/Inertial Reference Unit, Michael L. Sheffels, IEEE AES Systems
Magazine, March 1993. (google, or find paper free with IEEEexplore)
Homeworks/Exams:
- Expectations: Homeworks are expected to look professional!
They do not have to be typed in order to look good,
but be aware that I will not accept scribbles etc.
Use a new page for each problem and staple the final submission.
The submission should have the problem sheet as a cover.
- HW1 (pdf)
- HW2 (pdf)
- HW3 (pdf)
- 549 project
(pdf)
Old Exams:
Interesting Links
A special thanks to Dr. Roger Kieckhafer (MTU) for the contributions
to the material used in this class.
Back