Fault-Tolerant Systems (CS449/549)
Welcome to Fault-Tolerant Systems CS449/549,
which is offered in the Fall Semester 2009 at the
University of Idaho in Moscow.
This web-page
contains information about the course, e.g. syllabus, class notes, pointers
to interesting places etc.
Material can be down-loaded in pdf (or postscript) format, and will be made
available in the updated form as the class goes on.
To get an idea of what this class is about, take a look at
last time's page.
However, materials and topics constantly change, and this class will
be no exception.
If you have comments, please let me know.
Course description: this course addresses design, modelling, analysis, and
integration of hardware and software to achieve dependable computing
systems employing on-line fault-tolerance.
It covers the concepts and terminologies of
Fault-Tolerant System Design including: Reliability, Dependability,
Maintainability, Redundancy, Error Detection, Damage Confinement,
Error Recovery, Fault Treatment, Redundancy Management, Voting,
Information Redundancy, Random Variables, cdf, pdf, Expectation,
Bathtub Curve, MTTF, Reliability of Series/Parallel Systems,
Stand-by Redundancy, M-of-N System, Reliability Block Diagrams, Fault Trees,
Markov Process, Petri Nets, General Stochastic Petri Nets,
Recovery Strategies, Roll-back Recovery, Agreement and Consensus,
Byzantine Clock Synchronisation, RAID, Fail-Stop Processes,
Systems Diagnosis, Case studies.
I always change the material slightly to account for interesting changes
in the field.
Note: This class has a prerequisite of
Computer Operating Systems (CS240) or permission of
the instructor.
In a 400/500 level computer science class
I expect working knowledge of unix and MS operating systems.
- Contact information:
- Axel Krings (PhD), JEB 320,
- Phone: 208-885-4078, fax: 208-885-9052.
- Office Hours:
(see here)
- Class times: MWF 1:30-2:20 room JEB 121.
Class Handouts:
- The handouts are ordered by sequence numbers and the material covered in the lectures are indicated next to the date.
- If there are any problems with accessing the handouts,
please let me know (email, phone, smoke signs, drums, ...)!
- Corrections: some slides may contain formatting errors, typos etc.
which have been addressed in class, but have not been reflected
in the notes posted here.
- WARNING LOCAL STUDENTS:
Do not send pdf files (i.e. files in pdf format) to the printer!
Pdf files are binary files and printing them "directly" will
result in a big printer mess!!!
Use acroread to view and print the file!!!
-
Syllabus.
- Lecture Notes
- Lecture 1 (08/24/09): [1/01-1/06] Sequence 1,
(pdf)
:
Introduction to the class, syllabus, expectations, literature, etc.
Introduction to fault-tolerance and safety-critical systems.
Top challenges facing the practice of fault-tolerances
Reading Assignment 1).
- Lecture 2 (08/26/09): [1/07-1/20] Sequence 2,
(pdf)
:
Definitions, Dependability...Maintainability, Fault-Error-Failure,
Reading Assignment 2).
- Lecture 3 (08/28/09): [1/21-2/01] Sequence 3,
(pdf)
:
Redundancy, Error Detection, Damage Confinement, Error Recovery, Fault
Treatment, Passive HW Redundancy, Voting.
- Lecture 4 (08/31/09): [2/01-2/14] no new handouts
- Lecture 5 (09/02/09): [3/01-3/13] Sequence 4,
(pdf)
:
Information Redundancy, Parity, Checksum, CRC.
- Lecture 6 (09/04/09): [3/14-3/25] Sequence 5,
(pdf)
:
Random Variables, cdf, pdf, Expectation, Reliability, Bathtub Curve,
MTTF, Reliability Block Diagram, Reliability of Series System, Reliability of Parallel System.
- Lecture 7 (09/09/09): [4/01-4/12] no new handouts
- Lecture 8 (09/11/09): [4/13-5/04] Sequence 6,
(pdf)
:
Stand-by Redundancy, M-of-N System, Reliability Block Diagram,
Reliability analysis example Bus-Guardian
- Lecture 9 (09/14/09): [5/05-5/21] no new handouts
- Lecture 10 (09/16/09): [6/01-6/12] Sequence 7,
(pdf)
:
Fault Trees, Example Bus-Guardian, example (using SHARPE).
Homework 1 is posted.
- Lecture 11 (09/18/09): [6/13-7/03] Sequence 8,
(pdf)
:
Markov Process
- Lecture 12 (09/21/09): [7/04-8/05] Sequence 9,
(pdf)
:
Steady State and Transient Solution
- Lecture 13 (09/23/09): [8/06-9/03] Sequence 10,
(pdf)
:
Markov Models of Typical Systems, Reading assignment 3 & 4.
- Lecture 14 (09/25/09): [9/04-10/05] Sequence 11,
(pdf)
:
Petri Nets
- Lecture 15 (09/28/09): [10/06-11/01] Sequence 12,
(pdf)
:
-
Petri Nets, General Stochastic Petri Nets (GSPN)
- Lecture 16 (09/30/09): [11/02-11/18] Sequence 13,
(pdf)
:
Distributed Systems, Ordering-Synchronising
- Lecture 17 (10/02/09): [11/02-12/11]
Modeling with Petri Nets, cont.
- Lecture 18 (10/05/09): [12/12-12/28]
Sample nets, catching up
- Lecture 19 (10/07/09): [Trivedi slides, Mobius]
Sequence Mob-1,
(pdf)
:
- EXAM 1 (10/09/09):
- Lecture 20 (10/12/09): [CS549 Project]
(pdf)
:
Multi-core Resilience, 500-level project
- Lecture 21 (10/14/09): project discussion, catching up
- Lecture 22 (10/16/09): [13/01-13/08] Sequence 14,
(pdf)
:
Reliable Broadcast,
Atomic and Causal Broadcast, Reading assignment 5 and 6.
- Lecture 23 (10/19/09): [13/09-14/08] Sequence 15,
(pdf)
:
Intro. Fault-tolerant Agreement, Oral messages.
Reading assignment 7.
- Lecture 24 (10/21/09): [14/09-14/15] Sequence 16,
(pdf)
:
Multi-core assignment discussion,
Fault-tolerant Agreement, signed messages.
- Lecture 25 (10/23/09): [15/01-15/07] catching up, project discussion
- Lecture 26 (10/26/09): [15/08-15/20] Sequence 17,
(pdf)
:
Hardware assisted agreement (Davis and Wakerly approach.)
- Lecture 27 (10/28/09): [16/01-16/12] catching up
- Lecture 28 (10/30/09): [16/13-17/06] Sequence 18,
(pdf)
:
Fault models, Reading assignment 8:
Note the comment in class about the algorithm and the paper by John Rushby.
- Lecture 29 (11/02/09): [17/07-18/11] Sequence 19,
(pdf)
:
Clock Synchronization. Reading assignment 9.
- Lecture 30 (11/04/09): [18/12-19/08] Sequence 20,
(pdf)
:
Reading a remote clock.
Reading assignment 10.
- Lecture 31 (11/06/09): [19/09-xx/xx] synchronization cont.
Reading Assignments (so far):
- You need to locate the paper.
- 1) William R. Dunn, "Designing Safety-Critical Computer Systems"*
- 2) V. Nelson, "Fault-Tolerant Computing: Fundamental Concepts" *
- 3) Petri Nets *
- 4) Mobius Manual *
- 5) Broadcast Protocols, *
- 6) Birman and Joseph paper,
*
- 7) L. Lamport, R. Shostak, and M Pease,
The Byzantine Generals Problem , 1982.
- 8) Thambidurai, P., and You-Keun Park, "Interactive Consistency with Multiple Failure Modes".
There is an interesting followup paper "Verification of Hybrid Byzantine Agreement Under Link
Faults" by P. Lincoln and J. Rushby that addresses a problem in the algorithm of
Thambidurai and Park.
*
- 9) Fred Schneider, "Understanding Protocols for Byzantine Clock Synchronization"
*
- 10) Flaviu Cristian, "A Probabilistic Approach to Distributed Clock Synchronization"
*
Homeworks/Exams:
- Expectations: Homeworks are expected to look professional!
They do not have to be typed in order to look good,
but be aware that I will not accept scribbles etc.
Use a new page for each problem and staple the final submission.
The submission should have the problem sheet as a cover.
- HW1 (pdf)
- HW2 (pdf)
Old Exams:
Interesting Links
A special thanks to Dr. Roger Kieckhafer (MTU) for the contributions
to the material used in this class.
Back