Fault-Tolerant Systems (CS449/549)
Welcome to Fault-Tolerant Systems CS449/549.
This course is offered in the Fall Semester 2005 at the
University of Idaho in Moscow and is also available though
Engineering Outreach
for off-campus students.
The course is taught by
Dr. Axel Krings.
This web-page
contains information about the course, e.g. syllabus, class notes, pointers
to interesting places etc.
Material can be down-loaded in pdf (or postscript) format, and will be made
available in the updated form as the class goes on.
To get an idea of what this class is about, take a look at
last semester's page.
However, materials and topics constantly change, and this class will
be no exception.
If you have comments, please let me know.
Engineering Outreach students,
there are several things you should know.
First of all, if you are trying to contact me, you can call
800-824-2889 ext. 4078 (toll free).
Please download the class material from the web page.
This speeds up the distribution process and avoids shipping delays.
If you do not have a pdf viewer, you can get it free at
adobe,
if you need a postscript viewer, check out the aladin viewer.
If for some reason you are not able to download the material, please contact
Engineering Outreach.
There are several assignments that require access to local simulation tools.
Engineering Outreach students need to have web access with telnet capability
in order to use this software. Accounts on local workstations will be made
available.
Course description: this course addresses design, modelling, analysis, and
integration of hardware and software to achieve dependable computing
systems employing on-line fault-tolerance.
It covers the concepts and terminologies of
Fault-Tolerant System Design including: Reliability, Dependability,
Maintainability, Redundancy, Error Detection, Damage Confinement,
Error Recovery, Fault Treatment, Redundancy Management, Voting,
Information Redundancy, Random Variables, cdf, pdf, Expectation,
Bathtub Curve, MTTF, Reliability of Series/Parallel Systems,
Stand-by Redundancy, M-of-N System, Reliability Block Diagrams, Fault Trees,
Markov Process, Petri Nets, General Stochastic Petri Nets,
Recovery Strategies, Roll-back Recovery, Agreement and Consensus,
Byzantine Clock Synchronisation, RAID, Fail-Stop Processes,
Systems Diagnosis, Case studies.
I always change the material slightly to account for interesting changes
in the field.
Note: This class has a prerequisite of
Computer Organisation and Architecture (CS245) or permission of
the instructor.
In a 400/500 level computer science class
I expect working knowledge of unix and MS operating systems.
- Contact information:
- Axel Krings (PhD), JEB 320,
- Phone: 208-885-4078, fax: 208-885-9052.
- Engineering outreach students: dial toll free 800-824-2889 ext 4078
- Mailing address: Engineering Outreach, PO Box 441014,
Moscow, Idaho 83844-1014.
- Email: krings@cs.uidaho.edu (see comments in syllabus on email procedures)
- Office Hours:
(see here)
- Live-taped: MWF 10:30-11:20 room JEB 025.
- News Group
-
We will be using
webCT,
a service of the University of Idaho.
If you have confidential or personal issues, please send me an email or call me.
Any other issues, i.e. all course related questions, should be handled using this distribution mechanism.
WebCT is set up (9/9/2005) and you should have received an email by now on how to use the system.
Check for messages on a regular basis.
- Fall 2005 Term Class Handouts:
- The handout numbers refer to the lecture in which the handout
was made available.
This does not necessarily mean that this material was
covered in this particular lecture. (Most likely there is
some overlap).
- If there are any problems with accessing the handouts,
please let me know (email, phone, smoke signs, drums, ...)!
- Corrections: some slides may contain formatting errors, typos etc.
which have been addressed in class, but have not been reflected
in the notes posted here.
- WARNING LOCAL STUDENTS:
Do not send pdf files (i.e. files in pdf format) to the printer!
Pdf files are binary files and printing them "directly" will
result in a big printer mess!!!
Use acroread to view and print the file!!!
-
Syllabus.
- Lecture Notes
- lecture 1 (08/22/05):
(pdf)
Introduction to fault-tolerance and safety-critical systems.
Reading Assignment 1).
- lecture 2 (08/24/05):
(pdf)
Top challenges facing the practice of fault-tolerances
- lecture 3 (08/26/05):
(pdf)
Definitions, Dependability...Maintainability, Fault-Error-Failure,
Reading Assignment 2).
- lecture 4 (08/29/05):
(pdf)
Redundancy, Error Detection, Damage Confinement, Error Recovery, Fault
Treatment, Passive HW Redundancy, Voting.
- lecture 5 (08/31/05):
(pdf)
Information Redundancy, Parity, Checksum, CRC.
- lecture 6 (09/02/05):
(pdf)
Random Variables, cdf, pdf, Expectation, Reliability, Bathtub Curve,
MTTF, Reliability Block Diagram, Reliability of Series System, Reliability of Parallel System.
Reading Assignment 3).
- no lecture (09/05/05): National Holiday
- lecture 7 (09/07/05):
(pdf)
Stand-by Redundancy, M-of-N System, Reliability Block Diagram,
Reliability analysis example Bus-Guardian
- lecture 8 (09/09/05):
(pdf)
Fault Trees, Example Bus-Guardian, intro to SHARPE.
SHARPE manual: help save paper and try avoid printing the
manuals. You can use them on-line when you need them.
(sharpe quick starter)
- lecture 9 (09/12/05):
(pdf)
Markov Process
- lecture 10 (09/14/05):
(pdf)
Steady State and Transient Solution
- lecture 11 (09/16/05):
(pdf)
Markov Models of Typical Systems
- lecture 12 (09/19/05):
(pdf)
Petri Nets, Reading Assignment 4)
(Note: this is an updated set of slides after the old set was accidentally posted for several hours)
- Note that Homework 1 is on line below!
- lecture 13 (09/21/05):
(pdf)
Petri Nets,
General Stochastic Petri Nets (GSPN)
- lecture 14 (09/23/05):
(pdf)
Discussion
- lecture 15 (09/26/05):
(pdf)
Distributed Systems, Ordering-Synchronising
- lecture 16 (09/28/05): catching up. HW2 will be posted - start as soon as possible.
- lecture 17 (09/30/05): discussion on homework disaster, exam prep.
- lecture 18 (10/03/05):
(pdf)
Reliable Broadcast, Reading assignment 5.
- lecture 19 (10/05/05):
(pdf)
Atomic and Causal Broadcast, Reading assignment 6.
Homework 2 is due.
- lecture 20 (10/07/05): EXAM 1
- lecture 21 (10/10/05):
(pdf)
Intro. Fault-tolerant Agreement, Oral messages.
Reading assignment 7.
- lecture 22 (10/12/05):
(pdf)
Fault-tolerant Agreement, signed messages.
- lecture 23 (10/14/05):
(pdf)
Fault-tolerant Agreement, cont.
- lecture 24 (10/17/05):
(pdf)
Agreement cont. Davis and Wakerly approach.
- lecture 25 (10/19/05):
(pdf)
Fault models, Reading assignment 8:
Note the comment in class about the algorithm and the paper by John Rushby.
- lecture 26 (10/21/05):
(pdf)
Clock Synchronization. Reading assignment 9.
- lecture 27 (10/24/05):
(pdf)
Clock Synchronization
- lecture 28 (10/26/05):
(pdf)
Reading a remote clock. HW3 is out!
Reading assignment 10.
- lecture 29 (10/28/05):
(pdf)
(pdf)
(pdf)
Recovery Strategies, checkpointing
(pdf)
recovery strategies, Theft-induced checkpointing
- lecture 32 (11/04/05): Reading Assignment 11
- lecture 33 (11/07/05):
(pdf)
RAID, Reading Assignment 12/13
- lecture 34 (11/09/05):
(pdf)
Fail-Stop Processes,
Reading Assignment 14
- lecture 35 (11/11/05): no new handouts
- lecture 36 (11/14/05):
(pdf)
Diagnosability, Reading assignment 15 and 16
- lecture 37 (11/16/05):
(pdf)
Fault-tolerant Architectures
- lecture 38 (11/18/05):
(pdf)
Space Shuttle, Reading Assignment 17
- Fall Break
- lecture 39 (11/28/05):
(pdf)
Boeing 777, Reading Assignment 18
-
(11/30/05): EXAM II,
This is an in-class, open book/notes exam
(which includes open notes and papers - bring a calculator, just to be save)
- lecture 40 (12/02/05):
(pdf)
Boeing 777 ADIRU, Reading Assignment 19
- lecture 41 (12/05/03):
(pdf)
SIFT
- lecture 42 (12/07/05):
(pdf)
Tandem, NonStop System Cyclone, Himalaya
- lecture 43 (12/09/05):
(pdf)
MAFT
- FINAL exam (12/13/05: in-class starting at 10:00am)
-
Reading Assignments (so far):
- You need to locate the paper, unless I specify "(copy)", in which case I will
supply a hardcopy to the UI Copy Center in the Commons.
EO students: the "(copy)" papers will be send to you.
- 1) William R. Dunn, "Designing Safety-Critical Computer Systems"*
- 2) V. Nelson, "Fault-Tolerant Computing: Fundamental Concepts" *
- 3) SHARPE documentation *
- 4) Petri Nets *
- 5) Broadcast Protocols, *
- 6) Birman and Joseph paper,
*
- 7) L. Lamport, R. Shostak, and M Pease,
The Byzantine Generals Problem , 1982.
- 8) Thambidurai, P., and You-Keun Park, "Interactive Consistency with Multiple Failure Modes".
There is an interesting followup paper "Verification of Hybrid Byzantine Agreement Under Link
Faults" by P. Lincoln and J. Rushby that addresses a problem in the algorithm of
Thambidurai and Park.
*
- 9) Fred Schneider, "Understanding Protocols for Byzantine Clock Synchronization"
*
- 10) Flaviu Cristian, "A Probabilistic Approach to Distributed Clock Synchronization"
*
- 11) Mootaz et.al., "A survey of rollback-recovery protocols in message-passing systems"
*
- 12) A Case for Redundant Arrays of Inexpensive Disks (RAID), by D.A. Patterson, (google on the web).
- 13) RAID: High-Performance, Reliable Secondary Storage, Peter M. Chen, Edward Lee, Garth
Gibson, Randy Katz, and David Patterson, ACM Computing Surveys, 1994. (CiteSeer)
- 14) Byzantine Generals in Action: Implementing Fail-Stop Processors, Fred B. Schneider,
ACM Transactions on Computer Systems, Vol. 2, No..2, pp. 145-154, May 1984. (google on the web)
- 15) On the Connection Assignment Problem of Diagnosable Systems,
by F. Preparata, G. Metze and R. Chien.
1
2
3
4
5
6
7
- 16) Implementation of On-Line Distributed System-Level Diagnosis Theory,
by Ronald Bianchini and Richard Buskens, Trans. Computers, Vol. 41, No. 5, May 1992.
*
- 17) (copy) Redundancy Management Technique for Space Shuttle Computers, by Sklaroff, J., R.,
IBM Journal on Research and Development, Vol. 20, No. 1, pp. 20-28, January 1976.
*
- 18) Triple-Triple Redundant 777 Primary Flight Computer, Y.C. Yeh, 1996 IEEE Aerospace
Applications Conference, pg 293-307, 1996.
*
- 19) A Fault-Tolerant Air Data/Inertial Reference Unit, Michael L. Sheffels, IEEE AES Systems
Magazine, March 1993. (google, or find paper free with IEEEexplore)
- Fall 2005 Homeworks/Exams:
- Expectations: Homeworks are expected to look professional!
They do not have to be typed in order to look good,
but be aware that I will not accept scribbles etc.
Use a new page for each problem and staple the final submission.
- HW1
( pdf )
due 09/28/2005 (video 10/07)
- HW2
( pdf )
due 10/05/2005 (video 10/19) NOTE: the due date has changed for EO students!!!
Note, there is not much time. Also, since you need the results of the previous
HW for question 1, start with the other questions.
With respect to Exam I, you will get the most benefit from understanding how do
derive Markov chains and Petri nets of problems.
- HW3
( pdf )
due 11/09/2005 (video 10/23).
- HW4
( pdf )
due at the time of the final exam
- 500-level project
( pdf )
due 12/13/2005 (for all students).
- Old Exams:
- Interesting Links
- A special thanks to Dr. Roger Kieckhafer (MTU) for the contributions
to the material used in this class.
Back