Fault-Tolerant Systems (CS449/549)
This site will be updated shortly for Fall 2003. In the meanwhile, here
is the info to the question "What textbook do we use?":
The "optional" text will be "Reliability of computer systems and networks" by Martin L. Shooman, John Wiley & Sons Inc., 2002, ISBN 0-471-29342-3.
Below is the Spring 2002 site.
Welcome to Fault-Tolerant Systems CS449/549.
This course is offered in the Spring Semester 2002 at the
University of Idaho in Moscow and is also available though
Engineering Outreach
for off-campus students.
The course is taught by
Dr. Axel Krings.
This web-page
contains information about the course, e.g. syllabus, class notes, pointers
to interesting places etc.
Material can be down-loaded in pdf (or postscript) format, and will be made
available in the updated form as the class goes on.
To get an idea of what this class is about, take a look at
last semesters page.
However, materials and topics constantly change, and this class will
be no exception.
If you have comments, please let me know.
Engineering Outreach students,
there are several things you should know.
First of all, if you are trying to contact me, you can call
800-824-2889 ext. 4078 (toll free).
Please download the class material from the web page.
This speeds up the distribution process and avoids shipping delays.
If you do not have a pdf viewer, you can get it free at
adobe,
if you need a postscript viewer, check out the aladin viewer.
If for some reason you are not able to download the material, please contact
Engineering Outreach.
There are several assignments that require access to local simulation tools.
Engineering Outreach students need to have web access with telnet capability
in order to use this software. Accounts on local workstations will be made
available.
Course description: this course addresses design, modeling, analysis, and
integration of hardware and software to achieve dependable computing
systems employing on-line fault-tolerance.
It covers the concepts and terminologies of
Fault-Tolerant System Design including: Reliability, Dependability,
Maintainability, Redundancy, Error Detection, Damage Confinement,
Error Recovery, Fault Treatment, Redundancy Management, Voting,
Information Redundancy, Random Variables, cdf, pdf, Expectation,
Bathtub Curve, MTTF, Reliability of Series/Parallel Systems,
Stand-by Redundancy, M-of-N System, Reliability Block Diagrams, Fault Trees,
Markov Process, Petri Nets, General Stochastic Petri Nets,
Recovery Strategies, Roll-back Recovery, Agreement and Consensus,
Byzantine Clock Synchronization, RAID, Fail-Stop Processes,
Systems Diagnosis, Case studies.
I always change the material slightly to account for interesting changes
in the field.
Note: This class has a prerequisite of
Computer Organization and Architecture (CS245) or permission of
the instructor.
In a 400/500 level computer science class
I expect working knowledge of unix and MS operating systems.
- Contact information:
- Axel Krings (PhD), JEB 320,
- Phone: 208-885-4078, fax: 208-885-9052.
- Engineering outreach students: dial toll free 800-824-2889 ext 4078
- Mailing address: Engineering Outreach, PO Box 441014,
Moscow, Idaho 83844-1014.
- Email: krings@cs.uidaho.edu (see comments in syllabus on email procedures)
- Office Hours:
(see here)
- Live-taped: MWF 12:30-13:20 room JEB 025.
- News Group
-
CS449/549 has a news group as a forum for questions/answers related
to the material covered.
The news server is: news.uidaho.edu,
the news group is: uidaho.class.cs.449-ak.
Note, this is a standard news group accessible using any news reader,
e.g. netscape or tin, it is not a "chat room". If you have problems
accessing the group, make sure that the news server news.uidaho.edu is
in the list of servers. For example, in Netscape, you can do this by
going to "Edit->Preferences->Mail&Newsgroups->Newsgroup Server" and add
the server.
- Spring 2002 Term Class Handouts:
- The handout numbers refer to the lecture in which the handout
was made available.
This does not necessarily mean that this material was
covered in this particular lecture. (Most likely there is
some overlap).
- If there are any problems with accessing the handouts,
please let me know (email, phone, smoke signs, drums, ...)!
- Corrections: some slides may contain formatting errors, typos etc.
which have been addressed in class, but have not been reflected
in the notes posted here.
- WARNING LOCAL STUDENTS:
Do not send pdf files (i.e. files in pdf format) to the printer!
Pdf files are binary files and printing them "directly" will
result in a big printer mess!!!
There are 2 ways to look at or print the pdf notes:
- Save the file and use acroread (usr/local/bin/acroread)
to open it. Then from within acroread use the print option.
- Better: update netscape to use pdf files. To do this go to
"edit - preferences", then expand "Navigator" and
click "Application". Next click "New" and fill in the
following: Description: acroread, MIMEType: application/pdf,
Suffixes: pdf.
Then click on "application" and enter:
/usr/local/bin/acroread %s
- Now "OK" out of it and it should work.
-
Syllabus.
- Lecture Notes
- lecture 1 (01/14/02):
(pdf)
Intro, Shipping product on schedule, Reducing Unavailability,
Human Fault-tolerance
- lecture 2 (01/16/02):
(pdf )
Definitions, Dependability...Maintainability, Fault-Error-Failure
- lecture 3 (01/18/02):
(pdf )
Redundancy, Error Detection, Damage Confinement, Error Recovery, Fault
Treatment, Passive HW Redundancy, Voting.
- lecture 4 (01/23/02):
( pdf )
Active/Hybrid HW Redundancy, Information Redundancy, Parity, Checksum,
CRC.
- lecture 5 (01/25/02):
( pdf )
Random Variables, cdf, pdf, Expectation, Reliability, Bathtub Curve,
MTTF, Reliability of Series System, Reliability of Parallel System
- lecture 6 (01/28/02):
( pdf )
Stand-by Redundancy, M-of-N System, Reliability Block Diagram,
Example Bus-Guardian
- lecture 7 (01/30/02):
( pdf )
Example Bus-Guardian, Fault Trees
- lecture 8 (02/01/02):
( pdf )
(SHARPE Quick starter, pdf,
ps )
(SHARPE Language Description (63 pages),
pdf ),
(SHARPE Intro Manual (74 pages long -- might not want to print!),
pdf )
- lecture 9 (02/04/02): no new handouts
- lecture 10 (02/06/02):
( pdf )
Markov Process
- Want to find out what I am working on right now? ECE Research Colloquium
"Survivability Issues in Networked Computer Systems",
EP 216, 02/07/2002, 3:30pm.
- lecture 11 (02/08/02):
( pdf )
Steady State and Transient Solution
- lecture 12 (02/11/02):
( pdf )
Markov Model of Typical Systems
- lecture 13 (02/13/02):
( pdf )
Petri Nets
- lecture 14 (02/15/02):
( pdf )
General Stochastic Petri Nets (GSPN)
- lecture 15 (02/20/02):
catching up
- lecture 16 (02/22/02):
catching up
- lecture 17 (02/25/02):
( 15.pdf )
Distributed Systems, Ordering-Synchronizing
( 16.pdf )
Recovery Strategies
( 17.pdf )
Roll-back Recovery
- lecture 18 (02/27/01):
Reading assignment 5)
- lecture 19(20) (03/01/02):
( pdf )
The Byzantine General Problem,
Reading assignment 6)
- lecture 21 (03/06/02):
( pdf )
Reading assignment 6)
( Optimal Early Stopping (postscript) )
Byzantine Agreement: Oral Message Solution,
Signed Message Solution
- lecture 22 (03/08/02):
cathing up
- lecture 23 (03/11/02):
EXAM I
- lecture 24 (03/13/02):
catching up
- lecture 25 (03/15/02):
( pdf )
Signed Message cont.
- Spring Break
- lecture 26 (03/25/02):
( pdf )
HW solution -- Davis Wakerly Approach
- lecture 27 (03/27/02):
( pdf )
Fault Models, Thamb.& Park, Clock Synchronization
- lecture 28 (03/29/02):
Understanding Protocols for Byzantine Clock Synchronization, by Fred Schneider
- lecture 29 (04/01/02):
( pdf )
Clock Synchronization cont.
- lecture 30 (04/03/02):
catching up
- lecture 31 (04/05/02):
( pdf )
Reading Remote Clocks (Cri89a)
- lecture 32 (04/08/02):
catching up
- lecture 33 (04/10/02):
( pdf )
RAID (Reading Assignment 12)
- lecture 34 (04/12/02):
( pdf )
Fail-Stop Processes, Reading Assignment 13)
- lecture 35 (04/15/02):
( pdf )
Systems Diagnosis,
Reading Assignment 14)
- lecture 36 (04/17/02):
Reading Assignment 15)
- lecture 37 (04/19/02):
( pdf )
Fault-Tolerant Architectures, Tandem, Stratus, SIFT
- lecture 38 (03/22/02):
( pdf )
Space Shuttle,
Reading Assignment 16)
- lecture 39 (03/24/02):
( pdf )
Boeing 777
- lecture 40 (03/26/02):
( pdf )
( ppt )
Boeing 777 ADIRU
- lecture 41 (03/29/02):
( pdf )
( ppt )
SIFT
- lecture 42 (04/01/02):
( pdf )
( ppt )
Tandem, NonStop System Cyclone, Himalaya
- lecture 43 (04/03/02):
( pdf )
( ppt )
MAFT
- lecture 43 (04/06/02): coming up
- lecture 43 (04/08/02): coming up
- lecture 43 (04/10/02): coming up
- FINAL EXAM Monday, May 13, 2002, 1:00-3:00pm:
-
Reading Assignments (so far):
- You need to locate the paper, unless I specify "(copy)", in which case I will
supply a hardcopy.
If you were not present when "copies" were handed out, it is your responsibility
to get the copy, e.g. from a fellow class mate.
- 1) (copy) R. Cillarege, "Top Five Challenges Facing the Practice of Fault-tolerance"
- 2) (copy) V. Nelson, "Fault-Tolerant Computing: Fundamental Concepts"
- 3) (copy) D. Harvey, "Is it Safe to Fly-by-Wire?"
- 4) SHARPE documentation, see lecture 8
- 5) Avi Mendelson and Neeray Suri, "Cache Based Fault Recovery for Distributed Systems"
- 6) L. Lamport, R. Shostak, and M Pease, "The Byzantine Generals Problem"
- 7) There are many general papers on agreement,
e.g. thesis "Classes Of Byzantine Fault-Tolerant Algorithms For Dependable Distributed Systems",
by Andre Postma
- 8) (copy) Davies, Daniel, and J.F. Wakerly, "Synchronization and Matching in Redundant Systems"
- 9) (copy) Thambidurai, P., and You-Keun Park, "Interactive Consistency with Multiple Failure Modes".
There is an interesting followup paper "Verification of Hybrid Byzantine Agreement Under Link
Faults" by P. Lincoln and J. Rushby that addresses a problem in the algorithm of
Thambidurai and Park
- 10) Fred Schneider, "Understanding Protocols for Byzantine Clock Synchronization"
( pdf )
- 11) (copy)Flaviu Cristian, "Probabilistic clock synchronization".
Try to search for the paper online, e.g. google, and find the "ResearchIndex".
This is a good way to find related papers, e.g.
http://citeseer.nj.nec.com/cristian94probabilistic.html
- 12) A Case for Redundant Arrays of Inexpensive Disks (RAID), by D.A. Patterson,
- 13) Byzantine Generals in Action: Implementing Fail-Stop Processors, Fred B. Schneider,
ACM Transactions on Computer Systems, Vol. 2, No..2, pp. 145-154, May 1984.
- 14) (copy) On the Connection Assignment Problem of Diagnosable Systems,
by F. Preparata, G. Metze and R. Chien.
- 15) Implementation of On-Line Distributed System-Level Diagnosis Theory,
by Ronald Bianchini and Richard Buskens, Trans. Computers, Vol. 41, No. 5, May 1992.
- 16) (copy) Redundancy Management Technique for Space Shuttle Computers, by Sklaroff, J., R.,
IBM Journal on Research and Development, Vol. 20, No. 1, pp. 20-28, January 1976.
- Spring 2001 Homeworks/Exams:
- Expectations: Homeworks are expected to look professional.
They do not have to be typed in order to look good.
Please use a new page for each problem and
staple the final submission.
- HW1
( pdf )
due (2/13 in-class) (2/19 video)
- HW2
( pdf )
due (3/4 in-class) (3/18 video)
You have only a few days to do this!!!
- HW3
( pdf )
due (3/12 in-class) (3/24 video)
Get started early!!!
- 500 level project (due dates reflect on-campus schedule)
( pdf )
- Old Exams:
- Interesting Links
- A special thanks to Dr. Roger Kieckhafer (MTU) for the contributions
to the material used in this class.
Back