Fault-Tolerant Systems (CS449/549)
Welcome to Fault-Tolerant Systems CS449/549,
which is offered in the Spring Semester 2019 at the
University of Idaho in Moscow and is also available though
Engineering Outreach
for off-campus students.
This web-page
contains information about the course, e.g. syllabus, class notes, pointers
to interesting places etc.
Material can be down-loaded in pdf (or postscript) format, and will be made
available in the updated form as the class goes on.
To get an idea of what this class is about, take a look at
last time's page.
However, materials and topics constantly change, and this class will
be no exception.
If you have comments, please let me know.
Engineering Outreach students,
there are several things you should know.
First of all, if you are trying to contact me, you can call
800-824-2889 ext. 4078 (toll free).
Please download the class material from the web page.
This speeds up the distribution process and avoids shipping delays.
There are several assignments that require access to local simulation tools.
Engineering Outreach students need to have web access with ssh capability
in order to use this software. Accounts on local workstations will be made
available. I will talk more about this when the time has come...
Course description: this course addresses design, modeling, analysis, and
integration of hardware and software to achieve dependable computing
systems employing on-line fault-tolerance.
It covers the concepts and terminologies of
Fault-Tolerant System Design including: Reliability, Dependability,
Maintainability, Redundancy, Error Detection, Damage Confinement,
Error Recovery, Fault Treatment, Redundancy Management, Voting,
Information Redundancy, Random Variables, cdf, pdf, Expectation,
Bathtub Curve, MTTF, Reliability of Series/Parallel Systems,
Stand-by Redundancy, M-of-N System, Reliability Block Diagrams, Fault Trees,
Markov Process, Petri Nets, General Stochastic Petri Nets,
Recovery Strategies, Roll-back Recovery, Agreement and Consensus,
Byzantine Clock Synchronisation, RAID, Fail-Stop Processes,
Systems Diagnosis, Case studies.
I always change the material slightly to account for interesting changes
in the field.
Note: This class has a prerequisite of
Computer Operating Systems (CS240) or permission of
the instructor.
In a 400/500 level computer science class
I expect working knowledge of unix and MS operating systems.
- Contact information:
- Axel Krings (PhD), JEB 320,
- Phone: 208-885-4078, fax: 208-885-9052.
- Office Hours:
(see here)
(coming up soon)
- Class times: MWF 2:30-3:20 room JEB 26 (originally scheduled in EP 203).
- Class Forum
- Any questions that are related to the course can be posted to the Fault Tolerant Systems news group. Please read the welcome message for the posting policy.
- To get started go to CS449 Forum.
You need to log in with your UI login name/passwd.
Please note that the authentication is with the UI username/password and is handled by the UI's main authentication service and *not* a third party.
If you are a first-time user, you need to "register" (next to the "login" option). Now you can read, but if you want to post, you need to "login", using the name/passwd you created during registration.
Class Handouts:
- The handouts are ordered by sequence numbers and the material covered in the lectures are indicated next to the date.
- If there are any problems with accessing the handouts,
please let me know (email, phone, smoke signs, drums, ...)!
- Corrections: some slides may contain formatting errors, typos etc.
which have been addressed in class, but have not been reflected
in the notes posted here.
-
Syllabus.
How-to-reference contract
- Lecture Notes
- Lecture 1 (01/09/19): [1/01-1/03] Sequence 1,
(pdf)
:
Introduction to the class, syllabus, expectations, literature, etc.
Introduction to fault-tolerance and safety-critical systems.
Top challenges facing the practice of fault-tolerances
[Reading Assignment 1].
- Lecture 2 (01/11/19): [1/04-1/14] Sequence 2,
(pdf)
:
Definitions, Dependability...Maintainability, Fault-Error-Failure,
[Reading Assignment 2 and 3].
- Lecture 3 (01/14/19): [1/15-1/23] no new handout
Definitions (Hazard, Mishap, Risk), Fail-Safe and Fail-Operate Systems,
Intro. Failure Mode and Effects Analysis (FMEA), Fault-Tree Analysis (FTA), Risk Analysis (RA),
Top 5 Challenges in Fault Tolerance.
- Lecture 4 (01/16/19): [1/24-2/12] Sequence 3,
(pdf)
:
Redundancy, Error Detection, Damage Confinement, Error Recovery, Fault
Treatment, Passive HW Redundancy, Voting.
- Lecture 5 (01/18/19): [2/13-3/13] Sequence 4,
(pdf)
:
Information Redundancy, Parity, Checksum, CRC
- Lecture 6 (01/23/19): [3/14-3/25] Sequence 5
(pdf)
:
Random Variables, cdf, pdf, Expectation, Reliability, Bathtub Curve,
MTTF, Reliability Block Diagram, Reliability of Series System, Reliability of Parallel System.
- Lecture 7 (01/25/19): [4/01-4/12] no new handout
- Lecture 8 (01/28/19): [4/13-5/04] Sequence 6
(pdf)
:
Stand-by Redundancy, M-of-N System, Reliability Block Diagram,
Reliability analysis example Bus-Guardian
- Lecture 9 (01/30/19): [5/05-5/21] Sequence 7
(pdf)
:
Fault Trees, Example Bus-Guardian, example (using SHARPE).
- Lecture 10 (02/01/19): [6/01-6/12] Sequence 8
(pdf)
:
Markov Process
- Lecture 11 (02/04/19): [6/13-7/07] Sequence 9
(pdf)
:
Steady State and Transient Solution
- Lecture 12 (02/06/19): [7/08-8/08] No new handout. [Reading Assignment 4]
- Lecture 13 (02/08/19): [8/09-9/03] Sequence 10
(pdf)
:
Markov Models of Typical Systems,
- Lecture 14 (02/11/19): [9/4-10/02] no new handout
- Lecture 15 (02/13/19): [10/03-10/08] Sequence 11
(pdf)
:
Petri Nets
- 549 Semester Project posted below!
- Lecture 16 (02/15/19): [10/09-11/18] Sequence 12
(pdf)
:
Petri Nets, General Stochastic Petri Nets (GSPN)
- Lecture 17 (02/20/19): [12/01-12/12]
(Bobbio & Trivedi Slides)
[Reading assignment 5],
Petri Nets, General Stochastic Petri Nets (GSPN) continued
- Lecture 18 (02/22/19): [12/13-12/28]
Modeling with Petri Nets, examples, cont.
- Lecture 19 (02/25/19): [12/28-13/08] Sequence 13
(pdf)
:
Distributed Systems, Ordering, Synchronizing
- Exam 1 (02/27/19): in-class, closed notes, open mind. All material up to, and including Petri Nets, is covered.
- Lecture 20 (03/01/19): [13/09-14/08] Sequence 14
(pdf)
:
Reliable Broadcast,
Atomic and Causal Broadcast, Reading assignment 6 and 7.
- Lecture 21 (03/04/19): Broadcast continued
[Reading Assignment 8, "The Byzantine Generals Problem"]
- Lecture 22 (03/06/19): [14/09-15/04] Sequence 15
(pdf)
:
Intro. Fault-tolerant Agreement, Oral messages.
- (03/08/19) Project day. Class does not meet. Use your time for the project and review of materials.
- Lecture 23 (03/18/19): [15/05-15/15] Sequence 16
(pdf)
:
Fault-tolerant Agreement, signed messages. [Reading Assignment 9].
- Makeup Exam 1 (03/20/19):
- Lecture 24 (03/22/19): [15/16-16/04] Agreement cont.,
- Check out Homework 2!
- Lecture 25 (03/25/19): [16/05-17/01] Sequence 17
(pdf)
:
Hardware assisted agreement (e.g., Davis and Wakerly approach.)
- Lecture 26 (03/27/19): [17/01-18/02] Sequence 18
(pdf)
:
Fault models, from reading assignment 9:
- Lecture 27 (03/29/19): [18/03-18/13]
Fault models continued
Note the comments in class about the algorithm and the paper by John Rushby.
Check out the upcoming reading assignment 10.
- Lecture 28 (04/01/19): [19/01-19/03] Sequence 19
(pdf)
:
Clock Synchronization. [Reading assignment 10].
- Lecture 29 (04/03/19): [19/04-19/12] clock synchronization cont.
- Lecture 30 (04/05/19): [19/13-19/21]
Sequence 20
(pdf)
:
Reading a remote clock.
Reading assignment 11.
- Lecture 31 (04/08/19): [19/22-19/31]
clock synchronization cont.,
- Lecture 32 (04/10/19): [20/01-20/08]
finishing up: Approximate Agreement, MSR algorithms,
local and global agreement in partially connected networks, network topologies,
convergence rates and how to compare algorithms
- Lecture 33 (04/12/19): [21/01-21/18]
Sequence 21
(pdf)
:
Recovery Strategies, checkpointing,
[Reading Assignment 12a, 12b]
- Lecture 34 (04/15/19): [21/19-21/26]
Sequence 21a
(pdf)
:
Theft-induced Checkpointing for reconfigurable dataflow applications
- Lecture 35 (04/17/19): [21a/01-21a/12]
Sequence 22
(pdf)
:
RAID, [Reading Assignment 13a,13b]
- Exam 2: (04/19/19): In-class exam, you can bing one letter size sheet of paper with any notes you like.
- Lecture 37 (04/22/19): [22/01-22/25]
Sequence 23
(pdf)
:
Fail-Stop Processes, Reading Assignment 14
- Lecture 38 (04/24/19): [22/26-23/11]
Sequence 24
(pdf)
:
Diagnosability, Reading assignment 15 and 16
- Lecture 39 (04/26/19): [24/01-24/08]
Sequence 25
(pdf)
:
Fault-tolerant Architectures from a historical perspective.
- Lecture 40 (04/29/19): [24/09-25/14]
Sequence 27
(pdf)
:
Boeing 777, Reading Assignment 17
- Lecture 41 (05/01/19): [27/01-27/16]
catching up on the 777 PFC
- Lecture 42 (05/03/19): [27/17-28/xx]
Sequence 28
(pdf)
:
Boeing 777 ADIRU, Reading Assignment 18
- Final Exam: Tuesday, May 7, 3pm-5:00pm, our regular class room. Material covered focusses mainly on sequence 21 and forward.
Reading Assignments (so far):
- You need to locate the paper if no specific link is supplied. Note that the UI has subscriptions to most sources, which are automatically granted access if you call them up from within campus. If you are off-campus this means that you will have to go through the UI library to get this access.
- 1) William R. Dunn, "Designing Safety-Critical Computer Systems", IEEE Computer, November, 2003
- 2) V. Nelson, "Fault-Tolerant Computing: Fundamental Concepts", Victor P. Nelson, IEEE Computer, July 1990.
- 3) Basic Concepts and Taxonomy of Dependable and Secure Computing, Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr,
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 1, NO. 1, JANUARY-MARCH 2004.
- 4) SHARPE introduction: go to
SHARPE at Kishor Trivedi's home page,
which will bring you to the site with documentation, examples and more.
We will not cover much detail of SHARPE, so it will be your responsibility to learn how it works and how to use it.
(pdf)
- 5) The slides of Andrea Bobbio and Kishor Trivedi, posted (with permission) for Lecture 17, contain a good introduction to Petri Nets. But the slides cover more material than we need.
- 6) P.M. Melliar-Smith, L. E. Moser, and V. Agrawala, "Broadcast Protocols for Distributed Systems", IEEE Trans. on Parallel and Distributed Systems, Vol. 1, No. 1, January 1990.
- 7) K. Birman and T. Joseph, "Reliable Communication in the Presence of Failure", ACM Transactions on Computer Systems, Vol. 5, No. 1, February 1987, Pages 47-76.
- 8) L. Lamport, R. Shostak, and M Pease, "The Byzantine Generals Problem", 1982.
- 9) Thambidurai, P., and You-Keun Park, "Interactive Consistency with Multiple Failure Modes".
There is an interesting followup paper "A Formally Verified Algorithm for Interactive Consistency Under a Hybrid Fault Model" by P. Lincoln and J. Rushby
that addresses a problem in the algorithm of Thambidurai and Park.
- 10) Fred Schneider, "Understanding Protocols for Byzantine Clock Synchronization",
Fred B. Schneider, 87-859, Department of Computer Science, Cornell University, Ithaca, New York, August 1987.
- 11) Flaviu Cristian, "A Probabilistic Approach to Distributed Clock Synchronization", 9th Intl. Conference on Distributed Computing Systems, 5-9 June, 1989.
- 12a) Mootaz et.al., "A survey of rollback-recovery protocols in message-passing systems"
ACM Computing Surveys (CSUR), Volume 34 Issue 3, September 2002, Pages 375-408.
*
- 12b) Samir Jafar, Axel Krings and Thierry Gautier,
"Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing",
IEEE Transactions on Dependable and Secure Computing, (TDSC), Vol.~6, No.~1, January-March, 2009.
- 13a) A Case for Redundant Arrays of Inexpensive Disks (RAID), by Patterson, Gibson and Katz, ACM SIGMOD, Volume 17 Issue 3, June 1988, Pages 109-116.
- 13b) RAID: High-Performance, Reliable Secondary Storage, Peter M. Chen, Edward Lee, Garth
Gibson, Randy Katz, and David Patterson, ACM Computing Surveys, Volume 26 Issue 2, June 1994, Pages 145-185
- 14) Byzantine Generals in Action: Implementing Fail-Stop Processors, Fred B. Schneider,
ACM Transactions on Computer Systems, Vol. 2, No..2, pp. 145-154, May 1984.
- 15) On the Connection Assignment Problem of Diagnosable Systems,
by F. Preparata, G. Metze and R. Chien, IEEE Trans. Electronic Computers, Vol EC-16, Issue 6, Dec. 1967.
- 16) Implementation of On-Line Distributed System-Level Diagnosis Theory,
by Ronald Bianchini and Richard Buskens, Trans. Computers, Vol. 41, No. 5, May 1992.
- 17) Triple-Triple Redundant 777 Primary Flight Computer, Y.C. Yeh, 1996 IEEE Aerospace
Applications Conference, pg 293-307, 1996.
- 18) A Fault-Tolerant Air Data/Inertial Reference Unit, Michael L. Sheffels, IEEE AES Systems
Magazine, March 1993. (google, or find paper free with IEEEexplore)
Homeworks/Exams:
- Expectations: Homeworks are expected to look professional!
They do not have to be typed in order to look good,
but be aware that I will not accept scribbles etc.
Use a new page for each problem and staple the final submission.
The submission should have the problem sheet as a cover.
- HW1 (pdf)
(solutions-pdf) Password will be given out in class.
- HW2 (pdf)
(solutions-part-1-pdf)
- HW3 (pdf)
(solutions-pdf)
- 549 Project (pdf)
A special thanks to Dr. Roger Kieckhafer (MTU) for the contributions
to the material used in this class.