CS 448/548: Survivable Systems and Networks
This page is ALWAYS under construction!!!
Welcome to CS448/548 Survivable Systems and Networks.
This course is offered in the Spring Semester 2014 at the
University of Idaho.
The course is taught by
Dr. Axel Krings.
The web site used the last time the course was taught can be viewed
here,
but be aware that each semester the format and material will change
to reflect the dynamic behavior of the research area.
This web-page
contains information about the course, e.g. syllabus, class notes, pointers
to interesting places etc.
Material can be down-loaded in pdf and/or postscript format, and will be made
available in the updated form as the class goes on.
If you have comments, please let me know.
Imagine what would happen if our critical infrastructures were to be compromised by malicious act -- failure of communications, power, water, gas, banking & finance, emergency services etc. With increasing computer security concerns and the recognition of the vulnerability of our critical infrastructure to cyber terrorism, achieving Survivability of Systems under attack is vital in computing and networked systems, whether it is the systems themselves or the critical applications or infrastructures they control.
This course will focus on malicious act and other faults and their impacts on systems, as well as techniques useful in the design of systems that can survive such acts. Survivability goes beyond computer & network security or fault-tolerance. The range of threats to survivability that must be considered is enormous, including hardware malfunctions, software flaws, environmental hazards, and malicious and accidental human acts.
But can one really design systems that can survive attacks, tolerate intrusions? You would be surprised to find out that there is an entire research areas that deals with exactly that. Don't think of your laptop that becomes invincible (no James Bond scenarios here). Think bigger, think of models that help analyze systems, model reliability, identify essential services, explore the limits of redundancy. Think of what kind of faults or attack scenarios those systems may be subjected to. Now tab into the vast amount of tools and solutions that exist, including agreement algorithms, N-version & N-variant software, new Hybrid Fault Models, new analyzing approaches etc. and start designing your system!
Course description:
This course discusses issues of Survivability, Attributes of System
Survivability, Trustworthiness, Dependability and Assurance, Threats to
Survivability, Threats to Security, Threats to Reliability, Threats to
Performance, Requirements and Their Interdependence, Systemic Inadequacies,
Approaches for Overcoming Deficiencies, Evaluation Criteria, Attempts
at Standardization, Architectures for Survivability, Implementing and Configuring
for Survivability.
A wealth of literature has surfaced that deals with issues of system
survivability.
This class will be taught in several phases in which material
will be presented by the instructor and literature will be reviewed by
individual or groups of students.
The results will be individual and group
presentations as well as discussions of contemporary issues.
The exact list of topics and class format is not final and a work in progress.
- Contact information:
- Axel Krings (PhD), JEB 320,
- Phone: 208-885-4078, fax: 208-885-9052.
- Engineering outreach students: dial toll free 800-824-2889 ext 4078
- Mailing address: Engineering Outreach, PO Box 441014,
Moscow, Idaho 83844-1014.
- Office Hours:
(see here)
- Class time: MWF 9:30-10:20 room EP 203.
- Spring 2014 Term Class Handouts:
- The handouts are ordered by sequence numbers and the material covered in the lectures are indicated next to the date.
Specifically, the numbers in parentheses indicate the slides covered during class, i.e., [a/b-c/d] indicates that the material covered is from sequence a (slide b) to sequence c (to slide d).
- If there are any problems with accessing the handouts,
please let me know (email, phone, smoke signs, drums, ...)!
- Corrections: some slides may contain formatting errors, typos etc.
which have been addressed in class, but have not been reflected
in the notes posted here.
- Course syllabus: to be discussed in class.
- Lecture Support Material: Note that this represents only a subset of the issues presented in class!
Whereas the information below gives the general information about the schedule of the lectures,
it does not always indicated the specific approaches, methods, mechanisms, basic concepts and building blocks.
These are derived using the reading assignments as "case studies", the concepts are introduced as we discuss the papers.
Note that we will stretch out the material of the first few
classes in order to address background issues raised during
the presentation of the papers. This will help especially
students that have not taken computer security and fault-tolerant systems.
However, please do not confuse hand-waving with in-depth knowledge!
- Lecture 1 (01/15/14): [1/1-1/03]
Sequence 1, (pdf),
:
Introduction, Fault-tolerance primer. This will be revisited in the discussion based on Reading assignment 1
- Lecture 2 (01/17/14): [1/4-1/06]
Sequence 2, (pdf),
:
Introduction cont.: survivability, intrusion tolerance, resilience, fault-tolerance...
Fault-tolerance primer, Standard Definitions, Assumptions and their Limitations.
Main discussion focus is on fault, error, failure, as well as independence-of-fault-assumption (or common-mode faults).
This includes also the understanding of the limitations of testing and the Test-vector Generation Problem,
which is NP-hard (even for non-sequential circuits).
- Lecture 3 (01/22/14): [1/7-2/08]
Sequence 3, (pdf),
:
Based on Reading Assignment 2.
Make sure you really read these reading assignments, or you will lose out on developing a feeling for the topic.
- Lecture 4 (01/24/14): [2/09-3/xx] Class cancelled. We will make up the material next class.
So be sure to finish the reading assignment:
Internet Security: An Intrusion-tolerant approach.
- Lecture 5 (01/27/14): [ -3/11]
Sequence 4, (pdf),
:
Reading Assignment 3. Fault model classifications and what this really means in malicious environments.
- Lecture 6 (01/28/14): [3/12-4/04] Byzantine Agreement. [using Reading assignment 3]
- Lecture 7 (01/31/14): [4/05-4/12]
Sequence 5, (pdf),
:
Introduction to Hybrid Fault Models, [based on Reading assignment 4].
- Lecture 8 (02/03/14): [4/13-4/32]
Agreement and Lamport's OM(m) and SM(m) algorithm cont.
- Lecture 9 (02/05/14): [4/33-5/xx]
Agreement and Hybrid Fault Models cont.
- Lecture 10 (02/07/14): [5/xx-6/18]
Fault Models and Data Aggregation
Sequence 6, (pdf),
:
- Lecture 11 (02/10/14): [6/16-6/29]
Fault models, approximate agreement and conversion.
[Reading Assignment 6]
- Lecture 12 (02/12/14): [6/30-7/08]
Sequence 7, (pdf),
:
Based on Reading Assignment 6, What faults should the application tolerate, what can the infrastructure provide?
- Lecture 13 (02/14/14): [7/09-7/16]
Sequence 8, (pdf),
:
Discussion on the concept of Design for Analyzability, Reliability Block Diagrams, their dual, i.e., Fault Trees, and how useful they are.
Concepts and Taxonomy of Dependable and Secure Computing, [Reading Assignment 7]
- Lecture 14 (02/19/14): [7/17-7/32]
Unpredictable, latent, Unobserved and Unobservable Risks, in the context of the 3-layer survivability analysis architecture [Ma & Krings 2008],
- Lecture 15 (02/21/14): [8/01-8/xx] Material of reading assignment 7.
- Lecture 16 (02/24/14): [8/xx-8/66]
Sequence 9, (pdf),
:
Survivable Network (System) Analysis Method, [Reading Assignment 8 & 9].
- Lecture 16 (02/26/14): [9/01-9/07]
Survivable Systems Analysis preliminary discussion. SSA extensions, e.g., including Risk Assessment.
- Lecture 19 (02/28/14): [9/08-9/48]
Sequence 10, (pdf),
:
SSA Case Study.
- Lecture 20 (03/03/14): [10/01-10/08]
Discussion on SSA Case Studies listed in Sequence 10.
[Reading Assignment 10]
- Lecture 21 (03/05/14): [11/01-11/xx]
Sequence 11, (pdf),
:
Dealing with patterns, e.g., intrusion detection systems
- TAKE-HOME EXAM: see email I sent to the class
- Lecture 22 (03/07/14): [11/xx-12/04]
Sequence 12, (pdf),
:
Background material on Markov chains (needed for reading assignment 10 and an upcoming reading assignment by Y. Liu and K. Trivedi).
548 Project discussion.
- Lecture 23 (03/10/14): [12/05-13/02]
Sequence 13, (pdf),
:
Markov Analysis of Software Specifications
- Lecture 24 (03/12/14): [13/03-13/25]
Sequence 14, (pdf),
:
Decentralizing services, Case Study 1: Real-time attack recognition.
Dealing with Patters cont.: Case study based on [Reading Assignment 11]
- Lecture 25 (03/14/14): [14/01-14/05]
exam discussion, redundancy case study: lessons learned, DoS detection and recovery case study [from Reading Assignment 11]
:
- Spring Break
- Lecture 26 (03/24/14): [14/06-32]
Sequence 15, (pdf),
:
Profiling-based DoS detection and recovery (case study cont.) [Reading Assignment 12]
- Lecture 27 (03/26/14): [15/01-15/14]
Attack recognition continued.
Case study: real-time control application: ITS. (to be continued next Friday)
- Lecture 28 (03/28/14):
Sequence 16, (pdf),
:
Decentralized Services: case study background: RAID (note: this will be only a brief outline of the material),
[Reading Assignment 13]
- Lecture 29 (03/31/14):
Sequence 17, (pdf),
:
Decentralized Services: case study Survivable Storage
[Reading Assignment 14]
- Lecture 30 (04/02/14): [xx-18/26]
Sequence 18, (pdf),
:
How to share a secret, (derivation on board),
[Reading Assignment 15]
- Lecture 31 (04/04/14): [xx-15/31]
continuation of the material of Lecture 27 (before my conference trip)
- Lecture 32 (04/07/14): [15/32-15/55]
Execution behavior: Dealing with dependencies.
- Lecture 33 (04/09/14): [18/26-18/33],[19/01-19/05]
Wrapping up... How to share a secret. Derived on board.
- Lecture 34 (04/11/14) : [board derivation-19/xx]
Sequence 19, (pdf),
:
Case study: Survivability architecture. Concepts:
N-version and N-variant executions,
[based on Reading Assignment 16]
- Lecture 35 (04/14/14) : [19/xx-19/18]
N-variant executions using multi-core environments, different approaches of the literature.
- Lecture 36 (04/16/14) : [19/19-19/xx]
Reliability Modeling using Petri Nets, Stochastic Activity Networks,
probabilistic automata, [Reading Assignment 17]
- Lecture 37 (04/18/14) : [19/xx-19/46]
Sequence 20, (pdf),
:
Conceptual design: how to assess feasibility of survivability by evaluating if reliability specifications can theoretically archived, from evaluating concepts towards implementation.
- Lecture 37 (04/21/14) : [20/01-20/xx]
Decentralized Services: case study SITAR
[Reading Assignment 18]
- Lecture 38 (04/23/14) : [20/xx-20/29]
Sequence 21, (pdf),
:
Survivability Quantification, Markov Models,
- Lecture 39 (04/25/14) : [21/01-21/06]
Transient and Steady State solutions and the connection to the T1A1.2 definition of survivability.
Survivability quantification, case study telephone system, analysis using common survivability definitions,
Performance model, Availability model, Composite model [Reading Assignment 18]
- Lecture 40 (04/28/14) : [21/07-22/06]
Sequence 22, (pdf),
:
How do you know that your results of large computations have not been (massively) corrupted?
A probabilistic approach to Result Certification, [Reading Assignment 19]
- Lecture 41 (04/30/14) : [22/07-22/xx]
From designing algorithms that can tolerate some faults to detecting when that fault threshold has beed surpassed.
- Lecture 42 (05/02/14) : [22/xx-23/04]
Sequence 23, (pdf),
:
Risk background
- Lecture 43 (05/05/14) : [23/05-24/xx]
Sequence 24, (pdf),
:
SP800-30 Risk Management Guide, Risk Management or Risk Analysis?
- Lecture 44 (05/07/14) : [24/01-25/xx]
Sequence 25, (pdf),
:
Risk Staging
- Lecture 45 (05/09/14) : [25/xx-26/15]
Sequence 26, (pdf),
:
Risk case study: Firewall
- Final exam slot: May 14, 10-12pm. We may make it a take-home exam instead.
- Reading Assignments (so far):
- Note: besides the reading assignments below there are references to papers in the slides. These papers should be looked at as well!
- 1) Fault-Tolerant Computing: Fundamental Concepts, by Victor P. Nelson, Computer, Issue 7, Pages 19-25, 1990. *
- 2) Internet Security: An Intrusion-Tolerance Approach, by Yves Deswarte and David Powell, Proceedings of the IEEE, Vol. 94, Issue 2, 2009.
- 3) The Byzantine Generals Problem, by Leslie Lamport, Robert Shostak and Marshall Pease,
ACM Transactions on Programming Languages and Systems, Volume 4, Issue 3, (July 1982).
This paper is mainly for students that have not take CS449/549
and will bring them up to speed on topics related to fault models.
We will discuss their limitations in hostile environments later.
- 4) Thambidurai, P., and You-Keun Park, "Interactive Consistency with Multiple Failure Modes",
7th Symposium on Reliable Distributed Systems, 1988. Only read up to section 3.
There is an interesting followup paper "Verification of Hybrid Byzantine Agreement Under Link Faults",
by P. Lincoln and J. Rushby that addresses a problem in the algorithm of Thambidurai and Park.
- 5) Azadmanesh, M.H. and Kieckhafer, Exploiting omissive faults in synchronous approximate agreement,
R.M., IEEE Transactions on Computers, Volume: 49, Issue: 10, 2000.
- 6) Krings Axel and Zhanshan (Sam) Ma, "Surviving Attacks and Intrusions: What can we Learn from Fault Models",
Proceedings of the 42nd Hawaii International Conference on System Sciences, (HICSS-42) ,
Waikoloa, Big Island, Hawaii, January 5-8, 2009.
- 7) Basic Concepts and Taxonomy of Dependable and Secure Computing, Algirdas Avizienis, Jean-Claude Laprie,
Brian Randell, and Carl Landwehr,
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 1, NO. 1, JANUARY-MARCH 2004
- 8) Survivable Network Analysis Method, (CMU-report-00tr013.pdf).
- 9) A Case Study in Survivable Network System Analysis, (CMU-report-98tr014.pdf)
- 10) [Whi93] Whittaker James A., and J.H. Poore, Markov Analysis of Software Specifications,
ACM Transactions on Software Engineering and Methodology, Vol.2, No.1,
January 1993, pp. 93-106.
- 11) Case study 1: A Two-Layer Approach to Survivability of Networked Computing Systems, Krings A.W, et. al.
(pdf)
- 12) Case study 2: A. Krings, A. Serageldin and A. Abdel-Rahim, "A Prototype for a Real-Time Weather Responsive System"
(pdf)
- 13) Here are two pointers to papers. The original RAID paper is this one: Patterson, D.A., et. al., ÒA Case for Redundant Arrays of Inexpensive Disks (RAID)Ó,
ACM SIGMOD Records, International Conference on Management of Data, Vol.~17, No.~3, pp.~109-116, June~1988.
Note: this is only a background paper (keep the date (1988) in mind when you read this).
A great overall paper about RAID is this: RAID: High-Performance, Reliable Secondary Storage,
by Peter M. Chen , Edward K. Lee , Garth A. Gibson , Randy H. Katz , David A. Patterson, ACM Computing Surveys, 1994.
- 14) Survivable Storage, CMU Tech. Report CMU-CS-01-120.
Also look at "Decentralized Recovery for Survivable Storage Systems", Theodore Ming-Tao Wong May 2004 CMU-CS-04-119
- 15) Adi Shamir, "How to Share a Secret", Communications of the ACM, Vol. 22, No. 11, November 1979.
- 16) An Adaptive N-variant Software Architecture for Multi-Core Platforms: Models and Performance Analysis,
by Li Tan and Axel Krings, Proc. 11th Intl. Conference on Computational Science and its Applications (ICCSA 2011), June 20-23, 2011.
(*)
- 17) SITAR: A Scalable Intrusion-Tolerant Architecture for Distributed Services,
by Feiyi Wang, Fengmin Gong, Chandramouli Sargor, Katerina Goseva-Popstojanova, Kishor Trivedi, Frank Jou,
Proc 2001 IEEE Workshop on Information Assurance and Security, United States Military Academy, West Point, NY, 5-6 June, 2001
- 18) A General Framework for Network Survivability Quantification, by Y. Liu and Kishor Trivedi, Proc. 12th GI/ITG MMB, 2004.
- 19) Krings Axel, Jean-Louis Roch, Samir Jafar and Sebastien Varrette,
"A Probabilistic Approach for Task and Result Certification of Large-scale Distributed Applications in Hostile Environments",
Proc. European Grid Conference (EGC2005), in LNCS 3470, Springer Verlag, February 14-16, 2005.
(pdf)
- Assignments: