CS 448/548: Survivable Systems and Networks
This page is ALWAYS under construction!!!
Welcome to CS448/548 Survivable Systems and Networks.
This course is offered in the Spring Semester 2017 at the
University of Idaho.
The course is taught by
Dr. Axel Krings.
The web site used the last time the course was taught can be viewed
here,
but be aware that each semester the format and material will change
to reflect the dynamic behavior of the research area.
This web-page
contains information about the course, e.g. syllabus, class notes, pointers
to interesting places etc.
Material can be down-loaded in pdf and/or postscript format, and will be made
available in the updated form as the class goes on.
If you have comments, please let me know.
Imagine what would happen if our critical infrastructures were to be compromised by malicious act -- failure of communications, power, water, gas, banking & finance, emergency services etc.
With increasing computer security concerns and the recognition of the vulnerability of our critical infrastructure to cyber terrorism, achieving Survivability of Systems under attack is vital in computing and networked systems, whether it is the systems themselves or the critical applications or infrastructures they control.
This course will focus on malicious act and other faults and their impacts on systems, as well as techniques useful in the design of systems that can survive such acts.
Survivability goes beyond computer & network security or fault-tolerance.
The range of threats to survivability that must be considered is enormous, including hardware malfunctions, software flaws, environmental hazards, and malicious and accidental human acts.
However, we will also expand our view to include resilient systems and intrusion tolerant systems.
These terms are actually closely related and have common attributes.
But can one really design systems that can survive attacks, tolerate intrusions or be resilient?
You would be surprised to find out that there is an entire research areas that deals with exactly that.
Don't think of your laptop that becomes invincible (no James Bond scenarios here).
Think bigger, think of models that help analyze systems, model reliability, identify essential services, explore the limits of redundancy and the assumptions under which this will or will not work.
Think of what kind of faults or attack scenarios those systems may be subjected to.
Now tab into the vast amount of tools and solutions that exist, including agreement algorithms, N-version & N-variant software, new Hybrid Fault Models, new analyzing approaches etc. and start designing your system!
Course description:
This course discusses issues of Survivability, Attributes of System
Survivability, Trustworthiness, Dependability and Assurance, Threats to
Survivability, Threats to Security, Threats to Reliability, Threats to
Performance, Requirements and Their Interdependence, Systemic Inadequacies,
Approaches for Overcoming Deficiencies, Evaluation Criteria, Attempts
at Standardization, Architectures for Survivability, Implementing and Configuring
for Survivability.
However, we will not limit ourselves to the term "survivability" and look at contemporary issues of resilient systems, which are closely related in their goals.
A wealth of literature has surfaced that deals with issues of system
survivability.
This class will be taught in several phases in which material
will be presented by the instructor and literature will be reviewed by
individual or groups of students.
The results will be individual and group
presentations as well as discussions of contemporary issues.
The exact list of topics and class format is not final and a work in progress.
- Contact information:
- Axel Krings (PhD), JEB 320,
- Phone: 208-885-4078, fax: 208-885-9052.
- Engineering outreach students: dial toll free 800-824-2889 ext 4078
- Mailing address: Engineering Outreach, PO Box 441014,
Moscow, Idaho 83844-1014.
- Office Hours:
(see here)
- Class time: MWF 2:30-3:20pm room JEB 26.
- Syllabus
- Spring 2017 Term Class Handouts:
- The handouts are ordered by sequence numbers and the material covered in the lectures are indicated next to the date.
Specifically, the numbers in parentheses indicate the slides covered during class, i.e., [a/b-c/d] indicates that the material covered is from sequence a (slide b) to sequence c (to slide d).
- If there are any problems with accessing the handouts,
please let me know (email, phone, smoke signs, drums, ...)!
- Corrections: some slides may contain formatting errors, typos etc.
which have been addressed in class, but have not been reflected
in the notes posted here.
- Course syllabus: to be discussed in class.
- Lecture Support Material: Note that this represents only a subset of the issues presented in class!
Whereas the information below gives the general information about the schedule of the lectures,
it does not always indicated the specific approaches, methods, mechanisms, basic concepts and building blocks.
These are derived using the reading assignments as "case studies", the concepts are introduced as we discuss the papers.
Note that we will stretch out the material of the first few
classes in order to address background issues raised during
the presentation of the papers. This will help especially
students that have not taken computer security and fault-tolerant systems.
However, please do not confuse hand-waving with in-depth knowledge!
- Lecture 1 (01/11/17): [1/1-1/04]
Sequence 1, (pdf),
:
Introduction to the course. Discussion of a scenario exposing some of the key issues facing systems exposed to faults.
The class period is mainly using the board. [Reading Assignment 1]
- Lecture 2 (01/13/17): [1/5-1/xx]
Sequence 2, (pdf),
:
Introduction cont.: [mainly using examples on the board], survivability, intrusion tolerance, resilience, fault-tolerance...
Fault-tolerance primer, Standard Definitions, Assumptions and their Limitations.
Main discussion focus is on fault, error, failure, as well as independence-of-fault-assumption (or common-mode faults).
This includes also the understanding of the limitations of testing and the Test-vector Generation Problem,
which is NP-hard (even for non-sequential circuits). [Reading Assignment 2]
- Lecture 3 (01/18/17): [1/7-2/08]
Sequence 3, (pdf),
:
Preparation for Reading Assignment 2. Make sure you really read these assignments or you will lose out on developing a feeling for the topic.
More on definitions related to fault-tolerance and background why many solutions from that field may or may not be suitable to address our malicious aspects.
[Reading Assignment 3]
- Lecture 4 (01/20/17): [2/09-3/07]
Survivability definitions, their specific powers or limitations, Security: An Intrusion-tolerant approach.
- Lecture 5 (01/23/17): [3/08-3/17]
Sequence 4, (pdf),
:
Very Important: look closely at Reading Assignment 3, as it will be the basis for Fault model classifications and what this really means in malicious environments.
- Lecture 6 (01/25/17): [3/18-3/24] Discussion based on Reading Assignment 2.
- Lecture 7 (01/27/17): [4/01-4/20]
Sequence 5, (pdf),
:
Introduction to agreement algorithms and Fault Models,
- Lecture 8 (01/30/17): [4/20-4/34]
Byzantine Agreement (Lamport paper) cont., [Reading Assignment 4]
- Lecture 9 (02/01/17): [5/01-5/xx]
Hybrid Fault Models, [based on Reading assignment 4].
- Lecture 10 (02/03/17): [5/xx-6/10]
Fault Models and Data Aggregation
Sequence 6, (pdf),
:
[Reading Assignment 5]
- Lecture 11 (02/06/17): [6/11-6/29]
Fault models, approximate agreement and conversion.
[Reading Assignment 6]
- Lecture 12 (02/08/17): [6/30-7/08]
Sequence 7, (pdf),
:
Based on Reading Assignment 6, What faults should the application tolerate, what can the infrastructure provide?
Looking at partially connected topologies.
Local versus global convergence.
- Lecture 13 (02/10/17): [7/09-7/16]
Sequence 8, (pdf),
:
Discussion on the concept of Design for Analyzability, Reliability Block Diagrams, their dual, i.e., Fault Trees, and how useful or limited they are in our context.
Concepts and Taxonomy of Dependable and Secure Computing, [Reading Assignment 7]
- Lecture 14 (02/13/17): [7/17-7/32]
Unpredictable, latent, Unobserved and Unobservable Risks, in the context of the 3-layer survivability analysis architecture [Ma & Krings 2008],
- Lecture 15 (02/15/17): [8/01-8/xx] Material from reading assignment 7.
Continuation of Unpredictable, latent, Unobserved and Unobservable Risks.
- Lecture 16 (02/17/17): [8/xx-8/66]
Sequence 9, (pdf),
:
Survivable Network (System) Analysis Method, [Reading Assignment 8 & 9].
There have been different variants, but it started out here.
- Lecture 17 (02/22/17): [9/01-9/07]
[class canceled due to illness],
Survivable Systems Analysis preliminary discussion. SSA extensions, e.g., including Risk Assessment.
- Lecture 18 (02/24/17): [9/08-9/44]
Sequence 10, (pdf),
:
SSA Case Study.
- Lecture 19 (02/27/17): [9/45-10/xx]
CS548 semester project is posted below!
Lessons learned, limitations of SSA, SSA derivatives,
Case studies. [Reading Assignment 10]
- Lecture 20 (03/01/17): [10/01-11/07]
Sequence 11, (pdf),
:
Dealing with patterns, e.g., intrusion detection systems,
finishing up discussion on SSA Case Studies listed in Sequence 10.
- Lecture 21 (03/03/17): [11/08-11/19]
Sequence 12, (pdf),
:
Dealing with patterns, e.g., intrusion detection systems
- EXAM 1 (03/06/17): 50 minutes, closed notes.
- Lecture 22 (03/08/17): [12/01-12/21]
Sequence 13, (pdf),
:
Background material on Markov chains (needed for reading assignment by J. Whittaker and J.H. Poore and an upcoming reading assignment by Y. Liu and K. Trivedi).
- Lecture 23 (03/10/17): [13/01-13/14]
Markov Analysis of Software Specifications, based on Reading Assignment 10.
- Lecture 24 (03/20/17): [13/15-14/03]
Sequence 14, (pdf),
:
Exam-1 review,
Decentralizing services, Case Study 1: Real-time attack recognition.
Dealing with Patters cont.: Case study based on [Reading Assignment 11]
- Lecture 25 (03/22/17): [14/04-14/xx]
Redundancy case study: lessons learned, DoS detection and recovery case study [from Reading Assignment 11]
:
[Reading Assignment 12 posted]
- Lecture 26 (03/24/17): [14/xx-14/32]
Sequence 15, (pdf),
:
Profiling-based DoS detection and recovery (case study cont.) [Reading Assignment 12]
- Lecture 27 (03/27/17): [15/01-15/15]
Attack recognition continued.
Case study: real-time control application: ITS (Intelligent Transportation System)
- Lecture 28 (03/29/17): [15/16-15/34]
Sequence 16, (pdf),
:
Decentralized Services: case study background: RAID,
whereas they do not provide survivability w.r.t. malicious act,
they do provide fault-tolerance and the concepts will be expanded in later case studies. (Note: this will be only a brief outline of the material),
[Reading Assignments 13]
- Lecture 29 (03/31/17): [15/35-16/xx]
Sequence 17, (pdf),
:
Check out the assignment posted - try to get early start due to tight due date.
Decentralized Services: case study Survivable Storage
[Reading Assignment 14]
- Lecture 30 (04/03/17): [xx-16/46]
RAID Systems (in preparation for survivable storage)
- Lecture 31 (04/05/17): [17/01-17/15]
Finishing up RAID, Introduction to Survivable Storage.
[Reading Assignment 15]
- Lecture 32 (04/07/17): [17/16-17/33]
Sequence 18, (pdf),
:
Survivable Storage cont.,
- Lecture 33 (04/10/17): [18/01-18/05]
How to share a secret. Derived on board.
- Lecture 34 (04/12/17): [19/01-19/08]
Sequence 19, (pdf),
:
Case study: Survivability architecture. Concepts:
N-version and N-variant executions,
[based on Reading Assignment 16]
- Lecture 35 (04/14/17): [19/09-19/xx]
N-variant executions using multi-core environments, different approaches of the literature.
- Lecture 36 (04/17/17): [19/xx-xx]
Background on Petri-Nets (see Fault-Tolerance course sequence
11:
and
Petri Nets
12: )
and Probabilistic Automata.
- Exam-2 (04/19/17): Covering material up to, and including, sequence 18.
- Lecture 37 (04/21/17): [19/xx-19/xx]
Conceptual design: how to assess feasibility of survivability by evaluating if reliability specifications can theoretically be archived
from evaluating concepts towards implementation.
For this we will look at Petri Nets as a tool to give us a general ideal about what to expect.
[Reading Assignment 17]
- Lecture 38 (04/24/17): [19/xx-19/46]
Sequence 20, (pdf),
:
How to use Petri Nets and Probabilistic Automata for our cause.
[Reading Assignment 17]
- Lecture 39 (04/26/17): [20/01-20/xx]
Decentralized Services: case study SITAR
- Lecture 40 (04/28/17): [21/01-21/06]
Sequence 21, (pdf),
:
Survivability Quantification, Markov Models, Transient and Steady State solutions and the connection to the
T1A1.2 definition of survivability.
Survivability quantification, case study telephone system, analysis using common survivability definitions,
Performance model, Availability model, Composite model, [Reading Assignment 18]
- Lecture 41 (05/01/17): [21/07-22/06]
Sequence 22, (pdf),
:
How do you know that your results of large computations have not been (massively) corrupted?
A probabilistic approach to Result Certification, [Reading Assignment 19]
- Lecture 42 (05/03/17): [22/xx-23/17]
Sequence 23, (pdf),
:
Sequence 24, (pdf),
:
Risk background,
SP800-30 Risk Management Guide, Risk Management or Risk Analysis?
- Lecture 43 (05/05/17): [24/01-25/19]
Sequence 25, (pdf),
:
Risk Staging
- Final exam: Friday May 12, 3-5pm.
- Reading Assignments (so far):
- Note: besides the reading assignments below there are references to papers in the slides. These papers should be looked at as well!
- 1) Fault-Tolerant Computing: Fundamental Concepts, by Victor P. Nelson, Computer, Issue 7, Pages 19-25, 1990.
- 2) Internet Security: An Intrusion-Tolerance Approach, by Yves Deswarte and David Powell, Proceedings of the IEEE, Vol. 94, Issue 2, 2009.
- 3) The Byzantine Generals Problem, by Leslie Lamport, Robert Shostak and Marshall Pease,
ACM Transactions on Programming Languages and Systems, Volume 4, Issue 3, (July 1982).
This paper is mainly for students that have not take CS449/549
and will bring them up to speed on topics related to fault models.
We will discuss their limitations in hostile environments later.
- 4) Thambidurai, P., and You-Keun Park, "Interactive Consistency with Multiple Failure Modes",
7th Symposium on Reliable Distributed Systems, 1988. Only read up to section 3.
There is an interesting followup paper "Verification of Hybrid Byzantine Agreement Under Link Faults",
by P. Lincoln and J. Rushby that addresses a problem in the algorithm of Thambidurai and Park.
- 5) Azadmanesh, M.H. and Kieckhafer, Exploiting omissive faults in synchronous approximate agreement,
R.M., IEEE Transactions on Computers, Volume: 49, Issue: 10, 2000.
- 6) Krings Axel and Zhanshan (Sam) Ma, "Surviving Attacks and Intrusions: What can we Learn from Fault Models",
Proceedings of the 42nd Hawaii International Conference on System Sciences, (HICSS-42) ,
Waikoloa, Big Island, Hawaii, January 5-8, 2009.
- 7) Basic Concepts and Taxonomy of Dependable and Secure Computing, Algirdas Avizienis, Jean-Claude Laprie,
Brian Randell, and Carl Landwehr,
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 1, NO. 1, JANUARY-MARCH 2004
- 8) Survivable Network Analysis Method, (CMU-report-00tr013.pdf).
- 9) A Case Study in Survivable Network System Analysis, (CMU-report-98tr014.pdf)
- 10) [Whi93] Whittaker James A., and J.H. Poore, Markov Analysis of Software Specifications,
ACM Transactions on Software Engineering and Methodology, Vol.2, No.1,
January 1993, pp. 93-106.
- 11) Case study 1: A Two-Layer Approach to Survivability of Networked Computing Systems, Krings A.W, et. al.
(pdf)
- 12) Case study 2: A. Krings, A. Serageldin and A. Abdel-Rahim, "A Prototype for a Real-Time Weather Responsive System"
(pdf)
- 13) Here are two pointers to papers. The original RAID paper is this one: Patterson, D.A., et. al., ÒA Case for Redundant Arrays of Inexpensive Disks (RAID)Ó,
ACM SIGMOD Records, International Conference on Management of Data, Vol.~17, No.~3, pp.~109-116, June~1988.
Note: this is only a background paper (keep the date (1988) in mind when you read this).
A great overall paper about RAID is this: RAID: High-Performance, Reliable Secondary Storage,
by Peter M. Chen , Edward K. Lee , Garth A. Gibson , Randy H. Katz , David A. Patterson, ACM Computing Surveys, 1994.
- 14) Survivable Storage, CMU Tech. Report CMU-CS-01-120.
Also look at "Decentralized Recovery for Survivable Storage Systems", Theodore Ming-Tao Wong May 2004 CMU-CS-04-119
- 15) Adi Shamir, "How to Share a Secret", Communications of the ACM, Vol. 22, No. 11, November 1979.
- 16) An Adaptive N-variant Software Architecture for Multi-Core Platforms: Models and Performance Analysis,
by Li Tan and Axel Krings, Proc. 11th Intl. Conference on Computational Science and its Applications (ICCSA 2011), June 20-23, 2011.
(*)
- 17) SITAR: A Scalable Intrusion-Tolerant Architecture for Distributed Services,
by Feiyi Wang, Fengmin Gong, Chandramouli Sargor, Katerina Goseva-Popstojanova, Kishor Trivedi, Frank Jou,
Proc 2001 IEEE Workshop on Information Assurance and Security, United States Military Academy, West Point, NY, 5-6 June, 2001
- 18) A General Framework for Network Survivability Quantification, by Y. Liu and Kishor Trivedi, Proc. 12th GI/ITG MMB, 2004.
- 19) Krings Axel, Jean-Louis Roch, Samir Jafar and Sebastien Varrette,
"A Probabilistic Approach for Task and Result Certification of Large-scale Distributed Applications in Hostile Environments",
Proc. European Grid Conference (EGC2005), in LNCS 3470, Springer Verlag, February 14-16, 2005.
(pdf)
- Exam 1 Prep: (pdf)
- Exam 2 Prep: (pdf)
- Exam 3 Prep: (pdf)
- CS548 Project: (pdf)
- CS448 Assignment 1: (pdf)
- CS448 Assignment 2: (is the Exam 2 Prep) (pdf)