Overview Motivation ECE 753: FAULT-TOLERANT About the Course and - - PDF document

overview
SMART_READER_LITE
LIVE PREVIEW

Overview Motivation ECE 753: FAULT-TOLERANT About the Course and - - PDF document

1/20/2014 Overview Motivation ECE 753: FAULT-TOLERANT About the Course and the Instructor Conduct, Outline, Coursepack COMPUTING Introduction Terminology and definitions Sources, Overview and Comments Sources


slide-1
SLIDE 1

1/20/2014 1

ECE 753: FAULT-TOLERANT COMPUTING

K l K S l j Kewal K.Saluja

Department of Electrical and Computer Engineering

Motivation and Introduction

Lecture Set 1

Overview

  • Motivation
  • About the Course and the Instructor

– Conduct, Outline, Coursepack

  • Introduction
  • Terminology and definitions

Sources Overview and Comments

ECE 753 Fault Tolerant Computing 2

– Sources, Overview and Comments – System defined

  • Dependability/Security and their attributes
  • Threat to dependability and modeling FEF chain
  • Means to attain dependability
  • Fundamental Principles

Motivation

  • Informal Definition
  • Key Attributes
  • Who, What and Why Study

ECE 753 Fault Tolerant Computing 3

  • Examples

Motivation

  • What is Fault-Tolerance?

A “fault-tolerant system” is one that

ECE 753 Fault Tolerant Computing 4

continues to perform at desired level of service in spite of failures in some components that constitute the system.

Motivation (contd.)

  • Key attributes

Fault - Error - Failure

ECE 753 Fault Tolerant Computing 5

Performance - Availability - Reliability

More recently concept of “survivability”

Inclusions of these constraints at design stage is likely to be more cost effective.

Motivation (contd.)

  • Who is concerned about fault-tolerance?

– System Users – irrespective of the application but some are a lot more concerned than others

  • Who is concerned at design stages?

Universities

ECE 753 Fault Tolerant Computing 6

– Universities

  • R, d, and a (Research, development, applications)

– Industry

  • r, D, and A (research, Development, Applications)
  • Issues

– Design, Analysis/Validation, Implementation, Testing/Validation, Evaluation

slide-2
SLIDE 2

1/20/2014 2

Motivation (contd.)

Examples

  • General Purpose Systems

– PCs: RAMs with parity checks and possibly ECC ( id ti f ti f il d t ti i

ECE 753 Fault Tolerant Computing 7

(consideration of re-execution on failure detection is being investigated)

– Workstations/Servers: error detection (HW),

  • ccasional corrective action (SW), Even ECC

(HW), keeping log (SW)

Motivation (contd.)

Examples

  • Reliable Systems

– Telephone systems

ECE 753 Fault Tolerant Computing 8

– Banking systems e.g. ATM – Stock market – CAE - exams/projects – Football games - display/ticketing

Motivation (contd.)

Examples

  • Critical and Life Critical Systems

ECE 753 Fault Tolerant Computing 9

– Manned and unmanned space borne systems – Aircraft control systems – Nuclear reactor control systems – Life support systems

Motivation (contd.)

Examples

  • Reliable -> Critical Systems

ECE 753 Fault Tolerant Computing 10

– 911 telephone switching system – Traffic light control system – Automotive control systems (ABS, Fuel injection system)

About the Course and the Instructor

  • Conduct

– homeworks, exam, project, grading

  • Outline

ECE 753 Fault Tolerant Computing 11

  • Coursepack

– references and reading list

Introduction

– Historical perspective and major push – New initiatives – Goals of fault-tolerance

ECE 753 Fault Tolerant Computing 12

Goals of fault-tolerance – Applications of fault-tolerance

slide-3
SLIDE 3

1/20/2014 3

Introduction (contd.)

  • Historical Perspective

– not a new concept – first use by J. van Neumann 1956

  • probabilistic logic and synthesis of reliable organism from

unreliable components, Annals of mathematical studies,

ECE 753 Fault Tolerant Computing 13

p , , Princeton University Press

  • Major push

– Space program – HW Fault tolerance - then – SW Fault tolerance later – Merge the two

Introduction (contd.)

  • New initiatives

Density of devices more failures likely Power issue – schedular, on-chip sensors Failures due to soft-errors, life time degradations

ECE 753 Fault Tolerant Computing 14

  • hardening, re-exection,
  • on-chip ECC
  • erconfiguration
  • microarchitectural solutions
  • architectural solutions

Introduction (contd.)

  • New initiatives (contd.)

Deep submicron technology and time to market pressure designs not fully verified Implementation of numerous functionalities on hi /b d/ t ibilit f t

ECE 753 Fault Tolerant Computing 15

chip/board/system possibility of system hang-up Speculative execution results may need to be re- checked Low cost of HW and SW affordable/ecnomical

  • Hot issues: Soft errors, Life-time failures, Power

and Thermal Management

Introduction (contd.)

  • Goals - different goals for different

applications

The key word is “reliability” – has different meaning f diff t d li ti

ECE 753 Fault Tolerant Computing 16

for different users and applications

  • Intuitive explanations

– Dependability – Service – Specification

Introduction (contd.)

  • Intuitive concepts

– Reliability – continues to work – Availability – works when I need it – Safety – does not put me in jeopardy P f bili

ECE 753 Fault Tolerant Computing 17

– Performability – Maintainability – Testability – Survivability – will the system survive catastrophic events? – Security

Introduction (contd.)

  • Applications

– Space borne system

  • long life system

– Airplane control system

ECE 753 Fault Tolerant Computing 18

  • critical system

– Transaction processing system

  • high availability system

– Switching system

  • high availability over certain level of performance
slide-4
SLIDE 4

1/20/2014 4

Terminology and definitions

  • Reliability and concept of probability

– R(t): conditional probability that a system provides continuous proper service in the interval [0,t] given that it provided desired service at time 0.

  • Availability

ECE 753 Fault Tolerant Computing 19

  • Availability
  • Performabiltiy

– An Example

  • Dependability
  • Security

Sources, Overview and Comments (1/4)

Key reference:

  • Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl

Landwehr, Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Transactions on Dependable and Secure Computing, Vol. 1, No. 1, Jan-Mar 2004.

Other references:

  • Israel Koren and C. Mani Krishna, Fault Tolerant Systems, Elsevier, 2007.
  • D. K. Pradhan, editor, Fault-Tolerant Computer System Design, Prentice-

Hall, 1996.

  • B. W. Johnson, Design and analysis of fault tolerant digital systems,

Addison-Wesley, First edition, 1989.

  • My course (Fault-Tolerant Computing) URL:

http://homepages.cae.wisc.edu/~ece753/INFO.html

ECE 753 Fault Tolerant Computing

Sources, Overview and Comments (2/4)

  • What does the paper cover?

– Very basic definitions of the terminologies used in dependable computing I i d fi i i i h – It categorizes definitions in three groups

  • System, attributes of dependability, threats to

dependability

– Covers very briefly methods to attain dependability

ECE 753 Fault Tolerant Computing

Sources, Overview and Comments (3/4)

  • How to read the paper?

– It is easy to read – scan it first and then read it – I have organized the material differently – you may find it helpful d t e p u

  • What is not covered?

– One attribute almost missing - survivability – Basic methods of Fault Tolerance and their characterization

ECE 753 Fault Tolerant Computing

Sources, Overview and Comments (4/4)

  • Chronology of Developments

– Need for fault-tolerance - inception of the space program (recall “Voyager” launched in 1977 is still sending signals) – First standard glossary in 1985 g y – Integration of performance etc into fault tolerance – and hence the term “Dependability” – book published in 1992 – Recognition of “Security” as a basic attribute of dependability – this paper in 2004

ECE 753 Fault Tolerant Computing

System Defined (1/4)

  • “. . . an entity that interacts with other entities”

– First entity (system) – limited to be “electronic (mostly digital)” or “computer based” – Second entity

  • Hardware, software, human, other systems, .. (can also be called

“environment”)

  • Characterization and fundamental properties
  • Characterization and fundamental properties

– Functionality – Performance – Dependability and security – Cost (usuability, managability, adaptabilty : not directly included in the paper)

ECE 753 Fault Tolerant Computing

slide-5
SLIDE 5

1/20/2014 5

System Defined (2/4)

  • Function – “ what the system is intended to do”

– functional specifications: describe it in terms of functionality and performance – behavior – described as a sequence of states to implement the functionality – Total states – set of states as system evolves Total states set of states as system evolves

  • Internal states
  • External states – as viewed by the environment and users
  • Structure – “What enables system behavior

(function)”

– Interconnected components – recursively defined to “atomic” level

ECE 753 Fault Tolerant Computing

System Defined (3/4)

  • System Life Cycle

– Development phase – Use phase

  • Service – what is delivered by the system to its

“environment” (user)

– Environment sees only the “external states” – Development Phase – activities from concept to decision that system is ready for “use phase” – Use Phase - More meaningful and includes service delivery, service outage, service shutdown, maintenance

ECE 753 Fault Tolerant Computing

System Defined (4/4)

  • Development phase environment

– Physical world – Human developers – Development tools – Production and test facilities

  • User phase environment
  • User phase environment

– Physical world – Administrators – maintainers – Users and intruders – Providers and infrastructure

ECE 753 Fault Tolerant Computing

Dependability/Security Attributes (1/6)

  • Original definition: “ability to deliver service that

can justifiably be trusted”

  • Encompassing the following attributes

– Availability R li bilit – Reliability – Safety – Integrity – maintainability

ECE 753 Fault Tolerant Computing

Dependability/Security Attributes (2/6)

  • New definition: “ability to avoid service failures that

are more frequent or more severe than is acceptable” - deliver service that can justifiably be trusted

  • Reason for modification

– Security related issues – Security related issues – This recognizes that a system can fail and it usually does fail and it still can be called dependable – This definition also enables a connection with “development failures”

ECE 753 Fault Tolerant Computing

Dependability/Security Attributes (3/6)

Dependability

  • availability: readiness for correct service.
  • reliability: continuity of correct service.
  • safety: absence of catastrophic consequences on the user(s)

and the environment. b f i l i

  • integrity: absence of improper system alterations.
  • maintainability: ability to undergo modification and

repairs When addressing security, an additional attribute confidentiality: the absence of unauthorized disclosure

ECE 753 Fault Tolerant Computing

slide-6
SLIDE 6

1/20/2014 6

Security is concurrent existence of composite of the attributes 1) availability (for authorized actions only), 2) confidentiality and

Dependability/Security Attributes (4/6)

2) confidentiality, and 3) integrity (with “improper” meaning “unauthorized”)

ECE 753 Fault Tolerant Computing

F

Dependability/Security Attributes (5/6)

ECE 753 Fault Tolerant Computing

  • Other related concepts – summarized in table

(Fig 15) - these are

– Dependability – High confidence – Survivability

Dependability/Security Attributes (6/6)

Survivability – Trustworthiness

  • Example: all these have similar goals such as 1):

ability to deliver service, 2): predictable service, 3): fulfill mission, 4): assurance of expected service delivery

ECE 753 Fault Tolerant Computing

Threats and modeling threats (1/12)

  • Different phases are open to different types of threats

– generally termed as “faults”

  • Faults lead to “errors” – a total state of the system

different from the “true total state”

  • Errors can lead to “failure” – the service deviates

from the desired service from the desired service

  • This creates a FEF chain – a hierarchical phenomenon

(see next and more later)

ECE 753 Fault Tolerant Computing

f il

Fault activation – Error manifestation – Failure

Threats and modeling threats (2/12)

Fault –

active or dormant

Error

faul t

error

failu re

Error –

masked or latent

Failure –

incorrect response

Threats and modeling threats (3/12)

FEF Chain in an hierarchy

slide-7
SLIDE 7

1/20/2014 7

Threats and modeling threats (4/12)

Fault classes

  • Groups (not exclusive)

– Development, Physical – (that affect hardware - I

disagree with this definition), Interaction

  • Viewpoints:

– phase, system boundary, cause, dimension,

  • bjective, intent, capability, persistence

Threats and modeling threats (5/12)

Fault Taxonomy and Examples Production defect: physical, hardware, natural Bug: physical, software, natural Omission (absence of an action): Humam made, system generated g Melicious (meant to cause harm): Human made, Hardware or software Notes:

  • 1. Paper has a classification – Fig 4 and 5
  • 2. Examples and definition of many other faults given.

Some listed on next slide

Threats and modeling threats (6/12)

Fault Taxonomy (contd.) Permanent faults Intermittent faults – repeat at some interval Transient faults – no specific interval Malicious logic faults caused be natural faults Malicious logic faults – caused be natural faults Intrusion attempts – caused by humans Interaction faults – may be development phase or use

phase

Configuration faults – incorrect setting of parameters

Threats and modeling threats (7/12)

Errors classes

  • Detected
  • Latent

An example An example

– An adder gives incorrect sum for certain operands – Fault is active when those operands appear,

  • therwise it is dormant

– Incorrect sum is latent unless used or checked for correctness

Threats and modeling threats (8/12)

Failure classes

  • Development failures
  • Service failures
  • Security failures

Threats and modeling threats (9/12)

  • Development failures – introduced during

the development phase – Human developers – Tools – Production facility – Budgetary reasons – Scheduling issue (time to market)

(basically the system delivered is a downgraded system)

slide-8
SLIDE 8

1/20/2014 8

Threats and modeling threats (10/12)

  • Service failures - delivery of incorrect

service – Four viewpoints

  • 1. Failure domain

– Content failure – Timing failure – early or late delivery of the service(s)

  • Special case: silent failure, halt failure, crash

failure

  • Erratic failure (like Byzantine failure)

Threats and modeling threats (11/12)

  • 2. Failure detectability

– Signal provided by some checking mechanism

  • Signaled failure
  • Unsignaled failure
  • False alarm
  • 3. Consistency

– Consistent failure – all services see the same data – Inconsistent – different services see different data (like Byzantine failure)

Threats and modeling threats (12/12)

  • 4. Consequence of failure

– Need to rate the failure and hence develop criteria – examples:

  • Outage of duration (availability related)
  • Outage of duration (availability related)
  • Lives being endangered (safely related)
  • Extent of corrupted service (integrity related)
  • Amount of information disclosed

(confidentiality related)

Means to attain dependability (1/6)

  • Fault Prevention or Fault Avoidance
  • Improvement of development process
  • Elimination of causes that can induce

faults faults

  • Fault Tolerance
  • Techniques and implementations

(more later)

Means to attain dependability (2/6)

  • Fault Removal
  • Remove faults during development phase

– extensive simulation and validation

  • Testing
  • Deterministic testing
  • Random and statistical testing
  • Back to back testing

Test/validation quality: fault injection, design for test/verification

Means to attain dependability (3/6)

  • Fault Forecasting – evaluate the system

behavior and then use one or more methods previously discussed to improve dependability

  • Qualitative evaluation
  • Quantitative evaluation

Quantitative evaluation

  • Use benchmarks
  • Use of simulators

Examples: 1) Error and failure logs

2) when and where commissioned

slide-9
SLIDE 9

1/20/2014 9

Means to attain dependability (4/6)

  • Fault Tolerance Techniques
  • Error detection - need redundancy
  • Duplicate execution
  • Use of parity
  • Use of parity
  • Checker programs and/or hardware
  • More later

Means to attain dependability (5/6)

  • Recovery - Key is redundancy
  • Error handling
  • Masking and compensation
  • Rollback
  • Rollforward
  • Fault handling
  • Diagnosis
  • Isolation
  • Reconfiguration
  • Initialization

Means to attain dependability (6/6)

  • Key to fault tolerance
  • Break FEF chain
  • Use “redundancy” to improve “use phase”

dependability and security dependability and security

  • See next “fundamental principles”

Fundamental Principles

  • Hardware redundancy
  • Low level
  • High level

ECE 753 Fault Tolerant Computing 52

  • Software Redundancy
  • Time Redundancy
  • Information Redundancy

Fundamental Principles (contd.)

  • Hardware Redundancy - Low level

– logic level

  • Example 1 - Self checking circuits
  • Example 2

Arithmetic code

ECE 753 Fault Tolerant Computing 53

  • Example 2 - Arithmetic code

A modular adder using the mathematical principle (A+B) mod k = ((A mod k) + (B mod k)) mod k

  • Hardware Redundancy - High level

– Triplicate or 5-copies as in space shuttle

Fundamental Principles (contd.)

  • Software Redundancy

– Use two different programs/algorithms

  • Time Redundancy

R t d th t k d th lt

ECE 753 Fault Tolerant Computing 54

– Re-compute or redo the task and compare the results – May or may not use the same hardware/software

  • Information Redundancy

– backup information – Use of ECC

  • Question - What kind of FT is achieved?
slide-10
SLIDE 10

1/20/2014 10

Fault-Error-Failure

  • Intuitive definitions
  • Origins of faults
  • Methods to break FEF chain

ECE 753 Fault Tolerant Computing 55

  • Attribute of faults

Fault-Error-Failure concept (contd.)

Intuitive definitions

  • Fault -

– An anomalous physical condition caused by a manufacturing problem, fatigue, external disturbance

ECE 753 Fault Tolerant Computing 56

manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, … – Causes

  • Error - Effect of activation of a fault
  • Failure - over-all system effect of an error

Fault -> Error -> Failure

Fault-Error-Failure concept (contd.)

Origins of faults

  • Physical device level (HW)
  • Logic level (HW)

ECE 753 Fault Tolerant Computing 57

  • Chip level (HW)
  • System level (HW/SW)

– interfacing, specifications, …

  • Why systems fail

Fault-Error-Failure concept (contd.)

Methods to break FEF chain

  • Flow FEF
  • Barriers

ECE 753 Fault Tolerant Computing 58

– Fault avoidance – Fault masking – Fault removal – Fault forecasting

Fault-Error-Failure concept (contd.)

Attribute of faults

  • Cause
  • Nature

ECE 753 Fault Tolerant Computing 60

  • Duration
  • Extent
  • Value