System Dependability Robert Wierschke Seminar Prozesssteuerung und - PowerPoint PPT Presentation

System Dependability Robert Wierschke Seminar “Prozesssteuerung und Robotik” 14. Januar 2009

Outline 2 ■ Motivation ■ Dependability □ Definition □ Dependability attributes □ Attribute relevance ■ Threads □ Fault model □ Fault-error-failure ■ Attaining dependability □ Fault tolerance □ Redundancy ■ Software Dependability ■ Summary

Motivation 3 ■ Deliver correct communication and computation services ■ First generation computer used unreliable components □ Hardware concept ■ Consequences of system failure □ Economically ◊ Credit card authorization $2.6 million / hour of downtime ◊ Airline reservation $89.500 / hour of downtime □ Human life ◊ What happens if the board computer of an air plane crashes?

Definition: Dependability 1|2 4 [Merriam-Webster Online] dependability: capable of being depended on : reliable reliable: suitable or fit to be relied on : dependable “the collective term used to describe the availability performance and its influencing factors : reliability performance, maintainability performance and maintenance support performance” [7] □ (Zuverlässigkeit) □ Focus: availability ◊ Strongly influenced by telecommunication industry ■ Evolves over time ■ Depends on problem domain

Definition: Dependability 2|2 5 “the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers” [1] □ Behaves as specified □ Avoids hazards ■ Dependability Attributes □ Reliability (Funktionsfähigkeit) □ Availability (Verfügbarkeit) □ Safety (Sicherheit) □ Confidentiality (Vertraulichkeit) □ Integrity (Integrität) □ Maintainability (Wartbarkeit)

Reliability 1|2 6 ■ Continuity of service ■ Probability R(t) of a system/component to operate correctly during a time period t ■ Example: □ Space probe needs to operate correctly during the mission time. ■ Calculating reliability □ R(0) = 1 , R(∞) = 0 □ Failure probability Q(t) = 1 – R(t) □ Failure rate λ(t) number of failures during Δt □ For constant failure rate R(t) = e -λt [3]

Reliability 2|2 7 ■ Empirical values ■ Assuming independent components □ Series □ Parallel [3]

Availability 1|2 8 ■ Readiness for usage ■ Probability A of a system/component to operate correctly at any point in time ■ Example: □ Telecommunication ■ Availability vs. reliability □ Service that crashes often but restarts instantly has high A but low R(t) . 1 90,0 % 36,5 d ■ Number of 9s 2 99,0 % 3,65 d □ Availability per year 3 99,9 % 8,76 h 4 99,99 % 52,6 min 5 99,999 % 5,26 min 6 99,9999 % 31,5 s 7 99,99999 % 0,3 s

Availability 2|2 9 ■ A = MTTF / (MTTF + MTTR) □ MTTF mean time to failure ◊ MTTF = 1/λ □ MTTR mean time to repair ◊ Shorter repair time leads to higher availability □ MTTB mean time between failures [5]

Safety 10 ■ Avoidance of catastrophic consequences ■ Property of a system/component that it will not imperil equipment or human life. ■ Example: nuclear power plant □ If the reactor reaches temperature X, it must shut down within time Y. ■ Conflicting with reliability: a non working system is often save □ Fail-safe: system reaches a safe state □ Fail-operational: system provides a degraded service mode ◊ Example: spare tire

Security 11 ■ Combines confidentiality and integrity ■ Property of a system/component that it will prevent unauthorized access or alteration of data ■ example: control board in train □ Displays and switches are behind a glass door, thus values can be read be everyone but not modified.

Maintainability 12 [2] ■ System can be repaired and modified ■ Repair rate: μ = 1 / MTTR ■ Hard to specify and measure □ Low maintainability ◊ e.g. Satellites ◊ Requires high reliability

Attribute relevance 13 ■ Depends on problem domain □ Economical: Are financial consequence acceptable? □ Ethical: Are risks for life or equipment acceptable? ■ Attributes might be conflicting □ Fail-safe state (reliability vs. safety) ■ Classes □ Uncritical Embedded Systems (e.g. mobile phone, Lego NXT) □ High-Integrity Embedded Systems (e.g. Satellites ) □ Safety-Critical Systems (e.g. aircraft)

Threads 14 ■ Anything that is capable of decreasing the system dependability ■ A meaningful specification must state threads to relevant dependability attributes ■ Fault model [5]

Fault-Error-Failure 1|3 15 ■ Fault □ A defect within the system, that eventually leads to an error. ◊ Active f. if it causes an error, otherwise [2] ◊ Dormant f. □ Fault classes ■ Error □ Part of system state that may lead to failure.

Fault-Error-Failure 2|3 16 ■ Failure □ Event that occurs when the delivered service deviates from correct service □ Service restoration: transition from incorrect to correct service □ Partial failure: a failure of a service may leave the system in degraded mode (e.g. Emergency service) ■ Fault/failure chain □ Failures are recognized at component boundaries, thus a failure can be considered a fault in a depending component [2]

Fault-Error-Failure 3|3 17 ■ Failure □ Typical failure rate ◊ Hardware: bath tube curve [5] ◊ Software [5]

Attain dependability 18 ■ Fault prevention □ Avoid fault to be introduced into the system □ Development techniques ■ Fault tolerance □ Mechanisms that allow the system to operate correctly in case of certain failures □ Possibly degraded service mode ■ Fault removal □ Development or usage phase □ Maintenance ■ Fault forecasting □ Predicting likely faults

Fault tolerance 1|4 19 ■ Definition: Means to maintain service while faults are present. ■ Improves reliability and availability ■ Fault can only be tolerated if it was expected ■ Phases □ Error detection □ Damage assessment □ State restoration □ Continue service ◊ Degraded mode

Fault tolerance 2|4 20 ■ Error detection techniques □ Result comparison ◊ Compare results of redundant components □ Watchdog timers ◊ Assume failure if result is late □ Reasonableness ◊ Range checks ◊ Constraints (e.g. negative value) □ Information redundancy ◊ Checksums □ Functionality test ◊ Memory checks

Fault tolerance 3|4 21 ■ Error recovery □ Forward ◊ Discard computation ◊ Resume service from a error-free system state. ◊ Typical used for periodic tasks □ Backward ◊ Roll-back to know-good sate (checkpoints) ◊ Re-execution of failed task possible

Fault tolerance 4|4 22 ■ Replication □ Using multiple identical instances of a component □ Parallel task processing □ Voting/quorum ■ Diversity □ Tolerate systematic failures □ Using multiple different implementations of a component □ Otherwise use as replica “The most certain and effectual check upon errors which arise in the process of computation, is to cause the same computations to be made by separate and independent computers; and this check is rendered still more decisive if they make their computations by different methods” [1834; Lardner; 2] ■ Redundancy

Redundancy 1|2 23 ■ Using multiple identical instances of a component if component fails, switch to another (fail-over) ■ Types □ Space ◊ Use multiple components of the same type □ Time ◊ Send messages multiple times ◊ Execute computation multiple times □ Information ◊ Checksums, error correcting codes

Redundancy 2|2 24 ■ Active redundancy □ Parallel components, voting □ Voter is single point of failure □ Fail-silent □ N-modular-redundancy ◊ N >= 3 (TMR) [3] ◊ Tolerates (N-1)/2 component failures ■ Passive redundancy □ Hot standby ◊ Operating in background [3] to keep synchronized □ Cold standby

Software Dependability 1|2 25 ■ Fault model for software components □ Bohrbugs: easily reproducible □ Heisenbugs: complex event combination; transient faults; hard to reproduce □ “Aging”: fault accumulation (e.g. memory leaks) [8]

Software Dependability 2|2 26 ■ Rejuvenation “proactive fault management technique aimed at cleaning up the system internal state to prevent the occurrence of more severe crash failures in the future” [8] □ Heisenbugs, Aging ■ N-version Programming □ Different implementation of a software component (diversity) □ Bohrbugs (operational phase) ■ Retrying operations □ Heisenbugs, Aging ■ Fault masking/reconfiguring/fail over □ e.g. FT CORBA □ NMR □ Heisenbugs, Aging

Summary 1|2 27 ■ System Dependability is the ability of a system/component to operate as specified. Depending on the problem domain it subsumes a set of dependability attributes (reliability, availability, safety, security, maintainability, ...) with varying importance. ■ Threads □ Define possible degradation of system dependability □ Need to be specified □ Fault -> Error -> Failure --> Fault

Summary 2|2 28 ■ Fault tolerance □ Operate regardless of certain faults □ Tolerates only expected faults □ Strategies ◊ Replication ◊ Diversity ◊ Redundancy ● Space, time, information ● Active or passive

System Dependability Robert Wierschke Seminar Prozesssteuerung und - PowerPoint PPT Presentation

System Dependability Robert Wierschke Seminar Prozesssteuerung und Robotik 14. Januar 2009 Outline 2 Motivation Dependability Definition Dependability attributes Attribute relevance Threads Fault model

Dependability Evaluation Techniques for Dependability Evaluation The dependability evaluation of

Software Architecture & Dependability Valrie Issarny INRIA Joint work with Apostolos

Key Factors of Dependability of Mechatronic Units - Mechatronic Dependability - Hans-Dieter Kochs

Dependability and Architecture: An HDCP Perspective Bill Scherlis Carnegie Mellon University

Outline Motivation Opportunities and challenges O t iti d h ll Storage DepSky

An Architecture for An Architecture for Configurable Dependability of Configurable Dependability

Dependability within Dependability within Peer- -to to- -Peer Systems Peer Systems Peer

Dependability and Security Challenges Dependability and Security Challenges in Emerging

Assured Reconfiguration: An Architectural Core For System Dependability ICSE 2005 Workshop on

Cost Dependability and Security Johan Karlsson Energy-aware computing 2 1 Layered fault

Introduction to Dependability slides made with the collaboration of: Laprie, Kanoon, Romano

Toward a Reasoning Framework for Dependability Tacksoo Im and John D. McGregor

TDDD82 Secure Mobile Systems Lecture 5: Dependability Mikael Asplund Real-tjme Systems

MAFTIA: FTI Dependability: Basic Concepts and Terminology a European project for [Laprie 1992]

What SOA can do for Software Dependability Karl M. Gschka Karl.Goeschka@tuwien.ac.at Vienna

Dependability in the real world Dependability in the real world p Dependability needs arise from

Process Flows and ShinMing Guo Supporting Facility NKFUST Process Flow Structures

How to Adapt Your Quality Control Program During COVID-19 Sponsored by July 9, 2020 ABOUT AAFA

The binary paint shop problem Robert mal (Charles University) Joint work with J. Han cl,

Altering Java Semantics via Bytecode Manipulation Eric Tanter, Marc S egura-Devillechaise,

A Few Months In The Life Of An RPKI Introduction Validator Performance Graphs Object Counts

Survival Analysis http://www.isrec.isb-sib.ch/~darlene/EMBnet/ EMBnet Course Introduction to

2D Triangulation Input 2D Boundary Output Triangle Mesh Physical Simulation Yixin Hu, 2 9/5/19

Disclosures Cost and Value Considerations in Adult Deformity Surgery Research/Institutional

System Dependability Robert Wierschke Seminar Prozesssteuerung und - PowerPoint PPT Presentation

System Dependability Robert Wierschke Seminar Prozesssteuerung und Robotik 14. Januar 2009 Outline 2 Motivation Dependability Definition Dependability attributes Attribute relevance Threads Fault model

Dependability Evaluation Techniques for Dependability Evaluation The dependability evaluation of

Software Architecture &amp; Dependability Valrie Issarny INRIA Joint work with Apostolos

Key Factors of Dependability of Mechatronic Units - Mechatronic Dependability - Hans-Dieter Kochs

Dependability and Architecture: An HDCP Perspective Bill Scherlis Carnegie Mellon University

Outline Motivation Opportunities and challenges O t iti d h ll Storage DepSky

An Architecture for An Architecture for Configurable Dependability of Configurable Dependability

Dependability within Dependability within Peer- -to to- -Peer Systems Peer Systems Peer

Dependability and Security Challenges Dependability and Security Challenges in Emerging

Assured Reconfiguration: An Architectural Core For System Dependability ICSE 2005 Workshop on

Cost Dependability and Security Johan Karlsson Energy-aware computing 2 1 Layered fault

Introduction to Dependability slides made with the collaboration of: Laprie, Kanoon, Romano

Toward a Reasoning Framework for Dependability Tacksoo Im and John D. McGregor

TDDD82 Secure Mobile Systems Lecture 5: Dependability Mikael Asplund Real-tjme Systems

MAFTIA: FTI Dependability: Basic Concepts and Terminology a European project for [Laprie 1992]

What SOA can do for Software Dependability Karl M. Gschka Karl.Goeschka@tuwien.ac.at Vienna

Dependability in the real world Dependability in the real world p Dependability needs arise from

Process Flows and ShinMing Guo Supporting Facility NKFUST Process Flow Structures

How to Adapt Your Quality Control Program During COVID-19 Sponsored by July 9, 2020 ABOUT AAFA

The binary paint shop problem Robert mal (Charles University) Joint work with J. Han cl,

Altering Java Semantics via Bytecode Manipulation Eric Tanter, Marc S egura-Devillechaise,

A Few Months In The Life Of An RPKI Introduction Validator Performance Graphs Object Counts

Survival Analysis http://www.isrec.isb-sib.ch/~darlene/EMBnet/ EMBnet Course Introduction to

2D Triangulation Input 2D Boundary Output Triangle Mesh Physical Simulation Yixin Hu, 2 9/5/19

Disclosures Cost and Value Considerations in Adult Deformity Surgery Research/Institutional

Software Architecture & Dependability Valrie Issarny INRIA Joint work with Apostolos