System Dependability Robert Wierschke Seminar Prozesssteuerung und - - PowerPoint PPT Presentation

system dependability
SMART_READER_LITE
LIVE PREVIEW

System Dependability Robert Wierschke Seminar Prozesssteuerung und - - PowerPoint PPT Presentation

System Dependability Robert Wierschke Seminar Prozesssteuerung und Robotik 14. Januar 2009 Outline 2 Motivation Dependability Definition Dependability attributes Attribute relevance Threads Fault model


slide-1
SLIDE 1

System Dependability

Robert Wierschke Seminar “Prozesssteuerung und Robotik”

  • 14. Januar 2009
slide-2
SLIDE 2

2

Outline

■ Motivation ■ Dependability □ Definition □ Dependability attributes □ Attribute relevance ■ Threads □ Fault model □ Fault-error-failure ■ Attaining dependability □ Fault tolerance □ Redundancy ■ Software Dependability ■ Summary

slide-3
SLIDE 3

3

Motivation

■ Deliver correct communication and computation services ■ First generation computer used unreliable components □ Hardware concept ■ Consequences of system failure □ Economically ◊ Credit card authorization $2.6 million / hour of downtime ◊ Airline reservation $89.500 / hour of downtime □ Human life ◊ What happens if the board computer of an air plane crashes?

slide-4
SLIDE 4

4

Definition: Dependability 1|2

[Merriam-Webster Online] dependability: capable of being depended on : reliable reliable: suitable or fit to be relied on : dependable “the collective term used to describe the availability performance and its influencing factors : reliability performance, maintainability performance and maintenance support performance” [7] □ (Zuverlässigkeit) □ Focus: availability ◊ Strongly influenced by telecommunication industry ■ Evolves over time ■ Depends on problem domain

slide-5
SLIDE 5

5

Definition: Dependability 2|2

“the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers” [1] □ Behaves as specified □ Avoids hazards ■ Dependability Attributes □ Reliability (Funktionsfähigkeit) □ Availability (Verfügbarkeit) □ Safety (Sicherheit) □ Confidentiality (Vertraulichkeit) □ Integrity (Integrität) □ Maintainability (Wartbarkeit)

slide-6
SLIDE 6

6

Reliability 1|2

■ Continuity of service ■ Probability R(t) of a system/component to operate correctly during a time period t ■ Example: □ Space probe needs to operate correctly during the mission time. ■ Calculating reliability □ R(0) = 1, R(∞) = 0 □ Failure probability Q(t) = 1 – R(t) □ Failure rate λ(t) number of failures during Δt □ For constant failure rate R(t) = e-λt [3]

slide-7
SLIDE 7

7

Reliability 2|2

■ Empirical values ■ Assuming independent components □ Series □ Parallel [3]

slide-8
SLIDE 8

8

Availability 1|2

■ Readiness for usage ■ Probability A of a system/component to operate correctly at any point in time ■ Example: □ Telecommunication ■ Availability vs. reliability □ Service that crashes often but restarts instantly has high A but low R(t). ■ Number of 9s □ Availability per year 1 90,0 % 36,5 d 2 99,0 % 3,65 d 3 99,9 % 8,76 h 4 99,99 % 52,6 min 5 99,999 % 5,26 min 6 99,9999 % 31,5 s 7 99,99999 % 0,3 s

slide-9
SLIDE 9

9

Availability 2|2

■ A = MTTF / (MTTF + MTTR) □ MTTF mean time to failure ◊ MTTF = 1/λ □ MTTR mean time to repair ◊ Shorter repair time leads to higher availability □ MTTB mean time between failures [5]

slide-10
SLIDE 10

10

Safety

■ Avoidance of catastrophic consequences ■ Property of a system/component that it will not imperil equipment

  • r human life.

■ Example: nuclear power plant □ If the reactor reaches temperature X, it must shut down within time Y. ■ Conflicting with reliability: a non working system is often save □ Fail-safe: system reaches a safe state □ Fail-operational: system provides a degraded service mode ◊ Example: spare tire

slide-11
SLIDE 11

11

Security

■ Combines confidentiality and integrity ■ Property of a system/component that it will prevent unauthorized access or alteration of data ■ example: control board in train □ Displays and switches are behind a glass door, thus values can be read be everyone but not modified.

slide-12
SLIDE 12

12

Maintainability

■ System can be repaired and modified ■ Repair rate: μ = 1 / MTTR ■ Hard to specify and measure □ Low maintainability ◊ e.g. Satellites ◊ Requires high reliability [2]

slide-13
SLIDE 13

13

Attribute relevance

■ Depends on problem domain □ Economical: Are financial consequence acceptable? □ Ethical: Are risks for life or equipment acceptable? ■ Attributes might be conflicting □ Fail-safe state (reliability vs. safety) ■ Classes □ Uncritical Embedded Systems (e.g. mobile phone, Lego NXT) □ High-Integrity Embedded Systems (e.g. Satellites ) □ Safety-Critical Systems (e.g. aircraft)

slide-14
SLIDE 14

14

Threads

■ Anything that is capable of decreasing the system dependability ■ A meaningful specification must state threads to relevant dependability attributes ■ Fault model [5]

slide-15
SLIDE 15

15

Fault-Error-Failure 1|3

■ Fault □ A defect within the system, that eventually leads to an error. ◊ Active f. if it causes an error, otherwise ◊ Dormant f. □ Fault classes ■ Error □ Part of system state that may lead to failure. [2]

slide-16
SLIDE 16

16

Fault-Error-Failure 2|3

■ Failure □ Event that occurs when the delivered service deviates from correct service □ Service restoration: transition from incorrect to correct service □ Partial failure: a failure of a service may leave the system in degraded mode (e.g. Emergency service) ■ Fault/failure chain □ Failures are recognized at component boundaries, thus a failure can be considered a fault in a depending component [2]

slide-17
SLIDE 17

17

Fault-Error-Failure 3|3

■ Failure □ Typical failure rate ◊ Hardware: bath tube curve ◊ Software [5] [5]

slide-18
SLIDE 18

18

Attain dependability

■ Fault prevention □ Avoid fault to be introduced into the system □ Development techniques ■ Fault tolerance □ Mechanisms that allow the system to operate correctly in case

  • f certain failures

□ Possibly degraded service mode ■ Fault removal □ Development or usage phase □ Maintenance ■ Fault forecasting □ Predicting likely faults

slide-19
SLIDE 19

19

Fault tolerance 1|4

■ Definition: Means to maintain service while faults are present. ■ Improves reliability and availability ■ Fault can only be tolerated if it was expected ■ Phases □ Error detection □ Damage assessment □ State restoration □ Continue service ◊ Degraded mode

slide-20
SLIDE 20

20

Fault tolerance 2|4

■ Error detection techniques □ Result comparison ◊ Compare results of redundant components □ Watchdog timers ◊ Assume failure if result is late □ Reasonableness ◊ Range checks ◊ Constraints (e.g. negative value) □ Information redundancy ◊ Checksums □ Functionality test ◊ Memory checks

slide-21
SLIDE 21

21

Fault tolerance 3|4

■ Error recovery □ Forward ◊ Discard computation ◊ Resume service from a error-free system state. ◊ Typical used for periodic tasks □ Backward ◊ Roll-back to know-good sate (checkpoints) ◊ Re-execution of failed task possible

slide-22
SLIDE 22

22

Fault tolerance 4|4

■ Replication □ Using multiple identical instances of a component □ Parallel task processing □ Voting/quorum ■ Diversity □ Tolerate systematic failures □ Using multiple different implementations of a component □ Otherwise use as replica “The most certain and effectual check upon errors which arise in the process of computation, is to cause the same computations to be made by separate and independent computers; and this check is rendered still more decisive if they make their computations by different methods” [1834; Lardner; 2] ■ Redundancy

slide-23
SLIDE 23

23

Redundancy 1|2

■ Using multiple identical instances of a component if component fails, switch to another (fail-over) ■ Types □ Space ◊ Use multiple components of the same type □ Time ◊ Send messages multiple times ◊ Execute computation multiple times □ Information ◊ Checksums, error correcting codes

slide-24
SLIDE 24

24

Redundancy 2|2

■ Active redundancy □ Parallel components, voting □ Voter is single point of failure □ Fail-silent □ N-modular-redundancy ◊ N >= 3 (TMR) ◊ Tolerates (N-1)/2 component failures ■ Passive redundancy □ Hot standby ◊ Operating in background to keep synchronized □ Cold standby [3] [3]

slide-25
SLIDE 25

25

Software Dependability 1|2

■ Fault model for software components □ Bohrbugs: easily reproducible □ Heisenbugs: complex event combination; transient faults; hard to reproduce □ “Aging”: fault accumulation (e.g. memory leaks) [8]

slide-26
SLIDE 26

26

Software Dependability 2|2

■ Rejuvenation “proactive fault management technique aimed at cleaning up the system internal state to prevent the occurrence of more severe crash failures in the future” [8] □ Heisenbugs, Aging ■ N-version Programming □ Different implementation of a software component (diversity) □ Bohrbugs (operational phase) ■ Retrying operations □ Heisenbugs, Aging ■ Fault masking/reconfiguring/fail over □ e.g. FT CORBA □ NMR □ Heisenbugs, Aging

slide-27
SLIDE 27

27

Summary 1|2

■ System Dependability is the ability of a system/component to

  • perate as specified. Depending on the problem domain it subsumes

a set of dependability attributes (reliability, availability, safety, security, maintainability, ...) with varying importance. ■ Threads □ Define possible degradation of system dependability □ Need to be specified □ Fault -> Error -> Failure --> Fault

slide-28
SLIDE 28

28

Summary 2|2

■ Fault tolerance □ Operate regardless of certain faults □ Tolerates only expected faults □ Strategies ◊ Replication ◊ Diversity ◊ Redundancy

  • Space, time, information
  • Active or passive
slide-29
SLIDE 29

29

References

[1] ifip WG 10.4 on DEPENDABLE COMPUTING AND FAULT TOLERANCE, http://www.dependability.org/wg10.4/ [2] Avižienis, Laprie and Randell, DEPENDABILITY AND ITS THREATS: A TAXONOMY, http://rodin.cs.ncl.ac.uk/Publications/avizienis.pdf [3] Giese, Software Engineering for Embedded Systems: II Foundations (WS 0708) [4] Bowen and Hinchey, High-Integrity System Specification and Design, http://www.jpbowen.com/pub/hissd99.pdf [5] Löwis, Fault Tolerance (SS 08) [6] Geib, Verlässliche Systeme, http://www.informatik.fh-wiesbaden.de/~geib/LVweb/VS_Kap_1.pdf [7] IEC, IEV 191-02-03, http://dom2.iec.ch/iev/iev.nsf/display?openform&ievref=191-02-03 [8] http://srejuv.ee.duke.edu/