Fault Tolerant Computing Coping with errors Steven Janke February - PowerPoint PPT Presentation

Fault Tolerant Computing Coping with errors Steven Janke February 2013 Steven Janke (Seminar) Fault Tolerant Computing February 2013 1 / 30

Outline Problem: How do you design a computer system to avoid errors? Failure Data for laptops, disks, and RAM Measuring Reliability (Probability) Redundancy and Voters Error Correcting Codes Byzantine General problem Steven Janke (Seminar) Fault Tolerant Computing February 2013 2 / 30

Laptop Reliability Square Trade, Inc. failure over three years (Sample size: 30,000) % 31.0% T ota lF a ilu reR a te % Ma lfu n c tionR a te % A c c id e n tR a te 20.4% 19.7% % % 12.7% 10.6% % 7.2% 7.0% 4.7% % 2.5% % Steven Janke (Seminar) Fault Tolerant Computing February 2013 3 / 30

Types of Errors Hardware versus Software (Component failure versus operating system failure.) Permanent (Hard errors due to design defects or material failure.) Intermittent (Periodic errors that could be software or hardware.) Transient (Soft errors due to gamma rays, electromagnetic disturbance.) Steven Janke (Seminar) Fault Tolerant Computing February 2013 4 / 30

Microsoft Data Data Source Windows Error Reporting system Sample: 950,000 PC’s (2008) Steven Janke (Seminar) Fault Tolerant Computing February 2013 5 / 30

Microsoft Data Conclusions After 30 days of CPU time, probability of crash is 1/190 Problems are often recurrent. (1/3.3 for second crash) CPU speed matters. (Faster means more error.) DRAM errors exhibit spatial locality. Brand name PC’s more reliable. Steven Janke (Seminar) Fault Tolerant Computing February 2013 6 / 30

Google Data Data Source Google Data Centers Sample: 100,000 drives over 5 years Steven Janke (Seminar) Fault Tolerant Computing February 2013 7 / 30

Google Data Steven Janke (Seminar) Fault Tolerant Computing February 2013 8 / 30

Google Data Disk Drive Conclusions Mean time to failure (MTTF) is significantly smaller than specifications. There is significant ”infant mortality” and ”aging”. Correlation between disk replacement rates over time. SATA and SCSI disks all behaved the same. Steven Janke (Seminar) Fault Tolerant Computing February 2013 9 / 30

Google Data DRAM Conclusions 25,000 - 75,000 errors per Mbit (per billion device hours) More than 8% of DIMMS affected per year. Faults dominated by hard errors. 32% of machines had at least one memory error per year. About 22,000 correctable errors per year. Worse error behavior for larger capacity chips. More errors at higher temperatures. Increase error with age. Newer DIMMS just as reliable. Steven Janke (Seminar) Fault Tolerant Computing February 2013 10 / 30

Failure Distribution Definition Probability Density Function (pdf) is a function such that f ( t ) is the likelihood that the component fails at time t. Cumulative Distribution Function (cdf) is a function such that F ( t ) is the probability that the component will fail at or before time t. � ∞ F ( t ) = 0 f ( t ) dt (or F ′ ( t ) = f ( t )) Reliability: R ( t ) = 1 − F ( t ) is the probability that the component survives to time t . Steven Janke (Seminar) Fault Tolerant Computing February 2013 11 / 30

Lifetime density 0.8 0.6 0.4 0.2 1 2 3 4 5 Graph of f(t) Steven Janke (Seminar) Fault Tolerant Computing February 2013 12 / 30

Failure Rate Definition 1 − F ( t ) = f ( t ) f ( t ) Failure Rate h ( t ) = R ( t ) (Probability of failing in the next small time interval given that it survived until time t .) Since f ( t ) = dF ( t ) = − dR ( t ) R ( t ) , if the failure rate is constant, then dt 1 dR ( t ) = λ − R ( t ) dt The solution to this differential equation is R ( t ) = e − λ t Steven Janke (Seminar) Fault Tolerant Computing February 2013 13 / 30

Real Failure Rates Steven Janke (Seminar) Fault Tolerant Computing February 2013 14 / 30

Weibull Distribution 1.0 0.8 0.6 0.4 0.2 1 2 3 4 5 PDF: f ( t ) = αβ t β − 1 e − α t β Reliability: R ( t ) = e − α t β Steven Janke (Seminar) Fault Tolerant Computing February 2013 15 / 30

Series Circuit CPU Disk Probability CPU lasts to time t is R c ( t ) Probability Disk lasts to time t is R d ( t ) Probability system lasts to time t is R ( t ) = R c ( t ) × R c ( t ) With R c ( t ) = 0 . 9 and R d ( t ) = 0 . 8, the system reliability is R ( t ) = 0 . 9 × 0 . 8 = 0 . 72 Steven Janke (Seminar) Fault Tolerant Computing February 2013 16 / 30

Parallel DRAM 1 DRAM 2 Probability each DRAM lasts to time t is R dr ( t ) Probability system lasts to time t is R ( t ) = 1 − (1 − R dr ( t )) × (1 − R dr ( t ) With R dr = 0 . 9, the system reliability is R ( t ) = 1 − (1 − 0 . 9) 2 = 0 . 99 Steven Janke (Seminar) Fault Tolerant Computing February 2013 17 / 30

Triple Modular Redundant Structure A, B, C are circuits computing the same function. Reliability of each circuit is R f ( t ). The voting circuit picks the majority answer. Assuming the four gates work correctly, the reliability is R ( t ) = 3 R f ( t ) 2 (1 − R f ( t ) + R f ( t ) 3 . With R f ( t ) = 0 . 9, we have R ( t ) = 0 . 972 Steven Janke (Seminar) Fault Tolerant Computing February 2013 18 / 30

Simple Processor Circuit Steven Janke (Seminar) Fault Tolerant Computing February 2013 19 / 30

Parity Codes Bit Position 7 6 5 4 3 2 1 0 Code Word 1 0 0 0 1 0 1 1 Bits 0 through 7 are the message Bit 7 is the parity bit b 7 = b 6 ⊗ b 5 ⊗ b 4 ⊗ b 3 ⊗ b 2 ⊗ b 1 ⊗ b 0 The symbol ⊗ is the XOR operation. Detects single errors and some multiple errors. Steven Janke (Seminar) Fault Tolerant Computing February 2013 20 / 30

Hamming Code: Distance between words Code Word A 1 0 0 1 1 0 1 1 Code Word B 1 0 0 0 1 0 1 1 These two words differ in only one bit. Cannot correct an error in bit 3. Any two code words must differ by more than 2 bits to fix an error. Steven Janke (Seminar) Fault Tolerant Computing February 2013 21 / 30

Hamming Code: Extra Parity Bits Message: 1 0 0 1 0 1 1 0 Bit Position 12 11 10 9 8 7 6 5 4 3 2 1 Code Word 1 0 0 1 0 0 1 1 1 0 1 0 Bit positions 1, 2, 4, 8 are check bits. Position 1 checks 3, 5, 7, 9, 11 Position 2 checks 3, 6, 7, 10, 11 Position 4 checks 5, 6, 7, 12 Position 8 checks 9, 10, 11, 12 Steven Janke (Seminar) Fault Tolerant Computing February 2013 22 / 30

Hamming Code: Error Correcting Bit Position 12 11 10 9 8 7 6 5 4 3 2 1 Code Word 1 0 0 1 0 0 1 1 1 0 1 0 Received Word 1 0 0 1 0 0 1 0 1 0 1 0 Bit in position 5 has been flipped. Parity bits 1 and 4 are incorrect. Position 1 checks 3, 5, 7, 9, 11 Position 2 checks 3, 6, 7, 10, 11 Position 4 checks 5, 6, 7, 12 Position 8 checks 9, 10, 11, 12 Steven Janke (Seminar) Fault Tolerant Computing February 2013 23 / 30

Intel Processor specifications Steven Janke (Seminar) Fault Tolerant Computing February 2013 24 / 30

Byzantine Generals Problem Actual Computer Systems: Reliable system must cope with failure of one or more components. Failed component may behave erratically (e.g. stop working, send conflicting information, etc.) Abstract Problem Imagine a Byzantine commander general with several lieutenant generals. Lieutenants must agree on plan of action after getting command. (Action: Retreat or Attack) Some of the generals may be traitors. Steven Janke (Seminar) Fault Tolerant Computing February 2013 25 / 30

Byzantine Generals Problem (continued) Formal Problem: A commanding general must send an order to his n-1 lieutenants such that: IC1. All loyal lieutenants obey the same order. IC2. If the commanding general is loyal, then every loyal lieutenant obeys the order he sends. Communication: Oral messages sent by messenger. A solution is an algorithm allowing the loyal lieutenants to satisfy IC1 and IC2. Unfortunate Result: There is no solution unless more than two-thirds of the generals are loyal. Steven Janke (Seminar) Fault Tolerant Computing February 2013 26 / 30

Byzantine Generals Problem: Impossibility for N=3 and T=1 There is no solution for 3 generals with one traitor. To satisfy IC2, both Lieutenants must obey Commander order. Then in second scenario, IC1 is violated. There is no solution to 3 generals with 1 traitor. Steven Janke (Seminar) Fault Tolerant Computing February 2013 27 / 30

Byzantine Generals Problem: Impossibility There is no solution for m traitors and N ≤ 3 m (call them A generals). Suppose there was an algorithm for the A generals. Then use it to solve the N=3 and T=1 problem (B generals). Assign at most N / 3 A generals to each of the 3 B generals. Each B general simulates the algorithm for the A generals. The algorithm stops and loyal A generals agree so loyal B generals agree. This is a contradiction implying no solution for the A generals exists. Steven Janke (Seminar) Fault Tolerant Computing February 2013 28 / 30

Byzantine Generals Problem: Solution for N=4 and T=1 Steven Janke (Seminar) Fault Tolerant Computing February 2013 29 / 30

Byzantine Generals Problem: Other versions Approximate solution also impossible. Solution for N=3 and t=1 when messages signed. Synchronicity important Steven Janke (Seminar) Fault Tolerant Computing February 2013 30 / 30

Fault Tolerant Computing Coping with errors Steven Janke February - PowerPoint PPT Presentation

Fault Tolerant Computing Coping with errors Steven Janke February 2013 Steven Janke (Seminar) Fault Tolerant Computing February 2013 1 / 30 Outline Problem: How do you design a computer system to avoid errors? Failure Data for laptops,

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction

Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing Julien Stainer

Fault-tolerant Quantum Computing Bryan Eastin Northrop Grumman Corporation Aurora, CO December

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Fault-Tolerant Distributed Optimization Lili Su, Arun Padakandla, Qiong Hu, Seyyed A. Fatemi,

A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy Andrew L. Baldwin, BS 09,

Fault-tolerant matrix factorisation: a formal model and proof Camille Coti, Laure Petrucci,

Formalization and Verification of Fault Tolerance and Security Felix G artner TU Darmstadt,

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Hamiltonian thickness and fault-tolerant spanning rooted path systems of graphs Yinfeng Zhu ( 6

Principles of Multi-Level Reflection for Fault Tolerant Architectures PRDC'02 (Dec 16-18 2002,

Fully Fault Tolerant Real Time Data Pipeline with Docker and Mesos Rahul Kumar Technical Lead

Dependability Engineering of Complex Computing Systems M. Kaniche J.-C. Laprie

Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core System Eric Missimer*, Richard

Model Checking of Fault-Tolerant Distributed Algorithms Igor Konnov joint work with Annu