Fault Tolerant Computing Coping with errors Steven Janke February - - PowerPoint PPT Presentation

fault tolerant computing
SMART_READER_LITE
LIVE PREVIEW

Fault Tolerant Computing Coping with errors Steven Janke February - - PowerPoint PPT Presentation

Fault Tolerant Computing Coping with errors Steven Janke February 2013 Steven Janke (Seminar) Fault Tolerant Computing February 2013 1 / 30 Outline Problem: How do you design a computer system to avoid errors? Failure Data for laptops,


slide-1
SLIDE 1

Fault Tolerant Computing

Coping with errors Steven Janke February 2013

Steven Janke (Seminar) Fault Tolerant Computing February 2013 1 / 30

slide-2
SLIDE 2

Outline

Problem: How do you design a computer system to avoid errors? Failure Data for laptops, disks, and RAM Measuring Reliability (Probability) Redundancy and Voters Error Correcting Codes Byzantine General problem

Steven Janke (Seminar) Fault Tolerant Computing February 2013 2 / 30

slide-3
SLIDE 3

Laptop Reliability

Square Trade, Inc. failure over three years (Sample size: 30,000)

7.2% 19.7% 31.0% 4.7% 12.7% 20.4% 2.5% 7.0% 10.6%

% % % % % % % %

T

  • ta

lF a ilu reR a te Ma lfu n c tionR a te A c c id e n tR a te Steven Janke (Seminar) Fault Tolerant Computing February 2013 3 / 30

slide-4
SLIDE 4

Types of Errors

Hardware versus Software (Component failure versus operating system failure.) Permanent (Hard errors due to design defects or material failure.) Intermittent (Periodic errors that could be software or hardware.) Transient (Soft errors due to gamma rays, electromagnetic disturbance.)

Steven Janke (Seminar) Fault Tolerant Computing February 2013 4 / 30

slide-5
SLIDE 5

Microsoft Data

Data Source

Windows Error Reporting system Sample: 950,000 PC’s (2008)

Steven Janke (Seminar) Fault Tolerant Computing February 2013 5 / 30

slide-6
SLIDE 6

Microsoft Data

Conclusions

After 30 days of CPU time, probability of crash is 1/190 Problems are often recurrent. (1/3.3 for second crash) CPU speed matters. (Faster means more error.) DRAM errors exhibit spatial locality. Brand name PC’s more reliable.

Steven Janke (Seminar) Fault Tolerant Computing February 2013 6 / 30

slide-7
SLIDE 7

Google Data

Data Source

Google Data Centers Sample: 100,000 drives over 5 years

Steven Janke (Seminar) Fault Tolerant Computing February 2013 7 / 30

slide-8
SLIDE 8

Google Data

Steven Janke (Seminar) Fault Tolerant Computing February 2013 8 / 30

slide-9
SLIDE 9

Google Data

Disk Drive Conclusions

Mean time to failure (MTTF) is significantly smaller than specifications. There is significant ”infant mortality” and ”aging”. Correlation between disk replacement rates over time. SATA and SCSI disks all behaved the same.

Steven Janke (Seminar) Fault Tolerant Computing February 2013 9 / 30

slide-10
SLIDE 10

Google Data

DRAM Conclusions

25,000 - 75,000 errors per Mbit (per billion device hours) More than 8% of DIMMS affected per year. Faults dominated by hard errors. 32% of machines had at least one memory error per year. About 22,000 correctable errors per year. Worse error behavior for larger capacity chips. More errors at higher temperatures. Increase error with age. Newer DIMMS just as reliable.

Steven Janke (Seminar) Fault Tolerant Computing February 2013 10 / 30

slide-11
SLIDE 11

Failure Distribution

Definition

Probability Density Function (pdf) is a function such that f (t) is the likelihood that the component fails at time t. Cumulative Distribution Function (cdf) is a function such that F(t) is the probability that the component will fail at or before time t. F(t) = ∞

0 f (t)dt (or F ′(t) = f (t))

Reliability: R(t) = 1 − F(t) is the probability that the component survives to time t.

Steven Janke (Seminar) Fault Tolerant Computing February 2013 11 / 30

slide-12
SLIDE 12

Lifetime density

1 2 3 4 5 0.2 0.4 0.6 0.8

Graph of f(t)

Steven Janke (Seminar) Fault Tolerant Computing February 2013 12 / 30

slide-13
SLIDE 13

Failure Rate

Definition

Failure Rate h(t) =

f (t) 1−F(t) = f (t) R(t)

(Probability of failing in the next small time interval given that it survived until time t.) Since f (t) = dF(t)

dt

= − dR(t)

R(t) , if the failure rate is constant, then

− 1 R(t) dR(t) dt = λ The solution to this differential equation is R(t) = e−λt

Steven Janke (Seminar) Fault Tolerant Computing February 2013 13 / 30

slide-14
SLIDE 14

Real Failure Rates

Steven Janke (Seminar) Fault Tolerant Computing February 2013 14 / 30

slide-15
SLIDE 15

Weibull Distribution

1 2 3 4 5 0.2 0.4 0.6 0.8 1.0

PDF: f (t) = αβtβ−1e−αtβ Reliability: R(t) = e−αtβ

Steven Janke (Seminar) Fault Tolerant Computing February 2013 15 / 30

slide-16
SLIDE 16

Series Circuit

CPU Disk

Probability CPU lasts to time t is Rc(t) Probability Disk lasts to time t is Rd(t) Probability system lasts to time t is R(t) = Rc(t) × Rc(t) With Rc(t) = 0.9 and Rd(t) = 0.8, the system reliability is R(t) = 0.9 × 0.8 = 0.72

Steven Janke (Seminar) Fault Tolerant Computing February 2013 16 / 30

slide-17
SLIDE 17

Parallel

DRAM 1 DRAM 2

Probability each DRAM lasts to time t is Rdr(t) Probability system lasts to time t is R(t) = 1 − (1 − Rdr(t)) × (1 − Rdr(t) With Rdr = 0.9, the system reliability is R(t) = 1 − (1 − 0.9)2 = 0.99

Steven Janke (Seminar) Fault Tolerant Computing February 2013 17 / 30

slide-18
SLIDE 18

Triple Modular Redundant Structure

A, B, C are circuits computing the same function. Reliability of each circuit is Rf (t). The voting circuit picks the majority answer. Assuming the four gates work correctly, the reliability is R(t) = 3Rf (t)2(1 − Rf (t) + Rf (t)3. With Rf (t) = 0.9, we have R(t) = 0.972

Steven Janke (Seminar) Fault Tolerant Computing February 2013 18 / 30

slide-19
SLIDE 19

Simple Processor Circuit

Steven Janke (Seminar) Fault Tolerant Computing February 2013 19 / 30

slide-20
SLIDE 20

Parity Codes

Bit Position 7 6 5 4 3 2 1 Code Word 1 1 1 1 Bits 0 through 7 are the message Bit 7 is the parity bit b7 = b6 ⊗ b5 ⊗ b4 ⊗ b3 ⊗ b2 ⊗ b1 ⊗ b0 The symbol ⊗ is the XOR operation. Detects single errors and some multiple errors.

Steven Janke (Seminar) Fault Tolerant Computing February 2013 20 / 30

slide-21
SLIDE 21

Hamming Code: Distance between words

Code Word A 1 1 1 1 1 Code Word B 1 1 1 1 These two words differ in only one bit. Cannot correct an error in bit 3. Any two code words must differ by more than 2 bits to fix an error.

Steven Janke (Seminar) Fault Tolerant Computing February 2013 21 / 30

slide-22
SLIDE 22

Hamming Code: Extra Parity Bits

Message: 1 0 0 1 0 1 1 0 Bit Position 12 11 10 9 8 7 6 5 4 3 2 1 Code Word 1 1 1 1 1 1 Bit positions 1, 2, 4, 8 are check bits. Position 1 checks 3, 5, 7, 9, 11 Position 2 checks 3, 6, 7, 10, 11 Position 4 checks 5, 6, 7, 12 Position 8 checks 9, 10, 11, 12

Steven Janke (Seminar) Fault Tolerant Computing February 2013 22 / 30

slide-23
SLIDE 23

Hamming Code: Error Correcting

Bit Position 12 11 10 9 8 7 6 5 4 3 2 1 Code Word 1 1 1 1 1 1 Received Word 1 1 1 1 1 Bit in position 5 has been flipped. Parity bits 1 and 4 are incorrect. Position 1 checks 3, 5, 7, 9, 11 Position 2 checks 3, 6, 7, 10, 11 Position 4 checks 5, 6, 7, 12 Position 8 checks 9, 10, 11, 12

Steven Janke (Seminar) Fault Tolerant Computing February 2013 23 / 30

slide-24
SLIDE 24

Intel Processor specifications

Steven Janke (Seminar) Fault Tolerant Computing February 2013 24 / 30

slide-25
SLIDE 25

Byzantine Generals Problem

Actual Computer Systems:

Reliable system must cope with failure of one or more components. Failed component may behave erratically (e.g. stop working, send conflicting information, etc.)

Abstract Problem

Imagine a Byzantine commander general with several lieutenant generals. Lieutenants must agree on plan of action after getting command. (Action: Retreat or Attack) Some of the generals may be traitors.

Steven Janke (Seminar) Fault Tolerant Computing February 2013 25 / 30

slide-26
SLIDE 26

Byzantine Generals Problem (continued)

Formal Problem:

A commanding general must send an order to his n-1 lieutenants such that:

  • IC1. All loyal lieutenants obey the same order.
  • IC2. If the commanding general is loyal, then every loyal lieutenant obeys

the order he sends. Communication: Oral messages sent by messenger. A solution is an algorithm allowing the loyal lieutenants to satisfy IC1 and IC2. Unfortunate Result: There is no solution unless more than two-thirds of the generals are loyal.

Steven Janke (Seminar) Fault Tolerant Computing February 2013 26 / 30

slide-27
SLIDE 27

Byzantine Generals Problem: Impossibility for N=3 and T=1

There is no solution for 3 generals with one traitor. To satisfy IC2, both Lieutenants must obey Commander order. Then in second scenario, IC1 is violated. There is no solution to 3 generals with 1 traitor.

Steven Janke (Seminar) Fault Tolerant Computing February 2013 27 / 30

slide-28
SLIDE 28

Byzantine Generals Problem: Impossibility

There is no solution for m traitors and N ≤ 3m (call them A generals). Suppose there was an algorithm for the A generals. Then use it to solve the N=3 and T=1 problem (B generals). Assign at most N/3 A generals to each of the 3 B generals. Each B general simulates the algorithm for the A generals. The algorithm stops and loyal A generals agree so loyal B generals agree. This is a contradiction implying no solution for the A generals exists.

Steven Janke (Seminar) Fault Tolerant Computing February 2013 28 / 30

slide-29
SLIDE 29

Byzantine Generals Problem: Solution for N=4 and T=1

Steven Janke (Seminar) Fault Tolerant Computing February 2013 29 / 30

slide-30
SLIDE 30

Byzantine Generals Problem: Other versions

Approximate solution also impossible. Solution for N=3 and t=1 when messages signed. Synchronicity important

Steven Janke (Seminar) Fault Tolerant Computing February 2013 30 / 30