Optimal checkpointing periods with fail-stop and silent errors Anne - - PowerPoint PPT Presentation

optimal checkpointing periods with fail stop and silent
SMART_READER_LITE
LIVE PREVIEW

Optimal checkpointing periods with fail-stop and silent errors Anne - - PowerPoint PPT Presentation

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Optimal checkpointing periods with fail-stop and silent errors Anne Benoit ENS Lyon Anne.Benoit@ens-lyon.fr http://graal.ens-lyon.fr/~abenoit 3rd JLESC Summer


slide-1
SLIDE 1

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Optimal checkpointing periods with fail-stop and silent errors

Anne Benoit ENS Lyon

Anne.Benoit@ens-lyon.fr http://graal.ens-lyon.fr/~abenoit

3rd JLESC Summer School June 30, 2016

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 1/ 57

slide-2
SLIDE 2

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Exascale platforms

Hierarchical

  • 105 or 106 nodes
  • Each node equipped with 104 or 103 cores

Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h

  • f 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 2/ 57

slide-3
SLIDE 3

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Exascale platforms

Hierarchical

  • 105 or 106 nodes
  • Each node equipped with 104 or 103 cores

Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h

  • f 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

Exascale = Petascale ×1000

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 2/ 57

slide-4
SLIDE 4

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Even for today’s platforms (courtesy F. Cappello)

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 3/ 57

slide-5
SLIDE 5

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Even for today’s platforms (courtesy F. Cappello)

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 4/ 57

slide-6
SLIDE 6

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

A few definitions

Many types of faults: software error, hardware malfunction, memory corruption Many possible behaviors: silent, transient, unrecoverable Restrict to faults that lead to application failures This includes all hardware faults, and some software ones Will use terms fault and failure interchangeably Silent errors (SDC) will be addressed later in the course First question: quantify the rate or frequency at which these faults strike!

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 5/ 57

slide-7
SLIDE 7

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

A few definitions

Many types of faults: software error, hardware malfunction, memory corruption Many possible behaviors: silent, transient, unrecoverable Restrict to faults that lead to application failures This includes all hardware faults, and some software ones Will use terms fault and failure interchangeably Silent errors (SDC) will be addressed later in the course First question: quantify the rate or frequency at which these faults strike!

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 5/ 57

slide-8
SLIDE 8

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Exponential failure distributions

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 200 400 600 800 1000 Failure Probability Time (years) Sequential Machine Exp(1/100)

Exp(λ): Exponential distribution law of parameter λ: Probability density function (pdf): f (t) = λe−λtdt for t ≥ 0 Cumulative distribution function (cdf): F(t) = 1 − e−λt Mean: µ = 1

λ

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 6/ 57

slide-9
SLIDE 9

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Exponential failure distributions

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 200 400 600 800 1000 Failure Probability Time (years) Sequential Machine Exp(1/100)

X random variable for Exp(λ) failure inter-arrival times: P (X ≤ t) = 1 − e−λtdt (by definition) Memoryless property: P (X ≥ t + s | X ≥ s ) = P (X ≥ t) (for all t, s ≥ 0): at any instant, time to next failure does not depend on time elapsed since last failure Mean Time Between Failures (MTBF) µ = E (X) = 1

λ

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 6/ 57

slide-10
SLIDE 10

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

With several processors

Rebooting only faulty processor Platform failure distribution ⇒ superposition of p IID processor distributions ⇒ IID only for Exponential Define µp by lim

F→+∞

n(F) F = 1 µp n(F) = number of platform failures until time F is exceeded Theorem: µp = µ p for arbitrary distributions

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 7/ 57

slide-11
SLIDE 11

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Summary for the road

MTBF key parameter and µp = µ

p

Exponential distribution OK for most purposes Assume failure independence while not (completely) true

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 8/ 57

slide-12
SLIDE 12

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

General purpose approach

Periodic checkpointing, rollback and recovery

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 9/ 57

slide-13
SLIDE 13

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Outline

1

Probabilistic models Young/Daly’s approximation Assessing protocols at scale

2

In-memory checkpointing

3

Dealing with silent errors

4

Conclusion Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 10/ 57

slide-14
SLIDE 14

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Outline

1

Probabilistic models Young/Daly’s approximation Assessing protocols at scale

2

In-memory checkpointing

3

Dealing with silent errors

4

Conclusion Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 11/ 57

slide-15
SLIDE 15

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Checkpointing cost

Checkpointing the first chunk Computing the first chunk Processing the second chunk Processing the first chunk

Time Time spent checkpointing Time spent working

Blocking model: while a checkpoint is taken, no computation can be performed

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 12/ 57

slide-16
SLIDE 16

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Framework

Periodic checkpointing policy of period T Independent and identically distributed (IID) failures Applies to a single processor with MTBF µ = µind Applies to a platform with p processors with MTBF µ = µind

p

coordinated checkpointing tightly-coupled application progress ⇔ all processors available

⇒ platform = single (powerful, unreliable) processor Waste: fraction of time not spent for useful computations

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 13/ 57

slide-17
SLIDE 17

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste in fault-free execution

Checkpointing the first chunk Computing the first chunk Processing the second chunk Processing the first chunk Time Time spent checkpointing Time spent working

Timebase: application base time TimeFF: with periodic checkpoints but failure-free TimeFF = Timebase + #checkpoints × C #checkpoints = Timebase T − C

  • ≈ Timebase

T − C (valid for large jobs) Waste[FF] = TimeFF − Timebase TimeFF = C T

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 14/ 57

slide-18
SLIDE 18

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures

Timebase: application base time TimeFF: with periodic checkpoints but failure-free Timefinal: expectation of time with failures Timefinal = TimeFF + Nfaults × Tlost Nfaults: number of failures during execution Tlost: average time lost per failure Nfaults = Timefinal µ Tlost?

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 15/ 57

slide-19
SLIDE 19

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures

Timebase: application base time TimeFF: with periodic checkpoints but failure-free Timefinal: expectation of time with failures Timefinal = TimeFF + Nfaults × Tlost Nfaults: number of failures during execution Tlost: average time lost per failure Nfaults = Timefinal µ Tlost?

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 15/ 57

slide-20
SLIDE 20

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Computing Tlost

T C T − C R D T/2 P1 P0 P3 P2

Time spent working Time spent checkpointing Recovery time Downtime Time

Tlost = D + R + T 2 Rationale ⇒ Instants when periods begin and failures strike are independent ⇒ Approximation used for all distribution laws ⇒ Exact for Exponential and uniform distributions

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 16/ 57

slide-21
SLIDE 21

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures

Timefinal = TimeFF + Nfaults × Tlost Waste[fail] = Timefinal − TimeFF Timefinal = 1 µ

  • D + R + T

2

  • Anne.Benoit@ens-lyon.fr

3rd JLESC Summer School Optimal checkpointing periods 17/ 57

slide-22
SLIDE 22

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Total waste

TimeFF =Timefinal (1-Waste[fail]) Timefinal × Waste[fail] Timefinal

T-C C T-C C T-C C T-C C T-C C

Waste = Timefinal − Timebase Timefinal 1 − Waste = (1 − Waste[FF])(1 − Waste[fail]) Waste = C T +

  • 1 − C

T 1 µ

  • D + R + T

2

  • Anne.Benoit@ens-lyon.fr

3rd JLESC Summer School Optimal checkpointing periods 18/ 57

slide-23
SLIDE 23

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste minimization

Waste = C T +

  • 1 − C

T 1 µ

  • D + R + T

2

  • Waste = u

T + v + wT u = C

  • 1 − D + R

µ

  • v = D + R − C/2

µ w = 1 2µ Waste minimized for T = u

w

T =

  • 2(µ − (D + R))C

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 19/ 57

slide-24
SLIDE 24

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Validity of the approach (1/3)

Technicalities E (Nfaults) = Timefinal

µ

and E (Tlost) = D + R + T

2

but expectation of product is not product of expectations (not independent RVs here) Enforce C ≤ T to get Waste[FF] ≤ 1 Enforce D + R ≤ µ and bound T to get Waste[fail] ≤ 1 but µ = µind

p

too small for large p, regardless of µind

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 20/ 57

slide-25
SLIDE 25

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Validity of the approach (2/3)

Several failures within same period? Waste[fail] accurate only when two or more faults do not take place within same period Cap period: T ≤ γµ, where γ is some tuning parameter

Poisson process of parameter θ = T

µ

Probability of having k ≥ 0 failures: P(X = k) = θk

k! e−θ

Probability of having two or more failures: π = P(X ≥ 2) = 1−(P(X = 0)+P(X = 1)) = 1−(1+θ)e−θ γ = 0.27 ⇒ π ≤ 0.03 ⇒ overlapping faults for only 3% of checkpointing segments

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 21/ 57

slide-26
SLIDE 26

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Validity of the approach (3/3)

Enforce T ≤ γµ, C ≤ γµ, and D + R ≤ γµ Optimal period

  • 2(µ − (D + R))C may not belong to

admissible interval [C, γµ] Waste is then minimized for one of the bounds of this admissible interval (by convexity)

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 22/ 57

slide-27
SLIDE 27

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Wrap up

Capping periods, and enforcing a lower bound on MTBF ⇒ mandatory for mathematical rigor Not needed for practical purposes

  • actual job execution uses optimal value
  • account for multiple faults by re-executing work until success

Approach surprisingly robust

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 23/ 57

slide-28
SLIDE 28

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Lesson learnt for fail-stop failures

(Not so) Secret data

  • Tsubame 2: 962 failures during last 18 months so µ = 13 hrs
  • Blue Waters: 2-3 node failures per day
  • Titan: a few failures per day
  • Tianhe 2: wouldn’t say

Topt =

  • 2µC

⇒ Waste[opt] ≈

  • 2C

µ Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17% Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53% Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 24/ 57

slide-29
SLIDE 29

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Lesson learnt for fail-stop failures

(Not so) Secret data

  • Tsubame 2: 962 failures during last 18 months so µ = 13 hrs
  • Blue Waters: 2-3 node failures per day
  • Titan: a few failures per day
  • Tianhe 2: wouldn’t say

Topt =

  • 2µC

⇒ Waste[opt] ≈

  • 2C

µ Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17% Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53% Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%

Exascale = Petascale ×1000 Need more reliable components Need to checkpoint faster

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 24/ 57

slide-30
SLIDE 30

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Lesson learnt for fail-stop failures

(Not so) Secret data

  • Tsubame 2: 962 failures during last 18 months so µ = 13 hrs
  • Blue Waters: 2-3 node failures per day
  • Titan: a few failures per day
  • Tianhe 2: wouldn’t say

Topt =

  • 2µC

⇒ Waste[opt] ≈

  • 2C

µ Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17% Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53% Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%

Silent errors: detection latency ⇒ additional problems

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 24/ 57

slide-31
SLIDE 31

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Exponential failure distribution

How to compute the expected time E(T(W , C, D, R, λ)) to execute a work of duration W followed by a checkpoint of duration C? How to extend this result for sequential and parallel jobs? Attend the hands-on session at 14.45: ”Mathematical exercises on Daly and extensions”!

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 25/ 57

slide-32
SLIDE 32

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Outline

1

Probabilistic models Young/Daly’s approximation Assessing protocols at scale

2

In-memory checkpointing

3

Dealing with silent errors

4

Conclusion Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 26/ 57

slide-33
SLIDE 33

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Which checkpointing protocol to use?

Coordinated checkpointing

No risk of cascading rollbacks No need to log messages All processors need to roll back Rumor: May not scale to very large platforms

Hierarchical checkpointing

Need to log inter-group messages

  • Slowdowns failure-free execution
  • Increases checkpoint size/time

Only processors from failed group need to roll back Faster re-execution with logged messages Rumor: Should scale to very large platforms

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 27/ 57

slide-34
SLIDE 34

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Blocking vs. non-blocking

Checkpointing the first chunk Computing the first chunk Processing the second chunk Processing the first chunk

Time Time spent checkpointing Time spent working

Blocking model: checkpointing blocks all computations

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 28/ 57

slide-35
SLIDE 35

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Blocking vs. non-blocking

Checkpointing the first chunk Computing the first chunk Processing the second chunk Processing the first chunk

Time Time spent checkpointing Time spent working

Non-blocking model: checkpointing has no impact on computations (e.g., first copy state to RAM, then copy RAM to disk)

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 28/ 57

slide-36
SLIDE 36

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Blocking vs. non-blocking

Checkpointing the first chunk Computing the first chunk Processing the first chunk

Time Time spent working Time spent checkpointing Time spent working with slowdown

General model: checkpointing slows computations down: during a checkpoint of duration C, the same amount of computation is done as during a time αC without checkpointing (0 ≤ α ≤ 1)

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 28/ 57

slide-37
SLIDE 37

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste in fault-free execution

T C T − C P1 P0 P3 P2 Time spent working Time spent checkpointing Time spent working with slowdown Time

Time elapsed since last checkpoint: T Amount of computations executed: Work = (T − C) + αC Waste[FF] = T−Work

T

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-38
SLIDE 38

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures

P0 P3 P2 P1 Time spent checkpointing Time spent working Time spent working with slowdown Time

Failure can happen

1 During computation phase 2 During checkpointing phase Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-39
SLIDE 39

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures

P2 P1 P3 P0 Time spent working Time spent checkpointing Time spent working with slowdown Time Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-40
SLIDE 40

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures

P2 P1 P3 P0 Time spent working Time spent checkpointing Time spent working with slowdown Time Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-41
SLIDE 41

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures

Tlost P1 P3 P0 P2 Time spent working Time spent checkpointing Time spent working with slowdown Time

Coordinated checkpointing protocol: when one processor is victim

  • f a failure, all processors lose their work and must roll back to last

checkpoint

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-42
SLIDE 42

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures in computation phase

D P0 P2 P1 P3 Time spent working Time spent checkpointing Time spent working with slowdown Downtime Time

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-43
SLIDE 43

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures in computation phase

R P2 P1 P3 P0 Time spent checkpointing Time spent working Time spent working with slowdown Recovery time Downtime Time

Coordinated checkpointing protocol: all processors must recover from last checkpoint

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-44
SLIDE 44

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures in computation phase

C αC P3 P2 P1 P0 Time spent working Time spent checkpointing Time spent working with slowdown Re-executing slowed-down work Recovery time Downtime Time

Redo the work destroyed by the failure, that was done in the checkpointing phase before the computation phase But no checkpoint is taken in parallel, hence this re-execution is faster than the original computation

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-45
SLIDE 45

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures in computation phase

T − C P1 P0 P3 P2 Time spent working Time spent checkpointing Time spent working with slowdown Re-executing slowed-down work Recovery time Downtime Time

Re-execute the computation phase

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-46
SLIDE 46

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Waste due to failures in computation phase

C P3 P2 P1 P0 Time spent checkpointing Time spent working Time spent working with slowdown Re-executing slowed-down work Recovery time Downtime Time

Finally, the checkpointing phase is executed

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-47
SLIDE 47

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Total waste

∆ αC C T − C R D Tlost P0 P2 P1 P3 Time spent working Time spent checkpointing Time spent working with slowdown Re-executing slowed-down work Recovery time Downtime T Time

Waste[fail] = 1 µ

  • D + R + αC + T

2

  • Optimal period Topt =
  • 2(1 − α)(µ − (D + R + αC))C

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 29/ 57

slide-48
SLIDE 48

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Hierarchical checkpointing

T α(G−g+1)C R D G.C T−G.C−Tlost Tlost Tlost G2 G4 Gg G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Processors partitioned into G groups Each group includes q processors Inside each group: coordinated checkpointing in time C(q) Inter-group messages are logged

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 30/ 57

slide-49
SLIDE 49

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Total waste

Waste[FF] = T − Work T with Work = T − (1 − α)GC(q) Waste[fail] = 1 µ

  • D(q) + R(q) + Re-Exec
  • with

Re-Exec = T−GC(q) T Re-Execcomp + GC(q) T Re-Execckpt Waste = Waste[FF] + Waste[fail] − Waste[FF]Waste[fail] Minimize Waste subject to: GC(q) ≤ T (by construction) Gets complicated! Use computer algebra software

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 31/ 57

slide-50
SLIDE 50

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Conclusion

Hierarchical protocols better for small MTBFs: more suitable for failure-prone platforms Struggle when communication intensity increases, but limited waste in all other cases The faster the checkpointing time, the smaller the waste

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 32/ 57

slide-51
SLIDE 51

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Outline

1

Probabilistic models

2

In-memory checkpointing

3

Dealing with silent errors

4

Conclusion Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 33/ 57

slide-52
SLIDE 52

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Motivation

Checkpoint transfer and storage ⇒ critical issues of rollback/recovery protocols Stable storage: high cost Distributed in-memory storage:

Store checkpoints in local memory ⇒ no centralized storage

Much better scalability

Replicate checkpoints ⇒ application survives single failure

Still, risk of fatal failure in some (unlikely) scenarios

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 34/ 57

slide-53
SLIDE 53

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Double checkpointing algorithm

Platform nodes partitioned into pairs Each node in a pair exchanges its checkpoint with its buddy Each node saves two checkpoints:

  • one locally: storing its own data
  • one remotely: receiving and storing its buddy’s data

Two algorithms

  • blocking version by Zheng, Shi and Kal´

e

  • non-blocking version by Ni, Meneses and Kal´

e

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 35/ 57

slide-54
SLIDE 54

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Non-blocking checkpoint algorithm

1 1

d q s f f

P

Local checkpoint done Remote checkpoint done Period done Node p Node p'

Checkpoints taken periodically, with period P = δ + θ + σ Phase 1, length δ: local checkpoint, blocking mode. No work Phase 2, length θ: remote checkpoint. Overhead φ Phase 3, length σ: application at full speed 1

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 36/ 57

slide-55
SLIDE 55

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Non-blocking checkpoint algorithm

1 1

d q s f f

P

Local checkpoint done Remote checkpoint done Period done Node p Node p'

Work in failure-free period: W = (θ − φ) + σ = P − δ − φ

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 36/ 57

slide-56
SLIDE 56

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Cost of overlap

1 1

d q s f f

P

Local checkpoint done Remote checkpoint done Period done Node p Node p'

Overlap computations and checkpoint file exchanges Large θ ⇒ more flexibility to hide cost of file exchange ⇒ smaller overhead φ

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 37/ 57

slide-57
SLIDE 57

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Cost of overlap

1 1

d q s f f

P

Local checkpoint done Remote checkpoint done Period done Node p Node p'

θ = θmin: fastest communication, fully blocking ⇒ φ = θmin θ = θmax: full overlap with computation ⇒ φ = 0 Linear interpolation θ(φ) = θmin + α(θmin − φ)

φ = 0 for θ = θmax = (1 + α)θmin α: rate of overhead decrease w.r.t. communication length

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 38/ 57

slide-58
SLIDE 58

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Assessing the risk

1 1

d q s f f

P

Node p Node p' 1 1

d q f f tlost

Checkpoint of p Checkpoint of p' Risk Period Node to replace p

q f

1

tlost

D R

After failure: downtime D and recovery from buddy node Two checkpoint files lost, must be re-sent to faulty processor

1

Checkpoint of faulty node, needed for recovery ⇒ sent as fast as possible, in time R = θmin

2

Checkpoint of buddy node, needed in case buddy fails later on ⇒ ??

Application at risk until complete reception of both messages

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 39/ 57

slide-59
SLIDE 59

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Checkpoint of buddy node

Scenario DoubleNBL File sent at same speed as in regular mode, in time θ(φ) Overhead φ Favors performance, at the price of higher risk Scenario DoubleBoF File sent as fast as possible, in time θmin = R Overhead R Favors risk reduction, at the price of higher overhead Computing the waste? Hands-on session at 14:45!

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 40/ 57

slide-60
SLIDE 60

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Conclusion

Double checkpointing DoubleBoF reduces risk duration, at the cost of increasing failure overhead Parameter α for transfer cost overlap Unified model for performance/risk bi-criteria assessment Triple checkpointing Save checkpoint on two remote processes instead of one, without much more memory or storage requirements Excellent success probability, almost no failure-free overhead Assessment of performance and risk factors using unified mode Realistic scenarios conclude to superiority of Triple

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 41/ 57

slide-61
SLIDE 61

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Outline

1

Probabilistic models

2

In-memory checkpointing

3

Dealing with silent errors Revisiting Young/Daly (base pattern) Pattern with several verifications

4

Conclusion Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 42/ 57

slide-62
SLIDE 62

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

General-purpose approach

Periodic checkpointing, rollback and recovery:

Time Error

C C C

Works fine for fail-stop errors Detection latency in silent errors ⇒ risk of saving corrupted checkpoint(s) Maintaining multiple checkpoints (Lu, Zheng and Chien, 2013) Requires more stable storage Which checkpoint to roll back to? Critical failure when all live checkpoints are invalid

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 43/ 57

slide-63
SLIDE 63

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

General-purpose approach

Periodic checkpointing, rollback and recovery:

Time Corrupt Detect Error

C C C

Works fine for fail-stop errors Detection latency in silent errors ⇒ risk of saving corrupted checkpoint(s) Maintaining multiple checkpoints (Lu, Zheng and Chien, 2013) Requires more stable storage Which checkpoint to roll back to? Critical failure when all live checkpoints are invalid

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 43/ 57

slide-64
SLIDE 64

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

General-purpose approach

Periodic checkpointing, rollback and recovery:

Time Corrupt Detect Error

C C C

Works fine for fail-stop errors Detection latency in silent errors ⇒ risk of saving corrupted checkpoint(s) Maintaining multiple checkpoints (Lu, Zheng and Chien, 2013) Requires more stable storage Which checkpoint to roll back to? Critical failure when all live checkpoints are invalid

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 43/ 57

slide-65
SLIDE 65

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

General-purpose approach

Periodic checkpointing, rollback and recovery:

Time Corrupt Detect Corrupt

C C C

Works fine for fail-stop errors Detection latency in silent errors ⇒ risk of saving corrupted checkpoint(s) Maintaining multiple checkpoints (Lu, Zheng and Chien, 2013) Requires more stable storage Which checkpoint to roll back to? Critical failure when all live checkpoints are invalid

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 43/ 57

slide-66
SLIDE 66

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

General-purpose approach

Periodic checkpointing, rollback and recovery:

Time Corrupt Detect Corrupt

C C C

Works fine for fail-stop errors Detection latency in silent errors ⇒ risk of saving corrupted checkpoint(s) Maintaining multiple checkpoints (Lu, Zheng and Chien, 2013) Requires more stable storage Which checkpoint to roll back to? Critical failure when all live checkpoints are invalid Need to know when silent error occurred

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 43/ 57

slide-67
SLIDE 67

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Coping with silent errors

Couple checkpointing with verification:

Time Error Detect

V C V C V C

Before each checkpoint, run some verification mechanism or error detection test Silent error, if any, is detected by verification ⇒ need to maintain only one checkpoint, which is always valid

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 44/ 57

slide-68
SLIDE 68

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Models and objective

Resilience parameters C: Cost of checkpointing R: Cost of recovery V : Cost of verification Objective Design a periodic computing pattern that minimizes the expected execution time (makespan) of the application

Time Pattern

· · ·

V C V C V C

Last verification of a pattern is always perfect to avoid saving corrupted checkpoints

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 45/ 57

slide-69
SLIDE 69

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Outline

1

Probabilistic models

2

In-memory checkpointing

3

Dealing with silent errors Revisiting Young/Daly (base pattern) Pattern with several verifications

4

Conclusion Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 46/ 57

slide-70
SLIDE 70

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Revisiting Young/Daly (Base Pattern Pc)

Time W

V C V C

Proposition The expected time to execute a base pattern Pc of work length W is E(W ) = W + V + C + λW (W + V + R) + O(λ2W 3)

  • Proof. First, express the expected execution time recursively:

E(W ) = W + V + (1 − e−λW )(R + E(W )) + e−λW C Then, solve the recursion and take first-order approximation Approximation is accurate if platform MTBF is large in front of the resilience parameters

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 47/ 57

slide-71
SLIDE 71

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Revisiting Young/Daly (Base Pattern Pc)

Proposition The optimal work length W ∗ of the base pattern Pc is W ∗ =

  • V + C

λ and the optimal expected overhead is Overhead∗ = 2

  • λ(V + C) + O(λ)
  • Proof. Derive the overhead from the expected execution time:

Overhead = E(W ) W − 1 = V + C W + λW + λ(V + R) + O(λ2W 2) Balance W to minimize Overhead

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 48/ 57

slide-72
SLIDE 72

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Revisiting Young/Daly (Base Pattern Pc)

Recall from the waste analysis: Fail-stop errors Silent errors Pattern T = W + C T = W + V + C Wasteff

C T V +C T

Wastefail λ(D + R + W

2 )

λ(R + W + V ) Optimal period

  • 2C

λ

  • V +C

λ

Waste[opt] √ 2λC 2

  • λ(V + C)

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 49/ 57

slide-73
SLIDE 73

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Outline

1

Probabilistic models

2

In-memory checkpointing

3

Dealing with silent errors Revisiting Young/Daly (base pattern) Pattern with several verifications

4

Conclusion Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 50/ 57

slide-74
SLIDE 74

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Pattern with several verifications

Perform several verifications before each checkpoint:

Time Error Detect

V C V V V C V V V C

silent error is detected earlier in the pattern additional overhead in fault-free executions

What is the optimal checkpointing period? How many verifications to use? Where are their positions? Hands-on session at 14:45!

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 51/ 57

slide-75
SLIDE 75

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Pattern with several verifications

Perform several verifications before each checkpoint:

Time Error Detect

V C V V V C V V V C

silent error is detected earlier in the pattern additional overhead in fault-free executions

What is the optimal checkpointing period? How many verifications to use? Where are their positions? Hands-on session at 14:45!

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 51/ 57

slide-76
SLIDE 76

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Pattern with several verifications

Perform several verifications before each checkpoint:

Time Error Detect

V C V V V C V V V C

silent error is detected earlier in the pattern additional overhead in fault-free executions

What is the optimal checkpointing period? How many verifications to use? Where are their positions? Hands-on session at 14:45!

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 51/ 57

slide-77
SLIDE 77

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Outline

1

Probabilistic models

2

In-memory checkpointing

3

Dealing with silent errors

4

Conclusion Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 52/ 57

slide-78
SLIDE 78

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Leitmotiv

Resilient research on resilience

Models needed to assess techniques at scale without bias

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 53/ 57

slide-79
SLIDE 79

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Conclusion

Multiple approaches to Fault Tolerance Application-Specific Fault Tolerance will always provide more benefits:

Checkpoint size reduction (when needed) Portability (can run on different hardware, different deployment, etc..) Diversity of use (can be used to restart the execution and change parameters in the middle) More about this tomorrow at 10:00: ”ABFT techniques” (Frederic Vivien)

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 54/ 57

slide-80
SLIDE 80

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Conclusion

Multiple approaches to Fault Tolerance Application-Specific Fault Tolerance will always provide more benefits:

Checkpoint size reduction (when needed) Portability (can run on different hardware, different deployment, etc..) Diversity of use (can be used to restart the execution and change parameters in the middle) More about this tomorrow at 10:00: ”ABFT techniques” (Frederic Vivien)

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 54/ 57

slide-81
SLIDE 81

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Conclusion

Multiple approaches to Fault Tolerance General Purpose Fault Tolerance is a required feature of the platforms

Not every computer scientist needs to learn how to write fault-tolerant applications Not all parallel applications can be ported to a fault-tolerant version

Faults are a feature of the platform. Why should it be the role

  • f the programmers to handle them?

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 54/ 57

slide-82
SLIDE 82

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Conclusion

General Purpose Fault Tolerance Software/hardware techniques to reduce checkpoint, recovery, migration times and to improve failure prediction Need to deal with silent errors and design/use verification mechanisms General problem: multi-criteria scheduling problem execution time/energy/reliability Add replication Consider best resource usage (performance trade-offs) Need combine all these approaches and find optimal checkpointing periods! Several challenging algorithmic/scheduling problems

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 55/ 57

slide-83
SLIDE 83

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Bibliography

Exascale

  • Toward Exascale Resilience, Cappello F. et al., IJHPCA 23, 4 (2009)
  • The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al.,

IJHPCA 25, 1 (2011) Models

  • Checkpointing strategies for parallel jobs, Bougeret M. et al., SC’2011
  • Unified model for assessing checkpointing protocols at extreme-scale, Bosilca G. et

al., INRIA RR-7950, 2012 Buddy

  • Revisiting the double checkpointing algorithm, Dongarra J., H´

erault T., Robert Y., INRIA RR-8196, 2012 Silent errors

  • Assessing general-purpose algorithms to cope with fail-stop and silent errors, Benoit

A., Cavelan A., Robert Y., Sun H., INRIA RR-8599, 2014

  • Optimal resilience patterns to cope with fail-stop and silent errors, Benoit A.,

Cavelan A., Robert Y., Sun H., INRIA RR-8786, 2015

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 56/ 57

slide-84
SLIDE 84

Introduction Probabilistic models Buddy algorithm Silent errors Conclusion

Bibliography

New Monograph, Springer Verlag 2015 Thanks to Yves Robert, Thomas H´ erault, George Bosilca, Aur´ elien Bouteiller and Hongyang Sun, from whom I borrowed some slides

Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 57/ 57