Two-level checkpointing and partial verifications for linear task - - PowerPoint PPT Presentation

two level checkpointing and partial verifications for
SMART_READER_LITE
LIVE PREVIEW

Two-level checkpointing and partial verifications for linear task - - PowerPoint PPT Presentation

Problem statement Theoretical analysis Performance evaluation Conclusion Two-level checkpointing and partial verifications for linear task graphs Anne Benoit, Aur elien Cavelan, Yves Robert and Hongyang Sun ENS Lyon, France


slide-1
SLIDE 1

Problem statement Theoretical analysis Performance evaluation Conclusion

Two-level checkpointing and partial verifications for linear task graphs

Anne Benoit, Aur´ elien Cavelan, Yves Robert and Hongyang Sun ENS Lyon, France

Anne.Benoit@ens-lyon.fr http://graal.ens-lyon.fr/˜abenoit 6th Int. Workshop on Performance Modeling, Benchmarking and Simulation

  • f High Performance Computer Systems (PMBS15) @ SC’15

November 15, 2015, Austin, TX

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 1/ 23

slide-2
SLIDE 2

Problem statement Theoretical analysis Performance evaluation Conclusion

Computing at Exascale

Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µp = µind p for arbitrary distributions MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 106 nodes) 30 sec 5 mn 1 h

Need more reliable components!!

Need more resilient techniques!!!

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 2/ 23

slide-3
SLIDE 3

Problem statement Theoretical analysis Performance evaluation Conclusion

Computing at Exascale

Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µp = µind p for arbitrary distributions MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 106 nodes) 30 sec 5 mn 1 h

Need more reliable components!!

Need more resilient techniques!!!

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 2/ 23

slide-4
SLIDE 4

Problem statement Theoretical analysis Performance evaluation Conclusion

Two main sources of errors

Fail-stop errors: instantaneous error detection, e.g., resource crash Silent errors (aka silent data corruptions), e.g., soft faults in L1 cache, ALU, double bit flip

Silent error is detected only when corrupted data is activated, which could happen long after its occurrence Detection latency is problematic Before each checkpoint, run some verification mechanism (checksum, ECC, coherence tests, TMR, etc) Silent error is detected by verification ⇒ checkpoint always valid

Verified checkpoints, rollback and recovery

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 3/ 23

slide-5
SLIDE 5

Problem statement Theoretical analysis Performance evaluation Conclusion

One step further and partial verifications

Perform several verifications before each checkpoint:

Pro: silent error is detected earlier in the pattern Con: additional overhead in error-free executions

1 V ∗ 2 i V ∗C i+1 j V ∗C

Guaranteed/perfect verifications (V ∗) can be very expensive! Partial verifications (V ) are available for many HPC applications!

Lower accuracy: recall r = #detected errors

#total errors

< 1 Much lower cost, i.e., V < V ∗

How many intermediate verifications to use and the positions?

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 4/ 23

slide-6
SLIDE 6

Problem statement Theoretical analysis Performance evaluation Conclusion

One step further and partial verifications

Perform several verifications before each checkpoint:

Pro: silent error is detected earlier in the pattern Con: additional overhead in error-free executions

1 V ∗ 2 i V ∗C i+1 j V ∗C

Guaranteed/perfect verifications (V ∗) can be very expensive! Partial verifications (V ) are available for many HPC applications!

Lower accuracy: recall r = #detected errors

#total errors

< 1 Much lower cost, i.e., V < V ∗

How many intermediate verifications to use and the positions?

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 4/ 23

slide-7
SLIDE 7

Problem statement Theoretical analysis Performance evaluation Conclusion

Two-level checkpointing

Silent errors: use of a lightweight mechanism of in-memory checkpoints CM Local copies lost in case of fail-stop errors: use (less frequent) copies on stable storage (classical disk checkpoints) CD

Always CM before CD: little overhead, enforced in practice Always V ∗ before CM: all checkpoints are valid Verifications, memory copies and I/O transfers protected from errors

1 V 2 i V ∗CM i+1 j V ∗CMCD

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 5/ 23

slide-8
SLIDE 8

Problem statement Theoretical analysis Performance evaluation Conclusion

Outline

1

Problem statement

2

Theoretical analysis

3

Performance evaluation

4

Conclusion

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 6/ 23

slide-9
SLIDE 9

Problem statement Theoretical analysis Performance evaluation Conclusion

Application and errors

Linear chain of tasks T1, T2, . . . , Tn Each task Ti has a weight wi (computational load) Wi,j = j

k=i+1 wk: time to execute tasks Ti+1 to Tj

Subject to fail-stop and silent errors, independent and following a Poisson process with arrival rates λf and λs pf

i,j = 1 − e−λf Wi,j: probability of having at least a fail-stop

error while executing Ti+1 to Tj ps

i,j = 1 − e−λsWi,j: idem for silent errors

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 7/ 23

slide-10
SLIDE 10

Problem statement Theoretical analysis Performance evaluation Conclusion

Resilience parameters and objective

Cost of disk checkpointing CD, cost of disk recovery RD Cost of memory checkpointing CM, cost of memory recovery RM For simplicity, RM included in RD Cost V ∗ for guaranteed verification V for partial verification, with recall r, and g = 1 − r is the proportion of undetected errors ⇒ Decide where to place disk checkpoints, memory checkpoints, guaranteed verifications and partial verifications, in order to minimize the expected execution time (or makespan) of the application

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 8/ 23

slide-11
SLIDE 11

Problem statement Theoretical analysis Performance evaluation Conclusion

Outline

1

Problem statement

2

Theoretical analysis

3

Performance evaluation

4

Conclusion

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 9/ 23

slide-12
SLIDE 12

Problem statement Theoretical analysis Performance evaluation Conclusion

Dynamic programming algorithm

Several dynamic programming levels: First decide where to place disk checkpoints Then memory checkpoints between any two disk checkpoints And finally, guaranteed or partial verifications between any two memory checkpoints Compute the expected execution time between any two verifications

Edisk(d2) Emem(d1, m2) Everif (d1, m1, v2) E(d1, m1, v1, v2) d0 d1 d2 m1 m2 v1 v2 Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 10/ 23

slide-13
SLIDE 13

Problem statement Theoretical analysis Performance evaluation Conclusion

Without partial verifications

Placing disk checkpoints:

Edisk(d2) Emem(d1, d2) Edisk(d1) d0 d1 d2

Edisk(d2): expected time needed to successfully execute tasks T1 to Td2, where Td2 is followed by V ∗CMCD: Edisk(d2) = min

0≤d1<d2{Edisk(d1) + Emem(d1, d2) + CD}

Objective: Edisk(n) Initialization: Edisk(0) = 0

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 11/ 23

slide-14
SLIDE 14

Problem statement Theoretical analysis Performance evaluation Conclusion

Without partial verifications

Placing memory checkpoints:

Emem(d1, d2) Emem(d1, m2) Everif (d1, m1, m2) Emem(d1, m1) d0 d1 d2 m1 m2

Emem(d1, m2): expected time needed to successfully execute tasks Td1+1 to Tm2, where Td1 is followed by V ∗CMCD and Tm2 is followed by V ∗CM: Emem(d1, m2) = min

d1≤m1<m2{Emem(d1, m1)+Everif (d1, m1, m2)+CM}

Initialization: Emem(d1, d1) = 0

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 12/ 23

slide-15
SLIDE 15

Problem statement Theoretical analysis Performance evaluation Conclusion

Without partial verifications

Placing additional guaranteed verifications:

Everif (d1, m1, m2) Everif (d1, m1, v2) Everif (d1, m1, v1) E(d1, m1, v1, v2) d1 m1 m2 v1 v2

Everif (d1, m1, v2): expected time needed to successfully execute tasks Tm1+1 to Tv2, where Td1 is followed by V ∗CMCD, Tm1 is followed by V ∗CM, Tv2 is followed by V ∗: Everif (d1, m1, v2) = min

m1≤v1<v2{Everif (d1, m1, v1)+E(d1, m1, v1, v2)}

Initialization: Everif (d1, m1, m1) = 0

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 13/ 23

slide-16
SLIDE 16

Problem statement Theoretical analysis Performance evaluation Conclusion

Without partial verifications

Expected execution time between two verifications E(d1, m1, v1, v2), knowing positions of last CD and last CM: If pf

v1,v2, recover from CD

Otherwise, if ps

v1,v2, detect error at v2 and recover from CM

E(d1, m1, v1, v2) = pf

v1,v2

  • T lost

v1,v2 + RD + Emem(d1, m1) + Everif (d1, m1, v1) + E(d1, m1, v1, v2)

  • +
  • 1 − pf

v1,v2

  • Wv1,v2 + V ∗

+ ps

v1,v2

  • RM + Everif (d1, m1, v1) + E(d1, m1, v1, v2)
  • Compute T lost

v1,v2 = 1 λf − Wv1,v2 eλf Wv1,v2 −1 and simplify

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 14/ 23

slide-17
SLIDE 17

Problem statement Theoretical analysis Performance evaluation Conclusion

And with partial verifications?

Probability g that error remains undetected after partial verification Need to account fo time lost executing following tasks until error is detected: compute first values at the right of the current interval Epartial(d1, m1, v1, p1, v2): expected time needed to execute all tasks Tp1+1 to Tv2, tries all positions p2 for next partial verification Epartial(d1, m1, v1, p1, v2) calls recursively Epartial(d1, m1, v1, p2, v2) To compute E −(d1, m1, v1, p1, p2, v2), need to know Eleft(v1, p1) and Eright(d1, m1, v1, p2, v2); Eright can be computed, and Eleft accounted for separately (independent on nb of partial verifs)

Edisk(d2) Emem(d1, m2) Everif (d1, m1, v2) Epartial (d1, m1, v1, p1, v2) E−(d1, m1, v1, p1, p2, v2) Eleft(v1, p1) Eright(d1, m1, v1, p2, v2) d0 d1 d2 m1 m2 v1 v2 p1 p2 Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 15/ 23

slide-18
SLIDE 18

Problem statement Theoretical analysis Performance evaluation Conclusion

Outline

1

Problem statement

2

Theoretical analysis

3

Performance evaluation

4

Conclusion

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 16/ 23

slide-19
SLIDE 19

Problem statement Theoretical analysis Performance evaluation Conclusion

Simulation settings

Identical recovery and checkpoint costs: RD = CD and RM = CM V ∗ = CM (check all data in memory), V = V ∗

100 and r = 0.8

Work W = 25000 seconds, distributed between up to n = 50 tasks: Uniform: all tasks share the same cost W

n

(matrix multiplication, iterative stencil kernels) Decrease: task Ti has cost α(n + 1 − i)2, where α ≈ 3W

n3

(dense matrix solvers) HighLow: set of identical tasks with large costs followed by tasks with small costs Platforms used to evaluate Scalable Checkpoint/Restart (SCR) library (Moody et al.):

platform #nodes λf λs CD CM Hera 256 9.46e-7 3.38e-6 300s 15.4s Atlas 512 5.19e-7 7.78e-6 439s 9.1s Coastal 1024 4.02e-7 2.01e-6 1051s 4.5s Coastal SSD 1024 4.02e-7 2.01e-6 2500s 180.0s

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 17/ 23

slide-20
SLIDE 20

Problem statement Theoretical analysis Performance evaluation Conclusion

Number of tasks 10 20 30 40 50 Normalized Makespan 1.02 1.04 1.06 1.08 1.1 1.12 Platform Hera

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADV* on Hera

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Hera

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Hera

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications

Number of tasks 10 20 30 40 50 Normalized Makespan 1 1.1 1.2 Platform Atlas

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADV* on Atlas

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Atlas

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Atlas

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications

Number of tasks 10 20 30 40 50 Normalized Makespan 1.06 1.08 1.1 Platform Coastal

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADV* on Coastal

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Coastal

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Coastal

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications

Number of tasks 10 20 30 40 50 Normalized Makespan 1.13 1.14 1.15 1.16 1.17 Platform Coastal SSD

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADV* on Coastal SSD

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Coastal SSD

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Coastal SSD

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications

Figure: Performance of the three algorithms with uniform distribution

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 18/ 23

slide-21
SLIDE 21

Problem statement Theoretical analysis Performance evaluation Conclusion

Number of tasks 10 20 30 40 50 Normalized Makespan 1 1.1 1.2 1.3 1.4 Platform Hera

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Hera

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Hera

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications Disk ckpts Memory ckpts Verifications Partials

Platform Hera with ADMV and N=10 Number of tasks 10 20 30 40 50 Normalized Makespan 1 1.2 1.4 1.6 Platform Atlas

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Atlas

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Atlas

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications Disk ckpts Memory ckpts Verifications Partials

Platform Atlas with ADMV and N=10 Number of tasks 10 20 30 40 50 Normalized Makespan 1 1.1 1.2 1.3 1.4 Platform Coastal

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Coastal

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Coastal

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications Disk ckpts Memory ckpts Verifications Partials

Platform Coastal with ADMV and N=10 Number of tasks 10 20 30 40 50 Normalized Makespan 1.1 1.2 1.3 1.4 Platform Coastal SSD

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Coastal SSD

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Coastal SSD

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications Disk ckpts Memory ckpts Verifications Partials

Platform Coastal SSD with ADMV and N=10

Figure: Performance of the three algorithms with decrease distribution

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 19/ 23

slide-22
SLIDE 22

Problem statement Theoretical analysis Performance evaluation Conclusion

Number of tasks 10 20 30 40 50 Normalized Makespan 1.02 1.04 1.06 1.08 1.1 1.12 Platform Hera

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Hera

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Hera

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications Disk ckpts Memory ckpts Verifications Partials

Platform Hera with ADMV and N=10 Number of tasks 10 20 30 40 50 Normalized Makespan 1 1.1 1.2 Platform Atlas

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Atlas

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Atlas

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications Disk ckpts Memory ckpts Verifications Partials

Platform Atlas with ADMV and N=10 Number of tasks 10 20 30 40 50 Normalized Makespan 1.06 1.08 1.1 Platform Coastal

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Coastal

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Coastal

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications Disk ckpts Memory ckpts Verifications Partials

Platform Coastal with ADMV and N=10 Number of tasks 10 20 30 40 50 Normalized Makespan 1.13 1.14 1.15 1.16 1.17 Platform Coastal SSD

ADV* ADMV* ADMV

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV* on Coastal SSD

#Disk Checkpoints #Memory Checkpoints #Verifications

Number of tasks 10 20 30 40 50 # Checkpoints / Verifications 10 20 30 40 Algorithm ADMV on Coastal SSD

#Disk Checkpoints #Memory Checkpoints #Verifications #Partial Verifications Disk ckpts Memory ckpts Verifications Partials

Platform Coastal SSD with ADMV and N=10

Figure: Performance of the three algorithms with highlow distribution

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 20/ 23

slide-23
SLIDE 23

Problem statement Theoretical analysis Performance evaluation Conclusion

Summary of simulations

More tasks → better performance Single-level algorithm: Guaranteed verifications everywhere, except with too many tasks (n = 50 on Hera) or cost of verification too high (Coastal SSD) Two-level algorithms: Use of memory checkpoints drastically reduces makespan With partial verifications: Need to use a lot of them (smaller recall): useful only when enough tasks; limited impact, except for Coastal SSD with higher checkpointing and verification costs

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 21/ 23

slide-24
SLIDE 24

Problem statement Theoretical analysis Performance evaluation Conclusion

Outline

1

Problem statement

2

Theoretical analysis

3

Performance evaluation

4

Conclusion

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 22/ 23

slide-25
SLIDE 25

Problem statement Theoretical analysis Performance evaluation Conclusion

Conclusion

Two-level checkpointing scheme to cope with fail-stop and silent errors Combines disk/memory checkpoints with guaranteed/partial verifications Theoretically: multi-level polynomial-time dynamic programming algorithm for linear chains (O(n6)) Practically: benefit of combined approach with realistic parameters, fast in practice Future directions Usefulness of the approach on general application workflows Need of efficient polynomial-time heuristics

Research report RR-8794 available at graal.ens-lyon.fr/˜abenoit

Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 23/ 23