Multi-level checkpointing and silent data corruption Anne Benoit 2 , - - PowerPoint PPT Presentation

multi level checkpointing and silent data corruption
SMART_READER_LITE
LIVE PREVIEW

Multi-level checkpointing and silent data corruption Anne Benoit 2 , - - PowerPoint PPT Presentation

Multi-level checkpointing and silent data corruption Anne Benoit 2 , Franck Cappello 1 , Aurlien Cavelan 2 , Sheng Di 1 , Hongyang Sun 2 , Yves Robert 2 , Frdric Vivien 2 1 Argonne National Laboratory 2 INRIA October 6, 2016 Fail-stop


slide-1
SLIDE 1

October 6, 2016

Multi-level checkpointing and silent data corruption

Anne Benoit 2, Franck Cappello 1, Aurélien Cavelan 2, Sheng Di 1, Hongyang Sun 2, Yves Robert 2, Frédéric Vivien 2

1 Argonne National Laboratory 2 INRIA

slide-2
SLIDE 2

Fail-stop errors

Characteristics

◮ Component failure (node, network, power, ...) ◮ Application fails and data is lost

Fault rate proportional to number of components

◮ 2013: Preprod. Blue Waters requires repairs ≈ 4 hours [2, 1] ◮ 2014: Titan loses a node every ≈ 1.5 days [2, 3, 1] ◮ 2014: Blue Waters loses ≈ 2 nodes per day [1]

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 2/28

slide-3
SLIDE 3

Coping with fail-stop errors

Instantaneous error detection Standard approach: Periodic checkpoint, rollback, and recovery:

Time

C W C W C W C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 3/28

slide-4
SLIDE 4

Coping with fail-stop errors

Instantaneous error detection Standard approach: Periodic checkpoint, rollback, and recovery:

Time

C W C W C W C

Time

  • Fail-stop error

C C W C W C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 3/28

slide-5
SLIDE 5

Coping with fail-stop errors

Instantaneous error detection Standard approach: Periodic checkpoint, rollback, and recovery:

Time

C W C W C W C

Time

  • Fail-stop error

C C W C W C

Time

C R W C W C

  • Fail-stop error
  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 3/28

slide-6
SLIDE 6

Multi-Level Checkpointing

◮ Different kinds of checkpoints: local disk storage,

partner-copy, Reed-Solomon encoding technique, file system

◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time

C W1 C W2 C W3 C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 4/28

slide-7
SLIDE 7

Multi-Level Checkpointing

◮ Different kinds of checkpoints: local disk storage,

partner-copy, Reed-Solomon encoding technique, file system

◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time

C W1 C W2 C W3 C

Time

  • Fail-stop error

C W1 C W2 C W3 C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 4/28

slide-8
SLIDE 8

Multi-Level Checkpointing

◮ Different kinds of checkpoints: local disk storage,

partner-copy, Reed-Solomon encoding technique, file system

◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time

C W1 C W2 C W3 C

Time

  • Fail-stop error

C W1 C R W2 C W3 C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 4/28

slide-9
SLIDE 9

Multi-Level Checkpointing

◮ Different kinds of checkpoints: local disk storage,

partner-copy, Reed-Solomon encoding technique, file system

◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time

C W1 C W2 C W3 C

Time

  • Fail-stop error

C W1 C R W2 C W3 C

Time

  • Fail-stop error

C W1 C W2 C W3 C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 4/28

slide-10
SLIDE 10

Multi-Level Checkpointing

◮ Different kinds of checkpoints: local disk storage,

partner-copy, Reed-Solomon encoding technique, file system

◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time

C W1 C W2 C W3 C

Time

  • Fail-stop error

C W1 C R W2 C W3 C

Time

  • Fail-stop error

C C R W1 C W2 C W3 C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 4/28

slide-11
SLIDE 11

Multi-Level Checkpointing

◮ Different kinds of checkpoints: local disk storage,

partner-copy, Reed-Solomon encoding technique, file system

◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time

C W1 C W2 C W3 C

Time

  • Fail-stop error

C W1 C R W2 C W3 C

Time

  • Fail-stop error

C C R W1 C W2 C W3 C

When should we checkpoint? Using which mechanism?

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 4/28

slide-12
SLIDE 12

Two-level checkpointing: assumptions

Two types of faults

◮ Type-1: follow an exponential distribution of failure rate λ1 ◮ Type-2: follow an exponential distribution of failure rate λ2

Two types of checkpoints

◮ Type-2 checkpoints take time C2 (recovery R2)

Enables recovery from type-1 and type-2 faults

◮ Type-1 checkpoints take time C1 (recovery R1)

Enables recovery from type-1 faults

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 5/28

slide-13
SLIDE 13

Two-level checkpointing: assumptions

Two types of faults

◮ Type-1: follow an exponential distribution of failure rate λ1 ◮ Type-2: follow an exponential distribution of failure rate λ2

More dramatic faults Two types of checkpoints

◮ Type-2 checkpoints take time C2 (recovery R2)

Enables recovery from type-1 and type-2 faults More expensive checkpoints

◮ Type-1 checkpoints take time C1 (recovery R1)

Enables recovery from type-1 faults Cheap checkpoints

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 5/28

slide-14
SLIDE 14

Two-level checkpointing: assumptions

Two types of faults

◮ Type-1: follow an exponential distribution of failure rate λ1 ◮ Type-2: follow an exponential distribution of failure rate λ2

More dramatic faults Two types of checkpoints

◮ Type-2 checkpoints take time C2 (recovery R2)

Enables recovery from type-1 and type-2 faults More expensive checkpoints

◮ Type-1 checkpoints take time C1 (recovery R1)

Enables recovery from type-1 faults Cheap checkpoints

Other assumptions

◮ Fault of type-i is followed by a downtime and a type-i recovery ◮ No faults during recoveries

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 5/28

slide-15
SLIDE 15

Execution time of a pattern

◮ Pattern: work of some size W divided in K chunks

w1 C1 w2 C1 w3 C1 ... C1 wK C1 C2

◮ Objective: overhead minimization

Overhead(Pattern(K, W , w1, ..., wK)) = E(Pattern(K, W , w1, ..., wK)) W − 1

◮ First property:

Execution time is minimized when all chunks have same size

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 6/28

slide-16
SLIDE 16

Unknown job length: optimal solution

◮ Chunks have size wopt where:

N(wopt) ln(N(wopt)) = λLwopt(eλ(wopt+C1) − 1)

◮ There are K chunks in a pattern where:

βλKwopteλ(wopt+C1)(1 + L(eλ(wopt+C1) − 1))K−1 = α + β L(1 + L(eλ(wopt+C1) − 1))K

◮ Missing notations

N(w) = 1 + L(eλ(w+C1) − 1), L = λ2

λ , λ = λ1 + λ2,

α = R(eλC2 − 1) − β

L, β = R(1 + L(eλC2 − 1)),

R = 1+λ1R1+λ2R2

λ

+ D

◮ Ugly implicit equations: solve them numerically!

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 7/28

slide-17
SLIDE 17

Known job length: optimal solution

◮ Total size of job: Wtotal ◮ Chunks have same wopt size than previously ◮ There are p∗ patterns where:

p∗ = Wtotal ln(N(wopt))

  • L
  • αL

βe

  • + 1
  • wopt

with the same notations as previously and L(z) = x if xex = z.

◮ Ugly implicit equations: solve them numerically!

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 8/28

slide-18
SLIDE 18

Assessment through simulations

103500 104000 104500 105000 105500 106000 106500 200 250 300 350 400 450 500 550 600 650 Wall-clock Time (seconds) Wopt (seconds) Wopt2=1215 Wopt2=1235 Wopt2=1255 Wopt2=1275 Wopt2=1295 Wopt2=1315 Wopt2=1335 Wopt2=1355 Wopt2=1375

Case 1

114500 115000 115500 116000 116500 117000 117500 118000 118500 119000 150 200 250 300 350 400 Wall-clock Time (seconds) Wopt (seconds) Wopt2=693 Wopt2=713 Wopt2=733 Wopt2=753 Wopt2=773 Wopt2=793 Wopt2=813 Wopt2=833 Wopt2=853

Case 2

144000 145000 146000 147000 148000 149000 150000 100 150 200 250 300 Wall-clock Time (seconds) Wopt (seconds) Wopt2=631 Wopt2=651 Wopt2=671 Wopt2=691 Wopt2=711 Wopt2=731 Wopt2=751 Wopt2=771 Wopt2=791

Case 3

119000 120000 121000 122000 123000 124000 60 80 100 120 140 160 180 200 Wall-clock Time (seconds) Wopt (seconds) Wopt2=406 Wopt2=426 Wopt2=446 Wopt2=466 Wopt2=486 Wopt2=506 Wopt2=526 Wopt2=546 Wopt2=566

Case 4

138000 140000 142000 144000 146000 148000 150000 152000 154000 40 60 80 100 120 140 160 180 Wall-clock Time (seconds) Wopt (seconds) Wopt2=239 Wopt2=259 Wopt2=279 Wopt2=299 Wopt2=319 Wopt2=339 Wopt2=359 Wopt2=379 Wopt2=399

Case 5

84000 85000 86000 87000 88000 89000 90000 91000 92000 40 60 80 100 120 140 Wall-clock Time (seconds) Wopt (seconds) Wopt2=420 Wopt2=440 Wopt2=460 Wopt2=480 Wopt2=500 Wopt2=520 Wopt2=540 Wopt2=560 Wopt2=580

Case 6

125000 130000 135000 140000 145000 80 100 120 140 160 180 200 Wall-clock Time (seconds) Wopt (seconds) Wopt2=333 Wopt2=353 Wopt2=373 Wopt2=393 Wopt2=413 Wopt2=433 Wopt2=453 Wopt2=473 Wopt2=493

Case 7

360000 380000 400000 420000 440000 460000 480000 500000 520000 540000 60 80 100 120 140 160 180 200 Wall-clock Time (seconds) Wopt (seconds) Wopt2=370 Wopt2=390 Wopt2=410 Wopt2=430 Wopt2=450 Wopt2=470 Wopt2=490 Wopt2=510 Wopt2=530

Case 8

180000 190000 200000 210000 220000 230000 240000 250000 60 80 100 120 140 160 180 200 Wall-clock Time (seconds) Wopt (seconds) Wopt2=370 Wopt2=390 Wopt2=410 Wopt2=430 Wopt2=450 Wopt2=470 Wopt2=490 Wopt2=510 Wopt2=530

Case 9

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 9/28

slide-19
SLIDE 19

Conclusion so far

◮ We know how to use efficiently two-level checkpointing

under fail-stop failures

◮ What about silent data corruption?

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 10/28

slide-20
SLIDE 20

Second kind of errors: silent data corruption

Characteristics

◮ Bit flip (Disk, RAM, Cache, Bus, ...) ◮ Problems: detection latency, potentially wrong results

Cosmic rays do produce errors

◮ 2002: Unprotected address bus ASCI Q at Los Alamos

National Laboratory could not run more than one hour [3]

◮ 2003: No ECC Virginia Tech 1, 100 Apple Power Mac G5

supercomputer could not boot [3]

◮ 2010: ECC protected Jaguar saw 350 bit-flips/min [3] ◮ 2010: ECC protected Jaguar saw 1 double-bit error/day [3] ◮ 2014: Titan: reported > 1 double-bit error per week [4]

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 11/28

slide-21
SLIDE 21

Coping with Silent Errors

Main problem: detection latency Question: can we follow the same approach?

Time W W silent error

C C C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 12/28

slide-22
SLIDE 22

Coping with Silent Errors

Main problem: detection latency Question: can we follow the same approach?

Time W W Detect! silent error

C C C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 12/28

slide-23
SLIDE 23

Coping with Silent Errors

Main problem: detection latency Question: can we follow the same approach?

Time W W Detect! corrupted! silent error

C C C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 12/28

slide-24
SLIDE 24

Coping with Silent Errors

Main problem: detection latency Question: can we follow the same approach?

Time W W Detect! corrupted! silent error

C C C

Keep multiple checkpoints?

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 12/28

slide-25
SLIDE 25

Coping with Silent Errors

Main problem: detection latency Question: can we follow the same approach?

Time W W Detect! corrupted! corrupted?

C C C

Keep multiple checkpoints? Which checkpoint to recover from?

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 12/28

slide-26
SLIDE 26

Coping with Silent Errors

Main problem: detection latency Question: can we follow the same approach?

Time W W Detect! corrupted! corrupted?

C C C

Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors!

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 12/28

slide-27
SLIDE 27

Existing Methods for Detecting Silent Errors

General-purpose approaches

◮ Replication [Fiala et al. 2012] or triple modular redundancy and voting [Lyons and Vanderkulk 1962]

Application-specific approaches

◮ Algorithm-based fault tolerance (ABFT): checksums in dense matrices Limited to one error detection and/or correction in practice [Huang and Abraham 1984] ◮ Partial differential equations (PDE): use lower-order scheme as verification mechanism [Benson, Schmit and Schreiber 2014] ◮ Generalized minimal residual method (GMRES): inner-outer iterations [Hoemmen and Heroux 2011] ◮ Preconditioned conjugate gradients (PCG): orthogonalization check every k iterations, re-orthogonalization if problem detected [Sao and Vuduc 2013, Chen 2013]

Data-analytics approaches

◮ Dynamic monitoring of HPC datasets based on physical laws (e.g., temperature limit, speed limit) and space or temporal proximity [Bautista-Gomez and Cappello 2014] ◮ Time-series prediction, spatial multivariate interpolation [Di et al. 2014]

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 13/28

slide-28
SLIDE 28

Coping with Silent Errors

Solution: coupling checkpointing with verification

Time W W Error Detection

V ∗ C V ∗ C V ∗ C

◮ Before each checkpoint, run some verification mechanism or error

detection test

◮ Silent error, if any, is detected by verification ◮ Last checkpoint is always valid

Problem solved! But can do better than that!

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 14/28

slide-29
SLIDE 29

One step further

Perform several verifications before each checkpoint:

Time Error Detection

V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C

◮ Pro: silent error detected earlier in pattern ◮ Con: additional overhead in error-free executions ◮ Need to find the best trade-off

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 15/28

slide-30
SLIDE 30

One step further

Perform several verifications before each checkpoint:

Time Error Detection

V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C

◮ Pro: silent error detected earlier in pattern ◮ Con: additional overhead in error-free executions ◮ Need to find the best trade-off ◮ Not all verification mechanisms have 100% accuracy!

Should we use partial detectors? How?

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 15/28

slide-31
SLIDE 31

Partial verification

Guaranteed/perfect verifications (V ∗) can be very expensive! Partial verifications (V ) are available for some HPC applications!

◮ Lower accuracy: recall r = #detected errors #total errors

< 1

◮ Lower cost, i.e., V < V ∗ Time Error Detect? Detect!

V ∗ C V1 V2 V ∗ C V1 V2 V ∗ C

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 16/28

slide-32
SLIDE 32

The optimization problem

Two types of checkpoints

◮ Disk checkpoint: stable storage (slow but resilient) ◮ Memory checkpoint: local copy (fast but lost on fail-stop)

Checkpoint only done after guaranteed verification

Two types of responses to errors

◮ Fail-stop error ⇒ rollback to last disk checkpoint ◮ Silent errors ⇒ rollback to last memory checkpoint

Goal:

◮ Combine everything into a single periodic pattern ◮ Minimize the overhead due to faults and to fault-tolerance

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 17/28

slide-33
SLIDE 33

Resilience patterns (1/2)

Starting with base pattern

Time W Pattern

V ∗ CM CD V ∗ CM CD

Pattern à la Young-Daly Adding verified memory checkpoints

Time w1 w2 wn W · · · · · ·

V ∗ CM CD V ∗ CM V ∗ CM V ∗ CM V ∗ CM CD

Pattern with n segments

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 18/28

slide-34
SLIDE 34

Resilience patterns (2/2)

Adding intermediate verifications between memory checkpoints

Time wi,1 wi,2 wi,mi W · · · · · ·

V ∗ CM CD V V V V ∗ CM CD

Segment wi has mi chunks Putting everything together

Time w1,1 w1,m1 wn,1 wn,mn w1 wn W · · · · · · · · · · · · · · · · · · · · ·

V ∗ CM CD V V V ∗ CM V ∗ CM V V V ∗ CM CD

Full pattern

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 19/28

slide-35
SLIDE 35

The optimal solution (first order approximation)

Pattern W ∗ n∗ m∗ Overhead(Pattern) PD

  • V ∗+CM +CD

λs + λf 2

– – 2

  • λs + λf

2

  • (V ∗ + CM + CD)

PDV ∗

  • m∗V ∗+CM +CD

1 2

  • 1+

1 m∗

  • λs + λf

2

  • λs

λs +λf · CM +CD V ∗

  • 2(λs + λf )CM + CD + √2λsV ∗

PDV

  • (m∗−1)V +V ∗+CM +CD

1 2

  • 1+

2−r (m∗−2)r+2

  • λs + λf

2

– 2 − 2

r +

  • λs

λs +λf

  • 2(λs + λf )
  • V ∗ − 2−r

r

V + CM + CD

  • ×
  • 2−r

r

V ∗+CM +CD

V

− 2−r

r

  • +
  • 2λs 2−r

r

V PDM

  • n∗(V ∗+CM )+CD

λs n∗ + λf 2

  • 2λs

λf

·

CD V ∗+CM

– 2

  • λs(V ∗ + CM) +
  • 2λf CD

PDMV ∗

  • n∗m∗V ∗+n∗CM +CD

1 2

  • 1+

1 m∗ λs n∗ + λf 2

  • λs

λf · CD CM

CM

V ∗

  • 2λf CD +
  • 2λsCM + √2λsV ∗

PDMV

  • n∗(m∗−1)V +n∗(V ∗+CM )+CD

1 2

  • 1+

2−r (m∗−2)r+2

  • λs

n∗ + λf 2

  • λs

λf · CD V ∗− 2−r r V +CM

2 − 2

r

  • 2λf CD +
  • 2λs
  • V ∗ − 2−r

r

V + CM

  • +
  • 2−r

r

V ∗+CM

V

− 2−r

r

  • +
  • 2λs 2−r

r

V

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 20/28

slide-36
SLIDE 36

Simulations

Patterns PD PDV* PDV PDM PDMV* PDMV Expected Overhead 0.05 0.1 0.15 0.2 Platform Hera

Predicted Simulated

Patterns PD PDV* PDV PDM PDMV* PDMV Expected Overhead 0.05 0.1 0.15 0.2 Platform Atlas

Predicted Simulated

Patterns PD PDV* PDV PDM PDMV* PDMV Expected Overhead 0.05 0.1 0.15 0.2 Platform Coastal

Predicted Simulated

Patterns PD PDV* PDV PDM PDMV* PDMV Expected Overhead 0.05 0.1 0.15 0.2 Platform Coastal SSD

Predicted Simulated

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 21/28

slide-37
SLIDE 37

Conclusion so far

◮ We know how to use efficiently two-level checkpointing

under fail-stop failures and silent data corruption with guaranteed verifications and partial verifications

◮ Caveat: we assumed full freedom to place checkpoints and

verifications (divisible load) Question: What about task graphs?

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 22/28

slide-38
SLIDE 38

The optimization problem

◮ Application modeled as a linear task graph ◮ Checkpoints and verifications are performed in between tasks

1 V (1) 2 i V ∗(i)C (i)

M

i+1 j V ∗(j)C (j)

M C (j) D

◮ Question: when to take which checkpoint and verification in

  • rder to minimize the execution time?

Edisk(d2) Emem(d1, m2) Everif (d1, m1, v2) Epartial (d1, m1, v1, p1, v2) E−(d1, m1, v1, p1, p2, v2) Eleft(v1, p1) Eright(d1, m1, v1, p2, v2) d0 d1 d2 m1 m2 v1 v2 p1 p2

◮ Optimal solution: O(n6) dynamic programming algorithm

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 23/28

slide-39
SLIDE 39

Conclusion and perspectives

Pros

◮ Mix of silent and fail-stop errors ◮ Mix of partial and guaranteed verifications

Cons

◮ Results limited to 2 levels...

... but upcoming generalization for any number of levels!

◮ Exponential failure distribution

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 24/28

slide-40
SLIDE 40

All details can be found in

◮ S. Di, Y. Robert, F. Vivien, and F. Cappello.

Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model. IEEE Transactions on Parallel and Distributed Systems, 2016. To appear.

◮ A. Benoit, A. Cavelan, Y. Robert, and H. Sun.

Optimal Resilience Patterns to Cope with Fail-Stop and Silent

  • Errors. In IPDPS’2016, May 2016.

◮ A. Benoit, A. Cavelan, Y. Robert, and H. Sun.

Two-Level Checkpointing and Verifications for Linear Task

  • Graphs. In The 17th IEEE International Workshop on Parallel

and Distributed Scientific and Engineering Computing (PDSEC 2016), May 2016.

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 25/28

slide-41
SLIDE 41

Bibliography I

  • P. Balaprakash, L. A. Bautista-Gomez, M. Bouguerra, S. M.

Wild, F. Cappello, and P. D. Hovland. Analysis of the tradeoffs between energy and run time for multilevel checkpointing. In 5th International Workshop, PMBS 2014, New Orleans, LA, USA, pages 249–263, 2014.

  • F. Cappello, G. Al, W. Gropp, S. Kale, B. Kramer, and
  • M. Snir.

Toward exascale resilience: 2014 update.

  • Supercomput. Front. Innov.: Int. J., 1(1):5–28, Apr. 2014.
  • A. Geist.

How to kill a supercomputer: Dirty power, cosmic rays, and bad solder. IEEE Spectrum, Feb. 2016.

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 26/28

slide-42
SLIDE 42

Bibliography II

  • D. Tiwari, S. Gupta, J. H. Rogers, D. Maxwell, P. Rech, S. S.

Vazhkudai, D. A. G. de Oliveira, D. Londo, N. DeBardeleben,

  • P. O. A. Navaux, L. Carro, and A. S. Bland.

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In 21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, pages 331–342, 2015.

  • F. Vivien - Multi-level checkpointing and silent error detection

October 6, 2016 - 27/28

slide-43
SLIDE 43

ANY QUESTIONS?