Checkpointing strategies for parallel jobs Marin Bougeret , Henri - - PowerPoint PPT Presentation

checkpointing strategies for parallel jobs
SMART_READER_LITE
LIVE PREVIEW

Checkpointing strategies for parallel jobs Marin Bougeret , Henri - - PowerPoint PPT Presentation

Checkpointing strategies for parallel jobs Marin Bougeret , Henri Casanova , Mika el Rabie , Yves Robert , and Fr ed eric Vivien ENS Lyon & INRIA, France University of Hawaii at M anoa, USA University of Montpellier, France


slide-1
SLIDE 1

Checkpointing strategies for parallel jobs

Marin Bougeret, Henri Casanova, Mika¨ el Rabie, Yves Robert, and Fr´ ed´ eric Vivien ENS Lyon & INRIA, France University of Hawai‘i at M¯ anoa, USA University of Montpellier, France

slide-2
SLIDE 2

Motivation

Framework Very very large number of processing elements (e.g., 220) Failure-prone platform (like any realistic platform) Large application to be executed on the whole platform = ⇒ Failure(s) will certainly occur before completion! Resilience provided through coordinated checkpointing Question When should we checkpoint the application?

slide-3
SLIDE 3

State of the art

One knows that applications should be checkpointed periodically

slide-4
SLIDE 4

State of the art

One knows that applications should be checkpointed periodically Is this optimal?

slide-5
SLIDE 5

State of the art

One knows that applications should be checkpointed periodically Is this optimal? Several proposed values for period Young: √ 2 × C × MTBF (1st order approximation) Daly (1):

  • 2 × C × (R + MTBF) (1st order approximation)

Daly (2): η × MTBF − C, where η = ξ2 + 1 + L(−e−(2ξ2+1)), ξ =

  • C

2×MTBF, and L(z)eL(z) = z

(higher order approximation)

slide-6
SLIDE 6

State of the art

One knows that applications should be checkpointed periodically Is this optimal? Several proposed values for period Young: √ 2 × C × MTBF (1st order approximation) Daly (1):

  • 2 × C × (R + MTBF) (1st order approximation)

Daly (2): η × MTBF − C, where η = ξ2 + 1 + L(−e−(2ξ2+1)), ξ =

  • C

2×MTBF, and L(z)eL(z) = z

(higher order approximation) How good are these approximations? Could we find the optimal value? At least for Exponential failures? And for Weibull failures?

slide-7
SLIDE 7

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-8
SLIDE 8

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-9
SLIDE 9

Hypotheses

Overall size of work: W Checkpoint cost: C (e.g., write on disk the contents of each processor memory) Downtime: D (hardware replacement by spare,

  • r software rejuvenation via rebooting)

Recovery cost after failure: R Homogeneous platform (same computation speeds, iid failure distributions) History of failures has no impact, only the time elapsed since last failure does A failure can happen during a checkpoint, a recovery, but not a downtime (otherwise replace D by 0 and R by R + D).

slide-10
SLIDE 10

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-11
SLIDE 11

Problem statement

Makespan Minimize the job’s expected makespan, that is:

the expectation E

  • f the time T needed to process

a work of size W knowing that the (single) processor failed τ units of time ago.

Notation:

minimize E(T(W|τ)) ω1(W|τ): amount of work we attempt to do before taking the first checkpoint

slide-12
SLIDE 12

Recursive approach

E(T(W|τ)) =

slide-13
SLIDE 13

Recursive approach

Probability

  • f success

Psucc(ω1 + C|τ) (ω1 + C + E(T(W − ω1|τ + ω1 + C)) E(T(W|τ)) =

slide-14
SLIDE 14

Recursive approach

to compute the 1st chunk Time needed

Psucc(ω1 + C|τ) (ω1 + C + E(T(W − ω1|τ + ω1 + C)) E(T(W|τ)) =

slide-15
SLIDE 15

Recursive approach

Time needed to compute the remainder

Psucc(ω1 + C|τ) (ω1 + C + E(T(W − ω1|τ + ω1 + C)) E(T(W|τ)) =

slide-16
SLIDE 16

Recursive approach

+ (1 − Psucc(ω1 + C|τ)) (E(Tlost(ω1 + C|τ)) + E(Trec) + E(T(W|R))) Psucc(ω1 + C|τ) (ω1 + C + E(T(W − ω1|τ + ω1 + C)) E(T(W|τ)) =

slide-17
SLIDE 17

Recursive approach

Probability of failure

(1 − Psucc(ω1 + C|τ)) (E(Tlost(ω1 + C|τ)) + E(Trec) + E(T(W|R))) + Psucc(ω1 + C|τ) (ω1 + C + E(T(W − ω1|τ + ω1 + C)) E(T(W|τ)) =

slide-18
SLIDE 18

Recursive approach

Time elapsed before the failure

  • ccured

(1 − Psucc(ω1 + C|τ)) (E(Tlost(ω1 + C|τ)) + E(Trec) + E(T(W|R))) + Psucc(ω1 + C|τ) (ω1 + C + E(T(W − ω1|τ + ω1 + C)) E(T(W|τ)) =

slide-19
SLIDE 19

Recursive approach

Time needed to perform downtime and recovery

+ (1 − Psucc(ω1 + C|τ)) (E(Tlost(ω1 + C|τ)) + E(Trec) + E(T(W|R))) Psucc(ω1 + C|τ) (ω1 + C + E(T(W − ω1|τ + ω1 + C)) E(T(W|τ)) =

slide-20
SLIDE 20

Recursive approach

from scratch to compute W Time needed

(1 − Psucc(ω1 + C|τ)) (E(Tlost(ω1 + C|τ)) + E(Trec) + E(T(W|R))) + Psucc(ω1 + C|τ) (ω1 + C + E(T(W − ω1|τ + ω1 + C)) E(T(W|τ)) =

slide-21
SLIDE 21

Recursive approach

+ (1 − Psucc(ω1 + C|τ)) (E(Tlost(ω1 + C|τ)) + E(Trec) + E(T(W|R))) Psucc(ω1 + C|τ) (ω1 + C + E(T(W − ω1|τ + ω1 + C)) E(T(W|τ)) =

Problem: finding ω1(W, τ) minimizing E(T(W|τ))

slide-22
SLIDE 22

Failures following an exponential distribution

Theorem Optimal strategy splits W into K ∗ same-size chunks where K ∗ = max(1, ⌊K0⌋) or K ∗ = ⌈K0⌉

(whichever leads to the smaller value)

where K0 = λW 1 + L(−e−λC−1) and L(z)eL(z) = z Optimal expectation of makespan is K ∗

  • eλR

1 λ + D eλ( W

K∗ +C)−1

slide-23
SLIDE 23

Arbitrary failure distributions

E(T(W|τ)) = min

0<ω1≤W

    Psuc(ω1 + C|τ)

  • ω1 + C + E(T(W − ω1|τ + ω1 + C))
  • +(1 − Psuc(ω1 + C|τ))×

(E(Tlost(ω1 + C|τ))+E(Trec)+E(T(W|R))) Solve via dynamic programming

  • Time quantum u: all chunk sizes ωi are integer multiples of u
  • Trade-off: accuracy versus higher computing time
slide-24
SLIDE 24

Dynamic programming

Algorithm 1: DPMakespan (x,b,y,τ0)

if x = 0 then return 0 if solution[x][b][y] = unknown then best ← ∞; τ ← bτ0 + yu for i = 1 to x do exp succ ← first(DPMakespan(x − i, b, y + i + C

u , τ0))

exp fail ← first(DPMakespan(x, 0, R

u , τ0))

cur ← Psuc(iu + C|τ)(iu + C + exp succ) +(1 − Psuc(iu + C|τ))

  • E(Tlost(iu + C, τ))

+ E(Trec) + exp fail

  • if cur < best then

best ← cur; chunksize ← i solution[x][b][y] ← (best, chunksize) return solution[x][b][y]

slide-25
SLIDE 25

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-26
SLIDE 26

Problem statement

NextFailure Maximize expected amount of work completed before next failure Optimization on a “failure-by-failure” basis Hopefully a good approximation, at least for large job sizes W

slide-27
SLIDE 27

Approach

E(W (ω|τ))=Psuc(ω1 + C|τ)(ω1 + E(W (ω − ω1|τ + ω1 + C))) Proposition E(W (W|0)) =

K

  • i=1

ωi ×

i

  • j=1

Psuc(ωj + C|tj) where tj = j−1

ℓ=1 ωℓ + C is the total time elapsed (without failure)

before execution of chunk ωl, and K is the (unknown) target number of chunks.

slide-28
SLIDE 28

Solving through dynamic programming

Algorithm 2: DPNextFailure (x,n,τ0)

if x = 0 then return 0 if solution[x][n] = unknown then best ← ∞ τ ← τ0 + (W − xu) + nC for i = 1 to x do work = first(DPNextFailure(x − i, n + 1, τ0)) cur ← Psuc(iu + C|τ) × (iu + work) if cur < best then best ← cur; chunksize ← i solution[x][n] ← (best, chunksize) return solution[x][n]

slide-29
SLIDE 29

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-30
SLIDE 30

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-31
SLIDE 31

Failures following an exponential distribution

Theorem Optimal strategy splits W(p) in K ∗(p) same-size chunks where K ∗(p) = max(1, ⌊K0(p)⌋) or K ∗(p) = ⌈K0(p)⌉ (whichever leads to the smaller value) where K0(p) = λW(p) 1 + L(−e−pλC−1) and L(z)eL(z) = z Optimal expectation of makespan is K ∗(p) 1 pλ + E(Trec(p)) eλ

  • W

K∗(p) +pC

  • − 1
slide-32
SLIDE 32

Arbitrary failure distributions

Cannot solve analytically the recursion Cannot extend the dynamic programming algorithm DPMakespan designed for the single-processor case:

Would need to memorize all possible failure scenarios for each processor Number of states exponential in p

slide-33
SLIDE 33

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-34
SLIDE 34

Dynamic programming

All τ variables evolve identically: recursive calls only correspond to cases in which no failure has occurred. E(W (W|τ1, . . . , τp)) = Psuc(ω1+C|τ1, . . . , τp)(ω1+E(W (W−ω1|τ1+ω1+C, . . . , τp+ω1+C))) ⇒ Same dynamic programming approach than previously Linear dependency in p (computation of Psuc) Reduce complexity by recording only x most recent τ values and approximate the other values using y rounding values defined by x regularly-spaced quantiles

slide-35
SLIDE 35

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-36
SLIDE 36

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-37
SLIDE 37

Evaluated approaches

Heuristics Young [4] DalyLow [2] DalyHigh [2] Bouguerra [1] Liu [3] OptExp DPMakespan DPNextFailure Theoretical bounds LowerBound (omniscient algorithm) PeriodLB

slide-38
SLIDE 38

Synthetic failure distributions

ptotal D C,R MTBF W 1-proc 1 60 s 600 s 1 h, 1 d, 1 w 20 d Petascale 45, 208 60 s 600 s 125 y, 500 y 1, 000 y Exascale 220 60 s 600 s 1250 y 10, 000 y Simulation parameters

slide-39
SLIDE 39

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-40
SLIDE 40

Sequential jobs under Exponential failures

MTBF Heuristics 1 hour 1 day 1 week LowerBound 0.62865 0.90714 0.979151 PeriodLB 1.00705 1.01588 1.02298 Young 1.01635 1.01590 1.02332 DalyLow 1.02711 1.01611 1.02338 DalyHigh 1.00700 1.01592 1.02373 Liu 1.01607 1.01655 1.02333 Bouguerra 1.02562 1.02329 1.02685 OptExp 1.00705 1.01611 1.02298 DPNextFailure 1.00785 1.01699 1.02851 DPMakespan 1.00737 1.01655 1.03467 Degradation from best, single processor, Exponential failures

slide-41
SLIDE 41

Sequential jobs under Exponential failures

MTBF Heuristics 1 hour 1 day 1 week LowerBound 0.62865 0.90714 0.979151 PeriodLB 1.00705 1.01588 1.02298 Young 1.01635 1.01590 1.02332 DalyLow 1.02711 1.01611 1.02338 DalyHigh 1.00700 1.01592 1.02373 Liu 1.01607 1.01655 1.02333 Bouguerra 1.02562 1.02329 1.02685 OptExp 1.00705 1.01611 1.02298 DPNextFailure 1.00785 1.01699 1.02851 DPMakespan 1.00737 1.01655 1.03467 Degradation from best, single processor, Exponential failures

slide-42
SLIDE 42

Sequential jobs under Weibull failures

MTBF Heuristics 1 hour 1 day 1 week LowerBound 0.66417 0.90714 0.97915 PeriodLB 1.00960 1.01588 1.02298 Young 1.00965 1.01590 1.02332 DalyLow 1.01155 1.01611 1.02338 DalyHigh 1.01785 1.01592 1.02373 Liu 1.00914 1.01655 1.02333 Bouguerra 1.02936 1.02329 1.02685 OptExp 1.01788 1.01611 1.02298 DPNextFailure 1.01408 1.01699 1.02851 DPMakespan 1.00731 1.01655 1.03467 Degradation from best, single processor, Weibull failures

slide-43
SLIDE 43

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-44
SLIDE 44

Parallel jobs under Exponential failures (1/2)

0.9 1 1.1 average makespan degradation 211 213 210 212 215 214 number of processors

Petascale, MTBF = 125 years

slide-45
SLIDE 45

Parallel jobs under Exponential failures (1/2)

1 1.1 0.9 number of processors average makespan degradation 211 213 210 212 215 214 LowerBound PeriodLB

Petascale, MTBF = 125 years

slide-46
SLIDE 46

Parallel jobs under Exponential failures (1/2)

1.1 1 0.9 215 number of processors 210 212 214 average makespan degradation 211 213 LowerBound PeriodLB DalyHigh Young DalyLow

Petascale, MTBF = 125 years

slide-47
SLIDE 47

Parallel jobs under Exponential failures (1/2)

1.1 1 0.9 212 213 210 214 215 211 average makespan degradation number of processors LowerBound PeriodLB Young DalyLow DalyHigh Bouguerra Liu

Petascale, MTBF = 125 years

slide-48
SLIDE 48

Parallel jobs under Exponential failures (1/2)

1.1 1 0.9 212 213 211 214 215 210 average makespan degradation number of processors LowerBound PeriodLB DalyLow Young DalyHigh Liu Bouguerra OptExp

Petascale, MTBF = 125 years

slide-49
SLIDE 49

Parallel jobs under Exponential failures (1/2)

1.1 1 0.9 number of processors 211 213 210 212 215 average makespan degradation 214 LowerBound PeriodLB Young DalyLow DalyHigh Bouguerra Liu OptExp DPMakespan

Petascale, MTBF = 125 years

slide-50
SLIDE 50

Parallel jobs under Exponential failures (1/2)

1.1 1 0.9 210 211 average makespan degradation number of processors 214 215 212 213 LowerBound PeriodLB DalyHigh DalyLow Young Liu Bouguerra OptExp DPMakespan DPNextFailure

Petascale, MTBF = 125 years

slide-51
SLIDE 51

Parallel jobs under Exponential failures (2/2)

1.1 1 0.9 210 211 average makespan degradation number of processors 214 215 212 213 LowerBound PeriodLB DalyHigh DalyLow Young Liu Bouguerra OptExp DPMakespan DPNextFailure

Petascale MTBF = 125 years

1.1 1 0.9 210 211 212 213 average makespan degradation number of processors 214 215 DalyHigh DalyLow Young LowerBound OptExp PeriodLB Liu Bouguerra DPMakespan DPNextFailure

Petascale MTBF = 500 years

slide-52
SLIDE 52

Parallel jobs under Exponential failures (2/2)

1.1 1 0.9 210 211 average makespan degradation number of processors 214 215 212 213 LowerBound PeriodLB DalyHigh DalyLow Young Liu Bouguerra OptExp DPMakespan DPNextFailure

Petascale MTBF = 125 years

1.1 1 0.9 215 217 219 216 average makespan degradation number of processors 214 218 220 DalyHigh DalyLow Young LowerBound DPMakespan OptExp Bouguerra Liu PeriodLB DPNextFailure

Exascale MTBF = 1250 years

slide-53
SLIDE 53

Parallel jobs under Weibull failures (1/2)

1 1.1 0.9 number of processors average makespan degradation 211 213 210 212 215 214 LowerBound PeriodLB

Petascale, MTBF = 125 years, k = 0.70

slide-54
SLIDE 54

Parallel jobs under Weibull failures (1/2)

0.9 1 1.1 211 213 210 212 215 214 number of processors average makespan degradation LowerBound PeriodLB DalyHigh Young DalyLow

Petascale, MTBF = 125 years, k = 0.70

slide-55
SLIDE 55

Parallel jobs under Weibull failures (1/2)

1.1 1 0.9 212 213 210 214 215 211 average makespan degradation number of processors LowerBound PeriodLB Young DalyLow DalyHigh Liu Bouguerra

Petascale, MTBF = 125 years, k = 0.70

slide-56
SLIDE 56

Parallel jobs under Weibull failures (1/2)

1.1 1 0.9 212 213 211 214 215 210 average makespan degradation number of processors LowerBound PeriodLB DalyHigh Young DalyLow Liu Bouguerra OptExp

Petascale, MTBF = 125 years, k = 0.70

slide-57
SLIDE 57

Parallel jobs under Weibull failures (1/2)

0.9 1.1 1 211 213 210 212 215 214 average makespan degradation number of processors LowerBound PeriodLB Young DalyLow DalyHigh Bouguerra Liu OptExp DPNextFailure

Petascale, MTBF = 125 years, k = 0.70

slide-58
SLIDE 58

Parallel jobs under Weibull failures (2/2)

0.9 1.1 1 211 213 210 212 215 214 average makespan degradation number of processors LowerBound PeriodLB Young DalyLow DalyHigh Bouguerra Liu OptExp DPNextFailure

Petascale MTBF = 125 years k = 0.70

1.1 1 0.9 210 211 212 213 average makespan degradation number of processors 214 215 DalyHigh DalyLow Young LowerBound OptExp Bouguerra Liu PeriodLB DPNextFailure

Petascale MTBF = 500 years k = 0.70

slide-59
SLIDE 59

Parallel jobs under Weibull failures (2/2)

0.9 1.1 1 211 213 210 212 215 214 average makespan degradation number of processors LowerBound PeriodLB Young DalyLow DalyHigh Bouguerra Liu OptExp DPNextFailure

Petascale MTBF = 125 years k = 0.70

0.5 1 1.5 2 average makespan degradation 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Weibull shape parameter (k) Bouguerra Liu DPNextFailure PeriodLB OptExp DalyHigh DalyLow Young LowerBound

Petascale MTBF = 125 years 45,208 processors

slide-60
SLIDE 60

Parallel jobs under Weibull failures (2/2)

0.9 1.1 1 211 213 210 212 215 214 average makespan degradation number of processors LowerBound PeriodLB Young DalyLow DalyHigh Bouguerra Liu OptExp DPNextFailure

Petascale MTBF = 125 years k = 0.70

0.9 1 1.1 218 219 216 217 220 215 214 number of processors average makespan degradation DalyHigh DalyLow Young LowerBound OptExp Bouguerra PeriodLB Liu DPNextFailure

Exascale MTBF = 1250 years k = 0.70

slide-61
SLIDE 61

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-62
SLIDE 62

LANL trace set

1 1.01 1.02 1.03 1.04 1.05 average makespan degradation 215 214 number of processors 213 212 Young DalyLow DalyHigh

Petascale / LANL Cluster 18

slide-63
SLIDE 63

LANL trace set

1 1.01 1.02 1.03 1.04 1.05 average makespan degradation 213 212 215 214 number of processors Young DalyLow DalyHigh OptExp

Petascale / LANL Cluster 18

slide-64
SLIDE 64

LANL trace set

1 1.01 1.02 1.03 1.04 1.05 average makespan degradation 213 212 215 214 number of processors PeriodLB Young DalyLow DalyHigh OptExp

Petascale / LANL Cluster 18

slide-65
SLIDE 65

LANL trace set

1 1.01 1.02 1.03 1.04 1.05 average makespan degradation 213 212 215 214 number of processors PeriodLB Young DalyLow DalyHigh OptExp DPNextFailure

Petascale / LANL Cluster 18

slide-66
SLIDE 66

LANL trace set

1 1.01 1.02 1.03 1.04 1.05 average makespan degradation 213 212 215 214 number of processors PeriodLB Young DalyLow DalyHigh OptExp DPNextFailure

Petascale / LANL Cluster 18

1 1.01 1.02 1.03 1.04 1.05 average makespan degradation 213 212 215 214 number of processors DPNextFailure PeriodLB OptExp DalyHigh DalyLow Young

Petascale / LANL Cluster 19

slide-67
SLIDE 67

Outline

1

Single-processor jobs Solving Makespan Solving NextFailure

2

Parallel jobs Solving Makespan Solving NextFailure

3

Experiments Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures

4

Conclusion

slide-68
SLIDE 68

Conclusion and perspectives

Complete analytical solution for Makespan/ Exponential Dynamic programming algorithms for NextFailure / Arbitrary distribution Makespan decreased by DPNextFailure (for the hardest cases) Future work Target non-coordinated checkpointing (e.g., hierarchical checkpointing with message logging)

slide-69
SLIDE 69

Bibliography

M.-S. Bouguerra, T. Gautier, D. Trystram, and J.-M. Vincent. A flexible checkpoint/restart model in distributed systems. In PPAM, volume 6067 of LNCS, pages 206–215, 2010.

  • J. T. Daly.

A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 22(3):303–312, 2004.

  • Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon,
  • M. Paun, and S. Scott.

An optimal checkpoint/restart model for a large scale high performance computing system. In IPDPS 2008, pages 1–9. IEEE, 2008.

  • J. W. Young.

A first order approximation to the optimum checkpoint interval.