Robustness of the Young/Daly formula for stochastic iterative - - PowerPoint PPT Presentation

robustness of the young daly formula for stochastic
SMART_READER_LITE
LIVE PREVIEW

Robustness of the Young/Daly formula for stochastic iterative - - PowerPoint PPT Presentation

Robustness of the Young/Daly formula for stochastic iterative applications Yishu Du 1 , 2 Loris Marchal 2 Guillaume Pallez 3 Yves Robert 2 , 4 1 Tongji University, China 2 CNRS, ENS Lyon and Inria, France 3 Inria and University of Bordeaux, France


slide-1
SLIDE 1

Robustness of the Young/Daly formula for stochastic iterative applications

Yishu Du1,2 Loris Marchal2 Guillaume Pallez3 Yves Robert2,4

1Tongji University, China 2CNRS, ENS Lyon and Inria, France 3Inria and University of Bordeaux, France 4University of Tennessee, USA

August 18, 2020

slide-2
SLIDE 2

Contents

1

Introduction

2

Model

3

Static strategy

4

Dynamic strategy

5

Experiments

6

Conclusion

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 2 / 31

slide-3
SLIDE 3

1

Introduction

2

Model

3

Static strategy

4

Dynamic strategy

5

Experiments

6

Conclusion

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 3 / 31

slide-4
SLIDE 4

The road to Exascale

Observed two growth rates. What are the barriers on the road to achieving Exascale?

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 4 / 31

slide-5
SLIDE 5

The road to Exascale

In Feb. 2014, the Advanced Scientific Computing Advisory Committee published the top ten challenges to achieve the development of an Exascale system. We focus here on one of those: Resilience

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 5 / 31

slide-6
SLIDE 6

Why resilience?

Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µind;

Time

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31

slide-7
SLIDE 7

Why resilience?

Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µind; MTBF of P processors is µP = µind

P ; Time

Fault rate is proportional to the number of components.

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31

slide-8
SLIDE 8

Why resilience?

Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µind; MTBF of P processors is µP = µind

P ;

Most powerful computers in the Top 500 lists are victims of at least

  • ne failure a day;

. . .

Time

Fault rate is proportional to the number of components.

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31

slide-9
SLIDE 9

Why resilience?

Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µind; MTBF of P processors is µP = µind

P ;

Most powerful computers in the Top 500 lists are victims of at least

  • ne failure a day;

. . .

Time

Fault rate is proportional to the number of components. Need for fault-tolerance algorithm! One proc: MTBF ≈ 10 years Petascale: MTBF ≈ 1 hour Exascale: MTBF ≈ 5 minutes

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31

slide-10
SLIDE 10

Fail-Stop Errors

Fail-stop errors: hardware failures or crashes Effects: quickly detected the execution stops the entire content of local memory is lost computation has to be re-started from the last checkpoint To handle fail-stop errors → Checkpoint/Restart

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 7 / 31

slide-11
SLIDE 11

1

Introduction

2

Model

3

Static strategy

4

Dynamic strategy

5

Experiments

6

Conclusion

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 8 / 31

slide-12
SLIDE 12

Expected execution time

The expected execution time to perform a work of size W followed by a checkpoint of size C in the presence of failures (Exponential distribution of parameter λ), with a restart cost R and a downtime D is: Tλ(W , C, D, R) = 1 λ + D

  • eλR

eλ(W +C) − 1

  • .

We assumes that failures can strike during checkpoint and recovery, but not during downtime. [Springer Monograph on Resilience 2015]

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 9 / 31

slide-13
SLIDE 13

Objective

Minimizing the expectation of the execution time, or makespan

Divisible Applications

Optimal period: PYD = √2µf C =

  • 2C

λ

µf : Platform MTBF, C: Checkpoint time [Young 1974, Daly 2006]

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 10 / 31

slide-14
SLIDE 14

Applications decomposed into computational iterations

the duration of an iteration is stochastic, i.e., obeys a probability distribution law D of mean µD

  • ne can checkpoint only at the end of an iteration

Given an iterative application with n consecutive iterations The execution times of the iterations are X1, . . . , Xn, where the Xi are IID (Independent and identically Distributed) variables following D A solution with m checkpoints writes as S = (δ1, . . . , δn), where δi = 1 if and only if we perform a checkpoint after the i-th iteration

  • f length Xi.

1 ≤ i1 < i2 < · · · < im = n, δj = 1 ⇐ ⇒ j ∈ {i1, . . . , im} Wj = ij

l=ij−1+1 Xl denotes the work between two consecutive

checkpoints (of number j − 1 and j)

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 11 / 31

slide-15
SLIDE 15

1

Introduction

2

Model

3

Static strategy

4

Dynamic strategy

5

Experiments

6

Conclusion

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 12 / 31

slide-16
SLIDE 16

Static strategy

Consider an iterative application with n iterations of Wi. We are interested in minimizing the total execution time (makespan) of the application. This makespan is given as follows: E[MS(S)] = E m

  • i=1

Tλ(Wj, C, D, R)

  • .

Static solutions decide which iterations to checkpoint. One can choose a solution to be periodic with period k, i.e., checkpoints are taken every k iterations, namely at the end of iterations number k, 2k, . . . until the last iteration.

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 13 / 31

slide-17
SLIDE 17

Theorem

The periodic solution checkpointing every kstatic iterations is asymptotically optimal, where xstatic = W0(−e−λC−1) + 1 log (E[eλX]) and kstatic is either max(1, ⌊xstatic⌋) or ⌈xstatic⌉, whichever achieves the smaller value of Cind(k) = eλC E[eλX ]k−1

k

, W0 is the principal Lambert function.

Proposition

The first-order approximation kFO of kstatic obeys the equation kFO · µD =

  • 2C

λ .

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 14 / 31

slide-18
SLIDE 18

1

Introduction

2

Model

3

Static strategy

4

Dynamic strategy

5

Experiments

6

Conclusion

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 15 / 31

slide-19
SLIDE 19

Dynamic strategy

We fix a threshold Wth for the amount of work since the last checkpoint. When iteration Xi finishes, if the amount of work since the last checkpoint is greater than Wth, then δi = 1 (we checkpoint)

  • therwise δi = 0 (we do not checkpoint).

The slowdown H is defined as the ratio H = actual execution time useful execution time , so that the slowdown is equal to 1 if there is no cost for fault-tolerance (no checkpoints, nor re-execution after failures).

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 16 / 31

slide-20
SLIDE 20

When an iteration is completed, we compute two values: The expected slowdown Hckpt if a checkpoint is taken at the end of this iteration; Hckpt(wdyn) = T(wdyn, 0, D, R) + T(0, C, D, R + wdyn) wdyn The expected slowdown Hno if no checkpoint is taken at the end of this iteration. Hno(wdyn) = E [T(wdyn, 0, D, R) + T(X, C, D, R + wdyn)] E [wdyn + X]

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 17 / 31

slide-21
SLIDE 21

By definition, Wth is the threshold value where Hckpt(Wth) = Hno(Wth) Finally, we derive the threshold value: Wth = 1 λW0

λE[X] E[eλX] − 1e

−λ

  • C+

E[X] E[eλX ]−1

  • +

E[X] E[eλX] − 1.

Proposition

The first-order approximation WFO of Wth obeys the equation WFO =

  • 2C

λ .

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 18 / 31

slide-22
SLIDE 22

1

Introduction

2

Model

3

Static strategy

4

Dynamic strategy

5

Experiments

6

Conclusion

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 19 / 31

slide-23
SLIDE 23

Methodology

An iterative application composed of n = 1000 consecutive iterations The execution time of each iteration follows a probability distribution D with µD = 50 and the standard deviation σ.

Uniform(20, 80) Gamma(25, 0.5) Normal(50, 2.52)

Each iteration fails with probability pfail ∈ {10−3, 10−2.5, . . . , 10−0.1} Checkpoint time C = ηµD, where η is the proportion of checkpoint time to the expectation of iteration time (Default η = 0.1). Recovery time R = C, and fixed downtime as D = 1. Evaluating the makespan with 10000 random simulations

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 20 / 31

slide-24
SLIDE 24

Static strategy results

Gamma Normal Uniform 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.95 1.00 1.05 1.10

k makespan normalized by MSYD_sta

Figure: Performance (with boxplots) of the static strategy that chooses the value

  • f k. Brown-red diamonds plot E[MSD](k) (theoretical makespan). The blue

(resp. red) line represents the makespan obtained by the optimal dynamic strategy MSsim dyn(Wth) (resp. the YD-dynamic strategy MSYD dyn).

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 21 / 31

slide-25
SLIDE 25

Static strategy results

Table: Simulation for static case.

pfail = 10−2 Gamma Normal Uniform ksim 5 5 5 xstatic 4.6114 4.6122 4.6097 kstatic 5 5 5

1 µD

  • 2C

λ

4.6787 4.6787 4.6787 kFO 5 5 5

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 22 / 31

slide-26
SLIDE 26

Dynamic strategy results

Gamma Normal Uniform 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 0.95 1.00 1.05 1.10

γ

makespan normalized by MSYD_dyn

Figure: Performance (with boxplots) of the dynamic strategy that chooses a threshold of γ · Wth. The orange (resp. purple) line represents the makespan

  • btained by the optimal static strategy MSsim sta(kstatic) (resp. the YD-static

strategy MSYD sta).

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 23 / 31

slide-27
SLIDE 27

Dynamic strategy results

Table: Simulation for dynamic case.

pfail = 10−2 Gamma Normal Uniform γ 1.0 1.0 1.0 Wsim 206.0492 206.8876 204.2743 MSOPT

sim dyn

52267 52264 52267 Wth 206.0492 206.8876 204.2743 MSsim dyn(Wth) 52267 52264 52267 WFO 233.9328 233.9328 233.9328 MSYD dyn 52284 52271 52288

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 24 / 31

slide-28
SLIDE 28

Varying failure probability pfail

Gamma Normal Uniform 10−3 10−2.5 10−2 10−1.5 10−1 10−0.5 100 10−3 10−2.5 10−2 10−1.5 10−1 10−0.5 100 10−3 10−2.5 10−2 10−1.5 10−1 10−0.5 100 0.992 0.994 0.996 0.998 1.000

failure probability pfail

normalized makespan

Ratio

MSsim_sta(kstatic) MSYD_sta MSsim_dyn(Wth) MSYD_sta

Figure: Simulation with varying failure probability pfail.

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 25 / 31

slide-29
SLIDE 29

Varying standard deviation σ

Gamma Normal Uniform 8 12 16 1 2 3 4 5 10 20 0.9990 0.9995 1.0000

standard deviation σ

normalized makespan

Ratio

MSsim_sta(kstatic) MSYD_sta MSsim_dyn(Wth) MSYD_sta

Figure: Simulation with varying standard deviation σ.

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 26 / 31

slide-30
SLIDE 30

Varying proportion η

Gamma 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.9998 1.0000 1.0002

proportion η normalized makespan

MSsim_sta(kstatic)/MSYD_sta Ratio MSsim_dyn(Wth)/MSYD_sta Gamma 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 5 10 15 20

proportion η

k

k ksim kstatic kFO

Gamma 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 200 400 600 800 1000

proportion η

W

W Wsim =Wth WFO

Figure: Simulation with varying the proportion of checkpoint time η to the expected iteration time.

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 27 / 31

slide-31
SLIDE 31

1

Introduction

2

Model

3

Static strategy

4

Dynamic strategy

5

Experiments

6

Conclusion

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 28 / 31

slide-32
SLIDE 32

Summary

For static solutions, we derive a closed-form formula to compute the

  • ptimal checkpointing period, and we show that its first-order

approximation corresponds to the Young/Daly formula. For dynamic solutions, we derive a closed-form formula to compute the threshold at which one decides either to checkpoint or to execute more work, and we show that its first-order approximation also corresponds to the Young/Daly formula. We conduct an extensive set of experiments with classic probability distributions (Uniform, Gamma, Normal) and we conclude that the Young/Daly formula remains accurate and useful in a stochastic setting.

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 29 / 31

slide-33
SLIDE 33

Future work

Extension to iterative applications composed of several tasks Non-constant checkpoint time

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 30 / 31

slide-34
SLIDE 34

Thank You

Email: yishu.du@ens-lyon.fr

Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 31 / 31