Robustness of the Young/Daly formula for stochastic iterative - PowerPoint PPT Presentation

Robustness of the Young/Daly formula for stochastic iterative applications Yishu Du 1 , 2 Loris Marchal 2 Guillaume Pallez 3 Yves Robert 2 , 4 1 Tongji University, China 2 CNRS, ENS Lyon and Inria, France 3 Inria and University of Bordeaux, France 4 University of Tennessee, USA August 18, 2020

Contents Introduction 1 Model 2 Static strategy 3 Dynamic strategy 4 Experiments 5 Conclusion 6 Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 2 / 31

Introduction 1 Model 2 Static strategy 3 Dynamic strategy 4 Experiments 5 Conclusion 6 Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 3 / 31

The road to Exascale Observed two growth rates. What are the barriers on the road to achieving Exascale? Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 4 / 31

The road to Exascale In Feb. 2014, the Advanced Scientific Computing Advisory Committee published the top ten challenges to achieve the development of an Exascale system. We focus here on one of those: Resilience Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 5 / 31

Why resilience? Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µ ind ; Time Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31

Why resilience? Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µ ind ; MTBF of P processors is µ P = µ ind P ; Time Fault rate is proportional to the number of components. Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31

Why resilience? Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µ ind ; MTBF of P processors is µ P = µ ind P ; Most powerful computers in the Top 500 lists are victims of at least one failure a day; . . . Time Fault rate is proportional to the number of components. Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31

Why resilience? Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µ ind ; MTBF of P processors is µ P = µ ind P ; One proc: MTBF ≈ 10 years Most powerful computers in the Top 500 lists are victims of at least Petascale: MTBF ≈ 1 hour one failure a day; Exascale: MTBF ≈ 5 minutes . . . Need for fault-tolerance algorithm! Time Fault rate is proportional to the number of components. Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31

Fail-Stop Errors Fail-stop errors : hardware failures or crashes Effects: quickly detected the execution stops the entire content of local memory is lost computation has to be re-started from the last checkpoint To handle fail-stop errors → Checkpoint/Restart Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 7 / 31

Expected execution time The expected execution time to perform a work of size W followed by a checkpoint of size C in the presence of failures (Exponential distribution of parameter λ ), with a restart cost R and a downtime D is: � 1 � e λ R � e λ ( W + C ) − 1 � T λ ( W , C , D , R ) = λ + D . We assumes that failures can strike during checkpoint and recovery, but not during downtime. [Springer Monograph on Resilience 2015] Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 9 / 31

Objective Minimizing the expectation of the execution time, or makespan Divisible Applications Optimal period: P YD = √ 2 µ f C = � 2 C λ µ f : Platform MTBF, C : Checkpoint time [Young 1974, Daly 2006] Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 10 / 31

Applications decomposed into computational iterations the duration of an iteration is stochastic, i.e., obeys a probability distribution law D of mean µ D one can checkpoint only at the end of an iteration Given an iterative application with n consecutive iterations The execution times of the iterations are X 1 , . . . , X n , where the X i are IID (Independent and identically Distributed) variables following D A solution with m checkpoints writes as S = ( δ 1 , . . . , δ n ), where δ i = 1 if and only if we perform a checkpoint after the i -th iteration of length X i . 1 ≤ i 1 < i 2 < · · · < i m = n , δ j = 1 ⇐ ⇒ j ∈ { i 1 , . . . , i m } W j = � i j l = i j − 1 +1 X l denotes the work between two consecutive checkpoints (of number j − 1 and j ) Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 11 / 31

Static strategy Consider an iterative application with n iterations of W i . We are interested in minimizing the total execution time (makespan) of the application. This makespan is given as follows: � m � � E [ MS ( S )] = E T λ ( W j , C , D , R ) . i =1 Static solutions decide which iterations to checkpoint. One can choose a solution to be periodic with period k , i.e., checkpoints are taken every k iterations, namely at the end of iterations number k , 2 k , . . . until the last iteration. Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 13 / 31

Theorem The periodic solution checkpointing every k static iterations is asymptotically optimal, where x static = W 0 ( − e − λ C − 1 ) + 1 log ( E [ e λ X ]) and k static is either max(1 , ⌊ x static ⌋ ) or ⌈ x static ⌉ , whichever achieves the smaller value of C ind ( k ) = e λ C E [ e λ X ] k − 1 , W 0 is the principal Lambert k function. Proposition The first-order approximation k FO of k static obeys the equation � 2 C k FO · µ D = λ . Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 14 / 31

Dynamic strategy We fix a threshold W th for the amount of work since the last checkpoint. When iteration X i finishes, if the amount of work since the last checkpoint is greater than W th , then δ i = 1 (we checkpoint) otherwise δ i = 0 (we do not checkpoint). The slowdown H is defined as the ratio H = actual execution time useful execution time , so that the slowdown is equal to 1 if there is no cost for fault-tolerance (no checkpoints, nor re-execution after failures). Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 16 / 31

When an iteration is completed, we compute two values: The expected slowdown H ckpt if a checkpoint is taken at the end of this iteration; H ckpt ( w dyn ) = T ( w dyn , 0 , D , R ) + T (0 , C , D , R + w dyn ) w dyn The expected slowdown H no if no checkpoint is taken at the end of this iteration. H no ( w dyn ) = E [ T ( w dyn , 0 , D , R ) + T ( X , C , D , R + w dyn )] E [ w dyn + X ] Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 17 / 31

By definition, W th is the threshold value where H ckpt ( W th ) = H no ( W th ) Finally, we derive the threshold value: W th = 1 � λ E [ X ] � E [ X ] � � E [ X ] − λ C + E [ e λ X ] − 1 λ W 0 − E [ e λ X ] − 1 e + E [ e λ X ] − 1 . Proposition The first-order approximation W FO of W th obeys the equation � 2 C W FO = λ . Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 18 / 31

Methodology An iterative application composed of n = 1000 consecutive iterations The execution time of each iteration follows a probability distribution D with µ D = 50 and the standard deviation σ . Uniform (20 , 80) Gamma (25 , 0 . 5) Normal (50 , 2 . 5 2 ) Each iteration fails with probability p fail ∈ { 10 − 3 , 10 − 2 . 5 , . . . , 10 − 0 . 1 } Checkpoint time C = ηµ D , where η is the proportion of checkpoint time to the expectation of iteration time (Default η = 0 . 1). Recovery time R = C , and fixed downtime as D = 1. Evaluating the makespan with 10000 random simulations Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 20 / 31

Static strategy results Gamma Normal Uniform 1.10 makespan normalized by MS YD_sta 1.05 1.00 0.95 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 k Figure: Performance (with boxplots) of the static strategy that chooses the value of k . Brown-red diamonds plot E [ MS D ]( k ) (theoretical makespan). The blue (resp. red) line represents the makespan obtained by the optimal dynamic strategy MS sim dyn ( W th ) (resp. the YD-dynamic strategy MS YD dyn ). Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 21 / 31

Static strategy results Table: Simulation for static case. p fail = 10 − 2 Gamma Normal Uniform k sim 5 5 5 x static 4.6114 4.6122 4.6097 5 5 5 k static � 1 2 C 4.6787 4.6787 4.6787 µ D λ k FO 5 5 5 Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 22 / 31

Robustness of the Young/Daly formula for stochastic iterative - PowerPoint PPT Presentation

Robustness of the Young/Daly formula for stochastic iterative applications Yishu Du 1 , 2 Loris Marchal 2 Guillaume Pallez 3 Yves Robert 2 , 4 1 Tongji University, China 2 CNRS, ENS Lyon and Inria, France 3 Inria and University of Bordeaux, France

Formula Student Overview for 2014-2015 Carleton Formula Student What is Formula Student?

UCSD Robustness Summer School David Donoho 20190812 David Donoho UCSD Robustness Summer School

Robustness? Robustness ? Robustness?

71 Overview for 2010-2011 Carleton Formula SAE and Formula-Hybrid yb d u a o a d u a S o

Formula 1 What is Formula 1 ? What is Formula 1 ? Highest class of single seater auto racing

Target Formula Re-evaluation Target Formula Background Target formula is used to distribute

Robustness and Generalization Huan Xu The University of Texas at Austin Department of Electrical

Where Are We? Lecture 9 Robustness through Training 1 Robustness Explicit Handling of Noise

A Review of the Tennessee A Review of the Tennessee Funding Formula Funding Formula Tennessee

Ultimate Quadrilateral Outline Review formula for Sum of exterior angles 360 formula for Sum

Finding a Formula For f 1 ( x ) Given a formula for f ( x ), sometimes we would like to find a

Robustness and SMC Adam Pechner Overview What is Robustness and why do we care? Different

S9932: LEARNING TO BOOST S9932: LEARNING TO BOOST ROBUSTNESS FOR ROBUSTNESS FOR AUTONOMOUS

Trade-off between Efficiency and Robustness Doctoral Colloqium @ SenSys18, Shenzhen Robert

Limits on Robustness to Adversarial Examples Elvis Dohmatob Criteo AI Lab October 2, 2019 Elvis

Point sets, Maps and Navigation - II D.A. Forsyth Robustness is a serious problem Robustness is

Latent models of stepping and ramping: an update on (the debate over) single-trial

Attention, Binding, and Consciousness 1. Perceptual binding, dynamic binding 2. Neural

Learning and Optimization: Lower Bounds and Tight Connections Nati Srebro TTI-Chicago On The

Static analysis of numerical programs Sylvie Putot with Eric Goubault, Franck V edrine and

Supports and approximation properties in Lipschitz-free spaces Eva Perneck a Czech Technical

Concentration for Coulomb gases and Coulomb transport inequalities Myl` ene Ma da U.

Protection of Arithmetic Circuits against Physical Attacks Arnaud Tisserand CNRS, Lab-STICC LIP

Call for Abstracts Online submission platform: https://indico.egi.eu/indico/event/3973/