A different re-execution speed can help Anne Benoit, Aur elien - PowerPoint PPT Presentation

A different re-execution speed can help Anne Benoit, Aur´ elien Cavelan, Valentin Le F` evre, Yves Robert, Hongyang Sun LIP, ENS de Lyon, France PASA Workshop, in conjunction with ICPP’16 August 16, 2016 Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 1 / 25

Motivation: Resilience Large-scale platforms: increasingly subject to errors Major challenge for Exascale: frequent striking of silent errors How to deal with these errors? Verification + Checkpoint/Restart Verification mechanism: general-purpose (replication, triplication) or application-specific Verified checkpoints : a verification is performed just before each checkpoint V C W V C W V C Time Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 2 / 25

Silent vs Fail-stop errors C : time to checkpoint; λ : error rate (platform MTBF µ = 1 /λ ); V : time to verify; R : time to recover Optimal checkpointing period W for fail-stop errors (Young/Daly): � W = 2 C /λ ( V = 0) Fail-stop error V C ? R W V C W V C Time � Silent errors: W = ( V + C ) /λ ( C → V + C ; missing factor 2) Silent error Detection V C W V R W V C W V C Time Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 3 / 25

Motivation: Energy consumption Power requirement of current petascale platforms = small town Need to reduce energy consumption of future platforms Popular technique: dynamic voltage and frequency scaling (DVFS) Lower speed → energy savings: when computing at speed σ , power proportional to σ 3 and execution time proportional to 1 /σ → (dynamic) energy proportional to σ 2 Also account for static energy: trade-offs to be found Realistic approach: minimize energy while guaranteeing a performance bound Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 4 / 25

Motivation: Energy consumption Power requirement of current petascale platforms = small town Need to reduce energy consumption of future platforms Popular technique: dynamic voltage and frequency scaling (DVFS) Lower speed → energy savings: when computing at speed σ , power proportional to σ 3 and execution time proportional to 1 /σ → (dynamic) energy proportional to σ 2 Also account for static energy: trade-offs to be found Realistic approach: minimize energy while guaranteeing a performance bound ⇒ At which speed should we execute the workload? Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 4 / 25

Outline of the talk Model and optimization problem Optimal pattern size and speeds Simulations Extensions: both fail-stop and silent errors Conclusion Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 5 / 25

Framework Divisible-load applications Subject to silent data corruption Checkpoint/restart strategy: periodic patterns that repeat over time Verified checkpoints Is it better to use two different speeds rather than only one? What are the optimal checkpointing period and optimal execution speeds? Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 6 / 25

Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent error Detection V W V W V W V σ 1 C σ 1 R σ 2 C σ 1 C σ 1 σ 2 σ 1 Time With a silent error Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 7 / 25

Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Silent error Detection V W V W V W V σ 1 C σ 1 R σ 2 C σ 1 C σ 1 σ 2 σ 1 Time With a silent error Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 7 / 25

Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verif.: V units of work; checkpointing: time C ; recovery: time R Silent error Detection V W V W V W V σ 1 C σ 1 R σ 2 C σ 1 C σ 1 σ 2 σ 1 Time With a silent error Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 7 / 25

Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verif.: V units of work; checkpointing: time C ; recovery: time R P idle and P io constant; and P cpu ( σ ) = κσ 3 Silent error Detection V W V W V W V σ 1 C σ 1 R σ 2 C σ 1 C σ 1 σ 2 σ 1 Time With a silent error Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 7 / 25

Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verif.: V units of work; checkpointing: time C ; recovery: time R P idle and P io constant; and P cpu ( σ ) = κσ 3 Energy for W units of work at speed σ : W σ ( P idle + κσ 3 ) Energy of a verification at speed σ : V σ ( P idle + κσ 3 ) Energy of a checkpoint: C ( P idle + P io ) Energy of a recovery: R ( P idle + P io ) Silent error Detection V W V W V W V σ 1 C σ 1 R σ 2 C σ 1 C σ 1 σ 2 σ 1 Time With a silent error Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 7 / 25

Problem Optimization problem BiCrit : Minimize E ( W , σ 1 , σ 2 ) s.t. T ( W , σ 1 , σ 2 ) ≤ ρ, W W E ( W , σ 1 , σ 2 ) is the expected energy consumed to execute W units of work at speed σ 1 , with eventual re-executions at speed σ 2 T ( W , σ 1 , σ 2 ) is the expected execution time to execute W units of work at speed σ 1 , with eventual re-executions at speed σ 2 ρ is a performance bound, or admissible degradation factor Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 8 / 25

Computing expected execution time Proposition 1 For the BiCrit problem with a single speed, � W + V � � � λ W λ W σ − 1 T ( W , σ, σ ) = C + e + e R σ σ Proposition 2 For the BiCrit problem, T ( W , σ 1 , σ 2 ) = C + W + V � R + W + V � 1 − e − λ W λ W � � + e σ 1 σ 2 σ 1 σ 2 Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 9 / 25

Proof of Proposition 1 Proof. The recursive equation to compute T ( W , σ, σ ) writes: T ( W , σ, σ ) = W + V + p ( W /σ ) ( R + T ( W , σ, σ )) σ + (1 − p ( W /σ )) C , where p ( W /σ ) = 1 − e − λ W σ . The reasoning is as follows: We always execute W units of work followed by the verification, in time W + V ; σ With probability p ( W /σ ), a silent error occurred and is detected, in which case we recover and start anew; Otherwise, with probability 1 − p ( W /σ ), we simply checkpoint after a successful execution. Solving this equation leads to the expected execution time. Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 10 / 25

Proof of Proposition 2 Proof. The recursive equation to compute T ( W , σ 1 , σ 2 ) writes: T ( W , σ 1 , σ 2 ) = W + V + p ( W /σ 1 ) ( R + T ( W , σ 2 , σ 2 )) σ 1 + (1 − p ( W /σ 1 )) C , where p ( W /σ 1 ) = 1 − e − λ W σ 1 . The reasoning is as follows: We always execute W units of work followed by the verification, in time W + V ; σ 1 With probability p ( W /σ 1 ), a silent error occurred and is detected, in which case we recover and start anew at speed σ 2 ; Otherwise, with probability 1 − p ( W /σ 1 ), we simply checkpoint after a successful execution. Solving this equation leads to the expected execution time. Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 11 / 25

Computing expected energy consumption Proposition 3 For the BiCrit problem, 1 − e − λ W λ W � � � � σ 2 R E ( W , σ 1 , σ 2 ) = C + e ( P io + P idle ) σ 1 + W + V ( κσ 3 1 + P idle ) σ 1 + W + V (1 − e − λ W λ W σ 2 ( κσ 3 σ 1 ) e 2 + P idle ) σ 2 Power spent during checkpoint or recovery: P io + P idle ; power spent during computation and verification at speed σ : P cpu ( σ ) + P idle = κσ 3 + P idle . From Proposition 2, we get the expression of E ( W , σ 1 , σ 2 ). Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 12 / 25

Finding optimal pattern length (1) To get closed-form expression for optimal value of W , use of first-order approximations, using Taylor expansion e λ W = 1 + λ W + O ( λ 2 W 2 ): T ( W , σ 1 , σ 2 ) = 1 + λ W + λ R + λ V + C + V /σ 1 + O ( λ 2 W ) (1) W σ 1 σ 1 σ 2 σ 1 σ 1 σ 2 W = κσ 3 E ( W , σ 1 , σ 2 ) 1 + P idle + λ W ( κσ 3 2 + P idle ) W σ 1 σ 1 σ 2 + λ R ( P io + P idle ) + λ V ( κσ 3 1 + P idle ) σ 1 σ 1 σ 2 + C ( P io + P idle ) + V ( κσ 3 1 + P idle ) /σ 1 + O ( λ 2 W ) (2) W Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 13 / 25

A different re-execution speed can help Anne Benoit, Aur elien - PowerPoint PPT Presentation

A different re-execution speed can help Anne Benoit, Aur elien Cavelan, Valentin Le F` evre, Yves Robert, Hongyang Sun LIP, ENS de Lyon, France PASA Workshop, in conjunction with ICPP16 August 16, 2016 Anne.Benoit@ens-lyon.fr A

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Cedar Rapids RLR & Speed Des Moines RLR & Speed

Speed, speed, speed D. J. Bernstein University of Illinois at Chicago; Ruhr University Bochum

SPEED OF THOUGHT SPEED OF THOUGHT 120m/s SPEED OF THOUGHT COMMUNICATIVE The Artist is Absent:

Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all

execution states with swapping Processes, Execution, and State 3F. Execution State Model exit

POWERED STARTUPS Speed@BDD Presentation July 2017 SPEED@BDD IN A NUTSHELL Speed@BDD is a

Speed Bump? http://www.skepticalscience.com/graphics.php?g=47 Speed Bump?

MCC Speed Management Policy Agenda Purpose of the Speed Management Policy Results of

Lab 9. Speed Control of a D.C. motor Sensing Motor Speed (Tachometer Frequency Method) Motor

10 years of Speed Tables Peter da Silva FlightAware What are Speed Tables? What are Speed

Speed, speed, speed $1000 TCR hashing competition D. J. Bernstein Crowley: I have a problem

They Can Do It They Can Do It You Can Help You Can Help NE NEW S STUDENT NT REGI

Early Help Clare Mittelstadt Early Help Manager What is Early Help? Early Help is about

PRODUCTION EXECUTION PRODUCTION EXECUTION Table of contents Course Map Module 1: Production

STRATAEGOS CONSULTING STRATEGY EXECUTION CONSULTING STRATAEGOS.COM WELCOME STRATEGY EXECUTION

Optimal checkpointing periods with fail-stop and silent errors Anne Benoit ENS Lyon

conditional cash transfer programme on mental health A study from Malawi Julius Ohrnberger,

An overview of fault-tolerant techniques for HPC Yves Robert ENS Lyon & Institut

Algorithm-Based Fault Tolerance for Linear Algebra Thomas Herault University of Tennessee

T H E H E A LT H Y S T R E E T S A P P R O A C H Speaker Panellist Panellist Moderator Lucy

Suicide prevention in general practice 2013: what knowledge and skills do the health professionals

Evidence-based Policy Making for People with Disabilities in a Changing North Korea DC-AAPOR and

Health in Environmental Impact Assessment (EIA): Gaining strength from the Family of health

A different re-execution speed can help Anne Benoit, Aur elien - PowerPoint PPT Presentation

A different re-execution speed can help Anne Benoit, Aur elien Cavelan, Valentin Le F` evre, Yves Robert, Hongyang Sun LIP, ENS de Lyon, France PASA Workshop, in conjunction with ICPP16 August 16, 2016 Anne.Benoit@ens-lyon.fr A

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Cedar Rapids RLR &amp; Speed Des Moines RLR &amp; Speed

Speed, speed, speed D. J. Bernstein University of Illinois at Chicago; Ruhr University Bochum

SPEED OF THOUGHT SPEED OF THOUGHT 120m/s SPEED OF THOUGHT COMMUNICATIVE The Artist is Absent:

Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all

execution states with swapping Processes, Execution, and State 3F. Execution State Model exit

POWERED STARTUPS Speed@BDD Presentation July 2017 SPEED@BDD IN A NUTSHELL Speed@BDD is a

Speed Bump? http://www.skepticalscience.com/graphics.php?g=47 Speed Bump?

MCC Speed Management Policy Agenda Purpose of the Speed Management Policy Results of

Lab 9. Speed Control of a D.C. motor Sensing Motor Speed (Tachometer Frequency Method) Motor

10 years of Speed Tables Peter da Silva FlightAware What are Speed Tables? What are Speed

Speed, speed, speed $1000 TCR hashing competition D. J. Bernstein Crowley: I have a problem

They Can Do It They Can Do It You Can Help You Can Help NE NEW S STUDENT NT REGI

Early Help Clare Mittelstadt Early Help Manager What is Early Help? Early Help is about

PRODUCTION EXECUTION PRODUCTION EXECUTION Table of contents Course Map Module 1: Production

STRATAEGOS CONSULTING STRATEGY EXECUTION CONSULTING STRATAEGOS.COM WELCOME STRATEGY EXECUTION

Optimal checkpointing periods with fail-stop and silent errors Anne Benoit ENS Lyon

conditional cash transfer programme on mental health A study from Malawi Julius Ohrnberger,

An overview of fault-tolerant techniques for HPC Yves Robert ENS Lyon &amp; Institut

Algorithm-Based Fault Tolerance for Linear Algebra Thomas Herault University of Tennessee

T H E H E A LT H Y S T R E E T S A P P R O A C H Speaker Panellist Panellist Moderator Lucy

Suicide prevention in general practice 2013: what knowledge and skills do the health professionals

Evidence-based Policy Making for People with Disabilities in a Changing North Korea DC-AAPOR and

Health in Environmental Impact Assessment (EIA): Gaining strength from the Family of health

Cedar Rapids RLR & Speed Des Moines RLR & Speed

An overview of fault-tolerant techniques for HPC Yves Robert ENS Lyon & Institut