a different re execution speed can help
play

A different re-execution speed can help Anne Benoit, Aur elien - PowerPoint PPT Presentation

A different re-execution speed can help Anne Benoit, Aur elien Cavelan, Valentin Le F` evre, Yves Robert, Hongyang Sun LIP, ENS de Lyon, France PASA Workshop, in conjunction with ICPP16 August 16, 2016 Anne.Benoit@ens-lyon.fr A


  1. A different re-execution speed can help Anne Benoit, Aur´ elien Cavelan, Valentin Le F` evre, Yves Robert, Hongyang Sun LIP, ENS de Lyon, France PASA Workshop, in conjunction with ICPP’16 August 16, 2016 Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 1 / 25

  2. Motivation: Resilience Large-scale platforms: increasingly subject to errors Major challenge for Exascale: frequent striking of silent errors How to deal with these errors? Verification + Checkpoint/Restart Verification mechanism: general-purpose (replication, triplication) or application-specific Verified checkpoints : a verification is performed just before each checkpoint V C W V C W V C Time Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 2 / 25

  3. Silent vs Fail-stop errors C : time to checkpoint; λ : error rate (platform MTBF µ = 1 /λ ); V : time to verify; R : time to recover Optimal checkpointing period W for fail-stop errors (Young/Daly): � W = 2 C /λ ( V = 0) Fail-stop error V C ? R W V C W V C Time � Silent errors: W = ( V + C ) /λ ( C → V + C ; missing factor 2) Silent error Detection V C W V R W V C W V C Time Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 3 / 25

  4. Motivation: Energy consumption Power requirement of current petascale platforms = small town Need to reduce energy consumption of future platforms Popular technique: dynamic voltage and frequency scaling (DVFS) Lower speed → energy savings: when computing at speed σ , power proportional to σ 3 and execution time proportional to 1 /σ → (dynamic) energy proportional to σ 2 Also account for static energy: trade-offs to be found Realistic approach: minimize energy while guaranteeing a performance bound Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 4 / 25

  5. Motivation: Energy consumption Power requirement of current petascale platforms = small town Need to reduce energy consumption of future platforms Popular technique: dynamic voltage and frequency scaling (DVFS) Lower speed → energy savings: when computing at speed σ , power proportional to σ 3 and execution time proportional to 1 /σ → (dynamic) energy proportional to σ 2 Also account for static energy: trade-offs to be found Realistic approach: minimize energy while guaranteeing a performance bound ⇒ At which speed should we execute the workload? Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 4 / 25

  6. Outline of the talk Model and optimization problem Optimal pattern size and speeds Simulations Extensions: both fail-stop and silent errors Conclusion Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 5 / 25

  7. Framework Divisible-load applications Subject to silent data corruption Checkpoint/restart strategy: periodic patterns that repeat over time Verified checkpoints Is it better to use two different speeds rather than only one? What are the optimal checkpointing period and optimal execution speeds? Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 6 / 25

  8. Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent error Detection V W V W V W V σ 1 C σ 1 R σ 2 C σ 1 C σ 1 σ 2 σ 1 Time With a silent error Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 7 / 25

  9. Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Silent error Detection V W V W V W V σ 1 C σ 1 R σ 2 C σ 1 C σ 1 σ 2 σ 1 Time With a silent error Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 7 / 25

  10. Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verif.: V units of work; checkpointing: time C ; recovery: time R Silent error Detection V W V W V W V σ 1 C σ 1 R σ 2 C σ 1 C σ 1 σ 2 σ 1 Time With a silent error Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 7 / 25

  11. Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verif.: V units of work; checkpointing: time C ; recovery: time R P idle and P io constant; and P cpu ( σ ) = κσ 3 Silent error Detection V W V W V W V σ 1 C σ 1 R σ 2 C σ 1 C σ 1 σ 2 σ 1 Time With a silent error Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 7 / 25

  12. Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verif.: V units of work; checkpointing: time C ; recovery: time R P idle and P io constant; and P cpu ( σ ) = κσ 3 Energy for W units of work at speed σ : W σ ( P idle + κσ 3 ) Energy of a verification at speed σ : V σ ( P idle + κσ 3 ) Energy of a checkpoint: C ( P idle + P io ) Energy of a recovery: R ( P idle + P io ) Silent error Detection V W V W V W V σ 1 C σ 1 R σ 2 C σ 1 C σ 1 σ 2 σ 1 Time With a silent error Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 7 / 25

  13. Problem Optimization problem BiCrit : Minimize E ( W , σ 1 , σ 2 ) s.t. T ( W , σ 1 , σ 2 ) ≤ ρ, W W E ( W , σ 1 , σ 2 ) is the expected energy consumed to execute W units of work at speed σ 1 , with eventual re-executions at speed σ 2 T ( W , σ 1 , σ 2 ) is the expected execution time to execute W units of work at speed σ 1 , with eventual re-executions at speed σ 2 ρ is a performance bound, or admissible degradation factor Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 8 / 25

  14. Computing expected execution time Proposition 1 For the BiCrit problem with a single speed, � W + V � � � λ W λ W σ − 1 T ( W , σ, σ ) = C + e + e R σ σ Proposition 2 For the BiCrit problem, T ( W , σ 1 , σ 2 ) = C + W + V � R + W + V � 1 − e − λ W λ W � � + e σ 1 σ 2 σ 1 σ 2 Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 9 / 25

  15. Proof of Proposition 1 Proof. The recursive equation to compute T ( W , σ, σ ) writes: T ( W , σ, σ ) = W + V + p ( W /σ ) ( R + T ( W , σ, σ )) σ + (1 − p ( W /σ )) C , where p ( W /σ ) = 1 − e − λ W σ . The reasoning is as follows: We always execute W units of work followed by the verification, in time W + V ; σ With probability p ( W /σ ), a silent error occurred and is detected, in which case we recover and start anew; Otherwise, with probability 1 − p ( W /σ ), we simply checkpoint after a successful execution. Solving this equation leads to the expected execution time. Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 10 / 25

  16. Proof of Proposition 2 Proof. The recursive equation to compute T ( W , σ 1 , σ 2 ) writes: T ( W , σ 1 , σ 2 ) = W + V + p ( W /σ 1 ) ( R + T ( W , σ 2 , σ 2 )) σ 1 + (1 − p ( W /σ 1 )) C , where p ( W /σ 1 ) = 1 − e − λ W σ 1 . The reasoning is as follows: We always execute W units of work followed by the verification, in time W + V ; σ 1 With probability p ( W /σ 1 ), a silent error occurred and is detected, in which case we recover and start anew at speed σ 2 ; Otherwise, with probability 1 − p ( W /σ 1 ), we simply checkpoint after a successful execution. Solving this equation leads to the expected execution time. Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 11 / 25

  17. Computing expected energy consumption Proposition 3 For the BiCrit problem, 1 − e − λ W λ W � � � � σ 2 R E ( W , σ 1 , σ 2 ) = C + e ( P io + P idle ) σ 1 + W + V ( κσ 3 1 + P idle ) σ 1 + W + V (1 − e − λ W λ W σ 2 ( κσ 3 σ 1 ) e 2 + P idle ) σ 2 Power spent during checkpoint or recovery: P io + P idle ; power spent during computation and verification at speed σ : P cpu ( σ ) + P idle = κσ 3 + P idle . From Proposition 2, we get the expression of E ( W , σ 1 , σ 2 ). Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 12 / 25

  18. Finding optimal pattern length (1) To get closed-form expression for optimal value of W , use of first-order approximations, using Taylor expansion e λ W = 1 + λ W + O ( λ 2 W 2 ): T ( W , σ 1 , σ 2 ) = 1 + λ W + λ R + λ V + C + V /σ 1 + O ( λ 2 W ) (1) W σ 1 σ 1 σ 2 σ 1 σ 1 σ 2 W = κσ 3 E ( W , σ 1 , σ 2 ) 1 + P idle + λ W ( κσ 3 2 + P idle ) W σ 1 σ 1 σ 2 + λ R ( P io + P idle ) + λ V ( κσ 3 1 + P idle ) σ 1 σ 1 σ 2 + C ( P io + P idle ) + V ( κσ 3 1 + P idle ) /σ 1 + O ( λ 2 W ) (2) W Anne.Benoit@ens-lyon.fr A different re-execution speed can help PASA’16 13 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend