analysis of the tradeoffs between energy and run time for
play

Analysis of the Tradeoffs between Energy and Run Time for Multilevel - PowerPoint PPT Presentation

Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing Prasanna Balaprakash, Leonardo A. Bautista Gomez , Slim Bouguerra, Stefan M. Wild, Franck Cappello , and Paul D. Hovland ANL PMBS workshop @ SC14


  1. Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing Prasanna Balaprakash, Leonardo A. Bautista Gomez , Slim Bouguerra, Stefan M. Wild, Franck Cappello , and Paul D. Hovland ANL PMBS workshop @ SC’14 leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 1

  2. Context and motivations Context: The Need For Speed Figure : From http://www.scidacreview.org/ leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 2

  3. Context and motivations Motivation: Failures Sequoia MTBF ≈ 1 day. Blue Waters 2 nodes failure per day. Titan MTBF < 1 day. leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 3

  4. Context and motivations Motivation: Failures Sequoia MTBF ≈ 1 day. Blue Waters 2 nodes failure per day. Titan MTBF < 1 day. ≈ 20 % of the computation is wasted in recovery and re-execution (Implies energy waste) leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 3

  5. Context and motivations Motivation: Failures Sequoia MTBF ≈ 1 day. Blue Waters 2 nodes failure per day. Titan MTBF < 1 day. ≈ 20 % of the computation is wasted in recovery and re-execution (Implies energy waste) Exascale : The number of components for both memory and processors will increase by a factor of 100. Shrinking the circuit sizes and running at lower voltages, increases the SDC probability. leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 3

  6. Context and motivations Motivation: Failures Sequoia MTBF ≈ 1 day. Blue Waters 2 nodes failure per day. Titan MTBF < 1 day. ≈ 20 % of the computation is wasted in recovery and re-execution (Implies energy waste) Exascale : The number of components for both memory and processors will increase by a factor of 100. Shrinking the circuit sizes and running at lower voltages, increases the SDC probability. In exascale failures will occur at higher frequency, optimistic MTBF is couple of hours. leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 3

  7. Context and motivations Motivation: Energy The power draw of the interconnect on Blue Gene/Q appears to be independent of load. CPU varies only by some 20% Power draw under different loads is DRAM change by a factor of 2 or more. Exascale: http://www.scidacreview.org/1001/html/hardware.html Data movement and IO will consume more than 70% of the total system power (most of the 20 MW will go just to power the 10 PB of total system memory.) Flops/Watt VS Communication/Watts Avoid checkpointing and data movement do more re-computations. VS Avoid re-computations via checkpointing more often. leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 4

  8. Context and motivations Related Work ECOTFIT, Diouri et al Blocking checkpointing Message logging Conclusion no big tradeoff observed. Meneses et al Parallel recovery vs global recovery Used RALP API (No communication or IO are covered) Parallel is better since it reduces the overall time Aupy et al Blocking vs no-blocking single level. No experiment. leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 5

  9. Context and motivations Related Work ECOTFIT, Diouri et al Blocking checkpointing Message logging Conclusion no big tradeoff observed. Meneses et al Parallel recovery vs global recovery Used RALP API (No communication or IO are covered) Parallel is better since it reduces the overall time Aupy et al Blocking vs no-blocking single level. No experiment. The missing episode What about multilevel checkpointing ? leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 5

  10. Problem formulation and notations 1 Context and motivations 2 Problem formulation and notations Multilevel checkpointing Energy model Multiobjective optimization 3 Simulation and experimentations Experimentations Tradeoff analysis 4 Conclusion and future work leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 6

  11. Problem formulation and notations Multilevel checkpointing Multilevel Checkpointing Multiple levels of storage (DRAM, NVM, PFS). Coupled with data replication and erasure codes. Low levels offer high performance and partial reliability. High levels offer high reliability but impose large overhead. Different ckpt. levels have different frequencies. After a failure the application restart from the lowest available level. If unable to recover, try next level of checkpoint (Further in the past). leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 7

  12. Problem formulation and notations Multilevel checkpointing Wasted time model L levels of checkpoint (4 with FTI) Checkpoint strategy: τ i , for i = 1 · · · L Checkpoint cost: c i for level i r i time for a restart from level i d i downtime after a failure affecting level i . µ i rate of failures affecting level i . leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 8

  13. Problem formulation and notations Energy model Wasted energy model P c i power for a level i checkpoint Watts. P r i power for a restart from level i Watts. P a power for a failure-free computation without checkpointing Watts. µ i rate for failure affecting level i . leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 9

  14. Problem formulation and notations Multiobjective optimization Problem solving Checkpoint time   L i − 1  c i c j W ch = � � τ i + µ i τ i  2 τ j i =1 j =1 leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 10

  15. Problem formulation and notations Multiobjective optimization Problem solving Checkpoint time   L i − 1  c i c j W ch = � � τ i + µ i τ i  2 τ j i =1 j =1 Rework time L µ i τ i W rew = � 2 i =1 leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 10

  16. Problem formulation and notations Multiobjective optimization Problem solving Checkpoint time   L i − 1  c i c j W ch = � � τ i + µ i τ i  2 τ j i =1 j =1 Rework time L µ i τ i W rew = � 2 i =1 Downtime and restart time L W down = � µ i ( r i + d i ) i =1 leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 10

  17. Problem formulation and notations Multiobjective optimization Problem solving Checkpoint wasted energy L i − 1 P c j c j c i E ch = � � P c + µ i τ i i τ i 2 τ j i =1 j =1 leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 11

  18. Problem formulation and notations Multiobjective optimization Problem solving Checkpoint wasted energy L i − 1 P c j c j c i E ch = � � P c + µ i τ i i τ i 2 τ j i =1 j =1 Rework wasted energy L P a µ i τ i E rew = � 2 i =1 leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 11

  19. Problem formulation and notations Multiobjective optimization Problem solving Checkpoint wasted energy L i − 1 P c j c j c i E ch = � � P c + µ i τ i i τ i 2 τ j i =1 j =1 Rework wasted energy L P a µ i τ i E rew = � 2 i =1 Downtime and restart wasted energy L E down = � P r i µ i ( r i + d i ) i =1 leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 11

  20. Problem formulation and notations Multiobjective optimization Total wasted time L � � i − 1 � � τ i + µ i τ i c i c j � � W = 1 + + µ i ( r i + d i ) (1) 2 2 τ j i =1 j =1 leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 12

  21. Problem formulation and notations Multiobjective optimization Total wasted time L � � i − 1 � � c i τ i + µ i τ i c j � � W = 1 + + µ i ( r i + d i ) (1) 2 2 τ j i =1 j =1 Total wasted energy � P c P c � j c j �� i c i P a E = � L 2 + � i − 1 + � L i =1 P r + µ i τ i i µ i ( r i + d i ) , (2) i =1 j =1 τ i 2 τ j leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 12

  22. Problem formulation and notations Multiobjective optimization First derivatives � i − 1 � � L � ∂ W ∂τ i = µ i c j − c i µ j τ j � � 1 + 1 + (3) τ 2 2 τ j 2 i j =1 j = i +1 � i − 1 � � L � P c − P c j c j ∂τ i = µ i ∂ E i c i µ j τ j P a + � � 1 + (4) τ 2 2 τ j 2 i j =1 j = i +1 leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 13

  23. Problem formulation and notations Multiobjective optimization First derivatives � i − 1 � � L � ∂ W ∂τ i = µ i c j − c i µ j τ j � � 1 + 1 + (3) τ 2 2 τ j 2 i j =1 j = i +1 � i − 1 � � L � P c − P c j c j ∂τ i = µ i ∂ E i c i µ j τ j P a + � � 1 + (4) τ 2 2 τ j 2 i j =1 j = i +1 Solutions � c i (2 + � L � j = i +1 µ j τ W j ) � τ W = � i c j µ i (1 + � i − 1 � j ) j =1 τ W � ρ i c i (2 + � L � j = i +1 µ j τ E j ) � τ E i = � ρ j c j µ i (1 + � i − 1 � j ) j =1 τ E ρ i = P c i / P a leobago@anl.gov (ANL) Tradeoffs between Energy and Run Time PMBS workshop @ SC’14 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend