Checkpointing strategies for parallel jobs Marin Bougeret , Henri - PowerPoint PPT Presentation

Checkpointing strategies for parallel jobs Marin Bougeret , Henri Casanova , Mika¨ el Rabie , Yves Robert , and Fr´ ed´ eric Vivien ENS Lyon & INRIA, France University of Hawai‘i at M¯ anoa, USA University of Montpellier, France

Motivation Framework Very very large number of processing elements (e.g., 2 20 ) Failure-prone platform (like any realistic platform) Large application to be executed on the whole platform = ⇒ Failure(s) will certainly occur before completion! Resilience provided through coordinated checkpointing Question When should we checkpoint the application?

State of the art One knows that applications should be checkpointed periodically

State of the art One knows that applications should be checkpointed periodically Is this optimal?

State of the art One knows that applications should be checkpointed periodically Is this optimal? Several proposed values for period √ Young: 2 × C × MTBF (1st order approximation) � Daly (1): 2 × C × ( R + MTBF) (1st order approximation) Daly (2): η × MTBF − C , where η = ξ 2 + 1 + L ( − e − (2 ξ 2 +1) ), � 2 × MTBF , and L ( z ) e L ( z ) = z C ξ = (higher order approximation)

State of the art One knows that applications should be checkpointed periodically Is this optimal? Several proposed values for period √ Young: 2 × C × MTBF (1st order approximation) � Daly (1): 2 × C × ( R + MTBF) (1st order approximation) Daly (2): η × MTBF − C , where η = ξ 2 + 1 + L ( − e − (2 ξ 2 +1) ), � 2 × MTBF , and L ( z ) e L ( z ) = z C ξ = (higher order approximation) How good are these approximations? Could we find the optimal value? At least for Exponential failures? And for Weibull failures?

Outline Single-processor jobs 1 Solving Makespan Solving NextFailure Parallel jobs 2 Solving Makespan Solving NextFailure Experiments 3 Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures Conclusion 4

Hypotheses Overall size of work: W Checkpoint cost: C (e.g., write on disk the contents of each processor memory) Downtime: D (hardware replacement by spare, or software rejuvenation via rebooting) Recovery cost after failure: R Homogeneous platform (same computation speeds, iid failure distributions) History of failures has no impact, only the time elapsed since last failure does A failure can happen during a checkpoint, a recovery, but not a downtime (otherwise replace D by 0 and R by R + D ).

Problem statement Makespan Minimize the job’s expected makespan, that is: the expectation E of the time T needed to process a work of size W knowing that the (single) processor failed τ units of time ago. Notation: minimize E ( T ( W| τ )) ω 1 ( W| τ ): amount of work we attempt to do before taking the first checkpoint

Recursive approach E ( T ( W| τ )) =

Recursive approach Probability of success P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) =

Recursive approach Time needed to compute the 1st chunk P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) =

Recursive approach Time needed to compute the remainder P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) =

Failures following an exponential distribution Theorem Optimal strategy splits W into K ∗ same-size chunks where K ∗ = max(1 , ⌊ K 0 ⌋ ) or K ∗ = ⌈ K 0 ⌉ (whichever leads to the smaller value) where λ W 1 + L ( − e − λ C − 1 ) and L ( z ) e L ( z ) = z K 0 = Optimal expectation of makespan is � 1 � �� e λ ( W K ∗ + C ) − 1 K ∗ e λ R λ + D

Arbitrary failure distributions E ( T ( W| τ )) = � �  P suc ( ω 1 + C | τ ) ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C ))  min  +(1 − P suc ( ω 1 + C | τ )) × 0 <ω 1 ≤W  ( E ( T lost ( ω 1 + C | τ ))+ E ( T rec )+ E ( T ( W| R ))) Solve via dynamic programming • Time quantum u : all chunk sizes ω i are integer multiples of u • Trade-off: accuracy versus higher computing time

Dynamic programming Algorithm 1: DPMakespan ( x , b , y , τ 0 ) if x = 0 then return 0 if solution [ x ][ b ][ y ] = unknown then best ← ∞ ; τ ← b τ 0 + yu for i = 1 to x do exp succ ← first ( DPMakespan ( x − i , b , y + i + C u , τ 0 )) exp fail ← first ( DPMakespan ( x , 0 , R u , τ 0 )) cur ← P suc ( iu + C | τ )( iu + C + exp succ ) � +(1 − P suc ( iu + C | τ )) E ( T lost ( iu + C , τ )) � + E ( T rec ) + exp fail if cur < best then best ← cur ; chunksize ← i solution [ x ][ b ][ y ] ← ( best , chunksize ) return solution [ x ][ b ][ y ]

Problem statement NextFailure Maximize expected amount of work completed before next failure Optimization on a “failure-by-failure” basis Hopefully a good approximation, at least for large job sizes W

Approach E ( W ( ω | τ ))= P suc ( ω 1 + C | τ )( ω 1 + E ( W ( ω − ω 1 | τ + ω 1 + C ))) Proposition K i � � E ( W ( W| 0)) = ω i × P suc ( ω j + C | t j ) i =1 j =1 where t j = � j − 1 ℓ =1 ω ℓ + C is the total time elapsed (without failure) before execution of chunk ω l , and K is the (unknown) target number of chunks.

Solving through dynamic programming Algorithm 2: DPNextFailure ( x , n , τ 0 ) if x = 0 then return 0 if solution [ x ][ n ] = unknown then best ← ∞ τ ← τ 0 + ( W − xu ) + nC for i = 1 to x do work = first ( DPNextFailure ( x − i , n + 1 , τ 0 )) cur ← P suc ( iu + C | τ ) × ( iu + work ) if cur < best then best ← cur ; chunksize ← i solution [ x ][ n ] ← ( best , chunksize ) return solution [ x ][ n ]

Failures following an exponential distribution Theorem Optimal strategy splits W ( p ) in K ∗ ( p ) same-size chunks where K ∗ ( p ) = max(1 , ⌊ K 0 ( p ) ⌋ ) or K ∗ ( p ) = ⌈ K 0 ( p ) ⌉ (whichever leads to the smaller value) λ W ( p ) 1 + L ( − e − p λ C − 1 ) and L ( z ) e L ( z ) = z where K 0 ( p ) = Optimal expectation of makespan is � 1 � � � � � W K ∗ ( p ) + pC e λ K ∗ ( p ) p λ + E ( T rec ( p )) − 1

Checkpointing strategies for parallel jobs Marin Bougeret , Henri - PowerPoint PPT Presentation

Checkpointing strategies for parallel jobs Marin Bougeret , Henri Casanova , Mika el Rabie , Yves Robert , and Fr ed eric Vivien ENS Lyon & INRIA, France University of Hawaii at M anoa, USA University of Montpellier, France

JOBS, JOBS, JOBS! JOBS, JOBS, JOBS! Jobs, jobs, JO JOBS! JOBS, JOBS, JOBS! The other reality

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Jobs at sea TRINITY HOUSE // KEY STAGE 2 JOBS AT SEA Starter Activity 1 TRINITY HOUSE //

Green Jobs Employment experiences Green Jobs Employment experiences Green Jobs Employment

Green Jobs, Decent Work and Sustainable Development Ana Sanchez Green Jobs Programme Green Jobs

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud - Qingxi

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

Scheduling Parallel DAG Jobs Online Ben Moseley (CMU) Joint work with: Kunal Agrawal (WahsU)

Towards Scalable Application Checkpointing with Parallel File System Delegation Dulcardo Arteaga

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Cetus-assisted checkpointing of parallel codes guez , M.J. Mart n, P. Gonz alez, J.

Machine Learning Survival Kit for Future Death of New Jobs Only 5.5 million jobs are created

MOZAMBIQUE JOBS DIAGNOSTIC Principal findings Ian Walker, Lead Economist, Jobs August, 2018

WHERE ARE ALL THE GOOD JOBS GOING? Holzer, Lane, Rosenblum, Andersson Russell Sage Foundation,

Incremental checkpointing of program state to NVRAM for transiently-powered systems Fayal

Resilient Distributed Concurrent Collections Cdric Bassem Promotor: Prof. Dr. Wolfgang De

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N.,

Identifying Slow Queries, and Fixing Them! Stephen Frost Crunchy Data stephen@crunchydata.com

Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan Milojicic Georgia Institute

TRAINING NEURAL TRAINING NEURAL NETWORKS ON THE NETWORKS ON THE EDGE EDGE Navjot Kukreja,

Distributed Real-Time Stream Processing: Why and How Petr Zapletal @petr_zapletal NE Scala 2016

FS Consistency & Journaling Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Checkpointing strategies for parallel jobs Marin Bougeret , Henri - PowerPoint PPT Presentation

Checkpointing strategies for parallel jobs Marin Bougeret , Henri Casanova , Mika el Rabie , Yves Robert , and Fr ed eric Vivien ENS Lyon & INRIA, France University of Hawaii at M anoa, USA University of Montpellier, France

JOBS, JOBS, JOBS! JOBS, JOBS, JOBS! Jobs, jobs, JO JOBS! JOBS, JOBS, JOBS! The other reality

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Jobs at sea TRINITY HOUSE // KEY STAGE 2 JOBS AT SEA Starter Activity 1 TRINITY HOUSE //

Green Jobs Employment experiences Green Jobs Employment experiences Green Jobs Employment

Green Jobs, Decent Work and Sustainable Development Ana Sanchez Green Jobs Programme Green Jobs

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud - Qingxi

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

Scheduling Parallel DAG Jobs Online Ben Moseley (CMU) Joint work with: Kunal Agrawal (WahsU)

Towards Scalable Application Checkpointing with Parallel File System Delegation Dulcardo Arteaga

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

Cetus-assisted checkpointing of parallel codes guez , M.J. Mart n, P. Gonz alez, J.

Machine Learning Survival Kit for Future Death of New Jobs Only 5.5 million jobs are created

MOZAMBIQUE JOBS DIAGNOSTIC Principal findings Ian Walker, Lead Economist, Jobs August, 2018

WHERE ARE ALL THE GOOD JOBS GOING? Holzer, Lane, Rosenblum, Andersson Russell Sage Foundation,

Incremental checkpointing of program state to NVRAM for transiently-powered systems Fayal

Resilient Distributed Concurrent Collections Cdric Bassem Promotor: Prof. Dr. Wolfgang De

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N.,

Identifying Slow Queries, and Fixing Them! Stephen Frost Crunchy Data stephen@crunchydata.com

Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan Milojicic Georgia Institute

TRAINING NEURAL TRAINING NEURAL NETWORKS ON THE NETWORKS ON THE EDGE EDGE Navjot Kukreja,

Distributed Real-Time Stream Processing: Why and How Petr Zapletal @petr_zapletal NE Scala 2016

FS Consistency &amp; Journaling Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

FS Consistency & Journaling Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)