Combining Checkpointing and Replication for Reliable Execution of - PowerPoint PPT Presentation

Introduction Model DP Algo Experiments Conclusion Combining Checkpointing and Replication for Reliable Execution of Linear Workflows Anne Benoit 1 , 2 , Aur´ elien Cavelan 3 , Florina M. Ciorba 3 , evre 1 , Yves Robert 1 , 4 Valentin Le F` 1. LIP, Ecole Normale Sup´ erieure de Lyon, France 2. Georgia Institute of Technology, Atlanta, GA, USA 3. University of Basel, Switzerland 4. University of Tennessee, Knoxville, TN, USA http://graal.ens-lyon.fr/~abenoit/ APDCM workshop @ IPDPS’18, Vancouver, May 21, 2018 APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 1/ 26

Introduction Model DP Algo Experiments Conclusion Linear workflows High-performance computing (HPC) application: chain of tasks T 1 → T 2 → · · · → T n Parallel tasks executed on the whole platform For instance: tightly-coupled computational kernels, image processing applications, ... Goal: efficient execution, i.e., minimize total execution time APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 2/ 26

Introduction Model DP Algo Experiments Conclusion Reliable execution Hierarchical • 10 5 or 10 6 nodes • Each node equipped with 10 4 or 10 3 cores Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h of 10 6 nodes More nodes ⇒ Shorter MTBF (Mean Time Between Failures) Need to ensure that the execution will be reliable, i.e., without failures APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 3/ 26

Introduction Model DP Algo Experiments Conclusion Coping with fail-stop errors with checkpoints Checkpoint, rollback, and recovery: (no error) T 1 C 1 T 2 T 3 C 3 T 4 C 4 Time Fail-stop error (error) T 1 C 1 T 2 T 3 C 3 T 4 C 4 Time Fail-stop error (error) T 1 C 1 T 2 T 3 R 2 T 2 T 3 C 3 · · · Time Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection Rollback to last checkpoint and re-execute APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 4/ 26

Introduction Model DP Algo Experiments Conclusion Coping with fail-stop errors with replication T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) Fail-stop error T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 Fail-stop error T 1 ( p T 4 ( p 2 ) T 2 ( p ) T 3 ( p ) 2 ) T 5 ( p ) C 1 C 3 C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 5/ 26

Introduction Model DP Algo Experiments Conclusion Contributions Both checkpointing and replication have been extensively studied Combination of both techniques not yet investigated Detailed model Optimal dynamic programming algorithm Experiments to evaluate impact of using both replication and checkpointing during execution Guidelines about when to checkpoint only, replicate only, or combine both techniques APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 6/ 26

Introduction Model DP Algo Experiments Conclusion Outline Model and objective 1 Optimal dynamic programming algorithm 2 Experiments 3 Conclusion 4 APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 7/ 26

Introduction Model DP Algo Experiments Conclusion Application and platform model Application: Chain T 1 → T 2 → · · · → T n Parallel tasks: (failure-free) execution time of T i using q i � � α i + 1 − α i processors is w i (Amdahl’s law) q i Platform: Homogeneous platform with p processors P i , 1 ≤ i ≤ p Fail-stop errors, Exponential distribution, error rate λ ind P ( X ≤ T ) = 1 − e − q λ ind T on q processors APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 8/ 26

Introduction Model DP Algo Experiments Conclusion Checkpointing Checkpointing time: C i ( q i ) = a i + b i q i + c i q i a i + b i q i : communication time with latency a i c i q i : message passing overhead Downtime D Recovery cost R j +1 (where T j is the last checkpointed task) R i +1 ( q i ) = C i ( q i ) for 1 ≤ i ≤ n − 1: recovering for T i +1 ≈ reading C i T 0 with w 0 = 0 checkpointed (input time R 1 ( q 1 )) T n always checkpointed (output time C n ( q n )) APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 9/ 26

Introduction Model DP Algo Experiments Conclusion No replication T i not replicated: costs C norep and R norep i i � � Failure-free execution time: T norep α i + 1 − α i = w i i p Expected execution time E norep ( i ): � � E norep ( i ) = P ( X p ≤ T norep T norep ( T norep ) + D + R norep + E norep ( i ) ) i lost i i + (1 − P ( X p ≤ T norep )) T norep i i P ( X p ≤ t ) = 1 − e − λ ind pt : probability of failure on one of the p processors before time t T norep ( T norep 1 t ) = λ ind p − e λ ind pTnorep lost i − 1 i E norep ( i ) = ( e λ ind pT norep λ ind p + D + R norep 1 − 1)( ) i i If T i is checkpointed, add C norep i APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 10/ 26

Combining Checkpointing and Replication for Reliable Execution of - PowerPoint PPT Presentation

Introduction Model DP Algo Experiments Conclusion Combining Checkpointing and Replication for Reliable Execution of Linear Workflows Anne Benoit 1 , 2 , Aur elien Cavelan 3 , Florina M. Ciorba 3 , evre 1 , Yves Robert 1 , 4 Valentin Le F`

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Combining checkpointing and replication for reliable execution of linear workflows Anne Benoit 1 ,

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Replication and Migration Background, Requirements and Strawman Migration and Replication

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud - Qingxi

CS425 Computer System Design Lecture 10 Pipelining Hazards Shankar Balachandran Dept. of

Apple Q & A iPhone, iPad, Macintosh, MYOB Update Agenda MYOB Mac and Windows a quick

PI World Gothenburg 2019 Presentation Content Guidelines OSIsoft PI World presents a unique

Attention a useful tool to improve and understand neural networks Sala Riunioni DISI V.le

From Sensors to Supercomputers Eric Van Hensbergen Principal Engineer HPC ARM Research

GAELS Progress Wei Song 31/08/2012 Content Tool flow Progress Verilog Parser Tcl

Investor Presentation 22 February 2006 CEO presentation 2 Record result in 2005 Operating

ALIGNED NICKEL NANOSTRAND IN NANOPAPER ENABLED SHAPE-MEMORY NANOCOMPOSITE FOR HIGH SPEED Haibao Lu

Combining Checkpointing and Replication for Reliable Execution of - PowerPoint PPT Presentation

Introduction Model DP Algo Experiments Conclusion Combining Checkpointing and Replication for Reliable Execution of Linear Workflows Anne Benoit 1 , 2 , Aur elien Cavelan 3 , Florina M. Ciorba 3 , evre 1 , Yves Robert 1 , 4 Valentin Le F`

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Combining checkpointing and replication for reliable execution of linear workflows Anne Benoit 1 ,

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

New features in MySQL Replication Lars Thalmann, Development Manager, Replication &amp; Backup

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Replication and Migration Background, Requirements and Strawman Migration and Replication

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud - Qingxi

CS425 Computer System Design Lecture 10 Pipelining Hazards Shankar Balachandran Dept. of

Apple Q &amp; A iPhone, iPad, Macintosh, MYOB Update Agenda MYOB Mac and Windows a quick

PI World Gothenburg 2019 Presentation Content Guidelines OSIsoft PI World presents a unique

Attention a useful tool to improve and understand neural networks Sala Riunioni DISI V.le

From Sensors to Supercomputers Eric Van Hensbergen Principal Engineer HPC ARM Research

GAELS Progress Wei Song 31/08/2012 Content Tool flow Progress Verilog Parser Tcl

Investor Presentation 22 February 2006 CEO presentation 2 Record result in 2005 Operating

ALIGNED NICKEL NANOSTRAND IN NANOPAPER ENABLED SHAPE-MEMORY NANOCOMPOSITE FOR HIGH SPEED Haibao Lu

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Apple Q & A iPhone, iPad, Macintosh, MYOB Update Agenda MYOB Mac and Windows a quick