practical foundations for resilient applications
play

Practical foundations for resilient applications George Bosilca - PowerPoint PPT Presentation

Practical foundations for resilient applications George Bosilca Algorithms and Scheduling Techniques to Manage Resilience and Power Dagstuhl 2015 Failures are bad for business In HPC: Today, 20% or more of the computing capacity


  1. Practical foundations for resilient applications George Bosilca Algorithms and Scheduling Techniques to Manage Resilience and Power – Dagstuhl 2015

  2. Failures are bad for business … • In HPC: “Today, 20% or more of the computing capacity in a large high-performance computing system is wasted due to failures and recoveries”- Dr. M. Elnozahy et al., System Resilience at Extreme Scale, DARPA • Outside HPC: Dynamic execution environments (clouds) are not suitable for parallel application execution due to volatility. • Tomorrow: U.S. Department of Energy identified 10 research challenges to Exascale. One of them is • Resilience and correctness: Ensuring correct scientific computation in face of faults, reproducibility, and algorithm verification challenges.

  3. Fault Tolerance: many solutions • Rollb llback R Recovery y Coordina nated c che heckp kpoint nt • Legacy approach (with b h blo locki king ng, c , cons nstant nt c che heckp kpoint nts) • Checkpoint/Restart based • Active research on introducing more asynchrony (uncoordinated checkpoint, message logging, correlated sets), increasing the MTBF (hardware) and decreasing the overheads (buddy checkpointing, NVRAM) • Fo Forward R Recovery y • Replication (the only system level Forward Recovery) • Master-Worker with simple resubmission • Iterative methods, Naturally fault tolerant algorithms • Algorithm Based Fault Tolerance AB ABFT FT time Protection blocks Master Mast previous iterations previous iterations trailing matrix Factorized in Factorized in & protection a d e Worker0 Wo Factorize update by b Worker1 Wo applying the c b Wo Worker2 same operations 3

  4. Research Status Anatomy Rollback Recovery Forward Recovery Checkpointing Algorithm Based … & Restart (C/R) Fault Tolerance (ABFT) Large Overhead Small Application Specificity Significant None 4 4

  5. Rollback recovery modeling PurePeriodicCkpt Process 0 Application P URE P ERIODIC C KPT Library Young/Daly Process 1 Application Library P opt p PC = 2 C ( µ − D − R ) Process 2 Application Library Optimal Checkpoint Interval BiPeriodicCkpt Process 0 Application p P opt BPC , G = 2 C ( µ − D − R ) Library B I P ERIODIC C KPT p P opt Process 1 BPC , L = 2 C L ( µ − D − R ) Application Library Process 2 Application Library G ENERAL L IBRARY Checkpoint Interval Checkpoint Interval 5 5

  6. Rollback recovery modeling Memory/component 40 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults 30 Nb Faults Bi-PeriodicCkpt 20 10 Problem Size 0 0.4 increases O( √ n) PeriodicCkpt Bi-PeriodicCkpt 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1day, Evolutionary platforms design 0.25 scaled in O(1/n) (ABFT) Waste 0.2 Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small O(n) 0.05 0 1k 10k 100k 1M 80% of each Application Specificity Significant None iteration is spent in Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, ABFT-algorithm Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International modifying 80% of Journal of Networking and Computing, ISSN 2185-2847 the data 6 6

  7. Rollback recovery modeling Memory/component 40 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults 30 Nb Faults Bi-PeriodicCkpt 20 10 Problem Size 0 0.4 increases O( √ n) PeriodicCkpt Bi-PeriodicCkpt 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1day, Too many checkpoints !!! 0.25 scaled in O(1/n) (ABFT) Waste 0.2 Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small O(n) 0.05 0 1k 10k 100k 1M 80% of each Application Specificity Significant None iteration is spent in Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, ABFT-algorithm Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International modifying 80% of Journal of Networking and Computing, ISSN 2185-2847 the data 7 7

  8. Rollback recovery modeling Memory/component 6 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults Nb Faults Bi-PeriodicCkpt 4 2 Problem Size 0 0.4 increases O( √ n) 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1 day, ReEvolutionary platforms design 0.25 scaled in O(1/n) (ABFT) Waste 0.2 PeriodicCkpt Bi-PeriodicCkpt Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small 0.05 O(1) 0 1k 10k 100k 1M O(n^3) vs O(n^2) of Application Specificity Significant α = 0.55 α = 0.8 α = 0.92 α = 0.975 None Nodes each iteration is Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, spent in ABFT- Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International algorithm modifying Journal of Networking and Computing, ISSN 2185-2847 80% of the data 8 8

  9. Rollback recovery modeling Memory/component 6 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults Nb Faults Bi-PeriodicCkpt 4 2 Problem Size 0 0.4 increases O( √ n) 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1 day, Still too many checkpoints !!! 0.25 scaled in O(1/n) (ABFT) Waste 0.2 PeriodicCkpt Bi-PeriodicCkpt Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small 0.05 O(1) 0 1k 10k 100k 1M O(n^3) vs O(n^2) of Application Specificity Significant α = 0.55 α = 0.8 α = 0.92 α = 0.975 None Nodes each iteration is Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, spent in ABFT- Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International algorithm modifying Journal of Networking and Computing, ISSN 2185-2847 80% of the data 9 9

  10. Research Status Anatomy Rollback Recovery Forward Recovery Checkpointing Algorithm Based … & Restart (C/R) Fault Tolerance (ABFT) Large Overhead Small Large Application Specificity Small This situation can be improved by moving investments from the hardware, more I/O bandwidth, future technologies (NVRAM) and increasing the MTBF of components, into software and developers. 10 10

  11. Forward Recovery • Any technique that permit the application to continue without rollback • Repli lication n (the only system level Forward Recovery) • Master-Worker w with s h simple le r resubmi mission n • It Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms • Alg lgorithm B hm Based F Fault lt T Tole leranc nce • No checkpoint I/O overhead • No rollback, no loss of completed work • May require (sometime expensive, like replicates) protection/recovery operations, but still generally more scalable than checkpoint • “Why is not everybody doing this already, then?” • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R) • Supposes es t that M MPI c I continues es t to o oper erate a e across f failures es

  12. Forward Recovery • Any technique that permit the application to continue without rollback • Repli lication n (the only system level Forward Recovery) • Master-Worker w with s h simple le r resubmi mission n • It Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms • Alg lgorithm B hm Based F Fault lt T Tole leranc nce • No checkpoint I/O overhead • Minimal or no rollback, no loss of completed work • May require (sometime expensive, like replicates) protection/recovery operations, but still generally more scalable than checkpoint • “Why is not everybody doing this already, then?” • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R) • Supposes es t that M MPI c I continues es t to o oper erate a e across f failures es

  13. Forward Recovery • Any technique that permit the application to continue without rollback • Master-Worker w with s h simple le r resubmi mission n • It Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms • Alg lgorithm B hm Based F Fault lt T Tole leranc nce • Repli lication n (the only system level Forward Recovery) • No checkpoint I/O overhead • No rollback, no loss of completed work • May require (sometime expensive, like replicates) protection/recovery operations, but still generally more scalable than checkpoint • “Why is not everybody doing this already, then?” • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R) Standardization o of p programming p paradigms b beh ehavior a after er • Supposes es t that M MPI c I continues es t to o oper erate a e across f failures es failures es i is a a k key m missing i infrastructure e

  14. USE SER LEVEL EVEL FAI AILURE MIT ITIG IGATION ION ULFM ULFM Expend the MPI communication infrastructure to integrate faults as a first class citizen of the message passing concepts

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend