apis architecture and modeling for extreme scale
play

APIs, Architecture and Modeling for Extreme Scale Resilience - PowerPoint PPT Presentation

APIs, Architecture and Modeling for Extreme Scale Resilience Dagstuhl Seminar: Resilience in Exascale Computing 9/30/2014 Kento Sato LLNL-PRES-661421 This work was performed under the auspices of the U.S. Department of Energy by Lawrence


  1. APIs, Architecture and Modeling for Extreme Scale Resilience Dagstuhl Seminar: Resilience in Exascale Computing 9/30/2014 Kento Sato LLNL-PRES-661421 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

  2. Failures on HPC systems ! System resilience is critical for future extreme-scale computing ! 191 failures out of 5-million node-hours • A production application using Laser-plasma interaction code ( pF3D ) • Hera,&Atlas&and&Coastal&clusters&@LLNL&=>&MTBF:&1.2&day& — C.f.&)&TSUBAME2.0&=>&MTBF:&a&day& ! In&extreme&scale,&failure&rate&will&increase& ! Now,&&HPC&systems&must&consider&failures&as&usual& events&& Lawrence Livermore National Laboratory 2 LLNL-PRES-661421

  3. Motivation to resilience APIs Start ! Current MPI implementation does not have the MPI initialization capabilities • Standard MPI employs a fail-stop model End Application run ! When a failure occurs … Checkpointing Recovery • MPI terminates all processes Failure • The user locate, replace failed nodes with spare nodes Terminate processes • Re-initialize MPI • Restore the last checkpoint Locate failed node ! Applications will use more time for recovery Replace failed node • Users manually locate and replace the failed nodes with spare nodes via machinefile • The&manual&recovery&operaNons&may&introduce&extra& MPI re-initialization overhead&and&human&errors& � APIs to handle the failures are critical Restore checkpoint Lawrence Livermore National Laboratory 3 LLNL-PRES-661421

  4. Resilience APIs, Architecture and the model ! Resilience APIs � Fault tolerant messaging Res esilien ence e API PIs: Fault tolerant messaging interface (FMI) interface (FMI) Compute nodes Parallel file system Lawrence Livermore National Laboratory 4 LLNL-PRES-661421

  5. FMI: Fault Tolerant Messaging Interface [IPDPS2014] FMI&overview& FMI rank (virtual rank) 4 6 0 1 2 3 5 7 MPI-like interface User’s view FMI FMI’s view Fast checkpoint/restart Parity 0 Parity 0 P6-0 Parity 0 Parity 1 Parity 1 Parity 1 P2-0 P3-0 P4-0 P5-0 P7-0 P0-0 P1-0 P0-0 P0-0 P1-0 P1-0 Parity 2 Parity 3 P4-1 P5-1 P6-1 P7-1 P0-1 P0-1 P0-1 P1-1 P1-1 P2-1 P3-1 Parity 4 Parity 5 P6-2 P7-2 P1-1 P0-2 P0-2 P1-2 P1-2 P2-2 P3-2 P4-2 Parity 6 Parity 7 P0-2 P1-2 P5-2 P0 P2 P3 P4 P5 P6 P7 P8 P9 P1 Node 4 Node 0 Node 1 Node 2 Node 3 Dynamic node allocation 0 1 7 Scalable failure detection 6 2 3 5 4 ! FMI is a survivable messaging interface providing MPI-like interface • Scalable failure detection � Overlay network • Dynamic node allocation � FMI ranks are virtualized • Fast checkpoint/restart � In-memory diskless checkpoint/restart Lawrence Livermore National Laboratory 5 LLNL-PRES-661421

  6. How FMI applications work ? FMI&example&code& FMI_Loop enables transparent recovery and ! roll-back on a failure int main (int *argc, char *argv[]) { Periodically write a checkpoint • FMI_Init(&argc, &argv); Restore the last checkpoint on a failure • FMI_Comm_rank(FMI_COMM_WORLD, &rank); /* Application’s initialization */ Processes are launched via fmirun ! n = FMI_Loop(…) while (( ) < numloop) { fmirun spawns fmirun.task on each • /* Application’s program */ node } fmirun.task calls fork/exec a user program • /* Application’s finalization */ fmirun broadcasts connection information • FMI_Finalize(); (endpoints) for FMI_init( … ) } Launch&FMI&processes& machine_file node0.fmi.gov node1.fmi.gov node2.fmi.gov fmirun node3.fmi.gov node4.fmi.gov Node&0& Node&1& Node&3& Node&4& Node&2& fmirun.task fmirun.task fmirun.task fmirun.task Spare node P1& P3& P7& P0& P2& P4& P5& P6& Lawrence Livermore National Laboratory 6 LLNL-PRES-661421

  7. User perspective: No failures Node 0 Node 1 Node 2 Node 3 FMI&example&code& 3 5 6 7 0 1 2 4 int main (int *argc, char *argv[]) { FMI_Init FMI_Comm_rank FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); checkpoint: 0 0 = FMI_Loop(…) /* Application’s initialization */ n = FMI_Loop(…) while (( ) < 4) { 1 = FMI_Loop(…) /* Application’s program */ } /* Application’s finalization */ checkpoint: 1 2 = FMI_Loop(…) FMI_Finalize(); } 3 = FMI_Loop(…) 4 = FMI_Loop(…) FMI_Finalize • User&perspecNve&when&no&failures&happens& • IteraNons:&4& • Checkpoint&frequency:&Every&2&iteraNons& • FMI_Loop&returns&incremented&iteraNon&id&& Lawrence Livermore National Laboratory 7 LLNL-PRES-661421

  8. User perspective : Failure FMI&example&code& 3 5 6 7 0 1 2 4 int main (int *argc, char *argv[]) { FMI_Init FMI_Comm_rank FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); checkpoint: 0 0 = FMI_Loop(…) /* Application’s initialization */ while ((n = FMI_Loop(…)) < 4) { 1 = FMI_Loop(…) /* Application’s program */ } /* Application’s finalization */ checkpoint: 1 2 = FMI_Loop(…) FMI_Finalize(); } 3 = FMI_Loop(…) restart: 1 2 = FMI_Loop(…) Transparently&migrate&FMI&rank&0& • &&1&to&a&spare&node& 3 = FMI_Loop(…) Restart&form&the&last&checkpoint& • 4 = FMI_Loop(…) – 2 th &checkpoint&at&iteraNon&2& With&FMI,&applicaNons&sNll&use&the& • FMI_Finalize same&series&of&ranks&even&a[er& failures & Lawrence Livermore National Laboratory 8 LLNL-PRES-661421

  9. Resilience API: FMI_Loop FMI_Loop int FMI_Loop(void **ckpt, size_t *sizes, int len) ckpt : Array&of&pointers&to&variables&containing&data&that&needs&to&be&checkpointed& sizes: Array&of&sizes&of&each&checkpointed&variables& len : Length&of&arrays,& ckpt &and& sizes returns iteration id FMI constructs in-memory RAID-5 across compute nodes ! Checkpoint group size ! e.g.) group_size = 4 • FMI&checkpoinNng& Encoding group Encoding group Parity 0 P2-0 P4-0 P6-0 Parity 1 P3-0 P7-0 P5-0 0 2 4 6 8 10 12 14 P0-0 P6-1 P7-1 Parity 2 P4-1 P1-0 Parity 3 P5-1 P6-2 P0-1 P2-1 Parity 4 P1-1 P3-1 Parity 5 P7-2 P0-2 P2-2 P4-2 Parity 6 P1-2 P3-2 P5-2 Parity 7 Parity 1 P3-0 P5-0 P7-0 Parity 0 P2-0 P4-0 P6-0 1 3 5 7 9 11 13 15 P1-0 Parity 3 P5-1 P7-1 P0-0 Parity 2 P4-1 P6-1 P1-1 P3-1 Parity 5 P7-2 P0-1 P2-1 Parity 4 P6-2 P1-2 P3-2 Parity 7 P0-2 P2-2 P4-2 P5-2 Parity 6 Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Lawrence Livermore National Laboratory 9 LLNL-PRES-661421

  10. Application runtime with failures • Benchmark: Poisson’s equation solver using Jacobi iteration method – Stencil application benchmark – MPI_Isend, MPI_Irecv, MPI_Wait and MPI_Allreduce within a single iteration • For MPI, we use the SCR library for checkpointing – Since MPI is not survivable messaging interface, we write checkpoint memory on tmpfs • Checkpoint interval is optimized by Vaidya’s model for FMI and MPI P2P communication performance 2500 MPI 1-byte Latency Bandwidth (8MB) FMI MPI 3.555 usec 3.227 GB/s 2000 Performance (GFlops) FMI 3.573 usec 3.211 GB/s MPI + C FMI + C 1500 FMI + C/R FMI directly writes checkpoints via memcpy, and 1000 can exploit the bandwidth MTBF: 1 minute 500 Even with the high failure rate, FMI incurs only a 28% overhead 0 0 500 1000 1500 Lawrence Livermore National Laboratory # of Processes (12 processes/node) 10 LLNL-PRES-661421

  11. Asynchronous multi-level checkpointing (MLC) [SC12] Time RAID-5 Level-1 checkpoint PFS Level-2 checkpoint Source: K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka, “Design and Modeling of a Non- Blocking Checkpointing System,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’12. Salt Lake City, Utah: IEEE Computer Society Press, 2012 Asynchronous&MLC&is&a&technique&for&achieving&high& • Failure analysis on Coastal cluster reliability&while&reducing&checkpoinNng&overhead& MTBF Failure rate Asynchronous&MLC&Use&storage&levels&hierarchically & • L1 failure 130 hours 2.13 -6 – RAID_5&checkpoint:&Frequent&&for&one&node&or&a&few& L2 failure 650 hours 4.27 -7 node&failure& – PFS&checkpoint:&Less&frequent&and&asynchronous&for& Source: A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable mulN_node&failure& Multi-level Checkpointing System,” in Proceedings of the 2010 ACM/IEEE International Conference for High Our&previous&work&model&the&asynchronous& • Performance Computing, Networking, Storage and Analysis (SC 10). MLC& & Lawrence Livermore National Laboratory 11 11 LLNL-PRES-661421

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend