mcem multi level cooperative exception model for hpc
play

MCEM: Multi-Level Cooperative Exception Model for HPC Workflows - PowerPoint PPT Presentation

MCEM: Multi-Level Cooperative Exception Model for HPC Workflows Stephen Herbein , David Domyancic, Paul Minner, Ignacio Laguna, Rafael Ferreira da Silva , Dong H. Ahn June, 2019 LLNL-PRES-779103 This work was performed under the auspices of the


  1. MCEM: Multi-Level Cooperative Exception Model for HPC Workflows Stephen Herbein , David Domyancic, Paul Minner, Ignacio Laguna, Rafael Ferreira da Silva , Dong H. Ahn June, 2019 LLNL-PRES-779103 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC

  2. Fault-Tolerance in HPC [1] § The MTBF of our systems is shrinking § The cost of checkpoint/restart is becoming prohibitively expensive § The problem will only get worse with the inclusion of GPUs and node-local SSDs Fault-Tolerance is becoming increasingly important [1] R. Riesen, K. Ferreira and J. Stearley, "See applications run and throughput jump: The case for redundant computing in HPC," 2010 2 International Conference on Dependable Systems and Networks Workshops (DSN-W) , Chicago, IL, 2010, pp. 29-34 LLNL-PRES-779103

  3. Fault-Tolerance Primitives § Detection — the observation of a fault, error, or degradation § Isolation/Diagnosis — the identification of the root cause of the detected fault § Recovery — the remediation of the fault by affected components 3 LLNL-PRES-779103

  4. Fault Tolerance: State of the Practice § Existing State of the Practice fault tolerance techniques are entirely uncoordinated User’s Workflow Scheduler Manager § System components each act independently to detect, Process Exited Rank diagnose, and recover from Abnormally Unresponsive faults Node Resource Manager SegFault Parallel Job Failed Relaunch Restart with Process N-1 Nodes Detection Diagnosis SegFault Node Recovery Lack of coordination results in undetected faults and inefficiency 4 LLNL-PRES-779103

  5. Process Failure Fault Tolerance: State of the Art on Node X ML Model § Components coordinate to detect and diagnose faults Resubmit Job Kill Job User’s Workflow § System components each Scheduler Manager perform their own uncoordinated recovery actions Relaunch Global Event Restart with Process § These actions are usually Database N-1 Nodes Resource Manager redundant and sometimes Parallel Job Process Exited contradictory Rank Abnormally Unresponsive Detection Diagnosis SegFault Node Recovery Lack of coordinated recovery results in suboptimal and redundant work 5 LLNL-PRES-779103

  6. MCEM: Multi-Level Cooperative Exception Model Extend Job Walltime § MCEM extends the idea of Scheduler C++/Java exceptions to an Process Failure entire HPC system on Node X User’s Workflow Manager ML Model § Exceptions are cooperatively Restart with handled in a chain N Nodes Parallel Job § Chained exceptions include fault and recovery metadata Relaunch Rank Process Unresponsive Resource Manager Global Event Detection Diagnosis Database Process Exited Abnormally Node Recovery SegFault 6 LLNL-PRES-779103

  7. MCEM: Global Exceptions Hold Jobs Requiring PFS Parallel FS § Propagating up works well for Scheduler Down exceptions originating from a ML Model single, isolated resource (i.e., User’s Workflow Transfer jobs Manager local exception ) to 2 nd PFS § Reverse propagation direction Parallel Job for exceptions originating Global Event from a shared resource (i.e., Database global exception ) Resource Manager Detection Diagnosis Metadata IO Timeouts Node Failed Node Parallel Filesystem Recovery 7 LLNL-PRES-779103

  8. MCEM: Fault Model § Hard faults — Segmentation Faults, Node Failures, Network Link Failure, PFS Down, User Exceeded Disk Quota § Soft faults — Network or PFS performance degraded, User Approaching Disk Quota § Fault length — Effects must last long enough to be reliably detected, isolated, and recovered from – O(minutes) 8 LLNL-PRES-779103

  9. MCEM Exception Recovery Examples Failure Type Resource Manager Parallel Job Workflow Manager Scheduler Retry job (transient) Parallel Launcher -- -- -- Failure Log system error (permanent) Application Failure Launch mesh -- -- -- (i.e., mesh tangling) relaxation job Process Failure Relaunch Process Restart w/ N ranks -- Grant job addt’l time Restart w/ N-1 ranks Node Failure Mark node down -- Grant job addt’l node OR req addt’l node Migrate some/all User Approaching or Hold queued jobs -- -- workflow jobs to Exceeding Disk Quota requiring PFS access secondary filesystem 9 LLNL-PRES-779103

  10. Quota Exceeded: State of the Practice User’s Workflow Scheduler Manager Migrate to 2 nd PFS User Exceeded Resource Manager Parallel Job Quota EQUOT User Above Detection Diagnosis Hard Quota Parallel Filesystem Recovery 10 LLNL-PRES-779103

  11. User Exceeded Quota Exceeded: State of the Art Quota ML Model Migrate Some Hold User’s Jobs to 2 nd PFS Queued Jobs User’s Workflow Scheduler Manager Global Event Migrate to 2 nd PFS Database Resource Manager Parallel Job EQUOT User Above Detection Diagnosis Hard Quota Parallel Filesystem Recovery 11 LLNL-PRES-779103

  12. Quota Exceeded: MCEM Hold User’s User Exceeded Queued Jobs Scheduler Quota Migrate Some ML Model Jobs to 2 nd PFS User’s Workflow Manager EQUOT Parallel Job Global Event Database Resource Manager Detection Diagnosis User Above Hard Quota Node Parallel Filesystem Recovery 12 LLNL-PRES-779103

  13. Evaluation § In SOA, parallel applications all transition to 2 nd filesystem, and the WFM re-transitions some/all of the jobs § MCEM allows the WFM to only move the minimal subset of jobs exactly once MCEM can reduce IO by up to 90% 13 LLNL-PRES-779103

  14. Implementation: Resource Manager § Why to implement within the system RM — Communication already implemented and fault-tolerant (hopefully) — Can be a plugin/module, result in less code to write and audit § Why not to implement within the system RM — If the RM daemon dies, so does MCEM — RM failures then become potentially undetectable and certainly unrecoverable 14 LLNL-PRES-779103

  15. Implementation: Runtime Interface § Flux — flux job raise –severity=1 –type=“segmentation fault” $ID ’{“rank”: “262”, “pid”: 1182, “node”: ”quartz454”}’ — flux job eventlog $ID — flux_event_subscribe (h, "job-exception") § PMIx — PMIx_Notify_event — PMIx_Register_event_handler • Supports registering a handler for multiple events, simultaneously • “Multi-code” handlers always execute after “single-code” handlers • Supports specifying relative handler precedence within a “category” 15 LLNL-PRES-779103

  16. Acknowledgements § Flux Team § Co-Authors — Ned Bass — David Domyancic — Al Chu — Paul Minner — Jim Garlick — Ignacio Laguna — Mark Grondona — Rafael Ferreira da Silva — Tapasya Patki — Dong H. Ahn — Tom Scogland — Becky Springmeyer 16 LLNL-PRES-779103

  17. Disclaimer This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.

  18. Backup Slides 18 LLNL-PRES-779103

  19. MCEM’s Exception Propagation Order Local Exceptions Global Exceptions 19 LLNL-PRES-779103

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend