MCEM: Multi-Level Cooperative Exception Model for HPC Workflows - PowerPoint PPT Presentation

MCEM: Multi-Level Cooperative Exception Model for HPC Workflows Stephen Herbein , David Domyancic, Paul Minner, Ignacio Laguna, Rafael Ferreira da Silva , Dong H. Ahn June, 2019 LLNL-PRES-779103 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC

Fault-Tolerance in HPC [1] § The MTBF of our systems is shrinking § The cost of checkpoint/restart is becoming prohibitively expensive § The problem will only get worse with the inclusion of GPUs and node-local SSDs Fault-Tolerance is becoming increasingly important [1] R. Riesen, K. Ferreira and J. Stearley, "See applications run and throughput jump: The case for redundant computing in HPC," 2010 2 International Conference on Dependable Systems and Networks Workshops (DSN-W) , Chicago, IL, 2010, pp. 29-34 LLNL-PRES-779103

Fault-Tolerance Primitives § Detection — the observation of a fault, error, or degradation § Isolation/Diagnosis — the identification of the root cause of the detected fault § Recovery — the remediation of the fault by affected components 3 LLNL-PRES-779103

Fault Tolerance: State of the Practice § Existing State of the Practice fault tolerance techniques are entirely uncoordinated User’s Workflow Scheduler Manager § System components each act independently to detect, Process Exited Rank diagnose, and recover from Abnormally Unresponsive faults Node Resource Manager SegFault Parallel Job Failed Relaunch Restart with Process N-1 Nodes Detection Diagnosis SegFault Node Recovery Lack of coordination results in undetected faults and inefficiency 4 LLNL-PRES-779103

Process Failure Fault Tolerance: State of the Art on Node X ML Model § Components coordinate to detect and diagnose faults Resubmit Job Kill Job User’s Workflow § System components each Scheduler Manager perform their own uncoordinated recovery actions Relaunch Global Event Restart with Process § These actions are usually Database N-1 Nodes Resource Manager redundant and sometimes Parallel Job Process Exited contradictory Rank Abnormally Unresponsive Detection Diagnosis SegFault Node Recovery Lack of coordinated recovery results in suboptimal and redundant work 5 LLNL-PRES-779103

MCEM: Multi-Level Cooperative Exception Model Extend Job Walltime § MCEM extends the idea of Scheduler C++/Java exceptions to an Process Failure entire HPC system on Node X User’s Workflow Manager ML Model § Exceptions are cooperatively Restart with handled in a chain N Nodes Parallel Job § Chained exceptions include fault and recovery metadata Relaunch Rank Process Unresponsive Resource Manager Global Event Detection Diagnosis Database Process Exited Abnormally Node Recovery SegFault 6 LLNL-PRES-779103

MCEM: Global Exceptions Hold Jobs Requiring PFS Parallel FS § Propagating up works well for Scheduler Down exceptions originating from a ML Model single, isolated resource (i.e., User’s Workflow Transfer jobs Manager local exception ) to 2 nd PFS § Reverse propagation direction Parallel Job for exceptions originating Global Event from a shared resource (i.e., Database global exception ) Resource Manager Detection Diagnosis Metadata IO Timeouts Node Failed Node Parallel Filesystem Recovery 7 LLNL-PRES-779103

MCEM: Fault Model § Hard faults — Segmentation Faults, Node Failures, Network Link Failure, PFS Down, User Exceeded Disk Quota § Soft faults — Network or PFS performance degraded, User Approaching Disk Quota § Fault length — Effects must last long enough to be reliably detected, isolated, and recovered from – O(minutes) 8 LLNL-PRES-779103

MCEM Exception Recovery Examples Failure Type Resource Manager Parallel Job Workflow Manager Scheduler Retry job (transient) Parallel Launcher -- -- -- Failure Log system error (permanent) Application Failure Launch mesh -- -- -- (i.e., mesh tangling) relaxation job Process Failure Relaunch Process Restart w/ N ranks -- Grant job addt’l time Restart w/ N-1 ranks Node Failure Mark node down -- Grant job addt’l node OR req addt’l node Migrate some/all User Approaching or Hold queued jobs -- -- workflow jobs to Exceeding Disk Quota requiring PFS access secondary filesystem 9 LLNL-PRES-779103

Quota Exceeded: State of the Practice User’s Workflow Scheduler Manager Migrate to 2 nd PFS User Exceeded Resource Manager Parallel Job Quota EQUOT User Above Detection Diagnosis Hard Quota Parallel Filesystem Recovery 10 LLNL-PRES-779103

User Exceeded Quota Exceeded: State of the Art Quota ML Model Migrate Some Hold User’s Jobs to 2 nd PFS Queued Jobs User’s Workflow Scheduler Manager Global Event Migrate to 2 nd PFS Database Resource Manager Parallel Job EQUOT User Above Detection Diagnosis Hard Quota Parallel Filesystem Recovery 11 LLNL-PRES-779103

Quota Exceeded: MCEM Hold User’s User Exceeded Queued Jobs Scheduler Quota Migrate Some ML Model Jobs to 2 nd PFS User’s Workflow Manager EQUOT Parallel Job Global Event Database Resource Manager Detection Diagnosis User Above Hard Quota Node Parallel Filesystem Recovery 12 LLNL-PRES-779103

Evaluation § In SOA, parallel applications all transition to 2 nd filesystem, and the WFM re-transitions some/all of the jobs § MCEM allows the WFM to only move the minimal subset of jobs exactly once MCEM can reduce IO by up to 90% 13 LLNL-PRES-779103

Implementation: Resource Manager § Why to implement within the system RM — Communication already implemented and fault-tolerant (hopefully) — Can be a plugin/module, result in less code to write and audit § Why not to implement within the system RM — If the RM daemon dies, so does MCEM — RM failures then become potentially undetectable and certainly unrecoverable 14 LLNL-PRES-779103

Implementation: Runtime Interface § Flux — flux job raise –severity=1 –type=“segmentation fault” $ID ’{“rank”: “262”, “pid”: 1182, “node”: ”quartz454”}’ — flux job eventlog $ID — flux_event_subscribe (h, "job-exception") § PMIx — PMIx_Notify_event — PMIx_Register_event_handler • Supports registering a handler for multiple events, simultaneously • “Multi-code” handlers always execute after “single-code” handlers • Supports specifying relative handler precedence within a “category” 15 LLNL-PRES-779103

Acknowledgements § Flux Team § Co-Authors — Ned Bass — David Domyancic — Al Chu — Paul Minner — Jim Garlick — Ignacio Laguna — Mark Grondona — Rafael Ferreira da Silva — Tapasya Patki — Dong H. Ahn — Tom Scogland — Becky Springmeyer 16 LLNL-PRES-779103

Disclaimer This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.

Backup Slides 18 LLNL-PRES-779103

MCEM’s Exception Propagation Order Local Exceptions Global Exceptions 19 LLNL-PRES-779103

MCEM: Multi-Level Cooperative Exception Model for HPC Workflows - PowerPoint PPT Presentation

MCEM: Multi-Level Cooperative Exception Model for HPC Workflows Stephen Herbein , David Domyancic, Paul Minner, Ignacio Laguna, Rafael Ferreira da Silva , Dong H. Ahn June, 2019 LLNL-PRES-779103 This work was performed under the auspices of the

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Control Exception Handling: Exception handling is the control of error conditions or other

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

1 Benefits of exception handling Handling exceptions Generally, there are many ways to handle an

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Cooperative Game Theory Outline Introduction Relationship between Non-cooperative and

Cooperative Choice Cooperative and non-cooperative motives and their consequences via Mark

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

Characteristics of an Effective Team: Team-Assessment The purpose of the following team-assessment

Demand-Aware Content Distribution Srinivas Shakkottai Texas A&M University Hybrid content

Introduction to Pattern Recognition Part I Selim Aksoy Department of Computer Engineering

Antfarm: Efficient Content Distribution with Managed Swarms Ryan S. Peterson and Emin Gn Sirer

The role of planarity in connectivity problems parameterized by treewidth Julien Baste and Ignasi

Hitting minors on bounded treewidth graphs Julien Baste 1 Ignasi Sau 2 Dimitrios M. Thilikos 2 , 3

Inequality and Stability in Democratic and Decentralized Indonesia Mohammad Zulfan Tadjoeddin,

De sign Stor ie s E xplor ing and Cr e ating Code fr om a Nar r ative Pe r spe c tive

MCEM: Multi-Level Cooperative Exception Model for HPC Workflows - PowerPoint PPT Presentation

MCEM: Multi-Level Cooperative Exception Model for HPC Workflows Stephen Herbein , David Domyancic, Paul Minner, Ignacio Laguna, Rafael Ferreira da Silva , Dong H. Ahn June, 2019 LLNL-PRES-779103 This work was performed under the auspices of the

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Control Exception Handling: Exception handling is the control of error conditions or other

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

1 Benefits of exception handling Handling exceptions Generally, there are many ways to handle an

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Cooperative Game Theory Outline Introduction Relationship between Non-cooperative and

Cooperative Choice Cooperative and non-cooperative motives and their consequences via Mark

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

Characteristics of an Effective Team: Team-Assessment The purpose of the following team-assessment

Demand-Aware Content Distribution Srinivas Shakkottai Texas A&amp;M University Hybrid content

Introduction to Pattern Recognition Part I Selim Aksoy Department of Computer Engineering

Antfarm: Efficient Content Distribution with Managed Swarms Ryan S. Peterson and Emin Gn Sirer

The role of planarity in connectivity problems parameterized by treewidth Julien Baste and Ignasi

Hitting minors on bounded treewidth graphs Julien Baste 1 Ignasi Sau 2 Dimitrios M. Thilikos 2 , 3

Inequality and Stability in Democratic and Decentralized Indonesia Mohammad Zulfan Tadjoeddin,

De sign Stor ie s E xplor ing and Cr e ating Code fr om a Nar r ative Pe r spe c tive

Demand-Aware Content Distribution Srinivas Shakkottai Texas A&M University Hybrid content