Goal: Let MPI survive partial system failure 2 Managed by - PDF document

Fault Tolerance and the MPI standard meet at the Ultra-Scale Richard L. Graham Computer Science and Mathematics Division National Center for Computational Sciences 1 1 Managed by UT-Battelle Managed by UT-Battelle for the Department of Energy for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08 Outline • Problem definition – General – MPI Specific • General approach for making MPI fault tolerant • Current status • Is this all ? Goal: Let MPI survive partial system failure 2 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Problem definition Network Node 2 Node 1 Disk B Disk A 3 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Problem definition – A bit more realistic Network Node 2 Node 1 Disk B 4 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Failure example – node failure Network Node 2 Node 1 Disk B 5 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 • Problem: A component affecting a running MPI job is compromised (H/W or S/W) • Question: Can the MPI application continue to run correctly ? – Does the job have to abort ? – If not, can the job continue to communicate ? – Can there be a change in resources available to the job ? 6 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Related Work* Non Automatic Automatic Ckpt Based Log Based other Optimistic Log Casual Log Pessimistic Log Manetho Cocheck n faults Framework Independent of MPI Optimistic recovery in EZ92 Ste96 distributed systems N faults with coherent Starfish Ckpt - FY85 Enrichment of MPI AF99 Egida Clip Rav99 API Semi-trasparant Ckpt FT-MPI Clp97 MPI/FT LAM/MPI Redundance of Tasks 2 faults sender based BNC01 Pruitt 98 MPI-FT MPICH-V Comm Sender based Msg Log N faults MPICH-V/CL N faults Layer 1 fault sender based Centralized Sever Distributed logging JZ87 LNLE00 LA-MPI RG03 7 Managed by UT-Battelle *Borrowed from Jack Dongarra for the Department of Energy Graham_OpenMPI_SC08 Why address this problem now ? Why address this problem now ? • There have been quite a few predictions over the last decade that we would reach a scale at which hardware and software failure rates would be so high, we would not be able to make effective use of these systems. 8 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Why address this problem now ? – Why address this problem now ? – cont’d ont’d • This has not happened (?) � Why should we believe it will happen this time ? Actually, we have adjusted Wasting resources Automating simple forms of recovery (restart – John Daly) 9 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Why address this problem now ? – Why address this problem now ? – cont’d ont’d • Systems are getting much larger • NCCS ~15000 (‘07) -> ~30,000 cores(‘08) -> ~150,000 cores (end ’08) • Impact on the applications is increasing • As we go to 1,000,000+ processes being used in a single job, application MTBF will suffer greatly 10 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Why is Coordinated Checkpoint Restart not Sufficient ?* Ron Oldfield, et al. – Modeling the Impact of Checkpoints on Next- Generation Systems 11 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 The Challenge • We need approaches to dealing with fault- tolerance at scale that will: – Allow applications to harness full system capabilities – Work well at scale 12 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Technical Guiding Principles • End goal: Increase application MTBF – Applications must be willing to use the solution • No One-Solution-Fits-All – Hardware characteristics – Software characteristics – System complexity – System resources available for fault recovery – Performance impact on application – Fault characteristics of application • Standard should not be constrained by current practice 13 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Why MPI ? • The Ubiquitous standard parallel programming model used in scientific computing today • Minimize disruption of the running application 14 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

What role should MPI play in recovery ? • MPI does NOT provide fault-tolerance • MPI should enable the survivability of MPI upon failure. 15 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 What role should MPI play in recovery ? – Cont’d • MPI provides: – Communication primitives – Management of groups of processes – Access to the file system 16 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

What role should MPI play in recovery ? – Cont’d • Therefore upon failure MPI should: (limited by system state) – Restore MPI communication infrastructure to correct and consistent state – Restore process groups to a well defined state – Able to reconnect to file system – Provide hooks related to MPI communications needed by other protocols building on top of MPI, such as • Flush the message system • Quiesce the network • Send “piggyback” data • ? 17 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 What role should MPI play in recovery ? – Cont’d • MPI is responsible for making the internal state of MPI consistent and usable by the application • The “application” is responsible for restoring application state Layered Approach 18 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

CURRENT WORK CURRENT WORK 19 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 19 Active Members in the Current MPI Forum Argonne NL, Bull, Cisco, Pacific NW NL, Cray, Fujitsu, Qlogic, Sandia NL, HDF Group, HLRS, SiCortex, Sun Microsystems, HP, IBM, INRIA, Tokyo Institute of Technology, Indiana U., Intel, U. Alabama Birmingham, Lawrence Berkeley NL, U. Houston, Livermore NL, Los Alamos NL, U. Tennessee Knoxville, Mathworks, Microsoft, U. Tokyo NCSA/UIUC, NEC, Oak Ridge NL , Ohio State U., 20 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 20

Motivating Examples 21 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Collecting specific use case scenarios • Process failure – Client/Server, with client member of inter-communicator – Client process fails – Server is notified of failure – Server disconnects from Client inter- communicator, and continues to run – Client processes are terminated 22 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 22

Collecting specific use case scenarios • Process failure – Client/Server, with client member of intra-communicator – Client process fails – Processes communicating with failed process are notified of failure – Application specifies response to failure • Abort • Continue with reduced process count, with the missing process being labled MPI_Proc_null in the communicator • Replace the failed process (why not allow to increase the size of the communicator ?) 23 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 23 Collecting specific use case scenarios • Process failure – Tightly coupled simulation, with independent layered libraries – Process fails • Example application: POP (ocean simulation code) using conjugate gradient solver – Application specifies MPI’s response to failure 24 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 24

Design Details • Allow for local recovery, when global recovery is not needed (scalability) – Collective communications are global in nature, therefore global recovery is required for continued use of collective communications • Delay recovery response as much as possible • Allow for rapid and fully coordinated recovery 25 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Design Details – Cont’d • Key component: Communicator – A functioning communicator is required to continue MPI communications after failure • Disposition of active communicators: – Application specified – MPI_COMM_WORLD must be functional to continue • Handling of surviving processes – MPI_comm_rank does not change – MPI_Comm_size does not change 26 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Design Details – Cont’d • Handling of failed processes – Replace – Discard (MPI_PROC_NULL) • Disposition of active communications: – With failed process: discard – With other surviving processes – application defined on a per communicator basis 27 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Error Reporting Mechanisms – current status – Errors are associated with communicators – By default errors are returned from the affected MPI calls, are are returned synchronously Example: Ret1=MPI_Isend(comm=MPI_Comm_world, dest=3, …request=request3) Link to 3 fails Ret2=MPI_Isend(comm=MPI_Comm_world, dest=4, …request=request4) Ret3=Wait(request=request4) // success Ret4=Wait(request=request3) // error returned in Ret Can ask for more information about the failure 28 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 28

Goal: Let MPI survive partial system failure 2 Managed by - PDF document

Fault Tolerance and the MPI standard meet at the Ultra-Scale Richard L. Graham Computer Science and Mathematics Division National Center for Computational Sciences 1 1 Managed by UT-Battelle Managed by UT-Battelle for the Department of

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Overview Partial Constituent Fronting in German The phenomenon: Partial constituent fronting

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Success breeds complacency. Complacency breeds failure. Only the paranoid survive. Andy Grove,

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Some Design Considerations (and a Few Matrix Implementation Details) in PETSc, the Portable,

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Equinet Academy

iRODS Tutorial II. Data Grid Administration iRODS Tutorial Preview I. iRODS

An Empirical Study of High Availability in Stream Processing Systems Yu Gu, Zhe Zhang , Fan Ye,

Seminar on the Doctors Duty to Advise 2 December 2017 Terence Ang Outline Brief

Jeff York University of Colorado at Boulder jeffrey.york@colorado.edu Desiree Pacheco Portland

e-mail: pk@sdh.sk.ca Nunzio M. Fortugno Principal Cylinea Systems Corporation 327 Schubert

ENZO Simulations at PetaScale Robert Harkness UCSD/SDSC December 17th, 2010 Acknowledgements

Goal: Let MPI survive partial system failure 2 Managed by - PDF document

Fault Tolerance and the MPI standard meet at the Ultra-Scale Richard L. Graham Computer Science and Mathematics Division National Center for Computational Sciences 1 1 Managed by UT-Battelle Managed by UT-Battelle for the Department of

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Overview Partial Constituent Fronting in German The phenomenon: Partial constituent fronting

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Success breeds complacency. Complacency breeds failure. Only the paranoid survive. Andy Grove,

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Some Design Considerations (and a Few Matrix Implementation Details) in PETSc, the Portable,

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Equinet Academy

iRODS Tutorial II. Data Grid Administration iRODS Tutorial Preview I. iRODS

An Empirical Study of High Availability in Stream Processing Systems Yu Gu, Zhe Zhang , Fan Ye,

Seminar on the Doctors Duty to Advise 2 December 2017 Terence Ang Outline Brief

Jeff York University of Colorado at Boulder jeffrey.york@colorado.edu Desiree Pacheco Portland

e-mail: pk@sdh.sk.ca Nunzio M. Fortugno Principal Cylinea Systems Corporation 327 Schubert

ENZO Simulations at PetaScale Robert Harkness UCSD/SDSC December 17th, 2010 Acknowledgements

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards