MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing - PowerPoint PPT Presentation

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing Rohan Garg, *Gregory Price, and Gene Cooperman Northeastern University

Why checkpoint, and why transparently? Whether for maintenance, analysis, time-sharing, load balancing, or fault tolerance HPC developers require the ability to suspend and resume computations. Two general forms of checkpointing solutions 1. Transparent - No or Low development overhead 2. Application-specific - Moderate to High development overhead HPC Applications exist on a spectrum Developers apply technologies based on where they live in that spectrum.

Puzzle Can you solve checkpointing on... And restart on… Cray MPI over Infiniband MPICH over TCP/IP 1 2 5 6 4 5 6 7 Shared 3 4 7 8 8 10 12 14 Memory 9 10 13 14 1 2 11 15 Shared 11 12 15 16 3 9 13 16 Memory 8 Nodes, 2 Cores/Ranks per Node 4 Nodes, 4 Cores/Ranks per Node

Cross-Cluster Migration It is now possible to checkpoint on And restart on… Cray MPI over Infiniband MPICH over TCP/IP 1 2 5 6 4 5 6 7 Shared 3 4 7 8 8 10 12 14 Memory 9 10 13 14 1 2 11 15 Shared 11 12 15 16 3 9 13 16 Memory 8 Nodes, 2 Cores/Ranks per Node 4 Nodes, 4 Cores/Ranks per Node

The Problem How do we best transparently checkpoint an MPI Library? The Answer Don’t. :]

HPC Checkpointing Spectrum Low vs. High End: Defined by level of effort, funding, and time frame. Short term Long Term Low Investment High Investment Transparent Checkpointing Ready-made solution Hand-Rolled Solution Limit Cost / Effort Maximize Results Terms of the project dictate the technology employed

Transparency and Agnosticism Transparency 1. No re-compilation and no re-linking of application 2. No re-compilation of MPI 3. No special transport stack or drivers Agnosticism 1. Works with any libc or Linux kernel 2. Works with any MPI implementation (MPICH, CRAY MPI, etc) 3. Works with any network stack (Ethernet, Infiniband, Omni-Path, etc).

Alas, poor transparency, I knew him Horatio... Transparent checkpointing could die a slow, painful death. 1. Open MPI Checkpoint-Restart service (Network Agnostic; cf. Hursey et al.) ○ MPI implementation provides checkpoint service to the application. 2. BLCR Utilizes kernel module to checkpoint local MPI ranks ○ 3. DMTCP (MPI Agnostic) ○ External program that wraps MPI for checkpointing. These, and others, have run up against a wall: MAINTENANCE

The M x N maintenance penalty MPI: Interconnect: ● MPICH ● Ethernet ● OPEN MPI ● InfiniBand ● LAM-MPI ● InfiniBand + Mellanox CRAY MPI Cray GNI ● ● ● HP MPI ● Intel Omni-path ● IBM MPI ● libfabric ● SGI MPI ● System V Shared Memory MPI-BIP 115200 baud serial ● ● ● POWER-MPI ● Carrier Pigeon ● …. ● ….

The M x N maintenance penalty MPI: Network Agnostic Interconnect: ● MPICH ● Ethernet ● OPEN-MPI ● InfiniBand ● LAM-MPI ● InfiniBand + Mellanox CRAY MPI Cray GNI ● ● ● HP MPI ● Intel Omni-path ● IBM MPI ● libfabric ● SGI MPI ● System V Shared Memory MPI-BIP 115200 baud serial ● ● ● POWER-MPI ● Carrier Pigeon ● …. ● ….

The M x N maintenance penalty MPI: MPI and Network Agnostic Interconnect: ● MPICH ● Ethernet ● OPEN-MPI ● InfiniBand ● LAM-MPI ● InfiniBand + Mellanox CRAY MPI Cray GNI ● ● ● HP MPI ● Intel Omni-path ● IBM MPI ● libfabric ● SGI MPI ● System V Shared Memory MPI-BIP 115200 baud serial ● ● ● POWER-MPI ● Carrier Pigeon ● …. ● ….

MANA: MPI-Agnostic, Network-Agnostic The problem stems from checkpointing both the MPI coordinator and the MPI lib. MPI Coordinator Node 1 Node 2 MPI Rank MPI Rank MPI Rank MPI Rank

MANA: MPI-Agnostic, Network-Agnostic The problem stems from checkpointing MPI - both the coordinator and the library. MPI Coordinator Node 1 Node 2 Connections MPI Rank MPI Rank Groups Communicators Link State MPI Rank MPI Rank

Achieving Agnosticism Step 1: Drain the Network MPI Coordinator Node 1 Node 2 MPI Rank MPI Rank Chandy-Lamport Algorithm MPI Rank MPI Rank As demonstrated by Hursey et al. , abstracting by “MPI Messages” allows for Network Agnosticism.

Inspired by Chandy-Lamport Chandy-Lamport - Common mechanism of recording a consistent global state Usage is established among MPI checkpointing solutions (e.g. Hursey et. al. ) 1. Count the number of messages sent 2. Count the number of messages received or drained 3. When they’re equivalent, the network is drained and safe to checkpoint.

Checkpointing Message Operations ● Apply Chandy-Lamport outside the MPI library, checkpointing MPI API calls. Can be naively applied to point-to-point communications ● ○ Send, Recv, iSend, iRecv, etc. ● Collectives (Scatter / Gather) could not be naively supported Collectives can produce un-recordable MPI Library and Network events. ○ ○ Can cause straggler and starvation issues when applied naively Rank 1 Inside Collective Rank 2 Straggler Rank 3 Inside Collective

Checkpointing Collective Operations Solution: Two-phase collectives 1. Preface all collectives with a trivial barrier 2. When the trivial barrier is completed, call the original collective Trivial Barrier Collective Rank 1 Inside Barrier Rank 2 Straggler Rank 3 Inside Barrier

Checkpointing Collective Operations Solution: Two-phase collectives 1. Preface all collectives with a trivial barrier 2. When the trivial barrier is completed, call the original collective Trivial Barrier Collective Rank 1 Original Collective Rank 2 Original Collective Rank 3 Original Collective

Checkpointing Collective Operations Solution: Two-phase collectives 1. Preface all collectives with a trivial barrier 2. When the trivial barrier is completed, call the original collective Collective Trivial Barrier Collective Complete Rank 1 Rank 2 Rank 3

Checkpointing Collective Operations Solution: Two-phase collectives Checkpoint Disabled Collective Collective Trivial Barrier Complete Begins Rank 1 Rank 2 Rank 3

Checkpointing Collective Operations Solution: Two-phase collectives This prevents deadlock conditions Checkpoint Disabled Collective Collective Trivial Barrier Complete Begins Rank 1 Rank 2 Rank 3

Checkpointing Collective Operations Solution: Two-phase collectives This prevents deadlock conditions Checkpoint Disabled (Additional logic to avoid starvation) Collective Collective Trivial Barrier Complete Begins Rank 1 Rank 2 Rank 3

Achieving Agnosticism Step 2: Discard the network MPI Coordinator Node 1 Node 2 MPI Rank MPI Rank MPI Rank MPI Rank

Checkpointing A Rank Solution: Isolation Checkpointing the rank is simpler… right? MPI Application Problems: MPI Rank ● MPI Implementation Specific ● Grouping information MPI Library Contains MPI network state ● ● Opaque MPI Objects Required by MPI and Application ● ● Heap Allocations LIBC and friends ● Platform dependant

Isolation - The “Split-Process” Approach Terminology Single Memory Space Upper-Half program Checkpoint and Restore MPI Application Standard C Calling Conventions No RPC involved MPI Proxy Library Discard and Re-initialize Lower-Half program MPI Library MPI Library LIBC and friends

Re-initializing the network Runtime Restart MPI Application ● Record Configuration Calls ● Replay Configuration ● Initialize, Grouping, etc ● Buffer Drained Messages MPI Application Checkpoint Config and Drain Info ● Drain Network MPI Proxy Library Grouping information Contains MPI network state ● ● MPI Library ● Opaque MPI Objects LIBC and friends

Isolation Upper Half: Problem: Persistent Data MANA interposes on sbrk and malloc MPI Application Heap is a shared resource MPI Application to control where allocations occur Config and Drain Info Config and Drain Info LIBC and friends MPI Proxy Library Lower Half Ephemeral Data MPI Library LIBC and friends

MPI Agnosticism Achieved Upper Half: Persistent Data MPI Application Config and Drain Info LIBC and friends MPI Proxy Library Lower Half Ephemeral Data MPI Library LIBC and friends

MPI Agnosticism Achieved Upper Half: Persistent Data MPI Application Config and Drain Info *Special care must be taken when LIBC and friends replacing upper half libraries Lower Half Lower half data can be replaced by Ephemeral Data new and different implementations of MPI and related libraries.

Checkpoint Process Step 1: Drain the Network MPI Coordinator Node 1 Node 2 MPI Rank MPI Rank MPI Rank MPI Rank

Checkpoint Process Step 1: Drain the Network MPI Rank Step 2: Checkpoint Upper-Half MPI Application Config and Drain Info LIBC and friends

Restart Process Step 1: Restore Lower-Half MPI Proxy Library MPI Library LIBC and friends Lower-half components may be replaced

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing - PowerPoint PPT Presentation

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing Rohan Garg, *Gregory Price, and Gene Cooperman Northeastern University Why checkpoint, and why transparently? Whether for maintenance, analysis, time-sharing, load balancing,

ASSOCIA OCIATION TION MA MANA NA MANA MANA association was funded in 1998 in Bordeaux.

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

STOP P MANA NAGI GING NG COMMUNIC UNICATIONS TIONS & MANA NAGE GE THE E EM EMER

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

The Evolution of MPI William Gropp Computer Science www.cs.uiuc.edu/ homes/ wgropp Outline 1.

Inventory, Monitoring and Assessment of Sediment Risks from Forest Roads Charles Luce Tom

Why is Olmsted County proposing this ordinance? State Law Olmsted County is required to comply

Why care? HDD SSD Require seek, rotate, No seeks SSDs transfer on each I/O Parallel Not

DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio

Unit 2 Digital Circuits (Logic) 2.2 Moving from voltages to 1's and 0's ANALOG VS. DIGITAL

Making Linux TCP Fast Yuchung Cheng Neal Cardwell 1 netdev 1.2 Tokyo, October, 2016 Once upon

2020 HydInfra Discussion Includes 2019 Drainage Inspection Performance Measure Results 25 years

Update on Hidalgo County Northern &Central Watershed Protection Plan Project funded by the

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing - PowerPoint PPT Presentation

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing Rohan Garg, *Gregory Price, and Gene Cooperman Northeastern University Why checkpoint, and why transparently? Whether for maintenance, analysis, time-sharing, load balancing,

ASSOCIA OCIATION TION MA MANA NA MANA MANA association was funded in 1998 in Bordeaux.

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

STOP P MANA NAGI GING NG COMMUNIC UNICATIONS TIONS &amp; MANA NAGE GE THE E EM EMER

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

The Evolution of MPI William Gropp Computer Science www.cs.uiuc.edu/ homes/ wgropp Outline 1.

Inventory, Monitoring and Assessment of Sediment Risks from Forest Roads Charles Luce Tom

Why is Olmsted County proposing this ordinance? State Law Olmsted County is required to comply

Why care? HDD SSD Require seek, rotate, No seeks SSDs transfer on each I/O Parallel Not

DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio

Unit 2 Digital Circuits (Logic) 2.2 Moving from voltages to 1's and 0's ANALOG VS. DIGITAL

Making Linux TCP Fast Yuchung Cheng Neal Cardwell 1 netdev 1.2 Tokyo, October, 2016 Once upon

2020 HydInfra Discussion Includes 2019 Drainage Inspection Performance Measure Results 25 years

Update on Hidalgo County Northern &amp;Central Watershed Protection Plan Project funded by the

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

STOP P MANA NAGI GING NG COMMUNIC UNICATIONS TIONS & MANA NAGE GE THE E EM EMER

Update on Hidalgo County Northern &Central Watershed Protection Plan Project funded by the