MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing - - PowerPoint PPT Presentation

mana for mpi
SMART_READER_LITE
LIVE PREVIEW

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing - - PowerPoint PPT Presentation

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing Rohan Garg, *Gregory Price, and Gene Cooperman Northeastern University Why checkpoint, and why transparently? Whether for maintenance, analysis, time-sharing, load balancing,


slide-1
SLIDE 1

MANA for MPI

MPI-Agnostic Network-Agnostic Transparent Checkpointing Rohan Garg, *Gregory Price, and Gene Cooperman Northeastern University

slide-2
SLIDE 2

Why checkpoint, and why transparently?

Whether for maintenance, analysis, time-sharing, load balancing, or fault tolerance HPC developers require the ability to suspend and resume computations. Two general forms of checkpointing solutions 1. Transparent

  • No or Low development overhead

2. Application-specific

  • Moderate to High development overhead

HPC Applications exist on a spectrum Developers apply technologies based on where they live in that spectrum.

slide-3
SLIDE 3

Puzzle

Can you solve checkpointing on... Cray MPI over Infiniband And restart on… MPICH over TCP/IP

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4 8 5 10 6 12 7 14 1 3 2 9 11 13 15 16 4 Nodes, 4 Cores/Ranks per Node 8 Nodes, 2 Cores/Ranks per Node Shared Memory Shared Memory
slide-4
SLIDE 4

Cross-Cluster Migration

It is now possible to checkpoint on Cray MPI over Infiniband And restart on… MPICH over TCP/IP

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4 8 5 10 6 12 7 14 1 3 2 9 11 13 15 16 4 Nodes, 4 Cores/Ranks per Node 8 Nodes, 2 Cores/Ranks per Node Shared Memory Shared Memory
slide-5
SLIDE 5

The Problem How do we best transparently checkpoint an MPI Library? The Answer Don’t. :]

slide-6
SLIDE 6

HPC Checkpointing Spectrum

Low vs. High End: Defined by level of effort, funding, and time frame. Short term Long Term Low Investment High Investment Ready-made solution Hand-Rolled Solution Limit Cost / Effort Maximize Results Terms of the project dictate the technology employed Transparent Checkpointing

slide-7
SLIDE 7

Transparency and Agnosticism

Transparency 1. No re-compilation and no re-linking of application 2. No re-compilation of MPI 3. No special transport stack or drivers Agnosticism 1. Works with any libc or Linux kernel 2. Works with any MPI implementation (MPICH, CRAY MPI, etc) 3. Works with any network stack (Ethernet, Infiniband, Omni-Path, etc).

slide-8
SLIDE 8

Alas, poor transparency, I knew him Horatio...

Transparent checkpointing could die a slow, painful death. 1. Open MPI Checkpoint-Restart service (Network Agnostic; cf. Hursey et al.)

○ MPI implementation provides checkpoint service to the application.

2. BLCR

○ Utilizes kernel module to checkpoint local MPI ranks

3. DMTCP (MPI Agnostic)

○ External program that wraps MPI for checkpointing.

These, and others, have run up against a wall:

MAINTENANCE

slide-9
SLIDE 9

The M x N maintenance penalty

MPI:

  • MPICH
  • OPEN MPI
  • LAM-MPI
  • CRAY MPI
  • HP MPI
  • IBM MPI
  • SGI MPI
  • MPI-BIP
  • POWER-MPI
  • ….

Interconnect:

  • Ethernet
  • InfiniBand
  • InfiniBand + Mellanox
  • Cray GNI
  • Intel Omni-path
  • libfabric
  • System V Shared Memory
  • 115200 baud serial
  • Carrier Pigeon
  • ….
slide-10
SLIDE 10

The M x N maintenance penalty

MPI:

  • MPICH
  • OPEN-MPI
  • LAM-MPI
  • CRAY MPI
  • HP MPI
  • IBM MPI
  • SGI MPI
  • MPI-BIP
  • POWER-MPI
  • ….

Interconnect:

  • Ethernet
  • InfiniBand
  • InfiniBand + Mellanox
  • Cray GNI
  • Intel Omni-path
  • libfabric
  • System V Shared Memory
  • 115200 baud serial
  • Carrier Pigeon
  • ….
Network Agnostic
slide-11
SLIDE 11

The M x N maintenance penalty

MPI:

  • MPICH
  • OPEN-MPI
  • LAM-MPI
  • CRAY MPI
  • HP MPI
  • IBM MPI
  • SGI MPI
  • MPI-BIP
  • POWER-MPI
  • ….

Interconnect:

  • Ethernet
  • InfiniBand
  • InfiniBand + Mellanox
  • Cray GNI
  • Intel Omni-path
  • libfabric
  • System V Shared Memory
  • 115200 baud serial
  • Carrier Pigeon
  • ….
MPI and Network Agnostic
slide-12
SLIDE 12

The problem stems from checkpointing both the MPI coordinator and the MPI lib.

MANA: MPI-Agnostic, Network-Agnostic

MPI Coordinator MPI Rank MPI Rank MPI Rank MPI Rank Node 1 Node 2
slide-13
SLIDE 13

The problem stems from checkpointing MPI - both the coordinator and the library. Connections Groups Communicators Link State

MANA: MPI-Agnostic, Network-Agnostic

MPI Coordinator MPI Rank MPI Rank MPI Rank MPI Rank Node 1 Node 2
slide-14
SLIDE 14

Step 1: Drain the Network

Achieving Agnosticism

MPI Coordinator MPI Rank MPI Rank MPI Rank MPI Rank Node 2 Node 1 Chandy-Lamport Algorithm As demonstrated by Hursey et al., abstracting by “MPI Messages” allows for Network Agnosticism.
slide-15
SLIDE 15

Inspired by Chandy-Lamport

Chandy-Lamport - Common mechanism of recording a consistent global state Usage is established among MPI checkpointing solutions (e.g. Hursey et. al.) 1. Count the number of messages sent 2. Count the number of messages received or drained 3. When they’re equivalent, the network is drained and safe to checkpoint.

slide-16
SLIDE 16

Checkpointing Message Operations

  • Apply Chandy-Lamport outside the MPI library, checkpointing MPI API calls.
  • Can be naively applied to point-to-point communications
○ Send, Recv, iSend, iRecv, etc.
  • Collectives (Scatter / Gather) could not be naively supported
○ Collectives can produce un-recordable MPI Library and Network events. ○ Can cause straggler and starvation issues when applied naively Rank 1 Rank 2 Rank 3 Inside Collective Inside Collective Straggler
slide-17
SLIDE 17

Checkpointing Collective Operations

Solution: Two-phase collectives 1. Preface all collectives with a trivial barrier 2. When the trivial barrier is completed, call the original collective

Rank 1 Rank 2 Rank 3 Inside Barrier Inside Barrier Straggler Trivial Barrier Collective
slide-18
SLIDE 18

Checkpointing Collective Operations

Solution: Two-phase collectives 1. Preface all collectives with a trivial barrier 2. When the trivial barrier is completed, call the original collective

Rank 1 Rank 2 Rank 3 Original Collective Original Collective Original Collective Trivial Barrier Collective
slide-19
SLIDE 19

Checkpointing Collective Operations

Solution: Two-phase collectives 1. Preface all collectives with a trivial barrier 2. When the trivial barrier is completed, call the original collective

Rank 1 Rank 2 Rank 3 Trivial Barrier Collective Collective Complete
slide-20
SLIDE 20

Checkpointing Collective Operations

Solution: Two-phase collectives

Rank 1 Rank 2 Rank 3 Trivial Barrier Collective Begins Collective Complete Checkpoint Disabled
slide-21
SLIDE 21

Checkpointing Collective Operations

Solution: Two-phase collectives This prevents deadlock conditions

Rank 1 Rank 2 Rank 3 Trivial Barrier Collective Begins Collective Complete Checkpoint Disabled
slide-22
SLIDE 22

Checkpointing Collective Operations

Solution: Two-phase collectives This prevents deadlock conditions (Additional logic to avoid starvation)

Rank 1 Rank 2 Rank 3 Trivial Barrier Collective Begins Collective Complete Checkpoint Disabled
slide-23
SLIDE 23

Step 2: Discard the network

Achieving Agnosticism

MPI Coordinator MPI Rank MPI Rank MPI Rank MPI Rank Node 2 Node 1
slide-24
SLIDE 24 Problems:
  • MPI Implementation Specific
  • Contains MPI network state

Solution: Isolation Checkpointing the rank is simpler… right?

Checkpointing A Rank

MPI Rank MPI Application MPI Library LIBC and friends
  • Required by MPI and Application
  • Platform dependant
  • Grouping information
  • Opaque MPI Objects
  • Heap Allocations
slide-25
SLIDE 25 MPI Application MPI Library MPI Proxy Library MPI Library LIBC and friends

Terminology

Isolation - The “Split-Process” Approach

Upper-Half program Checkpoint and Restore Lower-Half program Discard and Re-initialize

Single Memory Space Standard C Calling Conventions No RPC involved
slide-26
SLIDE 26 MPI Application MPI Library MPI Proxy Library LIBC and friends

Re-initializing the network

  • Contains MPI network state
  • Grouping information
  • Opaque MPI Objects
Runtime
  • Record Configuration Calls
  • Initialize, Grouping, etc
Checkpoint
  • Drain Network
Restart
  • Replay Configuration
  • Buffer Drained Messages
MPI Application Config and Drain Info
slide-27
SLIDE 27 MPI Library MPI Proxy Library LIBC and friends

Isolation

MPI Application Config and Drain Info Problem: Heap is a shared resource MPI Application Config and Drain Info LIBC and friends Upper Half: Persistent Data Lower Half Ephemeral Data MANA interposes on sbrk and malloc to control where allocations occur
slide-28
SLIDE 28 Upper Half: Persistent Data Lower Half Ephemeral Data MPI Library MPI Proxy Library LIBC and friends

MPI Agnosticism Achieved

MPI Application Config and Drain Info LIBC and friends
slide-29
SLIDE 29 Upper Half: Persistent Data Lower Half Ephemeral Data

MPI Agnosticism Achieved

MPI Application Config and Drain Info LIBC and friends Lower half data can be replaced by new and different implementations
  • f MPI and related libraries.
*Special care must be taken when replacing upper half libraries
slide-30
SLIDE 30

Step 1: Drain the Network

Checkpoint Process

MPI Coordinator MPI Rank MPI Rank MPI Rank MPI Rank Node 2 Node 1
slide-31
SLIDE 31

Step 1: Drain the Network Step 2: Checkpoint Upper-Half

Checkpoint Process

MPI Application Config and Drain Info LIBC and friends MPI Rank
slide-32
SLIDE 32

Step 1: Restore Lower-Half

MPI Library MPI Proxy Library LIBC and friends

Restart Process

Lower-half components may be replaced
slide-33
SLIDE 33

Step 1: Restore Lower-Half Step 2: Re-initialize MPI

Restart Process

  • MPI_INIT
  • Replay Configuration
Naturally Optimized MPI Library MPI Proxy Library LIBC and friends Lower-half components may be replaced
slide-34
SLIDE 34

Step 1: Restore Lower-Half Step 2: Re-initialize MPI Step 3: Restore Upper-Half

MPI Library MPI Proxy Library LIBC and friends

Restart Process

MPI Application Config and Drain Info LIBC and friends MPI Rank
  • MPI_INIT
  • Replay Configuration
Naturally Optimized MPI Rank # assigned by MPI_Init used to select checkpoint file for restoring the upper half. This avoids the need to virtualize MPI Rank numbers. Lower-half components may be replaced
slide-35
SLIDE 35

How to transparently checkpoint MPI App+MPI Lib?

Answer:

Don’t Checkpoint the MPI Library

MPI Application Config and Drain Info LIBC and friends
slide-36
SLIDE 36

Puzzle

Can you solve checkpointing on... Cray MPI over Infiniband And restart on… MPICH over TCP/IP

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4 8 5 10 6 12 7 14 1 3 2 9 11 13 15 16 4 Nodes, 4 Cores/Ranks per Node 8 Nodes, 2 Cores/Ranks per Node

YES

slide-37
SLIDE 37

NEW: Cross-Cluster MPI Application Migration

Traditionally, migration across disparate clusters was not feasible.

  • Different MPI packages across clusters
  • Highly optimized configurations tied to local cluster (Caches, Cores/Node)
  • Overhead of checkpointing entire MPI state is prohibitive

Overhead of migrating under MANA:

  • 1.6% runtime overhead after migration.*
* Linux kernel 5.3 patch https://lwn.net/Articles/769355/ reduces overhead to 0.6%
slide-38
SLIDE 38

But what about single-cluster overhead?

Application Benchmarks:

  • miniFE, HPCG
○ nearly 0% runtime overhead
  • GROMACS, CLAMR, LULESH
○ 0.6% runtime overhead*

Memory Overhead

  • Copied upper-half system libraries: static 26MB on all experiments
  • Reduction in overall checkpointed data due to discarding lower-half memory.
* requires Linux kernel patch https://lwn.net/Articles/769355/
slide-39
SLIDE 39

Checkpoint-Restart Overhead

Checkpoint Data Size

  • GROMACS - 64 Ranks over 2 Nodes: 5.9GB
  • HPCG - 2048 ranks over 64 nodes: 4TB
  • Largely dominated by memory used by benchmark program.

Checkpoint Time

  • Largely dominated by disk-write time
  • “Stragglers” - a single rank takes much longer to checkpoint than others.

Restart Time

  • MPI State reconstruction represented < 10% of total restart time.
slide-40
SLIDE 40

Questions?