Adaptive MPI: Overview & Recent Developments Sam White UIUC - - PowerPoint PPT Presentation

adaptive mpi overview recent developments
SMART_READER_LITE
LIVE PREVIEW

Adaptive MPI: Overview & Recent Developments Sam White UIUC - - PowerPoint PPT Presentation

Adaptive MPI: Overview & Recent Developments Sam White UIUC Charm++ Workshop 2018 Motivation Exascale trends: HW: increased node parallelism, decreased memory per thread SW: applications becoming more complex, dynamic How


slide-1
SLIDE 1

Adaptive MPI: Overview & Recent Developments

Sam White

UIUC

Charm++ Workshop 2018

slide-2
SLIDE 2

Motivation

  • Exascale trends:
  • HW: increased node parallelism, decreased

memory per thread

  • SW: applications becoming more complex, dynamic
  • How should applications and runtimes respond?
  • Incrementally: MPI+X (X=OpenMP, Kokkos, MPI,

etc)?

  • Rewrite in: Legion, Charm++, HPX, etc?

2 Charm++ Workshop 2018

slide-3
SLIDE 3

Adaptive MPI

  • AMPI is an MPI implementation on top of Charm++
  • AMPI offers Charm++’s application-independent

features to MPI programmers:

  • Overdecomposition
  • Communication/computation overlap
  • Dynamic load balancing
  • Online fault tolerance

3 Charm++ Workshop 2018

slide-4
SLIDE 4

Overview

  • Introduction
  • Features
  • Shared memory optimizations
  • Conclusions

4 Charm++ Workshop 2018

slide-5
SLIDE 5

Execution Model

  • AMPI ranks are User-Level Threads (ULTs)
  • Can have multiple per core
  • Fast to context switch
  • Scheduled based on message delivery
  • Migratable across cores and nodes at runtime
  • For load balancing & fault tolerance

5 Charm++ Workshop 2018

slide-6
SLIDE 6

Node 0

Execution Model

Core 0 Scheduler Rank 0 Core 1 Scheduler Rank 1

6 Charm++ Workshop 2018

slide-7
SLIDE 7

Node 0

Execution Model

Core 0 Scheduler Rank 0 Rank 1 Rank 2 Core 1 Scheduler Rank 3 Rank 4

MPI_Send() MPI_Recv()

7 Charm++ Workshop 2018

slide-8
SLIDE 8

Node 0

Execution Model

Core 0 Scheduler Rank 0 Rank 1 Rank 2 Core 1 Scheduler Rank 4 Rank 5

8

Rank 3 Rank 6

AMPI_Migrate()

Charm++ Workshop 2018

slide-9
SLIDE 9

Thread Safety

  • AMPI virtualizes ranks as threads: is this safe?

9

slide-10
SLIDE 10

Thread Safety

  • AMPI virtualizes ranks as threads: is this safe?

No, global variables are defined per process

10

slide-11
SLIDE 11

Thread Safety

  • AMPI programs are MPI programs without mutable global

variables

  • Solutions:
  • 1. Refactor the application to not use globals/statics, instead

pass them around on the stack

  • 2. Swap ELF Global Offset Table entries at ULT context switch
  • 3. Swap Thread Local Storage pointer during ctx
  • Tag unsafe vars with C/C++ ‘thread_local’ or OpenMP

‘threadprivate’, the runtime manages TLS

  • Work in progress: have the compiler privatize them for

you, i.e., icc -fmpc-privatize

11

slide-12
SLIDE 12

Conversion to AMPI

  • AMPI programs are MPI programs, with 2 caveats:
  • 1. Without mutable global/static variables
  • Or with them properly handled
  • 2. Possibly with calls to AMPI’s extensions
  • AMPI_Migrate()

2 . Fortran main & command line args

12

1 2 1 2

slide-13
SLIDE 13

AMPI Fortran Support

  • AMPI implements the F77 and F90 MPI bindings
  • MPI -> AMPI Fortran conversion:
  • Rename ‘program main’ -> ‘subroutine mpi_main’
  • AMPI_ command line argument parsing routines
  • Automatic arrays: increase ULT stack size

13

slide-14
SLIDE 14

Overdecomposition

14 Charm++ Workshop 2018

  • Bulk-synchronous codes often underutilize the

network with compute/communicate phases

  • LULESH v2.0:
slide-15
SLIDE 15

Overdecomposition

15 Charm++ Workshop 2018

  • With overdecomposition, overlap communication of
  • ne rank with computation of others on its core
slide-16
SLIDE 16

Message-driven Execution

  • Overdecomposition spreads network injection over

the whole timestep

16

1 rank/core 8 ranks/core

LULESH 2.0 Communication over Time

240 KB 980 KB

slide-17
SLIDE 17

Migratability

  • AMPI ranks are migratable at runtime between

address spaces

  • User-level thread stack + heap

17 Charm++ Workshop 2018

text data bss thread 3 stack thread 2 stack thread 0 stack text data bss thread 4 stack thread 1 stack 0xFFFFFFFF 0xFFFFFFFF 0x00000000 0x00000000 thread 0 heap thread 2 heap thread 3 heap thread 1 heap thread 4 heap

  • Isomalloc memory allocator

makes migration automatic

  • No user serialization code
  • Works everywhere but BGQ &

Windows

slide-18
SLIDE 18

Load Balancing

  • To enable load balancing in an AMPI program:
  • 1. Insert a call to AMPI_Migrate(MPI_Info)
  • Info object is LB, Checkpoint, etc.
  • 2. Link with Isomalloc and a load balancer:

ampicc -memory isomalloc -module CommonLBs

  • 3. Specify the number of virtual processes and a

load balancing strategy at runtime:

srun -n 100 ./pgm +vp 1000 +balancer RefineLB

18

slide-19
SLIDE 19

Recent Work

  • AMPI can optimize for communication locality
  • Many ranks can reside on the same core
  • Same goes for process/socket/node
  • Load balancers can take communication graph

into consideration

19 Charm++ Workshop 2018

slide-20
SLIDE 20

AMPI Shared Memory

20

  • Many AMPI ranks can share the same OS process

Charm++ Workshop 2018

slide-21
SLIDE 21

Existing Performance

  • Small message latency on Quartz (LLNL)

21

0.5 1 2 4 8 16 32 1 4 16 64 256 1024 4096 16384 65536 1-way Latency (us) Message Size (Bytes) MVAPICH P2 IMPI P2 OpenMPI P2 AMPI P2 AMPI P1

ExaMPI 2017

slide-22
SLIDE 22

Existing Performance

  • Large message latency on Quartz

22

256 512 1024 2048 4096 8192 16384 32768 65536 131072 4 8 16 32 64 Message Size (MB)

4 8 16 32 64 128 256 512 1024 2048 64 128 256 512 1024 2048 4096 Latency (us) Message Size (KB)

ExaMPI 2017

slide-23
SLIDE 23

Performance Analysis

  • Breakdown of P1 time (us) per message on Quartz
  • Scheduling: Charm++ scheduler & ULT ctx
  • Memory copy: message payload movement
  • Other: AMPI message creation & matching

23 Charm++ Workshop 2018

slide-24
SLIDE 24

Scheduling Overhead

  • 1. Even for P1, all AMPI messages traveled thru

Charm++’s scheduler

  • Use Charm++ [inline] tasks
  • 2. ULT context switching overhead
  • Faster with Boost ULTs
  • 3. Avoid resuming threads without real progress
  • MPI_Waitall: keep track of # reqs “blocked on”

24

P1 0-B latency: 1.27 us -> 0.66 us

Charm++ Workshop 2018

slide-25
SLIDE 25

Memory Copy Overhead

  • Q: Even with [inline] tasks, AMPI P1 performs poorly for

large messages. Why?

  • A: Charm++ messaging semantics do not match

MPI’s

  • In Charm++, messages are first class objects
  • Users pass ownership of messages to the runtime

when sending and assume it when receiving

  • Only app’s that can reuse message objects in their

data structures can perform “zero copy” transfers

25 Charm++ Workshop 2018

slide-26
SLIDE 26

Memory Copy Overhead

  • To overcome Charm++ messaging semantics in

shared memory, use a rendezvous protocol:

  • Recv’er performs direct (userspace) memcpy

from sendbuf to recvbuf

  • Benefit: avoid intermediate copy
  • Cost: synchronization, sender must suspend &

be resumed upon copy completion

26

P1 1-MB latency: 165 us -> 82 us

Charm++ Workshop 2018

slide-27
SLIDE 27

Other Overheads

  • Sender-side:
  • Create a Charm++ message object & a request
  • Receiver-side:
  • Create a request, create matching queue entry, dequeue

from unexpectedMsgs or enqueue in postedReqs

  • Solution: use memory pools for fixed-size, frequently-used
  • bjects
  • Optimize for common usage patterns, i.e. MPI_Waitall

with a mix of send and recv requests

27

P1 0-B latency: 0.66 us -> 0.54 us

Charm++ Workshop 2018

slide-28
SLIDE 28

AMPI-shm Performance

  • Small message latency on Quartz
  • AMPI-shm P2 faster than other impl’s for 2+ KB

28 Charm++ Workshop 2018

slide-29
SLIDE 29

AMPI-shm Performance

  • Large message latency on Quartz
  • AMPI-shm P2 fastest for all large messages, up to

2.33x faster than process-based MPIs for 32+ MB

29 Charm++ Workshop 2018

slide-30
SLIDE 30

AMPI-shm Performance

  • Bidirectional bandwidth on Quartz
  • AMPI-shm can utilize full memory bandwidth
  • 26% higher peak, 2x bandwidth for 32+ MB than others

30

20000 25000 30000 35000 nal Bandwidth (MB/s)

STREAM copy

Charm++ Workshop 2018

slide-31
SLIDE 31

AMPI-shm Performance

  • Small message latency on Cori-Haswell

31

0.25 0.5 1 2 4 8 1 4 16 64 256 1024 4096 16384 65536 Latency (us) Message Size (Bytes) Cray MPI P2 AMPI-shm P2 AMPI-shm P1

Charm++ Workshop 2018

slide-32
SLIDE 32

AMPI-shm Performance

  • Large message latency on Cori-Haswell
  • AMPI-shm P2 is 47% faster than Cray MPI at 32+ MB

32 Charm++ Workshop 2018

slide-33
SLIDE 33

AMPI-shm Performance

  • Bidirectional bandwidth on Cori-Haswell
  • Cray MPI on XPMEM performs similarly to AMPI-shm up to 16 MB

33

20000 25000 30000 35000 nal Bandwidth (MB/s)

STREAM copy

Charm++ Workshop 2018

slide-34
SLIDE 34

AMPI-shm Performance

  • Bidirectional bandwidth on Cori-Haswell
  • Cray MPI on XPMEM performs similarly to AMPI-shm up to 16 MB

34

20000 25000 30000 35000 nal Bandwidth (MB/s)

STREAM copy

Charm++ Workshop 2018

slide-35
SLIDE 35

Summary

  • User-space communication offers portable

intranode messaging performance

  • Lower latency: 1.5x-2.3x for large msgs
  • Higher bandwidth: 1.3x-2x for large msgs
  • Intermediate buffering unnecessary for medium/

large msgs

35 Charm++ Workshop 2018

slide-36
SLIDE 36

Conclusions

  • AMPI provides application-independent runtime

support for existing MPI applications:

  • Overdecomposition
  • Latency tolerance
  • Dynamic load balancing
  • Automatic fault detection & recovery
  • See the AMPI manual for more info

36

slide-37
SLIDE 37

This material is based in part upon work supported by the Department of Energy, National Nuclear Security Administration, under Award Number DE- NA0002374.

37 Charm++ Workshop 2018

slide-38
SLIDE 38

38

Questions?

Thank you

Charm++ Workshop 2018