Adaptive MPI: Overview & Recent Developments
Sam White
UIUC
Charm++ Workshop 2018
Adaptive MPI: Overview & Recent Developments Sam White UIUC - - PowerPoint PPT Presentation
Adaptive MPI: Overview & Recent Developments Sam White UIUC Charm++ Workshop 2018 Motivation Exascale trends: HW: increased node parallelism, decreased memory per thread SW: applications becoming more complex, dynamic How
UIUC
Charm++ Workshop 2018
memory per thread
etc)?
2 Charm++ Workshop 2018
features to MPI programmers:
3 Charm++ Workshop 2018
4 Charm++ Workshop 2018
5 Charm++ Workshop 2018
Node 0
Core 0 Scheduler Rank 0 Core 1 Scheduler Rank 1
6 Charm++ Workshop 2018
Node 0
Core 0 Scheduler Rank 0 Rank 1 Rank 2 Core 1 Scheduler Rank 3 Rank 4
MPI_Send() MPI_Recv()
7 Charm++ Workshop 2018
Node 0
Core 0 Scheduler Rank 0 Rank 1 Rank 2 Core 1 Scheduler Rank 4 Rank 5
8
Rank 3 Rank 6
AMPI_Migrate()
Charm++ Workshop 2018
9
No, global variables are defined per process
10
variables
pass them around on the stack
‘threadprivate’, the runtime manages TLS
you, i.e., icc -fmpc-privatize
11
2 . Fortran main & command line args
12
1 2 1 2
13
14 Charm++ Workshop 2018
network with compute/communicate phases
15 Charm++ Workshop 2018
the whole timestep
16
1 rank/core 8 ranks/core
LULESH 2.0 Communication over Time
240 KB 980 KB
address spaces
17 Charm++ Workshop 2018
text data bss thread 3 stack thread 2 stack thread 0 stack text data bss thread 4 stack thread 1 stack 0xFFFFFFFF 0xFFFFFFFF 0x00000000 0x00000000 thread 0 heap thread 2 heap thread 3 heap thread 1 heap thread 4 heap
makes migration automatic
Windows
ampicc -memory isomalloc -module CommonLBs
load balancing strategy at runtime:
srun -n 100 ./pgm +vp 1000 +balancer RefineLB
18
into consideration
19 Charm++ Workshop 2018
20
Charm++ Workshop 2018
21
0.5 1 2 4 8 16 32 1 4 16 64 256 1024 4096 16384 65536 1-way Latency (us) Message Size (Bytes) MVAPICH P2 IMPI P2 OpenMPI P2 AMPI P2 AMPI P1
ExaMPI 2017
22
256 512 1024 2048 4096 8192 16384 32768 65536 131072 4 8 16 32 64 Message Size (MB)
4 8 16 32 64 128 256 512 1024 2048 64 128 256 512 1024 2048 4096 Latency (us) Message Size (KB)
ExaMPI 2017
23 Charm++ Workshop 2018
Charm++’s scheduler
24
P1 0-B latency: 1.27 us -> 0.66 us
Charm++ Workshop 2018
large messages. Why?
MPI’s
when sending and assume it when receiving
data structures can perform “zero copy” transfers
25 Charm++ Workshop 2018
shared memory, use a rendezvous protocol:
from sendbuf to recvbuf
be resumed upon copy completion
26
P1 1-MB latency: 165 us -> 82 us
Charm++ Workshop 2018
from unexpectedMsgs or enqueue in postedReqs
with a mix of send and recv requests
27
P1 0-B latency: 0.66 us -> 0.54 us
Charm++ Workshop 2018
28 Charm++ Workshop 2018
2.33x faster than process-based MPIs for 32+ MB
29 Charm++ Workshop 2018
30
20000 25000 30000 35000 nal Bandwidth (MB/s)
STREAM copy
Charm++ Workshop 2018
31
0.25 0.5 1 2 4 8 1 4 16 64 256 1024 4096 16384 65536 Latency (us) Message Size (Bytes) Cray MPI P2 AMPI-shm P2 AMPI-shm P1
Charm++ Workshop 2018
32 Charm++ Workshop 2018
33
20000 25000 30000 35000 nal Bandwidth (MB/s)
STREAM copy
Charm++ Workshop 2018
34
20000 25000 30000 35000 nal Bandwidth (MB/s)
STREAM copy
Charm++ Workshop 2018
intranode messaging performance
large msgs
35 Charm++ Workshop 2018
support for existing MPI applications:
36
This material is based in part upon work supported by the Department of Energy, National Nuclear Security Administration, under Award Number DE- NA0002374.
37 Charm++ Workshop 2018
38
Charm++ Workshop 2018