spcl.inf.ethz.ch @spcl_eth
ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER
Access Programming with MPI-3 One Sided R OBERT G ERSTENBERGER , M - - PowerPoint PPT Presentation
spcl.inf.ethz.ch @spcl_eth Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided R OBERT G ERSTENBERGER , M ACIEJ B ESTA , T ORSTEN H OEFLER spcl.inf.ethz.ch @spcl_eth MPI-3.0 R EMOTE M EMORY A CCESS MPI-3.0
spcl.inf.ethz.ch @spcl_eth
ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER
spcl.inf.ethz.ch @spcl_eth
2
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
spcl.inf.ethz.ch @spcl_eth
3
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
spcl.inf.ethz.ch @spcl_eth
4
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
spcl.inf.ethz.ch @spcl_eth
destination)
5
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Proc A Proc B
send recv
Proc A Proc B
put
two sided
Communication Communication + Synchronization Synchronization
sync
spcl.inf.ethz.ch @spcl_eth
6
MPI-3 RMA concepts
spcl.inf.ethz.ch @spcl_eth
7
Process A (passive) Memory
MPI window
Process B (active) Process C (active)
Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)
Memory
MPI window …
Process D (active)
…
spcl.inf.ethz.ch @spcl_eth
8
Process A (passive) Memory
MPI window
Process B (active) Process C (active)
Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)
Memory
MPI window …
Process D (active)
…
spcl.inf.ethz.ch @spcl_eth
9
Process A (passive) Memory
MPI window
Process B (active) Process C (active)
Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)
Memory
MPI window …
Process D (active)
…
spcl.inf.ethz.ch @spcl_eth
10
Process A (passive) Memory
MPI window
Process B (active) Process C (active)
Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)
Memory
MPI window …
Process D (active)
…
spcl.inf.ethz.ch @spcl_eth
11
Process A (passive) Memory
MPI window
Process B (active) Process C (active)
Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)
Memory
MPI window …
Process D (active)
…
spcl.inf.ethz.ch @spcl_eth
12
Active process Passive process Synchroni- zation
Passive Target Mode
Lock Lock All
Active Target Mode
Fence Post/Start/ Complete/Wait Communi- cation
spcl.inf.ethz.ch @spcl_eth
13
Active process Passive process Synchroni- zation
Passive Target Mode
Lock Lock All
Active Target Mode
Fence Post/Start/ Complete/Wait Communi- cation
spcl.inf.ethz.ch @spcl_eth
14
Active process Passive process Synchroni- zation
Passive Target Mode
Lock Lock All
Active Target Mode
Fence Post/Start/ Complete/Wait Communi- cation
spcl.inf.ethz.ch @spcl_eth
15
Active process Passive process Synchroni- zation
Passive Target Mode
Lock Lock All
Active Target Mode
Fence Post/Start/ Complete/Wait Communi- cation
spcl.inf.ethz.ch @spcl_eth
16
Active process Passive process Synchroni- zation
Passive Target Mode
Lock Lock All
Active Target Mode
Fence Post/Start/ Complete/Wait Communi- cation
spcl.inf.ethz.ch @spcl_eth
17
spcl.inf.ethz.ch @spcl_eth
18
spcl.inf.ethz.ch @spcl_eth
19
Window creation Communication Synchronization
spcl.inf.ethz.ch @spcl_eth
20
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
spcl.inf.ethz.ch @spcl_eth
21
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
spcl.inf.ethz.ch @spcl_eth
22
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
spcl.inf.ethz.ch @spcl_eth
23
Traditional windows
backwards compatible (MPI-2) Time bound: 𝒫 𝑞 Memory bound: 𝒫 𝑞
𝑞 = total number
Process A Memory Process B Memory Process C Memory 0x111 0x123 0x120
spcl.inf.ethz.ch @spcl_eth
24
Allocated windows
𝑞 = total number
Process A Memory Process B Memory Process C Memory Allows MPI to allocate memory Time bound: 𝒫 log 𝑞 (𝑥ℎ𝑞) Memory bound: 𝒫 1 0x123 0x123 0x123
spcl.inf.ethz.ch @spcl_eth
25
Dynamic windows
𝑞 = total number
Process A Memory Process B Memory Process C Memory Local attach/detach Most flexible Time bound: 𝒫 𝑞 Memory bound: 𝒫 𝑞 0x129 0x129 0x111 0x123 0x120
spcl.inf.ethz.ch @spcl_eth
local (blocking) memcpy (XPMEM)
intrinisic datatypes (e.g., MPI_DOUBLE)
26
[1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM’09
Contiguous memory MPI_Put dmapp_put_nbi
Remote process
…
MPI_Compare _and_swap dmapp_ acswap_qw_nbi
Remote process
…
spcl.inf.ethz.ch @spcl_eth
27
Put Inter-Node Get Inter-Node 20% faster 80% faster
Proc 0 Proc 1
put sync memory
Half ping-pong
spcl.inf.ethz.ch @spcl_eth
28
Put/Get Intra-Node 3x faster
Proc 0 Proc 1
put sync memory
Half ping-pong
spcl.inf.ethz.ch @spcl_eth
29
Inter-Node Overlap in %
Useful for, e.g., scientific codes:
3D FFT MILC
Proc 0 Proc 1 put Sync memory comp.
AWM-Olsen seismic
spcl.inf.ethz.ch @spcl_eth
30
Intra-Node Inter-Node
Proc 0 Proc 1 puts Sync memory ...
spcl.inf.ethz.ch @spcl_eth
31
64 bit integers
hardware- accelerated protocol: fall back protocol: lower latency higher bandwidth
proprietary
spcl.inf.ethz.ch @spcl_eth
32
Active process Passive process Synchroni- zation
Passive Target Mode
Lock Lock All
Active Target Mode
Fence Post/Start/ Complete/Wait Communi- cation
spcl.inf.ethz.ch @spcl_eth
Node 0 Node 1
Proc 2
put
Proc 0 Proc 1 Proc 3
int int MPI_Win_fence(…) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return return MPI_SUCCESS; }
put put put put put put
33
spcl.inf.ethz.ch @spcl_eth
Node 0 Node 1
Proc 2 Proc 0 Proc 1 Proc 3
int int MPI_Win_fence(…) { asm asm( ( mfence mfence ); ); dmapp_gsync_wait(); MPI_Barrier(...); return return MPI_SUCCESS; }
put put put
Local completion (XPMEM)
34
spcl.inf.ethz.ch @spcl_eth
Node 0 Node 1
Proc 2 Proc 0 Proc 1 Proc 3
int int MPI_Win_fence(…) { asm( mfence ); dmapp_gsync_wait dmapp_gsync_wait(); (); MPI_Barrier(...); return return MPI_SUCCESS; }
Local completion (DMAPP)
35
spcl.inf.ethz.ch @spcl_eth
Node 0 Node 1
Proc 2 Proc 0 Proc 1 Proc 3
int int MPI_Win_fence(…) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); MPI_Barrier(...); return return MPI_SUCCESS; }
Global completion
barrier
36
spcl.inf.ethz.ch @spcl_eth
37
Time bound 𝒫 log 𝑞 Memory bound 𝒫 1
90% faster
spcl.inf.ethz.ch @spcl_eth
38
post wait start complete access epoch exposure epoch Proc 0 Proc 1
matching algorithm matching algorithm allows to access other processes allows access from other processes
Posting process
Puts …
Starting process
Puts …
spcl.inf.ethz.ch @spcl_eth
39
post wait start complete Proc 0 Proc 1 start complete start complete Proc 2 Proc 3 start complete post wait Proc 4 Proc 5
spcl.inf.ethz.ch @spcl_eth
starting processes
4 starting processes
40
Posting process (opens its window)
spcl.inf.ethz.ch @spcl_eth
starting processes
4 starting processes
41
Starting processes (access remote window)
spcl.inf.ethz.ch @spcl_eth
42
Local list
spcl.inf.ethz.ch @spcl_eth
list at each starting process j1, . . . , j4
until the rank of the posting process i is present in its local list
43
spcl.inf.ethz.ch @spcl_eth
list at each starting process j1, . . . , j4
until the rank of the posting process i is present in its local list
44
spcl.inf.ethz.ch @spcl_eth
list at each starting process j1, . . . , j4
until the rank of the posting process i is present in its local list
45
spcl.inf.ethz.ch @spcl_eth
list at each starting process j1, . . . , j4
until the rank of the posting process i is present in its local list
46
spcl.inf.ethz.ch @spcl_eth
list at each starting process j1, . . . , j4
until the rank of the posting process i is present in its local list
47
spcl.inf.ethz.ch @spcl_eth
list at each starting process j1, . . . , j4
until the rank of the posting process i is present in its local list
48
spcl.inf.ethz.ch @spcl_eth
list at each starting process j1, . . . , j4
until the rank of the posting process i is present in its local list
49
spcl.inf.ethz.ch @spcl_eth
posting process
50
spcl.inf.ethz.ch @spcl_eth
posting process
51
spcl.inf.ethz.ch @spcl_eth
posting process
52
spcl.inf.ethz.ch @spcl_eth
posting process
53
spcl.inf.ethz.ch @spcl_eth
54
posting process
processes, the posting process returns from wait
spcl.inf.ethz.ch @spcl_eth
55
Time bound 𝒬𝑡𝑢𝑏𝑠𝑢 = 𝒬𝑥𝑏𝑗𝑢 = 𝒫 1 𝒬𝑞𝑝𝑡𝑢 = 𝒬𝑑𝑝𝑛𝑞𝑚𝑓𝑢𝑓 = 𝒫 log 𝑞 Memory bound 𝒫 log 𝑞 (for scalable programs)
Ring Topology
spcl.inf.ethz.ch @spcl_eth
Two-level lock hierarchy:
56
Active process Passive process Lock/Unlock (shared/exclusive) Lock All (always shared)
Process 0 Memory
00000 local: Shared Counter Exclusive Bit
Process 1 Memory
00000 local: Shared Counter Exclusive Bit
Process P-1 Memory
00000 local: Shared Counter Exclusive Bit
000 000 global: Shared Counter Exclusive Counter
Master Process
spcl.inf.ethz.ch @spcl_eth
(Invariant 1: no global shared lock held concurrently)
Proc 2 wants to lock Proc 1 exclusively Process 0
00000 000 000
Process 2
00000
Process 1
00000
57
spcl.inf.ethz.ch @spcl_eth
(Invariant 1: no global shared lock held concurrently)
Process 0
00000 000 001
Process 2
00000
Process 1
00000
fetch-add
000 000
MPI_Win_lock( EXCL, 1 )
58
Proc 2 wants to lock Proc 1 exclusively
spcl.inf.ethz.ch @spcl_eth
(Invariant 2: no local shared/exclusive lock held concurrently)
Process 0
00000 000 001
Process 2
00000
Process 1
00000 1
fetch-add
000 000
MPI_Win_lock( EXCL, 1 ) compare & swap
00000
59
Proc 2 wants to lock Proc 1 exclusively
spcl.inf.ethz.ch @spcl_eth
(Invariant: no local exclusive lock on this process held concurrently)
Proc 0 wants to lock Proc 1 Process 0
00000 000 001
Process 2
00000
Process 1
00000
60
spcl.inf.ethz.ch @spcl_eth
(Invariant: no local exclusive lock on this process held concurrently)
Proc 0 wants to lock Proc 1 Process 0
00000 000 001
Process 2
00000
Process 1
00001
MPI_Win_lock( SHRD, 1 ) fetch-add
00000
61
spcl.inf.ethz.ch @spcl_eth
(Invariant: no local exclusive lock is held concurrently)
Proc 2 wants to lock the whole window Process 0
00000 000 000
Process 2
00000
Process 1
00000
62
spcl.inf.ethz.ch @spcl_eth
(Invariant: no local exclusive lock is held concurrently)
Proc 2 wants to lock the whole window Process 0
00000 001 000
Process 2
00000
Process 1
00000
63
MPI_Win_lock_all() fetch-add
000 000
spcl.inf.ethz.ch @spcl_eth
CPU instructions to the critical path
64
Time bound 𝒫 1 Memory bound 𝒫(1)
Process 0 Process 1
inc(counter)
counter:
inc(counter) inc(counter)
spcl.inf.ethz.ch @spcl_eth
CPU instructions to the critical path
65
Time bound 𝒫 1 Memory bound 𝒫(1)
Process 0 Process 1
inc(counter)
counter:
inc(counter) inc(counter)
flush
spcl.inf.ethz.ch @spcl_eth
66
spcl.inf.ethz.ch @spcl_eth
67
Key/Value Store: Random Inserts per Second Dynamic Sparse Data Exchange (DSDE) with 6 neighbors
spcl.inf.ethz.ch @spcl_eth
68
NAS 3D FFT [1] Performance MILC [2] Application Execution Time
Annotations represent performance gain of foMPI over Cray MPI-1.
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09 [2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS’12
scale to 512k procs scale to 65k procs
spcl.inf.ethz.ch @spcl_eth
69
spcl.inf.ethz.ch @spcl_eth
70
routines
spcl.inf.ethz.ch @spcl_eth
71
communication
routines
spcl.inf.ethz.ch @spcl_eth
72
routines
communication
spcl.inf.ethz.ch @spcl_eth
73
routines
communication
spcl.inf.ethz.ch @spcl_eth
routines
communication
74
spcl.inf.ethz.ch @spcl_eth
75
routines
communication
spcl.inf.ethz.ch @spcl_eth
76
routines
communication
spcl.inf.ethz.ch @spcl_eth
Thanks to: Timo Schneider, Greg Bauer, Bill Kramer, Duncan Roweth, Nick Wright, Paul Hargrove (and the whole UPC team) and the MPI Forum RMA WG … … and the institutions:
77
spcl.inf.ethz.ch @spcl_eth
78
http://spcl.inf.ethz.ch/Research/ Parallel_Programming/foMPI
spcl.inf.ethz.ch @spcl_eth
79
spcl.inf.ethz.ch @spcl_eth
61
Process A Memory Process B Memory Process C Memory
spcl.inf.ethz.ch @spcl_eth
61
Process A Memory Process B Memory Process C Memory
0x111 0x120
spcl.inf.ethz.ch @spcl_eth
61
Process A Memory Process B Memory Process C Memory
0x111 0x120 0x129 0x129 0x123
spcl.inf.ethz.ch @spcl_eth
61
Process A Memory Process B Memory Process C Memory
0x111 0x120 0x129 0x129 0x123
Get(id)
Process A Process B
Cached:
Access the window
Get(id)
Process A Process B
Cached:
Access the window
Update(list)
spcl.inf.ethz.ch @spcl_eth
84
Put Inter-Node Get Inter-Node
Proc 0 Proc 1
put sync memory
Half ping-pong
spcl.inf.ethz.ch @spcl_eth
85
Put/Get Intra-Node
Proc 0 Proc 1
put sync memory
Half ping-pong
spcl.inf.ethz.ch @spcl_eth
Proc 2 wants to lock exclusively Proc 1 Process 0
00000 000 000
Process 2
00000
Process 1
00000
Add(-1)
000 001
61
spcl.inf.ethz.ch @spcl_eth
61
Fence 𝒬
𝑔𝑓𝑜𝑑𝑓 = 2.9𝜈𝑡 ⋅ log2(𝑞)
PSCW 𝒬𝑡𝑢𝑏𝑠𝑢 = 0.7𝜈𝑡, 𝒬𝑥𝑏𝑗𝑢 = 1.8𝜈𝑡 𝒬𝑞𝑝𝑡𝑢 = 𝒬𝑑𝑝𝑛𝑞𝑚𝑓𝑢𝑓 = 350𝑜𝑡 ⋅ 𝑙 Locks 𝒬𝑚𝑝𝑑𝑙,𝑓𝑦𝑑𝑚 = 5.4𝜈𝑡 𝒬𝑚𝑝𝑑𝑙,𝑡ℎ𝑠𝑒 = 𝒬𝑚𝑝𝑑𝑙_𝑏𝑚𝑚 = 2.7𝜈𝑡 𝒬𝑣𝑜𝑚𝑝𝑑𝑙 = 𝒬𝑣𝑜𝑚𝑝𝑑𝑙_𝑏𝑚𝑚 = 0.4𝜈𝑡 𝒬
𝑔𝑚𝑣𝑡ℎ = 76𝑜𝑡
𝒬
𝑡𝑧𝑜𝑑 = 17𝑜𝑡
Put/get 𝒬𝑞𝑣𝑢 = 0.16𝑜𝑡 ⋅ 𝑡 + 1𝜈𝑡 𝒬
𝑓𝑢 = 0.17𝑜𝑡 ⋅ 𝑡 + 1.9𝜈𝑡
Atomics 𝒬
𝑏𝑑𝑑,𝑡𝑣𝑛 = 28𝑜𝑡 ⋅ 𝑡 + 2.4𝜈𝑡
𝒬𝑏𝑑𝑑,𝑛𝑗𝑜 = 0.8𝑜𝑡 ⋅ 𝑡 + 7.3𝜈𝑡 Performance functions for synchronization protocols Performance functions for communication protocols