How I Learned to Rohit Zambre,* Stop Aparna - - PowerPoint PPT Presentation
How I Learned to Rohit Zambre,* Stop Aparna - - PowerPoint PPT Presentation
How I Learned to Rohit Zambre,* Stop Aparna Chandramowlishwaran,* Worrying Pavan Balaji About *University of California, Irvine User-Visible Argonne National Laboratory Endpoints and Love MPI 2 HOW I LEARNED TO STOP WORRYING
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 2
MPI everywhere
Node Core Process
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 3
MPI everywhere
Node Core Process
▸ Model artifact: high memory requirements
that worsen with increase domain- dimensionality and number of ranks.
▸ Hardware usage: resource wastage with
static split of limited resources on processor
Increasing number of cores Decreasing memory per core
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 4
MPI+threads
▸ Model artifact: reduces duplicated data
by a factor of number of threads.
▸ Hardware usage: able to use the many
cores while sharing all of processor’s resources.
Increasing number of cores Decreasing memory per core
Node Core Thread Process
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 5
1 2 3 4 5 6 7 1x440x110 1x220x220 1x110x440 6x180x45 6x90x90 6x45x180 Processor Grid (threads x processor rows x processor columns) Time (seconds) Category Computation Allgatherv AlltoAllv
MPI everywhere OOM!
Buluc et al. Distributed BFS (https://arxiv.org/abs/1705.04590)
Corresponding MPI+threads runs
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 6
20 40 60 1 2 4 8 16 Number of cores Million Messages/s Model MPI everywhere MPI+threads (MPI_THREAD_FUNNELED) MPI+threads (MPI_THREAD_MULTIPLE) MPI_Isend (8 B)
Communication performance of MPI+threads is dismal
Buluc et al. Distributed BFS (https://arxiv.org/abs/1705.04590)
MPI everywhere OOM!
1 2 3 4 5 6 7 1x440x110 1x220x220 1x110x440 6x180x45 6x90x90 6x45x180 Processor Grid (threads x processor rows x processor columns) Time (seconds) Category Computation Allgatherv AlltoAllv
Corresponding MPI+threads runs
Node
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 7
Network hardware context
Outdated view: Network is a single device Modern reality: Network features parallelism
Network Interface Card Network Interface Card
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 8
Network hardware context Software communication channel
P0 P1 P2 P3
Network Interface Card MPI library Application
MPI everywhere
P0
Network Interface Card
MPI+threads
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 9
P0 P1 P2 P3
Network Interface Card MPI library Application
MPI everywhere
P0
Network Interface Card Global critical section + 1 communication channel per process Network hardware context Software communication channel
MPI+threads
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 10
P0 P1 P2 P3
Network Interface Card MPI library Application
MPI everywhere
P0
Network Interface Card No logical parallelism expressed Global critical section + 1 communication channel per process Network hardware context Software communication channel
MPI+threads
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 11
MPI_Comm_create_endpoints(…,num_ep,…,comm_eps[]); MPI_Isend/Irecv(…,comm_eps[tid],ep_rank,…);
Network hardware context MPI Endpoint Software communication channel MPI Communicator
MPI process Network Interface Card
EP0 EP2 EP3 EP4
MPI library
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 12
MPI_Comm_create_endpoints(…,num_ep,…,comm_eps[]); MPI_Isend/Irecv(…,comm_eps[tid],ep_rank,…); MPI process Network Interface Card
EP0 EP2 EP3 EP4
MPI library
Pros
▸ Explicit control over network contexts
Cons
▸ Intrusive extension of the MPI standard ▸ Onus of managing network contexts on
user
Network hardware context MPI Endpoint Software communication channel MPI Communicator
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 13
P0 P1 P2 P3
Network Interface Card MPI library Application
MPI everywhere
P0
Network Interface Card
MPI+threads
Logical parallelism expressed Fine-grained critical sections + Multiple communication channel per process
C0 C1 C2 C3
Network hardware context Software communication channel MPI Communicator
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 14
P0 P1 P2 P3
Network Interface Card MPI library Application
MPI everywhere
P0
Network Interface Card
MPI+threads
Logical parallelism expressed Fine-grained critical sections + Multiple communication channel per process
C0 C1 C2 C3
Network hardware context Software communication channel MPI Communicator
Do we need user-visible endpoints?
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
CONTRIBUTIONS AS DEVIL’S ADVOCATE
▸ In-depth comparison between MPI-3.1 and user-visible endpoints ▸ A fast MPI+threads library that adheres to MPI-3.1’s constraints ▸ Optimized parallel communication streams applicable to all MPI libraries ▸ Recommendations for the MPI user to express logical parallelism with MPI-3.1
15
Evaluation platforms
MPI library
▸ Based on
MPICH:CH4 Interconnects
▸ Intel Omni-Path (OPA) with OFI:PSM2 ▸ Mellanox InfiniBand (IB) with UCX:Verbs
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
OUTLINE
16
▸ Introduction ▸ For MPI users: Parallelism in the MPI standard ▸ For MPI developers: Fast MPI+threads ▸ Fine-grained critical sections for thread safety ▸ Virtual Communication Interfaces (VCIs) for parallel communication streams ▸ Microbenchmark and Application analysis
▸ <comm,rank,tag> decides matching ▸ Non-overtaking order
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
POINT-TO-POINT COMMUNICATION
17
▸ Receive wildcards
Two or more operations on a process with Can be issued on parallel communication streams? Comm Rank Tag Send Recv
▸ <comm,rank,tag> decides matching ▸ Non-overtaking order
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
POINT-TO-POINT COMMUNICATION
18
Rank 0 (sender) <CA,R1,T1> <CB,R1,T1> Rank 1 (receiver) <CA,R0,T1> <CB,R0,T1>
▸ Receive wildcards
Two or more operations on a process with Can be issued on parallel communication streams? Comm Rank Tag Send Recv Different Different or Same Different or Same Yes Yes
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 19
Rank 0 (sender) <CA,R1,T1> <CA,R2,T1> Rank 1 (receiver) <CA,R0,T1> <CA,ANY,T1>
POINT-TO-POINT COMMUNICATION
▸ <comm,rank,tag> decides matching ▸ Non-overtaking order ▸ Receive wildcards
Two or more operations on a process with Can be issued on parallel communication streams? Comm Rank Tag Send Recv Different Different or Same Different or Same Yes Yes Same Different Different or Same Yes No
Wildcards
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 20
Two or more operations on a process with Can be issued on parallel communication streams? Comm Rank Tag Send Recv Different Different or Same Different or Same Yes Yes Same Different Different or Same Yes No Same Same Different or Same No No Rank 0 (sender) <CA,R1,T1> <CA,R1,T2> Rank 1 (receiver) <CA,R0,T3> <CA,R0,ANY>
POINT-TO-POINT COMMUNICATION
▸ <comm,rank,tag> decides matching ▸ Non-overtaking order ▸ Receive wildcards
Wildcards Non-overtaking order
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
RMA COMMUNICATION
21
Two or more operations
- n a process with
Can be issued on parallel communication streams? Window Rank Put Get Accumulate Different Different or Same Yes Yes Yes Same Different Yes Yes Yes Same Same Yes Yes No
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
RMA COMMUNICATION
22
Two or more operations
- n a process with
Can be issued on parallel communication streams? Window Rank Put Get Accumulate Different Different or Same Yes Yes Yes Same Different Yes Yes Yes Same Same Yes Yes No No order between multiple Gets and Puts Explicitly expressing parallelism Implicit parallelism
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
RMA COMMUNICATION
23
Two or more operations
- n a process with
Can be issued on parallel communication streams? Window Rank Put Get Accumulate Different Different or Same Yes Yes Yes Same Different Yes Yes Yes Same Same Yes Yes No Ordering of accumulate
- perations to the same
memory location Explicitly expressing parallelism Implicit parallelism No order between multiple Gets and Puts
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
OUTLINE
24
▸ Introduction ▸ For MPI users: Parallelism in the MPI standard ▸ For MPI developers: Fast MPI+threads ▸ Fine-grained critical sections for thread safety ▸ Virtual Communication Interfaces (VCIs) for parallel communication streams ▸ Microbenchmark and Application analysis
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
DESERIALIZING ACCESS TO THE MPI LIBRARY
▸ State of the art: global critical section ▸ Adopt fine-grained critical sections (Balaji et al., Amer et al.) ▸ Higher parallelism ▸ More lock acquisitions ▸ Atomics for counters
25
FG Global 1 2 3 Messages/s (x10^6) 8B MPI_Isend; MPICH/OFI/OPA 1 2 3 1 2 4 8 16 Number of threads Messages/s (x10^6) Critical section Global FG 8B MPI_Isend; MPICH/OFI/OPA
Overheads of FG in the single thread case FG outperforms Global at higher thread count
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
PARALLEL COMMUNICATION STREAMS
▸ Virtual Communication Interfaces (VCIs) ▸ Independent set of communication resources
with FIFO order
▸ Each VCI protected by its own lock ▸ Maps to a network hardware context ▸ VCI pool ▸ Allocate a VCI to a communicator/window ▸ Fallback VCI
26
Network Interface Card MPI process MPI library
Network hardware context VCI
C0 C1 C2 C3
VCI
TX Q RX Q C Q
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 27
10 20 30 40 50 1 2 4 8 16 Number of threads Messages/s (x10^6) MPI+threads Original (Global + 1 VCI) FG FG + multi VCIs 8B MPI_Isend; MPICH/OFI/OPA
Fine-grained critical sections + multiple VCIs alone give practically no benefit
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 28
10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)
Optimizations Original (Global + 1 VCI) All All w/o per-VCI progress
8B MPI_Isend; MPICH/OFI/OPA
Per-VCI progress
▸ Global progress: progress all VCIs ▸ High contention on VCIs’ locks ▸ Pure per-VCI progress: progress
- nly VCI of operation
▸ Deadlock when shared
progress required
▸ Hybrid per-VCI progress
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 29
10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)
Optimizations Original (Global + 1 VCI) All All w/o per-VCI progress
8B MPI_Isend; MPICH/OFI/OPA 10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)
Optimizations Original (Global + 1 VCI) All All w/o per-VCI req-mgmt
8B MPI_Isend; MPICH/OFI/OPA
Per-VCI progress Per-VCI Request management
▸ Request class lock: high
contention
▸ Per-VCI request cache ▸ Global lightweight request:
contended atomics for refcounting
▸ Per-VCI lightweight request ▸ Global progress: progress all VCIs ▸ High contention on VCIs’ locks ▸ Pure per-VCI progress: progress
- nly VCI of operation
▸ Deadlock when shared
progress required
▸ Hybrid per-VCI progress
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 30
10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)
Optimizations Original (Global + 1 VCI) All All w/o per-VCI progress
8B MPI_Isend; MPICH/OFI/OPA 10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)
Optimizations Original (Global + 1 VCI) All All w/o cache-aware VCI
8B MPI_Isend; MPICH/OFI/OPA 10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)
Optimizations Original (Global + 1 VCI) All All w/o per-VCI req-mgmt
8B MPI_Isend; MPICH/OFI/OPA
Per-VCI progress Per-VCI Request management Per-VCI cache-line awareness
▸ Request class lock: high
contention
▸ Per-VCI request cache ▸ Global lightweight request:
contended atomics for refcounting
▸ Per-VCI lightweight request ▸ False-sharing: locks of consecutive
VCIs
▸ Per-VCI cache alignment ▸ Global progress: progress all VCIs ▸ High contention on VCIs’ locks ▸ Pure per-VCI progress: progress
- nly VCI of operation
▸ Deadlock when shared
progress required
▸ Hybrid per-VCI progress
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 31
10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)
Optimizations Original (Global + 1 VCI) All All w/o per-VCI progress
8B MPI_Isend; MPICH/OFI/OPA 10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)
Optimizations Original (Global + 1 VCI) All All w/o cache-aware VCI
8B MPI_Isend; MPICH/OFI/OPA 10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)
Optimizations Original (Global + 1 VCI) All All w/o per-VCI req-mgmt
8B MPI_Isend; MPICH/OFI/OPA
Per-VCI progress Per-VCI Request management Per-VCI cache-line awareness
▸ Request class lock: high
contention
▸ Per-VCI request cache ▸ Global lightweight request:
contended atomics for refcounting
▸ Per-VCI lightweight request ▸ False-sharing: locks of consecutive
VCIs
▸ Per-VCI cache alignment ▸ Global progress: progress all VCIs ▸ High contention on VCIs’ locks ▸ Pure per-VCI progress: progress
- nly VCI of operation
▸ Deadlock when shared
progress required
▸ Hybrid per-VCI progress
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
OUTLINE
32
▸ Introduction ▸ For MPI users: Parallelism in the MPI standard ▸ For MPI developers: Fast MPI+threads ▸ Fine-grained critical sections for thread safety ▸ Virtual Communication Interfaces (VCIs) for parallel communication streams ▸ Microbenchmark and Application analysis
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
APPLICATION CATEGORIES
▸ Category 1 ▸ Direct use of parallel communication streams ▸ VCIs as good as user-visible endpoints and MPI everywhere ▸ Category 2 ▸ Require shared progress ▸ Both VCIs and user-visible endpoints perform poorly ▸ Category 3 ▸ Abstraction through MPI-3.1 prevents user from expressing parallelism ▸ User-visible endpoints perform better than VCIs
33
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
CATEGORY 1: POINT-TO-POINT MICROBENCHMARK
34
20 40 60 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/OFI/OPA 20 40 60 80 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)
No scaling without user-expressed parallelism (ser_comm) or without VCIs (orig_mpich)
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
CATEGORY 1: POINT-TO-POINT MICROBENCHMARK
35
20 40 60 1 4 16 64 256 1024 4096 16384 65536 Message size (B) Messages/s (x10^6) Isend; 16 cores; MPICH/OFI/OPA 20 40 60 80 1 4 16 64 256 1024 4096 16384 65536 Message size (B) Messages/s (x10^6) Isend; 16 cores; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)
20 40 60 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/OFI/OPA 20 40 60 80 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)
No scaling without user-expressed parallelism (ser_comm) or without VCIs (orig_mpich) Parallel communication streams effective
- nly when bound by the rate of issue of operations
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 36
20 40 60 1 4 16 64 256 1024 4096 16384 65536 Message size (B) Messages/s (x10^6) Isend; 16 cores; MPICH/OFI/OPA 20 40 60 80 1 4 16 64 256 1024 4096 16384 65536 Message size (B) Messages/s (x10^6) Isend; 16 cores; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)
20 40 60 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/OFI/OPA 20 40 60 80 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)
No scaling without user-expressed parallelism (ser_comm) or without VCIs (orig_mpich) Parallel communication streams effective
- nly when bound by the rate of issue of operations
MPI+Threads (+no atomics) MPI+Threads (+no locks) MPI+Threads MPI Everywhere 25 50 75 Messages/s (x10^6) Isend; MPICH/UCX/IB
VCIs and user-visible endpoints short of MPI everywhere due to thread-safety costs
Takeaway: For basic communication, VCIs and endpoints perform similarly and nearly as well as MPI everywhere
CATEGORY 1: POINT-TO-POINT MICROBENCHMARK
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 37
P3 P0
T0
P1 P2
NS_B NS_B 1 EW_B EW_B 1 EW_A EW_A 1 NS_B NS_B 1 NS_A NS_A 1 NS_A NS_A 1 EW_A EW_A 1 NS_A NS_A 1 EW_A EW_A 1 EW_B EW_B 1 EW_A EW_A 1 NS_A NS_A 1 T1 T2 T3 T0 T1 T2 T3 T0 T2 T3 T1 T0 T2 T3 T1
P0
T0 T1 T2 T3 EP_0 R0 EP_1 R1 EP_2 R2 EP_3 R3
P1
T0 T1 T2 T3 EP_0 R4 EP_1 R5 EP_2 R6 EP_3 R7
P3
T0 T1 T2 T3 EP_0 R12 EP_1 R13 EP_2 R14 EP_3 R15
P2
T0 T1 T2 T3 EP_0 R8 EP_1 R9 EP_2 R10 EP_3 R11
CATEGORY 1: STENCIL APPLICATIONS
MPI Endpoints MPI Communicator
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 38
0.0 0.1 0.2 0.3 48 96 192 384 768 1536 3072 6144 12288 24576 49152 98304 196608 Mesh dimension Time (ms) MPI Everywhere +Threads (Original) +Threads (VCIs) +Threads (Endpoints) +Threads (FUNNELED) Halo communication time per iteration; 9 nodes; 16 cores per node; MPICH/OFI/OPA
Recommendation: Maximize independence between threads for point-to-point communication with communicators Warning: Independent communication with ranks or tags is not sufficient because of receive wildcards Warning: Expressing parallelism with MPI-3.1 can be clumsier compared with user-visible endpoints due to matching requirements
CATEGORY 1: STENCIL APPLICATIONS
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
APPLICATION CATEGORIES
▸ Category 1 ▸ Direct use of parallel communication streams ▸ VCIs as good as user-visible endpoints and MPI everywhere ▸ Category 2 ▸ Require shared progress ▸ Both VCIs and user-visible endpoints perform poorly ▸ Category 3 ▸ Abstraction through MPI-3.1 prevents user from expressing parallelism ▸ User-visible endpoints perform better than VCIs
39
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 40
10 20 30 40 1 2 4 8 16 Number of cores Messages/s (x10^6) Put; MPICH/OFI/OPA 25 50 75 100 1 2 4 8 16 Number of cores Messages/s (x10^6) Put; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)
MPI everywhere performs best because target ranks progress their VCIs Intel OPA emulates RMA in software, requiring target VCI involvement Mellanox IB implements Puts completely in hardware
CATEGORY 2: RMA MICROBENCHMARK
Takeaway: When shared progress is required, neither VCIs nor endpoints perform well
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
▸ OpenMC: distributed Monte-Carlo neutron-transport code ▸ Band data equally distributed between nodes ▸ Particles distributed between nodes for simulation ▸ Each node fetches (MPI_Get) a band of data, processes its particles, and iterates
41
CATEGORY 2: OPENMC
Rank i Band
1 2 3
Window Endpoint
1 2 3
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 42
CATEGORY 2: OPENMC
0.00 0.05 0.10 0.15
64 256 1024 4096 16384 65536 262144 1048576
Band size (bytes) Time (ms) MPICH/UCX/IB 0.0 0.5 1.0 1.5 2.0
64 256 1024 4096 16384 65536 262144 1048576
Band size (bytes) Time (ms) MPICH/OFI/OPA MPI Everywhere + shared memory +Threads (Original) +Threads (VCIs) +Threads (Endpoints)
VCIs as good as endpoints and MPI everywhere when shared progress not required
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 43
0.00 0.05 0.10 0.15
64 256 1024 4096 16384 65536 262144 1048576
Band size (bytes) Time (ms) MPICH/UCX/IB 0.0 0.5 1.0 1.5 2.0
64 256 1024 4096 16384 65536 262144 1048576
Band size (bytes) Time (ms) MPICH/OFI/OPA MPI Everywhere + shared memory +Threads (Original) +Threads (VCIs) +Threads (Endpoints) 0.00 0.02 0.04 0.06
64 256 1024 4096 16384 65536 262144 1048576
Band size Time (ms) Time per Get; MPICH/OFI/OPA 0.0 0.5 1.0 1.5 2.0
64 256 1024 4096 16384 65536 262144 1048576
Band size Time (ms) Time per Flush; MPICH/OFI/OPA MPI Everywhere + shared memory +Threads (Original) +Threads (VCIs) +Threads (Endpoints)
VCIs as good as endpoints and MPI everywhere when shared progress not required Issue of
- perations is fast
Shared progress hurts completion of
- perations
CATEGORY 2: OPENMC
Shared progress: thread A progresses VCI
- f thread B. Required for correctness.
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 44
0.00 0.05 0.10 0.15
64 256 1024 4096 16384 65536 262144 1048576
Band size (bytes) Time (ms) MPICH/UCX/IB 0.0 0.5 1.0 1.5 2.0
64 256 1024 4096 16384 65536 262144 1048576
Band size (bytes) Time (ms) MPICH/OFI/OPA MPI Everywhere + shared memory +Threads (Original) +Threads (VCIs) +Threads (Endpoints) 0.00 0.02 0.04 0.06
64 256 1024 4096 16384 65536 262144 1048576
Band size Time (ms) Time per Get; MPICH/OFI/OPA 0.0 0.5 1.0 1.5 2.0
64 256 1024 4096 16384 65536 262144 1048576
Band size Time (ms) Time per Flush; MPICH/OFI/OPA MPI Everywhere + shared memory +Threads (Original) +Threads (VCIs) +Threads (Endpoints)
Recommendation: Maximize independence between threads for RMA communication with MPI windows Warning: Independent communication with VCIs and user-visible endpoints fundamentally opposes shared progress
VCIs as good as endpoints and MPI everywhere when shared progress not required Issue of
- perations is fast
Shared progress hurts completion of
- perations
CATEGORY 2: OPENMC
Shared progress: thread A progresses VCI
- f thread B. Required for correctness.
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
APPLICATION CATEGORIES
▸ Category 1 ▸ Direct use of parallel communication streams ▸ VCIs as good as user-visible endpoints and MPI everywhere ▸ Category 2 ▸ Require shared progress ▸ Both VCIs and user-visible endpoints perform poorly ▸ Category 3 ▸ Abstraction through MPI-3.1 prevents user from expressing parallelism ▸ User-visible endpoints perform better than VCIs
45
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
CATEGORY 3: LIMITING MPI SEMANTICS
▸ Example: microbenchmark capturing
communication pattern in Legion’s runtime
▸ Contention between receiver thread and
sender threads with communicators. No contention with endpoints.
46
1 2 3 1 2 4 8 16 Number of sender threads Messages/s (x10^6) Using Endpoints Communicators Isend; MPICH/OFI/OPA
Takeaway: User-visible endpoints perform better then VCIs when MPI’s semantics prevent user from expressing parallelism, especially in irregular communication patterns
▸ NWChem: quantum chemistry application suite ▸ Dominant cost is that of block-sparse matrix multiplication (BSPMM) ▸ A x B += C get-compute-update pattern (MPI_Get + MPI_Accumulate) ▸ Each worker on a node (thread or process) participates in BSPMM independently
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
CATEGORY 3: NWCHEM
47
▸ NWChem: quantum chemistry application suite ▸ Dominant cost is that of block-sparse matrix multiplication (BSPMM) ▸ A x B += C get-compute-update pattern (MPI_Get + MPI_Accumulate) ▸ Each worker on a node (thread or process) participates in BSPMM independently
Rank i Rank i
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
CATEGORY 3: NWCHEM
48
1 2 3 Window Endpoint 1 2 3
Parallel Get Parallel Accumulate
1 2 3
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 49
0.00 0.02 0.04 0.06 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Get; MPICH/OFI/OPA 0.0 0.1 0.2 0.3 0.4 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Get-flush; MPICH/OFI/OPA 0.000 0.025 0.050 0.075 0.100 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Accum; MPICH/OFI/OPA 0.0 0.1 0.2 0.3 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Accum-flush; MPICH/OFI/OPA MPI Everywhere +Threads (Original) +Threads (VCIs) +Threads (Endpoints)
Issue of Gets is fast Shared progress hurts completion of Get operations VCIs slower to issue Accumulates than endpoints because of single window Endpoints complete Accumulates slower than VCIs because of shared progress
Warning: Atomic operation semantics are not easy to achieve with multiple windows; using multiple VCIs may not help.
CATEGORY 3: NWCHEM
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 50
Warning: Atomic operation semantics are not easy to achieve with multiple windows; using multiple VCIs may not help.
CATEGORY 3: NWCHEM
Tip: If the application allows it, hint accumulate_ordering=none. The MPI library can exploit implicit parallelism.
0.00 0.02 0.04 0.06 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Get; MPICH/OFI/OPA 0.0 0.1 0.2 0.3 0.4 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Get-flush; MPICH/OFI/OPA 0.000 0.025 0.050 0.075 0.100 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Accum; MPICH/OFI/OPA 0.0 0.1 0.2 0.3 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Accum-flush; MPICH/OFI/OPA MPI Everywhere +Threads (Original) +Threads (VCIs) +Threads (VCIs w/hints) +Threads (Endpoints)
VCIs with hints perform as well as Endpoints
HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI
CONCLUDING REMARKS
▸ MPI+threads is critical for modern processors ▸ Users must proactively express logical parallelism ▸ User-visible endpoints not critical to express logical parallelism ▸ MPI-3.1 already features lots of parallelism ▸ VCIs perform as well as user-visible endpoints without burdening the user ▸ New info hints in MPI-4.0 give more options to express logical parallelism ▸ Enabling exploration of advanced mapping policies in the MPI library
51
THANK YOU!
Email questions to rzambre@uci.edu
52