How I Learned to Rohit Zambre,* Stop Aparna - - PowerPoint PPT Presentation

how i learned to
SMART_READER_LITE
LIVE PREVIEW

How I Learned to Rohit Zambre,* Stop Aparna - - PowerPoint PPT Presentation

How I Learned to Rohit Zambre,* Stop Aparna Chandramowlishwaran,* Worrying Pavan Balaji About *University of California, Irvine User-Visible Argonne National Laboratory Endpoints and Love MPI 2 HOW I LEARNED TO STOP WORRYING


slide-1
SLIDE 1

How I Learned to Stop Worrying About User-Visible Endpoints and Love MPI

Rohit Zambre,* Aparna Chandramowlishwaran,* Pavan Balaji⌃ *University of California, Irvine

⌃Argonne National Laboratory

slide-2
SLIDE 2

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 2

MPI everywhere

Node Core Process

slide-3
SLIDE 3

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 3

MPI everywhere

Node Core Process

▸ Model artifact: high memory requirements

that worsen with increase domain- dimensionality and number of ranks.

▸ Hardware usage: resource wastage with

static split of limited resources on processor

Increasing number of cores Decreasing memory per core

slide-4
SLIDE 4

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 4

MPI+threads

▸ Model artifact: reduces duplicated data

by a factor of number of threads.

▸ Hardware usage: able to use the many

cores while sharing all of processor’s resources.

Increasing number of cores Decreasing memory per core

Node Core Thread Process

slide-5
SLIDE 5

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 5

1 2 3 4 5 6 7 1x440x110 1x220x220 1x110x440 6x180x45 6x90x90 6x45x180 Processor Grid (threads x processor rows x processor columns) Time (seconds) Category Computation Allgatherv AlltoAllv

MPI everywhere OOM!

Buluc et al. Distributed BFS (https://arxiv.org/abs/1705.04590)

Corresponding MPI+threads runs

slide-6
SLIDE 6

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 6

20 40 60 1 2 4 8 16 Number of cores Million Messages/s Model MPI everywhere MPI+threads (MPI_THREAD_FUNNELED) MPI+threads (MPI_THREAD_MULTIPLE) MPI_Isend (8 B)

Communication performance of MPI+threads is dismal

Buluc et al. Distributed BFS (https://arxiv.org/abs/1705.04590)

MPI everywhere OOM!

1 2 3 4 5 6 7 1x440x110 1x220x220 1x110x440 6x180x45 6x90x90 6x45x180 Processor Grid (threads x processor rows x processor columns) Time (seconds) Category Computation Allgatherv AlltoAllv

Corresponding MPI+threads runs

slide-7
SLIDE 7

Node

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 7

Network hardware context

Outdated view: Network is a single device Modern reality: Network features parallelism

Network Interface Card Network Interface Card

slide-8
SLIDE 8

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 8

Network hardware context Software communication channel

P0 P1 P2 P3

Network Interface Card MPI library Application

MPI everywhere

P0

Network Interface Card

MPI+threads

slide-9
SLIDE 9

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 9

P0 P1 P2 P3

Network Interface Card MPI library Application

MPI everywhere

P0

Network Interface Card Global critical section + 1 communication channel per process Network hardware context Software communication channel

MPI+threads

slide-10
SLIDE 10

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 10

P0 P1 P2 P3

Network Interface Card MPI library Application

MPI everywhere

P0

Network Interface Card No logical parallelism expressed Global critical section + 1 communication channel per process Network hardware context Software communication channel

MPI+threads

slide-11
SLIDE 11

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 11

MPI_Comm_create_endpoints(…,num_ep,…,comm_eps[]); MPI_Isend/Irecv(…,comm_eps[tid],ep_rank,…);

Network hardware context MPI Endpoint Software communication channel MPI Communicator

MPI process Network Interface Card

EP0 EP2 EP3 EP4

MPI library

slide-12
SLIDE 12

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 12

MPI_Comm_create_endpoints(…,num_ep,…,comm_eps[]); MPI_Isend/Irecv(…,comm_eps[tid],ep_rank,…); MPI process Network Interface Card

EP0 EP2 EP3 EP4

MPI library

Pros

▸ Explicit control over network contexts

Cons

▸ Intrusive extension of the MPI standard ▸ Onus of managing network contexts on

user

Network hardware context MPI Endpoint Software communication channel MPI Communicator

slide-13
SLIDE 13

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 13

P0 P1 P2 P3

Network Interface Card MPI library Application

MPI everywhere

P0

Network Interface Card

MPI+threads

Logical parallelism expressed Fine-grained critical sections + Multiple communication channel per process

C0 C1 C2 C3

Network hardware context Software communication channel MPI Communicator

slide-14
SLIDE 14

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 14

P0 P1 P2 P3

Network Interface Card MPI library Application

MPI everywhere

P0

Network Interface Card

MPI+threads

Logical parallelism expressed Fine-grained critical sections + Multiple communication channel per process

C0 C1 C2 C3

Network hardware context Software communication channel MPI Communicator

Do we need user-visible endpoints?

slide-15
SLIDE 15

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

CONTRIBUTIONS AS DEVIL’S ADVOCATE

▸ In-depth comparison between MPI-3.1 and user-visible endpoints ▸ A fast MPI+threads library that adheres to MPI-3.1’s constraints ▸ Optimized parallel communication streams applicable to all MPI libraries ▸ Recommendations for the MPI user to express logical parallelism with MPI-3.1

15

Evaluation platforms

MPI library

▸ Based on

MPICH:CH4 Interconnects

▸ Intel Omni-Path (OPA) with OFI:PSM2 ▸ Mellanox InfiniBand (IB) with UCX:Verbs

slide-16
SLIDE 16

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

OUTLINE

16

▸ Introduction ▸ For MPI users: Parallelism in the MPI standard ▸ For MPI developers: Fast MPI+threads ▸ Fine-grained critical sections for thread safety ▸ Virtual Communication Interfaces (VCIs) for parallel communication streams ▸ Microbenchmark and Application analysis

slide-17
SLIDE 17

▸ <comm,rank,tag> decides matching ▸ Non-overtaking order

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

POINT-TO-POINT COMMUNICATION

17

▸ Receive wildcards

Two or more operations on a process with Can be issued on parallel communication streams? Comm Rank Tag Send Recv

slide-18
SLIDE 18

▸ <comm,rank,tag> decides matching ▸ Non-overtaking order

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

POINT-TO-POINT COMMUNICATION

18

Rank 0 (sender) <CA,R1,T1> <CB,R1,T1> Rank 1 (receiver) <CA,R0,T1> <CB,R0,T1>

▸ Receive wildcards

Two or more operations on a process with Can be issued on parallel communication streams? Comm Rank Tag Send Recv Different Different or Same Different or Same Yes Yes

slide-19
SLIDE 19

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 19

Rank 0 (sender) <CA,R1,T1> <CA,R2,T1> Rank 1 (receiver) <CA,R0,T1> <CA,ANY,T1>

POINT-TO-POINT COMMUNICATION

▸ <comm,rank,tag> decides matching ▸ Non-overtaking order ▸ Receive wildcards

Two or more operations on a process with Can be issued on parallel communication streams? Comm Rank Tag Send Recv Different Different or Same Different or Same Yes Yes Same Different Different or Same Yes No

Wildcards

slide-20
SLIDE 20

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 20

Two or more operations on a process with Can be issued on parallel communication streams? Comm Rank Tag Send Recv Different Different or Same Different or Same Yes Yes Same Different Different or Same Yes No Same Same Different or Same No No Rank 0 (sender) <CA,R1,T1> <CA,R1,T2> Rank 1 (receiver) <CA,R0,T3> <CA,R0,ANY>

POINT-TO-POINT COMMUNICATION

▸ <comm,rank,tag> decides matching ▸ Non-overtaking order ▸ Receive wildcards

Wildcards Non-overtaking order

slide-21
SLIDE 21

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

RMA COMMUNICATION

21

Two or more operations

  • n a process with

Can be issued on parallel communication streams? Window Rank Put Get Accumulate Different Different or Same Yes Yes Yes Same Different Yes Yes Yes Same Same Yes Yes No

slide-22
SLIDE 22

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

RMA COMMUNICATION

22

Two or more operations

  • n a process with

Can be issued on parallel communication streams? Window Rank Put Get Accumulate Different Different or Same Yes Yes Yes Same Different Yes Yes Yes Same Same Yes Yes No No order between multiple Gets and Puts Explicitly expressing parallelism Implicit parallelism

slide-23
SLIDE 23

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

RMA COMMUNICATION

23

Two or more operations

  • n a process with

Can be issued on parallel communication streams? Window Rank Put Get Accumulate Different Different or Same Yes Yes Yes Same Different Yes Yes Yes Same Same Yes Yes No Ordering of accumulate

  • perations to the same

memory location Explicitly expressing parallelism Implicit parallelism No order between multiple Gets and Puts

slide-24
SLIDE 24

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

OUTLINE

24

▸ Introduction ▸ For MPI users: Parallelism in the MPI standard ▸ For MPI developers: Fast MPI+threads ▸ Fine-grained critical sections for thread safety ▸ Virtual Communication Interfaces (VCIs) for parallel communication streams ▸ Microbenchmark and Application analysis

slide-25
SLIDE 25

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

DESERIALIZING ACCESS TO THE MPI LIBRARY

▸ State of the art: global critical section ▸ Adopt fine-grained critical sections (Balaji et al., Amer et al.) ▸ Higher parallelism ▸ More lock acquisitions ▸ Atomics for counters

25

FG Global 1 2 3 Messages/s (x10^6) 8B MPI_Isend; MPICH/OFI/OPA 1 2 3 1 2 4 8 16 Number of threads Messages/s (x10^6) Critical section Global FG 8B MPI_Isend; MPICH/OFI/OPA

Overheads of FG in the single thread case FG outperforms Global at higher thread count

slide-26
SLIDE 26

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

PARALLEL COMMUNICATION STREAMS

▸ Virtual Communication Interfaces (VCIs) ▸ Independent set of communication resources

with FIFO order

▸ Each VCI protected by its own lock ▸ Maps to a network hardware context ▸ VCI pool ▸ Allocate a VCI to a communicator/window ▸ Fallback VCI

26

Network Interface Card MPI process MPI library

Network hardware context VCI

C0 C1 C2 C3

VCI

TX Q RX Q C Q

slide-27
SLIDE 27

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 27

10 20 30 40 50 1 2 4 8 16 Number of threads Messages/s (x10^6) MPI+threads Original (Global + 1 VCI) FG FG + multi VCIs 8B MPI_Isend; MPICH/OFI/OPA

Fine-grained critical sections + multiple VCIs alone give practically no benefit

slide-28
SLIDE 28

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 28

10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)

Optimizations Original (Global + 1 VCI) All All w/o per-VCI progress

8B MPI_Isend; MPICH/OFI/OPA

Per-VCI progress

▸ Global progress: progress all VCIs ▸ High contention on VCIs’ locks ▸ Pure per-VCI progress: progress

  • nly VCI of operation

▸ Deadlock when shared

progress required

▸ Hybrid per-VCI progress

slide-29
SLIDE 29

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 29

10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)

Optimizations Original (Global + 1 VCI) All All w/o per-VCI progress

8B MPI_Isend; MPICH/OFI/OPA 10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)

Optimizations Original (Global + 1 VCI) All All w/o per-VCI req-mgmt

8B MPI_Isend; MPICH/OFI/OPA

Per-VCI progress Per-VCI Request management

▸ Request class lock: high

contention

▸ Per-VCI request cache ▸ Global lightweight request:

contended atomics for refcounting

▸ Per-VCI lightweight request ▸ Global progress: progress all VCIs ▸ High contention on VCIs’ locks ▸ Pure per-VCI progress: progress

  • nly VCI of operation

▸ Deadlock when shared

progress required

▸ Hybrid per-VCI progress

slide-30
SLIDE 30

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 30

10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)

Optimizations Original (Global + 1 VCI) All All w/o per-VCI progress

8B MPI_Isend; MPICH/OFI/OPA 10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)

Optimizations Original (Global + 1 VCI) All All w/o cache-aware VCI

8B MPI_Isend; MPICH/OFI/OPA 10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)

Optimizations Original (Global + 1 VCI) All All w/o per-VCI req-mgmt

8B MPI_Isend; MPICH/OFI/OPA

Per-VCI progress Per-VCI Request management Per-VCI cache-line awareness

▸ Request class lock: high

contention

▸ Per-VCI request cache ▸ Global lightweight request:

contended atomics for refcounting

▸ Per-VCI lightweight request ▸ False-sharing: locks of consecutive

VCIs

▸ Per-VCI cache alignment ▸ Global progress: progress all VCIs ▸ High contention on VCIs’ locks ▸ Pure per-VCI progress: progress

  • nly VCI of operation

▸ Deadlock when shared

progress required

▸ Hybrid per-VCI progress

slide-31
SLIDE 31

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 31

10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)

Optimizations Original (Global + 1 VCI) All All w/o per-VCI progress

8B MPI_Isend; MPICH/OFI/OPA 10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)

Optimizations Original (Global + 1 VCI) All All w/o cache-aware VCI

8B MPI_Isend; MPICH/OFI/OPA 10 20 30 40 50 1 2 4 8 16 Number of cores Messages/s (x10^6)

Optimizations Original (Global + 1 VCI) All All w/o per-VCI req-mgmt

8B MPI_Isend; MPICH/OFI/OPA

Per-VCI progress Per-VCI Request management Per-VCI cache-line awareness

▸ Request class lock: high

contention

▸ Per-VCI request cache ▸ Global lightweight request:

contended atomics for refcounting

▸ Per-VCI lightweight request ▸ False-sharing: locks of consecutive

VCIs

▸ Per-VCI cache alignment ▸ Global progress: progress all VCIs ▸ High contention on VCIs’ locks ▸ Pure per-VCI progress: progress

  • nly VCI of operation

▸ Deadlock when shared

progress required

▸ Hybrid per-VCI progress

slide-32
SLIDE 32

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

OUTLINE

32

▸ Introduction ▸ For MPI users: Parallelism in the MPI standard ▸ For MPI developers: Fast MPI+threads ▸ Fine-grained critical sections for thread safety ▸ Virtual Communication Interfaces (VCIs) for parallel communication streams ▸ Microbenchmark and Application analysis

slide-33
SLIDE 33

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

APPLICATION CATEGORIES

▸ Category 1 ▸ Direct use of parallel communication streams ▸ VCIs as good as user-visible endpoints and MPI everywhere ▸ Category 2 ▸ Require shared progress ▸ Both VCIs and user-visible endpoints perform poorly ▸ Category 3 ▸ Abstraction through MPI-3.1 prevents user from expressing parallelism ▸ User-visible endpoints perform better than VCIs

33

slide-34
SLIDE 34

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

CATEGORY 1: POINT-TO-POINT MICROBENCHMARK

34

20 40 60 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/OFI/OPA 20 40 60 80 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)

No scaling without user-expressed parallelism (ser_comm) or without VCIs (orig_mpich)

slide-35
SLIDE 35

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

CATEGORY 1: POINT-TO-POINT MICROBENCHMARK

35

20 40 60 1 4 16 64 256 1024 4096 16384 65536 Message size (B) Messages/s (x10^6) Isend; 16 cores; MPICH/OFI/OPA 20 40 60 80 1 4 16 64 256 1024 4096 16384 65536 Message size (B) Messages/s (x10^6) Isend; 16 cores; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)

20 40 60 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/OFI/OPA 20 40 60 80 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)

No scaling without user-expressed parallelism (ser_comm) or without VCIs (orig_mpich) Parallel communication streams effective

  • nly when bound by the rate of issue of operations
slide-36
SLIDE 36

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 36

20 40 60 1 4 16 64 256 1024 4096 16384 65536 Message size (B) Messages/s (x10^6) Isend; 16 cores; MPICH/OFI/OPA 20 40 60 80 1 4 16 64 256 1024 4096 16384 65536 Message size (B) Messages/s (x10^6) Isend; 16 cores; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)

20 40 60 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/OFI/OPA 20 40 60 80 1 2 4 8 16 Number of cores Messages/s (x10^6) Isend; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)

No scaling without user-expressed parallelism (ser_comm) or without VCIs (orig_mpich) Parallel communication streams effective

  • nly when bound by the rate of issue of operations

MPI+Threads (+no atomics) MPI+Threads (+no locks) MPI+Threads MPI Everywhere 25 50 75 Messages/s (x10^6) Isend; MPICH/UCX/IB

VCIs and user-visible endpoints short of MPI everywhere due to thread-safety costs

Takeaway: For basic communication, VCIs and endpoints perform similarly and nearly as well as MPI everywhere

CATEGORY 1: POINT-TO-POINT MICROBENCHMARK

slide-37
SLIDE 37

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 37

P3 P0

T0

P1 P2

NS_B NS_B 1 EW_B EW_B 1 EW_A EW_A 1 NS_B NS_B 1 NS_A NS_A 1 NS_A NS_A 1 EW_A EW_A 1 NS_A NS_A 1 EW_A EW_A 1 EW_B EW_B 1 EW_A EW_A 1 NS_A NS_A 1 T1 T2 T3 T0 T1 T2 T3 T0 T2 T3 T1 T0 T2 T3 T1

P0

T0 T1 T2 T3 EP_0 R0 EP_1 R1 EP_2 R2 EP_3 R3

P1

T0 T1 T2 T3 EP_0 R4 EP_1 R5 EP_2 R6 EP_3 R7

P3

T0 T1 T2 T3 EP_0 R12 EP_1 R13 EP_2 R14 EP_3 R15

P2

T0 T1 T2 T3 EP_0 R8 EP_1 R9 EP_2 R10 EP_3 R11

CATEGORY 1: STENCIL APPLICATIONS

MPI Endpoints MPI Communicator

slide-38
SLIDE 38

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 38

0.0 0.1 0.2 0.3 48 96 192 384 768 1536 3072 6144 12288 24576 49152 98304 196608 Mesh dimension Time (ms) MPI Everywhere +Threads (Original) +Threads (VCIs) +Threads (Endpoints) +Threads (FUNNELED) Halo communication time per iteration; 9 nodes; 16 cores per node; MPICH/OFI/OPA

Recommendation: Maximize independence between threads for point-to-point communication with communicators Warning: Independent communication with ranks or tags is not sufficient because of receive wildcards Warning: Expressing parallelism with MPI-3.1 can be clumsier compared with user-visible endpoints due to matching requirements

CATEGORY 1: STENCIL APPLICATIONS

slide-39
SLIDE 39

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

APPLICATION CATEGORIES

▸ Category 1 ▸ Direct use of parallel communication streams ▸ VCIs as good as user-visible endpoints and MPI everywhere ▸ Category 2 ▸ Require shared progress ▸ Both VCIs and user-visible endpoints perform poorly ▸ Category 3 ▸ Abstraction through MPI-3.1 prevents user from expressing parallelism ▸ User-visible endpoints perform better than VCIs

39

slide-40
SLIDE 40

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 40

10 20 30 40 1 2 4 8 16 Number of cores Messages/s (x10^6) Put; MPICH/OFI/OPA 25 50 75 100 1 2 4 8 16 Number of cores Messages/s (x10^6) Put; MPICH/UCX/IB MPI Everywhere +Threads (ser_comm+orig_mpich) +Threads (par_comm+orig_mpich) +Threads (par_comm+vcis) +Threads (ser_comm+vcis) +Threads (Endpoints)

MPI everywhere performs best because target ranks progress their VCIs Intel OPA emulates RMA in software, requiring target VCI involvement Mellanox IB implements Puts completely in hardware

CATEGORY 2: RMA MICROBENCHMARK

Takeaway: When shared progress is required, neither VCIs nor endpoints perform well

slide-41
SLIDE 41

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

▸ OpenMC: distributed Monte-Carlo neutron-transport code ▸ Band data equally distributed between nodes ▸ Particles distributed between nodes for simulation ▸ Each node fetches (MPI_Get) a band of data, processes its particles, and iterates

41

CATEGORY 2: OPENMC

Rank i Band

1 2 3

Window Endpoint

1 2 3

slide-42
SLIDE 42

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 42

CATEGORY 2: OPENMC

0.00 0.05 0.10 0.15

64 256 1024 4096 16384 65536 262144 1048576

Band size (bytes) Time (ms) MPICH/UCX/IB 0.0 0.5 1.0 1.5 2.0

64 256 1024 4096 16384 65536 262144 1048576

Band size (bytes) Time (ms) MPICH/OFI/OPA MPI Everywhere + shared memory +Threads (Original) +Threads (VCIs) +Threads (Endpoints)

VCIs as good as endpoints and MPI everywhere when shared progress not required

slide-43
SLIDE 43

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 43

0.00 0.05 0.10 0.15

64 256 1024 4096 16384 65536 262144 1048576

Band size (bytes) Time (ms) MPICH/UCX/IB 0.0 0.5 1.0 1.5 2.0

64 256 1024 4096 16384 65536 262144 1048576

Band size (bytes) Time (ms) MPICH/OFI/OPA MPI Everywhere + shared memory +Threads (Original) +Threads (VCIs) +Threads (Endpoints) 0.00 0.02 0.04 0.06

64 256 1024 4096 16384 65536 262144 1048576

Band size Time (ms) Time per Get; MPICH/OFI/OPA 0.0 0.5 1.0 1.5 2.0

64 256 1024 4096 16384 65536 262144 1048576

Band size Time (ms) Time per Flush; MPICH/OFI/OPA MPI Everywhere + shared memory +Threads (Original) +Threads (VCIs) +Threads (Endpoints)

VCIs as good as endpoints and MPI everywhere when shared progress not required Issue of

  • perations is fast

Shared progress hurts completion of

  • perations

CATEGORY 2: OPENMC

Shared progress: thread A progresses VCI

  • f thread B. Required for correctness.
slide-44
SLIDE 44

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 44

0.00 0.05 0.10 0.15

64 256 1024 4096 16384 65536 262144 1048576

Band size (bytes) Time (ms) MPICH/UCX/IB 0.0 0.5 1.0 1.5 2.0

64 256 1024 4096 16384 65536 262144 1048576

Band size (bytes) Time (ms) MPICH/OFI/OPA MPI Everywhere + shared memory +Threads (Original) +Threads (VCIs) +Threads (Endpoints) 0.00 0.02 0.04 0.06

64 256 1024 4096 16384 65536 262144 1048576

Band size Time (ms) Time per Get; MPICH/OFI/OPA 0.0 0.5 1.0 1.5 2.0

64 256 1024 4096 16384 65536 262144 1048576

Band size Time (ms) Time per Flush; MPICH/OFI/OPA MPI Everywhere + shared memory +Threads (Original) +Threads (VCIs) +Threads (Endpoints)

Recommendation: Maximize independence between threads for RMA communication with MPI windows Warning: Independent communication with VCIs and user-visible endpoints fundamentally opposes shared progress

VCIs as good as endpoints and MPI everywhere when shared progress not required Issue of

  • perations is fast

Shared progress hurts completion of

  • perations

CATEGORY 2: OPENMC

Shared progress: thread A progresses VCI

  • f thread B. Required for correctness.
slide-45
SLIDE 45

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

APPLICATION CATEGORIES

▸ Category 1 ▸ Direct use of parallel communication streams ▸ VCIs as good as user-visible endpoints and MPI everywhere ▸ Category 2 ▸ Require shared progress ▸ Both VCIs and user-visible endpoints perform poorly ▸ Category 3 ▸ Abstraction through MPI-3.1 prevents user from expressing parallelism ▸ User-visible endpoints perform better than VCIs

45

slide-46
SLIDE 46

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

CATEGORY 3: LIMITING MPI SEMANTICS

▸ Example: microbenchmark capturing

communication pattern in Legion’s runtime

▸ Contention between receiver thread and

sender threads with communicators. No contention with endpoints.

46

1 2 3 1 2 4 8 16 Number of sender threads Messages/s (x10^6) Using Endpoints Communicators Isend; MPICH/OFI/OPA

Takeaway: User-visible endpoints perform better then VCIs when MPI’s semantics prevent user from expressing parallelism, especially in irregular communication patterns

slide-47
SLIDE 47

▸ NWChem: quantum chemistry application suite ▸ Dominant cost is that of block-sparse matrix multiplication (BSPMM) ▸ A x B += C get-compute-update pattern (MPI_Get + MPI_Accumulate) ▸ Each worker on a node (thread or process) participates in BSPMM independently

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

CATEGORY 3: NWCHEM

47

slide-48
SLIDE 48

▸ NWChem: quantum chemistry application suite ▸ Dominant cost is that of block-sparse matrix multiplication (BSPMM) ▸ A x B += C get-compute-update pattern (MPI_Get + MPI_Accumulate) ▸ Each worker on a node (thread or process) participates in BSPMM independently

Rank i Rank i

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

CATEGORY 3: NWCHEM

48

1 2 3 Window Endpoint 1 2 3

Parallel Get Parallel Accumulate

1 2 3

slide-49
SLIDE 49

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 49

0.00 0.02 0.04 0.06 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Get; MPICH/OFI/OPA 0.0 0.1 0.2 0.3 0.4 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Get-flush; MPICH/OFI/OPA 0.000 0.025 0.050 0.075 0.100 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Accum; MPICH/OFI/OPA 0.0 0.1 0.2 0.3 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Accum-flush; MPICH/OFI/OPA MPI Everywhere +Threads (Original) +Threads (VCIs) +Threads (Endpoints)

Issue of Gets is fast Shared progress hurts completion of Get operations VCIs slower to issue Accumulates than endpoints because of single window Endpoints complete Accumulates slower than VCIs because of shared progress

Warning: Atomic operation semantics are not easy to achieve with multiple windows; using multiple VCIs may not help.

CATEGORY 3: NWCHEM

slide-50
SLIDE 50

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI 50

Warning: Atomic operation semantics are not easy to achieve with multiple windows; using multiple VCIs may not help.

CATEGORY 3: NWCHEM

Tip: If the application allows it, hint accumulate_ordering=none. The MPI library can exploit implicit parallelism.

0.00 0.02 0.04 0.06 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Get; MPICH/OFI/OPA 0.0 0.1 0.2 0.3 0.4 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Get-flush; MPICH/OFI/OPA 0.000 0.025 0.050 0.075 0.100 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Accum; MPICH/OFI/OPA 0.0 0.1 0.2 0.3 1 2 4 8 16 32 64 128 Tile dimension Time (ms) Accum-flush; MPICH/OFI/OPA MPI Everywhere +Threads (Original) +Threads (VCIs) +Threads (VCIs w/hints) +Threads (Endpoints)

VCIs with hints perform as well as Endpoints

slide-51
SLIDE 51

HOW I LEARNED TO STOP WORRYING ABOUT USER-VISIBLE ENDPOINTS AND LOVE MPI

CONCLUDING REMARKS

▸ MPI+threads is critical for modern processors ▸ Users must proactively express logical parallelism ▸ User-visible endpoints not critical to express logical parallelism ▸ MPI-3.1 already features lots of parallelism ▸ VCIs perform as well as user-visible endpoints without burdening the user ▸ New info hints in MPI-4.0 give more options to express logical parallelism ▸ Enabling exploration of advanced mapping policies in the MPI library

51

slide-52
SLIDE 52

THANK YOU!

Email questions to rzambre@uci.edu

52