Lightweight MPI Communicators with Applications to Perfectly - - PowerPoint PPT Presentation

lightweight mpi communicators with applications to
SMART_READER_LITE
LIVE PREVIEW

Lightweight MPI Communicators with Applications to Perfectly - - PowerPoint PPT Presentation

Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Michael Axtmann , Peter Sanders, Armin Wiebigke IPDPS May 22, 2018 Institute of Theoretical Informatics Michael Axtmann: Lightweight MPI Communicators with


slide-1
SLIDE 1

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

IPDPS · May 22, 2018 Michael Axtmann, Peter Sanders, Armin Wiebigke Institute of Theoretical Informatics

www.kit.edu

KIT – The Research University in the Helmholtz Association

Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort

slide-2
SLIDE 2

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Overview

1

Communicators and communication Disadvantages of communicator construction Solutions for MPI RBC communicators Case study on sorting

slide-3
SLIDE 3

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Communicators in MPI

2 MPI COMM WORLD 1 2 3 4 5 2 1 Subcommunicator 0 → 3; 1 → 4; 2 → 1 Nonblocking point-to-point 1 2 ISend IReceive and Test 1 2 Send Receive Compute and Test Nonblocking collective 1 2 IScan IScan 1 2 Scan Scan Compute and Test Test Blocking collective Blocking point-to-point

slide-4
SLIDE 4

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

MPI Examples

3

Usage of communicators Divide tasks into fine-grained subproblems Elegant algorithms and comfortable programming

Communicators make life easier at no cost!?

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Divide and conquer Communication over rows and columns

slide-5
SLIDE 5

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Current Implementations

4

OpenMPI and MPICH

2 1 Subcommunicator 0 → 3; 1 → 4; 2 → 1

PE group Mapping from PE ID to process ID required Explicit representation as table Context ID Separates communication between communicators part of each message Unique for all PEs of the PE group Blocking Allgather-operation on context ID mask

2/1 1 2 3 3/0 Subcommunicator Context ID 0 Subcommunicator Context ID 1

slide-6
SLIDE 6

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Current Implementations

4

OpenMPI and MPICH

2 1 Subcommunicator 0 → 3; 1 → 4; 2 → 1

PE group Mapping from PE ID to process ID required Explicit representation as table Context ID Separates communication between communicators part of each message Unique for all PEs of the PE group Blocking Allgather-operation on context ID mask Communicator creation takes time linear to the communicator size

2/1 1 2 3 3/0 Subcommunicator Context ID 0 Subcommunicator Context ID 1

slide-7
SLIDE 7

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Current Implementations

4

OpenMPI and MPICH

2 1 Subcommunicator 0 → 3; 1 → 4; 2 → 1

PE group Mapping from PE ID to process ID required Explicit representation as table Context ID Separates communication between communicators part of each message Unique for all PEs of the PE group Blocking Allgather-operation on context ID mask Communicator creation takes time linear to the communicator size Communicator creation is a blocking collective operation

2/1 1 2 3 3/0 Subcommunicator Context ID 0 Subcommunicator Context ID 1

slide-8
SLIDE 8

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Blocking Communicator Creation

5

A collective operation is invoked by all PEs of a communicator BUT: Communicator creation breaks nonblocking idea

“. . . nonblocking collective operations can mitigate possible synchronizing effects. . . ” “. . . enabling communication-computation overlap. . . ” “. . . perform collective operations on overlapping communicators, which would lead to deadlocks with blocking operations.” – MPI Standard Nonblocking collective 1 2 IScan IScan Compute and Test Test Nonblocking collective with communicator creation 1 2 IScanTest Communicator creation

slide-9
SLIDE 9

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Communicator Construction

6 210 211 212 213 214 215 0.1 0.2 0.3 0.4 0.5 0.6 Comm Size (Cores) Running Time / Comm Size [µs] IBM – MPI Comm create group IBM – MPI Comm split Intel – MPI Comm create group Intel – MPI Comm split

SuperMUC Splitting a communicator into two communicators of half the size Communicator construction time is linear to PE group size

PE 0 1 2 3 4 5 6 7 8 9 Splitting

slide-10
SLIDE 10

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Communicator Construction

7

SuperMUC – 32 768 cores – IBM MPI

23 25 27 29 211 213 215 217 219 221 10−1 100 101 102 Message Length [B] Running Time [ms]

MPI_Reduce MPI_Exscan MPI_Comm_split

Collective operation on 214 cores Communicator construction is expensive compared to collectives Splitting 215 PEs into two communicators of size 214

PE 0 1 2 3 4 5 6 7 8 9 Splitting Collective operation

slide-11
SLIDE 11

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Communicator Construction

8 29 210 211 212 213 10−1 100 101 102 103 Comm Size (Cores) Running time [ms] Alternating Cascade

SuperMUC – Intel MPI Blocking communicator creation causes delays Splitting a communicator into overlapping communicators of size four

PE group invokest MPI Comm create group Cascading PE 0 Alternating 1 2 3 4 5 6 7 8 9 10 PE 0 1 2 3 4 5 6 7 8 9 10 Time

slide-12
SLIDE 12

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Proposals for MPI

9

PE group Sparse representations E.g. MPI_Group_range_incl Context ID User-defined tag Calculate by MPI: Concatenation of counters

MPI COMM WORLD

{0} {3} {1} {2} {0,0} {0,1} {2,0} {2,1} {2,2}

Subcommunicator 1 f i r s t =1 , l a s t =4 , s t r i d e =2 2/1 1 2 3 3/0 Subcommunicator Context ID 0 Subcommunicator Context ID 1

slide-13
SLIDE 13

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Our RBC library

10

Range-based communicator in O(1) time Local construction Select MPI or RBC operations Local splitting:

Split_RBC_Comm(Comm&, Comm&,

tttint first, int last, int stride) Only adjust range

Our RBC library

rbc::Comm + parent comm : MPI Comm + first : int + last : int + stride : int rbc::Comm + parent comm : MPI Comm + first : int + last : int + stride : int Initial MPI communicator f i r s t =1 , l a s t =7 , s t r i d e =3 1 2 3 4 5 6 7 8

Blocking Ops Nonblocking Ops Classes Local Ops rbc::Bcast rbc::Ibcast rbc::Request rbc::Create RBC Comm rbc::Reduce rbc::Ireduce rbc::Comm rbc::Split RBC Comm rbc::Allreduce rbc::Iallreduce RBC::Comm rank rbc::Scan rbc::Iscan rbc::Comm size rbc::Gather rbc::Igather rbc::Gatherv rbc::Igatherv rbc::Barrier rbc::Ibarrier rbc::Send rbc::Isend rbc::Recv rbc::Irecv rbc::Probe rbc::Iprobe rbc::Wait rbc::Test rbc::Waitall

slide-14
SLIDE 14

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Our RBC library

11

Implementation Details

(Non)blocking point-to-point communication Maps rank to rank of MPI communicator Call MPI counterpart (Non)blocking collective operations Calls point-to-point operations of RBC One globally reserved tag Nonblocking details Optional user-defined tag Round-based schedule

rbc::Ibcast(void *buff, int cnt, MPI_Datatype datatype, int root,

tttrbc::Comm comm, rbc::Request *request, int tag = RBC_IBCAST_TAG)

Time PE 1 2 3

Broadcast

slide-15
SLIDE 15

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

RBC vs. MPI

12 210 211 212 213 214 215 0.1 0.2 0.3 0.4 0.5 0.6 Comm Size (Cores) Running Time / Comm Size [µs] IBM – MPI Comm create group IBM – MPI Comm split Intel – MPI Comm create group Intel – MPI Comm split RBC – rbc::Split_RBC_Comm

SuperMUC Splitting a communicator into two communicators of half the size RBC splitting comes with almost no cost

PE 0 1 2 3 4 5 6 7 8 9 Splitting

slide-16
SLIDE 16

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

RBC vs. MPI

13 23 25 27 29 211 213 215 217 219 221 10−1 100 101 Message Length [B] Running time [ms] IBM – MPI Ibcast IBM – rbc::Ibcast

SuperMUC – 32 768 cores – IBM MPI Splitting a communicator with 215 PEs into two communicators of size 214 and performing a broadcast operation on both communicators RBC splitting comes with almost no cost

PE 0 1 2 3 4 5 6 7 8 9 Splitting Collective operation

slide-17
SLIDE 17

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Cost of Communicators

14 29 210 211 212 213 10−3 10−2 10−1 100 101 102 103 104 105 Comm Size (Cores) Running time [ms] MPI – Alternating MPI – Cascade RBC – Cascade RBC – Alternating

SuperMUC – Intel MPI Cascades do not have an effect on RBC Splitting a communicator into overlapping communicators of size four

PE group invokest MPI Comm create group Cascading PE 0 Alternating 1 2 3 4 5 6 7 8 9 10 PE 0 1 2 3 4 5 6 7 8 9 10 Time

slide-18
SLIDE 18

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Quicksort on Distributed Systems

Single ported message passing Sending of l machine words: α + βl Analyze critical path Small inputs Minimal latency O(α log2 p + βn

p log p + n p log n) 15 Recurse Splitter selection Redistribution Merge Subcube 1 Subcube 2 Split Sort locally PE 1 PE 2 PE 3 PE 4

Hypercube Quicksort Static communication pattern Precomputable communicators Bad for skewed inputs Only works for p = 2k

slide-19
SLIDE 19

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

16 Sort Selection titioning e Creation Case Recursion PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 Group 1 Group 2 Local sort Recursion step

O(α log2 p + βn

p log p + n p log n)

Arbitrary p Calculate communicators on the fly Perfectly balanced data

Roman God Janus Twofaced www.wikipedia.org1 Image taken by Fubar Obfusco

slide-20
SLIDE 20

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

17

Janus PE PE i PE i + 2 PE i + 3 PE i + 4 group j group j + 1 execution

slide-21
SLIDE 21

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

17

Janus PE PE i PE i + 2 PE i + 3 PE i + 4 pivot selection group j group j + 1 execution pivot selection

slide-22
SLIDE 22

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

17

Janus PE PE i PE i + 2 PE i + 3 PE i + 4 pivot selection partitioning group j group j + 1 execution pivot selection partitioning

slide-23
SLIDE 23

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

17

Janus PE PE i PE i + 2 PE i + 3 PE i + 4 pivot selection partitioning data exchange group j group j + 1 execution pivot selection partitioning

slide-24
SLIDE 24

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

17

Janus PE PE i PE i + 2 PE i + 3 PE i + 4 pivot selection partitioning data exchange Janus PE Janus PE Janus PE PE i + 2 PE i + 4 group j group j + 1 base case group k group k + 1 execution pivot selection partitioning comm creation comm creation

slide-25
SLIDE 25

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

Pivot selection: O(α log p) Regular PE Janus PE

IBcast Wait IBcast IBcast Waitall Left group Right group 17

Janus PE PE i PE i + 2 PE i + 3 PE i + 4 pivot selection partitioning data exchange Janus PE Janus PE Janus PE PE i + 2 PE i + 4 group j group j + 1 base case group k group k + 1 execution pivot selection partitioning comm creation comm creation

slide-26
SLIDE 26

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

Pivot selection: O(α log p) Janus PE

Binary search Left group Right group

Partitioning: O(log n

p)

Regular PE

17

Janus PE PE i PE i + 2 PE i + 3 PE i + 4 pivot selection partitioning data exchange Janus PE Janus PE Janus PE PE i + 2 PE i + 4 group j group j + 1 base case group k group k + 1 execution pivot selection partitioning comm creation comm creation

Binary search Binary search

slide-27
SLIDE 27

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

Left group Right group

Pivot selection: O(α log p) Janus PE

IScan Wait IBcast Wait ISend IRecv Waitall IScan IBcast ISend IRecv

Partitioning: O(log n

p) IScan IBcast ISend IRecv Waitall Waitall Waitall

Data exchange: O(α + βn

p)

Regular PE

17

Janus PE PE i PE i + 2 PE i + 3 PE i + 4 pivot selection partitioning data exchange Janus PE Janus PE Janus PE PE i + 2 PE i + 4 group j group j + 1 base case group k group k + 1 execution pivot selection partitioning comm creation comm creation

slide-28
SLIDE 28

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

Left group Right group

Pivot selection: O(α log p)

Create comm

Partitioning: O(log n

p)

Data exchange: O(α + βn

p)

Comm creation: O(1)

Creation of even subcommunicator Creation of odd subcommunicator 17

Janus PE PE i PE i + 2 PE i + 3 PE i + 4 pivot selection partitioning data exchange Janus PE Janus PE Janus PE PE i + 2 PE i + 4 group j group j + 1 base case group k group k + 1 execution pivot selection partitioning comm creation comm creation

New Janus PE Regular PE

slide-29
SLIDE 29

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Janus Sort

20 22 24 26 28 210 212 214 216 218 220 10−2 10−1 100 101 Elements per core Run time [s] RBC Split RBC Split MPI Split MPI Split 18 29 210 211 212 213 214 215 216 10−2 10−1 100 101 Cores Run Time [s] RBC Split MPI Split

IBM MPI

Juqueen SuperMUC

16 384 double values per core Weak scaling on the SuperMUC Sorting of double values 32 768 cores

Experimental Results

slide-30
SLIDE 30

Michael Axtmann: Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort Institute of Theoretical Informatics Karlsruhe Institute of Technology

Conclusion

Communicator creation is expensive Blocking communicator creation breaks idea of nonblocking communication Extension for MPI Range-based communicators Case study on sorting Code published at https://github.com/MichaelAxtmann/RBC

19