SLIDE 1 Lecture 9: Distributed Memory∗
- More on MPI
- Communication costs
- Hardware / Network topologies
∗from Rauber and Runger
SLIDE 2
MPI Review
SLIDE 3 Computational Example
Common approach for grid-based computations on distributed memory uses domain decomposition: split domain into smaller problems on subdomains and iterate
- n each. Coordinate the solution between adjacent
subdomains.
- This approach to parallelism works well for problems that
exhibit locality - nearby objects interact more strongly than distant ones. (Same for good cache performance)
- Stencil-based finite difference equations are good
candidates.
SLIDE 4
Jacobi Example with MPI
Suppose n by n domain, p processors Left: sends n gridpoints to top and bottom. Total = 2pn. Right: each side=n/√p. Total = 4pn/√p
SLIDE 5
MPI Jacobi Elements
compute start,end from rank; /* update interior grid points */ for (i=istart;i<iend;i++) for (j=jstart;j<jend;j++a){ (x,y) = fn(i,j); update u(i,j); compute norm of update; } /* update ghost cells */ MPI_Sendrecv (to proc to the left); MPI_Sendrecv (to proc to the right); MPI_Sendrecv (to proc on top); MPI_Sendrecv (to proc on bottom); MPI_Allreduce - compute global updateNorm;
SLIDE 6
MPI Cartesian Communicator
#define UP #define DOWN 1 #define LEFT 2 #define RIGHT 3 MPI_COMM comm2d; int dims[2]={3, 4}, nbrs[4]; int reorder = 0, coords[2]; int periods[2] = {x_periodic,y_periodic}; /* 0 or 1 */
(2,3)
2 1 3 4 5 6 7 8 9 10 11
(0,0) (0,1) (0,3) (0,2) (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2)
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder, &comm2d); MPI_Comm_rank(comm2d, &my_rank); MPI_Cart_coords(comm2d, my_rank, 2, coords); MPI_Cart_shift(comm2d, 0, 1,&nbrs[UP],&nbrs[DOWN]); MPI_Cart_shift(comm2d, 1, 1,&nbrs[LEFT],&nbrs[RIGHT]); Then communicate using nbor[UP], nbor[LEFT], etc;
SLIDE 7 MPE
MPI Multi-Processing Environment = package of MPI tools including
- profiling libraries, event logging, and convenient wrappers
(use mpecc -mpilog for logging)
- Jumpshot viewer for logfiles
- graphics, debugging routines, more
Works with any compliant MPI implementation (MPICH and OpenMPI), distributed with MPICH. Current version MPE2 comes with MPICH2, or can download standalone.
SLIDE 8
Sample MPI Bugs
Use of wildcards could lead to race condition. MPI Bcast need not be synchronizing. If it is, rank 0 gets msg. from rank 1 first. If not, rank 0 could receive msg. from either rank 1 or rank 2 first.
SLIDE 9
Sample MPI Bugs
Only works for even number of processors.
SLIDE 10
Sample MPI Bugs
Supose have local variable, e.g. energy, and want to sum all the processors energy to find total energy of the system. Recall MPI_Reduce(sendbuf,recvbuf,count,datatype,op, root,comm) Using the same variable, as in MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM, MPI_COMM_WORLD) will bomb.
SLIDE 11
Sample MPI Bugs
while (stillIterating){ if (my_rank==0){ for (i=1;i<nProcs;i+){ MPI_Recv(buf, count, type, MPI_ANY_SOURCE, tag,MPI_COMM_WORLD,&status) /* process stuff from other processors */ } else { MPI_Send(buf,count,type,0,tag,MPI_COMM_WORLD) } }
SLIDE 12
MPI + OpenMP Example
SLIDE 13
MPI + OpenMP Example
Sample laptop output: % mpirun -np 2 a.out Hello from MPI rank 0 of 2 Hello from MPI rank 1 of 2 Number of threads 4 on process rank 0 Number of threads 4 on process rank 1
SLIDE 14 MPI References
- Lawrence Livermore tutorial
https:computing.llnl.gov/tutorials/mpi/
Portable Parallel Programming with the Message=Passing Interface by Gropp, Lusk, Skjellum
Advanced Features of the Message Passing Interface by Gropp, Lusk, Thakur
- Lots of other on-line tutorials, books, etc.
SLIDE 15
Parallel Performance
Recall Amdahl’s law: if T1 = serial cost + parallel cost then Tp = serial cost + parallel cost/p But really Tp = serial cost + parallel cost/p + Tcommunication How expensive is it?
SLIDE 16 Network Characteristics
Interconnection network connects nodes, transfers data Important qualities:
- Topology - the structure used to connect the nodes
- Routing algorithm - how messages are transmitted
between processors, along which path (= nodes along which message transferred).
- Switching strategy = how message is cut into pieces and
assigned a path
- Flow control (for dealing with congestion) - stall, store data
in buffers, re-route data, tell source to halt, discard, etc.
SLIDE 17 Interconnection Network
Represent as graph G = (V, E), V = set of nodes to be connected, E = direct links between the nodes. Links usually bidirectional - transfer msg in both directions at same time. Characterize network by:
- diameter - maximum over all pairs of nodes of the shortest
path between the nodes (length of path in message transmission)
- degree - number of direct links for a node (number of direct
neighbors)
- bisection bandwidth - minimum number of edges that must
be removed to partition network into two parts of equal size with no connection between them. (measures network capacity for transmitting messages simultaneously)
- node/edge connectivity - numbers of node/edges that must
fail to disconnect the network (measure of reliability)
SLIDE 18 Linear Array
- p vertices, p − 1 links
- Diameter = p − 1
- Degree = 2
- Bisection bandwidth = 1
- Node connectivity = 1, edge connectivity = 1
SLIDE 19 Ring topology
- diameter = p/2
- degree = 2
- bisection bandwidth = 2
- node connectivity = 2
edge connectivity = 2
SLIDE 20 Mesh topology
3d mesh is 3( 3 √p − 1)
- degree = 4 (6 in 3d )
- bisection bandwidth √p
- node connectivity 2
edge connectivity 2 Route along each dimension in turn
SLIDE 21
Torus topology
Diameter halved, Bisection bandwidth doubled, Edge and Node connectivity doubled over mesh
SLIDE 22 Hypercube topology
1100 1110 1010
1 00 01 10 11
0010 0011 0111 0000 0001 0100 0101
010 011 111 000 001 100 101 110
0110
1011 1111 1000 1001 1101
- p = 2k processors labelled with binary numbers of length k
- k-dimensional cube constructed from two (k − 1)-cubes
- Connect corresponding procs if labels differ in 1 bit
(Hamming distance d between 2 k-bit binary words = path of length d between 2 nodes)
SLIDE 23 Hypercube topology
1100 1110 1010
1 00 01 10 11
0010 0011 0111 0000 0001 0100 0101
010 011 111 000 001 100 101 110
0110
1011 1111 1000 1001 1101
- diameter = k ( =log p)
- degree = k
- bisection bandwidth = p/2
- node connectivity k
edge connectivity k
SLIDE 24 Dynamic Networks
Above networks were direct, or static interconnection networks = processors connected directly with each through fixed physical links. Indirect or dynamic networks = contain switches which provide an indirect connection between the nodes. Switches configured dynamically to establish a connection.
- bus
- crossbar
- multistage network - e.g. butterfly, omega, baseline
SLIDE 25 Crossbar
Mm P1 P2 Pn M1 M2
- Connecting n inputs and m outputs takes nm switches.
(Typically only for small numbers of processors)
- At each switch can either go straight or change dir.
- Diameter = 1, bisection bandwidth = p
SLIDE 26
Butterfly
16 × 16 butterfly network:
stage 3
000 001 010 011 100 101 110 111
stage 0 stage 1 stage 2
for p = 2k+1 processors, k + 1 stages, 2k switches per stage, 2 × 2 switches
SLIDE 27 Fat tree
- Complete binary tree
- Processors at leaves
- Increase links for higher bandwidth near root
SLIDE 28 Current picture
- Old style: mapped algorithms to topologies
- New style: avoid topology-specific optimizations
- Want code that runs on next year’s machines too.
- Topology awareness in vendor MPI libraries?
- Software topology - easy of programming, but not used for
performance?
SLIDE 29
Top500 Interconnects
T Sta
Se H
SLIDE 30 Networks∗
∗from Top500 2007 recent supercomputers overview
SLIDE 31 Routing and Switching
Routing = determine a path from source to destination through the network. Avoid deadlock. Try to find a minimum cost path, depends on length of path (topology), contention, congestion.
- Deterministic routing - always uses same path, can have
unbalanced network load, network contention (2 or more messages transmitting at the same time over the same link, leading to delay in msg. transmissions).
- Adaptive routing - dynamically select routing path based on
load information. Try to spread network traffic evenly. Also more fault tolerant.
SLIDE 32 Routing and Switching
- Circuit switching: full path reserved for entire message (like
telephone) short probe msg. establishes path; all message units use same path.
- Packet switching: message broken into separately-routed
packets
- store-and-forward - entire packet must be received by each
switch on path (Store) before sent to next switch (forward). Needs enough memory at switches.
- pipelining - packets sent so that all links used by succesive
packets in an overlapping way (if all packets transmitted along same path)
- Cut-through routing, Wormhole routing
SLIDE 33 Sample Point-to-Point Message Protocol
To send a message:
- 1. Message copied into system buffer
- 2. checksum computed, header added to message.
- 3. timer started, message sent out over network interface
To receive a message:
- 1. message copied from network interface to system buffer
- 2. checksum computed, compared with checksum in header.
If checksums agree, send acknowledgment message to
- sender. If not, message discarded. Message will be
re-sent after sender timer has timed out.
- 3. If checksums agree, copy message from systembuffer to
user buffer. Application program gets notification and can continue execution.
SLIDE 34 Sample Message Transmission Time
total latency Time Sender Receiver Network Total Time sender
recvr time of flight transmission time transport latency
- sender overhead - time to prepare message (compute checksum,
make header, execute routing alg.)
- time of flight = channel propagation delay = time for first bit to
arrive at receiver
- transmission time - message size in bytes divided by bandwidth of
link (bytes per second) (inverse of byte transfer time). Does not include conflicts.
- receiver overhead - time to process incoming message, compare
checksums, generate acknowledgement
SLIDE 35 Sample Message Transmission Time
total latency Time Sender Receiver Network Total Time sender
recvr time of flight transmission time transport latency
Total time to send message size m bytes T(m) = Osend + TtimeOfFlight + m/B + Orecv Combine constants, define tB = 1/B to get T(m) = Toverhead + tB · m For circuit switching over multiple links add term proportional to l · mc l = length of path, mc = size of probe message, but typically very small
SLIDE 36 Latency and Bandwidth Model
- Time to send message of length m is roughly
= latency + m/bandwidth. Topology assumed irrelevant
- Often called α − β model and written T = α + m ∗ β
- Usually α >> β >>time per flop
- One long message is cheaper than many short ones
- Can do hundreds or thousands of flops for cost of one
message
- Need large computation to communication ratio to be
efficient
SLIDE 37 Scalar Product Example
Compute dot product of two vectors c = n
1 ajbk
On p processor each computes n/p items. Final result stored on one processor using an accumulate. Let α = cost of add = cost of multiply Let β = cost of communicating one value to neighbor Linear Array T(p, n) = 2 n
pα + p 2(α + β)
p∗ = 2
Hypercube T(p, n) = 2 n
pα + log p (α + β)
p∗ = 2nα ln 2/(α + β) For p > p∗ execution time T increases!