MPI Review Computational Example Common approach for grid-based - - PowerPoint PPT Presentation

mpi review computational example
SMART_READER_LITE
LIVE PREVIEW

MPI Review Computational Example Common approach for grid-based - - PowerPoint PPT Presentation

Lecture 9: Distributed Memory More on MPI Communication costs Hardware / Network topologies from Rauber and Runger MPI Review Computational Example Common approach for grid-based computations on distributed memory uses domain


slide-1
SLIDE 1

Lecture 9: Distributed Memory∗

  • More on MPI
  • Communication costs
  • Hardware / Network topologies

∗from Rauber and Runger

slide-2
SLIDE 2

MPI Review

slide-3
SLIDE 3

Computational Example

Common approach for grid-based computations on distributed memory uses domain decomposition: split domain into smaller problems on subdomains and iterate

  • n each. Coordinate the solution between adjacent

subdomains.

  • This approach to parallelism works well for problems that

exhibit locality - nearby objects interact more strongly than distant ones. (Same for good cache performance)

  • Stencil-based finite difference equations are good

candidates.

slide-4
SLIDE 4

Jacobi Example with MPI

Suppose n by n domain, p processors Left: sends n gridpoints to top and bottom. Total = 2pn. Right: each side=n/√p. Total = 4pn/√p

slide-5
SLIDE 5

MPI Jacobi Elements

compute start,end from rank; /* update interior grid points */ for (i=istart;i<iend;i++) for (j=jstart;j<jend;j++a){ (x,y) = fn(i,j); update u(i,j); compute norm of update; } /* update ghost cells */ MPI_Sendrecv (to proc to the left); MPI_Sendrecv (to proc to the right); MPI_Sendrecv (to proc on top); MPI_Sendrecv (to proc on bottom); MPI_Allreduce - compute global updateNorm;

slide-6
SLIDE 6

MPI Cartesian Communicator

#define UP #define DOWN 1 #define LEFT 2 #define RIGHT 3 MPI_COMM comm2d; int dims[2]={3, 4}, nbrs[4]; int reorder = 0, coords[2]; int periods[2] = {x_periodic,y_periodic}; /* 0 or 1 */

(2,3)

2 1 3 4 5 6 7 8 9 10 11

(0,0) (0,1) (0,3) (0,2) (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2)

MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder, &comm2d); MPI_Comm_rank(comm2d, &my_rank); MPI_Cart_coords(comm2d, my_rank, 2, coords); MPI_Cart_shift(comm2d, 0, 1,&nbrs[UP],&nbrs[DOWN]); MPI_Cart_shift(comm2d, 1, 1,&nbrs[LEFT],&nbrs[RIGHT]); Then communicate using nbor[UP], nbor[LEFT], etc;

slide-7
SLIDE 7

MPE

MPI Multi-Processing Environment = package of MPI tools including

  • profiling libraries, event logging, and convenient wrappers

(use mpecc -mpilog for logging)

  • Jumpshot viewer for logfiles
  • graphics, debugging routines, more

Works with any compliant MPI implementation (MPICH and OpenMPI), distributed with MPICH. Current version MPE2 comes with MPICH2, or can download standalone.

slide-8
SLIDE 8

Sample MPI Bugs

Use of wildcards could lead to race condition. MPI Bcast need not be synchronizing. If it is, rank 0 gets msg. from rank 1 first. If not, rank 0 could receive msg. from either rank 1 or rank 2 first.

slide-9
SLIDE 9

Sample MPI Bugs

Only works for even number of processors.

slide-10
SLIDE 10

Sample MPI Bugs

Supose have local variable, e.g. energy, and want to sum all the processors energy to find total energy of the system. Recall MPI_Reduce(sendbuf,recvbuf,count,datatype,op, root,comm) Using the same variable, as in MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM, MPI_COMM_WORLD) will bomb.

slide-11
SLIDE 11

Sample MPI Bugs

while (stillIterating){ if (my_rank==0){ for (i=1;i<nProcs;i+){ MPI_Recv(buf, count, type, MPI_ANY_SOURCE, tag,MPI_COMM_WORLD,&status) /* process stuff from other processors */ } else { MPI_Send(buf,count,type,0,tag,MPI_COMM_WORLD) } }

slide-12
SLIDE 12

MPI + OpenMP Example

slide-13
SLIDE 13

MPI + OpenMP Example

Sample laptop output: % mpirun -np 2 a.out Hello from MPI rank 0 of 2 Hello from MPI rank 1 of 2 Number of threads 4 on process rank 0 Number of threads 4 on process rank 1

slide-14
SLIDE 14

MPI References

  • Lawrence Livermore tutorial

https:computing.llnl.gov/tutorials/mpi/

  • Using MPI

Portable Parallel Programming with the Message=Passing Interface by Gropp, Lusk, Skjellum

  • Using MPI-2

Advanced Features of the Message Passing Interface by Gropp, Lusk, Thakur

  • Lots of other on-line tutorials, books, etc.
slide-15
SLIDE 15

Parallel Performance

Recall Amdahl’s law: if T1 = serial cost + parallel cost then Tp = serial cost + parallel cost/p But really Tp = serial cost + parallel cost/p + Tcommunication How expensive is it?

slide-16
SLIDE 16

Network Characteristics

Interconnection network connects nodes, transfers data Important qualities:

  • Topology - the structure used to connect the nodes
  • Routing algorithm - how messages are transmitted

between processors, along which path (= nodes along which message transferred).

  • Switching strategy = how message is cut into pieces and

assigned a path

  • Flow control (for dealing with congestion) - stall, store data

in buffers, re-route data, tell source to halt, discard, etc.

slide-17
SLIDE 17

Interconnection Network

Represent as graph G = (V, E), V = set of nodes to be connected, E = direct links between the nodes. Links usually bidirectional - transfer msg in both directions at same time. Characterize network by:

  • diameter - maximum over all pairs of nodes of the shortest

path between the nodes (length of path in message transmission)

  • degree - number of direct links for a node (number of direct

neighbors)

  • bisection bandwidth - minimum number of edges that must

be removed to partition network into two parts of equal size with no connection between them. (measures network capacity for transmitting messages simultaneously)

  • node/edge connectivity - numbers of node/edges that must

fail to disconnect the network (measure of reliability)

slide-18
SLIDE 18

Linear Array

  • p vertices, p − 1 links
  • Diameter = p − 1
  • Degree = 2
  • Bisection bandwidth = 1
  • Node connectivity = 1, edge connectivity = 1
slide-19
SLIDE 19

Ring topology

  • diameter = p/2
  • degree = 2
  • bisection bandwidth = 2
  • node connectivity = 2

edge connectivity = 2

slide-20
SLIDE 20

Mesh topology

  • diameter = 2(√p − 1)

3d mesh is 3( 3 √p − 1)

  • degree = 4 (6 in 3d )
  • bisection bandwidth √p
  • node connectivity 2

edge connectivity 2 Route along each dimension in turn

slide-21
SLIDE 21

Torus topology

Diameter halved, Bisection bandwidth doubled, Edge and Node connectivity doubled over mesh

slide-22
SLIDE 22

Hypercube topology

1100 1110 1010

1 00 01 10 11

0010 0011 0111 0000 0001 0100 0101

010 011 111 000 001 100 101 110

0110

1011 1111 1000 1001 1101

  • p = 2k processors labelled with binary numbers of length k
  • k-dimensional cube constructed from two (k − 1)-cubes
  • Connect corresponding procs if labels differ in 1 bit

(Hamming distance d between 2 k-bit binary words = path of length d between 2 nodes)

slide-23
SLIDE 23

Hypercube topology

1100 1110 1010

1 00 01 10 11

0010 0011 0111 0000 0001 0100 0101

010 011 111 000 001 100 101 110

0110

1011 1111 1000 1001 1101

  • diameter = k ( =log p)
  • degree = k
  • bisection bandwidth = p/2
  • node connectivity k

edge connectivity k

slide-24
SLIDE 24

Dynamic Networks

Above networks were direct, or static interconnection networks = processors connected directly with each through fixed physical links. Indirect or dynamic networks = contain switches which provide an indirect connection between the nodes. Switches configured dynamically to establish a connection.

  • bus
  • crossbar
  • multistage network - e.g. butterfly, omega, baseline
slide-25
SLIDE 25

Crossbar

Mm P1 P2 Pn M1 M2

  • Connecting n inputs and m outputs takes nm switches.

(Typically only for small numbers of processors)

  • At each switch can either go straight or change dir.
  • Diameter = 1, bisection bandwidth = p
slide-26
SLIDE 26

Butterfly

16 × 16 butterfly network:

stage 3

000 001 010 011 100 101 110 111

stage 0 stage 1 stage 2

for p = 2k+1 processors, k + 1 stages, 2k switches per stage, 2 × 2 switches

slide-27
SLIDE 27

Fat tree

  • Complete binary tree
  • Processors at leaves
  • Increase links for higher bandwidth near root
slide-28
SLIDE 28

Current picture

  • Old style: mapped algorithms to topologies
  • New style: avoid topology-specific optimizations
  • Want code that runs on next year’s machines too.
  • Topology awareness in vendor MPI libraries?
  • Software topology - easy of programming, but not used for

performance?

slide-29
SLIDE 29

Top500 Interconnects

T Sta

Se H

slide-30
SLIDE 30

Networks∗

∗from Top500 2007 recent supercomputers overview

slide-31
SLIDE 31

Routing and Switching

Routing = determine a path from source to destination through the network. Avoid deadlock. Try to find a minimum cost path, depends on length of path (topology), contention, congestion.

  • Deterministic routing - always uses same path, can have

unbalanced network load, network contention (2 or more messages transmitting at the same time over the same link, leading to delay in msg. transmissions).

  • Adaptive routing - dynamically select routing path based on

load information. Try to spread network traffic evenly. Also more fault tolerant.

slide-32
SLIDE 32

Routing and Switching

  • Circuit switching: full path reserved for entire message (like

telephone) short probe msg. establishes path; all message units use same path.

  • Packet switching: message broken into separately-routed

packets

  • store-and-forward - entire packet must be received by each

switch on path (Store) before sent to next switch (forward). Needs enough memory at switches.

  • pipelining - packets sent so that all links used by succesive

packets in an overlapping way (if all packets transmitted along same path)

  • Cut-through routing, Wormhole routing
slide-33
SLIDE 33

Sample Point-to-Point Message Protocol

To send a message:

  • 1. Message copied into system buffer
  • 2. checksum computed, header added to message.
  • 3. timer started, message sent out over network interface

To receive a message:

  • 1. message copied from network interface to system buffer
  • 2. checksum computed, compared with checksum in header.

If checksums agree, send acknowledgment message to

  • sender. If not, message discarded. Message will be

re-sent after sender timer has timed out.

  • 3. If checksums agree, copy message from systembuffer to

user buffer. Application program gets notification and can continue execution.

slide-34
SLIDE 34

Sample Message Transmission Time

total latency Time Sender Receiver Network Total Time sender

  • verhead
  • verhead

recvr time of flight transmission time transport latency

  • sender overhead - time to prepare message (compute checksum,

make header, execute routing alg.)

  • time of flight = channel propagation delay = time for first bit to

arrive at receiver

  • transmission time - message size in bytes divided by bandwidth of

link (bytes per second) (inverse of byte transfer time). Does not include conflicts.

  • receiver overhead - time to process incoming message, compare

checksums, generate acknowledgement

slide-35
SLIDE 35

Sample Message Transmission Time

total latency Time Sender Receiver Network Total Time sender

  • verhead
  • verhead

recvr time of flight transmission time transport latency

Total time to send message size m bytes T(m) = Osend + TtimeOfFlight + m/B + Orecv Combine constants, define tB = 1/B to get T(m) = Toverhead + tB · m For circuit switching over multiple links add term proportional to l · mc l = length of path, mc = size of probe message, but typically very small

slide-36
SLIDE 36

Latency and Bandwidth Model

  • Time to send message of length m is roughly

= latency + m/bandwidth. Topology assumed irrelevant

  • Often called α − β model and written T = α + m ∗ β
  • Usually α >> β >>time per flop
  • One long message is cheaper than many short ones
  • Can do hundreds or thousands of flops for cost of one

message

  • Need large computation to communication ratio to be

efficient

slide-37
SLIDE 37

Scalar Product Example

Compute dot product of two vectors c = n

1 ajbk

On p processor each computes n/p items. Final result stored on one processor using an accumulate. Let α = cost of add = cost of multiply Let β = cost of communicating one value to neighbor Linear Array T(p, n) = 2 n

pα + p 2(α + β)

p∗ = 2

  • α/(α + β)√n

Hypercube T(p, n) = 2 n

pα + log p (α + β)

p∗ = 2nα ln 2/(α + β) For p > p∗ execution time T increases!