MPI Review Computational Example Common approach for grid-based - PowerPoint PPT Presentation

Lecture 9: Distributed Memory ∗ • More on MPI • Communication costs • Hardware / Network topologies ∗ from Rauber and Runger

MPI Review

Computational Example Common approach for grid-based computations on distributed memory uses domain decomposition: split domain into smaller problems on subdomains and iterate on each. Coordinate the solution between adjacent subdomains. • This approach to parallelism works well for problems that exhibit locality - nearby objects interact more strongly than distant ones. (Same for good cache performance) • Stencil-based finite difference equations are good candidates.

Jacobi Example with MPI Suppose n by n domain, p processors Left: sends n gridpoints to top and bottom. Total = 2 pn . Right: each side= n / √ p . Total = 4 pn / √ p

MPI Jacobi Elements compute start,end from rank; /* update interior grid points */ for (i=istart;i<iend;i++) for (j=jstart;j<jend;j++a){ (x,y) = fn(i,j); update u(i,j); compute norm of update; } /* update ghost cells */ MPI_Sendrecv (to proc to the left); MPI_Sendrecv (to proc to the right); MPI_Sendrecv (to proc on top); MPI_Sendrecv (to proc on bottom); MPI_Allreduce - compute global updateNorm;

MPI Cartesian Communicator #define UP 0 #define DOWN 1 0 1 2 3 #define LEFT 2 (0,0) (0,1) (0,2) (0,3) #define RIGHT 3 4 5 6 7 (1,0) (1,1) (1,2) (1,3) MPI_COMM comm2d; 8 9 10 11 (2,0) (2,1) (2,2) (2,3) int dims[2]={3, 4}, nbrs[4]; int reorder = 0, coords[2]; int periods[2] = {x_periodic,y_periodic}; /* 0 or 1 */ MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder, &comm2d); MPI_Comm_rank(comm2d, &my_rank); MPI_Cart_coords(comm2d, my_rank, 2, coords); MPI_Cart_shift(comm2d, 0, 1,&nbrs[UP],&nbrs[DOWN]); MPI_Cart_shift(comm2d, 1, 1,&nbrs[LEFT],&nbrs[RIGHT]); Then communicate using nbor[UP], nbor[LEFT], etc;

MPE MPI Multi-Processing Environment = package of MPI tools including • profiling libraries, event logging, and convenient wrappers (use mpecc -mpilog for logging) • Jumpshot viewer for logfiles • graphics, debugging routines, more Works with any compliant MPI implementation (MPICH and OpenMPI), distributed with MPICH. Current version MPE2 comes with MPICH2, or can download standalone.

Sample MPI Bugs Use of wildcards could lead to race condition. MPI Bcast need not be synchronizing. If it is, rank 0 gets msg. from rank 1 first. If not, rank 0 could receive msg. from either rank 1 or rank 2 first.

Sample MPI Bugs Only works for even number of processors.

Sample MPI Bugs Supose have local variable, e.g. energy, and want to sum all the processors energy to find total energy of the system. Recall MPI_Reduce(sendbuf,recvbuf,count,datatype,op, root,comm) Using the same variable, as in MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM, MPI_COMM_WORLD) will bomb.

Sample MPI Bugs while (stillIterating){ if (my_rank==0){ for (i=1;i<nProcs;i+){ MPI_Recv(buf, count, type, MPI_ANY_SOURCE, tag,MPI_COMM_WORLD,&status) /* process stuff from other processors */ } else { MPI_Send(buf,count,type,0,tag,MPI_COMM_WORLD) } }

MPI + OpenMP Example

MPI + OpenMP Example Sample laptop output: % mpirun -np 2 a.out Hello from MPI rank 0 of 2 Hello from MPI rank 1 of 2 Number of threads 4 on process rank 0 Number of threads 4 on process rank 1

MPI References • Lawrence Livermore tutorial https:computing.llnl.gov/tutorials/mpi/ • Using MPI Portable Parallel Programming with the Message=Passing Interface by Gropp, Lusk, Skjellum • Using MPI-2 Advanced Features of the Message Passing Interface by Gropp, Lusk, Thakur • Lots of other on-line tutorials, books, etc.

Parallel Performance Recall Amdahl’s law: if T 1 = serial cost + parallel cost then T p = serial cost + parallel cost/p But really T p = serial cost + parallel cost/p + T communication How expensive is it?

Network Characteristics Interconnection network connects nodes, transfers data Important qualities: • Topology - the structure used to connect the nodes • Routing algorithm - how messages are transmitted between processors, along which path (= nodes along which message transferred). • Switching strategy = how message is cut into pieces and assigned a path • Flow control (for dealing with congestion) - stall, store data in buffers, re-route data, tell source to halt, discard, etc.

Interconnection Network Represent as graph G = ( V , E ) , V = set of nodes to be connected, E = direct links between the nodes. Links usually bidirectional - transfer msg in both directions at same time. Characterize network by: • diameter - maximum over all pairs of nodes of the shortest path between the nodes (length of path in message transmission) • degree - number of direct links for a node (number of direct neighbors) • bisection bandwidth - minimum number of edges that must be removed to partition network into two parts of equal size with no connection between them. (measures network capacity for transmitting messages simultaneously) • node/edge connectivity - numbers of node/edges that must fail to disconnect the network (measure of reliability)

Linear Array • p vertices, p − 1 links • Diameter = p − 1 • Degree = 2 • Bisection bandwidth = 1 • Node connectivity = 1, edge connectivity = 1

Ring topology • diameter = p/2 • degree = 2 • bisection bandwidth = 2 • node connectivity = 2 edge connectivity = 2

Mesh topology • diameter = 2 ( √ p − 1 ) √ p − 1 ) 3d mesh is 3 ( 3 • degree = 4 (6 in 3d ) • bisection bandwidth √ p • node connectivity 2 edge connectivity 2 Route along each dimension in turn

Torus topology Diameter halved, Bisection bandwidth doubled, Edge and Node connectivity doubled over mesh

Hypercube topology 0110 0111 0 1 0010 0011 110 111 1110 1111 10 11 010 011 1010 1011 0100 0101 1100 1101 100 101 1000 1001 0000 0001 00 01 000 001 • p = 2 k processors labelled with binary numbers of length k • k -dimensional cube constructed from two ( k − 1 ) -cubes • Connect corresponding procs if labels differ in 1 bit (Hamming distance d between 2 k -bit binary words = path of length d between 2 nodes)

Hypercube topology 0110 0111 0 1 0010 0011 110 111 1110 1111 10 11 010 011 1010 1011 0100 0101 1100 1101 100 101 1000 1001 0000 0001 00 01 000 001 • diameter = k ( =log p) • degree = k • bisection bandwidth = p/2 • node connectivity k edge connectivity k

Dynamic Networks Above networks were direct, or static interconnection networks = processors connected directly with each through fixed physical links. Indirect or dynamic networks = contain switches which provide an indirect connection between the nodes. Switches configured dynamically to establish a connection. • bus • crossbar • multistage network - e.g. butterfly, omega, baseline

Crossbar P1 P2 Pn M1 M2 Mm • Connecting n inputs and m outputs takes nm switches. (Typically only for small numbers of processors) • At each switch can either go straight or change dir. • Diameter = 1, bisection bandwidth = p

Butterfly 16 × 16 butterfly network: stage 0 stage 1 stage 2 stage 3 000 001 010 011 100 101 110 111 for p = 2 k + 1 processors, k + 1 stages, 2 k switches per stage, 2 × 2 switches

Fat tree • Complete binary tree • Processors at leaves • Increase links for higher bandwidth near root

Current picture • Old style: mapped algorithms to topologies • New style: avoid topology-specific optimizations • Want code that runs on next year’s machines too. • Topology awareness in vendor MPI libraries? • Software topology - easy of programming, but not used for performance?

Top500 Interconnects T Sta Se H

Networks ∗ ∗ from Top500 2007 recent supercomputers overview

Routing and Switching Routing = determine a path from source to destination through the network. Avoid deadlock. Try to find a minimum cost path, depends on length of path (topology), contention, congestion. • Deterministic routing - always uses same path, can have unbalanced network load, network contention (2 or more messages transmitting at the same time over the same link, leading to delay in msg. transmissions). • Adaptive routing - dynamically select routing path based on load information. Try to spread network traffic evenly. Also more fault tolerant.

Routing and Switching • Circuit switching: full path reserved for entire message (like telephone) short probe msg. establishes path; all message units use same path. • Packet switching: message broken into separately-routed packets • store-and-forward - entire packet must be received by each switch on path (Store) before sent to next switch (forward). Needs enough memory at switches. • pipelining - packets sent so that all links used by succesive packets in an overlapping way (if all packets transmitted along same path) • Cut-through routing, Wormhole routing

MPI Review Computational Example Common approach for grid-based - PowerPoint PPT Presentation

Lecture 9: Distributed Memory More on MPI Communication costs Hardware / Network topologies from Rauber and Runger MPI Review Computational Example Common approach for grid-based computations on distributed memory uses domain

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

The Evolution of MPI William Gropp Computer Science www.cs.uiuc.edu/ homes/ wgropp Outline 1.

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

Introductory Lecture on Astrophysics (for astroparticle physicists) Pasquale D. Serpico

Automatic Failure Diagnosis based on Timing Behavior Anomaly Detection in Distributed Java Web

Ethereum Blocks Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical Engineering

D: Centralization, 51% Attacks Developer centralization Transf ansformati ormation on Code

Fou our Case S e Studies es o on Just T Transiti tion: Le Less ssons s for I Ireland

Computer Security DD2395 http://www.csc.kth.se/utbildning/kth/kurser/DD2395/dasak10/ Spring 2010

I nformation Technology Advisory Committee (I TAC) Public Business Meeting February 2, 2018

iot.schema.org Overview and Update July 1, 2018 Semantic Interoperability What? Common

MPI Review Computational Example Common approach for grid-based - PowerPoint PPT Presentation

Lecture 9: Distributed Memory More on MPI Communication costs Hardware / Network topologies from Rauber and Runger MPI Review Computational Example Common approach for grid-based computations on distributed memory uses domain

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

The Evolution of MPI William Gropp Computer Science www.cs.uiuc.edu/ homes/ wgropp Outline 1.

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

Introductory Lecture on Astrophysics (for astroparticle physicists) Pasquale D. Serpico

Automatic Failure Diagnosis based on Timing Behavior Anomaly Detection in Distributed Java Web

Ethereum Blocks Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical Engineering

D: Centralization, 51% Attacks Developer centralization Transf ansformati ormation on Code

Fou our Case S e Studies es o on Just T Transiti tion: Le Less ssons s for I Ireland

Computer Security DD2395 http://www.csc.kth.se/utbildning/kth/kurser/DD2395/dasak10/ Spring 2010

I nformation Technology Advisory Committee (I TAC) Public Business Meeting February 2, 2018

iot.schema.org Overview and Update July 1, 2018 Semantic Interoperability What? Common

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards