Lecture 8: MPI continued David Bindel 17 Feb 2010 Logistics HW 1 - PowerPoint PPT Presentation

Lecture 8: MPI continued David Bindel 17 Feb 2010

Logistics ◮ HW 1 due today! ◮ Cluster maintenance tomorrow morning ◮ Please also submit a feedback form (see web page).

Previously on Parallel Programming Can write a lot of MPI code with 6 operations we’ve seen: ◮ MPI_Init ◮ MPI_Finalize ◮ MPI_Comm_size ◮ MPI_Comm_rank ◮ MPI_Send ◮ MPI_Recv ... but there are sometimes better ways. Decide on communication style using simple performance models.

Communication performance ◮ Basic info: latency and bandwidth ◮ Simplest model: t comm = α + β M ◮ More realistic: distinguish CPU overhead from “gap” ( ∼ inverse bw) ◮ Different networks have different parameters ◮ Can tell a lot via a simple ping-pong experiment

OpenMPI on crocus ◮ Two quad-core chips per nodes, five nodes ◮ Heterogeneous network: ◮ Crossbar switch between cores (?) ◮ Bus between chips ◮ Gigabit ehternet between nodes ◮ Default process layout (16 process example) ◮ Processes 0-3 on first chip, first node ◮ Processes 4-7 on second chip, first node ◮ Processes 8-11 on first chip, second node ◮ Processes 12-15 on second chip, second node ◮ Test ping-pong from 0 to 1, 7, and 8.

Approximate α - β parameters (on node) 12 Measured (1) Model (1) 10 Measured (7) Model (7) Time/msg (microsec) 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 Message size (kilobytes) α 1 ≈ 1 . 0 × 10 − 6 , β 1 ≈ 5 . 7 × 10 − 10 α 2 ≈ 8 . 4 × 10 − 7 , β 2 ≈ 6 . 8 × 10 − 10

Approximate α - β parameters (cross-node) 250 Measured (1) Model (1) Measured (7) 200 Model (7) Time/msg (microsec) Measured (8) Model (8) 150 100 50 0 0 2 4 6 8 10 12 14 16 18 Message size (kilobytes) α 3 ≈ 7 . 1 × 10 − 5 , β 3 ≈ 9 . 7 × 10 − 9

Moral Not all links are created equal! ◮ Might handle with mixed paradigm ◮ OpenMP on node, MPI across ◮ Have to worry about thread-safety of MPI calls ◮ Can handle purely within MPI ◮ Can ignore the issue completely? For today, we’ll take the last approach.

Reminder: basic send and recv MPI_Send(buf, count, datatype, dest, tag, comm); MPI_Recv(buf, count, datatype, source, tag, comm, status); MPI_Send and MPI_Recv are blocking ◮ Send does not return until data is in system ◮ Recv does not return until data is ready

Blocking and buffering data buffer buffer data P0 OS Net OS P1 Block until data “in system” — maybe in a buffer?

Blocking and buffering data data P0 OS Net OS P1 Alternative: don’t copy, block until done.

Problem 1: Potential deadlock Send Send ... blocked ... Both processors wait to finish send before they can receive! May not happen if lots of buffering on both sides.

Solution 1: Alternating order Send Recv Recv Send Could alternate who sends and who receives.

Solution 2: Combined send/recv Sendrecv Sendrecv Common operations deserve explicit support!

Combined sendrecv MPI_Sendrecv(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag, comm, status); Blocking operation, combines send and recv to avoid deadlock.

Problem 2: Communication overhead Sendrecv Sendrecv ..waiting... Sendrecv Sendrecv ..waiting... Partial solution: nonblocking communication

Blocking vs non-blocking communication ◮ MPI_Send and MPI_Recv are blocking ◮ Send does not return until data is in system ◮ Recv does not return until data is ready ◮ Cons: possible deadlock, time wasted waiting ◮ Why blocking? ◮ Overwrite buffer during send = ⇒ evil! ◮ Read buffer before data ready = ⇒ evil! ◮ Alternative: nonblocking communication ◮ Split into distinct initiation/completion phases ◮ Initiate send/recv and promise not to touch buffer ◮ Check later for operation completion

Overlap communication and computation Start send Start send Start recv Start recv } Compute, but don’t touch buffers! End send End send End recv End recv

Nonblocking operations Initiate message: MPI_Isend(start, count, datatype, dest tag, comm, request); MPI_Irecv(start, count, datatype, dest tag, comm, request); Wait for message completion: MPI_Wait(request, status); Test for message completion: MPI_Wait(request, status);

Multiple outstanding requests Sometimes useful to have multiple outstanding messages: MPI_Waitall(count, requests, statuses); MPI_Waitany(count, requests, index, status); MPI_Waitsome(count, requests, indices, statuses); Multiple versions of test as well.

Other send/recv variants Other variants of MPI_Send ◮ MPI_Ssend (synchronous) – do not complete until receive has begun ◮ MPI_Bsend (buffered) – user provides buffer (via MPI_Buffer_attach ) ◮ MPI_Rsend (ready) – user guarantees receive has already been posted ◮ Can combine modes (e.g. MPI_Issend ) MPI_Recv receives anything.

Another approach ◮ Send/recv is one-to-one communication ◮ An alternative is one-to-many (and vice-versa): ◮ Broadcast to distribute data from one process ◮ Reduce to combine data from all processors ◮ Operations are called by all processes in communicator

Broadcast and reduce MPI_Bcast(buffer, count, datatype, root, comm); MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm); ◮ buffer is copied from root to others ◮ recvbuf receives result only at root ◮ op ∈ { MPI_MAX , MPI_SUM , . . . }

Example: basic Monte Carlo #include <stdio.h> #include <mpi.h> int main(int argc, char** argv) { int nproc, myid, ntrials; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nproc); MPI_Comm_rank(MPI_COMM_WORLD, &my_id); if (myid == 0) { printf("Trials per CPU:\n"); scanf("%d", &ntrials); } MPI_Bcast(&ntrials, 1, MPI_INT, 0, MPI_COMM_WORLD); run_trials(myid, nproc, ntrials); MPI_Finalize(); return 0; }

Example: basic Monte Carlo i X 2 Let sum[0] = � i X i and sum[1] = � i . void run_mc(int myid, int nproc, int ntrials) { double sums[2] = {0,0}; double my_sums[2] = {0,0}; /* ... run ntrials local experiments ... */ MPI_Reduce(my_sums, sums, 2, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) { int N = nproc*ntrials; double EX = sums[0]/N; double EX2 = sums[1]/N; printf("Mean: %g; err: %g\n", EX, sqrt((EX*EX-EX2)/N)); } }

Collective operations ◮ Involve all processes in communicator ◮ Basic classes: ◮ Synchronization (e.g. barrier) ◮ Data movement (e.g. broadcast) ◮ Computation (e.g. reduce)

Barrier MPI_Barrier(comm); Not much more to say. Not needed that often.

Broadcast A A P0 P0 Broadcast A P1 P1 A P2 P2 A P3 P3

Scatter/gather A B C D A P0 P0 Scatter B P1 P1 Gather C P2 P2 D P3 P3

Allgather A A B C D P0 P0 Allgather B A B C D P1 P1 C A B C D P2 P2 D A B C D P3 P3

Alltoall A0 A1 A2 A3 A0 B0 C0 D0 P0 P0 Alltoall B0 B1 B2 B3 A1 B1 C1 D1 P1 P1 C0 C1 C2 C3 A2 B2 C2 D2 P2 P2 D0 D1 D2 D3 A3 B3 C3 D3 P3 P3

Reduce A ABCD P0 P0 B P1 Reduce P1 C P2 P2 D P3 P3

Scan A A P0 P0 B AB P1 Scan P1 C ABC P2 P2 D ABCD P3 P3

The kitchen sink ◮ In addition to above, have vector variants ( v suffix), more All variants ( Allreduce ), Reduce_scatter , ... ◮ MPI3 adds one-sided communication (put/get) ◮ MPI is not a small library! ◮ But a small number of calls goes a long way ◮ Init / Finalize ◮ Get_comm_rank , Get_comm_size ◮ Send / Recv variants and Wait ◮ Allreduce , Allgather , Bcast

Example: n -body interactions ◮ Newtonian mechanics with pairwise force law ◮ Time step n particles (e.g. via Euler) ◮ Questions: ◮ Where should the particles live? ◮ How should we evaluate forces? ◮ How should we balance load? ◮ What should our communication strategy be?

Lecture 8: MPI continued David Bindel 17 Feb 2010 Logistics HW 1 - PowerPoint PPT Presentation

Lecture 8: MPI continued David Bindel 17 Feb 2010 Logistics HW 1 due today! Cluster maintenance tomorrow morning Please also submit a feedback form (see web page). Previously on Parallel Programming Can write a lot of MPI code with

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

The Evolution of MPI William Gropp Computer Science www.cs.uiuc.edu/ homes/ wgropp Outline 1.

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

ssmping <draft-venaas-mboned-ssmping-00.txt> Stig Venaas venaas@uninett.no ssmping A

When testing meets Code Review: Why and How Developers Review Tests Davide Spadini, Mauricio

www.ncipher.com People are the WEAK LINK . 2

Course Topic What is Computational Photography ? Study the basics of computation

Linux Network Stack Test Automated and Portable Network Tests Red Hat Radek Pazdera February 2,

Improvements in OMNeT++/INET Real-Time Scheduler for Emulation Mode 2nd OMNeT++ Community Summit

Early Prospects for Electroweak Physics in CMS Ping Tan on half of the CMS Collaboration

Experience with new architectures: moving from HELIOS to Marconi Serhiy Mochalskyy, Roman Hatzky

Lecture 8: MPI continued David Bindel 17 Feb 2010 Logistics HW 1 - PowerPoint PPT Presentation

Lecture 8: MPI continued David Bindel 17 Feb 2010 Logistics HW 1 due today! Cluster maintenance tomorrow morning Please also submit a feedback form (see web page). Previously on Parallel Programming Can write a lot of MPI code with

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

The Evolution of MPI William Gropp Computer Science www.cs.uiuc.edu/ homes/ wgropp Outline 1.

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

ssmping &lt;draft-venaas-mboned-ssmping-00.txt&gt; Stig Venaas venaas@uninett.no ssmping A

When testing meets Code Review: Why and How Developers Review Tests Davide Spadini, Mauricio

www.ncipher.com People are the WEAK LINK . 2

Course Topic What is Computational Photography ? Study the basics of computation

Linux Network Stack Test Automated and Portable Network Tests Red Hat Radek Pazdera February 2,

Improvements in OMNeT++/INET Real-Time Scheduler for Emulation Mode 2nd OMNeT++ Community Summit

Early Prospects for Electroweak Physics in CMS Ping Tan on half of the CMS Collaboration

Experience with new architectures: moving from HELIOS to Marconi Serhiy Mochalskyy, Roman Hatzky

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

ssmping <draft-venaas-mboned-ssmping-00.txt> Stig Venaas venaas@uninett.no ssmping A