MPI - Message Passing Interface MPI is the mostly used message - - PDF document

mpi message passing interface mpi is the mostly used
SMART_READER_LITE
LIVE PREVIEW

MPI - Message Passing Interface MPI is the mostly used message - - PDF document

MPI - Message Passing Interface MPI is the mostly used message passing-standard By using MPI you obtain portable programs MPI is based on earlier message passing libraries Lecture 7: NX, MPL, PVM, P4, TCGMSH, PARMACS, ...


slide-1
SLIDE 1 1

Lecture 7: Message Passing Programing using MPI

2

MPI - Message Passing Interface

  • MPI – is the mostly used message passing-standard

– By using MPI you obtain portable programs

  • MPI is based on earlier message passing libraries

– NX, MPL, PVM, P4, TCGMSH, PARMACS, ...

  • MPI is a standard that specifies:

– programming interface to message passing – semantics for communication routines in the standard

  • MPI 1.1

– 128 routines callable from FORTRAN, C, C++, Ada, ..

  • Latest version is MPI 2.0 (www.mpi-forum.org)
  • Exists many implementations: MPICH, MPICH-G2, ScaMPI,

etc.

3

The MPI Architecture

  • SPMD: Single Program Multiple Data

– Given P processors, run the same program on every processor

  • Data types

– Data is described in a standardized way in MPI

  • Communicators

– An abstraction on how to choose participants in a collection of communications

  • Pair-wise communication

– One participant sends and the other receives

  • Collective communication

– Reductions, broadcasts, etc

4

MPI-Programming

  • include mpi.h (C)
  • All MPI program have to begin with

err = MPI_Init(&argc, &argv)

  • All MPI program ends with

err = MPI_Finalize() – cleans up, removes internal MPI structures etc – do not interrupt remaining communications

  • Note:

– routines not belonging to MPI are local – e.g., printf is run by all proc.

#include <mpi.h> #include <stdio.h> int main(int argc, char **argv) { MPI_Init(&argc, &argv); printf("Hello world\n"); MPI_Finalize(); return 0; } On sarek: sarek$ module add mpich/pgi Compile with mpicc On seth: Compile with gcc

  • I/opt/scali/include

and

  • L/opt/scali/lib –lmpi

See quickstart guides

5

MPI World

  • All processors calling MPI_Init

defines MPI_COMM_WORLD

  • All MPI communication demands a communicator

– MPI_COMM_WORLD – predefined world composed of all processors taking part in the program – You can create your own communicators

  • Only MPI processes with the same communicator can

exchange messages

  • Size of a communicator

err = MPI_Comm_Size(MPI_COMM_WORLD, &size)

  • My identity (Rank) in the world (0..size-1)

err = MPI_Comm_Rank(MPI_COMM_WORLD, &myrank)

1 3 4 2 6 7 5

6

The MPI world, example

#include <mpi.h> #include <stdio.h> int main( int argc, char **argv ) { int rank, size; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); printf( "Hello world! I'm %d of %d\n", rank, size ); MPI_Finalize(); return 0; }

slide-2
SLIDE 2 7

”The minimal set of MPI routines”

➢ In practice, you can do everything with 6 routines: ➢ MPI_Init(int *argc, char ***argv) ➢ MPI_Finalize() ➢ int MPI_Comm_size(MPI_Comm comm, int *size) ➢ int MPI_Comm_rank(MPI_Comm comm, int *rank) ➢ int MPI_Send(void *buf, int count,

MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )

➢ int MPI_Recv(void *buf, int count,

MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

8

”The minimal set of MPI routines”

  • Things to remember:

– MPI_Init may only be called once during execution – MPI_Finalize is called last and after that may no

  • ther MPI routines be called, not even MPI_Init
  • MPI_Comm_size and MPI_Comm_rank refers to size and

rank with respect to the communicator comm. MPI_COMM_WORLD is the one we use most of the time, but we can create our own communicators with completely different values for size and rank.

  • Standard MPI_Send/MPI_Recv are blocking (see below).

– Returns only when *buf can be used again

  • the message is received, or
  • copied to an internal buffer, MPI allows both types,

– we should assume the first type to be sure!!!.

9

Point to Point Communication

  • Sending a message from one process to another

Process 1 sends to process 6

1 3 4 2 6 7 5

You can of course send several messages at the same time

1 3 4 2 6 7 5

10
  • Vi have to fill in the details
  • Things that has to be defined:

– How can ”data” be represented? – How can processes be identified? – When is an operation finished?

  • Demands cooperation between sender and receiver

MPI Basic Send/Receive

Process 0 Process 1 Send(data) Receive(data)

11

Message passing

User mode

sendbuf

System mode

sysbuf

Calls the send-subroutine Now is sendbuf ready for reuse Sends data from sysbuf to dest Copies data from sendbuf to sysbuf User mode System mode Calls the receive-subroutine Now has recvbuf valid data Receives data from src into sysbuf Copies data from sysbuf to recvbuf

sysbuf recvbuf 12

Synchronous vs Asynchronous Communication

1 2 3

ready? ready!

Synchronous The sequence of 1 and 2 do not matter A A A B B B 1 2 3

ready!

Asynchronous A A A ic ic ic B B B

slide-3
SLIDE 3 13

Synchronization or...?

1 2 3

ready? ready!

Synchronous The sequence of 1 and 2 do not matter A A A B B B 1 2 3

ready!

Asynchronous A A A ic ic ic B B B

MPI_Ssend, MPI_Issend MPI_Send, MPI_Isend The send calls are matched against a receive-routine, e.g., MPI_Irecv

14

Blocking vs Nonblocking Communication

  • Blocking

– Send: When the control is returned, the message has been sent and everything that has to do with the send is finished – Receive: When control is returned, the whole message has been received

  • Non-Blocking

– Send: The control is returned immediately, the actual send is done later by the system – Receive: The control is returned immediately, the message is not received until it arrives

  • You may of course match a blocking send with a non-

blocking receive or the other way around (common)

15

Data types

  • Data in a message is described by a triplet

(address, count, datatype)

  • Predefined MPI data types corresponds to data types

from the programming language (e.g.: MPI_INT, MPI_DOUBLE_PRECISION)

  • There are MPI functions to create aggregate data

types like arrays (int, float), pairs, or one row in a matrix stored column wise

  • Since type information is sent with all data, an MPI

implementation can support communication between processes with different memory representations and sizes of elementary data types (heterogeneous communication).

16

More about data types

  • In order to hide machine specific differences in how

data is stored MPI defines a number of data types that makes it possible to have communication between heterogeneous processors

  • MPI_CHAR (signed char), MPI_SHORT (signed short int),

MPI_LONG (signed int), MPI_UNSIGNED_CHAR (unsigned char), MPI_UNSIGNED_SHORT (unsigned short int), MPI_UNSIGNED (unsigned int), MPI_UNSIGNED_LONG (unsigned long int), MPI_FLOAT (float), MPI_DOUBLE (double), MPI_LONG_DOUBLE (long double), MPI_BYTE (1 byte), MPI_PACKED (packed non- contiguous data)

  • You can create your own MPI types by calling (possibly

recursively) some routines (e.g. MPI_TYPE_STRUCT)

17

More about data types

struct { int nResults; double results[RMAX]; } resultPacket; #define RESULT_PACKET_NBLOCKS 2 //Get necessary information for call to MPI_Type_Struct int blocklengths[RESULT_PACKET_NBLOCKS] = {1, RMAX} MPI_Type_extent (MPI_INT, &extent) MPI_Aint displacements[RESULT_PACKET_NBLOCKS] = {0, extent} MPI_Datatype types[RESULT_PACKET_NBLOCKS] = {MPI_INT, MPI_DOUBLE} //Create the new type MPI_Datatype resultPacketType; MPI_Type_struct ( 2, blocklengths, displacements, types, &resultPacketType); MPI_Type_commit(&resultPacketType); //Now the new MPI type can be used to send variables of type resultPacket MPI_Send(&myResultPacket, count, resultPacketType, dest, tag, comm);

Example: construction of a struct

18

Tags in MPI

  • Messages are sent together with a tag, that

helps the receiving processes to identify the

  • message. (Multiplexing)

– Messages can be filtered by the receiver wrt tag,

  • r not be filtered by specifying MPI_ANY_TAG
  • Errors in using tags are common for the MPI

novice and often causes deadlocks

slide-4
SLIDE 4 19

Point-to-Point Send

  • MPI has four different blocking-send primitives

MPI_xSEND(buf, count, datatype, dest, tag, comm)

  • The receiving process is specified by dest, which is the rank of

the receiving process in the communicator comm.

  • When the function returns, the data has been sent to the

system and the buffer can be reused/overwritten. The message may not yet have been received by the target process. – Synchronous: MPI_SSEND – Ready: MPI_RSEND – Buffered: MPI_BSEND – Standard: MPI_SEND

  • Standard is either synchronous or buffered

– Ready, synchronous and buffered differs mainly in when the send is completed, i.e., when the send buffer may be reused

20

Point-to-Point Send

  • Standard: finished when the message has been sent

– does not necessarily mean that the message has been received

  • Synchronous: finished when an acknowledgment from the

receiver has been received

– implies waiting, if the send-recv is out of sync

  • Buffered: finished when the send buffer has been copied

to a system buffer

– You can not trust the system to supply a buffer large enough - i.e., use your own allocated buffer, MPI_BUFFER_ATTACH(buffer, size)

  • Ready: finishes directly

– sends the message right away, if there is a waiting receive OK! else the message disappears out in cyber space (the result is even undefined)

  • (See also http://www.mpi-forum.org/docs/mpi-11-html/node40.html)
21

MPI Basic (Blocking) Receive

MPI_xRECV(start, count, datatype, source, tag, comm, status)

  • Waits until a matching (source, tag) message is

received from the system, and the buffer can be used.

  • source is the rank specified by comm, can also be

MPI_ANY_SOURCE.

  • status contains possible error information etc.
  • To receive less than count occurrences of

datatype is all right, but more is an error.

22

Information about messages

  • To every message there is an “envelope” with unique

information

  • Can be checked directly by looking at

– status.MPI_SOURCE, returns the rank of the source – status.MPI_TAG, returns the tag of the received message – MPI_GET_COUNT(status, datatype, count), returns number of elements sent. The count specified in send/receive calls states the size of the buffer, not necessarily the amount sent.

23

Exempel: P-to-P Communication

#include <mpi.h> #include <stdio.h> #include <string.h> main( int argc, char **argv ) { char message[20] = "Hello, there"; int myrank; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { /* code for process zero*/ MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD); } else { /* code for process one */ MPI_Recv(message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status); printf("received :%s:\n", message); } MPI_Finalize(); return 0; }

24

Communication with Blocking send/receive

Example: Exchange a double between two processors

err = MPI_Send(&sb, 1, MPI_DOUBLE, dest, 0, MPI_COMM_WORLD) err = MPI_Recv(&rb, 1, MPI_DOUBLE, source, 0, MPI_COMM_WORLD, &r)

In some implementations (assume: all) the blocking send is not returned until the receive has been started! That is: Deadlock in the example above Solution: Red-Black (odd send first/then even)

if (myrank ==0) err = MPI_Send(&sb, 1, MPI_DOUBLE, dest, 0, MPI_COMM_WORLD) err = MPI_Recv(&rb, 1, MPI_DOUBLE, source, 0, MPI_COMM_WORLD, &r) else err = MPI_Recv(&rb, 1, MPI_DOUBLE, source, 0, MPI_COMM_WORLD, &r) err = MPI_Send(&sb, 1, MPI_DOUBLE, dest, 0, MPI_COMM_WORLD)

slide-5
SLIDE 5 25

Non-Blocking Send/Receive

  • MPI has 4 different non-blocking-send primitives

MPI_IxSEND(buf, count, datatype, dest, tag, comm, request) – Synchronous: MPI_ISSEND – Ready: MPI_IRSEND – Buffered: MPI_IBSEND – Standard: MPI_ISEND

  • Posts a send request and returns immediately
  • Returns a request that is used later to check if the send is

finished!

26

Checking if the communication is finished

  • To wait for an ISEND/IRECV to be finished the following can be

used: – MPI_WAIT(request, status)

  • blocks until the message request is finished (received)

– MPI_WAITANY(count, array of requests, array of status)

  • blocks until any message is finished (received)

– MPI_WAITALL(count, array of requests, array of status)

  • blocks untill all messages are finished (received)

– MPI_TEST(request, flag, status)

  • if flag is true the message request is finished (received)

– MPI_TESTANY, MPI_TESTALL

27

Wait for messages

MPI_WAIT(request, status) Blocks until the operation with a matching tag has been done, i.e., after an MPI_WAIT you can reuse the send/receive buffer

CALL MPI_COMM_RANK(comm, rank, ierr) IF(rank.EQ.0) THEN CALL MPI_ISEND(a(1), 10, MPI_REAL, 1, tag, comm, request, ierr) **** do some computation to mask latency **** CALL MPI_WAIT(request, status, ierr) ELSE CALL MPI_IRECV(a(1), 15, MPI_REAL, 0, tag, comm, request, ierr) **** do some computation to mask latency **** CALL MPI_WAIT(request, status, ierr) END IF

28

MPI_Status

  • When a message is received status can be examined.
  • Status also returns information about the size of the

message: – Can not be seen directly

– int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count)

typedef struct MPI_Status { int MPI_SOURCE; int MPI_TAG; int MPI_ERROR; }

29

Blocking vs Non-blocking

MPI_Send, MPI_Recv m.fl. MPI_Bsend, MPI_Brecv m.fl. MPI_Ibsend, MPI_Ibrecv m.fl. MPI_Isend, MPI_Irecv m.fl.

30

Deadlock Free Communication

  • The simplest way to avoid deadlocks is to first post

a non-blocking receive, then send, then wait for the communication to finish

  • Example: two processes exchanging a double

err = MPI_Irecv(&rb, 1, MPI_DOUBLE, source, 0, MPI_COMM_WORLD, &r) err = MPI_Send(&sb, 1, MPI_DOUBLE, dest, 0, MPI_COMM_WORLD) err = MPI_Wait(&r, &status);

slide-6
SLIDE 6 31

Checking for a message without receiving

MPI_PROBE(source, tag, comm, status) blocks until a message with matching tag arrives

CALL MPI_COMM_RANK(comm, rank, ierr) IF (rank.EQ.0) THEN CALL MPI_SEND(i, 1, MPI_INTEGER, 2, 0, comm, ierr) ELSE IF(rank.EQ.1) THEN CALL MPI_SEND(x, 1, MPI_REAL, 2, 0, comm, ierr) ELSE ! rank.EQ.2 DO i=1, 2 CALL MPI_PROBE(MPI_ANY_SOURCE, 0,comm, status, ierr) IF (status(MPI_SOURCE) = 0) THEN 100 CALL MPI_RECV(i, 1, MPI_INTEGER, 0, 0, status, ierr) ELSE 200 CALL MPI_RECV(x, 1, MPI_REAL, 1, 0, status, ierr) END IF END DO END IF

32

Combined Communication Operations

  • Sometimes you want to exchange data between two

processors:

  • Simpler if you do it like this:

int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype senddatatype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, int recvtag, MPI_Comm comm, MPI_Status *status ) int MPI_Sendrecv_replace(void* buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status)

if (rank == 0) err = MPI_Send(&sb, 1, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD) err = MPI_Recv(&rb, 1, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &r) elseif(rank == 1) err = MPI_Recv(&rb, 1, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &r) err = MPI_Send(&sb, 1, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD)

33

Collective Communication in MPI

  • All processes in a communicator takes part!!
  • All collective communication is blocking

– i.e., they have a synchronizing side effect

  • Messages are as before an array with a type
  • No tags
  • There are several different primitives for collective

communication e.g., – Broadcast: MPI_BCAST – Gather: MPI_Gather, MPI_Gatherv – Scatter: MPI_Scatter, MPI_Scatterv – All-to-all: MPI_Alltoall, MPI_ AlltoAllv – Reduction: MPI_Reduce, MPI_Allreduce – Barrier: MPI_BARRIER

34

Collective Communication

m m m

2 3 1

broadcast

m1 m2 m3

All processes in a communicator takes part

m1 m2 m3

tex, scattering an array m, m1

m2 m3 2 3 1

scatter

2 3 1

gather

35

Collective Communication

B B B B bcast scatter gather allgather all_to_all A B C A B C A B C A B C A B C A B C A B C A B C A B C A B C A B C D E F G H J A B C D E F G H J A D G B E H C F J

36

Reduction and Barriers

2 3 1

reduce

Σ Everyone receives the answer Same as allreduce but with an empty message

Can look at these operations as communication with a virtual node

2 3 1

allreduce

Σ 2 3 1 2 3 1

barrier

ε 2 3 1

slide-7
SLIDE 7 37

MPI_REDUCE

MPI_REDUCE(sendbuf,recvbuf,count, datatype,op,root, comm) – op = MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_MAXLOC, MPI_MINLOC, ... – Compare with reduction directives in OpenMP

38

Reduce and Allreduce

A B C D E F G H I J K L A B C D E F G H I J K L

A

  • D
  • G
  • J

A B C D E F G H I J K L A B C D E F G H I J K L

AoDoGoJ

39

Collective Communication, example

#include <mpi.h> #include <math.h> int main(int argc,char **argv) { int done = 0, n, myid, numprocs, i, rc; double mypi, pi, h, sum, x, a;

MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid);

while (!done){ if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); }

40

PI example cont.

MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

if (n == 0) break; h = 1.0 / n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,

MPI_COMM_WORLD);

if (myid == 0) printf("pi is %.16f\n",pi); }

MPI_Finalize();

}

41

Virtual Topologies

  • If a problem maps well to a specific topology MPI

can map the processes to that topology – Nothing strange, you might as well do the job yourself!

1 2 3 4 5 6 7

0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3

1 3 4 2 6 7 5

42

Topologies and Mapping of Topologies

  • The (logical) Default-topology in MPI is a 1-dim

array (ring)

  • Many programs requires a 2-dimensional topology

(e.g., Cannons algorithm for matrix multiplication)

  • We have to map our array on another topology

that matches better with our algorithm.

slide-8
SLIDE 8 43

Topologies and Mapping of Topologies

  • Example: Row-major, Column-major, Space-filling curve, hypercube
  • MPI works exclusively with k-dim. processor

nets

  • How the mapping network -> logical topology is

done is completely hidden for the user

  • MPI_Cart_Create
44

Cartesian Virtual Topologies

  • MPI_CART_CREATE(comm_old, ndims, dims, periods,

reorder, comm_cart) – Takes an old communicator and returns a new with the new topology – ndims specifies the number of dimensions – dims specifies the number of processors in each dim – periods specifies if the edges “wraps around” – reorder

  • false: data is already on the nodes and the nodes keep

their old rank

  • true: MPI is allowed to re-rank the nodes to improve

communication

  • MPI_CART_RANK: converts grid to process rank
  • MPI_CART_COORDS: converts process to grid rank
45

Groups and Communicators

  • Sometimes: some operations is limited to a part of

MPI_COMM_WORLD. E.g.: broadcast within one processor row in a 2-dim processor net

  • Create new communicators from already existing
  • nes with MPI_Comm_split:

– int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm) – color and key: which new communicator I shall belong to and how I shall be numbered within it

46

Groups and Communicators

  • Exampel: split MPI_COMM_WORLD with 8 processes into 3

new communicators:

Key = 1 makes MPI itself to number the processes within the new communicators

47

Groups and Communicators

  • A group – a collection of processes sharing some property (e.g., rank

less than some number)

  • The long road to splitting: create new groups from the group that is

MPI_COMM_WORLD, create new communicators for these new groups

– MPI_Comm_group(MPI_Comm comm, MPI_Group *group) – MPI_Group_incl(MPI_Group group, int n, int *ranks, MPI_Group *group_out) – MPI_Comm_Create( MPI_Comm comm, MPI_Group group, MPI_Comm *comm_out)

48

Timing in MPI

  • How to time your program is always a problem
  • MPI offers a solution:

– double MPI_Wtime()

  • Returns number of seconds from an

implementation dependent start time (e.g. 1980-01-01 kl. 24.00) – double MPI_Wtick()

  • Returns the accuracy for MPI_Wtime
slide-9
SLIDE 9 49

Timing in MPI

#include <mpi.h> #include <stdio.h> int main( int argc, char **argv ) { int rank, size; double t1, t2; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size );

t1 = MPI_Wtime()

... do some job you want to time ...

t2 = MPI_Wtime()

printf(“Job took %d second for processor %d out of %d.”, t2-t1, rank, size); MPI_Finalize(); return 0; }