MPI types, Scatter and Scatterv MPI types, Scatter and Scatterv 0 - - PowerPoint PPT Presentation

mpi types scatter and scatterv mpi types scatter and
SMART_READER_LITE
LIVE PREVIEW

MPI types, Scatter and Scatterv MPI types, Scatter and Scatterv 0 - - PowerPoint PPT Presentation

MPI types, Scatter and Scatterv MPI types, Scatter and Scatterv 0 1 2 3 4 5 Logical and physical 6 7 8 9 10 11 layout of a C/C++ array in 12 13 14 15 16 17 memory. 18 19 20 21 22 23 A = malloc(6*6*sizeof(int)); 24 25 26 27 28 29


slide-1
SLIDE 1

MPI types, Scatter and Scatterv

slide-2
SLIDE 2

MPI types, Scatter and Scatterv

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Logical and physical layout of a C/C++ array in memory. A = malloc(6*6*sizeof(int));

35 1 2 3 4 5 6 7 8 9 10 . . . 30 31 32 33 34 35

slide-3
SLIDE 3

MPI_Scatter

int MPI_Scatter( const void *sendbuf, // data to send int sendcount, // sent to each process MPI_Datatype sendtype,// type of data sent void *recvbuf, // where received int recvcount, // how much to receive MPI_Datatype recvtype,// type of data received int root, // sending process MPI_Comm comm) // communicator sendbuf, sendcount, sendtype valid only at the sending process

slide-4
SLIDE 4

Equal number elements to all processors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

int MPI_Scatter(A, 9, MPI_Int, B, 9, MPI_Int,0, MPI_COMM_WORLD)

A

3 4 5 1 2 6 7 8

P0

12 13 14 9 10 11 15 16 17

P1

21 22 23 18 19 20 24 25 26

P2

30 31 32 27 28 29 33 34 35

P3

slide-5
SLIDE 5

MPI_Scatterv

int MPI_Scatter( const void *sendbuf, // data to send const int *sendcounts,// sent to each process const int* displ // where in sendbuf // sent data is MPI_Datatype sendtype,// type of data sent void *recvbuf, // where received int recvcount, // how much to receive MPI_Datatype recvtype,// type of data received int root, // sending process MPI_Comm comm) // communicator sendbuf, sendcount, sendtype valid only at the sending process

slide-6
SLIDE 6

Specify the number elements sent to each processor

int[] counts = {10, 9, 8, 9}; int[] displ = {0, 10, 19, 27}; int MPI_Scatterv(A, counts, displs, MPI_Int,rb, counts, MPI_Int 0, MPI_COMM_WORLD)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

A

3 4 5 1 2 6 7 8

P0

21 22 23 19 20 24 25 26

P2

30 31 32 27 28 29 33 34 35

P3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

9

P1

12 13 14 10 11 15 16 17 18

rb

slide-7
SLIDE 7

MPI_T ype_vector

int MPI_Type_vector( int count, // number of blocks int blocklength, // #elts in a block int stride, // #elts between block starts MPI_Datatype oldtype, // type of block elements MPI_Datatype *newtype // handle for new type ) Allows a type to be created that puts together blocks of elements in a vector into another vector. Note that a 2-D array in contiguous memory can be treated as a 1-D vector.

slide-8
SLIDE 8

MPI_Datatype col, coltype; MPI_Type_vector(6, 1, 6, MPI_INT, &col); MPI_Type_commit(&col); MPI_Send(A, 1, col, P-1, MPI_ANY_TAG, MPI_Comm_World);

A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 12 18 24 30 6

There are 6 blocks, and each is made

  • f 1 int, and the new block starts 6

positions in the linearized array from the start of the previous block.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

MPI_T ype_vector: defjning the type

1 2 3 4 5 6 Block start

slide-9
SLIDE 9

What if we want to scatter columns (C array layout)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

A

18 24 30 6 12

P0

19 25 31 1 7 13

P1

20 26 32 2 8 14

P2

21 27 33 3 9 15

P3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

22 28 34 4 10 16

P4

23 29 35 5 11 17

P5

slide-10
SLIDE 10

What if we want to scatter columns?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

MPI_Datatype col, coltype; MPI_Type_vector(1, 1, 6, MPI_INT, &col); MPI_Type_commit(&col); int MPI_Scatter(A, 6, col, AC, 6, MPI_Int, 0, MPI_Comm_World);

The code above won’t work. Why? Where does the fjrst col end? We want the fjrst column to end at 0, the second at 1, etc. – not what is shown below. Need to fool MPI_Scatter

1 col

slide-11
SLIDE 11

MPI_T ype_create_resized to the rescue

int MPI_Type_create_resized( MPI_Datatype oldtype, // type being resized MPI_Aint lb, // new lower bound MPI_Aint extent, // new extent (“length”) MPI_Datatype *newtype) // resized type name ) Allows a new size (or extent) to be assigned to an existing type. Allows MPI to determine how far from an object O1 the next adjacent object O2 is. As we will see this is often necessitated because we treat a logically 2-D array as a 1-D vector.

slide-12
SLIDE 12

Using MPI_T ype_vector

MPI_Datatype col, coltype; MPI_Type_vector(6, 1, 6, MPI_INT,&col); MPI_Type_commit(&col); MPI_Type_create_resized(col, 0, 1*sizeof(int), &coltype); MPI_Type_commit(&coltype); MPI_Scatter(A, 1, coltype, rb, 6, MPI_Int, 0, MPI_COMM_WORLD);

A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

slide-13
SLIDE 13

MPI_Datatype col, coltype; MPI_Type_vector(6, 1, 6, MPI_INT, &col); MPI_Type_commit(&col);

MPI_Type_create_resized(col, 0, 1*sizeof(int), &coltype); MPI_Type_commit(&coltype); MPI_Scatter(A, 1, coltype, rb, 6, MPI_Int, 0, MPI_COMM_WORLD);

A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

1 2 3 4 5 6 7 8 9 1 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 3 1 3 2 3 3 3 4 3 5

MPI_T ype_vector: defjning the type

Again, there are 6 blocks, and each is made of 1 int, and the new block starts 6 positions in the linearized array from the start of the previous block.

1 col

slide-14
SLIDE 14

Using MPI_type_create_resized

MPI_Datatype col, coltype; MPI_Type_vector(6, 1, 6, MPI_INT, &col); MPI_Type_commit(&col);

MPI_Type_create_resized(col, 0, 1*sizeof(int), &coltype); MPI_Type_commit(&coltype);

MPI_Scatter(A, 1, coltype, rb, 6, MPI_Int, 0, MPI_COMM_WORLD);

A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

resize creates a new type from a previous type and changes the size. This allows easier computation of the

  • fgset from one element of a type to

the next element of a type in the

  • riginal data structure.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

1 word

slide-15
SLIDE 15

The next starts here, one sizeof(int) away.

MPI_Datatype col, coltype; MPI_Type_vector(6, 1, 6, MPI_INT, &col); MPI_Type_commit(&col); MPI_Type_create_resized(col, 0, 1*sizeof(int), &coltype); MPI_Type_commit(&coltype);

MPI_Scatter(A, 1, coltype, rb, 6, MPI_Int, 0, MPI_COMM_WORLD);

A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

  • ne object
  • f type col

starts here

The next starts here, one sizeof(int) away.

  • ne object
  • f type col

starts here

slide-16
SLIDE 16

The result of the communication

MPI_Datatype col, coltype; MPI_Type_vector(6, 1, 6, MPI_INT, &col); MPI_Type_commit(&col); MPI_Type_create_resized(col, 0, 1*sizeof(int), &coltype); MPI_Type_commit(&coltype); MPI_Scatter(A, 1, coltype, rb, 6, MPI_Int, 0, MPI_COMM_WORLD);

A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

1

6 12 18 24 30 1 7 13 19 25 31

. . .

5 11 17 23 29 35

P0 P1 P2

slide-17
SLIDE 17

Scattering diagonal blocks

MPI_Datatype block, blocktype; MPI_Type_vector(2, 2, 6, MPI_INT, &block); MPI_Type_commit(&block); MPI_Type_create_resized(block, 0, 14*sizeof(int), &blocktype); MPI_Type_commit(&blocktype);

int MPI_Scatter(A, 1, blocktype, B, 4, MPI_Int,0, MPI_COMM_WORLD)

A

5 11 17 23 29 1 2 3 4 6 7 8 9 10 12 13 14 15 16 18 19 20 21 22 24 25 26 27 28 30 31 32 33 34 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

6 1 2 14

note that 2*numrows + width of block =14

slide-18
SLIDE 18

Scattering the blocks

MPI_Datatype block, blocktype; MPI_Type_vector(2, 2, 6, MPI_INT, &block); MPI_Type_commit(&block); MPI_Type_create_resized(block, 0, 14*sizeof(int), &blocktype); MPI_Type_commit(&blocktype); int MPI_Scatter(A, 1, blocktype, B, 4,MPI_Int,0, MPI_COMM_WORLD)

A

5 11 17 23 29 1 2 3 4 6 7 8 9 10 12 13 14 15 16 18 19 20 21 22 24 25 26 27 28 30 31 32 33 34 35 1 6 7

P0

14 15 20 21

P1

29 28 34 35

P2 B

slide-19
SLIDE 19

The T ype_vector statement describing this

MPI_Datatype block, blocktype; MPI_Type_vector(3, 3, 6, MPI_INT, &block); MPI_Type_commit(&block); MPI_Type_create_resized(block, 0, 3*sizeof(int), &blocktype); MPI_Type_commit(&blocktype);

A

5 11 17 23 29 1 2 3 4 6 7 8 9 10 12 13 14 15 16 18 19 20 21 22 24 25 26 27 28 30 31 32 33 34 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

3 1 2 3 6

slide-20
SLIDE 20

The create_resize statement for this

MPI_Datatype block, blocktype; MPI_Type_vector(3, 3, 6, MPI_INT, &block); MPI_Type_commit(&block);

MPI_Type_create_resized(block, 0, 3*sizeof(int), &blocktype); MPI_Type_commit(&blocktype);

A

5 11 17 23 29 1 2 3 4 6 7 8 9 10 12 13 14 15 16 18 19 20 21 22 24 25 26 27 28 30 31 32 33 34 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

3 15 3

Distance between start of blocks varies, but are multiples

  • f 3. Use MPI_Scatterv
slide-21
SLIDE 21

Sending the data

MPI_Datatype block, blocktype; int disp = {0, 1, 6, 7) int scount = {1, 1, 1, 1} int rcount = {9, 9, 9, 9} MPI_Type_vector(3, 3, 6, MPI_INT, &block); MPI_Type_commit(&block); MPI_Type_create_resized(block, 0, 3*sizeof(int), &blocktype); MPI_Type_commit(&blocktype); int MPI_Scatterv(A, scount, displ, blocktype, rb, rcount, MPI_Int, 0, MPI_COMM_WORLD)

A

5 11 17 23 29 1 2 3 4 6 7 8 9 10 12 13 14 15 16 18 19 20 21 22 24 25 26 27 28 30 31 32 33 34 35

1 6 7

displacement is sizeof(blockcol)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

3 15

slide-22
SLIDE 22

Matrix Multiply Cannon’s Algorithm

  • Useful for the small project
  • Algorithm 1 in what follows is the

layout we discussed earlier

slide-23
SLIDE 23

Elements of A and B Needed to Compute a Process’s Portion of C

Algorithm 1 Cannon’s Algorithm

slide-24
SLIDE 24

Parallel Algorithm 2 (Cannon’s Algorithm)

  • Associate a primitive task with

each matrix element

  • Agglomerate tasks responsible

for a square (or nearly square) block of C (the result matrix)

  • Computation-to-communication

ratio rises to n / Öp (same total computation, more computation per communication)

  • 2n / p < n / Öp when p > 4
slide-25
SLIDE 25

Simplifying assumptions

– Assume that

  • A, B and (consequently) C

are n x n square matrices

  • √p is an integer, and
  • n = k √p, k

⋅√p, l an integer (i.e. n is a multiple of √p

slide-26
SLIDE 26

Blocks need to compute a C element

A B C

These blocks need to be on the same processor. The processor that owns these blocks fully computes value of the circled C block (but needs more than the circled A and B blocks)

slide-27
SLIDE 27

Blocks to compute the C element

C

P2,1

B

P2,1

A

P2,1

Processor P2,1 needs, at some point, to simultaneously hold the green A and B blocks, the red A and B blocks, the blue A and B blocks, and the cayenne A and B blocks. With the current data layout it cannot do useful work because it does not contain matching A and B blocks (it has a red A and blue B block)

slide-28
SLIDE 28

P1,1 P2,1

Blocks needed to compute a C element

P2,1 P2,1

A B C We need to rearrange the data so that every block has useful work to do The initial data confjguration does not provide for this

P2,2

slide-29
SLIDE 29

Every processor now has useful work to do

Note -- this only shows the full data layout for

  • ne processor

C B A

slide-30
SLIDE 30

At each step in the multiplication, shift B elements up within their column, and A elements left within their row

P2,1

C

P2,1

B

P2,1

A First partial sum

slide-31
SLIDE 31

And again . . .

P2,1

C

P2,1

B

P2,1

A Second partial sum

slide-32
SLIDE 32

And again . . .

P2,1

C

P2,1

B

P2,1

A Third partial sum

slide-33
SLIDE 33

And again

P2,1

C

P2,1

B

P2,1

A Fourth partial sum

slide-34
SLIDE 34

Another way to view this

Before After

slide-35
SLIDE 35

Another way to view this

Before After

B block goes here (up 1 (j) rows) B block goes here (up 1 (j) rows) A block goes here (over 2 (i) rows)

slide-36
SLIDE 36

Yet another way to view this

A00 B00 A01 B01 A02 B02 A03 B03 A10 B10 A11 B11 A12 B12 A13 B13 A20 B20 A21 B21 A22 B22 A23 B23 A30 B30 A31 B31 A32 B32 A33 B33

Each triangle represents a matrix block on a processor Only same-color triangles should be multiplied

slide-37
SLIDE 37

Rearrange Blocks

A00 B00 A01 B01 A02 B02 A03 B03 A10 B10 A11 B11 A12 B12 A13 B13 A20 B20 A21 B21 A22 B22 A23 B23 A30 B30 A31 B31 A32 B32 A33 B33

Block Ai,j shifts left i positions Block Bi,j shifts up j positions

slide-38
SLIDE 38

Consider Process P1,2

B02 A10 A11 A12 B12 A13 B22 B32

Step 1

Next communication Next communication

slide-39
SLIDE 39

Consider Process P1,2

B12 A11 A12 A13 B22 A10 B32 B02

Step 2

Next communication Next communication

slide-40
SLIDE 40

Consider Process P1,2

B22 A12 A13 A10 B32 A11 B02 B12

Step 3

Next communication Next communication

slide-41
SLIDE 41

Consider Process P1,2

B32 A13 A10 A11 B02 A12 B12 B22

Step 4

Next communication Next communication

slide-42
SLIDE 42

Complexity Analysis

  • Algorithm has Öp iterations
  • During each iteration process multiplies two

(n / Öp ) ´ (n / Öp ) matrices: Q(n / Öp)3 or Q(n3 / p 3/2)

  • Overall computational complexity: Öp n3/p 3/2 or

Q(n3 / p)

  • During each Öp iterations a process sends and

receives two blocks of size (n / Öp ) ´ (n / Öp )

  • Overall communication complexity: Q(n2/ Öp)