distributed memory programming
play

Distributed Memory Programming Wolfgang Schreiner Research - PDF document

Distributed Memory Programming Distributed Memory Programming Wolfgang Schreiner Research Institute for Symbolic Computation (RISC-Linz) Johannes Kepler University, A-4040 Linz, Austria Wolfgang.Schreiner@risc.uni-linz.ac.at


  1. Distributed Memory Programming Distributed Memory Programming Wolfgang Schreiner Research Institute for Symbolic Computation (RISC-Linz) Johannes Kepler University, A-4040 Linz, Austria Wolfgang.Schreiner@risc.uni-linz.ac.at http://www.risc.uni-linz.ac.at/people/schreine Wolfgang Schreiner RISC-Linz

  2. Distributed Memory Programming SIMD Mesh Matrix Multiplication Single Instruction, Multiple Data • n 2 processors, • 3 n time. Algorithm: see slide. Wolfgang Schreiner 1

  3. Distributed Memory Programming SIMD Mesh Matrix Multiplication 1. Precondition array • Shift row i by i − 1 elements west, • Shift column j by j − 1 elements north. 2. Multiply and add On processor � i, j � : c = k a ik ∗ b kj � • Inverted dimensions – Matrix ↓ i, → j . – Processor array ↓ iyproc , → ixproc . • n shift and n arithmetic operations. • n 2 processors. Maspar program: see slide. Wolfgang Schreiner 2

  4. Distributed Memory Programming SIMD Cube Matrix Multiplication Cube of d 3 processors nzproc nxproc N D W nyproc U S Idea • Map A ( i, j ) to all P ( j, i, k ) • Map B ( i, j ) to all P ( i, k, j ) B C A Wolfgang Schreiner 3

  5. Distributed Memory Programming SIMD Cube Matrix Multiplication Multiplication and Addition • Each processor computes single product P ijk : c ijk = a ik ∗ b kj • Bars along x-directions are added P 0 ij : C ij = k c ijk � B(k,j) C(i,j) A(i,k) Wolfgang Schreiner 4

  6. Distributed Memory Programming SIMD Cube Matrix Multiplication Maspar Program int A[N,N], B[N,N], C[N,N]; plural int a, b, c; a = A[iyproc, ixproc]; b = B[ixproc, izproc]; c = a*b; for (i = 0; i < N-1; i++) if (ixproc > 0) c = xnetE[1].c else c += xnetE[1].c; if (ixproc == 0) C[iyproc, izproc] = c; • O ( n 3 ) processors, • O ( n ) time. Wolfgang Schreiner 5

  7. Distributed Memory Programming SIMD Cube Matrix Multiplication Tree-like summation plural x, d; ... x = ixproc; d = 1; while (d < N) { if (x % 2 != 0) break; c += xnetE[d].c; x /= 2; d *= 2; } if (ixproc == 0) C[iyproc, izproc] = c; • O (log n ) time • O ( n 3 ) processors Long-distance communication required! Wolfgang Schreiner 6

  8. Distributed Memory Programming SIMD Hypercube Mat. Multiplication 101 111 1 01 11 001 011 100 110 _ 0 00 10 000 010 d=0 d=1 d=2 d=3 1010 0010 d=4 • d -dimensional hypercube ⇒ processors in- dexed with d bits. • p 1 and p 2 differ in i bits ⇒ shortest path between p 1 and p 2 has length i . Wolfgang Schreiner 7

  9. Distributed Memory Programming SIMD Hypercube Matrix Multiplica- tion Mapping of cube with dimension n to hyper- cube with dimension d . • Hypercube of n 3 = 2 d processors ⇒ d = 3 s (for some s ). • 64 processors ⇒ n = 4 , d = 6 , s = 2 . Hypercube d 5 d 4 d 3 d 2 d 1 d 0 Cube x y z • Embedding algorithm – Cube indices in binary form ( s bits each) – Concatenate indices ( 3 s = d bits) • Neighbor processors in cube remain neigh- bors in hypercube. • Any cube algorithm can be executed with same efficiency on hypercube. Wolfgang Schreiner 8

  10. Distributed Memory Programming SIMD Hypercube Matrix Multiplica- tion Tree summation in hypercube. Processors 000 001 010 011 100 101 110 111 Step 1 r 0 s 0 r 1 s 1 r 2 s 2 r 3 s 3 Step 2 r 0 s 0 r 1 s 1 Step 3 r 0 s 0 • Each processor receives value from neigh- boring processors only. • Only short-distance communication is re- quired. Cube algorithm can be more efficient on hy- percube! Wolfgang Schreiner 9

  11. Distributed Memory Programming Row/Column-Oriented Matrix Multi- plication A B C 1. Load A i on every processor P i . 2. For all P i do: for j =0 to N -1 Receive B j from root C ij = A i * B j 3. Collect C i Broadcasting of each B j ⇒ Step 2 takes O ( N log N ) time. Wolfgang Schreiner 10

  12. Distributed Memory Programming Ring Algorithm See Quinn, Figure 7-15. • Change order of multiplication by • Using a ring of processors. 1. Load A i and B i on every processor P i . 2. For all P i do: p = ( i +1) mod N j = i for k =0 to N -1 do C ij = A i * B j j = ( j +1) mod N Receive B j from P p 3. Collect C i Point-to-point communication ⇒ Step 2 takes O ( N ) time. Wolfgang Schreiner 11

  13. Distributed Memory Programming Hypercube Algorithm Problem: How to embed ring into hypercube? • Simple solution H ( i ) = i : – Ring processor i is mapped to hypercube processor H ( i ) . – Massive non-neighbor communication! • How to preserve neighbor-to-neighbor communication? (see Quinn, Figure 5-13) • Requirements for H ( i ) : – H must be a 1-to-1 mapping. – H ( i ) and H ( i + 1) must differ in 1 bit. – H (0) and H ( N − 1) must differ in 1 bit. Can we construct such a function H ? Wolfgang Schreiner 12

  14. Distributed Memory Programming Ring Successor Assume H is given. • Given: hypercube processor number i • Wanted: “ring successor” S ( i )  0 , if i = N − 1    S ( i ) =   H ( H − 1 ( i ) + 1) , otherwise      Same technique for embedding a 2-D mesh into an hypercube (see Quinn, Figure 5-14). Wolfgang Schreiner 13

  15. Distributed Memory Programming Gray Codes Recursive construction. • 1-bit Gray code G 1 i G 1 ( i ) 0 0 1 1 • n -bit Gray code G n i G n ( i ) i G n ( i ) 0 0 G n − 1 (0) n − 1 1 G n − 1 (0) 1 0 G n − 1 (1) n − 2 1 G n − 1 (1) . . . . . . . . . . . . n 2 − 1 0 G n − 1 ( n n 1 G n − 1 ( n 2 − 1) 2 − 1) 2 • Required properties preserved by construc- tion! H ( i ) = G ( i ) = i xor i 2 . Wolfgang Schreiner 14

  16. Distributed Memory Programming Gray Code Computation C functions. • Gray-Code int G(int i) { return(i ^ (i/2)); } • Inverse Gray-Code int G_inv(int i) { int answer, mask; answer = i; mask = answer/2; while (mask > 0) { answer = answer ^ mask; mask = mask / 2; } return(answer); } Wolfgang Schreiner 15

  17. Distributed Memory Programming Block-Oriented Algorithm      A 11 A 12  B 11 B 12  B = A =     A 21 A 22 B 21 B 22     C 11 C 12  = C =   C 21 C 22    A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22   A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22  • Use block-oriented distribution introduced for shared memory multiprocessors. Block-matrix multiplication is analogous to scalar ma- trix multiplication. • Use staggering technique introduced for 2D SIMD mesh. Rotation along rows and columns. • Perform the SIMD matrix multiplication al- gorithm on whole submatrices . Submatrices are multiplied and shifted. Wolfgang Schreiner 16

  18. Distributed Memory Programming Analysis of Algorithm n 2 matrix, p processors. • Row/Column-oriented – Computation: n 2 /p ∗ n/p = n 3 /p 2 . – Communication: 2( λ + βn 2 /p ) – p iterations. • Block-oriented (staggering ignored) – Computation: n 2 /p ∗ n/p = n 3 /p 2 . – Communication: 4( λ + βn 2 /p ) – √ p − 1 iterations. • Comparison 2 p ( λ + βn 2 /p ) > 4( √ p − 1)( λ + βn 2 /p ) 2 λp + 2 βn 2 > 4 λ ( √ p − 1) + 4 β ( √ p − 1) n 2 /p 1. p > 2( √ p − 1) 2. 1 > 2( √ p − 1) /p True for all p ≥ 1 . Also including staggering, for larger p the block-oriented algorithm performs better! Wolfgang Schreiner 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend