matrix multiplication
play

Matrix multiplication Standard serial algorithm: procedure MAT_VECT - PowerPoint PPT Presentation

Matrix multiplication Standard serial algorithm: procedure MAT_VECT ( A, x, y ) begin for i := 0 to n - 1 do begin y [ i ] := 0 for j := 0 to n - 1 do y [ i ] := y [ i ] + A [ i , j ] * x [ j ] end end MAT_VECT Complexity: W = n 2


  1. Matrix multiplication • Standard serial algorithm: procedure MAT_VECT ( A, x, y ) begin for i := 0 to n - 1 do begin y [ i ] := 0 for j := 0 to n - 1 do y [ i ] := y [ i ] + A [ i , j ] * x [ j ] end end MAT_VECT • Complexity: W = n 2

  2. Matrix-Vector Multiplication (rowwise striping) Sequential run time: W = n 2 • • Simple case: p = n – A2A broadcast of vector elements: Θ ( n) – single row multiplication: Θ ( n) – Total time: Θ ( n) – processor-time product: Θ ( n 2 ) (cost-opt.) • General case: p < n – A2A broadcast of vector elements: • t s log p + t w n (hypercube) • 2 t s √ p + t w n (mesh) – row multiplication: Θ ( n 2 /p) – total time: • n 2 /p + t s log p + t w n (hypercube) • n 2 /p + 2 t s √ p + t w n (mesh)

  3. Matrix-Vector Multiplication (checkerboard partitioning) Simple case: p = n 2 • – one-to-one comm. + one-2-all broadcast + single node accumul. per row + multipl. = = Θ ( n) + Θ ( n) + Θ ( n) + Θ (1) = Θ ( n) (mesh) = Θ (log n ) + Θ (log n ) + … = Θ (log n ) (hyper) – processor-time product: • Θ ( n 3 ) (mesh) • Θ ( n 2 log n ) (hypercube) General case: p < n 2 • – each processor stores a ( n √ p × n √ p ) block – Run time on mesh with CT: Θ ( n/ √ p ) + Θ (( n/ √ p ) log p ) + Θ (( n/ √ p ) log p ) – + Θ ( n 2 /p ) – Cost optimal as long as p = O ( n 2 / log 2 n )

  4. Matrix multiplication • Standard serial algorithm: procedure MAT_MULT ( A, B, C ) begin for i := 0 to n - 1 do for j := 0 to n - 1 do begin C [ i, j ] := 0 for k := 0 to n - 1 do C [ i, j ] := C [ i, j ] + A [ i , k ] * B [ k , j ] end end MAT_MULT • Complexity: W = n 3

  5. Matrix multiplication: a block matrix implementation • A parallelizable algorithm is derived from the serial one by replacing scalar operations with block matrix operations : procedure BLOCK_MAT_MULT( A, B, C ) begin for i := 0 to q - 1 do for j := 0 to q - 1 do begin Initialize all elements of C i,j to zero for k := 0 to q - 1 do C i,j := C i, j + A i , k * B k , j end end BLOCK_MAT_MULT • Number of scalar addition/moltiplications: n 3 – q 3 matrix multiplications each involving n/q × n/q elements and requiring ( n/q ) 3 multiplications and additions

  6. Matrix multiplication: simple parallel implementation • The block matrix algorithm lends itself to implementation on a √ p × √ p mesh, with one block per processor: – processor P i, j initially holds A i, j and B i, j , and computes block C i,j , P i, j needs all blocks A i, k and B k, j for 0 ≤ k ≤ √ p , thus an A2A – broadcast is performed on each row, and one on each column – computation requires √ p multiplications of n/ √ p × n/ √ p matrices – Assuming a hypercube topology: • Tp = n 3 /p + t s log p + 2 t w n 2 / √ p – cost • n 3 + t s p log p + 2 t w n 2 √ p • cost-optimal for p = O( n 2 ) • A drawback of this algorithm is the excessive memory requirements - every processor has 2 √ p blocks of size n 2 /p

  7. Matrix Multiplication: efficient algorithms • The simple parallel algorithm we saw last time is not memory efficient – every processor holds 2 √ p blocks of size n 2 /p -> total memory required is n 2 √ p which is √ p times the sequential alg. • More memory efficient algorithms exist – Cannon’s algorithm • rowwise (columnwise) rotation of submatrices instead of a2a broadcast • interleaving of communication and submatrix multiplication – Fox’s algorithm • rowwise broadcasts of each A i,j in turn, upward shifts of B i,

  8. Systems of linear equations a 0,0 x 0 + a 0,1 x 1 + … + a 0,n-1 x n-1 = b 0 x 0 + u 0,1 x 1 + … + u 0,n-1 x n-1 = y 0 a 1,0 x 0 + a 1,1 x 1 + … + a 1,n-1 x n-1 = b 1 x 1 + … + u 1,n-1 x n-1 = y 1 . . . . . . . . . . . . . . . . . . Gaussian a n-1,0 x 0 + a n-1,1 x 1 + … + a n-1,n-1 x n-1 = b n-1 x n-1 = y n-1 elimination • Gaussian elimination is the classical serial algorithm for solving systems of linear equations: – First step: Ax = b transformed into Ux = y, where U is upper triangular and U[ i, i ] = 1 – Second step: system is solved for the x variables from x [n-1] to x [0] by back-substitution

  9. Steps of the Gaussian elimination

  10. Parallel Gaussian elimination: striped partitioning • Simple case: one row/processor – k th computation step: 3( n-k- 1) – Total computation: 3 n ( n -1)/2 – k th comm. step: ( t s + t w ( n-k -1))log n (on hypercube) – Tot. comm. : ( t s n + t w n(n- 1)/2)log n – Cost: Θ ( n 3 log n ) • not cost-optimal ( W = 2 n 3 /3) • A possible optimization; pipelining of the n iterations – row k +1 can start computation as soon as it gets data from row k – this optimized version of the alg. is cost- optimal

  11. Parallel Gaussian elimination: striped partitioning • General case: p < n – block-striped partitioning • k th computation step: 3( n-k- 1) n/p (on proc. with max load) • k th comm. step: ( t s + t w ( n-k -1))log n (on hypercube) • => computation time dominates • total time: Θ ( n 3 p ) • cost: Θ ( n 3 ) • not very efficient because some processors are idle – cyclic-striped partitioning • better load balancing => less idle time

  12. Parallel Gaussian elimination: blocked-checkerboard partitioning • Simple case: one element/processor – two communication steps required • rowwise broadcast • columnwise broadcast – Not cost-optimal • Usual optimization; pipelining of communication and computation – two communication steps => two opportunities for pipelining – this optimized version of the alg. is cost-optimal

  13. Parallel Gaussian elimination: cyclic-checkerboard partitioning • Like for striped partitioning, an additional optimization is achieved by using cyclic- instead of block- partitioning due to improved load balancing among processors

  14. Example of discretized physical domain • Physical system: temperature distribution on a sheet of metal held at temperature U 0 and U 1 on two sides – system is discretized by superimposing grid – temperature at each point is computed based on neighbor values

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend