numerical algorithms

Numerical Algorithms Matrices A Review Column a 0,0 a 0,1 a 0, m - PDF document

Numerical Algorithms Matrices A Review Column a 0,0 a 0,1 a 0, m 2 a 0, m 1 a 1,0 a 1,1 a 1, m 2 a 1, m 1 Row a n 2,0 a n 2,1 a n 2, m -2 a n 2, m 1 a n 1,0 a n 1,1 a n 1, m 2 a n 1, m


  1. Numerical Algorithms Matrices — A Review Column a 0,0 a 0,1 a 0, m − 2 a 0, m − 1 a 1,0 a 1,1 a 1, m − 2 a 1, m − 1 Row a n − 2,0 a n − 2,1 a n − 2, m -2 a n − 2, m − 1 a n − 1,0 a n − 1,1 a n − 1, m − 2 a n − 1, m − 1 An n × m matrix. Figure 10.1 335 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  2. Matrix Addition Matrix addition simply involves adding corresponding elements of each matrix to form the result matrix. Given the elements of A as a i , j and the elements of B as b i , j , each element of C is computed as c i , j = a i , j + b i , j (0 ≤ i < n , 0 ≤ j < m ) 336 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  3. Matrix Multiplication Multiplication of two matrices, A and B , produces the matrix C whose elements, c i , j (0 ≤ i < n , 0 ≤ j < m ), are computed as follows: l – 1 ∑ c i j = a i,k b k,j , k = 0 where A is an n × l matrix and B is an l × m matrix. Column Sum Multiply results j Row i c i , j × = A B C Matrix multiplication, C = A × B . Figure 10.2 337 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  4. Matrix-Vector Multiplication A vector is a matrix with one column; i.e., an n × 1 matrix. Matrix-vector multiplication follows directly from the definition of matrix-matrix multipli- cation by making B an n × 1 matrix. The result will be an n × 1 matrix (vector). × = A b c Row sum c i i Figure 10.3 Matrix-vector multiplication c = A × b . 338 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  5. Relationship of Matrices to Linear Equations A system of linear equations can be written in matrix form Ax = b The matrix A holds the a constants, x is a vector of the unknowns, and b is a vector of the b constants. 339 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  6. Implementing Matrix Multiplication Sequential Code Assume throughout that the matrices are square ( n × n matrices). The sequential code to compute A × B could simply be for (i = 0; i < n; i++) for (j = 0; j < n; j++) { c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } This algorithm requires n 3 multiplications and n 3 additions, leading to a sequential time complexity of Ο ( n 3 ). 340 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  7. Parallel Code Usually based upon the direct sequential matrix multiplication algorithm. Sequential code has independent loops. With n processors (and n × n matrices), we can expect a parallel time complexity of Ο ( n 2 ), and this is easily obtainable. Also quite easy to obtain a time complexity of Ο ( n ) with n 2 processors, where one element of A and B are assigned to each processor. These implementations are cost optimal [since Ο ( n 3 ) = n × Ο ( n 2 ) = n 2 × Ο ( n )]. Possible to obtain Ο (log n ) with n 3 processors by parallelizing the inner loop - not cost optimal [since Ο ( n 3 ) ≠ n 3 × Ο (log n )]. Ο (log n ) is the lower bound for parallel matrix multiplication. 341 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  8. Partitioning into Submatrices Usually, we want to use far fewer than n processors with n × n matrices because of the size of n . Submatrices Each matrix can be divided into blocks of elements called submatrices . These submatrices can be manipulated as if there were single matrix elements. Suppose the matrix is divided into s 2 submatrices. Each submatrix has n / s × n / s elements. Using the notation A p,q as the submatrix in submatrix row p and submatrix column q : for (p = 0; p < s; p++) for (q = 0; q < s; q++) { C p,q = 0; /* clear elements of submatrix */ for (r = 0; r < m; r++) /* submatrix multiplication and */ C p,q = C p,q + A p,r * B r,q ; /* add to accumulating submatrix */ } The line C p,q = C p,q + A p,r * B r,q ; means multiply submatrix A p,r and B r,q using matrix multiplication and add to submatrix C p,q using matrix addition. The arrangement is known as block matrix multiplication . Sum q Multiply results p × = A B C Figure 10.4 Block matrix multiplication. 342 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  9. a 0,0 a 0,1 a 0,2 a 0,3 b 0,0 b 0,1 b 0,2 b 0,3 a 1,0 a 1,1 a 1,2 a 1,3 b 1,0 b 1,1 b 1,2 b 1,3 × a 2,0 a 2,1 a 2,2 a 2,3 b 2,0 b 2,1 b 2,2 b 2,3 a 3,0 a 3,1 a 3,2 a 3,3 b 3,0 b 3,1 b 3,2 b 3,3 (a) Matrices A 0,0 B 0,0 A 0,1 B 1,0 a 0,0 a 0,1 b 0,0 b 0,1 a 0,2 a 0,3 b 2,0 b 2,1 × × + a 1,0 a 1,1 b 1,0 b 1,1 a 1,2 a 1,3 b 3,0 b 3,1 a 0,0 b 0,0 + a 0,1 b 1,0 a 0,0 b 0,1 + a 0,1 b 1,1 a 0,2 b 2,0 + a 0,3 b 3,0 a 0,2 b 2,1 + a 0,3 b 3,1 = + a 1,0 b 0,0 + a 1,1 b 1,0 a 1,0 b 0,1 + a 1,1 b 1,1 a 1,2 b 2,0 + a 1,3 b 3,0 a 1,2 b 2,1 + a 1,3 b 3,1 a 0,0 b 0,0 + a 0,1 b 1,0 + a 0,2 b 2,0 + a 0,3 b 3,0 a 0,0 b 0,1 + a 0,1 b 1,1 + a 0,2 b 2,1 + a 0,3 b 3,1 = a 1,0 b 0,0 + a 1,1 b 1,0 + a 1,2 b 2,0 + a 1,3 b 3,0 a 1,0 b 0,1 + a 1,1 b 1,1 + a 1,2 b 2,1 + a 1,3 b 3,1 = C 0,0 (b) Multiplying A 0,0 × B 0,0 to obtain C 0,0 Figure 10.5 Submatrix multiplication. 343 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  10. Direct Implementation Allocate one processor to compute each element of C . Then n 2 processors would be needed. One row of elements of A and one column of elements of B are needed. Some of the same elements must be sent to more than one processor. Using submatrices, one processor would compute one m × m submatrix of C . Column j b[][j] Row i a[i][] Processor P i , j Figure 10.6 Direct implementation of c[i][j] matrix multiplication. 344 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  11. Analysis Assuming n × n matrices and not submatrices. Communication With separate messages to each of the n 2 slave processors: t comm = n 2 ( t startup + 2 nt data ) + n 2 ( t startup + t data ) = n 2 (2 t startup + (2 n + 1) t data ) A broadcast along a single bus would yield t comm = ( t startup + n 2 t data ) + n 2 ( t startup + t data ) Dominant time now is in returning the results as t startup is usually significantly larger than t data . A gather routine should reduce the time. Computation Each slave performs in parallel n multiplications and n additions; i.e., t comp = 2 n 345 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  12. Performance Improvement By using a tree construction so that n numbers can be added in log n steps using n proces- sors: a 0,0 b 0,0 a 0,1 b 1,0 a 0,2 b 2,0 a 0,3 b 3,0 × × × × P 0 P 1 P 2 P 3 + + P 0 P 2 + P 0 Figure 10.7 Accumulation using a tree c 0,0 construction. Computational time complexity of Ο (log n ) using n 3 processors Instead of Ο ( n ) using n 2 processors. 346 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

  13. Submatrices In every method, we can substitute submatrices for matrix elements to reduce the number of processors. Let us select m × m submatrices and s = n / m . Then there are s 2 submatrices in each matrix and s 2 processors. Communication Each of the s 2 slave processors must separately receive one row and one column of subma- trices. Each slave processor must separately return a C submatrix to the master processor giving a communication time of t comm = s 2 { 2( t startup + nmt data ) + ( t startup + m 2 t data ) } = ( n / m ) 2 { 3 t startup + ( m 2 + 2 nm ) t data } Complete matrices could be broadcast to every processor. As the size of the matrices, m , is increased, the data transmission time of each message increases but the actual number of messages decreases. Computation One sequential submatrix multiplication requires m 3 multiplications and m 3 additions. A submatrix addition requires m 2 additions. Hence t comp = s (2 m 3 + m 2 ) = Ο ( sm 3 ) = Ο ( nm 2 ) 347 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Recommend


More recommend