parallel numerical algorithms
play

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems - PowerPoint PPT Presentation

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of


  1. BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Numerical Algorithms Chapter 3 – Dense Linear Systems Section 3.1 – Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 77

  2. BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Outline BLAS 1 Inner Product 2 Outer Product 3 Matrix-Vector Product 4 Matrix-Matrix Product 5 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 77

  3. BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Basic Linear Algebra Subprograms Basic Linear Algebra Subprograms ( BLAS ) are building blocks for many other matrix computations BLAS encapsulate basic operations on vectors and matrices so they can be optimized for particular computer architecture while high-level routines that call them remain portable BLAS offer good opportunities for optimizing utilization of memory hierarchy Generic BLAS are available from netlib , and many computer vendors provide custom versions optimized for their particular systems Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 77

  4. BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Examples of BLAS Level Work Examples Function 1 O ( n ) Scalar × vector + vector daxpy Inner product ddot Euclidean vector norm dnrm2 O ( n 2 ) 2 Matrix-vector product dgemv dtrsv Triangular solve Outer-product dger O ( n 3 ) 3 Matrix-matrix product dgemm dtrsm Multiple triangular solves Symmetric rank- k update dsyrk γ 1 > γ 2 γ 3 ≫ ���� ���� ���� BLAS 1 effective sec/flop BLAS 2 effective sec/flop BLAS 3 effective sec/flop Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 77

  5. BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Inner Product Inner product of two n -vectors x and y given by n � x T y = x i y i i =1 Computation of inner product requires n multiplications and n − 1 additions M 1 = Θ( n ) , Q 1 = Θ( n ) , T 1 = Θ( γ n ) Effectively as hard as scalar reduction, can be done via binary or binomial tree summation Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 77

  6. BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Partition For i = 1 , . . . , n , fine-grain task i stores x i and y i , and computes their product x i y i Communicate Sum reduction over n fine-grain tasks x 1 y 1 x 2 y 2 x 3 y 3 x 4 y 4 x 5 y 5 x 6 y 6 x 7 y 7 x 8 y 8 x 9 y 9 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 77

  7. BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Fine-Grain Parallel Algorithm z i = x i y i { local scalar product } reduce z i across all tasks i = 1 , ..., n { sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 77

  8. BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Agglomeration and Mapping Agglomerate Combine k components of both x and y to form each coarse-grain task, which computes inner product of these subvectors Communication becomes sum reduction over n/k coarse-grain tasks Map Assign ( n/k ) /p coarse-grain tasks to each of p processors, for total of n/p components of x and y per processor x 1 y 1 + x 2 y 2 + x 3 y 3 x 4 y 4 + x 5 y 5 + x 6 y 6 x 7 y 7 + x 8 y 8 + x 9 y 9 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 77

  9. BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Coarse-Grain Parallel Algorithm z i = x T [ i ] y [ i ] { local inner product } { sum reduction } reduce z i across all processors i = 1 , ..., p � � x [ i ] – subvector of x assigned to processor i Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 77

  10. BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Performance The parallel costs ( L p , W p , F p ) for the inner product are given by Computational cost F p = Θ( n/p ) regardless of network The latency and bandwidth costs depend on network: 1-D mesh: L p , W p = Θ( p ) 2-D mesh: L p , W p = Θ( √ p ) hypercube: L p , W p = Θ(log p ) For a hypercube or fully-connected network time is T p = αL p + βW p + γF p = Θ( α log( p ) + γn/p ) Efficiency and scaling are the same as for binary tree sum Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 77

  11. BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Inner product on 1-D Mesh For 1-D mesh, total time is T p = Θ( γn/p + αp ) To determine strong scalability, we set constant efficiency and solve for p s � � � � T 1 γn 1 const = E p s = = Θ = Θ γn + αp 2 1 + ( α/γ ) p 2 p s T p s s /n s � which yields p s = Θ( ( γ/α ) n ) 1-D mesh weakly scalable to p w = Θ(( γ/α ) n ) processors: � � � � 1 1 E p w ( p w n ) = Θ = Θ 1 + ( α/γ ) p 2 w / ( p w n ) 1 + ( α/γ ) p w /n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 77

  12. BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Inner product on 2-D Mesh For 2-D mesh, total time is T p = Θ( γn/p + α √ p ) To determine strong scalability, we set constant efficiency and solve for p s � � � � T 1 γn 1 const = E p s = = Θ = Θ p s T p s γn + αp 3 / 2 1 + ( α/γ ) p 3 / 2 /n s s which yields p s = Θ(( γ/α ) 2 / 3 n 2 / 3 ) 2-D mesh weakly scalable to p w = Θ(( γ/α ) 2 n 2 ) , since � � � � 1 1 E p w ( p w n ) = Θ = Θ 1 + ( α/γ ) √ p w /n 1 + ( α/γ ) p 3 / 2 w / ( p w n ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 77

  13. BLAS Parallel Algorithm Inner Product Agglomeration Schemes Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Outer Product Outer product of two n -vectors x and y is n × n matrix Z = xy T whose ( i, j ) entry z ij = x i y j For example, T       x 1 y 1 x 1 y 1 x 1 y 2 x 1 y 3 x 2 y 2 = x 2 y 1 x 2 y 2 x 2 y 3       x 3 y 3 x 3 y 1 x 3 y 2 x 3 y 3 Computation of outer product requires n 2 multiplications, M 1 = Θ( n 2 ) , Q 1 = Θ( n 2 ) , T 1 = Θ( γn 2 ) (in this case, we should treat M 1 as output size or define the problem as in the BLAS: Z = Z input + xy T ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 77

  14. BLAS Parallel Algorithm Inner Product Agglomeration Schemes Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Partition For i, j = 1 , . . . , n , fine-grain task ( i, j ) computes and stores z ij = x i y j , yielding 2-D array of n 2 fine-grain tasks Assuming no replication of data, at most 2 n fine-grain tasks store components of x and y , say either for some j , task ( i, j ) stores x i and task ( j, i ) stores y i , or task ( i, i ) stores both x i and y i , i = 1 , . . . , n Communicate For i = 1 , . . . , n , task that stores x i broadcasts it to all other tasks in i th task row For j = 1 , . . . , n , task that stores y j broadcasts it to all other tasks in j th task column Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 77

  15. BLAS Parallel Algorithm Inner Product Agglomeration Schemes Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Fine-Grain Tasks and Communication x 1 y 1 x 1 y 2 x 1 y 3 x 1 y 4 x 1 y 5 x 1 y 6 x 2 y 1 x 2 y 2 x 2 y 3 x 2 y 4 x 2 y 5 x 2 y 6 x 3 y 1 x 3 y 2 x 3 y 3 x 3 y 4 x 3 y 5 x 3 y 6 x 4 y 1 x 4 y 2 x 4 y 3 x 4 y 4 x 4 y 5 x 4 y 6 x 5 y 1 x 5 y 2 x 5 y 3 x 5 y 4 x 5 y 5 x 5 y 6 x 6 y 1 x 6 y 2 x 6 y 3 x 6 y 4 x 6 y 5 x 6 y 6 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 77

  16. BLAS Parallel Algorithm Inner Product Agglomeration Schemes Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Fine-Grain Parallel Algorithm broadcast x i to tasks ( i, k ) , k = 1 , . . . , n { horizontal broadcast } broadcast y j to tasks ( k, j ) , k = 1 , . . . , n { vertical broadcast } z ij = x i y j { local scalar product } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 77

  17. BLAS Parallel Algorithm Inner Product Agglomeration Schemes Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Agglomeration Agglomerate With n × n array of fine-grain tasks, natural strategies are 2-D: Combine k × k subarray of fine-grain tasks to form each coarse-grain task, yielding ( n/k ) 2 coarse-grain tasks 1-D column: Combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: Combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 77

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend