column based matrix partitioning for parallel matrix
play

Column-Based Matrix Partitioning for Parallel Matrix Multiplication - PowerPoint PPT Presentation

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors Based on Functional Performance


  1. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors Based on Functional Performance Models David Clarke Alexey Lastovetsky Vladimir Rychkov Heterogeneous Computing Laboratory, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland http://hcl.ucd.ie HeteroPar’2011 David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  2. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Outline Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  3. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Why optimize Matrix Multiplication? Molecular Solving a System of Linear Equations Simulation Cholesky LU Decomposition Decomposition Image Matrix Multiplication Processing Complexity of order O ( n 3 ) David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  4. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions A x B = C David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  5. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions P P P 1 1 1 P P P 2 2 2 P P P i i i A A x x B B = = C C David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  6. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions P P P 1 1 1 P P P 2 2 2 P P P i i i A A x x B B = = C C David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  7. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions P P P 1 1 1 P P P 2 2 2 P P P i i i A A x x B B = = C C P P P 1 1 1 P P P i i i P P P 2 2 2 David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  8. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions P P P 1 1 1 P P P 2 2 2 P P P i i i A A x x B B = = C C P P P 1 1 1 P P P i i i P P P 2 2 2 David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  9. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Optimising Parallel Matrix Multiplication on a Heterogeneous Platform ◮ Partition in proportion to processor speed. ◮ Minimise volume of communication. ◮ Partition in proportion to interconnect speed. David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  10. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  11. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions allocate and initialise matrices A , B , C ; allocate workspace WA , WB ; for k = 0 → N − 1 do A WA if (is pivot row) then point WB to local pivot row of B; Broadcast WB to all in column; else WB Receive WB ; B end if if (is pivot column) then point WA to local pivot column of A; Send WA horizontally; else C receive WA ; end if DGEMM(. . . , WA , WB , C , . . . ); end for David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  12. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions allocate and initialise matrices A , B , C ; allocate workspace WA , WB ; for k = 0 → N − 1 do if (is pivot row) then A point WB to local pivot row of B; WA Broadcast WB to all in column; else Receive WB ; end if WB if (is pivot column) then point WA to local pivot column of A; B Send WA horizontally; else receive WA ; end if DGEMM(. . . , WA , WB , C , . . . ); end for David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  13. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions allocate and initialise matrices A , B , C ; allocate workspace WA , WB ; for k = 0 → N − 1 do if (is pivot row) then A point WB to local pivot row of B; Broadcast WB to all in column; else Receive WB ; end if if (is pivot column) then point WA to local pivot column of A; B Send WA horizontally; else receive WA ; end if DGEMM(. . . , WA , WB , C , . . . ); end for David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  14. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Benchmarking on each processor must be independent of other processors: serial code. allocate and initialise matrices A , B , C ; allocate workspace WA , WB ; start timer; MPI Send(A, . . . , MPI COMM SELF); MPI Recv(WA, . . . , MPI COMM SELF); memcpy(WB, B, . . . ); DGEMM(. . . , WA , WB , C , . . . ); stop timer; free memory; David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  15. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Matrix Partitioning Algorithms ◮ Column-Based Partitioning (Kalinov & Lastovetsky 1999) ( KL ) ◮ Minimising Total Communication Volume (Beaumont, Boudet, Rastello, Robert, 2001) ( BR ) ◮ 1D Functional Performance Model-based Partitioning (Lastovetsky, Reddy, 2007) ( FPM1D ) ◮ 2D Functional Performance Model-based Partitioning (Lastovetsky, Reddy, 2010) ( FPM-KL ) ◮ New Two-Dimensional Matrix Partitioning Algorithm ( FPM-BR ) David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  16. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Column-Based Partitioning ( KL ) ◮ Processors are arranged into columns. ◮ The width of each column is in proportion to the sum of the speeds of the processors in that column. ◮ Within each column the heights are calculated in proportion to speed. P7 P1 P4 P8 P1, P4, P7, P2, P5, P8, P3 P6 P9 P2 P5 P9 P6 P3 David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  17. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Column-Based Partitioning ( KL ) ◮ Processors are arranged into columns. ◮ The width of each column is in proportion to the sum of the speeds of the processors in that column. ◮ Within each column the heights are calculated in proportion to speed. ◮ However, communication cost is not taken into account. ◮ Uses inaccurate, single-value performance model of processor speed. David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  18. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Minimising Total Communication Volume ( BR ) ◮ Column-based algorithm. ◮ Computes: ◮ Optimum number of columns ◮ Optimum number of processors in each column ◮ Such that: ◮ Workload is distributed in proportion to speed, ◮ Total volume of communication is minimised. David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  19. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Minimising Total Communication Volume ( BR ) ◮ Column-based algorithm. ◮ Computes: ◮ Optimum number of columns ◮ Optimum number of processors in each column ◮ Such that: ◮ Workload is distributed in proportion to speed, ◮ Total volume of communication is minimised. ◮ However, uses inaccurate, single-value performance model of processor speed. David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

  20. Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions Minimising Total Communication Volume ( BR ) n n i i m m P P i i i i A B Total volume of communication = � p i ( m i + n i ) “the sum of the half perimeters” minimised when m i ≈ n i David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend