CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements - PowerPoint PPT Presentation

CSE 262 Lecture 7 Parallel Matrix Multiplication

Announcements • Projects Scott B. Baden /CSE 262/ Winter 2015 2

Today’s lecture • Cannon’s Parallel Matrix Multiplication Algorithm • Working with communicators Scott B. Baden /CSE 260/ Winter 2014 3

Recall Matrix Multiplication • Given two conforming matrices A and B, form the matrix product A × B A is m × n B is n × p • Operation count: O(n 3 ) multiply-adds for an n × n square matrix • Different variants, e.g. ijk, etc. Scott B. Baden /CSE 260/ Winter 2014 4

ijk variant for i := 0 to n-1 for j := 0 to n-1 for k := 0 to n-1 C[i,j] += A[i,k] * B[k,j] A[i,:] C[i,j] B[:,j] += * Scott B. Baden /CSE 260/ Winter 2014 5

Parallel matrix multiplication • Organize processors into rows and columns u Process rank is an ordered pair of integers u Assume p is a perfect square • Each processor gets an n/ √ p × n/ √ p chunk of data • Assume that we have an efficient serial matrix multiply (dgemm, sgemm) p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) Scott B. Baden /CSE 260/ Winter 2014 6

A simple parallel algorithm • Conceptually, like the blocked serial algorithm × = Scott B. Baden /CSE 260/ Winter 2014 7

Cost • Each processor performs n 3 /p multiply-adds • Multiplies a wide and short matrix by a tall skinny matrix • Needs to collect these matrices via collective communication • High memory overhead × = Scott B. Baden /CSE 260/ Winter 2014 8

Observation • But we can form the same product by computing √ p separate matrix multiplies involving n 2 /p x n 2 /p matrices and accumulating partial results for k := 0 to n - 1 C[i, j] += A[i, k] * B[k, j]; × = Scott B. Baden /CSE 260/ Winter 2014 9

A more efficient algorithm • We can form the same product by computing √ p separate matrix multiplies involving n 2 /p x n 2 /p submatrices and accumulating partial results for k := 0 to n - 1 C[i, j] += A[i, k] * B[k, j]; • Move data incrementally in √ p phases within a row or column • In effect, a linear time ring broadcast algorithm • Modest buffering requirements × = Scott B. Baden /CSE 260/ Winter 2014 10

Canon’s algorithm • Implements the strategy • In effect we are using a ring broadcast algorithm • Consider block C[1,2] C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2] C(0,0) C(0,1) C(0,2) A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) × = C1,0) C(1,1) C(1,2) A(1,0) A(1,1) A(1,2) B(1,0) B(1,1) B(1,2) C(2,0) C(2,1) C(2,2) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) Image: Jim Demmel Scott B. Baden /CSE 260/ Winter 2014 11

Skewing the matrices C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2] • Before we start, we A(0,0) A(0,1) A(0,2) A(0,0) A(0,1) A(0,2) preskew the matrices so everything lines up A(1,0) A(1,1) A(1,2) A(1,1) A(1,2) A(1,0) • Shift row i by i columns to A(2,2) A(2,0) A(2,1) A(2,0) A(2,1) A(2,2) the left using sends and receives u Do the same for each column B(0,0) B(0,1) B(0,2) B(0,0) B(1,1) B(2,2) u Communication wraps around B(1,0) B(2,1) B(0,2) B(1,0) B(1,1) B(1,2) • Ensures that each partial B(2,0) B(0,1) B(1,2) B(2,0) B(2,1) B(2,2) product is computed on the same processor that owns C[I,J], using only shifts Scott B. Baden /CSE 260/ Winter 2014 12

Shift and multiply C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2] √ p steps • • Circularly shift Rows 1 column to the left, columns 1 row up • Each processor forms the partial product of local A& B and adds into the accumulated sum in C A(0,0) A(0,1) A(0,2) A(0,1) A(0,2) A(0,0) A(0,2) A(0,0) A(0,1) A(1,1) A(1,2) A(1,0) A(1,2) A(1,0) A(1,1) A(1,0) A(1,1) A(1,2) A(2,2) A(2,0) A(2,1) A(2,0) A(2,2) A(2,1) A(2,0) A(2,1) A(2,2) B(0,0) B(1,1) B(2,2) B(0,2) B(1,2) B(1,0) B(2,1) B(2,0) B(0,1) B(1,0) B(2,1) B(0,2) B(1,2) B(2,2) B(2,0) B(0,1) B(0,0) B(1,1) B(2,0) B(0,1) B(1,2) B(2,2) B(0,2) B(0,0) B(1,1) B(1,0) B(2,1) Scott B. Baden /CSE 260/ Winter 2014 13

Cost of Cannon’s Algorithm forall i=0 to √ p -1 CShift-left A[i; :] by i // T= α + β n 2 /p forall j=0 to √ p -1 Cshift-up B[: , j] by j // T= α + β n 2 /p for k=0 to √ p -1 forall i=0 to √ p -1 and j=0 to √ p -1 C[i,j] += A[i,j]*B[i,j] // T = 2*n 3 /p 3/2 CShift-leftA[i; :] by 1 // T= α + β n 2 /p Cshift-up B[: , j] by 1 // T= α + β n 2 /p end forall end for T P = 2n 3 /p + 2( α (1+ √ p) + ( β n 2 )(1+ √ p)/p) E P = T 1 /(pT P ) = ( 1 + α p 3/2 /n 3 + β √ p/n)) -1 ≈ ( 1 + O( √ p/n)) -1 E P → 1 as (n/ √ p) grows [sqrt of data / processor] Scott B. Baden /CSE 262/ Winter 2015 14

Implementation

Communication domains • Cannon’s algorithm shifts data along rows and columns of processors • MPI provides communicators for grouping processors, reflecting the communication structure of the algorithm • An MPI communicator is a name space, a subset of processes that communicate • Messages remain within their X0 X1 X2 X3 communicator P0 P1 P2 P3 • A process may be a member of Y0 0,0 0,1 0,2 0,3 more than one communicator P4 P5 P6 P7 Y1 1,0 1,1 1,2 1,3 P8 P9 P10 P11 Y2 2,0 2,1 2,2 2,3 P12 P13 P14 P15 Y3 3,0 3,1 3,2 3,3 Scott B. Baden /CSE 260/ Winter 2014 16

Creating the communicators • Create a communicator for each row key = myRank div √ P   Column?   MPI_Comm rowComm; MPI_Comm_split ( MPI_COMM_WORLD,   myRank / √ P, myRank, &rowComm); MPI_Comm_rank (rowComm,&myRow); • Each process obtains a new X0 X1 X2 X3 communicator according to the key P0 P1 P2 P3 Y0 • Process rank relative 0 0 0 0 to the new communicator P4 P5 P6 P7 Y1 • Rank applies to the 1 1 1 1 P8 P9 P10 P11 respective communicator only Y2 2 2 2 2 • Ordered according to myRank P12 P13 P14 P15 Y3 3 3 3 3 Scott B. Baden /CSE 260/ Winter 2014 17

More on Comm_split MPI_Comm_split(MPI_Comm comm, int splitKey,   int rankKey, MPI_Comm* newComm)   • Ranks assigned arbitrarily among processes sharing the same rankKey value • May exclude a process by passing the constant MPI_UNDEFINED as the splitKey • Return a special MPI_COMM_NULL communicator • If a process is a member of several communicators, it will have a rank within each one Scott B. Baden /CSE 260/ Winter 2014 18

Circular shift • Communication with columns (and rows) MPI_Comm_rank(rowComm,&myidRing); MPI_Comm_size(rowComm,&nodesRing); int next = (myidRng + 1 ) % nodesRing; MPI_Send(&X,1,MPI_INT,next,0, rowComm); MPI_Recv(&XR,1,MPI_INT,   MPI_ANY_SOURCE,   0, rowComm, &status); p(0,0) p(0,1) p(0,2) • Processes 0, 1, 2 in one p(1,0) p(1,1) p(1,2) communicator because they share the same key value (0) p(2,0) p(2,1) p(2,2) • Processes 3, 4, 5 are in another (key=1), and so on Scott B. Baden /CSE 260/ Winter 2014 19

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements - PowerPoint PPT Presentation

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements Projects Scott B. Baden /CSE 262/ Winter 2015 2 Todays lecture Cannons Parallel Matrix Multiplication Algorithm Working with communicators Scott B. Baden /CSE

CSE 262 Lecture 11 GPU Implementation of stencil methods (II) Announcements Final

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements

CSE 262 Lecture 10 Multigrid GPU Implementation of stencil methods (I) Announcements Final

CSE 262 Lecture 12 Communication overlap Announcements A problem set has been posted, due

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

7th Grade The Number System & Mathematical Operations 2015-08-31 www.njctl.org Slide 3 /

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Creative Labeling Of 12 ft. x 262 ft. Canal As Floating Plant Removal In Wetlands Enable

T-1 Date: 1 10/01/2018 02/04/2019 Historic District Map & Index of Drawings 4C/D 262

SEPTEMBER 15, 2020 www.aimnet.org | 617.262.1180 @AIMBusinessNews | #AIMBiz Brooke Thomson

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

CSE Fall 2014 311 Lecture 1 Lecture 1 Lecture 1: Propositional Logic Lecture 1 Foundations

Priority Queues & Heaps Operations put: insert item with a priority get: remove the

Cryptography and Cryptography and Network Security Network Security Chapter Chapter 5 5

Circular Modes and Flat Beams for LHC Al Alex exey ey Bu Buro rov, v, Al Alex exan ande

Systems Prof. Seungchul Lee Industrial AI Lab. Most slides from (edX) Discrete Time Signals and

Part III. OFDM Discrete Fourier Transform; Circular Convolution; Eigen Decomposition of Circulant

Avoiding Circular Repetitions Hamoon Mousavi and Jeffrey Shallit School of Computer Science

Data Structures II Partial Sums Dynamic Arrays Philip Bille Data Structures II

Multi-object tracking (MOT): visual and audio-visual Daniel Gatica-Perez (joint work with Kevin

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements - PowerPoint PPT Presentation

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements Projects Scott B. Baden /CSE 262/ Winter 2015 2 Todays lecture Cannons Parallel Matrix Multiplication Algorithm Working with communicators Scott B. Baden /CSE

CSE 262 Lecture 11 GPU Implementation of stencil methods (II) Announcements Final

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements

CSE 262 Lecture 10 Multigrid GPU Implementation of stencil methods (I) Announcements Final

CSE 262 Lecture 12 Communication overlap Announcements A problem set has been posted, due

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

7th Grade The Number System &amp; Mathematical Operations 2015-08-31 www.njctl.org Slide 3 /

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Creative Labeling Of 12 ft. x 262 ft. Canal As Floating Plant Removal In Wetlands Enable

T-1 Date: 1 10/01/2018 02/04/2019 Historic District Map &amp; Index of Drawings 4C/D 262

SEPTEMBER 15, 2020 www.aimnet.org | 617.262.1180 @AIMBusinessNews | #AIMBiz Brooke Thomson

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

CSE Fall 2014 311 Lecture 1 Lecture 1 Lecture 1: Propositional Logic Lecture 1 Foundations

Priority Queues &amp; Heaps Operations put: insert item with a priority get: remove the

Cryptography and Cryptography and Network Security Network Security Chapter Chapter 5 5

Circular Modes and Flat Beams for LHC Al Alex exey ey Bu Buro rov, v, Al Alex exan ande

Systems Prof. Seungchul Lee Industrial AI Lab. Most slides from (edX) Discrete Time Signals and

Part III. OFDM Discrete Fourier Transform; Circular Convolution; Eigen Decomposition of Circulant

Avoiding Circular Repetitions Hamoon Mousavi and Jeffrey Shallit School of Computer Science

Data Structures II Partial Sums Dynamic Arrays Philip Bille Data Structures II

Multi-object tracking (MOT): visual and audio-visual Daniel Gatica-Perez (joint work with Kevin

7th Grade The Number System & Mathematical Operations 2015-08-31 www.njctl.org Slide 3 /

T-1 Date: 1 10/01/2018 02/04/2019 Historic District Map & Index of Drawings 4C/D 262

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

Priority Queues & Heaps Operations put: insert item with a priority get: remove the