Shared Memory with Cilk++ Matrix-matrix multiplication - PowerPoint PPT Presentation

CS 140 : Numerical Examples on Shared Memory with Cilk++ • Matrix-matrix multiplication • Matrix-vector multiplication • Hyperobjects Thank nks to Charles E. L Leiserson on for some of t these slides 1

Work and Span (Recap) T P = execution time on P processors T 1 = work rk T ∞ = sp span an * Speedup eedup on p p processors cessors ∙ T 1 /T p Par aral allelism lelism ∙ T 1 /T ∞ *Also called cr critica cal-pa path th length or co computa putatio tiona nal depth . 2

Cilk Loops: Divide and Conquer cilk_for (int i=0; i<n; ++i) { Vector A[i]+=B[i]; addition } Implementati plementation on Work: Work: k: T 1 = Θ (n) k: T 1 = G grain gr in siz ize Span: T ∞ = Θ (lg n) Span: T ∞ = Paralle Paralle llelism: llelism: lism: T 1 /T ∞ = lism: T 1 /T ∞ = Θ (n/lg n) Assume that G = Θ (1). 3

Squa uare re-Matrix Matrix Mu Mult ltipl iplication ication c 11 c 12 ⋯ c 1n a 11 a 12 ⋯ a 1n b 11 b 12 ⋯ b 1n = · c 21 c 22 ⋯ c 2n a 21 a 22 ⋯ a 2n b 21 b 22 ⋯ b 2n ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ a n1 a n2 ⋯ a nn b n1 b n2 ⋯ b nn c n1 c n2 ⋯ c nn C A B n c ij =  a ik b kj k = 1 Assume for simplicity that n = 2 k . 4

Parallelizing Matrix Multiply cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { for (int k=0; k<n; ++k { C[i][j] += A[i][k] * B[k][j]; } } Work: T 1 = Θ (n 3 ) Span: T ∞ = Sp Θ (n) Parall llel elism: ism: T 1 /T ∞ = Θ (n 2 ) For 1000 × 1000 matrices, parallelism ≈ (10 3 ) 2 = 10 6 . 5

Recursive Matrix Multiplication Divide vide an and conquer quer — C 11 C 12 A 11 A 12 B 11 B 12 = · C 21 C 22 A 21 A 22 B 21 B 22 A 11 B 11 A 11 B 12 A 12 B 21 A 12 B 22 = + A 21 B 11 A 21 B 12 A 22 B 21 A 22 B 22 8 multiplications of n/2 × n/2 matrices. 1 addition of n × n matrices. 6

D&C Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T[n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); Row/column cilk_spawn MMult(C12, A11, B12, n/2); length of cilk_spawn MMult(C22, A21, B12, n/2); matrices cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); Coarsen for cilk_spawn MMult(D12, A12, B22, n/2); efficiency Determine cilk_spawn MMult(D22, A22, B22, n/2); submatrices MMult(D21, A22, B21, n/2); cilk_sync; by index MAdd(C, D, n); // C += D; calculation } 7

Matrix Addition template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T[n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); cilk_spawn MMult(C22, A21, B12, n/2); cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); template <typename T> cilk_spawn MMult(D12, A12, B22, n/2); void MAdd(T *C, T *D, int n) { cilk_spawn MMult(D22, A22, B22, n/2); cilk_for (int i=0; i<n; ++i) { MMult(D21, A22, B21, n/2); cilk_for (int j=0; j<n; ++j) { cilk_sync; C[n*i+j] += D[n*i+j]; MAdd(C, D, n); // C += D; } } } } 8

Analysis of Matrix Addition template <typename T> void MAdd(T *C, T *D, int n) { cilk_for (int i=0; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { C[n*i+j] += D[n*i+j]; } } } Work: A 1 (n) = Θ (n 2 ) Sp Span: A ∞ (n) = Θ (lg n) Nested cilk_for statements have the same Θ(lg n) span 9

Work of Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); CASE SE 1: cilk_spawn MMult(C12, A11, B12, n/2); n log b a = n log 2 8 = n 3 ⋮ f(n) = Θ (n 2 ) cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n); // C += D; } Work: M 1 (n) = 8M 1 (n/2) + A 1 (n) + Θ (1) = 8M 1 (n/2) + Θ (n 2 ) = Θ (n 3 ) 10

Span of Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; // base case & partition matrices maximum cilk_spawn MMult(C11, A11, B11, n/2); CASE SE 2: cilk_spawn MMult(C12, A11, B12, n/2); ⋮ n log b a = n log 2 1 = 1 cilk_spawn MMult(D22, A22, B22, n/2); f(n) = Θ (n log b a lg 1 n) MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n, size); // C += D; } Sp Span: M ∞ (n) = M ∞ (n/2) + A ∞ (n) + Θ (1) = M ∞ (n/2) + Θ (lg n) = Θ (lg 2 n) 11

Parallelism of Matrix Multiply Work: M 1 (n) = Θ (n 3 ) Sp Span: M ∞ (n) = Θ (lg 2 n) M 1 (n) Para rall llel elism: ism: = Θ (n 3 /lg 2 n) M ∞ (n) For 1000 × 1000 matrices, parallelism ≈ (10 3 ) 3 /10 2 = 10 7 . 12

Stack Temporaries template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); cilk_spawn MMult(C22, A21, B12, n/2); cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); cilk_spawn MMult(D12, A12, B22, n/2); cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n); // C += D; } I DEA EA : : Since minimizing storage tends to yield higher performance, trade off parallelism for less storage. 13

No-Temp Matrix Multiplication // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { // base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); MMult2(C21, A21, B11, n/2); cilk_sync; cilk_spawn MMult2(C11, A12, B21, n/2); cilk_spawn MMult2(C12, A12, B22, n/2); cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); cilk_sync; } Saves space, but at what expense? 14

Work of No-Temp Multiply // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { // base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); CASE SE 1: MMult2(C21, A21, B11, n/2); cilk_sync; n log b a = n log 2 8 = n 3 cilk_spawn MMult2(C11, A12, B21, n/2); f(n) = Θ (1) cilk_spawn MMult2(C12, A12, B22, n/2); cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); cilk_sync; } Work: M 1 (n) = 8M 1 (n/2) + Θ (1) = Θ (n 3 ) 15

Span of No-Temp Multiply // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { // base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); max cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); CASE SE 1: MMult2(C21, A21, B11, n/2); cilk_sync; n log b a = n log 2 2 = n cilk_spawn MMult2(C11, A12, B21, n/2); f(n) = Θ (1) cilk_spawn MMult2(C12, A12, B22, n/2); max cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); cilk_sync; } Sp Span: M ∞ (n) = 2M ∞ (n/2) + Θ (1) = Θ (n) 16

Parallelism of No-Temp Multiply Work: M 1 (n) = Θ (n 3 ) Sp Span: M ∞ (n) = Θ (n) M 1 (n) Para rall llel elism: ism: = Θ (n 2 ) M ∞ (n) For 1000 × 1000 matrices, parallelism ≈ (10 3 ) 2 = 10 6 . Faste ter r in in p practic ice! 17

How general that was? ∙ Matrices are often rectangular ∙ Even when they are square, the dimensions are hardly a power of two n k B k = · A C m Which dimension to split? 18

General Matrix Multiplication template <typename T> void MMult3(T *A, T* B, T* C, int i0, int i1, int j0, int j1, int k0, int k1) { Split t m if it is int di = i1 - i0; int dj = j1 - j0; int dk = k1 - k0; the largest if (di >= dj && di >= dk && di >= THRESHOLD) { int mi = i0 + di / 2; MMult3 (A, B, C, i0, mi, j0, j1, k0, k1); MMult3 (A, B, C, mi, i1, j0, j1, k0, k1); Split t n if it is the } else if (dj >= dk && dj >= THRESHOLD) { larges est int mj = j0 + dj / 2; MMult3 (A, B, C, i0, i1, j0, mj, k0, k1); MMult3 (A, B, C, i0, i1, mj, j1, k0, k1); } else if (dk >= THRESHOLD) { Split t k if it is the int mk = k0 + dk / 2; larges est MMult3 (A, B, C, i0, i1, j0, j1, k0, mk); MMult3 (A, B, C, i0, i1, j0, j1, mk, k1); } else { // Iterative (triple-nested loop) multiply } } for (int i = i0; i < i1; ++i) { for (int j = j0; j < j1; ++j) { for (int k = k0; k < k1; ++k) C[i][j] += A[i][k] * B[k][j]; 19

Parallelizing General MMult template <typename T> void MMult3(T *A, T* B, T* C, int i0, int i1, int j0, int j1, int k0, int k1) { int di = i1 - i0; int dj = j1 - j0; int dk = k1 - k0; if (di >= dj && di >= dk && di >= THRESHOLD) { int mi = i0 + di / 2; cilk_spawn MMult3 (A, B, C, i0, mi, j0, j1, k0, k1); MMult3 (A, B, C, mi, i1, j0, j1, k0, k1); } else if (dj >= dk && dj >= THRESHOLD) { int mj = j0 + dj / 2; cilk_spawn MMult3 (A, B, C, i0, i1, j0, mj, k0, k1); Unsa safe e to to spawn wn MMult3 (A, B, C, i0, i1, mj, j1, k0, k1); here re unles less s we use } else if (dk >= THRESHOLD) { a tempo porar ary y ! int mk = k0 + dk / 2; MMult3 (A, B, C, i0, i1, j0, j1, k0, mk); MMult3 (A, B, C, i0, i1, j0, j1, mk, k1); } else { // Iterative (triple-nested loop) multiply } } for (int i = i0; i < i1; ++i) { for (int j = j0; j < j1; ++j) { for (int k = k0; k < k1; ++k) C[i][j] += A[i][k] * B[k][j]; 20

Split m No races, safe to spawn ! n k B k = · A C m 21

Split n No races, safe to spawn ! n k B k = · A C m 22

Split k Data races, unsafe to spawn ! n k B k = · A C m 23

Shared Memory with Cilk++ Matrix-matrix multiplication - PowerPoint PPT Presentation

CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thank nks to Charles E. L Leiserson on for some of t these slides 1 Work and Span (Recap) T P =

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza University of Western

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Distributed Shared Memory and Machine Learning CSci 8211 Chai-Wen Hsieh 11/5/2018 Agenda

Regularity structures and renormalisation of FitzHughNagumo SPDEs in three space dimensions

CS/COE 1520 pitt.edu/~ach54/cs1520 Python Python Guido van Rossum Guido van Rossum

CS 473: Algorithms Chandra Chekuri Ruta Mehta University of Illinois, Urbana-Champaign Fall

The Relational Model Module 1, Lecture 2 Database Management Systems, R. Ramakrishnan 1 Why

J is for Compsci 201 Maps and Midterms Java A simple, object-oriented, distributed,

arXiv:1606.06235v2 [cs.DS] 4 Feb 2017 Abstract We develop new methods based on graph motifs for

CS31 Discussion 1E Spring 17: week 01 TA: Bo-Jhang Ho bojhang@cs.ucla.edu Credit to former

Logic in Logic Programming: Sequent Calculus, Higher-Orders, and Linear Logic Fourth

Sambuz

Useful Links

Newsletter

Mail Us

Shared Memory with Cilk++ Matrix-matrix multiplication - PowerPoint PPT Presentation

CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thank nks to Charles E. L Leiserson on for some of t these slides 1 Work and Span (Recap) T P =

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza University of Western

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Distributed Shared Memory and Machine Learning CSci 8211 Chai-Wen Hsieh 11/5/2018 Agenda

Regularity structures and renormalisation of FitzHughNagumo SPDEs in three space dimensions

CS/COE 1520 pitt.edu/~ach54/cs1520 Python Python Guido van Rossum Guido van Rossum

CS 473: Algorithms Chandra Chekuri Ruta Mehta University of Illinois, Urbana-Champaign Fall

The Relational Model Module 1, Lecture 2 Database Management Systems, R. Ramakrishnan 1 Why

J is for Compsci 201 Maps and Midterms Java A simple, object-oriented, distributed,

arXiv:1606.06235v2 [cs.DS] 4 Feb 2017 Abstract We develop new methods based on graph motifs for

CS31 Discussion 1E Spring 17: week 01 TA: Bo-Jhang Ho bojhang@cs.ucla.edu Credit to former

Logic in Logic Programming: Sequent Calculus, Higher-Orders, and Linear Logic Fourth

Sambuz

Useful Links

Newsletter

Mail Us

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA