shared memory with cilk
play

Shared Memory with Cilk++ Matrix-matrix multiplication - PowerPoint PPT Presentation

CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thank nks to Charles E. L Leiserson on for some of t these slides 1 Work and Span (Recap) T P =


  1. CS 140 : Numerical Examples on Shared Memory with Cilk++ • Matrix-matrix multiplication • Matrix-vector multiplication • Hyperobjects Thank nks to Charles E. L Leiserson on for some of t these slides 1

  2. Work and Span (Recap) T P = execution time on P processors T 1 = work rk T ∞ = sp span an * Speedup eedup on p p processors cessors ∙ T 1 /T p Par aral allelism lelism ∙ T 1 /T ∞ *Also called cr critica cal-pa path th length or co computa putatio tiona nal depth . 2

  3. Cilk Loops: Divide and Conquer cilk_for (int i=0; i<n; ++i) { Vector A[i]+=B[i]; addition } Implementati plementation on Work: Work: k: T 1 = Θ (n) k: T 1 = G grain gr in siz ize Span: T ∞ = Θ (lg n) Span: T ∞ = Paralle Paralle llelism: llelism: lism: T 1 /T ∞ = lism: T 1 /T ∞ = Θ (n/lg n) Assume that G = Θ (1). 3

  4. Squa uare re-Matrix Matrix Mu Mult ltipl iplication ication c 11 c 12 ⋯ c 1n a 11 a 12 ⋯ a 1n b 11 b 12 ⋯ b 1n = · c 21 c 22 ⋯ c 2n a 21 a 22 ⋯ a 2n b 21 b 22 ⋯ b 2n ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ a n1 a n2 ⋯ a nn b n1 b n2 ⋯ b nn c n1 c n2 ⋯ c nn C A B n c ij =  a ik b kj k = 1 Assume for simplicity that n = 2 k . 4

  5. Parallelizing Matrix Multiply cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { for (int k=0; k<n; ++k { C[i][j] += A[i][k] * B[k][j]; } } Work: T 1 = Θ (n 3 ) Span: T ∞ = Sp Θ (n) Parall llel elism: ism: T 1 /T ∞ = Θ (n 2 ) For 1000 × 1000 matrices, parallelism ≈ (10 3 ) 2 = 10 6 . 5

  6. Recursive Matrix Multiplication Divide vide an and conquer quer — C 11 C 12 A 11 A 12 B 11 B 12 = · C 21 C 22 A 21 A 22 B 21 B 22 A 11 B 11 A 11 B 12 A 12 B 21 A 12 B 22 = + A 21 B 11 A 21 B 12 A 22 B 21 A 22 B 22 8 multiplications of n/2 × n/2 matrices. 1 addition of n × n matrices. 6

  7. D&C Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T[n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); Row/column cilk_spawn MMult(C12, A11, B12, n/2); length of cilk_spawn MMult(C22, A21, B12, n/2); matrices cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); Coarsen for cilk_spawn MMult(D12, A12, B22, n/2); efficiency Determine cilk_spawn MMult(D22, A22, B22, n/2); submatrices MMult(D21, A22, B21, n/2); cilk_sync; by index MAdd(C, D, n); // C += D; calculation } 7

  8. Matrix Addition template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T[n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); cilk_spawn MMult(C22, A21, B12, n/2); cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); template <typename T> cilk_spawn MMult(D12, A12, B22, n/2); void MAdd(T *C, T *D, int n) { cilk_spawn MMult(D22, A22, B22, n/2); cilk_for (int i=0; i<n; ++i) { MMult(D21, A22, B21, n/2); cilk_for (int j=0; j<n; ++j) { cilk_sync; C[n*i+j] += D[n*i+j]; MAdd(C, D, n); // C += D; } } } } 8

  9. Analysis of Matrix Addition template <typename T> void MAdd(T *C, T *D, int n) { cilk_for (int i=0; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { C[n*i+j] += D[n*i+j]; } } } Work: A 1 (n) = Θ (n 2 ) Sp Span: A ∞ (n) = Θ (lg n) Nested cilk_for statements have the same Θ(lg n) span 9

  10. Work of Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); CASE SE 1: cilk_spawn MMult(C12, A11, B12, n/2); n log b a = n log 2 8 = n 3 ⋮ f(n) = Θ (n 2 ) cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n); // C += D; } Work: M 1 (n) = 8M 1 (n/2) + A 1 (n) + Θ (1) = 8M 1 (n/2) + Θ (n 2 ) = Θ (n 3 ) 10

  11. Span of Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; // base case & partition matrices maximum cilk_spawn MMult(C11, A11, B11, n/2); CASE SE 2: cilk_spawn MMult(C12, A11, B12, n/2); ⋮ n log b a = n log 2 1 = 1 cilk_spawn MMult(D22, A22, B22, n/2); f(n) = Θ (n log b a lg 1 n) MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n, size); // C += D; } Sp Span: M ∞ (n) = M ∞ (n/2) + A ∞ (n) + Θ (1) = M ∞ (n/2) + Θ (lg n) = Θ (lg 2 n) 11

  12. Parallelism of Matrix Multiply Work: M 1 (n) = Θ (n 3 ) Sp Span: M ∞ (n) = Θ (lg 2 n) M 1 (n) Para rall llel elism: ism: = Θ (n 3 /lg 2 n) M ∞ (n) For 1000 × 1000 matrices, parallelism ≈ (10 3 ) 3 /10 2 = 10 7 . 12

  13. Stack Temporaries template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); cilk_spawn MMult(C22, A21, B12, n/2); cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); cilk_spawn MMult(D12, A12, B22, n/2); cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n); // C += D; } I DEA EA : : Since minimizing storage tends to yield higher performance, trade off parallelism for less storage. 13

  14. No-Temp Matrix Multiplication // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { // base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); MMult2(C21, A21, B11, n/2); cilk_sync; cilk_spawn MMult2(C11, A12, B21, n/2); cilk_spawn MMult2(C12, A12, B22, n/2); cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); cilk_sync; } Saves space, but at what expense? 14

  15. Work of No-Temp Multiply // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { // base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); CASE SE 1: MMult2(C21, A21, B11, n/2); cilk_sync; n log b a = n log 2 8 = n 3 cilk_spawn MMult2(C11, A12, B21, n/2); f(n) = Θ (1) cilk_spawn MMult2(C12, A12, B22, n/2); cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); cilk_sync; } Work: M 1 (n) = 8M 1 (n/2) + Θ (1) = Θ (n 3 ) 15

  16. Span of No-Temp Multiply // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { // base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); max cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); CASE SE 1: MMult2(C21, A21, B11, n/2); cilk_sync; n log b a = n log 2 2 = n cilk_spawn MMult2(C11, A12, B21, n/2); f(n) = Θ (1) cilk_spawn MMult2(C12, A12, B22, n/2); max cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); cilk_sync; } Sp Span: M ∞ (n) = 2M ∞ (n/2) + Θ (1) = Θ (n) 16

  17. Parallelism of No-Temp Multiply Work: M 1 (n) = Θ (n 3 ) Sp Span: M ∞ (n) = Θ (n) M 1 (n) Para rall llel elism: ism: = Θ (n 2 ) M ∞ (n) For 1000 × 1000 matrices, parallelism ≈ (10 3 ) 2 = 10 6 . Faste ter r in in p practic ice! 17

  18. How general that was? ∙ Matrices are often rectangular ∙ Even when they are square, the dimensions are hardly a power of two n k B k = · A C m Which dimension to split? 18

  19. General Matrix Multiplication template <typename T> void MMult3(T *A, T* B, T* C, int i0, int i1, int j0, int j1, int k0, int k1) { Split t m if it is int di = i1 - i0; int dj = j1 - j0; int dk = k1 - k0; the largest if (di >= dj && di >= dk && di >= THRESHOLD) { int mi = i0 + di / 2; MMult3 (A, B, C, i0, mi, j0, j1, k0, k1); MMult3 (A, B, C, mi, i1, j0, j1, k0, k1); Split t n if it is the } else if (dj >= dk && dj >= THRESHOLD) { larges est int mj = j0 + dj / 2; MMult3 (A, B, C, i0, i1, j0, mj, k0, k1); MMult3 (A, B, C, i0, i1, mj, j1, k0, k1); } else if (dk >= THRESHOLD) { Split t k if it is the int mk = k0 + dk / 2; larges est MMult3 (A, B, C, i0, i1, j0, j1, k0, mk); MMult3 (A, B, C, i0, i1, j0, j1, mk, k1); } else { // Iterative (triple-nested loop) multiply } } for (int i = i0; i < i1; ++i) { for (int j = j0; j < j1; ++j) { for (int k = k0; k < k1; ++k) C[i][j] += A[i][k] * B[k][j]; 19

  20. Parallelizing General MMult template <typename T> void MMult3(T *A, T* B, T* C, int i0, int i1, int j0, int j1, int k0, int k1) { int di = i1 - i0; int dj = j1 - j0; int dk = k1 - k0; if (di >= dj && di >= dk && di >= THRESHOLD) { int mi = i0 + di / 2; cilk_spawn MMult3 (A, B, C, i0, mi, j0, j1, k0, k1); MMult3 (A, B, C, mi, i1, j0, j1, k0, k1); } else if (dj >= dk && dj >= THRESHOLD) { int mj = j0 + dj / 2; cilk_spawn MMult3 (A, B, C, i0, i1, j0, mj, k0, k1); Unsa safe e to to spawn wn MMult3 (A, B, C, i0, i1, mj, j1, k0, k1); here re unles less s we use } else if (dk >= THRESHOLD) { a tempo porar ary y ! int mk = k0 + dk / 2; MMult3 (A, B, C, i0, i1, j0, j1, k0, mk); MMult3 (A, B, C, i0, i1, j0, j1, mk, k1); } else { // Iterative (triple-nested loop) multiply } } for (int i = i0; i < i1; ++i) { for (int j = j0; j < j1; ++j) { for (int k = k0; k < k1; ++k) C[i][j] += A[i][k] * B[k][j]; 20

  21. Split m No races, safe to spawn ! n k B k = · A C m 21

  22. Split n No races, safe to spawn ! n k B k = · A C m 22

  23. Split k Data races, unsafe to spawn ! n k B k = · A C m 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend