dense linear algebra on heterogeneous platforms state of
play

Dense Linear Algebra on Heterogeneous Platforms: State of the Art - PowerPoint PPT Presentation

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala


  1. Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala University, Sweden Paolo Bientinesi (AICES, RWTH Aachen) 1 / 34

  2. Setting the stage 1 Part 1: blocked algorithms 2 Part 2: multithreading, fork-join 3 Part 3: multihtreading, algorithms-by-blocks 4 5 Part 4: streaming Paolo Bientinesi (AICES, RWTH Aachen) 2 / 34

  3. Dense Linear Algebra ××××××××××××   ×××××××××××× ××××××××××××   ××××××××××××   ××××××××××××   ××××××××××××   M =  ××××××××××××   ××××××××××××    ××××××××××××   ××××××××××××   ×××××××××××× ×××××××××××× Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

  4. Dense Linear Algebra ××× × ××× ×   ×××××××××××× ××××× ×××××   ××××××××××××   ××× ××××× ××   ×××× ××××× ×   M =  ××× × ××× ×   ××××××××××××    ××××× ×××××   ××××××××××××   ××× ××××× ×× ×××× ××××× × Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

  5. Dense Linear Algebra ×× × × × ×   × × ××× ×× × ×× × ××   × ×× ×× ×   × × ×× ×   × ××× × ×   M =  × × × × × ×   ×× ××××× ××    × × ××   ××× × ×××   × × × ×× ×× × ×× × ×× × Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

  6. Dense Linear Algebra ×   ×× ×××   ××××   ××××   ×××× ×   M =  ×××××××   ×× ×××××    ×××××××××   ××××××× ×   ×××××××××× ×××××××××××× Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

  7. Dense Linear Algebra ××   ××× ×××   ×××   ×××   ×××   M =  ×××   ×××    ×××   ×××   ××× ×× Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

  8. Dense Linear Algebra Linear systems Eigenproblems Ax = b , AX = B , least squares, . . . Ax = λx , AX = BX Λ , SVD, . . . Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

  9. Dense Linear Algebra Linear systems Eigenproblems Ax = b , AX = B , least squares, . . . Ax = λx , AX = BX Λ , SVD, . . . Support routines factorizations, reductions, . . . Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

  10. Dense Linear Algebra Matrix equations AX + XB = C , A = A + A − 1 , . . . 2 Linear systems Eigenproblems Ax = b , AX = B , least squares, . . . Ax = λx , AX = BX Λ , SVD, . . . Support routines factorizations, reductions, . . . Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

  11. Organization in layers Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  12. Organization in layers BLAS Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  13. Organization in layers BLAS x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  14. Organization in layers BLAS A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  15. Organization in layers BLAS A, B, C, L ∈ R n × n BLAS-3 : C := C + AB C := L − 1 B A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  16. Organization in layers LAPACK LL T = A, Q T TQ = A, . . . LU = A, QR = A, BLAS A, B, C, L ∈ R n × n BLAS-3 : C := C + AB C := L − 1 B A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  17. Organization in layers other libraries ScaLAPACK, Elemental, PETSc, . . . LAPACK LL T = A, Q T TQ = A, . . . LU = A, QR = A, BLAS A, B, C, L ∈ R n × n BLAS-3 : C := C + AB C := L − 1 B A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  18. Example: AX = B (full A ) AX = B Linear System LX = B LU = A Triangular LU System Factorization LX = B C = AB + C C = AB + C Triangular Gemm Gemm System C = AB + C Gemm Paolo Bientinesi (AICES, RWTH Aachen) 6 / 34

  19. Why BLAS-3? Why GEMM? Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  20. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  21. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS Level 1 2 n 3 n 2 / 3 x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  22. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS Level 1 2 n 3 n 2 / 3 A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  23. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  24. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A, B, C, ∈ R n × n BLAS-3 : C := C + AB A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  25. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 3 4 n 2 Level 3 n/ 2 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A, B, C, ∈ R n × n BLAS-3 : C := C + AB A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  26. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 3 4 n 2 Level 3 n/ 2 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A, B, C, ∈ R n × n BLAS-3 : C := C + AB A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Morale BLAS-3: The larger the problem the better, as long as it fits in memory. GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK. Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  27. Part 1: Blocked algorithms Simple example: Cholesky factorization Input: Matrix A , symmetric and positive definite. Determine L (lower triangular matrix) such that LL T = A Goal: � L T L � L = = ? L BL L BR Paolo Bientinesi (AICES, RWTH Aachen) 8 / 34

  28. Cholesky factorization iteration i DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

  29. Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm ❅ ❅ ❘ sqrt CHOL ✲ syr trsv TRSM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

  30. Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm DONE DONE DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

  31. Cholesky: unblocked vs. blocked algorithms Paolo Bientinesi (AICES, RWTH Aachen) 10 / 34

  32. Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

  33. Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Advantage: ease of use. Legacy code! Drawback: unnecessary synchronization points OpenBLAS, ATLAS, BLIS, old versions of MKL, . . . Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

  34. Multithreaded BLAS Xeon, 32 physical cores, MKL Efficiency of GEMM 1 0.8 Efficiency 0.6 1 thread 0.4 2 threads 4 threads 8 threads 0.2 16 threads 32 threads 0 1000 2000 3000 4000 5000 6000 7000 8000 Matrix dimension Paolo Bientinesi (AICES, RWTH Aachen) 12 / 34

  35. Development of LA libraries - New architecture / new architectural features Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

  36. Development of LA libraries - New architecture / new architectural features - GEMM Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

  37. Development of LA libraries - New architecture / new architectural features - GEMM → peak performance [FFT, SpMV] Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

  38. Development of LA libraries - New architecture / new architectural features - GEMM → peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

  39. Development of LA libraries - New architecture / new architectural features - GEMM → peak performance [FFT, SpMV] → - BLAS3, factorizations, AX=B LINPACK benchmark Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend