parallel linear algebra software for multi core
play

Parallel Linear Algebra Software for Multi-Core Architectures - PowerPoint PPT Presentation

Jakub Kurzak Parallel Linear Algebra Software for Multi-Core Architectures (PLASMA) for the CELL BE Georgia Tech CELL Workshop June 18, 2007 Outline New goals for dense linear algebra on multi-core processors Hand-crafted code


  1. Jakub Kurzak Parallel Linear Algebra Software for Multi-Core Architectures (PLASMA) for the CELL BE Georgia Tech CELL Workshop June 18, 2007

  2. Outline ➢ New goals for dense linear algebra on multi-core processors ➢ Hand-crafted code example ➢ solving SPD systems of linear equations ➢ Automated code generation ➢ DAG based execution with CELL SuperScalar ➢ New algorithmic approaches ➢ Algorithms by tiles 2 06/18/07 00:37

  3. Dense Linear Algebra Software Evolution LINPACK ➢ Vector machines – BLAS 1, BLAS 2 LAPACK ➢ Cache-based machines – BLAS 3 ScaLAPACK ➢ Distributed memory machines – PBLAS, BLACS, MPI PLASMA ➢ General framework for ➢ Multi-core ➢ CELL BE ➢ distributed memory machines 3 06/18/07 00:37

  4. The PLASMA Framework PLASMA Scalability ➢ Mixed-precision algorithms to exploit SIMD ILP scalability ➢ Dynamic DAG-based scheduling scalability ➢ Non-blocking communication ..... ➢ Algorithms by tiles ..... ➢ Maximum locality ..... ➢ Minimal synchronization ➢ High performance data representation (BDL) New BLAS required ➢ Block Data Layout ➢ Focus on tile kernel performance ➢ Implementation of new operations 4 06/18/07 00:37

  5. Exploiting SIMD ILP in Single Precision Mixed-precision Iterative Refinement (SPD) Solve: Ax = b, where A is SPD L L T = cholesky(A) O ( n 3 ) SINGLE x = L\(L T \b) O ( n 2 ) SINGLE O ( n 2 ) r = b – Ax DOUBLE WHILE || r || not small enough z = L\(L T \r) O ( n 2 ) SINGLE O ( n 1 ) x = x + z DOUBLE O ( n 2 ) r = b – Ax DOUBLE END 5 06/18/07 00:37

  6. Algorithms by Tiles – Cholesky Factorization T = LL T C = C \ T C = C – B × T = T – A × A T A T TRSM POTRF SYRK GEMM A T T A T B C C A T T A T B C C 6 06/18/07 00:37

  7. Building Blocks – SIMD Tile Kernels Equivalent BLAS / .c .o Compilation Execution Execution LAPACK call [LOC] [KB] time speed [ µ s] [Gflop/s] cblas_ sgemm ( 330 9.0 spu-gcc -Os 22.78 23.01 CblasRowMajor, 24.00 Real CblasNoTrans, CblasTrans, men 64, 64, 64, program -1.0, A, 64, B, 64, 1.0, C, 64); in assembly cblas_ ssyrk ( 160 4.7 spuxlc -O3 13.23 20.12 CblasRowMajor, CblasLower, CblasNoTrans, 64, 64, -1.0, A, 64, 1.0, C, 64); cblas_ strsm ( 310 8.2 spuxlc -O3 16.47 15.91 CblasRowMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, 64, 64, 1.0, T, 64, B, 64); lapack_ spotrf ( 340 4.1 spu_gcc -O3 15.16 5.84 lapack_lower, 64, trans(A), 64, &info); 7 06/18/07 00:37

  8. DAG-based Dependency Tracking 1:1 1:1 1:2 1:3 1:4 1:2 2:2 2:2 2:3 2:4 1:3 2:3 3:3 2:2 1:4 2:4 3:4 4:4 2:3 2:4 Dependencies expressed by the DAG 3:3 3:4 are enforced on a tile basis: ➢ fine-grained parallelization ➢ flexible scheduling 3:3 8 06/18/07 00:37

  9. Pipelining & Double Buffering Pipelining: ➢ Between loop iterations. Double Buffering: ➢ Within BLAS, ➢ Between BLAS, ➢ Between loop iterations. Result: ➢ Minimum load imbalance, ➢ Minimum dependency stalls, ➢ Minimum memory stalls (no waiting for data). 9 06/18/07 00:37

  10. Cholesky Factorization (SPOTRF) 204 200 SP peak 184 – 192 with assembly SGEMM kernel 175 SGEMM peak 174 ➢ 150 384 × 384 → > 50 Gflop/s ➢ 512 × 512 → ~ 90 Gflop/s 125 Gflop/s ➢ 640 × 640 → > 110 Gflop/s 100 ➢ 1024 × 1024 → > 150 Gflop/s ➢ 75 2304 × 2304 → > 170 Gflop/s 50 ➢ 1536 × 1536 25 14 → 90 % peak of SGEMM DP peak 0 ➢ 4096 × 4096 0 1000 2000 3000 4000 Size → 95 % peak of SGEMM 10 06/18/07 00:37

  11. Triangular Solve (STRSV) 25.6 26 memory peak 24 23.7 – 90% peak 22 STRSV 20 18 16 Compute-bound operations of the factorization (SPOTRF) GB/s 14 get close to the peak floating 12 point performance 10 8 Memory-bound operations of the refinement (STRSV) 6 get close to the peak 4 memory bandwidth 2 0 0 1000 2000 3000 4000 Size 11 06/18/07 00:37

  12. Performance – CELL Blade 204 200 SP peak 184 174 175 SGEMM peak 171 SPOTRF 155 150 SPOSV 125 DSPOSV Gflop/s 100 75 50 25 14 DP peak 0 0 1000 2000 3000 4000 Size 12 06/18/07 00:37

  13. Performance – Playstation 3 160 153 SP peak 138 140 SGEMM peak 126 120 122 SPOTRF 104 100 SPOSV Gflop/s 80 DSPOSV 60 40 20 10 DP peak 0 0 500 1000 1500 2000 Size 13 06/18/07 00:37

  14. The Need for Automation DO 20 J = 1, N, NB LAPACK FORTRAN 77 * Cholesky factorization * Update and factorize the current diagonal block and test ➢ * for non-positive-definiteness. Roughly 20 lines * JB = MIN( NB, N-J+1 ) CALL DSYRK( 'Lower', 'No transpose', JB, J-1, -ONE, $ A( J, 1 ), LDA, ONE, A( J, J ), LDA ) CALL DPOTF2( 'Lower', JB, A( J, J ), LDA, INFO ) IF( INFO.NE.0 ) $ GO TO 30 CELL Cholesky main IF( J+JB.LE.N ) THEN * factorization routine * Compute the current block column. ➢ Roughly 400 lines !!! * CALL DGEMM( 'No transpose', 'Transpose', N-J-JB+1, JB, $ J-1, -ONE, A( J+JB, 1 ), LDA, A( J, 1 ), $ LDA, ONE, A( J+JB, J ), LDA ) CALL DTRSM( 'Right', 'Lower', 'Transpose', 'Non-unit', $ N-J-JB+1, JB, ONE, A( J, J ), LDA, Some $ A( J+JB, J ), LDA ) automation END IF 20 CONTINUE needed !!! 14 06/18/07 00:37

  15. Cholesky – Sequential CELL Code for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); } void sgemm_tile(float *A, float *B, float *C) void strsm_tile(float *T, float *B) void ssyrk_tile(float *A, float *C) 15 06/18/07 00:37

  16. Cholesky – CELL SupeScalar (BSC) for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); } #pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void sgemm_tile(float *A, float *B, float *C) #pragma css task input (T[64][64]) inout(B[64][64]) void strsm_tile(float *T, float *B) #pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void ssyrk_tile(float *A, float *C) 16 06/18/07 00:37

  17. Cholesky – CELL SupeScalar – Performance DAG construction DAG construction on the fly before execution 17 06/18/07 00:37

  18. Block Algorithms – LAPACK QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DGEQR2 ( ) = Factorize the panel – calculate small R (purple) and a block of Householder reflectors (white) 18 06/18/07 00:37

  19. Block Algorithms – LAPACK QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DLARFB ( ) = Update the trailing submatrix – apply the Householder reflectors to the submatrix to the right from the panel 19 06/18/07 00:37

  20. Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DGEQR2 ( ) = Factorize the firts tile – calculate small R (purple) and a small block of Householder reflectors (white) 20 06/18/07 00:37

  21. Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DLARFB ( ) = Update the first row of tiles 21 06/18/07 00:37

  22. Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DGEQR2 ( ) = Couple the first R with the first tile in the same column and compute the QR factorization 22 06/18/07 00:37

  23. Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DLARFB ( ) = Update the two rows of tiles to the right 23 06/18/07 00:37

  24. Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DGEQR2 ( ) = Couple the first R with the second tile in the same column and compute the QR factorization 24 06/18/07 00:37

  25. Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DLARFB ( ) = Update the two rows of tiles to the right 25 06/18/07 00:37

  26. PLASMA QR Factorization DAG Each node is a tile operation 26 06/18/07 00:37

  27. Future ➢ CELL implementation of tile algorithms ➢ Cholesky (done), LU, QR, ➢ Hessenbeg, bi-diagonal, tri-diagonal reduction ➢ Efficient DAG construction and execution 27 06/18/07 00:37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend