Parallel Linear Algebra Software for Multi-Core Architectures - - PowerPoint PPT Presentation

parallel linear algebra software for multi core
SMART_READER_LITE
LIVE PREVIEW

Parallel Linear Algebra Software for Multi-Core Architectures - - PowerPoint PPT Presentation

Jakub Kurzak Parallel Linear Algebra Software for Multi-Core Architectures (PLASMA) for the CELL BE Georgia Tech CELL Workshop June 18, 2007 Outline New goals for dense linear algebra on multi-core processors Hand-crafted code


slide-1
SLIDE 1

Parallel Linear Algebra Software for Multi-Core Architectures (PLASMA) for the CELL BE

Georgia Tech CELL Workshop June 18, 2007 Jakub Kurzak

slide-2
SLIDE 2

06/18/07 00:37

2

➢ New goals for dense linear algebra on multi-core processors ➢ Hand-crafted code example ➢ solving SPD systems of linear equations ➢ Automated code generation ➢ DAG based execution with CELL SuperScalar ➢ New algorithmic approaches ➢ Algorithms by tiles

Outline

slide-3
SLIDE 3

06/18/07 00:37

3

Dense Linear Algebra Software Evolution

LINPACK ➢ Vector machines – BLAS 1, BLAS 2 LAPACK ➢ Cache-based machines – BLAS 3 ScaLAPACK ➢ Distributed memory machines – PBLAS, BLACS, MPI PLASMA ➢ General framework for ➢ Multi-core ➢ CELL BE ➢ distributed memory machines

slide-4
SLIDE 4

06/18/07 00:37

4

The PLASMA Framework

PLASMA ➢ Mixed-precision algorithms to exploit SIMD ILP ➢ Dynamic DAG-based scheduling ➢ Non-blocking communication ➢ Algorithms by tiles ➢ Maximum locality ➢ Minimal synchronization ➢ High performance data representation (BDL) New BLAS required ➢ Block Data Layout ➢ Focus on tile kernel performance ➢ Implementation of new operations Scalability scalability scalability ..... ..... .....

slide-5
SLIDE 5

06/18/07 00:37

5

Exploiting SIMD ILP in Single Precision

Mixed-precision Iterative Refinement (SPD) Solve: Ax = b, where A is SPD L LT = cholesky(A) SINGLE O(n3) x = L\(LT\b) SINGLE O(n2) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough z = L\(LT\r) SINGLE O(n2) x = x + z DOUBLE O(n1) r = b – Ax DOUBLE O(n2) END

slide-6
SLIDE 6

06/18/07 00:37

6

A C A B C T T T

T = T – A × AT SYRK T = LLT POTRF C = C – B × AT GEMM C = C \ T TRSM

Algorithms by Tiles – Cholesky Factorization

A T C A B C T T

slide-7
SLIDE 7

06/18/07 00:37

7

.c [LOC] .o [KB] Compilation Execution time [µs] Equivalent BLAS / LAPACK call Execution speed [Gflop/s] cblas_sgemm( CblasRowMajor, CblasNoTrans, CblasTrans, 64, 64, 64,

  • 1.0, A, 64, B, 64, 1.0, C, 64);

330 9.0 spu-gcc -Os 22.78 23.01 24.00 160 4.7 spuxlc -O3 13.23

cblas_ssyrk( CblasRowMajor, CblasLower, CblasNoTrans, 64, 64, -1.0, A, 64, 1.0, C, 64);

20.12

cblas_strsm( CblasRowMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, 64, 64, 1.0, T, 64, B, 64);

310 8.2 spuxlc -O3 16.47 15.91

lapack_spotrf( lapack_lower, 64, trans(A), 64, &info);

340 4.1 spu_gcc -O3 15.16 5.84

Building Blocks – SIMD Tile Kernels

Real men program in assembly

slide-8
SLIDE 8

06/18/07 00:37

8

1:1 1:2 2:2 1:3 2:3 3:3 1:4 2:4 3:4 4:4

1:1 1:2 1:3 1:4 2:2 2:3 2:4 2:2 2:3 2:4 3:3 3:4 3:3

DAG-based Dependency Tracking

Dependencies expressed by the DAG are enforced on a tile basis: ➢ fine-grained parallelization ➢ flexible scheduling

slide-9
SLIDE 9

06/18/07 00:37

9

Pipelining & Double Buffering

Result: ➢ Minimum load imbalance, ➢ Minimum dependency stalls, ➢ Minimum memory stalls (no waiting for data). Pipelining: ➢ Between loop iterations. Double Buffering: ➢ Within BLAS, ➢ Between BLAS, ➢ Between loop iterations.

slide-10
SLIDE 10

06/18/07 00:37

10

Cholesky Factorization (SPOTRF)

1000 2000 3000 4000 25 50 75 100 125 150 175 200

Size Gflop/s

SP peak SGEMM peak DP peak 204 184 – 192 with assembly SGEMM kernel 174 14

➢ 384×384 → >50 Gflop/s ➢ 512×512 → ~90 Gflop/s ➢ 640×640 → >110 Gflop/s ➢ 1024×1024 → >150 Gflop/s ➢ 2304×2304 → >170 Gflop/s ➢ 1536×1536

→ 90% peak of SGEMM

➢ 4096×4096

→ 95% peak of SGEMM

slide-11
SLIDE 11

06/18/07 00:37

11 1000 2000 3000 4000 2 4 6 8 10 12 14 16 18 20 22 24 26

Size GB/s

25.6 23.7 – 90% peak memory peak STRSV

Triangular Solve (STRSV)

Compute-bound operations

  • f the factorization (SPOTRF)

get close to the peak floating point performance Memory-bound operations

  • f the refinement (STRSV)

get close to the peak memory bandwidth

slide-12
SLIDE 12

06/18/07 00:37

12

Performance – CELL Blade

1000 2000 3000 4000 25 50 75 100 125 150 175 200

Size Gflop/s

SP peak SGEMM peak DP peak DSPOSV SPOSV SPOTRF 204 184 174 171 155 14

slide-13
SLIDE 13

06/18/07 00:37

13

Performance – Playstation 3

500 1000 1500 2000 20 40 60 80 100 120 140 160

Size Gflop/s

SP peak SGEMM peak DP peak DSPOSV SPOSV SPOTRF 153 138 126 122 104 10

slide-14
SLIDE 14

06/18/07 00:37

14

The Need for Automation

DO 20 J = 1, N, NB * * Update and factorize the current diagonal block and test * for non-positive-definiteness. * JB = MIN( NB, N-J+1 ) CALL DSYRK( 'Lower', 'No transpose', JB, J-1, -ONE, $ A( J, 1 ), LDA, ONE, A( J, J ), LDA ) CALL DPOTF2( 'Lower', JB, A( J, J ), LDA, INFO ) IF( INFO.NE.0 ) $ GO TO 30 IF( J+JB.LE.N ) THEN * * Compute the current block column. * CALL DGEMM( 'No transpose', 'Transpose', N-J-JB+1, JB, $ J-1, -ONE, A( J+JB, 1 ), LDA, A( J, 1 ), $ LDA, ONE, A( J+JB, J ), LDA ) CALL DTRSM( 'Right', 'Lower', 'Transpose', 'Non-unit', $ N-J-JB+1, JB, ONE, A( J, J ), LDA, $ A( J+JB, J ), LDA ) END IF 20 CONTINUE

LAPACK FORTRAN 77 Cholesky factorization ➢ Roughly 20 lines CELL Cholesky main factorization routine ➢ Roughly 400 lines !!! Some automation needed !!!

slide-15
SLIDE 15

06/18/07 00:37

15

for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); } void sgemm_tile(float *A, float *B, float *C) void strsm_tile(float *T, float *B) void ssyrk_tile(float *A, float *C)

Cholesky – Sequential CELL Code

slide-16
SLIDE 16

06/18/07 00:37

16

for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); } #pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void sgemm_tile(float *A, float *B, float *C) #pragma css task input (T[64][64]) inout(B[64][64]) void strsm_tile(float *T, float *B) #pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void ssyrk_tile(float *A, float *C)

Cholesky – CELL SupeScalar (BSC)

slide-17
SLIDE 17

06/18/07 00:37

17

Cholesky – CELL SupeScalar – Performance

DAG construction

  • n the fly

DAG construction before execution

slide-18
SLIDE 18

06/18/07 00:37

18

← DGEQR2( ) =

Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections

Block Algorithms – LAPACK QR Factorization

Factorize the panel – calculate small R (purple) and a block of Householder reflectors (white)

slide-19
SLIDE 19

06/18/07 00:37

19

← DLARFB( ) = Block Algorithms – LAPACK QR Factorization

Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections Update the trailing submatrix – apply the Householder reflectors to the submatrix to the right from the panel

slide-20
SLIDE 20

06/18/07 00:37

20

← DGEQR2( ) = Tile Algorithms – PLASMA QR Factorization

Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections Factorize the firts tile – calculate small R (purple) and a small block of Householder reflectors (white)

slide-21
SLIDE 21

06/18/07 00:37

21

← DLARFB( ) = Tile Algorithms – PLASMA QR Factorization

Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections Update the first row of tiles

slide-22
SLIDE 22

06/18/07 00:37

22

← DGEQR2( ) = Tile Algorithms – PLASMA QR Factorization

Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections Couple the first R with the first tile in the same column and compute the QR factorization

slide-23
SLIDE 23

06/18/07 00:37

23

← DLARFB( ) = Tile Algorithms – PLASMA QR Factorization

Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections Update the two rows of tiles to the right

slide-24
SLIDE 24

06/18/07 00:37

24

← DGEQR2( ) = Tile Algorithms – PLASMA QR Factorization

Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections Couple the first R with the second tile in the same column and compute the QR factorization

slide-25
SLIDE 25

06/18/07 00:37

25

← DLARFB( ) = Tile Algorithms – PLASMA QR Factorization

Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections Update the two rows of tiles to the right

slide-26
SLIDE 26

06/18/07 00:37

26

PLASMA QR Factorization DAG

Each node is a tile

  • peration
slide-27
SLIDE 27

06/18/07 00:37

27

➢ CELL implementation of tile algorithms ➢ Cholesky (done), LU, QR, ➢ Hessenbeg, bi-diagonal, tri-diagonal reduction ➢ Efficient DAG construction and execution

Future