Parallel Linear Algebra Software for Multi-Core Architectures - PowerPoint PPT Presentation

Jakub Kurzak Parallel Linear Algebra Software for Multi-Core Architectures (PLASMA) for the CELL BE Georgia Tech CELL Workshop June 18, 2007

Outline ➢ New goals for dense linear algebra on multi-core processors ➢ Hand-crafted code example ➢ solving SPD systems of linear equations ➢ Automated code generation ➢ DAG based execution with CELL SuperScalar ➢ New algorithmic approaches ➢ Algorithms by tiles 2 06/18/07 00:37

Dense Linear Algebra Software Evolution LINPACK ➢ Vector machines – BLAS 1, BLAS 2 LAPACK ➢ Cache-based machines – BLAS 3 ScaLAPACK ➢ Distributed memory machines – PBLAS, BLACS, MPI PLASMA ➢ General framework for ➢ Multi-core ➢ CELL BE ➢ distributed memory machines 3 06/18/07 00:37

The PLASMA Framework PLASMA Scalability ➢ Mixed-precision algorithms to exploit SIMD ILP scalability ➢ Dynamic DAG-based scheduling scalability ➢ Non-blocking communication ..... ➢ Algorithms by tiles ..... ➢ Maximum locality ..... ➢ Minimal synchronization ➢ High performance data representation (BDL) New BLAS required ➢ Block Data Layout ➢ Focus on tile kernel performance ➢ Implementation of new operations 4 06/18/07 00:37

Exploiting SIMD ILP in Single Precision Mixed-precision Iterative Refinement (SPD) Solve: Ax = b, where A is SPD L L T = cholesky(A) O ( n 3 ) SINGLE x = L\(L T \b) O ( n 2 ) SINGLE O ( n 2 ) r = b – Ax DOUBLE WHILE || r || not small enough z = L\(L T \r) O ( n 2 ) SINGLE O ( n 1 ) x = x + z DOUBLE O ( n 2 ) r = b – Ax DOUBLE END 5 06/18/07 00:37

Algorithms by Tiles – Cholesky Factorization T = LL T C = C \ T C = C – B × T = T – A × A T A T TRSM POTRF SYRK GEMM A T T A T B C C A T T A T B C C 6 06/18/07 00:37

Building Blocks – SIMD Tile Kernels Equivalent BLAS / .c .o Compilation Execution Execution LAPACK call [LOC] [KB] time speed [ µ s] [Gflop/s] cblas_ sgemm ( 330 9.0 spu-gcc -Os 22.78 23.01 CblasRowMajor, 24.00 Real CblasNoTrans, CblasTrans, men 64, 64, 64, program -1.0, A, 64, B, 64, 1.0, C, 64); in assembly cblas_ ssyrk ( 160 4.7 spuxlc -O3 13.23 20.12 CblasRowMajor, CblasLower, CblasNoTrans, 64, 64, -1.0, A, 64, 1.0, C, 64); cblas_ strsm ( 310 8.2 spuxlc -O3 16.47 15.91 CblasRowMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, 64, 64, 1.0, T, 64, B, 64); lapack_ spotrf ( 340 4.1 spu_gcc -O3 15.16 5.84 lapack_lower, 64, trans(A), 64, &info); 7 06/18/07 00:37

DAG-based Dependency Tracking 1:1 1:1 1:2 1:3 1:4 1:2 2:2 2:2 2:3 2:4 1:3 2:3 3:3 2:2 1:4 2:4 3:4 4:4 2:3 2:4 Dependencies expressed by the DAG 3:3 3:4 are enforced on a tile basis: ➢ fine-grained parallelization ➢ flexible scheduling 3:3 8 06/18/07 00:37

Pipelining & Double Buffering Pipelining: ➢ Between loop iterations. Double Buffering: ➢ Within BLAS, ➢ Between BLAS, ➢ Between loop iterations. Result: ➢ Minimum load imbalance, ➢ Minimum dependency stalls, ➢ Minimum memory stalls (no waiting for data). 9 06/18/07 00:37

Cholesky Factorization (SPOTRF) 204 200 SP peak 184 – 192 with assembly SGEMM kernel 175 SGEMM peak 174 ➢ 150 384 × 384 → > 50 Gflop/s ➢ 512 × 512 → ~ 90 Gflop/s 125 Gflop/s ➢ 640 × 640 → > 110 Gflop/s 100 ➢ 1024 × 1024 → > 150 Gflop/s ➢ 75 2304 × 2304 → > 170 Gflop/s 50 ➢ 1536 × 1536 25 14 → 90 % peak of SGEMM DP peak 0 ➢ 4096 × 4096 0 1000 2000 3000 4000 Size → 95 % peak of SGEMM 10 06/18/07 00:37

Triangular Solve (STRSV) 25.6 26 memory peak 24 23.7 – 90% peak 22 STRSV 20 18 16 Compute-bound operations of the factorization (SPOTRF) GB/s 14 get close to the peak floating 12 point performance 10 8 Memory-bound operations of the refinement (STRSV) 6 get close to the peak 4 memory bandwidth 2 0 0 1000 2000 3000 4000 Size 11 06/18/07 00:37

Performance – CELL Blade 204 200 SP peak 184 174 175 SGEMM peak 171 SPOTRF 155 150 SPOSV 125 DSPOSV Gflop/s 100 75 50 25 14 DP peak 0 0 1000 2000 3000 4000 Size 12 06/18/07 00:37

Performance – Playstation 3 160 153 SP peak 138 140 SGEMM peak 126 120 122 SPOTRF 104 100 SPOSV Gflop/s 80 DSPOSV 60 40 20 10 DP peak 0 0 500 1000 1500 2000 Size 13 06/18/07 00:37

The Need for Automation DO 20 J = 1, N, NB LAPACK FORTRAN 77 * Cholesky factorization * Update and factorize the current diagonal block and test ➢ * for non-positive-definiteness. Roughly 20 lines * JB = MIN( NB, N-J+1 ) CALL DSYRK( 'Lower', 'No transpose', JB, J-1, -ONE, $ A( J, 1 ), LDA, ONE, A( J, J ), LDA ) CALL DPOTF2( 'Lower', JB, A( J, J ), LDA, INFO ) IF( INFO.NE.0 ) $ GO TO 30 CELL Cholesky main IF( J+JB.LE.N ) THEN * factorization routine * Compute the current block column. ➢ Roughly 400 lines !!! * CALL DGEMM( 'No transpose', 'Transpose', N-J-JB+1, JB, $ J-1, -ONE, A( J+JB, 1 ), LDA, A( J, 1 ), $ LDA, ONE, A( J+JB, J ), LDA ) CALL DTRSM( 'Right', 'Lower', 'Transpose', 'Non-unit', $ N-J-JB+1, JB, ONE, A( J, J ), LDA, Some $ A( J+JB, J ), LDA ) automation END IF 20 CONTINUE needed !!! 14 06/18/07 00:37

Cholesky – Sequential CELL Code for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); } void sgemm_tile(float *A, float *B, float *C) void strsm_tile(float *T, float *B) void ssyrk_tile(float *A, float *C) 15 06/18/07 00:37

Cholesky – CELL SupeScalar (BSC) for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); } #pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void sgemm_tile(float *A, float *B, float *C) #pragma css task input (T[64][64]) inout(B[64][64]) void strsm_tile(float *T, float *B) #pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void ssyrk_tile(float *A, float *C) 16 06/18/07 00:37

Cholesky – CELL SupeScalar – Performance DAG construction DAG construction on the fly before execution 17 06/18/07 00:37

Block Algorithms – LAPACK QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DGEQR2 ( ) = Factorize the panel – calculate small R (purple) and a block of Householder reflectors (white) 18 06/18/07 00:37

Block Algorithms – LAPACK QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DLARFB ( ) = Update the trailing submatrix – apply the Householder reflectors to the submatrix to the right from the panel 19 06/18/07 00:37

Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DGEQR2 ( ) = Factorize the firts tile – calculate small R (purple) and a small block of Householder reflectors (white) 20 06/18/07 00:37

Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DLARFB ( ) = Update the first row of tiles 21 06/18/07 00:37

Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DGEQR2 ( ) = Couple the first R with the first tile in the same column and compute the QR factorization 22 06/18/07 00:37

Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DLARFB ( ) = Update the two rows of tiles to the right 23 06/18/07 00:37

Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DGEQR2 ( ) = Couple the first R with the second tile in the same column and compute the QR factorization 24 06/18/07 00:37

Tile Algorithms – PLASMA QR Factorization Decompose a matrix into factors Q and R, where Q is unitary and R is upper triangular using Householder reflections ← DLARFB ( ) = Update the two rows of tiles to the right 25 06/18/07 00:37

PLASMA QR Factorization DAG Each node is a tile operation 26 06/18/07 00:37

Future ➢ CELL implementation of tile algorithms ➢ Cholesky (done), LU, QR, ➢ Hessenbeg, bi-diagonal, tri-diagonal reduction ➢ Efficient DAG construction and execution 27 06/18/07 00:37

Parallel Linear Algebra Software for Multi-Core Architectures - PowerPoint PPT Presentation

Jakub Kurzak Parallel Linear Algebra Software for Multi-Core Architectures (PLASMA) for the CELL BE Georgia Tech CELL Workshop June 18, 2007 Outline New goals for dense linear algebra on multi-core processors Hand-crafted code

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

Introductory Multi-Linear Algebra: Dual Spaces to Tensor Products Hayden Borg University of

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen & Greg Corrado Linear Algebra is

Basic Concepts in Linear Algebra Department of Mathematics Boise State University September 14,

Single Cell Analysis with the MVX-7100 L Workstation July 17 th 2019 Peter Winship, Ph.D.

Simulation of HED Plasmas (4,050,000 Node hours) Frank Tsung (co-PI) Viktor K. Decyk Weiming An

Start-to-end simulations of the self-modulation experiment at PITZ Osip Lishilin DPG

12/21/2009 The cell membrane is kind of like a soap bubble. The Structure and Function of the

9: Viterbi Algorithm for HMM Decoding Machine Learning and Real-world Data Helen Yannakoudakis 1

Caching at the Edge: Throughput Scaling Laws of Wireless Video Streaming Giuseppe Caire

New Crypto-fundamentals in RIOT Peter Kietzmann peter.kietzmann@haw-hamburg.de 3rd get-together

Lets regenerate! The exci0ng life of a stem cell

Parallel Linear Algebra Software for Multi-Core Architectures - PowerPoint PPT Presentation

Jakub Kurzak Parallel Linear Algebra Software for Multi-Core Architectures (PLASMA) for the CELL BE Georgia Tech CELL Workshop June 18, 2007 Outline New goals for dense linear algebra on multi-core processors Hand-crafted code

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

Introductory Multi-Linear Algebra: Dual Spaces to Tensor Products Hayden Borg University of

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen &amp; Greg Corrado Linear Algebra is

Basic Concepts in Linear Algebra Department of Mathematics Boise State University September 14,

Single Cell Analysis with the MVX-7100 L Workstation July 17 th 2019 Peter Winship, Ph.D.

Simulation of HED Plasmas (4,050,000 Node hours) Frank Tsung (co-PI) Viktor K. Decyk Weiming An

Start-to-end simulations of the self-modulation experiment at PITZ Osip Lishilin DPG

12/21/2009 The cell membrane is kind of like a soap bubble. The Structure and Function of the

9: Viterbi Algorithm for HMM Decoding Machine Learning and Real-world Data Helen Yannakoudakis 1

Caching at the Edge: Throughput Scaling Laws of Wireless Video Streaming Giuseppe Caire

New Crypto-fundamentals in RIOT Peter Kietzmann peter.kietzmann@haw-hamburg.de 3rd get-together

Lets regenerate! The exci0ng life of a stem cell

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen & Greg Corrado Linear Algebra is