Feeding of the Thousands Leveraging the GPU's Computing Power for - PowerPoint PPT Presentation

SPPEXA Annual Meeting 2016, January 25 th , 2016, Garching, Germany Feeding of the Thousands – Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt

Sparse Linear Algebra on GPUs • In Inherently parallel operations • axpy, copy, gemv... • usually memory bound • Kernel fusion • Sp Sparse matrix vector product • often computationally most expensive part • variety of storage formats, kernels… • row-thread mapping can result in imbalance • malicious memory access • Bo Bottlenecks • sequential operations, unstructured, random memory access • incomplete factorizations (ILU/IC) • sparse triangular solves 2

Sparse Matrix Vector Product (SpMV) Sliced Ellpack (SELL) format as • trade-off between CSR and Ellpack Sorting can improve load-balancing • CSR format ELL format SELLP format col-index 1 2 3 4 5 6 7 0 5 2 4 2 5 0 0 0 5 2 4 2 5 0 0 0 5 2 4 2 5 0 0 0 5 2 4 0 0 2 0 5 0 3 7 2 0 0 0 0 0 3 7 2 0 0 0 0 0 3 7 2 0 0 0 0 0 3 7 2 0 0 0 0 0 1 row-index 7 5 0 0 0 0 0 0 7 5 0 0 0 0 0 0 7 5 0 0 0 0 0 0 2 0 0 7 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Sparse 0 0 0 0 0 0 0 0 5 8 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 storage formats 8 0 0 0 0 0 0 0 6 3 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 7 0 0 0 0 0 0 0 0 5 2 4 2 5 0 1 2 5 7 values 5 2 4 2 5 3 7 2 7 5 8 3 3 7 2 0 0 0 1 2 X X 5 2 4 2 5 0 1 2 5 7 colind 0 1 2 5 7 0 1 2 2 7 0 6 7 5 0 0 0 2 7 X X X 3 7 2 0 0 0 1 2 X X 0 0 0 0 0 X X X X X 7 5 2 7 0 0 0 0 0 X X X X X 0 0 X X rowptr 0 5 8 10 10 10 11 8 0 0 0 0 0 X X X X 0 X values colind 3 0 0 0 0 6 X X X X 8 0 points to first element in row 3 6 values values colind colind 0 X rowptr 0 10 14 16 18 points to first element in block Kreutzer et al.: A A Un Unified Sparse Matrix Data Format for Ef Efficient Ge General Sparse Matrix-Ve Vector Multiplication on Mo Modern Pr Processors wi with Wide e SIMD Un Units , SISC 36(5), 2014. 3

Sparse Matrix Vector Product (SpMV) Assign multiple threads to each row • 2-dimensional thread blocks • shared memory 2D TBs + + + TB 0 + TB 1 reduction Kreutzer et al.: A A Un Unified Sparse Matrix Data Format for Ef Efficient Ge General Sparse Matrix-Ve Vector Multiplication on Mo Modern Pr Processors wi with Wide e SIMD Un Units , SISC 36(5), 2014. 4

Sparse Matrix Vector Product with multiple Vectors (SpMM) 3-dimensional thread blocks for processing multiple vectors simultaneously • multiple vectors shared memory shared memory 2D TBs + shared memory 2D TBs + + shared memory 2D TBs + + + shared memory 2D TBs TB 0 + + + + shared memory 2D TBs TB 0 + + + + 2D TBs TB 0 + + + + TB 0 + + + TB 0 + + TB 1 TB 0 + = TB 1 reduction TB 1 reduction TB 1 reduction TB 1 reduction TB 1 reduction reduction Anzt et al.: En Energy efficiency and performance frontiers for sparse computations on GPU su supercomputers , PMAM 2015. . 5

Sparse Matrix Vector Product with multiple Vectors (SpMM) 3-dimensional thread blocks for processing multiple vectors simultaneously • Performance on NVIDIA K40, 64 vectors, DP: • 120 100 GFLOP/s 80 60 40 20 audikw_1 bmw3_2 bmwcra_1 bone_010 bone_S10 cant crankseg_2 F1 Fault_639 Hook_1498 inline_1 ldoor pwtk Stoc_1465 stomach xenon_2 cuSPARSE CSRSpMM cuSPARSE CSRSpMM v2 MAGMA SELL-P SpMM Anzt et al.: En Energy efficiency and performance frontiers for sparse computations on GPU su supercomputers, PMAM 2015. . 6

Kernel Fusion in Sparse Iterative Algorithms Memory bandwidth in many cases the performance bottleneck. • Sequence of consecutive vector updates (BLAS 1) • benefits from enhanced data locality. Design of algorithm-specific kernels. • r+w Thread block r+w r+w r+w r+w r+w r+w Thread block r+w r+w r+w r+w r+w r+w Thread block r+w r+w r+w r+w r+w kernel2 kernel1 7

Kernel Fusion in Sparse Iterative Algorithms Memory bandwidth in many cases the performance bottleneck • Sequence of consecutive vector updates (BLAS 1) benefits from enhanced • data locality while ( ( k < maxiter ) && ( res > epsilon ) ){ while ( ( k < maxiter ) && ( res > epsilon ) ){ Scalar_SpMV <<<Gs,Bs>>> ( n, rowA, colA, valA, d, z ); scalar_fusion_1 <<<Gs, Bs, Ms>>> ( n, rowA, colA, valA, tmp = cublasSdot ( n, d, 1, z, 1 ); d, z, beta, rho, gamma, vtmp ); rho = beta / tmp; fusion_2 ( Gs, Bs, Ms, n, beta, rho, vtmp ); gamma = beta; cublasSaxpy (n, rho, d, 1, x, 1 ); fusion_3 <<<Gs, Bs, Ms>>> ( n, rho, d, x, z, r, vtmp ); cublasSaxpy (n, -rho, z, 1, r, 1 ); fusion_4 ( Gs, Bs, Ms, n, vtmp, vtmp2 ); tmp = cublasSdot ( n, r, 1, r, 1 ); beta = tmp; fusion_5 <<<Gs, Bs>>> ( n, beta, gamma, alpha, alpha = beta / gamma; d, r, vtmp ); cublasSscal (n, alpha, d, 1 ); cudaMemcopy ( &res, beta, sizeof(float), cublasSaxpy (n, one, r, 1, d, 1 ); cudaMemcpyDeviceToHost ); res = sqrt( beta ); res = sqrt( beta ); k++; k ++; } // end-while } // end-while Aliaga et al .: .: Re Reformulated Conjugate Gradient for the Energy-Aw Aware S Solution of Linear Systems on G GPUs , Parallel Processing (ICPP), 2013. 8

Kernel Fusion in Sparse Iterative Algorithms Wh Which operations can be merged d into a single kernel? • kernels compatible in terms of component-thread mapping • example classification for Jacobi-CG: • Op Operation on Input vector(s) In Output Ou / / M − 1 x y α y AXPY mapped mapped mapped y = α x + y COPY mapped mapped mapped y = x DOT mapped mapped unmapped α = h x, y i Jacobi-Prec y = M − 1 x mapped mapped mapped SpMV CSR unmapped - mapped y = Ax SpMV ELL unmapped - mapped y = Ax SpMV SELL-P unmapped - unmapped y = Ax Aliaga et al .: .: Systematic Fusion of C CUDA Kernels for It Iterative S Sparse Linear S System So Solvers , Euro-Par 2015, LLNCS 9233, 2015. 9

Kernel Fusion in Sparse Iterative Algorithms Wh Which operations can be merged d into a single kernel? • Wh Which kernels do do we want to merge? • performance vs. flexibility… • Operation Op on In Input vector(s) Ou Output / / M − 1 x y α y AXPY mapped mapped mapped y = α x + y COPY mapped mapped mapped y = x DOT mapped mapped unmapped α = h x, y i Jacobi-Prec y = M − 1 x mapped mapped mapped SpMV CSR unmapped - mapped y = Ax SpMV ELL unmapped - mapped y = Ax SpMV SELL-P unmapped - unmapped y = Ax Code Jacobi-preconditioned J-CG, J-BiCGSTAB, J-IDR, J-GMRES…? How about ILU/IC? 10

Kernel Fusion in Sparse Iterative Algorithms Wh Which operations can be merged d into a single kernel? • Wh Which kernels do do we want to merge? • performance vs. flexibility… • Operation Op on In Input vector(s) Ou Output / / M − 1 x y α y AXPY mapped mapped mapped y = α x + y COPY mapped mapped mapped y = x DOT mapped mapped unmapped α = h x, y i Jacobi-Prec y = M − 1 x mapped mapped mapped SpMV CSR unmapped - mapped y = Ax SpMV ELL unmapped - mapped y = Ax SpMV SELL-P unmapped - unmapped y = Ax One stand-alone code for each SpMV kernel? 11

Kernel Fusion in Sparse Iterative Algorithms Wh Which operations can be merged d into a single kernel? • Wh Which kernels do do we want to merge? • performance vs. flexibility… • 10 basic JCG fusion JCG 9 Jacobi-fusion 8 Matrix Size Nonzeros 7 AUD 943,695 77,651,847 G3 1,585,478 7,660,826 Runtime [s] 6 INL 503,712 36,816,342 5 LDO 952,203 46,522,475 4 3 2 1 0 AUD G3 INL LDO • Benefits from fusion Jacobi-preconditioner for very large and sparse matrices. • Smaller benefits for more complex algorithms (BiCGSTAB, CGS, QMR, IDR…) 12

Kernel Fusion in Sparse Iterative Algorithms How close can kernel fusion bring us to the • th theoreti tical performance bound induced by memory bandwidth th? Cooperation with Moritz Kreutzer, Eduardo Ponce . • NVIDIA K40, theoretical bandwidth: 288 GB/s… • 200 b = 193 GB/s ... WEB 160 Bandwidth b in GB/s 120 TDK SCR 80 Matrix Size SCR 170,998 40 TDK 204,316 WEB 1,000,005 DIE 1,157,456 0 3 4 5 6 7 10 10 10 10 10 Vector length 13

Kernel Fusion in Sparse Iterative Algorithms Ef Efficiency compared to roofline performance model: P = min( P peak ; Ib ) Gflop/s • P theoretical compute peak, I intensity, b bandwidth IDR(s) g ID general Kr Krylov sol solver 100 IDR(1) Moritz Kreutzer, Eduardo Ponce IDR(2) 90 IDR(4) IDR(8) 80 Matrix Size Nonzeros 70 Efficiency [%] SCR 170,998 958,936 60 TDK 204,316 2,846,228 WEB 1,000,005 3,105,536 50 NLP 1,062,400 28,704,672 40 DIE 1,157,456 48,538,952 THM 1,228,045 8,580,313 30 AFS 1,508,065 52,672,325 MLG 1,504,002 110,879,972 20 G3 1,585,478 7,660,826 TRA 1,602,111 23,500,731 10 0 SCR TDK WEB NLP DIE THM AFS MLG G3 TRA 14

Kernel Overlap in Sparse Iterative Algorithms Co Concurrent kernel execution to exploit unused GPU compute resources. • Ra Rare in sparse linear algebra (most algorithms compose of memory-bound • operations). Small benefits if datasets too sm Sm small to saturate memory bandwidth. • 10 IDR(1) IDR(2) 9 IDR(4) IDR(8) 8 Performance improvement [%] Matrix Size 7 SCR 170,998 6 TDK 204,316 5 WEB 1,000,005 4 DIE 1,157,456 3 2 1 0 SCR TDK WEB DIE 15

Feeding of the Thousands Leveraging the GPU's Computing Power for - PowerPoint PPT Presentation

SPPEXA Annual Meeting 2016, January 25 th , 2016, Garching, Germany Feeding of the Thousands Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt Sparse Linear Algebra on GPUs In Inherently parallel operations

What are you feeding on? Daniel 1 What are you feeding on? What are you feeding on? What are

aims Silage Feeding pigs silage Soyabean meal Feeding pigs silage

Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

The TMR Feeding Program Dr. Jim Linn University of Minnesota St. Paul, Minnesota Keys to a

Feeding System - centralized feeding system that is easy to install and simple to use Easy to

Pediatric Feeding Pediatric Feeding Difficulties Difficulties Erin Erin Reier Reier, OTD,

Swine Day 2004 and Feeding Gestating Sows Feeding sows in gestation based on body weight and

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Welcome and Introduction Suzannah Power Lay Member for Public and Patient Involvement Agenda

From Sensitive to Formal Barbaric Systems Biology Oded Maler Memorial Alexandre Donz Decyphir

Charting the Life Course Supporting Families in their role of Fostering Self-Determination in

Statistical physics lecture 4 Szymon Stoma 09-10-2009 Szymon Stoma Statistical physics

Hierarchical Modeling for Large Univariate Areal Data Abhi Datta 1 , Sudipto Banerjee 2 and Andrew

GOOD PLANS GONE BAD: TOP 10 ACCESSIBILITY ISSUES AND MISTAKES WILL BEGIN AT 2 PM ET About Your

Drug Interaction Studies Lawrence J. Lesko Center for Pharmacometrics and Systems Pharmacology

Natural Language Processing Lecture 15: Meaning Representation Languages INTENSION AND

Feeding of the Thousands Leveraging the GPU's Computing Power for - PowerPoint PPT Presentation

SPPEXA Annual Meeting 2016, January 25 th , 2016, Garching, Germany Feeding of the Thousands Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt Sparse Linear Algebra on GPUs In Inherently parallel operations

What are you feeding on? Daniel 1 What are you feeding on? What are you feeding on? What are

aims Silage Feeding pigs silage Soyabean meal Feeding pigs silage

Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

The TMR Feeding Program Dr. Jim Linn University of Minnesota St. Paul, Minnesota Keys to a

Feeding System - centralized feeding system that is easy to install and simple to use Easy to

Pediatric Feeding Pediatric Feeding Difficulties Difficulties Erin Erin Reier Reier, OTD,

Swine Day 2004 and Feeding Gestating Sows Feeding sows in gestation based on body weight and

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Welcome and Introduction Suzannah Power Lay Member for Public and Patient Involvement Agenda

From Sensitive to Formal Barbaric Systems Biology Oded Maler Memorial Alexandre Donz Decyphir

Charting the Life Course Supporting Families in their role of Fostering Self-Determination in

Statistical physics lecture 4 Szymon Stoma 09-10-2009 Szymon Stoma Statistical physics

Hierarchical Modeling for Large Univariate Areal Data Abhi Datta 1 , Sudipto Banerjee 2 and Andrew

GOOD PLANS GONE BAD: TOP 10 ACCESSIBILITY ISSUES AND MISTAKES WILL BEGIN AT 2 PM ET About Your

Drug Interaction Studies Lawrence J. Lesko Center for Pharmacometrics and Systems Pharmacology

Natural Language Processing Lecture 15: Meaning Representation Languages INTENSION AND

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,