GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016

OBJECTIVE Direct sparse methods are among the most widely used in science and engineering GPU acceleration is challenging due to irregularity in operations and data access Investigate methods for GPU acceleration of sparse Cholesky factorization Implement within CHOLMOD 2

Sparse Cholesky Root algorithm Subtree algorithm AGENDA Custom batched BLAS/Lapack Multi-GPU Performance 3

DENSE CHOLESKY FACTORIZATION dense block Cholesky supernodes A 12 A 11 L 11 0 I 0 L t 11 L t 21 = A 21 L 21 I A 22 0 A * 0 I 22 L 11 L t 11 = A 11 POTRF dense Cholesky triangular solve L 11 L t 21 = A t 21 TRSM compressed column matrix multiplication A * 22 = A 22 – L 21 L t GEMM 21 Schur complement 4

SPARSE CHOLESKY ISSUES Lots of small math 1 POTRF 7 2 TRSM SYRK 3 Irregular operations/ 6 3 GEMM 4 access patterns 5 1 2 4 5 6 fill fill PCIe communication 7 5

ROOT ALGORITHM SuiteSparse (CHOLMOD) 4.4.3 Send appropriate BLAS to GPU Assemble supernodes on GPU & CPU CPU ¡ CPU ¡+ ¡GPU ¡ 1.5x ‘Hide’ PCIe communication 800 ¡ 2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ 700 ¡ ¡ Handles large matrices 600 ¡ row/column threshold GFlops/s ¡ 500 ¡ ndrow >= 256 400 ¡ ndcol >= 32 300 ¡ supernode score 200 ¡ 100 ¡ 0 ¡ GPU CPU Florida ¡Sparse ¡Matrix ¡Collec4on ¡ descendant supernodes 6

ROOT ALGORITHM SuiteSparse (CHOLMOD) 4.4.3 Send appropriate BLAS to GPU Assemble supernodes on GPU & CPU CPU ¡ CPU ¡+ ¡GPU ¡ why not higher? ‘Hide’ PCIe communication 800 ¡ 2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ 700 ¡ ¡ Handles large matrices 600 ¡ row/column threshold GFlops/s ¡ 500 ¡ ndrow >= 256 400 ¡ ndcol >= 32 300 ¡ supernode score why so low? 200 ¡ 100 ¡ 0 ¡ GPU CPU Florida ¡Sparse ¡Matrix ¡Collec4on ¡ descendant supernodes 7

SUBTREE ALGORITHM Send entire subtrees to GPU Factorization performed entirely on GPU Minimizes PCIe communication Requires batched BLAS/Lapack level 2 (with variable m, n, k ) level 1 Previous method is used for root level 0 subtree 2 subtree 1 subtree 3 subtree 4 8

SUBTREE ALGORITHM ROOT alg. Send entire subtrees to GPU Factorization performed entirely on GPU Minimizes PCIe communication Requires batched BLAS/Lapack level 2 (with variable m, n, k ) level 1 Previous method is used for root level 0 subtree 2 subtree 1 subtree 3 subtree 4 SUBTREE alg. 9

RESULTS - SUBTREE CHOLMOD 4.4.3 CHOLMOD 4.43 Subtrees CPU CPU + GPU GPU Branches 900 22x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ 1.38x average speedup vs. 800 ¡ previous CPU+GPU 700 600 GFlop/s 2x average speedup vs. CPU 500 400 Poorly performing matrices 300 see the greatest speedup 200 100 PCIe well avoided 0 Florida Sparse Matrix Collection 10

CURRENT WORK 1. Releasable CUDA versions of batched BLAS/Lapack 2. Support multi-GPU for both SUBTREE and ROOT algs. 3. General implementation improvements 4. Release as merged with latest SuiteSparse library SuiteSparse 4.6.0 BETA http://faculty.cse.tamu.edu/davis/SuiteSparse/SuiteSparse-4.6.0-beta.tar.gz 11

BATCHED BLAS/LAPACK For each level For GEMM, SYRK, TRSM, POTRF, batch if: GEMM, SYRK: m<=128 & n<=128 & k<=128 POTRF, TRSM: m<=64 & n<=64 Stream remaining BLAS/LAPACK operations Require batches with variable sized elements: m, n, k Irregular operations don’t give large uniform batches Cannot afford to copy/pad Previous work used modified cuBLAS/cuSolver code 12

CUSTOM BATCHED BLAS/LAPACK Written in CUDA – accepts lists of m, n, k Every BLAS/Lapack operation gets assigned to a threadblock grid size = #batches automatic scheduling All threadblocks are 16x16 = 256 threads If result matrix < 16x16 If result matrix > 16x16 idle threads tiled 13

MULTI-GPU elimination tree Subtree 1 subtree per GPU Automatically scaled: root spin subtree size <= GPU memory wait supernodes (as large as possible) Static load-balancing based on flops synchronize Root subtrees At least one supernode in ‘Root’ subtree OMP parallel loop over supernodes: nthreads(#GPUs) ordered Spinwait on unfinished descendant supernodes 4x GPU : GPU 1 , GPU 2, GPU 3, GPU 4 14

MULTI-GPU Serena.mtx GPU 0 GPU 1 15

MULTI-GPU Serena.mtx GPU 0 synchronize GPU 1 subtree root 16

CURRENT PERFORMANCE K40 2000.0 ¡ SuiteSparse 4.6.0 GF/s ¡for ¡numerical ¡factoriza4on ¡ 1800.0 ¡ 2x E5-2698 v3 @2.3 GHz 1xK40 = 1.8x 1x - 4x K40 (full boost, ECC=ON) 1600.0 ¡ CPU ¡ 1400.0 ¡ 1x ¡K40 ¡ 1200.0 ¡ 2xK40 = 2.3x 1000.0 ¡ 2x ¡K40 ¡ 800.0 ¡ 600.0 ¡ 4x ¡K40 ¡ 4xK40 = 2.6x 400.0 ¡ 200.0 ¡ 0.0 ¡ Frlorida ¡Sparse ¡Matrix ¡Collec4on ¡ 17

CURRENT PERFORMANCE – K40 VS K80 2000.0 ¡ 2x E5-2698 v3 @2.3 GHz CPU ¡ 1800.0 ¡ 1x - 4x K40 (full boost, ECC=ON) numerical ¡factoriza4on ¡GF/s ¡ 1x – 4x K80 (full boost, pl=175, ECC=ON) 1600.0 ¡ 1x ¡K40 ¡ 1400.0 ¡ 2x ¡K40 ¡ 1200.0 ¡ 1000.0 ¡ 4x ¡K40 ¡ 800.0 ¡ 1x ¡K80 ¡ 600.0 ¡ 400.0 ¡ 2x ¡K80 ¡ 200.0 ¡ 4x ¡K80 ¡ 0.0 ¡ Florida ¡Sparse ¡Matrix ¡Collec4on ¡ 18

CURRENT PERFORMANCE 6 ¡ 2x E5-2698 v3 @2.3 GHz + 1x K40 (full boost, ECC=ON) or 1x K80 (board, full boost, pl=175, ECC=ON) 5 ¡ 103 SPD from Florida Sparse Matrix Collection Speedup ¡(GPU/CPU) ¡ 4 ¡ 3 ¡ K40 ¡ K80 ¡ 2 ¡ 1x 1 ¡ 0 ¡ 0 ¡ 2000 ¡ 4000 ¡ 6000 ¡ 8000 ¡ 10000 ¡ 12000 ¡ factor ¡flops ¡/ ¡nnz(L) ¡ 19

FURTHER WORK further optimization of batched routines • improved overlap/load-balancing for multi-gpu case • LU • pivoting • accelerating all other aspects of matrix solution • … • 20

CONCLUSIONS Sparse factorization can be well accelerated on GPUs • Subtree algorithm / batched BLAS / careful implementation • Plenty yet to be done • SuiteSparse 4.6.0 BETA http://faculty.cse.tamu.edu/davis/SuiteSparse/SuiteSparse-4.6.0-beta.tar.gz 21

April 4-7, 2016 | Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely used in science and engineering GPU

Matchbox automatic batching for imperative deep learning James Bradbury NVIDIA GTC, 2018/3/28

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

On the way to Omniledger: adding transaction batching and ByzcoinX to skipchains Raphal Dunant

EFFICIENT USE OF ALUMINUM SCRAP IN EFFICIENT USE OF ALUMINUM SCRAP IN BATCHING SECONDARY ALLOYS

Investigating scalability of recurrent network using dynamic batching in PyTorch Devin Taylor

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS)

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

EXPO REAL Hybrid Summit Your virtual exhibition EXPO REAL Hybrid Summit The Hybrid Conference

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Frugal IP Lookup Based on a Parallel Search Zoran Ci ca and Aleksandra Smiljani c School

Planned OBT Experiments at CRL Planned OBT Experiments at CRL September 14, 2011 Sang Bog Kim

The Value of Nondestructive Testing for New Transmission & Substation Foundations Presented

The IITM Earth System Model (ESM) Development and Future Roadmap R. Krishnan Centre for Climate

Toward Precision Agriculture: Building a Soil Wetness Multi-Hop WSN from First Principles Aggelos

The Concept of Subtyping Explained from a Type System Perspective Andr e Gasser HSR

Agencies Data September 14, 2017 I-95 Corridor Coalition - Crowdsourcing Summit 1

2019 Sc ie ntific Re tre a t F RI DAY, APRI L 26, 2019 MUL T I PL E MYE L OMA T

Sambuz

Useful Links

Newsletter

Mail Us

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely used in science and engineering GPU

Matchbox automatic batching for imperative deep learning James Bradbury NVIDIA GTC, 2018/3/28

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

On the way to Omniledger: adding transaction batching and ByzcoinX to skipchains Raphal Dunant

EFFICIENT USE OF ALUMINUM SCRAP IN EFFICIENT USE OF ALUMINUM SCRAP IN BATCHING SECONDARY ALLOYS

Investigating scalability of recurrent network using dynamic batching in PyTorch Devin Taylor

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS)

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

EXPO REAL Hybrid Summit Your virtual exhibition EXPO REAL Hybrid Summit The Hybrid Conference

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Frugal IP Lookup Based on a Parallel Search Zoran Ci ca and Aleksandra Smiljani c School

Planned OBT Experiments at CRL Planned OBT Experiments at CRL September 14, 2011 Sang Bog Kim

The Value of Nondestructive Testing for New Transmission &amp; Substation Foundations Presented

The IITM Earth System Model (ESM) Development and Future Roadmap R. Krishnan Centre for Climate

Toward Precision Agriculture: Building a Soil Wetness Multi-Hop WSN from First Principles Aggelos

The Concept of Subtyping Explained from a Type System Perspective Andr e Gasser HSR

Agencies Data September 14, 2017 I-95 Corridor Coalition - Crowdsourcing Summit 1

2019 Sc ie ntific Re tre a t F RI DAY, APRI L 26, 2019 MUL T I PL E MYE L OMA T

Sambuz

Useful Links

Newsletter

Mail Us

The Value of Nondestructive Testing for New Transmission & Substation Foundations Presented