DAG-Scheduled Linear Algebra Using Template-Based Building Blocks - PowerPoint PPT Presentation

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to Jeremy Appleyard of NVIDIA 1 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Introduction What’s in the title? DAG-Scheduled Similar approach to MAGMA, but more flexible. Linear Algebra Aimed at implementing matrix algorithms-by-blocks. Template-Based Building-Blocks Template library for BLAS-like functionality (i.e. CUB for LA) 2 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Introduction What’s in the title? DAG-Scheduled Similar approach to MAGMA, but more flexible. Linear Algebra Aimed at implementing matrix algorithms-by-blocks. Template-Based Building-Blocks Template library for BLAS-like functionality (i.e. CUB for LA) So what’s different to MAGMA? ◮ DAG handled on-device. ◮ Improved performance for small and medium matrices ◮ More flexible ⇒ allows more complex pivoting 2 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Introduction What’s in the title? DAG-Scheduled Similar approach to MAGMA, but more flexible. Linear Algebra Aimed at implementing matrix algorithms-by-blocks. Template-Based Building-Blocks Template library for BLAS-like functionality (i.e. CUB for LA) So what’s different to MAGMA? ◮ DAG handled on-device. ◮ Improved performance for small and medium matrices ◮ More flexible ⇒ allows more complex pivoting Some things worked, some didn’t... 2 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

DAG-Scheduling: Overview Aims ◮ Expose maximum parallelism ◮ Separate parallelism/scheduling from algorithm Example: Cholesky factorization ◮ Split matrix up into blocks ◮ Divide algorithm into tasks that act on blocks. ◮ Represent dependencies as edges in DAG ◮ Typically each task is implemented by a block of threads. 3 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

DAG-Scheduling: Example L 11 = factor ( A 11 ) L 21 = solve ( A 21 , L 11 ) L 41 = solve ( A 41 , L 11 ) A 22 = update ( A 22 , L 21 , L 21 ) L 31 = solve ( A 31 , L 11 ) A 42 = update ( A 42 , L 41 , L 21 ) L 22 = factor ( A 22 ) A 32 = update ( A 32 , L 31 , L 21 ) A 43 = update ( A 43 , L 41 , L 31 ) L 42 = solve ( A 42 , L 22 ) L 32 = solve ( A 32 , L 22 ) A 33 = update ( A 33 , L 31 , L 31 ) A 43 = update ( A 43 , L 42 , L 32 ) A 33 = update ( A 33 , L 32 , L 32 ) A 44 = update ( A 44 , L 41 , L 41 ) L 33 = factor ( A 33 ) A 44 = update ( A 44 , L 42 , L 42 ) L 43 = solve ( A 43 , L 33 ) A 44 = update ( A 44 , L 43 , L 43 ) L 44 = factor ( A 44 ) 4 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

DAG-Scheduling Progress: Cholesky 4.5 ◮ More advanced New Magma implicit-DAG scheme 4 Host MKL similar to “domino” 3.5 scheme from trsv . Speedup vs cuSolver 3 ◮ Big gains on 2.5 latency-bound sizes 2 ◮ Still need to address flop-bound case by 1.5 calling cuBLAS. 1 ◮ Surprisingly beat MKL 0.5 on “small” sizes 0 1000 10000 n 5 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Linear Algebra Algorithms of interest Cholesky A = LL T — proof of concept, check performance Symmetric Indefinite A = LDL T — requires complex pivoting, (Bunch-Kaufmann often insufficient for sparse solvers). 6 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Linear Algebra Algorithms of interest Cholesky A = LL T — proof of concept, check performance Symmetric Indefinite A = LDL T — requires complex pivoting, (Bunch-Kaufmann often insufficient for sparse solvers). Cholesky ◮ For j = 1 , . . . , n : 1. Factor diagonal block L jj L T jj ← A jj 2. “Divide” column by diagonal L ij ← A ij L − T , i > j jj 3. Update columns to right A ik ← A ik − L ij L T kj , i ≥ k > j 6 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Symmetric Indefinite with Pivoting Symmetric Indefinite A = LDL T ◮ Ignoring stability, is essentially Cholesky with extra D ’s. ◮ To ensure stability need to ensure no entry of L is too large. ◮ For use in sparse solver, needs to cope with rectangular matrices ⇒ Bunch-Kaufmann is unsuitable. Traditional pivoting ◮ Finds largest entry in column before making pivoting decision. – Latency-bound (global communication for each column). – Entire (block) column may not fit in GPU (shared) memory. 7 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Symmetric Indefinite with Pivoting II But we’re lucky! ◮ Numerical pre-treatment (scaling,ordering) means < 0 . 1% matrices need pivoting ◮ Allows “Try-it-and-see” approach (aka A Posteriori Pivoting ) 8 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Requirements from task system ◮ Follow Cholesky scheme But also... ◮ Apply permutations to Left ◮ Check pivot sizes ◮ Perform speculative execution... ◮ ...backtrack if things go wrong ◮ In case where pivots fail, need to update to Left as well as Right 9 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Requirements from task system ◮ Follow Cholesky scheme But also... ◮ Apply permutations to Left ◮ Check pivot sizes ◮ Perform speculative execution... ◮ ...backtrack if things go wrong ◮ In case where pivots fail, need to update to Left as well as Right Still writing this... ...but don’t forsee major problems 9 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Unoptimized results 30 Magma (unpivoted) cuSolver Host MKL 25 Time / Time(cuSolver Cholesky) Prelim code 20 15 10 5 0 1000 10000 n 10 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

DAG-Scheduling: Vs MAGMA Implementation in MAGMA + Performs ”straight-forward” tasks on GPU (e.g. GEMM ) + More complicated tasks on CPU (e.g. pivoting kernels) + High asymptotic performance (because GEMM ) 11 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

DAG-Scheduling: Vs MAGMA Implementation in MAGMA + Performs ”straight-forward” tasks on GPU (e.g. GEMM ) + More complicated tasks on CPU (e.g. pivoting kernels) + High asymptotic performance (because GEMM ) – Tasks must be certain minimum size to be efficient. – CPU ↔ GPU latency limits performance on small matrices. – Can’t easily handle speculative execution and backtracking. – Doesn’t work well on lots of simultaneous small matrices. – Can’t (easily) dynamically modify task DAG based on pivoting decisions. 11 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Template Library What is it? ◮ Similar in concept to CUDA Unbound (CUB) library ◮ Provide efficient BLAS-like functionality as templates: “BLAS Unbound” ◮ Warp, Block and Device-level constructs ◮ Facilitate auto-tuning Why do we need it? ◮ For our DAG-library, all tasks performed in same kernel ◮ So all get same shared memory, number of threads etc. ◮ Pick best parameters for GEMM operation where most flops are ◮ Everything else has to live within that envelope 12 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

The good, the bad and the ugly Problems ◮ Combinatorial complexity / Manpower intensive ◮ Often need to break warp/block separation for performance ◮ Lots of performance optimization needed ◮ Can’t even come close to cuBLAS GEMM performance (70% vs 90% of peak) Wins ◮ Easy to play around with alternatives ◮ Test-driven development allows increased confidence in correctness ◮ Non-traditional features added using template parameters may be reused in other scenarios 13 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

Tricks for fast Cholesky Warp-level ◮ Each thread handle multiple consecutive columns ◮ Hide DFMA in communication latency ◮ Can’t hide in RSQRT latency — PTXAS issue? ◮ Use of SHFL requires a lot more unrolling — Instruction Cache Size issues ◮ Explicit hand/template based unrolling as NVCC tries to be too clever ◮ warpSize not a square number makes things messy √ d ii on diagonal not √ d ii . 1 ◮ Break block/warp separation by leaving 14 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks - PowerPoint PPT Presentation

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to Jeremy Appleyard of NVIDIA 1 / 20

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

Line Maintenance Keeping assets in service Scheduled and un-scheduled line maintenance Scheduled

XD XDAG: PoW + DA DAG frozen@xdag.io XDAG: A new DAG-based cryptocurrency The first mineable

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag Haug PROIEL Milan, 4 June

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Template- -Based Information Mining Based Information Mining Template The Web Information

Modern Modern Template Techniques Template The Simplest Function Template

Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan Lawrence

Washington Elementary School Caon City Schools Design Advisory Group Meeting 1 August 29, 2018

MADISON HIGH SCHOOL MASTER PLAN Portland Public Schools | Opsis Architecture + Dao MHS MASTER

Assessment of Community Engagement in Design Advisory Processes at Faubion

On Querying OBO Ontologies using a DAG Pattern Query Language Amarnath Gupta Simone Santini

Language Generation via DAG Transduction Yajie Ye, Weiwei Sun and Xiaojun Wan

Causality in a wide sense Lecture I Peter B uhlmann Seminar for Statistics ETH Z

NRC Group ASA Capital markets update Oslo, 13 February 2020 Agenda 08:30 09:00 Light

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks - PowerPoint PPT Presentation

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to Jeremy Appleyard of NVIDIA 1 / 20

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

Line Maintenance Keeping assets in service Scheduled and un-scheduled line maintenance Scheduled

XD XDAG: PoW + DA DAG frozen@xdag.io XDAG: A new DAG-based cryptocurrency The first mineable

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag Haug PROIEL Milan, 4 June

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Template- -Based Information Mining Based Information Mining Template The Web Information

Modern Modern Template Techniques Template The Simplest Function Template

Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan Lawrence

Washington Elementary School Caon City Schools Design Advisory Group Meeting 1 August 29, 2018

MADISON HIGH SCHOOL MASTER PLAN Portland Public Schools | Opsis Architecture + Dao MHS MASTER

Assessment of Community Engagement in Design Advisory Processes at Faubion

On Querying OBO Ontologies using a DAG Pattern Query Language Amarnath Gupta Simone Santini

Language Generation via DAG Transduction Yajie Ye, Weiwei Sun and Xiaojun Wan

Causality in a wide sense Lecture I Peter B uhlmann Seminar for Statistics ETH Z

NRC Group ASA Capital markets update Oslo, 13 February 2020 Agenda 08:30 09:00 Light

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE