DAG-Scheduled Linear Algebra Using Template-Based Building Blocks - - PowerPoint PPT Presentation

dag scheduled linear algebra using template based
SMART_READER_LITE
LIVE PREVIEW

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks - - PowerPoint PPT Presentation

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to Jeremy Appleyard of NVIDIA 1 / 20


slide-1
SLIDE 1

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Jonathan Hogg

STFC Rutherford Appleton Laboratory

19 March 2015 GPU Technology Conference San Jose, California

* Thanks also to Jeremy Appleyard of NVIDIA

1 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-2
SLIDE 2

Introduction

What’s in the title?

DAG-Scheduled Similar approach to MAGMA, but more flexible. Linear Algebra Aimed at implementing matrix algorithms-by-blocks. Template-Based Building-Blocks Template library for BLAS-like functionality (i.e. CUB for LA)

2 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-3
SLIDE 3

Introduction

What’s in the title?

DAG-Scheduled Similar approach to MAGMA, but more flexible. Linear Algebra Aimed at implementing matrix algorithms-by-blocks. Template-Based Building-Blocks Template library for BLAS-like functionality (i.e. CUB for LA)

So what’s different to MAGMA?

◮ DAG handled on-device. ◮ Improved performance for small and medium matrices ◮ More flexible ⇒ allows more complex pivoting

2 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-4
SLIDE 4

Introduction

What’s in the title?

DAG-Scheduled Similar approach to MAGMA, but more flexible. Linear Algebra Aimed at implementing matrix algorithms-by-blocks. Template-Based Building-Blocks Template library for BLAS-like functionality (i.e. CUB for LA)

So what’s different to MAGMA?

◮ DAG handled on-device. ◮ Improved performance for small and medium matrices ◮ More flexible ⇒ allows more complex pivoting

Some things worked, some didn’t...

2 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-5
SLIDE 5

DAG-Scheduling: Overview

Aims

◮ Expose maximum parallelism ◮ Separate parallelism/scheduling from algorithm

Example: Cholesky factorization

◮ Split matrix up into blocks ◮ Divide algorithm into tasks that act on blocks. ◮ Represent dependencies as edges in DAG ◮ Typically each task is implemented by a block of threads.

3 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-6
SLIDE 6

DAG-Scheduling: Example

L33 = factor(A33) L22 = factor(A22) L11 = factor(A11) L21 = solve(A21, L11) A22 = update(A22, L21, L21) A43 = update(A43, L41, L31) L44 = factor(A44) A33 = update(A33, L31, L31) L32 = solve(A32, L22) L31 = solve(A31, L11) L43 = solve(A43, L33) A32 = update(A32, L31, L21) A44 = update(A44, L43, L43) A33 = update(A33, L32, L32) A42 = update(A42, L41, L21) L41 = solve(A41, L11) A43 = update(A43, L42, L32) L42 = solve(A42, L22) A44 = update(A44, L42, L42) A44 = update(A44, L41, L41)

4 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-7
SLIDE 7

DAG-Scheduling Progress: Cholesky

0.5 1 1.5 2 2.5 3 3.5 4 4.5 1000 10000 Speedup vs cuSolver n New Magma Host MKL ◮ More advanced

implicit-DAG scheme similar to “domino” scheme from trsv.

◮ Big gains on

latency-bound sizes

◮ Still need to address

flop-bound case by calling cuBLAS.

◮ Surprisingly beat MKL

  • n “small” sizes

5 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-8
SLIDE 8

Linear Algebra

Algorithms of interest

Cholesky A = LLT — proof of concept, check performance Symmetric Indefinite A = LDLT — requires complex pivoting, (Bunch-Kaufmann

  • ften insufficient for sparse solvers).

6 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-9
SLIDE 9

Linear Algebra

Algorithms of interest

Cholesky A = LLT — proof of concept, check performance Symmetric Indefinite A = LDLT — requires complex pivoting, (Bunch-Kaufmann

  • ften insufficient for sparse solvers).

Cholesky

◮ For j = 1, . . . , n:

  • 1. Factor diagonal block LjjLT

jj ← Ajj

  • 2. “Divide” column by diagonal Lij ← AijL−T

jj

, i > j

  • 3. Update columns to right Aik ← Aik − LijLT

kj, i ≥ k > j

6 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-10
SLIDE 10

Symmetric Indefinite with Pivoting

Symmetric Indefinite A = LDLT

◮ Ignoring stability, is essentially Cholesky with extra D’s. ◮ To ensure stability need to ensure no entry of L is too large. ◮ For use in sparse solver, needs to cope with rectangular matrices

⇒ Bunch-Kaufmann is unsuitable.

Traditional pivoting

◮ Finds largest entry in column before making pivoting decision.

– Latency-bound (global communication for each column). – Entire (block) column may not fit in GPU (shared) memory.

7 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-11
SLIDE 11

Symmetric Indefinite with Pivoting II

But we’re lucky!

◮ Numerical pre-treatment (scaling,ordering) means < 0.1% matrices need pivoting ◮ Allows “Try-it-and-see” approach (aka A Posteriori Pivoting)

8 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-12
SLIDE 12

Requirements from task system

◮ Follow Cholesky scheme

But also...

◮ Apply permutations to Left ◮ Check pivot sizes ◮ Perform speculative execution... ◮ ...backtrack if things go wrong ◮ In case where pivots fail, need to update to Left as well as Right

9 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-13
SLIDE 13

Requirements from task system

◮ Follow Cholesky scheme

But also...

◮ Apply permutations to Left ◮ Check pivot sizes ◮ Perform speculative execution... ◮ ...backtrack if things go wrong ◮ In case where pivots fail, need to update to Left as well as Right

Still writing this... ...but don’t forsee major problems

9 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-14
SLIDE 14

Unoptimized results

5 10 15 20 25 30 1000 10000 Time / Time(cuSolver Cholesky) n Magma (unpivoted) cuSolver Host MKL Prelim code

10 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-15
SLIDE 15

DAG-Scheduling: Vs MAGMA

Implementation in MAGMA

+ Performs ”straight-forward” tasks on GPU (e.g. GEMM) + More complicated tasks on CPU (e.g. pivoting kernels) + High asymptotic performance (because GEMM)

11 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-16
SLIDE 16

DAG-Scheduling: Vs MAGMA

Implementation in MAGMA

+ Performs ”straight-forward” tasks on GPU (e.g. GEMM) + More complicated tasks on CPU (e.g. pivoting kernels) + High asymptotic performance (because GEMM)

– Tasks must be certain minimum size to be efficient. – CPU↔GPU latency limits performance on small matrices. – Can’t easily handle speculative execution and backtracking. – Doesn’t work well on lots of simultaneous small matrices. – Can’t (easily) dynamically modify task DAG based on pivoting decisions.

11 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-17
SLIDE 17

Template Library

What is it?

◮ Similar in concept to CUDA Unbound (CUB) library ◮ Provide efficient BLAS-like functionality as templates: “BLAS Unbound” ◮ Warp, Block and Device-level constructs ◮ Facilitate auto-tuning

Why do we need it?

◮ For our DAG-library, all tasks performed in same kernel ◮ So all get same shared memory, number of threads etc. ◮ Pick best parameters for GEMM operation where most flops are ◮ Everything else has to live within that envelope

12 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-18
SLIDE 18

The good, the bad and the ugly

Problems

◮ Combinatorial complexity / Manpower intensive ◮ Often need to break warp/block separation for performance ◮ Lots of performance optimization needed ◮ Can’t even come close to cuBLAS GEMM performance (70% vs 90% of peak)

Wins

◮ Easy to play around with alternatives ◮ Test-driven development allows increased confidence in correctness ◮ Non-traditional features added using template parameters may be reused in other

scenarios

13 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-19
SLIDE 19

Tricks for fast Cholesky

Warp-level

◮ Each thread handle multiple consecutive columns ◮ Hide DFMA in communication latency ◮ Can’t hide in RSQRT latency — PTXAS issue? ◮ Use of SHFL requires a lot more unrolling — Instruction Cache Size issues ◮ Explicit hand/template based unrolling as NVCC tries to be too clever ◮ warpSize not a square number makes things messy ◮ Break block/warp separation by leaving 1 √dii on diagonal not √dii.

14 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-20
SLIDE 20

Tricks for fast Cholesky II

Block-level

◮ Use different data layout for warps factoring diagonal vs off-diagonal ◮ Trsm (in Potrf): Each thread holds one entire vector ⇒ no communication ◮ Work lower triangle of 4 × 4 blocks with only 8 warps — need “warp stealing”. ◮ BlockTrsm: Stage Dii into shmem by hand, double buffering ◮ BlockTrsm: Good SHFL use is fiddly ◮ Post progress after each diagonal block for device-level algorithm to pick up

1 2 3 4 5 6 7 8 9

15 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-21
SLIDE 21

Cholesky: 4 blocks

16 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-22
SLIDE 22

Cholesky: 16 blocks

17 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-23
SLIDE 23

Tricks for fast Cholesky III

Device-level

◮ Overlap as much as possible ◮ Consolidate work and avoid synchronization ◮ Small calls to CUBLAS infeasible: launch overhead >> single block update ◮ Need to identify larger blocks to call CUBLAS on

18 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-24
SLIDE 24

Tricks for fast Cholesky III

Device-level

◮ Overlap as much as possible ◮ Consolidate work and avoid synchronization ◮ Small calls to CUBLAS infeasible: launch overhead >> single block update ◮ Need to identify larger blocks to call CUBLAS on ◮ Will need more complicated scheme:

18 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-25
SLIDE 25

Conclusions

◮ On-device DAG-scheduling good for latency bound kernels ◮ Significant improvements in Cholesky for n ≤ 2000 ◮ New a posteriori pivoting techniques for LDLT ◮ BLAS/LAPACK-like template library is a lot of work ◮ ... so only limited subset will be brought up to release quality ◮ If you want the rest, email me. ◮ Code will ultimately be used for Sparse solver SPRAL SSIDS v2

jonathan.hogg@stfc.ac.uk

19 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory

slide-26
SLIDE 26

Thanks for listening!

Questions?

jonathan.hogg@stfc.ac.uk http://www.numerical.rl.ac.uk/spral

20 / 20 DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg, STFC Rutherford Appleton Laboratory