Autotuning Dense Batched QR Factorizations on GPU Wissam M. - - PowerPoint PPT Presentation

autotuning dense batched qr factorizations on gpu
SMART_READER_LITE
LIVE PREVIEW

Autotuning Dense Batched QR Factorizations on GPU Wissam M. - - PowerPoint PPT Presentation

Introduction Meta-programming Optimization Experimental results Conclusion Autotuning Dense Batched QR Factorizations on GPU Wissam M. Sid-Lakhdar Tim A. Davis Xiaoye S. Li Texas A&M University & Lawrence Berkeley National


slide-1
SLIDE 1

Introduction Meta-programming Optimization Experimental results Conclusion

Autotuning Dense Batched QR Factorizations on GPU

Wissam M. Sid-Lakhdar Tim A. Davis Xiaoye S. Li

Texas A&M University & Lawrence Berkeley National Laboratory

March 26, 2018

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-2
SLIDE 2

Introduction Meta-programming Optimization Experimental results Conclusion

Overview

1

Introduction

2

Meta-programming

3

Optimization

4

Experimental results

5

Conclusion

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-3
SLIDE 3

Introduction Meta-programming Optimization Experimental results Conclusion

Motivation and Goal

Portability or Efficiency?

Portability (too general) Write one code that fits all GPU architectures but that is not the fastest / fast enough on any one of them Efficiency (too specific) Write the best code for one GPU architecture but that will be much less efficient / will not work for other architectures Effort Writing an efficient code for every architecture is tedious and unsustainable.

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-4
SLIDE 4

Introduction Meta-programming Optimization Experimental results Conclusion

Motivation and Goal

Portability or Efficiency?

Portability (too general) Write one code that fits all GPU architectures but that is not the fastest / fast enough on any one of them Efficiency (too specific) Write the best code for one GPU architecture but that will be much less efficient / will not work for other architectures Effort Writing an efficient code for every architecture is tedious and unsustainable. How to get both Portability and Efficiency with a minimum Effort?

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-5
SLIDE 5

Introduction Meta-programming Optimization Experimental results Conclusion

Our approach

Within NSF SparseKaffe project

Autotuning Write a general template code that relies on a set of parameters. The Autotuner generates, compiles, runs and checks a kernel, for every combination of parameters. The Autotuner traverses the parameters search space in order to find the combination leading to the best (fastest) kernel, for any given GPU architecture. In this talk: autotunning batch dense QR factorization on GPUs

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-6
SLIDE 6

Introduction Meta-programming Optimization Experimental results Conclusion

Overview

1

Introduction

2

Meta-programming

3

Optimization

4

Experimental results

5

Conclusion

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-7
SLIDE 7

Introduction Meta-programming Optimization Experimental results Conclusion

Overview

1

Introduction

2

Meta-programming

3

Optimization

4

Experimental results

5

Conclusion

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-8
SLIDE 8

Introduction Meta-programming Optimization Experimental results Conclusion

Algorithm

Matlab

function [A V1 T] = vthqr gpu (A) [m n ] = size (A) ; T = zeros (min(m, n ) ) ; for k = 1: min(m, n) [ v , tau , s ] = house higham (A( k :m, k )) ; V1( k ) = v ( 1 ) ; A ( k+1:m, k ) = v ( 2 : end ) ; z = −tau ∗ v ’ ∗ A( k :m, : ) ; A( k :m, k+1:n) = A( k :m, k+1:n) + v ∗ z ( k+1:n ) ; T( 1 : k−1,k ) = T( 1 : k −1 ,1:k−1) ∗ z ( 1 : k −1) ’; T(k , k ) = tau ; A(k , k ) = s ; end

QR factorization (for GPU) Householder ` a la Higham:

Numerical stability (when norm of Householder vector is small) Less operations (most Householder vector entries stay unchanged) ⇒ GPU friendly

Computing and using the z vector allows for less branching (warp divergence) and for more parallelism

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-9
SLIDE 9

Introduction Meta-programming Optimization Experimental results Conclusion

Template

Python/CUDA

PyExpander Replacing and extending the C macros system by leveraging the power of Python ability to use loops while very difficult and painful with macros ability to have functions calling other functions or using variables, which is very difficult with C macros nice checking done by the python compiler while hassle with dealing with non understandable errors with the C/CUDA compiler even the Makefile is generated to take into account architecture type and optimization options

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-10
SLIDE 10

Introduction Meta-programming Optimization Experimental results Conclusion

Code example

Template Code PyExpander instructions evaluated by the Python interpreter $for ≈ #pragma unroll $if ≈ #ifeq . . . #endif

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-11
SLIDE 11

Introduction Meta-programming Optimization Experimental results Conclusion

Parameters

Problem:

TlSz, NbXTl, NbYTl Inputs (fixed for every configuration)

Architecture:

WpSz, NbTh, NbReg

Mapping:

Nb Dt Th Wp X Y A T

  • Load/Store:

NbXChkA, NbXChkT

Code optimization:

X∗, X 1∗, . . . Switch between sub-algorithms Replace pragma and inline of CUDA

. . . : Many more parameters and routines depend on above parameters

Warp0 Thread0 Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-12
SLIDE 12

Introduction Meta-programming Optimization Experimental results Conclusion

Search space

Some parameters need to be of the form 2i, i ∈ [0, n] in order to make the code simpler (⇒ faster) The search space for the Mapping parameters is bound by the value

  • f the Problem parameters

The search space for the Architecture and Load/Store parameters depend on the architectural characteristics of the targeted GPU The Optimization parameters are (most often) Booleans, used to turn On/Off some features

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-13
SLIDE 13

Introduction Meta-programming Optimization Experimental results Conclusion

Constraints

Equalities: enforce a bijection between matrices and threads Inequalities: prohibit out-of-memory accesses Conditional constraints Examples 0 NbTh ∗ NbReg ≤ NbMaxReg Total # of registers cannot exceed architecture limit 1 NbThXA ∗ NbThYA ∗ NbTh == TlSz2 ∗ NbXTl ∗ NbYTl Sum of threads’ registers for A equals the surface of A 2 NbThXA ∗ DtThXA ≤ TlSz ∗ NbXTl A thread cannot be mapped on rows outside of A 3 NbWpXA ∗ NbWpYA == WpSz Layout of a warp respects its size

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-14
SLIDE 14

Introduction Meta-programming Optimization Experimental results Conclusion

Positioning

Position of first row of first thread of a warp in matrix A posWpXA = ((WpIdXA cx )∗dx+(WpIdXA&(cx − 1) ex )∗fx+(WpIdXA&(ex−1))) (1) Position of thread in warp posThWpXA = ThWpId NbWpYA ∗ DtWpXA (2) Position of first row of thread posX0A = posWpXA + posThWpXA (3) Relative position of ith row of a thread posThXA(i) = i ∗ DtThXA (4) Position of ith row of a thread posX(i) = posX0A + posThXA(i) (5)

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-15
SLIDE 15

Introduction Meta-programming Optimization Experimental results Conclusion

Positioning

posThXA(i) and posThYA(j) are straightforward to compute posX0A and posY 0A are expensive to compute. Every thread computes them once only and stores them in dedicated registers

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-16
SLIDE 16

Introduction Meta-programming Optimization Experimental results Conclusion

Implementation issues

Template code is harder to read/write/modify than standard code CUDA optimization decisions are not easy to make in template code Over-use of the select statement

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-17
SLIDE 17

Introduction Meta-programming Optimization Experimental results Conclusion

Autotuner

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-18
SLIDE 18

Introduction Meta-programming Optimization Experimental results Conclusion

Overview

1

Introduction

2

Meta-programming

3

Optimization

4

Experimental results

5

Conclusion

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-19
SLIDE 19

Introduction Meta-programming Optimization Experimental results Conclusion

Optimization problem

The objective function is the execution time of the kernels No analytical formulation exists. Every function evaluation is costly The gradiant is unknown. It can be approximated but at a high cost The optimization constraints are non-linear ⇒ This is classified as a black-box optimization problem In the general case, no method CAN EVER exist with a proof of convergence (no free lanch theorem)

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-20
SLIDE 20

Introduction Meta-programming Optimization Experimental results Conclusion

Optimization parallelization

The evaluation of the objective funtction for different parameter configurations is embarrassingly parallel As many evaluations can be launched in parallel as there are CPUs/GPUs available Exploiting this parallelism is the main focus of the BONSAI project in ICL (UTK) We use the cudaSetDevice(GPU ID) routine to map an autotuner process with a specific GPU Our system (backslash) contains 24 CPUs and 8 K40 GPUs

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-21
SLIDE 21

Introduction Meta-programming Optimization Experimental results Conclusion

Dealing with hard constraints

Deterministic case:

Too costly to evaluate the validity of all combinations of parameters Backtracking Classical algorithm for finding the solutions of constrained computational problems. It builds candidates incrementally and drops them as soon as it determines they cannot lead to valid solutions.

Stochastic case:

Too much rejections of potential candidates before finding valid ones Too much time consuming to code explicitly rules that ensure the validity of genes Space reduction For every parameter, start from its space of possible values and reduce this it until only one value remains. Parameters are set one after the other and all the constraints related to them are checked in

  • rder to eliminate the impossible values for the other parameters.

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-22
SLIDE 22

Introduction Meta-programming Optimization Experimental results Conclusion

Optimization solutions

Deterministic and Stochastic

Strategy *: Sub-space search Optimize sets of parameters independently, i.e., fix some parameters in one phase and optimize them in another phase Strategy 1: Exhaustive (or grid) search For a 32x32 matrix:

Size of the unconstrained search space: 2437438960041984 Size of the constrained search space: 45536

Strategy 2: Random sampling

Not effective at finding an optimum . . . . . . but effective at discovering regions of interest

Strategy 3: Fusion and Mutation (Genetic Algorithms)

Does not necessarily converge to global optimum . . . . . . but does converge to local minima

Strategy 4: OpenTunner (opentunner.org) Autotunning framework implementing several classical black-box

  • ptimization techniques

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-23
SLIDE 23

Introduction Meta-programming Optimization Experimental results Conclusion

Optimization solutions

Hybrid

Strategy 5: Meta-model Generate a latin hypercube to cover the search space and evaluate the corresponding points Build a surrogate model based on RBF kernels (or Kriging) Optimize the Expected Improvement on the model in order to find the global optimum (using Particle Swarm Optimization (PSO) or another genetic algorithm)

Exploitation: ask for evaluation of current optimum point Exploration: ask for evaluation of a point in a certain region where data lacks

Update the model at every new evaluation

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-24
SLIDE 24

Introduction Meta-programming Optimization Experimental results Conclusion

Optimization solutions

Hybrid

Strategy 5: Meta-model Generate a latin hypercube to cover the search space and evaluate the corresponding points Build a surrogate model based on RBF kernels (or Kriging) Optimize the Expected Improvement on the model in order to find the global optimum (using Particle Swarm Optimization (PSO) or another genetic algorithm)

Exploitation: ask for evaluation of current optimum point Exploration: ask for evaluation of a point in a certain region where data lacks

Update the model at every new evaluation After initial evaluation of 50 points in the latice, the meta-model converges to the best 1% points in the parameters search space within 10 iterations only!

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-25
SLIDE 25

Introduction Meta-programming Optimization Experimental results Conclusion

Overview

1

Introduction

2

Meta-programming

3

Optimization

4

Experimental results

5

Conclusion

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-26
SLIDE 26

Introduction Meta-programming Optimization Experimental results Conclusion

Experimental environment

Target GPUs:

K40: NVidia Kepler architecture, for HPC servers, contains 15 SMs. TX1: NVidia Maxwell architecture, for embedded systems, 2 SMs.

Libraries compared against:

CuBLAS from NVidia MAGMA from ICL in UTK

Batch size: 1500 on K40 and 100 on TX1.

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-27
SLIDE 27

Introduction Meta-programming Optimization Experimental results Conclusion

Performance Results

Arch #Rows #Cols Autotuner Cublas MAGMA DGEQRF + DLARFT DGEQRF DGEQRF DGEQRF + DLARFT 32 32 3.62 3.77 21.05 22.90 64 32 5.59 6.18 21.19 23.23 K40 128 32 11.96 13.55 23.87 26.31 256 32 24.86 43.47 37.44 40.71 512 32 60.36 104.53 73.56 78.51 32 32 22.12 28.06 131.13 144.80 64 32 43.80 41.49 483.08 506.12 TX1 128 32 74.78 81.84 911.85 1061.88 256 32 164.88 267.09 1784.02 2073.35 512 32 1426.48 586.32 3673.41 4240.94

Time in milli-seconds of batched QR factorization kernels on K40 (Kepler) and TX1 (Maxwell) GPUs. Best results emphasized in bold. Optimal parameters different from one architecture to the other and from matrix size to the other.

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-28
SLIDE 28

Introduction Meta-programming Optimization Experimental results Conclusion

Counter intuitive results

The best kernels are not the fastest ones! The fastest kernel in terms of GFlops may have a lower occupancy, while the optimal kernel might be slower sequentially, but since more parallel versions can run in parallel, the total computation time of the whole batch is lower. The optimal kernels induce register spill and use some local memory . . . but not too much There is a trade off between using some local memory and having a higher occupancy. The optimum is somewhere at the limit This result is counter intuitive, as when we handcraft a kernel, we always carefully try to avoid register spill and use of local memory.

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-29
SLIDE 29

Introduction Meta-programming Optimization Experimental results Conclusion

Overview

1

Introduction

2

Meta-programming

3

Optimization

4

Experimental results

5

Conclusion

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-30
SLIDE 30

Introduction Meta-programming Optimization Experimental results Conclusion

Summary

Write once, use forever! Writing a templatized code is different / more challenging than customizing a code with fixed parameters . . . . . . But the hassle of tweaking the last bit of performance out of the code is transitioned from the library developper to the computer Several days / weeks necessary to find the best kernels for a given architecture, but this tedious work has to be done once only. The

  • ptimized kernel can then be packaged into a library reasy to use by

the end users.

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-31
SLIDE 31

Introduction Meta-programming Optimization Experimental results Conclusion

Perspectives

Rely on the generated autotunned kernels as building blocks for factorizing general matrix sizes Currently working on a black-box optimization technique that rely

  • n multi-task learning and transfer learning to speed-up the
  • ptimzation phase of several concurent optimzation tasks on similar
  • ptimization tasks

Sid-Lakhdar, Davis, Li Autotuning QR GPU

slide-32
SLIDE 32

Introduction Meta-programming Optimization Experimental results Conclusion

Thank You!

Sid-Lakhdar, Davis, Li Autotuning QR GPU