Autotuning Dense Batched QR Factorizations on GPU Wissam M. - PowerPoint PPT Presentation

Introduction Meta-programming Optimization Experimental results Conclusion Autotuning Dense Batched QR Factorizations on GPU Wissam M. Sid-Lakhdar Tim A. Davis Xiaoye S. Li Texas A&M University & Lawrence Berkeley National Laboratory March 26, 2018 Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Overview Introduction 1 Meta-programming 2 Optimization 3 Experimental results 4 Conclusion 5 Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Motivation and Goal Portability or Efficiency? Portability (too general) Write one code that fits all GPU architectures but that is not the fastest / fast enough on any one of them Efficiency (too specific) Write the best code for one GPU architecture but that will be much less efficient / will not work for other architectures Effort Writing an efficient code for every architecture is tedious and unsustainable. Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Motivation and Goal Portability or Efficiency? Portability (too general) Write one code that fits all GPU architectures but that is not the fastest / fast enough on any one of them Efficiency (too specific) Write the best code for one GPU architecture but that will be much less efficient / will not work for other architectures Effort Writing an efficient code for every architecture is tedious and unsustainable. How to get both Portability and Efficiency with a minimum Effort? Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Our approach Within NSF SparseKaffe project Autotuning Write a general template code that relies on a set of parameters . The Autotuner generates , compiles , runs and checks a kernel, for every combination of parameters. The Autotuner traverses the parameters search space in order to find the combination leading to the best (fastest) kernel, for any given GPU architecture. In this talk: autotunning batch dense QR factorization on GPUs Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Algorithm Matlab function [A V1 T] = vthqr gpu (A) [m n ] = size (A) ; T = zeros ( min (m, n ) ) ; for k = 1: min (m, n) [ v , tau , s ] = house higham (A( k :m, k )) ; V1( k ) = v ( 1 ) ; A ( k+1:m, k ) = v ( 2 : end ) ; z = − tau ∗ v ’ ∗ A( k :m, : ) ; A( k :m, k+1:n) = A( k :m, k+1:n) + v ∗ z ( k+1:n ) ; T( 1 : k − 1,k ) = T( 1 : k − 1 ,1:k − 1) ∗ z ( 1 : k − 1) ’; T(k , k ) = tau ; A(k , k ) = s ; end QR factorization (for GPU) Householder ` a la Higham: Numerical stability (when norm of Householder vector is small) Less operations (most Householder vector entries stay unchanged) ⇒ GPU friendly Computing and using the z vector allows for less branching (warp divergence) and for more parallelism Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Template Python/CUDA PyExpander Replacing and extending the C macros system by leveraging the power of Python ability to use loops while very difficult and painful with macros ability to have functions calling other functions or using variables, which is very difficult with C macros nice checking done by the python compiler while hassle with dealing with non understandable errors with the C/CUDA compiler even the Makefile is generated to take into account architecture type and optimization options Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Code example Template Code PyExpander instructions evaluated by the Python interpreter $for ≈ #pragma unroll $if ≈ #ifeq . . . #endif Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Parameters Problem : TlSz , NbXTl , NbYTl Inputs (fixed for every configuration) Architecture : WpSz , NbTh , NbReg Mapping : Warp 0 � Nb � � Th � � X � � A � Thread 0 Dt Wp Y T Load/Store : NbXChkA , NbXChkT Code optimization : X ∗ , X 1 ∗ , . . . Switch between sub-algorithms Replace pragma and inline of CUDA . . . : Many more parameters and routines depend on above parameters Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Search space Some parameters need to be of the form 2 i , i ∈ [0 , n ] in order to make the code simpler ( ⇒ faster) The search space for the Mapping parameters is bound by the value of the Problem parameters The search space for the Architecture and Load/Store parameters depend on the architectural characteristics of the targeted GPU The Optimization parameters are (most often) Booleans, used to turn On/Off some features Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Constraints Equalities: enforce a bijection between matrices and threads Inequalities: prohibit out-of-memory accesses Conditional constraints Examples 0 NbTh ∗ NbReg ≤ NbMaxReg Total # of registers cannot exceed architecture limit 1 NbThXA ∗ NbThYA ∗ NbTh == TlSz 2 ∗ NbXTl ∗ NbYTl Sum of threads’ registers for A equals the surface of A 2 NbThXA ∗ DtThXA ≤ TlSz ∗ NbXTl A thread cannot be mapped on rows outside of A 3 NbWpXA ∗ NbWpYA == WpSz Layout of a warp respects its size Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Positioning Position of first row of first thread of a warp in matrix A posWpXA = (( WpIdXA ) ∗ dx +( WpIdXA &( cx − 1) ) ∗ fx +( WpIdXA &( ex − 1))) cx ex (1) Position of thread in warp posThWpXA = ThWpId NbWpYA ∗ DtWpXA (2) Position of first row of thread posX 0 A = posWpXA + posThWpXA (3) Relative position of i th row of a thread posThXA ( i ) = i ∗ DtThXA (4) Position of i th row of a thread posX ( i ) = posX 0 A + posThXA ( i ) (5) Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Positioning posThXA ( i ) and posThYA ( j ) are straightforward to compute posX 0 A and posY 0 A are expensive to compute. Every thread computes them once only and stores them in dedicated registers Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Implementation issues Template code is harder to read/write/modify than standard code CUDA optimization decisions are not easy to make in template code Over-use of the select statement Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Autotuner Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Optimization problem The objective function is the execution time of the kernels No analytical formulation exists. Every function evaluation is costly The gradiant is unknown. It can be approximated but at a high cost The optimization constraints are non-linear ⇒ This is classified as a black-box optimization problem In the general case, no method CAN EVER exist with a proof of convergence ( no free lanch theorem ) Sid-Lakhdar , Davis, Li Autotuning QR GPU

Introduction Meta-programming Optimization Experimental results Conclusion Optimization parallelization The evaluation of the objective funtction for different parameter configurations is embarrassingly parallel As many evaluations can be launched in parallel as there are CPUs/GPUs available Exploiting this parallelism is the main focus of the BONSAI project in ICL (UTK) We use the cudaSetDevice(GPU ID) routine to map an autotuner process with a specific GPU Our system ( backslash ) contains 24 CPUs and 8 K40 GPUs Sid-Lakhdar , Davis, Li Autotuning QR GPU

Autotuning Dense Batched QR Factorizations on GPU Wissam M. - PowerPoint PPT Presentation

Introduction Meta-programming Optimization Experimental results Conclusion Autotuning Dense Batched QR Factorizations on GPU Wissam M. Sid-Lakhdar Tim A. Davis Xiaoye S. Li Texas A&M University & Lawrence Berkeley National

BOOLEAN MATRIX FACTORIZATIONS Pauli Miettinen Leap day, 2012 MATRIX FACTORIZATIONS

Non-unique factorizations in bounded hereditary noetherian prime rings Daniel Smertnig

Factorizations of ideals in noncommutative rings similar to factorizations of ideals in

Dense Flow Visualization Lecture 10 February 27, 2020 General Overview Dense methods in 2D

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA variable-size batched kernel Ichitaro

Autotuning Wavefront Applications for Multicore Multi-GPU Hybrid Architectures University of

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Dense cold mixes: Preservation of Dense cold mixes: Preservation of county roads county roads

Factorizations of Coxeter Elements in Complex Reflection Groups University of Minnesota-Twin

Counting factorizations in complex reflection groups Joel Brewster Lewis (George Washington

A type of generalized factorization on domains -Factorizations R.M. Ortiz-Albino Conference of

1 Certain statements in this document constitute forward looking statements which may not be

Application of the Ensemble Kalman Filter for Improved Mineral Resource Recovery C. Yksel,

PURSUING A REVIVAL IN GOLD Corporate Presentation November 2018 TSX-V: RVG | OTCQB: RVLGF 1

Investor Presentation June 2011 FORWARD LOOKING STATEMENTS This presentation contains forward

Centamin plc Positioned For Growth Sukari Gold Mine Site Visit 28 March 2012 2 Forward Looking

Directions Tyler J Fox US EPA, Air Quality Modeling Group Research Triangle Park, North Carolina

Gope 25 - A Unique Diamond Deposit Botswana Resource Conference June 2010 June 2010 1 AGENDA

Investor Update David Busch Managing Director March

Autotuning Dense Batched QR Factorizations on GPU Wissam M. - PowerPoint PPT Presentation

Introduction Meta-programming Optimization Experimental results Conclusion Autotuning Dense Batched QR Factorizations on GPU Wissam M. Sid-Lakhdar Tim A. Davis Xiaoye S. Li Texas A&M University & Lawrence Berkeley National

BOOLEAN MATRIX FACTORIZATIONS Pauli Miettinen Leap day, 2012 MATRIX FACTORIZATIONS

Non-unique factorizations in bounded hereditary noetherian prime rings Daniel Smertnig

Factorizations of ideals in noncommutative rings similar to factorizations of ideals in

Dense Flow Visualization Lecture 10 February 27, 2020 General Overview Dense methods in 2D

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA variable-size batched kernel Ichitaro

Autotuning Wavefront Applications for Multicore Multi-GPU Hybrid Architectures University of

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Dense cold mixes: Preservation of Dense cold mixes: Preservation of county roads county roads

Factorizations of Coxeter Elements in Complex Reflection Groups University of Minnesota-Twin

Counting factorizations in complex reflection groups Joel Brewster Lewis (George Washington

A type of generalized factorization on domains -Factorizations R.M. Ortiz-Albino Conference of

1 Certain statements in this document constitute forward looking statements which may not be

Application of the Ensemble Kalman Filter for Improved Mineral Resource Recovery C. Yksel,

PURSUING A REVIVAL IN GOLD Corporate Presentation November 2018 TSX-V: RVG | OTCQB: RVLGF 1

Investor Presentation June 2011 FORWARD LOOKING STATEMENTS This presentation contains forward

Centamin plc Positioned For Growth Sukari Gold Mine Site Visit 28 March 2012 2 Forward Looking

Directions Tyler J Fox US EPA, Air Quality Modeling Group Research Triangle Park, North Carolina

Gope 25 - A Unique Diamond Deposit Botswana Resource Conference June 2010 June 2010 1 AGENDA

Investor Update David Busch Managing Director March

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team