Exploiting Multiple GPUs in Sparse QR: Regular Numerics with - PowerPoint PPT Presentation

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA GTC 2015 March 2015

Outline Combinatorial scientific computing: math + CS + applications Multifrontal methods for factorizing sparse matrices fill-in creates cliques in the graph cliques connected via the elimination tree Sparse LU: square cliques assembled via addition Sparse LU: rectangular, via addition Sparse QR: rectangular, via concatenation GPU kernel design for sparse QR Bucket scheduler for single-GPU QR Extended for multi-GPU QR Performance results

Combinatorial Scientific Computing: math + CS + applications

Cliques in the graph: the multifrontal method Cliques + elimination tree = sequence of frontal matrices Dense factorization within a front; assemble data into parent regular + irregular computation

UMFPACK: unsymmetric multifrontal method Frontal matrices become rectangular Assemble data into ancestors, not just parents

UMFPACK: unsymmetric multifrontal method Key results / impact high-performance via dense matrix kernels within each front symbolic preordering and analysis, followed by revised local pivot search with approximate unsymmetric degree update widely used sparse LU and x=A \ b in MATLAB Mathematica IBM circuit simulation application finite-element solvers: NASTRAN, FEniCS, ... Berkeley Design Automation: circuit simulation CVXOPT ...

SuiteSparseQR: multifrontal sparse QR factorization Key results / impact rectangular fronts like UMFPACK, but simpler frontal matrix assembly (concatenation, not summation) (Duff, Puglisi) rank approximation (Heath, extended to multifrontal case) multicore parallelism on multicore CPU (70 Gflop theoretical peak): up to 30 Gflops sparse qr in MATLAB, and x=A \ b today’s talk: GPU algorithm novel “Bucket QR” scheduler and custom GPU kernels up to 150 GFlops on one Kepler K20c, 286 GFlops on 4 Tesla C2070’s up to 28x speedup vs CPU algorithm (10x typical for large problems)

SuiteSparseQR A column elimination tree and its supernodes

SuiteSparseQR Frontal matrix assembly

SuiteSparseQR concatenation, resulting a staircase matrix

SuiteSparseQR

Multifrontal factorization and assembly Prior methods one front at a time on the GPU assembly on CPU panel factorization on the CPU, applied on GPU Our multifrontal QR many fronts on the GPU (entire subtree) assembly on GPU: data concatenation, not summation entire dense QR of each front on the GPU

Consider a subtree of frontal matrices on the GPU

Expanded to show GPU kernel launches

Bucket QR factorization

GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t -by-1 tile, t = 1, 2, or 3 creates a block Householder details on next slides Apply a block Householder: A = A − V ( T ′ ( V ′ A )) A is t -by- s , where s can be large thread-block iterates 2 column blocks at atime (details omitted)

GPU kernel: Block Householder QR Block Householder QR = A where Q = ( I − VT ′ V ′ ) ′ , and R is upper triangular [m n] = size (A) for k = 1:n [tau, v] = house (A (k:m,k)) A (k:m,k:n) = A (k:m,k:n) - v * (tau * (v’ * A (k:m,k:n))) V (k:m,k) = v ; Tau (k) = tau end T = zeros (n) for k = 1:n tau = Tau (k) ; v = V (k:m,k) z = - tau * v’ * V (k:m,1:k-1) T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau end

GPU kernel: Block Householder QR Householder update Construct T A(k:m,k:n) = A(k:m,k:n) - ... z = - tau * v’ * V (k:m,1:k-1) v*(tau*(v’*A(k:m,k:n))) T (1:k-1,k) = T(1:k-1,1:k-1)*z’ T (k,k) = tau

GPU kernel: Block Householder QR Towards an GPU kernel overwrite tril(A,-1) with V, and fold in construction of T. [m n] = size (A) T = zeros (n) for k = 1:n [tau, v] = house (A (k:m,k)) A (k:m,k:n) = A (k:m,k:n) - ... v * (tau * (v’ * A (k:m,k:n))) V1 (k) = v (1) A (k+1:m,k) = v (2:end) z = - tau * v’ * A (k:m,1:k-1) T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau end

GPU kernel: Block Householder QR The GPU kernel update A and construct T in parallel: A (k:m,k:n) = A (k:m,k:n) - ... v * (tau * (v’ * A (k:m,k:n))) z = - tau * (v’ * A (k:m,1:k-1)) T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau becomes z = -tau * v’ * A (k:m,:) A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’ T (k,k) = tau

GPU kernel: Block Householder QR z = -tau * v’ * A (k:m,:) A (k:m,k:n) = ... A (k:m,k:n) + ... v * z (k:n) T (1:k-1,k) = ... T (1:k-1,1:k-1) ... * z (1:k-1)’ T (k,k) = tau

GPU kernel: Block Householder QR The GPU kernel thread-level parallelism z = -tau * v’ * A (k:m,:) A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’ T (k,k) = tau A is 96-by-32, in register during factorization, and finally in global memory at the end ( V and R ) each thread owns an 8-by-1 “bitty block” of A v is 96-by-1, in shared memory z is 32-by-1, in shared. Requires 12-by-32 shared space for v’*A(k:m,:) reduction

GPU kernel: Block Householder QR Putting it all together. At the kth step: threads that own column k write A to shared thread zero computes Householder coefficients z = -tau * v’ * A (k:m,:) each thread computes 8-by-1 dot product in parallel writes scalar result in 12-by-32 reduction space warp zero sums reduction space to get z A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) only done by threads that own columns k:n threads that own column k+1 compute norm of that column of A , for next Householder coef, saving result in 12-by-1 reduction space T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’ only done by threads 1:k-1 thread zero sums up reduction space for norm of column k+1

GPU kernel: Block Householder QR z = -tau * v’ * A (k:m,:) each thread computes 8-by-1 dot product in parallel writes scalar result in 12-by-32 reduction space warp zero sums reduction space to get z

GPU kernel: Block Householder QR A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) only done by threads that own columns k:n threads that own column k+1 compute norm of that column of A , for next Householder coef, saving result in 12-by-1 reduction space

GPU kernel: Block Householder QR T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’ only done by threads 1:k-1

Single GPU performance results Putting it all together ... Performance results Fermi K20 K40 GPU kernels: apply block Householder 183 Gflops 260 Gflops 360 Gflops factorize 3 tiles 27 Gflops 20 Gflops dense QR for large front 107 Gflops 120 Gflops (’bare metal’ flops) 154 Gflops 172 Gflops sparse QR on GPU 80 Gflops 150 Gflops peak speedup over CPU 11x 20x typical speedup over CPU 5x 10x

GPU Kernel pitfalls What we’ll do differently in our kernel design Householder block-apply using too much shared memory uberkernel approach each thread block determines what to do from a task list (QR, apply, assemble) pros: single large kernel launch, lots of parallelism con: all tasks use same thread geometry � con: QR of panel needs higher occupancy to hide scalar ( d ) to do: block apply kernel needs to stage A = A − WYA by not keeping W and Y in shared. split the uberkernel so QR panel can have higher occupancy

Single GPU performance on many matrices

Multiple GPUs: decomposing the tree

Multiple GPUs: Performance Results Results on NVIDIA Tesla C2070 GPUs problem CPU 1 GPU 2 GPU 4 GPU speedup speedup GFlop GFlop GFlop GFlop vs CPU vs 1 GPU 1500:2D 6.1 16.0 27.1 38.4 6.3 2.4 2000:2D 6.9 21.0 37.8 56.7 8.2 2.7 3000:2D 7.8 25.8 44.8 73.7 9.4 2.9 lp nug20 23.9 74.3 86.4 66.1 2.8 0.9 ch7-8-b3 25.3 104.0 111.3 173.7 6.9 1.7 ch7-8-b3:8 10.0 88.0 160.4 286.2 28.6 3.3

Multiple GPUs: for large fronts

Multiple GPUs: bucket scheduler on the large scale

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with - PowerPoint PPT Presentation

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA GTC 2015 March 2015 Outline

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Sub-Riemannian geometry and numerics for SDEs Charles Curry May 9, 2019 SDE numerics The CMT

1. Foundations of Numerics from Advanced Mathematics 1. Foundations of Numerics from Advanced

LLVM Numerics Improvements Michael C. Berg, Apple LLVM Developers Meeting, Brussels,

9. Hardware-Aware Numerics Approaching supercomputing ... 9. Hardware-Aware Numerics Numerical

Regular Array Computation in Haskell on GPUs Geoffrey Mainland Microsoft Research Ltd WG 2.8,

Exploiting Level- Exploiting Level -of of- -Detail Perception Detail Perception Multiple

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Regular Expressions A regular expression describes a language using three operations. Regular

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Regular a regular expression I Example 1.68 Consider the following DFA b a 1 2 a b a

Edge-regular graphs and regular cliques Gary Greaves Nanyang Technological University, Singapore

Welcoming and Retaining First-Time Guests November 2018 Be Social @faithperception

Dont be Fooled by Taubmans False and Misleading Claims May 2017 www.SaveTaubman.com Please

During the interviews, we were surprised to find that the old people in this remote mountain area

in Residential LTC CAROLE A. ESTABROOKS, RN, PHD, FCAHS, FAAN PROFESSOR & CANADA RESEARCH

Long Term Regulatory Monitoring Correlation Goal Setting How much tree canopy do we have? How

Gravity Wave Ducting over Antarctica P-D Pautet 1 , MJ Taylor 1 , and D. Murphy 2 1 Center for

Community composition, biomass and photosynthetic competency of phytoplankton associated with

Dynamical response of the Arctic surface winds to sea ice variability Hyodae Seo and Jiayan Yang

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with - PowerPoint PPT Presentation

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA GTC 2015 March 2015 Outline

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Sub-Riemannian geometry and numerics for SDEs Charles Curry May 9, 2019 SDE numerics The CMT

1. Foundations of Numerics from Advanced Mathematics 1. Foundations of Numerics from Advanced

LLVM Numerics Improvements Michael C. Berg, Apple LLVM Developers Meeting, Brussels,

9. Hardware-Aware Numerics Approaching supercomputing ... 9. Hardware-Aware Numerics Numerical

Regular Array Computation in Haskell on GPUs Geoffrey Mainland Microsoft Research Ltd WG 2.8,

Exploiting Level- Exploiting Level -of of- -Detail Perception Detail Perception Multiple

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Regular Expressions A regular expression describes a language using three operations. Regular

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Regular a regular expression I Example 1.68 Consider the following DFA b a 1 2 a b a

Edge-regular graphs and regular cliques Gary Greaves Nanyang Technological University, Singapore

Welcoming and Retaining First-Time Guests November 2018 Be Social @faithperception

Dont be Fooled by Taubmans False and Misleading Claims May 2017 www.SaveTaubman.com Please

During the interviews, we were surprised to find that the old people in this remote mountain area

in Residential LTC CAROLE A. ESTABROOKS, RN, PHD, FCAHS, FAAN PROFESSOR &amp; CANADA RESEARCH

Long Term Regulatory Monitoring Correlation Goal Setting How much tree canopy do we have? How

Gravity Wave Ducting over Antarctica P-D Pautet 1 , MJ Taylor 1 , and D. Murphy 2 1 Center for

Community composition, biomass and photosynthetic competency of phytoplankton associated with

Dynamical response of the Arctic surface winds to sea ice variability Hyodae Seo and Jiayan Yang

in Residential LTC CAROLE A. ESTABROOKS, RN, PHD, FCAHS, FAAN PROFESSOR & CANADA RESEARCH