Hierarchical-matrix Linear Solver on GPU clusters with MAGMA - - PowerPoint PPT Presentation

hierarchical matrix linear solver on gpu clusters
SMART_READER_LITE
LIVE PREVIEW

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA - - PowerPoint PPT Presentation

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA variable-size batched kernel Ichitaro Yamazaki , Ahmad Abdelfattah , Akihiro Ida , Satoshi Ohshima , Stanimire Tomov , Rio Yokota , Jack Dongarra The


slide-1
SLIDE 1

Hierarchical-matrix Linear Solver on GPU clusters

with MAGMA variable-size batched kernel

Ichitaro Yamazaki∗, Ahmad Abdelfattah∗, Akihiro Ida†, Satoshi Ohshima‡, Stanimire Tomov∗, Rio Yokota♯, Jack Dongarra∗

∗The University of Tennessee, Knoxville, USA †The University of Tokyo, Japan ‡Kyushu University, Japan ♯Tokyo Institute of Tehcnology, Japan

GPU Technology Conference San Jose, CA, 03/26/2018

H-matrix BiCGStab with GPUs

1/20

slide-2
SLIDE 2

Boundary Equation Method: from integral equation to linear equations

◮ many scientific and engineering applications

(e.g., acoustics, electro magnetics, and fracture and fluid mechanics)

◮ numerical solution of integral equation

K(x, y)u(y)dy = f → solution of dense linear system Aφ = b

◮ sizes of problems limited by cost of solving

the linear system

H-matrix BiCGStab with GPUs

2/20

slide-3
SLIDE 3

HACApK: dense linear solver

◮ solves dense linear system of equations

e.g., for BEM (ppohBEM).

◮ reduces computational and storage costs by

compressing the matrix into H-matrix ⊲ reordered/partitioned using geometry of problem

◮ uses Krylov solver like BiCGStab for

computing the solution

◮ is available at

http://ppopenhpc.cc.u-tokyo.ac.jp → this talk focuses on utilizing GPUs

H-matrix BiCGStab with GPUs

3/20

slide-4
SLIDE 4

BiCGStab with H-matrix on distributed-memory computer

1: t := Ax 2: r := b − t; r0 := r, γ := r02 3: for iter = 1, 2, . . . , maxiters do 4:

p := r + β · (p − ζ · v)

5:

v := Ap, followed by Allgatherv

6:

α = (r0, r)/(r0, v)

7:

v := r − α · v

8:

t := Av, followed by Allgatherv

9:

ζ = (t, v)/(t, t)

10:

x := x + αp + ζv

11:

r := v − ζt

12:

β = α/ζ· (r0, r)/γ

13:

γ = r

14: end for

◮ HiMV is dominant in iteration time, and parallelized

⊲ 1D block row, but with H-blocks and non-disjoint rows

◮ vector operations are insignificant, and redanduntly computed

⊲ avoid all-reduces for five dot-products per iter

◮ MPI Allgatherv after each HiMV ◮ OpenMP threads may be used to parallelize local matrix/vector

  • perations

H-matrix BiCGStab with GPUs

4/20

slide-5
SLIDE 5

GPU testbeds

◮ Reedbush-H: two 18-core Intel Xeon CPUs and two NVIDIA P100 GPUs per node, connected with 2 × 56Gb/s InifiniBand ◮ Tsubame-3: two 14-core Intel Xeon CPUs and four NVIDIA P100 GPUs per node, connected with 4 × 100 Gb/s Omni-Path

H-matrix BiCGStab with GPUs

5/20

slide-6
SLIDE 6

BiCGStab with H-matrix on GPU cluster

1: t := Ax 2: r := b − t; r0 := r, γ := r02 3: for iter = 1, 2, . . . , maxiters do 4:

p := r + β · (p − ζ · v)

5:

v := Ap, followed by Allgatherv

6:

α = (r0, r)/(r0, v)

7:

v := r − α · v

8:

t := Av, followed by Allgatherv

9:

ζ = (t, v)/(t, t)

10:

x := x + αp + ζv

11:

r := v − ζt

12:

β = α/ζ· (r0, r)/γ

13:

γ = r

14: end for

◮ all the operations are on GPUs (CPUs schedules tasks)

⊲ CPU-GPU data copy before/after MPI call ⊲ vector-operations using CUBLAS ⊲ HiMV using batched kernel !!

fine-grained irregular computation + global communication

H-matrix BiCGStab with GPUs

6/20

slide-7
SLIDE 7

batched GPU kernels from MAGMA

◮ many small same operations in parallel ◮ hardware parallelism through data parallilization ◮ motizated by application needs (e.g., deep learning, structural mechanics, high-order FEM, astrophysics, sparse/dense solvers) ◮ MAGMA: http://www.icl.utk.edu/magma LU, QR, Cholesky (fixed), all BLAS-3 (fixed or variable), and SYMV and GEMV (fixed or variable) http://www.icl.utk.edu/files/print/2017/magma-sc17.pdf (SC’17 handout)

H-matrix BiCGStab with GPUs

7/20

slide-8
SLIDE 8

interface to variable-size batched DGEMV kernel

magmablas_dgemv_vbatched ( magma trans t trans , magma int t * m, magma int t * n, double alpha , magmaDouble ptr dA array[], magma int t* ldda, magmaDouble ptr dx array[], magma int t* incx, magmaDouble ptr dy array[], magma int t* incy, magma int t batchCount , magma queue t queue) ◮ matrices/vectors as arrays of size batchCount on GPU (i.e., dA, dx, dy)

⊲ maximum batchCount is 65,536

◮ variable matrix sizes as arrays on GPU (e.g, m, n, lda) ◮ same operations (i.e., trans and alpha) ◮ layered interface (e.g., magmablas dgemv vbatched nocheck)

H-matrix BiCGStab with GPUs

8/20

slide-9
SLIDE 9

integration of variable-size batched kernel into HiMV

for k = 1, 2, . . . , nℓ do if dense block then // multiply with dense B(k) y(k) := B(k)x(k) else // multiply with compressed U(k)V (k) t(k) := V (k)x(k) y(k) := U(k)t(k) end if end for ◮ variable-size batched kernel to perform a batch of dgemv in parallel ⊲ group dgemvs into multiple batches (e.g., of fixed batch count) ◮ HiMV with many small dgemv with dense or compressed blocks ⊲ flat for-loop without hierarchical recursion → effective integration of batched kernel

H-matrix BiCGStab with GPUs

9/20

slide-10
SLIDE 10

integration of variable-size batched kernel into HiMV

for k = 1, 2, . . . , nℓ do if dense block then // multiply with dense B(k) y(k) := B(k)x(k) else // multiply with compressed U(k)V (k) t(k) := V (k)x(k) y(k) := U(k)t(k) end if end for ◮ two data conflicts: ⊲ output y(k) may overlap → NVIDIA’s atomic-add on y ⊲ multiply with U(k) depends on t(k) from that with V (k) → 1) batches of B(k) and V (k), and then 2) batches of U(k)

  • n same stream, or on multiple streams with events

H-matrix BiCGStab with GPUs

10/20

slide-11
SLIDE 11

performance of batched kernel for HiMV

◮ a wide range of block sizes

⊲ diagonal blocks

: dense & square

⊲ off-diagonal blocks: dense/compress & tall-skinny/short-wide

◮ overhead with variable sizes, e.g., to accomodate largest block, smaller blocks have thread blocks with no work ◮ lower variable-size performance (Gflop/s)

H-matrix BiCGStab with GPUs

11/20

slide-12
SLIDE 12

performance of batched kernel for HiMV

◮ sort blocks to reduce overhead associated with variable-size blocks ⊲ sort by numbers of rows in blocks ⊲ group by number of rows, sort by numbers of columns within group

H-matrix BiCGStab with GPUs

12/20

slide-13
SLIDE 13

performance of batched kernel for HiMV

◮ appropriate sorting scheme improves performance

⊲ up to 2.5× speedups

H-matrix BiCGStab with GPUs

13/20

slide-14
SLIDE 14

performance (Gflop/s) of different HiMV implementations

100ts 338ts human2 human6 20 40 60 80 100 120 140 160

Gflop/s

OpenMP+MKL CUBLAS+5 streams fixed batch(5K) with pad variable batch(5K) variable batch(20K) variable batch(variable) + 3 streams

◮ obtained higher performance using variable-size GPU kernel

compared to fixed-size (wasted ops with zeros or limited batch count)

◮ last three rows: variable batch counts to reduce overhead

⊲ specific range of block sizes in each batch ⊲ GPU streams to execute small batches in parallel

H-matrix BiCGStab with GPUs

14/20

slide-15
SLIDE 15

BiCGStab performance with GPUs (strong scaling)

Tsubame-3

6.0x 8.5x 6.5x 5.1x 2.1x 1 GPU 1 2 4 8

Number of nodes

0.2 0.4 0.6 0.8 1 1.2

BiCG Solution Time (s)

  • ther

HiMV(MPI) HiMV(copy) HiMV(comp) CPU only

Reedbush-H

4.2x 4.4x 3.6x 4.7x 4.6x 1 GPU 1 2 4 8

Number of nodes

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

BiCG Solution Time (s)

  • ther

HiMV(MPI) HiMV(copy) HiMV(comp) CPU only

◮ CPU runs (one process/socket with threads)

  • vs. GPU runs (one process/GPU)

◮ 2.1× speedup on 8 nodes of Tsubame-3 ◮ 4.6× speedup on 8 nodes of Reedbush-H

H-matrix BiCGStab with GPUs

15/20

slide-16
SLIDE 16

BiCGStab performance with GPUs on 8 nodes

Tsubame-3

2.1x 3.0x 3.3x 3.2x 1.8x 4.2x

100ts 288ts 388ts 1ms hum4 hum6 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BiCG Solution Time (s)

  • ther

HiMV(MPI) HiMV(copy) HiMV(comp) CPU only

Reedbush-H

4.6x 3.3x 3.0x 3.3x 3.1x 3.8x

100ts 288ts 338ts 1ms hum1 hum4 2 4 6 8 10 12

BiCG Solution Time (s)

  • ther

HiMV(MPI) HiMV(copy) HiMV(comp) CPU only

◮ CPU runs (one process/socket with threads)

  • vs. GPU runs (one process/GPU)

◮ upto 4.2× speedup on 8 nodes of Tsubame-3

⊲ 6.0× on one node

◮ upto 4.6× speedup on 8 nodes of Reedbush-H

⊲ 4.2× on one node

◮ communication starts to become significant

⊲ 46% or 43% on Tsubame-3 or Reedbush-H

H-matrix BiCGStab with GPUs

16/20

slide-17
SLIDE 17

BiCGStab with multiple GPUs per process

1 2 4 8 16

Node count

2 3 4 5 6 7 8 9 10 11 12 13

GB/s Tsubame-3

per node per socket per gpu

◮ each process with multiple GPUs to lower inter-node communication

by reducing number of processes

◮ use NVLink for data transfer among local GPUs

H-matrix BiCGStab with GPUs

17/20

slide-18
SLIDE 18

BiCGStab performance with multiple GPUs per process

matrix 100ts

4 8 16 32

Number of nodes

25 50 75 100 125 150 175

time/iter (ms)

No GPU per-GPU per-Socket per-Node

human6

4 8 16 32

Number of nodes

25 50 75 100 125 150 175

time/iter (s)

No GPU per-GPU per-Socket per-Node

◮ on large number of nodes, inter-GPU comm may be reduced by

multi-GPU implementation with careful communication scheme

H-matrix BiCGStab with GPUs

18/20

slide-19
SLIDE 19

hiding communication on GPU cluster

100ts 338ts hum4 ng block pipe1 pipe2 block pipe1 pipe2 block pipe1 pipe2 1 18.4 19.2 19.3 −− −− −− −− −− −− 4 10.2 10.0 9.4 34.6 33.8 31.6 34.3 32.1 30.4 8 9.1 8.9 7.6 31.5 30.1 25.6 30.5 28.1 24.3 16 8.8 8.6 6.7 30.2 28.6 22.0 30.1 28.0 21.1 32 10.3 10.1 6.8 31.7 30.8 21.8 30.9 28.3 19.7 ◮ hide all-gatherv for HiMV behind vector operations ◮ pipe1 aims to hide CPU-GPU vector copy ◮ pipe1 aims to hide MPI communication

⊲ upto 1.57× speedup

H-matrix BiCGStab with GPUs

19/20

slide-20
SLIDE 20

Conclusion

◮ used MAGMA’s variable-size batched kernel to utilized GPUs on node ◮ considered underlying hardware to improve inter-GPU communication ◮ more details in IPDPS’18 paper

⊲ some GPU-aware MPI results on Reedubush-H

Current Work

◮ GPU acceleration of other parts ⊲ generation/compression of the matrix? ◮ scalability improvement ⊲ load balancing? ◮ factorization-based solver

Thank you!!

H-matrix BiCGStab with GPUs

20/20