FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan - - PowerPoint PPT Presentation

for geometric multi grid
SMART_READER_LITE
LIVE PREVIEW

FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH-ORDER DISCRETIZATIONS FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016 MULTI-GRID METHODS Introduction Multi-grid solves elliptic PDEs (Ax=b) using a


slide-1
SLIDE 1

April 4-7, 2016 | Silicon Valley

Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016

HIGH-ORDER DISCRETIZATIONS FOR GEOMETRIC MULTI-GRID METHODS

slide-2
SLIDE 2

2

MULTI-GRID METHODS

Introduction

Multi-grid solves elliptic PDEs (Ax=b) using a hierarchical approach solution to hard problem is expressed as solution to an easier problem accelerates iterative method and provides O(N) complexity

4/4/2016

slide-3
SLIDE 3

3

MULTI-GRID METHODS

Applications

4/4/2016 EXTERNAL FLOW OVER BODY RESERVOIR SIMULATION LATTICE QCD (BROOKHAVEN NATIONAL LABORATORY)

slide-4
SLIDE 4

4

MULTI-GRID METHODS

Geometric vs algebraic

4/4/2016

GMG Uses a structured grid Operator (A) is a stencil Constructs coarse grid directly More efficient approach Requires geometric information AMG Uses an arbitrary graph Operator (A) is a sparse matrix Constructs coarse grid using RAP product More general approach Performance optimization is challenging

Ax=b

slide-5
SLIDE 5

5

NVIDIA AMGX

Black-box library for AMG

Fast, scalable linear solvers, emphasis on iterative methods Flexible toolkit for GPU accelerated Ax = b solver Simple API makes it easy to solve your problems faster

4/4/2016

Used in industry-leading CFD packages!

slide-6
SLIDE 6

6

NVIDIA AMGX

Example usage

4/4/2016

// One header #include “amgx_c.h” // Read config file AMGX_create_config(&cfg, cfgfile); // Create resources based on config AMGX_resources_create_simple(&res, cfg); // Create solver object, A,x,b, set precision AMGX_solver_create(&solver, res, mode, cfg); AMGX_matrix_create(&A,res,mode); AMGX_vector_create(&x,res,mode); AMGX_vector_create(&b,res,mode); // Read coefficients from a file AMGX_read_system(&A,&x,&b, matrixfile); // Setup and Solve Loop AMGX_solver_setup(solver,A); AMGX_solver_solve(solver, b, x); // Download Result AMGX_download_vector(&x) solver(main)=FGMRES main:max_iters=100 main:convergence=RELATIVE_MAX main:tolerance=0.1 main:preconditioner(amg)=AMG amg:algorithm=AGGREGATION amg:selector=SIZE_8 amg:cycle=V amg:max_iters=1 amg:max_levels=10 amg:smoother(smoother)=BLOCK_JACOBI amg:relaxation_factor= 0.75 amg:presweeps=1 amg:postsweeps=2 amg:coarsest_sweeps=4 determinism_flag=1

Configuration file

slide-7
SLIDE 7

7

HPGMG

High-Performance Geometric Multi-Grid

Lawrence Berkeley National Laboratory FVM and FEM variants, we focus on FVM Proxy AMR and Low Mach Combustion codes Used in Top500 benchmarking

4/4/2016

http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/

slide-8
SLIDE 8

8

HPGMG

Multi-grid cycles

HPGMG implements F-cycle which has better convergence rate than V-cycle Poisson or Helmholtz operator using 2nd or 4th order discretization

4/4/2016

V-CYCLE F-CYCLE

DIRECT SOLVE

FEW FINER GRIDS MANY COARSER GRIDS

SMOOTHER & RESIDUAL SMOOTHER & RESIDUAL SMOOTHER SMOOTHER

slide-9
SLIDE 9

9

HPGMG

High-order discretizations

Order describes the relationship between grid spacing and error Why do we need higher-order discretizations? 1. Improving accuracy without significant increase in memory usage 2. Higher flops per byte ratio enables better utilization of modern architectures Many solvers are moving towards high-order schemes for production codes

4/4/2016

slide-10
SLIDE 10

10

HPGMG

v0.2: second-order

2nd order: 7-point stencil 6 x 2-point stencils

4/4/2016

K+1 K K-1

slide-11
SLIDE 11

11

HPGMG

v0.3: fourth-order

4th order: 25-point stencil 18 x 4-point stencils, Flop:Byte ~1.0

4/4/2016

K-2 K+1 K+2 K-1 K

slide-12
SLIDE 12

12

HPGMG

4th order vs 2nd order

Performs 4x the FP operations MPI: sends 3x the messages, doubles the size (2-deep halos) DRAM memory footprint is the same (assuming no overfetch) Attains lower relative residual: ~10-9 for a single F-cycle

4/4/2016

slide-13
SLIDE 13

13

HYBRID IMPLEMENTATION

slide-14
SLIDE 14

14

HYBRID IMPLEMENTATION

Take advantage of both architectures

Fine levels are executed on throughput-optimized processors (GPU) Coarse levels are executed on latency-optimized processors (CPU)

4/4/2016

GPU CPU

THRESHOLD V-CYCLE F-CYCLE

slide-15
SLIDE 15

15

HYBRID IMPLEMENTATION

What is the optimal threshold?

4/4/2016

1.0E+07 1.5E+07 2.0E+07 2.5E+07 3.0E+07 3.5E+07 4.0E+07 4.5E+07 5.0E+07 5.5E+07 6.0E+07 6.5E+07 7 6 5 4 3 2 1 DOF/s number of levels on GPU

HPGMG v0.3 hybrid performance

h 2h

All levels on GPU All levels on CPU execute on GPU if >10K grid points

slide-16
SLIDE 16

16

MEMORY MANAGEMENT

HPGMG-FV entities naturally map to GPU hierarchy

4/4/2016

THREAD BLOCK OMP THREAD GPU MULTI-GPU MPI RANK MULTI-PROCESS

DOMAIN BOX BLOCK GRID 0 GRID 1 GRID 2 GRID 3

slide-17
SLIDE 17

17

MEMORY MANAGEMENT

volume

4/4/2016

box 1 box 2 box 3 vectors (~10) box 1 box 2 box 3 Vector data within a level is contiguous Requires one copy per vector Vector data within a level is disjoint Requires one copy per box

slide-18
SLIDE 18

18

MEMORY MANAGEMENT

Using Unified Memory

No changes to data structures No explicit data movements Single pointer for CPU and GPU data Use cudaMallocManaged for allocations

4/4/2016

Developer View With Unified Memory

Unified Memory

slide-19
SLIDE 19

19

UNIFIED MEMORY

Simplified GPU programming

Minimal modifications to the original code: (1) malloc replaced with cudaMallocManaged for levels accessed by GPU (2) Invoke CUDA kernel if level size is greater than threshold

4/4/2016

void smooth(level_type *level,...){ ... if(level->use_cuda) { // run on GPU cuda_cheby_smooth(level,...); } else { // run on CPU #pragma omp parallel for for(block = 0; block < num_blocks; block++) ... }}

slide-20
SLIDE 20

20

UNIFIED MEMORY

What about performance?

Problem: excessive faults and migrations at CPU-GPU crossover points

4/4/2016

NVVP timeline for HPGMG level 0 level 1 level 2

slide-21
SLIDE 21

21

Level N (large) is shared between CPU and GPU

Solution: allocate the first CPU level with cudaMallocHost (zero-copy memory)

Level N+1 (small) is shared between CPU and GPU

UNIFIED MEMORY

Eliminating page migrations and faults

4/4/2016

Level N Level N+1 Smoother Residual Restriction data GPU kernels Smoother CPU functions Redisual

slide-22
SLIDE 22

22

UNIFIED MEMORY

4/4/2016

+20% speed-up! no page faults CPU levels allocated with cudaMallocManaged CPU levels allocated with cudaMallocHost

slide-23
SLIDE 23

23

KERNEL OPTIMIZATIONS

slide-24
SLIDE 24

24

MULTI-GRID BOTTLENECK

Cost of operations

4/4/2016

level kernel time / total time

0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6

smoother interpolation copy_blocks residual restriction apply_bc

MOST TIME SPENT ON STENCILS

level kernel time / level time

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 3 4 5 6

smoother interpolation copy_blocks residual restriction apply_bc

VOLUME SURFACE

slide-25
SLIDE 25

25

STENCILS ON GPU

Parallelization strategies

4/4/2016

3D thread blocks Provides more than enough parallelism 2D thread blocks Parallelize over XY plane and march along Z

i j k

3D BLOCKS 2D BLOCKS

i j k

slide-26
SLIDE 26

26

STENCILS ON GPU

Baseline 2D kernel

4/4/2016

__global__ void smooth_kernel(level_type level,...) { blockCopy_type block = level.my_blocks[blockIdx.x]; // prologue const double * __restrict__ rhs = ...; ... for(int k = 0; k < dimz; k++) { // apply operator const double Ax = apply_op_ijk(); // smoother #ifdef CHEBY xo[ijk] = x[ijk] + ... + c2*lambda*(rhs[ijk]-Ax); #elif JACOBI xo[ijk] = x[ijk] + (2.0/3.0)*lambda*(rhs[ijk]-Ax); #elif GSRB xo[ijk] = x[ijk] + RB[ij]*lambda*(rhs[ijk]-Ax); #endif } }

APPLY STENCIL OPERATION POISSON, HELMHOLTZ (FV2,FV4) THREAD BLOCK MARCHES FROM 0 TO DIMZ CHEBYSHEV POLYNOMIALS GAUSS SEIDEL RED-BLACK JACOBI

slide-27
SLIDE 27

27

STENCILS ON GPU

Register caching

4/4/2016

// load k and k-1 planes into registers double xc0 = x[ijk – kStride]; double xc1 = x[ijk]; ... for(k=0; k<dimz; k++) { // load k+1 plane into registers xc2 = x[ijk + kStride]; ... // apply operator const double Ax = apply_op_ijk(); // smoother xo[ijk] = xc1 + ...; // update k and k-1 planes in registers xc0 = xc1; xc1 = xc2; ... }}

const double Ax =

  • b*h2inv*(

STENCIL_TWELFTH*( + bic1 * ( 15.0*(xl1-xc1) - (xll-xr1) ) + bir1 * ( 15.0*(xr1-xc1) - (xrr-xl1) ) + bjc1 * ( 15.0*(xu1-xc1) - (xuu-xd1) ) + bjd1 * ( 15.0*(xd1-xc1) - (xdd-xu1) ) + bkc1 * ( 15.0*(xc0-xc1) - (xbb-xc2) ) + bkc2 * ( 15.0*(xc2-xc1) - (xff-xc0) ) ) + 0.25*STENCIL_TWELFTH*( + (bid - biu ) * (xld - xd1 - xlu + xu1) + (bic2 - bic0) * (xl2 - xc2 - xl0 + xc0) + (bjr - bjl ) * (xru - xr1 - xlu + xl1) + (bjc2 - bjc0) * (xu2 - xc2 - xu0 + xc0) + (bkr1 - bkl1) * (xr0 - xr1 - xl0 + xl1) + (bkd1 - bku1) * (xd0 - xd1 - xu0 + xu1) + (bird - biru) * (xrd - xd1 - xru + xu1) + (bir2 - bir0) * (xr2 - xc2 - xr0 + xc0) + (bjrd - bjld) * (xrd - xr1 - xld + xl1) + (bjd2 - bjd0) * (xd2 - xc2 - xd0 + xc0) + (bkr2 - bkl2) * (xr2 - xr1 - xl2 + xl1) + (bkd2 - bku2) * (xd2 - xd1 - xu2 + xu1) ));

4TH ORDER STENCIL, 90 REGS 38 REGS IN KERNEL WITHOUT STENCIL

const double Ax =

  • b*h2inv*(

STENCIL_TWELFTH*( + bir1 * (xr1 - xc1) + bic1 * (zl1 - xc1) + bju1 * (zu1 - xc1) + bjc1 * (zd1 - xc1) + bkc2 * (xc2 - xc1) + bkc1 * (xc0 - xc1) );

7-POINT STENCIL, 18 REGS TOTAL REG USAGE: 56 FOR FV2 AND 128 FOR FV4

up to 1.5x speed-up!

slide-28
SLIDE 28

28

STENCILS ON GPU

Using caches on SM

By default global loads are not cached in L1 on Kepler Use -Xptxas=“-dlcm=ca” to enable L1

128B load granularity Better for coalesced access

Use __ldg intrinsic for read-only accesses

32B load granularity Better for scattered access

4/4/2016

up to 1.4x speed-up! up to 1.3x speed-up!

slide-29
SLIDE 29

29

STENCILS ON GPU

Shared memory approach

Copy to registers by staging in shared memory Reduce redundant accesses along XY plane Double buffering to avoid synchronization Store only ‘x’ and ‘𝛾𝑙’ in shared memory

4/4/2016

M+2R N+2R

int tx = threadIdx.x + R; int ty = threadIdx.y + R; sx[ty][tx], sx[ty][tx+1]

slide-30
SLIDE 30

30

STENCILS ON GPU

Optimizations summary on Kepler

4/4/2016

0.0 0.5 1.0 1.5 2.0 2.5 baseline tex l1 reg reg+smem reg+tex reg+l1 kernel speed-up

  • ptimization

4th-order GSRB smoother performance

1x 2x

slide-31
SLIDE 31

31

STENCILS ON GPU

Bottleneck analysis on Tesla K40

4/4/2016

BASELINE OPTIMIZED METRIC FV2 FV4 FV2 FV4 OCCUPANCY 0.55 0.25 0.55 0.25 FOOTPRINT DRAM 1.10 GB 1.09 GB 0.97 GB 1.09 GB L2 2.22 GB 8.22 GB 1.48 GB 3.70 GB BANDWIDTH DRAM 0.76 0.20 0.75 0.41 L2 0.69 0.67 0.53 0.60 TIME 5.63 ms 20.65 ms 5.06 ms 10.48 ms

DECREASE IN L2 FOOTPRINT INCREASE IN DRAM BW FV2 LIMITED BY DRAM, FV4 LIMITED BY L2

slide-32
SLIDE 32

32

MAXWELL?

Poor double precision performance 

2nd order scheme works really well, limited by DRAM bandwidth Improves with register/caching optimizations 4th order scheme starts to bottleneck in DP unit on Maxwell Performance stays flat across all optimizations We have more flops to execute, but not enough DP throughput

4/4/2016

slide-33
SLIDE 33

33

ALTERNATIVE COMPILERS?

NVCC vs CLANG

Experiment with clang went smoothly overall Compilation time is 20% shorter for clang 3.9 compared to nvcc 7.5 Minor issues with CUB due to PTX intrinsics GSRB smoother kernel runtime:

4/4/2016

Baseline (ms) Optimized (ms) nvcc 7.5 25.5 17.0 clang 3.9 27.2 19.4 clang vs nvcc +7% +14% http://llvm.org/docs/CompileCudaWithLLVM.html

slide-34
SLIDE 34

34

MULTI-GPU

slide-35
SLIDE 35

35

MULTI-GPU

MPI communications

Communication scheme: 1 – send data (boundary->MPI buffer) 2 – local exchange (internal->internal) 3 – receive data (MPI buffer->boundary) Intra-level: boundary ghost region exchange Inter-level: restriction and interpolation

4/4/2016

GRID 0 GRID 1 GRID 3 BOX 0 BOX 1 BOX 2 BOX 3 BOUNDARY EXCHANGE

slide-36
SLIDE 36

36

MULTI-GPU

MPI vs SHMEM

S6378 - Simplifying Multi-GPU Communication with NVSHMEM Wednesday @ 16:30, Room 211B

4/4/2016

MPI SHMEM

CopyKernel(BOUNDARY-TO-BUFFER) cudaDeviceSync MPI_Irecv + MPI_Isend CopyKernel(INTERNAL-TO-INTERNAL) MPI_Waitall CopyKernel(BUFFER-TO-BOUNDARY) CopyKernel(ALL-TO-ALL) shmem_barrier_all

Boundary exchange code

slide-37
SLIDE 37

37

MULTI-GPU

Unified Memory and MPI

Remember that we replaced malloc with cudaMallocManaged Unified Memory and regular MPI implementation Requires unmanaged staging buffer Does not play well with RDMA protocols

4/4/2016

DATA MIGRATIONS (50%)

cudaMallocManaged + regular MPI

slide-38
SLIDE 38

38

MULTI-GPU

CUDA-aware MPI

Pinned allocations and CUDA-aware MPI Unified Memory and CUDA-aware MPI Supported in OpenMPI >= 1.8.5 and MVAPICH2-GDR >= 2.2b Avoids CUDA IPC which is not supported for Unified Memory

4/4/2016

SINGLE NODE WITH P2P BETWEEN TWO K40s

S6142 – Multi GPU Programming with MPI Today @ 13:00, Room 212A

slide-39
SLIDE 39

39

MULTI-GPU

Zero-copy MPI buffers

Use zero-copy to allocate MPI buffers, write them directly from GPU kernels Can outperform RDMA for large sizes at scale

4/4/2016

0.00E+00 2.00E+07 4.00E+07 6.00E+07 8.00E+07 1.00E+08 managed managed + cuda- aware zero-copy device + cuda- aware

2xTESLA K40 WITH P2P SUPPORT

slide-40
SLIDE 40

40

0.0E+00 1.0E+08 2.0E+08 3.0E+08 4.0E+08 5.0E+08 6.0E+08 7.0E+08 2xCPU 2xCPU + 2xGPU 2xCPU + 4xGPU 2xCPU + 8xGPU

MULTI-GPU RESULTS

4/4/2016

DOF/s

CPU: Intel E5-2698 v3@2.3GHz 3.6GHz Turbo (Haswell-EP) HT off GPU: NVIDIA Tesla K80

Weak scaling within a single node

3.7x 1x 5.5x 9.8x

slide-41
SLIDE 41

41

MULTI-GPU RESULTS

Weak scaling across thousands nodes

4/4/2016

0.0E+00 5.0E+10 1.0E+11 1.5E+11 2.0E+11 2.5E+11 3.0E+11 3.5E+11 4.0E+11 4.5E+11 5.0E+11 2000 4000 6000 8000 10000 12000 14000 16000 18000 DOF/s number of nodes

ORNL Titan performance of HPGMG v0.3

CPU only CPU + GPU

3x

*results are obtained by Sam Williams (Lawrence Berkeley National Laboratory)

slide-42
SLIDE 42

42

TAKEAWAYS

4/4/2016

Both CPU and GPU architectures are utilized for best performance of HPGMG Unified Memory greatly simplifies code porting to GPU architecture Optimizations are not intrusive and should apply to future architectures The code scales well up to 16K GPUs on large supercomputing clusters

slide-43
SLIDE 43

43

LEARN MORE

CUDA sources: https://bitbucket.org/nsakharnykh/hpgmg-cuda Parallel Forall blog post: http://nvda.ly/YFAcI S6216 - The Future of Unified Memory Tuesday @ 16:00, Room 212A CUDA Libraries Hangout. AmgX, cuBLAS, cuSolver, cuSparse Thursday @ 10:00, Pod B

4/4/2016

slide-44
SLIDE 44

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join