April 4-7, 2016 | Silicon Valley
Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016
FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan - - PowerPoint PPT Presentation
April 4-7, 2016 | Silicon Valley HIGH-ORDER DISCRETIZATIONS FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016 MULTI-GRID METHODS Introduction Multi-grid solves elliptic PDEs (Ax=b) using a
April 4-7, 2016 | Silicon Valley
Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016
2
Multi-grid solves elliptic PDEs (Ax=b) using a hierarchical approach solution to hard problem is expressed as solution to an easier problem accelerates iterative method and provides O(N) complexity
4/4/2016
3
4/4/2016 EXTERNAL FLOW OVER BODY RESERVOIR SIMULATION LATTICE QCD (BROOKHAVEN NATIONAL LABORATORY)
4
4/4/2016
GMG Uses a structured grid Operator (A) is a stencil Constructs coarse grid directly More efficient approach Requires geometric information AMG Uses an arbitrary graph Operator (A) is a sparse matrix Constructs coarse grid using RAP product More general approach Performance optimization is challenging
5
Fast, scalable linear solvers, emphasis on iterative methods Flexible toolkit for GPU accelerated Ax = b solver Simple API makes it easy to solve your problems faster
4/4/2016
Used in industry-leading CFD packages!
6
4/4/2016
// One header #include “amgx_c.h” // Read config file AMGX_create_config(&cfg, cfgfile); // Create resources based on config AMGX_resources_create_simple(&res, cfg); // Create solver object, A,x,b, set precision AMGX_solver_create(&solver, res, mode, cfg); AMGX_matrix_create(&A,res,mode); AMGX_vector_create(&x,res,mode); AMGX_vector_create(&b,res,mode); // Read coefficients from a file AMGX_read_system(&A,&x,&b, matrixfile); // Setup and Solve Loop AMGX_solver_setup(solver,A); AMGX_solver_solve(solver, b, x); // Download Result AMGX_download_vector(&x) solver(main)=FGMRES main:max_iters=100 main:convergence=RELATIVE_MAX main:tolerance=0.1 main:preconditioner(amg)=AMG amg:algorithm=AGGREGATION amg:selector=SIZE_8 amg:cycle=V amg:max_iters=1 amg:max_levels=10 amg:smoother(smoother)=BLOCK_JACOBI amg:relaxation_factor= 0.75 amg:presweeps=1 amg:postsweeps=2 amg:coarsest_sweeps=4 determinism_flag=1
Configuration file
7
Lawrence Berkeley National Laboratory FVM and FEM variants, we focus on FVM Proxy AMR and Low Mach Combustion codes Used in Top500 benchmarking
4/4/2016
http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/
8
HPGMG implements F-cycle which has better convergence rate than V-cycle Poisson or Helmholtz operator using 2nd or 4th order discretization
4/4/2016
V-CYCLE F-CYCLE
DIRECT SOLVE
FEW FINER GRIDS MANY COARSER GRIDS
SMOOTHER & RESIDUAL SMOOTHER & RESIDUAL SMOOTHER SMOOTHER
9
Order describes the relationship between grid spacing and error Why do we need higher-order discretizations? 1. Improving accuracy without significant increase in memory usage 2. Higher flops per byte ratio enables better utilization of modern architectures Many solvers are moving towards high-order schemes for production codes
4/4/2016
10
2nd order: 7-point stencil 6 x 2-point stencils
4/4/2016
K+1 K K-1
11
4th order: 25-point stencil 18 x 4-point stencils, Flop:Byte ~1.0
4/4/2016
K-2 K+1 K+2 K-1 K
12
Performs 4x the FP operations MPI: sends 3x the messages, doubles the size (2-deep halos) DRAM memory footprint is the same (assuming no overfetch) Attains lower relative residual: ~10-9 for a single F-cycle
4/4/2016
13
14
Fine levels are executed on throughput-optimized processors (GPU) Coarse levels are executed on latency-optimized processors (CPU)
4/4/2016
GPU CPU
THRESHOLD V-CYCLE F-CYCLE
15
4/4/2016
1.0E+07 1.5E+07 2.0E+07 2.5E+07 3.0E+07 3.5E+07 4.0E+07 4.5E+07 5.0E+07 5.5E+07 6.0E+07 6.5E+07 7 6 5 4 3 2 1 DOF/s number of levels on GPU
HPGMG v0.3 hybrid performance
h 2h
All levels on GPU All levels on CPU execute on GPU if >10K grid points
16
HPGMG-FV entities naturally map to GPU hierarchy
4/4/2016
THREAD BLOCK OMP THREAD GPU MULTI-GPU MPI RANK MULTI-PROCESS
DOMAIN BOX BLOCK GRID 0 GRID 1 GRID 2 GRID 3
17
volume
4/4/2016
box 1 box 2 box 3 vectors (~10) box 1 box 2 box 3 Vector data within a level is contiguous Requires one copy per vector Vector data within a level is disjoint Requires one copy per box
18
No changes to data structures No explicit data movements Single pointer for CPU and GPU data Use cudaMallocManaged for allocations
4/4/2016
Developer View With Unified Memory
Unified Memory
19
Minimal modifications to the original code: (1) malloc replaced with cudaMallocManaged for levels accessed by GPU (2) Invoke CUDA kernel if level size is greater than threshold
4/4/2016
void smooth(level_type *level,...){ ... if(level->use_cuda) { // run on GPU cuda_cheby_smooth(level,...); } else { // run on CPU #pragma omp parallel for for(block = 0; block < num_blocks; block++) ... }}
20
Problem: excessive faults and migrations at CPU-GPU crossover points
4/4/2016
NVVP timeline for HPGMG level 0 level 1 level 2
21
Level N (large) is shared between CPU and GPU
Solution: allocate the first CPU level with cudaMallocHost (zero-copy memory)
Level N+1 (small) is shared between CPU and GPU
4/4/2016
Level N Level N+1 Smoother Residual Restriction data GPU kernels Smoother CPU functions Redisual
22
4/4/2016
+20% speed-up! no page faults CPU levels allocated with cudaMallocManaged CPU levels allocated with cudaMallocHost
23
24
4/4/2016
level kernel time / total time
0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6
smoother interpolation copy_blocks residual restriction apply_bc
MOST TIME SPENT ON STENCILS
level kernel time / level time
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 3 4 5 6
smoother interpolation copy_blocks residual restriction apply_bc
VOLUME SURFACE
25
4/4/2016
3D thread blocks Provides more than enough parallelism 2D thread blocks Parallelize over XY plane and march along Z
i j k
3D BLOCKS 2D BLOCKS
i j k
26
4/4/2016
__global__ void smooth_kernel(level_type level,...) { blockCopy_type block = level.my_blocks[blockIdx.x]; // prologue const double * __restrict__ rhs = ...; ... for(int k = 0; k < dimz; k++) { // apply operator const double Ax = apply_op_ijk(); // smoother #ifdef CHEBY xo[ijk] = x[ijk] + ... + c2*lambda*(rhs[ijk]-Ax); #elif JACOBI xo[ijk] = x[ijk] + (2.0/3.0)*lambda*(rhs[ijk]-Ax); #elif GSRB xo[ijk] = x[ijk] + RB[ij]*lambda*(rhs[ijk]-Ax); #endif } }
APPLY STENCIL OPERATION POISSON, HELMHOLTZ (FV2,FV4) THREAD BLOCK MARCHES FROM 0 TO DIMZ CHEBYSHEV POLYNOMIALS GAUSS SEIDEL RED-BLACK JACOBI
27
4/4/2016
// load k and k-1 planes into registers double xc0 = x[ijk – kStride]; double xc1 = x[ijk]; ... for(k=0; k<dimz; k++) { // load k+1 plane into registers xc2 = x[ijk + kStride]; ... // apply operator const double Ax = apply_op_ijk(); // smoother xo[ijk] = xc1 + ...; // update k and k-1 planes in registers xc0 = xc1; xc1 = xc2; ... }}
const double Ax =
STENCIL_TWELFTH*( + bic1 * ( 15.0*(xl1-xc1) - (xll-xr1) ) + bir1 * ( 15.0*(xr1-xc1) - (xrr-xl1) ) + bjc1 * ( 15.0*(xu1-xc1) - (xuu-xd1) ) + bjd1 * ( 15.0*(xd1-xc1) - (xdd-xu1) ) + bkc1 * ( 15.0*(xc0-xc1) - (xbb-xc2) ) + bkc2 * ( 15.0*(xc2-xc1) - (xff-xc0) ) ) + 0.25*STENCIL_TWELFTH*( + (bid - biu ) * (xld - xd1 - xlu + xu1) + (bic2 - bic0) * (xl2 - xc2 - xl0 + xc0) + (bjr - bjl ) * (xru - xr1 - xlu + xl1) + (bjc2 - bjc0) * (xu2 - xc2 - xu0 + xc0) + (bkr1 - bkl1) * (xr0 - xr1 - xl0 + xl1) + (bkd1 - bku1) * (xd0 - xd1 - xu0 + xu1) + (bird - biru) * (xrd - xd1 - xru + xu1) + (bir2 - bir0) * (xr2 - xc2 - xr0 + xc0) + (bjrd - bjld) * (xrd - xr1 - xld + xl1) + (bjd2 - bjd0) * (xd2 - xc2 - xd0 + xc0) + (bkr2 - bkl2) * (xr2 - xr1 - xl2 + xl1) + (bkd2 - bku2) * (xd2 - xd1 - xu2 + xu1) ));
4TH ORDER STENCIL, 90 REGS 38 REGS IN KERNEL WITHOUT STENCIL
const double Ax =
STENCIL_TWELFTH*( + bir1 * (xr1 - xc1) + bic1 * (zl1 - xc1) + bju1 * (zu1 - xc1) + bjc1 * (zd1 - xc1) + bkc2 * (xc2 - xc1) + bkc1 * (xc0 - xc1) );
7-POINT STENCIL, 18 REGS TOTAL REG USAGE: 56 FOR FV2 AND 128 FOR FV4
up to 1.5x speed-up!
28
By default global loads are not cached in L1 on Kepler Use -Xptxas=“-dlcm=ca” to enable L1
128B load granularity Better for coalesced access
Use __ldg intrinsic for read-only accesses
32B load granularity Better for scattered access
4/4/2016
up to 1.4x speed-up! up to 1.3x speed-up!
29
Copy to registers by staging in shared memory Reduce redundant accesses along XY plane Double buffering to avoid synchronization Store only ‘x’ and ‘𝛾𝑙’ in shared memory
4/4/2016
M+2R N+2R
int tx = threadIdx.x + R; int ty = threadIdx.y + R; sx[ty][tx], sx[ty][tx+1]
30
4/4/2016
0.0 0.5 1.0 1.5 2.0 2.5 baseline tex l1 reg reg+smem reg+tex reg+l1 kernel speed-up
4th-order GSRB smoother performance
1x 2x
31
4/4/2016
BASELINE OPTIMIZED METRIC FV2 FV4 FV2 FV4 OCCUPANCY 0.55 0.25 0.55 0.25 FOOTPRINT DRAM 1.10 GB 1.09 GB 0.97 GB 1.09 GB L2 2.22 GB 8.22 GB 1.48 GB 3.70 GB BANDWIDTH DRAM 0.76 0.20 0.75 0.41 L2 0.69 0.67 0.53 0.60 TIME 5.63 ms 20.65 ms 5.06 ms 10.48 ms
DECREASE IN L2 FOOTPRINT INCREASE IN DRAM BW FV2 LIMITED BY DRAM, FV4 LIMITED BY L2
32
2nd order scheme works really well, limited by DRAM bandwidth Improves with register/caching optimizations 4th order scheme starts to bottleneck in DP unit on Maxwell Performance stays flat across all optimizations We have more flops to execute, but not enough DP throughput
4/4/2016
33
Experiment with clang went smoothly overall Compilation time is 20% shorter for clang 3.9 compared to nvcc 7.5 Minor issues with CUB due to PTX intrinsics GSRB smoother kernel runtime:
4/4/2016
Baseline (ms) Optimized (ms) nvcc 7.5 25.5 17.0 clang 3.9 27.2 19.4 clang vs nvcc +7% +14% http://llvm.org/docs/CompileCudaWithLLVM.html
34
35
Communication scheme: 1 – send data (boundary->MPI buffer) 2 – local exchange (internal->internal) 3 – receive data (MPI buffer->boundary) Intra-level: boundary ghost region exchange Inter-level: restriction and interpolation
4/4/2016
GRID 0 GRID 1 GRID 3 BOX 0 BOX 1 BOX 2 BOX 3 BOUNDARY EXCHANGE
36
S6378 - Simplifying Multi-GPU Communication with NVSHMEM Wednesday @ 16:30, Room 211B
4/4/2016
MPI SHMEM
CopyKernel(BOUNDARY-TO-BUFFER) cudaDeviceSync MPI_Irecv + MPI_Isend CopyKernel(INTERNAL-TO-INTERNAL) MPI_Waitall CopyKernel(BUFFER-TO-BOUNDARY) CopyKernel(ALL-TO-ALL) shmem_barrier_all
Boundary exchange code
37
Remember that we replaced malloc with cudaMallocManaged Unified Memory and regular MPI implementation Requires unmanaged staging buffer Does not play well with RDMA protocols
4/4/2016
DATA MIGRATIONS (50%)
cudaMallocManaged + regular MPI
38
Pinned allocations and CUDA-aware MPI Unified Memory and CUDA-aware MPI Supported in OpenMPI >= 1.8.5 and MVAPICH2-GDR >= 2.2b Avoids CUDA IPC which is not supported for Unified Memory
4/4/2016
SINGLE NODE WITH P2P BETWEEN TWO K40s
S6142 – Multi GPU Programming with MPI Today @ 13:00, Room 212A
39
Use zero-copy to allocate MPI buffers, write them directly from GPU kernels Can outperform RDMA for large sizes at scale
4/4/2016
0.00E+00 2.00E+07 4.00E+07 6.00E+07 8.00E+07 1.00E+08 managed managed + cuda- aware zero-copy device + cuda- aware
2xTESLA K40 WITH P2P SUPPORT
40
0.0E+00 1.0E+08 2.0E+08 3.0E+08 4.0E+08 5.0E+08 6.0E+08 7.0E+08 2xCPU 2xCPU + 2xGPU 2xCPU + 4xGPU 2xCPU + 8xGPU
4/4/2016
DOF/s
CPU: Intel E5-2698 v3@2.3GHz 3.6GHz Turbo (Haswell-EP) HT off GPU: NVIDIA Tesla K80
3.7x 1x 5.5x 9.8x
41
4/4/2016
0.0E+00 5.0E+10 1.0E+11 1.5E+11 2.0E+11 2.5E+11 3.0E+11 3.5E+11 4.0E+11 4.5E+11 5.0E+11 2000 4000 6000 8000 10000 12000 14000 16000 18000 DOF/s number of nodes
ORNL Titan performance of HPGMG v0.3
CPU only CPU + GPU
3x
*results are obtained by Sam Williams (Lawrence Berkeley National Laboratory)
42
4/4/2016
Both CPU and GPU architectures are utilized for best performance of HPGMG Unified Memory greatly simplifies code porting to GPU architecture Optimizations are not intrusive and should apply to future architectures The code scales well up to 16K GPUs on large supercomputing clusters
43
CUDA sources: https://bitbucket.org/nsakharnykh/hpgmg-cuda Parallel Forall blog post: http://nvda.ly/YFAcI S6216 - The Future of Unified Memory Tuesday @ 16:00, Room 212A CUDA Libraries Hangout. AmgX, cuBLAS, cuSolver, cuSparse Thursday @ 10:00, Pod B
4/4/2016
April 4-7, 2016 | Silicon Valley
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join