for geometric multi grid
play

FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH-ORDER DISCRETIZATIONS FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016 MULTI-GRID METHODS Introduction Multi-grid solves elliptic PDEs (Ax=b) using a


  1. April 4-7, 2016 | Silicon Valley HIGH-ORDER DISCRETIZATIONS FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016

  2. MULTI-GRID METHODS Introduction Multi-grid solves elliptic PDEs (Ax=b) using a hierarchical approach solution to hard problem is expressed as solution to an easier problem accelerates iterative method and provides O(N) complexity 2 4/4/2016

  3. MULTI-GRID METHODS Applications EXTERNAL FLOW OVER BODY LATTICE QCD (BROOKHAVEN NATIONAL LABORATORY) RESERVOIR SIMULATION 3 4/4/2016

  4. MULTI-GRID METHODS Geometric vs algebraic Ax=b GMG AMG Uses a structured grid Uses an arbitrary graph Operator (A) is a stencil Operator (A) is a sparse matrix Constructs coarse grid directly Constructs coarse grid using RAP product More efficient approach More general approach Requires geometric information Performance optimization is challenging 4 4/4/2016

  5. NVIDIA AMGX Black-box library for AMG Fast, scalable linear solvers, emphasis on iterative methods Flexible toolkit for GPU accelerated Ax = b solver Simple API makes it easy to solve your problems faster Used in industry-leading CFD packages! 5 4/4/2016

  6. NVIDIA AMGX Example usage Configuration file // One header #include “ amgx_c.h ” solver(main)=FGMRES // Read config file main:max_iters=100 AMGX_create_config(&cfg, cfgfile); main:convergence=RELATIVE_MAX // Create resources based on config main:tolerance=0.1 AMGX_resources_create_simple(&res, cfg); main:preconditioner(amg)=AMG // Create solver object, A,x,b, set precision amg:algorithm=AGGREGATION AMGX_solver_create(&solver, res, mode, cfg); amg:selector=SIZE_8 AMGX_matrix_create(&A,res,mode); amg:cycle=V AMGX_vector_create(&x,res,mode); amg:max_iters=1 AMGX_vector_create(&b,res,mode); amg:max_levels=10 // Read coefficients from a file amg:smoother(smoother)=BLOCK_JACOBI AMGX_read_system(&A,&x,&b, matrixfile); amg:relaxation_factor= 0.75 // Setup and Solve Loop amg:presweeps=1 AMGX_solver_setup(solver,A); amg:postsweeps=2 AMGX_solver_solve(solver, b, x); amg:coarsest_sweeps=4 // Download Result determinism_flag=1 AMGX_download_vector(&x) 6 4/4/2016

  7. HPGMG High-Performance Geometric Multi-Grid Lawrence Berkeley National Laboratory FVM and FEM variants, we focus on FVM Proxy AMR and Low Mach Combustion codes Used in Top500 benchmarking http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/ 7 4/4/2016

  8. HPGMG Multi-grid cycles V-CYCLE F-CYCLE FEW FINER GRIDS SMOOTHER SMOOTHER & RESIDUAL SMOOTHER SMOOTHER & RESIDUAL MANY COARSER GRIDS DIRECT SOLVE HPGMG implements F-cycle which has better convergence rate than V-cycle Poisson or Helmholtz operator using 2 nd or 4 th order discretization 8 4/4/2016

  9. HPGMG High-order discretizations Order describes the relationship between grid spacing and error Why do we need higher-order discretizations? 1. Improving accuracy without significant increase in memory usage 2. Higher flops per byte ratio enables better utilization of modern architectures Many solvers are moving towards high-order schemes for production codes 9 4/4/2016

  10. HPGMG v0.2: second-order 2 nd order: 7-point stencil K-1 K K+1 6 x 2-point stencils 10 4/4/2016

  11. HPGMG v0.3: fourth-order 4 th order: 25-point stencil K-2 K-1 K K+1 K+2 18 x 4-point stencils, Flop:Byte ~1.0 11 4/4/2016

  12. HPGMG 4 th order vs 2 nd order Performs 4x the FP operations MPI: sends 3x the messages, doubles the size (2-deep halos) DRAM memory footprint is the same (assuming no overfetch) Attains lower relative residual: ~10 -9 for a single F-cycle 12 4/4/2016

  13. HYBRID IMPLEMENTATION 13

  14. HYBRID IMPLEMENTATION Take advantage of both architectures V-CYCLE F-CYCLE GPU THRESHOLD CPU Fine levels are executed on throughput-optimized processors (GPU) Coarse levels are executed on latency-optimized processors (CPU) 14 4/4/2016

  15. HYBRID IMPLEMENTATION What is the optimal threshold? HPGMG v0.3 hybrid performance h 2h 6.5E+07 6.0E+07 5.5E+07 5.0E+07 4.5E+07 DOF/s 4.0E+07 3.5E+07 execute on GPU if 3.0E+07 >10K grid points 2.5E+07 2.0E+07 1.5E+07 1.0E+07 7 6 5 4 3 2 1 All levels on GPU All levels on CPU number of levels on GPU 15 4/4/2016

  16. MEMORY MANAGEMENT HPGMG-FV entities naturally map to GPU hierarchy MULTI-GPU GRID 0 GRID 1 GPU THREAD BLOCK DOMAIN BLOCK BOX GRID 2 GRID 3 OMP THREAD MPI RANK MULTI-PROCESS 16 4/4/2016

  17. MEMORY MANAGEMENT box 1 box 2 box 3 box 1 box 2 box 3 vectors (~10) volume Vector data within a level is contiguous Vector data within a level is disjoint Requires one copy per vector Requires one copy per box 17 4/4/2016

  18. MEMORY MANAGEMENT Using Unified Memory Developer View With No changes to data structures Unified Memory No explicit data movements Single pointer for CPU and GPU data Use cudaMallocManaged for allocations Unified Memory 18 4/4/2016

  19. UNIFIED MEMORY Simplified GPU programming Minimal modifications to the original code: (1) malloc replaced with cudaMallocManaged for levels accessed by GPU (2) Invoke CUDA kernel if level size is greater than threshold void smooth(level_type *level,...){ ... if(level->use_cuda) { // run on GPU cuda_cheby_smooth(level,...); } else { // run on CPU #pragma omp parallel for for(block = 0; block < num_blocks; block++) ... }} 19 4/4/2016

  20. UNIFIED MEMORY What about performance? NVVP timeline for HPGMG level 0 level 2 level 1 Problem: excessive faults and migrations at CPU-GPU crossover points 20 4/4/2016

  21. UNIFIED MEMORY Eliminating page migrations and faults data Level N Level N+1 Smoother Residual Restriction Smoother Redisual GPU kernels CPU functions Level N+1 (small) is shared between CPU and GPU Level N (large) is shared between CPU and GPU Solution: allocate the first CPU level with cudaMallocHost (zero-copy memory) 21 4/4/2016

  22. UNIFIED MEMORY CPU levels allocated with cudaMallocManaged CPU levels allocated with cudaMallocHost +20% speed-up! no page faults 22 4/4/2016

  23. KERNEL OPTIMIZATIONS 23

  24. MULTI-GRID BOTTLENECK Cost of operations smoother interpolation copy_blocks smoother interpolation copy_blocks residual restriction apply_bc residual restriction apply_bc kernel time / level time kernel time / total time 0.8 0.5 0.7 SURFACE 0.4 0.6 MOST TIME SPENT VOLUME ON STENCILS 0.5 0.3 0.4 0.2 0.3 0.2 0.1 0.1 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 level level 24 4/4/2016

  25. STENCILS ON GPU Parallelization strategies 3D thread blocks 2D thread blocks Provides more than enough parallelism Parallelize over XY plane and march along Z k k j j i i 2D BLOCKS 3D BLOCKS 25 4/4/2016

  26. STENCILS ON GPU Baseline 2D kernel __global__ void smooth_kernel(level_type level,...) { blockCopy_type block = level.my_blocks[blockIdx.x]; THREAD BLOCK MARCHES // prologue FROM 0 TO DIMZ const double * __restrict__ rhs = ...; ... for(int k = 0; k < dimz; k++) { // apply operator const double Ax = apply_op_ijk(); APPLY STENCIL OPERATION // smoother POISSON, HELMHOLTZ (FV2,FV4) #ifdef CHEBY CHEBYSHEV POLYNOMIALS xo[ijk] = x[ijk] + ... + c2*lambda*(rhs[ijk]-Ax); #elif JACOBI JACOBI xo[ijk] = x[ijk] + (2.0/3.0)*lambda*(rhs[ijk]-Ax); #elif GSRB GAUSS SEIDEL RED-BLACK xo[ijk] = x[ijk] + RB[ij]*lambda*(rhs[ijk]-Ax); #endif } } 26 4/4/2016

  27. STENCILS ON GPU Register caching 38 REGS IN KERNEL WITHOUT STENCIL 7-POINT STENCIL, 18 REGS 4 TH ORDER STENCIL, 90 REGS // load k and k-1 planes into registers const double Ax = const double Ax = -b*h2inv*( -b*h2inv*( double xc0 = x[ijk – kStride]; STENCIL_TWELFTH*( STENCIL_TWELFTH*( double xc1 = x[ijk]; ... + bir1 * (xr1 - xc1) + bic1 * ( 15.0*(xl1-xc1) - (xll-xr1) ) + bic1 * (zl1 - xc1) + bir1 * ( 15.0*(xr1-xc1) - (xrr-xl1) ) + bju1 * (zu1 - xc1) + bjc1 * ( 15.0*(xu1-xc1) - (xuu-xd1) ) for(k=0; k<dimz; k++) { + bjc1 * (zd1 - xc1) + bjd1 * ( 15.0*(xd1-xc1) - (xdd-xu1) ) + bkc2 * (xc2 - xc1) + bkc1 * ( 15.0*(xc0-xc1) - (xbb-xc2) ) // load k+1 plane into registers + bkc1 * (xc0 - xc1) + bkc2 * ( 15.0*(xc2-xc1) - (xff-xc0) ) ) xc2 = x[ijk + kStride]; ... ); + 0.25*STENCIL_TWELFTH*( + (bid - biu ) * (xld - xd1 - xlu + xu1) // apply operator + (bic2 - bic0) * (xl2 - xc2 - xl0 + xc0) + (bjr - bjl ) * (xru - xr1 - xlu + xl1) const double Ax = apply_op_ijk(); + (bjc2 - bjc0) * (xu2 - xc2 - xu0 + xc0) + (bkr1 - bkl1) * (xr0 - xr1 - xl0 + xl1) + (bkd1 - bku1) * (xd0 - xd1 - xu0 + xu1) // smoother xo[ijk] = xc1 + ...; + (bird - biru) * (xrd - xd1 - xru + xu1) up to 1.5x speed-up! + (bir2 - bir0) * (xr2 - xc2 - xr0 + xc0) + (bjrd - bjld) * (xrd - xr1 - xld + xl1) // update k and k-1 planes in registers + (bjd2 - bjd0) * (xd2 - xc2 - xd0 + xc0) + (bkr2 - bkl2) * (xr2 - xr1 - xl2 + xl1) xc0 = xc1 ; xc1 = xc2 ; ... + (bkd2 - bku2) * (xd2 - xd1 - xu2 + xu1) }} )); TOTAL REG USAGE: 56 FOR FV2 AND 128 FOR FV4 27 4/4/2016

  28. STENCILS ON GPU Using caches on SM By default global loads are not cached in L1 on Kepler Use -Xptxas =“ -dlcm=ca ” to enable L1 up to 1.4x speed-up! 128B load granularity Better for coalesced access Use __ldg intrinsic for read-only accesses up to 1.3x speed-up! 32B load granularity Better for scattered access 28 4/4/2016

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend