FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH-ORDER DISCRETIZATIONS FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016

MULTI-GRID METHODS Introduction Multi-grid solves elliptic PDEs (Ax=b) using a hierarchical approach solution to hard problem is expressed as solution to an easier problem accelerates iterative method and provides O(N) complexity 2 4/4/2016

MULTI-GRID METHODS Applications EXTERNAL FLOW OVER BODY LATTICE QCD (BROOKHAVEN NATIONAL LABORATORY) RESERVOIR SIMULATION 3 4/4/2016

MULTI-GRID METHODS Geometric vs algebraic Ax=b GMG AMG Uses a structured grid Uses an arbitrary graph Operator (A) is a stencil Operator (A) is a sparse matrix Constructs coarse grid directly Constructs coarse grid using RAP product More efficient approach More general approach Requires geometric information Performance optimization is challenging 4 4/4/2016

NVIDIA AMGX Black-box library for AMG Fast, scalable linear solvers, emphasis on iterative methods Flexible toolkit for GPU accelerated Ax = b solver Simple API makes it easy to solve your problems faster Used in industry-leading CFD packages! 5 4/4/2016

NVIDIA AMGX Example usage Configuration file // One header #include “ amgx_c.h ” solver(main)=FGMRES // Read config file main:max_iters=100 AMGX_create_config(&cfg, cfgfile); main:convergence=RELATIVE_MAX // Create resources based on config main:tolerance=0.1 AMGX_resources_create_simple(&res, cfg); main:preconditioner(amg)=AMG // Create solver object, A,x,b, set precision amg:algorithm=AGGREGATION AMGX_solver_create(&solver, res, mode, cfg); amg:selector=SIZE_8 AMGX_matrix_create(&A,res,mode); amg:cycle=V AMGX_vector_create(&x,res,mode); amg:max_iters=1 AMGX_vector_create(&b,res,mode); amg:max_levels=10 // Read coefficients from a file amg:smoother(smoother)=BLOCK_JACOBI AMGX_read_system(&A,&x,&b, matrixfile); amg:relaxation_factor= 0.75 // Setup and Solve Loop amg:presweeps=1 AMGX_solver_setup(solver,A); amg:postsweeps=2 AMGX_solver_solve(solver, b, x); amg:coarsest_sweeps=4 // Download Result determinism_flag=1 AMGX_download_vector(&x) 6 4/4/2016

HPGMG High-Performance Geometric Multi-Grid Lawrence Berkeley National Laboratory FVM and FEM variants, we focus on FVM Proxy AMR and Low Mach Combustion codes Used in Top500 benchmarking http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/ 7 4/4/2016

HPGMG Multi-grid cycles V-CYCLE F-CYCLE FEW FINER GRIDS SMOOTHER SMOOTHER & RESIDUAL SMOOTHER SMOOTHER & RESIDUAL MANY COARSER GRIDS DIRECT SOLVE HPGMG implements F-cycle which has better convergence rate than V-cycle Poisson or Helmholtz operator using 2 nd or 4 th order discretization 8 4/4/2016

HPGMG High-order discretizations Order describes the relationship between grid spacing and error Why do we need higher-order discretizations? 1. Improving accuracy without significant increase in memory usage 2. Higher flops per byte ratio enables better utilization of modern architectures Many solvers are moving towards high-order schemes for production codes 9 4/4/2016

HPGMG v0.2: second-order 2 nd order: 7-point stencil K-1 K K+1 6 x 2-point stencils 10 4/4/2016

HPGMG v0.3: fourth-order 4 th order: 25-point stencil K-2 K-1 K K+1 K+2 18 x 4-point stencils, Flop:Byte ~1.0 11 4/4/2016

HPGMG 4 th order vs 2 nd order Performs 4x the FP operations MPI: sends 3x the messages, doubles the size (2-deep halos) DRAM memory footprint is the same (assuming no overfetch) Attains lower relative residual: ~10 -9 for a single F-cycle 12 4/4/2016

HYBRID IMPLEMENTATION 13

HYBRID IMPLEMENTATION Take advantage of both architectures V-CYCLE F-CYCLE GPU THRESHOLD CPU Fine levels are executed on throughput-optimized processors (GPU) Coarse levels are executed on latency-optimized processors (CPU) 14 4/4/2016

HYBRID IMPLEMENTATION What is the optimal threshold? HPGMG v0.3 hybrid performance h 2h 6.5E+07 6.0E+07 5.5E+07 5.0E+07 4.5E+07 DOF/s 4.0E+07 3.5E+07 execute on GPU if 3.0E+07 >10K grid points 2.5E+07 2.0E+07 1.5E+07 1.0E+07 7 6 5 4 3 2 1 All levels on GPU All levels on CPU number of levels on GPU 15 4/4/2016

MEMORY MANAGEMENT HPGMG-FV entities naturally map to GPU hierarchy MULTI-GPU GRID 0 GRID 1 GPU THREAD BLOCK DOMAIN BLOCK BOX GRID 2 GRID 3 OMP THREAD MPI RANK MULTI-PROCESS 16 4/4/2016

MEMORY MANAGEMENT box 1 box 2 box 3 box 1 box 2 box 3 vectors (~10) volume Vector data within a level is contiguous Vector data within a level is disjoint Requires one copy per vector Requires one copy per box 17 4/4/2016

MEMORY MANAGEMENT Using Unified Memory Developer View With No changes to data structures Unified Memory No explicit data movements Single pointer for CPU and GPU data Use cudaMallocManaged for allocations Unified Memory 18 4/4/2016

UNIFIED MEMORY Simplified GPU programming Minimal modifications to the original code: (1) malloc replaced with cudaMallocManaged for levels accessed by GPU (2) Invoke CUDA kernel if level size is greater than threshold void smooth(level_type *level,...){ ... if(level->use_cuda) { // run on GPU cuda_cheby_smooth(level,...); } else { // run on CPU #pragma omp parallel for for(block = 0; block < num_blocks; block++) ... }} 19 4/4/2016

UNIFIED MEMORY What about performance? NVVP timeline for HPGMG level 0 level 2 level 1 Problem: excessive faults and migrations at CPU-GPU crossover points 20 4/4/2016

UNIFIED MEMORY Eliminating page migrations and faults data Level N Level N+1 Smoother Residual Restriction Smoother Redisual GPU kernels CPU functions Level N+1 (small) is shared between CPU and GPU Level N (large) is shared between CPU and GPU Solution: allocate the first CPU level with cudaMallocHost (zero-copy memory) 21 4/4/2016

UNIFIED MEMORY CPU levels allocated with cudaMallocManaged CPU levels allocated with cudaMallocHost +20% speed-up! no page faults 22 4/4/2016

KERNEL OPTIMIZATIONS 23

MULTI-GRID BOTTLENECK Cost of operations smoother interpolation copy_blocks smoother interpolation copy_blocks residual restriction apply_bc residual restriction apply_bc kernel time / level time kernel time / total time 0.8 0.5 0.7 SURFACE 0.4 0.6 MOST TIME SPENT VOLUME ON STENCILS 0.5 0.3 0.4 0.2 0.3 0.2 0.1 0.1 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 level level 24 4/4/2016

STENCILS ON GPU Parallelization strategies 3D thread blocks 2D thread blocks Provides more than enough parallelism Parallelize over XY plane and march along Z k k j j i i 2D BLOCKS 3D BLOCKS 25 4/4/2016

STENCILS ON GPU Baseline 2D kernel __global__ void smooth_kernel(level_type level,...) { blockCopy_type block = level.my_blocks[blockIdx.x]; THREAD BLOCK MARCHES // prologue FROM 0 TO DIMZ const double * __restrict__ rhs = ...; ... for(int k = 0; k < dimz; k++) { // apply operator const double Ax = apply_op_ijk(); APPLY STENCIL OPERATION // smoother POISSON, HELMHOLTZ (FV2,FV4) #ifdef CHEBY CHEBYSHEV POLYNOMIALS xo[ijk] = x[ijk] + ... + c2*lambda*(rhs[ijk]-Ax); #elif JACOBI JACOBI xo[ijk] = x[ijk] + (2.0/3.0)*lambda*(rhs[ijk]-Ax); #elif GSRB GAUSS SEIDEL RED-BLACK xo[ijk] = x[ijk] + RB[ij]*lambda*(rhs[ijk]-Ax); #endif } } 26 4/4/2016

STENCILS ON GPU Register caching 38 REGS IN KERNEL WITHOUT STENCIL 7-POINT STENCIL, 18 REGS 4 TH ORDER STENCIL, 90 REGS // load k and k-1 planes into registers const double Ax = const double Ax = -b*h2inv*( -b*h2inv*( double xc0 = x[ijk – kStride]; STENCIL_TWELFTH*( STENCIL_TWELFTH*( double xc1 = x[ijk]; ... + bir1 * (xr1 - xc1) + bic1 * ( 15.0*(xl1-xc1) - (xll-xr1) ) + bic1 * (zl1 - xc1) + bir1 * ( 15.0*(xr1-xc1) - (xrr-xl1) ) + bju1 * (zu1 - xc1) + bjc1 * ( 15.0*(xu1-xc1) - (xuu-xd1) ) for(k=0; k<dimz; k++) { + bjc1 * (zd1 - xc1) + bjd1 * ( 15.0*(xd1-xc1) - (xdd-xu1) ) + bkc2 * (xc2 - xc1) + bkc1 * ( 15.0*(xc0-xc1) - (xbb-xc2) ) // load k+1 plane into registers + bkc1 * (xc0 - xc1) + bkc2 * ( 15.0*(xc2-xc1) - (xff-xc0) ) ) xc2 = x[ijk + kStride]; ... ); + 0.25*STENCIL_TWELFTH*( + (bid - biu ) * (xld - xd1 - xlu + xu1) // apply operator + (bic2 - bic0) * (xl2 - xc2 - xl0 + xc0) + (bjr - bjl ) * (xru - xr1 - xlu + xl1) const double Ax = apply_op_ijk(); + (bjc2 - bjc0) * (xu2 - xc2 - xu0 + xc0) + (bkr1 - bkl1) * (xr0 - xr1 - xl0 + xl1) + (bkd1 - bku1) * (xd0 - xd1 - xu0 + xu1) // smoother xo[ijk] = xc1 + ...; + (bird - biru) * (xrd - xd1 - xru + xu1) up to 1.5x speed-up! + (bir2 - bir0) * (xr2 - xc2 - xr0 + xc0) + (bjrd - bjld) * (xrd - xr1 - xld + xl1) // update k and k-1 planes in registers + (bjd2 - bjd0) * (xd2 - xc2 - xd0 + xc0) + (bkr2 - bkl2) * (xr2 - xr1 - xl2 + xl1) xc0 = xc1 ; xc1 = xc2 ; ... + (bkd2 - bku2) * (xd2 - xd1 - xu2 + xu1) }} )); TOTAL REG USAGE: 56 FOR FV2 AND 128 FOR FV4 27 4/4/2016

STENCILS ON GPU Using caches on SM By default global loads are not cached in L1 on Kepler Use -Xptxas =“ -dlcm=ca ” to enable L1 up to 1.4x speed-up! 128B load granularity Better for coalesced access Use __ldg intrinsic for read-only accesses up to 1.3x speed-up! 32B load granularity Better for scattered access 28 4/4/2016

FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH-ORDER DISCRETIZATIONS FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016 MULTI-GRID METHODS Introduction Multi-grid solves elliptic PDEs (Ax=b) using a

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

ON-GRID VS OFF-GRID SOLAR On-Grid Solar is solar generation that is connected to the utility grid

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

SEE-GRID Deploying a Grid-enabled eInfrastructure in SE Europe www.see-grid.org Jorge Sanchez,

Modernizing T&D on the Electric Grid 11/29/2011 Mark Nealon System Meter & Smart Grid

Grid Grid to Grid Grid-to to Ports Clock Routing for to-Ports Clock Routing for Ports Clock

Grid/Clo d Comp ting Grid/Clo d Comp ting Grid/Cloud Computing Grid/Cloud Computing over

SEE-GRID-SCI SEE-GRID Infrastructure for Regional eScience www.see-grid-sci.eu International

Geometric Optimization Piotr Indyk April 26, 2005 Lecture 19: Geometric Optimization Geometric

Geometric Algebra A powerful tool for solving geometric problems in visual computing Leandro A.

Grid! Alison Fulford Housekeeping National Grid 2 Introductions National Grid 3 Workplace

One Page Everywhere Fluid, Responsive Design with Semantic.gs The Semantic Grid System Grid

GRID PHD GRID, PHD The Smart Grid Cyber Security and the Future of Keeping the Lights On The

& Grid5000 Grid eXplorer eXplorer Grid Plates-formes de Grilles exprimentales

Outline n Introduction Proxy Dynamic Delegation in Grid Gateway n Is there the need for a

USQCD regional grid USQCD regional grid Report to ILDG 14 Report to ILDG 14 US Grid Usage US

MicroHH An Overview B. V. Rathish Kumar IIT Kanpur Outline of the Talk Introduction to

Discretization and Solution of and Solution of Discretization Convection- -Diffusion Problems

Standardization of Services and Applications including Security for IoT and SC&C Gyu Myoung

California Pollution Control Financing Authority Increasing Access to Private Capital: CalCAP and

12 Implicit Spatial Discretization for Advection-Diffusion-Reaction Equation Kundan Kumar

Topology Sensistivity Miloslav Capek Department of Electromagnetic Field Czech Technical

I NTERESTING , BUT ... We have not answered questions on TKE and U b , yet Key: transient from

Simulation of Surface- -water / water / Simulation of Surface Ground- -water Interaction in

FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH-ORDER DISCRETIZATIONS FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016 MULTI-GRID METHODS Introduction Multi-grid solves elliptic PDEs (Ax=b) using a

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

ON-GRID VS OFF-GRID SOLAR On-Grid Solar is solar generation that is connected to the utility grid

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

SEE-GRID Deploying a Grid-enabled eInfrastructure in SE Europe www.see-grid.org Jorge Sanchez,

Modernizing T&amp;D on the Electric Grid 11/29/2011 Mark Nealon System Meter &amp; Smart Grid

Grid Grid to Grid Grid-to to Ports Clock Routing for to-Ports Clock Routing for Ports Clock

Grid/Clo d Comp ting Grid/Clo d Comp ting Grid/Cloud Computing Grid/Cloud Computing over

SEE-GRID-SCI SEE-GRID Infrastructure for Regional eScience www.see-grid-sci.eu International

Geometric Optimization Piotr Indyk April 26, 2005 Lecture 19: Geometric Optimization Geometric

Geometric Algebra A powerful tool for solving geometric problems in visual computing Leandro A.

Grid! Alison Fulford Housekeeping National Grid 2 Introductions National Grid 3 Workplace

One Page Everywhere Fluid, Responsive Design with Semantic.gs The Semantic Grid System Grid

GRID PHD GRID, PHD The Smart Grid Cyber Security and the Future of Keeping the Lights On The

&amp; Grid5000 Grid eXplorer eXplorer Grid Plates-formes de Grilles exprimentales

Outline n Introduction Proxy Dynamic Delegation in Grid Gateway n Is there the need for a

USQCD regional grid USQCD regional grid Report to ILDG 14 Report to ILDG 14 US Grid Usage US

MicroHH An Overview B. V. Rathish Kumar IIT Kanpur Outline of the Talk Introduction to

Discretization and Solution of and Solution of Discretization Convection- -Diffusion Problems

Standardization of Services and Applications including Security for IoT and SC&amp;C Gyu Myoung

California Pollution Control Financing Authority Increasing Access to Private Capital: CalCAP and

12 Implicit Spatial Discretization for Advection-Diffusion-Reaction Equation Kundan Kumar

Topology Sensistivity Miloslav Capek Department of Electromagnetic Field Czech Technical

I NTERESTING , BUT ... We have not answered questions on TKE and U b , yet Key: transient from

Simulation of Surface- -water / water / Simulation of Surface Ground- -water Interaction in

Modernizing T&D on the Electric Grid 11/29/2011 Mark Nealon System Meter & Smart Grid

& Grid5000 Grid eXplorer eXplorer Grid Plates-formes de Grilles exprimentales

Standardization of Services and Applications including Security for IoT and SC&C Gyu Myoung