 
              Daino: A High-level Framework for Parallel and Efficient AMR on GPUs Mohamed Wahib 1 , Naoya Maruyama 1,2 , Takayuki Aoki 2 1 RIKEN Advanced Institute for Computational Science, Kobe, Japan 2 Tokyo Institute of Technology, GSIC, Tokyo, Japan 11 th May 2017 GTC17
Summary  Motivation & Problem:  “ AMR is one of the paths to multi-scale exascale applications “  Producing efficient AMR code is hard (especially for GPU)  Solution:  A framework for producing efficient AMR code (for GPUs)  Architecture-independent interface provided to the user  A speedup model for quantifying the efficiency of AMR code  Key results: We evaluate three AMR applications  Speedups & scalability comparable to hand-written code (~3,642 K20x GPUs) 2
Adaptive Mesh Refinement (AMR)  For meshes in some simulations using PDEs:  We only require high resolution for areas of interest  Resolution changes dynamically during simulation  Achieving efficient AMR is challenging  Managing an adaptive mesh can be complicated  Balancing compute load and communication costs 3
Structured Tree-based AMR  Many ways to represent the mesh  We focus on octree representation (quadtree in 2D)  Mesh divided into blocks, refine/coarsen if required PE 1 PE 2 PE 3 (a) (b) Octree-based meshes: (a) Adaptive mesh (b) Tree representation Operations applied on tree are distributed 4
How AMR Works Reduced Computation  Initialize the Mesh (less data in mesh)  FOR Simulation time DO  Execute stencil operations for all blocks Computation  Exchange ghost layers with neighbor nodes  IF time to remesh  Calculate remeshing critirion for all blocks Remeshing  Refine or consolidate blocks  Balance the mesh  ENDIF Overhead  IF time to load balance Load balancing  Apply load balancing algorithm  ENDIF  ENDFOR 5
AMR on GPUs  Hard to achieve efficient AMR with GPUs  Few existing AMR frameworks support GPU:  User must provide code optimized for GPU 1  Scalability problems due to CPU-GPU data movement 2  No speedup-bound model 3 Contributions of our framework 6
Framework for Efficient AMR  A compiler and runtime  Input:  Serial code applying stencil on a uniform grid  User adds directives to identify relevant data arrays  Architecture-neutral  Output:  Executable binary for target architecture  Code is parallel and optimized for GPU (MPI+CUDA) 7
Architecture-neutral Interface 1 (1 of 2) AMR frameworks Our framework CUDA code OpenMP Code Uniform Mesh Serial C Code #pragma daino kernel Two benefits: __global__ 3D_alloy(..) void 3D_alloy(..) void 3D_alloy(..) { { - Productivity { … CUDA kernel code ... #pragma omp for - Ability to apply #pragma daino data (Nx,Ny,Nz) } … kernel code ... {p, u, dpt, no, o;} low-level GPU } … kernel code ... optimizations } Framework Framework GPU AMR CPU AMR GPU AMR CPU AMR Executable Executable Executable Executable 8
Architecture-neutral Interface 1 (2 of 2) Minimal example of using directives in our framework #pragma dno kernel A target kernel void func(float ***a, float ***b, ..) { #pragma dno data domName(i, j, k) Data arrays + iterators a, b; #pragma dno timeloop Target loop for(int t; t< TIME_MAX;t++) { for(int i; i<NX; i++) for(int j; i<NY; j++) { ... // comput. not related to a and b for(int k; k<NZ; k++) { a[i][j][k] = c * (b[i-1][j][k] + b[i+1][j][k] + b[i][j][k] + b[i][j+1][k] + b[i][j-1][k]); } } } } 9
Scalable AMR: Data-centric Model 2 (1 of 2)  A data-centric approach  Each computing element specializes on its data  Blocks on GPU, octree data structure on CPU  Migrate all operations touching block data to GPU  CPU only processes octree data structure 10
Scalable AMR: data-centric Model 2 (2 of 2) GPU CPU GPU 0 Memory Copy Initial Arrays Initialize 1. Copy Ghost Layers Exchange Ghost Layers CPU Memory 2. Octants (Data Arrays) Invoke Compute Stencil Stencil Kernel Invoke 3. GPU 1 Memory Invoke Loop Correction Kernel Post-Stencil (Correction) 4. MOVE BLOCKS Invoke Error Estim. Kernel Evaluate Error 5. Copy δ Octants Invoke < δ Octree Refine Kernel Refine (Data Arrays) (AMR Metadata) Invoke GPU 2 Memory > δ Consolidate Consolidate Kernel Update & Balance Octree 6. Finalize 7. Octants Copy Final Arrays (Data Arrays) Conceptual Overview of Data-centric GPU AMR  All kernels are data parallel (i.e. well-suited to GPU) [1] Mohamed Wahib, Naoya Maruyama, Data-centric GPU-based Adaptive Mesh Refinement, IA^3'15, 5th Workshop on Irregular Applications Architectures and Algorithms, co- located with SC’15 11
Speedup Model 3  AMR promises reduced computation  Problem  overhead in managing hierarchal mesh  Project speedup bound  Informs framework designer of  efficiency of AMR code  Compare achieved speedup vs. projected upper-bound speedup  Takes into account AMR overhead  If projected speedup  far from  achieved speedup  Some AMR overheads(s) not properly accounted for 12
Framework Implementation (1 of 2) Fixed Mesh Code LLVM-IR (Annotated) Compiler Front End Passes Daino Runtime Optimized LLVM-IR AMR library Call LLVM Comm. library Object Files Linker Adapted Mesh Executable Figure 1: Overview of framework implementation Pass Pass Pass Machine Front Back C/C++ IR IR IR IR Code End End Clang LLVM proper Apply translations and optimizations as passes 13
Framework Implementation (2 of 2) Application C Code Stencil Code Emit Translator AST Emit Application Refine Kernel Generate LLVM IR (NVVM IR) Coarsen Kernel Stencil GPU Stencil IR (NVVM IR) Kernel Error Kernel (NVVM IR) (CUDA) IR Pass NVPTX Application CUDA Driver API Call LLVM IR PTX AMR Driver Daino Runtime IR API Call AMR library Compile Object Files Comm. library Link Executable The Daino framework overview. Application C code is transformed to an optimized executable. Daino components enclosed in red dotted line 14
Runtime Libraries  AMR Management  Maintain the octree  Orchestration of work  Memory manager  Especially important with GPU  Communication  MPI processes  Halo data exchange  Transparent access to blocks  Moving blocks (load balancing) 15
Evaluation Application Description A 2 nd order directionally split hyperbolic Hydrodynamics schemes to solve Euler equations. [RTVD scheme modified from GAMER 1 ] Solver We model shallow water simulations by Shallow-water depth-averaging Navier – Stokes equations. Solver [2 nd order Runge-Kutta method] 3D dendritic growth during binary alloy Phase-field solidification 2 Simulation [Time integartion by Allen-Chan equation] [1] H.-Y. Schive, U.-H. Zhang, and T. Chiueh. Directionally Unsplit Hydrodynamic Schemes with Hybrid MPI/ OpenMP/GPU Parallelization in AMR. Int. J. High Perform. Comput. Appl. , 26(4):367 – 377, Nov. 2012 [2] T. Shimokawabe et. Al, Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer, SC’11 16
Results (1 of 4)  We use TSUBAME2.5 supercomputer (TokyoTech)  Up to 3,642 K20x GPUs  TSUBAME Grand Challenge Category A (full machine) HYDRODYNAMICS PHASE-FIELD SHALLOW-WATERS Uniform Mesh Auto AMR (Daino) Uniform Mesh Auto AMR Hand-written AMR Uniform Mesh Auto AMR Hand-written AMR 2.5E+03 Hand-written AMR Auto AMR (GAMER) 3.0E+02 2.0E+03 1.66 x 1.78 x 2.0E+03 2.5E+02 2.9 x 3.8 x 1.5E+03 Runtime (Seconds) Runtime (Seconds) Runtime (Seconds) 2.0E+02 8.5 x 9.4 x 1.5E+03 1.5E+02 1.0E+03 1.0E+03 1.0E+02 5.0E+02 5.0E+02 5.1E+01 1.0E+00 1.0E+00 0.0E+00 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 Number GPUs (Mesh size per GPU: 4,096 3 ) Number GPUs (Mesh size per GPU: 8,192 3 ) Number GPUs (Mesh size per 16 GPUs: 4,096x512x512) Weak scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic) 17
Results (2 of 4)  Notes:  Phase-field achieves 1.7x speedup  Original implementation is Gordon Bell 2011 winner  Daino is faster than GAMER AMR version  GAMER is a leading framework for AMR over GPUs PHASE-FIELD HYDRODYNAMICS SHALLOW-WATERS 2.1E+03 Uniform Mesh Uniform Mesh 3.5E+02 Uniform Mesh 3.5E+04 3.0E+02 Auto AMR (Daino) Auto AMR Auto AMR 1.7 x 3.0E+02 Hand-written AMR 3.0E+04 9.6 x Hand-written AMR Hand-written AMR 2.5E+02 Auto AMR (GAMER) 2.5E+02 2.5E+04 Runtime (Seconds) 4.1 x Runtime (Seconds) Runtime (Seconds) 2.0E+02 2.0E+02 2.0E+04 1.5E+02 1.5E+02 1.5E+04 1.0E+02 1.0E+02 1.0E+04 7.4 x 3.2 x 1.3 x 5.1E+01 5.1E+01 5.0E+03 1.0E+00 1.0E+00 1.0E+00 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 Number GPUs (Mesh size per GPU: 4,096 3 ) Number GPUs (Mesh size 4,096 3 ) Number GPUs (Mesh size per GPU: 8,192 3 ) Strong scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic) 18
Results (3 of 4)  Overhead of the AMR framework (weak scaling): AMR overhead Remeshing from 12% in 16 kernels are well- GPUs to 16% in suited to GPU 3600 GPUs 19
Recommend
More recommend