Daino: A High-level Framework for Parallel and Efficient AMR on GPUs - - PowerPoint PPT Presentation

daino a high level framework for parallel and efficient
SMART_READER_LITE
LIVE PREVIEW

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs - - PowerPoint PPT Presentation

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs Mohamed Wahib 1 , Naoya Maruyama 1,2 , Takayuki Aoki 2 1 RIKEN Advanced Institute for Computational Science, Kobe, Japan 2 Tokyo Institute of Technology, GSIC, Tokyo, Japan 11 th


slide-1
SLIDE 1

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs

Mohamed Wahib1, Naoya Maruyama1,2, Takayuki Aoki2

1 RIKEN Advanced Institute for Computational Science, Kobe, Japan 2 Tokyo Institute of Technology, GSIC, Tokyo, Japan 11th May 2017 GTC17

slide-2
SLIDE 2
  • Motivation & Problem:
  • “AMR is one of the paths to multi-scale exascale applications“
  • Producing efficient AMR code is hard (especially for GPU)
  • Solution:
  • A framework for producing efficient AMR code (for GPUs)
  • Architecture-independent interface provided to the user
  • A speedup model for quantifying the efficiency of AMR code
  • Key results: We evaluate three AMR applications
  • Speedups & scalability comparable to hand-written code

(~3,642 K20x GPUs)

Summary

2

slide-3
SLIDE 3
  • For meshes in some simulations using PDEs:
  • We only require high resolution for areas of interest
  • Resolution changes dynamically during simulation
  • Achieving efficient AMR is challenging
  • Managing an adaptive mesh can be complicated
  • Balancing compute load and communication costs

3

Adaptive Mesh Refinement (AMR)

slide-4
SLIDE 4

Octree-based meshes: (a) Adaptive mesh (b) Tree representation

Structured Tree-based AMR

4

  • Many ways to represent the mesh
  • We focus on octree representation (quadtree in 2D)
  • Mesh divided into blocks, refine/coarsen if required

(a) (b)

PE1 PE2 PE3

Operations applied on tree are distributed

slide-5
SLIDE 5

How AMR Works

5

  • Initialize the Mesh
  • FOR Simulation time DO
  • Execute stencil operations for all blocks
  • Exchange ghost layers with neighbor nodes
  • IF time to remesh
  • Calculate remeshing critirion for all blocks
  • Refine or consolidate blocks
  • Balance the mesh
  • ENDIF
  • IF time to load balance
  • Apply load balancing algorithm
  • ENDIF
  • ENDFOR

Computation Remeshing Load balancing

Reduced Computation (less data in mesh) Overhead

slide-6
SLIDE 6

AMR on GPUs

6

  • Hard to achieve efficient AMR with GPUs
  • Few existing AMR frameworks support GPU:
  • User must provide code optimized for GPU
  • Scalability problems due to CPU-GPU data movement
  • No speedup-bound model

Contributions of our framework

1 2 3

slide-7
SLIDE 7

Framework for Efficient AMR

7

  • A compiler and runtime
  • Input:
  • Serial code applying stencil on a uniform grid
  • User adds directives to identify relevant data arrays
  • Architecture-neutral
  • Output:
  • Executable binary for target architecture
  • Code is parallel and optimized for GPU (MPI+CUDA)
slide-8
SLIDE 8

#pragma daino kernel void 3D_alloy(..) { #pragma daino data (Nx,Ny,Nz) {p, u, dpt, no, o;} … kernel code ... }

AMR frameworks

Architecture-neutral Interface

(1 of 2)

8

Uniform Mesh Serial C Code

__global__ 3D_alloy(..) { … CUDA kernel code ... } void 3D_alloy(..) { #pragma omp for … kernel code ... }

CUDA code OpenMP Code Framework GPU AMR Executable CPU AMR Executable

Our framework

Framework GPU AMR Executable CPU AMR Executable

1

Two benefits:

  • Productivity
  • Ability to apply

low-level GPU

  • ptimizations
slide-9
SLIDE 9

#pragma dno kernel void func(float ***a, float ***b, ..) { #pragma dno data domName(i, j, k) a, b; #pragma dno timeloop for(int t; t< TIME_MAX;t++) { for(int i; i<NX; i++) for(int j; i<NY; j++) { ... // comput. not related to a and b for(int k; k<NZ; k++) { a[i][j][k] = c * (b[i-1][j][k] + b[i+1][j][k] + b[i][j][k] + b[i][j+1][k] + b[i][j-1][k]); } } } }

Minimal example of using directives in our framework

Architecture-neutral Interface

(2 of 2)

9

1

A target kernel Data arrays + iterators Target loop

slide-10
SLIDE 10

Scalable AMR: Data-centric Model (1 of 2)

10

  • A data-centric approach
  • Each computing element specializes on its data
  • Blocks on GPU, octree data structure on CPU
  • Migrate all operations touching block data to GPU
  • CPU only processes octree data structure

2

slide-11
SLIDE 11
  • All kernels are data parallel (i.e. well-suited to GPU)

11

Scalable AMR: data-centric Model (2 of 2)

Finalize

Copy Final Arrays Octants (Data Arrays) Octants (Data Arrays) Octants (Data Arrays) GPU2 Memory GPU1 Memory GPU0 Memory Octree (AMR Metadata) CPU Memory

Initialize Stencil Kernel Exchange Ghost Layers Update & Balance Octree

Loop

Copy Initial Arrays Copy Ghost Layers

Consolidate

Invoke

Refine

Invoke

Evaluate Error

Copy δ

Post-Stencil (Correction)

Invoke Invoke

Compute Stencil

Invoke

< δ > δ

1. 2. 3. 4. 5. 6. 7.

Error Estim. Kernel Refine Kernel Consolidate Kernel Correction Kernel

MOVE BLOCKS Invoke

2

Conceptual Overview of Data-centric GPU AMR

CPU GPU

[1] Mohamed Wahib, Naoya Maruyama, Data-centric GPU-based Adaptive Mesh Refinement, IA^3'15, 5th Workshop on Irregular Applications Architectures and Algorithms, co-located with SC’15

slide-12
SLIDE 12
  • AMR promises reduced computation
  • Problem  overhead in managing hierarchal mesh
  • Project speedup bound
  • Informs framework designer of  efficiency of AMR code
  • Compare achieved speedup vs. projected upper-bound speedup
  • Takes into account AMR overhead
  • If projected speedup  far from  achieved speedup
  • Some AMR overheads(s) not properly accounted for

Speedup Model

12

3

slide-13
SLIDE 13

Framework Implementation (1 of 2)

13

Fixed Mesh Code (Annotated)

Compiler Front End Passes LLVM Optimized LLVM-IR

Object Files

LLVM-IR

Daino Runtime

Adapted Mesh Executable

Linker

AMR library

  • Comm. library

Call

Front End Pass Pass Pass Back End C/C++ Machine Code IR IR IR IR Clang LLVM proper

Figure 1: Overview of framework implementation Apply translations and optimizations as passes

slide-14
SLIDE 14

The Daino framework overview. Application C code is transformed to an optimized

  • executable. Daino components enclosed in red dotted line

Framework Implementation (2 of 2)

14

Application C Code Stencil Code Object Files

Emit Compile

Stencil GPU Kernel (NVVM IR) AST

NVPTX

PTX

Emit Generate

Application LLVM IR Stencil IR Application LLVM IR AMR Driver IR

IR Pass

Refine Kernel (NVVM IR) Coarsen Kernel (NVVM IR) Error Kernel (CUDA) Daino Runtime AMR library

  • Comm. library

Translator

CUDA Driver API Call API Call

Executable

Link

slide-15
SLIDE 15

Runtime Libraries

15

  • AMR Management
  • Maintain the octree
  • Orchestration of work
  • Memory manager
  • Especially important with GPU
  • Communication
  • MPI processes
  • Halo data exchange
  • Transparent access to blocks
  • Moving blocks (load balancing)
slide-16
SLIDE 16

Evaluation

16

Application Description Hydrodynamics Solver A 2nd order directionally split hyperbolic schemes to solve Euler equations. [RTVD scheme modified from GAMER1] Shallow-water Solver We model shallow water simulations by depth-averaging Navier–Stokes equations. [2nd order Runge-Kutta method] Phase-field Simulation 3D dendritic growth during binary alloy solidification2 [Time integartion by Allen-Chan equation]

[1] H.-Y. Schive, U.-H. Zhang, and T. Chiueh. Directionally Unsplit Hydrodynamic Schemes with Hybrid MPI/ OpenMP/GPU Parallelization in AMR. Int. J. High Perform. Comput. Appl., 26(4):367–377, Nov. 2012 [2] T. Shimokawabe et. Al, Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer, SC’11

slide-17
SLIDE 17

Results (1 of 4)

17

Weak scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic)

  • We use TSUBAME2.5 supercomputer (TokyoTech)
  • Up to 3,642 K20x GPUs
  • TSUBAME Grand Challenge Category A (full machine)

1.0E+00 5.0E+02 1.0E+03 1.5E+03 2.0E+03 16 64 256 576 1024 1600 2288 2880 3600

Runtime (Seconds) Number GPUs (Mesh size per GPU: 4,0963)

HYDRODYNAMICS

Uniform Mesh Auto AMR (Daino) Hand-written AMR Auto AMR (GAMER)

9.4 x 8.5 x

0.0E+00 5.0E+02 1.0E+03 1.5E+03 2.0E+03 2.5E+03 16 64 256 576 1024 1600 2288 2880 3600

Runtime (Seconds) Number GPUs (Mesh size per 16 GPUs: 4,096x512x512)

PHASE-FIELD

Uniform Mesh Auto AMR Hand-written AMR

1.78 x 1.66 x

1.0E+00 5.1E+01 1.0E+02 1.5E+02 2.0E+02 2.5E+02 3.0E+02 16 64 256 576 1024 1600 2288 2880 3600

Runtime (Seconds) Number GPUs (Mesh size per GPU: 8,1923)

SHALLOW-WATERS

Uniform Mesh Auto AMR Hand-written AMR

3.8 x 2.9 x

slide-18
SLIDE 18

Results (2 of 4)

18

Strong scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic)

  • Notes:
  • Phase-field achieves 1.7x speedup
  • Original implementation is Gordon Bell 2011 winner
  • Daino is faster than GAMER AMR version
  • GAMER is a leading framework for AMR over GPUs

1.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 3.0E+04 3.5E+04 16 64 256 576 1024 1600 2288 2880 3600

Runtime (Seconds) Number GPUs (Mesh size 4,0963)

PHASE-FIELD

Uniform Mesh Auto AMR Hand-written AMR

1.7 x 1.3 x

1.0E+00 5.1E+01 1.0E+02 1.5E+02 2.0E+02 2.5E+02 3.0E+02 3.5E+02 16 64 256 576 1024 1600 2288 2880 3600

Runtime (Seconds) Number GPUs (Mesh size per GPU: 4,0963)

HYDRODYNAMICS

Uniform Mesh Auto AMR (Daino) Hand-written AMR Auto AMR (GAMER)

9.6 x

2.1E+03

7.4 x

1.0E+00 5.1E+01 1.0E+02 1.5E+02 2.0E+02 2.5E+02 3.0E+02 16 64 256 576 1024 1600 2288 2880 3600

Runtime (Seconds) Number GPUs (Mesh size per GPU: 8,1923)

SHALLOW-WATERS

Uniform Mesh Auto AMR Hand-written AMR

4.1 x 3.2 x

slide-19
SLIDE 19

Results (3 of 4)

19

  • Overhead of the AMR framework (weak scaling):

AMR overhead from 12% in 16 GPUs to 16% in 3600 GPUs Remeshing kernels are well- suited to GPU

slide-20
SLIDE 20

Results (4 of 4)

Speedup: measured vs. projected. M is measured, P is the practical AMR speedup projection, and T is the theoretical AMR speedup projection.

20

  • Efficiency of transformation:
  • Achieved speedup > 86% of practical limit

2 4 6 8 10 Number GPUs

HYDRODYNAMICS

M L A 0.5 1 1.5 2 Speedup Number GPUs

PHASE-FIELD

M L A 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Number GPUs

SHALLOW-WATERS

M L A

slide-21
SLIDE 21
  • Problem:
  • AMR is one of the paths to multi-scale exascale applications
  • Producing efficient AMR code is hard (especially for GPU)
  • Solution:
  • A framework for producing efficient AMR code (for GPUs)
  • Architecture-independent interface provided to the user
  • A speedup model for quantifying the efficiency of AMR code
  • Key results: We evaluate three AMR applications
  • Speedups & scalability comparable to hand-written code

(3,642 K20x GPUs)

Summary

21

slide-22
SLIDE 22

Future Work

  • Expand Daino
  • Incorporate Daino’s GPU backend in other AMR framework
  • Work-in-progress for porting new applications (CFD)
  • Supporting user-specified boundary conditions,

equations of state, and flux corrections

  • Extend support for Intel Xeon Phi (KNL)
  • We already introduced experimental support for

OpenMP (not fully optimized)

  • Leverage the speedup model analysis
  • Auto-tuning

22

Daino will be publically released at: http://github.com/wahibium/Daino

slide-23
SLIDE 23

Thank you for listening. Questions?

23