First Steps of YALES2 Code Towards GPU Acceleration on Standard and - - PowerPoint PPT Presentation

first steps of yales2 code towards gpu acceleration on
SMART_READER_LITE
LIVE PREVIEW

First Steps of YALES2 Code Towards GPU Acceleration on Standard and - - PowerPoint PPT Presentation

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application Lab, University of Reims GTC Europe -


slide-1
SLIDE 1

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application Lab, University of Reims GTC Europe - M¨ unchen October, 11th 2017

slide-2
SLIDE 2

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

  • 1. Introduction : Context

ROMEO HPC Center – GPU Application Lab Yales2

  • 2. Existing code profiling
  • 3. Code porting

Porting stategies Internal kernels performances

  • 4. Benchmarks
  • 5. Conclusions

Limitations and future work

2 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-3
SLIDE 3

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

ROMEO HPC Center

University of Reims

I about 25000 students I Multidisciplinary university (undergraduate, graduate, PhD, research

labs)

ROMEO HPC Center

I HPC resources for both academic and industrial research I Expertise and teaching in HPC and GPU technologies I Integrated in the European HPC ecosystem (French Tier 1.5

equip@meso, ETP4HPC)

I Full hybrid cluster (2 × Intel Ivy Bridge + 2 × K20 + IB QDR )

3 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-4
SLIDE 4

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

GPU Application Lab

Objectives : intensive exploitation of ROMEO

I Expertise in hybrid HPC, in particular in GPU Technologies I GPU code porting I Optimization and scaling-up towards a large number of GPUs I Training and teaching for ROMEO users

Activities

I GPU, hybrid and parallel codes optimization I Algorithms improvements regarding targeted architectures I Numerical methods adapting to hybrid and parallel architecture

Various collaborations

I Local URCA laboratories, and some external collaborations (ONERA, Univ. of Normandy) I Several domains of application (fluid mechanics, chemistry, computer science, applied maths, ...)

4 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-5
SLIDE 5

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Massively parallel solver for multi-physics problems in fluid dynamics from primary atomisation to pollutants dispersion in complex geometries

I Code developped at CORIA (University of Normandy) since 2007 I V. Moureau, G. Lartigue, P. B´

enard (projects leaders)

I ∼ 10 developers (engineers, researchers, PhD students, . . .) + contributors

Code

I Diphasic and reactive fluids flows simulations at low Mach number on complex geometries I LES and DNS solvers on unstructured meshes I 3D flow simulations on massively parallel architectures I Use by more than 160 academic and industrial researchers I 60+ scientific publications

5 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-6
SLIDE 6

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

YALES2, a complete library

Main features

I 350 000 lines of code f90 and f03 I Portable I Python Interface I Main solvers :

I Scalar solver (SCS) I Level set solver (LSS) I Lagrangian solver (LGS) I Incompressible solver (ICS) I Variable density solver (VDS) I Spray solver (SPS) I Magneto-Hydrodynamic solver (MHD) I Heat transfer solver (HTS) I Chemical reactor solver (CRS) I Darcy solver (DCY) I Mesh movement solver (MMS) I ALE solver (ALE) I Linear acoustics solver (ACS)

I 5+ solvers in progress

6 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-7
SLIDE 7

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

HPC with YALES2 in combustion

Multi-scale and multi-physics applications

I More than 85% of used energy comes from combustion I Related to many fields (transportation, industry, energy, . . .)

Examples in aeronautics :

7 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-8
SLIDE 8

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

HPC with YALES2

HPC

I Using up to 10000 cores on national french clusters (IDRIS, CINES, . . .), regional (CRIANN) and

local machines

I Using advanced parallel programming techiques (hybrid computing, automatic mesh adaptation,

. . .)

I Collaborations with Exascale Lab, INTEL/CEA/GENCI/UVSQ I Code used as benchmark on prototypes (IDRIS, Ouessant : Power8+P100), Cellule de Veille

technologique GENCI

I Collaboration on GPU porting, GPU Application Lab, ROMEO

8 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-9
SLIDE 9

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Existing code profiling

  • 1. Introduction : Context

ROMEO HPC Center – GPU Application Lab Yales2

  • 2. Existing code profiling
  • 3. Code porting

Porting stategies Internal kernels performances

  • 4. Benchmarks
  • 5. Conclusions

Limitations and future work

9 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-10
SLIDE 10

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Profiling the existing code

Specific tools (MAQAO + TAU + PAPI)

I In-depth profiling :

I Computational time (per functions, per internal and external loops) I Number of floating point operations I Number of caches misses I . . .

I Hot-spot : matrix-vector product in Preconditionned Conjugate Gradient (PCG)

Functions profiles External loops profile

10 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-11
SLIDE 11

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Profiling existing code

Indentifying hot-spot

I Preconditioned conjugate gradient :

250 lines of code for 55% of total time

I Matrix-vector product :

30 lines of code for 30% of total time

11 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-12
SLIDE 12

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Existing code profiling

  • 1. Introduction : Context

ROMEO HPC Center – GPU Application Lab Yales2

  • 2. Existing code profiling
  • 3. Code porting

Porting stategies Internal kernels performances

  • 4. Benchmarks
  • 5. Conclusions

Limitations and future work

12 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-13
SLIDE 13

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

How to port hot-spot on GPU ?

Code main feature : data-centered structure

I Hierarchical well-defined data structures based on bloc-decomposition of the mesh I Every computing loops follow the same skeleton (two levels of nested loops : over blocs meshes,

then over vertex, edges or elements)

Code porting

Three major possibilities :

I OpenACC with PGI compilers

I Non intrusive for code (macros) I Complementary with in-progress OpenMP

version

I Strong potential with unified memory I No deep copy for complex data structures I No support for Fortran pointers

I CUDA/C with Intel compilers

I Fine management of GPU (code+data) I Passing through intermediary C interfaces I No deep copy for complex data structures I Code rewriting (only for computational loops)

I CUDA/Fortran with PGI (not tested)

I Similar to CUDA/C without interface 13 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-14
SLIDE 14

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Code porting with CUDA

Key points

Data management

I Exploiting the Fortran/C interoperability for data structures I Fortran derived types translation to C typedef (automatic translation tool YALES2-specific)

GPU memory management

I Allocation ans management of GPU specific data and utilities arrays I CPU-GPU transfers optimized with a buffer array (in Pinned memory)

Execution model

I Mapping mesh decomposition and hierarchical data structure to CUDA blocks/threads

Algorithm adaptation : inverse connectivity for mesh exploration

I Loop first over vertices instead of edges (Finite Volumes method works on edges by construction)

14 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-15
SLIDE 15

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

CUDA code porting

Inverse connectivity for mesh exploration

Matrix-vector product computing (op product)

I Initial algorithm (not well suited to GPU) :

Foreach bloc b of mesh //blocks Foreach edge e of b //threads vs, ve = vertex(e) result(vs) += f(value(e), data(vs), data(ve)) result(ve) -= f(value(e), data(vs), data(ve))

I Algorithm with inverse connectivity :

Foreach bloc b of mesh //blocks Foreach vertex v of b //threads r = 0 // Register Foreach edge e from vertex s ve = end(e) r += f(value(e), data(v), data(ve)) Foreach edge e to vertex s vs = start(e) r -= f(value(e), data(vs), data(v)) result(v) = r

15 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-16
SLIDE 16

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Kernel performances

Performance comparison for Fortran loops and CUDA kernels

  • p product

compute p compute gamma update scal res exact residual residual norm compute final rho 1 5 10 15 20 25 30 35 40 x28.1 x16.3 x18.4 x19.3 x24.5 x28.3 x38.9 Speedup for Conjugate Gradient internals loops on GPU (16MPI vs. 2 GPU)

16 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-17
SLIDE 17

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Existing code profiling

  • 1. Introduction : Context

ROMEO HPC Center – GPU Application Lab Yales2

  • 2. Existing code profiling
  • 3. Code porting

Porting stategies Internal kernels performances

  • 4. Benchmarks
  • 5. Conclusions

Limitations and future work

17 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-18
SLIDE 18

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Overall algorithm

Overall process

Main loop of conjugate gradient containing :

I Computing functions from previous figure (CUDA kernels) I Synchronization and data management host-side functions (Fortran + MPI)

GPU management

I Host-Device data transfers between computation kernels and host-side data management

functions

I Partial overlap of transfers by computations

Code versions

I CPU, initial algorithm and Fortran code I GPU, algorithm with inverted connectivity (better for GPU), only hot-spot internal function on

GPU

I GPU-PCG, GPU version with all Conjugate Gradient computations on GPU

18 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-19
SLIDE 19

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Various cluster architectures

Clusters configurations for single node comparison

Comparison of local GPU accelerations versus CPU code

I Basic runtime : N-MPI process (reference) vs N-MPI using N-GPU (1-to-1 association) I Runtime MPS : 16-MPI process (reference) vs 16-MPI with N-GPU

Machines

ROMEO 2×Intel E5-2650 v2 8-cores, 2×K20x (PCIe) Myria 2×Intel E5-2680 v4 14-cores, 2×P100 (PCIe) Ouessant 2×IBM Power S822LC 10-cores, 4×P100 (NVLINK)

19 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-20
SLIDE 20

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Benchmarks results

Application speedup on the different architectures

gpu gpu-pcg MPS+gpu MPS+gpu-pcg gpu gpu-pcg MPS+gpu MPS+gpu-pcg gpu gpu-pcg MPS+gpu MPS+gpu-pcg 1 0.5 1.5 2 2.5 3 3.5 4 4.5 5 Mesh-1 3.106 elts Mesh-2 4.106 elts Mesh-3 30.106 elts x3.8 x3.6 x2.8 x4.4 x4.5 x4.0 Intel-K20 Intel-P100 IBM-P100 PCG speedup for different mesh size, code version and runtime configuration

20 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-21
SLIDE 21

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Limitations and future work

Discussion

Overall successful study : performance improvement for entire application with GPU-accelerated code

I Recent technologies helps for performance (the more recent the higher speedup) I MPS has a limited interest (wait for Volta version : internal support and client number) I Data transfers is still strongly limiting performances

I Intrusive overlapping of transfers by non-GPU computations I Porting more functions on GPU (synchronization and data management for MPI)

Future work and code developments perspectives

I GPU porting of data management functions (use MPI GPU-aware) I Introduce an 3rd level of parallelism : OpenMP (accelerate CPU part of data management)

21 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

slide-22
SLIDE 22

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions

Thank you for your attention

22 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster