First Steps of YALES2 Code Towards GPU Acceleration on Standard and - - PowerPoint PPT Presentation
First Steps of YALES2 Code Towards GPU Acceleration on Standard and - - PowerPoint PPT Presentation
First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application Lab, University of Reims GTC Europe -
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
- 1. Introduction : Context
ROMEO HPC Center – GPU Application Lab Yales2
- 2. Existing code profiling
- 3. Code porting
Porting stategies Internal kernels performances
- 4. Benchmarks
- 5. Conclusions
Limitations and future work
2 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
ROMEO HPC Center
University of Reims
I about 25000 students I Multidisciplinary university (undergraduate, graduate, PhD, research
labs)
ROMEO HPC Center
I HPC resources for both academic and industrial research I Expertise and teaching in HPC and GPU technologies I Integrated in the European HPC ecosystem (French Tier 1.5
equip@meso, ETP4HPC)
I Full hybrid cluster (2 × Intel Ivy Bridge + 2 × K20 + IB QDR )
3 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
GPU Application Lab
Objectives : intensive exploitation of ROMEO
I Expertise in hybrid HPC, in particular in GPU Technologies I GPU code porting I Optimization and scaling-up towards a large number of GPUs I Training and teaching for ROMEO users
Activities
I GPU, hybrid and parallel codes optimization I Algorithms improvements regarding targeted architectures I Numerical methods adapting to hybrid and parallel architecture
Various collaborations
I Local URCA laboratories, and some external collaborations (ONERA, Univ. of Normandy) I Several domains of application (fluid mechanics, chemistry, computer science, applied maths, ...)
4 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Massively parallel solver for multi-physics problems in fluid dynamics from primary atomisation to pollutants dispersion in complex geometries
I Code developped at CORIA (University of Normandy) since 2007 I V. Moureau, G. Lartigue, P. B´
enard (projects leaders)
I ∼ 10 developers (engineers, researchers, PhD students, . . .) + contributors
Code
I Diphasic and reactive fluids flows simulations at low Mach number on complex geometries I LES and DNS solvers on unstructured meshes I 3D flow simulations on massively parallel architectures I Use by more than 160 academic and industrial researchers I 60+ scientific publications
5 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
YALES2, a complete library
Main features
I 350 000 lines of code f90 and f03 I Portable I Python Interface I Main solvers :
I Scalar solver (SCS) I Level set solver (LSS) I Lagrangian solver (LGS) I Incompressible solver (ICS) I Variable density solver (VDS) I Spray solver (SPS) I Magneto-Hydrodynamic solver (MHD) I Heat transfer solver (HTS) I Chemical reactor solver (CRS) I Darcy solver (DCY) I Mesh movement solver (MMS) I ALE solver (ALE) I Linear acoustics solver (ACS)
I 5+ solvers in progress
6 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
HPC with YALES2 in combustion
Multi-scale and multi-physics applications
I More than 85% of used energy comes from combustion I Related to many fields (transportation, industry, energy, . . .)
Examples in aeronautics :
7 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
HPC with YALES2
HPC
I Using up to 10000 cores on national french clusters (IDRIS, CINES, . . .), regional (CRIANN) and
local machines
I Using advanced parallel programming techiques (hybrid computing, automatic mesh adaptation,
. . .)
I Collaborations with Exascale Lab, INTEL/CEA/GENCI/UVSQ I Code used as benchmark on prototypes (IDRIS, Ouessant : Power8+P100), Cellule de Veille
technologique GENCI
I Collaboration on GPU porting, GPU Application Lab, ROMEO
8 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Existing code profiling
- 1. Introduction : Context
ROMEO HPC Center – GPU Application Lab Yales2
- 2. Existing code profiling
- 3. Code porting
Porting stategies Internal kernels performances
- 4. Benchmarks
- 5. Conclusions
Limitations and future work
9 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Profiling the existing code
Specific tools (MAQAO + TAU + PAPI)
I In-depth profiling :
I Computational time (per functions, per internal and external loops) I Number of floating point operations I Number of caches misses I . . .
I Hot-spot : matrix-vector product in Preconditionned Conjugate Gradient (PCG)
Functions profiles External loops profile
10 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Profiling existing code
Indentifying hot-spot
I Preconditioned conjugate gradient :
250 lines of code for 55% of total time
I Matrix-vector product :
30 lines of code for 30% of total time
11 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Existing code profiling
- 1. Introduction : Context
ROMEO HPC Center – GPU Application Lab Yales2
- 2. Existing code profiling
- 3. Code porting
Porting stategies Internal kernels performances
- 4. Benchmarks
- 5. Conclusions
Limitations and future work
12 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
How to port hot-spot on GPU ?
Code main feature : data-centered structure
I Hierarchical well-defined data structures based on bloc-decomposition of the mesh I Every computing loops follow the same skeleton (two levels of nested loops : over blocs meshes,
then over vertex, edges or elements)
Code porting
Three major possibilities :
I OpenACC with PGI compilers
I Non intrusive for code (macros) I Complementary with in-progress OpenMP
version
I Strong potential with unified memory I No deep copy for complex data structures I No support for Fortran pointers
I CUDA/C with Intel compilers
I Fine management of GPU (code+data) I Passing through intermediary C interfaces I No deep copy for complex data structures I Code rewriting (only for computational loops)
I CUDA/Fortran with PGI (not tested)
I Similar to CUDA/C without interface 13 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Code porting with CUDA
Key points
Data management
I Exploiting the Fortran/C interoperability for data structures I Fortran derived types translation to C typedef (automatic translation tool YALES2-specific)
GPU memory management
I Allocation ans management of GPU specific data and utilities arrays I CPU-GPU transfers optimized with a buffer array (in Pinned memory)
Execution model
I Mapping mesh decomposition and hierarchical data structure to CUDA blocks/threads
Algorithm adaptation : inverse connectivity for mesh exploration
I Loop first over vertices instead of edges (Finite Volumes method works on edges by construction)
14 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
CUDA code porting
Inverse connectivity for mesh exploration
Matrix-vector product computing (op product)
I Initial algorithm (not well suited to GPU) :
Foreach bloc b of mesh //blocks Foreach edge e of b //threads vs, ve = vertex(e) result(vs) += f(value(e), data(vs), data(ve)) result(ve) -= f(value(e), data(vs), data(ve))
I Algorithm with inverse connectivity :
Foreach bloc b of mesh //blocks Foreach vertex v of b //threads r = 0 // Register Foreach edge e from vertex s ve = end(e) r += f(value(e), data(v), data(ve)) Foreach edge e to vertex s vs = start(e) r -= f(value(e), data(vs), data(v)) result(v) = r
15 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Kernel performances
Performance comparison for Fortran loops and CUDA kernels
- p product
compute p compute gamma update scal res exact residual residual norm compute final rho 1 5 10 15 20 25 30 35 40 x28.1 x16.3 x18.4 x19.3 x24.5 x28.3 x38.9 Speedup for Conjugate Gradient internals loops on GPU (16MPI vs. 2 GPU)
16 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Existing code profiling
- 1. Introduction : Context
ROMEO HPC Center – GPU Application Lab Yales2
- 2. Existing code profiling
- 3. Code porting
Porting stategies Internal kernels performances
- 4. Benchmarks
- 5. Conclusions
Limitations and future work
17 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Overall algorithm
Overall process
Main loop of conjugate gradient containing :
I Computing functions from previous figure (CUDA kernels) I Synchronization and data management host-side functions (Fortran + MPI)
GPU management
I Host-Device data transfers between computation kernels and host-side data management
functions
I Partial overlap of transfers by computations
Code versions
I CPU, initial algorithm and Fortran code I GPU, algorithm with inverted connectivity (better for GPU), only hot-spot internal function on
GPU
I GPU-PCG, GPU version with all Conjugate Gradient computations on GPU
18 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Various cluster architectures
Clusters configurations for single node comparison
Comparison of local GPU accelerations versus CPU code
I Basic runtime : N-MPI process (reference) vs N-MPI using N-GPU (1-to-1 association) I Runtime MPS : 16-MPI process (reference) vs 16-MPI with N-GPU
Machines
ROMEO 2×Intel E5-2650 v2 8-cores, 2×K20x (PCIe) Myria 2×Intel E5-2680 v4 14-cores, 2×P100 (PCIe) Ouessant 2×IBM Power S822LC 10-cores, 4×P100 (NVLINK)
19 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Benchmarks results
Application speedup on the different architectures
gpu gpu-pcg MPS+gpu MPS+gpu-pcg gpu gpu-pcg MPS+gpu MPS+gpu-pcg gpu gpu-pcg MPS+gpu MPS+gpu-pcg 1 0.5 1.5 2 2.5 3 3.5 4 4.5 5 Mesh-1 3.106 elts Mesh-2 4.106 elts Mesh-3 30.106 elts x3.8 x3.6 x2.8 x4.4 x4.5 x4.0 Intel-K20 Intel-P100 IBM-P100 PCG speedup for different mesh size, code version and runtime configuration
20 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Limitations and future work
Discussion
Overall successful study : performance improvement for entire application with GPU-accelerated code
I Recent technologies helps for performance (the more recent the higher speedup) I MPS has a limited interest (wait for Volta version : internal support and client number) I Data transfers is still strongly limiting performances
I Intrusive overlapping of transfers by non-GPU computations I Porting more functions on GPU (synchronization and data management for MPI)
Future work and code developments perspectives
I GPU porting of data management functions (use MPI GPU-aware) I Introduce an 3rd level of parallelism : OpenMP (accelerate CPU part of data management)
21 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
Introduction : Context Existing code profiling Code porting Benchmarks Conclusions
Thank you for your attention
22 JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster