he ur
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
he ur Operated by Los Alamos National Security, LLC for the U.S. - - PowerPoint PPT Presentation
he ur Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Los Alamos National Laboratory LA-UR-17-23350 GPU Acceleration of Large Scale you Fluid Dynamics Scientific Codes nt wo Jenniffer Estrada Joseph
he ur
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
you nt wo
Los Alamos National Laboratory
Jenniffer Estrada Joseph Schoonover LANL, Research Scientist CIRES, Research Scientist jme@lanl.gov jschoonover@lanl.gov GTC 2017 May 10, 2017
LA-UR-17-23350
Los Alamos National Laboratory 5/10/17 | 3
Los Alamos National Laboratory 5/10/17 | 4
Galerkin (Polynomial based) Nodal Spectral Element Method
Modelling
scale flow ( ~100 km) catalyzes the formation of small features ( ~ 1 km) due to interactions with topography (Vorticity in the Gulf Stream Shelf)
Los Alamos National Laboratory 5/10/17 | 5
Los Alamos National Laboratory 5/10/17 | 6
MappedTimeDerivative
Los Alamos National Laboratory 5/10/17 | 7
GPU: Tesla K20X
Los Alamos National Laboratory 5/10/17 | 8
1.5x to Ideal
Los Alamos National Laboratory 5/10/17 | 9
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 # of Threads Speedup (RK3) MPR SPR SPR-LA Ideal
5 10 15 20 0.2 0.4 0.6 0.8 1 # of Threads Scaling Efficiency (RK3) MPR SPR SPR-LA Ideal
2 4 6 8 10 12 14 16 50 100 150 200 # of Threads Runtime (RK3, sec) MPR SPR SPR-LA Ideal ** Yuliana Zamora and Robert Robey, Effective OpenMP Implementations, https://www.lanl.gov/projects/national-security-education-center/information- science-technology/summer-schools/parallelcomputing/_assets/images/2016projects/Zamora.pdf; https://anl.app.box.com/v/IXPUG2016-presentation-23
,
Standard OpenMP
Level OpenMP without loop bounds
Assignments (SPR-LA) - High Level OpenMP
Los Alamos National Laboratory 5/10/17 | 10
Higher is better!
Los Alamos National Laboratory 5/10/17 | 11
neutral stratification
20x20x20 Elements, Polynomial Degree 7
Los Alamos National Laboratory 5/10/17 | 12
Los Alamos National Laboratory 5/10/17 | 13
UNM Xena CPU: Intel Xeon GPU: Tesla K40m
Wall-Time (CPU)
Wall Time (GPU)
Los Alamos National Laboratory 5/10/17 | 14
Benchmarks for ForwardStepRK3 (Euler 3-D) Tests run with CUDA Fortran Polynomial degree = 7 Laplacian Diffusion 15x15x15 Elements (5 time steps) (Footprint: ~1.9 GB memory space ) GPU Model CPU Model Serial Time GPU Runtime Speedup Tesla K40m Intel Xeon E5- 2683 45.969 (sec) 1.282 (sec) 35.854 x GeForce GTX TitanX Intel Xeon E3- 1285L 35.672 (sec) 1.159 (sec) 30.775 x Tesla P100-SXM2- 16GB Power8NVL 49.588 (sec) 0.439 (sec) 112.913 x
Los Alamos National Laboratory 5/10/17 | 15
weak scaling)
implementation to improve memory access patterns on the GPU
Los Alamos National Laboratory 5/10/17 | 16
the same set of equations (Compressible Navier Stokes) using a Finite Volume discretization
(FIRETEC and wildland fire modelling)
Los Alamos National Laboratory 5/10/17 | 17
OpenCL OpenMP OpenACC OpenMPI CUDA Fortran
Los Alamos National Laboratory 5/10/17 | 18
Los Alamos National Laboratory 5/10/17 | 19
Los Alamos National Laboratory 5/10/17 | 20
Compute intensive kernels currently on GPU
Los Alamos National Laboratory 5/10/17 | 21
and 10x faster
Los Alamos National Laboratory 5/10/17 | 22
need to perform 10,000,000 time steps (not unrealistic for scale interaction problems), then ideal runtimes would be:
≈ 4 years Tideal = cideal*(100,000)*(10,000,000) ≈ 3 months Tgpu = cgpu*(100,000)*(10,000,000) ≈ 1.8 months
potential savings for larger problems!
Los Alamos National Laboratory 5/10/17 | 23
Los Alamos National Laboratory 5/10/17 | 24