he ur Operated by Los Alamos National Security, LLC for the U.S. - - PowerPoint PPT Presentation

he ur
SMART_READER_LITE
LIVE PREVIEW

he ur Operated by Los Alamos National Security, LLC for the U.S. - - PowerPoint PPT Presentation

he ur Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Los Alamos National Laboratory LA-UR-17-23350 GPU Acceleration of Large Scale you Fluid Dynamics Scientific Codes nt wo Jenniffer Estrada Joseph


slide-1
SLIDE 1

he ur

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

slide-2
SLIDE 2

you nt wo

Los Alamos National Laboratory

GPU Acceleration of Large Scale Fluid Dynamics Scientific Codes

Jenniffer Estrada Joseph Schoonover LANL, Research Scientist CIRES, Research Scientist jme@lanl.gov jschoonover@lanl.gov GTC 2017 May 10, 2017

LA-UR-17-23350

slide-3
SLIDE 3

Los Alamos National Laboratory 5/10/17 | 3

Motivation

  • Scale Interactions
  • Resolution vs Resources
  • Why are they bragging?
slide-4
SLIDE 4

Los Alamos National Laboratory 5/10/17 | 4

The Scientific Codes - SELF

  • Continuous and Discontinuous

Galerkin (Polynomial based) Nodal Spectral Element Method

  • Oceanographic and Geophysical

Modelling

  • Target problem: Model a large

scale flow ( ~100 km) catalyzes the formation of small features ( ~ 1 km) due to interactions with topography (Vorticity in the Gulf Stream Shelf)

  • 10-million degrees of freedom
slide-5
SLIDE 5

Los Alamos National Laboratory 5/10/17 | 5

SELF-DGSEM Algorithm

slide-6
SLIDE 6

Los Alamos National Laboratory 5/10/17 | 6

Progression

  • Hot Spot Identification

MappedTimeDerivative

  • Software Changes
  • Message Passing
slide-7
SLIDE 7

Los Alamos National Laboratory 5/10/17 | 7

Progression (Continued)

  • CPU: AMD Opteron (16 core)

GPU: Tesla K20X

  • Serial (Single Core):
  • Original: 110.8 sec
  • After changes: 127.2 sec
  • OpenACC: 5.3 sec
slide-8
SLIDE 8

Los Alamos National Laboratory 5/10/17 | 8

1.5x to Ideal

slide-9
SLIDE 9

Los Alamos National Laboratory 5/10/17 | 9

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 # of Threads Speedup (RK3) MPR SPR SPR-LA Ideal

5 10 15 20 0.2 0.4 0.6 0.8 1 # of Threads Scaling Efficiency (RK3) MPR SPR SPR-LA Ideal

2 4 6 8 10 12 14 16 50 100 150 200 # of Threads Runtime (RK3, sec) MPR SPR SPR-LA Ideal ** Yuliana Zamora and Robert Robey, Effective OpenMP Implementations, https://www.lanl.gov/projects/national-security-education-center/information- science-technology/summer-schools/parallelcomputing/_assets/images/2016projects/Zamora.pdf; https://anl.app.box.com/v/IXPUG2016-presentation-23

,

  • Multiple parallel regions (MPR) -

Standard OpenMP

  • Single parallel region (SPR) - High

Level OpenMP without loop bounds

  • Single Parallel Region Loop Bound

Assignments (SPR-LA) - High Level OpenMP

slide-10
SLIDE 10

Los Alamos National Laboratory 5/10/17 | 10

Higher is better!

slide-11
SLIDE 11

Los Alamos National Laboratory 5/10/17 | 11

Thermal Bubble

  • Initial conditions consist of a anomalous warm blob in an otherwise

neutral stratification

  • Domain Size: 10,000 m (cube)
  • Discretization: Discontinuous Galerkin Spectral Element Method

20x20x20 Elements, Polynomial Degree 7

  • Laplacian Diffusion: 0.8 m2/s
  • Simulation Time: 37 minutes
slide-12
SLIDE 12

Los Alamos National Laboratory 5/10/17 | 12

slide-13
SLIDE 13

Los Alamos National Laboratory 5/10/17 | 13

Thermal Bubble

UNM Xena CPU: Intel Xeon GPU: Tesla K40m

  • 1.2 million time steps

Wall-Time (CPU)

  • 37 days

Wall Time (GPU)

  • 24 hours, 13 min
slide-14
SLIDE 14

Los Alamos National Laboratory 5/10/17 | 14

Across Architectures

Benchmarks for ForwardStepRK3 (Euler 3-D) Tests run with CUDA Fortran Polynomial degree = 7 Laplacian Diffusion 15x15x15 Elements (5 time steps) (Footprint: ~1.9 GB memory space ) GPU Model CPU Model Serial Time GPU Runtime Speedup Tesla K40m Intel Xeon E5- 2683 45.969 (sec) 1.282 (sec) 35.854 x GeForce GTX TitanX Intel Xeon E3- 1285L 35.672 (sec) 1.159 (sec) 30.775 x Tesla P100-SXM2- 16GB Power8NVL 49.588 (sec) 0.439 (sec) 112.913 x

slide-15
SLIDE 15

Los Alamos National Laboratory 5/10/17 | 15

Going Forward

  • Initial development of hybrid GPU-MPI code is underway (Improve

weak scaling)

  • Use GPU-Direct technology to overcome CPU-GPU copy
  • Continue to update data structure layout and CUDA kernel

implementation to improve memory access patterns on the GPU

slide-16
SLIDE 16

Los Alamos National Laboratory 5/10/17 | 16

The Scientific Codes - Higrad

  • The fluid dynamics core of Higrad solves

the same set of equations (Compressible Navier Stokes) using a Finite Volume discretization

  • Atmospheric Modelling
  • Couples with other modules

(FIRETEC and wildland fire modelling)

slide-17
SLIDE 17

Los Alamos National Laboratory 5/10/17 | 17

OpenCL OpenMP OpenACC OpenMPI CUDA Fortran

slide-18
SLIDE 18

Los Alamos National Laboratory 5/10/17 | 18

Progression

  • Bottom Up Approach
slide-19
SLIDE 19

Los Alamos National Laboratory 5/10/17 | 19

Progression (Continued)

slide-20
SLIDE 20

Los Alamos National Laboratory 5/10/17 | 20

Progression

  • GPU enabled with OpenACC
  • Handling memory handling with CUDA Fortran

Compute intensive kernels currently on GPU

slide-21
SLIDE 21

Los Alamos National Laboratory 5/10/17 | 21

Going Forward (continued)

  • Higrad
  • Finish memory handling with CUDA Fortran
  • Scaling with GPU Aware MPI
  • CUDA Implementation
  • Higrad/Firetec Gatlinburg Fire Simulation on Titan
  • Mission needs 10x larger problem size (prior limits 1.6 billion cells)

and 10x faster

slide-22
SLIDE 22

Los Alamos National Laboratory 5/10/17 | 22

Where We Are Going With This?

  • Suppose that I am working on a problem with 100,000 elements, and I

need to perform 10,000,000 time steps (not unrealistic for scale interaction problems), then ideal runtimes would be:

  • Tserial = cserial*(100,000)*(10,000,000)

≈ 4 years Tideal = cideal*(100,000)*(10,000,000) ≈ 3 months Tgpu = cgpu*(100,000)*(10,000,000) ≈ 1.8 months

  • The reduction in wall-time for small problems translates to huge

potential savings for larger problems!

slide-23
SLIDE 23

Los Alamos National Laboratory 5/10/17 | 23

Acknowledgements

  • This research used resources of the Oak Ridge

Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

  • Special thanks to Fernanda Foertter (ORNL), Jeff

Larkin (NVIDIA), David Norton (NVIDIA/PGI), Frank Winkler (ORNL), Matt Otten (LLNL)

slide-24
SLIDE 24

Los Alamos National Laboratory 5/10/17 | 24

Questions?