MPAS on GPUs Using OpenACC: Portability, Scalability & - - PowerPoint PPT Presentation

mpas on gpus using openacc portability scalability
SMART_READER_LITE
LIVE PREVIEW

MPAS on GPUs Using OpenACC: Portability, Scalability & - - PowerPoint PPT Presentation

MPAS on GPUs Using OpenACC: Portability, Scalability & Performance Dr. Raghu Raj Kumar Project Scientist I & Group Head Special Technical Projects (STP) Group National Center for Atmospheric Research March 2018 Outline


slide-1
SLIDE 1

MPAS on GPUs Using OpenACC: Portability, Scalability & Performance

  • Dr. Raghu Raj Kumar

Project Scientist I & Group Head Special Technical Projects (STP) Group National Center for Atmospheric Research

March 2018

slide-2
SLIDE 2

Outline

  • Motivation & Goals
  • System & Software Specs
  • Approach & Challenges
  • Results
  • Future Plans
  • Questions

2

slide-3
SLIDE 3

Motivation & Goals

  • Motivation
  • Shallow water equation solver & Discrete Galerkin Kernel
  • Showed promising results on GPUs
  • Unified code was possible- at least for small scale applications
  • Goals
  • Port MPAS on to GPUs
  • Optimize performance on GPUs
  • No compromise on portability: < 10% of original performance
  • Scale MPAS on GPUs

3

slide-4
SLIDE 4

System Specs

  • NVIDIA’s Internal Clusters
  • PSG: Dual socket Haswell (32 cores/node) with 4 P100/node, 12 nodes, intra-node PCIe

connected, inter-node FDR

  • PSG: Dual socket Haswell (32 cores/ node) with 4 V100/node, 2 nodes, intra-node PCIe

connected, inter-node FDR

  • NCAR’s Cheyenne supercomputer
  • Dual socket Broadwell (36 cores/node), 4,032 nodes
  • IBM’s R92 Cluster (Internal)
  • Minsky- Dual socket Power8 (20 cores/node) with 4 P100/node, 90+ nodes, intra-node NVLink,

inter-node IB

4

slide-5
SLIDE 5

Software Spec: MPAS Dry Dynamical Core

  • Software
  • MPAS Release (or MPAS 5.2)
  • Intel Compiler 17.0, PGI Compiler 17.10
  • Dry Baroclinic Instability Test- No physics, no scalar transport
  • Dry dynamics test-case produces baroclinic storms from analytic initial conditions
  • Split Dynamics: 2 sub-steps, 3 split steps
  • 120 km (40k grid points, dt=720s) , 60 km resolution (163k grid points, dt=360s), 30 km

resolution (655k grid points, dt=180s), 15 km resolution (2.6m grid points, dt=60s)

  • Number of levels = 56
  • Double precision (DP) and Single precision (SP)
  • Simulation executed for 16 days, performance shown for 1 timestep

5

slide-6
SLIDE 6

Why did we choose Dycore?

6

Execution time- Physics: 45-50% DyCore: 50-55% Lines of Code- Physics: 110,000 DyCore: 10,000

MicroPhysics WSM6(9.62%) Physics scheme Dynamic Core MPAS Boundary Layer YSU(1.55%) Gravity Wave Drag GWDO(0.71%) Radiation Short Wave RRTMG_SW(18.83%) Radiation Long Wave RRTMG_LW(16.43%) Convection New Tiedtke(4.19%)

Flow Diagram by KISTI

slide-7
SLIDE 7

Dycore: Zoom in

Integration Setup 1% Moist coefficients 0% imp_coef 1% dyn_tend 32% small_step 4% acoustic_step 13% large_step 16% diagnostics 20% substep 5% MPI 8%

MPAS-5 120 km case with Intel Compiler running 36 MPI ranks on Intel E5V2697v4 “Broadwell” node

7

slide-8
SLIDE 8

Approach

Baseline

Software & Architecture Configuration & Accuracy

Porting

KGen OpenACC Directives

Optimize KGen Profile & Analyze Integrate

Verification Benchmark

Portability Check

Testing Code Refactoring

slide-9
SLIDE 9

Challenges Faced: Using the Right Directives

9

Lower time for porting, Reasonable performance Much higher time for porting, Improved performance depending on the loop count- up to 50%

slide-10
SLIDE 10

Challenges Faced: Using the Right Approach for Data

10

subroutine dyn_tend !$acc enter data !$acc data copy(…) … <code> … !$acc exit data end subroutine dyn_tend

subroutine dyn_tend !$acc data present(…) … <code> … end subroutine dyn_tend

Lower time for porting, Repeated unnecessary data transfer- poor performance Create a copy on Host and Device simultaneously. Harder to design, No unnecessary copies of data between host and device

slide-11
SLIDE 11

Challenges Faced: Using the Right Techniques

  • Understand how to optimize
  • Learn the basics of optimization
  • Understand how GPUs work
  • Poor performance snippets
  • Know when to use global, register and shared memory
  • Nvprof is your best friend!
  • Understand how GPUs and CPUs work
  • Experiment with SIMD friendly loops- code layout
  • Experiment with GPU’s SIMT code- data layout
  • Learn how to combine the two!

11

slide-12
SLIDE 12

Single Node Performance: MPAS Dry Dycore

12

  • Timers
  • MPAS GPTL timers reported in log files
  • GPU Timing : Has no updates from device to host
  • Host updates maybe needed for printing values on screen
  • Host updates maybe needed for netcdf file output

SP

0.40 0.28 0.19 0.26 1.54 2.16

DP

0.88 0.40 0.29 0.35 2.51 2.99

SP

1.90 1.02 0.69 1.01 1.88 2.74

DP

3.80 1.54 1.12 1.41 2.70 3.40

Speedup Broadwell vs V100 Speedup Broadwell vs P100 P100 with Power8(1 GPU, PGI compiled, OpenACC code) 120 Km (40K) P100 with Haswell(1 GPU, PGI compiled, OpenACC code) V100 with Haswell (1 GPU, PGI compiled, OpenACC code) Broadwell (Fully Subscribed, OpenMP Enabled, Intel compiled, Base code) Dataset 60 Km (163K)

Taking 40k data points per node for SP, 32.8M grid points (15 Km & 3 Km Locally refined grid) need ~800 Volta GPUs

slide-13
SLIDE 13

Weak Scaling for MPAS Dry Dycore (SP & DP) on P100 GPU

13

Time per timestep, 4 GPUs per node, 1 MPI rank per GPU, Max of 4 MPI ranks per node, Intranode Affinity for MPI ranks, Uses OpenMPI, PCIe no NVLink, PGI 17.10 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2 4 6 8 10 12 14 16 Time (secs) Number of GPUs 40k per node, SP 40k per node, DP 163k per node, SP 163k per node, DP

slide-14
SLIDE 14

Strong Scaling for MPAS Dry Dycore (SP & DP) for 15 Km (2.6M) on P100 GPU

14

Time per timestep, 4 GPUs per node, 1 MPI rank per GPU, Max of 4 MPI ranks per node, Intranode Affinity for MPI ranks, Uses OpenMPI, PCIe no NVLink, PGI 17.10.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 8 16 20 24 28 32 Inverse Time Number of GPUs SP DP

slide-15
SLIDE 15

Portability: Performance Comparison of Base code with OpenACC code on Fully Subscribed Broadwell single node

15

For datasets up to 40k per node, the execution time is identical. For 40k & 163k, the variation is <1% & <4% respectively.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 40k SP 40k DP 163k SP 163k DP Time (secs) Dataset Base Code OpenACC Code

slide-16
SLIDE 16

Future Work

  • Improving MPAS Scalability
  • MVAPICH instead of OpenMPI
  • NVLink systems
  • Moving halo exchange book-keeping on GPUs
  • MPS: Preliminary results showed no improvement
  • Scalar Transport
  • Currently being integrated
  • Physics
  • Port: 35% remaining
  • Optimize: 65% remaining
  • Radiation, Land surface on CPU
  • Development
  • Lagged radiation
  • Adopting Sion library for faster IO

16

slide-17
SLIDE 17

Thank you! Questions?

17