SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May - - PowerPoint PPT Presentation

speeding up simulations of
SMART_READER_LITE
LIVE PREVIEW

SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May - - PowerPoint PPT Presentation

OPENACC & OPENMP4.5 OFFLOADING: SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May 9, 2017 OAK RIDGE LEADERSHIP COMPUTING FACILITY Center for Accelerated Application Readiness (CAAR) Preparing codes to run on the upcoming


slide-1
SLIDE 1

Tom Papatheodore, May 9, 2017

OPENACC & OPENMP4.5 OFFLOADING: SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS

slide-2
SLIDE 2

2

OAK RIDGE LEADERSHIP COMPUTING FACILITY

Preparing codes to run on the upcoming (CORAL) Summit supercomputer at ORNL

  • Summit – IBM POWER9 + NVIDIA Volta
  • EA System – IBM POWER8 + NVIDIA Pascal

Center for Accelerated Application Readiness (CAAR)

FLASH – adaptive-mesh, multi-physics simulation code widely used in astrophysics

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

http://flash.uchicago.edu/site/

slide-3
SLIDE 3

3

SUPERNOVAE

Supernovae – exploding stars

  • Among most energetic events in the universe
  • Contribute to galactic dynamics
  • Create heavy elements (e.g. iron, calcium)

What are supernovae?

By NASA/CXC/Rutgers/J.Warren & J.Hughes et al.

  • http://chandra.harvard.edu/photo/2005/tycho

By NASA, ESA, J. Hester and A. Loll (Arizona State University) - HubbleSite: gallery, release.

slide-4
SLIDE 4

4

SUPERNOVAE

Requires multi-physics code

  • Hydrodynamics
  • Nuclear Burning
  • Gravity
  • Equation of State
  • Relationship between thermodynamic variables in a system ( e.g. P=P(ρ,T,X) )
  • Called many times during simulation
  • Can be calculated independently

Simulating Supernovae

Jordan et al. The Astrophysical Journal, Volume 681, Issue 2, article id. 1448-1457, pp. (2008)

slide-5
SLIDE 5

5

EQUATION OF STATE

Based on Helmholtz free energy formulation

  • High-order interpolation from table of free energy (quintic Hermite polynomials)

Collaborators at Stony Brook University (Mike Zingale, Max Katz, Adam Jacobs)

  • Developed OpenACC version of Helmholtz EOS - part of a shared repository of microphysics (starkiller)

that can run in FLASH, as well as BoxLib-based codes such as CASTRO and MAESTRO.

FLASH-CAAR

  • Install this accelerated version of EOS into FLASH
  • Created a version using OpenMP4.5 w/offloading (as part of hackathon by IBM’s CoE)

Helmholtz EOS (Timmes & Swesty, 2000)

slide-6
SLIDE 6

6

HELMHOLTZ EOS

To determine best use of accelerated EOS in FLASH, we created driver program

  • Mimics AMR block structure and time stepping in FLASH
  • Loops through several time steps
  • Change the number of total grid zones
  • Fill these zones with new data
  • Calculate interpolation in all grid zones
  • How many AMR blocks should we calculate (call the EOS on) at once per MPI rank?

Driver Program

slide-7
SLIDE 7

7

HELMHOLTZ EOS

1) Allocate main data arrays on host and device

  • Arrays of Fortran derived types
  • Each elements holds grid data for single zone
  • Persist for the duration of the program
  • Used to pass zone data back and forth from host to device
  • Reduced set sent from H-to-D
  • Full set sent from D-to-H

Basic Flow of Driver Program

slide-8
SLIDE 8

8

HELMHOLTZ EOS

2) Read in tabulated Helmholtz Free Energy data and make copy on device

  • This will persist for the duration of the program
  • Thermodynamic quantities are interpolated from this table

Basic Flow of Driver Program

slide-9
SLIDE 9

9

HELMHOLTZ EOS

3) For each timestep

  • Change number of AMR blocks
  • Update device with new grid data
  • Launch EOS kernel: calculate interpolation for all grid zones
  • Update host with newly calculated quantities

Basic Flow of Driver Program

Traditional OpenMP (CPU) parallel region

(Each thread gets portion of total zones)

slide-10
SLIDE 10

10

HELMHOLTZ EOS

Basic Flow of Driver Program

!$acc update device(reduced_state(start_element:stop_element)) async(thread_id + 1) !$acc kernels async(thread_id + 1) do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$acc end kernels !$acc update self(state(start_element:stop_element)) async(thread_id + 1) !$acc wait !$omp target update to(reduced_state(start_element:stop_element)) !$omp target !$omp teams distribute parallel do thread_limit(128) num_threads(128) do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$omp end teams distribute parallel do !$omp end target !$omp target update from(state(start_element:stop_element))

OpenACC OpenMP4.5

slide-11
SLIDE 11

11

HELMHOLTZ EOS

Number of “AMR” blocks: 1, 10, 100, 1000, 10000 (each with 256 zones)

  • Ran with 1, 4, and 10 (CPU) OpenMP threads for each block count
  • OpenACC (PGI 16.10)
  • OpenMP4.5 (XL 15.1.5)

Driver Program Tests

slide-12
SLIDE 12

12

OPENACC VS OPENMP4.5

  • PGI’s OpenACC implementation has more mature API (version 16.10)
  • It has simply been around longer
  • IBM’s XL Fortran implementation of OpenMP4.5 (version 15.1.5)
  • Does not currently allow pinned memory or asynchronous data transfers / kernel

execution ➢ Coming soon

Current Functionality for offloading to GPUs

slide-13
SLIDE 13

13

HELMHOLTZ EOS

Preliminary Performance Results

  • At low AMR block counts, kernel overhead is

large and kernel execution does not increase much

slide-14
SLIDE 14

14

HELMHOLTZ EOS

Preliminary Performance Results

  • Same is true for multi-threaded, but overhead

is increased for each “send, compute, receive” sequence

slide-15
SLIDE 15

15

HELMHOLTZ EOS

Preliminary Performance Results

  • At higher block counts, kernel overhead is
  • negligible. Now dominated by D2H transfers.
slide-16
SLIDE 16

16

HELMHOLTZ EOS

Preliminary Performance Results

  • Same is true for multi-threaded, even with
  • verlap of data transfers and kernel execution
slide-17
SLIDE 17

17

HELMHOLTZ EOS

Preliminary Performance Results

  • At low AMR block counts, kernel execution

times are roughly same, and these dominate

  • verall time of each “send, compute, receive”

Temporary variables Why so large?

CUDA thread scheduling?

slide-18
SLIDE 18

18

HELMHOLTZ EOS

Preliminary Performance Results

  • Similar behavior for multi-threaded case, but

serialized launch overhead

slide-19
SLIDE 19

19

HELMHOLTZ EOS

Preliminary Performance Results

  • At larger AMR block counts, kernel times still

dominate but now we are saturating GPUs

slide-20
SLIDE 20

20

HELMHOLTZ EOS

Preliminary Performance Results

  • Similar behavior for multi-threaded case, but

no benefit of asynchronous behavior

slide-21
SLIDE 21

21

HELMHOLTZ EOS

Preliminary Performance Results

  • Compare with CPU-only OpenMP (dashed)
  • 1, 4, 10 CPU threads
  • No current advantage of using GPUs
slide-22
SLIDE 22

22

HELMHOLTZ EOS

Preliminary Performance Results

  • Advantage from GPUs when >100 AMR blocks
  • Can calculate 100 blocks in roughly the same

time as 1 block

  • So in FLASH, we should compute 100s of blocks

per MPI rank

  • Restructure to calculate many blocks at once
slide-23
SLIDE 23

23

CONCLUSIONS

Current Snapshot

  • OpenACC (PGI 16.10)
  • Mature API with more features implemented
  • Currently better performance of kernel execution
  • XL’s Fortran (15.1.5) OpenMP4.5 implementation is still in early stages
  • Currently some missing features, but these are being developed as we speak
  • Some bugs are still being worked out
  • Looking into long kernel execution times (related to CUDA thread scheduling?)
slide-24
SLIDE 24