Tom Papatheodore, May 9, 2017
SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May - - PowerPoint PPT Presentation
SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May - - PowerPoint PPT Presentation
OPENACC & OPENMP4.5 OFFLOADING: SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May 9, 2017 OAK RIDGE LEADERSHIP COMPUTING FACILITY Center for Accelerated Application Readiness (CAAR) Preparing codes to run on the upcoming
2
OAK RIDGE LEADERSHIP COMPUTING FACILITY
Preparing codes to run on the upcoming (CORAL) Summit supercomputer at ORNL
- Summit – IBM POWER9 + NVIDIA Volta
- EA System – IBM POWER8 + NVIDIA Pascal
Center for Accelerated Application Readiness (CAAR)
FLASH – adaptive-mesh, multi-physics simulation code widely used in astrophysics
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
http://flash.uchicago.edu/site/
3
SUPERNOVAE
Supernovae – exploding stars
- Among most energetic events in the universe
- Contribute to galactic dynamics
- Create heavy elements (e.g. iron, calcium)
What are supernovae?
By NASA/CXC/Rutgers/J.Warren & J.Hughes et al.
- http://chandra.harvard.edu/photo/2005/tycho
By NASA, ESA, J. Hester and A. Loll (Arizona State University) - HubbleSite: gallery, release.
4
SUPERNOVAE
Requires multi-physics code
- Hydrodynamics
- Nuclear Burning
- Gravity
- Equation of State
- Relationship between thermodynamic variables in a system ( e.g. P=P(ρ,T,X) )
- Called many times during simulation
- Can be calculated independently
Simulating Supernovae
Jordan et al. The Astrophysical Journal, Volume 681, Issue 2, article id. 1448-1457, pp. (2008)
5
EQUATION OF STATE
Based on Helmholtz free energy formulation
- High-order interpolation from table of free energy (quintic Hermite polynomials)
Collaborators at Stony Brook University (Mike Zingale, Max Katz, Adam Jacobs)
- Developed OpenACC version of Helmholtz EOS - part of a shared repository of microphysics (starkiller)
that can run in FLASH, as well as BoxLib-based codes such as CASTRO and MAESTRO.
FLASH-CAAR
- Install this accelerated version of EOS into FLASH
- Created a version using OpenMP4.5 w/offloading (as part of hackathon by IBM’s CoE)
Helmholtz EOS (Timmes & Swesty, 2000)
6
HELMHOLTZ EOS
To determine best use of accelerated EOS in FLASH, we created driver program
- Mimics AMR block structure and time stepping in FLASH
- Loops through several time steps
- Change the number of total grid zones
- Fill these zones with new data
- Calculate interpolation in all grid zones
- How many AMR blocks should we calculate (call the EOS on) at once per MPI rank?
Driver Program
7
HELMHOLTZ EOS
1) Allocate main data arrays on host and device
- Arrays of Fortran derived types
- Each elements holds grid data for single zone
- Persist for the duration of the program
- Used to pass zone data back and forth from host to device
- Reduced set sent from H-to-D
- Full set sent from D-to-H
Basic Flow of Driver Program
8
HELMHOLTZ EOS
2) Read in tabulated Helmholtz Free Energy data and make copy on device
- This will persist for the duration of the program
- Thermodynamic quantities are interpolated from this table
Basic Flow of Driver Program
9
HELMHOLTZ EOS
3) For each timestep
- Change number of AMR blocks
- Update device with new grid data
- Launch EOS kernel: calculate interpolation for all grid zones
- Update host with newly calculated quantities
Basic Flow of Driver Program
Traditional OpenMP (CPU) parallel region
(Each thread gets portion of total zones)
10
HELMHOLTZ EOS
Basic Flow of Driver Program
!$acc update device(reduced_state(start_element:stop_element)) async(thread_id + 1) !$acc kernels async(thread_id + 1) do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$acc end kernels !$acc update self(state(start_element:stop_element)) async(thread_id + 1) !$acc wait !$omp target update to(reduced_state(start_element:stop_element)) !$omp target !$omp teams distribute parallel do thread_limit(128) num_threads(128) do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$omp end teams distribute parallel do !$omp end target !$omp target update from(state(start_element:stop_element))
OpenACC OpenMP4.5
11
HELMHOLTZ EOS
Number of “AMR” blocks: 1, 10, 100, 1000, 10000 (each with 256 zones)
- Ran with 1, 4, and 10 (CPU) OpenMP threads for each block count
- OpenACC (PGI 16.10)
- OpenMP4.5 (XL 15.1.5)
Driver Program Tests
12
OPENACC VS OPENMP4.5
- PGI’s OpenACC implementation has more mature API (version 16.10)
- It has simply been around longer
- IBM’s XL Fortran implementation of OpenMP4.5 (version 15.1.5)
- Does not currently allow pinned memory or asynchronous data transfers / kernel
execution ➢ Coming soon
Current Functionality for offloading to GPUs
13
HELMHOLTZ EOS
Preliminary Performance Results
- At low AMR block counts, kernel overhead is
large and kernel execution does not increase much
14
HELMHOLTZ EOS
Preliminary Performance Results
- Same is true for multi-threaded, but overhead
is increased for each “send, compute, receive” sequence
15
HELMHOLTZ EOS
Preliminary Performance Results
- At higher block counts, kernel overhead is
- negligible. Now dominated by D2H transfers.
16
HELMHOLTZ EOS
Preliminary Performance Results
- Same is true for multi-threaded, even with
- verlap of data transfers and kernel execution
17
HELMHOLTZ EOS
Preliminary Performance Results
- At low AMR block counts, kernel execution
times are roughly same, and these dominate
- verall time of each “send, compute, receive”
Temporary variables Why so large?
CUDA thread scheduling?
18
HELMHOLTZ EOS
Preliminary Performance Results
- Similar behavior for multi-threaded case, but
serialized launch overhead
19
HELMHOLTZ EOS
Preliminary Performance Results
- At larger AMR block counts, kernel times still
dominate but now we are saturating GPUs
20
HELMHOLTZ EOS
Preliminary Performance Results
- Similar behavior for multi-threaded case, but
no benefit of asynchronous behavior
21
HELMHOLTZ EOS
Preliminary Performance Results
- Compare with CPU-only OpenMP (dashed)
- 1, 4, 10 CPU threads
- No current advantage of using GPUs
22
HELMHOLTZ EOS
Preliminary Performance Results
- Advantage from GPUs when >100 AMR blocks
- Can calculate 100 blocks in roughly the same
time as 1 block
- So in FLASH, we should compute 100s of blocks
per MPI rank
- Restructure to calculate many blocks at once
23
CONCLUSIONS
Current Snapshot
- OpenACC (PGI 16.10)
- Mature API with more features implemented
- Currently better performance of kernel execution
- XL’s Fortran (15.1.5) OpenMP4.5 implementation is still in early stages
- Currently some missing features, but these are being developed as we speak
- Some bugs are still being worked out
- Looking into long kernel execution times (related to CUDA thread scheduling?)