SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May - PowerPoint PPT Presentation

OPENACC & OPENMP4.5 OFFLOADING: SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May 9, 2017

OAK RIDGE LEADERSHIP COMPUTING FACILITY Center for Accelerated Application Readiness (CAAR) Preparing codes to run on the upcoming (CORAL) Summit supercomputer at ORNL Summit – IBM POWER9 + NVIDIA Volta • • EA System – IBM POWER8 + NVIDIA Pascal This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. FLASH – adaptive-mesh, multi-physics simulation code widely used in astrophysics http://flash.uchicago.edu/site/ 2

State University) - HubbleSite: gallery, release. By NASA/CXC/Rutgers/J.Warren & J.Hughes et al. - http://chandra.harvard.edu/photo/2005/tycho By NASA, ESA, J. Hester and A. Loll (Arizona SUPERNOVAE What are supernovae? Supernovae – exploding stars • Among most energetic events in the universe Contribute to galactic dynamics • Create heavy elements (e.g. iron, calcium) • 3

SUPERNOVAE Simulating Supernovae Requires multi-physics code • Hydrodynamics Jordan et al. The Astrophysical Journal, Volume 681, Issue 2, article id. 1448-1457, pp. Nuclear Burning • (2008) Gravity • • Equation of State Relationship between thermodynamic variables in a system ( e.g. P=P( ρ ,T, X ) ) • • Called many times during simulation Can be calculated independently • 4

EQUATION OF STATE Helmholtz EOS (Timmes & Swesty, 2000) Based on Helmholtz free energy formulation High-order interpolation from table of free energy (quintic Hermite polynomials) • Collaborators at Stony Brook University ( Mike Zingale, Max Katz, Adam Jacobs ) • Developed OpenACC version of Helmholtz EOS - part of a shared repository of microphysics (starkiller) that can run in FLASH, as well as BoxLib-based codes such as CASTRO and MAESTRO. FLASH-CAAR Install this accelerated version of EOS into FLASH • • Created a version using OpenMP4.5 w/offloading (as part of hackathon by IBM’s CoE) 5

HELMHOLTZ EOS Driver Program To determine best use of accelerated EOS in FLASH, we created driver program Mimics AMR block structure and time stepping in FLASH • Loops through several time steps • • Change the number of total grid zones Fill these zones with new data • Calculate interpolation in all grid zones • • How many AMR blocks should we calculate (call the EOS on) at once per MPI rank? 6

HELMHOLTZ EOS Basic Flow of Driver Program 1) Allocate main data arrays on host and device • Arrays of Fortran derived types Each elements holds grid data for single zone • • Persist for the duration of the program Used to pass zone data back and forth from host to device • Reduced set sent from H-to-D • • Full set sent from D-to-H 7

HELMHOLTZ EOS Basic Flow of Driver Program 2) Read in tabulated Helmholtz Free Energy data and make copy on device This will persist for the duration of the program • Thermodynamic quantities are interpolated from this table • 8

HELMHOLTZ EOS Basic Flow of Driver Program 3) For each timestep Traditional OpenMP (CPU) parallel region (Each thread gets portion of total zones) Change number of AMR blocks • Update device with new grid data • Launch EOS kernel: calculate interpolation for all grid zones • Update host with newly calculated quantities • 9

HELMHOLTZ EOS Basic Flow of Driver Program !$acc update device(reduced_state(start_element:stop_element)) async(thread_id + 1) !$acc kernels async(thread_id + 1) OpenACC do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$acc end kernels !$acc update self(state(start_element:stop_element)) async(thread_id + 1) !$acc wait !$omp target update to(reduced_state(start_element:stop_element)) !$omp target OpenMP4.5 !$omp teams distribute parallel do thread_limit(128) num_threads(128) do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$omp end teams distribute parallel do !$omp end target !$omp target update from(state(start_element:stop_element)) 10

HELMHOLTZ EOS Driver Program Tests Number of “AMR” blocks: 1, 10, 100, 1000, 10000 (each with 256 zones) Ran with 1, 4, and 10 (CPU) OpenMP threads for each block count • OpenACC (PGI 16.10) • • OpenMP4.5 (XL 15.1.5) 11

OPENACC VS OPENMP4.5 Current Functionality for offloading to GPUs PGI’s OpenACC implementation has more mature API (version 16.10) • • It has simply been around longer • IBM’s XL Fortran implementation of OpenMP4.5 (version 15.1.5) • Does not currently allow pinned memory or asynchronous data transfers / kernel execution ➢ Coming soon 12

HELMHOLTZ EOS Preliminary Performance Results • At low AMR block counts, kernel overhead is large and kernel execution does not increase much 13

HELMHOLTZ EOS Preliminary Performance Results • Same is true for multi-threaded, but overhead is increased for each “send, compute, receive” sequence 14

HELMHOLTZ EOS Preliminary Performance Results • At higher block counts, kernel overhead is negligible. Now dominated by D2H transfers. 15

HELMHOLTZ EOS Preliminary Performance Results • Same is true for multi-threaded, even with overlap of data transfers and kernel execution 16

HELMHOLTZ EOS Preliminary Performance Results • At low AMR block counts, kernel execution times are roughly same, and these dominate overall time of each “send, compute, receive” Why so large? Temporary variables CUDA thread scheduling? 17

HELMHOLTZ EOS Preliminary Performance Results • Similar behavior for multi-threaded case, but serialized launch overhead 18

HELMHOLTZ EOS Preliminary Performance Results • At larger AMR block counts, kernel times still dominate but now we are saturating GPUs 19

HELMHOLTZ EOS Preliminary Performance Results • Similar behavior for multi-threaded case, but no benefit of asynchronous behavior 20

HELMHOLTZ EOS Preliminary Performance Results • Compare with CPU-only OpenMP (dashed) • 1, 4, 10 CPU threads • No current advantage of using GPUs 21

HELMHOLTZ EOS Preliminary Performance Results • Advantage from GPUs when >100 AMR blocks Can calculate 100 blocks in roughly the same • time as 1 block • So in FLASH, we should compute 100s of blocks per MPI rank Restructure to calculate many blocks at once • 22

CONCLUSIONS Current Snapshot OpenACC (PGI 16.10) • • Mature API with more features implemented Currently better performance of kernel execution • XL’s Fortran (15.1.5) OpenMP4.5 implementation is still in early stages • • Currently some missing features, but these are being developed as we speak Some bugs are still being worked out • • Looking into long kernel execution times (related to CUDA thread scheduling?) 23

SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May - PowerPoint PPT Presentation

OPENACC & OPENMP4.5 OFFLOADING: SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May 9, 2017 OAK RIDGE LEADERSHIP COMPUTING FACILITY Center for Accelerated Application Readiness (CAAR) Preparing codes to run on the upcoming

COMPUTER COMPUTER COMPUTER COMPUTER SIMULATIONS SIMULATIONS SIMULATIONS SIMULATIONS

Speeding up the Inter-Planetary File System (IPFS) Speeding up the Inter-Planetary File System

Speeding Up Your Mac A Joe ON Tech Guide Speeding Up Your Mac Basics Three factors affect

Simulations in Coalgebra Bart Jacobs and Jesse Hughes { bart,jesseh } @cs.kun.nl. University of

OPTIONAL LABORATORY SESSIONS Second-order NL Optics simulations Simulations using SNLO, a

Speeding up SPH models: local adaptivity and parallelization for soil simulations. Yaidel Reyes

Wheeler Road Virtual Community Meeting Summer 2020 Station 1 Speeding 2 Station 1

Speeding up query execution in PostgreSQL using LLVM JIT compiler Dmitry Melnik dm@ispras.ru

Speeding up by using ISM-like calls Junji NAKANO (The Institute of Statistical Mathematics, Japan)

Speeding up Permutation Testing Vamsi Ithapu http://pages.cs.wisc.edu/~vamsi/pt_fast November

Speeding up target-language driven part-of-speech tagger training for machine translation Felipe

W HEN DO WE MOVE TO FULL RING , SELF - CONSISTENT SIMULATIONS ? JR Cary 20180509 1 SIMULATIONS

Monte Carlo Simulations and PcNaive Heino Bohn Nielsen 1 of 21 Monte Carlo Simulations MC

The SXS Catalog of Simulations The SXS Catalog of Simulations Mike Boyle Mike Boyle Outline

Data Mining Combat Simulations: Data Mining Combat Simulations: an Emerging Opportunity an

Using Simulations to Teach Physics I : PhET simulations: Free, researched, web based resources

Comparative study of the use of C 1 continuous finite Conference on Computational Mechanics,

L L 2 = + 0 1 t v v Detector 1 0 THE VESUVIO SPECTOMETER The Filter Difference Method

The Kronig-Penney model extended to arbitrary potentials via numerical matrix mechanics Robert

FLSA Collective Action Discovery Challenges Effective Approaches Before and After Conditional

High-precision lane-level road map building for vehicle navigation Anning Chen Jay A. Farrell

An approach to limit states in An approach to limit states in advanced materials advanced

Simon Pabst (Double Negative VFX) Talk Overview 1. The need in production (Jeff) 2. The

Comparison of Local Feature Descriptors Subhransu Maji Department of EECS, University of