A first strike at an OpenACC C++ Monte Carlo code Seth R Johnson, - PowerPoint PPT Presentation

A first strike at an OpenACC � C++ Monte Carlo code � Seth R Johnson, Ph.D. R&D Staff, Monte Carlo Methods Radiation Transport Group Exnihilo team: Hackathon Mentors: Greg Davidson Wayne Joubert Tom Evans Jeff Larkin Stephen Hamilton Seth Johnson Tara Pandya ORNL is managed by UT–Battelle for the U.S. Department of Energy.

The codes � • Exnihilo: radiation transport framework – Multi-application (fusion, fission, detectors, homeland security) – Export controlled • Profugus mini-app: – Written for algorithmic and HPC development – Limited capability – Reduced complexity 2 C++ Monte Carlo OpenACC

Introduction to the code environment � • C++11: unordered maps, auto , lambdas, etc. • TPLs: Trilinos, HDF5 • Data structures are not POD, have irregular shape – Many distinct objects, dynamically sized vectors, shared pointers, etc. trade convenience for poorer data locality – Examples: particle, geometry cell, material attributes Geometry ... Cross section data Materials Material Assy 2 Assy 3 Assy 1 • Production environment: Chester (OLCF cluster) – PGI 14.7.0 (a few months old) – CUDA 5.5 (more than 2 years old) 3 C++ Monte Carlo OpenACC

Introduction to Monte Carlo for neutronics � particle born calculate distance to collision calculate distance to boundary NO YES YES distance = collision process collision particle killed NO NO YES particle escapes process surface 4 C++ Monte Carlo OpenACC

Algorithmic challenges � • Inherently stochastic process – Fast, long-period random number sampling required – Highly divergent code paths between loops – There is no fixed-length nested “ for loop” to parallelize • Complex data structures built to mirror physical processes – Indirection, dynamic allocation, irregular data shapes – There is no homogeneous multi-dimensional array of data 5 C++ Monte Carlo OpenACC

Initial timing profile � • Ran a semi-realistic reactor assembly problem • No compute-intensive bottlenecks to o ffl oad mc::Manager::solve 18.19% (0.00%) 1× Particle transport loop 18.19% 1× profugus::KCode_Solver::solve 18.19% (0.00%) 1× 1.11% 16.19% 0.74% 1× 2× 2× Eight routines   profugus::KCode_Solver::initialize profugus::Source_Transporter::solve profugus::Fission_Source::build_source 1.11% 16.19% 0.74% (0.00%) (0.00%) (0.00%) each take 1× 2× 2× 0.74% 13.77% 2.34% 0.74% 2× 204148× 204148× 2× 5–20% of time profugus::T allier::build profugus::Domain_Transporter::transport profugus::Fission_Source::get_particle profugus::Source::make_RNG 0.74% 13.77% 2.34% 1.11% (0.74%) (0.63%) (0.06%) (0.00%) 2× 204148× 204148× 3× 0.76% 3.22% 2.94% 2.15% 1.38% 1.33% 1.74% 1.11% 8759551× 24437238× 8759551× 15677687× 8759551× 24437238× 740264× 3× profugus::Geometry::move_to_point profugus::Geometry::distance_to_boundary profugus::Physics::collide profugus::Geometry::move_to_surface profugus::Physics::sample_fission_site profugus::T allier::path_length profugus::Geometry::initialize profugus::RNG_Control::rng 0.76% 3.22% 2.94% 2.15% 1.38% 1.33% 1.74% 1.11% (0.06%) (0.05%) (1.76%) (0.06%) (1.34%) (0.39%) (0.45%) (0.00%) 8759551× 24437238× 8759551× 15677687× 8759551× 24437238× 740264× 3× 0.69% 3.17% 0.88% 2.09% 0.89% 0.94% 1.29% 1.11% 8759551× 24437238× 8759551× 15677687× 24437238× 24437238× 740264× 3× profugus::RTK_Array::update_state profugus::RTK_Array::distance_to_boundary profugus::Physics::sample_group profugus::RTK_Array::cross_surface profugus::Keff_T ally::accumulate profugus::RTK_Array::initialize init_rng 0.69% 3.17% 0.88% 2.09% 0.94% 1.29% 1.48% (0.66%) (0.73%) (0.88%) (0.17%) (0.05%) (0.96%) (0.00%) 8759551× 24437238× 8759551× 15677687× 24437238× 740264× 4× 0.75% 1.70% 0.97% 0.95% 0.89% 1.33% 24437238× 24437238× 15677687× 6901530× 24437238× 4× profugus::RTK_Array::transform profugus::RTK_Cell::distance_to_boundary profugus::RTK_Array::determine_boundary_crossings profugus::RTK_Array::update_coordinates profugus::Physics::total initialize 0.99% 1.70% 0.97% 0.95% 1.79% 1.33% (0.99%) (1.56%) (0.39%) (0.52%) (1.79%) (1.17%) 32079032× 24437238× 15677687× 6901530× 48874476× 4× 0.59% 15677687× profugus::RTK_Array::determine_boundary_crossings 0.59% (0.59%) 15677687× 6 C++ Monte Carlo OpenACC

The initial plan � • Rewrite classes for on-device execution – Geometry, Physics, Particle, Transporter • Put CPU-intensive routines on the GPU – Particle geometry tracking – Cross section sampling and collisions – Tallying • Run a simplified reactor assembly problem • Get new timing profile using GPUs 7 C++ Monte Carlo OpenACC

The immediate derailing of the initial plan � • Adding -‑acc flag broke our code – No OpenACC (or other) pragmas even being used – Unintelligible errors emitted from a standard library include inside Trilinos – Split OpenACC-dependent code into a subpackage that uses that flag, preventing its propagation elsewhere • At least a day of team e ff ort with Nvidia/PGI to get a C++ class with multiple vectors compiling 8 C++ Monte Carlo OpenACC

The final plan � • Attempt to write an adapter class to flatten CPU classes into data structures suitable for OpenACC • Write a simple random number generator • Write a simple brick mesh ray tracer that can be parallelized with OpenACC • Write simple OpenACC-enabled multigroup physics with data access and collisions 9 C++ Monte Carlo OpenACC

What actually was accomplished • 23 PGI compiler bug reports – PGI is the only compiler to support both   OpenACC and C++11 – We were probably the first group to use both in a production environment • Primitive multigroup physics on the GPU – Driven through unit tests, reproduced CPU results • Successfully ray-traced particles on brick mesh   on the GPU – 20X faster if all particles do the same thing – 15X faster with divergence 10 C++ Monte Carlo OpenACC

C++ suggestions for OpenACC � • Separate compilation units for ACC code – Inline keyword gives the compilers trouble; always write in .cc files – Include as few headers as possible (no Trilinos) to avoid compiler errors from non-ACC code and to reduce compiler time • CPU data management with std::vector , then copy address to raw pointer for OpenACC • Complexity hidden by ACC means more mysteries: – Do not rely on thread-private data   84, Accelerator restriction: scalar variable live-out from loop: seed   98, Loop carried scalar dependence for 'seed' at line 104 � – Issues with reduction operations on scalars – Do not use “const” class member data 11 C++ Monte Carlo OpenACC

Positive takeaways � • Learned basics of OpenACC and how it can be used in a C++ environment • Better understanding of the heterogeneous architecture and how it relates to OpenACC directives (prior knowledge of CUDA is helpful) • For very simplified and specific MC problems, we may be able to achieve speedup and the ability to run full problems on the GPU using Profugus (with a lot of rewriting) 12 C++ Monte Carlo OpenACC

Negative takeaways � • Existing MC algorithms are fundamentally incompatible with OpenACC-type usage – Monte Carlo does not have nested, fixed-length loops – Memory-managed objects cannot be accelerated • C++, PGI, and OpenACC do not currently get along – Two weeks preparation to compile with PGI on Titan – C++11 incompatible with installed Cray compiler wrapper – Profiling tool issues with the code – Mystery compiler errors when turning on -‑acc ¡ on PGI • No OpenACC libraries yet – We had write a simple pseudorandom number generator – No microkernels or algorithms for sorting, binary search 13 C++ Monte Carlo OpenACC

Concluding comments � • Hackathon was critical to kick-starting our investigation into Monte Carlo on the GPU – Resources: the compiler experts are there to help you – Time: you have a solid week to work in a focused environment with one task at hand – Perspective: you are not the only team struggling! • OpenACC feasibility for C++ – #pragma ¡ is not very pragmatic (inherently incompatible with C++ features): appropriate for Fortran – Compiler and environment are very di ffi cult to get working • Our next step: Kokkos as template-based abstraction layer 14 C++ Monte Carlo OpenACC

Acknowledgements � • This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the O ffi ce of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725 • Thanks to our mentors Je ff and Wayne (and to Matt) for their help! • And thanks to Fernanda and OLCF for making the Hackathon happen! Profugus: http://ornl-cees.github.io/Profugus/ 15 C++ Monte Carlo OpenACC

A first strike at an OpenACC C++ Monte Carlo code Seth R Johnson, - PowerPoint PPT Presentation

A first strike at an OpenACC C++ Monte Carlo code Seth R Johnson, Ph.D. R&D Staff, Monte Carlo Methods Radiation Transport Group Exnihilo team: Hackathon Mentors: Greg Davidson Wayne Joubert Tom Evans Jeff Larkin Stephen Hamilton Seth

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

SimProp : Monte Carlo code for UHECR propagation Eleonora Guido SimProp v2r4: Monte Carlo

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

The random module Python Marquette University A Monte Carlo Method for Area calculation

Because "use urandom" isnt everything: a deep dive into CSPRNGs in Operating Systems

Random number generators and random processes Eugeniy E. Mikhailov The College of William &

DEVELOPMENT OF A SIMULATED ENVIRONMENT FOR DECISION MAKING WITH AN AUTONOMOUS SYSTEM UNDER

Sampling for Detailed NDIP Inspections Emma OKeefe StR in Dental Public Health Presented at

ACOS3X ACOS3 eXpress Microprocessor Card A Product Presentation www.acs.com.hk Rundown 1.

Determinism in Deep Learning (S9911) Duncan Riach, GTC 2019 1 RANDOMNESS Pseudo-random number

The Design of Slot Machine Games Kevin Harrigan, PhD University of Waterloo Nov 17, 2009, New

A first strike at an OpenACC C++ Monte Carlo code Seth R Johnson, - PowerPoint PPT Presentation

A first strike at an OpenACC C++ Monte Carlo code Seth R Johnson, Ph.D. R&D Staff, Monte Carlo Methods Radiation Transport Group Exnihilo team: Hackathon Mentors: Greg Davidson Wayne Joubert Tom Evans Jeff Larkin Stephen Hamilton Seth

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

SimProp : Monte Carlo code for UHECR propagation Eleonora Guido SimProp v2r4: Monte Carlo

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

The random module Python Marquette University A Monte Carlo Method for Area calculation

Because &quot;use urandom&quot; isnt everything: a deep dive into CSPRNGs in Operating Systems

Random number generators and random processes Eugeniy E. Mikhailov The College of William &amp;

DEVELOPMENT OF A SIMULATED ENVIRONMENT FOR DECISION MAKING WITH AN AUTONOMOUS SYSTEM UNDER

Sampling for Detailed NDIP Inspections Emma OKeefe StR in Dental Public Health Presented at

ACOS3X ACOS3 eXpress Microprocessor Card A Product Presentation www.acs.com.hk Rundown 1.

Determinism in Deep Learning (S9911) Duncan Riach, GTC 2019 1 RANDOMNESS Pseudo-random number

The Design of Slot Machine Games Kevin Harrigan, PhD University of Waterloo Nov 17, 2009, New

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

Because "use urandom" isnt everything: a deep dive into CSPRNGs in Operating Systems

Random number generators and random processes Eugeniy E. Mikhailov The College of William &