Breaking Through the Barriers to GPU Accelerated Monte Carlo - PowerPoint PPT Presentation

Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport GTC 2018 Jeremy Sweezy Scientist Monte Carlo Methods, Codes and Applications Group 3/28/2018 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-18-XXXX

What is Monte Carlo Particle Transport? – Follows the path of individual particles through a system – Uses pseudo-random numbers to sample processes – Randomly sample physical and non-physical processes – Attributed to Stanislaw Ulam and Enrico Fermi FERMIAC – Named because Ulam had an uncle who who would borrow money from relatives because he “just had to go to Monte Carlo” Los Alamos National Laboratory 3/23/18 | 2

Porting to Specialized Hardware is Prohibitively Expensive –The world’s production Monte Carlo codes have decades of development –LANL’s MCNP code has been in development since 1977 –Equally extensive amount of V&V effort –Codes have to run on desktop machines and super-computers –DOE HPC platforms have been in a state of flux for the last 10-years • Cell Broadband Engine • Intel Xeon Phi (MIC) • GPUs • ARM??? Barrier #1: Limited Resources (Money, People, Time) Los Alamos National Laboratory 3/23/18 | 3

Monte Carlo Random Walk on GPU Hardware has reached a Performance Wall • A least 6 different research groups have ported the Monte Carlo random walk to GPU hardware for neutron transport • All report results against different numbers of CPUs • All get the same results! • Almost all are extremely simplified • Production codes will likely have worse performance. 4.5x • What are the limitations? – Conditional branching – Random data access – No small computational intensive kernel to accelerate 3.0x Barrier #2: Performance of random walk on GPUs Los Alamos National Laboratory 3/25/18 | 4

How do You Define Performance? • A computer scientist might measure performance as an increase in speed. 𝑸 = 𝑼 𝑫𝑸𝑽 𝑼 𝑯𝑸𝑽 • A Monte Carlo specialist would measure performance as an balance between speed and statistical variance using a Figure-of-Merit 𝟑 𝑮𝑷𝑵 = 𝝉 𝑫𝑸𝑽 𝑼 𝑫𝑸𝑽 𝟑 𝝉 𝑯𝑸𝑽 𝑼 𝑯𝑸𝑽 𝑭𝒚𝒃𝒏𝒒𝒎𝒇: 𝑮𝑷𝑵 = 𝟏. 𝟐 𝟑 7 𝟐 min 𝟏. 𝟏𝟔 𝟑 7 𝟑 min = 𝟑 To date, almost all GPU implementations of Monte Carlo particle transport of have focused on increasing speed. Los Alamos National Laboratory 3/23/18 | 5

Next Event Estimator • Next-event estimator calculates the Cell 1 probability of a particle from a Image source or collision event reaches a Plane A point without interaction • Typically used for image tallies μ B Cell 2 𝒙 𝑻 𝑺, 𝑭 = 𝟑𝝆𝑺 𝟑 × 𝑶 𝑺 C 𝝉 𝒋 𝑺, 𝑭 𝒒 𝒋 𝝂, 𝑭 → 𝑭 G exp exp ( − M 𝚻 𝑼 𝒕, 𝑭 G 𝒆𝒕 ) 𝝉 𝑼 𝟏 𝒋S𝟐 Ray-cast One to two orders of magnitude faster on GPU hardware Los Alamos National Laboratory 3/23/18 | 6

Traditional Track-Length Estimator • The standard Monte Carlo fluence estimator • Uses the sampled distance in each cell as fluence estimator • Only contributes to cells through which the particle passes • Easy to compute • Nothing to accelerate on GPU B Cell 3 Cell 1 Cell 2 Computing has changed, we need to change our algorithms too! Los Alamos National Laboratory 3/25/18 | 7

Volumetric-Ray-Casting Estimator • For use in place of the traditional track-length estimator on GPU • Multiple pseudo-rays are generated at each source and collision event • Computational intensive estimator with lower variance B Cell 3 Ray-cast Cell 1 Cell 2 𝒙 𝟐UVWX U𝚻 𝑼,𝒋 𝑭 Y 𝒎 𝒋 𝒔 Y U𝒔 𝚻 𝑼 𝒔 + 𝛁′𝒕′, 𝑭 G 𝒆𝒕′ F 𝒋, 𝑭′ = exp − ∫ 𝑶𝚻 𝑼,𝒋 (𝑭 Y ) 𝟏 A neutron dance for a neutron fan. P.M. Dawn Los Alamos National Laboratory 3/25/18 | 8

MonteRay - Accelerating Monte Carlo Transport with GPU Ray Tracing • MonteRay – A library for accelerating Monte Carlo tallies with GPU • Random walk is maintained on CPU • Ray casting based tallies are calculated on the GPU –Next-Event estimator –Volumetric-Ray-Casting estimator, a new estimator designed for GPUs –Supports neutron and photon tallies • Can be incorporated into new and legacy Monte Carlo codes • Uses continuous energy cross-section data • Single precision ray casting • Single precision attenuation cross-sections • Double precision tallies Reduces cost of accelerating an existing Monte Carlo code with GPUs Los Alamos National Laboratory 3/23/18 | 9

MonteRay - Testing • Tests use: –GeForce GTX TitanX GPU with NVIDIA Maxwell architecture –2 CPUs (Intel Haswell E5-2660 v3 at 2.60 GHz), with 10 cores each • MonteRay linked with LANL’s C++ Monte Carlo code MCATK • MCATK uses MPI parallelism building shared ray buffers using MPI-3 shared memory • 3-D Cartesian Structured Mesh Geometry • 2 tests measured performance of the Next-event estimator • 4 tests measured the performance of the Volumetric-ray-casting estimator • Volumetric-ray-casting estimator performance on GPU compared to the Track-length estimator performance on the CPU • Base performance measured as compared to 8 CPU cores Los Alamos National Laboratory 3/23/18 | 10

Testing the Next-Event Estimator on GPU Hardware: Two Radiography Tests Los Alamos National Laboratory 3/23/18 | 11

MonteRay – Medical X-Ray Imaging Simulation • 50-keV X-ray beam • 0.12mm spot size • Radiograph used Next-Event Estimator • Simulation useful for designing collimator to minimize scattered contribution Los Alamos National Laboratory 3/23/18 | 12

MonteRay – Medical X-Ray Imaging Simulation • Source and Collided contribution calculated separately • Source contribution relatively easy to 14.5x 15.3x calculate • Collided contribution important for collimator design • Collided performance 15-18x Los Alamos National Laboratory 3/23/18 | 13

MonteRay – Industrial Radiography • Simulated a physical test object used at Los Alamos’ Dual Axis Radiographic Hydrodynamic Test Facility • Used 4-MeV mono-energetic X-ray beam • 100 x 100 image grid (10,000 estimators) to simulate image detector • Calculation of scatter component needed to design collimators and experiment, but too computational expensive I'm a peeping-tom techie with x-ray eyes – Patrick Lee MacDonald Los Alamos National Laboratory 3/23/18 | 14

MonteRay – Industrial Radiography GPU Performance vs Number of CPU Cores Source Collided 100 Relative Performance 28.5x 24.2x 10 0 5 10 15 20 Number of CPU Cores / GPU Collided calculation performance 15-32x! Los Alamos National Laboratory 3/23/18 | 15

Volumetric-Ray-Casting Estimator on GPU Hardware vs Track-Length Estimator on CPU Hardware Los Alamos National Laboratory 3/23/18 | 16

Cancer Treatment Simulation • 2-MeV Photon beam ( peak of 6MV medical accelerator photon spectrum) • 1-cm beam radius What is the dose to healthy Tumor tissue? 2-MeV Photon Beam GPU Performance vs 8 CPU Cores 14x performance improvement in healthy tissue Los Alamos National Laboratory 3/23/18 | 17

Cancer Treatment Simulation GPU Performance vs Number of CPU Cores in Healthy Tissue 14.3x 10.2x Performance is 14x vs 8 CPU cores or 10x vs 12 CPU cores Los Alamos National Laboratory 3/23/18 | 18

Pressured Water Reactor Assembly Simulation • 16x16 Fuel Assembly • Performance 7.5x in the Control Rods, 5x in the fuel, and 4.5x in the coolant Fuel Pin Control Rod GPU Performance vs 8 CPU Cores Los Alamos National Laboratory 3/23/18 | 19

Pressured Water Reactor Assembly Simulation GPU Performance vs Number of CPU Cores 7.2x 5.4x 6.0x 4.4x Compared to 8 CPU cores performance in control rod 7.2x and 6.0x in the fuel Los Alamos National Laboratory 3/23/18 | 20

Criticality Accident Simulation • Critical Uranium sphere in the corner of a concrete room • Concrete floor, walls, ceiling, and 4 concrete pillars Uranium Sphere GPU Performance vs 8 CPU Cores Performance increase of 14-16x in the center of the room Los Alamos National Laboratory 3/23/18 | 21

Criticality Accident Simulation – Smoother Fluence Estimate Track-Length Estimator Volumetric-Ray-Casting Estimator Los Alamos National Laboratory 3/23/18 | 22

Criticality Accident Simulation GPU Performance vs Number of CPU Cores 15x 10.5x Things are going great, and they’re only getting better – Patrick Lee MacDonald Los Alamos National Laboratory 3/23/18 | 23

Reflected Godiva Criticality Experiment Simulation • U-235 sphere reflected by water • Performance Improvement GPU Performance vs 8 CPU Cores –2.5x in the core –1.0x in the water Los Alamos National Laboratory 3/23/18 | 24

Reflected Godiva Criticality Experiment Simulation Variance Ratio vs Num. Collisions • Variance of the Volumetric-Ray-Casting 4.5 estimator approaches that of the Track-Length 4 estimator is strong scattering material. VRC ) 3.5 2 / σ 2 Variance Ratio ( σ TL GPU Performance vs. Num. CPU Cores 3 2.5 2.2x 2 1.5 2.2x 1 1 4 8 12 16 20 Number of Samples per Collision (N) Performance is limited by the estimator variance, not the GPU speed Los Alamos National Laboratory 3/23/18 | 25

Breaking Through the Barriers to GPU Accelerated Monte Carlo - PowerPoint PPT Presentation

Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport GTC 2018 Jeremy Sweezy Scientist Monte Carlo Methods, Codes and Applications Group 3/28/2018 Operated by Los Alamos National Security, LLC for the U.S.

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Breaking Encryptions In The Cloud GPU-accelerated supercomputing for everyone Thomas Roth

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Best Practices From Across The Country Morning Session 2014 NCWBA Summit Breaking Barriers

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Breaking out of the box Understanding rela5onships between learning and assessment Breaking

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

Drones. Fun and games or a dystopian nightmare? By: Nicky DiCarlo and Kelly Venechanos By: Nicky

Peeping THOMAS: A Little Look at a Big System Law Library of Congress AALL Annual Meeting, 2011

INTEGRATE-HTA Ansgar Gerhardus for the INTEGRATE-HTA project team INTEGRATE-HTA Aim of

Scottish Drug Forum June 19, 2018 Changing Perceptions and Practice - The Experience in British

WELCOME MARCHES 2020 FORUM 90% Excellent or Good reviews since acquisition December 2019

Learning from Each Other: A Peer Observation Process to Evaluate and Promote Effective Online

Moodle Workshop Feature (Peer Review) 0 Flips the classroom 0 My research questions: Is online

Y * Name L * Title or Rank N * Department O Proposal Information E * Title of Presentation

Breaking Through the Barriers to GPU Accelerated Monte Carlo - PowerPoint PPT Presentation

Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport GTC 2018 Jeremy Sweezy Scientist Monte Carlo Methods, Codes and Applications Group 3/28/2018 Operated by Los Alamos National Security, LLC for the U.S.

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Breaking Encryptions In The Cloud GPU-accelerated supercomputing for everyone Thomas Roth

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Best Practices From Across The Country Morning Session 2014 NCWBA Summit Breaking Barriers

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Breaking out of the box Understanding rela5onships between learning and assessment Breaking

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

Drones. Fun and games or a dystopian nightmare? By: Nicky DiCarlo and Kelly Venechanos By: Nicky

Peeping THOMAS: A Little Look at a Big System Law Library of Congress AALL Annual Meeting, 2011

INTEGRATE-HTA Ansgar Gerhardus for the INTEGRATE-HTA project team INTEGRATE-HTA Aim of

Scottish Drug Forum June 19, 2018 Changing Perceptions and Practice - The Experience in British

WELCOME MARCHES 2020 FORUM 90% Excellent or Good reviews since acquisition December 2019

Learning from Each Other: A Peer Observation Process to Evaluate and Promote Effective Online

Moodle Workshop Feature (Peer Review) 0 Flips the classroom 0 My research questions: Is online

Y * Name L * Title or Rank N * Department O Proposal Information E * Title of Presentation

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team