PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON - PowerPoint PPT Presentation

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli Department of Electrical and Computer Engineering Department of Bioengineering*

SIMULATION OF PHOTON TRANSPORT INSIDE HUMAN BRAIN • Photon migration in 3D turbid media • Prediction of experimental outcomes • Simulation is a time- consuming task GTC April 4-7, 2016 | Silicon Valley 2

MCX.SPACE GTC April 4-7, 2016 | Silicon Valley 3

MCX AROUND THE WORLD Over 30,000 unique visits made from 148 countries Accumulative download is over 12,000 worldwide Over 900 registered users, from more than 350 institutions/companies around the world GTC April 4-7, 2016 | Silicon Valley 4

MCX STATISTICS GTC April 4-7, 2016 | Silicon Valley 5

OUTLINE Portable Performance Monte Carlo Extreme (MCX) MCX in CUDA Persistent Threads in CUDA (MCX) Portable Performance MCX Other enhacements Results MCX on multiple GPUs Performance Model Partitioning Schemes Performance Results GTC April 4-7, 2016 | Silicon Valley 6

PORTABLE Photons initialization PERFORMANCE MCX 3D voxelated media GTC April 4-7, 2016 | Silicon Valley 7

MONTE CARLO EXTREME (MCX) Estimates the 3D light (fluence) distribution by simulating a large number of independent photons Most accurate algorithm for a wide ranges of optical properties, including low-scattering/ voids, high absorption and short source- detector separation Computationally intensive, so a great target for GPU acceleration Widely adopted for bio-optical imaging applications: Optical brain functional imaging Fluorescence imaging of small animals for drug development Gold stand for validating new optical imaging instrumentation designs and algorithms GTC April 4-7, 2016 | Silicon Valley 8

MCX APPLICATIONS Simulation of photons inside human brain Imaging of bone marrow in the tibia Imaging of a complex mouse model using Monte Carlo simulations GTC April 4-7, 2016 | Silicon Valley 9

MCX IN CUDA [1] … Loop of repetitions Thread i+1 Thread i Seed GPU RNG Start Launch a new photon with CPU RNG Compute a new scattering length Global Propagate photon until Memory cross voxel boundary Compute attenuation based on absorption Compute a Accumulate photon new scattering (optional) energy loss to the direction vector Repetition volume complete? n y y End of Exceeding scattering time gate? path? Retrieve solution n y n Terminate End of Total photon Normalize & save # reached? thread simulation solution CPU GPU [1] Q. Fang and D. A. Boas. "Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units." Optics express 17.22 (2009): 20178-20190. GTC April 4-7, 2016 | Silicon Valley 10

PERSISTENT THREADS (PT) IN MCX PT kernels alter the notion of a virtual thread lifetime, treating those threads as physical hardware threads PT kernels provide a view that threads are active for the entire duration of the kernel We schedule only as many threads as the GPU SMs can concurrently run The threads remain active until end of kernel execution Worker thread Thread exits Thread loop, clean up, initializes and and shut down enter thread Thread loops loop continuously GTC April 4-7, 2016 | Silicon Valley 11

PORTABLE PERFORMANCE MCX Feature Fermi Kepler Maxwell MaxThreadBlocks/ 8 16 32 MP Maxthreads/MP 1536 2048 2058 MP 16 14 22 CUDA cores/MP 32 192 128 autoBlock = MaxThreadsPerMP / MaxBlocksPerMP autoThread = autoBlock * MaxBlocksPerMP * MP GTC April 4-7, 2016 | Silicon Valley 12

OTHER ENHANCEMENTS Autopilot improvement Developed customized operation such as: mcx_nextafter Reduced the use of SharedMemory Enables more threads to be launch Avoided branch divergence by using indexes GTC April 4-7, 2016 | Silicon Valley 13

IMPROVEMENT PER ENHANCEMENT Overall Performance 1.4x 980Ti GK110 2.4x 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Autopilot Reducing Shared Memory Increasing Local Memory/ Hide Latency Avoid branch divergence/ Customized function GTC April 4-7, 2016 | Silicon Valley 14

PERFORMANCE MCX - RESULTS Baseline: MCX version Sep 12, 2015 Arch GPU Photons/ms Photons/ms Speedup (Baseline) Fermi GTX 590 2044.99 2901.92 1.4x Kepler GT 730 529.89 1263.74 2.4x Kepler GK110 2383.22 5238.34 2.2x Maxwell 980Ti 12268.98 19157.09 1.4x Performance (photons/ms) 25000 20000 Photons 15000 10000 5000 0 GTX 590 GT 730 GK110 980Ti GPUs GTC April 4-7, 2016 | Silicon Valley 15

MCX AS A BENCHMARK Performance is changing dramatically • Same input -10x • Same code of sequence +10x 1799.76 1550.63 1800.00 1368.96 1600.00 1400.00 Size (KB) 1200.00 1000.00 800.00 600.00 400.00 0.14 200.00 MCX_core.sass 0.14 0.00 0.14 MCX_core.ptx Baseline After Improvement After Improvement with Hack CUDA 7.5 - Maxwell Compute 5.2 (980Ti) GTC April 4-7, 2016 | Silicon Valley 16

MCX ON MULTIPLE GPUS

MOTIVATION Monte Carlo eXtreme (MCX) simulation in OpenCL Distribute workloads among different devices NVIDIA GPUs / AMD GPUs / CPUs GPU 1 thread GPU 2 thread thread GPU 3 MCXCL Partitioning Scheme Platform GTC April 4-7, 2016 | Silicon Valley 18

METHODOLOGY Predict the kernel execution time Evaluate the kernel runtime Develop the performance model Partitioning Schemes Core-based Throughput Iterative Fminimax Nonlinear linear The number of Application Throughput- programming parallel compute throughput based iterative solution for units (photons/ms) partitioning minimax problem GTC April 4-7, 2016 | Silicon Valley 19

PERFORMANCE MODEL Measure the kernel execution time on various devices Simulate 1M to 25M photon migrations GTC April 4-7, 2016 | Silicon Valley 20

PERFORMANCE MODEL Given n devices: D 1 , D 2 , … D n Given linear performance for each device Given the performance for 1M and 2M for each device We can obtain the linear equation for each device as follows: y 1 = a 1 x 1 + c 1 Device 1 : y a x c = + Device 2 : 2 2 2 2 . . . . y a x c = + Device n : n n n n GTC April 4-7, 2016 | Silicon Valley 21

PARTITIONING SCHEME ELABORATION ComputeUnits i Throughput i ∑ ∑ ComputeUnits i Throughput i Iterative Approximation Stop when Iteratively evaluate Core-based achieving the throughput-based Initialization max partitioning throughput GTC April 4-7, 2016 | Silicon Valley 22

PERFORMANCE RESULTS Core-based Throughput Iterative Fminimax 30000 25000 20000 15000 10000 5000 0 10M 100M 10M 100M GTX 980 Ti + GTX 590 + GT 730 K40c + K20c Throughput Utilization Throughput Utilization 10M 100M 10M 100M Core-based 35.01% 41.65% Core-based 85.31% 97.56% Throughput 59.31% 93.42% Throughput 80.39% 87.89% Iterative 68.85% 93.77% Iterative 80.39% 87.89% Fminimax 68.85% 93.77% Fminimax 80.39% 87.89% Max throughput 9688 photons/ms Max throughput 30323 photons/ms GTC April 4-7, 2016 | Silicon Valley 23

PERFORMANCE RESULTS Core-based Throughput Iterative Fminimax 4500 4000 3500 3000 2500 2000 1500 1000 500 0 10M 100M 10M 100M AMD 7970M + Intel i7-3740QM AMD 7970 + Fiji + Intel i7-4770 Throughput Utilization Throughput Utilization 10M 100M 10M 100M Core-based 19.32% 18.69% Core-based 15.10% 19.06% Throughput 18.81% 27.14% Throughput 16.38% 21.10% Iterative 18.78% 27.91% Iterative 16.38% 21.10% Fminimax 18.78% 27.91% Fminimax 16.38% 21.10% Max throughput 4529 photons/ms Max throughput 19176 photons/ms GTC April 4-7, 2016 | Silicon Valley 24

SUMMARY We have improved the performance of MCX across a range of NVIDIA GPU architectures We have showed how to exploit Persistent Thread kernel to automatically tune MCX kernel We developed an iterative scheme to search the best partition to run MCX on multiple accelerators We obtained an 24% and 44% throughput utilization improvement (Iterative vs Core-based) for 10M and 100M photon simulations, respectively GTC April 4-7, 2016 | Silicon Valley 25

FUTURE WORK Instrumentation of MCX Leverage SASSI to instrument MCX and better characterize the behavior of a kernel to guide auto-tuning MCX on Multiple GPUs Evaluate our partitioning optimization for multiple devices GTC April 4-7, 2016 | Silicon Valley 26

MCX CHALLENGE Interested in improving performance of MCX over 40% compared to current version? Monetary reward will be announced soon. Stay tuned to mcx.space GTC April 4-7, 2016 | Silicon Valley 27

ACKNOWLEDGEMENT This project is funded by the NIH/NIGMS under the grant R01-GM114365 We would like to acknowledge NVIDIA for their support for this work through the NVIDIA Research Center program GTC April 4-7, 2016 | Silicon Valley 28

THANK YOU! QUESTIONS? fninaparavecino@ece.neu.edu ylm@ece.neu.edu

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON - PowerPoint PPT Presentation

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli Department of Electrical and Computer Engineering Department

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Simulation Monte Carlo Monte Carlo simulation Outcome of a single stochastic simulation run

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

PC PORTABLE PC PORTABLE PC PORTABLE Introducing the PC Portable Lamp, one of a range of

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR

Portable Monte Carlo Transport Performance Evaluation in the PATMOS Prototype Tao CHANG 1

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Photon Tracing Photon Maps Simulating light propagation by shooting photons from the light

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

3. Monte Carlo Simulations Understanding Molecular Simulation Molecular Simulations Molecular

Data Q QUEST Data Q Quality T Testing DQ DQe Tools ools 2/28/17 Kari A. Stephens, PhD

New NIH requirements regarding Rigor and Reproducibility

Funding for Early Stage Investigators: Needs and Opportunities February 2009 NIGMS The Need

Faculty to Faculty Mentoring Part II: Summer Sabbatical Program Research Enrichment Core: Amy

2017 wwPDB AC Meeting Stephen K. Burley, Genji Kurisu, John L. Markley, and Sameer Velankar

Fellowship Application Surviv urvival al Tips To Make the Process Less Painful Than it Has To

Research Initiative for Scientific Enhancement UPR-PRISE Program NIH-NIGMS #R25-GM-096955 *

How to Organize and Fund Free Culture Projects Kevin Shockey @shockeyk Founder, Mis Tribus

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON - PowerPoint PPT Presentation

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli Department of Electrical and Computer Engineering Department

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Simulation Monte Carlo Monte Carlo simulation Outcome of a single stochastic simulation run

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

PC PORTABLE PC PORTABLE PC PORTABLE Introducing the PC Portable Lamp, one of a range of

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR

Portable Monte Carlo Transport Performance Evaluation in the PATMOS Prototype Tao CHANG 1

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Photon Tracing Photon Maps Simulating light propagation by shooting photons from the light

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

3. Monte Carlo Simulations Understanding Molecular Simulation Molecular Simulations Molecular

Data Q QUEST Data Q Quality T Testing DQ DQe Tools ools 2/28/17 Kari A. Stephens, PhD

New NIH requirements regarding Rigor and Reproducibility

Funding for Early Stage Investigators: Needs and Opportunities February 2009 NIGMS The Need

Faculty to Faculty Mentoring Part II: Summer Sabbatical Program Research Enrichment Core: Amy

2017 wwPDB AC Meeting Stephen K. Burley, Genji Kurisu, John L. Markley, and Sameer Velankar

Fellowship Application Surviv urvival al Tips To Make the Process Less Painful Than it Has To

Research Initiative for Scientific Enhancement UPR-PRISE Program NIH-NIGMS #R25-GM-096955 *

How to Organize and Fund Free Culture Projects Kevin Shockey @shockeyk Founder, Mis Tribus

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.