portable performance for monte carlo simulation of photon
play

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON - PowerPoint PPT Presentation

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli Department of Electrical and Computer Engineering Department


  1. PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli Department of Electrical and Computer Engineering Department of Bioengineering*

  2. SIMULATION OF PHOTON TRANSPORT INSIDE HUMAN BRAIN • Photon migration in 3D turbid media • Prediction of experimental outcomes • Simulation is a time- consuming task GTC April 4-7, 2016 | Silicon Valley 2

  3. MCX.SPACE GTC April 4-7, 2016 | Silicon Valley 3

  4. MCX AROUND THE WORLD ž Over 30,000 unique visits made from 148 countries ž Accumulative download is over 12,000 worldwide ž Over 900 registered users, from more than 350 institutions/companies around the world GTC April 4-7, 2016 | Silicon Valley 4

  5. MCX STATISTICS GTC April 4-7, 2016 | Silicon Valley 5

  6. OUTLINE ž Portable Performance Monte Carlo Extreme (MCX) — MCX in CUDA — Persistent Threads in CUDA (MCX) — Portable Performance MCX — Other enhacements — Results ž MCX on multiple GPUs — Performance Model — Partitioning Schemes — Performance Results GTC April 4-7, 2016 | Silicon Valley 6

  7. PORTABLE Photons initialization PERFORMANCE MCX 3D voxelated media GTC April 4-7, 2016 | Silicon Valley 7

  8. MONTE CARLO EXTREME (MCX) ž Estimates the 3D light (fluence) distribution by simulating a large number of independent photons ž Most accurate algorithm for a wide ranges of optical properties, including low-scattering/ voids, high absorption and short source- detector separation ž Computationally intensive, so a great target for GPU acceleration ž Widely adopted for bio-optical imaging applications: — Optical brain functional imaging — Fluorescence imaging of small animals for drug development — Gold stand for validating new optical imaging instrumentation designs and algorithms GTC April 4-7, 2016 | Silicon Valley 8

  9. MCX APPLICATIONS Simulation of photons inside human brain Imaging of bone marrow in the tibia Imaging of a complex mouse model using Monte Carlo simulations GTC April 4-7, 2016 | Silicon Valley 9

  10. MCX IN CUDA [1] … Loop of repetitions Thread i+1 Thread i Seed GPU RNG Start Launch a new photon with CPU RNG Compute a new scattering length Global Propagate photon until Memory cross voxel boundary Compute attenuation based on absorption Compute a Accumulate photon new scattering (optional) energy loss to the direction vector Repetition volume complete? n y y End of Exceeding scattering time gate? path? Retrieve solution n y n Terminate End of Total photon Normalize & save # reached? thread simulation solution CPU GPU [1] Q. Fang and D. A. Boas. "Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units." Optics express 17.22 (2009): 20178-20190. GTC April 4-7, 2016 | Silicon Valley 10

  11. PERSISTENT THREADS (PT) IN MCX ž PT kernels alter the notion of a virtual thread lifetime, treating those threads as physical hardware threads ž PT kernels provide a view that threads are active for the entire duration of the kernel — We schedule only as many threads as the GPU SMs can concurrently run — The threads remain active until end of kernel execution Worker thread Thread exits Thread loop, clean up, initializes and and shut down enter thread Thread loops loop continuously GTC April 4-7, 2016 | Silicon Valley 11

  12. PORTABLE PERFORMANCE MCX Feature Fermi Kepler Maxwell MaxThreadBlocks/ 8 16 32 MP Maxthreads/MP 1536 2048 2058 MP 16 14 22 CUDA cores/MP 32 192 128 autoBlock = MaxThreadsPerMP / MaxBlocksPerMP autoThread = autoBlock * MaxBlocksPerMP * MP GTC April 4-7, 2016 | Silicon Valley 12

  13. OTHER ENHANCEMENTS ž Autopilot improvement ž Developed customized operation such as: — mcx_nextafter ž Reduced the use of SharedMemory — Enables more threads to be launch ž Avoided branch divergence by using indexes GTC April 4-7, 2016 | Silicon Valley 13

  14. IMPROVEMENT PER ENHANCEMENT Overall Performance 1.4x 980Ti GK110 2.4x 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Autopilot Reducing Shared Memory Increasing Local Memory/ Hide Latency Avoid branch divergence/ Customized function GTC April 4-7, 2016 | Silicon Valley 14

  15. PERFORMANCE MCX - RESULTS ž Baseline: MCX version Sep 12, 2015 Arch GPU Photons/ms Photons/ms Speedup (Baseline) Fermi GTX 590 2044.99 2901.92 1.4x Kepler GT 730 529.89 1263.74 2.4x Kepler GK110 2383.22 5238.34 2.2x Maxwell 980Ti 12268.98 19157.09 1.4x Performance (photons/ms) 25000 20000 Photons 15000 10000 5000 0 GTX 590 GT 730 GK110 980Ti GPUs GTC April 4-7, 2016 | Silicon Valley 15

  16. MCX AS A BENCHMARK Performance is changing dramatically • Same input -10x • Same code of sequence +10x 1799.76 1550.63 1800.00 1368.96 1600.00 1400.00 Size (KB) 1200.00 1000.00 800.00 600.00 400.00 0.14 200.00 MCX_core.sass 0.14 0.00 0.14 MCX_core.ptx Baseline After Improvement After Improvement with Hack CUDA 7.5 - Maxwell Compute 5.2 (980Ti) GTC April 4-7, 2016 | Silicon Valley 16

  17. MCX ON MULTIPLE GPUS

  18. MOTIVATION ž Monte Carlo eXtreme (MCX) simulation in OpenCL ž Distribute workloads among different devices — NVIDIA GPUs / AMD GPUs / CPUs GPU 1 thread GPU 2 thread thread GPU 3 MCXCL Partitioning Scheme Platform GTC April 4-7, 2016 | Silicon Valley 18

  19. METHODOLOGY ž Predict the kernel execution time — Evaluate the kernel runtime — Develop the performance model ž Partitioning Schemes Core-based Throughput Iterative Fminimax Nonlinear linear The number of Application Throughput- programming parallel compute throughput based iterative solution for units (photons/ms) partitioning minimax problem GTC April 4-7, 2016 | Silicon Valley 19

  20. PERFORMANCE MODEL ž Measure the kernel execution time on various devices ž Simulate 1M to 25M photon migrations GTC April 4-7, 2016 | Silicon Valley 20

  21. PERFORMANCE MODEL ž Given n devices: D 1 , D 2 , … D n ž Given linear performance for each device ž Given the performance for 1M and 2M for each device ž We can obtain the linear equation for each device as follows: y 1 = a 1 x 1 + c 1 Device 1 : y a x c = + Device 2 : 2 2 2 2 . . . . y a x c = + Device n : n n n n GTC April 4-7, 2016 | Silicon Valley 21

  22. PARTITIONING SCHEME ELABORATION ComputeUnits i Throughput i ∑ ∑ ComputeUnits i Throughput i Iterative Approximation Stop when Iteratively evaluate Core-based achieving the throughput-based Initialization max partitioning throughput GTC April 4-7, 2016 | Silicon Valley 22

  23. PERFORMANCE RESULTS Core-based Throughput Iterative Fminimax 30000 25000 20000 15000 10000 5000 0 10M 100M 10M 100M GTX 980 Ti + GTX 590 + GT 730 K40c + K20c Throughput Utilization Throughput Utilization 10M 100M 10M 100M Core-based 35.01% 41.65% Core-based 85.31% 97.56% Throughput 59.31% 93.42% Throughput 80.39% 87.89% Iterative 68.85% 93.77% Iterative 80.39% 87.89% Fminimax 68.85% 93.77% Fminimax 80.39% 87.89% Max throughput 9688 photons/ms Max throughput 30323 photons/ms GTC April 4-7, 2016 | Silicon Valley 23

  24. PERFORMANCE RESULTS Core-based Throughput Iterative Fminimax 4500 4000 3500 3000 2500 2000 1500 1000 500 0 10M 100M 10M 100M AMD 7970M + Intel i7-3740QM AMD 7970 + Fiji + Intel i7-4770 Throughput Utilization Throughput Utilization 10M 100M 10M 100M Core-based 19.32% 18.69% Core-based 15.10% 19.06% Throughput 18.81% 27.14% Throughput 16.38% 21.10% Iterative 18.78% 27.91% Iterative 16.38% 21.10% Fminimax 18.78% 27.91% Fminimax 16.38% 21.10% Max throughput 4529 photons/ms Max throughput 19176 photons/ms GTC April 4-7, 2016 | Silicon Valley 24

  25. SUMMARY ž We have improved the performance of MCX across a range of NVIDIA GPU architectures ž We have showed how to exploit Persistent Thread kernel to automatically tune MCX kernel ž We developed an iterative scheme to search the best partition to run MCX on multiple accelerators ž We obtained an 24% and 44% throughput utilization improvement (Iterative vs Core-based) for 10M and 100M photon simulations, respectively GTC April 4-7, 2016 | Silicon Valley 25

  26. FUTURE WORK ž Instrumentation of MCX — Leverage SASSI to instrument MCX and better characterize the behavior of a kernel to guide auto-tuning ž MCX on Multiple GPUs — Evaluate our partitioning optimization for multiple devices GTC April 4-7, 2016 | Silicon Valley 26

  27. MCX CHALLENGE ž Interested in improving performance of MCX over 40% compared to current version? — Monetary reward will be announced soon. Stay tuned to mcx.space GTC April 4-7, 2016 | Silicon Valley 27

  28. ACKNOWLEDGEMENT ž This project is funded by the NIH/NIGMS under the grant R01-GM114365 ž We would like to acknowledge NVIDIA for their support for this work through the NVIDIA Research Center program GTC April 4-7, 2016 | Silicon Valley 28

  29. THANK YOU! QUESTIONS? fninaparavecino@ece.neu.edu ylm@ece.neu.edu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend