Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and Beyond
2017 GPU Technology Conference Shogo Okada <shogo@port.kobe-u.ac.jp> Koichi Murakami <koichi.murakami@kek.jp> Nick Henderson <nick.henderson@gmail.com>
Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and - - PowerPoint PPT Presentation
Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and Beyond 2017 GPU Technology Conference Shogo Okada <shogo@port.kobe-u.ac.jp> Koichi Murakami <koichi.murakami@kek.jp> Nick Henderson
2017 GPU Technology Conference Shogo Okada <shogo@port.kobe-u.ac.jp> Koichi Murakami <koichi.murakami@kek.jp> Nick Henderson <nick.henderson@gmail.com>
Geant4 GPU experimentation MPEXS Algorithm research Application development Geant4 multi-threading
(~ x, ~ p, k) k ∈ {γ, e−, e+, . . . }
Goal: record effect of particle interaction in material
through and interacting with matter
geometries, and materials
application areas
ATLAS LISA gMocren
Challenges:
If you want to consider full capability of Geant4:
GPU
positron annihilation
Verification for Dose Distribution
z y density water 1.0 g/cm3 lung 0.26 g/cm3 bone 1.85 g/cm3 air 0.0012 g/cm3
(1) water (2) lung (3) bone air source Beam particle and its initial kinetic energy:
Dose Distribution of slab phantoms
Comparison of depth dose for γ 6MV
− G4 v9.6.3 − G4CU
(1) water
(2) lung (3) bone
5 10 15 20 25 30dose (Gy)
0.05 0.1 0.15 0.2 0.25 0.3G4 G4CU
depth dose distribution
depth (cm)
5 10 15 20 25 30residual
5 10 15 20 25 30
dose (Gy)
0.1 0.15 0.2 0.25 0.3G4 G4CU
depth dose distribution
depth (cm)
5 10 15 20 25 30residual
5 10 15 20 25 30
dose (Gy)
0.01 0.02 0.03 0.04 0.05 0.06 0.07G4 G4CU
depth dose distribution
depth (cm)
5 10 15 20 25 30residual
lung bone
MPEXS
MPEXS
MPEXS MPEXS MPEXS
Comparison of depth dose for γ 18MV
− G4 v9.6.3 − G4CU
(1) water
(2) lung (3) bone
5 10 15 20 25 30
dose (Gy)
0.02 0.04 0.06 0.08 0.1 0.12G4 G4CU
depth dose distribution
depth (cm)
5 10 15 20 25 30residual
5 10 15 20 25 30
dose (Gy)
0.02 0.04 0.06 0.08 0.1 0.12G4 G4CU
depth dose distribution
depth (cm)
5 10 15 20 25 30residual
5 10 15 20 25 30
dose (Gy)
0.02 0.04 0.06 0.08 0.1 0.12G4 G4CU
depth dose distribution
depth (cm)
5 10 15 20 25 30residual
lung bone
MPEXS
MPEXS
MPEXS MPEXS MPEXS
5 10 15 20 25 30
dose (Gy)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18G4 G4CU
depth dose distribution
depth (cm)
5 10 15 20 25 30residual
5 10 15 20 25 30
dose (Gy)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18G4 G4CU
depth dose distribution
depth (cm)
5 10 15 20 25 30residual
5 10 15 20 25 30
dose (Gy)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18G4 G4CU
depth dose distribution
depth (cm)
5 10 15 20 25 30residual
Comparison of depth dose for e- 20MeV
− G4 v9.6.3 − G4CU
(1) water
(2) lung (3) bone
depth (cm) 5 10 15 20 25 30 dose (Gy)log scale
depth (cm) 5 10 15 20 25 30 dose (Gy)log scale log scale lung bone
MPEXS
MPEXS
Computation Time Performance
γ beam with 6MV γ beam with 18MV (1) water (2) lung (3) bone (1) water (2) lung (3) bone G4 [msec/particle] 0.780 0.822 0.819 0.803 0.857 0.924 G4CU [msec/particle] 0.00336 0.00331 0.00341 0.00433 0.00425 0.00443 × speedup factor ( = G4 / G4CU ) 232 248 240 185 201 208
GPU:
CPU:
e- beam with 20MeV (1) water (2) lung (3) bone G4 [msec/particle] 1.84 1.87 1.65 G4CU [msec/particle] 0.00881 0.00958 0.00885 × speedup factor ( = G4 / G4CU ) 208 195 193
185~250 times speedup against single-core G4 simulation!!
MPEXS / MPEXS) MPEXS / MPEXS)
different particle kinds, then thread divergence occurs in physics process code
run-time. Some applications call for the generation of many secondary particles. This restriction meant that we could only run with a small number of active threads.
e+
e-
e+
e-
process e- process e+ process
e- e- e+ particles in memory 1 2 3 4 5 6 7 index particles in memory
This leads to a non-physical simulation, but eliminates thread
process and perform a run length encode against the time for a single trip through event loop. Calculations indicate we should expect a factor 2x speedup if implemented in full simulation.
stacks
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ
ˠ ˠ ˠ ˠ ˠ ˠ
e- e-
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ
ˠ ˠ ˠ ˠ ˠ ˠ
e- e-
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ Compton scattering Photoelectric effect
ˠ ˠ ˠ ˠ ˠ ˠ
e- e-
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ
ˠ ˠ ˠ ˠ ˠ ˠ
e- e-
sort by selected process Compton scattering Photoelectric effect
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ secondary generation secondary particles ˠ ˠ ˠ
e- e- e- e- e-
ˠ ˠ ˠ ˠ ˠ ˠ
e- e-
sort by selected process Compton scattering Photoelectric effect
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ secondary generation secondary particles ˠ ˠ ˠ
e- e- e- e- e-
ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ
e- e- e- e- e- e- e-
secondary storage sort by selected process Compton scattering Photoelectric effect
particles in one operation.
they are all the same kind, thus we can apply the same (non-divergent) operations.
input photon buffers and write to output electron and photon buffers that are pushed
after-step processes are applied only to particles that call for it.
divergence may occur in the application of a physics process, because many of them rely on sample-reject algorithms to sample from various distributions.
physics process. However, all writes of particle data is coalesced. We have to pay for the randomness somewhere.
have not yet ported the physics processes over. We've done performance experiments with fake/model physics processes (which mimic computation and memory access patterns of the real
data moved. The numbers shown are the speed up of the new- architecture against the old for a variety of configurations.
Number of processes 1 2 4 8 16 32 64 128 Data transfer (float #) 1 0.5 0.6 0.8 1.0 1.3 1.8 2.7 4.1 2 0.5 0.7 0.8 1.0 1.5 2.1 2.9 4.2 4 0.6 0.7 0.9 1.2 1.8 2.6 3.4 4.6 8 0.6 0.8 1.1 1.6 2.4 3.3 4.2 5.2 16 0.6 1.0 1.5 2.1 3.0 4.1 4.9 5.9 32 0.7 1.2 1.8 2.6 3.6 4.6 5.4 6.3 64 0.8 1.4 2.0 2.8 3.9 4.9 5.7 6.6 128 0.9 1.7 2.4 3.1 4.0 5.1 5.9 6.7 speedup
Geant4 GPU experimentation MPEXS Algorithm research Application development Geant4 multi-threading
“Geant4-DNA”, an extension of Geant4 to DNA physics
energy scale (down to meV)
molecules like excited and ionized H2O.
Chromatine fiber (constituent of chromosomes) EM shower in DNA ∅ 10 nm
species like H2O*, H2O-/+, e-aq Diffusion and reactions for chemical species
(Future work)
http://www.windows2universe.org/earth/Life/cell_radiation_damage.html
Physics Processes for X-rays Compton scattering 100 eV - 1 GeV, Livermore Photoelectric effect 100 eV - 1 GeV, Livermore Gamma conversion 100 eV - 1 GeV, Livermore Rayleigh scattering 100 eV - 1 GeV, Livermore
Particles Electrons Protons Hydrogen atoms Helium atoms (He++, He+, He0) Elastic scattering 9 eV - 10 keV Uehara 10 keV - 1 MeV Champion 100 eV - 1 MeV Hoang 100 eV - 10 MeV Hoang Excitation 10 eV - 10 keV Emfietzoglou 10 keV - 1 MeV Born 10 eV - 500 keV Miller Green 500 keV - 100 MeV Born 10 eV - 500 keV Miller Green 1 keV - 400 MeV Miller Green Charge change — 100 eV - 10 MeV Dingfelder 100 eV - 10 MeV Dingfelder 1 keV - 400 MeV Dingfelder Ionization 10 eV - 10 keV Emfietzoglou 10 keV - 1 MeV Born 100 eV - 500 keV Rudd 500 keV - 100 MeV Born 100 eV - 100 MeV Rudd 1 keV - 400 MeV Rudd Vibrational excitation 2 - 100 eV Michaud et al. — — — Disociative attachment 4 - 13 eV Melton — — —
E1 E2
p e- H atom -> p AB + e- -> AB- -> A + B-
((( (((
ΔE
e- e- p
Physics Processes
Atomic deexcitation occurs during ionization process, and emits auger electrons and X-rays
Standard EM Physics
step with the Bethe-Bloch formula.
above the threshold. DNA physics
energy thresholds to calculate complex energy spread within cells for DNA damage
(~ 20k / primary).
Bethe-Bloch formula: ΔE1 ΔE3 ΔE2 i
i z a t i
e x c i t a t i
ΔE4 ΔE5 ΔE6 ΔE1 ΔE2 ΔE3
Δx1 Δx2 Δx3 Δx4
ΔE4 “continues process” “discrete process”
consumption for storing secondaries generated into the stack.
NVIDIA, Tesla K40c, Global Memory: 11,439 MB (GDDR5)
The difference of # of secondaries and active thread number (DNA vs EM) Incident particle Initial energy Typical # of secondaries generated Stack size per CUDA thread Total active CUDA thread numbers (Nblk x Nthr/blk) Total memory usage for stacks DNA Physics He++ 1 MeV > 20,000 25,000 (1,074 kB) 10,240 (80 x 128) 10,740 MB EM Physics e- 20 MeV < 40 100 (4.3 kB) 1,048,576 (4,096 x 256) 4,405 MB
share a secondary stack.
(~10k threads -> more than 1 M threads)
DNA Physics Standard EM Physics
1 2 3 4 5 6 … e- e- γ γ e- e+ γ …
…
CUDA Threads Secondary stacks (capacity: 100) Thread#
…
CUDA Threads Secondary stacks (tot. capacity: 25k)
32 threads
…
Warp #0 1 2 3 4 5 6 … 30 H 31 p e- H e- e- e- H … H e- Thread# Event #0 Event #1 Warp #1 32 33 34 35 36 37 38 … 62 H 63 p H e- e- H p …
Depth dose curves (CPU vs GPU)
z-direction (um) 5 10 15 20 25 Dose (Gy) 100 200 300 400 500 600 700
3
10 ×
depth dose distribution depth dose distribution
z-direction (um) 10 20 30 40 50 60 70 80 90 100 Dose (Gy) 1 10
2
10
depth dose distribution
z-direction (um) 1 2 3 4 5 6 7 8 9 10 Dose (Gy) 500 1000 1500 2000 2500 3000 3500 4000
3
10 ×
depth dose distribution
— Geant4-DNA (CPU) — MEPXS-DNA (GPU)
p 1 MeV He++ 1 MeV e- 100 keV
Good agreement with Geant4-DNA
excited H2O molecules (H2O+/H2O-, H2O*)
Electronic state Process Dissociation channel Fraction (%) Ionization state Dissociative decay H3O+ + •OH 100 Excitation state: A1B1 Dissociative decay
65 Relaxation H2O + ΔE 35 Excitation state: B1A1 Auto-ionization H3O+ + •OH + e-aq 55 Dissociative decay
15 Relaxation H2O + ΔE 30 Excitation state: Rydberg, diffusion bands Auto-ionization H3O+ + •OH + e-aq 50 Relaxation H2O + ΔE 50 Dissociative attachment: H2O- Dissociative decay
100
Ref.) Radiat Environ Biophys (2009) 48: 11- 20
(1)Calculates intermolecular distance (d) for all pairs.
Then, makes reactions for pairs with d < R (2)Finds minimum distance in remains, and calculates time step (Δt). (3)Diffuses molecules using Δt.
(4)Loops (1) ~ (3)
Species Diffusion coefficient [m2/s] H3O+ 9.0E-09 H• 7.0E-09 OH- 5.0E-09 e-aq 4.9E-09 H2 4.8E-09
2.8E-09 H2O2 2.3E-09 Reactions Reaction rate [M-1s-1] 2e-aq + 2H2O -> H2+ 2OH- 5.00E+09 e-aq + •OH -> OH- 2.95E+10 e-aq + H• + H2O -> OH- + H2 2.65E+10 e-aq + H3O+ -> H• + H2O 2.11E+10 e-aq + H2O2 -> OH- + •OH 1.44E+10
4.40E+09
1.44E+10 H• + H• -> H2 1.20E+10 H3O+ + OH- -> 2H2O 1.43E+10
Ref.) Radiat Environ Biophys (2009) 48: 11- 20
d d < R ? No Yes Make reaction Diffusion
R = k 4πNAD
Reaction radius (R) (by Smoluchowski Model) :
Time(ps) 1 10
210
310
410
510
610 G-value (# of molecules / 100 eV) 1 2 3 4 5 6
Comparison of G-value profile (CPU vs GPU) ✓ Line: Geant4-DNA ✓ Filled circle: MPEXS-DNA
p 20 MeV OH・ OH- H3O+ eaq- H2 ・H H2O2
Agrees with Geant4-DNA within ~ 3 %
G-value = # of Molecules Energy loss
Time(ps) 1 10
210
310
410
510
610 G-value (# of molecules / 100 eV) 1 2 3 4 5 6 7
・OH (! MPEXS-DNA) H2O2 (! MPEXS-DNA) ・ ・ ・OH (Partrac) H2O2 (Partrac)
e- 750 keV
Verifying with other simulation data
Ref.) J. Radiat. Res., 46, 333–341 (2005)
Diffusions and chemical reactions after irradiated water phantom with a 10 keV electron
13.48 2.57 3279.47 932.82 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 e- 750 keV p 20 MeV Event Number / 1 min. Geant4-DNA (CPU) MPEXS-DNA (GPU)
363x 243x
Up to 360 times speedup against single-core Xeon CPU
2,880 cores, 745 MHz
3.50 GHz
Comparison of event number processed per 1 min.
3279.47 932.82 10053.09 3028.60 0.0E+00 2.0E+03 4.0E+03 6.0E+03 8.0E+03 1.0E+04 1.2E+04 e- 750 keV p 20 MeV Event Number / 1 min. MPEXS-DNA (K40c) MPEXS-DNA (P100)
3.06x 3.24x
Comparison of event number processed per 1 min.
Preliminary result
time.
simulation using GPU power drastically.
(e.g. Gold nanoparticle; GNP)
Technology
based on Geant4. <https://dx.doi.org/10.1051/snamc/201404204>
dx.doi.org/10.1109/NSSMIC.2013.6829452>
<https://dx.doi.org/10.1016/S0168-9002(03)01368-8>
very low energy extension of the Geant4 Monte Carlo simulation toolkit. <https:// dx.doi.org/10.1016/j.ejmp.2015.10.087>