[PPT] - Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and PowerPoint Presentation

SLIDE 1

Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and Beyond

2017 GPU Technology Conference Shogo Okada <shogo@port.kobe-u.ac.jp> Koichi Murakami <koichi.murakami@kek.jp> Nick Henderson <nick.henderson@gmail.com>

SLIDE 2

Outline

Geant4 GPU  experimentation MPEXS Algorithm  research Application  development Geant4 multi-threading

SLIDE 3

Big Picture

SLIDE 4

(~ x, ~ p, k) k ∈ {γ, e−, e+, . . . }

Goal: record effect of particle interaction in material

SLIDE 5

Geant4

Toolkit for simulation of particles traveling

through and interacting with matter

Supports wide variety of physics models,

geometries, and materials

Extendable - users can add new models
Used in numerous and diverse

application areas

high energy physics
medical physics
spacecraft
semiconductor devices
biology research

ATLAS LISA gMocren

SLIDE 6

Parallelism

Simulations require many events for statistical significance
Events are IID
Each computation thread processes an event

Challenges:

Random nature of simulation leads to thread divergence
Storage of secondary particles
Recording of energy deposition

If you want to consider full capability of Geant4:

Very complicated geometry -- non uniform data structures
Many material types
Large data tables to support physics processes

SLIDE 7

MPEXS

MPEXS is an adaptation of the core simulation algorithm from Geant4 for

GPU

Target application: X-ray radiotherapy
Geometry: uniformly discretized box
Material: Water with variable density
Physics: Low energy electromagnetics
Gamma: Compton scattering, photoelectric effect, pair-production
Electron/Positron: ionization, multiple scattering, Bremsstrahlung,

positron annihilation

Each GPU thread tracks an active particle
Secondary particles are stored on thread-local secondary stacks
Threads deposit energy to a shared global domain (via atomicAdd)

SLIDE 8

MPEXS - Performance & Validation

SLIDE 9

Verification for Dose Distribution

z y density water 1.0 g/cm3 lung 0.26 g/cm3 bone 1.85 g/cm3 air 0.0012 g/cm3

phantom size : 30.5 x 30.5 x 30 cm  
voxel size : 5 x 5 x 2 mm 
field size : 10 cm2 
SSD : 100 cm
slab materials :

(1) water  (2) lung  (3) bone air source Beam particle and its initial kinetic energy:  

electron with 20MeV 
photon with 6MV Linac 
photon with 18MV Linac

Dose Distribution of slab phantoms

SLIDE 10

Comparison of depth dose for γ 6MV

− G4 v9.6.3  − G4CU

(1) water

x-axis: z-direction (cm)
y-axis: dose (Gy)
residual = (G4CU−G4) / G4

(2) lung (3) bone

5 10 15 20 25 30

dose (Gy)

0.05 0.1 0.15 0.2 0.25 0.3

3

10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

0.2
0.1

0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.1 0.15 0.2 0.25 0.3

3

10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

0.2
0.1

0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.01 0.02 0.03 0.04 0.05 0.06 0.07

3

10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

0.2
0.1

0.1 0.2

lung bone

MPEXS

MPEXS MPEXS MPEXS

SLIDE 11

Comparison of depth dose for γ 18MV

− G4 v9.6.3  − G4CU

(1) water

x-axis: z-direction (cm)
y-axis: dose (Gy)
residual = (G4CU−G4) / G4

(2) lung (3) bone

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12

3

10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

0.2
0.1

0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12

3

10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

0.2
0.1

0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12

3

10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

0.2
0.1

0.1 0.2

lung bone

MPEXS

MPEXS MPEXS MPEXS

SLIDE 12

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

3

10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

0.2
0.1

0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

3

10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

0.2
0.1

0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

3

10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

0.2
0.1

0.1 0.2

Comparison of depth dose for e- 20MeV

− G4 v9.6.3  − G4CU

(1) water

x-axis: z-direction (cm)
y-axis: dose (Gy)
residual = (G4CU−G4) / G4

(2) lung (3) bone

depth (cm) 5 10 15 20 25 30 dose (Gy)

6

10

5

10

4

10

log scale

depth (cm) 5 10 15 20 25 30 dose (Gy)

6

10

5

10

4

10 depth (cm) 5 10 15 20 25 30 dose (Gy)

6

10

5

10

4

10

log scale log scale lung bone

MPEXS

SLIDE 13

Computation Time Performance

γ beam with 6MV γ beam with 18MV (1) water (2) lung (3) bone (1) water (2) lung (3) bone G4   [msec/particle] 0.780 0.822 0.819 0.803 0.857 0.924 G4CU   [msec/particle] 0.00336 0.00331 0.00341 0.00433 0.00425 0.00443 × speedup factor  ( = G4 / G4CU ) 232 248 240 185 201 208

GPU:

Tesla K20c (Kepler architecture)
2496 cores, 706 MHz
4096 x 128 threads
# of primaries
50M particles -> e- 20MeV
500M particles -> γ 6MV, 18MV

CPU: 

Xeon E5-2643 v2 3.50 GHz

e- beam with 20MeV (1) water (2) lung (3) bone G4   [msec/particle] 1.84 1.87 1.65 G4CU   [msec/particle] 0.00881 0.00958 0.00885 × speedup factor  ( = G4 / G4CU ) 208 195 193

185~250 times speedup against single-core G4 simulation!!

MPEXS / MPEXS) MPEXS / MPEXS)

SLIDE 14

Algorithm Research

SLIDE 15

MPEXS does not attempt to sort particles
Thread divergence: if threads in the same warp are tracking

different particle kinds, then thread divergence occurs in physics process code

Size of particle stack is the same for each thread and is fixed at

run-time. Some applications call for the generation of many secondary particles. This restriction meant that we could only run with a small number of active threads.

SLIDE 16

e-

e+

e-

e-

e-

e+

e-

e-

computation

process e- process e+ process

e-

e- e- e+ particles in memory 1 2 3 4 5 6 7 index particles in memory

SLIDE 17

MPEXS Experiments

Initialize each thread with the same random number generator state.

This leads to a non-physical simulation, but eliminates thread

divergence. We saw a factor 3x speedup in these runs.
Measure the time it takes to sort particle index by selected

process and perform a run length encode against the time for a single trip through event loop. Calculations indicate we should expect a factor 2x speedup if implemented in full simulation.

SLIDE 18

New Architecture

Goal 1: minimize/eliminate thread divergence
Goal 2: eliminate need for fixed-size and thread-local secondary

stacks

Goal 3: maintain extensibility

SLIDE 19

How it works

SLIDE 20

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ

SLIDE 21

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ

utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ

e- e-

SLIDE 22

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ

utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ

e- e-

SLIDE 23

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ Compton scattering Photoelectric effect

utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ

e- e-

SLIDE 24

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ

utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ

e- e-

sort by selected process Compton scattering Photoelectric effect

SLIDE 25

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ secondary generation secondary particles ˠ ˠ ˠ

e- e- e- e- e-

utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ

e- e-

sort by selected process Compton scattering Photoelectric effect

SLIDE 26

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ secondary generation secondary particles ˠ ˠ ˠ

e- e- e- e- e-

utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ

e- e- e- e- e- e- e-

secondary storage sort by selected process Compton scattering Photoelectric effect

SLIDE 27

Features

Store particles on a generalized stack that allows pushing and popping a block of

particles in one operation.

Group particles by kind (gamma, e-, e+). When we pop a block of particles, we know

they are all the same kind, thus we can apply the same (non-divergent) operations.

Maintain separate input and output buffers. Physics processes know the input and
utput particles. For example, in Compton scattering the input is a photon and the
utput is a scattered photon and a recoil electron. Thus, we can read from the active

input photon buffers and write to output electron and photon buffers that are pushed

nto appropriate stacks.
The sort and run-length encode operations are applied after process selection so that

after-step processes are applied only to particles that call for it.

SLIDE 28

Properties

No thread-divergence due to process selection. Thread

divergence may occur in the application of a physics process, because many of them rely on sample-reject algorithms to sample from various distributions.

Have non-coalesced reads of particle data in the after-step

physics process. However, all writes of particle data is coalesced. We have to pay for the randomness somewhere.

Thread-local stacks are not required.

SLIDE 29

Experiments

The new architecture is substantially different from MPEXS. We

have not yet ported the physics processes over. We've done performance experiments with fake/model physics processes (which mimic computation and memory access patterns of the real

nes).
We can vary the number of physics processes and the amount of

data moved. The numbers shown are the speed up of the new- architecture against the old for a variety of configurations.

SLIDE 30

Speedup via new architecture

Speedup due to sorting by process id for fake/model processes
Vary number of process and amount of data required by each process
Results collected from K40

Number of processes 1 2 4 8 16 32 64 128 Data transfer (float #) 1 0.5 0.6 0.8 1.0 1.3 1.8 2.7 4.1 2 0.5 0.7 0.8 1.0 1.5 2.1 2.9 4.2 4 0.6 0.7 0.9 1.2 1.8 2.6 3.4 4.6 8 0.6 0.8 1.1 1.6 2.4 3.3 4.2 5.2 16 0.6 1.0 1.5 2.1 3.0 4.1 4.9 5.9 32 0.7 1.2 1.8 2.6 3.6 4.6 5.4 6.3 64 0.8 1.4 2.0 2.8 3.9 4.9 5.7 6.6 128 0.9 1.7 2.4 3.1 4.0 5.1 5.9 6.7 speedup

SLIDE 31

Summary

MPEXS is a GPU-based Monte Carlo simulator for X-ray radiotherapy
MPEXS attains around 200x speedup when compared to Geant4 running
n single CPU core
Algorithm experimentation indicates a further 2x speed up with a sort
peration after process selection
New architecture also opens opportunities for other applications
better performance with more physics processes
no thread-local secondary stacks

SLIDE 32

Outline

Geant4 GPU  experimentation MPEXS Algorithm  research Application  development Geant4 multi-threading

SLIDE 33

MPEXS-DNA

SLIDE 34

The Geant4-DNA Project

“Geant4-DNA”, an extension of Geant4 to DNA physics

Estimates biological effects (e.g. DNA strand breaks) by radiation with ultra low

energy scale (down to meV)

The main objective of the project:
Evaluates effects on human health in chronic radiation exposure
ex.) Medical diagnostic, Astronauts in space missions, Airline crews, …
Should be improved its computing performance using GPU power.
Energy spread in cells is an important factor for DNA damage.
Geant4-DNA calculates complex track geometry within cells.
Needs to handle a large number of secondary particles.
ex.) More than 20k secondaries are generated per primary
Days-Weeks simulation on CPU cluster

SLIDE 35

Based on Geant4-DNA 10.02 p03
EM Physics for lower energy range (down to meV)
Calculates energy loss and generates primary

molecules like excited and ionized H2O.

Radiolysis of water
Diffusion and production of chemical species
Estimates DNA damage (-> future work).

Chromatine fiber (constituent of chromosomes) EM shower in DNA ∅ 10 nm

1. Physical Phase
2. Chemical Phase
Calculates dose distributions
Generates primary chemical

species like H2O*, H2O-/+, e-aq Diffusion and reactions for chemical species

3. Biological Phase

(Future work)

MPEXS-DNA, microdosimetry simulation on GPU

http://www.windows2universe.org/earth/Life/cell_radiation_damage.html

SLIDE 36

Physics Processes for X-rays Compton scattering 100 eV - 1 GeV, Livermore Photoelectric effect 100 eV - 1 GeV, Livermore Gamma conversion 100 eV - 1 GeV, Livermore Rayleigh scattering 100 eV - 1 GeV, Livermore

Particles Electrons Protons Hydrogen  atoms Helium atoms  (He++, He+, He0) Elastic  scattering 9 eV - 10 keV  Uehara 10 keV - 1 MeV Champion 100 eV - 1 MeV  Hoang 100 eV - 10 MeV  Hoang Excitation 10 eV - 10 keV  Emfietzoglou 10 keV - 1 MeV Born 10 eV - 500 keV  Miller Green  500 keV - 100 MeV  Born 10 eV - 500 keV  Miller Green 1 keV - 400 MeV  Miller Green Charge change — 100 eV - 10 MeV  Dingfelder 100 eV - 10 MeV  Dingfelder 1 keV - 400 MeV  Dingfelder Ionization 10 eV - 10 keV  Emfietzoglou 10 keV - 1 MeV Born 100 eV - 500 keV  Rudd  500 keV - 100 MeV  Born 100 eV - 100 MeV  Rudd 1 keV - 400 MeV  Rudd Vibrational  excitation 2 - 100 eV  Michaud et al. — — — Disociative  attachment 4 - 13 eV  Melton — — —

E1 E2

p e- H atom -> p AB + e- -> AB- -> A + B-

((( (((

ΔE

e- e- p

Physics Processes

MPEXS-DNA Physics Processes

Atomic deexcitation occurs during ionization process, and emits   auger electrons and X-rays

SLIDE 37

The difference of energy loss process   (EM Physics vs DNA Physics)

Standard EM Physics

Continues process
Energy loss is below a given threshold.
Calculates average energy loss at each

step with the Bethe-Bloch formula.

No secondaries are generated.
Discrete process
Generates a secondary if energy loss is

above the threshold. DNA physics

Handling as a discrete process without

energy thresholds to calculate complex energy   spread within cells for DNA damage

A large number of secondaries are generated

(~ 20k / primary).

Bethe-Bloch formula: ΔE1 ΔE3 ΔE2 i

n

i z a t i

n

e x c i t a t i

n

ΔE4 ΔE5 ΔE6 ΔE1 ΔE2 ΔE3

Δx1 Δx2 Δx3 Δx4

ΔE4 “continues process” “discrete process”

SLIDE 38

DNA Physics simulation had an issue of Low thread occupancy.
The number of active threads was limited due to large memory

consumption for storing secondaries generated into the stack.

NVIDIA, Tesla K40c, Global Memory: 11,439 MB (GDDR5)

The difference of # of secondaries and active thread number (DNA vs EM) Incident  particle Initial  energy Typical # of  secondaries  generated Stack size per CUDA thread Total active CUDA thread numbers  (Nblk x Nthr/blk) Total memory usage for stacks DNA  Physics He++ 1 MeV > 20,000 25,000  (1,074 kB) 10,240   (80 x 128) 10,740 MB EM  Physics e- 20 MeV < 40 100  (4.3 kB) 1,048,576  (4,096 x 256) 4,405 MB

An issue on lower thread occupancy   in physics simulation

SLIDE 39

CUDA Thread Assignment   For MPEXS-DNA Physics Simulation

A group of 32 CUDA threads is assigned per event and the threads in a group

share a secondary stack.

cf.) In MPEXS case (Standard EM Physics), each thread has its own stack.
Host memory is also available as a stack (using virtual memory addressing)
Reduces memory consumption for the stacks and increases active thread number

(~10k threads -> more than 1 M threads) 

> Keeps high thread occupancy during the simulation

DNA Physics Standard EM Physics

1 2 3 4 5 6 … e- e- γ γ e- e+ γ …

…

CUDA Threads Secondary   stacks  (capacity: 100) Thread#

…

CUDA Threads Secondary stacks  (tot. capacity: 25k)

32 threads

…

Warp #0 1 2 3 4 5 6 … 30 H 31 p e- H e- e- e- H … H e- Thread# Event #0 Event #1 Warp #1 32 33 34 35 36 37 38 … 62 H 63 p H e- e- H p …

n host mem.
n device mem.

SLIDE 40

MPEXS-DNA Physics Performance

Depth dose curves (CPU vs GPU)

z-direction (um) 5 10 15 20 25 Dose (Gy) 100 200 300 400 500 600 700

3

10 ×

depth dose distribution depth dose distribution

z-direction (um) 10 20 30 40 50 60 70 80 90 100 Dose (Gy) 1 10

2

10

depth dose distribution

z-direction (um) 1 2 3 4 5 6 7 8 9 10 Dose (Gy) 500 1000 1500 2000 2500 3000 3500 4000

3

10 ×

depth dose distribution

— Geant4-DNA (CPU)   — MEPXS-DNA (GPU)

p 1 MeV He++ 1 MeV e- 100 keV

Good agreement with Geant4-DNA

SLIDE 41

Physico-Chemical Phase for MPEXS-DNA

Physical interactions (Ionization / Excitation / Attachment) produce ionised and

excited H2O molecules (H2O+/H2O-, H2O*)

Then, dissociates or releases energy into water 
Electrons (Ekin < 8.22 eV) become hydrated electrons (e-aq)
These processes occur within 1 ps after irradiation

Electronic state Process Dissociation channel Fraction (%) Ionization state Dissociative decay H3O+ + •OH 100 Excitation state: A1B1 Dissociative decay

OH + H•

65 Relaxation H2O + ΔE 35 Excitation state: B1A1 Auto-ionization H3O+ + •OH + e-aq 55 Dissociative decay

OH + •OH + H2

15 Relaxation H2O + ΔE 30 Excitation state: Rydberg,  diffusion bands Auto-ionization H3O+ + •OH + e-aq 50 Relaxation H2O + ΔE 50 Dissociative attachment: H2O- Dissociative decay

OH + OH- + H2

100

Ref.) Radiat Environ Biophys (2009) 48: 11- 20

SLIDE 42

(1)Calculates intermolecular distance (d) for all pairs.

Computation time increases by O(N2/2).
kd-tree algorithm (Geant4-DNA)
Spreading CUDA threads (MPEXS-DNA)

Then, makes reactions for pairs with d < R  (2)Finds minimum distance in remains,  and calculates time step (Δt).  (3)Diffuses molecules using Δt.

A CUDA thread transports a molecule.

(4)Loops (1) ~ (3)

Species Diffusion coefficient [m2/s] H3O+ 9.0E-09 H• 7.0E-09 OH- 5.0E-09 e-aq 4.9E-09 H2 4.8E-09

OH

2.8E-09 H2O2 2.3E-09 Reactions Reaction rate [M-1s-1] 2e-aq + 2H2O -> H2+ 2OH- 5.00E+09 e-aq + •OH -> OH- 2.95E+10 e-aq + H• + H2O -> OH- + H2 2.65E+10 e-aq + H3O+ -> H• + H2O 2.11E+10 e-aq + H2O2 -> OH- + •OH 1.44E+10

OH + •OH -> H2O

4.40E+09

OH + H• -> H2O

1.44E+10 H• + H• -> H2 1.20E+10 H3O+ + OH- -> 2H2O 1.43E+10

Ref.) Radiat Environ Biophys (2009) 48: 11- 20

d d < R ? No Yes Make reaction Diffusion

R = k  4πNAD

Reaction radius (R)  (by Smoluchowski Model) :

Chemical Phase for MPEXS-DNA

SLIDE 43

Time(ps) 1 10

2

10

3

10

4

10

5

10

6

10 G-value (# of molecules / 100 eV) 1 2 3 4 5 6

Comparison of G-value profile (CPU vs GPU) ✓ Line: Geant4-DNA ✓ Filled circle: MPEXS-DNA

p 20 MeV OH･ OH- H3O+ eaq- H2 ･H H2O2

Agrees with Geant4-DNA within ~ 3 %

G-value = # of Molecules  Energy loss

Time(ps) 1 10

2

10

3

10

4

10

5

10

6

10 G-value (# of molecules / 100 eV) 1 2 3 4 5 6 7

先週

･OH (! MPEXS-DNA) H2O2 (! MPEXS-DNA) ･･･OH (Partrac) H2O2 (Partrac)

e- 750 keV

Verifying with other simulation data

Ref.) J. Radiat. Res., 46, 333–341 (2005)

MPEXS-DNA Physics and Chemical Performance

Diffusions and chemical reactions after irradiated water phantom with a 10 keV electron

SLIDE 44

Fast math option (nvcc --use_fast_math)
~ 1.2x speedup 
L1 cache (nvcc -Xpxas -dlcm=ca)
~ 1.8x speedup 
CUDA Stream
For kernels without dependency in Physics Phase
Calculating cross-section value for each physical interaction
To use GPU resource fully in Chemical Phase

Code optimization for Tesla K40c GPU

SLIDE 45

13.48 2.57 3279.47 932.82 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 e- 750 keV p 20 MeV Event Number / 1 min. Geant4-DNA (CPU) MPEXS-DNA (GPU)

363x 243x

Up to 360 times speedup against single-core Xeon CPU

Process time for p 20 MeV (total ~15k events)
~ 4 days (single-core Xeon CPU) -> ~ 16 min. (Tesla K40c GPU)

GPU Performance for MPEXS-DNA Simulation  Including Physics and Chemical Phases

GPU:
NVIDIA, Tesla K40c,

2,880 cores, 745 MHz

CPU:
Intel, Xeon E5-2643 v2,

3.50 GHz

Comparison of event number processed per 1 min.

SLIDE 46

Performance Gain for Tesla P100 against Tesla K40c

3279.47 932.82 10053.09 3028.60 0.0E+00 2.0E+03 4.0E+03 6.0E+03 8.0E+03 1.0E+04 1.2E+04 e- 750 keV p 20 MeV Event Number / 1 min. MPEXS-DNA (K40c) MPEXS-DNA (P100)

3.06x 3.24x

Adopted the same thread configuration as K40c in the simulation with P100
More than 3 times performance gain against K40c

Comparison of event number processed per 1 min.

Preliminary result

SLIDE 47

Summary

MPEXS-DNA is an extension of MPEXS to DNA Physics.
Geant4-DNA should be improved an issue on long duration of simulation

time.

We’ve succeeded to boost up computing performance for microdosimetry

simulation using GPU power drastically.

Up to 360 times speedup against single-core Xeon CPU for K40c
A Tesla P100 is equivalent to ~ 1000 cores of Xeon CPU.
~ 3 times performance gain against K40c without any optimization
Could achieve further performance improvement by appropriate
ptimization.

SLIDE 48

In near future

Developing “killer applications” based on MPEXS-DNA to estimate biological effects
n radiation quantitatively
DNA single- and double-strand breaks
Cellular survival rate
Radiosensitization to tumor in radiation therapy

(e.g. Gold nanoparticle; GNP)

…
Extending MPEXS to “nuclear physics” and “thermal neutron physics”
Proton and carbon therapy
Boron Neutron Capture Therapy
Radiation shielding calculations
…

SLIDE 49

Acknowledgements

Makoto Asai, SLAC
Joseph Perl, SLAC
Andrea Dotti, SLAC
Takashi Sasaki, KEK
Akinori Kimura, Ashikaga Institute of

Technology

Margot Gerritsen, ICME, Stanford

SLIDE 50

References

N. Henderson, et al. A CUDA Monte Carlo simulator for radiation therapy dosimetry

based on Geant4. <https://dx.doi.org/10.1051/snamc/201404204>

K. Murakami, et al. Geant4 Based simulation of radiation dosimetry in CUDA. <https://

dx.doi.org/10.1109/NSSMIC.2013.6829452>

S. Okada, et al. GPU Acceleration of Monte Carlo Simulation at the cellular and DNA
levels. <https://dx.doi.org/10.1007/978-3-319-23024-5_29>
S. Agostinelli, et al. Geant4—-a simulation toolkit.

<https://dx.doi.org/10.1016/S0168-9002(03)01368-8>

M.A. Bernal, et al. Track structure modeling in liquid water: A review of the Geant4-DNA

very low energy extension of the Geant4 Monte Carlo simulation toolkit. <https:// dx.doi.org/10.1016/j.ejmp.2015.10.087>

Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and Beyond

Outline

Big Picture

Geant4

Parallelism

MPEXS

MPEXS - Performance & Validation

Algorithm Research

MPEXS Experiments

New Architecture

How it works

Features

Properties

Experiments

Speedup via new architecture

Summary

Outline

MPEXS-DNA

The Geant4-DNA Project

MPEXS-DNA, microdosimetry simulation on GPU

MPEXS-DNA Physics Processes

The difference of energy loss process (EM Physics vs DNA Physics)

An issue on lower thread occupancy in physics simulation

CUDA Thread Assignment For MPEXS-DNA Physics Simulation

MPEXS-DNA Physics Performance

Physico-Chemical Phase for MPEXS-DNA

Chemical Phase for MPEXS-DNA

先週

MPEXS-DNA Physics and Chemical Performance

Code optimization for Tesla K40c GPU

GPU Performance for MPEXS-DNA Simulation Including Physics and Chemical Phases

Performance Gain for Tesla P100 against Tesla K40c

Summary

In near future

Acknowledgements

References

The difference of energy loss process   (EM Physics vs DNA Physics)

An issue on lower thread occupancy   in physics simulation

CUDA Thread Assignment   For MPEXS-DNA Physics Simulation

GPU Performance for MPEXS-DNA Simulation  Including Physics and Chemical Phases