Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and - - PowerPoint PPT Presentation

fast gpu monte carlo simulation for radiotherapy dna
SMART_READER_LITE
LIVE PREVIEW

Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and - - PowerPoint PPT Presentation

Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and Beyond 2017 GPU Technology Conference Shogo Okada <shogo@port.kobe-u.ac.jp> Koichi Murakami <koichi.murakami@kek.jp> Nick Henderson


slide-1
SLIDE 1

Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and Beyond

2017 GPU Technology Conference Shogo Okada <shogo@port.kobe-u.ac.jp> Koichi Murakami <koichi.murakami@kek.jp> Nick Henderson <nick.henderson@gmail.com>

slide-2
SLIDE 2

Outline

Geant4 GPU
 experimentation MPEXS Algorithm
 research Application
 development Geant4 multi-threading

slide-3
SLIDE 3

Big Picture

slide-4
SLIDE 4

(~ x, ~ p, k) k ∈ {γ, e−, e+, . . . }

Goal: record effect of particle interaction in material

slide-5
SLIDE 5

Geant4

  • Toolkit for simulation of particles traveling

through and interacting with matter

  • Supports wide variety of physics models,

geometries, and materials

  • Extendable - users can add new models
  • Used in numerous and diverse

application areas

  • high energy physics
  • medical physics
  • spacecraft
  • semiconductor devices
  • biology research

ATLAS LISA gMocren

slide-6
SLIDE 6

Parallelism

  • Simulations require many events for statistical significance
  • Events are IID
  • Each computation thread processes an event

Challenges:

  • Random nature of simulation leads to thread divergence
  • Storage of secondary particles
  • Recording of energy deposition

If you want to consider full capability of Geant4:

  • Very complicated geometry -- non uniform data structures
  • Many material types
  • Large data tables to support physics processes
slide-7
SLIDE 7

MPEXS

  • MPEXS is an adaptation of the core simulation algorithm from Geant4 for

GPU

  • Target application: X-ray radiotherapy
  • Geometry: uniformly discretized box
  • Material: Water with variable density
  • Physics: Low energy electromagnetics
  • Gamma: Compton scattering, photoelectric effect, pair-production
  • Electron/Positron: ionization, multiple scattering, Bremsstrahlung,

positron annihilation

  • Each GPU thread tracks an active particle
  • Secondary particles are stored on thread-local secondary stacks
  • Threads deposit energy to a shared global domain (via atomicAdd)
slide-8
SLIDE 8

MPEXS - Performance & Validation

slide-9
SLIDE 9

Verification for Dose Distribution

z y density water 1.0 g/cm3 lung 0.26 g/cm3 bone 1.85 g/cm3 air 0.0012 g/cm3

  • phantom size : 30.5 x 30.5 x 30 cm 

  • voxel size : 5 x 5 x 2 mm

  • field size : 10 cm2

  • SSD : 100 cm
  • slab materials :

(1) water
 (2) lung
 (3) bone air source Beam particle and its initial kinetic energy: 


  • electron with 20MeV

  • photon with 6MV Linac

  • photon with 18MV Linac

Dose Distribution of slab phantoms

slide-10
SLIDE 10

Comparison of depth dose for γ 6MV

− G4 v9.6.3
 − G4CU

(1) water

  • x-axis: z-direction (cm)
  • y-axis: dose (Gy)
  • residual = (G4CU−G4) / G4

(2) lung (3) bone

5 10 15 20 25 30

dose (Gy)

0.05 0.1 0.15 0.2 0.25 0.3
  • 3
10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

  • 0.2
  • 0.1
0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.1 0.15 0.2 0.25 0.3
  • 3
10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

  • 0.2
  • 0.1
0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.01 0.02 0.03 0.04 0.05 0.06 0.07
  • 3
10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

  • 0.2
  • 0.1
0.1 0.2

lung bone

MPEXS

MPEXS

MPEXS MPEXS MPEXS

slide-11
SLIDE 11

Comparison of depth dose for γ 18MV

− G4 v9.6.3
 − G4CU

(1) water

  • x-axis: z-direction (cm)
  • y-axis: dose (Gy)
  • residual = (G4CU−G4) / G4

(2) lung (3) bone

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12
  • 3
10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

  • 0.2
  • 0.1
0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12
  • 3
10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

  • 0.2
  • 0.1
0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12
  • 3
10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

  • 0.2
  • 0.1
0.1 0.2

lung bone

MPEXS

MPEXS

MPEXS MPEXS MPEXS

slide-12
SLIDE 12

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
  • 3
10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

  • 0.2
  • 0.1
0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
  • 3
10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

  • 0.2
  • 0.1
0.1 0.2

5 10 15 20 25 30

dose (Gy)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
  • 3
10 ×

G4 G4CU

depth dose distribution

depth (cm)

5 10 15 20 25 30

residual

  • 0.2
  • 0.1
0.1 0.2

Comparison of depth dose for e- 20MeV

− G4 v9.6.3
 − G4CU

(1) water

  • x-axis: z-direction (cm)
  • y-axis: dose (Gy)
  • residual = (G4CU−G4) / G4

(2) lung (3) bone

depth (cm) 5 10 15 20 25 30 dose (Gy)
  • 6
10
  • 5
10
  • 4
10

log scale

depth (cm) 5 10 15 20 25 30 dose (Gy)
  • 6
10
  • 5
10
  • 4
10 depth (cm) 5 10 15 20 25 30 dose (Gy)
  • 6
10
  • 5
10
  • 4
10

log scale log scale lung bone

MPEXS

MPEXS

slide-13
SLIDE 13

Computation Time Performance

γ beam with 6MV γ beam with 18MV (1) water (2) lung (3) bone (1) water (2) lung (3) bone G4 
 [msec/particle] 0.780 0.822 0.819 0.803 0.857 0.924 G4CU 
 [msec/particle] 0.00336 0.00331 0.00341 0.00433 0.00425 0.00443 × speedup factor
 ( = G4 / G4CU ) 232 248 240 185 201 208

GPU:

  • Tesla K20c (Kepler architecture)
  • 2496 cores, 706 MHz
  • 4096 x 128 threads
  • # of primaries
  • 50M particles -> e- 20MeV
  • 500M particles -> γ 6MV, 18MV

CPU:


  • Xeon E5-2643 v2 3.50 GHz

e- beam with 20MeV (1) water (2) lung (3) bone G4 
 [msec/particle] 1.84 1.87 1.65 G4CU 
 [msec/particle] 0.00881 0.00958 0.00885 × speedup factor
 ( = G4 / G4CU ) 208 195 193

185~250 times speedup against single-core G4 simulation!!

MPEXS / MPEXS) MPEXS / MPEXS)

slide-14
SLIDE 14

Algorithm Research

slide-15
SLIDE 15
  • MPEXS does not attempt to sort particles
  • Thread divergence: if threads in the same warp are tracking

different particle kinds, then thread divergence occurs in physics process code

  • Size of particle stack is the same for each thread and is fixed at

run-time. Some applications call for the generation of many secondary particles. This restriction meant that we could only run with a small number of active threads.

slide-16
SLIDE 16
  • e-

e+

  • e-

e-

  • e-

e+

  • e-

e-

  • computation

process e- process e+ process

  • e-

e- e- e+ particles in memory 1 2 3 4 5 6 7 index particles in memory

slide-17
SLIDE 17

MPEXS Experiments

  • Initialize each thread with the same random number generator state.

This leads to a non-physical simulation, but eliminates thread

  • divergence. We saw a factor 3x speedup in these runs.
  • Measure the time it takes to sort particle index by selected

process and perform a run length encode against the time for a single trip through event loop. Calculations indicate we should expect a factor 2x speedup if implemented in full simulation.

slide-18
SLIDE 18

New Architecture

  • Goal 1: minimize/eliminate thread divergence
  • Goal 2: eliminate need for fixed-size and thread-local secondary

stacks

  • Goal 3: maintain extensibility
slide-19
SLIDE 19

How it works

slide-20
SLIDE 20

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ

slide-21
SLIDE 21

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ

  • utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ

e- e-

slide-22
SLIDE 22

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ

  • utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ

e- e-

slide-23
SLIDE 23

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ Compton scattering Photoelectric effect

  • utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ

e- e-

slide-24
SLIDE 24

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ

  • utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ

e- e-

sort by selected process Compton scattering Photoelectric effect

slide-25
SLIDE 25

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ secondary generation secondary particles ˠ ˠ ˠ

e- e- e- e- e-

  • utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ

e- e-

sort by selected process Compton scattering Photoelectric effect

slide-26
SLIDE 26

input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection ˠ ˠ ˠ ˠ ˠ secondary generation secondary particles ˠ ˠ ˠ

e- e- e- e- e-

  • utput buffers

ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ

e- e- e- e- e- e- e-

secondary storage sort by selected process Compton scattering Photoelectric effect

slide-27
SLIDE 27

Features

  • Store particles on a generalized stack that allows pushing and popping a block of

particles in one operation.

  • Group particles by kind (gamma, e-, e+). When we pop a block of particles, we know

they are all the same kind, thus we can apply the same (non-divergent) operations.

  • Maintain separate input and output buffers. Physics processes know the input and
  • utput particles. For example, in Compton scattering the input is a photon and the
  • utput is a scattered photon and a recoil electron. Thus, we can read from the active

input photon buffers and write to output electron and photon buffers that are pushed

  • nto appropriate stacks.
  • The sort and run-length encode operations are applied after process selection so that

after-step processes are applied only to particles that call for it.

slide-28
SLIDE 28

Properties

  • No thread-divergence due to process selection. Thread

divergence may occur in the application of a physics process, because many of them rely on sample-reject algorithms to sample from various distributions.

  • Have non-coalesced reads of particle data in the after-step

physics process. However, all writes of particle data is coalesced. We have to pay for the randomness somewhere.

  • Thread-local stacks are not required.
slide-29
SLIDE 29

Experiments

  • The new architecture is substantially different from MPEXS. We

have not yet ported the physics processes over. We've done performance experiments with fake/model physics processes (which mimic computation and memory access patterns of the real

  • nes).
  • We can vary the number of physics processes and the amount of

data moved. The numbers shown are the speed up of the new- architecture against the old for a variety of configurations.

slide-30
SLIDE 30

Speedup via new architecture

  • Speedup due to sorting by process id for fake/model processes
  • Vary number of process and amount of data required by each process
  • Results collected from K40

Number of processes 1 2 4 8 16 32 64 128 Data transfer (float #) 1 0.5 0.6 0.8 1.0 1.3 1.8 2.7 4.1 2 0.5 0.7 0.8 1.0 1.5 2.1 2.9 4.2 4 0.6 0.7 0.9 1.2 1.8 2.6 3.4 4.6 8 0.6 0.8 1.1 1.6 2.4 3.3 4.2 5.2 16 0.6 1.0 1.5 2.1 3.0 4.1 4.9 5.9 32 0.7 1.2 1.8 2.6 3.6 4.6 5.4 6.3 64 0.8 1.4 2.0 2.8 3.9 4.9 5.7 6.6 128 0.9 1.7 2.4 3.1 4.0 5.1 5.9 6.7 speedup

slide-31
SLIDE 31

Summary

  • MPEXS is a GPU-based Monte Carlo simulator for X-ray radiotherapy
  • MPEXS attains around 200x speedup when compared to Geant4 running
  • n single CPU core
  • Algorithm experimentation indicates a further 2x speed up with a sort
  • peration after process selection
  • New architecture also opens opportunities for other applications
  • better performance with more physics processes
  • no thread-local secondary stacks
slide-32
SLIDE 32

Outline

Geant4 GPU
 experimentation MPEXS Algorithm
 research Application
 development Geant4 multi-threading

slide-33
SLIDE 33

MPEXS-DNA

slide-34
SLIDE 34

The Geant4-DNA Project

“Geant4-DNA”, an extension of Geant4 to DNA physics

  • Estimates biological effects (e.g. DNA strand breaks) by radiation with ultra low

energy scale (down to meV)

  • The main objective of the project:
  • Evaluates effects on human health in chronic radiation exposure
  • ex.) Medical diagnostic, Astronauts in space missions, Airline crews, …
  • Should be improved its computing performance using GPU power.
  • Energy spread in cells is an important factor for DNA damage.
  • Geant4-DNA calculates complex track geometry within cells.
  • Needs to handle a large number of secondary particles.
  • ex.) More than 20k secondaries are generated per primary
  • Days-Weeks simulation on CPU cluster
slide-35
SLIDE 35
  • Based on Geant4-DNA 10.02 p03
  • EM Physics for lower energy range (down to meV)
  • Calculates energy loss and generates primary

molecules like excited and ionized H2O.

  • Radiolysis of water
  • Diffusion and production of chemical species
  • Estimates DNA damage (-> future work).

Chromatine fiber (constituent of chromosomes) EM shower in DNA ∅ 10 nm

  • 1. Physical Phase
  • 2. Chemical Phase
  • Calculates dose distributions
  • Generates primary chemical

species like H2O*, H2O-/+, e-aq Diffusion and reactions for chemical species

  • 3. Biological Phase


(Future work)

MPEXS-DNA, microdosimetry simulation on GPU

http://www.windows2universe.org/earth/Life/cell_radiation_damage.html

slide-36
SLIDE 36

Physics Processes for X-rays Compton scattering 100 eV - 1 GeV, Livermore Photoelectric effect 100 eV - 1 GeV, Livermore Gamma conversion 100 eV - 1 GeV, Livermore Rayleigh scattering 100 eV - 1 GeV, Livermore

Particles Electrons Protons Hydrogen
 atoms Helium atoms
 (He++, He+, He0) Elastic
 scattering 9 eV - 10 keV
 Uehara 10 keV - 1 MeV Champion 100 eV - 1 MeV
 Hoang 100 eV - 10 MeV
 Hoang Excitation 10 eV - 10 keV
 Emfietzoglou 10 keV - 1 MeV Born 10 eV - 500 keV
 Miller Green
 500 keV - 100 MeV
 Born 10 eV - 500 keV
 Miller Green 1 keV - 400 MeV
 Miller Green Charge change — 100 eV - 10 MeV
 Dingfelder 100 eV - 10 MeV
 Dingfelder 1 keV - 400 MeV
 Dingfelder Ionization 10 eV - 10 keV
 Emfietzoglou 10 keV - 1 MeV Born 100 eV - 500 keV
 Rudd
 500 keV - 100 MeV
 Born 100 eV - 100 MeV
 Rudd 1 keV - 400 MeV
 Rudd Vibrational
 excitation 2 - 100 eV
 Michaud et al. — — — Disociative
 attachment 4 - 13 eV
 Melton — — —

E1 E2

p e- H atom -> p AB + e- -> AB- -> A + B-

((( (((

ΔE

e- e- p

Physics Processes

MPEXS-DNA Physics Processes

Atomic deexcitation occurs during ionization process, and emits 
 auger electrons and X-rays

slide-37
SLIDE 37

The difference of energy loss process 
 (EM Physics vs DNA Physics)

Standard EM Physics

  • Continues process
  • Energy loss is below a given threshold.
  • Calculates average energy loss at each


step with the Bethe-Bloch formula.

  • No secondaries are generated.
  • Discrete process
  • Generates a secondary if energy loss is


above the threshold. DNA physics

  • Handling as a discrete process without


energy thresholds to calculate complex energy 
 spread within cells for DNA damage

  • A large number of secondaries are generated 


(~ 20k / primary).

Bethe-Bloch formula: ΔE1 ΔE3 ΔE2 i

  • n

i z a t i

  • n

e x c i t a t i

  • n

ΔE4 ΔE5 ΔE6 ΔE1 ΔE2 ΔE3

Δx1 Δx2 Δx3 Δx4

ΔE4 “continues process” “discrete process”

slide-38
SLIDE 38
  • DNA Physics simulation had an issue of Low thread occupancy.
  • The number of active threads was limited due to large memory

consumption for storing secondaries generated into the stack.

NVIDIA, Tesla K40c, Global Memory: 11,439 MB (GDDR5)

The difference of # of secondaries and active thread number (DNA vs EM) Incident
 particle Initial
 energy Typical # of
 secondaries
 generated Stack size per CUDA thread Total active CUDA thread numbers
 (Nblk x Nthr/blk) Total memory usage for stacks DNA
 Physics He++ 1 MeV > 20,000 25,000
 (1,074 kB) 10,240 
 (80 x 128) 10,740 MB EM
 Physics e- 20 MeV < 40 100
 (4.3 kB) 1,048,576
 (4,096 x 256) 4,405 MB

An issue on lower thread occupancy 
 in physics simulation

slide-39
SLIDE 39

CUDA Thread Assignment 
 For MPEXS-DNA Physics Simulation

  • A group of 32 CUDA threads is assigned per event and the threads in a group

share a secondary stack.

  • cf.) In MPEXS case (Standard EM Physics), each thread has its own stack.
  • Host memory is also available as a stack (using virtual memory addressing)
  • Reduces memory consumption for the stacks and increases active thread number


(~10k threads -> more than 1 M threads)


  • > Keeps high thread occupancy during the simulation

DNA Physics Standard EM Physics

1 2 3 4 5 6 … e- e- γ γ e- e+ γ …

CUDA Threads Secondary 
 stacks
 (capacity: 100) Thread#

CUDA Threads Secondary stacks
 (tot. capacity: 25k)

32 threads

Warp #0 1 2 3 4 5 6 … 30 H 31 p e- H e- e- e- H … H e- Thread# Event #0 Event #1 Warp #1 32 33 34 35 36 37 38 … 62 H 63 p H e- e- H p …

  • n host mem.
  • n device mem.
slide-40
SLIDE 40

MPEXS-DNA Physics Performance

Depth dose curves (CPU vs GPU)

z-direction (um) 5 10 15 20 25 Dose (Gy) 100 200 300 400 500 600 700

3

10 ×

depth dose distribution depth dose distribution

z-direction (um) 10 20 30 40 50 60 70 80 90 100 Dose (Gy) 1 10

2

10

depth dose distribution

z-direction (um) 1 2 3 4 5 6 7 8 9 10 Dose (Gy) 500 1000 1500 2000 2500 3000 3500 4000

3

10 ×

depth dose distribution

— Geant4-DNA (CPU) 
 — MEPXS-DNA (GPU)

p 1 MeV He++ 1 MeV e- 100 keV

Good agreement with Geant4-DNA

slide-41
SLIDE 41

Physico-Chemical Phase for MPEXS-DNA

  • Physical interactions (Ionization / Excitation / Attachment) produce ionised and

excited H2O molecules (H2O+/H2O-, H2O*)

  • Then, dissociates or releases energy into water

  • Electrons (Ekin < 8.22 eV) become hydrated electrons (e-aq)
  • These processes occur within 1 ps after irradiation

Electronic state Process Dissociation channel Fraction (%) Ionization state Dissociative decay H3O+ + •OH 100 Excitation state: A1B1 Dissociative decay

  • OH + H•

65 Relaxation H2O + ΔE 35 Excitation state: B1A1 Auto-ionization H3O+ + •OH + e-aq 55 Dissociative decay

  • OH + •OH + H2

15 Relaxation H2O + ΔE 30 Excitation state: Rydberg,
 diffusion bands Auto-ionization H3O+ + •OH + e-aq 50 Relaxation H2O + ΔE 50 Dissociative attachment: H2O- Dissociative decay

  • OH + OH- + H2

100

Ref.) Radiat Environ Biophys (2009) 48: 11- 20

slide-42
SLIDE 42

(1)Calculates intermolecular distance (d) for all pairs.

  • Computation time increases by O(N2/2).
  • kd-tree algorithm (Geant4-DNA)
  • Spreading CUDA threads (MPEXS-DNA)

Then, makes reactions for pairs with d < R
 (2)Finds minimum distance in remains,
 and calculates time step (Δt).
 (3)Diffuses molecules using Δt.

  • A CUDA thread transports a molecule.

(4)Loops (1) ~ (3)

Species Diffusion coefficient [m2/s] H3O+ 9.0E-09 H• 7.0E-09 OH- 5.0E-09 e-aq 4.9E-09 H2 4.8E-09

  • OH

2.8E-09 H2O2 2.3E-09 Reactions Reaction rate [M-1s-1] 2e-aq + 2H2O -> H2+ 2OH- 5.00E+09 e-aq + •OH -> OH- 2.95E+10 e-aq + H• + H2O -> OH- + H2 2.65E+10 e-aq + H3O+ -> H• + H2O 2.11E+10 e-aq + H2O2 -> OH- + •OH 1.44E+10

  • OH + •OH -> H2O

4.40E+09

  • OH + H• -> H2O

1.44E+10 H• + H• -> H2 1.20E+10 H3O+ + OH- -> 2H2O 1.43E+10

Ref.) Radiat Environ Biophys (2009) 48: 11- 20

d d < R ? No Yes Make reaction Diffusion

R = k
 4πNAD

Reaction radius (R)
 (by Smoluchowski Model) :

Chemical Phase for MPEXS-DNA

slide-43
SLIDE 43

Time(ps) 1 10

2

10

3

10

4

10

5

10

6

10 G-value (# of molecules / 100 eV) 1 2 3 4 5 6

Comparison of G-value profile (CPU vs GPU) ✓ Line: Geant4-DNA ✓ Filled circle: MPEXS-DNA

p 20 MeV OH・ OH- H3O+ eaq- H2 ・H H2O2

Agrees with Geant4-DNA within ~ 3 %

G-value = # of Molecules
 Energy loss

Time(ps) 1 10

2

10

3

10

4

10

5

10

6

10 G-value (# of molecules / 100 eV) 1 2 3 4 5 6 7

先週

・OH (! MPEXS-DNA) H2O2 (! MPEXS-DNA) ・ ・ ・OH (Partrac) H2O2 (Partrac)

e- 750 keV

Verifying with other simulation data

Ref.) J. Radiat. Res., 46, 333–341 (2005)

MPEXS-DNA Physics and Chemical Performance

Diffusions and chemical reactions after irradiated water phantom with a 10 keV electron

slide-44
SLIDE 44
  • Fast math option (nvcc --use_fast_math)
  • ~ 1.2x speedup

  • L1 cache (nvcc -Xpxas -dlcm=ca)
  • ~ 1.8x speedup

  • CUDA Stream
  • For kernels without dependency in Physics Phase
  • Calculating cross-section value for each physical interaction
  • To use GPU resource fully in Chemical Phase

Code optimization for Tesla K40c GPU

slide-45
SLIDE 45

13.48 2.57 3279.47 932.82 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 e- 750 keV p 20 MeV Event Number / 1 min. Geant4-DNA (CPU) MPEXS-DNA (GPU)

363x 243x

Up to 360 times speedup against single-core Xeon CPU

  • Process time for p 20 MeV (total ~15k events)
  • ~ 4 days (single-core Xeon CPU) -> ~ 16 min. (Tesla K40c GPU)

GPU Performance for MPEXS-DNA Simulation
 Including Physics and Chemical Phases

  • GPU:
  • NVIDIA, Tesla K40c, 


2,880 cores, 745 MHz

  • CPU:
  • Intel, Xeon E5-2643 v2, 


3.50 GHz

Comparison of event number processed per 1 min.

slide-46
SLIDE 46

Performance Gain for Tesla P100 against Tesla K40c

3279.47 932.82 10053.09 3028.60 0.0E+00 2.0E+03 4.0E+03 6.0E+03 8.0E+03 1.0E+04 1.2E+04 e- 750 keV p 20 MeV Event Number / 1 min. MPEXS-DNA (K40c) MPEXS-DNA (P100)

3.06x 3.24x

  • Adopted the same thread configuration as K40c in the simulation with P100
  • More than 3 times performance gain against K40c

Comparison of event number processed per 1 min.

Preliminary result

slide-47
SLIDE 47

Summary

  • MPEXS-DNA is an extension of MPEXS to DNA Physics.
  • Geant4-DNA should be improved an issue on long duration of simulation

time.

  • We’ve succeeded to boost up computing performance for microdosimetry

simulation using GPU power drastically.

  • Up to 360 times speedup against single-core Xeon CPU for K40c
  • A Tesla P100 is equivalent to ~ 1000 cores of Xeon CPU.
  • ~ 3 times performance gain against K40c without any optimization
  • Could achieve further performance improvement by appropriate
  • ptimization.
slide-48
SLIDE 48

In near future

  • Developing “killer applications” based on MPEXS-DNA to estimate biological effects
  • n radiation quantitatively
  • DNA single- and double-strand breaks
  • Cellular survival rate
  • Radiosensitization to tumor in radiation therapy 


(e.g. Gold nanoparticle; GNP)

  • Extending MPEXS to “nuclear physics” and “thermal neutron physics”
  • Proton and carbon therapy
  • Boron Neutron Capture Therapy
  • Radiation shielding calculations
slide-49
SLIDE 49

Acknowledgements

  • Makoto Asai, SLAC
  • Joseph Perl, SLAC
  • Andrea Dotti, SLAC
  • Takashi Sasaki, KEK
  • Akinori Kimura, Ashikaga Institute of

Technology

  • Margot Gerritsen, ICME, Stanford
slide-50
SLIDE 50

References

  • N. Henderson, et al. A CUDA Monte Carlo simulator for radiation therapy dosimetry

based on Geant4. <https://dx.doi.org/10.1051/snamc/201404204>

  • K. Murakami, et al. Geant4 Based simulation of radiation dosimetry in CUDA. <https://

dx.doi.org/10.1109/NSSMIC.2013.6829452>

  • S. Okada, et al. GPU Acceleration of Monte Carlo Simulation at the cellular and DNA
  • levels. <https://dx.doi.org/10.1007/978-3-319-23024-5_29>
  • S. Agostinelli, et al. Geant4—-a simulation toolkit.


<https://dx.doi.org/10.1016/S0168-9002(03)01368-8>

  • M.A. Bernal, et al. Track structure modeling in liquid water: A review of the Geant4-DNA

very low energy extension of the Geant4 Monte Carlo simulation toolkit. <https:// dx.doi.org/10.1016/j.ejmp.2015.10.087>