[PPT] - Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,' PowerPoint Presentation

SLIDE 1

Accelerating Proton Computed Tomography with GPUs

Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility' Michael'E.'Papka,'Argonne'Leadership'Compu2ng'Facility,'Northern'Illinois'University' Nicholas'T.'Karonis,'Northern'Illinois'University,'Argonne'Na2onal'Laboratory

SLIDE 2

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Overview

Proton'computed'tomography'(pCT)'is'an'alterna2ve'to'xEray'based'CAT'scans,'which'

promises'several'medical'benefits'at'the'cost'of'being'significantly'more'computa2onally' expensive'

We'designed'a'60Enode'GPU'cluster'to'meet'the'computa2onal'challenge'

! !

Computed'tomography'
Benefits'of'proton'computed'tomography'
Computa2onal'problem'descrip2on'
CPU/GPU'performance'comparison

2

SLIDE 3

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

What is Computed Tomography?

CAT'(or'CT)'scans'are'wellEknown'
CAT'=='“computerized'axial'tomography”'
CAT'scans'are'used'to'reconstruct'the'density'distribu2on'within'a'volume,'typically'used'

in'medical'imaging'

CAT'scans'are'conducted'with'photons'(XErays)'

!

What'is'Proton'Computed'Tomography?'
A'reconstruc2on'technique'similar'to'XEray'computed'tomography,'conducted'with'

protons'instead'of'photons

3

SLIDE 4

13'million'people'are'diagnosed'with'cancer'each'year'worldwide'
2.6'million'of'them'are'candidates'for'proton'therapy'treatment'
Proton'therapy'involves'deposi2ng'protons'at'precise'loca2ons'within'a'tumor'

site'where'they'irradiate'the'target'2ssue'

The'protons'emit'lower'radia2on'as'they'travel'through'the'body'un2l'they'

reach'the'target,'where'they'emit'a'burst'of'radia2on'(the'Bragg'peak)'

Healthy'2ssue'beyond'the'tumor'site'receives'nominally'no'radia2on'
It'is'crucially'important'to'precisely'iden2fy'the'tumor'site'
To'ensure'that'cancerous'2ssue'is'destroyed'
To'avoid'damaging'healthy'2ssue'surrounding'the'tumor,'especially'in'

sensi2ve'areas'

Proton'therapy'treatment'planning'is'currently'performed'using'XEray'imaging'
Photons'and'protons'interact'with'intermediate'material'differently'
Conversion'between'photon/proton'modali2es'involves'a'systema0c'range'

error'of'365%

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Why Proton Computed Tomography?

4

Image source: Wikipedia

SLIDE 5

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Our'goal'is'to'reconstruct'volume'
f'adult'human'head'in'under'10'

minutes''

Protons'directed'through'two'

frontal'planes,'the'target'volume,' two'backing'planes,'and'finally'a' calorimeter'

Measures'posi2on'and'angle'of'

incidence'of'protons'at'entry'and' exit,'and'the'energy'loss

5

Final System (in black): 4 tracking planes with XY Si detectors: calorimeter with 64 end=on CsI Crystals Planned Scaled Prototype (in red): 4 planes of XY Si detectors (2 X-SSDs and 2 Y-SSDs per plane): 8 CsI Crystal bars Calorimeter: Each bar corresponds to a 5cm x 5cm CsI Crystal, read out by a photodiode Tracking Plane: Each large square corresponds to one double- sided or two single-sided 9cm x 9cm SSDs

Proton computed tomography

SLIDE 6

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Problem Description

Proton'source,'detector'planes,'and'calorimeter'

mounted'on'rota2ng'gantry,'as'in'familiar'XEray'CT' configura2ons'

Data'collected'over'a'full'rota2on'of'the'gantry,'180'

samples'(every'2'degrees)'

Ini2al'detector'designed'to'image'a'human'head'

(nominally'25cm'cube)'

From'physics'domain,'and'so'that'each'voxel'is'

sufficiently'represented'in'the'resul2ng'system' matrix,'we'approximate'requiring'a'volume' consis2ng'of'256x256x36'(2,359,296=~'2.4M)' voxels'and'2'billion'protons'total'

For'each'proton,'we'track'11'values:'
[x,y,z]'at'entry'
[x,y,z]'at'exit'
angle'at'entry'and'exit'
input'and'output'energy'
gantry'rota2on'angle

6

Final System (in black): 4 tracking planes with XY Si detectors: calorimeter with 64 end=on CsI Crystals Planned Scaled Prototype (in red): 4 planes of XY Si detectors (2 X-SSDs and 2 Y-SSDs per plane): 8 CsI Crystal bars Calorimeter: Each bar corresponds to a 5cm x 5cm CsI Crystal, read out by a photodiode Tracking Plane: Each large square corresponds to one double- sided or two single-sided 9cm x 9cm SSDs

SLIDE 7

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Baseline execution times

7

Began'with'serial'code'

that'took'more'than'7' hours'to'process'131M' protons'

Parallelized'with'MPI'to'

use'mul2ple'CPUs'

Established'baseline'

execu2on'2mes

{

Phase Execution time (seconds) Setup

128.2 Most Likely Path (MLP)

1278.5

Linear solver (CARP)

664.9 Overall execution time

2072.0

1 billion protons, 60 nodes, CPU only

SLIDE 8

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

MLP (Most Likely Path)

8

In'contrast'with'XEray'computed'tomography'in'

which'the'par2cles'traverse'the'volume'in' straight'lines,'in'pCT'the'protons'are'scakered' by'the'material'as'they'travel'through'the' volume'

MLP'computes'the'path'integral'of'the'protons'

through'the'material'based'on'their'known' entry'and'exit'loca2ons'and'angles'and'the' energy'loss'

The'proton'paths'are'discre2zed'as'the'voxels'

touched'while'traversing'the'volume'

Path'integral'calcula2ons'are'independent'and'

parallelize'at'the'level'of'protons'(but'inherently' sequen2al'within'each'path)

SLIDE 9

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Linear solver (CARP)

The'result'of'MLP'is'a'system'of'equa2ons'rela2ng'each'proton’s'touched'

voxels'to'the'rela2ve'stopping'power'(roughly,'the'energy'loss)'

We'began'the'project'with'a'CPU'implementa2on'of'the'rowEac2on'based'

sparse'itera2ve'solver'CARP'(component'averaged'row'projec2ons)'

CARP'decomposes'the'matrix'into'row'blocks,'one'block'per'processor,'and'

iterates'to'sa2sfactory'convergence:'

Performs'a'JacobiElike'itera2on'sequen2ally'through'the'rows'to'produce'a'perE

block'solu2on'vector'

Averages'the'perEblock'solu2on'vectors'(in'componentEwise'fashion)'
Redistributes'the'solu2on'vector'x'to'all'processors

9

SLIDE 10

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Hardware: Gaea GPU cluster at Northern Illinois University

60'compute'nodes'
Node'configura2on'
2x'Intel'X5650'12Ecore'CPUs'
2x'NVIDIA'M2070'GPUs'
72GB'RAM'
QDR'Infiniband

10

SLIDE 11

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Data decomposition

2.1B'protons'/'60'nodes'=~'35M'protons'per'node'
2'GPUs'E>'17M'protons'per'GPU'
The'maximum'voxels'per'proton'is'~364'
17M'protons'x'364'voxels'x'4'bytes/voxel'='25GB'data'per'GPU'
Larger'than'available'M2070'GPU'memory'of'6GB'
High'watermark'memory'requirement'on'cluster'is'3TB'(aggregate)

11

SLIDE 12

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

MLP (Most Likely Path) CUDA implementation

MLP'involves'calcula2ng'path'integral'of'the'protons'
Ini2al'implementa2on'assigns'a'thread'per'proton'
PerEGPU'proton'data'is'larger'than'GPU'memory'on'M2070'
Stage'batches'of'protons'to'GPU'
MLP'was'ported'to'the'GPU,'with'mul2ple'variants'
gpu'struct:'Direct'port'of'CPUEbased'code'using'structured'proton/voxel'data'
gpu'flat'memory:'Flat'memory'space'with'perEproton'padded'voxel'arrays'
gpu'flat'memory'+'overlap:'Streaming'computa2on'to'overlap'compute'and'

hostEdevice'transfers'

12

SLIDE 13

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

MLP (Most Likely Path) CUDA implementation (26M protons, 2 GPUs)

13

Implementation Execution time (seconds) Speedup cpu

598.7

gpu_struct

77.6 7.7x

gpu_flat_memory

55.5 10.8x

gpu_flat_memory +

verlap

53.0 11.3x

SLIDE 14

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Linear solver (CARP) CUDA implementation (26M protons, 2 GPUs)

CARP'ported'directly'from'CPU'code'
PerEnode'rowEblock'data'larger'than'GPU'memory;'batch'process'
Further'subdivide'perEnode'rowEblock'into'rowEblocks'per'streaming'mul2processor'

! ! ! ! ! ! !

Limited'speedup'in'GPU'implementa2on,'because:'
rowEac2on'based'solver'constrains'parallel'granularity'
scakered'memory'accesses'constrain'performance,'as'is'typical'of'sparse'matrix'opera2ons

14

Implementation Execution time (seconds) Speedup cpu

161.0

gpu

139.3 1.16x

SLIDE 15

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Performance at scale

15

Phase Execution time (seconds) Setup

22.3 Most Likely Path (MLP)

151.0 Linear solver (CARP)

265.5 Overall execution time

438.8 Initial goal was to complete in <600s (10mins)

2'billion'protons,'60'nodes,'12'CPU'cores/node,'2'GPUs/node

SLIDE 16

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Further work: CARP Hybrid CPU/GPU

Assign'row'blocks'to'CPU'and'GPU'simultaneously'
Weighted'work'distribu2on'based'on'ini2al'performance'measurements

16

Implementation Execution time (seconds) Speedup cpu

161.0

gpu

139.3 1.16x

hybrid

102.3 1.57x

2'billion'protons,'60'nodes,'12'cores/node,'2'GPUs/node

SLIDE 17

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Future work

Integrate'alterna2ve'linear'solvers'to'improve'performance

(amgX,'cusparse,'PETSc)'

Consider'alternate'data'decomposi2ons'to'improve'cache'locality'
volume'slab'per'streaming'mul2processor'
volume'wedge'per'streaming'mul2processor''
Measure'performance'on'nextEgenera2on'GPUs'
K80'for'greater'performance'
Jetson/TK1'for'greater'performance/wak'
Experiment'with'GPU'cloud'plauorms'(Amazon'cloud)

17

SLIDE 18

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov)

Acknowledgements

Nicholas'T.'Karonis,'Northern'Illinois'University'(NIU)'and'Argonne'Na2onal'Laboratory'(ANL)' Michael'E.'Papka,'NIU'and'ANL' Caesar'Ordoñez,'NIU' Eric'Olson,'ANL' Kirk'Duffin,'NIU' Venkat'Vishwanath,'ANL'

!

US'Department'of'Defense'contract'number'W81XWHE10E1E0170'sponsored'this'work.'

18