v ? ? ? ? ? ? ? ? ? ? ? ? ? ??? ? ?? ? ? ? ? ?? - - PowerPoint PPT Presentation

v
SMART_READER_LITE
LIVE PREVIEW

v ? ? ? ? ? ? ? ? ? ? ? ? ? ??? ? ?? ? ? ? ? ?? - - PowerPoint PPT Presentation

MASSACHUSETTS GENERAL HOSPITAL RADIATION ONCOLOGY A CCELERATING MI - B ASED B - S PLINE R EGISTRATION U SING CUDA E NABLED GPU S James Shackleford (1) , Nagarajan Kandasamy (2), Gregory C. Sharp (1) (1) Massachusetts General Hospital, Radiation


slide-1
SLIDE 1

MASSACHUSETTS GENERAL HOSPITAL RADIATION ONCOLOGY

James Shackleford(1), Nagarajan Kandasamy(2), Gregory C. Sharp(1)

(1) Massachusetts General Hospital, Radiation Oncology (2) Drexel University, Electrical and Computer Engineering

ACCELERATING MI-BASED B-SPLINE REGISTRATION USING CUDA ENABLED GPUS

slide-2
SLIDE 2

SLIDE 2 OF 33

FIXED IMAGE MOVING IMAGE

INTRODUCTION

WHAT IS DEFORMABLE REGISTRATION?

slide-3
SLIDE 3

FIXED IMAGE MOVING IMAGE

INTRODUCTION

WHAT IS DEFORMABLE REGISTRATION?

SLIDE 3 OF 33

slide-4
SLIDE 4

FIXED IMAGE MOVING IMAGE DEFORMATION VECTOR FIELD

INTRODUCTION

WHAT IS DEFORMABLE REGISTRATION?

SLIDE 4 OF 33

slide-5
SLIDE 5

PX PY βX βY PARAMETER COEFF PARAMETER WEIGHT

B-SPLINE GRID

PARAMETERIZATION METHOD REGIONAL INFLUENCE

SLIDE 5 OF 33

slide-6
SLIDE 6

PX PY βX βY PARAMETER COEFF PARAMETER WEIGHT PX PY vY vX vY vY vX = (βX βY )PX vY = (βX βY )PY REGIONAL INFLUENCE

SLIDE 6 OF 33

slide-7
SLIDE 7

PX PY βX βY PARAMETER COEFF PARAMETER WEIGHT vX = Σ Σ (βX,i βY,j )PX,i,j

i=1 4 j=1 4

vY = Σ Σ (βX,i βY,j )PY,i,j

i=1 4 j=1 4

16 CONTRIBUTIONS REGIONAL INFLUENCE

SLIDE 7 OF 33

slide-8
SLIDE 8

M F CORRESPONDANCE AND COST F

? ? ? ? ? ? ? ??? ?? ? ? ?? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ?? ? ? ?

𝚬 COST w.r.t. VECTORS

F

? ? ? ? ? ? ? ??? ?? ? ? ?? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ?? ? ? ?

𝚬 COST w.r.t. COEFFICIENTS

F

DECOMPRESS VECTOR FIELD

v ∂C ∂v New P C ∂C ∂P

QUASI-NEWTONIAN OPTIMIZER

SLIDE 8 OF 33

slide-9
SLIDE 9

H(F) + H(M) – H(F,M) C = C H(F | M) H(M | F) H(F) H(M) H(F) H(M) H(F,M) M F

FIXED IMAGE VALUE MOVING IMAGE VALUE

Σ Σ

i=1 BM j=1 BF

hj (i, j) ln N ⨉ hj (i, j) hF ( i ) ⨉ hM ( j ) C = 1 N

SLIDE 9 OF 33

slide-10
SLIDE 10

M M F F

A B C D 1 2 3 4 1 2 3 4 A B C D

Static Image Moving Image Nearest Neighbors Partial Volumes

# of voxels intensity

ln N ⨉ hj (in, jn ) hF ( in ) ⨉ hM ( jn ) = ∂C ∂h

  • C

xn

Σ

n=1 4

∂C ∂v = ∂C ∂h xn

(

⨉ ∂wn ∂v) ∂C ∂P = ∂C ∂h ∂h ∂v ∂v ∂P ⨉ ⨉

SLIDE 10 OF 33

slide-11
SLIDE 11

SERIAL IMPLEMENTATION

FOLLOWING A SINGLE THREAD

SLIDE 11 OF 33

slide-12
SLIDE 12

M F

FIXED IMAGE INTENSITY MOVING IMAGE INTENSITY

Generate Histograms

compute vector for each voxel get corresponding voxels in moving image

A B C D

1 2 3 4

Nearest Neighbors Partial Volumes

compute partial volumes use partial volumes for moving & joint SLIDE 12 OF 33

slide-13
SLIDE 13

M F

FIXED IMAGE INTENSITY MOVING IMAGE INTENSITY

Σ Σ

i=1 BM j=1 BF

hj (i, j) ln N ⨉ hj (i, j) hF ( i ) ⨉ hM ( j ) C = 1 N

Traditional Serial CPU

is very fast

(time required is negligible)

Generate Histograms

compute vector for each voxel get corresponding voxels in moving image

A B C D

1 2 3 4

Nearest Neighbors Partial Volumes

compute partial volumes use partial volumes for moving & joint

Compute Score

simply cycle thru histograms SLIDE 13 OF 33

slide-14
SLIDE 14

M F

FIXED IMAGE INTENSITY MOVING IMAGE INTENSITY

Σ Σ

i=1 BM j=1 BF

hj (i, j) ln N ⨉ hj (i, j) hF ( i ) ⨉ hM ( j ) C = 1 N

Traditional Serial CPU

is very fast

(time required is negligible)

Generate Histograms

compute vector for each voxel get corresponding voxels in moving image

A B C D

1 2 3 4

Nearest Neighbors Partial Volumes

compute partial volumes use partial volumes for moving & joint

Compute Score

simply cycle thru histograms

Compute Gradient

∂C ∂P = ∂C ∂v ∂v ∂P ⨉

M F

get vector for each voxel

A B C D

1 2 3 4

Nearest Neighbors Partial Volumes

compute partial volume derivatives

get corresponding voxels in moving image

ln N ⨉ hj (in, jn ) hF ( in ) ⨉ hM ( jn ) = ∂C ∂h

  • C

xn

Σ

n=1 4

∂C ∂v = ∂C ∂h xn

(

⨉ ∂wn ∂v)

change in cost as vector changes

NEXT

SLIDE 14 OF 33

slide-15
SLIDE 15

M F CORRESPONDANCE AND COST F

? ? ? ? ? ? ? ??? ?? ? ? ?? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ?? ? ? ?

𝚬 COST w.r.t. VECTORS

F

? ? ? ? ? ? ? ??? ?? ? ? ?? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ?? ? ? ?

𝚬 COST w.r.t. COEFFICIENTS

F

DECOMPRESS VECTOR FIELD

v ∂C ∂v New P C ∂C ∂P

QUASI-NEWTONIAN OPTIMIZER

SLIDE 15 OF 33

slide-16
SLIDE 16

F

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ?

CHANGE IN COST w.r.t. COEFFICIENTS

βX βY 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

vX = Σ Σ (βX,i βY,j )PX,i,j

i=1 4 j=1 4

∂C ∂v ∂C ∂P = ∂v ∂P = βX,i βY,j

Σ Σ

i=1 4 j=1 4

∂C ∂v

Σ

SLIDE 16 OF 33

slide-17
SLIDE 17

F

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ?

CHANGE IN COST w.r.t. COEFFICIENTS

βX βY 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

vX = Σ Σ (βX,i βY,j )PX,i,j

i=1 4 j=1 4

∂C ∂v ∂C ∂P = ∂v ∂P = βX,i βY,j

Σ Σ

i=1 4 j=1 4

∂C ∂v

Σ

SLIDE 17 OF 33

slide-18
SLIDE 18

PARALLELIZATION

LEVERAGING GPUS, OPENMP, ETC

SLIDE 18 OF 33

slide-19
SLIDE 19

M F CORRESPONDANCE AND COST F

? ? ? ? ? ? ? ??? ?? ? ? ?? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ?? ? ? ?

𝚬 COST w.r.t. VECTORS

F

? ? ? ? ? ? ? ??? ?? ? ? ?? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ?? ? ? ?

𝚬 COST w.r.t. COEFFICIENTS

F

DECOMPRESS VECTOR FIELD

v ∂C ∂v New P C ∂C ∂P

QUASI-NEWTONIAN OPTIMIZER

What do we parallelize?

✓ ✓ ✓ ✓

✗ ✗

SLIDE 19 OF 33

slide-20
SLIDE 20

F M F F

? ? ? ? ? ?? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

COMPUTE VECTOR FROM COEFF COMPUTE HISTOGRAMS COMPUTE CHANGE

IN COST w.r.t

VECTOR CYCLE

HIST

COST

(MI) ∂C ∂v

SLIDE 20 OF 33

slide-21
SLIDE 21

SLIDE 21 OF 33

M F

FIXED IMAGE INTENSITY MOVING IMAGE INTENSITY

Σ Σ

i=1 BM j=1 BF

hj (i, j) ln N ⨉ hj (i, j) hF ( i ) ⨉ hM ( j ) C = 1 N

Traditional Serial CPU

is very fast

(time required is negligible)

Generate Histograms

compute vector for each voxel get corresponding voxels in moving image

A B C D

1 2 3 4

Nearest Neighbors Partial Volumes

compute partial volumes use partial volumes for moving & joint

Compute Score

simply cycle thru histograms

Compute Gradient

∂C ∂P = ∂C ∂v ∂v ∂P ⨉

M F

get vector for each voxel

A B C D

1 2 3 4

Nearest Neighbors Partial Volumes

compute partial volume derivatives

get corresponding voxels in moving image

ln N ⨉ hj (in, jn ) hF ( in ) ⨉ hM ( jn ) = ∂C ∂h

  • C

xn

Σ

n=1 4

∂C ∂v = ∂C ∂h xn

(

⨉ ∂wn ∂v)

change in cost as vector changes

NEXT

slide-22
SLIDE 22

SLIDE 22 OF 33

βX βY βX βY 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16

1 2 3 4 5 16

. . .

1 2 3 4 5 16

. . .

1 2 3 4 5 16

. . .

vX = Σ Σ (βX,i βY,j )PX,i,j

i=1 4 j=1 4

∂C ∂v ∂C ∂P = ∂v ∂P = βX,i βY,j

Σ Σ

i=1 4 j=1 4

∂C ∂v F

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ?

CHANGE IN COST w.r.t. COEFFICIENTS

CPU 1

slide-23
SLIDE 23

SLIDE 23 OF 33

βX βY βX βY 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16

1 2 3 4 5 16

. . .

1 2 3 4 5 16

. . .

1 2 3 4 5 16

. . .

vX = Σ Σ (βX,i βY,j )PX,i,j

i=1 4 j=1 4

∂C ∂v ∂C ∂P = ∂v ∂P = βX,i βY,j

Σ Σ

i=1 4 j=1 4

∂C ∂v

CPU 2

F

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ?

CHANGE IN COST w.r.t. COEFFICIENTS

CPU 1

slide-24
SLIDE 24

CONSTANT CONTROL POINT SPACING

15 x 15 x 15

16x

speedup

30min → 1.8min

  • J. Shackleford, N. Kandasamy, and G. Sharp, “On developing B-spline registration algorithms for

multi-core processors,” Physics in Medicine and Biology, vol. 55, p. 6329, 2010.

  • J. Shackleford, N. Kandasamy, and G. Sharp, Deformable Volumetric Registration using B-splines.

GPU Computing Gems: Emerald Edition, Morgan Kaufmann Pub, 2011.

SLIDE 24 OF 33

slide-25
SLIDE 25

CONSTANT VOLUME SIZE

256 x 256 x 256

  • J. Shackleford, N. Kandasamy, and G. Sharp, “On developing B-spline registration algorithms for

multi-core processors,” Physics in Medicine and Biology, vol. 55, p. 6329, 2010.

  • J. Shackleford, N. Kandasamy, and G. Sharp, Deformable Volumetric Registration using B-splines.

GPU Computing Gems: Emerald Edition, Morgan Kaufmann Pub, 2011.

SLIDE 25 OF 33

slide-26
SLIDE 26

HISTOGRAM COMPUTATION

LEVERAGING GPUS, OPENMP, ETC OpenMP CUDA +

thread-level histograms

(shared memory)

block-level histograms

(global memory)

complete histograms

(global memory)

SLIDE 26 OF 33

slide-27
SLIDE 27

HISTOGRAM COMPUTATION

LEVERAGING GPUS, OPENMP, ETC OpenMP CUDA +

thread-level histograms

(shared memory)

block-level histograms

(global memory)

complete histogram

(global memory)

block

SLIDE 27 OF 33

slide-28
SLIDE 28

100 x 100 x 100 50 x 50 x 50 25 x 25 x 25 0.5 1 1.5 2 2.5 3 3.5 Quadro 2000 (CC 2.1) Tesla C2075 (CC 2.0) AMD Phenom(tm) 9750 @ 1.2GHz [4-Core] Intel Xeon E5420 @ 2.5GHz [4-Core] Intel Xeon E5620 @ 2.4GHz [4-Core SMT] Intel Xeon X5675 @ 3.0GHz [6-Core SMT]

Control Grid Spacing (mm) Wall Clock Time (s)

32 Fixed Histogram Bins 32 Moving Histogram Bins Fixed Image: 512 x 512 x 115 (0.79 x 0.79 x 2.5) Moving Image: 512 x 384 x 115 (0.74 x 0.74 x 6.0) Compute Resolution: 256 x 256 x 115

Control Grid: 8 x 8 x 6 Regions: 5 x 5 x 3 Vox/Region: 63 x 63 x 40 Control Grid: 11 x 11 x 9 Regions: 8 x 8 x 6 Vox/Region: 32 x 32 x 20 Control Grid: 19 x 19 x 15 Regions: 16 x 16 x 12 Vox/Region: 16 x 16 x 10

slide-29
SLIDE 29

100 x 100 x 100 50 x 50 x 50 25 x 25 x 25 0.5 1 1.5 2 2.5 3 3.5 Quadro 2000 (CC 2.1) Tesla C2075 (CC 2.0) AMD Phenom(tm) 9750 @ 1.2GHz [4-Core] Intel Xeon E5420 @ 2.5GHz [4-Core] Intel Xeon E5620 @ 2.4GHz [4-Core SMT] Intel Xeon X5675 @ 3.0GHz [6-Core SMT]

Control Grid Spacing (mm) Wall Clock Time (s)

1.4 1.2 1.4 0.8 1.4 0.8 0.3 0.4 0.3 0.2 0.3 0.2 0.6 1.0 0.8 1.7 0.8 1.7

Histograms Gradient

0.9 1.1 0.9 1.7 0.9 1.6 0.5 0.8 0.5 0.8 0.5 0.8 0.7 1.0 0.7 1.0 0.7 1.0

32 Fixed Histogram Bins 32 Moving Histogram Bins Fixed Image: 512 x 512 x 115 (0.79 x 0.79 x 2.5) Moving Image: 512 x 384 x 115 (0.74 x 0.74 x 6.0) Compute Resolution: 256 x 256 x 115

Control Grid: 8 x 8 x 6 Regions: 5 x 5 x 3 Vox/Region: 63 x 63 x 40 Control Grid: 11 x 11 x 9 Regions: 8 x 8 x 6 Vox/Region: 32 x 32 x 20 Control Grid: 19 x 19 x 15 Regions: 16 x 16 x 12 Vox/Region: 16 x 16 x 10

slide-30
SLIDE 30

100 x 100 x 100 50 x 50 x 50 25 x 25 x 25 0.5 1 1.5 2 2.5 3 3.5 Quadro 2000 (CC 2.1) Tesla C2075 (CC 2.0) AMD Phenom(tm) 9750 @ 1.2GHz [4-Core] Intel Xeon E5420 @ 2.5GHz [4-Core] Intel Xeon E5620 @ 2.4GHz [4-Core SMT] Intel Xeon X5675 @ 3.0GHz [6-Core SMT]

Control Grid Spacing (mm) Wall Clock Time (s)

64 Fixed Histogram Bins 64 Moving Histogram Bins Fixed Image: 512 x 512 x 115 (0.79 x 0.79 x 2.5) Moving Image: 512 x 384 x 115 (0.74 x 0.74 x 6.0) Compute Resolution: 256 x 256 x 115

Control Grid: 8 x 8 x 6 Regions: 5 x 5 x 3 Vox/Region: 63 x 63 x 40 Control Grid: 11 x 11 x 9 Regions: 8 x 8 x 6 Vox/Region: 32 x 32 x 20 Control Grid: 19 x 19 x 15 Regions: 16 x 16 x 12 Vox/Region: 16 x 16 x 10

slide-31
SLIDE 31

100 x 100 x 100 50 x 50 x 50 25 x 25 x 25 0.5 1 1.5 2 2.5 3 3.5 Quadro 2000 (CC 2.1) Tesla C2075 (CC 2.0) AMD Phenom(tm) 9750 @ 1.2GHz [4-Core] Intel Xeon E5420 @ 2.5GHz [4-Core] Intel Xeon E5620 @ 2.4GHz [4-Core SMT] Intel Xeon X5675 @ 3.0GHz [6-Core SMT]

Control Grid Spacing (mm) Wall Clock Time (s)

0.5 1.2 0.5 0.8 0.5 0.8 0.7 0.4 0.7 0.2 0.6 1.0 0.8 1.7 0.8 1.7 0.9 1.2 0.9 1.7 0.9 1.6 0.5 0.8 0.5 0.8 0.5 0.8 0.7 1.0 0.7 1.0 0.7 1.0 0.7 0.2

Histograms Gradient 64 Fixed Histogram Bins 64 Moving Histogram Bins Fixed Image: 512 x 512 x 115 (0.79 x 0.79 x 2.5) Moving Image: 512 x 384 x 115 (0.74 x 0.74 x 6.0) Compute Resolution: 256 x 256 x 115

Control Grid: 8 x 8 x 6 Regions: 5 x 5 x 3 Vox/Region: 63 x 63 x 40 Control Grid: 11 x 11 x 9 Regions: 8 x 8 x 6 Vox/Region: 32 x 32 x 20 Control Grid: 19 x 19 x 15 Regions: 16 x 16 x 12 Vox/Region: 16 x 16 x 10

*

* *

* via OpenMP E5620

† via OpenMP X5675 † †

slide-32
SLIDE 32

100 x 100 x 100 50 x 50 x 50 25 x 25 x 25 0.5 1 1.5 2 2.5 3 3.5 Quadro 2000 (CC 2.1) Tesla C2075 (CC 2.0) AMD Phenom(tm) 9750 @ 1.2GHz [4-Core] Intel Xeon E5420 @ 2.5GHz [4-Core] Intel Xeon E5620 @ 2.4GHz [4-Core SMT] Intel Xeon X5675 @ 3.0GHz [6-Core SMT]

Control Point Spacing (mm) Wall Clock Time (s)

512 Fixed Histogram Bins 512 Moving Histogram Bins Fixed Image: 512 x 512 x 115 (0.79 x 0.79 x 2.5) Moving Image: 512 x 384 x 115 (0.74 x 0.74 x 6.0) Compute Resolution: 256 x 256 x 115

Control Grid: 8 x 8 x 6 Regions: 5 x 5 x 3 Vox/Region: 63 x 63 x 40 Control Grid: 11 x 11 x 9 Regions: 8 x 8 x 6 Vox/Region: 32 x 32 x 20 Control Grid: 19 x 19 x 15 Regions: 16 x 16 x 12 Vox/Region: 16 x 16 x 10

slide-33
SLIDE 33

100 x 100 x 100 50 x 50 x 50 25 x 25 x 25 0.5 1 1.5 2 2.5 3 3.5 Quadro 2000 (CC 2.1) Tesla C2075 (CC 2.0) AMD Phenom(tm) 9750 @ 1.2GHz [4-Core] Intel Xeon E5420 @ 2.5GHz [4-Core] Intel Xeon E5620 @ 2.4GHz [4-Core SMT] Intel Xeon X5675 @ 3.0GHz [6-Core SMT]

Control Point Spacing (mm) Wall Clock Time (s)

0.5 1.2 0.5 0.8 0.5 0.8 0.7 0.4 0.7 0.2 0.7 1.0 0.9 1.8 0.9 1.8 0.9 1.3 0.9 1.7 0.9 1.6 0.5 0.8 0.5 0.8 0.5 0.8 0.7 1.0 0.7 1.0 0.7 1.0 0.7 0.2

Histograms Gradient

Control Grid: 8 x 8 x 6 Regions: 5 x 5 x 3 Vox/Region: 63 x 63 x 40 Control Grid: 11 x 11 x 9 Regions: 8 x 8 x 6 Vox/Region: 32 x 32 x 20 Control Grid: 19 x 19 x 15 Regions: 16 x 16 x 12 Vox/Region: 16 x 16 x 10

512 Fixed Histogram Bins 512 Moving Histogram Bins Fixed Image: 512 x 512 x 115 (0.79 x 0.79 x 2.5) Moving Image: 512 x 384 x 115 (0.74 x 0.74 x 6.0) Compute Resolution: 256 x 256 x 115 *

* *

* via OpenMP E5620

† via OpenMP X5675 † †