Computer Aided Detection (CAD) for 3D Breast Imaging and GPU - - PowerPoint PPT Presentation

computer aided detection cad for 3d breast imaging and
SMART_READER_LITE
LIVE PREVIEW

Computer Aided Detection (CAD) for 3D Breast Imaging and GPU - - PowerPoint PPT Presentation

Computer Aided Detection (CAD) for 3D Breast Imaging and GPU Technology Xiangwei Zhang, Chui Haili Imaging and CAD science, Hologic Inc., Santa Clara, CA 03/19/2015 Summary Computer aided detection (CAD) of breast cancer in 3D digital


slide-1
SLIDE 1

Computer Aided Detection (CAD) for 3D Breast Imaging and GPU Technology

Xiangwei Zhang, Chui Haili

Imaging and CAD science, Hologic Inc., Santa Clara, CA 03/19/2015

slide-2
SLIDE 2

3/25/2015

Summary

Slide 2

  • Computer aided detection (CAD) of breast cancer in 3D

digital breast tomo-synthesis

  • An introduction;
  • GPU kernel optimization
  • A case study of convolution filtering;
  • GPU optimization in DBT CAD
  • GPU/CPU data copy; GPU memory management;
  • Conclusions/Questions;
slide-3
SLIDE 3

3/25/2015

Cancer Incidence/Mortality Rates (USA)

Slide 3

50,000 100,000 150,000 200,000 250,000

Lung Cancer Colorect al Cancer Breast Cancer Prostate Cancer Incidence 169400 148300 205000 189000 Mortality 154900 56600 40000 30200

Number of Cases Disease Type

2001 – American Cancer Society

slide-4
SLIDE 4

3/25/2015

Computer Aided Detection (CAD)

Slide 4

  • Early detection of cancer is the key to reduce the mortality

rate;

  • Medical imaging can help the early detection of cancer;
  • Breast X-ray mammography, chest X-ray, lung CT, colonoscopy, brain MRI, etc;
  • Interpreting the image to find signs of cancer is very

challenging for radiologists;

  • Automated processing using computer software helps

radiologists in clinical decision;

  • Various image analysis software, including computer aided detection (CAD)

and diagnosis (CADx);

slide-5
SLIDE 5

Medical Imaging Applications

Breast Mammography Lung CT Chest X-ray Colonoscopy

slide-6
SLIDE 6

3/25/2015

Micro-calcifications in digital mammography

Slide 6

slide-7
SLIDE 7

3/25/2015

CAD for 2D Mammography

Slide 7

  • Each patient/examination has 4 views (left/right breast, CC/MLO views);
  • There are four 2D image to be processed;
  • CAD generates marker overlays ( triangle - micro-calcification clusters; star –

mass density or spiculation/architectural distortion)

slide-8
SLIDE 8

3/25/2015

2D Mammography CAD processing flow

Slide 8

  • Pre-processing
  • Pixel value transformation (log, invert, scaling);
  • Segmentations;
  • breast, pectoral muscle (MLO view), roll–off region;
  • Suspicious candidate generation;
  • Filtering(general and dedicated), region growing;
  • Region analysis/classification;
  • feature extraction/selection, classification;

It takes ~10 seconds/view to complete (pure CPU implementation);

slide-9
SLIDE 9

3/25/2015

Digital Breast Tomo-synthesis (DBT)

Slide 9

  • Acquisition
  • Multiple 2D projection views (PVs) are acquired in different angles (11 to 15);
  • The angle span is limited to get high in-plane resolution (15 to 30 degree);
  • Each projection uses a much lower dose than 2D mammography;
  • Reconstruction
  • Back projection is used to reconstruct a 3D volume with 1mm slice interval;
  • Usually a volume consists of 40 to 80 slices (1560x2457 pixels/slice);
  • Advantage (vs 2D mammogram)
  • Reduce tissue overlap  reveal 3D anatomical structures hidden in 2D;
  • Disadvantage
  • Much more data to interpret and store;
slide-10
SLIDE 10

3/25/2015 Slide 10

… …

Reconstruction slices PV1 PV2 PVm PVn-1 PVn X ray tube Compression paddle Digital detector Compressed breast Center of rotation

PV1, PV2, PV3, …, PVm PV1, PV2, PV3, …, PVm

DBT acquisition and reconstruction

slide-11
SLIDE 11

3/25/2015

DBT CAD processing flow

Slide 11

  • Slice by slice processing (similar to first three steps in 2D

CAD)

  • 3D region growing;
  • 3D Region analysis/classification;
  • Prototype (2007)
  • Pure CPU implementation;
  • It takes ~10 minutes/view to complete;
  • Clinically unacceptable;
  • What we can do to speedup?
  • CUDA computation on GPGPU;
slide-12
SLIDE 12

3/25/2015

GPU kernel performance optimization

Slide 12

  • Key requirements for good GPU kernel performance
  • Sufficient parallelism;
  • Efficient memory access;
  • Efficient instruction execution;
  • Efficient memory access: a case study of 1D convolution on

2D image with different implementations

  • CPU;
  • GPU
  • Global memory;
  • Texture memory;
  • Shared memory;
slide-13
SLIDE 13

3/25/2015

GPU CUDA memory space

Slide 13 13

Grid

Constant Memory Texture Memory Global Memory

Block (0, 0)

Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers

Block (1, 0)

Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers

Host

slide-14
SLIDE 14

3/25/2015

GPU global memory access optimization

Slide 14

  • GPU global memory
  • DRAM -- High latency;
  • Not necessarily cached;
  • Many algorithms are memory-limited
  • Or at least somewhat sensitive to memory bandwidth;
  • Arithmetic operation to memory access ratio is low;
  • Optimization goal: maximize bandwidth utilization
  • Memory accesses are per warp (warp – 32 consecutive threads in a single block);
  • Memory accesses are in discrete chunks (line – 128 bytes; segments – 32 bytes);
  • The key is to have sufficient concurrent memory access per warp;
slide-15
SLIDE 15

3/25/2015 Slide 15

... addresses from a warp

  • ne 4-segment transaction

96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses

Efficient GPU memory access

... addresses from a warp

96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses

slide-16
SLIDE 16

3/25/2015 Slide 16

Not efficient GPU memory access

96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses

... addresses from a warp addresses from a warp ...

slide-17
SLIDE 17

3/25/2015 Slide 17

X

A case study: 1D vertical convolution on 2D image

X

Pixel of interest Local neighborhood

Convolution:

For each pixel, the new pixel value is the weighted sum of the pixel value in the defined neighborhood; Pixel in a 2D image

slide-18
SLIDE 18

3/25/2015

Running time comparison: CPU and GPU

Slide 18

  • Platform
  • Host – Dell Precision 7500
  • CPU: Intel Xeon dual core @3.07GHz, @3.06GHz;
  • RAM: 16.0GB;
  • Device
  • GeForce GTX 690 (dual card);
  • 3072 CUDA cores (1536x2), 16 SMs;
  • 4GB 512-Bit GDDR5;
  • PCI Express 3.0x16;
  • CUDA 5.5;
  • OS
  • Window 7 Professional Service Pack 1, 2009;
slide-19
SLIDE 19

3/25/2015 Slide 19

CPU based – Two ways of serial processing

Assumption: 2D image stored in continuous linear memory space; left: row-wise; right: column-wise;

X X

slide-20
SLIDE 20

3/25/2015 Slide 20

CPU based – Two ways of serial processing

  • The speeds are different;
  • Due to the linear memory structure and data caching;
  • Vertical = 146.59 ms/run; Horizontal = 104.19 ms/run;

20 40 60 80 100 120 140 160 CPU CPU Vertical CPU Horizontal Running time for two CPU versions (milliseonds/run)

slide-21
SLIDE 21

3/25/2015 Slide 21

Global memory based – thread-block design 1

X

Pixel of interest Local neighborhood

Cuda implementation:

The whole image is divided into multiple vertical bar shape thread-blocks (1x128); Pixel in a 2D memory chunk

X

CUDA thread-block

slide-22
SLIDE 22

3/25/2015 Slide 22

Global memory based – thread-block design 2

X

Pixel of interest Local neighborhood Pixel in a 2D memory chunk CUDA threadblock

X

Cuda implementation:

The whole image is divided into multiple horizontal bar threadblocks (128x1);

slide-23
SLIDE 23

3/25/2015 Slide 23

Global memory based – Vertical vs Horizontal

  • The speeds are quite different;
  • Due to the linear memory structure and concurrent aligned reading in a WARP;
  • Vertical = 13.325; Horizontal = 1.652 ms/run;
  • 60x speedup compared to CPU version;

20 40 60 80 100 120 140 160 CPU GPU Global Memory Vertical Horizontal Running time for two GPU versions (milliseonds/run)

slide-24
SLIDE 24

3/25/2015

Texture memory based version

Slide 24

  • Texture memory
  • Read only cache;
  • Good for scattered reads;
  • Caching is 32 bytes (one segment);
  • Two different thread-blocks
  • Vertical (1x128);
  • Horizontal (128x1);
slide-25
SLIDE 25

3/25/2015 Slide 25

Texture memory based – Vertical vs Horizontal

  • The speeds are different;
  • Vertical = 2.507; Horizontal = 1.707 ms/run;
  • Horizontal: comparable to global memory version;
  • Vertical: much better than global memory version (better at scattered data);

2 4 6 8 10 12 14 GPU Global Memory GPU Texture Memory Vertical Horizontal Running time for two GPU versions (milliseonds/run)

slide-26
SLIDE 26

3/25/2015

Shared memory based version

Slide 26

  • Shared memory
  • Read/write cache in SM;
  • Low latency compared to global memory, or even texture memory;
  • Using as read cache, the original data still need to be loaded from global memory;
  • Two different thread-blocks
  • Vertical (1x768);
  • Horizontal (32x24);
slide-27
SLIDE 27

3/25/2015 Slide 27

Shared memory based – thread-block design 1

X

Pixel of interest Local neighborhood

Cuda implementation:

The whole image is divided into multiple vertical bar shape thread-blocks (1x768); Pixel in a 2D memory chunk

X

CUDA thread-block

slide-28
SLIDE 28

3/25/2015 Slide 28

Shared memory based – thread-block design 2

X

Pixel of interest Local neighborhood

Cuda implementation:

The whole image is divided into multiple vertical bar shape thread-blocks (32x24 = 768 threads); Pixel in a 2D memory chunk

X

CUDA thread-block

slide-29
SLIDE 29

3/25/2015 Slide 29

Shared memory based – Vertical vs Horizontal

  • The speeds are quite different;
  • Vertical = 3.725 ms/run; Horizontal = 1.084 ms/run;
  • Horizontal: better than both global/texture memory;
  • Vertical: better than global memory, worse than texture memory;

2 4 6 8 10 12 14 GPU Global Memory GPU Texture Memory GPU Shared Memory Vertical Horizontal Running time for two GPU versions (milliseonds/run)

slide-30
SLIDE 30

3/25/2015

GPU performance optimization

Slide 30

  • Key requirements for good GPU kernel performance
  • Sufficient parallelism;
  • Efficient memory access;
  • Efficient instruction execution;
  • Other overheads (convolution example)
  • GPU memory allocation;
  • CPU to GPU image data copy;
  • Kernel execution;
  • GPU to CPU image result copy;
  • GPU memory de-allocation;
slide-31
SLIDE 31

3/25/2015 Slide 31

Running time of kernel vs other overheads

2 4 6 8 10 12 14 GPU/CPU data copy GPU memory allocation/de- allocation GPU kernel ms/run

slide-32
SLIDE 32

3/25/2015

GPU performance optimization in CAD

Slide 32

  • CPU/GPU image data copy;
  • Cudarize processing components as many as possible, so data can stay on GPU

without passing back to CPU;

CPU computation GPU computation GPU/CPU data copy CAD processing flow GPU memory allocation GPU memory de-allocation

slide-33
SLIDE 33

3/25/2015

GPU performance optimization in CAD

Slide 33

CPU computation GPU computation GPU/CPU data copy CAD processing flow GPU memory allocation GPU memory de-allocation

  • GPU memory allocation/de-allocation;
  • Allocate memory once, re-use as much as possible;
  • Very suitable for slice processing in DBT CAD (2D image with fixed dimensions, changing

contents;

slide-34
SLIDE 34

3/25/2015

GPU performance optimization in DBT CAD

Slide 34

  • GPU kernel optimization
  • GPU memory allocation/de-allocation;
  • CPU/GPU image data copy;

~10 minutes /view (CPU version)  15 seconds /view

  • Satisfy clinical requirements;
slide-35
SLIDE 35

3/25/2015

Conclusions

Slide 35

  • GPU CUDA technology can help CAD speedup very much;
  • CUDA based DBT micro-calcification CAD runs under 15

seconds/view, satisfying clinical requirements!

slide-36
SLIDE 36

3/25/2015

Acknowledgement

Slide 36

  • Andrew F. Vandergrift, Business Development Manager – Medical

Imaging, Nvidia (Santa Clara, CA);

  • Bob Keating, Senior Solution Architect, Nvidia (Chalfont, PA);
  • Sarah Tariq, Nvidia (Santa Clara, CA);
  • Ashiwini Kshirsagar, Principal Scientist, Hologic, (Santa Clara, CA);
  • Jun Ge, Principal Scientist, Hologic, (Santa Clara, CA);
  • Jin-long Chen, Principal Scientist, Hologic, (Santa Clara, CA);
  • Liyang, Senior Scientist, Hologic, (Santa Clara, CA);
  • Xiaomin Liu, Scientist, Hologic, (Santa Clara, CA);