Computer Aided Detection (CAD) for 3D Breast Imaging and GPU - - PowerPoint PPT Presentation
Computer Aided Detection (CAD) for 3D Breast Imaging and GPU - - PowerPoint PPT Presentation
Computer Aided Detection (CAD) for 3D Breast Imaging and GPU Technology Xiangwei Zhang, Chui Haili Imaging and CAD science, Hologic Inc., Santa Clara, CA 03/19/2015 Summary Computer aided detection (CAD) of breast cancer in 3D digital
3/25/2015
Summary
Slide 2
- Computer aided detection (CAD) of breast cancer in 3D
digital breast tomo-synthesis
- An introduction;
- GPU kernel optimization
- A case study of convolution filtering;
- GPU optimization in DBT CAD
- GPU/CPU data copy; GPU memory management;
- Conclusions/Questions;
3/25/2015
Cancer Incidence/Mortality Rates (USA)
Slide 3
50,000 100,000 150,000 200,000 250,000
Lung Cancer Colorect al Cancer Breast Cancer Prostate Cancer Incidence 169400 148300 205000 189000 Mortality 154900 56600 40000 30200
Number of Cases Disease Type
2001 – American Cancer Society
3/25/2015
Computer Aided Detection (CAD)
Slide 4
- Early detection of cancer is the key to reduce the mortality
rate;
- Medical imaging can help the early detection of cancer;
- Breast X-ray mammography, chest X-ray, lung CT, colonoscopy, brain MRI, etc;
- Interpreting the image to find signs of cancer is very
challenging for radiologists;
- Automated processing using computer software helps
radiologists in clinical decision;
- Various image analysis software, including computer aided detection (CAD)
and diagnosis (CADx);
Medical Imaging Applications
Breast Mammography Lung CT Chest X-ray Colonoscopy
3/25/2015
Micro-calcifications in digital mammography
Slide 6
3/25/2015
CAD for 2D Mammography
Slide 7
- Each patient/examination has 4 views (left/right breast, CC/MLO views);
- There are four 2D image to be processed;
- CAD generates marker overlays ( triangle - micro-calcification clusters; star –
mass density or spiculation/architectural distortion)
3/25/2015
2D Mammography CAD processing flow
Slide 8
- Pre-processing
- Pixel value transformation (log, invert, scaling);
- Segmentations;
- breast, pectoral muscle (MLO view), roll–off region;
- Suspicious candidate generation;
- Filtering(general and dedicated), region growing;
- Region analysis/classification;
- feature extraction/selection, classification;
It takes ~10 seconds/view to complete (pure CPU implementation);
3/25/2015
Digital Breast Tomo-synthesis (DBT)
Slide 9
- Acquisition
- Multiple 2D projection views (PVs) are acquired in different angles (11 to 15);
- The angle span is limited to get high in-plane resolution (15 to 30 degree);
- Each projection uses a much lower dose than 2D mammography;
- Reconstruction
- Back projection is used to reconstruct a 3D volume with 1mm slice interval;
- Usually a volume consists of 40 to 80 slices (1560x2457 pixels/slice);
- Advantage (vs 2D mammogram)
- Reduce tissue overlap reveal 3D anatomical structures hidden in 2D;
- Disadvantage
- Much more data to interpret and store;
3/25/2015 Slide 10
… …
Reconstruction slices PV1 PV2 PVm PVn-1 PVn X ray tube Compression paddle Digital detector Compressed breast Center of rotation
PV1, PV2, PV3, …, PVm PV1, PV2, PV3, …, PVm
DBT acquisition and reconstruction
3/25/2015
DBT CAD processing flow
Slide 11
- Slice by slice processing (similar to first three steps in 2D
CAD)
- 3D region growing;
- 3D Region analysis/classification;
- Prototype (2007)
- Pure CPU implementation;
- It takes ~10 minutes/view to complete;
- Clinically unacceptable;
- What we can do to speedup?
- CUDA computation on GPGPU;
3/25/2015
GPU kernel performance optimization
Slide 12
- Key requirements for good GPU kernel performance
- Sufficient parallelism;
- Efficient memory access;
- Efficient instruction execution;
- Efficient memory access: a case study of 1D convolution on
2D image with different implementations
- CPU;
- GPU
- Global memory;
- Texture memory;
- Shared memory;
3/25/2015
GPU CUDA memory space
Slide 13 13
Grid
Constant Memory Texture Memory Global Memory
Block (0, 0)
Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers
Block (1, 0)
Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers
Host
3/25/2015
GPU global memory access optimization
Slide 14
- GPU global memory
- DRAM -- High latency;
- Not necessarily cached;
- Many algorithms are memory-limited
- Or at least somewhat sensitive to memory bandwidth;
- Arithmetic operation to memory access ratio is low;
- Optimization goal: maximize bandwidth utilization
- Memory accesses are per warp (warp – 32 consecutive threads in a single block);
- Memory accesses are in discrete chunks (line – 128 bytes; segments – 32 bytes);
- The key is to have sufficient concurrent memory access per warp;
3/25/2015 Slide 15
... addresses from a warp
- ne 4-segment transaction
96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses
Efficient GPU memory access
... addresses from a warp
96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses
3/25/2015 Slide 16
Not efficient GPU memory access
96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses
... addresses from a warp addresses from a warp ...
3/25/2015 Slide 17
X
A case study: 1D vertical convolution on 2D image
X
Pixel of interest Local neighborhood
Convolution:
For each pixel, the new pixel value is the weighted sum of the pixel value in the defined neighborhood; Pixel in a 2D image
3/25/2015
Running time comparison: CPU and GPU
Slide 18
- Platform
- Host – Dell Precision 7500
- CPU: Intel Xeon dual core @3.07GHz, @3.06GHz;
- RAM: 16.0GB;
- Device
- GeForce GTX 690 (dual card);
- 3072 CUDA cores (1536x2), 16 SMs;
- 4GB 512-Bit GDDR5;
- PCI Express 3.0x16;
- CUDA 5.5;
- OS
- Window 7 Professional Service Pack 1, 2009;
3/25/2015 Slide 19
CPU based – Two ways of serial processing
Assumption: 2D image stored in continuous linear memory space; left: row-wise; right: column-wise;
X X
3/25/2015 Slide 20
CPU based – Two ways of serial processing
- The speeds are different;
- Due to the linear memory structure and data caching;
- Vertical = 146.59 ms/run; Horizontal = 104.19 ms/run;
20 40 60 80 100 120 140 160 CPU CPU Vertical CPU Horizontal Running time for two CPU versions (milliseonds/run)
3/25/2015 Slide 21
Global memory based – thread-block design 1
X
Pixel of interest Local neighborhood
Cuda implementation:
The whole image is divided into multiple vertical bar shape thread-blocks (1x128); Pixel in a 2D memory chunk
X
CUDA thread-block
3/25/2015 Slide 22
Global memory based – thread-block design 2
X
Pixel of interest Local neighborhood Pixel in a 2D memory chunk CUDA threadblock
X
Cuda implementation:
The whole image is divided into multiple horizontal bar threadblocks (128x1);
3/25/2015 Slide 23
Global memory based – Vertical vs Horizontal
- The speeds are quite different;
- Due to the linear memory structure and concurrent aligned reading in a WARP;
- Vertical = 13.325; Horizontal = 1.652 ms/run;
- 60x speedup compared to CPU version;
20 40 60 80 100 120 140 160 CPU GPU Global Memory Vertical Horizontal Running time for two GPU versions (milliseonds/run)
3/25/2015
Texture memory based version
Slide 24
- Texture memory
- Read only cache;
- Good for scattered reads;
- Caching is 32 bytes (one segment);
- Two different thread-blocks
- Vertical (1x128);
- Horizontal (128x1);
3/25/2015 Slide 25
Texture memory based – Vertical vs Horizontal
- The speeds are different;
- Vertical = 2.507; Horizontal = 1.707 ms/run;
- Horizontal: comparable to global memory version;
- Vertical: much better than global memory version (better at scattered data);
2 4 6 8 10 12 14 GPU Global Memory GPU Texture Memory Vertical Horizontal Running time for two GPU versions (milliseonds/run)
3/25/2015
Shared memory based version
Slide 26
- Shared memory
- Read/write cache in SM;
- Low latency compared to global memory, or even texture memory;
- Using as read cache, the original data still need to be loaded from global memory;
- Two different thread-blocks
- Vertical (1x768);
- Horizontal (32x24);
3/25/2015 Slide 27
Shared memory based – thread-block design 1
X
Pixel of interest Local neighborhood
Cuda implementation:
The whole image is divided into multiple vertical bar shape thread-blocks (1x768); Pixel in a 2D memory chunk
X
CUDA thread-block
3/25/2015 Slide 28
Shared memory based – thread-block design 2
X
Pixel of interest Local neighborhood
Cuda implementation:
The whole image is divided into multiple vertical bar shape thread-blocks (32x24 = 768 threads); Pixel in a 2D memory chunk
X
CUDA thread-block
3/25/2015 Slide 29
Shared memory based – Vertical vs Horizontal
- The speeds are quite different;
- Vertical = 3.725 ms/run; Horizontal = 1.084 ms/run;
- Horizontal: better than both global/texture memory;
- Vertical: better than global memory, worse than texture memory;
2 4 6 8 10 12 14 GPU Global Memory GPU Texture Memory GPU Shared Memory Vertical Horizontal Running time for two GPU versions (milliseonds/run)
3/25/2015
GPU performance optimization
Slide 30
- Key requirements for good GPU kernel performance
- Sufficient parallelism;
- Efficient memory access;
- Efficient instruction execution;
- Other overheads (convolution example)
- GPU memory allocation;
- CPU to GPU image data copy;
- Kernel execution;
- GPU to CPU image result copy;
- GPU memory de-allocation;
3/25/2015 Slide 31
Running time of kernel vs other overheads
2 4 6 8 10 12 14 GPU/CPU data copy GPU memory allocation/de- allocation GPU kernel ms/run
3/25/2015
GPU performance optimization in CAD
Slide 32
- CPU/GPU image data copy;
- Cudarize processing components as many as possible, so data can stay on GPU
without passing back to CPU;
CPU computation GPU computation GPU/CPU data copy CAD processing flow GPU memory allocation GPU memory de-allocation
3/25/2015
GPU performance optimization in CAD
Slide 33
CPU computation GPU computation GPU/CPU data copy CAD processing flow GPU memory allocation GPU memory de-allocation
- GPU memory allocation/de-allocation;
- Allocate memory once, re-use as much as possible;
- Very suitable for slice processing in DBT CAD (2D image with fixed dimensions, changing
contents;
3/25/2015
GPU performance optimization in DBT CAD
Slide 34
- GPU kernel optimization
- GPU memory allocation/de-allocation;
- CPU/GPU image data copy;
~10 minutes /view (CPU version) 15 seconds /view
- Satisfy clinical requirements;
3/25/2015
Conclusions
Slide 35
- GPU CUDA technology can help CAD speedup very much;
- CUDA based DBT micro-calcification CAD runs under 15
seconds/view, satisfying clinical requirements!
3/25/2015
Acknowledgement
Slide 36
- Andrew F. Vandergrift, Business Development Manager – Medical
Imaging, Nvidia (Santa Clara, CA);
- Bob Keating, Senior Solution Architect, Nvidia (Chalfont, PA);
- Sarah Tariq, Nvidia (Santa Clara, CA);
- Ashiwini Kshirsagar, Principal Scientist, Hologic, (Santa Clara, CA);
- Jun Ge, Principal Scientist, Hologic, (Santa Clara, CA);
- Jin-long Chen, Principal Scientist, Hologic, (Santa Clara, CA);
- Liyang, Senior Scientist, Hologic, (Santa Clara, CA);
- Xiaomin Liu, Scientist, Hologic, (Santa Clara, CA);