Multi-GPU Accelerated Refraction-Corrected Reflection Image - - PowerPoint PPT Presentation

β–Ά
multi gpu accelerated refraction corrected reflection
SMART_READER_LITE
LIVE PREVIEW

Multi-GPU Accelerated Refraction-Corrected Reflection Image - - PowerPoint PPT Presentation

Multi-GPU Accelerated Refraction-Corrected Reflection Image Reconstruction for 3D Ultrasound Breast Imaging Qun (Maxine) Liu Martin Cwikla Presentation Overview Background Motivation & Problem Statement Technical Design GPU


slide-1
SLIDE 1

Multi-GPU Accelerated Refraction-Corrected Reflection Image Reconstruction for 3D Ultrasound Breast Imaging

Qun (Maxine) Liu Martin Cwikla

slide-2
SLIDE 2

Presentation Overview

  • Background
  • Motivation & Problem Statement
  • Technical Design
  • GPU Implementation
  • Results
  • Contact Information
slide-3
SLIDE 3

Scanner Introduction

  • For breast tissue evaluation;
  • Quantitative transmission image;
  • Qualitative reflection image;
  • No radiation;
  • Patient comfort improvement.

Figure 1 QTultrasound scanner

slide-4
SLIDE 4

Scanner Geometry

Figure 2 Five scanning arrays mounted on the tri-channel Figure 3 Geometry of three reflection arrays and a pair of transmission arrays Background

slide-5
SLIDE 5

Voltage

Transducer Receiver

d Time t = 2d / c

Transmitted pulse Echo from skin surface Echo from lesion front face Echo from lesion back face

Background

Data Acquisition

slide-6
SLIDE 6

B-mode Scan and Acoustic Wave Behavior

Figure 5 Acoustic wave behavior between the inhomogeneous border of two different mediums. Figure 4 Sample of B-mode (brightness mode). Image driven. Background

http://www.sonoguide.com/physics.html

slide-7
SLIDE 7

Image Reconstruction Algorithm

Background

Transmission Image Preprocessing Support Function Generation Refraction-corrected Ray Tracing Image Postprocessing Mapping transmission results into reflection image space. Signal gain control for attenuation. Ray (data) driven B-mode back-projection tomography Readability improvement

slide-8
SLIDE 8

Refraction-Corrected Ray Tracing

Background

Eikonal equation:

𝑒 𝑒𝑑 π‘œ 𝑒𝒔 𝑒𝑑 = π›Όπ‘œ s: arc length along the ray; r: ray position vector in 3D; n: refractive index. π‘œ 𝒔 = 𝑑0 𝑑(𝒔)

Euler step method:

𝒔𝑗+1 = 𝒔𝑗 + β„Ž βˆ™ 𝑣𝒔 𝑗+1 2 For i = 1,… h: step length; u: unit tangent vector to the ray path.

slide-9
SLIDE 9

Compounding to Tomography

Background

slide-10
SLIDE 10

Challenges for Parallel Computation

Sequential operation of refraction-corrected ray tracing Each step of each ray depends

  • n the previous step’s position

and refractive index and the current step’s refractive index; Each pixel’s signal weighting is contributed by multiple rays; Each ray behavior is unpredictable in terms of position ranges. File access speed limitation Data writing and reading between pipeline stages allows for all the operations in each stage to be computed independently; However, the data throughput

  • n an SSD or a hard drive

becomes a limiting factor. Large amount of memory management Unknowns: 32390540 pixels; Acquired data: around 1.88 GB; Computation data: around 9.4 GB.

slide-11
SLIDE 11

Parallelism with multi-core CPU and GPU streaming processors

Design

CPU multiple worker threads

… . . . . . .

Work group

. . .

View 30 View 1 View 6

… . . . . . .

View 54 View 60

. . .

Probe 1 Probe 2 Probe 3 Level 140 Level 70 Level 1 Ray 1 Ray 94 Ray 192 SubRay 1 SubRay 2 SubRay 3 Refraction corrected ray tracing

Worker thread join()

. . . . . .

. . . . . .

slide-12
SLIDE 12

Ray Tracing in Parallel

3D sparse matrix Dictionary of Keys (DOK) Pixel position-> key Weight-> Value Time sample-> Value

Design

slide-13
SLIDE 13

Concurrent Operations of CPU and GPU

cudaMalloc (&dev, size); … cudaStream_t stream[nStreams]; For (int iStream = 1; iStream <= nStreams; ++iStream) cudaStreamCreate (&stream[iStream]); cudaEvent_t event; cudaEventCreate (&event); For (int iView = 1; iView < nViews; ++iView) { …read the data for iView cudaMemcpyAsync (dev, host, size, H2D, stream[iView]); kernel <<< grid, block, 0, stream[iView]>>> (…, dev, …); if (iView != nViews) { cudaEventRecord (event, stream[iView]); cudaStreamWaitEvent(stream[iView + 1], event, 0); } } cudaMemcpy (host, device, size, D2H); …destroy stream and event

Create streams for each view computation Only one event needed to be created Asynchronous with stream Wait for the previous event done Design

slide-14
SLIDE 14

Memory Contention Solution

View 1 View 2 View 3 View 4 View 5 View 6 Thread group View 7 View 8 … … … … … … Finished Aborted

Sufficient memory available Memory wait Sleep (5)

Finished

Memory wait Exception

Thread group View 20 View 45

Finished Finished Finished

Design View 9

slide-15
SLIDE 15

Hardware Selection

  • Stability and reliability:

long-term product;

  • High single precision floating-point

performance: 4.20 TFlops;

  • Large memory to support multiple CPU

worker threads operation: 12 GB.

Tesla K40 GPU

slide-16
SLIDE 16

Software Architecture Design

Host Layer GPU Layer Interface Layer Pipeline Layer Interface to rest of production software GUI Standalone

GPU kernels implementation Manage GPU memory; CPU and GPU data transfer; Call GPU kernels and check kernel errors; schedule asynchronous operations; Support multi-GPUs Object-oriented API for each stage of reconstruction algorithm Read parameters, order of algorithm operations, etc, from a configuration file

Implementation

slide-17
SLIDE 17

Performance Test (individual functions)

Functions Single CPU time Single GPU time Speedup Bilinear interpolation 1930.00 33.28 57.99X Blurring filter (FFT included) 15660.00 129.17 121.24X L1 norm fit third-order polynomial 100.00 7.24 13.81X Nearest points mapping 6850.00 39.98 171.34X Compounding images 366800.00 4403.09 83.31X Dynamic gain for images 910.01 42.88 21.22X Note: The calculation of GPU time includes data transfer from host to device and back from device to host. All times are given in milliseconds.

slide-18
SLIDE 18

Performance Test (overall)

Reconstruction Stage Single CPU time Single GPU time Single GPU speedup Two GPU time Two GPU speedup Preprocessing and Support Function 34.09 9.36 3.64X 7.72 4.42X Refraction-Corrected Ray Tracing 1899.98 63.29 30.02X 45.53 41.73X Compounding Views 39.33 0.84 46.71X 0.84 46.71X Entire Reflection Reconstruction 2108.40 79.16 26.63X 54.57 38.64X Note: All times are presented in seconds. Result

slide-19
SLIDE 19

Case Images 1: Multiple Cysts

Figure 9 Coronal, Axial and Sagittal images present multiple cysts. Figure 10 Comparison with mammography, hand-held ultrasound Result

slide-20
SLIDE 20

Case Images 2: Invasive Ductal Carcinoma

Figure 11 Coronal, Axial and Sagittal images present invasive ductal carcinoma Figure 12 Comparison with mammography, hand-held ultrasound Result

slide-21
SLIDE 21

Contact Information

Qun (Maxine) Liu: Scientist QTultrasound, LLC 3216 S Highland Drive, Suite 100, Salt Lake City, UT 84121 Email: maxine.liu@qtultrasound.com Personal Email: maxineliuqun@gmail.com Cell: 979-703-9475 Website: http://qunmaxineliu.weebly.com/ Martin Cwikla: Senior Software Engineer QTultrasound, LLC 3216 S Highland Drive, Suite 100, Salt Lake City, UT 84121 Email: martin.cwikla@qtultrasound.com Personal Email: mcwikla@ieee.org Cell: 801-512-1027

slide-22
SLIDE 22

Thank you