S Homeland Security Image Credit: Thinkstock.com March 17-20, GTC - - PowerPoint PPT Presentation

s homeland security
SMART_READER_LITE
LIVE PREVIEW

S Homeland Security Image Credit: Thinkstock.com March 17-20, GTC - - PowerPoint PPT Presentation

Real-time image segmentation for Homeland Security exploiting Hyper-Q concurrency Fanny Nina-Paravecino David Kaeli NU-MGH CUDA Research Center Dept. of Electrical and Computer Engineering Northeastern University Boston, MA S Homeland


slide-1
SLIDE 1

S

Real-time image segmentation for Homeland Security exploiting Hyper-Q concurrency

Fanny Nina-Paravecino

David Kaeli NU-MGH CUDA Research Center

  • Dept. of Electrical and Computer Engineering

Northeastern University Boston, MA

slide-2
SLIDE 2

Homeland Security

2 March 17-20, GTC 2015 San Jose, California

Image Credit: Thinkstock.com

slide-3
SLIDE 3

Homeland Security

Alert!

3 March 17-20, GTC 2015 San Jose, California

Image Credit: snallabolaget.com

slide-4
SLIDE 4

Homeland Security

Constraints of the input data:

S Noise S Hundreds of

frames per objects

4 March 17-20, GTC 2015 San Jose, California

slide-5
SLIDE 5

Homeland Security

S One key application for Homeland Security is the need to

perform high quality luggage inspection at airports

S This task becomes challenging since it involves the

following constraints :

S

Near real-time response needed

S

Very high accuracy needed

S We will explore using CUDA 6.5 and new hardware

features to address these needs in this important application

5 March 17-20, GTC 2015 San Jose, California

slide-6
SLIDE 6

Outline for this presentation

S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper-Q S Performance results S Conclusion and future work

6 March 17-20, GTC 2015 San Jose, California

slide-7
SLIDE 7

Homeland Security

Image dimensions 512 x512

~700 Images …

One Frame DICOM Image Multiple Frames

7 March 17-20, GTC 2015 San Jose, California

slide-8
SLIDE 8

Object Detection Pipeline

Input Object Detection Preprocessing Image Segmentation Features Extraction Object Detection

8 March 17-20, GTC 2015 San Jose, California

slide-9
SLIDE 9

Homeland Security

S Image Segmentation plays a key role in the

compute pipeline when performing object detection.

S Multiple algorithms:

S Graph-based image segmentation [Fenzenswalb04] S Level Set [Shi05] S Spectral Clustering [Zelnik-Manor04] S Connected Component Labeling [Zhao10]

9 March 17-20, GTC 2015 San Jose, California

slide-10
SLIDE 10

Outline for this presentation

S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper-Q S Performance results S Conclusion and future work

10 March 17-20, GTC 2015 San Jose, California

slide-11
SLIDE 11

Connected Component Labeling

S Connected component labeling is a good fit

based on the constraints of the environment

S Connected Component Labeling identifies

neighboring segments possessing similar intensities

S Potential for efficient segmentation S Provides high quality results

11 March 17-20, GTC 2015 San Jose, California

slide-12
SLIDE 12

Connected Component Labeling

1 1 2 2 2 1 2 3 3 3 2 2 3 3 4 2 3 4 4 4 2 3 2 2 2 1 1 2 2 2 1 2 3 3 3 2 2 3 3 4 2 3 4 4 4 2 3 2 2 2 Connected Component Labeling

7 segments Despite there are four different

  • intensities. Groups pixels by

location, and intensity

A lot of dependencies among neighbors!!!

12 March 17-20, GTC 2015 San Jose, California

slide-13
SLIDE 13

Outline for this presentation

S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper-Q S Performance results S Conclusion and future work

13 March 17-20, GTC 2015 San Jose, California

slide-14
SLIDE 14

How can we improve the performance of CCL?

S Exploit inherent parallelism! S Dependencies among neighbors?

S Stripe-based Connected Component Labeling [Zhao10] S Re-structure of the storage labeling

S Merge Strip-based approach?

S Exploit CUDA’s Dynamic Parallelism

S Further optimizations

S Explore the potential of using Hyper-Q

14 March 17-20, GTC 2015 San Jose, California

slide-15
SLIDE 15

Accelerated Connected Component Labeling

S Two phases:

S Phase 0: Find Spans S Phase 1: Merge Spans Phase 0 Phase 1

1 1 1 1 1 0 0 2 2 1 2 -

  • 0 0 -
  • Spans matrix

N x K Image matrix N x M Each pair = span 1 2 3 - 5 - Label Index Matrix N x K/2

Input

Binary image N x M threads 0 0 2 2 1 2 -

  • 0 0 -
  • Span matrix

1 2 2

  • 5
  • Label Index

Child threads

15 March 17-20, GTC 2015 San Jose, California

slide-16
SLIDE 16

Phase 0: Find Spans

S Each span has two elements: (ystart, yend) S A unique label is assigned immediately

spanx ={(ystart, yend)| I(x,ystart ) = I(x,ystart+1) =... = I(x,yend )}

1 1 1 1 1 2 2 1 2

  • Span matrix

1 2 3

  • 5
  • Label matrix

Binary image N x M

  • Reduced intermediate matrix of

labels

  • Half the size of the span matrix

16 March 17-20, GTC 2015 San Jose, California

slide-17
SLIDE 17

Dynamic Parallelism

S Kepler GK110 [Whitepaper NVIDIA’s Next Generation CUDATM

Compute Architecture: KeplerTM GK110]

S

Nested parallelism

17 March 17-20, GTC 2015 San Jose, California

Courtesy: NVIDIA

slide-18
SLIDE 18

Phase 1: Merge Spans

Merge Span

Parent Kernel 0 0 2 2 1 2 - - 0 0 - - Spans matrix

1 2 3

  • 5
  • Label

matrix Yes No Next span

One single update

1 2 2

  • 5
  • Label

matrix

18 March 17-20, GTC 2015 San Jose, California

slide-19
SLIDE 19

Outline for this presentation

S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper-Q S Performance results S Conclusion and future work

19 March 17-20, GTC 2015 San Jose, California

slide-20
SLIDE 20

Hyper-Q

S Kepler: Hyper-Q working with CUDA streams [Whitepaper

NVIDIA’s Next Generation CUDATM Compute Architecture: KeplerTM GK110]

20 March 17-20, GTC 2015 San Jose, California

Courtesy: NVIDIA

slide-21
SLIDE 21

When should we use Hyper-Q?

S Identify kernels that have low of the device S Identify applications that can allow for concurrent kernel

execution

S Two tasks:

S Analyze the applications S Analyze the kernels

21 March 17-20, GTC 2015 San Jose, California

slide-22
SLIDE 22

Outline for this presentation

S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper-Q S Performance results S Conclusion and future work

22 March 17-20, GTC 2015 San Jose, California

slide-23
SLIDE 23

Accelerated Connected Component Labeling

S Resources utilization per kernel

S Find Spans:

S

SMX Activity: 27%

S

Occupancy: 0.11 S Merge Spans

S

SMX Activity: 31%

S

Occupancy: 0.09

23 March 17-20, GTC 2015 San Jose, California

slide-24
SLIDE 24

Accelerated Connected Component Labeling

S Exploiting Hyper-Q

Stream 1 Each stream processes 2 frames - each frame has 512 x 512 pixels Stream 2 Stream N … Hyper-Q Each stream processes 2 frames - each frame has 512 x 512 pixels Each stream processes 2 frames - each frame has 512 x 512 pixels

24 March 17-20, GTC 2015 San Jose, California

slide-25
SLIDE 25

Concurrent kernel execution

… Find Spans Merge Spans Re-label … Stream 1 Stream 2 … Stream 3

25 March 17-20, GTC 2015 San Jose, California

slide-26
SLIDE 26

Performance Results

S Speedup of a stream-based ACCL run on CUDA 6.5 vs.

OpenMP with 8 threads on an Intel Core i7-3770K

# Streams # Frames OpenMP CCL (s) ACCL(s) Speedup 4 8 2.72 1.35 2.01x 8 16 10.79 2.73 3.94x 16 32 42.92 5.43 7.91x 32 64 171.18 10.79 15.32x 64 128 1020.00 21.56 47.32x

26 March 17-20, GTC 2015 San Jose, California

slide-27
SLIDE 27

Outline for this presentation

S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper-Q S Performance results S Conclusion and future work

27 March 17-20, GTC 2015 San Jose, California

slide-28
SLIDE 28

Conclusion

S Improved performance of image segmentation task for

baggage scanning problem

S Exploited NVIDIA’s Hyper-Q feature to accelerate Connected

Component Labeling

S Compared an OpenMP CCL implementation with our ACCL

implementation

S

Our algorithm scales well as long as we increase the number of streams

S Kernels with low occupancy are the best fit to use Hyper-Q

28 March 17-20, GTC 2015 San Jose, California

slide-29
SLIDE 29

Future work

S Combine Hyper-Q with MPI to exploit multiple grains of

parallelism using multiple GPU nodes

S Evaluate additional image segmentation algorithms that

address the constraints of baggage scanning

29 March 17-20, GTC 2015 San Jose, California

slide-30
SLIDE 30

THANK YOU

S Questions?

S fninaparavecino@ece.neu.edu

30 March 17-20, GTC 2015 San Jose, California