 
              Real-time image segmentation for Homeland Security exploiting Hyper-Q concurrency Fanny Nina-Paravecino David Kaeli NU-MGH CUDA Research Center Dept. of Electrical and Computer Engineering Northeastern University Boston, MA S
Homeland Security Image Credit: Thinkstock.com March 17-20, GTC 2015 2 San Jose, California
Homeland Security Alert! Image Credit: snallabolaget.com March 17-20, GTC 2015 3 San Jose, California
Homeland Security Constraints of the input data: S Noise S Hundreds of frames per objects March 17-20, GTC 2015 4 San Jose, California
Homeland Security S One key application for Homeland Security is the need to perform high quality luggage inspection at airports S This task becomes challenging since it involves the following constraints : S Near real-time response needed S Very high accuracy needed S We will explore using CUDA 6.5 and new hardware features to address these needs in this important application March 17-20, GTC 2015 5 San Jose, California
Outline for this presentation S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper -Q S Performance results S Conclusion and future work March 17-20, GTC 2015 6 San Jose, California
Homeland Security Image dimensions One Frame 512 x512 DICOM Image ~ 700 Images … Multiple Frames March 17-20, GTC 2015 7 San Jose, California
Object Detection Pipeline Input Object Detection Preprocessing Image Segmentation Features Extraction Object Detection March 17-20, GTC 2015 8 San Jose, California
Homeland Security S Image Segmentation plays a key role in the compute pipeline when performing object detection. S Multiple algorithms: S Graph-based image segmentation [Fenzenswalb04] S Level Set [Shi05] S Spectral Clustering [ Zelnik-Manor04 ] S Connected Component Labeling [ Zhao10 ] March 17-20, GTC 2015 9 San Jose, California
Outline for this presentation S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper -Q S Performance results S Conclusion and future work March 17-20, GTC 2015 10 San Jose, California
Connected Component Labeling S Connected component labeling is a good fit based on the constraints of the environment S Connected Component Labeling identifies neighboring segments possessing similar intensities S Potential for efficient segmentation S Provides high quality results March 17-20, GTC 2015 11 San Jose, California
Connected Component Labeling A lot of dependencies among neighbors!!! 1 1 2 2 2 1 1 2 2 2 1 2 3 3 3 1 2 3 3 3 Connected Component 2 2 3 3 4 2 2 3 3 4 Labeling 2 3 4 4 4 2 3 4 4 4 2 3 2 2 2 2 3 2 2 2 7 segments Despite there are four different intensities. Groups pixels by location, and intensity March 17-20, GTC 2015 12 San Jose, California
Outline for this presentation S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper -Q S Performance results S Conclusion and future work March 17-20, GTC 2015 13 San Jose, California
How can we improve the performance of CCL? S Exploit inherent parallelism! S Dependencies among neighbors? S Stripe-based Connected Component Labeling [ Zhao10 ] S Re-structure of the storage labeling S Merge Strip-based approach? S Exploit CUDA’s Dynamic Parallelism S Further optimizations S Explore the potential of using Hyper-Q March 17-20, GTC 2015 14 San Jose, California
Accelerated Connected Component Labeling S Two phases: S Phase 0: Find Spans S Phase 1: Merge Spans Phase 0 Phase 1 Each pair = span Input Span matrix threads 0 0 2 2 0 0 2 2 1 1 1 2 - - 1 2 - - 1 1 0 0 - - 0 0 - - 1 threads Image Binary image Spans matrix Child matrix N x M N x K N x M 1 2 1 2 Label 2 - Index 3 - 5 - Matrix 5 - Label Index N x K/2 March 17-20, GTC 2015 15 San Jose, California
Phase 0: Find Spans S Each span has two elements: (y start , y end ) span x = {( y start , y end )| I ( x , y start ) = I ( x , y start + 1 ) = ... = I ( x , y end ) } S A unique label is assigned immediately 0 0 2 2 1 1 1 2 - - 1 1 0 0 - - 1 Binary image Span matrix N x M • 1 2 Reduced intermediate matrix of 3 - labels 5 - • Half the size of the span matrix Label matrix March 17-20, GTC 2015 16 San Jose, California
Dynamic Parallelism S Kepler GK110 [ Whitepaper NVIDIA’s Next Generation CUDATM Compute Architecture: KeplerTM GK110 ] S Nested parallelism March 17-20, GTC 2015 Courtesy: NVIDIA 17 San Jose, California
Phase 1: Merge Spans Merge Span Parent Kernel Label Label matrix matrix Spans 1 2 1 2 matrix Yes 0 0 2 2 2 - 3 - 1 2 - - 5 - 5 - No 0 0 - - One single update Next span March 17-20, GTC 2015 18 San Jose, California
Outline for this presentation S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper -Q S Performance results S Conclusion and future work March 17-20, GTC 2015 19 San Jose, California
Hyper-Q S Kepler: Hyper-Q working with CUDA streams [ Whitepaper NVIDIA’s Next Generation CUDATM Compute Architecture: KeplerTM GK110 ] March 17-20, GTC 2015 Courtesy: NVIDIA 20 San Jose, California
When should we use Hyper-Q? S Identify kernels that have low of the device S Identify applications that can allow for concurrent kernel execution S Two tasks: S Analyze the applications S Analyze the kernels March 17-20, GTC 2015 21 San Jose, California
Outline for this presentation S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper -Q S Performance results S Conclusion and future work March 17-20, GTC 2015 22 San Jose, California
Accelerated Connected Component Labeling S Resources utilization per kernel S Find Spans: S SMX Activity: 27% S Occupancy: 0.11 S Merge Spans S SMX Activity: 31% S Occupancy: 0.09 March 17-20, GTC 2015 23 San Jose, California
Accelerated Connected Component Labeling S Exploiting Hyper-Q Stream 1 Each stream processes 2 frames - each frame has 512 x 512 pixels Each stream processes Stream 2 2 frames - each frame has Hyper-Q 512 x 512 pixels … Each stream processes Stream N 2 frames - each frame has 512 x 512 pixels March 17-20, GTC 2015 24 San Jose, California
Concurrent kernel execution Stream 1 Find Spans Merge Spans … Re-label … Stream 2 … Stream 3 March 17-20, GTC 2015 25 San Jose, California
Performance Results S Speedup of a stream-based ACCL run on CUDA 6.5 vs. OpenMP with 8 threads on an Intel Core i7-3770K # Streams # Frames OpenMP CCL (s) ACCL(s) Speedup 4 8 2.72 1.35 2.01x 8 16 10.79 2.73 3.94x 16 32 42.92 5.43 7.91x 32 64 171.18 10.79 15.32x 64 128 1020.00 21.56 47.32x March 17-20, GTC 2015 26 San Jose, California
Outline for this presentation S Background on the imaging analysis problem S Connected Component Analysis S Performance optimization S NVIDIA’s Hyper -Q S Performance results S Conclusion and future work March 17-20, GTC 2015 27 San Jose, California
Conclusion S Improved performance of image segmentation task for baggage scanning problem S Exploited NVIDIA’s Hyper -Q feature to accelerate Connected Component Labeling S Compared an OpenMP CCL implementation with our ACCL implementation S Our algorithm scales well as long as we increase the number of streams S Kernels with low occupancy are the best fit to use Hyper-Q March 17-20, GTC 2015 28 San Jose, California
Future work S Combine Hyper-Q with MPI to exploit multiple grains of parallelism using multiple GPU nodes S Evaluate additional image segmentation algorithms that address the constraints of baggage scanning March 17-20, GTC 2015 29 San Jose, California
THANK YOU S Questions? S fninaparavecino@ece.neu.edu March 17-20, GTC 2015 30 San Jose, California
Recommend
More recommend