adaptable two dimension sliding windows on nvidia gpus
play

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation Nicholas Moore and Miriam Leeser Dept. of Electrical and Computer Engineering Northeastern University Boston, MA Laurie Smith King Dept. of Mathematics and

0 downloads 0 Views 335 KB Size Report
  1. Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation Nicholas Moore and Miriam Leeser Dept. of Electrical and Computer Engineering Northeastern University Boston, MA Laurie Smith King Dept. of Mathematics and Computer Science College of the Holy Cross Worcester, MA Supported by Supported by

  2. Motivation ● GPUs offer significant performance potential ● GPU development is difficult ● Complicated target with changes over time ● Leads to problem-specific non-reusable code ● Affects library developers and users ● Goal: more adaptable kernel implementations ● Case study: template matching application ● Technique: problem-specific kernel compilation 2

  3. Template Matching (1) ● Real-world tumor tracking application ● Ying Cui, Jennifer Dy, Gregory Sharp, Brian Alexander, and Steve Jiang ● Visual tracking of tumor ● Focused radiotherapy ● Tumor moves during breathing Y. Cui, J. G. Dy, G. C. Sharp, B. Alexander, and S. B. Jiang, "Multiple Template Based Fluoroscopic Tracking of Lung Tumor Mass without Implanted Fiducial Markers," Physics in Medicine and Biology, Vol. 52, pp. 6229- 3 6242, 2007.

  4. Template Matching (2) S1, L1 Template 1 Matching S2, L2 Voting Location Template 2 Incoming Frame SN, LN 4 Template N

  5. corr2() ∑ M ∑ ( A MN − ̄ A )( B MN − ̄ B ) N corr2 ( A , B )= √ ( ∑ M ∑ 2 )( ∑ M ∑ ( A MN − ̄ ( B MN −̄ 2 ) A ) B ) N N ● Sliding window template matching ● Pearson's correlation for similarity score ● Floating-point data ● Templates and frames pre-processed 5

  6. Computation Reduction ∑ M ∑ ( A MN − ̄ A )( B MN −̄ B ) N corr2 ( A , B )= √ ( ∑ M ∑ 2 )( ∑ M ∑ ( A MN − ̄ ( B MN −̄ 2 ) A ) B ) N N ● Template data (A) ● Not expected to be separable ● Fixed for given template 6

  7. Computation Reduction C ( B MN −̄ ∑ M ∑ A MN B ) N corr2 ( A , B )= √ A D ∑ M ∑ ( B MN −̄ 2 B ) N ● Template data (A) ● Not expected to be separable ● Fixed for given template 7

  8. Computation Reduction C ( B MN −̄ ∑ M ∑ A MN B ) N corr2 ( A , B )= √ A D ∑ M ∑ ( B MN −̄ 2 B ) N ● ROI data (B) ● Dependent on window location and frame ● Subtraction complicates frequency domain 8

  9. Reference Data Sets Template Size Shift ±V/±H Patient Templates (pixels) (pixels) 1 12 53×54 18/9 2 13 23×21 11/5 3 10 76×45 9/4 4 11 156×116 9/3 5 12 86×78 11/6 6 14 141×107 9/2 ● Large templates ● Significant variation in dimensions ● Small search with single ROI per frame ● Different part of the problem space 9

  10. Convolution Implementations ● Kong et al. (GPGPU 2010) ● Template stored in shared memory ● Only 7×7 kernels presented ● NVIDIA Performance Primitives ● Only supports uint8 ● Accelereyes Jacket ● Last documented version supports arbitrary kernels up to 5×5, square kernels to 10×10 ● OpenCV ● Supports single precision floating point ● Non-separable templates stored in constant memory. 10

  11. CUDA Mapping Complications ● Common correlation case: ● Small template ● Large image with many window locations ● Template matching application: ● Templates too large to use shared or constant memory ● Few sources of parallelism – Few templates (10 to 14) – Relatively small ROI (95 to 703 positions) – Single ROI per frame ● Problem parameters vary between patients 11

  12. CUDA Mapping Solution ● Tiling of the template ● Reduces local working set size ● More independent parallelism ● Problem-specific kernel compilation ● Adaptability without performance impact 12

  13. CUDA Implementation C ( B MN −̄ ∑ M ∑ A MN B ) N corr2 ( A , B )= √ A D ∑ M ∑ ( B MN −̄ 2 B ) N ● Multiple pass implementation ● Average, denominator, and numerator similar ● Outer loops are all addition 13

  14. Tiled Template (1) ● Tile and process sub- templates separately ● More parallelism ● Reduces working set Main Tiles Right Tiles size to fit in shared memory ● Tiles mapped across CUDA grid Corner ● Scales to arbitrary Bottom Tiles Tile template sizes 14

  15. Tiled Template (2) ● Efficient tile size may not match problem ● Corr2() complicates padding Main Tiles Right Tiles ● Varying template size per block Corner Bottom Tiles Tile 15

  16. Experimental Setup ● Benchmarked tile sizes from 4×4 to 16×16 ● Compared against ● MATLAB and pthreads-based C application ● Both used constant template optimization ● Benchmarking ● Intel Xeon W3580 (4 Nehalem cores @ 3.33 GHz, 6MB L2) ● NVIDIA GeForce GTX 480 (Fermi) with CUDA 3.2 ● 64-bit Linux (GCC 4.4.3) and MATLAB R2010a 16

  17. Performance ● Good performance across patients ● Steady-state streaming ● Includes data transfer GPU vs CPU: Patient 1 2 3 4 5 6 Template Size 53×54 23×21 76×45 156×116 86×78 141×107 Best Tile Size 16×2 4×4 8×8 16×10 8×8 16×10 Total Speedup 7.80 1.57 8.48 12.67 12.50 14.78 17

  18. Tile Size Selection (1) ● Trade-off between efficiency and parallelism ● Limited execution hardware ● Patient 2 ● Small tiles for more parallelism Patient 1 2 3 4 5 6 Template Size 53×54 23×21 76×45 156×116 86×78 141×107 Best Tile Size 16×2 4×4 8×8 16×10 8×8 16×10 Total Speedup 7.80 1.57 8.48 12.67 12.50 14.78 18

  19. Tile Size Selection (2) ● Trade-off between efficiency and parallelism ● Limited execution hardware ● Patient 4 ● 4×4 tiles results in no edge cases ● Larger 16×10 tiles generates enough parallelism – 16×6, 12×16, and 12×6 edge tiles Patient 1 2 3 4 5 6 Template Size 53×54 23×21 76×45 156×116 86×78 141×107 Best Tile Size 16×2 4×4 8×8 16×10 8×8 16×10 Total Speedup 7.80 1.57 8.48 12.67 12.50 14.78 19

  20. CUDA Adaptability ● Adaptability may affect performance ● Compile-time optimizations not-possible – Loop unrolling – Strength reduction (esp. % or /) ● Increased resource usage ● Mitigate issues with problem-specific kernel compilation 20

  21. Problem-Specific Kernel Compilation (PSKC) ● No C-level source compilation in CUDA API ● Productivity and portability vs. PTX ● Framework for runtime compilation ● Part of larger set of GPU host-code abstractions ● Automates compilation and loading of modules ● nvcc called at runtime ● Kernels written in terms of unspecified compile-time constants ● -D flag used to set parameters ● Overhead acceptable: one time setup, then streaming 21

  22. PSKC: Current Benefits ● Loop unrolling for all tile regions ● Instantiation of separate computation loops with C++ templates ● Strength reduction ● Bit-wise offset calculations ● Instance & implementation parameter values inlined ● Register usage reduction 22

  23. Conclusions ● Tiled implementation allows for processing of large templates ● Better usage of fast memories ● Better performance through better parallelism ● Problem-specific kernel compilation supports adaptability at runtime ● Loop unrolling, strength reduction, efficient register usage ● Future work: ability to adapt to both problem and hardware ● Problem and implementation parameterization – Applications: particle image velocimetry – Different GPUs ● PSKC: quantify benefits and explore limits 23

  24. Thank You Nicholas Moore: nmoore@coe.neu.edu Miriam Leeser: mel@coe.neu.edu Supported by 24

  25. Performance Breakdown 25

Recommend Documents


sliding right into disaster left to right sliding windows
Sliding right into disaster -

Sliding right into disaster - Left-to-right sliding windows leak Daniel J.

targeting gpus with openmp 4 5
Targeting GPUs with OpenMP 4.5 Device

April 4-7, 2016 | Silicon Valley Targeting GPUs with OpenMP 4.5 Device

nvidia grid
NVIDIA GRID Linux Virtual Desktops

NVIDIA GRID Linux Virtual Desktops with NVIDIA Virtual GPUs for Chip-Design

accelerating performance and scalability with
Accelerating Performance and

Accelerating Performance and Scalability with NVIDIA GPUs on HPC Applications

data stream statistics over sliding windows how to
Data stream statistics over sliding

Data stream statistics over sliding windows: How to summarize 150 Million

how walmart improves forecast accuracy
How Walmart Improves Forecast Accuracy

How Walmart Improves Forecast Accuracy with NVIDIA GPUs March 19, 2019

nvidia optical flow
NVIDIA OPTICAL FLOW Abhijit Patait,

NVIDIA OPTICAL FLOW Abhijit Patait, 3/18/2019 Optical Flow in Turing GPUs

filling the performance gap in convolution
Filling the Performance Gap in

www.bsc.es Filling the Performance Gap in Convolution Implementations for

beyond sliding windows object localization by efficient
Beyond Sliding Windows: Object

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search

brief announcement
Brief Announcement: Tracking

Brief Announcement: Tracking Distributed Aggregates over Time-based Sliding

modularity and adaptability in future u s navy ship
Modularity and Adaptability in Future

Modularity and Adaptability in Future U.S. Navy Ship Designs Dr. Norbert

es esthe ther r oprins oprins ma marjoleine rjoleine t
ES ESTHE THER R OPRINS OPRINS MA

MEASURI MEASURING NG MILIT MILITAR ARY J Y JOB OB AD ADAPT APTAB

principles for effective
Principles for Effective Stakeholder

Principles for Effective Stakeholder Engagement in Marine Planning Prevent |

end of year presentation student intern
End Of Year Presentation Student

End Of Year Presentation Student Intern By Yours Truly: Carmen Zhou May 2016

apprentices the future of your business
Apprentices - The future of your

Apprentices - The future of your business Laura Beswick Director of

accessible and
Accessible and Adaptable Housing

Information Classification: PUBLIC Accessible and Adaptable Housing Karen

the house
THE HOUSE BRITISH HOME AWARDS 2018

Terraces were originally designed to house both family and servants together

governing the future under climate change contested
Governing the future under climate

VCCCAR Scenarios for Climate Adaptation Working Paper Lauren Rickards, Nov

faculty of science internship 2019 internships
FACULTY OF SCIENCE INTERNSHIP - 2019

Dr Stephen Kidd FACULTY OF SCIENCE INTERNSHIP - 2019 Internships:

speed
Speed Flexibility Efficiency

Speed Flexibility Efficiency Adaptability Value Proposition An LED

climate change and health
Climate Change and Health An

Climate Change and Health An Introduction to PPHs Vulnerability Assessment

vocational education and 21 st century skill promoting
Vocational education and 21 st century

Vocational education and 21 st century skill: Promoting adaptability through