Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with - PowerPoint PPT Presentation

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation Nicholas Moore and Miriam Leeser Dept. of Electrical and Computer Engineering Northeastern University Boston, MA Laurie Smith King Dept. of Mathematics and Computer Science College of the Holy Cross Worcester, MA Supported by Supported by

Motivation ● GPUs offer significant performance potential ● GPU development is difficult ● Complicated target with changes over time ● Leads to problem-specific non-reusable code ● Affects library developers and users ● Goal: more adaptable kernel implementations ● Case study: template matching application ● Technique: problem-specific kernel compilation 2

Template Matching (1) ● Real-world tumor tracking application ● Ying Cui, Jennifer Dy, Gregory Sharp, Brian Alexander, and Steve Jiang ● Visual tracking of tumor ● Focused radiotherapy ● Tumor moves during breathing Y. Cui, J. G. Dy, G. C. Sharp, B. Alexander, and S. B. Jiang, "Multiple Template Based Fluoroscopic Tracking of Lung Tumor Mass without Implanted Fiducial Markers," Physics in Medicine and Biology, Vol. 52, pp. 6229- 3 6242, 2007.

Template Matching (2) S1, L1 Template 1 Matching S2, L2 Voting Location Template 2 Incoming Frame SN, LN 4 Template N

corr2() ∑ M ∑ ( A MN − ̄ A )( B MN − ̄ B ) N corr2 ( A , B )= √ ( ∑ M ∑ 2 )( ∑ M ∑ ( A MN − ̄ ( B MN −̄ 2 ) A ) B ) N N ● Sliding window template matching ● Pearson's correlation for similarity score ● Floating-point data ● Templates and frames pre-processed 5

Computation Reduction ∑ M ∑ ( A MN − ̄ A )( B MN −̄ B ) N corr2 ( A , B )= √ ( ∑ M ∑ 2 )( ∑ M ∑ ( A MN − ̄ ( B MN −̄ 2 ) A ) B ) N N ● Template data (A) ● Not expected to be separable ● Fixed for given template 6

Computation Reduction C ( B MN −̄ ∑ M ∑ A MN B ) N corr2 ( A , B )= √ A D ∑ M ∑ ( B MN −̄ 2 B ) N ● Template data (A) ● Not expected to be separable ● Fixed for given template 7

Computation Reduction C ( B MN −̄ ∑ M ∑ A MN B ) N corr2 ( A , B )= √ A D ∑ M ∑ ( B MN −̄ 2 B ) N ● ROI data (B) ● Dependent on window location and frame ● Subtraction complicates frequency domain 8

Reference Data Sets Template Size Shift ±V/±H Patient Templates (pixels) (pixels) 1 12 53×54 18/9 2 13 23×21 11/5 3 10 76×45 9/4 4 11 156×116 9/3 5 12 86×78 11/6 6 14 141×107 9/2 ● Large templates ● Significant variation in dimensions ● Small search with single ROI per frame ● Different part of the problem space 9

Convolution Implementations ● Kong et al. (GPGPU 2010) ● Template stored in shared memory ● Only 7×7 kernels presented ● NVIDIA Performance Primitives ● Only supports uint8 ● Accelereyes Jacket ● Last documented version supports arbitrary kernels up to 5×5, square kernels to 10×10 ● OpenCV ● Supports single precision floating point ● Non-separable templates stored in constant memory. 10

CUDA Mapping Complications ● Common correlation case: ● Small template ● Large image with many window locations ● Template matching application: ● Templates too large to use shared or constant memory ● Few sources of parallelism – Few templates (10 to 14) – Relatively small ROI (95 to 703 positions) – Single ROI per frame ● Problem parameters vary between patients 11

CUDA Mapping Solution ● Tiling of the template ● Reduces local working set size ● More independent parallelism ● Problem-specific kernel compilation ● Adaptability without performance impact 12

CUDA Implementation C ( B MN −̄ ∑ M ∑ A MN B ) N corr2 ( A , B )= √ A D ∑ M ∑ ( B MN −̄ 2 B ) N ● Multiple pass implementation ● Average, denominator, and numerator similar ● Outer loops are all addition 13

Tiled Template (1) ● Tile and process sub- templates separately ● More parallelism ● Reduces working set Main Tiles Right Tiles size to fit in shared memory ● Tiles mapped across CUDA grid Corner ● Scales to arbitrary Bottom Tiles Tile template sizes 14

Tiled Template (2) ● Efficient tile size may not match problem ● Corr2() complicates padding Main Tiles Right Tiles ● Varying template size per block Corner Bottom Tiles Tile 15

Experimental Setup ● Benchmarked tile sizes from 4×4 to 16×16 ● Compared against ● MATLAB and pthreads-based C application ● Both used constant template optimization ● Benchmarking ● Intel Xeon W3580 (4 Nehalem cores @ 3.33 GHz, 6MB L2) ● NVIDIA GeForce GTX 480 (Fermi) with CUDA 3.2 ● 64-bit Linux (GCC 4.4.3) and MATLAB R2010a 16

Performance ● Good performance across patients ● Steady-state streaming ● Includes data transfer GPU vs CPU: Patient 1 2 3 4 5 6 Template Size 53×54 23×21 76×45 156×116 86×78 141×107 Best Tile Size 16×2 4×4 8×8 16×10 8×8 16×10 Total Speedup 7.80 1.57 8.48 12.67 12.50 14.78 17

Tile Size Selection (1) ● Trade-off between efficiency and parallelism ● Limited execution hardware ● Patient 2 ● Small tiles for more parallelism Patient 1 2 3 4 5 6 Template Size 53×54 23×21 76×45 156×116 86×78 141×107 Best Tile Size 16×2 4×4 8×8 16×10 8×8 16×10 Total Speedup 7.80 1.57 8.48 12.67 12.50 14.78 18

Tile Size Selection (2) ● Trade-off between efficiency and parallelism ● Limited execution hardware ● Patient 4 ● 4×4 tiles results in no edge cases ● Larger 16×10 tiles generates enough parallelism – 16×6, 12×16, and 12×6 edge tiles Patient 1 2 3 4 5 6 Template Size 53×54 23×21 76×45 156×116 86×78 141×107 Best Tile Size 16×2 4×4 8×8 16×10 8×8 16×10 Total Speedup 7.80 1.57 8.48 12.67 12.50 14.78 19

CUDA Adaptability ● Adaptability may affect performance ● Compile-time optimizations not-possible – Loop unrolling – Strength reduction (esp. % or /) ● Increased resource usage ● Mitigate issues with problem-specific kernel compilation 20

Problem-Specific Kernel Compilation (PSKC) ● No C-level source compilation in CUDA API ● Productivity and portability vs. PTX ● Framework for runtime compilation ● Part of larger set of GPU host-code abstractions ● Automates compilation and loading of modules ● nvcc called at runtime ● Kernels written in terms of unspecified compile-time constants ● -D flag used to set parameters ● Overhead acceptable: one time setup, then streaming 21

PSKC: Current Benefits ● Loop unrolling for all tile regions ● Instantiation of separate computation loops with C++ templates ● Strength reduction ● Bit-wise offset calculations ● Instance & implementation parameter values inlined ● Register usage reduction 22

Conclusions ● Tiled implementation allows for processing of large templates ● Better usage of fast memories ● Better performance through better parallelism ● Problem-specific kernel compilation supports adaptability at runtime ● Loop unrolling, strength reduction, efficient register usage ● Future work: ability to adapt to both problem and hardware ● Problem and implementation parameterization – Applications: particle image velocimetry – Different GPUs ● PSKC: quantify benefits and explore limits 23

Thank You Nicholas Moore: nmoore@coe.neu.edu Miriam Leeser: mel@coe.neu.edu Supported by 24

Performance Breakdown 25

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with - PowerPoint PPT Presentation

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation Nicholas Moore and Miriam Leeser Dept. of Electrical and Computer Engineering Northeastern University Boston, MA Laurie Smith King Dept. of Mathematics and

Sliding right into disaster - Left-to-right sliding windows leak Daniel J. Bernstein, Joachim

Glacier Sliding Ian Hewitt, University of Oxford hewitt@maths.ox.ac.uk Sliding / friction laws -

NVIDIA GRID Linux Virtual Desktops with NVIDIA Virtual GPUs for Chip-Design Applications Shailesh

Windows Not just for houses Windows 1-10 Windows Server Essentially a jacked up windows 8 box

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Sliding system for concertina doors 271 Sliding system for concertina doors - Technical features

Lecture 3: Introduction to Sliding Mode Control Reference: S.C. Tan, Chapter 1. Sliding Mode

NVIDIA GPUS Mark Kilgard Principal S ystem S oftware Engineer, NVIDIA Piers Daniell S enior

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

Windows 8 Heap Internals Windows 8 Heap Internals Windows 8 Heap Internals INTRODUCTION Windows 8

1. 2. 3. 1. 2. 3. Windows 10 IoT Core Universal Windows Platform (UWP) Microsoft Azure v7

Windows Not Just For Houses Everyone Uses Windows! Versions of Windows 10 There are multiple

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

NVIDIA OPTICAL FLOW Abhijit Patait, 3/18/2019 Optical Flow in Turing GPUs NVIDIA Optical Flow

Modularity and Adaptability in Future U.S. Navy Ship Designs Dr. Norbert Doerry Dr. Philip

ES ESTHE THER R OPRINS OPRINS MA MARJOLEINE RJOLEINE T HAR T HART THE THE WORL ORLD IS

Principles for Effective Stakeholder Engagement in Marine Planning Prevent | Collaborate |

End Of Year Presentation Student Intern By Yours Truly: Carmen Zhou May 2016 Service Balance:

THE HOUSE BRITISH HOME AWARDS 2018 Adaptability is a key driver behind the design of this terrace

Governing the future under climate change: contested visions of climate change adaptation Lauren

FACULTY OF SCIENCE INTERNSHIP - 2019 Internships: information session What is Sciences

Speed Flexibility Efficiency Adaptability Value Proposition An LED luminaire developed