High Quality Real Time Image Processing Framework on Mobile Platforms - - PowerPoint PPT Presentation

high quality real time image processing
SMART_READER_LITE
LIVE PREVIEW

High Quality Real Time Image Processing Framework on Mobile Platforms - - PowerPoint PPT Presentation

High Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1 Eyal Hirsch SagivTech Snapshot Established in 2009 and headquartered in Israel Core domain expertise: GPU Computing and Computer Vision What we


slide-1
SLIDE 1

High Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1

Eyal Hirsch

slide-2
SLIDE 2

SagivTech Ltd. proprietary information - for internal use only

  • Established in 2009 and headquartered in Israel
  • Core domain expertise: GPU Computing and Computer Vision
  • What we do:
  • Technology
  • Solutions
  • Projects
  • EU Research
  • Training
  • GPU expertise:
  • Hard core optimizations
  • Efficient streaming for single or multiple GPU systems
  • Mobile GPUs

SagivTech Snapshot

slide-3
SLIDE 3

SagivTech Ltd. proprietary information - for internal use only

  • The new era of mobile

Mobile is everywhere

slide-4
SLIDE 4

SagivTech Ltd. proprietary information - for internal use only

  • In the beginning: I can talk from anywhere !
  • A bit later: My phone can take pictures !
  • Now:

– Advanced camera – More compute power – Fast device – cloud communication

  • What can be done with those advancements?

As mobile devices get smarter

slide-5
SLIDE 5

SagivTech Ltd. proprietary information - for internal use only

  • Mission: Running a depth sensing technology on a mobile platform
  • Challenge: First time on NVIDIA’s Tegra K1
  • Extreme optimizations on a CPU-GPU platform to

allow the device to handle other tasks in parallel

  • Expertise:
  • Mantis Vision – the algorithms
  • NVIDIA – the Tegra K1 platform
  • SagivTech – the GPU computing expertise
  • Bottom line: Depth sensing running in real time in parallel

to other compute intensive applications !

Project Tango

slide-6
SLIDE 6

SagivTech Ltd. proprietary information - for internal use only

Project Tango

Credits: http://techaeris.com

slide-7
SLIDE 7

SagivTech Ltd. proprietary information - for internal use only

  • If you’ve been to a concert recently, you’ve probably seen how many people

take videos of the event with mobile phone cameras

  • Each user has only one video – taken from one angle and location

and of only moderate quality

Mobile Crowdsourcing Video Scene Reconstruction

slide-8
SLIDE 8

SagivTech Ltd. proprietary information - for internal use only

The Idea behind SceneNet

  • Leverage the power of multiple mobile phone cameras
  • Create a high-quality 3D video experience that

is sharable via social networks

slide-9
SLIDE 9

SagivTech Ltd. proprietary information - for internal use only

Creation of the 3D Video Sequence

The scene is photographed by several people using their cell phone camera The video data is transmitted via the cellular network to a High Performance Computing server. Following time synchronization, resolution normalization and spatial registration, the several videos are merged into a 3-D video cube.

TIME

slide-10
SLIDE 10

SagivTech Ltd. proprietary information - for internal use only

Algorithms implemented on the TK1

  • Enabling the 3D reconstruction for SceneNet required various

algorithms to run on the TK1 GPU – FREAK: Fast Retina Key point – BRISK: Binary Robust Invariant Scalable Key points – DoG: Difference of Gaussians

  • Algorithms had to run in real-time
  • Algorithms are image processing building blocks for

various image processing tasks

slide-11
SLIDE 11

SagivTech Ltd. proprietary information - for internal use only

  • DoG:

– Input: 480 x 640 RGB Image – Output: ~32K key points

  • Freak:

– Input: ~32K key points, Image – Output: Descriptor per key point

  • Majority of the code on the GPU
  • Off loading to the GPU allows for real time processing,

not possible on the CPU

Freak &DoG performance on the TK1

slide-12
SLIDE 12

SagivTech Ltd. proprietary information - for internal use only

  • DoG flow:

– Gaussian – DiffImage – Find Key points

  • Total: 10.83 ms

DoG performance on the TK1

  • Avg. time (ms)

Kernel 0.3 Misc 4.8 Gaussian: Conv2D 0.6 Gaussian: DownSampleBilinear 1.7 DiffImage 3.43 FindKeyPoints 10.83 Total DoG

slide-13
SLIDE 13

SagivTech Ltd. proprietary information - for internal use only

  • FREAK flow:

– IntegralImage – Extract descriptors

  • Total: 2.4 ms
  • Total DoG + FREAK: 13.23 ms

FREAK performance on the TK1

  • Avg. time (ms)

Kernel 1.5 IntegralImage 0.9 GetDescriptors 2.4 Total FREAK

slide-14
SLIDE 14

SagivTech Ltd. proprietary information - for internal use only

  • 13 ms means real time processing on Ardbeg

development board !!!

  • Room for more tasks to run in the background
  • Opens up possibilities for many mobile applications
  • Having real time performance is not enough
  • Need to evaluate power consumption as well

Freak &DoG performance on the TK1

slide-15
SLIDE 15

SagivTech Ltd. proprietary information - for internal use only

Performance is also GFlops/WATT

slide-16
SLIDE 16

SagivTech Ltd. proprietary information - for internal use only

Programming the TK1 GPU

  • CUDA – NVIDIA
  • OpenCL – Khronos
  • RenderScript – Developed by Google
slide-17
SLIDE 17

SagivTech Ltd. proprietary information - for internal use only

Programming the TK1 - CUDA

  • Most rules and methods that apply to discrete cards,

apply to the TK1 GPU

  • Code and libraries (such as cuFFT, cuBLAS,

cuSPARSE, CUB, Thrust, etc) should work out

  • f the box for the TK1
  • Develop on Windows/Linux with discrete card and then

migrate to the TK1

  • Use the profiler
slide-18
SLIDE 18

SagivTech Ltd. proprietary information - for internal use only

Programming the TK1 - OpenCL

  • Most of the tips for CUDA applies to OpenCL
  • Runs nicely and shows nice performance
  • Migrated the in-house Bilateral filter from CUDA

to OpenCL in less than a day

  • 2D separable convolution yield nice performance

gains (compared to an optimized Neon implementation)

slide-19
SLIDE 19

SagivTech Ltd. proprietary information - for internal use only

  • Used 4 tests configuration to evaluate performance

– Highly optimized reference library utilizing the NEON (CPU) – SagivTech’s in-house Neon implementation (CPU) – SagivTech’s in-house OpenCL implementation (GPU)

2D separable convolution on the TK1

2K x 2K 1K x 1K T est configuration 97 22 Reference library 99 23.5 ST single core NEON 48 10.8 ST 4 cores NEON 9 4 ST OpenCL

slide-20
SLIDE 20

SagivTech Ltd. proprietary information - for internal use only

Programming the TK1 – RenderScript - 1

  • Google’s way of doing Compute on a mobile platform
  • Quick CUDA to RenderScript acronym translation:

– User manages allocations (a.k.a buffers) – User manages data transfer/copies to/from allocations – User sets runtime parameters (a.k.a kernel params) – User launches kernels much like OpenCL/CUDA

  • Code ran on the GPU and yielded impressive

performance boost (still lags behind CUDA)

  • CUDA to RS migration fairly easy
slide-21
SLIDE 21

SagivTech Ltd. proprietary information - for internal use only

  • Google does NOT mandate which SoC component

will run the RS code

  • Developer has no control where RS code will run
  • Depends on specific hardware, vendors, code, etc
  • To test RS on TK1, locked GPU clocks in different

configurations and run RS sparse matrix vector multiplication benchmark

  • Performance of the RS code under different clocks, would

reveal which component ran RS code

Programming the TK1 – RenderScript - 2

slide-22
SLIDE 22

SagivTech Ltd. proprietary information - for internal use only

Programming the TK1 – RenderScript - 3

  • Sparse matrix vector multiplication using Render script
  • Used 3 test configurations

– Naive C++ CPU code – SagivTech RS – NVIDIA’s cuSparse

  • RS running on GPU
  • RS shows nice performance

5 10 15 20 25 30 35 40 45 GPU: Full clocks GPU: Half clocks GPU: Quarter clocks

Chart Title

Naive C++ SagivTech RS NVIDIA cuSparse

slide-23
SLIDE 23

SagivTech Ltd. proprietary information - for internal use only

Programming the TK1 – Optimization tips

  • Only one SMX
  • We’ve seen cases where different optimizations behave

differently on the TK1 than on equivalent discrete card (such as __ldg etc)

  • Try various optimizations, in some cases we got better

performance when using atomics rather than shared memory

  • Always optimize on the TK1 and not on discrete used for the

development phase

slide-24
SLIDE 24

SagivTech Ltd. proprietary information - for internal use only

The future

  • Real time image processing of even complex

algorithms is achievable on the TK1

  • Easy migration from mature discrete GPU code

to new and exiting field of mobile compute

  • Maxwell is already planed for next mobile generation,

bringing more power efficiency and performance

  • It works!!
slide-25
SLIDE 25

Thank You

F o r m o r e i n f o r m a t i o n p l e a s e c o n t a c t E y a l H i r s c h e y a l @ s a g i v t e c h . c o m

slide-26
SLIDE 26

SagivTech Ltd. proprietary information - for internal use only

Programming the TK1 – General tips

  • TK1 hardware CC is 3.2
  • Tools and compilation chain is quite different. Need some time

to get started

  • Strive to do the CUDA and managing app/code in Windows/Linux

using a discrete card and then migrate to Android

  • Always have a reference code in naive, single thread C++ to

compare the results of the parallel algorithm

slide-27
SLIDE 27

SagivTech Ltd. proprietary information - for internal use only

Computational Photography: examples …

  • Background subtitution
slide-28
SLIDE 28

SagivTech Ltd. proprietary information - for internal use only

  • Binary feature descriptor
  • Hamming distance matcher
  • Sampling pattern
  • Overlapping receptive fields
  • Exponential change in size
  • Rotation invariant

FREAK – Fast Retina Keypoint

slide-29
SLIDE 29

SagivTech Ltd. proprietary information - for internal use only

  • Binary feature descriptor
  • Hamming distance matcher
  • Sampling pattern
  • Equally spaced in circles
  • Gaussian kernel size relative

to distance from feature

BRISK – Binary Robust Invariant Scalable Key points

slide-30
SLIDE 30

SagivTech Ltd. proprietary information - for internal use only

  • Feature detector
  • Local minima/maxima of the image convolved

with difference of gaussians

  • Acts as a 2D band-pass filter over the image
  • Enhances corners

DoG – Difference of Gaussians