CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT - - PowerPoint PPT Presentation

cuda on mobile
SMART_READER_LITE
LIVE PREVIEW

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified Memory on Tegra 2 TYPICAL USE CASES Automobiles: Autonomous Cars Mobile Devices: Consoles, Tablets Embedded: Drones,


slide-1
SLIDE 1

April 4-7, 2016 | Silicon Valley

Yogesh Kini, GTC 2016

CUDA ON MOBILE

slide-2
SLIDE 2

2

ABSTRACT

Typical pipeline CUDA Interop APIs Unified Memory on Tegra

slide-3
SLIDE 3

3

TYPICAL USE CASES

Automobiles: Autonomous Cars

Embedded: Drones, Robots, Smart-Surveillance Mobile Devices: Consoles, Tablets

slide-4
SLIDE 4

4 4/1/2016

TYPICAL PIPELINE

CAPTURE

DISPLAY PROCESS

Camera Sensor ISP/DSP CUDA Graphics/ Display Actuators Camera Sensor ISP/DSP Graphics Actuators CUDA

slide-5
SLIDE 5

5

CUDA OPENGL(ES) INTEROP

slide-6
SLIDE 6

6 4/1/2016

CUDA–OPENGL(ES)

  • Provide access to OpenGL-ES resources in

CUDA

  • Support for EGL
  • Supported on Android, L4T, Vibrante-

Linux, QNX

  • Implicit synchronization support
  • Useful for graphics applications and

games

slide-7
SLIDE 7

7

EGL IMAGE INTEROP

slide-8
SLIDE 8

8 4/1/2016

EGL IMAGE

Source for EGL image

  • GStreamer
  • OpenGL ES
  • OpenMAX
  • Android - GraphicBuffer

Khronos EGL_image_base:

https://www.khronos.org/registry/egl/extensions/KHR/EGL_KHR_image_base.txt

slide-9
SLIDE 9

9 4/1/2016

EGL IMAGE

EGLimage cudaArray cudaDevicePointer

cuGraphicsEGLRegisterImage() cuGraphicsResourceGetMappedPointer() cuGraphicsResourceGetMappedArray()

synchronize Begin resource usage in Other API Other API code End resource Usage in Other API Begin resource Usage in CUDA End resource Usage in CUDA CUDA code

slide-10
SLIDE 10

10

EGL STREAMS INTEROP

slide-11
SLIDE 11

11 4/1/2016

EGL STREAMS

  • Producer-Consumer architecture
  • EGL streams spec: https://www.khronos.org/registry/egl/extensions/KHR/EGL_KHR_stream.txt
  • Implicit Synchronization
  • Cross Process support
  • Supports YUV formats

EGL stream EGL stream ISP Producer CUDA Consumer CUDA Producer OpenGL Consumer

cuDNN CUDA cuBLAS Visionworks

slide-12
SLIDE 12

12

EGL STREAMS

CUDA Producer CUDA Consumer EGL Stream

cuEGLStreamProducerConnect() cuEGLStreamConsumerConnect()

Frame

cuEGLStreamProducerPresentFrame(frame) cuEGLStreamProducerReturnFrame(frame) cuEGLStreamConsumerAcquireFrame(frame) cuEGLStreamConsumerReleaseFrame(frame)

Frame Use in CUDA Frame Frame Frame

1 2 3 4

slide-13
SLIDE 13

13 4/1/2016

CUDA-OPENGL EGL IMAGE EGL STREAMS

INTEROP SUMMARY

  • Producer-Consumer
  • Implicit-

Synchronization

  • Cross-Process support
  • YUV Planar Image

support

  • EGL support
  • OpenGL-ES support
  • Portable across Tegra

and discrete GPU

  • Easy setup
  • Works with several EGL

client API

  • YUV Planar Image

support

slide-14
SLIDE 14

14 4/1/2016

CUDA UNIFIED MEMORY ON TEGRA

  • Helps take advantage of unified DRAM on Tegra
  • Easier to program, Unified allocator: cudaMallocManaged
  • Programming model enforced through memory access

protection

  • Memcpy not needed, migration managed by CUDA driver
  • Saves memory consumption and power
  • Attach API will help achieve optimal performance

CPU GPU Memory- DRAM

TEGRA

slide-15
SLIDE 15

15

CUDA MEMORY TYPES

Allocate CPU use Migrate CUDA kernel Migrate CPU use malloc() cudaMalloc() CPU use cudaMemcpyHtoD() Kernel_launch<<<>>>() cudaMemcpyDtoH() CPU use

Traditional

cudaMallocManaged() CPU use cudaMemAttach[Optional] Kernel_launch<<<>>>() cudaMemAttach[Optional] CPU use

Unified Memory

cudaMallocHost() CPU use NA Kernel_launch<<<>>>() NA CPU use

Zero Copy

slide-16
SLIDE 16

16

CUDA MEMORY TYPES

TRADITIONAL ZERO COPY MANAGED MEMORY

16KB

0.617 0.544 0.644

1MB

9.723 11.119 7.093

4MB

59.37618 62.232 46.42551

16MB

377.9244 403.2382 344.926

Traditional

  • Easy portability from

existing desktop programs

  • Faster for some small

allocations

  • Suitable for GPU

intermediate buffers, tables, etc Managed

  • Memory access by CPU

and GPU is through cache.

  • Faster for larger

allocations

  • Suitable when memory

used on both host and GPU Zero Copy

  • Cache is bypassed by

both GPU and CPU while accessing these allocations

  • Suitable when memory

access is not affected by caching Time taken(ms) by the Matrix Multiply CUDA kernel with different allocation types:

slide-17
SLIDE 17

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join