April 4-7, 2016 | Silicon Valley
Yogesh Kini, GTC 2016
CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT - - PowerPoint PPT Presentation
April 4-7, 2016 | Silicon Valley CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified Memory on Tegra 2 TYPICAL USE CASES Automobiles: Autonomous Cars Mobile Devices: Consoles, Tablets Embedded: Drones,
April 4-7, 2016 | Silicon Valley
Yogesh Kini, GTC 2016
2
Typical pipeline CUDA Interop APIs Unified Memory on Tegra
3
Automobiles: Autonomous Cars
4 4/1/2016
CAPTURE
Camera Sensor ISP/DSP CUDA Graphics/ Display Actuators Camera Sensor ISP/DSP Graphics Actuators CUDA
5
6 4/1/2016
CUDA
Linux, QNX
games
7
8 4/1/2016
Source for EGL image
https://www.khronos.org/registry/egl/extensions/KHR/EGL_KHR_image_base.txt
9 4/1/2016
EGLimage cudaArray cudaDevicePointer
cuGraphicsEGLRegisterImage() cuGraphicsResourceGetMappedPointer() cuGraphicsResourceGetMappedArray()
synchronize Begin resource usage in Other API Other API code End resource Usage in Other API Begin resource Usage in CUDA End resource Usage in CUDA CUDA code
10
11 4/1/2016
EGL stream EGL stream ISP Producer CUDA Consumer CUDA Producer OpenGL Consumer
cuDNN CUDA cuBLAS Visionworks
12
CUDA Producer CUDA Consumer EGL Stream
cuEGLStreamProducerConnect() cuEGLStreamConsumerConnect()
Frame
cuEGLStreamProducerPresentFrame(frame) cuEGLStreamProducerReturnFrame(frame) cuEGLStreamConsumerAcquireFrame(frame) cuEGLStreamConsumerReleaseFrame(frame)
Frame Use in CUDA Frame Frame Frame
1 2 3 4
13 4/1/2016
Synchronization
support
and discrete GPU
client API
support
14 4/1/2016
protection
CPU GPU Memory- DRAM
TEGRA
15
Allocate CPU use Migrate CUDA kernel Migrate CPU use malloc() cudaMalloc() CPU use cudaMemcpyHtoD() Kernel_launch<<<>>>() cudaMemcpyDtoH() CPU use
Traditional
cudaMallocManaged() CPU use cudaMemAttach[Optional] Kernel_launch<<<>>>() cudaMemAttach[Optional] CPU use
Unified Memory
cudaMallocHost() CPU use NA Kernel_launch<<<>>>() NA CPU use
Zero Copy
16
TRADITIONAL ZERO COPY MANAGED MEMORY
16KB
0.617 0.544 0.644
1MB
9.723 11.119 7.093
4MB
59.37618 62.232 46.42551
16MB
377.9244 403.2382 344.926
Traditional
existing desktop programs
allocations
intermediate buffers, tables, etc Managed
and GPU is through cache.
allocations
used on both host and GPU Zero Copy
both GPU and CPU while accessing these allocations
access is not affected by caching Time taken(ms) by the Matrix Multiply CUDA kernel with different allocation types:
April 4-7, 2016 | Silicon Valley
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join