Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems
Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith
All image sources and references are provided at the end.
1
Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in - - PowerPoint PPT Presentation
Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang, Nathan Otterness , Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith All image sources and references are provided at the end. 1
All image sources and references are provided at the end.
1
Nathan Otterness
2
Nathan Otterness
3
4
Computer Vision & AI Expertise GPU Behavior Expertise Real-time Expertise
Nathan Otterness
5
(i) Allocate GPU memory
cudaMalloc(&devicePtr, bufferSize);
(ii) Copy data from CPU to GPU
cudaMemcpy(devicePtr, hostPtr, bufferSize);
(iii) Launch the kernel (kernel = code that runs on GPU)
computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);
(iv) Copy results from GPU to CPU
cudaMemcpy(hostPtr, devicePtr, bufferSize);
(v) Free GPU memory
cudaFree(devicePtr);
6
(i) Allocate GPU memory
cudaMalloc(&devicePtr, bufferSize);
(ii) Copy data from CPU to GPU
cudaMemcpy(devicePtr, hostPtr, bufferSize);
(iii) Launch the kernel (kernel = code that runs on GPU)
computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);
(iv) Copy results from GPU to CPU
cudaMemcpy(hostPtr, devicePtr, bufferSize);
(v) Free GPU memory
cudaFree(devicePtr);
7
(i) Allocate GPU memory
cudaMalloc(&devicePtr, bufferSize);
(ii) Copy data from CPU to GPU
cudaMemcpy(devicePtr, hostPtr, bufferSize);
(iii) Launch the kernel (kernel = code that runs on GPU)
computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);
(iv) Copy results from GPU to CPU
cudaMemcpy(hostPtr, devicePtr, bufferSize);
(v) Free GPU memory
cudaFree(devicePtr);
8
(i) Allocate GPU memory
cudaMalloc(&devicePtr, bufferSize);
(ii) Copy data from CPU to GPU
cudaMemcpy(devicePtr, hostPtr, bufferSize);
(iii) Launch the kernel (kernel = code that runs on GPU)
computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);
(iv) Copy results from GPU to CPU
cudaMemcpy(hostPtr, devicePtr, bufferSize);
(v) Free GPU memory
cudaFree(devicePtr);
9
(i) Allocate GPU memory
cudaMalloc(&devicePtr, bufferSize);
(ii) Copy data from CPU to GPU
cudaMemcpy(devicePtr, hostPtr, bufferSize);
(iii) Launch the kernel (kernel = code that runs on GPU)
computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);
(iv) Copy results from GPU to CPU
cudaMemcpy(hostPtr, devicePtr, bufferSize);
(v) Free GPU memory
cudaFree(devicePtr);
10
(i) Allocate GPU memory
cudaMalloc(&devicePtr, bufferSize);
(ii) Copy data from CPU to GPU
cudaMemcpy(devicePtr, hostPtr, bufferSize);
(iii) Launch the kernel (kernel = code that runs on GPU)
computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);
(iv) Copy results from GPU to CPU
cudaMemcpy(hostPtr, devicePtr, bufferSize);
(v) Free GPU memory
cudaFree(devicePtr);
11
Nathan Otterness
12
Nathan Otterness
13
Nathan Otterness
14
CPU threads ("tasks")
Nathan Otterness
15
K1 starts K1 completes
Nathan Otterness
16
1024 threads 256 threads
Nathan Otterness
17
Thread 3
(explicit synchronization).
Nathan Otterness
18
(explicit synchronization). (a)
Nathan Otterness
19
(explicit synchronization). (a)
Nathan Otterness
20
Nathan Otterness
21
Two commands from different streams cannot run concurrently [if separated by]:
CUDA toolkit 9.2.88 Programming Guide, Section 3.2.5.5.4, "Implicit Synchronization":
Nathan Otterness
22
Nathan Otterness
23
Nathan Otterness
24
Nathan Otterness
25
trying to launch kernel 4. (b)
thread 3 sleeps for 0.2 seconds. (c)
Nathan Otterness
26
Nathan Otterness
27
Nathan Otterness
28
Multiple Process-based Tasks Multiple Thread-based Tasks Without MPS MP MT With MPS MP(MPS) MT(MPS)
Nathan Otterness
29
Multiple Process-based Tasks Multiple Thread-based Tasks Without MPS MP MT With MPS MP(MPS) MT(MPS)
Nathan Otterness
30
70% of the time, a single Hough transform iteration completed in 12 ms or less.
Nathan Otterness
31
This occurred when four concurrent instances were running in separate CPU threads.
Nathan Otterness
32
The observed WCET under MT was over 4x the WCET under MP.
Nathan Otterness
33
Nathan Otterness
34
35
Nathan Otterness
36
Nathan Otterness
37
Nathan Otterness
38
Nathan Otterness
if (!CheckCUDAError( cudaMemsetAsync( state->device_block_smids, 0, data_size))) { return 0; }
39
Nathan Otterness
if (!CheckCUDAError( cudaMemsetAsync( state->device_block_smids, 0, data_size))) { return 0; }
40
that memset causes implicit synchronization?
Nathan Otterness
if (!CheckCUDAError( cudaMemsetAsync( state->device_block_smids, 0, data_size))) { return 0; }
41
that memset causes implicit synchronization?
cause implicit synchronization?
Nathan Otterness
if (!CheckCUDAError( cudaMemsetAsync( state->device_block_smids, 0, data_size))) { return 0; }
42
if (!CheckCUDAError( cudaMemsetAsync( state->device_block_smids, 0, data_size, state->stream))) { return 0; }
Nathan Otterness
43
Nathan Otterness
44
Nathan Otterness
45
CUDA Programming Guide, section 3.2.5.1:
The following device operations are asynchronous with respect to the host: [...] Memory copies performed by functions that are suffixed with Async
CUDA Runtime API Documentation, section 2:
For transfers from device memory to pageable host memory, [cudaMemcpyAsync] will return only once the copy has completed.
Nathan Otterness
46
Nathan Otterness
47
Nathan Otterness
Figure sources:
https://electrek.co/guides/tesla-vision/ https://www.quora.com/What-are-the-different-types-of-artificial-neural-network https://www.researchgate.net/figure/Compute-unified-device-architecture-CUDA-threads-and-blocks-multidimensional_fig1_320806445?_sg=ziaY-gBKKiKX4pljRq4v JSWZvDvdOidZ2aCRYnD1QVFBJDxIx3MEO1I03cI31e1It6pUr53qaS1L1w4Bt5fd8w
48