Zürcher Fachhochschule
Armin Weiss
- Dr. Amin Mazloumian
- Dr. Matthias Rosenthal
Institute of Embedded Systems High Performance Multimedia Research Group
Zurich University of Applied Sciences
Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss - - PowerPoint PPT Presentation
Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Institute of Embedded Systems High Performance Multimedia Research Group Dr. Amin Mazloumian Zurich University of Applied Sciences Dr. Matthias Rosenthal Zrcher Fachhochschule
Zürcher Fachhochschule
Armin Weiss
Institute of Embedded Systems High Performance Multimedia Research Group
Zurich University of Applied Sciences
Zürcher Fachhochschule
Digital Audio Mixing Console
2
Audio I / O Card Control Unit Audio Source
Audio Sink
Zürcher Fachhochschule
3
Zürcher Fachhochschule
4
Zürcher Fachhochschule
Variability in GPU Kernel Launch
5
__global__ void identity(float *input, float *output, int numElem) { for (int index = 0; index < numElem; index++) {
} } numElem = 25
Zürcher Fachhochschule
Variability in GPU Kernel Launch
6
Zürcher Fachhochschule
Problems
7
Zürcher Fachhochschule
Persistent Kernel
8
… while (running) { if (new_audio_samples() == true) { send_sync_to_GPU(); wait_for_GPU_sync(); } } … __global__ void audioKernel(…) { … // Infinite loop while (true) { wait_for_CPU_sync(); // Audio channel processing … wait_for_all_threads_to_finish(); send_sync_to_CPU(); } }
Host Application
CUDA Kernel
Zürcher Fachhochschule
Persistent Kernel: Synchronization
9
__global__ void audioKernel(…, volatile int* gpuToken) { … // Infinite loop while (true) { wait_for_CPU_sync(); // Audio channel processing … wait_for_all_threads_to_finish(); send_sync_to_CPU(); } }
CUDA Kernel
wait_for_CPU_sync() { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i == 0) while (*gpuToken == NO_AUDIO); synchronize_threads(); }
Memory Accessible by CPU and GPU
Zürcher Fachhochschule
CUDA Memory Comparison Managed <-> Zero Copy
10
TX2 CPU GPU DRAM 8GB Memory Controller
Cache Cache
GPU Buffer TX2 CPU GPU DRAM 8GB Memory Controller
Cache Cache
GPU Buffer
Zürcher Fachhochschule
Problems
11
Zürcher Fachhochschule
Infinite Loop: Parameter Communication
12
Application Writes at Arbitrary Time Audio Processing Thread Local Parameter Copy Parameter Memory (Zero Copy)
Periodic Update
Zürcher Fachhochschule
Conclusion
13
Zürcher Fachhochschule
14
Zürcher Fachhochschule
Separate Buffers
15
TX2 CPU GPU DRAM 8GB Memory Controller
Cache Cache
I/O Buffer GPU Buffer PCIe Audio I / O Card
Zürcher Fachhochschule
memcpy() Measurements
16
1.E+00 1.E+03 1.E+06 1.E+09
4096 bytes @ 48 kHz for 12h on 3 CPUs (A57)
TX1 ( Kernel v3.10.96) TX2 (Kernel v4.4.15)
Zürcher Fachhochschule
Shared Buffer
17
TX2 CPU GPU DRAM 8GB Memory Controller
Cache Cache
I/O Buffer GPU Buffer PCIe Audio I / O Card
Zürcher Fachhochschule
Shared Buffer
18
TX2 CPU GPU DRAM 8GB Memory Controller
Cache Cache
Shared Buffer PCIe Audio I / O Card
Zürcher Fachhochschule
Shared Buffer
19
Zürcher Fachhochschule
Discrete GPU
Integrated GPU
GPUDirect RDMA
20
CPUs System Memory 3rd Party Device GPU Bridge
PCIe
CPUs System Memory 3rd Party Device PCIe Controller
PCIe
GPU
Memory Interconnect
Memory
GPUDirect RDMA
Zürcher Fachhochschule
Shared Buffer
21
Zürcher Fachhochschule
CudaHostRegister()
22
TX2 CPU GPU DRAM 8GB Memory Controller
Cache Cache
I/O Buffer PCIe Audio I / O Card CudaHostRegister()
Zürcher Fachhochschule
Shared Buffer
23
Zürcher Fachhochschule
Video4Linux - Userptr
24
TX2 CPU GPU DRAM 8GB Memory Controller
Cache Cache
GPU Buffer Video Input Embedded Camera Userptr Mode Mapped Access
Zürcher Fachhochschule
Video4Linux - Userptr
25
TX2 CPU GPU DRAM 8GB Memory Controller
Cache Cache
GPU Buffer PCIe Audio I / O Card Userptr Mode Mapped Access
Zürcher Fachhochschule
Shared Buffer
26
Zürcher Fachhochschule
27
Zürcher Fachhochschule
28
Website: http://www.zhaw.ch/ines/ Blog: https://blog.zhaw.ch/high-performance/ Github: https://github.com/ines-hpmm E-Mail: armin.weiss@zhaw.ch amin.mazloumian@zhaw.ch matthias.rosenthal@zhaw.ch