Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss - - PowerPoint PPT Presentation

realtime signal processing on nvidia tx2
SMART_READER_LITE
LIVE PREVIEW

Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss - - PowerPoint PPT Presentation

Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Institute of Embedded Systems High Performance Multimedia Research Group Dr. Amin Mazloumian Zurich University of Applied Sciences Dr. Matthias Rosenthal Zrcher Fachhochschule


slide-1
SLIDE 1

Zürcher Fachhochschule

Armin Weiss

  • Dr. Amin Mazloumian
  • Dr. Matthias Rosenthal

Institute of Embedded Systems High Performance Multimedia Research Group

Zurich University of Applied Sciences

Realtime Signal Processing on Nvidia TX2 using CUDA

slide-2
SLIDE 2

Zürcher Fachhochschule

System Overview

Digital Audio Mixing Console

2

Audio I / O Card Control Unit Audio Source

TX2

Audio Processing

Audio Sink

slide-3
SLIDE 3

Zürcher Fachhochschule

Motivation

3

CPU GPU

  • In Comparison to FPGA / DSP Solutions:
  • Performance Gain: 100x (e.g. Analog Devices SHARC)
  • Fast Development Time
  • System on Single Chip
  • Cost Effective

Nvidia TX Series

slide-4
SLIDE 4

Zürcher Fachhochschule

Challenges

  • Short and Deterministic Latency
  • 32 Samples (0.33 ms @ 96 kHz)
  • Video (60 Hz): 16.7 ms / Frame
  • High Input and Output Data Rate
  • 256 Channels ∗ 7 Inputs ∗ 32 Bit

Input ∗ 96 kHz = 5.5 Gb/s

  • 1080p@60 (24-bit RGB): 3.0 Gb/s

4

slide-5
SLIDE 5

Zürcher Fachhochschule

Short and Deterministic Latency

Variability in GPU Kernel Launch

5

< 0.1%

__global__ void identity(float *input, float *output, int numElem) { for (int index = 0; index < numElem; index++) {

  • utput[index] = input[index];

} } numElem = 25

≈ 99.8% as expected 0.1 % - 0.2% Outliers

slide-6
SLIDE 6

Zürcher Fachhochschule

Short and Deterministic Latency

Variability in GPU Kernel Launch

6

Latency ~ Buffer Size Outliers ≈ 50 ms

slide-7
SLIDE 7

Zürcher Fachhochschule

Short and Deterministic Latency

Problems

  • How to Improve Deterministic Behavior?
  •  Solution: Persistent CUDA Kernel
  • Eliminate Launch Time

7

slide-8
SLIDE 8

Zürcher Fachhochschule

Short and Deterministic Latency

Persistent Kernel

8

… while (running) { if (new_audio_samples() == true) { send_sync_to_GPU(); wait_for_GPU_sync(); } } … __global__ void audioKernel(…) { … // Infinite loop while (true) { wait_for_CPU_sync(); // Audio channel processing … wait_for_all_threads_to_finish(); send_sync_to_CPU(); } }

CPU

Host Application

GPU

CUDA Kernel

slide-9
SLIDE 9

Zürcher Fachhochschule

Short and Deterministic Latency

Persistent Kernel: Synchronization

9

__global__ void audioKernel(…, volatile int* gpuToken) { … // Infinite loop while (true) { wait_for_CPU_sync(); // Audio channel processing … wait_for_all_threads_to_finish(); send_sync_to_CPU(); } }

GPU

CUDA Kernel

wait_for_CPU_sync() { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i == 0) while (*gpuToken == NO_AUDIO); synchronize_threads(); }

Memory Accessible by CPU and GPU

slide-10
SLIDE 10

Zürcher Fachhochschule

Short and Deterministic Latency

CUDA Memory Comparison Managed <-> Zero Copy

10

TX2 CPU GPU DRAM 8GB Memory Controller

  • Incl. SMMU

Cache Cache

GPU Buffer TX2 CPU GPU DRAM 8GB Memory Controller

  • Incl. SMMU

Cache Cache

GPU Buffer

Zero Copy Managed Memory

slide-11
SLIDE 11

Zürcher Fachhochschule

Short and Deterministic Latency

Problems

  • Memory Accessible by CPU and GPU
  • Use Zero Copy Memory
  • What about Parameters?

11

slide-12
SLIDE 12

Zürcher Fachhochschule

GPU Short and Deterministic Latency

Infinite Loop: Parameter Communication

12

System Memory CPU

Application Writes at Arbitrary Time Audio Processing Thread Local Parameter Copy Parameter Memory (Zero Copy)

Slow Access!

Periodic Update

slide-13
SLIDE 13

Zürcher Fachhochschule

Short and Deterministic Latency

Conclusion

  • Use Memory Accessible by CPU and GPU
  • Use Zero Copy Memory
  • What about Parameters?
  • Speed-up due to Local Copy
  • How to Make CPUs Deterministic?
  • CPU Core Isolation
  • Initramfs built with Yocto
  • No Flash Access Anymore During Runtime
  • Minimize Influence from other Applications

13

slide-14
SLIDE 14

Zürcher Fachhochschule

Challenges

  • Short and Deterministic Latency
  • High Input and Output Data Rate

14

slide-15
SLIDE 15

Zürcher Fachhochschule

High Input / Output Data Rate

Separate Buffers

15

TX2 CPU GPU DRAM 8GB Memory Controller

  • Incl. SMMU

Cache Cache

I/O Buffer GPU Buffer PCIe Audio I / O Card

memcpy

slide-16
SLIDE 16

Zürcher Fachhochschule

High Input / Output Data Rate

memcpy() Measurements

16

1.E+00 1.E+03 1.E+06 1.E+09

memcpy() Time

4096 bytes @ 48 kHz for 12h on 3 CPUs (A57)

TX1 ( Kernel v3.10.96) TX2 (Kernel v4.4.15)

75 % CPU Usage!

# Occurrences Time (µs)

slide-17
SLIDE 17

Zürcher Fachhochschule

High Input / Output Data Rate

Shared Buffer

17

TX2 CPU GPU DRAM 8GB Memory Controller

  • Incl. SMMU

Cache Cache

I/O Buffer GPU Buffer PCIe Audio I / O Card

memcpy

slide-18
SLIDE 18

Zürcher Fachhochschule

High Input / Output Data Rate

Shared Buffer

18

TX2 CPU GPU DRAM 8GB Memory Controller

  • Incl. SMMU

Cache Cache

Shared Buffer PCIe Audio I / O Card

slide-19
SLIDE 19

Zürcher Fachhochschule

High Input / Output Data Rate

Shared Buffer

19

  • Existing Solutions for Buffer Sharing
  • GPUDirect RDMA
slide-20
SLIDE 20

Zürcher Fachhochschule

Desktop

Discrete GPU

TX2

Integrated GPU

High Input / Output Data Rate

GPUDirect RDMA

20

CPUs System Memory 3rd Party Device GPU Bridge

PCIe

CPUs System Memory 3rd Party Device PCIe Controller

PCIe

GPU

Memory Interconnect

Memory

GPUDirect RDMA

slide-21
SLIDE 21

Zürcher Fachhochschule

High Input / Output Data Rate

Shared Buffer

21

  • Existing Solutions for Buffer Sharing
  • GPUDirect RDMA  Not Available
  • CudaHostRegister()
slide-22
SLIDE 22

Zürcher Fachhochschule

High Input / Output Data Rate

CudaHostRegister()

22

TX2 CPU GPU DRAM 8GB Memory Controller

  • Incl. SMMU

Cache Cache

I/O Buffer PCIe Audio I / O Card CudaHostRegister()

slide-23
SLIDE 23

Zürcher Fachhochschule

High Input / Output Data Rate

Shared Buffer

23

  • Existing Solutions for Buffer Sharing
  • GPUDirect RDMA  Not Available
  • CudaHostRegister()  Not Implemented on TX2
  • Video4Linux2  Userptr Mode
slide-24
SLIDE 24

Zürcher Fachhochschule

High Input / Output Data Rate

Video4Linux - Userptr

24

TX2 CPU GPU DRAM 8GB Memory Controller

  • Incl. SMMU

Cache Cache

GPU Buffer Video Input Embedded Camera Userptr Mode Mapped Access

slide-25
SLIDE 25

Zürcher Fachhochschule

High Input / Output Data Rate

Video4Linux - Userptr

25

TX2 CPU GPU DRAM 8GB Memory Controller

  • Incl. SMMU

Cache Cache

GPU Buffer PCIe Audio I / O Card Userptr Mode Mapped Access

slide-26
SLIDE 26

Zürcher Fachhochschule

High Input / Output Data Rate

Shared Buffer

26

  • Existing Solutions for Buffer Sharing
  • GPUDirect RDMA  Not Available
  • CudaHostRegister()  Not Implemented on TX2
  • Video4Linux2  Userptr Mode

slide-27
SLIDE 27

Zürcher Fachhochschule

Conclusion

27

  • Feasibility of Low-Latency Signal Processing on GPU
  • Professional Audio Mixer with 200 Channels
  • Short and Deterministic Latency
  • Persistent CUDA Kernel
  • High Input / Output Data Rate
  • Shared Buffer I/O <-> GPU (Video4Linux)
slide-28
SLIDE 28

Zürcher Fachhochschule

Get started with signal processing on GPU!

28

Website: http://www.zhaw.ch/ines/ Blog: https://blog.zhaw.ch/high-performance/ Github: https://github.com/ines-hpmm E-Mail: armin.weiss@zhaw.ch amin.mazloumian@zhaw.ch matthias.rosenthal@zhaw.ch

Questions