Venugopala Madumbu, NVIDIA GTC 2017 – 210D
S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION - - PowerPoint PPT Presentation
S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION - - PowerPoint PPT Presentation
S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION Venugopala Madumbu, NVIDIA GTC 2017 210D ADVANCED DRIVING ASSIST SYSTEMS (ADAS) & AUTONOMOUS DRIVING (AD) High Compute Workloads Mapped to GPU 2 ADAS/AD Requirements
2
ADVANCED DRIVING ASSIST SYSTEMS (ADAS) & AUTONOMOUS DRIVING (AD)
High Compute Workloads Mapped to GPU
3
ADAS/AD
Requirements & Challenges
Real-Time Behavior
- Determinism
- Freedom from Interference
- Priority of Functionalities
Performance
- Maximum Throughput
- Minimal Latency
Multi-Core CPU GPU/DSP/HWA
4
ADAS/AD WORKLOADS
Challenges Illustrated
Scenario#1 – Standalone Exec
GL Workload
X msec
CUDA Workload
Scenario#3 – Concurrent Exec
GL Workload
> (X+Y) msec Time Shared GPU Execution
If so, How to
- Achieve determinism
- Achieve Freedom from interference
- Prioritize one Workload over other
While also having
- maximum throughput
- minimum latency
CUDA Workload GL Workload X msec Y msec
Scenario#2 – Standalone Exec Y msec
CUDA Workload
5
GPU
Host
Engines DRAM Memory Controller
CPU Other Clients
(ISP , Display, etc.) GPU Memory Interface
GPU IN TEGRA
High Level Tegra SoC Block Diagram
CPU submits job/work to GPU GPU runs asynchronously to CPU GPU has its own hardware scheduler (Host) It switches between workloads without CPU involvement
6
GPU SCHEDULING
Channel – independent stream of work on the GPU Command Push Buffer – Command buffer written by Software and read by Hardware Channel Switching – Save/restore GPU state on a channel switch Semaphores/SyncPoints – Synchronization mechanism for events within the GPU Time Slice – How long a GPU executes commands of a channel before a channel switch Run-list – An ordered list of channels that SW wants the GPU to execute
Concepts
7
GPU SCHEDULING
Channel switching occurs when any ONE
- f the following happens:
- Time slice expires
- Engine runs out of work (no more
commands)
- Blocked on a semaphore
Channel Switch time = Drain Time + Save/Restore time Preemption can reduce Channel Switch times drastically
Timesharing by Channel Switching
Time GPU Occupancy
GPU
Timesliced Round-Robin
App1 App4 App3 App2
. . . . . .
8
GPU SCHEDULING
Preemption
9
Channel 1 Time slice Channel 1 Channel Switch Timeout
- 2. Channel preemption
Stop all commands in pipeline Wait for engines to idle Higher Context Switch time Channel 1 Time slice Channel 1 Channel Reset
- 3. Channel Reset
Engine could not idle and context could not save before channel switch timeout Callback to notify kernel of channel reset event Channel Switch Timeout
GPU SCHEDULING
Channel Switching with Time Slice Scenarios
Channel 1 Time slice
- 1. Channel finishes before time slice expires
Context switch to next channel
10
CHALLENGE REVISTED
How can we achieve both? Real-Time behavior:
- Determinism
- Freedom from Interference
- Priority of Functionalities
Performance:
- Maximum Throughput
- Minimal Latency
11
GPU SYNCHRONIZATION & SCHEDULING
Software Control
1. User Driver Level (GPU Synchronization Approach)
- Syncpoints/Semaphores for Synchronization
- Through EglStreams, EGLSync etc
2. Kernel Driver Level (GPU Priority Scheduling Approach)
- Run-List Engineering
- How long channel runs
- Order of Channel execution
12
GPU SYNCHRONIZATION APPROACH
No Synchronization Case
CPU GPU CPU Task CPU Task CPU Task Priority GPU Task GPU Task
Latency due to Concurrent Execution
GPU Task Kernel launch GPU Semaphore
5 10 15 20 25 30 35 msec
13
GPU SYNCHRONIZATION APPROACH
Synchronization on CPU: Not good for GPU
CPU GPU CPU Task CPU Task CPU Task Priority GPU Task GPU Task GPU Task Kernel launch GPU Semaphore
5 10 15 20 25 30 35 msec
14
GPU SYNCHRONIZATION APPROACH
Synchronization on GPU: No Context Switches
CPU GPU CPU Task CPU Task Priority GPU Task GPU Task GPU Task Kernel launch GPU Semaphore CPU Task Delayed Start
5 10 15 20 25 30 35 msec
Determinism Freedom from Interference Priority of Functionalities
15
GPU PRIORITY SCHEDULING APPROACH
Hypothetical Example
TASK PRIORITY FPS WORST CASE EXECUTION TIME (WCET) H1
High 60 9ms
M1
Medium 30 4ms
M2
Medium 30 4ms
L1
Low/Best Effort 30 10ms
H1 M1 M2 L1
16
GPU PRIORITY SCHEDULING APPROACH
Engineered Run-list and Time Slice Ensuring FPS and Latency
H1 M1 M2 H1 Run-List
H1 (Max Exec Time = 9 ms)
Time slice = 9 ms
M1 (Max Exec Time = 4 ms)
Time slice = 3 ms
M2 (Max Exec Time = 4 ms)
Time slice = 3 ms
L1 (Max Exec Time = 10 ms)
Time slice = 1 ms
M1 L1 M2 Time Work on GPU
. . . . . .
Ensured not >16ms for 60fps
- peration
17
GPU PRIORITY SCHEDULING APPROACH
Ensure timeslice is long enough to complete work Ensure work is continually submitted and also well ahead in time
- To Avoid
- GPU idle time
- Unnecessary context switches
Reduce Latency for GPU Work Completion
18
GPU SCHEDULING
Submit work in advance
- So the GPU has some work to execute at any point of time
Try to reduce/eliminate work dependencies Have contingency plan for work overload
- If feedback shows over budget, submit work few frames ahead and spread
Plan for worst case scenario
- Deal with GPU reset case esp for the Low priority cases
- GL Robustness Extensions
Best Practices to Keep GPU Busy
19
CONCLUSION
GPU Synchronization & Scheduling Approaches
Real-Time behavior:
- Determinism
- Freedom from Interference
- Priority of Functionalities
Performance:
- Maximum Throughput
- Minimal Latency
20
ACKNOWLEDGEMENTS
- Scott Whitman, NVIDIA
- Vladislav Buzov, NVIDIA
- Amit Rao, NVIDIA
- Yogesh Kini, NVIDIA
GTC Instructor led Lab:: L7105 – EGLSTREAMS : INTEROPERABILITY OF CAMERA, CUDA AND OPENGL 11TH MAY 2017 9:30-11:30AM LL21D
21