S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION - - PowerPoint PPT Presentation

s7105 adas ad challenges
SMART_READER_LITE
LIVE PREVIEW

S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION - - PowerPoint PPT Presentation

S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION Venugopala Madumbu, NVIDIA GTC 2017 210D ADVANCED DRIVING ASSIST SYSTEMS (ADAS) & AUTONOMOUS DRIVING (AD) High Compute Workloads Mapped to GPU 2 ADAS/AD Requirements


slide-1
SLIDE 1

Venugopala Madumbu, NVIDIA GTC 2017 – 210D

S7105 – ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION

slide-2
SLIDE 2

2

ADVANCED DRIVING ASSIST SYSTEMS (ADAS) & AUTONOMOUS DRIVING (AD)

High Compute Workloads Mapped to GPU

slide-3
SLIDE 3

3

ADAS/AD

Requirements & Challenges

Real-Time Behavior

  • Determinism
  • Freedom from Interference
  • Priority of Functionalities

Performance

  • Maximum Throughput
  • Minimal Latency

Multi-Core CPU GPU/DSP/HWA

slide-4
SLIDE 4

4

ADAS/AD WORKLOADS

Challenges Illustrated

Scenario#1 – Standalone Exec

GL Workload

X msec

CUDA Workload

Scenario#3 – Concurrent Exec

GL Workload

> (X+Y) msec Time Shared GPU Execution

If so, How to

  • Achieve determinism
  • Achieve Freedom from interference
  • Prioritize one Workload over other

While also having

  • maximum throughput
  • minimum latency

CUDA Workload GL Workload X msec Y msec

Scenario#2 – Standalone Exec Y msec

CUDA Workload

slide-5
SLIDE 5

5

GPU

Host

Engines DRAM Memory Controller

CPU Other Clients

(ISP , Display, etc.) GPU Memory Interface

GPU IN TEGRA

High Level Tegra SoC Block Diagram

CPU submits job/work to GPU GPU runs asynchronously to CPU GPU has its own hardware scheduler (Host) It switches between workloads without CPU involvement

slide-6
SLIDE 6

6

GPU SCHEDULING

Channel – independent stream of work on the GPU Command Push Buffer – Command buffer written by Software and read by Hardware Channel Switching – Save/restore GPU state on a channel switch Semaphores/SyncPoints – Synchronization mechanism for events within the GPU Time Slice – How long a GPU executes commands of a channel before a channel switch Run-list – An ordered list of channels that SW wants the GPU to execute

Concepts

slide-7
SLIDE 7

7

GPU SCHEDULING

Channel switching occurs when any ONE

  • f the following happens:
  • Time slice expires
  • Engine runs out of work (no more

commands)

  • Blocked on a semaphore

Channel Switch time = Drain Time + Save/Restore time Preemption can reduce Channel Switch times drastically

Timesharing by Channel Switching

Time GPU Occupancy

GPU

Timesliced Round-Robin

App1 App4 App3 App2

. . . . . .

slide-8
SLIDE 8

8

GPU SCHEDULING

Preemption

slide-9
SLIDE 9

9

Channel 1 Time slice Channel 1 Channel Switch Timeout

  • 2. Channel preemption

Stop all commands in pipeline Wait for engines to idle Higher Context Switch time Channel 1 Time slice Channel 1 Channel Reset

  • 3. Channel Reset

Engine could not idle and context could not save before channel switch timeout Callback to notify kernel of channel reset event Channel Switch Timeout

GPU SCHEDULING

Channel Switching with Time Slice Scenarios

Channel 1 Time slice

  • 1. Channel finishes before time slice expires

Context switch to next channel

slide-10
SLIDE 10

10

CHALLENGE REVISTED

How can we achieve both? Real-Time behavior:

  • Determinism
  • Freedom from Interference
  • Priority of Functionalities

Performance:

  • Maximum Throughput
  • Minimal Latency
slide-11
SLIDE 11

11

GPU SYNCHRONIZATION & SCHEDULING

Software Control

1. User Driver Level (GPU Synchronization Approach)

  • Syncpoints/Semaphores for Synchronization
  • Through EglStreams, EGLSync etc

2. Kernel Driver Level (GPU Priority Scheduling Approach)

  • Run-List Engineering
  • How long channel runs
  • Order of Channel execution
slide-12
SLIDE 12

12

GPU SYNCHRONIZATION APPROACH

No Synchronization Case

CPU GPU CPU Task CPU Task CPU Task Priority GPU Task GPU Task

Latency due to Concurrent Execution

GPU Task Kernel launch GPU Semaphore

5 10 15 20 25 30 35 msec

slide-13
SLIDE 13

13

GPU SYNCHRONIZATION APPROACH

Synchronization on CPU: Not good for GPU

CPU GPU CPU Task CPU Task CPU Task Priority GPU Task GPU Task GPU Task Kernel launch GPU Semaphore

5 10 15 20 25 30 35 msec

slide-14
SLIDE 14

14

GPU SYNCHRONIZATION APPROACH

Synchronization on GPU: No Context Switches

CPU GPU CPU Task CPU Task Priority GPU Task GPU Task GPU Task Kernel launch GPU Semaphore CPU Task Delayed Start

5 10 15 20 25 30 35 msec

 Determinism  Freedom from Interference  Priority of Functionalities

slide-15
SLIDE 15

15

GPU PRIORITY SCHEDULING APPROACH

Hypothetical Example

TASK PRIORITY FPS WORST CASE EXECUTION TIME (WCET) H1

High 60 9ms

M1

Medium 30 4ms

M2

Medium 30 4ms

L1

Low/Best Effort 30 10ms

H1 M1 M2 L1

slide-16
SLIDE 16

16

GPU PRIORITY SCHEDULING APPROACH

Engineered Run-list and Time Slice Ensuring FPS and Latency

H1 M1 M2 H1 Run-List

H1 (Max Exec Time = 9 ms)

Time slice = 9 ms

M1 (Max Exec Time = 4 ms)

Time slice = 3 ms

M2 (Max Exec Time = 4 ms)

Time slice = 3 ms

L1 (Max Exec Time = 10 ms)

Time slice = 1 ms

M1 L1 M2 Time Work on GPU

. . . . . .

Ensured not >16ms for 60fps

  • peration
slide-17
SLIDE 17

17

GPU PRIORITY SCHEDULING APPROACH

 Ensure timeslice is long enough to complete work  Ensure work is continually submitted and also well ahead in time

  • To Avoid
  • GPU idle time
  • Unnecessary context switches

Reduce Latency for GPU Work Completion

slide-18
SLIDE 18

18

GPU SCHEDULING

 Submit work in advance

  • So the GPU has some work to execute at any point of time

 Try to reduce/eliminate work dependencies  Have contingency plan for work overload

  • If feedback shows over budget, submit work few frames ahead and spread

 Plan for worst case scenario

  • Deal with GPU reset case esp for the Low priority cases
  • GL Robustness Extensions

Best Practices to Keep GPU Busy

slide-19
SLIDE 19

19

CONCLUSION

GPU Synchronization & Scheduling Approaches

Real-Time behavior:

  • Determinism
  • Freedom from Interference
  • Priority of Functionalities

Performance:

  • Maximum Throughput
  • Minimal Latency
slide-20
SLIDE 20

20

ACKNOWLEDGEMENTS

  • Scott Whitman, NVIDIA
  • Vladislav Buzov, NVIDIA
  • Amit Rao, NVIDIA
  • Yogesh Kini, NVIDIA

GTC Instructor led Lab:: L7105 – EGLSTREAMS : INTEROPERABILITY OF CAMERA, CUDA AND OPENGL 11TH MAY 2017 9:30-11:30AM LL21D

slide-21
SLIDE 21

21

Q&A

slide-22
SLIDE 22

THANK YOU