Real-Time GPU Management Heechul Yun 1 This Week Topic: General - - PowerPoint PPT Presentation

real time gpu management
SMART_READER_LITE
LIVE PREVIEW

Real-Time GPU Management Heechul Yun 1 This Week Topic: General - - PowerPoint PPT Presentation

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing Unit (GPGPU) management Today GPU architecture GPU programming model Challenges Real-Time GPU management 2 History GPU


slide-1
SLIDE 1

Real-Time GPU Management

Heechul Yun

1

slide-2
SLIDE 2

This Week

  • Topic: General Purpose Graphic Processing

Unit (GPGPU) management

  • Today

– GPU architecture – GPU programming model – Challenges – Real-Time GPU management

2

slide-3
SLIDE 3

History

  • GPU

– Graphic is embarrassingly parallel by nature – GeForce 6800 (2003): 53GFLOPs (MUL) – Some PhDs tried to use GPU to do some general purpose computing, but difficult to program

  • GPGPU

– Ian Buck (Stanford PhD, 2004) joined Nvidia and created CUDA language and runtime. – General purpose: (relatively) easy to program, many scientific applications

3

slide-4
SLIDE 4

Discrete GPU

  • Add-on PCIe cards on PC

– GPU and CPU memories are separate – GPU memory (GDDR) is much faster than CPU one (DDR)

4

4992 GPU cores 4 CPU cores Graphic DRAM Host DRAM PCIE 3.0

Nvidia Tesla K80 Intel Core i7

slide-5
SLIDE 5

Integrated CPU-GPU SoC

  • Tighter integration of CPU and GPU

– Memory is shared by both CPU and GPU – Good for embedded systems (e.g., smartphone)

5

GPUSync: A Famework for R eal-Time GP Management Nvidia Tegra K1 Core1 Shared DRAM Shared Memory Controller GPU cores Core2 Core3 Core4

slide-6
SLIDE 6

NVIDIA Titan Xp

  • 3840 CUDA cores, 12GB GDDR5X
  • Peak performance: 12 TFLOPS

6

slide-7
SLIDE 7

NVIDIA Jetson TX2

  • 256 CUDA GPU cores + 4 CPU cores

7

Image credit: T. Amert et al., “GPU Scheduling on the NVIDIA TX2: Hidde n Details Revealed,” RTSS17

slide-8
SLIDE 8

NVIDIA Jetson Platforms

8

slide-9
SLIDE 9

CPU vs. GPGPU

  • CPU

– Designed to run sequential programs faster – High ILP: pipeline, superscalar, out-of-order, multi-level cache hierarchy – Powerful, but complex and big

  • GPGPU

– Designed to compute math faster for embarrassingly parallel data (e.g., pixels) – No need for complex logics (no superscalar, out-of-order, cache) – Simple, less powerful, but small---can put many of them

9

slide-10
SLIDE 10

10

“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

slide-11
SLIDE 11

11

“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

slide-12
SLIDE 12

12

“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

slide-13
SLIDE 13

13

“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

slide-14
SLIDE 14

14

“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

slide-15
SLIDE 15

15

“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

slide-16
SLIDE 16

GPU Programming Model

  • Host = CPU
  • Device = GPU
  • Kernel

– Function that executes on the device – Multiple threads execute each kernel

16

slide-17
SLIDE 17

17

Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf

slide-18
SLIDE 18

18

Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf

slide-19
SLIDE 19

19

Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf

slide-20
SLIDE 20

20

Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf

slide-21
SLIDE 21

Challenges for Discrete GPU

  • Data movement problem

– Host mem <-> gpu mem – Copy overhead can be high

  • Scheduling problem

– Limited ability to prioritize important GPU kernels – Most (old) GPUs don’t support preemption – New GPUs support preemption within a process

21

User buffer Kernel buffer

user kernel

4992 GPU c

  • res

4 CPU cores Graphic DR AM Host DRAM PCIE 3.0

GPU CPU

slide-22
SLIDE 22

Data Movement Challenge

22

4992 GPU cores 4 CPU cores Graphic DRAM Host DRAM PCIE 3.0

Nvidia Tesla K80 Intel Core i7

480 GB/s 25 GB/s 16 GB/s

Data transfer is the bottleneck

slide-23
SLIDE 23

An Example

23

CPU GPU GPU CPU

PTask: Operating System Abstractions To Manage GPUs as Compute Devices, SOSP'11

slide-24
SLIDE 24

Inefficient Data migration

OS executive capture GPU Run! camdrv GPU driver

PCI-xfer PCI-xfer

xform

copy to GPU copy from GPU PCI-xfer PCI-xfer

filter

copy from GPU

detect

IRP

HIDdrv

read() copy to GPU write() read() write() read() write() read()

capture xform filter detect

#> capture | xform | filter | detect &

24

Acknowledgement: This slide is from the paper’s author’s slide

A lot of copies

slide-25
SLIDE 25

Scheduling Challenge

CPU priorities do not apply to GPU Long running GPU task (xform) is not preemptible delaying short GPU task (mouse update)

PTask: Operating System Abstractions To Manage GPUs as Compute Devices, SOSP'11

slide-26
SLIDE 26

Challenges for Integrated CPU-GPU

  • Memory is shared by both CPU and GPU
  • Data movement may be easier but…

26

GPUSync: A Famework for R eal-Time GP Management Nvidia Tegra X2 Core1

PMC

Shared DRAM Shared Memory Controller GPU cores Core2

PMC

Core3

PMC

Core4

PMC

16 GB/s

slide-27
SLIDE 27

Memory Bandwidth Contention

27

3 GPU 2 1 GPU App Co-runners CPU

Co-scheduling memory intensive CPU task affects GPU performance

  • n Integrated CPU-GPU SoC

Waqar Ali, Heechul Yun. Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms. Euromicro Conference on Real-Time Systems (ECRTS), 2018 [pdf] [arXiv] [ppt] [code]

slide-28
SLIDE 28

Summary

  • GPU Architecture

– Many simple in-order cores

  • GPU Programming Model

– SIMD

  • Challenges

– Data movement cost – Scheduling – Bandwidth bottleneck – NOT time predictable!

28

slide-29
SLIDE 29

Real-Time GPU Management

  • Goal

– Time predictable and efficient GPU sharing in multi-tasking environment

  • Challenges

– High data copy overhead – Real-time scheduling support -- preemption – Shared resource (bandwidth) contention

29

slide-30
SLIDE 30

References

  • Timegraph: Gpu scheduling for real-time multi-tasking
  • environments. In ATC, 2011.
  • Gdev: First-class gpu resource management in the operating
  • system. In ATC, 2012.
  • GPES: a preemptive execution system for gpgpu computing. In

RTAS, 2015

  • Gpusync: A framework for real-time gpu management. In

RTSS, 2013.

  • A server based approach for predictable gpu access control. In

RTCSA, 2017.

30

slide-31
SLIDE 31

Real-Time GPU Scheduling

  • Early real-time GPU schedulers

– Timegraph – Gdev

  • GPU kernel slicing

– GPES

  • Synchronization (Lock) based approach

– GPUSync

  • Server based approach

– GPU Server

31

slide-32
SLIDE 32

GPU Software Stack

32

Acknowledgement: This slide is from the paper author’s slide Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

slide-33
SLIDE 33

TimeGraph

  • First work to support “soft” real-time GPU scheduling.
  • Implemented at the device driver level

33

  • S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa, “TimeGraph: GPU scheduling for real-time multi-tasking environm

ents,” in USENIX ATC, 2011

slide-34
SLIDE 34

TimeGraph Scheduling

  • GPU commands are not immediately sent to the GPU, if it is busy
  • Schedule high-priority GPU commands when GPU becomes idle

34

slide-35
SLIDE 35

GDev

  • Implemented at the kernel level on top of the

stock GPU driver

35

Acknowledgement: This slide is from the paper author’s slide Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

slide-36
SLIDE 36

TimeGraph Scheduling

  • high priority tasks can still suffer long delays
  • Due to lack of hardware preemption

36

Acknowledgement: This slide is from the paper author’s slide Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

slide-37
SLIDE 37

GDev’s BAND Scheduler

  • Monitor consumed b/w, add some delay to wait

high-priority requests

  • Non-work conserving, no real-time guarantee

37

Acknowledgement: This slide is from the paper author’s slide Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

slide-38
SLIDE 38

GPES

  • Based on Gdev
  • Implement kernel slicing to reduce latency

38 (*) GPES: A Preemptive Execution System for GPGPU Computing, RTAS'14

slide-39
SLIDE 39

39

GPUSync

http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf

slide-40
SLIDE 40

40

http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf

slide-41
SLIDE 41

41

http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf

slide-42
SLIDE 42

42

http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf

slide-43
SLIDE 43

43

http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf

slide-44
SLIDE 44

Hardware Preemption

  • Recent GPUs (NVIDIA Pascal) support

hardware preemption capability

– Problem solved?

  • Issues

– Works only between GPU streams within a single address space (process) – High context switching overhead

  • ~100us per context switch (*)

44

(*) AnandTech, “Preemption Improved: Fine-Grained Preemption for Time-Critical Tasks”

slide-45
SLIDE 45

Hardware Preemption

45

slide-46
SLIDE 46

Discussion

  • Long running low priority GPU kernel?
  • Memory interference from the CPU?

46

slide-47
SLIDE 47

Challenges for Integrated CPU-GPU

  • Memory is shared by both CPU and GPU
  • Data movement may be easier but…

47

GPUSync: A Famework for R eal-Time GP Management Nvidia Tegra X2 Core1

PMC

Shared DRAM Shared Memory Controller GPU cores Core2

PMC

Core3

PMC

Core4

PMC

16 GB/s

slide-48
SLIDE 48

References

  • SiGAMMA: server based integrated GPU arbitration

mechanism for memory accesses, RTNS, 2017

  • GPUguard: towards supporting a predictable execution model

for heterogeneous SoC, DATE, 2017

  • Protecting Real-Time GPU Applications on Integrated CPU-

GPU SoC Platforms, ECRTS, 2018

48

slide-49
SLIDE 49

SiGAMMA

  • Protect PREM compliant real-time CPU tasks
  • Throttle GPU when CPU is in a mem-phase

49

slide-50
SLIDE 50

SiGAMMA

  • GPU is throttled by launching a high-priority

spinning kernel

50

slide-51
SLIDE 51

GPUGuard

  • PREM compliant GPU and CPU tasks.
  • CPU is throttled using MemGuard
  • Need GPU source code modification

51

slide-52
SLIDE 52

BWLOCK++

  • Protect real-time GPU tasks by throttling CPU
  • Runtime binary instrumentation overriding CUDA API

– No code modification is needed

52

Waqar Ali, Heechul Yun. Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms. Euromicro Conference on Real-Time Systems (ECRTS), 2018 [pdf] [arXiv] [ppt] [code]

slide-53
SLIDE 53

Dynamic Instrumentation

  • Begin/stop throttling by instrumenting CUDA

53

CPU GPU

cudaMalloc(...) cudaMemcpy(...) cudaMemcpy(...) kernel<<<...>>>(...) cudaFree(...) cudaLaunch () cudaSynchronize ()

Acquire memory bandwidth lock Instrument binary Using LD_PRELOAD Release memory bandwidth lock

No source code modification is needed

slide-54
SLIDE 54

BWLOCK++

  • Real-time GPU kernels are protected.

54

slide-55
SLIDE 55

Summary

  • Integrated CPU-GPU SoC

– Shared main memory is a source of interference

  • SiGAMMA

– PREM compliant CPU tasks – Throttle GPU to protect real-time CPU tasks

  • GPUGuard

– PREM compliant GPU (and CPU) tasks. – Need GPU source code modification

  • BWLOCK++

– Throttle CPU (memguard) to protect real-time GPU tasks – Runtime instrumentation (no code modification)

55

slide-56
SLIDE 56

Discussion Papers

  • Deadline-based Scheduling for GPU with

Preemption Support, RTSS, 2018

  • Pipelined Data-Parallel CPU/GPU Scheduling

for Multi-DNN Real-Time Inference, RTSS, 2019

56

slide-57
SLIDE 57

Deadline-based Scheduling for GPU with Preemption Support

  • N. Capodieci, R. Cavicchioli, M.

Bertogna, A. Paramakuru. RTSS, 2018

57

slide-58
SLIDE 58

Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time Inference

Yecheng Xiang and Hyoseung Kim RTSS’19

58

slide-59
SLIDE 59

Cameras in Self-Driving Car

59

slide-60
SLIDE 60

NVIDIA Jetson TX2

60

slide-61
SLIDE 61

DNN Layer Processing Time

61

slide-62
SLIDE 62

Problem

  • Processing DNNs on a heterogenous platform

(e.g., TX2) is not efficient

– Using one type of resource (e.g., GPU) may not be always the best – Because CPUs don’t do much (if any) and GPU utilization is low for many layers – Also, sometimes CPUs are faster than GPU

  • How to process multiple DNNs efficiently on a

heterogeneous platform?

62

slide-63
SLIDE 63

Key Ideas

  • Group layers into multiple stages
  • A stage can be processed in different

computing nodes

  • Dynamically match make stages and nodes to

maximize performance while respecting real- time requirements

63

slide-64
SLIDE 64

64

slide-65
SLIDE 65

65

slide-66
SLIDE 66

Summary & Discussion

  • Efficiently schedule multiple DNN models on a

heterogeneous computing platform

  • Exploit DNN layer-level differences on best

computing resource configurations

  • Improve throughput and support real-time

scheduling

  • Downsides?

66