 
              Real-Time GPU Management Heechul Yun 1
This Week • Topic: General Purpose Graphic Processing Unit (GPGPU) management • Today – GPU architecture – GPU programming model – Challenges – Real-Time GPU management 2
History • GPU – Graphic is embarrassingly parallel by nature – GeForce 6800 (2003): 53GFLOPs (MUL) – Some PhDs tried to use GPU to do some general purpose computing, but difficult to program • GPGPU – Ian Buck (Stanford PhD, 2004) joined Nvidia and created CUDA language and runtime. – General purpose: (relatively) easy to program, many scientific applications 3
Discrete GPU Intel Core i7 Nvidia Tesla K80 4992 GPU cores 4 CPU cores Graphic DRAM Host DRAM PCIE 3.0 • Add-on PCIe cards on PC – GPU and CPU memories are separate – GPU memory (GDDR) is much faster than CPU one (DDR) 4
Integrated CPU-GPU SoC GPU cores Core1 Core2 Core3 Core4 GPUSync: A Famework for R eal-Time GP Management Nvidia Tegra K1 Shared Memory Controller Shared DRAM • Tighter integration of CPU and GPU – Memory is shared by both CPU and GPU – Good for embedded systems (e.g., smartphone) 5
NVIDIA Titan Xp • 3840 CUDA cores, 12GB GDDR5X • Peak performance: 12 TFLOPS 6
NVIDIA Jetson TX2 • 256 CUDA GPU cores + 4 CPU cores Image credit: T. Amert et al., “GPU Scheduling on the NVIDIA TX2: Hidde n Details Revealed,” RTSS17 7
NVIDIA Jetson Platforms 8
CPU vs. GPGPU • CPU – Designed to run sequential programs faster – High ILP: pipeline, superscalar, out-of-order, multi-level cache hierarchy – Powerful, but complex and big • GPGPU – Designed to compute math faster for embarrassingly parallel data (e.g., pixels) – No need for complex logics (no superscalar, out-of-order, cache) – Simple, less powerful, but small ---can put many of them 9
10 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
11 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
12 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
13 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
14 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
15 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
GPU Programming Model • Host = CPU • Device = GPU • Kernel – Function that executes on the device – Multiple threads execute each kernel 16
Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf 17
Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf 18
Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf 19
Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf 20
Challenges for Discrete GPU User buffer user • Data movement problem kernel – Host mem <-> gpu mem Kernel buffer – Copy overhead can be high GPU CPU 4992 GPU c 4 CPU cores ores Graphic DR Host DRAM • Scheduling problem AM PCIE 3.0 – Limited ability to prioritize important GPU kernels – Most (old) GPUs don’t support preemption – New GPUs support preemption within a process 21
Data Movement Challenge Intel Core i7 Nvidia Tesla K80 4992 GPU cores 4 CPU cores 480 GB/s 25 GB/s Graphic DRAM Host DRAM PCIE 3.0 16 GB/s Data transfer is the bottleneck 22
An Example CPU GPU GPU CPU PTask: Operating System Abstractions To Manage GPUs as Compute Devices, SOSP'11 23
Inefficient Data migration #> capture | xform | filter | detect & capture capture xform xform filter filter detect detect write() read() write() read() write() read() read() OS executive copy copy copy copy IRP to to from from GPU GPU GPU GPU camdrv GPU driver HIDdrv PCI-xfer PCI-xfer PCI-xfer PCI-xfer GPU Run! A lot of copies 24 Acknowledgement: This slide is from the paper’s author’s slide
Scheduling Challenge CPU priorities do not apply to GPU Long running GPU task (xform) is not preemptible delaying short GPU task (mouse update) PTask: Operating System Abstractions To Manage GPUs as Compute Devices, SOSP'11
Challenges for Integrated CPU-GPU • Memory is shared by both CPU and GPU • Data movement may be easier but… PMC PMC PMC PMC GPUSync: A Famework for R Core1 Core2 Core3 Core4 GPU cores eal-Time GP Management Nvidia Tegra X2 Shared Memory Controller 16 GB/s Shared DRAM 26
Memory Bandwidth Contention Co-scheduling memory intensive CPU task affects GPU performance on Integrated CPU-GPU SoC Co-runners GPU App 3 2 1 0 GPU CPU Waqar Ali, Heechul Yun. Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms. Euromicro Conference on Real-Time Systems (ECRTS) , 2018 [pdf] [arXiv] [ppt] [code] 27
Summary • GPU Architecture – Many simple in-order cores • GPU Programming Model – SIMD • Challenges – Data movement cost – Scheduling – Bandwidth bottleneck – NOT time predictable! 28
Real-Time GPU Management • Goal – Time predictable and efficient GPU sharing in multi-tasking environment • Challenges – High data copy overhead – Real-time scheduling support -- preemption – Shared resource (bandwidth) contention 29
References • Timegraph: Gpu scheduling for real-time multi-tasking environments. In ATC, 2011. • Gdev: First-class gpu resource management in the operating system. In ATC, 2012. • GPES: a preemptive execution system for gpgpu computing. In RTAS, 2015 • Gpusync: A framework for real-time gpu management. In RTSS, 2013. • A server based approach for predictable gpu access control. In RTCSA, 2017. 30
Real-Time GPU Scheduling • Early real-time GPU schedulers – Timegraph – Gdev • GPU kernel slicing – GPES • Synchronization (Lock) based approach – GPUSync • Server based approach – GPU Server 31
GPU Software Stack Acknowledgement: This slide is from the paper author’s slide 32 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.
TimeGraph • First work to support “soft” real -time GPU scheduling. • Implemented at the device driver level S. Kato, K. Lakshmanan, R. R. Rajkumar , and Y. Ishikawa, “ TimeGraph: GPU scheduling for real-time multi-tasking environm 33 ents,” in USENIX ATC , 2011
TimeGraph Scheduling • GPU commands are not immediately sent to the GPU, if it is busy • Schedule high-priority GPU commands when GPU becomes idle 34
GDev • Implemented at the kernel level on top of the stock GPU driver Acknowledgement: This slide is from the paper author’s slide 35 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.
TimeGraph Scheduling • high priority tasks can still suffer long delays • Due to lack of hardware preemption Acknowledgement: This slide is from the paper author’s slide 36 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.
GDev’s BAND Scheduler • Monitor consumed b/w, add some delay to wait high-priority requests • Non-work conserving, no real-time guarantee Acknowledgement: This slide is from the paper author’s slide 37 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.
GPES • Based on Gdev • Implement kernel slicing to reduce latency (*) GPES: A Preemptive Execution System for GPGPU Computing, RTAS'14 38
GPUSync http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 39
http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 40
http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 41
http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 42
http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 43
Hardware Preemption • Recent GPUs (NVIDIA Pascal) support hardware preemption capability – Problem solved? • Issues – Works only between GPU streams within a single address space (process) – High context switching overhead • ~100us per context switch (*) (*) AnandTech, “Preemption Improved: Fine -Grained Preemption for Time-Critical Tasks ” 44
Hardware Preemption 45
Discussion • Long running low priority GPU kernel? • Memory interference from the CPU? 46
Challenges for Integrated CPU-GPU • Memory is shared by both CPU and GPU • Data movement may be easier but… PMC PMC PMC PMC GPUSync: A Famework for R Core1 Core2 Core3 Core4 GPU cores eal-Time GP Management Nvidia Tegra X2 Shared Memory Controller 16 GB/s Shared DRAM 47
References • SiGAMMA: server based integrated GPU arbitration mechanism for memory accesses, RTNS, 2017 • GPUguard: towards supporting a predictable execution model for heterogeneous SoC, DATE, 2017 • Protecting Real-Time GPU Applications on Integrated CPU- GPU SoC Platforms, ECRTS, 2018 48
SiGAMMA • Protect PREM compliant real-time CPU tasks • Throttle GPU when CPU is in a mem-phase 49
SiGAMMA • GPU is throttled by launching a high-priority spinning kernel 50
Recommend
More recommend