Real-Time GPU Management Heechul Yun 1 This Week Topic: General - PowerPoint PPT Presentation

Real-Time GPU Management Heechul Yun 1

This Week • Topic: General Purpose Graphic Processing Unit (GPGPU) management • Today – GPU architecture – GPU programming model – Challenges – Real-Time GPU management 2

History • GPU – Graphic is embarrassingly parallel by nature – GeForce 6800 (2003): 53GFLOPs (MUL) – Some PhDs tried to use GPU to do some general purpose computing, but difficult to program • GPGPU – Ian Buck (Stanford PhD, 2004) joined Nvidia and created CUDA language and runtime. – General purpose: (relatively) easy to program, many scientific applications 3

Discrete GPU Intel Core i7 Nvidia Tesla K80 4992 GPU cores 4 CPU cores Graphic DRAM Host DRAM PCIE 3.0 • Add-on PCIe cards on PC – GPU and CPU memories are separate – GPU memory (GDDR) is much faster than CPU one (DDR) 4

Integrated CPU-GPU SoC GPU cores Core1 Core2 Core3 Core4 GPUSync: A Famework for R eal-Time GP Management Nvidia Tegra K1 Shared Memory Controller Shared DRAM • Tighter integration of CPU and GPU – Memory is shared by both CPU and GPU – Good for embedded systems (e.g., smartphone) 5

NVIDIA Titan Xp • 3840 CUDA cores, 12GB GDDR5X • Peak performance: 12 TFLOPS 6

NVIDIA Jetson TX2 • 256 CUDA GPU cores + 4 CPU cores Image credit: T. Amert et al., “GPU Scheduling on the NVIDIA TX2: Hidde n Details Revealed,” RTSS17 7

NVIDIA Jetson Platforms 8

CPU vs. GPGPU • CPU – Designed to run sequential programs faster – High ILP: pipeline, superscalar, out-of-order, multi-level cache hierarchy – Powerful, but complex and big • GPGPU – Designed to compute math faster for embarrassingly parallel data (e.g., pixels) – No need for complex logics (no superscalar, out-of-order, cache) – Simple, less powerful, but small ---can put many of them 9

10 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

GPU Programming Model • Host = CPU • Device = GPU • Kernel – Function that executes on the device – Multiple threads execute each kernel 16

Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf 17

Challenges for Discrete GPU User buffer user • Data movement problem kernel – Host mem <-> gpu mem Kernel buffer – Copy overhead can be high GPU CPU 4992 GPU c 4 CPU cores ores Graphic DR Host DRAM • Scheduling problem AM PCIE 3.0 – Limited ability to prioritize important GPU kernels – Most (old) GPUs don’t support preemption – New GPUs support preemption within a process 21

Data Movement Challenge Intel Core i7 Nvidia Tesla K80 4992 GPU cores 4 CPU cores 480 GB/s 25 GB/s Graphic DRAM Host DRAM PCIE 3.0 16 GB/s Data transfer is the bottleneck 22

An Example CPU GPU GPU CPU PTask: Operating System Abstractions To Manage GPUs as Compute Devices, SOSP'11 23

Inefficient Data migration #> capture | xform | filter | detect & capture capture xform xform filter filter detect detect write() read() write() read() write() read() read() OS executive copy copy copy copy IRP to to from from GPU GPU GPU GPU camdrv GPU driver HIDdrv PCI-xfer PCI-xfer PCI-xfer PCI-xfer GPU Run! A lot of copies 24 Acknowledgement: This slide is from the paper’s author’s slide

Scheduling Challenge CPU priorities do not apply to GPU Long running GPU task (xform) is not preemptible delaying short GPU task (mouse update) PTask: Operating System Abstractions To Manage GPUs as Compute Devices, SOSP'11

Challenges for Integrated CPU-GPU • Memory is shared by both CPU and GPU • Data movement may be easier but… PMC PMC PMC PMC GPUSync: A Famework for R Core1 Core2 Core3 Core4 GPU cores eal-Time GP Management Nvidia Tegra X2 Shared Memory Controller 16 GB/s Shared DRAM 26

Memory Bandwidth Contention Co-scheduling memory intensive CPU task affects GPU performance on Integrated CPU-GPU SoC Co-runners GPU App 3 2 1 0 GPU CPU Waqar Ali, Heechul Yun. Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms. Euromicro Conference on Real-Time Systems (ECRTS) , 2018 [pdf] [arXiv] [ppt] [code] 27

Summary • GPU Architecture – Many simple in-order cores • GPU Programming Model – SIMD • Challenges – Data movement cost – Scheduling – Bandwidth bottleneck – NOT time predictable! 28

Real-Time GPU Management • Goal – Time predictable and efficient GPU sharing in multi-tasking environment • Challenges – High data copy overhead – Real-time scheduling support -- preemption – Shared resource (bandwidth) contention 29

References • Timegraph: Gpu scheduling for real-time multi-tasking environments. In ATC, 2011. • Gdev: First-class gpu resource management in the operating system. In ATC, 2012. • GPES: a preemptive execution system for gpgpu computing. In RTAS, 2015 • Gpusync: A framework for real-time gpu management. In RTSS, 2013. • A server based approach for predictable gpu access control. In RTCSA, 2017. 30

Real-Time GPU Scheduling • Early real-time GPU schedulers – Timegraph – Gdev • GPU kernel slicing – GPES • Synchronization (Lock) based approach – GPUSync • Server based approach – GPU Server 31

GPU Software Stack Acknowledgement: This slide is from the paper author’s slide 32 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

TimeGraph • First work to support “soft” real -time GPU scheduling. • Implemented at the device driver level S. Kato, K. Lakshmanan, R. R. Rajkumar , and Y. Ishikawa, “ TimeGraph: GPU scheduling for real-time multi-tasking environm 33 ents,” in USENIX ATC , 2011

TimeGraph Scheduling • GPU commands are not immediately sent to the GPU, if it is busy • Schedule high-priority GPU commands when GPU becomes idle 34

GDev • Implemented at the kernel level on top of the stock GPU driver Acknowledgement: This slide is from the paper author’s slide 35 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

TimeGraph Scheduling • high priority tasks can still suffer long delays • Due to lack of hardware preemption Acknowledgement: This slide is from the paper author’s slide 36 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

GDev’s BAND Scheduler • Monitor consumed b/w, add some delay to wait high-priority requests • Non-work conserving, no real-time guarantee Acknowledgement: This slide is from the paper author’s slide 37 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

GPES • Based on Gdev • Implement kernel slicing to reduce latency (*) GPES: A Preemptive Execution System for GPGPU Computing, RTAS'14 38

GPUSync http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 39

http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 40

Hardware Preemption • Recent GPUs (NVIDIA Pascal) support hardware preemption capability – Problem solved? • Issues – Works only between GPU streams within a single address space (process) – High context switching overhead • ~100us per context switch (*) (*) AnandTech, “Preemption Improved: Fine -Grained Preemption for Time-Critical Tasks ” 44

Hardware Preemption 45

Discussion • Long running low priority GPU kernel? • Memory interference from the CPU? 46

Challenges for Integrated CPU-GPU • Memory is shared by both CPU and GPU • Data movement may be easier but… PMC PMC PMC PMC GPUSync: A Famework for R Core1 Core2 Core3 Core4 GPU cores eal-Time GP Management Nvidia Tegra X2 Shared Memory Controller 16 GB/s Shared DRAM 47

References • SiGAMMA: server based integrated GPU arbitration mechanism for memory accesses, RTNS, 2017 • GPUguard: towards supporting a predictable execution model for heterogeneous SoC, DATE, 2017 • Protecting Real-Time GPU Applications on Integrated CPU- GPU SoC Platforms, ECRTS, 2018 48

SiGAMMA • Protect PREM compliant real-time CPU tasks • Throttle GPU when CPU is in a mem-phase 49

SiGAMMA • GPU is throttled by launching a high-priority spinning kernel 50

Real-Time GPU Management Heechul Yun 1 This Week Topic: General - PowerPoint PPT Presentation

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing Unit (GPGPU) management Today GPU architecture GPU programming model Challenges Real-Time GPU management 2 History GPU

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Real graduates, Real graduates, real transitions, real transitions, real stories: real

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

COMPGA11: Research in Information Security Steven Murdoch University College London based

2XWOLQH ([SHULPHQWDO(YDOXDWLRQLQ &RPSXWHU6FLHQFH$ 0RWLYDWLRQ

Privacy Enhancing Technologies Spring 2006 Outline Privacy Overview Course Topics

CS 135: File Systems Class Overview 1 / 11 Class Overview Todays Topics Purpose of class

TAU19 WORKSHOP MONTEREY, MARCH 21-22 2019 SONG CHEN (SYNOPSYS) GENERAL CHAIR JOO GEADA

TORCHSCRIPT: OPTIMIZED EXECUTION OF PY TORCH PROGRAMS Presenter Zachary DeVito PyTorch

Assignm ssignment 4 nt 4: : Fina inal R l Rese search P h Pape per Pr r Proposa oposal

Uninformed Search CS171, Winter Quarter, 2019 Introduction to Artificial Intelligence Prof.

Real-Time GPU Management Heechul Yun 1 This Week Topic: General - PowerPoint PPT Presentation

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing Unit (GPGPU) management Today GPU architecture GPU programming model Challenges Real-Time GPU management 2 History GPU

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Real graduates, Real graduates, real transitions, real transitions, real stories: real

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

COMPGA11: Research in Information Security Steven Murdoch University College London based

2XWOLQH ([SHULPHQWDO(YDOXDWLRQLQ &amp;RPSXWHU6FLHQFH$ 0RWLYDWLRQ

Privacy Enhancing Technologies Spring 2006 Outline Privacy Overview Course Topics

CS 135: File Systems Class Overview 1 / 11 Class Overview Todays Topics Purpose of class

TAU19 WORKSHOP MONTEREY, MARCH 21-22 2019 SONG CHEN (SYNOPSYS) GENERAL CHAIR JOO GEADA

TORCHSCRIPT: OPTIMIZED EXECUTION OF PY TORCH PROGRAMS Presenter Zachary DeVito PyTorch

Assignm ssignment 4 nt 4: : Fina inal R l Rese search P h Pape per Pr r Proposa oposal

Uninformed Search CS171, Winter Quarter, 2019 Introduction to Artificial Intelligence Prof.

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

2XWOLQH ([SHULPHQWDO(YDOXDWLRQLQ &RPSXWHU6FLHQFH$ 0RWLYDWLRQ