Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems
Vishakha Gupta, Karsten Schwan @ Georgia Tech Niraj Tolia @ Maginatics Vanish Talwar, Parthasarathy Ranganathan @ HP Labs
USENIX ATC 2011 – Portland, OR, USA
Scheduling for Virtualized Accelerator-based Systems Vishakha Gupta , - - PowerPoint PPT Presentation
Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems Vishakha Gupta , Karsten Schwan @ Georgia Tech Niraj Tolia @ Maginatics Vanish Talwar, Parthasarathy Ranganathan @ HP Labs USENIX ATC 2011 Portland, OR, USA Increasing
USENIX ATC 2011 – Portland, OR, USA
2
2007
based- Playstation 2008
based RoadRunner
programmab le GPUs for developers 2009
popularity of NVIDIA GPUs powered desktops and laptops 2010
adopts GPUs
and Nebulae supercomput ers in Top500 2011
cellphones
3
PCIe
4
PCIe Proprietary NVIDIA Driver and CUDA runtime
5
PCIe Proprietary NVIDIA Driver and CUDA runtime
C-like CUDA-based applications (host portion)
6
PCIe CUDA Kernels Proprietary NVIDIA Driver and CUDA runtime
C-like CUDA-based applications (host portion)
7
PCIe CUDA Kernels Proprietary NVIDIA Driver and CUDA runtime
C-like CUDA-based applications (host portion)
Design flaw: Bulk of logic in drivers which were meant to be for simple operations like read, write and handle interrupts Shortcoming: Inaccessibility and one scheduling fits all
8
2010
AMD, NVIDIA 2011
(Keeneland )
− With the exception of extensively tuned (e.g.
9
2010
AMD, NVIDIA 2011
(Keeneland )
− With the exception of extensively tuned (e.g.
10
2010
AMD, NVIDIA 2011
(Keeneland )
− With the exception of extensively tuned (e.g.
11
2010
AMD, NVIDIA 2011
(Keeneland )
Need for accelerator sharing: resource sharing is now supported in NVIDIA’s Fermi architecture Concern: Can driver scheduling do a good job?
12
Max Min 50% Median
13
Max Min 50% Median
Driver can: efficiently implement computation and data interactions between host and accelerator Limitations: Call ordering suffers when sharing – any scheme used is static and cannot adapt to different system expectations
14
− Why treat such powerful processing resources as devices? − How can such heterogeneous resources be managed
15
− Why treat such powerful processing resources as devices? − How can such heterogeneous resources be managed
− Are there efficient methods to utilize a heterogeneous pool
− Can applications share accelerators without a big hit in
16
− Why treat such powerful processing resources as devices? − How can such heterogeneous resources be managed
− Are there efficient methods to utilize a heterogeneous pool
− Can applications share accelerators without a big hit in
− How do you deal with multiple scheduling domains? − Does coordination obtain any performance gains?
17
18
19
20
21
22
Management Domain (Dom0) Management Domain (Dom0)
Traditional Device Drivers
General purpose multicores General purpose multicores Traditional Devices Traditional Devices VM
Linux
23
Management Domain (Dom0) Management Domain (Dom0)
Traditional Device Drivers
General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM
Linux
24
Management Domain (Dom0) Management Domain (Dom0)
Traditional Device Drivers
General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM
Linux Runtime + GPU Driver
25
Management Domain (Dom0) Management Domain (Dom0)
Traditional Device Drivers
General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM
Linux
NVIDIA’s CUDA – Compute Unified Device Architecture for managing GPUs
Runtime + GPU Driver CUDA API GPU Application
26
Management Domain (Dom0) Management Domain (Dom0)
GPU Backend Traditional Device Drivers
General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM
GPU Frontend Linux
NVIDIA’s CUDA – Compute Unified Device Architecture for managing GPUs
Runtime + GPU Driver CUDA API GPU Application
27
Management Domain (Dom0) Management Domain (Dom0)
GPU Backend Traditional Device Drivers
General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM
GPU Frontend Linux
NVIDIA’s CUDA – Compute Unified Device Architecture for managing GPUs
Runtime + GPU Driver CUDA API GPU Application
VM
GPU Frontend Linux CUDA API GPU Application
28
Management Domain (Dom0) Management Domain (Dom0)
Mgmt Extension GPU Backend Traditional Device Drivers
General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM
GPU Frontend Linux
NVIDIA’s CUDA – Compute Unified Device Architecture for managing GPUs
Runtime + GPU Driver CUDA API GPU Application
VM
GPU Frontend Linux CUDA API GPU Application
29
30
Pegasus Frontend Frontend driver
Interposer library
CUDA calls + Responses Xen shared ring for CUDA calls (per VM) – call buffer Shared pages for data Application data
31
Pegasus Frontend Frontend driver
Interposer library
CUDA calls + Responses Xen shared ring for CUDA calls (per VM) – call buffer Shared pages for data Application data CUDA calls + Responses Polling thread Application data
Pegasus Backend
32
Pegasus Frontend Frontend driver
Interposer library
CUDA calls + Responses Xen shared ring for CUDA calls (per VM) – call buffer Shared pages for data Application data CUDA calls + Responses Polling thread Application data
CUDA Runtime + Driver Pegasus Backend
33
Pegasus Frontend Frontend driver
Interposer library
CUDA calls + Responses Xen shared ring for CUDA calls (per VM) – call buffer Shared pages for data Application data CUDA calls + Responses Polling thread Application data
CUDA Runtime + Driver Polling thread is the VM’s representative for call execution It can be queued or scheduled to pick calls and issue them for any amount of time the accelerator portion of the VM can be scheduled Hence, we define an “accelerator” virtual CPU or aVCPU Pegasus Backend
34
aVCPU
Runtime and driver context CUDA calls + data Polling Thread
VCPU
Data Execution context
35
aVCPU
Runtime and driver context CUDA calls + data Polling Thread
VCPU
Data Execution context
VCPU: first class schedulable entity on a physical CPU aVCPU: first class schedulable entity on GPU (with a CPU component due to execution model) Manageable pool of heterogeneous resources
36
37
38
39
40
41
42
43
44
45
46
linux kernel in all domains
SDK 1.1
Dom2 = 512, Dom3 = 1024 credits
Max Min 50% Median
47
linux kernel in all domains
SDK 1.1
Dom2 = 512, Dom3 = 1024 credits
Max Min 50% Median
48
− Hypervisor scheduling determines which domain should run on
− Latency reduction by occasional unfairness − Possible waste of resources e.g. if domain picked for GPU has
49
− Hypervisor scheduling determines which domain should run on
− Latency reduction by occasional unfairness − Possible waste of resources e.g. if domain picked for GPU has
− Scan the hypervisor CPU schedule to temporarily boost credits
− Pick domain(s) for GPU(s) based on GPU credits + remaining
− Throughput improvement by temporary credit boost
50
51
linux kernel in all domains
SDK 1.1
Dom2 = 512, Dom3 = 1024 credits
52
linux kernel in all domains
SDK 1.1
Dom2 = 512, Dom3 = 1024 credits
− Default – GPU driver based – base case (None) − Round Robin (RR) − AccCredit (AccC) – credits based on static profiling
− XenCredit (XC) – use Xen CPU credits − SLA feedback based (SLAF) − Augmented Credit based (AugC) – temporarily augment
− HypeControlled or coscheduled (CoSched)
53
− Default – GPU driver based – base case (None) − Round Robin (RR) − AccCredit (AccC) – credits based on static profiling
− XenCredit (XC) – use Xen CPU credits − SLA feedback based (SLAF) − Augmented Credit based (AugC) – temporarily augment
− HypeControlled or coscheduled (CoSched)
54
− Default – GPU driver based – base case (None) − Round Robin (RR) − AccCredit (AccC) – credits based on static profiling
− XenCredit (XC) – use Xen CPU credits − SLA feedback based (SLAF) − Augmented Credit based (AugC) – temporarily augment
− HypeControlled or coscheduled (CoSched)
55
Application Application
56 56
Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform Management Domain OS OS Guest VM Hypervisor
Application Application
57 57
Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform Management Domain OS OS Guest VM
CPU Ready Queues
Domains to Schedule
Picked
VCPU
CPU Scheduler Hypervisor
Domains (Credits)
Application Application Accelerator Application Accelerator Application
58 58
Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform Management Domain OS OS Guest VM
CPU Ready Queues
Domains to Schedule
Picked
VCPU
CPU Scheduler Hypervisor
Domains (Credits)
Application Application Accelerator Application Accelerator Application
59 59
Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform
Doms (Credits)
Accelerator Ready Queues
Domains to Schedule
DomA Scheduler DomA Scheduler
Accelerator Selection Module Accelerator Selection Module
Management Domain OS OS
Guest VM
CPU Ready Queues
Domains to Schedule
Picked
VCPU
CPU Scheduler Hypervisor
Domains (Credits)
Application Application Accelerator Application Accelerator Application
60 60
Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform
Doms (Credits)
Accelerator Ready Queues
Domains to Schedule
DomA Scheduler DomA Scheduler
Accelerator Selection Module Accelerator Selection Module
Management Domain OS OS
Guest VM
Host Part
CPU Ready Queues
Domains to Schedule
Picked
VCPU
CPU Scheduler Hypervisor
Domains (Credits)
Application Application Accelerator Application Accelerator Application
61 61
Picked
Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform
aVCPU
Doms (Credits)
Accelerator Ready Queues
Domains to Schedule
DomA Scheduler DomA Scheduler
Accelerator Selection Module Accelerator Selection Module
Management Domain OS OS
Guest VM
Accelerator Part Host Part
CPU Ready Queues
Domains to Schedule
Picked
VCPU
CPU Scheduler Hypervisor
Domains (Credits)
Application Application Accelerator Application Accelerator Application
62 62
Picked
Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform
aVCPU
Monitoring/Feedback Monitoring/Feedback
Doms (Credits)
Accelerator Ready Queues
Domains to Schedule
DomA Scheduler DomA Scheduler
Accelerator Selection Module Accelerator Selection Module
Management Domain OS OS
Guest VM
Accelerator Part Host Part
CPU Ready Queues
Domains to Schedule
Picked
VCPU
CPU Scheduler Hypervisor
Domains (Credits)
Application Application Accelerator Application Accelerator Application
63 63
Picked
Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform
aVCPU
Monitoring/Feedback Monitoring/Feedback
Doms (Credits)
Accelerator Ready Queues
Domains to Schedule
DomA Scheduler DomA Scheduler
Accelerator Selection Module Accelerator Selection Module
Management Domain OS OS
Guest VM
Accelerator Part Host Part
CPU Ready Queues
Domains to Schedule
Picked
VCPU
CPU Scheduler Hypervisor
Domains (Credits)
Schedule
Application Application Accelerator Application Accelerator Application
64 64
Picked
Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform
aVCPU
Monitoring/Feedback Monitoring/Feedback
More aVCPUs …
Doms (Credits)
Accelerator Ready Queues
Domains to Schedule
DomA Scheduler DomA Scheduler
Accelerator Selection Module Accelerator Selection Module
Management Domain OS OS
Guest VM
Accelerator Part Host Part
CPU Ready Queues
Domains to Schedule
Picked
VCPU
CPU Scheduler Hypervisor More VCPUs …
Domains (Credits)
OS OS
Guest VM Accelerator Application Accelerator Application
Accelerator Part Host Part
Schedule
65
66
67
68
69
Speed improvement for most benchmarks Increased #
Calls
70
Cuda Time: Time within application to execute CUDA calls Total Time: Total execution time of benchmark from command line
71
Scheduler - RR
72
Scheduler - RR
73
Without resource management, calls can be variably delayed due to interference from other application(s)/domain(s), even in the absence of virtualization
Scheduler - RR
74
𝒑𝒒𝒖𝒋𝒑𝒐𝒕 𝒖𝒋𝒏𝒇
𝒃𝒎𝒎 𝒔𝒗𝒐𝒕
75
76
− Even basic accelerator request scheduling can improve sharing
− While co-scheduling is really useful [CoSched], other methods
77
− Even basic accelerator request scheduling can improve sharing
− While co-scheduling is really useful [CoSched], other methods
78
− Even basic accelerator request scheduling can improve sharing
− While co-scheduling is really useful [CoSched], other methods
79
80
− Devise scheduling methods that coordinate accelerator use
− Performance evaluated on x86-GPU Xen-based prototype
81
− Devise scheduling methods that coordinate accelerator use
− Performance evaluated on x86-GPU Xen-based prototype
− Need for coordination when sharing accelerator resources,
− Need for diverse policies when coordinating resource
82
− Past experience with IBM Cell accelerator [Cellule] − Open architecture allows finer grained control of resources
83
−
− Instrumentation support from Ocelot [GTOcelot] − Improve admission control, load balancing and scheduling
84
−
−
− More generic problem where even processing resources on the
85
−
−
86
−
Accelerator Frontend or multi-core programming models: [CUDA], [Georgia Tech Harmony], [Georgia Tech Cellule], [OpenCL]
−
Some examples: [Intel Tolapai], [AMD Fusion], [LANL Roadrunner]
−
Application domains: [NSF Keeneland], [Amazon Cloud]
−
Interaction with higher levels: [PerformancePointsOSR]
−
Cluster level: [rCUDA], [Shadowfax]
87