Scheduling for Virtualized Accelerator-based Systems Vishakha Gupta , - - PowerPoint PPT Presentation

scheduling for virtualized
SMART_READER_LITE
LIVE PREVIEW

Scheduling for Virtualized Accelerator-based Systems Vishakha Gupta , - - PowerPoint PPT Presentation

Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems Vishakha Gupta , Karsten Schwan @ Georgia Tech Niraj Tolia @ Maginatics Vanish Talwar, Parthasarathy Ranganathan @ HP Labs USENIX ATC 2011 Portland, OR, USA Increasing


slide-1
SLIDE 1

Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems

Vishakha Gupta, Karsten Schwan @ Georgia Tech Niraj Tolia @ Maginatics Vanish Talwar, Parthasarathy Ranganathan @ HP Labs

USENIX ATC 2011 – Portland, OR, USA

slide-2
SLIDE 2

Increasing Popularity of Accelerators

2

2007

  • IBM Cell-

based- Playstation 2008

  • IBM Cell-

based RoadRunner

  • CUDA

programmab le GPUs for developers 2009

  • Increasing

popularity of NVIDIA GPUs powered desktops and laptops 2010

  • Amazon EC2

adopts GPUs

  • Tianhe-1A

and Nebulae supercomput ers in Top500 2011

  • Tegras in

cellphones

  • Keeneland
slide-3
SLIDE 3

Example x86-GPU System

3

PCIe

slide-4
SLIDE 4

Example x86-GPU System

4

PCIe Proprietary NVIDIA Driver and CUDA runtime

  • Memory management
  • Communication with device
  • Scheduling logic
  • Binary translation
slide-5
SLIDE 5

Example x86-GPU System

5

PCIe Proprietary NVIDIA Driver and CUDA runtime

  • Memory management
  • Communication with device
  • Scheduling logic
  • Binary translation

C-like CUDA-based applications (host portion)

slide-6
SLIDE 6

Example x86-GPU System

6

PCIe CUDA Kernels Proprietary NVIDIA Driver and CUDA runtime

  • Memory management
  • Communication with device
  • Scheduling logic
  • Binary translation

C-like CUDA-based applications (host portion)

slide-7
SLIDE 7

Example x86-GPU System

7

PCIe CUDA Kernels Proprietary NVIDIA Driver and CUDA runtime

  • Memory management
  • Communication with device
  • Scheduling logic
  • Binary translation

C-like CUDA-based applications (host portion)

Design flaw: Bulk of logic in drivers which were meant to be for simple operations like read, write and handle interrupts Shortcoming: Inaccessibility and one scheduling fits all

slide-8
SLIDE 8

Sharing Accelerators

8

2010

  • Amazon EC2 adopts GPUs
  • Other cloud offerings by

AMD, NVIDIA 2011

  • Tegras in cellphones
  • HPC GPU Cluster

(Keeneland )

slide-9
SLIDE 9

Sharing Accelerators

  • Most applications fail to occupy GPUs completely

− With the exception of extensively tuned (e.g.

supercomputing) applications

9

2010

  • Amazon EC2 adopts GPUs
  • Other cloud offerings by

AMD, NVIDIA 2011

  • Tegras in cellphones
  • HPC GPU Cluster

(Keeneland )

slide-10
SLIDE 10

Sharing Accelerators

  • Most applications fail to occupy GPUs completely

− With the exception of extensively tuned (e.g.

supercomputing) applications

  • Expected utilization of GPUs across applications in

some domains “may” follow patterns to allow sharing

10

2010

  • Amazon EC2 adopts GPUs
  • Other cloud offerings by

AMD, NVIDIA 2011

  • Tegras in cellphones
  • HPC GPU Cluster

(Keeneland )

slide-11
SLIDE 11

Sharing Accelerators

  • Most applications fail to occupy GPUs completely

− With the exception of extensively tuned (e.g.

supercomputing) applications

  • Expected utilization of GPUs across applications in

some domains “may” follow patterns to allow sharing

11

2010

  • Amazon EC2 adopts GPUs
  • Other cloud offerings by

AMD, NVIDIA 2011

  • Tegras in cellphones
  • HPC GPU Cluster

(Keeneland )

Need for accelerator sharing: resource sharing is now supported in NVIDIA’s Fermi architecture Concern: Can driver scheduling do a good job?

slide-12
SLIDE 12

NVIDIA GPU Sharing – Driver Default

  • Xeon Quadcore with

2 8800GTX NVIDIA GPUs, driver 169.09, CUDA SDK 1.1

  • Coulomb Potential

[CP] benchmark from parboil benchmark suite

  • Result of sharing two

GPUs among four instances of the application

12

Max Min 50% Median

slide-13
SLIDE 13

NVIDIA GPU Sharing – Driver Default

  • Xeon Quadcore with

2 8800GTX NVIDIA GPUs, driver 169.09, CUDA SDK 1.1

  • Coulomb Potential

[CP] benchmark from parboil benchmark suite

  • Result of sharing two

GPUs among four instances of the application

13

Max Min 50% Median

Driver can: efficiently implement computation and data interactions between host and accelerator Limitations: Call ordering suffers when sharing – any scheme used is static and cannot adapt to different system expectations

slide-14
SLIDE 14

Re-thinking Accelerator-based Systems

14

slide-15
SLIDE 15

Re-thinking Accelerator-based Systems

  • Accelerators as first class citizens

− Why treat such powerful processing resources as devices? − How can such heterogeneous resources be managed

especially with evolving programming models, evolving hardware and proprietary software?

15

slide-16
SLIDE 16

Re-thinking Accelerator-based Systems

  • Accelerators as first class citizens

− Why treat such powerful processing resources as devices? − How can such heterogeneous resources be managed

especially with evolving programming models, evolving hardware and proprietary software?

  • Sharing of accelerators

− Are there efficient methods to utilize a heterogeneous pool

  • f resources?

− Can applications share accelerators without a big hit in

efficiency?

16

slide-17
SLIDE 17

Re-thinking Accelerator-based Systems

  • Accelerators as first class citizens

− Why treat such powerful processing resources as devices? − How can such heterogeneous resources be managed

especially with evolving programming models, evolving hardware and proprietary software?

  • Sharing of accelerators

− Are there efficient methods to utilize a heterogeneous pool

  • f resources?

− Can applications share accelerators without a big hit in

efficiency?

  • Coordination across different processor types

− How do you deal with multiple scheduling domains? − Does coordination obtain any performance gains?

17

slide-18
SLIDE 18

18

Pegasus addresses the urgent need for systems support to smartly manage accelerators.

slide-19
SLIDE 19

19

Pegasus addresses the urgent need for systems support to smartly manage accelerators. (Demonstrated through x86--NVIDIA GPU-based systems)

slide-20
SLIDE 20

20

Pegasus addresses the urgent need for systems support to smartly manage accelerators. (Demonstrated through x86--NVIDIA GPU-based systems) It leverages new opportunities presented by increased adoption of virtualization technology in commercial, cloud computing, and even high performance infrastructures.

slide-21
SLIDE 21

21

Pegasus addresses the urgent need for systems support to smartly manage accelerators. (Demonstrated through x86--NVIDIA GPU-based systems) It leverages new opportunities presented by increased adoption of virtualization technology in commercial, cloud computing, and even high performance infrastructures. (Virtualization provided by Xen hypervisor and Dom0 management domain)

slide-22
SLIDE 22

ACCELERATORS AS FIRST CLASS CITIZENS

22

slide-23
SLIDE 23

Manageability

Extending Xen for Closed NVIDIA GPUs

Management Domain (Dom0) Management Domain (Dom0)

Hypervisor (Xen) Hypervisor (Xen)

Traditional Device Drivers

General purpose multicores General purpose multicores Traditional Devices Traditional Devices VM

Linux

23

slide-24
SLIDE 24

Manageability

Extending Xen for Closed NVIDIA GPUs

Management Domain (Dom0) Management Domain (Dom0)

Hypervisor (Xen) Hypervisor (Xen)

Traditional Device Drivers

General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM

Linux

24

slide-25
SLIDE 25

Manageability

Extending Xen for Closed NVIDIA GPUs

Management Domain (Dom0) Management Domain (Dom0)

Hypervisor (Xen) Hypervisor (Xen)

Traditional Device Drivers

General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM

Linux Runtime + GPU Driver

25

slide-26
SLIDE 26

Manageability

Extending Xen for Closed NVIDIA GPUs

Management Domain (Dom0) Management Domain (Dom0)

Hypervisor (Xen) Hypervisor (Xen)

Traditional Device Drivers

General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM

Linux

NVIDIA’s CUDA – Compute Unified Device Architecture for managing GPUs

Runtime + GPU Driver CUDA API GPU Application

26

slide-27
SLIDE 27

Manageability

Extending Xen for Closed NVIDIA GPUs

Management Domain (Dom0) Management Domain (Dom0)

Hypervisor (Xen) Hypervisor (Xen)

GPU Backend Traditional Device Drivers

General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM

GPU Frontend Linux

NVIDIA’s CUDA – Compute Unified Device Architecture for managing GPUs

Runtime + GPU Driver CUDA API GPU Application

27

slide-28
SLIDE 28

Manageability

Extending Xen for Closed NVIDIA GPUs

Management Domain (Dom0) Management Domain (Dom0)

Hypervisor (Xen) Hypervisor (Xen)

GPU Backend Traditional Device Drivers

General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM

GPU Frontend Linux

NVIDIA’s CUDA – Compute Unified Device Architecture for managing GPUs

Runtime + GPU Driver CUDA API GPU Application

VM

GPU Frontend Linux CUDA API GPU Application

28

slide-29
SLIDE 29

Manageability

Extending Xen for Closed NVIDIA GPUs

Management Domain (Dom0) Management Domain (Dom0)

Hypervisor (Xen) Hypervisor (Xen)

Mgmt Extension GPU Backend Traditional Device Drivers

General purpose multicores General purpose multicores Compute Accelerators (NVIDIA GPUs) Compute Accelerators (NVIDIA GPUs) Traditional Devices Traditional Devices VM

GPU Frontend Linux

NVIDIA’s CUDA – Compute Unified Device Architecture for managing GPUs

Runtime + GPU Driver CUDA API GPU Application

VM

GPU Frontend Linux CUDA API GPU Application

29

slide-30
SLIDE 30

Accelerator Virtual CPU (aVCPU) Abstraction

30

Pegasus Frontend Frontend driver

Interposer library

CUDA calls + Responses Xen shared ring for CUDA calls (per VM) – call buffer Shared pages for data Application data

VM

slide-31
SLIDE 31

Accelerator Virtual CPU (aVCPU) Abstraction

31

Pegasus Frontend Frontend driver

Interposer library

CUDA calls + Responses Xen shared ring for CUDA calls (per VM) – call buffer Shared pages for data Application data CUDA calls + Responses Polling thread Application data

Pegasus Backend

VM Dom0

slide-32
SLIDE 32

Accelerator Virtual CPU (aVCPU) Abstraction

32

Pegasus Frontend Frontend driver

Interposer library

CUDA calls + Responses Xen shared ring for CUDA calls (per VM) – call buffer Shared pages for data Application data CUDA calls + Responses Polling thread Application data

CUDA Runtime + Driver Pegasus Backend

VM Dom0

slide-33
SLIDE 33

Accelerator Virtual CPU (aVCPU) Abstraction

33

Pegasus Frontend Frontend driver

Interposer library

CUDA calls + Responses Xen shared ring for CUDA calls (per VM) – call buffer Shared pages for data Application data CUDA calls + Responses Polling thread Application data

CUDA Runtime + Driver Polling thread is the VM’s representative for call execution It can be queued or scheduled to pick calls and issue them for any amount of time  the accelerator portion of the VM can be scheduled Hence, we define an “accelerator” virtual CPU or aVCPU Pegasus Backend

VM Dom0

slide-34
SLIDE 34

First Class Citizens

  • The aVCPU has execution context on both, CPU (polling

thread, runtime, driver context) and GPU (CUDA kernel)

  • It has data used by these calls

34

aVCPU

Runtime and driver context CUDA calls + data Polling Thread

VCPU

Data Execution context

slide-35
SLIDE 35

First Class Citizens

  • The aVCPU has execution context on both, CPU (polling

thread, runtime, driver context) and GPU (CUDA kernel)

  • It has data used by these calls

35

aVCPU

Runtime and driver context CUDA calls + data Polling Thread

VCPU

Data Execution context

VCPU: first class schedulable entity on a physical CPU aVCPU: first class schedulable entity on GPU (with a CPU component due to execution model) Manageable pool of heterogeneous resources

slide-36
SLIDE 36

SHARING OF ACCELERATORS

36

slide-37
SLIDE 37

Scheduling aVCPUs

Per call granularity Per application granularity

37

Too fine Too coarse Time slot based methods

slide-38
SLIDE 38

Scheduling aVCPUs

Per call granularity Per application granularity RR: Fair share

38

Too fine Too coarse Time slot based methods

slide-39
SLIDE 39

Scheduling aVCPUs

Per call granularity Per application granularity RR: Fair share

39

aVCPUs are given equal time slices and scheduled in a circular fashion

Too fine Too coarse Time slot based methods

slide-40
SLIDE 40

Scheduling aVCPUs

Per call granularity Per application granularity RR: Fair share XC: Proportional fair share

40

Too fine Too coarse Time slot based methods

slide-41
SLIDE 41

Adopt Xen credit scheduling for aVCPU scheduling. E.g. VMs 1, 2 and 3 have 256, 512, 1024 credits, they get 1, 2, 4 time ticks respectively, every scheduling cycle

Scheduling aVCPUs

Per call granularity Per application granularity RR: Fair share XC: Proportional fair share

41

Too fine Too coarse Time slot based methods

slide-42
SLIDE 42

Scheduling aVCPUs

Per call granularity Per application granularity RR: Fair share XC: Proportional fair share AccC: Prop. fair share

42

Too fine Too coarse Time slot based methods

slide-43
SLIDE 43

Scheduling aVCPUs

Per call granularity Per application granularity RR: Fair share XC: Proportional fair share AccC: Prop. fair share

43

Instead of using the assigned VCPU credits for scheduling aVCPUs, define new accelerator credits. These could be some fraction of CPU credits

Too fine Too coarse Time slot based methods

slide-44
SLIDE 44

Scheduling aVCPUs

Per call granularity Per application granularity RR: Fair share XC: Proportional fair share AccC: Prop. fair share

44

SLAF: Feedback- based prop. fair share Too fine Too coarse Time slot based methods

slide-45
SLIDE 45

Scheduling aVCPUs

Per call granularity Per application granularity RR: Fair share XC: Proportional fair share AccC: Prop. fair share

45

SLAF: Feedback- based prop. fair share Too fine Too coarse

Periodic scanning can lead to adjustment in the timer ticks assigned to aVCPUs if they do not get or exceed their assigned/expected time quota

Time slot based methods

slide-46
SLIDE 46

Performance Improves but Still High Variation

46

  • BlackScholes <2mi,128>
  • Xen 3.2.1 with 2.6.18

linux kernel in all domains

  • NVIDIA driver 169.09 +

SDK 1.1

  • Dom1, Dom4 = 256,

Dom2 = 512, Dom3 = 1024 credits

Max Min 50% Median

slide-47
SLIDE 47

Performance Improves but Still High Variation

47

  • BlackScholes <2mi,128>
  • Xen 3.2.1 with 2.6.18

linux kernel in all domains

  • NVIDIA driver 169.09 +

SDK 1.1

  • Dom1, Dom4 = 256,

Dom2 = 512, Dom3 = 1024 credits

Still high variation: due to the hidden driver and runtime Coordination: Can we do better?

Max Min 50% Median

slide-48
SLIDE 48

COORDINATION ACROSS SCHEDULING DOMAINS

48

slide-49
SLIDE 49

Coordinating CPU-GPU Scheduling

  • Hypervisor co-schedule [CoSched]

− Hypervisor scheduling determines which domain should run on

a GPU depending on the CPU schedule

− Latency reduction by occasional unfairness − Possible waste of resources e.g. if domain picked for GPU has

no work to do

49

slide-50
SLIDE 50

Coordinating CPU-GPU Scheduling

  • Hypervisor co-schedule [CoSched]

− Hypervisor scheduling determines which domain should run on

a GPU depending on the CPU schedule

− Latency reduction by occasional unfairness − Possible waste of resources e.g. if domain picked for GPU has

no work to do

  • Augmented credit [AugC]

− Scan the hypervisor CPU schedule to temporarily boost credits

  • f domains selected for CPUs

− Pick domain(s) for GPU(s) based on GPU credits + remaining

CPU credits from hypervisor (augmenting)

− Throughput improvement by temporary credit boost

50

slide-51
SLIDE 51

Coordination Further Improves Performance

51

  • BlackScholes <2mi,128>
  • Xen 3.2.1 with 2.6.18

linux kernel in all domains

  • NVIDIA driver 169.09 +

SDK 1.1

  • Dom1, Dom4 = 256,

Dom2 = 512, Dom3 = 1024 credits

slide-52
SLIDE 52

Coordination Further Improves Performance

52

Coordination: Aligning the CPU and GPU portions

  • f an application to run almost simultaneously,

reduces variation and improves performance

  • BlackScholes <2mi,128>
  • Xen 3.2.1 with 2.6.18

linux kernel in all domains

  • NVIDIA driver 169.09 +

SDK 1.1

  • Dom1, Dom4 = 256,

Dom2 = 512, Dom3 = 1024 credits

slide-53
SLIDE 53

Pegasus Scheduling Policies

  • No coordination:

− Default – GPU driver based – base case (None) − Round Robin (RR) − AccCredit (AccC) – credits based on static profiling

  • Coordination based:

− XenCredit (XC) – use Xen CPU credits − SLA feedback based (SLAF) − Augmented Credit based (AugC) – temporarily augment

credits for co-scheduling

  • Controlled

− HypeControlled or coscheduled (CoSched)

53

slide-54
SLIDE 54

Pegasus Scheduling Policies

  • No coordination:

− Default – GPU driver based – base case (None) − Round Robin (RR) − AccCredit (AccC) – credits based on static profiling

  • Coordination based:

− XenCredit (XC) – use Xen CPU credits − SLA feedback based (SLAF) − Augmented Credit based (AugC) – temporarily augment

credits for co-scheduling

  • Controlled

− HypeControlled or coscheduled (CoSched)

54

slide-55
SLIDE 55

Pegasus Scheduling Policies

  • No coordination:

− Default – GPU driver based – base case (None) − Round Robin (RR) − AccCredit (AccC) – credits based on static profiling

  • Coordination based:

− XenCredit (XC) – use Xen CPU credits − SLA feedback based (SLAF) − Augmented Credit based (AugC) – temporarily augment

credits for co-scheduling

  • Controlled

− HypeControlled or coscheduled (CoSched)

55

slide-56
SLIDE 56

Application Application

56 56

Logical View

  • f the

Pegasus Resource Management Framework

Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform Management Domain OS OS Guest VM Hypervisor

slide-57
SLIDE 57

Application Application

57 57

Logical View

  • f the

Pegasus Resource Management Framework

Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform Management Domain OS OS Guest VM

CPU Ready Queues

Domains to Schedule

Picked

VCPU

CPU Scheduler Hypervisor

Domains (Credits)

slide-58
SLIDE 58

Application Application Accelerator Application Accelerator Application

58 58

Logical View

  • f the

Pegasus Resource Management Framework

Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform Management Domain OS OS Guest VM

CPU Ready Queues

Domains to Schedule

Picked

VCPU

CPU Scheduler Hypervisor

Domains (Credits)

slide-59
SLIDE 59

Application Application Accelerator Application Accelerator Application

59 59

Logical View

  • f the

Pegasus Resource Management Framework

Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform

Doms (Credits)

Accelerator Ready Queues

Domains to Schedule

DomA Scheduler DomA Scheduler

Accelerator Selection Module Accelerator Selection Module

Management Domain OS OS

  • Acc. Frontend

Guest VM

CPU Ready Queues

Domains to Schedule

Picked

VCPU

CPU Scheduler Hypervisor

Domains (Credits)

slide-60
SLIDE 60

Application Application Accelerator Application Accelerator Application

60 60

Logical View

  • f the

Pegasus Resource Management Framework

Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform

Doms (Credits)

Accelerator Ready Queues

Domains to Schedule

DomA Scheduler DomA Scheduler

Accelerator Selection Module Accelerator Selection Module

Management Domain OS OS

  • Acc. Frontend

Guest VM

Host Part

CPU Ready Queues

Domains to Schedule

Picked

VCPU

CPU Scheduler Hypervisor

Domains (Credits)

slide-61
SLIDE 61

Application Application Accelerator Application Accelerator Application

61 61

Logical View

  • f the

Pegasus Resource Management Framework

Picked

Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform

aVCPU

Doms (Credits)

Accelerator Ready Queues

Domains to Schedule

DomA Scheduler DomA Scheduler

Accelerator Selection Module Accelerator Selection Module

Management Domain OS OS

  • Acc. Frontend

Guest VM

Accelerator Part Host Part

CPU Ready Queues

Domains to Schedule

Picked

VCPU

CPU Scheduler Hypervisor

Domains (Credits)

slide-62
SLIDE 62

Application Application Accelerator Application Accelerator Application

62 62

Logical View

  • f the

Pegasus Resource Management Framework

Picked

Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform

aVCPU

Monitoring/Feedback Monitoring/Feedback

Doms (Credits)

Accelerator Ready Queues

Domains to Schedule

DomA Scheduler DomA Scheduler

Accelerator Selection Module Accelerator Selection Module

Management Domain OS OS

  • Acc. Frontend

Guest VM

Accelerator Part Host Part

CPU Ready Queues

Domains to Schedule

Picked

VCPU

CPU Scheduler Hypervisor

Domains (Credits)

slide-63
SLIDE 63

Application Application Accelerator Application Accelerator Application

63 63

Logical View

  • f the

Pegasus Resource Management Framework

Picked

Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform

aVCPU

Monitoring/Feedback Monitoring/Feedback

Doms (Credits)

Accelerator Ready Queues

Domains to Schedule

DomA Scheduler DomA Scheduler

Accelerator Selection Module Accelerator Selection Module

Management Domain OS OS

  • Acc. Frontend

Guest VM

Accelerator Part Host Part

CPU Ready Queues

Domains to Schedule

Picked

VCPU

CPU Scheduler Hypervisor

Domains (Credits)

Schedule

  • Acc. Data

slide-64
SLIDE 64

Application Application Accelerator Application Accelerator Application

64 64

Logical View

  • f the

Pegasus Resource Management Framework

Picked

Acc1 (Compute) Acc2 (Compute) C1 C3 C2 C4 Physical Platform

aVCPU

Monitoring/Feedback Monitoring/Feedback

More aVCPUs …

Doms (Credits)

Accelerator Ready Queues

Domains to Schedule

DomA Scheduler DomA Scheduler

Accelerator Selection Module Accelerator Selection Module

Management Domain OS OS

  • Acc. Frontend

Guest VM

Accelerator Part Host Part

CPU Ready Queues

Domains to Schedule

Picked

VCPU

CPU Scheduler Hypervisor More VCPUs …

Domains (Credits)

OS OS

  • Acc. Frontend

Guest VM Accelerator Application Accelerator Application

Accelerator Part Host Part

Schedule

  • Acc. Data

slide-65
SLIDE 65

Testbed Details

  • Xeon 4 core @3GHz, 3GB RAM, 2 NVIDIA GPUs G92-450
  • Xen 3.2.1 – stable, Fedora 8 Dom0 and DomU running

Linux kernel 2.6.18, NVIDIA driver 169.09, SDK 1.1

  • Guest domains given 512M memory and 1 core mostly
  • Pinned to different physical cores
  • Launched almost simultaneously: worst case measurement

due to maximum load

  • Data currently sampled over 50runs for statistical

significance despite driver/runtime variation

  • Scheduling plots report h-spread with min-max over

85% readings or total work done over all runs in an experiment

65

slide-66
SLIDE 66

Benchmarks

66

Category Source Benchmarks Financial SDK Binomial (BOp), BlackScholes (BS), MonteCarlo (MC) Media processing SDK/parboil ProcessImage(PI)=matrix multiply+DXTC, MRIQ, FastWalshTransform(FWT) Scientific Parboil CP, TPACF, RPES

slide-67
SLIDE 67

Benchmarks

67

Category Source Benchmarks Financial SDK Binomial (BOp), BlackScholes (BS), MonteCarlo (MC) Media processing SDK/parboil ProcessImage(PI)=matrix multiply+DXTC, MRIQ, FastWalshTransform(FWT) Scientific Parboil CP, TPACF, RPES

  • Diverse benchmarks: from different application domains show -

(a) different throughput and latency constraints, (b) varying data and CUDA kernel sizes and (c) different number of CUDA calls

slide-68
SLIDE 68

Benchmarks

68

Category Source Benchmarks Financial SDK Binomial (BOp), BlackScholes (BS), MonteCarlo (MC) Media processing SDK/parboil ProcessImage(PI)=matrix multiply+DXTC, MRIQ, FastWalshTransform(FWT) Scientific Parboil CP, TPACF, RPES

  • Diverse benchmarks: from different application domains show -

(a) different throughput and latency constraints, (b) varying data and CUDA kernel sizes and (c) different number of CUDA calls

  • BlackScholes worst in the set: Throughput + latency sensitive

due to large number of CUDA calls (depending on iteration)

slide-69
SLIDE 69

Benchmarks

69

Category Source Benchmarks Financial SDK Binomial (BOp), BlackScholes (BS), MonteCarlo (MC) Media processing SDK/parboil ProcessImage(PI)=matrix multiply+DXTC, MRIQ, FastWalshTransform(FWT) Scientific Parboil CP, TPACF, RPES

  • Diverse benchmarks: from different application domains show -

(a) different throughput and latency constraints, (b) varying data and CUDA kernel sizes and (c) different number of CUDA calls

  • BlackScholes worst in the set: Throughput + latency sensitive

due to large number of CUDA calls (depending on iteration)

  • Latency sensitive FastWalshTransform: multiple computation

kernel launches and large data transfer

slide-70
SLIDE 70

Ability to Achieve Low Virtualization Overhead

Speed improvement for most benchmarks Increased #

  • f CUDA

Calls

70

Cuda Time: Time within application to execute CUDA calls Total Time: Total execution time of benchmark from command line

slide-71
SLIDE 71

Appropriate Scheduling is Important

71

Scheduler - RR

slide-72
SLIDE 72

Appropriate Scheduling is Important

72

Scheduler - RR

slide-73
SLIDE 73

Appropriate Scheduling is Important

73

Without resource management, calls can be variably delayed due to interference from other application(s)/domain(s), even in the absence of virtualization

Scheduler - RR

slide-74
SLIDE 74

Pegasus Scheduling

Black Scholes – Latency and throughput sensitive

74

Equal credits for all domains Work done =

𝒑𝒒𝒖𝒋𝒑𝒐𝒕 𝒖𝒋𝒏𝒇

𝒃𝒎𝒎 𝒔𝒗𝒐𝒕

slide-75
SLIDE 75

Pegasus Scheduling

FWT – Latency sensitive

75

Dom1, Dom4 – 256, Dom2 - 1024, Dom3 – 2048 credits

slide-76
SLIDE 76

Insights

  • Pegasus approach efficiently virtualizes GPUs

76

slide-77
SLIDE 77

Insights

  • Pegasus approach efficiently virtualizes GPUs
  • Coordinated scheduling is effective

− Even basic accelerator request scheduling can improve sharing

performance

− While co-scheduling is really useful [CoSched], other methods

can come close [AugC], keep up utilization and give desirable properties

77

slide-78
SLIDE 78

Insights

  • Pegasus approach efficiently virtualizes GPUs
  • Coordinated scheduling is effective

− Even basic accelerator request scheduling can improve sharing

performance

− While co-scheduling is really useful [CoSched], other methods

can come close [AugC], keep up utilization and give desirable properties

  • Scheduling lowers degree of variability caused by un-

coordinated use of the NVIDIA driver.

78

slide-79
SLIDE 79

Insights

  • Pegasus approach efficiently virtualizes GPUs
  • Coordinated scheduling is effective

− Even basic accelerator request scheduling can improve sharing

performance

− While co-scheduling is really useful [CoSched], other methods

can come close [AugC], keep up utilization and give desirable properties

  • Scheduling lowers degree of variability caused by un-

coordinated use of the NVIDIA driver.

79

No single `best' scheduling policy Clear need for diverse policies geared to match different system goals and to account for different application characteristics

slide-80
SLIDE 80

Conclusion

  • We successfully virtualize GPUs to convert them into

first class citizens

80

slide-81
SLIDE 81

Conclusion

  • We successfully virtualize GPUs to convert them into

first class citizens

  • Pegasus approach abstracts accelerator interfaces

through CUDA-level virtualization

− Devise scheduling methods that coordinate accelerator use

with that of general purpose host cores

− Performance evaluated on x86-GPU Xen-based prototype

81

slide-82
SLIDE 82

Conclusion

  • We successfully virtualize GPUs to convert them into

first class citizens

  • Pegasus approach abstracts accelerator interfaces

through CUDA-level virtualization

− Devise scheduling methods that coordinate accelerator use

with that of general purpose host cores

− Performance evaluated on x86-GPU Xen-based prototype

  • Evaluation with a variety of benchmarks shows

− Need for coordination when sharing accelerator resources,

especially for applications with high CPU-GPU coupling

− Need for diverse policies when coordinating resource

management decisions made for general purpose vs. accelerator core

82

slide-83
SLIDE 83

Future Work: Generalizing Pegasus

  • Applicability: concepts applicable to open as well as

close accelerators due lack of integration with runtimes

− Past experience with IBM Cell accelerator [Cellule] − Open architecture allows finer grained control of resources

83

slide-84
SLIDE 84

Future Work: Generalizing Pegasus

  • Toolchains: sophistication through integration

− Instrumentation support from Ocelot [GTOcelot] − Improve admission control, load balancing and scheduling

84

slide-85
SLIDE 85

Future Work: Generalizing Pegasus

  • Heterogeneous platforms: Scheduling different

personalities for a virtual machine [Poster session]

− More generic problem where even processing resources on the

same chip can be asymmetric

85

slide-86
SLIDE 86

Future Work: Generalizing Pegasus

  • Scale: Extensions to cluster-based systems with

Shadowfax [VTDC`11]

86

slide-87
SLIDE 87

Related Work

  • Heterogeneous and larger-scale systems – [Helios],

[MultiKernel]

  • Scheduling extension – [Cypress], [Xen Credit Scheduling], [QoS

Adaptive Communication], [Intel Shared ISA Heterogeneity], [Cellular Disco]

  • GPU Virtualization: [OpenGL], [VMWare DirectX], [VMGL],

[vCUDA], [gVirtuS]

  • Other related work

Accelerator Frontend or multi-core programming models: [CUDA], [Georgia Tech Harmony], [Georgia Tech Cellule], [OpenCL]

Some examples: [Intel Tolapai], [AMD Fusion], [LANL Roadrunner]

Application domains: [NSF Keeneland], [Amazon Cloud]

Interaction with higher levels: [PerformancePointsOSR]

Cluster level: [rCUDA], [Shadowfax]

87

slide-88
SLIDE 88

Thank you!