Full Virtualization for GPUs Reconsidered Revisit -- Suzuki, Yusuke, - - PowerPoint PPT Presentation

full virtualization for gpus reconsidered
SMART_READER_LITE
LIVE PREVIEW

Full Virtualization for GPUs Reconsidered Revisit -- Suzuki, Yusuke, - - PowerPoint PPT Presentation

Full Virtualization for GPUs Reconsidered Revisit -- Suzuki, Yusuke, et al. GPUvm: Why not virtualizing GPUs at the hypervisor?. USENIX ATC 14. Hangchen Yu 1 , Christopher J. Rossbach 1,2 1 The University of Texas at Austin 2 VMware


slide-1
SLIDE 1

Full Virtualization for GPUs Reconsidered

Hangchen Yu1, Christopher J. Rossbach1,2

1The University of Texas at Austin 2VMware Research Group

Revisit -- Suzuki, Yusuke, et al. “GPUvm: Why not virtualizing GPUs at the hypervisor?.” USENIX ATC’14.

slide-2
SLIDE 2

#2

Overview

  • Demands, introductions, challenges of virtual GPUs
  • Distinctive features of GPUvm
  • Re-evaluate GPUvm with additional benchmarks

– Hard to set up the testbed – Some functionalities do not work – Over 200x overheads on average – Unfairness issue – Over 40% throughput loss

slide-3
SLIDE 3

#3

Do we still need GPU virtualizations?

  • Share GPUs in datacenter
  • Different end-user demands
  • Hidden scenarios
slide-4
SLIDE 4

#4

Do we still need GPU virtualizations?

  • Share GPUs in datacenter
  • Different end-user demands
  • Hidden scenarios
slide-5
SLIDE 5

#5

Do we still need GPU virtualizations?

  • Share GPUs in datacenter
  • Different end-user demands
  • Hidden scenarios
slide-6
SLIDE 6

#6

GPU Virtualization Challenges

  • Diverse hardware
  • Undocumented APIs
  • Closed-source GPUs and drivers
  • Deep graphics stack
  • Coupled layers
  • Significant overheads
  • Limited flexibility
slide-7
SLIDE 7

#7

GPU Virtualization Comparisons

Front-end Back-end

API remoting

Forwards graphics API calls To external graphics stack

Mediated-passthrough

Dedicates a set of contexts

Passthrough

Provides exclusive access

Device emulation

Synthesizes host graphics

  • perations
slide-8
SLIDE 8

#8

GPU Virtualization Comparisons

Front-end Back-end

API remoting

Forwards graphics API calls To external graphics stack

Mediated-passthrough

Dedicates a set of contexts

Passthrough

Provides exclusive access

Device emulation

Synthesizes host graphics

  • perations

Performance Fidelity Multiplexing Interposition Complexity

slide-9
SLIDE 9

#9

GPU Virtualization Comparisons

Front-end Back-end

API remoting

Forwards graphics API calls To external graphics stack

Mediated-passthrough

Dedicates a set of contexts

Passthrough

Provides exclusive access

Device emulation

Synthesizes host graphics

  • perations

Performance Fidelity Multiplexing Interposition Complexity

slide-10
SLIDE 10

#10

GPU Virtualization Comparisons

Front-end Back-end

API remoting

Forwards graphics API calls To external graphics stack

Mediated-passthrough

Dedicates a set of contexts

Passthrough

Provides exclusive access

Device emulation

Synthesizes host graphics

  • perations

Performance Fidelity Multiplexing Interposition Complexity

slide-11
SLIDE 11

#11

GPU Virtualization Comparisons

Front-end Back-end

API remoting

Forwards graphics API calls To external graphics stack

Mediated-passthrough

Dedicates a set of contexts

Passthrough

Provides exclusive access

Device emulation

Synthesizes host graphics

  • perations

Performance Fidelity Multiplexing Interposition Complexity

slide-12
SLIDE 12

#12

GPU Virtualization Comparisons

Front-end Back-end

API remoting

Forwards graphics API calls To external graphics stack

Mediated-passthrough

Dedicates a set of contexts

Passthrough

Provides exclusive access

Device emulation

Synthesizes host graphics

  • perations

Performance Fidelity Multiplexing Interposition Complexity

slide-13
SLIDE 13

#13

GPU Virtualization Comparisons

Front-end Back-end

API remoting

Forwards graphics API calls To external graphics stack

Mediated-passthrough

Dedicates a set of contexts

Passthrough

Provides exclusive access

Device emulation

Synthesizes host graphics

  • perations

Performance Fidelity Multiplexing Interposition Complexity

slide-14
SLIDE 14

#14

GPU Virtualization Examples

Front-end Back-end

API remoting Mediated-passthrough Passthrough Device emulation

  • M. Dowty, VMware SVGA, SIGOPS-OSR’09

AMD MxGPU (FirePro), VMworld’15 NVIDIA GRID vGPU, 15 KVMGT (Intel GVT-g), 14

Amazon Elastic Compute Cloud (AWS EC2)

  • J. Duato, rCUDA, HiPC’10,11
  • G. Giunta, gVirtuS, European Conference on Parallel Processing’10
slide-15
SLIDE 15

#15

GPUvm Features

Front-end Back-end

API remoting Mediated-passthrough Device emulation

Exposes a native device model to VMs Passes-through some operations (I/O requests) to hardware Forwards commands to GPU virtual aggregator Similar approaches when virtualizing at hypervisor-level

slide-16
SLIDE 16

#16

Full-virtualization vs. Para-virtualization

Para-virtualization

Split device model

Full-virtualization

Trap-and-emulate Apps vGPU driver API Front End GPU GPU driver GPU driver API Back End Apps GPU driver API GPU driver Device model Hypervisor GPU

slide-17
SLIDE 17

#17

Full-virtualization vs. Para-virtualization

Performance Fidelity Multiplexing Interposition

Para-virtualization

Split device model

Full-virtualization

Trap-and-emulate Apps vGPU driver API Front End GPU GPU driver GPU driver API Back End Apps GPU driver API GPU driver Device model Hypervisor GPU

slide-18
SLIDE 18

#18

Full-virtualization vs. Para-virtualization

Performance Fidelity Multiplexing Interposition

Para-virtualization

Split device model

Full-virtualization

Trap-and-emulate Apps vGPU driver API Front End GPU GPU driver GPU driver API Back End Back End vGPU driver API Apps GPU driver API GPU driver Device model Hypervisor GPU Hypervisor GPU driver API

slide-19
SLIDE 19

#19

Full Virtualization: A Reasonable Goal?

Full-virtualization

Trap-and-emulate Apps GPU driver API GPU driver Device model Hypervisor GPU Full-featured vGPU (3D acceleration) Strong isolation Slow performance Device Model Hard to map different GPUs

slide-20
SLIDE 20

#20

Full Virtualization: A Reasonable Goal?

Full-virtualization

Trap-and-emulate Apps GPU driver API GPU driver Device model Hypervisor GPU Full-featured vGPU (3D acceleration) Strong isolation Slow performance Device Model Hard to map different GPUs

slide-21
SLIDE 21

#21

GPUvm Overview

  • Access aggregator
slide-22
SLIDE 22

#22

GPUvm Overview

  • Access aggregator
slide-23
SLIDE 23

#23

GPUvm Overview

  • Access aggregator
slide-24
SLIDE 24

#24

GPUvm Overview

  • Access aggregator
  • Shadow channel

– Mapped by a virtual channel

  • Shadow page table

VM Driver

Virtual Context

Virtual Channel Virtual Channel Shadow Channel Shadow Page Table Shadow Channel Shadow Page Table

Shadowing Mechanism

slide-25
SLIDE 25

#25

GPUvm Overview

  • Access aggregator
  • Shadow channel

– Mapped by a virtual channel

  • Shadow page table

VM Driver

Virtual Context

Virtual Channel Virtual Channel Shadow Channel Shadow Page Table Shadow Channel Shadow Page Table

Shadowing Mechanism

slide-26
SLIDE 26

#26

GPUvm Overview

  • Access aggregator
  • Shadow channel

– Mapped by a virtual channel

  • Shadow page table

VM Driver

Virtual Context

Virtual Channel Virtual Channel Shadow Channel Shadow Page Table Shadow Channel Shadow Page Table

Shadowing Mechanism

slide-27
SLIDE 27

#27

GPUvm Overview

  • Access aggregator
  • Shadow channel

– Mapped by a virtual channel

  • Shadow page table

VM Driver

Virtual Context

Virtual Channel Virtual Channel Shadow Channel Shadow Page Table Shadow Channel Shadow Page Table

Shadowing Mechanism

slide-28
SLIDE 28

#28

GPUvm Overview

  • Access aggregator
  • Shadow channel

– Mapped by a virtual channel

  • Shadow page table
  • Virtual scheduler

– FIFO – CREDIT – BAND (bandwidth-aware non-preemptive device)

slide-29
SLIDE 29

#29

Why GPUvm?

  • Open-source
  • Overheads

– FV (36x) PV (1.9x)

  • Good open architecture

– Decoupled components – Native device model, virtual MMIO, shadow channels, shadow page tables, virtual schedulers

  • Not-so-good aspects

– Interposes guest access to memory-mapped resources – Shadows expensive resources

  • Trade-off of hypervisor-level full-virtualization

Easier to upgrade/swap/optimize components Significant performance impact Easier to analyze performance/mechanism

slide-30
SLIDE 30

#30

GPUvm Optimizations

  • Sync virtual & shadow channels

– Intercept data accesses – BAR3 remapping

  • BAR3 accesses are passed-through
  • Sync guest & shadow page tables

– GPU-side page faults – Lazy shadowing

  • Updates shadow page tables only when

referenced

MMIO through PCIe base address register

slide-31
SLIDE 31

#31

GPUvm Optimizations

  • Sync virtual & shadow channels

– Intercept data accesses – BAR3 remapping

  • BAR3 accesses are passed-through
  • Sync guest & shadow page tables

– GPU-side page faults – Lazy shadowing

  • Updates shadow page tables only when

referenced

MMIO through PCIe base address register

slide-32
SLIDE 32

#32

Testbed

  • Specific hardware

– NVIDIA Quadro 6000 NVC0 – GF100GL vs. GF100 (GTX 480) (different region addresses)

  • Specific software

– Fedora 16 (Kernel 3.6.5) – Xen HVM (4.2.0) – Gdev (commit 605e69e7) – GCC 4.6.3 – NVCC 4.2 – Boost 1.4.7

slide-33
SLIDE 33

#33

Performance

  • BAR3 remapping

– 1.6x speed-up – Fails for some benchmarks

  • Lazy shadowing

– 1.2x speed-up – Fails for some benchmarks

  • Overhead

– up to 737x, 232x on average

  • 7.4x Boot slowdown

hotspot lud srad mmul WRITE bytes 659,664 662,544 666,784 660,832 Original WRITE bytes 6,736 7,240 6,352 6,672 Relative execution time

slide-34
SLIDE 34

#34

Performance

  • BAR3 remapping

– 1.6x speed-up – Fails for some benchmarks

  • Lazy shadowing

– 1.2x speed-up – Fails for some benchmarks

  • Overhead

– up to 737x, 232x on average

  • 7.4x Boot slowdown

hotspot lud srad mmul WRITE bytes 659,664 662,544 666,784 660,832 Original WRITE bytes 6,736 7,240 6,352 6,672 Relative execution time

slide-35
SLIDE 35

#35

Runtime Breakdown

  • Init phase

– Major overhead – Major cost

  • Optimizations

– Overheads of Init, MemAlloc and Close ↓ ↓ ↓

Naive BAR-remap Shadow Optimized Init 850x 150x 750x 60x MemAlloc 3,878x 3,135x 287x 21x Close 1,260x 1,075x 200x 165x

Needle

Percentage of runtime

slide-36
SLIDE 36

#36

Runtime Breakdown

  • Init phase

– Major overhead – Major cost

  • Optimizations

– Overheads of Init, MemAlloc and Close ↓ ↓ ↓

Naive BAR-remap Shadow Optimized Init 850x 150x 750x 60x MemAlloc 3,878x 3,135x 287x 21x Close 1,260x 1,075x 200x 165x

Needle

Percentage of runtime

slide-37
SLIDE 37

#37

Runtime Breakdown

  • Some GPU optimization

features not help much

– E.g. pipelining MemCpy

  • Other phases

– Launch: Almost not influenced by optimizations – DtoH: Trivial overheads

100 200 300 400 500

Needle

Naive Optimized Native Time (milliseconds)

slide-38
SLIDE 38

#38

Fairness

  • Is BAND more fair?

– Not always – 6% worse in 4VM case

  • 𝐺 =

𝑁𝑏𝑦−𝑁𝑗𝑜 𝐵𝑤𝑕

Init MemAlloc MemCpy Launch Close Total CREDIT VM0 1,466.12 2.71 599.98 67,781.37 142.56 69,992.74 VM1 2,615.47 2.11 465.26 69,269.52 283.45 72,635.81 BAND VM0 3,424.71 1.99 498.15 67,544.55 339.45 71,808.84 VM1 2,871.53 11.78 569.74 71,338.09 100.12 74,891.25

slide-39
SLIDE 39

#39

Throughput

  • BAND

– 42.5% throughput loss in 8VM case

  • Compared with CREDIT

– 8% in 8VM case – Overheads of Init & Close – Inserted idle phases

273.76 1VM x8 Time (milliseconds)

slide-40
SLIDE 40

#40

  • Full-virtualization benefits

– Compatibility, interposition, isolation

  • Performance

– MMIO interceptions – Resource shadowing – Two optimizations

  • Fairness and throughput

– Decoupled scheduler module

Conclusion

  • If not impossible…

– Further improve the components? vMMIO, schedulers – Leverage new hardware functionalities? NVIDIA Pascal pagefaults, SRIOV

Baumann, Hardware is the new software, HotOS'17

Performance (avg) Throughput loss (8VM) Our testbed > 200x slowdown ≈ 40% GPUvm paper > 33x slowdown ≈ 27%