Full Virtualization for GPUs Reconsidered
Hangchen Yu1, Christopher J. Rossbach1,2
1The University of Texas at Austin 2VMware Research Group
Revisit -- Suzuki, Yusuke, et al. “GPUvm: Why not virtualizing GPUs at the hypervisor?.” USENIX ATC’14.
Full Virtualization for GPUs Reconsidered Revisit -- Suzuki, Yusuke, - - PowerPoint PPT Presentation
Full Virtualization for GPUs Reconsidered Revisit -- Suzuki, Yusuke, et al. GPUvm: Why not virtualizing GPUs at the hypervisor?. USENIX ATC 14. Hangchen Yu 1 , Christopher J. Rossbach 1,2 1 The University of Texas at Austin 2 VMware
1The University of Texas at Austin 2VMware Research Group
Revisit -- Suzuki, Yusuke, et al. “GPUvm: Why not virtualizing GPUs at the hypervisor?.” USENIX ATC’14.
#2
#3
#4
#5
#6
#7
Front-end Back-end
Forwards graphics API calls To external graphics stack
Dedicates a set of contexts
Provides exclusive access
Synthesizes host graphics
#8
Front-end Back-end
Forwards graphics API calls To external graphics stack
Dedicates a set of contexts
Provides exclusive access
Synthesizes host graphics
Performance Fidelity Multiplexing Interposition Complexity
#9
Front-end Back-end
Forwards graphics API calls To external graphics stack
Dedicates a set of contexts
Provides exclusive access
Synthesizes host graphics
Performance Fidelity Multiplexing Interposition Complexity
#10
Front-end Back-end
Forwards graphics API calls To external graphics stack
Dedicates a set of contexts
Provides exclusive access
Synthesizes host graphics
Performance Fidelity Multiplexing Interposition Complexity
#11
Front-end Back-end
Forwards graphics API calls To external graphics stack
Dedicates a set of contexts
Provides exclusive access
Synthesizes host graphics
Performance Fidelity Multiplexing Interposition Complexity
#12
Front-end Back-end
Forwards graphics API calls To external graphics stack
Dedicates a set of contexts
Provides exclusive access
Synthesizes host graphics
Performance Fidelity Multiplexing Interposition Complexity
#13
Front-end Back-end
Forwards graphics API calls To external graphics stack
Dedicates a set of contexts
Provides exclusive access
Synthesizes host graphics
Performance Fidelity Multiplexing Interposition Complexity
#14
Front-end Back-end
AMD MxGPU (FirePro), VMworld’15 NVIDIA GRID vGPU, 15 KVMGT (Intel GVT-g), 14
Amazon Elastic Compute Cloud (AWS EC2)
#15
Front-end Back-end
Exposes a native device model to VMs Passes-through some operations (I/O requests) to hardware Forwards commands to GPU virtual aggregator Similar approaches when virtualizing at hypervisor-level
#16
Split device model
Trap-and-emulate Apps vGPU driver API Front End GPU GPU driver GPU driver API Back End Apps GPU driver API GPU driver Device model Hypervisor GPU
#17
Performance Fidelity Multiplexing Interposition
Split device model
Trap-and-emulate Apps vGPU driver API Front End GPU GPU driver GPU driver API Back End Apps GPU driver API GPU driver Device model Hypervisor GPU
#18
Performance Fidelity Multiplexing Interposition
Split device model
Trap-and-emulate Apps vGPU driver API Front End GPU GPU driver GPU driver API Back End Back End vGPU driver API Apps GPU driver API GPU driver Device model Hypervisor GPU Hypervisor GPU driver API
#19
Trap-and-emulate Apps GPU driver API GPU driver Device model Hypervisor GPU Full-featured vGPU (3D acceleration) Strong isolation Slow performance Device Model Hard to map different GPUs
#20
Trap-and-emulate Apps GPU driver API GPU driver Device model Hypervisor GPU Full-featured vGPU (3D acceleration) Strong isolation Slow performance Device Model Hard to map different GPUs
#21
#22
#23
#24
– Mapped by a virtual channel
VM Driver
Virtual Context
Virtual Channel Virtual Channel Shadow Channel Shadow Page Table Shadow Channel Shadow Page Table
Shadowing Mechanism
#25
– Mapped by a virtual channel
VM Driver
Virtual Context
Virtual Channel Virtual Channel Shadow Channel Shadow Page Table Shadow Channel Shadow Page Table
Shadowing Mechanism
#26
– Mapped by a virtual channel
VM Driver
Virtual Context
Virtual Channel Virtual Channel Shadow Channel Shadow Page Table Shadow Channel Shadow Page Table
Shadowing Mechanism
#27
– Mapped by a virtual channel
VM Driver
Virtual Context
Virtual Channel Virtual Channel Shadow Channel Shadow Page Table Shadow Channel Shadow Page Table
Shadowing Mechanism
#28
– Mapped by a virtual channel
– FIFO – CREDIT – BAND (bandwidth-aware non-preemptive device)
#29
Easier to upgrade/swap/optimize components Significant performance impact Easier to analyze performance/mechanism
#30
referenced
MMIO through PCIe base address register
#31
referenced
MMIO through PCIe base address register
#32
#33
– 1.6x speed-up – Fails for some benchmarks
– 1.2x speed-up – Fails for some benchmarks
– up to 737x, 232x on average
hotspot lud srad mmul WRITE bytes 659,664 662,544 666,784 660,832 Original WRITE bytes 6,736 7,240 6,352 6,672 Relative execution time
#34
– 1.6x speed-up – Fails for some benchmarks
– 1.2x speed-up – Fails for some benchmarks
– up to 737x, 232x on average
hotspot lud srad mmul WRITE bytes 659,664 662,544 666,784 660,832 Original WRITE bytes 6,736 7,240 6,352 6,672 Relative execution time
#35
Naive BAR-remap Shadow Optimized Init 850x 150x 750x 60x MemAlloc 3,878x 3,135x 287x 21x Close 1,260x 1,075x 200x 165x
Needle
Percentage of runtime
#36
Naive BAR-remap Shadow Optimized Init 850x 150x 750x 60x MemAlloc 3,878x 3,135x 287x 21x Close 1,260x 1,075x 200x 165x
Needle
Percentage of runtime
#37
100 200 300 400 500
Needle
Naive Optimized Native Time (milliseconds)
#38
– Not always – 6% worse in 4VM case
𝑁𝑏𝑦−𝑁𝑗𝑜 𝐵𝑤
Init MemAlloc MemCpy Launch Close Total CREDIT VM0 1,466.12 2.71 599.98 67,781.37 142.56 69,992.74 VM1 2,615.47 2.11 465.26 69,269.52 283.45 72,635.81 BAND VM0 3,424.71 1.99 498.15 67,544.55 339.45 71,808.84 VM1 2,871.53 11.78 569.74 71,338.09 100.12 74,891.25
#39
273.76 1VM x8 Time (milliseconds)
#40
– Compatibility, interposition, isolation
– MMIO interceptions – Resource shadowing – Two optimizations
– Decoupled scheduler module
– Further improve the components? vMMIO, schedulers – Leverage new hardware functionalities? NVIDIA Pascal pagefaults, SRIOV
Baumann, Hardware is the new software, HotOS'17
Performance (avg) Throughput loss (8VM) Our testbed > 200x slowdown ≈ 40% GPUvm paper > 33x slowdown ≈ 27%