 
              Full Virtualization for GPUs Reconsidered Revisit -- Suzuki, Yusuke, et al. “ GPUvm: Why not virtualizing GPUs at the hypervisor?.” USENIX ATC’ 14. Hangchen Yu 1 , Christopher J. Rossbach 1,2 1 The University of Texas at Austin 2 VMware Research Group
Overview • Demands, introductions, challenges of virtual GPUs • Distinctive features of GPUvm • Re-evaluate GPUvm with additional benchmarks – Hard to set up the testbed – Some functionalities do not work – Over 200x overheads on average – Unfairness issue – Over 40% throughput loss #2
Do we still need GPU virtualizations? • Share GPUs in datacenter • Different end-user demands • Hidden scenarios #3
Do we still need GPU virtualizations? • Share GPUs in datacenter • Different end-user demands • Hidden scenarios #4
Do we still need GPU virtualizations? • Share GPUs in datacenter • Different end-user demands • Hidden scenarios #5
GPU Virtualization Challenges • Diverse hardware • Undocumented APIs • Closed-source GPUs and drivers • Deep graphics stack • Coupled layers • Significant overheads • Limited flexibility #6
GPU Virtualization Comparisons Front-end Device emulation Synthesizes host graphics operations API remoting Forwards graphics API calls To external graphics stack Mediated-passthrough Dedicates a set of contexts Back-end Passthrough Provides exclusive access #7
GPU Virtualization Comparisons Performance Fidelity Multiplexing Interposition Complexity Front-end Device emulation Synthesizes host graphics operations API remoting Forwards graphics API calls To external graphics stack Mediated-passthrough Dedicates a set of contexts Back-end Passthrough Provides exclusive access #8
GPU Virtualization Comparisons Performance Fidelity Multiplexing Interposition Complexity Front-end Device emulation Synthesizes host graphics operations API remoting Forwards graphics API calls To external graphics stack Mediated-passthrough Dedicates a set of contexts Back-end Passthrough Provides exclusive access #9
GPU Virtualization Comparisons Performance Fidelity Multiplexing Interposition Complexity Front-end Device emulation Synthesizes host graphics operations API remoting Forwards graphics API calls To external graphics stack Mediated-passthrough Dedicates a set of contexts Back-end Passthrough Provides exclusive access #10
GPU Virtualization Comparisons Performance Fidelity Multiplexing Interposition Complexity Front-end Device emulation Synthesizes host graphics operations API remoting Forwards graphics API calls To external graphics stack Mediated-passthrough Dedicates a set of contexts Back-end Passthrough Provides exclusive access #11
GPU Virtualization Comparisons Performance Fidelity Multiplexing Interposition Complexity Front-end Device emulation Synthesizes host graphics operations API remoting Forwards graphics API calls To external graphics stack Mediated-passthrough Dedicates a set of contexts Back-end Passthrough Provides exclusive access #12
GPU Virtualization Comparisons Performance Fidelity Multiplexing Interposition Complexity Front-end Device emulation Synthesizes host graphics operations API remoting Forwards graphics API calls To external graphics stack Mediated-passthrough Dedicates a set of contexts Back-end Passthrough Provides exclusive access #13
GPU Virtualization Examples Front-end M. Dowty, VMware SVGA , SIGOPS- OSR’09 Device emulation J. Duato, rCUDA , HiPC’10,11 API remoting G. Giunta, gVirtuS , European Conference on Parallel Processing’10 AMD MxGPU ( FirePro ), VMworld’15 Mediated-passthrough NVIDIA GRID vGPU, 15 KVMGT (Intel GVT-g ), 14 Back-end Passthrough Amazon Elastic Compute Cloud (AWS EC2 ) #14
GPUvm Features Similar approaches when virtualizing at hypervisor-level Front-end Device emulation Exposes a native device model to VMs API remoting Forwards commands to GPU virtual aggregator Back-end Mediated-passthrough Passes-through some operations (I/O requests) to hardware #15
Full-virtualization vs. Para-virtualization Para-virtualization Split device model Back End Apps GPU driver API vGPU driver API GPU driver Front End GPU Full-virtualization Trap-and-emulate Apps Device model GPU driver API Hypervisor GPU driver GPU #16
Full-virtualization vs. Para-virtualization Performance Interposition Fidelity Multiplexing Para-virtualization Split device model Back End Apps GPU driver API vGPU driver API GPU driver Front End GPU Full-virtualization Trap-and-emulate Apps Device model GPU driver API Hypervisor GPU driver GPU #17
Full-virtualization vs. Para-virtualization Performance Interposition Fidelity Multiplexing Para-virtualization Split device model Back End Apps vGPU driver GPU driver API vGPU driver API Back End API GPU driver Front End GPU Full-virtualization Trap-and-emulate Apps Device model GPU driver Hypervisor GPU driver API Hypervisor API GPU driver GPU #18
Full Virtualization: A Reasonable Goal? Full-featured vGPU Strong isolation (3D acceleration) Full-virtualization Trap-and-emulate Apps Device model Device Model GPU driver API Hypervisor Slow performance Hard to map GPU driver GPU different GPUs #19
Full Virtualization: A Reasonable Goal? Full-featured vGPU Strong isolation (3D acceleration) Full-virtualization Trap-and-emulate Apps Device model Device Model GPU driver API Hypervisor Slow performance Hard to map GPU driver GPU different GPUs #20
GPUvm Overview • Access aggregator #21
GPUvm Overview • Access aggregator #22
GPUvm Overview • Access aggregator #23
GPUvm Overview • Access aggregator VM • Shadow channel Virtual Context Driver – Mapped by a virtual channel Virtual Virtual Channel Channel • Shadow page table Shadow Channel Shadow Channel Shadow Page Shadow Page Table Table Shadowing Mechanism #24
GPUvm Overview • Access aggregator VM • Shadow channel Virtual Context Driver – Mapped by a virtual channel Virtual Virtual Channel Channel • Shadow page table Shadow Channel Shadow Channel Shadow Page Shadow Page Table Table Shadowing Mechanism #25
GPUvm Overview • Access aggregator VM • Shadow channel Virtual Context Driver – Mapped by a virtual channel Virtual Virtual Channel Channel • Shadow page table Shadow Channel Shadow Channel Shadow Page Shadow Page Table Table Shadowing Mechanism #26
GPUvm Overview • Access aggregator VM • Shadow channel Virtual Context Driver – Mapped by a virtual channel Virtual Virtual Channel Channel • Shadow page table Shadow Channel Shadow Channel Shadow Page Shadow Page Table Table Shadowing Mechanism #27
GPUvm Overview • Access aggregator • Shadow channel – Mapped by a virtual channel • Shadow page table • Virtual scheduler – FIFO – CREDIT – BAND (bandwidth-aware non-preemptive device) #28
Why GPUvm? • Open-source • Overheads – FV (36x) PV (1.9x) Easier to Easier to analyze upgrade/swap/optimize • performance/mechanism Good open architecture components – Decoupled components – Native device model, virtual MMIO, shadow channels, shadow page tables, virtual schedulers Significant • Not-so-good aspects performance impact – Interposes guest access to memory-mapped resources – Shadows expensive resources • Trade-off of hypervisor-level full-virtualization #29
MMIO through PCIe GPUvm Optimizations base address register • Sync virtual & shadow channels – Intercept data accesses – BAR3 remapping • BAR3 accesses are passed-through • Sync guest & shadow page tables – GPU-side page faults – Lazy shadowing • Updates shadow page tables only when referenced #30
MMIO through PCIe GPUvm Optimizations base address register • Sync virtual & shadow channels – Intercept data accesses – BAR3 remapping • BAR3 accesses are passed-through • Sync guest & shadow page tables – GPU-side page faults – Lazy shadowing • Updates shadow page tables only when referenced #31
Testbed • Specific hardware – NVIDIA Quadro 6000 NVC0 – GF100GL vs. GF100 (GTX 480) (different region addresses) • Specific software – Fedora 16 (Kernel 3.6.5) – Xen HVM (4.2.0) – Gdev (commit 605e69e7) – GCC 4.6.3 – NVCC 4.2 – Boost 1.4.7 #32
Performance • BAR3 remapping – Relative execution time 1.6x speed-up – Fails for some benchmarks • Lazy shadowing – 1.2x speed-up – Fails for some benchmarks • Overhead – up to 737x, 232x on average • 7.4x Boot slowdown hotspot lud srad mmul WRITE bytes 659,664 662,544 666,784 660,832 Original WRITE bytes 6,736 7,240 6,352 6,672 #33
Performance • BAR3 remapping – Relative execution time 1.6x speed-up – Fails for some benchmarks • Lazy shadowing – 1.2x speed-up – Fails for some benchmarks • Overhead – up to 737x, 232x on average • 7.4x Boot slowdown hotspot lud srad mmul WRITE bytes 659,664 662,544 666,784 660,832 Original WRITE bytes 6,736 7,240 6,352 6,672 #34
Recommend
More recommend