VEE’16 04/02/2016
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted - - PowerPoint PPT Presentation
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted - - PowerPoint PPT Presentation
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang , John Lange Haoqiang Zheng University of Pittsburgh VMware Inc. VEE16 04/02/2016 CPU Consolidation in the Cloud CPU Consolidation: multiple virtual
CPU Consolidation in the Cloud
2
CPU Consolidation: multiple virtual CPUs (vCPUs) share the same physical CPU (pCPU). Motivation: Improve datacenter utilization.
Figure 1. Average activity distribution of a typical shared Google clusters including Online Services, each containing over 20,000 servers, over a period of 3 months [Barroso 13].
Problems with Preempted vCPUs
3 P P P P A B A B B A A B A
vCPU of VM-A
B
vCPU of VM-B Preempted Running pCPU
P
Performance problems: Busy-waiting based kernel synchronization operations
Lock Holder Preemption problem Lock Waiter Preemption problem TLB Shootdown Preemption problem
Lock Holder Preemption
4
Lock holder preemption [Uhlig 04, Friebel 08]
A preempted vCPU is holding a spinlock Causes dramatically longer lock waiting time
context switch latency + CPU shares allocated to other vCPUs
Scheduling Techniques
co-scheduling, relaxed co-scheduling [VMware 10] Adaptive co-scheduling [Weng HPDC11] Balanced scheduling [Sukwong EuroSys11] Demand-based coordinated scheduling [Kim ASPLOS13]
Hardware Assisted Techniques
Intel Pause-Loop Exiting (PLE) [Riel 11]
Lock Waiter Preemption [Ouyang VEE13]
5
Linux uses a FIFO order fair spinlock, named ticket spinlock
i i+1 i+2 i+3
Lock waiter preemption
A lock waiter is preempted, and blocks the queue P(waiter preemption) > P(holder preemption)
T 2T Timeout: 3T
Preemptable Ticket Spinlock
Key idea: proportional timeout
TLB Shootdown Preemption
6
KVM Paravirt Remote Flush TLB [kvmtlb 12]
VMM maintains vCPU preemption states and shares with the
guest.
Use conventional approach if the remote vCPU is running. Defer TLB flush if the remote vCPU is preempted. Cons: preemption state may change after checking.
TLB shootdown IPIs as scheduling heuristics [Kim ASPLOS13]
Shoot4U
Goal: eliminate the problem Key idea: invalidate guest TLB entries from the VMM
Contributions
7
An analysis of the impact that various low level
synchronization operations have on system benchmark performance.
Shoot4U: A novel virtualized TLB architecture that ensures
consistently low latencies for synchronized TLB operations.
An evaluation of the performance benefits achieved by
Shoot4U over current state-of-art software and hardware assisted approaches.
Performance Analysis
8
Overhead of CPU Consolidation
9
2 4 6 8 10 12 14 16 18 20 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Slowdown 70.6 max ideal slowdown
PARSEC Runtime with co-located VM over running alone
(12-core VMs, measured on Linux/KVM, with PLE disabled)
CPU Usage Profiling (perf)
10
10 20 30 40 50 60 70 80 90 100 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Percentage (%) 1VM 2VM
k:lock k:tlb k:other u:*
CDF of TLB Shootdown Latency (ktap)
11
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 101 102 103 104 105 106 Cumulative Percent Latency (us) 1VM 2VM
How TLB Shootdown Works in VMs
12
TLB (Translation Lookaside Buffer)
a per-core hardware cache for page table translation results
TLB coherence is managed by the OS
TLB shootdown operations: IPI + invlpg
Linux TLB shootdown is busy-waiting based
VMM
P P P P A A B A
- 1. vIPIs
- 3. pIPIs
- 2. trap
- 4. inject virtual interrupts
- 5. vCPU is scheduled
(TLB Shootdown Preemption)
B B A B
- 6. invalidation & ACK
Shoot4U
13
Shoot4U
14
Observation: modern hardware allows the VMM to invalidate guest TLB entries (e.g. Intel invpid) Key idea: invalidate guest TLB entries from the VMM
Tell the VMM what TLB entries and vCPUs to invalidate (hypervall) The VMM invalidates and returns, no interrupt injection and waiting
- 2. pIPIs
- 1. hypercall
<vcpu set, addr range>
- 3. invalidation and ACK
VMM
P P P P A A B A B B A B
Implementation
15
KVM/Linux 3.16, ~200 LOC (~50 LOC guest side)
https://github.com/ouyangjn/shoot4u
Guest
use hypercall for TLB shootdowns
VMM
hypercall handler: vCPU set => pCPU set, and send IPIs IPI handler: invalidate guest TLB entries with invpid
kvm_hypercall3(unsigned long KVM_HC_SHOOT4U, unsigned long vcpu_bitmap, unsigned long start_addr, unsigned long end_addr);
Shoot4U API
Evaluation
16
Dual-socket Dell R450 server
6-core Intel “Ivy-Bridge” Xeon processors with hyperthreading 24 GB RAM split across two NUMA domains. CentOS 7 (Linux 3.16)
Virtual Machines
12 vCPUs, 4G RAM on the same socket Fedora 19 (Linux 4.0) VM1: PARSEC Benchmark Suite, VM2 sysbench CPU test
Schemes
baseline: unmodified Linux kernel kvmtlb [kvmtlb 12] Shoot4U Pause-Loop Exiting (PLE) [Riel 11] Preemptable Ticket Spinlock (PMT) [Ouyang VEE ’13]
TLB Shootdown Latency (Cycles)
17
Order of magnitude lower latency
TLB Shootdown Latency (CDF)
18
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 101 102 103 104 105 106 Cumulative Percent Latency (us) shoot4u-1VM shoot4u-2VM kvmtlb-1VM kvmtlb-2VM baseline-1VM baseline-2VM
Parsec Performance (2-VMs)
19
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 b l a c k s c h
- l
e s b
- d
y t r a c k c a n n e a l d e d u p f e r r e t f r e q m i n e r a y t r a c e s t r e a m c l u s t e r s w a p t i
- n
s v i p s x 2 6 4 Normalized Execution Time
baseline ple pmt pmt+kvmtlb pmt+shoot4u ple+pmt+shoot4u
Revisiting Performance Slowdown
20
2 4 6 8 10 12 14 16 18 20 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Slowdown 70.6
baseline ple+pmt+shoot4u
Revisiting CPU Usage Profiling
21
10 20 30 40 50 60 70 80 90 100 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Percentage (%) baseline 2VM ple+pmt+shoot4u 2VM
k:lock k:tlb k:other u:*
Conclusions
22
We conducted a set of experiments in order to provide a
breakdown of overheads caused by preempted virtual CPU cores, showing that TLB operations can have a significant impact on performance with certain workloads.
We Shoot4U, an optimization for TLB shootdown operations that
internalizes TLB shootdowns in the VMM and so no longer requires the involvement of a guest’s vCPUs.
Our evaluation demonstrates the effectiveness of our approach,
and illustrates how under certain workloads our approach is dramatically better than state-of-the-art techniques.
23
https://github.com/ouyangjn/shoot4u
Q & A
24
Jiannan Ouyang
Ph.D. Candidate University of Pittsburgh
- uyang@cs.pitt.edu
http://www.cs.pitt.edu/~ouyang/
The Prognostic Lab
University of Pittsburgh
http://www.prognosticlab.org Pisces Co-Kernel Kitten Lightweight Kernel Palacios VMM
References
25
[Ouyang 13] Jiannan Ouyang and John R. Lange. Preemptable Ticket
Spinlocks: Improving Consolidated Performance in the Cloud. In Proc. 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2013.
[Uhlig 04] Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, and
Uwe Dannowski. Towards scalable multiprocessor virtual
- machines. In Proceedings of the 3rd conference on Virtual
Machine Research And Technology Symposium - Volume 3, VM’04, 2004.
[Friebel 08] Thomas Friebel. How to deal with lock-holder
- preemption. Presented at the Xen Summit North America, 2008.
[Kim ASPLOS’13] H. Kim, S. Kim, J. Jeong, J. Lee, and S. Maeng.
Demand- based Coordinated Scheduling for SMP VMs. In Proc. Inter- national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013.
26
[VMware 10] VMware(r) vSphere(tm): The cpu scheduler in
vmware esx(r) 4.1. Technical report, VMware, Inc, 2010.
[Barroso 13] L. A. Barroso, J. Clidaras, and U. Holzle. The
Datacenter as a Computer: An Introduction to the Design of Warehouse- Scale Machines. Synthesis Lectures on Computer Architec- ture, 2013.
[Weng HPDC’11] C. Weng, Q. Liu, L.
Yu, and M. Li. Dynamic Adaptive Scheduling for Virtual Machines. In Proc. 20th International Symposium on High Performance Parallel and Distributed Computing (HPDC), 2011.
[Sukwong EuroSys’11] O. Sukwong and H. S. Kim. Is Co-
scheduling Too Expensive for SMP VMs? In Proc. 6th European Conference on Com- puter Systems (EuroSys), 2011.
27
[Riel 11] R. v. Riel. Directed yield for pause loop exiting, 2011.
URL http://lwn.net/Articles/424960/.
[kvmtlb 12] KVM Paravirt Remote Flush TLB. https://lwn.net/