Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted - PowerPoint PPT Presentation

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang , John Lange Haoqiang Zheng University of Pittsburgh VMware Inc. VEE’16 04/02/2016

CPU Consolidation in the Cloud CPU Consolidation: multiple virtual CPUs (vCPUs) share the same physical CPU (pCPU). Motivation: Improve datacenter utilization. Figure 1. Average activity distribution of a typical shared Google clusters including Online Services, each containing over 20,000 servers, over a 2 period of 3 months [Barroso 13].

Problems with Preempted vCPUs Preempted B B A B Running A A B A P P P P pCPU vCPU of VM-A vCPU of VM-B P A B Performance problems: Busy-waiting based kernel synchronization operations Lock Holder Preemption problem Lock Waiter Preemption problem TLB Shootdown Preemption problem 3

Lock Holder Preemption Lock holder preemption [Uhlig 04, Friebel 08] A preempted vCPU is holding a spinlock Causes dramatically longer lock waiting time context switch latency + CPU shares allocated to other vCPUs Scheduling Techniques co-scheduling, relaxed co-scheduling [VMware 10] Adaptive co-scheduling [Weng HPDC11] Balanced scheduling [Sukwong EuroSys11] Demand-based coordinated scheduling [Kim ASPLOS13] Hardware Assisted Techniques Intel Pause-Loop Exiting (PLE) [Riel 11] 4

Lock Waiter Preemption [Ouyang VEE13] Linux uses a FIFO order fair spinlock, named ticket spinlock i i+1 i+2 i+3 Timeout: T 2T 3T 0 Lock waiter preemption A lock waiter is preempted, and blocks the queue P(waiter preemption) > P(holder preemption) Preemptable Ticket Spinlock Key idea: proportional timeout 5

TLB Shootdown Preemption KVM Paravirt Remote Flush TLB [kvmtlb 12] VMM maintains vCPU preemption states and shares with the guest. Use conventional approach if the remote vCPU is running. Defer TLB flush if the remote vCPU is preempted. Cons: preemption state may change after checking. TLB shootdown IPIs as scheduling heuristics [Kim ASPLOS13] Shoot4U Goal: eliminate the problem Key idea: invalidate guest TLB entries from the VMM 6

Contributions An analysis of the impact that various low level synchronization operations have on system benchmark performance. Shoot4U: A novel virtualized TLB architecture that ensures consistently low latencies for synchronized TLB operations. An evaluation of the performance benefits achieved by Shoot4U over current state-of-art software and hardware assisted approaches. 7

Performance Analysis 8

Overhead of CPU Consolidation max ideal slowdown 70.6 20 18 16 14 Slowdown 12 10 8 6 4 2 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 PARSEC Runtime with co-located VM over running alone (12-core VMs, measured on Linux/KVM, with PLE disabled) 9

CPU Usage Profiling (perf) k:lock k:tlb k:other u:* 1VM 2VM 100 90 80 Percentage (%) 70 60 50 40 30 20 10 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 10

CDF of TLB Shootdown Latency (ktap) 1 0.9 0.8 Cumulative Percent 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1VM 2VM 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency (us) 11

How TLB Shootdown Works in VMs TLB (Translation Lookaside Buffer) a per-core hardware cache for page table translation results TLB coherence is managed by the OS TLB shootdown operations: IPI + invlpg Linux TLB shootdown is busy-waiting based B B A B 6. invalidation & ACK 1. vIPIs 5 . vCPU is scheduled A A B A (TLB Shootdown Preemption) 2 . trap VMM 4 . inject virtual interrupts P P P P 12 3. pIPIs

Shoot4U 13

Shoot4U Observation: modern hardware allows the VMM to invalidate guest TLB entries (e.g. Intel invpid ) Key idea: invalidate guest TLB entries from the VMM Tell the VMM what TLB entries and vCPUs to invalidate (hypervall) The VMM invalidates and returns, no interrupt injection and waiting B B A B A A B A 1. hypercall 3 . invalidation and ACK <vcpu set, addr range> VMM 2 . pIPIs P P P P 14

Implementation Shoot4U API kvm_hypercall3( unsigned long KVM_HC_SHOOT4U, unsigned long vcpu_bitmap, unsigned long start_addr, unsigned long end_addr); KVM/Linux 3.16, ~200 LOC (~50 LOC guest side) https://github.com/ouyangjn/shoot4u Guest use hypercall for TLB shootdowns VMM hypercall handler: vCPU set => pCPU set, and send IPIs IPI handler: invalidate guest TLB entries with invpid 15

Evaluation Dual-socket Dell R450 server 6-core Intel “Ivy-Bridge” Xeon processors with hyperthreading 24 GB RAM split across two NUMA domains. CentOS 7 (Linux 3.16) Virtual Machines 12 vCPUs, 4G RAM on the same socket Fedora 19 (Linux 4.0) VM1: PARSEC Benchmark Suite, VM2 sysbench CPU test Schemes baseline: unmodified Linux kernel kvmtlb [kvmtlb 12] Shoot4U Pause-Loop Exiting (PLE) [Riel 11] Preemptable Ticket Spinlock (PMT) [Ouyang VEE ’ 13] 16

TLB Shootdown Latency (Cycles) Order of magnitude lower latency 17

TLB Shootdown Latency (CDF) 1 0.9 0.8 Cumulative Percent 0.7 0.6 0.5 0.4 shoot4u-1VM 0.3 shoot4u-2VM kvmtlb-1VM 0.2 kvmtlb-2VM 0.1 baseline-1VM baseline-2VM 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency (us) 18

Parsec Performance (2-VMs) baseline pmt+kvmtlb ple pmt+shoot4u pmt ple+pmt+shoot4u Normalized Execution Time 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 b b c d f f r s s v x l o a e e r a t w i 2 a d n d r e y r a p 6 c y n u r q t e p s 4 k t e p e m r a t s r a t i a m i c a l n c c o h c e e l n o k u s l s e t s e r 19

Revisiting Performance Slowdown baseline ple+pmt+shoot4u 70.6 20 18 16 14 Slowdown 12 10 8 6 4 2 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 20

Revisiting CPU Usage Profiling k:lock k:tlb k:other u:* baseline 2VM ple+pmt+shoot4u 2VM 100 90 80 Percentage (%) 70 60 50 40 30 20 10 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 21

Conclusions We conducted a set of experiments in order to provide a breakdown of overheads caused by preempted virtual CPU cores, showing that TLB operations can have a significant impact on performance with certain workloads. We Shoot4U, an optimization for TLB shootdown operations that internalizes TLB shootdowns in the VMM and so no longer requires the involvement of a guest’s vCPUs. Our evaluation demonstrates the effectiveness of our approach, and illustrates how under certain workloads our approach is dramatically better than state-of-the-art techniques. 22

https://github.com/ouyangjn/shoot4u 23

Q & A Kitten Lightweight Kernel Jiannan Ouyang Ph.D. Candidate University of Pittsburgh Pisces Co-Kernel ouyang@cs.pitt.edu http://www.cs.pitt.edu/~ouyang/ The Prognostic Lab University of Pittsburgh http:// www.prognosticlab.org Palacios VMM 24

References [Ouyang 13] Jiannan Ouyang and John R. Lange. Preemptable Ticket Spinlocks: Improving Consolidated Performance in the Cloud. In Proc. 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2013. [Uhlig 04] Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, and Uwe Dannowski. Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3, VM’04, 2004. [Friebel 08] Thomas Friebel. How to deal with lock-holder preemption. Presented at the Xen Summit North America, 2008. [Kim ASPLOS’13] H. Kim, S. Kim, J. Jeong, J. Lee, and S. Maeng. Demand- based Coordinated Scheduling for SMP VMs. In Proc. Inter- national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , 2013. 25

[VMware 10] VMware(r) vSphere(tm): The cpu scheduler in vmware esx(r) 4.1. Technical report, VMware, Inc, 2010. [Barroso 13] L. A. Barroso, J. Clidaras, and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse- Scale Machines. Synthesis Lectures on Computer Architec- ture , 2013. [Weng HPDC’11] C. Weng, Q. Liu, L. Yu, and M. Li. Dynamic Adaptive Scheduling for Virtual Machines. In Proc. 20th International Symposium on High Performance Parallel and Distributed Computing (HPDC) , 2011. [Sukwong EuroSys’11] O. Sukwong and H. S. Kim. Is Co- scheduling Too Expensive for SMP VMs? In Proc. 6th European Conference on Com- puter Systems (EuroSys) , 2011. 26

[Riel 11] R. v. Riel. Directed yield for pause loop exiting, 2011. URL http://lwn.net/Articles/424960/. [kvmtlb 12] KVM Paravirt Remote Flush TLB. https://lwn.net/ Articles/500188/. 27

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted - PowerPoint PPT Presentation

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang , John Lange Haoqiang Zheng University of Pittsburgh VMware Inc. VEE16 04/02/2016 CPU Consolidation in the Cloud CPU Consolidation: multiple virtual

branch prediction 1 last time what happens with TLB in access patterns overlapping TLB and

rust-vmm Building the Virtualization Stack of the Future Andreea Florescu

Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C Cover erage e under er D

Virtual Machine Monitors Lakshmi Ganesh What is a VMM? Virtualization: Using a layer of

TEST BENCH FOR MODULE -1 VMM CARD VMM1 Analog output 16 Digital output : TOT or TTP LVDS

OpenBSD: add VMM to packer The red pill taken to develop a Go plugin for packer.io to

Guess What? Caching! Translation-Lookaside Buffer (TLB) stores for future use a successful

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

TLBleed Translation leak-aside buffer: Defeating cache side-channel protections with TLB

MINUTE OPTIMIZE YOUR PH MONITORING OPTIMIZE WITH HAVING CHALLENGES MEASURING

1/8/2014 Pancreas 1,2,3 Functions: Secretes insulin which assists w/ glucose transport

KC Germany Team Mrs. Shweta Jejani Assists Associates as well as replies to their queries and

AVOIDING THE CRASH: AVOIDING THE CRASH 1: DONT INTUBATE , OPTIMIZE OPTIMIZE YOUR PRE, PERI,

Dont Optimize my Queries; Optimize my Data! Julian Hyde DataEngConf NYC 2017/10/30

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

OPTIMIZE YOUR PAGES, LEVERAGE YOUR BUSINESS CONTAINS ALL VIDEO SLIDEDECKS FOR THIS SESSION 1

Quantitative XRF Analysis algorithms and their practical use Piet Van Espen

Evaluation of XRF Spectra from basics to advanced systems Piet Van Espen

Getting your Qt bas ed applic ation on OVI and more Espen Riskedal Teamlead / S enior S oftware

Taxes and the Global Allocation of Capital David Backus, Espen Henriksen, and Kjetil Storesletten

The Stochastics of Energy Markets ...or... Modelling Financial Energy Forwards Fred Espen Benth

r rt qts

2016 SIAG/FME (FINANCIAL Logo here MATHEMATICS AND ENGINEERING) BUSINESS MEETING Friday,

Impacts of Factory Jobs on Fertility: Experimental Evidence from Ethiopia Sandra K. Halvorsen 1 ,

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted - PowerPoint PPT Presentation

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang , John Lange Haoqiang Zheng University of Pittsburgh VMware Inc. VEE16 04/02/2016 CPU Consolidation in the Cloud CPU Consolidation: multiple virtual

branch prediction 1 last time what happens with TLB in access patterns overlapping TLB and

rust-vmm Building the Virtualization Stack of the Future Andreea Florescu

Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C Cover erage e under er D

Virtual Machine Monitors Lakshmi Ganesh What is a VMM? Virtualization: Using a layer of

TEST BENCH FOR MODULE -1 VMM CARD VMM1 Analog output 16 Digital output : TOT or TTP LVDS

OpenBSD: add VMM to packer The red pill taken to develop a Go plugin for packer.io to

Guess What? Caching! Translation-Lookaside Buffer (TLB) stores for future use a successful

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

TLBleed Translation leak-aside buffer: Defeating cache side-channel protections with TLB

MINUTE OPTIMIZE YOUR PH MONITORING OPTIMIZE WITH HAVING CHALLENGES MEASURING

1/8/2014 Pancreas 1,2,3 Functions: Secretes insulin which assists w/ glucose transport

KC Germany Team Mrs. Shweta Jejani Assists Associates as well as replies to their queries and

AVOIDING THE CRASH: AVOIDING THE CRASH 1: DONT INTUBATE , OPTIMIZE OPTIMIZE YOUR PRE, PERI,

Dont Optimize my Queries; Optimize my Data! Julian Hyde DataEngConf NYC 2017/10/30

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

OPTIMIZE YOUR PAGES, LEVERAGE YOUR BUSINESS CONTAINS ALL VIDEO SLIDEDECKS FOR THIS SESSION 1

Quantitative XRF Analysis algorithms and their practical use Piet Van Espen

Evaluation of XRF Spectra from basics to advanced systems Piet Van Espen

Getting your Qt bas ed applic ation on OVI and more Espen Riskedal Teamlead / S enior S oftware

Taxes and the Global Allocation of Capital David Backus, Espen Henriksen, and Kjetil Storesletten

The Stochastics of Energy Markets ...or... Modelling Financial Energy Forwards Fred Espen Benth

r rt qts

2016 SIAG/FME (FINANCIAL Logo here MATHEMATICS AND ENGINEERING) BUSINESS MEETING Friday,

Impacts of Factory Jobs on Fertility: Experimental Evidence from Ethiopia Sandra K. Halvorsen 1 ,

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &