shoot4u using vmm assists to optimize tlb operations on
play

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted - PowerPoint PPT Presentation

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang , John Lange Haoqiang Zheng University of Pittsburgh VMware Inc. VEE16 04/02/2016 CPU Consolidation in the Cloud CPU Consolidation: multiple virtual


  1. Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang , John Lange Haoqiang Zheng University of Pittsburgh VMware Inc. VEE’16 04/02/2016

  2. CPU Consolidation in the Cloud CPU Consolidation: multiple virtual CPUs (vCPUs) share the same physical CPU (pCPU). Motivation: Improve datacenter utilization. Figure 1. Average activity distribution of a typical shared Google clusters including Online Services, each containing over 20,000 servers, over a 2 period of 3 months [Barroso 13].

  3. Problems with Preempted vCPUs Preempted B B A B Running A A B A P P P P pCPU vCPU of VM-A vCPU of VM-B P A B Performance problems: Busy-waiting based kernel synchronization operations — Lock Holder Preemption problem — Lock Waiter Preemption problem — TLB Shootdown Preemption problem 3

  4. Lock Holder Preemption Lock holder preemption [Uhlig 04, Friebel 08] — A preempted vCPU is holding a spinlock — Causes dramatically longer lock waiting time — context switch latency + CPU shares allocated to other vCPUs Scheduling Techniques — co-scheduling, relaxed co-scheduling [VMware 10] — Adaptive co-scheduling [Weng HPDC11] — Balanced scheduling [Sukwong EuroSys11] — Demand-based coordinated scheduling [Kim ASPLOS13] Hardware Assisted Techniques — Intel Pause-Loop Exiting (PLE) [Riel 11] 4

  5. Lock Waiter Preemption [Ouyang VEE13] Linux uses a FIFO order fair spinlock, named ticket spinlock i i+1 i+2 i+3 Timeout: T 2T 3T 0 Lock waiter preemption — A lock waiter is preempted, and blocks the queue — P(waiter preemption) > P(holder preemption) Preemptable Ticket Spinlock — Key idea: proportional timeout 5

  6. TLB Shootdown Preemption KVM Paravirt Remote Flush TLB [kvmtlb 12] — VMM maintains vCPU preemption states and shares with the guest. — Use conventional approach if the remote vCPU is running. — Defer TLB flush if the remote vCPU is preempted. — Cons: preemption state may change after checking. TLB shootdown IPIs as scheduling heuristics [Kim ASPLOS13] Shoot4U — Goal: eliminate the problem — Key idea: invalidate guest TLB entries from the VMM 6

  7. Contributions — An analysis of the impact that various low level synchronization operations have on system benchmark performance. — Shoot4U: A novel virtualized TLB architecture that ensures consistently low latencies for synchronized TLB operations. — An evaluation of the performance benefits achieved by Shoot4U over current state-of-art software and hardware assisted approaches. 7

  8. Performance Analysis 8

  9. Overhead of CPU Consolidation max ideal slowdown 70.6 20 18 16 14 Slowdown 12 10 8 6 4 2 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 PARSEC Runtime with co-located VM over running alone (12-core VMs, measured on Linux/KVM, with PLE disabled) 9

  10. CPU Usage Profiling (perf) k:lock k:tlb k:other u:* 1VM 2VM 100 90 80 Percentage (%) 70 60 50 40 30 20 10 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 10

  11. CDF of TLB Shootdown Latency (ktap) 1 0.9 0.8 Cumulative Percent 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1VM 2VM 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency (us) 11

  12. How TLB Shootdown Works in VMs — TLB (Translation Lookaside Buffer) — a per-core hardware cache for page table translation results — TLB coherence is managed by the OS — TLB shootdown operations: IPI + invlpg — Linux TLB shootdown is busy-waiting based B B A B 6. invalidation & ACK 1. vIPIs 5 . vCPU is scheduled A A B A (TLB Shootdown Preemption) 2 . trap VMM 4 . inject virtual interrupts P P P P 12 3. pIPIs

  13. Shoot4U 13

  14. Shoot4U Observation: modern hardware allows the VMM to invalidate guest TLB entries (e.g. Intel invpid ) Key idea: invalidate guest TLB entries from the VMM — Tell the VMM what TLB entries and vCPUs to invalidate (hypervall) — The VMM invalidates and returns, no interrupt injection and waiting B B A B A A B A 1. hypercall 3 . invalidation and ACK <vcpu set, addr range> VMM 2 . pIPIs P P P P 14

  15. Implementation Shoot4U API kvm_hypercall3( unsigned long KVM_HC_SHOOT4U, unsigned long vcpu_bitmap, unsigned long start_addr, unsigned long end_addr); KVM/Linux 3.16, ~200 LOC (~50 LOC guest side) — https://github.com/ouyangjn/shoot4u Guest — use hypercall for TLB shootdowns VMM — hypercall handler: vCPU set => pCPU set, and send IPIs — IPI handler: invalidate guest TLB entries with invpid 15

  16. Evaluation Dual-socket Dell R450 server — 6-core Intel “Ivy-Bridge” Xeon processors with hyperthreading — 24 GB RAM split across two NUMA domains. — CentOS 7 (Linux 3.16) Virtual Machines — 12 vCPUs, 4G RAM on the same socket — Fedora 19 (Linux 4.0) — VM1: PARSEC Benchmark Suite, VM2 sysbench CPU test Schemes — baseline: unmodified Linux kernel — kvmtlb [kvmtlb 12] — Shoot4U — Pause-Loop Exiting (PLE) [Riel 11] — Preemptable Ticket Spinlock (PMT) [Ouyang VEE ’ 13] 16

  17. TLB Shootdown Latency (Cycles) Order of magnitude lower latency 17

  18. TLB Shootdown Latency (CDF) 1 0.9 0.8 Cumulative Percent 0.7 0.6 0.5 0.4 shoot4u-1VM 0.3 shoot4u-2VM kvmtlb-1VM 0.2 kvmtlb-2VM 0.1 baseline-1VM baseline-2VM 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency (us) 18

  19. Parsec Performance (2-VMs) baseline pmt+kvmtlb ple pmt+shoot4u pmt ple+pmt+shoot4u Normalized Execution Time 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 b b c d f f r s s v x l o a e e r a t w i 2 a d n d r e y r a p 6 c y n u r q t e p s 4 k t e p e m r a t s r a t i a m i c a l n c c o h c e e l n o k u s l s e t s e r 19

  20. Revisiting Performance Slowdown baseline ple+pmt+shoot4u 70.6 20 18 16 14 Slowdown 12 10 8 6 4 2 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 20

  21. Revisiting CPU Usage Profiling k:lock k:tlb k:other u:* baseline 2VM ple+pmt+shoot4u 2VM 100 90 80 Percentage (%) 70 60 50 40 30 20 10 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 21

  22. Conclusions — We conducted a set of experiments in order to provide a breakdown of overheads caused by preempted virtual CPU cores, showing that TLB operations can have a significant impact on performance with certain workloads. — We Shoot4U, an optimization for TLB shootdown operations that internalizes TLB shootdowns in the VMM and so no longer requires the involvement of a guest’s vCPUs. — Our evaluation demonstrates the effectiveness of our approach, and illustrates how under certain workloads our approach is dramatically better than state-of-the-art techniques. 22

  23. https://github.com/ouyangjn/shoot4u 23

  24. Q & A Kitten Lightweight Kernel Jiannan Ouyang Ph.D. Candidate University of Pittsburgh Pisces Co-Kernel ouyang@cs.pitt.edu http://www.cs.pitt.edu/~ouyang/ The Prognostic Lab University of Pittsburgh http:// www.prognosticlab.org Palacios VMM 24

  25. References — [Ouyang 13] Jiannan Ouyang and John R. Lange. Preemptable Ticket Spinlocks: Improving Consolidated Performance in the Cloud. In Proc. 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2013. — [Uhlig 04] Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, and Uwe Dannowski. Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3, VM’04, 2004. — [Friebel 08] Thomas Friebel. How to deal with lock-holder preemption. Presented at the Xen Summit North America, 2008. — [Kim ASPLOS’13] H. Kim, S. Kim, J. Jeong, J. Lee, and S. Maeng. Demand- based Coordinated Scheduling for SMP VMs. In Proc. Inter- national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , 2013. 25

  26. — [VMware 10] VMware(r) vSphere(tm): The cpu scheduler in vmware esx(r) 4.1. Technical report, VMware, Inc, 2010. — [Barroso 13] L. A. Barroso, J. Clidaras, and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse- Scale Machines. Synthesis Lectures on Computer Architec- ture , 2013. — [Weng HPDC’11] C. Weng, Q. Liu, L. Yu, and M. Li. Dynamic Adaptive Scheduling for Virtual Machines. In Proc. 20th International Symposium on High Performance Parallel and Distributed Computing (HPDC) , 2011. — [Sukwong EuroSys’11] O. Sukwong and H. S. Kim. Is Co- scheduling Too Expensive for SMP VMs? In Proc. 6th European Conference on Com- puter Systems (EuroSys) , 2011. 26

  27. — [Riel 11] R. v. Riel. Directed yield for pause loop exiting, 2011. URL http://lwn.net/Articles/424960/. — [kvmtlb 12] KVM Paravirt Remote Flush TLB. https://lwn.net/ Articles/500188/. 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend