Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted - - PowerPoint PPT Presentation

shoot4u using vmm assists to optimize tlb operations on
SMART_READER_LITE
LIVE PREVIEW

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted - - PowerPoint PPT Presentation

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang , John Lange Haoqiang Zheng University of Pittsburgh VMware Inc. VEE16 04/02/2016 CPU Consolidation in the Cloud CPU Consolidation: multiple virtual


slide-1
SLIDE 1

VEE’16 04/02/2016

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Jiannan Ouyang, John Lange

University of Pittsburgh

Haoqiang Zheng

VMware Inc.

slide-2
SLIDE 2

CPU Consolidation in the Cloud

2

CPU Consolidation: multiple virtual CPUs (vCPUs) share the same physical CPU (pCPU). Motivation: Improve datacenter utilization.

Figure 1. Average activity distribution of a typical shared Google clusters including Online Services, each containing over 20,000 servers, over a period of 3 months [Barroso 13].

slide-3
SLIDE 3

Problems with Preempted vCPUs

3 P P P P A B A B B A A B A

vCPU of VM-A

B

vCPU of VM-B Preempted Running pCPU

P

Performance problems: Busy-waiting based kernel synchronization operations

— Lock Holder Preemption problem — Lock Waiter Preemption problem — TLB Shootdown Preemption problem

slide-4
SLIDE 4

Lock Holder Preemption

4

Lock holder preemption [Uhlig 04, Friebel 08]

— A preempted vCPU is holding a spinlock — Causes dramatically longer lock waiting time

— context switch latency + CPU shares allocated to other vCPUs

Scheduling Techniques

— co-scheduling, relaxed co-scheduling [VMware 10] — Adaptive co-scheduling [Weng HPDC11] — Balanced scheduling [Sukwong EuroSys11] — Demand-based coordinated scheduling [Kim ASPLOS13]

Hardware Assisted Techniques

— Intel Pause-Loop Exiting (PLE) [Riel 11]

slide-5
SLIDE 5

Lock Waiter Preemption [Ouyang VEE13]

5

Linux uses a FIFO order fair spinlock, named ticket spinlock

i i+1 i+2 i+3

Lock waiter preemption

— A lock waiter is preempted, and blocks the queue — P(waiter preemption) > P(holder preemption)

T 2T Timeout: 3T

Preemptable Ticket Spinlock

— Key idea: proportional timeout

slide-6
SLIDE 6

TLB Shootdown Preemption

6

KVM Paravirt Remote Flush TLB [kvmtlb 12]

— VMM maintains vCPU preemption states and shares with the

guest.

— Use conventional approach if the remote vCPU is running. — Defer TLB flush if the remote vCPU is preempted. — Cons: preemption state may change after checking.

TLB shootdown IPIs as scheduling heuristics [Kim ASPLOS13]

Shoot4U

— Goal: eliminate the problem — Key idea: invalidate guest TLB entries from the VMM

slide-7
SLIDE 7

Contributions

7

— An analysis of the impact that various low level

synchronization operations have on system benchmark performance.

— Shoot4U: A novel virtualized TLB architecture that ensures

consistently low latencies for synchronized TLB operations.

— An evaluation of the performance benefits achieved by

Shoot4U over current state-of-art software and hardware assisted approaches.

slide-8
SLIDE 8

Performance Analysis

8

slide-9
SLIDE 9

Overhead of CPU Consolidation

9

2 4 6 8 10 12 14 16 18 20 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Slowdown 70.6 max ideal slowdown

PARSEC Runtime with co-located VM over running alone

(12-core VMs, measured on Linux/KVM, with PLE disabled)

slide-10
SLIDE 10

CPU Usage Profiling (perf)

10

10 20 30 40 50 60 70 80 90 100 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Percentage (%) 1VM 2VM

k:lock k:tlb k:other u:*

slide-11
SLIDE 11

CDF of TLB Shootdown Latency (ktap)

11

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 101 102 103 104 105 106 Cumulative Percent Latency (us) 1VM 2VM

slide-12
SLIDE 12

How TLB Shootdown Works in VMs

12

— TLB (Translation Lookaside Buffer)

— a per-core hardware cache for page table translation results

— TLB coherence is managed by the OS

— TLB shootdown operations: IPI + invlpg

— Linux TLB shootdown is busy-waiting based

VMM

P P P P A A B A

  • 1. vIPIs
  • 3. pIPIs
  • 2. trap
  • 4. inject virtual interrupts
  • 5. vCPU is scheduled

(TLB Shootdown Preemption)

B B A B

  • 6. invalidation & ACK
slide-13
SLIDE 13

Shoot4U

13

slide-14
SLIDE 14

Shoot4U

14

Observation: modern hardware allows the VMM to invalidate guest TLB entries (e.g. Intel invpid) Key idea: invalidate guest TLB entries from the VMM

— Tell the VMM what TLB entries and vCPUs to invalidate (hypervall) — The VMM invalidates and returns, no interrupt injection and waiting

  • 2. pIPIs
  • 1. hypercall

<vcpu set, addr range>

  • 3. invalidation and ACK

VMM

P P P P A A B A B B A B

slide-15
SLIDE 15

Implementation

15

KVM/Linux 3.16, ~200 LOC (~50 LOC guest side)

— https://github.com/ouyangjn/shoot4u

Guest

— use hypercall for TLB shootdowns

VMM

— hypercall handler: vCPU set => pCPU set, and send IPIs — IPI handler: invalidate guest TLB entries with invpid

kvm_hypercall3(unsigned long KVM_HC_SHOOT4U, unsigned long vcpu_bitmap, unsigned long start_addr, unsigned long end_addr);

Shoot4U API

slide-16
SLIDE 16

Evaluation

16

Dual-socket Dell R450 server

— 6-core Intel “Ivy-Bridge” Xeon processors with hyperthreading — 24 GB RAM split across two NUMA domains. — CentOS 7 (Linux 3.16)

Virtual Machines

— 12 vCPUs, 4G RAM on the same socket — Fedora 19 (Linux 4.0) — VM1: PARSEC Benchmark Suite, VM2 sysbench CPU test

Schemes

— baseline: unmodified Linux kernel — kvmtlb [kvmtlb 12] — Shoot4U — Pause-Loop Exiting (PLE) [Riel 11] — Preemptable Ticket Spinlock (PMT) [Ouyang VEE ’13]

slide-17
SLIDE 17

TLB Shootdown Latency (Cycles)

17

Order of magnitude lower latency

slide-18
SLIDE 18

TLB Shootdown Latency (CDF)

18

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 101 102 103 104 105 106 Cumulative Percent Latency (us) shoot4u-1VM shoot4u-2VM kvmtlb-1VM kvmtlb-2VM baseline-1VM baseline-2VM

slide-19
SLIDE 19

Parsec Performance (2-VMs)

19

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 b l a c k s c h

  • l

e s b

  • d

y t r a c k c a n n e a l d e d u p f e r r e t f r e q m i n e r a y t r a c e s t r e a m c l u s t e r s w a p t i

  • n

s v i p s x 2 6 4 Normalized Execution Time

baseline ple pmt pmt+kvmtlb pmt+shoot4u ple+pmt+shoot4u

slide-20
SLIDE 20

Revisiting Performance Slowdown

20

2 4 6 8 10 12 14 16 18 20 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Slowdown 70.6

baseline ple+pmt+shoot4u

slide-21
SLIDE 21

Revisiting CPU Usage Profiling

21

10 20 30 40 50 60 70 80 90 100 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Percentage (%) baseline 2VM ple+pmt+shoot4u 2VM

k:lock k:tlb k:other u:*

slide-22
SLIDE 22

Conclusions

22

— We conducted a set of experiments in order to provide a

breakdown of overheads caused by preempted virtual CPU cores, showing that TLB operations can have a significant impact on performance with certain workloads.

— We Shoot4U, an optimization for TLB shootdown operations that

internalizes TLB shootdowns in the VMM and so no longer requires the involvement of a guest’s vCPUs.

— Our evaluation demonstrates the effectiveness of our approach,

and illustrates how under certain workloads our approach is dramatically better than state-of-the-art techniques.

slide-23
SLIDE 23

23

https://github.com/ouyangjn/shoot4u

slide-24
SLIDE 24

Q & A

24

Jiannan Ouyang

Ph.D. Candidate University of Pittsburgh

  • uyang@cs.pitt.edu

http://www.cs.pitt.edu/~ouyang/

The Prognostic Lab

University of Pittsburgh

http://www.prognosticlab.org Pisces Co-Kernel Kitten Lightweight Kernel Palacios VMM

slide-25
SLIDE 25

References

25

— [Ouyang 13] Jiannan Ouyang and John R. Lange. Preemptable Ticket

Spinlocks: Improving Consolidated Performance in the Cloud. In Proc. 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2013.

— [Uhlig 04] Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, and

Uwe Dannowski. Towards scalable multiprocessor virtual

  • machines. In Proceedings of the 3rd conference on Virtual

Machine Research And Technology Symposium - Volume 3, VM’04, 2004.

— [Friebel 08] Thomas Friebel. How to deal with lock-holder

  • preemption. Presented at the Xen Summit North America, 2008.

— [Kim ASPLOS’13] H. Kim, S. Kim, J. Jeong, J. Lee, and S. Maeng.

Demand- based Coordinated Scheduling for SMP VMs. In Proc. Inter- national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013.

slide-26
SLIDE 26

26

— [VMware 10] VMware(r) vSphere(tm): The cpu scheduler in

vmware esx(r) 4.1. Technical report, VMware, Inc, 2010.

— [Barroso 13] L. A. Barroso, J. Clidaras, and U. Holzle. The

Datacenter as a Computer: An Introduction to the Design of Warehouse- Scale Machines. Synthesis Lectures on Computer Architec- ture, 2013.

— [Weng HPDC’11] C. Weng, Q. Liu, L.

Yu, and M. Li. Dynamic Adaptive Scheduling for Virtual Machines. In Proc. 20th International Symposium on High Performance Parallel and Distributed Computing (HPDC), 2011.

— [Sukwong EuroSys’11] O. Sukwong and H. S. Kim. Is Co-

scheduling Too Expensive for SMP VMs? In Proc. 6th European Conference on Com- puter Systems (EuroSys), 2011.

slide-27
SLIDE 27

27

— [Riel 11] R. v. Riel. Directed yield for pause loop exiting, 2011.

URL http://lwn.net/Articles/424960/.

— [kvmtlb 12] KVM Paravirt Remote Flush TLB. https://lwn.net/

Articles/500188/.