Vhost: Sharing is better Eyal Moscovici Bandan Das Partly - - PowerPoint PPT Presentation

vhost sharing is better
SMART_READER_LITE
LIVE PREVIEW

Vhost: Sharing is better Eyal Moscovici Bandan Das Partly - - PowerPoint PPT Presentation

Vhost: Sharing is better Eyal Moscovici Bandan Das Partly sponsored by: 1 / 29 What's it about ? Paravirtualization: Shared Responsibilities Vhost: How much can we stretch ? Design Ideas: Parallelization Design Ideas:


slide-1
SLIDE 1

1 / 29

Vhost: Sharing is better

Eyal Moscovici Partly sponsored by: Bandan Das

slide-2
SLIDE 2

2 / 29

What's it about ?

  • Paravirtualization: Shared Responsibilities
  • Vhost: How much can we stretch ?
  • Design Ideas: Parallelization
  • Design Ideas: Consolidation
  • Vhost: ELVIS
  • Upstreaming
  • Results
  • Wrap up and Questions
slide-3
SLIDE 3

3 / 29

Shared Responsibilities

  • From Virtualization to Paravirtualization
  • Virtio – Host/Guest co-ordination

– - Standardized backend/frontend drivers

  • Advantages

– - Host still has ultimate control

(compared to hardware device assignment)

– - Security, Fault tolerance, SDN, fjle-

based images, replication, snapshots, VM migration

  • Disadvantages

– - Scalability Limitations

slide-4
SLIDE 4

4 / 29

Shared Responsibilities

  • Vhost kernel

– - Let's move things into

the kernel (almost!)

– - Better

userspace/kernel API

– - Avoids system calls,

improves performance

– - And comes with all the

advantages of virtio

vCPU Vhost worker thread ioeventfd Network Stack irqfd

Read/Write

Virtio bufgers Guest KVM

slide-5
SLIDE 5

5 / 29

How much can we stretch ?

  • One worker thread per virtqueue pair
  • More guests = more worker threads

– - But is it necessary ? – - Can a worker share responsibilities ?

  • Performance will improve (or at least stay the

same)

– - Main objective: Scalable performance

  • No userspace modifjcations should be necessary
slide-6
SLIDE 6

6 / 29

Parallelization (Pronunciation Challenge)

  • A worker thread running
  • n every CPU core.
  • Guest/Thread mapping is

decoupled.

  • Guest serviced by a free

worker thread with NUMA locality

  • Presented by Shirley Ma

at LPC 2012

CPU0 Guest Guest Guest Guest CPU1 CPU2 CPU3 Vhost-1 Vhost-2 Vhost-3 Vhost-4 Numa-aware scheduling Tx/Rx Tx/Rx Tx/Rx Tx/Rx

slide-7
SLIDE 7

7 / 29

Parallelization

  • But….
  • Do we really need “always-on” threads ?
  • - is it enough to create threads on demand ?

– - Scheduling more complicated when number of

guests increase ?

  • - Why not share a thread among multiple

devices ?

slide-8
SLIDE 8

8 / 29

Consolidation - ELVIS (Not the singer)

Presented by Abel Gordon at KVM Forum 2013

  • Divide the cores in the system into

two group: VM cores and I/O cores.

  • A vhost thread servicing multiple I/O

devices from difgerent guest

  • has a dedicated CPU core
  • A user confjgurable parameter

determines how many.

  • A dedicated I/O scheduler on the

vhost thread

  • Posted interrupts and polling included!

I/O Core Core N Core 2 Core 1 Core 1 I/O Core I/O VM1 Core N VMi I/O VM2

fine-grained I/O scheduling

Core 2 I/O VM2 I/O VMi

thread-based scheduling

Execution Time

VMj VMi VM1 VCPU1 I/O VM1 I/O VMj I/O VM2 … VM2 VCPU2 I/O VM2 I/O VMi

Execution Time

VM1 VCPU2 VM2 VCPU1

slide-9
SLIDE 9

9 / 29

ELVIS Polling Thread

  • Single thread in a dedicated core monitors the activity
  • f each queue (VMs I/O)
  • Balance between queues based on the I/O activity
  • Decide which queue should be processed and for how

long

  • Balance between throughput and latency
  • No process/thread context switches for I/O
  • Exitless communication (in the next slides)
slide-10
SLIDE 10

10 / 29

ELVIS Polling Thread

VCPU Thread (Core X)

guest hypervisor

(time)

I/O Thread (Core Y)

hypervisor

I/O notification Guest-to-Host I/O notification Host-to-Guest Process I/O Request Complete I/O Request

ELVIS

VCPU Thread (Core X)

(time)

I/O Thread (Core Y)

I/O notification Guest-to-Host I/O notification Host-to-Guest Process I/O Request Complete I/O Request

Traditional Paravirtual I/O

Polling Exitless virtual interrupt injection (via ELI) guest hypervisor hypervisor

slide-11
SLIDE 11

11 / 29

ELVIS Exitless communication

  • Implemented software posted interrupt based on ELI

(Exitless interupts)

  • ELI will be very hard to upstream
  • Possible replacements
  • KVM PV EOI introduced by Michael S. T

sirkin

– - INTEL VT-d Posted-interrupts (PI) which may be

leveraged

slide-12
SLIDE 12

12 / 29

Upstreaming..

  • A lot of new ideas!
  • First Step

– - Stabilize a next generation vhost design.

  • The plan:

– - Introduce a shared vhost design and run benchmarks

with difgerent confjgurations

  • - RFC posted upstream
  • - Initial test results favorable
  • Later enhancements can be introduced gradually...

slide-13
SLIDE 13

13 / 29

Cgroups (Buzzwords, JK ;))

  • Initial approach

– - Add a function to search all cgroups

in all hierarchies for the new process.

– - Even a single mismatch => create a

new vhost worker.

  • But..

– - What happens when a VM process is

migrated to a difgerent cgroup ?

– - Can we optimize the cgroup search ? – - What happens if use polling? – - Rethink cgroups integration ? –

Guest1 Guest1 CG1 CG2 CG3 G1 G2 G3

WG3 WG3 WG3 WG1 WG1 WG1 WG2 WG2 WG2 WG3 WG1 WG3 Per Device Vhost Worker Shared Vhost Worker

slide-14
SLIDE 14

14 / 29

Cgroups and polling

  • Can a vhost polling thread poll guests with

missmatching cgoups?

– - Yes, but it will require the polling thread to take

into account cgroup state of the guest.

  • Probably requires a deeper integration of vhost and

cgroups

– –

slide-15
SLIDE 15

15 / 29

Workqueues (cmwq) (Even more

sharing!)

  • Can we use concurrency managed workqueues ?
  • NUMA awareness comes free!
  • But wait, what about cgroups ?

– - No cgroups support (at least yet, WIP)

  • Less code to manage, less bugs.
  • Cons-

– - Minimal control once work enters the workqueue – - Again, no cgroups support :( – –

slide-16
SLIDE 16

16 / 29

Results

  • ELVIS results

– - A little old but signifjcant – - Includes testing for Exit Less Interrupts, Polling

  • - Valuable data for future work
  • Setup

– - Linux Kernel 3.1 – - IBM System x3550 M4, two 8-cores sockets of Intel Xeon E5-2660, 2.2

GHz, 56GB RAM

– and with an Intel x520 dual port 10Gbps – - QEMU 0.14

  • Results showing the performance impact of the difgerent components of ELVIS

– - Throughput: Netperf TCP stream w. 64 byte messages – - Latency: Netperf UDP RR

slide-17
SLIDE 17

17 / 29

Results – Performance (Netperf)

1 2 3 4 5 6 7 2 4 6 8 10

netperf tcp stream

elvis-poll-pi elvis-poll elvis baseline baseline-affinity

# VMs Throughput (Gbps)

1 2 3 4 5 6 7 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00

netperf udp rr

baseline elvis elvis-poll elvis-poll-pi

# vms latency (msec)

slide-18
SLIDE 18

18 / 29

Results – Components of ELVIS

1 2 3 4 5 6 7 0.75 0.80 0.85 0.90 0.95 1.00 1.05

netperf udp rr

elvis elvis-poll elvis-poll-pi

# vms relative latency

1 2 3 4 5 6 7 0.8 0.9 1.0 1.1 1.2 1.3 1.4

netperf tcp stream

elvis-poll-pi elvis-poll elvis

# VMs Relative throughput

slide-19
SLIDE 19

19 / 29

Even more Results

  • New results with RFC patches

– - T

wo systems with Xeon E5-2640 v3

– - Point to point network connection – - Netperf TCP throughput (STREAM & MAERTS) – - Netperf TCP Request Response

slide-20
SLIDE 20

20 / 29

Results

slide-21
SLIDE 21

21 / 29

Results

slide-22
SLIDE 22

22 / 29

So, ship it ?!

  • Not yet :)
  • Slowly making progress towards a acceptable solution
  • Scope for a lot of interesting work

Questions/Comments/Suggestions ?

slide-23
SLIDE 23

23 / 29

Backup

slide-24
SLIDE 24

24 / 29

ELVIS missing piece

  • Polling on the physical NIC
  • It may be possible to use low-latency Ethernet

device polling introduced in kernel 3.11

  • * I have an ELVIS version polling the physical NIC

that is not using this patch

slide-25
SLIDE 25

25 / 29

Results – Performance (Netperf)

1 2 3 4 5 6 7 2 4 6 8 10

netperf tcp stream

elvis-poll-pi elvis-poll elvis baseline baseline-affinity

# VMs Throughput (Gbps)

1 2 3 4 5 6 7 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00

netperf udp rr

baseline elvis elvis-poll elvis-poll-pi

# vms latency (msec)

slide-26
SLIDE 26

26 / 29

Results – Performance (Netperf)

  • Different message sizes require different number of IO cores
  • Using sidecores is beneficial in a wide range of message sizes
  • The number of VMs “doesn't matter” for throughput
slide-27
SLIDE 27

27 / 29

Results – Performance (Netperf

TCP RR)

  • One I/O side core is not enough, two is needed
  • sidecore performs up to x1.5 better then Baseline
slide-28
SLIDE 28

28 / 29

Results – Performance

(memcached)

  • One I/O side core is not enough, two is needed
  • sidecore performs up to > x2 better then Baseline
slide-29
SLIDE 29

29 / 29

Results – Performance

(apachebench)

  • One I/O side core is not enough, two is needed
  • sidecore performs up to x2 better then Baseline