Performance Improvements of Virtual Machine Networking Jason Wang - - PowerPoint PPT Presentation

performance improvements of virtual machine networking
SMART_READER_LITE
LIVE PREVIEW

Performance Improvements of Virtual Machine Networking Jason Wang - - PowerPoint PPT Presentation

Performance Improvements of Virtual Machine Networking Jason Wang jasowang@redhat.com Typical setup Guest Guest virtio-net drv virtio-net drv T R T R X X X X Host Host vhost_net vhost_net bridge TAP macvlan macvtap NIC NIC


slide-1
SLIDE 1

Performance Improvements of Virtual Machine Networking

Jason Wang jasowang@redhat.com

slide-2
SLIDE 2

Host Guest

Typical setup

vhost_net TAP bridge NIC virtio-net drv T X R X Host Guest vhost_net macvtap macvlan NIC virtio-net drv T X R X

slide-3
SLIDE 3

How slow were we?

slide-4
SLIDE 4

Agenda

  • Vhost threading model
  • Busy polling
  • TAP improvements
  • Batching virtio processing
  • XDP
  • Performance Evaluation
  • TODO
slide-5
SLIDE 5

Threading model

  • one kthread worker

for both RX and TX

  • half duplex
  • degradation on heavy

bi-directional traffic

− more devices since

we are virt

− Complexity for both

management and application

  • Scale?

RX TX TX RX ...

Vhost_net kthread

slide-6
SLIDE 6

New models

  • ELVIS by Abel Gordon

− Dedicated cores for vhost − Several devices shares a single vhost worker

thread

− Polling and optimization on interrupt − Dedicated I/O scheduler − Lack of cgroup support

  • CMWQ by Bandan Das

− All benefits from CWMQ, e.g NUMA, dynamic

workers

− can be cgroup aware but expensive

slide-7
SLIDE 7

Busy Polling

slide-8
SLIDE 8

VCPU thread vhost_net thread

IO notify handle_tx handle_rx

guest kvm vhost

hardirq

softirq cpu vhost_net thread

IO notify handle_tx

Event Driven Vhost

  • vhost_net is driven by events:

− virtqueue kicks: tx and rx − socket events: new packets arrived and sndbuf

available

  • overheads

− caused by virtualization: vmentry and vmexit,

decoding/emulating

− caused by wakeup: spinlocks, scheduler latency

slide-9
SLIDE 9

Limited busy polling (since 4.6)

VCPU thread vhost_net thread

IO notify handle_tx handle_rx

guest kvm vhost

hardirq

softirq cpu vhost_net thread

handle_tx no notify

  • still driven by events but busy poll for a while if

nothing to do

− maximum us spent on busy polling is limited by

userspace

− disable events and poll the sources

  • overheads of virtualization and wakeups was

eliminated in the best case.

no wakeup polling polling polling

slide-10
SLIDE 10

Limited busy polling (since 4.6)

  • Exit the busy polling loop also when

− signal is pending − TIF_NEED_RESCHED was set

  • 1 byte TCP_RR shows 5%-20% improvements
  • Issues

− Not a 100% busy polling implementation

  • This could be done by specifying a very large poll-us
  • still some limitation caused by sharing kthread model
  • Sometime user want a balance between latency

and cpu consumption

slide-11
SLIDE 11

TAP improvements

slide-12
SLIDE 12

socket receive queue

  • TAP use double linked list (sk_receive_queue)

before 4.8

− cache threshing

  • Every user has to write to lots of places
  • Every change has to be made multiple places

− Spinlock is used for synchronization between

producer and consumer

static inline void __skb_insert(struct sk_buff *newsk, struct sk_buff *prev, struct sk_buff *next, struct sk_buff_head *list) { newsk->next = next; newsk->prev = prev; next->prev = prev->next = newsk; list->qlen++; }

slide-13
SLIDE 13

ptr_ring (since 4.8)

  • cache friendly ring for pointers (Michael S.

Tsirkin)

− an array of pointers

  • NULL means valid, !NULL means invalid
  • consumer and producer verify against NULL, no need to

read the index of each other, no barrier needed

  • no lock contention between producer and consumer

struct ptr_ring { int producer ____cacheline_aligned_in_smp; spinlock_t producer_lock; int consumer ____cacheline_aligned_in_smp; spinlock_t consumer_lock; /* Shared consumer/producer data */ /* Read-only by both the producer and the consumer */ int size ____cacheline_aligned_in_smp; /* max entries in queue */ void **queue; };

producer only consumer only

slide-14
SLIDE 14

skb_array (since 4.8)

  • wrapper for storing pointers to skb
  • sk_receive_queue was replaced by skb_array
  • 15.3% RX pps was measured in guest during

unit-test

slide-15
SLIDE 15

issue of slow consumer

PTR Z PTR PTR 1 PTR 2 PTR 7 PTR 8 PTR 9 PTR PTR X producer index consumer index ...

X

... cache line consumer index’ producer index’ ...

  • if consumer index advances one by one

− producer and consumer are in the same cache line − cache line bouncing almost for each pointer

  • Solution

− batch zeroing (consuming)

slide-16
SLIDE 16

Batch zeroing (since 4.12)

PTR Z PTR PTR 1 PTR 2 PTR 7 PTR 8 PTR 9

PTR D NUL L

producer index ... ... cache line ... ... cache line PTR E PTR 9 consumer_tail consumer_head

struct ptr_ring { ... int consumer_head ____cacheline_aligned_in_smp; /* next valid entry */ int consumer_tail; /* next entry to invalidate */ ... int batch; /* number of entries to consume in a batch */ void **queue; };

zeroing order

slide-17
SLIDE 17

Batch zeroing (since 4.12)

PTR Z NUL L NUL L NUL L NUL L NUL L PTR 9 NUL L producer index ... ... cache line ... ... cache line PTR E NUL L consumer_tail consumer_head

  • Start to invalidate consumed pointers only when

consumer is 2x size of cache line far from producer

  • Zeroing in the reverse order

− Make sure producer won’t make progress

  • Make sure producing several new pointers does

not lead cache line bouncing

  • zeroing order
slide-18
SLIDE 18

Batch dequeuing (since 4.13)

PTR Z NUL L NUL L NUL L NUL L NUL L producer index ... ... ... VHOST_RX_BATCH PTR E NUL L consumer_head

  • consumer the pointers in a batch, pointer

access is lock free afterwards

  • reduce the cache misses and keep consumer

even more far away

  • co-opreate with batch zeroing
  • consumer_tail

PTR PTR 1 PTR 2 PTR 3 PTR 4 PTR 5 ... PTR 63

PTR 63 NUL L

zeroing round1 zeroing round N ...

slide-19
SLIDE 19

Batching for Virtio

slide-20
SLIDE 20

Virtqueue and cache misses

N 2 address len flag nex t 0x8000420 0x8 R W NIL flag avail_idx ... N M 2 flag used_idx ... M 1 0x4 ... 3rd miss: read descriptor 1st miss: read avail_idx 2nd miss: read idx from avail ring 5th miss: update used_idx 4th miss: write idx and len at used ring 5 misses for each packet

slide-21
SLIDE 21

How batching helps

N 2 address len flag nex t 0x8000420 0x8 R W NIL flag avail_idx ... 3 M 2 flag used_idx ... M 3 0x4 ... 0x8000430 4 5 ... ... 4 5 ... ... 3rd miss: read descriptors 1st miss: read avail_idx 2nd miss: read indexes from avail ring 5th miss: update used_idx N 5 misses for 4 packets 1.25 misses per packet in ideal case 4th miss: write indexes and lens at used ring

slide-22
SLIDE 22

Batching (WIP)

  • Reduce cache misses
  • Reduce cache threshing

− When ring in almost empty or full − Device or driver won’t make progress when avail idx

  • r used idx changes
  • Cache line contention on avail, used and descriptor ring

was mitigated

  • Fast string copy function

− Benefit from modern CPU

slide-23
SLIDE 23

Batching in vhost_net (WIP)

  • Prototype:

− Batch reading avail indexes − Batch update them in used ring − Update used idx once for a batch

  • TX get ~22% improvements
  • RX get ~60% improvements
  • TODO:

− Batch descriptor table reading

slide-24
SLIDE 24

XDP

slide-25
SLIDE 25

Introduction to XDP

  • short for eXpress Data Path
  • work at early stage on driver rx

− before skb is created

  • Fast

− page level − driver specific optimizations (page recycling ...)

  • Programmable

− eBPF

  • Actions

− DROP, TX, PASS, REDIRECT

slide-26
SLIDE 26

Typical XDP implementation

  • Typical Ethernet XDP support

− Dedicated TX queue for lockless XDP_TX

  • per CPU or paired with RX queue
  • Multiqueue support is needed

− Adding/removing queues when XDP is set/unset

− Run under NAPI poll routine

  • after DMA is done

− Don’t support large packets

  • JUMBO/LRO/RSC needs to be disabled during XDP set
  • But TAP is a little bit different
slide-27
SLIDE 27

XDP for TAP (since 4.13)

  • Challenge for TAP

− Multiqueue is controlled by userspace:

  • solution: No dedicated TX queue, sharing TX queue
  • work even for single queue TAP

− Changing LRO/RSC/Jumbo configuration:

  • solution: Hybird mode XDP implementation

− Datacopy was done with skb allocation:

  • solution: Decouple data copy out of skb allocation,

build_skb()

− No NAPI by default:

  • run inside tun_sendmsg()

− Zerocopy:

  • done through Generic XDP, adjust_head
slide-28
SLIDE 28

Hybrid XDP in TAP (since 4.13)

tun_net_xmit () TX skb array tun_recvmsg () tun_sendmsg() Native XDP

ndo_xdp_xmit()

ethX build_skb() XDP_REDIRECT XDP_PAS S XDP_TX XDP_DROP Generic XDP helpers small packet Zerocopy or big packets

  • Merged in 4.13

− mix using native XDP and skb XDP − simplify the VM configuration (no notice from guest)

ndo_start_xmit( )

slide-29
SLIDE 29

XDP transmission for TAP (WIP)

  • For accelerating guest RX

− An XDP queue (ptr_ring) is introduced for each tap

socket

− Storing XDP metadata in the headroom − Batch dequeuing support −

tun_net_xmit () ptr ring tun_recvmsg ()

EthX poll()

Native XDP tun_xdp_xmit ()

XDP meta

XDP data

XDP meta

XDP data TX skb array XDP_REDIRECT vhost_net

slide-30
SLIDE 30

XDP for virtio-net (since 4.10)

  • Multiqueue based

− Per CPU TX XDP queue − Need reserve enough queue pairs during VM

launching

  • OFFLOADS were disabled on set on demand
  • No reset

− Copy the packet if headroom is not enough

  • A little bit slow but should be rare
  • Support XDP redirecting/transmission

− Since 4.13

  • No page recycling yet
slide-31
SLIDE 31

Performance Evaluation

slide-32
SLIDE 32

Test setup bridge

Host kernel Guest vhost_net TAP bridge ixgbe testpmd T X R X Remote host testpmd

  • Two Intel(R) Xeon(R)

CPU E5-2630 v3 @ 2.40GHz

  • Back to back ixgbes
  • Testpmd is used:

− traffic generator and

receiver

  • 30% faster than

pktgen

− No interrupt − Busy polling

  • Tx and rx was

measured separately

  • txonly/

rxonly txonly/ rxonly

slide-33
SLIDE 33

RX performance

busy polling RPS hash

  • n demand

skb_array build_skb() for ixgbe Batch zeroing Batch consumin g XDP + RX batching (WIP) XDP transmission (WIP)

slide-34
SLIDE 34

TX performance

busy polling no backlog MSG_MOR E Batch virtio TX (WIP) no flow caches (WIP) XDP

slide-35
SLIDE 35

XDP vs testpmd

Host kernel Guest vhost_net TAP ixgbe testpmd T X R X Remote host testpmd Host Userspace Guest vhost pmd testpmd (io) ixgbe pmd testpmd T X R X Remote host testpmd

XDP_REDIREC T

slide-36
SLIDE 36

Here we are

slide-37
SLIDE 37

perf – ksoftirqd RX

  • 26.49% [kernel] [k] _raw_spin_lock
  • 16.00% [ixgbe] [k] ixgbe_clean_rx_irq
  • 15.99% [kernel] [k] sock_def_readable
  • 5.63% [kernel] [k]

dev_get_by_index_rcu

  • 5.48% [kernel] [k] __bpf_tx_xdp
  • 4.42% [tun] [k] tun_xdp_xmit
  • 4.29% [kernel] [k] xdp_do_redirect
  • 3.70% [ixgbe] [k]

ixgbe_alloc_rx_buffers

  • 2.53% [kernel] [k] swiotlb_sync_single
  • 2.08% [kernel] [k]

percpu_array_map_lookup_elem

slide-38
SLIDE 38

perf – vhost_net RX

  • 43.38% [vhost_net] [k] handle_rx
  • 9.86% [kernel] [k] copy_page_to_iter
  • 8.87% [kernel] [k] _copy_to_iter
  • 7.41% [vhost_net] [k] vhost_net_buf_peek
  • 6.38% [vhost] [k] __vhost_get_vq_desc
  • 6.22% [kernel] [k] iov_iter_advance
  • 6.16% [kernel] [k]

copy_user_generic_unrolled

  • 3.80% [vhost] [k] vhost_get_vq_desc
  • 3.64% [vhost] [k] translate_desc
  • 2.40% [kernel] [k] copyout
slide-39
SLIDE 39

perf – vhost_net TX

  • 21.49% [vhost] [k] translate_desc
  • 13.41% [tun] [k] tun_get_user
  • 10.12% [vhost] [k]

__vhost_get_vq_desc

  • 6.54% [kernel] [k] iov_iter_advance
  • 4.32% [kernel] [k] copy_page_from_iter
  • 4.15% [kernel] [k]

copy_user_enhanced_fast_string

  • 3.92% [ixgbe] [k]

ixgbe_xmit_xdp_ring.isra.88

  • 3.56% [vhost_net] [k] handle_tx
  • 3.46% [tun] [k] tun_sendmsg
  • 3.23% [kernel] [k] page_frag_free
slide-40
SLIDE 40

TODO/Raw ideas

  • Raw ideas

− better integration with NAPI busy polling in

vhost_net?

− pure busy polling vhost_net? − Better XDP co-operation on page recycling for

hardware NIC drivers?

− Build and receive skb/XDP in vhost_net? − Rx zerocopy

  • ndo_post_rx_buffer()?
  • Please comment on virtio 1.1
slide-41
SLIDE 41

Thanks