On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines - - PowerPoint PPT Presentation

on paravirualizing tcp congestion control in xen virtual
SMART_READER_LITE
LIVE PREVIEW

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines - - PowerPoint PPT Presentation

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013


slide-1
SLIDE 1

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines

Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong

Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013

slide-2
SLIDE 2

Outline

  • Motivation

– Physical datacenter vs. Virtualized datacenter – Incast congestion

  • Understand the Problem

– Pseudo-congestion – Sender-side vs. Receiver-side

  • PVTCP – A ParaVirtualized TCP

– Design, Implementation, Evaluation

  • Questions & Comments
slide-3
SLIDE 3

Outline

  • Motivation

– Physical datacenter vs. Virtualized datacenter – Incast congestion

  • Understand the Problem

– Pseudo-congestion – Sender-side vs. Receiver-side

  • PVTCP – A ParaVirtualized TCP

– Design, Implementation, Evaluation

  • Questions & Comments
slide-4
SLIDE 4

Physical datacenter

  • A set of physical machines
  • Network delays:

propagation delays of the physical network/switch

ToR switches Core switch

. . .

Servers in a rack

… … …

ToR switches Core switch

. . .

Servers in a rack

… …

VM VM VM VM VM VM

Virtualized datacenter

  • A set of virtual machines
  • Network delays:

additional delays due to virtualization overhead

slide-5
SLIDE 5

Virtualization brings “delays”

  • 1. I/O virtualization overhead (PV or HVM)

– Guest VMs are unable to directly access the hardware. – Additional data movement between dom0 and domUs. – HVM: Passthrough I/O can avoid it

  • 2. VM scheduling delays

– Multiple VMs share one physical core

delay

VM VM VM pCPU VM VM VM pCPU Hypervisor

slide-6
SLIDE 6

Virtualization brings “delays”

[1VM  2VMs] [1VM  3VMs]

Peak: 30ms Peak: 60ms Avg: 0.147ms Avg: 0.374ms

[PM  PM] [1VM  1VM]

  • Delays of I/O virtualization (PV guests): < 1ms
  • VM scheduling delays: 10× ms

– Queuing delays  VM scheduling delays

  • The dominant factor to network RTT
slide-7
SLIDE 7

Network delays in public clouds

[HPDC’10] [INFOCOM’10]

slide-8
SLIDE 8

Incast network congestion

  • A special form of network congestion, typically seen in

distributed processing applications (scatter-gather).

– Barrier-synchronized request workloads – The limited buffer space of the switch output port can be easily

  • verfilled by simultaneous transmissions.
  • Application-level throughput (goodput) can be orders of

magnitude lower than the link capacity.

[SIGCOMM’09]

slide-9
SLIDE 9

Solutions for physical clusters

  • The dominate factor: once the packet loss happens,

whether the sender can know it as soon as possible.

– In case of “tail loss”, the sender can only count on the retransmit timer’s firing.

Two representative papers:

  • Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems [FAST’08].
  • Understanding TCP Incast Throughput Collapse in Datacenter Networks [WREN’09].
  • Prior works: none of them can fully eliminate the

throughput collapse.

– Increase switch buffer size – Limited transmit – Reduce duplicate ACK threshold – Disable slow-start – Randomize timeout value – Reno, NewReno, SACK

slide-10
SLIDE 10

Solutions for physical clusters (cont’d)

RTOmin in a virtual cluster? Not well studied.

[SIGCOMM’09] [DCTCP, SIGCOMM’10]

  • Significantly reducing RTOmin has been shown to be a safe

and effective approach. [SIGCOMM’09]

  • Even with ECN support in hardware switch, a small RTOmin

still shows apparent advantages. [DCTCP, SIGCOMM’10]

slide-11
SLIDE 11

Outline

  • Motivation

– Physical datacenter vs. Virtualized datacenter – Incast congestion

  • Understand the Problem

– Pseudo-congestion – Sender-side vs. Receiver-side

  • PVTCP – A ParaVirtualized TCP

– Design, Implementation, Evaluation

  • Questions & Comments
slide-12
SLIDE 12

Pseudo-congestion

A small RTOmin  frequent spurious RTOs

RTOmin=200ms RTOmin=100ms RTOmin=10ms RTOmin=1ms

NO network congestion, still RTT spikes.

VM pCPU VM VM

30ms 30ms 30ms

3VMs per core Red points: measured RTTs Blue points: calculated RTO values

RTO = SRTT + 4* RTTVAR Lower-bound: RTOmin

TCP’s low-pass filter

Retransmit TimeOut

slide-13
SLIDE 13

Pseudo-congestion (cont’d)

A small RTOmin: serious spurious RTOs with largely varied RTTs. A big RTOmin: throughput collapse with heavy network congestion.

  • “adjusting RTOmin: a tradeoff between timely

response with premature timeouts, and there is NO

  • ptimal balance between the two.”
  • - Allman and Paxson [SIGCOMM’99]

Virtualized datacenters  A new instantiation

slide-14
SLIDE 14

Sender-side vs. Receiver-side

The scheduling delays to the sender VM The scheduling delays to the receiver VM

3VMs1VM Freq. 1VM3VMs 1086 1× RTOs 677 2× RTOs 673 3× RTOs 196 4× RTOs 30

RTO only happens once a time Successive RTOs are normal

To transmit 4000 1MB data blocks

slide-15
SLIDE 15

A micro-view with tcpdump

8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80

x106

snd.una: the first sent but unacknowledged byte. snd.nxt: the next byte that will be sent.

snd.nxt snd.una snd.nxt snd.una

RTO happens twice, before the receiver VM wakes up. The receiver VM has been stopped.

  • The generation and the

return of the ACKs will be delayed.

  • RTOs must happen on the

sender’s side.

When the receiver VM is preempted

Time (ms) vs. sequence number (from the sender VM) Time (ms) vs. ACK number (from the receiver VM)

  • The ACK’s arrival time is not

delayed, but the receiving time is too late.

  • From TCP’s perspective, RTO

should not be triggered.

When the sender VM is preempted

RTO happens just after the sender VM wakes up. The sender VM has been stopped. An ACK arrives before the sender VM wakes up.

slide-16
SLIDE 16

Timer

VM1 is running

Buffer

Timer

TCP sender Driver domain TCP receiver

ACK ACK ACK data data data Physical network Within hypervisor

VM2 is running VM3 is running VM1 is running

clear timer clear timer

VM scheduling latency VM2 is waiting VM3 is waiting VM1 is waiting VM3 is waiting VM1 is waiting VM2 is waiting VM2 is waiting VM3 is waiting

wait .. Timer IRQ: RTO happens! Network IRQ: receive ACK; Spurious RTO! deliver ACK Scheduling queue

Expire time Timer

  • The reasons due to common OS design

– Timer interrupt is executed before other interrupts – Network processing is a little late (bottom half)

The sender-side problem: OS reasons

1 2

  • After the VM wakes up, both TIMER and NET are pending.
  • RTO happens just before the ACK enters the VM
slide-17
SLIDE 17

To detect spurious RTOs

Disabling delayed ACK seems to be helpful

c

  • Two well-known detection algorithms: F-RTO and Eifel

– Eifel performs much worse than F-RTO in some situations, e.g. with bursty packet loss [CCR’03] – F-RTO is implemented in Linux

Low detection rate

[3VMs1VM] [1VM3VMs]

Low detection rate

  • F-RTO interacts badly with delayed ACK (ACK coalescing)

– Reducing delayed ACK timeout value does NOT help.

slide-18
SLIDE 18

Delayed ACK vs. CPU overhead

Disabling delayed ACK  Significant CPU overhead

Sender VM Receiver VM Sender VM Receiver VM

slide-19
SLIDE 19

Delayed ACK vs. CPU overhead

Disabling delayed ACK  Significant CPU overhead

delack-200ms delack-1ms w/o delack Total ACKs 229,650 244,757 2,832,260 delack-200ms delack-1ms w/o delack Total ACKs 252,278 262,274 2,832,179 Sender VM Receiver VM Sender VM Receiver VM

Disabling delayed ACK: 11~13× more ACKs are sent

slide-20
SLIDE 20

Outline

  • Motivation

– Physical datacenter vs. Virtualized datacenter – Incast congestion

  • Understand the Problem

– Pseudo-congestion – Sender-side vs. Receiver-side

  • PVTCP – A ParaVirtualized TCP

– Design, Implementation, Evaluation

  • Questions & Comments
slide-21
SLIDE 21

PVTCP – A ParaVirtualized TCP

  • Main Idea

– If we can detect such moment, and let the guest OS be aware of this, there is a chance to handle the problem.

  • Observation

– Spurious RTOs only happen when the sender/receiver VM just experienced a scheduling delays. “the more information about current network conditions available to a transport protocol, the more efficiently it can use the network to transfer its data.”

  • - Allman and Paxson [SIGCOMM’99]
slide-22
SLIDE 22

Detect the VM’s wakeup moment

VM pCPU VM VM

30ms 30ms 30ms

Virtual timer IRQs (every 1ms) Time Guest OS Hypervisor VM is NOT running . . . VM is running Virtual timer IRQs (every 1ms) jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ VM is running

jiffies += 60

(HZ=1000)

jiffies++ jiffies++ jiffies++ jiffies++

3VMs per core

Acute increase of the system clock (jiffies)  The VM just wakes up

  • ne-shot timer
slide-23
SLIDE 23

PVTCP – the sender VM is preempted

  • Spurious RTOs can be avoided.

No need to detect them at all!

Timer

TCP

Timer

VM1 is running

Buffer

Timer

TCP sender Driver domain TCP receiver

ACK ACK ACK data data data Physical network Within hypervisor

VM2 is running VM3 is running VM1 is running

clear timer clear timer

wait .. deliver ACK

Expire time Timer

Start time Expiry time

Timer IRQ: RTO happens! Network IRQ: receive ACK; Spurious RTO!

2 1

VM scheduling latency

slide-24
SLIDE 24

PVTCP – the sender VM is preempted

  • Spurious RTOs can be avoided.

No need to detect them at all!

Timer Timer

TCP PVTCP

Timer

VM1 is running

Buffer

Timer

TCP sender Driver domain TCP receiver

ACK ACK ACK data data data Physical network Within hypervisor

VM2 is running VM3 is running VM1 is running

clear timer clear timer

wait .. deliver ACK

Expire time Timer

Start time Expiry time 1ms

  • Solution: after the VM wakes up,

extend the TCP retransmit timer’s expiry time by 1ms.

Net IRQ first: ACK enters.

Reset the timer.

VM scheduling latency

slide-25
SLIDE 25

PVTCP – the sender VM is preempted

Timer

PVTCP

Timer

VM1 is running

Buffer

Timer

TCP sender Driver domain TCP receiver

ACK ACK ACK data data data Physical network Within hypervisor

VM2 is running VM3 is running VM1 is running

clear timer clear timer

wait .. deliver ACK

Expire time Timer

1ms

Net IRQ first: ACK enters.

Reset the timer.

VM scheduling latency

StartTime ExpiryTime

Solution: MRTTi  SRTTi-1

TCP’s low-pass filter to estimate RTT/RTO Smoothed RTT (SRTTi)  7/8 * SRTTi-1 +1/8 * MRTTi RTT variance (RTTVARi)  3/4 * RTTVARi -1+ 1/4 * |SRTTi - MRTTi| Expected RTO value (RTOi+1)  SRTTi + 4 * RTTVARi Measured RTT (MRTT) = TrueRTT + VMSchedDelay

slide-26
SLIDE 26

PVTCP – the receiver VM is preempted

Spurious RTOs cannot be avoided, so we have to let the sender detect them.

  • Solution: temporarily disable delayed ACK when the

receiver VM just wakes up.

– Eifel: check the timestamp of the first one ACK – F-RTO: check the ACK number of the first two ACKs – Just-in-time: do not delay the ACKs for the first three segments

  • Detection algorithms requires deterministic return of

future ACKs from the receiver

– Enable delayed ACK  retransmission ambiguity – Disable delayed ACK  significant CPU overhead

slide-27
SLIDE 27

PVTCP evaluation: throughput

PVTCP avoids throughput collapse in the whole range TCP’s dilemma: pseudo-congestion & real congestion RTOmin

Experimental setup: 20 sender VMs  1 receiver VM

PVTCP-1ms TCP-1ms TCP-200ms

slide-28
SLIDE 28

PVTCP evaluation: CPU overhead

Sender VM Receiver VM Sender VM Receiver VM

Enable delayed ACK: PVTCP (RTOmin=1ms) ≈ TCP (RTOmin=200ms)

slide-29
SLIDE 29

PVTCP evaluation: CPU overhead

RTOmin

TCP-200ms TCP-1ms PVTCP-1ms Total ACKs 192,587 244,757 192,863

RTOmin

TCP-200ms TCP-1ms PVTCP-1ms Total ACKs 194,384 262,274 208,688 Sender VM Receiver VM Sender VM Receiver VM

+0% +7.4%

Spurious RTOs are avoided Temporarily disable delayed ACK to help the sender detect spurious RTOs

slide-30
SLIDE 30

One concern

slide-31
SLIDE 31

The buffer of the netback

  • The vif’s buffer: temporarily store incoming packets

when the VM has been preempted.

– ifconfig vifX.Y txqueuelen [value]

  • The default value is too small  intensive packet loss

– #define XENVIF_QUEUE_LENGTH 32

  • This parameter should be set bigger (> 10,000 perhaps..)

The scheduling delays to the receiver VM

Receiver VM Driver domain

RTO happens! Hypervisor scheduling delay VM 2 RUN VM 3 WAIT VM 1 WAIT VM 3 WAIT

buffer Data packets, waiting for ACK ACKing Within hypervisor Physical network

VM scheduling queue VM 3 WAIT VM 1 RUN VM 2 WAIT VM 1 RUN

wait deliver ACKing

VM 1 WAIT VM2 WAIT VM 3 RUN VM 2 WAIT

Sender VM Sender VM Receiver VM Driver domain

RTO happens!

Hypervisor scheduling delay

ACKing buffer Within hypervisor Physical network Data packets, waiting for ACK

VM scheduling queue VM 1 RUN VM 3 RUN VM 2 WAIT VM 1 WAIT VM 2 RUN VM 3 WAIT VM 3 WAIT VM 2 WAIT VM 1 WAIT VM 1 RUN VM 2 WAIT VM 3 WAIT

wait deliver

The buffer size matters! The scheduling delays to the sender VM

slide-32
SLIDE 32

Summary

Problem: VM scheduling delays cause spurious RTOs.

  • Proposed Solution: a ParaVirtualized TCP (PVTCP)

– Provide a method to detect a VM’s wakeup moment

  • Sender-side problem

– There are OS reasons

  • Receiver-side problem

– Networking problem

  • Sender-side problem

– Spurious RTOs can be avoided. – Slightly extends the retransmit timer’s expiry time after the sender VM wakes up.

  • Receiver-side problem

– Spurious RTOs can be detected. – Temporarily disable delayed ACK after the receiver VM wakes up. – Just-in-time

  • Future Work: your inputs ..
slide-33
SLIDE 33

Thanks for your listening

Comments & Questions

Email: lwcheng@cs.hku.hk URL: http://www.cs.hku.hk/~lwcheng