On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines
Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong
Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013
On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines - - PowerPoint PPT Presentation
On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013
Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013
…
ToR switches Core switch
. . .
Servers in a rack
… … …
ToR switches Core switch
. . .
Servers in a rack
… …
VM VM VM VM VM VM
– Guest VMs are unable to directly access the hardware. – Additional data movement between dom0 and domUs. – HVM: Passthrough I/O can avoid it
– Multiple VMs share one physical core
delay
VM VM VM pCPU VM VM VM pCPU Hypervisor
[1VM 2VMs] [1VM 3VMs]
[PM PM] [1VM 1VM]
[HPDC’10] [INFOCOM’10]
– Barrier-synchronized request workloads – The limited buffer space of the switch output port can be easily
[SIGCOMM’09]
– In case of “tail loss”, the sender can only count on the retransmit timer’s firing.
Two representative papers:
– Increase switch buffer size – Limited transmit – Reduce duplicate ACK threshold – Disable slow-start – Randomize timeout value – Reno, NewReno, SACK
[SIGCOMM’09] [DCTCP, SIGCOMM’10]
RTOmin=200ms RTOmin=100ms RTOmin=10ms RTOmin=1ms
VM pCPU VM VM
30ms 30ms 30ms
3VMs per core Red points: measured RTTs Blue points: calculated RTO values
RTO = SRTT + 4* RTTVAR Lower-bound: RTOmin
TCP’s low-pass filter
Retransmit TimeOut
A small RTOmin: serious spurious RTOs with largely varied RTTs. A big RTOmin: throughput collapse with heavy network congestion.
The scheduling delays to the sender VM The scheduling delays to the receiver VM
3VMs1VM Freq. 1VM3VMs 1086 1× RTOs 677 2× RTOs 673 3× RTOs 196 4× RTOs 30
To transmit 4000 1MB data blocks
8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80
x106
snd.una: the first sent but unacknowledged byte. snd.nxt: the next byte that will be sent.
snd.nxt snd.una snd.nxt snd.una
RTO happens twice, before the receiver VM wakes up. The receiver VM has been stopped.
When the receiver VM is preempted
Time (ms) vs. sequence number (from the sender VM) Time (ms) vs. ACK number (from the receiver VM)
When the sender VM is preempted
RTO happens just after the sender VM wakes up. The sender VM has been stopped. An ACK arrives before the sender VM wakes up.
Timer
VM1 is running
Buffer
Timer
TCP sender Driver domain TCP receiver
ACK ACK ACK data data data Physical network Within hypervisor
VM2 is running VM3 is running VM1 is running
clear timer clear timer
VM scheduling latency VM2 is waiting VM3 is waiting VM1 is waiting VM3 is waiting VM1 is waiting VM2 is waiting VM2 is waiting VM3 is waiting
wait .. Timer IRQ: RTO happens! Network IRQ: receive ACK; Spurious RTO! deliver ACK Scheduling queue
Expire time Timer
1 2
c
– Eifel performs much worse than F-RTO in some situations, e.g. with bursty packet loss [CCR’03] – F-RTO is implemented in Linux
Low detection rate
[3VMs1VM] [1VM3VMs]
Low detection rate
– Reducing delayed ACK timeout value does NOT help.
Sender VM Receiver VM Sender VM Receiver VM
delack-200ms delack-1ms w/o delack Total ACKs 229,650 244,757 2,832,260 delack-200ms delack-1ms w/o delack Total ACKs 252,278 262,274 2,832,179 Sender VM Receiver VM Sender VM Receiver VM
VM pCPU VM VM
30ms 30ms 30ms
Virtual timer IRQs (every 1ms) Time Guest OS Hypervisor VM is NOT running . . . VM is running Virtual timer IRQs (every 1ms) jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ VM is running
jiffies += 60
(HZ=1000)
jiffies++ jiffies++ jiffies++ jiffies++
3VMs per core
Timer
TCP
Timer
VM1 is running
Buffer
Timer
TCP sender Driver domain TCP receiver
ACK ACK ACK data data data Physical network Within hypervisor
VM2 is running VM3 is running VM1 is running
clear timer clear timer
wait .. deliver ACK
Expire time Timer
Start time Expiry time
Timer IRQ: RTO happens! Network IRQ: receive ACK; Spurious RTO!
2 1
VM scheduling latency
Timer Timer
TCP PVTCP
Timer
VM1 is running
Buffer
Timer
TCP sender Driver domain TCP receiver
ACK ACK ACK data data data Physical network Within hypervisor
VM2 is running VM3 is running VM1 is running
clear timer clear timer
wait .. deliver ACK
Expire time Timer
Start time Expiry time 1ms
Net IRQ first: ACK enters.
Reset the timer.
VM scheduling latency
Timer
PVTCP
Timer
VM1 is running
Buffer
Timer
TCP sender Driver domain TCP receiver
ACK ACK ACK data data data Physical network Within hypervisor
VM2 is running VM3 is running VM1 is running
clear timer clear timer
wait .. deliver ACK
Expire time Timer
1ms
Net IRQ first: ACK enters.
Reset the timer.
VM scheduling latency
StartTime ExpiryTime
TCP’s low-pass filter to estimate RTT/RTO Smoothed RTT (SRTTi) 7/8 * SRTTi-1 +1/8 * MRTTi RTT variance (RTTVARi) 3/4 * RTTVARi -1+ 1/4 * |SRTTi - MRTTi| Expected RTO value (RTOi+1) SRTTi + 4 * RTTVARi Measured RTT (MRTT) = TrueRTT + VMSchedDelay
– Eifel: check the timestamp of the first one ACK – F-RTO: check the ACK number of the first two ACKs – Just-in-time: do not delay the ACKs for the first three segments
– Enable delayed ACK retransmission ambiguity – Disable delayed ACK significant CPU overhead
Experimental setup: 20 sender VMs 1 receiver VM
Sender VM Receiver VM Sender VM Receiver VM
RTOmin
TCP-200ms TCP-1ms PVTCP-1ms Total ACKs 192,587 244,757 192,863
RTOmin
TCP-200ms TCP-1ms PVTCP-1ms Total ACKs 194,384 262,274 208,688 Sender VM Receiver VM Sender VM Receiver VM
+0% +7.4%
The scheduling delays to the receiver VM
Receiver VM Driver domain
RTO happens! Hypervisor scheduling delay VM 2 RUN VM 3 WAIT VM 1 WAIT VM 3 WAIT
buffer Data packets, waiting for ACK ACKing Within hypervisor Physical network
VM scheduling queue VM 3 WAIT VM 1 RUN VM 2 WAIT VM 1 RUN
wait deliver ACKing
VM 1 WAIT VM2 WAIT VM 3 RUN VM 2 WAIT
Sender VM Sender VM Receiver VM Driver domain
RTO happens!
Hypervisor scheduling delay
ACKing buffer Within hypervisor Physical network Data packets, waiting for ACK
VM scheduling queue VM 1 RUN VM 3 RUN VM 2 WAIT VM 1 WAIT VM 2 RUN VM 3 WAIT VM 3 WAIT VM 2 WAIT VM 1 WAIT VM 1 RUN VM 2 WAIT VM 3 WAIT
wait deliver
The buffer size matters! The scheduling delays to the sender VM