On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines - PowerPoint PPT Presentation

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013

Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments

Physical datacenter Virtualized datacenter Core switch Core switch ToR ToR . . . . . . switches switches … … … … … … Servers in a rack Servers in a rack VM VM VM VM VM VM  A set of physical machines  A set of virtual machines  Network delays:  Network delays: propagation delays of the additional delays due to physical network/switch virtualization overhead

Virtualization brings “delays” VM VM VM VM delay VM VM Hypervisor pCPU pCPU  1. I/O virtualization overhead (PV or HVM) – Guest VMs are unable to directly access the hardware. – Additional data movement between dom0 and domUs. – HVM: Passthrough I/O can avoid it  2. VM scheduling delays – Multiple VMs share one physical core

Virtualization brings “delays” Avg: 0.374ms Avg: 0.147ms [PM  PM] [1VM  1VM]  Delays of I/O virtualization (PV guests): < 1ms Peak: 60ms Peak: 30ms [1VM  2VMs] [1VM  3VMs]  VM scheduling delays: 10× ms – Queuing delays  VM scheduling delays  The dominant factor to network RTT

Network delays in public clouds [INFOCOM’10] [HPDC’10]

Incast network congestion • A special form of network congestion, typically seen in distributed processing applications (scatter-gather). – Barrier-synchronized request workloads – The limited buffer space of the switch output port can be easily overfilled by simultaneous transmissions. • Application-level throughput (goodput) can be orders of magnitude lower than the link capacity. [SIGCOMM’09]

Solutions for physical clusters  Prior works: none of them can fully eliminate the throughput collapse. – Increase switch buffer size – Limited transmit – Reduce duplicate ACK threshold – Disable slow-start – Randomize timeout value – Reno, NewReno, SACK  The dominate factor : once the packet loss happens, whether the sender can know it as soon as possible. – In case of “tail loss”, the sender can only count on the retransmit timer’s firing. Two representative papers:  Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems [FAST’08].  Understanding TCP Incast Throughput Collapse in Datacenter Networks [WREN’09].

Solutions for physical clusters (cont’d)  Significantly reducing RTO min has been shown to be a safe and effective approach. [SIGCOMM’09]  Even with ECN support in hardware switch, a small RTO min still shows apparent advantages. [DCTCP, SIGCOMM’10] [SIGCOMM’09] [DCTCP, SIGCOMM’10] RTO min in a virtual cluster? Not well studied.

Pseudo-congestion NO network congestion, still RTT spikes. Red points: measured RTTs Blue points: 30ms VM calculated RTO values 30ms VM RTO min =200ms RTO min =100ms 30ms VM pCPU TCP’s low-pass filter 3VMs per core RTO = SRTT + 4* RTTVAR Lower-bound: RTO min R etransmit T ime O ut RTO min =10ms RTO min =1ms A small RTO min  frequent spurious RTOs

Pseudo-congestion (cont’d) A small RTO min : A big RTO min : serious spurious throughput collapse RTOs with largely with heavy network varied RTTs. congestion.  “adjusting RTO min : a tradeoff between timely response with premature timeouts, and there is NO optimal balance between the two.” -- Allman and Paxson [SIGCOMM’99] Virtualized datacenters  A new instantiation

Sender-side vs. Receiver-side The scheduling delays to the sender VM The scheduling delays to the receiver VM To transmit 4000 1MB data blocks 3VMs  1VM 1VM  3VMs Freq. 1086 1× RTOs 677 0 2× RTOs 673 0 3× RTOs 196 0 4× RTOs 30 RTO only happens once a time Successive RTOs are normal

A micro-view with tcpdump snd.una : the first sent but unacknowledged byte. snd.nxt : the next byte that will be sent. x10 6 9.1 snd.nxt 9 snd.una 8.9 RTO happens twice, The sender The receiver before the receiver An ACK arrives 8.8 VM has been VM has been VM wakes up. before the sender stopped. stopped. VM wakes up. 8.7 snd.nxt 8.6 snd.una 8.5 RTO happens just after the sender VM wakes up. 8.4 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 Time (ms) vs. sequence number (from the sender VM) Time (ms) vs. ACK number (from the receiver VM) When the sender VM is preempted When the receiver VM is preempted  The generation and the  The ACK’s arrival time is not delayed, but the receiving return of the ACKs will be time is too late . delayed.  RTOs must happen on the  From TCP’s perspective, RTO should not be triggered. sender’s side.

The sender-side problem: OS reasons TCP receiver ACK ACK ACK Physical network deliver VM scheduling latency wait .. Driver ACK domain Network IRQ : Buffer data data data receive ACK; 2 Within Spurious RTO ! hypervisor Expire time clear clear Timer IRQ : timer timer 1 TCP RTO happens! Timer Timer Timer sender VM1 is running VM2 is running VM3 is running VM1 is running Scheduling VM2 is waiting VM3 is waiting VM1 is waiting VM2 is waiting queue VM3 is waiting VM1 is waiting VM2 is waiting VM3 is waiting  After the VM wakes up, both TIMER and NET are pending.  RTO happens just before the ACK enters the VM  The reasons due to common OS design – Timer interrupt is executed before other interrupts – Network processing is a little late (bottom half)

To detect spurious RTOs  Two well-known detection algorithms: F-RTO and Eifel – Eifel performs much worse than F-RTO in some situations, e.g. with bursty packet loss [CCR’03] – F-RTO is implemented in Linux [3VMs  1VM] [1VM  3VMs] c Low Low detection detection rate rate  F-RTO interacts badly with delayed ACK (ACK coalescing) – Reducing delayed ACK timeout value does NOT help. Disabling delayed ACK seems to be helpful

Delayed ACK vs. CPU overhead Sender VM Receiver VM Sender VM Receiver VM Disabling delayed ACK  Significant CPU overhead

Delayed ACK vs. CPU overhead delack-200ms delack-1ms w/o delack delack-200ms delack-1ms w/o delack Total ACKs 229,650 244,757 2,832,260 Total ACKs 252,278 262,274 2,832,179 Disabling delayed ACK: 11~13× more ACKs are sent Sender VM Receiver VM Sender VM Receiver VM Disabling delayed ACK  Significant CPU overhead

PVTCP – A ParaVirtualized TCP  Observation – Spurious RTOs only happen when the sender/receiver VM just experienced a scheduling delays.  Main Idea – If we can detect such moment , and let the guest OS be aware of this, there is a chance to handle the problem. “ the more information about current network conditions available to a transport protocol, the more efficiently it can use the network to transfer its data .” -- Allman and Paxson [SIGCOMM’99]

Detect the VM’s wakeup moment VM is VM is NOT VM is running running running 30ms VM jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ Guest OS jiffies += 60 (HZ=1000) 30ms VM Time 30ms VM . . . Virtual timer IRQs Virtual timer IRQs pCPU (every 1ms) (every 1ms) 3VMs per core one-shot timer Hypervisor Acute increase of the system clock ( jiffies )  The VM just wakes up

PVTCP – the sender VM is preempted  Spurious RTOs can be avoided. No need to detect them at all! TCP receiver ACK ACK ACK Physical network VM scheduling latency deliver wait .. Driver ACK domain Network IRQ : Buffer data data data 2 receive ACK; Within Spurious RTO ! hypervisor Expire time clear clear Timer IRQ : timer timer 1 TCP RTO happens! Timer Timer Timer sender VM1 is running VM2 is running VM3 is running VM1 is running TCP Timer Start Expiry time time

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines - PowerPoint PPT Presentation

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013

TCP TCP Congestion Control Congestion Control Essential strategy :: The TCP host sends

Internet congestion control: TCP Internet congestion control: TCP 1988: "Congestion

TCP/IP Over Lossy Links - TCP SACK without Congestion Control Organization 1. The History of

Xen past, present and future Stefano Stabellini Xen architecture: PV domains Xen arch: driver

Xen/ia64 Status Update Xen Summit, November 2007 Aron Griffis <aron@hp.com> Xen/ia64

Xen 4.6 and beyond Wei Liu Seattle August 17-18, 2015 Agenda Xen 4.6 timeline Development

TCP Congestion Avoidance Joshua Gancher November 10, 2016 Joshua Gancher TCP Congestion

TCP Pacing in Data Center Networks Monia Ghobadi, Yashar Ganjali Department of Computer

Virtualization in the Cloud: Featuring Xen and XCP Lars Kurth Xen Community Manager

Xen on ARM A success story Stefano Stabellini - Citrix Xen Project Team Achievements of one year

10 Years of Xen and beyond Lars Kurth Xen Project Community Manager lars.kurth@xen.org

End-to-End Previously on CS5229 , Not everyone uses TCP Congestion Control TCP Congestion

The Present and Future of Congestion Control Mark Handley Outline Purpose of congestion

Attacks on TCP 1 Outline What is TCP protocol? How the TCP Protocol Works SYN

Xen Summit 8 September, 2006 Xen Management API and Control Stack Ewan Mellor

Congestion Control Mark Handley Outline Part 1: Traditional congestion control for bulk

QoS Services with Dynamic Packet State Ion Stoica Carnegie Mellon University (joint work with

Swarm-based In Incast Congestion Control in in a Datacenter Serving Web Applications Haoyu Wang*

On the Design of Load Factor based Congestion Control Protocols for Next-Generation Networks

Hybrid Control and Communication Karl Henrich Johansson KTH Stockholm, Sweden kallej@s3.kth.se

Nearby Threats: Reversing, Analyzing, and Attacking Googles Nearby Connections on Android

April 2020 CDS Connect Work Group Call Agenda Schedule Topic 3:00 3:02 Roll

AK Connect! FACILITATED BY: ASHLEY SCHABER, PHARMD, MBA, BCPS I N P A T I E N T P H A R M A C Y

Low-loss connection of weight vectors: distribution-based approaches Ivan Anokhin, Dmitry

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines - PowerPoint PPT Presentation

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013

TCP TCP Congestion Control Congestion Control Essential strategy :: The TCP host sends

Internet congestion control: TCP Internet congestion control: TCP 1988: &quot;Congestion

TCP/IP Over Lossy Links - TCP SACK without Congestion Control Organization 1. The History of

Xen past, present and future Stefano Stabellini Xen architecture: PV domains Xen arch: driver

Xen/ia64 Status Update Xen Summit, November 2007 Aron Griffis &lt;aron@hp.com&gt; Xen/ia64

Xen 4.6 and beyond Wei Liu Seattle August 17-18, 2015 Agenda Xen 4.6 timeline Development

TCP Congestion Avoidance Joshua Gancher November 10, 2016 Joshua Gancher TCP Congestion

TCP Pacing in Data Center Networks Monia Ghobadi, Yashar Ganjali Department of Computer

Virtualization in the Cloud: Featuring Xen and XCP Lars Kurth Xen Community Manager

Xen on ARM A success story Stefano Stabellini - Citrix Xen Project Team Achievements of one year

10 Years of Xen and beyond Lars Kurth Xen Project Community Manager lars.kurth@xen.org

End-to-End Previously on CS5229 , Not everyone uses TCP Congestion Control TCP Congestion

The Present and Future of Congestion Control Mark Handley Outline Purpose of congestion

Attacks on TCP 1 Outline What is TCP protocol? How the TCP Protocol Works SYN

Xen Summit 8 September, 2006 Xen Management API and Control Stack Ewan Mellor

Congestion Control Mark Handley Outline Part 1: Traditional congestion control for bulk

QoS Services with Dynamic Packet State Ion Stoica Carnegie Mellon University (joint work with

Swarm-based In Incast Congestion Control in in a Datacenter Serving Web Applications Haoyu Wang*

On the Design of Load Factor based Congestion Control Protocols for Next-Generation Networks

Hybrid Control and Communication Karl Henrich Johansson KTH Stockholm, Sweden kallej@s3.kth.se

Nearby Threats: Reversing, Analyzing, and Attacking Googles Nearby Connections on Android

April 2020 CDS Connect Work Group Call Agenda Schedule Topic 3:00 3:02 Roll

AK Connect! FACILITATED BY: ASHLEY SCHABER, PHARMD, MBA, BCPS I N P A T I E N T P H A R M A C Y

Low-loss connection of weight vectors: distribution-based approaches Ivan Anokhin, Dmitry

Internet congestion control: TCP Internet congestion control: TCP 1988: "Congestion

Xen/ia64 Status Update Xen Summit, November 2007 Aron Griffis <aron@hp.com> Xen/ia64