Accurate Latency-based Congestion Feedback for Datacenters - PowerPoint PPT Presentation

Accurate Latency-based Congestion Feedback for Datacenters Changhyun Lee with Chunjong Park, Keon Jang*, Sue Moon, and Dongsu Han KAIST *Intel Labs USENIX Annual Technical Conference (ATC) July 10, 2015

Congestion control? Again??? • Numerous congestion control algorithms have been proposed since Jacobson’s TCP Congestion feedback Network Control algorithm Reaction • Performance of congestion control fundamentally depends on congestion feedback • New forms of congestion feedback have enabled innovative congestion control behavior • Packet loss, latency, bandwidth, ECN, in-network (RCP, XCP), etc. 2

Congestion control challenges in DCN • Datacenters’ unique environment requires congestion control to be finer-grained than ever • Prevalence of latency sensitive flows (partition/aggregate workload) • Every 100ms slow down in Amazon = 1% drop in sales* • Dominance of queuing delay in end-to-end latency • Accurate and fine-grained congestion feedback is a must ! 3 *Cracking latency in cloud, http://www.datacenterdynamics.com/

The most popular choice so far: ECN • ECN (Explicit Congestion Notification) detects congestion earlier than packet loss, but… • It still provides very coarse-grained feedback (binary) • DCTCP puts in more effort to improve granularity • Other ECN-based work also employ the same technique 1 packet marked  congestion probability: 33% 1 2 3 2 packets marked  congestion probability: 66% 1 2 3 • Pursuit of better congestion feedback leads to customized in-network feedback  hard to deploy 4

Our proposal: latency feedback • Network latency is a good indicator of congestion • Latency congestion feedback has a long history from CARD, DUAL, and TCP Vegas in wide-area networks • Used feedback: RTT measured in TCP stack • We revisit latency feedback for use in datacenter networks Can we reuse the same latency feedback from TCP Vegas? 5

Challenges in latency feedback in DC • Network latency changes in µs time scale in datacenters Datacenter Wide-area Link speed 10 Gbps 100 Mbps Transmission delay 1.2 μ s 120 μ s Queueing delay (10 pkts) 12 μ s 1.2 ms • Differentiating network latency change from other noise becomes a challenging task Measuring network latency accurately in microsecond scale is crucial 6

Evaluation of TCP stack measurement • We test whether RTT measured in TCP stack can indicate network congestion level in datacenters • We first evaluate the case of no congestion • Ideally, all the RTT measurements should have the same value 10Gbps (TCP) Sender Receiver 7

Inaccuracy of TCP stack measurement 710 μ s = 592 MTU packets at 10Gbps Latency feedback from stack cannot indicate network congestion level 8

Why is TCP stack measurement unreliable? • Sources of errors in RTT measurement • End-host stack delay • I/O batching • Reverse path delay Refer to our paper • Clock drift 9

Identifying sources of errors (1) • End-host stack delay • Packet I/O, stack processing, interrupt handling, CPU scheduling, etc. Sender Receiver Application Application Data SENT TS ACK RCVD TS Network stack Network stack Timestamping Measured RTT = ACK TS – Data TS Driver Driver NIC NIC RTT measured from kernel gets affected by host delay jitter 10

Removing stack delay (sender-side) • Solution #1: Driver-level timestamping (software) • We use SoftNIC*, an Intel DPDK-based packet processing platform Sender Receiver Application Application Network stack Network stack Data SENT TS ACK RCVD TS SoftNIC SoftNIC Measured RTT Timestamping = ACK TS – Data TS NIC NIC * SoftNIC: A Software NIC to Augment Hardware , Sangjin Han, Keon Jang, Shoumik 11 Palkar, Dongsu Han, and Sylvia Ratnasamy ( Technical Report, UCB )

Removing stack delay (sender-side) • Solution #2: NIC-level timestamping (hardware) • We use Mellanox ConnectX-3, a timestamp-capable NIC Sender Receiver Application Application Data SENT TS Network stack Network stack ACK RCVD TS Measured RTT SoftNIC SoftNIC = ACK TS – Data TS NIC NIC Timestamping 12

Removing stack delay (receiver side) • Solution #3: Timestamping also at the receiver host • We subtract receiver node’s stack delay from RTT Sender Data SENT TS Receiver Application Application ACK RCVD TS Data RCVD TS Network stack Network stack ACK SENT TS Measured RTT SoftNIC SoftNIC = (ACK RCVD TS – Data SENT TS) Timestamping Timestamping – (ACK SENT TS – Data RCVD TS) NIC NIC 13

Identifying sources of errors (2) • Bursty timestamps from I/O batching • Multiple packets acquire the same timestamp in network stack Sender Receiver Application Application D1 D2 D3 Network stack Network stack Timestamping Driver Driver NIC NIC Timestamps do not reflect the actual sending/receiving time 14

Removing bursty timestamps (driver) • SoftNIC stores bursty packets from upper-layer in a queue and pace before timestamping Application Network stack D1 D2 D3 D4 D5 D6 Queue SoftNIC Timestamping NIC 15

Removing bursty timestamps (NIC) • Even NIC-level timestamping generates bursty timestamps • NIC timestamps packets after DMA completion, not when sending/receiving packets on the wire • We calibrate timestamps based on link transmission delay 16

Improved accuracy by our techniques SW Best HW Best Accuracy of HW timestamping is sub-microsecond scale 17

Can we measure accurate queuing delay? • Using our accurate RTT measurement, we infer queueing delay (queue length) at switch • Queueing delay is calculated as (Current RTT – Base RTT) • Current RTT: RTT sample from current Data/ACK pair • Base RTT: RTT measured without congestion (minimum value) One 1500 byte packet in 1G switch queue = 12us increase in RTT Switch Queue 18

Evaluation of queuing delay measurement • Traffic • Sender 1 generates 1Gbps full rate TCP traffic • Sender 2 generates an MTU (1500B) Ping packet every 25ms • Measurement • Sender 1 measures queueing delay • Switch measures ground-truth queue length 1Gbps (TCP) Sender 1 Receiver 1500B periodically Sender 2 19

Accuracy of queuing delay measurement • We can measure queueing delay in single packet granularity • Ground truth from switch matches with delay measurement 20

DX: latency-based congestion control • We propose DX, a new congestion control algorithm based on the accurate latency feedback • Goal: minimizing queueing delay while fully utilizing network links • DX behavior is straightforward • When queuing delay is zero, DX increases window size • When queuing delay is positive, DX decreases window size How much should we increase or decrease? 21

DX window calculation rule • Additive Increase: one packet per RTT • Multiplicative Decrease: proportional to the queuing delay • Challenge: How can we keep 100% utilization after decrement? Q: queueing delay V: normalizer 22

DX example scenario Q > 0  Decrease window 23

Challenge: sender #1’s view How much should I decrease? How much congestion am “I” responsible for? CWND=20+1 CWND=20+1 ??? Simple assumption: Other senders have the same window size CWND=20+1 New window size can be calculated from Link capacity, RTT, and current window size 24 *Refer to our paper for detailed derivation

Implementation • We implement timestamping module in SoftNIC • Timestamp collection • Data and ACK packet match • RTT and queueing delay calculation • Bursty timestamp calibration • We implement DX control algorithm in Linux 3.13 kernel • 200+ lines of code addition (mainly in tcp_ack()) • Use of TCP option header for storing timestamps 25

Evaluation methodology • Testbed experiment (small-scale) • Bottleneck queue length in 2-to-1 topology • Ns-2 simulation (large-scale) • Flow completion time of datacenter workload in a toy datacenter • More in our paper • Queueing delay and utilization with 10/20/30 senders • Flow throughput convergence • Impact of measurement noise to headroom • Fairness and throughput stability 26

Testbed experiment setup • Two senders share a bottleneck link (1Gbps/10Gbps) • Senders generate DX/DCTCP traffic to fully utilize the link • We measure and compare the queue length of DX/DCTCP 1G/10G Sender 1 Receiver Sender 2 27

Testbed experiment result at 1Gbps DX reduces median queuing delay by 5.33 times from DCTCP 28

Testbed experiment result at 10Gbps Hardware timestamping achieves further queueing delay reduction 29

Simulation with datacenter workload • Topology • A 3-tier fat tree with 192 nodes and 56 switches C C C C C C C C A A T T T T • Workload • Empirical web search workload from production datacenter 30

FCT of search workload simulation 6.0x faster 1.1x slower 2.6x faster 1.2x slower 0KB - 10KB 10MB - DX effectively reduces the completion time of small flows 31

Conclusion • The quality of congestion feedback fundamentally governs the performance of congestion control • We propose to use latency feedback in datacenters with support from our SW/HW timestamping techniques • We develop DX, a new latency-based congestion control, which achieves 5.3 times (1Gbps) and 1.6 times (10Gbps) queueing delay reduction than ECN-based DCTCP 32

Accurate Latency-based Congestion Feedback for Datacenters - PowerPoint PPT Presentation

Accurate Latency-based Congestion Feedback for Datacenters Changhyun Lee with Chunjong Park, Keon Jang, Sue Moon, and Dongsu Han KAIST Intel Labs USENIX Annual Technical Conference (ATC) July 10, 2015 Congestion control? Again???

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern OUTLINE Who needs accurate

The Benefits and Burdens of Nuclear Latency by Mehta and Whitlark Andrew Malandra Possible

Run-time interrupts latency detection in real-time systems Julien Desfossez Michel Dagenais

Taming Latency In Data Center Applications Ph.D. Defense of Dissertation Mohan Kumar Advisor:

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen

Reducing input latency on the web bit.ly/reduce-input-latency W3C Games Workshop - June 2019

CROSS-LAYER CROSS-LAYER LATENCY-AWARE AND -PREDICTABLE LATENCY-AWARE AND -PREDICTABLE DATA

Green Latency-aware Data Deployment in Data Centers: Balancing Latency, Energy in Networks and

Sensitivity Of Quake3 Players Sensitivity Of Quake3 Players Sensitivity Of Quake3 Players

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Latency and Throughput Latency (of task): Time elapsed between start of the task and

Markets for Transport Eliminating Congestion through Scheduling, Routing, and Real-time Pricing

Optimization-based routing and congestion control Routing, congestion control as optimization

Congestion Avoidance in Low-Voltage Networks Using Smart Meters Nicolas Gast (Inria, Grenoble)

PDE Backstepping Control Traffic Congestion Control: of Congested Traffic A PDE Backstepping

Networking part 3: the transport layer Juliusz Chroboczek Universit de Paris-Diderot (Paris 7)

04832250 Computer Networks (Honor Track) A Data Communication and Device Networking

#getmoving2020 getmoving2020.org Notice: Verbal Public Comment will be limited to between 90

RippleFPGA: A Routability-Driven Placement for Large-Scale