CompSci 514: Computer Networks Lecture 14 Datacenter Transport - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang

Roadmap • Clos topology • Datacenter TCP • Re-architecting datacenter networks and stacks for low latency and high performance – Best Paper award, SIGCOMM’17

Motivation for Clos topology • Clos topology aims to achieve the performance of a cross-bar switch • When the number of ports n is large, it is hard to build such a nxn switch

Clos topology • A multi-stage switching network • A path from any input port to any output port • Each switch has a small number of ports

Roadmap • Clos topology • Datacenter TCP • Re-architecting datacenter networks and stacks for low latency and high performance – Best Paper award, SIGCOMM’17

Datacenter Impairments • Incast • Queue Buildup • Buffer Pressure 6

Queue Buildup Sender 1 • Big flows buildup queues. Ø Increased latency for short flows. Receiver • Measurements in Bing cluster Sender 2 Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms 7

Data Center Transport Requirements 1. High Burst Tolerance – Incast due to Partition/Aggregate is common. 2. Low Latency – Short flows, queries 3. High Throughput – Continuous data updates, large file transfers The challenge is to achieve these three together. 8

Tension Between Requirements High Throughput Low Latency High Burst Tolerance Deep Buffers: Shallow Buffers: Ø Queuing Delays Ø Bad for Bursts & Objective: Increase Latency Throughput Low Queue Occupancy & High Throughput DCTCP AQM – RED: Reduced RTO min Ø Avg Queue Not Fast (SIGCOMM ‘09) Enough for Incast Ø Doesn’t Help Latency 9

The DCTCP Algorithm 10

Review: The TCP/ECN Control Loop Sender 1 ECN = Explicit Congestion Notification ECN Mark (1 bit) Receiver Sender 2 11

Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. Cwnd Buffer Size B Throughput 100% 17

Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. • Appenzeller rule of thumb (SIGCOMM ‘04): – Large # of flows: is enough. Cwnd Buffer Size B Throughput 100% 17

Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. • Appenzeller rule of thumb (SIGCOMM ‘04): – Large # of flows: is enough. • Can’t rely on stat-mux benefit in the DC. – Measurements show typically 1-2 big flows at each server , at most 4. 17

Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. • Appenzeller rule of thumb (SIGCOMM ‘04): – Large # of flows: is enough. • Can’t rely on stat-mux benefit in the DC. – Measurements show typically 1-2 big flows at each server , at most 4. Real Rule of Thumb: B Low Variance in Sending Rate → Small Buffers Suffice 17

Two Key Ideas 1. React in proportion to the extent of congestion, not its presence . ü Reduces variance in sending rates, lowering queuing requirements. ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5% 2. Mark based on instantaneous queue length. ü Fast feedback to better deal with bursts. 18

Data Center TCP Algorithm B K Don’t Switch side: Mark Mark – Mark packets when Q ueue Length > K. Sender side: – Maintain running average of fraction of packets marked ( α ) . In each RTT: The picture can't be displayed. Ø Adaptive window decreases: – Note: decrease factor between 1 and 2. 19

Working with delayed acks �� Figure 10: Two state ACK generation state machine.

DCTCP in Action (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB 20

Why it Works 1.High Burst Tolerance ü Large buffer headroom → bursts fit. ü Aggressive marking → sources react before packets are dropped. 2. Low L atency ü Small buffer occupancies → low queuing delay. 3. High Throughput ü ECN averaging → smooth rate adjustments, low variance. 21

Analysis • How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters? Ø Need to quantify queue size oscillations (Stability). Window Size W*+1 W* (W*+1)(1-α/2) Time 22

Analysis • How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters? Ø Need to quantify queue size oscillations (Stability). Packets sent in this Window Size RTT are marked. W*+1 W* (W*+1)(1-α/2) Time 22

Analysis • Q(t) = NW(t) − C × RTT • The key observation is that with synchronized senders, the queue size exceeds the marking threshold K for exactly one RTT in each period of the saw-tooth, before the sources receive ECN marks and reduce their window sizes accordingly. • S(W 1 ,W 2 )=(W 22 −W 12 )/2 • Critical window size when ECN marking occurs: W ∗ =(C×RTT+K)/N

• α = S(W ∗ ,W ∗ + 1)/S((W ∗ + 1)(1 − α/2),W ∗ + 1) • α 2 (1 − α/4) = (2W ∗ + 1)/(W ∗ + 1)2 ≈ 2/W ∗ • α ≈ sqrt(2/W ∗ ) • Single flow oscillation – D = (W ∗ +1)−(W ∗ +1)(1−α/2) A = ND = N ( W ∗ + 1) α / 2 ≈ N √ 2 W ∗ 2 = 1 p 2 N ( C × RT T + K ) , (8) 2 T C = D = 1 p 2( C × RT T + K ) /N (in RTTs). (9) 2 Finally, using (3), we have: Q max = N ( W ∗ + 1) − C × RT T = K + N. (10)

Analysis • How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters? Ø Need to quantify queue size oscillations (Stability). Q min = Q max − A (11) = K + N − 1 p 2 N ( C × RTT + K ) . (12) 2 Minimizing Qmin 85% Less Buffer than TCP 22

Evaluation • Implemented in Windows stack. • Real hardware, 1Gbps and 10Gbps experiments – 90 server testbed – Broadcom Triumph 48 1G ports – 4MB shared memory – Cisco Cat4948 48 1G ports – 16MB shared memory – Broadcom Scorpion 24 10G ports – 4MB shared memory • Numerous micro-benchmarks – Throughput and Queue Length – Fairness and Convergence – Multi-hop – Incast – Static vs Dynamic Buffer Mgmt – Queue Buildup – Buffer Pressure • Cluster traffic benchmark 23

Cluster Traffic Benchmark • Emulate traffic within 1 Rack of Bing cluster – 45 1G servers, 10G server for external traffic • Generate query, and background traffic – Flow sizes and arrival times follow distributions seen in Bing • Metric: We use RTO min = 10ms for both TCP & DCTCP. – Flow completion time for queries and background flows. 24

Baseline Background Flows Query Flows 25

Baseline Background Flows Query Flows ü Low latency for short flows. 25

Baseline Background Flows Query Flows ü Low latency for short flows. ü High throughput for long flows. 25

Baseline Background Flows Query Flows ü Low latency for short flows. ü High throughput for long flows. ü High burst tolerance for query flows. 25

Scaled Background & Query 10x Background, 10x Query Query Short messages 26

Conclusions • DCTCP satisfies all our requirements for Data Center packet transport. ü Handles bursts well ü Keeps queuing delays low ü Achieves high throughput • Features: ü Very simple change to TCP and a single switch parameter. ü Based on mechanisms already available in Silicon. 27

Comments • Real world data • A novel idea • Comprehensive evaluation • Didn’t compare with the scheme of eliminating RTOmin and microsecond RTT measurement • Deadline-based scheduling research

Discussion • How does DCTCP differ from TCP? • Will DCTCP work well on the Internet? Why? • Is there a tradeoff between generality and performance?

Re-architecting datacenter networks and stacks for low latency and high performance Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik

Motivation • Low latency • High throughput

Design assumptions • Clos Topology • Designer can change end system protocol stacks as well as switches

• https://www.youtube.com/watch?v=OI3mh1V x8xI

Discussion • Will NDP work well on the Internet? Why? • Is there a tradeoff between generality and performance? • Will it work well on non-clos topology?

Summary • How to overcome the transport challenges in DC networks • DCTCP – Use the fraction of CE marked packets to estimate congestion – Smoothing sending rates • NDP – Start, spray, trim

CompSci 514: Computer Networks Lecture 14 Datacenter Transport - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang Roadmap Clos topology Datacenter TCP Re-architecting datacenter networks and stacks for low latency and high performance Best Paper award,

CompSci 514: Computer Networks Lecture 15 Practical Datacenter Networks Xiaowei Yang Overview

CompSci 514: Computer Networks Lecture 17: Datacenter Network Architectures Xiaowei Yang

CompSci 514: Computer Networks L18: Datacenter Network Architectures II Xiaowei Yang 1

CompSci 514: Computer Networks Lecture 16: Network Function Virtualization Xiaowei Yang Adapted

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions Xiaowei Yang Roadmap

Camera Calibration COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Camera

CompSci 514: Computer Networks Lecture 5: Congestion Control Xiaowei Yang 1 Outline

CompSci 514: Computer Networks Lecture 13: Distributed Hash Table Xiaowei Yang Overview

CompSci 514: Computer Networks Lecture 04: Evolution of the Internet Xiaowei Yang

CompSci 514 Computer Networks Lecture 20: Combating Denial of Service Attacks Xiaowei Yang How

CompSci 514: Computer Networks Lecture 17: Network Support for Remote Direct Memory Access

CompSci 514: Computer Networks Lecture 21-2: From BitTorrent to BitTyrant Problem Statement

CompSci 514: Computer Networks Lecture 11: Software Defined Networking Xiaowei Yang 1

CompSci 514: Computer Networks Lecture 10: BGP problems Xiaowei Yang 1 Today Known

CompSci 514: Computer Networks Lecture 3: The Design Philosophy of the DARPA Internet Protocols

Rigid Geometric Transformations COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

IV and IV-GMM Christopher F Baum EC 823: Applied Econometrics Boston College, Spring 2013

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

The Calibrated Bayes Factor for Model Comparison Steve MacEachern The Ohio State University

Data Center TCP (DCTCP) 1 TCP in the Data Center Well see TCP does not meet demands of

My default postgresql.conf file, step by step Ilya Kosmodemiansky ik@dataegret.com Before

Recent and Upcoming Advances in the Dyninst Toolkits Sasha Nicolas and Xiaozhu Meng Paradyn

TAXY TAXY Gives your thumb a lift. Gives your thumb a lift. Julia Hafner L O G L I N E L O G

A Few Bad Votes Too Many? Towards Robust Ranking in Social Media Jiang Bian Georgia Tech