CompSci 514: Computer Networks Lecture 14 Datacenter Transport - - PowerPoint PPT Presentation
CompSci 514: Computer Networks Lecture 14 Datacenter Transport - - PowerPoint PPT Presentation
CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang Roadmap Clos topology Datacenter TCP Re-architecting datacenter networks and stacks for low latency and high performance Best Paper award,
Roadmap
- Clos topology
- Datacenter TCP
- Re-architecting datacenter networks and
stacks for low latency and high performance
– Best Paper award, SIGCOMM’17
Motivation for Clos topology
- Clos topology aims to achieve the
performance of a cross-bar switch
- When the number of ports n is large, it is hard
to build such a nxn switch
Clos topology
- A multi-stage switching network
- A path from any input port to any output port
- Each switch has a small number of ports
Roadmap
- Clos topology
- Datacenter TCP
- Re-architecting datacenter networks and
stacks for low latency and high performance
– Best Paper award, SIGCOMM’17
Datacenter Impairments
- Incast
- Queue Buildup
- Buffer Pressure
6
Queue Buildup
7
Sender 1 Sender 2 Receiver
- Big flows buildup queues.
Ø Increased latency for short flows.
- Measurements in Bing cluster
Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms
Data Center Transport Requirements
8
- 1. High Burst Tolerance
– Incast due to Partition/Aggregate is common.
- 2. Low Latency
– Short flows, queries
- 3. High Throughput
– Continuous data updates, large file transfers
The challenge is to achieve these three together.
Tension Between Requirements
9
High Burst Tolerance High Throughput Low Latency
DCTCP
Deep Buffers: Ø Queuing Delays Increase Latency Shallow Buffers: Ø Bad for Bursts & Throughput Reduced RTOmin (SIGCOMM ‘09) Ø Doesn’t Help Latency AQM – RED: Ø Avg Queue Not Fast Enough for Incast
Objective: Low Queue Occupancy & High Throughput
The DCTCP Algorithm
10
Review: The TCP/ECN Control Loop
11
Sender 1 Sender 2 Receiver
ECN Mark (1 bit)
ECN = Explicit Congestion Notification
Small Queues & TCP Throughput:
The Buffer Sizing Story
17
- Bandwidth-delay product rule of thumb:
– A single flow needs buffers for 100% Throughput.
B Cwnd Buffer Size Throughput 100%
Small Queues & TCP Throughput:
The Buffer Sizing Story
17
- Bandwidth-delay product rule of thumb:
– A single flow needs buffers for 100% Throughput.
- Appenzeller rule of thumb (SIGCOMM ‘04):
– Large # of flows: is enough.
B Cwnd Buffer Size Throughput 100%
Small Queues & TCP Throughput:
The Buffer Sizing Story
17
- Bandwidth-delay product rule of thumb:
– A single flow needs buffers for 100% Throughput.
- Appenzeller rule of thumb (SIGCOMM ‘04):
– Large # of flows: is enough.
- Can’t rely on stat-mux benefit in the DC.
– Measurements show typically 1-2 big flows at each server, at most 4.
Small Queues & TCP Throughput:
The Buffer Sizing Story
17
- Bandwidth-delay product rule of thumb:
– A single flow needs buffers for 100% Throughput.
- Appenzeller rule of thumb (SIGCOMM ‘04):
– Large # of flows: is enough.
- Can’t rely on stat-mux benefit in the DC.
– Measurements show typically 1-2 big flows at each server, at most 4.
B
Real Rule of Thumb: Low Variance in Sending Rate → Small Buffers Suffice
Two Key Ideas
- 1. React in proportion to the extent of congestion, not its presence.
ü Reduces variance in sending rates, lowering queuing requirements.
- 2. Mark based on instantaneous queue length.
ü Fast feedback to better deal with bursts.
18
ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%
Data Center TCP Algorithm
Switch side:
– Mark packets when Queue Length > K.
19
Sender side:
– Maintain running average of fraction of packets marked (α). In each RTT: Ø Adaptive window decreases: – Note: decrease factor between 1 and 2. B K
Mark Don’t Mark
The picture can't be displayed.Working with delayed acks
- Figure 10: Two state ACK generation state machine.
DCTCP in Action
20
Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB
(Kbytes)
Why it Works
1.High Burst Tolerance
ü Large buffer headroom → bursts fit.
ü Aggressive marking → sources react before packets are
dropped.
- 2. Low Latency
ü Small buffer occupancies → low queuing delay.
- 3. High Throughput
ü ECN averaging → smooth rate adjustments, low variance.
21
Analysis
- How low can DCTCP maintain queues without loss of throughput?
- How do we set the DCTCP parameters?
22
Ø Need to quantify queue size oscillations (Stability).
Time
(W*+1)(1-α/2) W*
Window Size
W*+1
Packets sent in this RTT are marked.
Analysis
- How low can DCTCP maintain queues without loss of throughput?
- How do we set the DCTCP parameters?
22
Ø Need to quantify queue size oscillations (Stability).
Time
(W*+1)(1-α/2) W*
Window Size
W*+1
Analysis
- Q(t) = NW(t) − C × RTT
- The key observation is that with synchronized senders,
the queue size exceeds the marking threshold K for exactly one RTT in each period of the saw-tooth, before the sources receive ECN marks and reduce their window sizes accordingly.
- S(W1,W2)=(W22 −W12)/2
- Critical window size when ECN marking occurs: W∗
=(C×RTT+K)/N
- α = S(W∗,W∗ + 1)/S((W∗ + 1)(1 − α/2),W∗ + 1)
- α2(1 − α/4) = (2W∗ + 1)/(W∗ + 1)2 ≈ 2/W∗
- α ≈ sqrt(2/W∗)
- Single flow oscillation
– D = (W∗ +1)−(W∗ +1)(1−α/2)
A = ND = N(W ∗ + 1)α/2 ≈ N 2 √2W ∗ = 1 2 p2N(C × RT T + K), (8) TC = D = 1 2 p2(C × RT T + K)/N (in RTTs). (9) Finally, using (3), we have: Qmax = N(W ∗ + 1) − C × RT T = K + N. (10)
Analysis
- How low can DCTCP maintain queues without loss of throughput?
- How do we set the DCTCP parameters?
22
Ø Need to quantify queue size oscillations (Stability).
85% Less Buffer than TCP
Qmin = Qmax − A (11) = K + N − 1 2 p 2N(C × RTT + K). (12)
Minimizing Qmin
Evaluation
- Implemented in Windows stack.
- Real hardware, 1Gbps and 10Gbps experiments
– 90 server testbed – Broadcom Triumph 48 1G ports – 4MB shared memory – Cisco Cat4948 48 1G ports – 16MB shared memory – Broadcom Scorpion 24 10G ports – 4MB shared memory
- Numerous micro-benchmarks
– Throughput and Queue Length – Multi-hop – Queue Buildup – Buffer Pressure
- Cluster traffic benchmark
23
– Fairness and Convergence – Incast – Static vs Dynamic Buffer Mgmt
Cluster Traffic Benchmark
- Emulate traffic within 1 Rack of Bing cluster
– 45 1G servers, 10G server for external traffic
- Generate query, and background traffic
– Flow sizes and arrival times follow distributions seen in Bing
- Metric:
– Flow completion time for queries and background flows.
24
We use RTOmin = 10ms for both TCP & DCTCP.
Baseline
25
Background Flows Query Flows
Baseline
25
Background Flows Query Flows
ü Low latency for short flows.
Baseline
25
Background Flows Query Flows
ü Low latency for short flows. ü High throughput for long flows.
Baseline
25
Background Flows Query Flows
ü Low latency for short flows. ü High throughput for long flows. ü High burst tolerance for query flows.
Scaled Background & Query
10x Background, 10x Query
26
Query Short messages
Conclusions
- DCTCP satisfies all our requirements for Data Center
packet transport.
ü Handles bursts well ü Keeps queuing delays low ü Achieves high throughput
- Features:
ü Very simple change to TCP and a single switch parameter. ü Based on mechanisms already available in Silicon.
27
Comments
- Real world data
- A novel idea
- Comprehensive evaluation
- Didn’t compare with the scheme of
eliminating RTOmin and microsecond RTT measurement
- Deadline-based scheduling research
Discussion
- How does DCTCP differ from TCP?
- Will DCTCP work well on the Internet? Why?
- Is there a tradeoff between generality and
performance?
Re-architecting datacenter networks and stacks for low latency and high performance
Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik
Motivation
- Low latency
- High throughput
Design assumptions
- Clos Topology
- Designer can change end system protocol
stacks as well as switches
- https://www.youtube.com/watch?v=OI3mh1V
x8xI
Discussion
- Will NDP work well on the Internet? Why?
- Is there a tradeoff between generality and
performance?
- Will it work well on non-clos topology?
Summary
- How to overcome the transport challenges in
DC networks
- DCTCP
– Use the fraction of CE marked packets to estimate congestion – Smoothing sending rates
- NDP