CompSci 514: Computer Networks Lecture 14 Datacenter Transport - - PowerPoint PPT Presentation

compsci 514 computer networks lecture 14 datacenter
SMART_READER_LITE
LIVE PREVIEW

CompSci 514: Computer Networks Lecture 14 Datacenter Transport - - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang Roadmap Clos topology Datacenter TCP Re-architecting datacenter networks and stacks for low latency and high performance Best Paper award,


slide-1
SLIDE 1

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II

Xiaowei Yang

slide-2
SLIDE 2

Roadmap

  • Clos topology
  • Datacenter TCP
  • Re-architecting datacenter networks and

stacks for low latency and high performance

– Best Paper award, SIGCOMM’17

slide-3
SLIDE 3

Motivation for Clos topology

  • Clos topology aims to achieve the

performance of a cross-bar switch

  • When the number of ports n is large, it is hard

to build such a nxn switch

slide-4
SLIDE 4

Clos topology

  • A multi-stage switching network
  • A path from any input port to any output port
  • Each switch has a small number of ports
slide-5
SLIDE 5

Roadmap

  • Clos topology
  • Datacenter TCP
  • Re-architecting datacenter networks and

stacks for low latency and high performance

– Best Paper award, SIGCOMM’17

slide-6
SLIDE 6

Datacenter Impairments

  • Incast
  • Queue Buildup
  • Buffer Pressure

6

slide-7
SLIDE 7

Queue Buildup

7

Sender 1 Sender 2 Receiver

  • Big flows buildup queues.

Ø Increased latency for short flows.

  • Measurements in Bing cluster

Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms

slide-8
SLIDE 8

Data Center Transport Requirements

8

  • 1. High Burst Tolerance

– Incast due to Partition/Aggregate is common.

  • 2. Low Latency

– Short flows, queries

  • 3. High Throughput

– Continuous data updates, large file transfers

The challenge is to achieve these three together.

slide-9
SLIDE 9

Tension Between Requirements

9

High Burst Tolerance High Throughput Low Latency

DCTCP

Deep Buffers: Ø Queuing Delays Increase Latency Shallow Buffers: Ø Bad for Bursts & Throughput Reduced RTOmin (SIGCOMM ‘09) Ø Doesn’t Help Latency AQM – RED: Ø Avg Queue Not Fast Enough for Incast

Objective: Low Queue Occupancy & High Throughput

slide-10
SLIDE 10

The DCTCP Algorithm

10

slide-11
SLIDE 11

Review: The TCP/ECN Control Loop

11

Sender 1 Sender 2 Receiver

ECN Mark (1 bit)

ECN = Explicit Congestion Notification

slide-12
SLIDE 12

Small Queues & TCP Throughput:

The Buffer Sizing Story

17

  • Bandwidth-delay product rule of thumb:

– A single flow needs buffers for 100% Throughput.

B Cwnd Buffer Size Throughput 100%

slide-13
SLIDE 13

Small Queues & TCP Throughput:

The Buffer Sizing Story

17

  • Bandwidth-delay product rule of thumb:

– A single flow needs buffers for 100% Throughput.

  • Appenzeller rule of thumb (SIGCOMM ‘04):

– Large # of flows: is enough.

B Cwnd Buffer Size Throughput 100%

slide-14
SLIDE 14

Small Queues & TCP Throughput:

The Buffer Sizing Story

17

  • Bandwidth-delay product rule of thumb:

– A single flow needs buffers for 100% Throughput.

  • Appenzeller rule of thumb (SIGCOMM ‘04):

– Large # of flows: is enough.

  • Can’t rely on stat-mux benefit in the DC.

– Measurements show typically 1-2 big flows at each server, at most 4.

slide-15
SLIDE 15

Small Queues & TCP Throughput:

The Buffer Sizing Story

17

  • Bandwidth-delay product rule of thumb:

– A single flow needs buffers for 100% Throughput.

  • Appenzeller rule of thumb (SIGCOMM ‘04):

– Large # of flows: is enough.

  • Can’t rely on stat-mux benefit in the DC.

– Measurements show typically 1-2 big flows at each server, at most 4.

B

Real Rule of Thumb: Low Variance in Sending Rate → Small Buffers Suffice

slide-16
SLIDE 16

Two Key Ideas

  • 1. React in proportion to the extent of congestion, not its presence.

ü Reduces variance in sending rates, lowering queuing requirements.

  • 2. Mark based on instantaneous queue length.

ü Fast feedback to better deal with bursts.

18

ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%

slide-17
SLIDE 17

Data Center TCP Algorithm

Switch side:

– Mark packets when Queue Length > K.

19

Sender side:

– Maintain running average of fraction of packets marked (α). In each RTT: Ø Adaptive window decreases: – Note: decrease factor between 1 and 2. B K

Mark Don’t Mark

The picture can't be displayed.
slide-18
SLIDE 18

Working with delayed acks

  • Figure 10: Two state ACK generation state machine.
slide-19
SLIDE 19

DCTCP in Action

20

Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB

(Kbytes)

slide-20
SLIDE 20

Why it Works

1.High Burst Tolerance

ü Large buffer headroom → bursts fit.

ü Aggressive marking → sources react before packets are

dropped.

  • 2. Low Latency

ü Small buffer occupancies → low queuing delay.

  • 3. High Throughput

ü ECN averaging → smooth rate adjustments, low variance.

21

slide-21
SLIDE 21

Analysis

  • How low can DCTCP maintain queues without loss of throughput?
  • How do we set the DCTCP parameters?

22

Ø Need to quantify queue size oscillations (Stability).

Time

(W*+1)(1-α/2) W*

Window Size

W*+1

slide-22
SLIDE 22

Packets sent in this RTT are marked.

Analysis

  • How low can DCTCP maintain queues without loss of throughput?
  • How do we set the DCTCP parameters?

22

Ø Need to quantify queue size oscillations (Stability).

Time

(W*+1)(1-α/2) W*

Window Size

W*+1

slide-23
SLIDE 23

Analysis

  • Q(t) = NW(t) − C × RTT
  • The key observation is that with synchronized senders,

the queue size exceeds the marking threshold K for exactly one RTT in each period of the saw-tooth, before the sources receive ECN marks and reduce their window sizes accordingly.

  • S(W1,W2)=(W22 −W12)/2
  • Critical window size when ECN marking occurs: W∗

=(C×RTT+K)/N

slide-24
SLIDE 24
  • α = S(W∗,W∗ + 1)/S((W∗ + 1)(1 − α/2),W∗ + 1)
  • α2(1 − α/4) = (2W∗ + 1)/(W∗ + 1)2 ≈ 2/W∗
  • α ≈ sqrt(2/W∗)
  • Single flow oscillation

– D = (W∗ +1)−(W∗ +1)(1−α/2)

A = ND = N(W ∗ + 1)α/2 ≈ N 2 √2W ∗ = 1 2 p2N(C × RT T + K), (8) TC = D = 1 2 p2(C × RT T + K)/N (in RTTs). (9) Finally, using (3), we have: Qmax = N(W ∗ + 1) − C × RT T = K + N. (10)

slide-25
SLIDE 25

Analysis

  • How low can DCTCP maintain queues without loss of throughput?
  • How do we set the DCTCP parameters?

22

Ø Need to quantify queue size oscillations (Stability).

85% Less Buffer than TCP

Qmin = Qmax − A (11) = K + N − 1 2 p 2N(C × RTT + K). (12)

Minimizing Qmin

slide-26
SLIDE 26

Evaluation

  • Implemented in Windows stack.
  • Real hardware, 1Gbps and 10Gbps experiments

– 90 server testbed – Broadcom Triumph 48 1G ports – 4MB shared memory – Cisco Cat4948 48 1G ports – 16MB shared memory – Broadcom Scorpion 24 10G ports – 4MB shared memory

  • Numerous micro-benchmarks

– Throughput and Queue Length – Multi-hop – Queue Buildup – Buffer Pressure

  • Cluster traffic benchmark

23

– Fairness and Convergence – Incast – Static vs Dynamic Buffer Mgmt

slide-27
SLIDE 27

Cluster Traffic Benchmark

  • Emulate traffic within 1 Rack of Bing cluster

– 45 1G servers, 10G server for external traffic

  • Generate query, and background traffic

– Flow sizes and arrival times follow distributions seen in Bing

  • Metric:

– Flow completion time for queries and background flows.

24

We use RTOmin = 10ms for both TCP & DCTCP.

slide-28
SLIDE 28

Baseline

25

Background Flows Query Flows

slide-29
SLIDE 29

Baseline

25

Background Flows Query Flows

ü Low latency for short flows.

slide-30
SLIDE 30

Baseline

25

Background Flows Query Flows

ü Low latency for short flows. ü High throughput for long flows.

slide-31
SLIDE 31

Baseline

25

Background Flows Query Flows

ü Low latency for short flows. ü High throughput for long flows. ü High burst tolerance for query flows.

slide-32
SLIDE 32

Scaled Background & Query

10x Background, 10x Query

26

Query Short messages

slide-33
SLIDE 33

Conclusions

  • DCTCP satisfies all our requirements for Data Center

packet transport.

ü Handles bursts well ü Keeps queuing delays low ü Achieves high throughput

  • Features:

ü Very simple change to TCP and a single switch parameter. ü Based on mechanisms already available in Silicon.

27

slide-34
SLIDE 34

Comments

  • Real world data
  • A novel idea
  • Comprehensive evaluation
  • Didn’t compare with the scheme of

eliminating RTOmin and microsecond RTT measurement

  • Deadline-based scheduling research
slide-35
SLIDE 35

Discussion

  • How does DCTCP differ from TCP?
  • Will DCTCP work well on the Internet? Why?
  • Is there a tradeoff between generality and

performance?

slide-36
SLIDE 36

Re-architecting datacenter networks and stacks for low latency and high performance

Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik

slide-37
SLIDE 37

Motivation

  • Low latency
  • High throughput
slide-38
SLIDE 38

Design assumptions

  • Clos Topology
  • Designer can change end system protocol

stacks as well as switches

slide-39
SLIDE 39
  • https://www.youtube.com/watch?v=OI3mh1V

x8xI

slide-40
SLIDE 40

Discussion

  • Will NDP work well on the Internet? Why?
  • Is there a tradeoff between generality and

performance?

  • Will it work well on non-clos topology?
slide-41
SLIDE 41

Summary

  • How to overcome the transport challenges in

DC networks

  • DCTCP

– Use the fraction of CE marked packets to estimate congestion – Smoothing sending rates

  • NDP

– Start, spray, trim