6.888: Lecture 3 Data Center Conges4on Control Mohammad Alizadeh - - PowerPoint PPT Presentation

6 888 lecture 3 data center conges4on control
SMART_READER_LITE
LIVE PREVIEW

6.888: Lecture 3 Data Center Conges4on Control Mohammad Alizadeh - - PowerPoint PPT Presentation

6.888: Lecture 3 Data Center Conges4on Control Mohammad Alizadeh Spring 2016 1 Transport 100Kbps100Mbps links inside the DC ~100ms latency INTERNET Fabric 1040Gbps links ~10100s latency Servers Transport inside the DC


slide-1
SLIDE 1

6.888: Lecture 3 Data Center Conges4on Control

Mohammad Alizadeh

Spring 2016

1

slide-2
SLIDE 2

INTERNET Servers Fabric

100Kbps–100Mbps links ~100ms latency 10–40Gbps links ~10–100μs latency

Transport inside the DC

slide-3
SLIDE 3

INTERNET Servers Fabric

web app data- base map- reduce HPC monitoring cache

Interconnect for distributed compute workloads Transport inside the DC

slide-4
SLIDE 4

What’s Different About DC Transport?

Network characteris4cs

– Very high link speeds (Gb/s); very low latency (microseconds)

Applica4on characteris4cs

– Large-scale distributed computa4on

Challenging traffic pa^erns

– Diverse mix of mice & elephants – Incast

Cheap switches

– Single-chip shared-memory devices; shallow buffers

4

slide-5
SLIDE 5

Short messages

(e.g., query, coordina@on)

Large flows

(e.g., data update, backup)

Low Latency High Throughput

Data Center Workloads

Mice & Elephants

slide-6
SLIDE 6

TCP @meout

Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTOmin = 300 ms

  • Synchronized fan-in conges4on

Incast

² Vasudevan et al. (SIGCOMM’09)

slide-7
SLIDE 7

Requests are ji^ered over 10ms window. Ji^ering switched off around 8:30 am.

7

MLA Query Comple@on Time (ms)

Incast in Bing

Ji^ering trades of median for high percen4les

slide-8
SLIDE 8

DC Transport Requirements

8

  • 1. Low Latency

– Short messages, queries

  • 2. High Throughput

– Con4nuous data updates, backups

  • 3. High Burst Tolerance

– Incast

The challenge is to achieve these together

slide-9
SLIDE 9

High Throughput Low Latency

Baseline fabric latency (propaga4on + switching): 10 microseconds

slide-10
SLIDE 10

High Throughput Low Latency High throughput requires buffering for rate mismatches … but this adds significant queuing latency

Baseline fabric latency (propaga4on + switching): 10 microseconds

slide-11
SLIDE 11

Data Center TCP

slide-12
SLIDE 12

TCP in the Data Center

TCP [Jacobsen et al.’88] is widely used in the data center

– More than 99% of the traffic

Operators work around TCP problems

‒ Ad-hoc, inefficient, oren expensive solu4ons ‒ TCP is deeply ingrained in applica4ons

Prac4cal deployment is hard à keep it simple!

slide-13
SLIDE 13

Review: The TCP Algorithm

Sender 1 Sender 2 Receiver

ECN = Explicit Conges@on No@fica@on

Time Window Size (Rate)

Addi@ve Increase: W à W+1 per round-trip 4me Mul@plica@ve Decrease: W à W/2 per drop or ECN mark ECN Mark (1 bit)

slide-14
SLIDE 14

TCP Buffer Requirement

Bandwidth-delay product rule of thumb:

– A single flow needs C×RTT buffers for 100% Throughput.

Throughput Buffer Size 100% B

B ≥ C×RTT

B 100%

B < C×RTT

slide-15
SLIDE 15

Window Size (Rate) Buffer Size Throughput 100%

Appenzeller et al. (SIGCOMM ‘04):

– Large # of flows: is enough.

15

Reducing Buffer Requirements

slide-16
SLIDE 16

Appenzeller et al. (SIGCOMM ‘04):

– Large # of flows: is enough

Can’t rely on stat-mux benefit in the DC.

– Measurements show typically only 1-2 large flows at each server

16

Key Observa4on: Low variance in sending rate à Small buffers suffice

Reducing Buffer Requirements

slide-17
SLIDE 17

Ø Extract mul4-bit feedback from single-bit stream of ECN marks

– Reduce window size based on frac@on of marked packets.

ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%

DCTCP: Main Idea

Window Size (Bytes) Window Size (Bytes) Time (sec) Time (sec)

TCP DCTCP

slide-18
SLIDE 18

DCTCP: Algorithm

Switch side:

– Mark packets when Queue Length > K.

Sender side:

– Maintain running average of frac%on of packets marked (α). Ø Adap@ve window decreases:

– Note: decrease factor between 1 and 2.

B K

Mark Don’t Mark

each RTT : F = # of marked ACKs Total # of ACKs ⇒ α ← (1− g)α + gF

W ← (1− α 2 )W

slide-19
SLIDE 19

100 200 300 400 500 600 700 Queue Length (Packets) DCTCP, 2 flows TCP, 2 flows Time (seconds)

DCTCP TCP

(KBytes)

Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch

ECN Marking Thresh = 30KB

DCTCP vs TCP

Buffer is mostly empty

DCTCP mi4gates Incast by crea4ng a large buffer headroom

slide-20
SLIDE 20
  • 1. Low Latency

ü Small buffer occupancies → low queuing delay

  • 2. High Throughput

ü ECN averaging → smooth rate adjustments, low variance

  • 3. High Burst Tolerance

ü Large buffer headroom → bursts fit ü Aggressive marking → sources react before packets are dropped

21

Why it Works

slide-21
SLIDE 21

DCTCP Deployments

21

slide-22
SLIDE 22

Discussion

22

slide-23
SLIDE 23

What You Said?

Aus@n: “The paper's performance comparison to RED seems arbitrary, perhaps RED had trac:on at the :me? Or just convenient as the switches were capable of implemen:ng it?”

23

slide-24
SLIDE 24

Implemented in Windows stack. Real hardware, 1Gbps and 10Gbps experiments

– 90 server testbed – Broadcom Triumph 48 1G ports – 4MB shared memory – Cisco Cat4948 48 1G ports – 16MB shared memory – Broadcom Scorpion 24 10G ports – 4MB shared memory

Numerous micro-benchmarks

– Throughput and Queue Length – Mul@-hop – Queue Buildup – Buffer Pressure

Bing cluster benchmark

– Fairness and Convergence – Incast – Sta@c vs Dynamic Buffer Mgmt

Evalua4on

slide-25
SLIDE 25

25

Background Flows Query Flows

Bing Benchmark (baseline)

slide-26
SLIDE 26

Bing Benchmark (scaled 10x)

Query Traffic (Incast bursts) Short messages (Delay-sensi4ve) Comple4on Time (ms)

Incast Deep buffers fix incast, but increase latency DCTCP good for both incast & latency

slide-27
SLIDE 27

What You Said

Amy: “I find it unsa:sfying that the details of many conges:on control protocols (such at these) are so complicated! ... can we create a parameter-less conges:on control protocol that is similar in behavior to DCTCP or TIMELY?” Hongzi: “Is there a general guideline to tune the parameters, like alpha, beta, delta, N, T_low, T_high, in the system?”

27

slide-28
SLIDE 28

Packets sent in this RTT are marked.

How much buffering does DCTCP need for 100% throughput?

22

Ø Need to quan4fy queue size oscilla4ons (Stability).

Time

(W*+1)(1-α/2) W*

Window Size

W*+1

A bit of Analysis

B K

α = # of pkts in last RTT of Period # of pkts in Period

slide-29
SLIDE 29

How small can queues be without loss of throughput?

22

Ø Need to quan4fy queue size oscilla4ons (Stability).

A bit of Analysis

B K

K > (1/7) C x RTT for TCP: K > C x RTT

What assump4ons does the model make?

slide-30
SLIDE 30

What You Said

Anurag: “In both the papers, one of the difference I saw from TCP was that these protocols don’t have the “slow start” phase, where the rate grows exponen:ally star:ng from 1 packet/RTT.”

30

slide-31
SLIDE 31

DCTCP takes at most ~40% more RTTs than TCP

– “Analysis of DCTCP: Stability, Convergence, and Fairness,” SIGMETRICS 2011

Intui@on: DCTCP makes smaller adjustments than TCP, but makes them much more frequently

31

Convergence Time

TCP DCTCP

slide-32
SLIDE 32

TIMELY

² Slides by Radhika Mi^al (Berkeley)

slide-33
SLIDE 33

Qualities of RTT

  • Fine-grained and informative
  • Quick response time
  • No switch support needed
  • End-to-end metric
  • Works seamlessly with QoS
slide-34
SLIDE 34

RTT correlates with queuing delay

slide-35
SLIDE 35

What You Said

Ravi: “The first thing that struck me while reading these papers was how different their approaches were. DCTCP even states that delay-based protocols are "suscep:ble to noise in the very low latency environment of data centers" and that "the accurate measurement of such small increases in queuing delay is a daun:ng task". Then, I no:ced that there is a 5 year gap between these two papers… “ Arman: “They had to resort to extraordinary measures to ensure that the 4mestamps accurately reflect the 4me at which a packet was put on wire…”

35

slide-36
SLIDE 36

Accurate RTT Measurement

slide-37
SLIDE 37

Hardware Timestamps – mitigate noise in measurements Hardware Acknowledgements – avoid processing overhead

Hardware Assisted RTT Measurement

slide-38
SLIDE 38

Hardware vs Software Timestamps

Kernel Timestamps introduce significant noise in RTT measurements compared to HW Timestamps.

slide-39
SLIDE 39

Impact of RTT Noise

Throughput degrades with increasing noise in RTT. Precise RTT measurement is crucial.

slide-40
SLIDE 40

TIMELY Framework

slide-41
SLIDE 41

Overview

RTT Measurement Engine

Timestamps RTT

Rate Computation Engine Pacing Engine

Rate Data Paced Data

slide-42
SLIDE 42

Serialization Delay

RECEIVER SENDER

Propagation & Queuing Delay

RTT = tcompletion – tsend – Serialization Delay

HW ack

RTT tcompletion tsend

RTT Measurement Engine

slide-43
SLIDE 43

Algorithm Overview

Gradient-based Increase / Decrease

slide-44
SLIDE 44

Algorithm Overview

Gradient-based Increase / Decrease Time RTT gradient = 0

slide-45
SLIDE 45

Algorithm Overview

Gradient-based Increase / Decrease Time RTT gradient > 0

slide-46
SLIDE 46

Algorithm Overview

Gradient-based Increase / Decrease Time RTT gradient < 0

slide-47
SLIDE 47

Algorithm Overview

Gradient-based Increase / Decrease Time RTT

slide-48
SLIDE 48

Algorithm Overview

To navigate the throughput-latency tradeoff and ensure stability.

Gradient-based Increase / Decrease

slide-49
SLIDE 49

Why Does Gradient Help Stability?

49

Source

e(t) = RTT(t)− RTT0 e(t)+ ke'(t)

Source

Feedback higher order deriva4ves

Observe not only error, but change in error – “an4cipate” future state

slide-50
SLIDE 50

What You Said

Arman: “I also think that deducing the queue length from the gradient model could lead to miscalcula:ons. For example, consider an Incast scenario, where many senders transmit simultaneously through the same

  • path. No:ng that every packet will see a long, yet

steady, RTT, they will compute a near-zero gradient and hence the conges:on will con:nue.”

50

slide-51
SLIDE 51

Algorithm Overview

Additive Increase Multiplicative Decrease

Thigh Tlow To keep tail latency within acceptable limits. Better Burst Tolerance To navigate the throughput-latency tradeoff and ensure stability.

Gradient-based Increase / Decrease

slide-52
SLIDE 52

Discussion

52

slide-53
SLIDE 53

TIMELY is implemented in the context of RDMA. – RDMA write and read primitives used to invoke NIC services.

Priority Flow Control is enabled in the network fabric. – RDMA transport in the NIC is sensitive to packet drops.

– PFC sends out pause frames to ensure lossless network.

Implementation Set-up

slide-54
SLIDE 54

“Conges4on Spreading” in Lossless Networks

54

P A U S E P A U S E PAUSE PAUSE

slide-55
SLIDE 55

TIMELY vs PFC

55

slide-56
SLIDE 56

TIMELY vs PFC

56

slide-57
SLIDE 57

What You Said

Amy: “I was surprised to see that TIMELY performed so much be\er than DCTCP. Did the lack of an OS-bypass for DCTCP impact performance? I wish that the authors had offered an explana:on for this result.”

57

slide-58
SLIDE 58

Next 4me: Load Balancing

58

slide-59
SLIDE 59

59