[PPT] - CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions PowerPoint Presentation

SLIDE 1

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions

Xiaowei Yang

SLIDE 2

Roadmap

Midterm I summary
Midterm II date change: 12/15

– Make up lecture for hurricane – Many can’t make it on that day

Today

– Other Data Center network topologies – What’s TCP incast – Solutions

SLIDE 3

Other Datacenter network topologies

SLIDE 4

Charles E. Leiserson's 60th Birthday Symposium 4

http://www.infotechlead.com/2013/03/28/gartner-data-center-spending-to-grow-3-7-to-146-bi

SLIDE 5

What to build?

“Fat-tree” [SIGCOMM 2008]
VL2 [SIGCOMM 2009, CoNEXT 2013]
DCell [SIGCOMM 2008]
BCube [SIGCOMM 2009]
Jellyfish [NSDI 2012]

5

This question has spawned a cottage industry in the computer networking research community.

SLIDE 6

“Fat-tree” SIGCOMM 2008

6

Isomorphic to butterfly network except at top level Bisection width n/2, oversubscription ratio 1

SLIDE 7

VL2 (SIGCOMM 2009, CoNEXT 2013)

7

called a Clos network

versubscription ratio 1

(but 1Gbps links at leaves, 10Gbps elsewhere)

SLIDE 8

Dcell 2008)

8

A “clique of cliques” Servers forward packets n servers in DCell0 (n+1)n servers in DCell1 (((n+1)n)+1)(n+1)n in DCell2

versubscription ratio 1

DCell1 n=4

SLIDE 9

Bcube (SIGCOMM 2009)

9

Mesh of stars (analogous to a mesh of trees) Bisection width between .25n and .35n

SLIDE 10

Jellyfish (NSDI 2012)

10

Random connections Bisection width Θ(n)

vs.

“fat-tree” butterfly

SLIDE 11

11

http://www.itdisasters.com/category/networking-rack/

SLIDE 12

Datacenter Transport Protocols

SLIDE 13

Data Center Packet Transport

Large purpose-built DCs

– Huge investment: R&D, business

Transport inside the DC

– TCP rules (99.9% of traffic)

How’s TCP doing?

13

SLIDE 14

TCP does not meet demands of DC applications

Goal:

– Low latency – High throughput

TCP does not meet these demands.

– Incast

Suffers from bursty packet drops

– Large queues:

Ø Adds significant latency. Ø Wastes precious buffers, esp. bad with shallow-buffered switches.

14

SLIDE 15

Case Study: Microsoft Bing

Measurements from 6000 server production cluster
Instrumentation passively collects logs

‒ Application-level

‒ Socket-level ‒ Selected packet-level

More than 150TB of compressed data over a month

15

SLIDE 16

TLA MLA MLA Worker Nodes ………

Partition/Aggregate Application Structure

16

Picasso

“Everything you can imagine is real.” “Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” “I'd like to live as a poor man with lots of money.“ “Art is a lie that makes us realize the truth. “Computers are useless. They can only give you answers.” 1. 2. 3.

…..

1. Art is a lie…
2. The chief…

3.

…..

1.

2. Art is a lie…

3.

…..

Art is…

Picasso

Time is money

Ø Strict deadlines (SLAs)

Missed deadline

Ø Lower quality result

Deadline = 250ms Deadline = 50ms Deadline = 10ms

SLIDE 17

Generality of Partition/Aggregate

The foundation for many large-scale web

applications.

– Web search, Social network composition, Ad selection, etc.

Example: Facebook

Partition/Aggregate ~ Multiget

– Aggregators: Web Servers – Workers: Memcached Servers

17

Memcached Servers Internet Web Servers

Memcached Protocol

SLIDE 18

Workloads

Partition/Aggregate

(Query)

Short messages [50KB-1MB]

(Coordination, Control state)

Large flows [1MB-50MB]

(Data update)

18

Delay-sensitive Delay-sensitive Throughput-sensitive

SLIDE 19

Work load characterization

0.5 1 0.25 0.5 0.75 1 secs CDF of Interarrival Time Query Traffic 10 20 0.5 0.6 0.7 0.8 0.9 1 secs CDF of Interarrival Time Background Traffic

Figure 3: Time between arrival of new work for the Aggrega- tor (queries) and between background flows between servers (update and short message).

SLIDE 20

Workload characterization

10

3

10

4

10

5

10

6

10

7

10

8

0.02 0.04 0.06 0.08 0.1 Flow Size (Bytes) PDF Flow Size Total Bytes

Figure 4: PDF of flow size distribution for background traffic. PDF of Total Bytes shows probability a randomly selected byte would come from a flow of given size.

SLIDE 21

Workload characterization

50 100 150 200 0.25 0.5 0.75 1 Concurrent Connections CDF All Flows 1 2 3 4 0.25 0.5 0.75 1 Concurrent Connections CDF Flows > 1MB

Figure 5: Distribution of number of concurrent connections.

SLIDE 22

Impairments

Incast
Queue Buildup
Buffer Pressure

22

SLIDE 23

Incast

23

TCP timeout

Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTOmin = 300 ms

Synchronized mice collide.

Ø Caused by Partition/Aggregate.

SLIDE 24

Incast Really Happens

Requests are jittered over 10ms window.
Jittering switched off around 8:30 am.

24

Jittering trades off median against high percentiles. 99.9th percentile is being tracked.

MLA Query Completion Time (ms)

SLIDE 25

Queue Buildup

25

Sender 1 Sender 2 Receiver

Big flows buildup queues.

Ø Increased latency for short flows.

Measurements in Bing cluster

Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms

SLIDE 26

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas Inc.

SLIDE 27

Datacenter TCP Request-Response

Client Switch Server

R

2

Rsp Rsp Rsp Request Sent Request Received Response dropped Response Resent

Flow stalled...

200ms response delay

SLIDE 28

Applications Sensitive to 200ms TCP Timeouts

“Drive-bys” affecting single-flow request/response
Barrier-Sync workloads
Parallel cluster filesystems (Incast workloads)
Massive multi-server queries (e.g., previous talk)
Latency-sensitive, customer-facing

SLIDE 29

Main Takeaways

Problem: 200ms TCP timeouts can cripple datacenter apps
Solution: Enable microsecond retransmissions
Can improve datacenter app throughput/latency
Safe in the wide-area

SLIDE 30

The Datacenter Environment

Client Switch Servers

1-10 Gbps 10-100us Commodity Ethernet Switches Under heavy load, packet losses are frequent.

5

SLIDE 31

TCP: Loss Recovery Comparison

Sender Receiver

1

2

3 4 5 Ack 1 Ack 1 Ack 1 Ack 1 Retransmit 2 Seq # Ack 5

Sender Receiver 1 2 3 4 5 1 Retransmission Timeout (RTO) Ack 1 Seq #

Timeout driven recovery is painfully slow (ms)

Data-driven recovery is super fast (us)

minRTO DC Latency

200.0ms 0.1ms

1 TCP Timeout lasts 1000 RTTs!

6

SLIDE 32

RTO Estimation and Minimum Bound

Jacobson’s TCP RTO Estimator
RTO = SRTT + 4*RTTVAR
Minimum RTO bound = 200ms
Actual RTO Timer = max(200ms, RTO)

SLIDE 33

The Incast Workload

Client Switch Storage Servers

RRRR 1 2 Data Block Server Request Unit (SRU) 3 4

Synchronized Read Client now sends next batch of requests

1 2 3 4

SLIDE 34

Client Switch

RRRR 1 2 3 4 4

Synchronized Read

1 2 3 4 Server Request Unit (SRU) 4

Incast Workload Overfills Buffers

9

Requests Sent Requests Received Responses 1-3 completed Response 4 dropped Response 4 Resent

Link Idle!

Monday, September 14, 2009

SLIDE 35

Client Link Utilization

200ms IDLE!

SLIDE 36

Cluster Environment 1Gbps Ethernet 100us Delay 200ms RTO S50 Switch 1MB Block Size

[Nagle04] called this Incast; provided app-level workaround
Cause of throughput collapse: 200ms TCP Timeouts
Prior work: Other TCP variants did not prevent TCP
timeouts. [Phanishayee:FAST2008]

200ms Timeouts Cause Throughput Collapse

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""

!" !' !#" !#' !$" !$' !%" !%' !&" !&'

,-./0123-4!56/370!8396:

?:/=B4:7B6! !!!CDE72F

More servers

SLIDE 37

Latency-sensitive Apps

Request for 4MB of data sharded across 16

servers (256KB each)

How long does it take for all of the 4MB of

data to return?

SLIDE 38

Timeouts Increase Latency

4MB (256KB from 16 servers)

13

2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 # of Occurrences Block Transfer Time (in ms) Block Transfer Time Distribution with No RTO bound

Responses delayed by 200ms TCP timeout(s)

Ideal Response Time # of

ccurrences

SLIDE 39

Outline

Problem Description, Examples
Solution: Microsecond TCP Retransmissions
Is it safe?

14

SLIDE 40

First attempt: reducing RTOmin

100 200 300 400 500 600 700 800 900 1000 200u 1m 5m 10m 50m 100m 200m Goodput (Mbps) RTOmin (seconds) RTOmin vs Goodput (Block size = 1MB, buffer = 32KB) 4 8 16 32 64 128

# servers

Figure 2: Reducing the RTOmin in simulation to microseconds from the current default value of 200ms improves goodput.

3.3 In Simulation

100 200 300 400 500 600 700 800 900 1000 200u 1m 5m 10m 50m 100m 200m Goodput (Mbps) RTOmin (seconds) RTOmin vs Goodput (Block size = 1MB, buffer = 32KB (estimate)) 4 8 16

# servers

Figure 3: Experiments on a real cluster validate the simulation result that reducing the RTOmin to microseconds improves goodput.

SLIDE 41

What if link speed increases?

1000 2000 3000 4000 5000 6000 200 400 600 800 1000 # of Occurrences RTT in Microseconds (binsize = 20us) RTT Distribution at Los Alamos National Lab Storage Node

Figure 5: The distribution of RTTs from an active storage node at Los Alamos National Lab shows an appreciable number of RTTs in the 10s of microsec-

nds.

SLIDE 42

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 32 64 128 256 512 1024 2048 Goodput (Mbps) Number of Servers Average Goodput VS # Servers (Block size = 80MB, buffer = 32KB, rtt = 20us) No RTOmin 200us RTOmin 1ms RTOmin

Figure 6: In simulation, flows experience reduced goodput when retransmissions do not fire at the same granularity as RTTs. Fine-grained timers can

bserve suboptimal goodput for a large number of

servers if retransmissions are tightly synchronized.

SLIDE 43

Eliminate minRTO

✓Simple one-line change in Linux

Does not change RTT measurement

granularity

Still uses low-resolution, 1ms kernel timers

SLIDE 44

Eliminating the RTO bound helps

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""

!" !' !#" !#' !$" !$' !%" !%' !&" !&'

,-./0123-4!56/370!8396:

?:/=F4:7F6! !!!GHE72I

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""

!" !' !#" !#' !$" !$' !%" !%' !&" !&'

,-./0123-4!56/370!8396:

?:/=B4:7B6! !!!CDE72F

Unmodified TCP No RTO Bound

More servers

Millisecond retransmissions not enough

SLIDE 45

Requirements for Microsecond RTO

TCP must track RTT in microseconds
Modify internal data-structures
Timestamp option (backwards compatible)
Efficient high-resolution kernel timers
Use HPET for efficient interrupt signaling

SLIDE 46

Implementing fine-grained TCP timers

Old implementation uses a kernel timer hz that

updates 100/250/1000 per second

– Linux jitty: 250 – rtt + 4var à minimum 5ms RTO

Hrtimer

– The Generic Time of Day (GTOD) framework provides the kernel and other applications with nanosecond resolution timekeeping using the CPU cycle counter

n all modern processors

SLIDE 47

Microsecond timeouts are necessary

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""

!" !' !#" !#' !$" !$' !%" !%' !&" !&'

,-./0123-4!56/370!8396:

?:/=F4:7F6! !!!GHE72I

Microsecond TCP + no RTO Bound Unmodified TCP No RTO Bound

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""

!" !' !#" !#' !$" !$' !%" !%' !&" !&'

,-./0123-4!56/370!8396:

=:/<F4:7F6! !!!H;E72I

High throughput sustained for up to 47 servers

More servers

SLIDE 48

Improvement to Latency

All responses returned within 40ms

4MB (256KB from 16 servers)

19

2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 # of Occurrences Block Transfer Time (in ms) Block Transfer Time Distribution with No RTO bound 2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 # of Occurrences Block Transfer Time (in ms) Block Transfer Time Distribution with No RTO bound

200ms RTO Microsecond RTO

# of

ccurrences

Monday, September 14, 2009

SLIDE 49

Outline

Problem Description, Examples
Solution: Microsecond TCP Retransmissions
Is it safe?
Interaction with Delayed ACK
Performance in the wide-area

SLIDE 50

μs RTO and Delayed ACK

Packet 1 40ms Sender Receiver ACK 1 Packet 1 Sender Receiver ACK 1

RTO > 40ms RTO < 40ms

Packet 1 Retransmission

Premature Timeout RTO on sender triggers before delayed ACK on receiver

SLIDE 51

Impact of Delayed ACKs

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#""" !" !$ !& !( !* !#" !#$ !#& !#( ,-./01!23!4015016 !789:0;!<=2>?!@!#A<B!/-3301!@!%$C<!706DEFB!4G9D>H!@!I12>-150F J0=KL0;!MNC!J96K/=0; !!!!!!!!!!!!!!!!!!!J0=KL0;!MNC!&".6

10-15% loss in throughput with delayed ACK clients

Throughput (Mbps)

SLIDE 52

Is it safe for the wide area?

Potential Concerns:
Stability: Could we cause congestion collapse?
Performance: Do we often timeout unnecessarily?
Stability preserved
Timeouts retain exponential backoff
Spurious timeouts slow rate of transfer
Performance: spurious timeouts vs. timely response
No optimal RTO estimator [Allman99]

SLIDE 53

Spurious timeouts less harmful

1. Detect spurious timeouts
Using TCP timestamp option
2. Recover from spurious timeouts
Forward RTO (F-RTO)

Today’s TCP has mechanisms to:

Both implemented widely!

SLIDE 54

Wide-area Performance Without minRTO

Do microsecond timeouts harm wide-area throughput?

25

Standard TCP TCP w/ no RTO Bound Real BitTorrent Clients Seed content

SLIDE 55

Wide-area Results

20 40 60 80 100 1 10 100 % samples (with Kbps <= x) Throughout (Kbps) 200ms RTOmin (Default) 200us RTOmin

No noticeable difference in throughput

No RTOmin

Few total timeouts (spurious or legitimate)

Throughput (Kbps) % samples (with Kbps ≤ x)

SLIDE 56

Conclusion

Microsecond RTOs can help datacenter

application response time and throughput

Safe for wide-area communication as well
Linux patch available:
http://www.cs.cmu.edu/~vrv/incast/

SLIDE 57

Summary

Datacenter Network Topologies
TCP problems in datacenter networks

– Incast – Buffer built up

Using fine-grained timers to improve TCP

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions

Xiaowei Yang

Roadmap

Other Datacenter network topologies

What to build?

This question has spawned a cottage industry in the computer networking research community.

“Fat-tree” SIGCOMM 2008

Isomorphic to butterfly network except at top level Bisection width n/2, oversubscription ratio 1

VL2 (SIGCOMM 2009, CoNEXT 2013)

Dcell 2008)

A “clique of cliques” Servers forward packets n servers in DCell0 (n+1)n servers in DCell1 (((n+1)n)+1)(n+1)n in DCell2

DCell1 n=4

Bcube (SIGCOMM 2009)

Mesh of stars (analogous to a mesh of trees) Bisection width between .25n and .35n

Jellyfish (NSDI 2012)

vs.

Datacenter Transport Protocols

Data Center Packet Transport

TCP does not meet demands of DC applications

Case Study: Microsoft Bing

Partition/Aggregate Application Structure

Generality of Partition/Aggregate

applications.

Workloads

(Query)

Work load characterization

Workload characterization

Workload characterization

Impairments

Incast

Incast Really Happens

Queue Buildup

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

Datacenter TCP Request-Response

Client Switch Server

Flow stalled...

Applications Sensitive to 200ms TCP Timeouts

Main Takeaways

The Datacenter Environment

TCP: Loss Recovery Comparison

RTO Estimation and Minimum Bound

The Incast Workload

Incast Workload Overfills Buffers

Client Link Utilization

200ms Timeouts Cause Throughput Collapse

Latency-sensitive Apps

Timeouts Increase Latency

Outline

First attempt: reducing RTOmin

What if link speed increases?

Eliminate minRTO

✓Simple one-line change in Linux

Eliminating the RTO bound helps

Requirements for Microsecond RTO

Implementing fine-grained TCP timers

Microsecond timeouts are necessary

Improvement to Latency

Outline

μs RTO and Delayed ACK

Impact of Delayed ACKs

Is it safe for the wide area?

Spurious timeouts less harmful

Today’s TCP has mechanisms to:

Wide-area Performance Without minRTO

Wide-area Results

Conclusion

Summary

performance