CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions - - PowerPoint PPT Presentation

compsci 514 computer networks lecture 13 tcp incast and
SMART_READER_LITE
LIVE PREVIEW

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions - - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions Xiaowei Yang Roadmap Midterm I summary Midterm II date change: 12/15 Make up lecture for hurricane Many cant make it on that day Today Other Data Center


slide-1
SLIDE 1

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions

Xiaowei Yang

slide-2
SLIDE 2

Roadmap

  • Midterm I summary
  • Midterm II date change: 12/15

– Make up lecture for hurricane – Many can’t make it on that day

  • Today

– Other Data Center network topologies – What’s TCP incast – Solutions

slide-3
SLIDE 3

Other Datacenter network topologies

slide-4
SLIDE 4

Charles E. Leiserson's 60th Birthday Symposium 4

http://www.infotechlead.com/2013/03/28/gartner-data-center-spending-to-grow-3-7-to-146-bi

slide-5
SLIDE 5

What to build?

  • “Fat-tree” [SIGCOMM 2008]
  • VL2 [SIGCOMM 2009, CoNEXT 2013]
  • DCell [SIGCOMM 2008]
  • BCube [SIGCOMM 2009]
  • Jellyfish [NSDI 2012]

5

This question has spawned a cottage industry in the computer networking research community.

slide-6
SLIDE 6

“Fat-tree” SIGCOMM 2008

6

Isomorphic to butterfly network except at top level Bisection width n/2, oversubscription ratio 1

slide-7
SLIDE 7

VL2 (SIGCOMM 2009, CoNEXT 2013)

7

called a Clos network

  • versubscription ratio 1

(but 1Gbps links at leaves, 10Gbps elsewhere)

slide-8
SLIDE 8

Dcell 2008)

8

A “clique of cliques” Servers forward packets n servers in DCell0 (n+1)n servers in DCell1 (((n+1)n)+1)(n+1)n in DCell2

  • versubscription ratio 1

DCell1 n=4

slide-9
SLIDE 9

Bcube (SIGCOMM 2009)

9

Mesh of stars (analogous to a mesh of trees) Bisection width between .25n and .35n

slide-10
SLIDE 10

Jellyfish (NSDI 2012)

10

Random connections Bisection width Θ(n)

vs.

“fat-tree” butterfly

slide-11
SLIDE 11

11

http://www.itdisasters.com/category/networking-rack/

slide-12
SLIDE 12

Datacenter Transport Protocols

slide-13
SLIDE 13

Data Center Packet Transport

  • Large purpose-built DCs

– Huge investment: R&D, business

  • Transport inside the DC

– TCP rules (99.9% of traffic)

  • How’s TCP doing?

13

slide-14
SLIDE 14

TCP does not meet demands of DC applications

  • Goal:

– Low latency – High throughput

  • TCP does not meet these demands.

– Incast

  • Suffers from bursty packet drops

– Large queues:

Ø Adds significant latency. Ø Wastes precious buffers, esp. bad with shallow-buffered switches.

14

slide-15
SLIDE 15

Case Study: Microsoft Bing

  • Measurements from 6000 server production cluster
  • Instrumentation passively collects logs

‒ Application-level

‒ Socket-level ‒ Selected packet-level

  • More than 150TB of compressed data over a month

15

slide-16
SLIDE 16

TLA MLA MLA Worker Nodes ………

Partition/Aggregate Application Structure

16

Picasso

“Everything you can imagine is real.” “Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” “I'd like to live as a poor man with lots of money.“ “Art is a lie that makes us realize the truth. “Computers are useless. They can only give you answers.” 1. 2. 3.

…..

  • 1. Art is a lie…
  • 2. The chief…

3.

…..

1.

  • 2. Art is a lie…

3.

…..

Art is…

Picasso

  • Time is money

Ø Strict deadlines (SLAs)

  • Missed deadline

Ø Lower quality result

Deadline = 250ms Deadline = 50ms Deadline = 10ms

slide-17
SLIDE 17

Generality of Partition/Aggregate

  • The foundation for many large-scale web

applications.

– Web search, Social network composition, Ad selection, etc.

  • Example: Facebook

Partition/Aggregate ~ Multiget

– Aggregators: Web Servers – Workers: Memcached Servers

17

Memcached Servers Internet Web Servers

Memcached Protocol

slide-18
SLIDE 18

Workloads

  • Partition/Aggregate

(Query)

  • Short messages [50KB-1MB]

(Coordination, Control state)

  • Large flows [1MB-50MB]

(Data update)

18

Delay-sensitive Delay-sensitive Throughput-sensitive

slide-19
SLIDE 19

Work load characterization

0.5 1 0.25 0.5 0.75 1 secs CDF of Interarrival Time Query Traffic 10 20 0.5 0.6 0.7 0.8 0.9 1 secs CDF of Interarrival Time Background Traffic

Figure 3: Time between arrival of new work for the Aggrega- tor (queries) and between background flows between servers (update and short message).

slide-20
SLIDE 20

Workload characterization

10

3

10

4

10

5

10

6

10

7

10

8

0.02 0.04 0.06 0.08 0.1 Flow Size (Bytes) PDF Flow Size Total Bytes

Figure 4: PDF of flow size distribution for background traffic. PDF of Total Bytes shows probability a randomly selected byte would come from a flow of given size.

slide-21
SLIDE 21

Workload characterization

50 100 150 200 0.25 0.5 0.75 1 Concurrent Connections CDF All Flows 1 2 3 4 0.25 0.5 0.75 1 Concurrent Connections CDF Flows > 1MB

Figure 5: Distribution of number of concurrent connections.

slide-22
SLIDE 22

Impairments

  • Incast
  • Queue Buildup
  • Buffer Pressure

22

slide-23
SLIDE 23

Incast

23

TCP timeout

Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTOmin = 300 ms

  • Synchronized mice collide.

Ø Caused by Partition/Aggregate.

slide-24
SLIDE 24

Incast Really Happens

  • Requests are jittered over 10ms window.
  • Jittering switched off around 8:30 am.

24

Jittering trades off median against high percentiles. 99.9th percentile is being tracked.

MLA Query Completion Time (ms)

slide-25
SLIDE 25

Queue Buildup

25

Sender 1 Sender 2 Receiver

  • Big flows buildup queues.

Ø Increased latency for short flows.

  • Measurements in Bing cluster

Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms

slide-26
SLIDE 26

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas Inc.

slide-27
SLIDE 27

Datacenter TCP Request-Response

Client Switch Server

R

2

Rsp Rsp Rsp Request Sent Request Received Response dropped Response Resent

Flow stalled...

200ms response delay

slide-28
SLIDE 28

Applications Sensitive to 200ms TCP Timeouts

  • “Drive-bys” affecting single-flow request/response
  • Barrier-Sync workloads
  • Parallel cluster filesystems (Incast workloads)
  • Massive multi-server queries (e.g., previous talk)
  • Latency-sensitive, customer-facing
slide-29
SLIDE 29

Main Takeaways

  • Problem: 200ms TCP timeouts can cripple datacenter apps
  • Solution: Enable microsecond retransmissions
  • Can improve datacenter app throughput/latency
  • Safe in the wide-area
slide-30
SLIDE 30

The Datacenter Environment

Client Switch Servers

1-10 Gbps 10-100us Commodity Ethernet Switches Under heavy load, packet losses are frequent.

5

slide-31
SLIDE 31

TCP: Loss Recovery Comparison

Sender Receiver

1

2

3 4 5 Ack 1 Ack 1 Ack 1 Ack 1 Retransmit 2 Seq # Ack 5

Sender Receiver 1 2 3 4 5 1 Retransmission Timeout (RTO) Ack 1 Seq #

Timeout driven recovery is painfully slow (ms)

Data-driven recovery is super fast (us)

minRTO DC Latency

200.0ms 0.1ms

1 TCP Timeout lasts 1000 RTTs!

6

slide-32
SLIDE 32

RTO Estimation and Minimum Bound

  • Jacobson’s TCP RTO Estimator
  • RTO = SRTT + 4*RTTVAR
  • Minimum RTO bound = 200ms
  • Actual RTO Timer = max(200ms, RTO)
slide-33
SLIDE 33

The Incast Workload

Client Switch Storage Servers

RRRR 1 2 Data Block Server Request Unit (SRU) 3 4

Synchronized Read Client now sends next batch of requests

1 2 3 4

slide-34
SLIDE 34

Client Switch

RRRR 1 2 3 4 4

Synchronized Read

1 2 3 4 Server Request Unit (SRU) 4

Incast Workload Overfills Buffers

9

Requests Sent Requests Received Responses 1-3 completed Response 4 dropped Response 4 Resent

Link Idle!

Monday, September 14, 2009

slide-35
SLIDE 35

Client Link Utilization

200ms IDLE!

slide-36
SLIDE 36

Cluster Environment 1Gbps Ethernet 100us Delay 200ms RTO S50 Switch 1MB Block Size

  • [Nagle04] called this Incast; provided app-level workaround
  • Cause of throughput collapse: 200ms TCP Timeouts
  • Prior work: Other TCP variants did not prevent TCP
  • timeouts. [Phanishayee:FAST2008]

200ms Timeouts Cause Throughput Collapse

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""

!" !' !#" !#' !$" !$' !%" !%' !&" !&'

,-./0123-4!56/370!8396:

?:/=B4:7B6! !!!CDE72F

More servers

slide-37
SLIDE 37

Latency-sensitive Apps

  • Request for 4MB of data sharded across 16

servers (256KB each)

  • How long does it take for all of the 4MB of

data to return?

slide-38
SLIDE 38

Timeouts Increase Latency

4MB (256KB from 16 servers)

13

2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 # of Occurrences Block Transfer Time (in ms) Block Transfer Time Distribution with No RTO bound

Responses delayed by 200ms TCP timeout(s)

Ideal Response Time # of

  • ccurrences
slide-39
SLIDE 39

Outline

  • Problem Description, Examples
  • Solution: Microsecond TCP Retransmissions
  • Is it safe?

14

slide-40
SLIDE 40

First attempt: reducing RTOmin

100 200 300 400 500 600 700 800 900 1000 200u 1m 5m 10m 50m 100m 200m Goodput (Mbps) RTOmin (seconds) RTOmin vs Goodput (Block size = 1MB, buffer = 32KB) 4 8 16 32 64 128

# servers

Figure 2: Reducing the RTOmin in simulation to mi- croseconds from the current default value of 200ms improves goodput.

3.3 In Simulation

100 200 300 400 500 600 700 800 900 1000 200u 1m 5m 10m 50m 100m 200m Goodput (Mbps) RTOmin (seconds) RTOmin vs Goodput (Block size = 1MB, buffer = 32KB (estimate)) 4 8 16

# servers

Figure 3: Experiments on a real cluster validate the simulation result that reducing the RTOmin to mi- croseconds improves goodput.

slide-41
SLIDE 41

What if link speed increases?

1000 2000 3000 4000 5000 6000 200 400 600 800 1000 # of Occurrences RTT in Microseconds (binsize = 20us) RTT Distribution at Los Alamos National Lab Storage Node

Figure 5: The distribution of RTTs from an active storage node at Los Alamos National Lab shows an appreciable number of RTTs in the 10s of microsec-

  • nds.
slide-42
SLIDE 42

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 32 64 128 256 512 1024 2048 Goodput (Mbps) Number of Servers Average Goodput VS # Servers (Block size = 80MB, buffer = 32KB, rtt = 20us) No RTOmin 200us RTOmin 1ms RTOmin

Figure 6: In simulation, flows experience reduced goodput when retransmissions do not fire at the same granularity as RTTs. Fine-grained timers can

  • bserve suboptimal goodput for a large number of

servers if retransmissions are tightly synchronized.

slide-43
SLIDE 43

Eliminate minRTO

✓Simple one-line change in Linux

  • Does not change RTT measurement

granularity

  • Still uses low-resolution, 1ms kernel timers
slide-44
SLIDE 44

Eliminating the RTO bound helps

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""

!" !' !#" !#' !$" !$' !%" !%' !&" !&'

,-./0123-4!56/370!8396:

?:/=F4:7F6! !!!GHE72I

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""

!" !' !#" !#' !$" !$' !%" !%' !&" !&'

,-./0123-4!56/370!8396:

?:/=B4:7B6! !!!CDE72F

Unmodified TCP No RTO Bound

More servers

Millisecond retransmissions not enough

slide-45
SLIDE 45

Requirements for Microsecond RTO

  • TCP must track RTT in microseconds
  • Modify internal data-structures
  • Timestamp option (backwards compatible)
  • Efficient high-resolution kernel timers
  • Use HPET for efficient interrupt signaling
slide-46
SLIDE 46

Implementing fine-grained TCP timers

  • Old implementation uses a kernel timer hz that

updates 100/250/1000 per second

– Linux jitty: 250 – rtt + 4var à minimum 5ms RTO

  • Hrtimer

– The Generic Time of Day (GTOD) framework provides the kernel and other applications with nanosecond resolution timekeeping using the CPU cycle counter

  • n all modern processors
slide-47
SLIDE 47

Microsecond timeouts are necessary

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""

!" !' !#" !#' !$" !$' !%" !%' !&" !&'

,-./0123-4!56/370!8396:

?:/=F4:7F6! !!!GHE72I

Microsecond TCP + no RTO Bound Unmodified TCP No RTO Bound

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""

!" !' !#" !#' !$" !$' !%" !%' !&" !&'

,-./0123-4!56/370!8396:

=:/<F4:7F6! !!!H;E72I

High throughput sustained for up to 47 servers

More servers

slide-48
SLIDE 48

Improvement to Latency

All responses returned within 40ms

4MB (256KB from 16 servers)

19

2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 # of Occurrences Block Transfer Time (in ms) Block Transfer Time Distribution with No RTO bound 2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 # of Occurrences Block Transfer Time (in ms) Block Transfer Time Distribution with No RTO bound

200ms RTO Microsecond RTO

# of

  • ccurrences

Monday, September 14, 2009

slide-49
SLIDE 49

Outline

  • Problem Description, Examples
  • Solution: Microsecond TCP Retransmissions
  • Is it safe?
  • Interaction with Delayed ACK
  • Performance in the wide-area
slide-50
SLIDE 50

μs RTO and Delayed ACK

Packet 1 40ms Sender Receiver ACK 1 Packet 1 Sender Receiver ACK 1

RTO > 40ms RTO < 40ms

Packet 1 Retransmission

Premature Timeout RTO on sender triggers before delayed ACK on receiver

slide-51
SLIDE 51

Impact of Delayed ACKs

!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#""" !" !$ !& !( !* !#" !#$ !#& !#( ,-./01!23!4015016 !789:0;!<=2>?!@!#A<B!/-3301!@!%$C<!706DEFB!4G9D>H!@!I12>-150F J0=KL0;!MNC!J96K/=0; !!!!!!!!!!!!!!!!!!!J0=KL0;!MNC!&".6

10-15% loss in throughput with delayed ACK clients

Throughput (Mbps)

slide-52
SLIDE 52

Is it safe for the wide area?

  • Potential Concerns:
  • Stability: Could we cause congestion collapse?
  • Performance: Do we often timeout unnecessarily?
  • Stability preserved
  • Timeouts retain exponential backoff
  • Spurious timeouts slow rate of transfer
  • Performance: spurious timeouts vs. timely response
  • No optimal RTO estimator [Allman99]
slide-53
SLIDE 53

Spurious timeouts less harmful

  • 1. Detect spurious timeouts
  • Using TCP timestamp option
  • 2. Recover from spurious timeouts
  • Forward RTO (F-RTO)

Today’s TCP has mechanisms to:

Both implemented widely!

slide-54
SLIDE 54

Wide-area Performance Without minRTO

  • Do microsecond timeouts harm wide-area throughput?

25

Standard TCP TCP w/ no RTO Bound Real BitTorrent Clients Seed content

slide-55
SLIDE 55

Wide-area Results

20 40 60 80 100 1 10 100 % samples (with Kbps <= x) Throughout (Kbps) 200ms RTOmin (Default) 200us RTOmin

No noticeable difference in throughput

No RTOmin

  • Few total timeouts (spurious or legitimate)

Throughput (Kbps) % samples (with Kbps ≤ x)

slide-56
SLIDE 56

Conclusion

  • Microsecond RTOs can help datacenter

application response time and throughput

  • Safe for wide-area communication as well
  • Linux patch available:
  • http://www.cs.cmu.edu/~vrv/incast/
slide-57
SLIDE 57

Summary

  • Datacenter Network Topologies
  • TCP problems in datacenter networks

– Incast – Buffer built up

  • Using fine-grained timers to improve TCP

performance