CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions - - PowerPoint PPT Presentation
CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions - - PowerPoint PPT Presentation
CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions Xiaowei Yang Roadmap Midterm I summary Midterm II date change: 12/15 Make up lecture for hurricane Many cant make it on that day Today Other Data Center
Roadmap
- Midterm I summary
- Midterm II date change: 12/15
– Make up lecture for hurricane – Many can’t make it on that day
- Today
– Other Data Center network topologies – What’s TCP incast – Solutions
Other Datacenter network topologies
Charles E. Leiserson's 60th Birthday Symposium 4
http://www.infotechlead.com/2013/03/28/gartner-data-center-spending-to-grow-3-7-to-146-bi
What to build?
- “Fat-tree” [SIGCOMM 2008]
- VL2 [SIGCOMM 2009, CoNEXT 2013]
- DCell [SIGCOMM 2008]
- BCube [SIGCOMM 2009]
- Jellyfish [NSDI 2012]
5
This question has spawned a cottage industry in the computer networking research community.
“Fat-tree” SIGCOMM 2008
6
Isomorphic to butterfly network except at top level Bisection width n/2, oversubscription ratio 1
VL2 (SIGCOMM 2009, CoNEXT 2013)
7
called a Clos network
- versubscription ratio 1
(but 1Gbps links at leaves, 10Gbps elsewhere)
Dcell 2008)
8
A “clique of cliques” Servers forward packets n servers in DCell0 (n+1)n servers in DCell1 (((n+1)n)+1)(n+1)n in DCell2
- versubscription ratio 1
DCell1 n=4
Bcube (SIGCOMM 2009)
9
Mesh of stars (analogous to a mesh of trees) Bisection width between .25n and .35n
Jellyfish (NSDI 2012)
10
Random connections Bisection width Θ(n)
vs.
“fat-tree” butterfly
11
http://www.itdisasters.com/category/networking-rack/
Datacenter Transport Protocols
Data Center Packet Transport
- Large purpose-built DCs
– Huge investment: R&D, business
- Transport inside the DC
– TCP rules (99.9% of traffic)
- How’s TCP doing?
13
TCP does not meet demands of DC applications
- Goal:
– Low latency – High throughput
- TCP does not meet these demands.
– Incast
- Suffers from bursty packet drops
– Large queues:
Ø Adds significant latency. Ø Wastes precious buffers, esp. bad with shallow-buffered switches.
14
Case Study: Microsoft Bing
- Measurements from 6000 server production cluster
- Instrumentation passively collects logs
‒ Application-level
‒ Socket-level ‒ Selected packet-level
- More than 150TB of compressed data over a month
15
TLA MLA MLA Worker Nodes ………
Partition/Aggregate Application Structure
16
Picasso
“Everything you can imagine is real.” “Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” “I'd like to live as a poor man with lots of money.“ “Art is a lie that makes us realize the truth. “Computers are useless. They can only give you answers.” 1. 2. 3.
…..
- 1. Art is a lie…
- 2. The chief…
3.
…..
1.
- 2. Art is a lie…
3.
…..
Art is…
Picasso
- Time is money
Ø Strict deadlines (SLAs)
- Missed deadline
Ø Lower quality result
Deadline = 250ms Deadline = 50ms Deadline = 10ms
Generality of Partition/Aggregate
- The foundation for many large-scale web
applications.
– Web search, Social network composition, Ad selection, etc.
- Example: Facebook
Partition/Aggregate ~ Multiget
– Aggregators: Web Servers – Workers: Memcached Servers
17
Memcached Servers Internet Web Servers
Memcached Protocol
Workloads
- Partition/Aggregate
(Query)
- Short messages [50KB-1MB]
(Coordination, Control state)
- Large flows [1MB-50MB]
(Data update)
18
Delay-sensitive Delay-sensitive Throughput-sensitive
Work load characterization
0.5 1 0.25 0.5 0.75 1 secs CDF of Interarrival Time Query Traffic 10 20 0.5 0.6 0.7 0.8 0.9 1 secs CDF of Interarrival Time Background Traffic
Figure 3: Time between arrival of new work for the Aggrega- tor (queries) and between background flows between servers (update and short message).
Workload characterization
10
3
10
4
10
5
10
6
10
7
10
8
0.02 0.04 0.06 0.08 0.1 Flow Size (Bytes) PDF Flow Size Total Bytes
Figure 4: PDF of flow size distribution for background traffic. PDF of Total Bytes shows probability a randomly selected byte would come from a flow of given size.
Workload characterization
50 100 150 200 0.25 0.5 0.75 1 Concurrent Connections CDF All Flows 1 2 3 4 0.25 0.5 0.75 1 Concurrent Connections CDF Flows > 1MB
Figure 5: Distribution of number of concurrent connections.
Impairments
- Incast
- Queue Buildup
- Buffer Pressure
22
Incast
23
TCP timeout
Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTOmin = 300 ms
- Synchronized mice collide.
Ø Caused by Partition/Aggregate.
Incast Really Happens
- Requests are jittered over 10ms window.
- Jittering switched off around 8:30 am.
24
Jittering trades off median against high percentiles. 99.9th percentile is being tracked.
MLA Query Completion Time (ms)
Queue Buildup
25
Sender 1 Sender 2 Receiver
- Big flows buildup queues.
Ø Increased latency for short flows.
- Measurements in Bing cluster
Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms
Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication
Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas Inc.
Datacenter TCP Request-Response
Client Switch Server
R
2
Rsp Rsp Rsp Request Sent Request Received Response dropped Response Resent
Flow stalled...
200ms response delay
Applications Sensitive to 200ms TCP Timeouts
- “Drive-bys” affecting single-flow request/response
- Barrier-Sync workloads
- Parallel cluster filesystems (Incast workloads)
- Massive multi-server queries (e.g., previous talk)
- Latency-sensitive, customer-facing
Main Takeaways
- Problem: 200ms TCP timeouts can cripple datacenter apps
- Solution: Enable microsecond retransmissions
- Can improve datacenter app throughput/latency
- Safe in the wide-area
The Datacenter Environment
Client Switch Servers
1-10 Gbps 10-100us Commodity Ethernet Switches Under heavy load, packet losses are frequent.
5
TCP: Loss Recovery Comparison
Sender Receiver
1
2
3 4 5 Ack 1 Ack 1 Ack 1 Ack 1 Retransmit 2 Seq # Ack 5
Sender Receiver 1 2 3 4 5 1 Retransmission Timeout (RTO) Ack 1 Seq #
Timeout driven recovery is painfully slow (ms)
Data-driven recovery is super fast (us)
minRTO DC Latency
200.0ms 0.1ms
1 TCP Timeout lasts 1000 RTTs!
6
RTO Estimation and Minimum Bound
- Jacobson’s TCP RTO Estimator
- RTO = SRTT + 4*RTTVAR
- Minimum RTO bound = 200ms
- Actual RTO Timer = max(200ms, RTO)
The Incast Workload
Client Switch Storage Servers
RRRR 1 2 Data Block Server Request Unit (SRU) 3 4
Synchronized Read Client now sends next batch of requests
1 2 3 4
Client Switch
RRRR 1 2 3 4 4
Synchronized Read
1 2 3 4 Server Request Unit (SRU) 4
Incast Workload Overfills Buffers
9
Requests Sent Requests Received Responses 1-3 completed Response 4 dropped Response 4 Resent
Link Idle!
Monday, September 14, 2009
Client Link Utilization
200ms IDLE!
Cluster Environment 1Gbps Ethernet 100us Delay 200ms RTO S50 Switch 1MB Block Size
- [Nagle04] called this Incast; provided app-level workaround
- Cause of throughput collapse: 200ms TCP Timeouts
- Prior work: Other TCP variants did not prevent TCP
- timeouts. [Phanishayee:FAST2008]
200ms Timeouts Cause Throughput Collapse
!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""
!" !' !#" !#' !$" !$' !%" !%' !&" !&'
,-./0123-4!56/370!8396:
?:/=B4:7B6! !!!CDE72F
More servers
Latency-sensitive Apps
- Request for 4MB of data sharded across 16
servers (256KB each)
- How long does it take for all of the 4MB of
data to return?
Timeouts Increase Latency
4MB (256KB from 16 servers)
13
2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 # of Occurrences Block Transfer Time (in ms) Block Transfer Time Distribution with No RTO bound
Responses delayed by 200ms TCP timeout(s)
Ideal Response Time # of
- ccurrences
Outline
- Problem Description, Examples
- Solution: Microsecond TCP Retransmissions
- Is it safe?
14
First attempt: reducing RTOmin
100 200 300 400 500 600 700 800 900 1000 200u 1m 5m 10m 50m 100m 200m Goodput (Mbps) RTOmin (seconds) RTOmin vs Goodput (Block size = 1MB, buffer = 32KB) 4 8 16 32 64 128
# servers
Figure 2: Reducing the RTOmin in simulation to mi- croseconds from the current default value of 200ms improves goodput.
3.3 In Simulation
100 200 300 400 500 600 700 800 900 1000 200u 1m 5m 10m 50m 100m 200m Goodput (Mbps) RTOmin (seconds) RTOmin vs Goodput (Block size = 1MB, buffer = 32KB (estimate)) 4 8 16
# servers
Figure 3: Experiments on a real cluster validate the simulation result that reducing the RTOmin to mi- croseconds improves goodput.
What if link speed increases?
1000 2000 3000 4000 5000 6000 200 400 600 800 1000 # of Occurrences RTT in Microseconds (binsize = 20us) RTT Distribution at Los Alamos National Lab Storage Node
Figure 5: The distribution of RTTs from an active storage node at Los Alamos National Lab shows an appreciable number of RTTs in the 10s of microsec-
- nds.
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 32 64 128 256 512 1024 2048 Goodput (Mbps) Number of Servers Average Goodput VS # Servers (Block size = 80MB, buffer = 32KB, rtt = 20us) No RTOmin 200us RTOmin 1ms RTOmin
Figure 6: In simulation, flows experience reduced goodput when retransmissions do not fire at the same granularity as RTTs. Fine-grained timers can
- bserve suboptimal goodput for a large number of
servers if retransmissions are tightly synchronized.
Eliminate minRTO
✓Simple one-line change in Linux
- Does not change RTT measurement
granularity
- Still uses low-resolution, 1ms kernel timers
Eliminating the RTO bound helps
!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""
!" !' !#" !#' !$" !$' !%" !%' !&" !&'
,-./0123-4!56/370!8396:
?:/=F4:7F6! !!!GHE72I
!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""
!" !' !#" !#' !$" !$' !%" !%' !&" !&'
,-./0123-4!56/370!8396:
?:/=B4:7B6! !!!CDE72F
Unmodified TCP No RTO Bound
More servers
Millisecond retransmissions not enough
Requirements for Microsecond RTO
- TCP must track RTT in microseconds
- Modify internal data-structures
- Timestamp option (backwards compatible)
- Efficient high-resolution kernel timers
- Use HPET for efficient interrupt signaling
Implementing fine-grained TCP timers
- Old implementation uses a kernel timer hz that
updates 100/250/1000 per second
– Linux jitty: 250 – rtt + 4var à minimum 5ms RTO
- Hrtimer
– The Generic Time of Day (GTOD) framework provides the kernel and other applications with nanosecond resolution timekeeping using the CPU cycle counter
- n all modern processors
Microsecond timeouts are necessary
!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""
!" !' !#" !#' !$" !$' !%" !%' !&" !&'
,-./0123-4!56/370!8396:
?:/=F4:7F6! !!!GHE72I
Microsecond TCP + no RTO Bound Unmodified TCP No RTO Bound
!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#"""
!" !' !#" !#' !$" !$' !%" !%' !&" !&'
,-./0123-4!56/370!8396:
=:/<F4:7F6! !!!H;E72I
High throughput sustained for up to 47 servers
More servers
Improvement to Latency
All responses returned within 40ms
4MB (256KB from 16 servers)
19
2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 # of Occurrences Block Transfer Time (in ms) Block Transfer Time Distribution with No RTO bound 2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 # of Occurrences Block Transfer Time (in ms) Block Transfer Time Distribution with No RTO bound
200ms RTO Microsecond RTO
# of
- ccurrences
Monday, September 14, 2009
Outline
- Problem Description, Examples
- Solution: Microsecond TCP Retransmissions
- Is it safe?
- Interaction with Delayed ACK
- Performance in the wide-area
μs RTO and Delayed ACK
Packet 1 40ms Sender Receiver ACK 1 Packet 1 Sender Receiver ACK 1
RTO > 40ms RTO < 40ms
Packet 1 Retransmission
Premature Timeout RTO on sender triggers before delayed ACK on receiver
Impact of Delayed ACKs
!" !#"" !$"" !%"" !&"" !'"" !("" !)"" !*"" !+"" !#""" !" !$ !& !( !* !#" !#$ !#& !#( ,-./01!23!4015016 !789:0;!<=2>?!@!#A<B!/-3301!@!%$C<!706DEFB!4G9D>H!@!I12>-150F J0=KL0;!MNC!J96K/=0; !!!!!!!!!!!!!!!!!!!J0=KL0;!MNC!&".6
10-15% loss in throughput with delayed ACK clients
Throughput (Mbps)
Is it safe for the wide area?
- Potential Concerns:
- Stability: Could we cause congestion collapse?
- Performance: Do we often timeout unnecessarily?
- Stability preserved
- Timeouts retain exponential backoff
- Spurious timeouts slow rate of transfer
- Performance: spurious timeouts vs. timely response
- No optimal RTO estimator [Allman99]
Spurious timeouts less harmful
- 1. Detect spurious timeouts
- Using TCP timestamp option
- 2. Recover from spurious timeouts
- Forward RTO (F-RTO)
Today’s TCP has mechanisms to:
Both implemented widely!
Wide-area Performance Without minRTO
- Do microsecond timeouts harm wide-area throughput?
25
Standard TCP TCP w/ no RTO Bound Real BitTorrent Clients Seed content
Wide-area Results
20 40 60 80 100 1 10 100 % samples (with Kbps <= x) Throughout (Kbps) 200ms RTOmin (Default) 200us RTOmin
No noticeable difference in throughput
No RTOmin
- Few total timeouts (spurious or legitimate)
Throughput (Kbps) % samples (with Kbps ≤ x)
Conclusion
- Microsecond RTOs can help datacenter
application response time and throughput
- Safe for wide-area communication as well
- Linux patch available:
- http://www.cs.cmu.edu/~vrv/incast/
Summary
- Datacenter Network Topologies
- TCP problems in datacenter networks
– Incast – Buffer built up
- Using fine-grained timers to improve TCP