Identifying Performance Bottlenecks in CDNs through TCP-Level - - PowerPoint PPT Presentation

identifying performance bottlenecks in cdns through tcp
SMART_READER_LITE
LIVE PREVIEW

Identifying Performance Bottlenecks in CDNs through TCP-Level - - PowerPoint PPT Presentation

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu, Michael J. Freedman, Jennifer Rexford Princeton University August 19, 2011 Performance Bottlenecks CDN Servers Server APP Internet Clients


slide-1
SLIDE 1

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring

Peng Sun

Minlan Yu, Michael J. Freedman, Jennifer Rexford Princeton University August 19, 2011

slide-2
SLIDE 2

Performance Bottlenecks

2

Server APP Server OS CDN Servers Internet Clients

APP

Write too slowly

Server OS

Insufficient send buffer or Small initial congestion window

Internet

Network congestion

Client

Insufficient receive buffer

slide-3
SLIDE 3

Reaction to Each Bottleneck

3

Server APP Server OS CDN Servers Internet Clients APP is bottleneck: Debug application Server OS is bottleneck: Tune buffer size, or upgrade server Internet is bottleneck: Circumvent the congested part of network Client is bottleneck: Notify client to change

slide-4
SLIDE 4

Server APP Packet Sniffer Server OS

Previous Techniques Not Enough

4

Application logs: No details of network activities Packet sniffing: Expensive to capture Active probing: Extra load on network Transport-layer stats: Directly reveal perf. bottlenecks

slide-5
SLIDE 5

How TCP Stats Reveal Bottlenecks

CDN Servers Internet Clients CDN Server Applications Server Network Stack Network Path Clients

5

Insufficient data in send buffer

Send buffer full or Initial congestion window too small

Packet loss Receive window too small

slide-6
SLIDE 6

Measurement Framework

  • Collect TCP statistics
  • Web100 kernel patch
  • Extract useful TCP stats for analyzing perf.
  • Analysis tool
  • Bottleneck classifier for individual connections
  • Cross-connection correlation at AS level
  • Map conn. to AS based on RouteView
  • Correlate bottlenecks to drive CDN decisions

6

slide-7
SLIDE 7

How Bottleneck Classifier Works

7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 50 100 150 200 250 300 350

Time in seconds KB Rwin Cwin BytesInSndBuf

BytesInSndBuf = Rwin Rwin limits sending Client is bottleneck Cwin drops greatly and Packet loss Network path is bottleneck Small initial Cwin Slow start limits perf. Network Stack is bottleneck

slide-8
SLIDE 8

CoralCDN Experiment

  • CoralCDN serves 1 million clients per day
  • Experiment Environment
  • Deployment: A Clemson PlanetLab node
  • Polling interval: 50 ms
  • Traces to Show: Feb 19th – 25th 2011
  • Total # of Conn.: 209K
  • After removing

Cache-Miss Conn.: 137K (Total 2008 ASes)

  • Log Space overhead
  • < 200MB per Coral server per day

8

slide-9
SLIDE 9

What are Major Bottleneck for Individual Clients?

  • We calculate the fraction of time that the

connection is under each bottleneck in lifetime

9

Bottlenecks % of Conn. With Bottleneck for >40% of Lifetime Server Application 10.75% Server Network Stack 18.72% Network Path 3.94% Clients 1.27% Reasons: Slow CPU or scarce disk resources of the PlanetLab node Reasons: Congestion window rises too slowly for short conn. (>80% of the connections last <1 second) Reasons: Spotty network (discussed in next slide) Reasons: Receive buffer too small (Most of them are <30KB) Our suggestion: Use more powerful PlanetLab machines Our suggestion: Use larger initial congestion window Our suggestion: Filter them out of decision making

slide-10
SLIDE 10

AS-Level Correlation

  • CDNs make decision at the AS level
  • e.g., change server selection for 1.1.1.0/24
  • Explore at the AS level:
  • Filter out non-network bottlenecks
  • Whether network problems exist
  • Whether the problem is consistent

10

slide-11
SLIDE 11

Filtering Out Non-Network Bottlenecks

  • CDNs change server selection if clients have low

throughput

  • Non-network factors can limit throughput
  • 236 out of 505 low-throughput ASes limited by

non-network bottlenecks

  • Filtering is helpful:
  • Don’t worry about things CDNs cannot control
  • Produce more accurate estimates of perf.

11

slide-12
SLIDE 12

Network Problem at AS Level

  • CDN make decision at AS level
  • Whether conn. in the same AS have common

network problem

  • For 7.1% of the ASes, half of conn. have >10%

packet loss rate

  • Network problems are significant at the AS

level

12

slide-13
SLIDE 13

Consistent Packet Loss of AS

  • CDNs care about predictive value of measurement
  • Analyze the variance of average packet loss rates
  • Each epoch (1 min) has nonzero average loss rate
  • Loss rate is consistent across epochs

(standard deviation < mean)

13

Analysis Length # of ASes with Consistent Packet Loss One Week 377 / 2008 One Day (Feb 21st) 122 / 739 One Hour (Feb 21st 18:00~19:00) 19 / 121

slide-14
SLIDE 14

Conclusion & Future Work

  • Use TCP-level stats to detect performance

bottlenecks

  • Identify major bottlenecks for a production CDN
  • Discuss how to improve CDN’s operation with
  • ur tool
  • Future Works
  • Automatic and real-time analysis combined into

CDN operation

  • Detect the problematic AS on the path
  • Combine TCP-level stats with application logs to

debug online services

14

slide-15
SLIDE 15

Thanks! Questions?

15