Tracking the Evolution of Tracking the Evolution of Tracking the - - PowerPoint PPT Presentation

tracking the evolution of tracking the evolution of
SMART_READER_LITE
LIVE PREVIEW

Tracking the Evolution of Tracking the Evolution of Tracking the - - PowerPoint PPT Presentation

Web Traffic Measurement and Web Traffic Measurement and The University of North Carolina at Chapel Hill The University of North Carolina at Chapel Hill The University of North Carolina at Chapel Hill Department of Computer Science Department


slide-1
SLIDE 1

1 1

Tracking the Evolution of Web Traffic: 1995-2003 Tracking the Evolution of Tracking the Evolution of Web Traffic: 1995-2003 Web Traffic: 1995-2003

http://www.cs.unc.edu/Research/dirt

The University of North Carolina at Chapel Hill Department of Computer Science The University of North Carolina at Chapel Hill The University of North Carolina at Chapel Hill Department of Computer Science Department of Computer Science

11 11th

th ACM/IEEE International Symposium on Modeling, Analysis and

ACM/IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) Simulation of Computer and Telecommunication Systems (MASCOTS) Orlando, October 13 Orlando, October 13th

th, 2003

, 2003

Félix Hernández-Campos Kevin Jeffay

  • F. Donelson Smith

2 2

Web Traffic Measurement and Web Traffic Measurement and Analysis at UNC-Chapel Hill Analysis at UNC-Chapel Hill

  • In 1997, populating web traffic generators for

experimental networking research motivated a large- scale study of web traffic at UNC with three goals: Develop a light-weight methodology

– Based on passive measurement – Easy to maintain models up-to-date

Replace smaller-scale, quickly aging models

– Mah, 1995 data set – Crovella et. al, 1995 data set (revised with 1998 data)

Characterize the use of the HTTP protocol

– E.g., Use of persistent connections

  • In 1997,

In 1997, populating web traffic generators populating web traffic generators for for experimental networking research motivated a large- experimental networking research motivated a large- scale study of web traffic at UNC with three goals: scale study of web traffic at UNC with three goals:

  • Develop a light-weight methodology

Develop a light-weight methodology

– – Based on passive measurement Based on passive measurement – – Easy Easy to maintain models up-to-date to maintain models up-to-date

  • Replace smaller-scale, quickly aging models

Replace smaller-scale, quickly aging models

– – Mah Mah, 1995 data set , 1995 data set – – Crovella Crovella et. al

  • et. al, 1995 data set (revised with 1998 data)

, 1995 data set (revised with 1998 data)

  • Characterize the use of the HTTP protocol

Characterize the use of the HTTP protocol

– – E.g. E.g., Use of persistent connections , Use of persistent connections

3 3

Web Traffic Measurement and Web Traffic Measurement and Analysis at UNC-Chapel Hill Analysis at UNC-Chapel Hill

  • Our methodology and first results were published in

SIGMETRICS/Performance’01

– What TCP/IP Protocol Headers Can Tell Us About the Web

  • Modeling aspect explored in a series of papers

– E.g., Variable Heavy Tails in Internet Traffic (with J.S. Marron)

» (Part I: Understanding Heavy Tails published in MASCOTS’02)

  • In this talk, I will describe our approach and our
  • bservation on the evolution of web traffic:

– Three data sets: 1999, 2001 and 2003 – Comparisons to Mah and Crovella et al.

  • Our methodology and first results were published in

Our methodology and first results were published in SIGMETRICS/Performance SIGMETRICS/Performance’ ’01 01

– – What TCP/IP Protocol Headers Can Tell Us About the Web What TCP/IP Protocol Headers Can Tell Us About the Web

  • Modeling aspect explored in a series of papers

Modeling aspect explored in a series of papers

– – E.g., Variable Heavy Tails in Internet Traffic E.g., Variable Heavy Tails in Internet Traffic (with J.S. (with J.S. Marron Marron) )

» » (Part I: (Part I: Understanding Heavy Tails Understanding Heavy Tails published in MASCOTS published in MASCOTS’ ’02) 02)

  • In this talk, I will describe our approach and our

In this talk, I will describe our approach and our

  • bservation on the evolution of web traffic:
  • bservation on the evolution of web traffic:

– – Three data sets: 1999, 2001 and 2003 Three data sets: 1999, 2001 and 2003 – – Comparisons to Mah and Crovella Comparisons to Mah and Crovella et al. et al.

4 4

Methodology Methodology

Study of Web Content Consumers Study of Web Content Consumers

  • We studied a large collection of users (~35,000) as

web content consumers

  • We studied a large collection of users (~35,000) as

We studied a large collection of users (~35,000) as web content consumers web content consumers

  • The only source of data for our study were packet

header traces

– Anonymized IP addresses – No HTTP headers

  • The only source of data for our study were packet

The only source of data for our study were packet header traces header traces

– – Anonymized IP addresses Anonymized IP addresses – – No HTTP headers No HTTP headers

University of University of North Carolina North Carolina at Chapel Hill at Chapel Hill

Internet Internet

Web Servers Web Servers Web Servers Web Clients Web Clients Web Clients HTTP Requests HTTP Requests HTTP Responses HTTP Responses

slide-2
SLIDE 2

5 5

Methodology Methodology

One-Way Packet Header Traces One-Way Packet Header Traces

  • Only inbound TCP/IP headers are captured

– Eliminate synchronization and buffering issues on the NIC – Reduce trace size

  • Only inbound TCP/IP headers are captured

Only inbound TCP/IP headers are captured

– – Eliminate synchronization and buffering issues on the NIC Eliminate synchronization and buffering issues on the NIC – – Reduce trace size Reduce trace size

University of University of North Carolina North Carolina at Chapel Hill at Chapel Hill

Internet Internet

Web Servers Web Servers Web Servers Web Clients Web Clients Web Clients Traffic Monitor Traffic Monitor ( (tcpdump tcpdump) ) Gigabit Ethernet Gigabit Ethernet

  • Trace collection: 2.7 TB of packet headers

– ~40 billion packets ~16 TB of data transfers

  • Trace collection: 2.7 TB of packet headers

Trace collection: 2.7 TB of packet headers

– – ~40 billion packets ~40 billion packets ~16 TB of data transfers ~16 TB of data transfers

6 6

Methodology Methodology

Processing Sequence Overview Processing Sequence Overview

Raw TCP/IP Raw TCP/IP headers headers trace trace tcpdump tcpdump TCP TCP Connections Connections

(Port 80) (Port 80)

Filter & Sort Filter & Sort HTTP HTTP Req Req/ /Rsp Rsp Exchanges Exchanges Connection-level Connection-level Analysis Analysis HTTP HTTP Client Client Behavior Behavior Client-level Client-level Analysis Analysis

Statistical Analysis Statistical Analysis

7 7

TCP/IP Headers and HTTP TCP/IP Headers and HTTP

Request/response Exchange Request/response Exchange

Web Client (UNC) Web Client (UNC) Web Server (Internet) Web Server (Internet)

HTTP HTTP Response Response

2875 bytes 2875 bytes

HTTP HTTP Request Request

304 bytes 304 bytes D A T A D A T A ACK ACK DATA DATA DATA DATA A C K A C K s e q n

  • s

e q n

  • 3

5 3 5 a c k n

  • a

c k n

  • 1

1 seqno seqno 1 1 ackno ackno 305 305 seqno seqno 1461 1461 ackno ackno 305 305 seqno seqno 2876 2876 ackno ackno 305 305 s e q n

  • s

e q n

  • 3

5 3 5 a c k n

  • a

c k n

  • 2

8 7 6 2 8 7 6 FIN FIN F I N

  • A

C K F I N

  • A

C K F I N F I N FIN-ACK FIN-ACK S Y N S Y N SYN-ACK SYN-ACK A C K A C K

8 8

TCP/IP Headers and HTTP TCP/IP Headers and HTTP

Server-to-client Segments Only Server-to-client Segments Only

Web Client (UNC) Web Client (UNC) Web Server (Internet) Web Server (Internet)

HTTP HTTP Response Response

2875 bytes 2875 bytes ACK ACK DATA DATA DATA DATA seqno seqno 1 1 ackno ackno 305 305 seqno seqno 1461 1461 ackno ackno 305 305 seqno seqno 2876 2876 ackno ackno 305 305 FIN FIN FIN-ACK FIN-ACK SYN-ACK SYN-ACK seqno seqno 1 1 ackno ackno 1 1 Ackno Ackno increased increased Seqno Seqno increased increased

HTTP HTTP Request Request

304 bytes 304 bytes

slide-3
SLIDE 3

9 9

Methodology Methodology

Request/Response Traces Request/Response Traces Web Client Web Client Web Server Web Server

Computed Computed Directly Observed Directly Observed

HTTP Request HTTP Request

304 bytes 304 bytes

HTTP Response HTTP Response

2875 bytes 2875 bytes

  • Unidirectional TCP/IP header traces are sufficient for

capturing application-level behavior

  • Unidirectional TCP/IP header traces are sufficient for

Unidirectional TCP/IP header traces are sufficient for capturing application-level behavior capturing application-level behavior

10 10

Persistent Connections in HTTP Persistent Connections in HTTP

Example Example – – TCP/IP Headers TCP/IP Headers

Web Client (UNC) Web Client (UNC) Web Server (Internet) Web Server (Internet)

SYN-ACK SYN-ACK ACK ACK seqno seqno 1 1 ackno ackno 305 305 DATA DATA seqno seqno 1461 1461 ackno ackno 305 305 DATA DATA seqno seqno 2876 2876 ackno ackno 305 305 FIN FIN FIN-ACK FIN-ACK ACK ACK seqno seqno 2876 2876 ackno ackno 567 567 DATA DATA seqno seqno 4336 4336 ackno ackno 567 567 DATA DATA seqno seqno 5796 5796 ackno ackno 567 567 DATA DATA seqno seqno 6341 6341 ackno ackno 567 567 seqno seqno 1 1 ackno ackno 1 1 Ackno Ackno increased increased

Request 1 Request 1

304 bytes 304 bytes

Response 1 Response 1

2875 bytes 2875 bytes

Response 2 Response 2

3465 bytes 3465 bytes

Request 2 Request 2

262 bytes 262 bytes Ackno Ackno increased increased Seqno Seqno increased increased

11 11

Sizes of HTTP Requests Sizes of HTTP Requests

Empirical Empirical CDFs CDFs

Cumulative Probability Cumulative Probability Size in Bytes Size in Bytes

0.64 0.64

1999 1999 2003 2003

0.82 0.82

12 12

Sizes of HTTP Requests Sizes of HTTP Requests

Empirical Empirical CCDFs CCDFs

Complementary Cumulative Probability Complementary Cumulative Probability

  • No. of Bytes
  • No. of Bytes

0.3e-5 0.3e-5

1 MB 1 MB

0.0003% 0.0003% 296 Requests 296 Requests 2.7% Bytes 2.7% Bytes

slide-4
SLIDE 4

13 13

Response Sizes Response Sizes

Comparison with Earlier Studies Comparison with Earlier Studies

Cumulative Probability Cumulative Probability Size in Bytes Size in Bytes

LogNormal LogNormal Fits Fits

14 14

Response Sizes Response Sizes

Comparison with Earlier Studies Comparison with Earlier Studies

Complementary Cumulative Probability Complementary Cumulative Probability

  • No. of Bytes
  • No. of Bytes

Pareto Pareto Fits Fits Pareto Pareto Fits Fits Systematic Systematic Wobbles Wobbles

15 15

Page Identification Heuristic Page Identification Heuristic

Two TCP Connections Example Two TCP Connections Example

Client Client Server Server Client Client Server Server Quiet Time Quiet Time >1 second >1 second Page 2 Page 2 Page 1 Page 1

Top-level Top-level Object Object Embedded Embedded Objects Objects

16 16

Objects Per Page Objects Per Page

Comparison with Earlier Studies Comparison with Earlier Studies

Cumulative Probability Cumulative Probability

  • No. of Objects (HTTP Exchanges)
  • No. of Objects (HTTP Exchanges)
slide-5
SLIDE 5

17 17

Objects Per Page Objects Per Page

Comparison with Earlier Studies Comparison with Earlier Studies

Complementary Cumulative Probability Complementary Cumulative Probability

  • No. of Objects (HTTP Exchanges)
  • No. of Objects (HTTP Exchanges)

18 18

Page Requests Per IP Address Page Requests Per IP Address

Cumulative Probability Cumulative Probability

  • No. of Page Requests
  • No. of Page Requests

19 19

Sampling Issues Sampling Issues

  • Questions:

– Can we obtain a sufficiently large sample with a small number of short traces? – How does the length of the tracing interval affect the

  • verall empirical distribution shapes?

– Should we include in the empirical distributions the data from incomplete TCP connections?

  • Questions:

Questions:

– – Can we obtain a Can we obtain a sufficiently large sample sufficiently large sample with a small with a small number of short traces? number of short traces? – – How does the How does the length of the tracing interval length of the tracing interval affect the affect the

  • verall empirical distribution shapes?
  • verall empirical distribution shapes?

– – Should we include in the empirical distributions the data Should we include in the empirical distributions the data from from incomplete TCP connections incomplete TCP connections? ?

  • Approach:

– Examine a wide range of trace lengths

» 4 h., 2 h., 1h., 30 min., 15 min., 5 min. and 90 sec.

– Construct datasets by sub-sampling the 21 4-hour-long traces collected in 2001 – E.g., remove first and last hour of each trace to produce 21 2-hour-long traces

  • Approach:

Approach:

– – Examine a wide range of trace lengths Examine a wide range of trace lengths

» » 4 h., 2 h., 1h., 30 min., 15 min., 5 min. and 90 sec. 4 h., 2 h., 1h., 30 min., 15 min., 5 min. and 90 sec.

– – Construct datasets by sub-sampling the 21 4-hour-long Construct datasets by sub-sampling the 21 4-hour-long traces collected in 2001 traces collected in 2001 – – E.g. E.g., remove first and last hour of each trace to produce 21 , remove first and last hour of each trace to produce 21 2-hour-long traces 2-hour-long traces

20 20

Sampling Issues Sampling Issues

Impact of Tracing Interval Length Impact of Tracing Interval Length

Cumulative Probability Cumulative Probability Response Size in Bytes Response Size in Bytes

slide-6
SLIDE 6

21 21

Sampling Issues Sampling Issues

Impact of Tracing Interval Length Impact of Tracing Interval Length

Complementary Cumulative Probability Complementary Cumulative Probability Response Size in Bytes Response Size in Bytes

22 22

Sampling Issues Sampling Issues

Impact of Tracing Interval Length Impact of Tracing Interval Length

Cumulative Probability Cumulative Probability

  • No. of Pages Per Client IP Address
  • No. of Pages Per Client IP Address

23 23

Sampling Issues Sampling Issues

Impact of Partially-Captured Objects Impact of Partially-Captured Objects

Cumulative Probability Cumulative Probability Response Size in Bytes Response Size in Bytes

24 24

Sampling Issues Sampling Issues

Impact of Partially-Captured Objects Impact of Partially-Captured Objects

Complementary Cumulative Probability Complementary Cumulative Probability

  • No. of Bytes
  • No. of Bytes
slide-7
SLIDE 7

25 25

Summary and Conclusions Summary and Conclusions

Web Traffic Characterization Web Traffic Characterization

  • New data to populate traffic generators

– Request sizes – Response sizes – Use of persistent connections – ...

  • New data to populate traffic generators

New data to populate traffic generators

– – Request sizes Request sizes – – Response sizes Response sizes – – Use of persistent connections Use of persistent connections – – ... ...

  • 1-hour long traces are sufficient to capture

application-level behavior

– Short traces cut off large objects, which skews the tails of the distributions

  • Persistent Connections:

– ~15% of all the HTTP connections – 40-50% of all the transferred HTTP bytes

  • 1-hour long traces are sufficient to capture

1-hour long traces are sufficient to capture application-level behavior application-level behavior

– – Short traces cut off large objects, which skews the tails of Short traces cut off large objects, which skews the tails of the distributions the distributions

  • Persistent Connections:

Persistent Connections:

– – ~15% of all the HTTP connections ~15% of all the HTTP connections – – 40-50% of all the transferred HTTP bytes 40-50% of all the transferred HTTP bytes