Why is Internet traffic self-similar? Allen B. Downey Wellesley - - PowerPoint PPT Presentation

why is internet traffic self similar
SMART_READER_LITE
LIVE PREVIEW

Why is Internet traffic self-similar? Allen B. Downey Wellesley - - PowerPoint PPT Presentation

Why is Internet traffic self-similar? Allen B. Downey Wellesley College No Micro$oft products were used in the preparation of this talk. What is self-similarity? Real-world: visually similar over range of spatial scales. Fractals:


slide-1
SLIDE 1

No Micro$oft products were used in the preparation of this talk.

Why is Internet traffic self-similar?

Allen B. Downey Wellesley College

slide-2
SLIDE 2

What is self-similarity?

Real-world: visually similar over range of spatial scales. Fractals: geometrically similar over all spatial scales. Time-series: statistically similar over range of time scales.

slide-3
SLIDE 3

Network traffic

200 400 600 800 1000 20000 40000 60000

17m 100s 2.8h 28h 10s

200 400 600 800 1000 20000 40000 60000 200 400 600 800 1000 2000 4000 6000 200 400 600 800 1000 2000 4000 6000 200 400 600 800 1000 200 400 600 800 200 400 600 800 1000 200 400 600 800 200 400 600 800 1000 20 40 60 80 100 200 400 600 800 1000 20 40 60 80 100 200 400 600 800 1000 5 10 15 200 400 600 800 1000 5 10 15

Ethernet and WAN traffic appear self-similar. [WillingerEtAl95] x = time in varying units y = packets / unit time Visual self-similarity over 5 orders of magnitude!

slide-4
SLIDE 4

Explanatory models

System Behavior System Model Behavior Model derivation verification explanation abstraction

Abstraction: is it realistic? Derivation: is it correct? Verification: is the behavior the same? Explanation: does this really explain?

slide-5
SLIDE 5

Ideal gas law explained

Abstraction: no interaction, elastic collision, etc. Derivation: you do the math (or simulation). Verification: most gas, most of the time.

slide-6
SLIDE 6

Verification

FGN is self-similar. ASY isn’t, but it can pass.

  • Explanations of self-similarity

empirical self−similarity ON/OFF model M/G/infinity model noise gaussian fractional self similarity asymptotic Internet

Abstraction

Two aggregation models Long-tailed distribution of file sizes

slide-7
SLIDE 7

Distribution of file sizes

Is it long-tailed? If so, why?

slide-8
SLIDE 8

Cumulative distributions

10000 20000 30000 40000 File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Normal cdf

normal

x = range of values y = Prob {value < x} cdf maps values to percentiles

slide-9
SLIDE 9

Skewed distributions

20000 40000 60000 80000 100000 File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Skewed cdfs

normal skewed lognormal pareto

normal distribution is symmetric. skewed has many small values and some large. lognormal even more skewed. pareto even more skewed.

slide-10
SLIDE 10

Logarithmic x axis

20000 40000 60000 80000 100000 File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Skewed cdfs

normal skewed lognormal pareto

1 32 1KB 32KB 1MB File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Skewed cdfs, log x axis

normal skewed lognormal pareto

slide-11
SLIDE 11

Log-log axes

1 32 1KB 32KB 1MB File size (bytes)

  • 0.0

0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Skewed cdfs, log x axis

normal skewed

lognormal pareto

Complementary cdf: Prob {value > x} Log y axis amplifies tail behavior. Pareto distribution is a straight line.

  • 1

32 1KB 32KB 1MB File size (bytes)

1 1/4 1/16 1/64 1/256 1/1024 Prob {file size > x} Skewed cdfs, log-log axes

normal skewed lognormal pareto

slide-12
SLIDE 12

Evidence of long tails

0.001 0.01 0.1 1 10 100 1000 Duration (seconds) 1 0.1 0.01 0.001 0.0001 0.00001 Prob {lifetime > x} Process lifetimes

Pareto model actual cdf

Is long-tailedness an empirical property? Long-tailed dist converges to Pareto. How do we know it keeps going?

slide-13
SLIDE 13

File sizes in the WWW

1 32 1KB 32KB 1MB 32MB File size (bytes) 1 1/4 1/16 1/64 1/256 1/1024 1/4096 1/16384 Prob {file size > x} File Sizes from Crovella dataset

Pareto model actual cdf

1 32 1KB 32KB 1MB 32MB File size (bytes) 1 1/4 1/16 1/64 1/256 1/1024 1/4096 1/16384 Prob {file size > x} File Sizes from NASA dataset

Pareto model actual cdf

slide-14
SLIDE 14

Where we are

Some empirical evidence

  • f long tailed distributions.

Explanatory model for WWW files. [CarlsonDoyle99] No explanation for other file systems.

slide-15
SLIDE 15

Explanatory model

Goal: Model of user behavior that produces long-tailed distributions. Hypothesis: Most new files are copies of old files. Many new files are translations of old files. New size is a small multiple of the old size.

slide-16
SLIDE 16

User Model

Model: Choose an existing file at random. Choose a small multiplier at random. new file size = old file size * multiplier Repeat. Two parameters: Initial file size. Variability of multipliers.

slide-17
SLIDE 17

Simulation of user model

1 32 1KB 32KB 1MB 32MB File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Distribution of File Sizes

cdf from simulation actual cdf

89,000 files on rocky.wellesley.edu Choose parameters to fit the distribution. Fits pretty good! Analytic form?

slide-18
SLIDE 18

Continuous model

Replace discrete file sizes with continuous. Simulation computes numerical solution of diffusion equation. Solution of PDE yields analytic model

  • f the distribution.
slide-19
SLIDE 19

Solve that PDE!

1 32 1KB 32KB 1MB 32MB File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Simulation evolution

10 files 1000 files 100000 files

Distribution of file sizes is normal on a log-x axis: LOGNORMAL.

slide-20
SLIDE 20

Estimate those parameters!

1 32 1KB 32KB 1MB 32MB File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} File Sizes, Irlam dataset

lognormal model actual cdf

Irlam collected file sizes from 500+ systems. Using the analytic model we can estimate parameters. Goodness of fit: Kolmogorov-Smirnov statistic. Range: 1.4 to 40 Median: 8.0

slide-21
SLIDE 21

Oh, no!

1 32 1KB 32KB 1MB File size (bytes) 1 1/4 1/16 1/64 1/256 1/1024 Prob {file size > x} Skewed cdfs, log-log axes

normal skewed lognormal pareto

The lognormal distribution is not long-tailed. Under either aggregation model, lognormal file sizes yield self-similarity

  • ver a range of time

scales, but not true self-similarity.

slide-22
SLIDE 22

Tail behavior?

1 32 1KB 32KB 1MB 32MB File size (bytes)

1 1/4 1/16 1/64 1/256 1/1024 1/4096 1/16384 Prob {file size > x} File Sizes from Crovella dataset

Pareto model lognormal model actual cdf

1 32 1KB 32KB 1MB 32MB File size (bytes)

1 1/4 1/16 1/64 1/256 1/1024 1/4096 1/16384 Prob {file size > x} File Sizes from NASA dataset

Pareto model lognormal model actual cdf

To explain self-similarity, we only need a Pareto tail. Log-log ccdf amplifies tail. Which model is better?

slide-23
SLIDE 23

Kuhn’s criteria

  • ne more criterion

Theory choice

Accuracy Scope Consistency Simplicity Fruitfulness Explanatory model

slide-24
SLIDE 24

Lognormal vs. Pareto

Accuracy and Scope

Diffusion model fits the bulk of the distribution. Pareto model sometimes fits the tail better.

Consistency

Diffusion model undermines self-sim explanation.

Simplicity

Pick ’em.

Fruitfulness

Long-tailed distributions are a nightmare for modelers.

Explanatory model

Carlson and Doyle only explain Web files. I think the diffusion model is more realistic.

slide-25
SLIDE 25

Trade simplicity for accuracy

1 32 1KB 32KB 1MB 32MB File size (bytes) 1 1/4 1/16 1/64 1/256 1/1024 1/4096 1/16384 Prob {file size > x} File Sizes from Crovella98

lognormal model actual cdf

What if the primordial soup contained two files? Multimodal (5-parameter) lognormal model. Accuracy and complexity comparable to Crovella’s hybrid model.

slide-26
SLIDE 26

Is Internet traffic really self-similar?

What seems to be an empirical question depends on theory choice. Theory choice is not determined (entirely) by evidence.

  • M/G/infinity

model ON/OFF model asymptotic Pareto tail lognormal pseudo self similarity

  • ther Pareto

self similarity noise gaussian fractional noise gaussian fractional

slide-27
SLIDE 27

Where does that leave us?

Realist:

There is a real world and we are capable of knowing about it. Rational theory choice is capable of selecting the right theory. The Internet either is or is not really self-similar.

Instrumentalist:

Agnostic about the real world. Our theories are tools that either work or not. If it’s useful to model the Internet as self-similar, go ahead.

Other flavors of anti-realist.

slide-28
SLIDE 28

Long-tailed marmot?