Why is Internet traffic self-similar? Allen B. Downey Wellesley - - PowerPoint PPT Presentation
Why is Internet traffic self-similar? Allen B. Downey Wellesley - - PowerPoint PPT Presentation
Why is Internet traffic self-similar? Allen B. Downey Wellesley College No Micro$oft products were used in the preparation of this talk. What is self-similarity? Real-world: visually similar over range of spatial scales. Fractals:
What is self-similarity?
Real-world: visually similar over range of spatial scales. Fractals: geometrically similar over all spatial scales. Time-series: statistically similar over range of time scales.
Network traffic
200 400 600 800 1000 20000 40000 60000
17m 100s 2.8h 28h 10s
200 400 600 800 1000 20000 40000 60000 200 400 600 800 1000 2000 4000 6000 200 400 600 800 1000 2000 4000 6000 200 400 600 800 1000 200 400 600 800 200 400 600 800 1000 200 400 600 800 200 400 600 800 1000 20 40 60 80 100 200 400 600 800 1000 20 40 60 80 100 200 400 600 800 1000 5 10 15 200 400 600 800 1000 5 10 15
Ethernet and WAN traffic appear self-similar. [WillingerEtAl95] x = time in varying units y = packets / unit time Visual self-similarity over 5 orders of magnitude!
Explanatory models
System Behavior System Model Behavior Model derivation verification explanation abstraction
Abstraction: is it realistic? Derivation: is it correct? Verification: is the behavior the same? Explanation: does this really explain?
Ideal gas law explained
Abstraction: no interaction, elastic collision, etc. Derivation: you do the math (or simulation). Verification: most gas, most of the time.
Verification
FGN is self-similar. ASY isn’t, but it can pass.
- Explanations of self-similarity
empirical self−similarity ON/OFF model M/G/infinity model noise gaussian fractional self similarity asymptotic Internet
Abstraction
Two aggregation models Long-tailed distribution of file sizes
Distribution of file sizes
Is it long-tailed? If so, why?
Cumulative distributions
10000 20000 30000 40000 File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Normal cdf
normal
x = range of values y = Prob {value < x} cdf maps values to percentiles
Skewed distributions
20000 40000 60000 80000 100000 File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Skewed cdfs
normal skewed lognormal pareto
normal distribution is symmetric. skewed has many small values and some large. lognormal even more skewed. pareto even more skewed.
Logarithmic x axis
20000 40000 60000 80000 100000 File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Skewed cdfs
normal skewed lognormal pareto
1 32 1KB 32KB 1MB File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Skewed cdfs, log x axis
normal skewed lognormal pareto
Log-log axes
1 32 1KB 32KB 1MB File size (bytes)
- 0.0
0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Skewed cdfs, log x axis
normal skewed
✁lognormal pareto
✂Complementary cdf: Prob {value > x} Log y axis amplifies tail behavior. Pareto distribution is a straight line.
- 1
32 1KB 32KB 1MB File size (bytes)
✄1 1/4 1/16 1/64 1/256 1/1024 Prob {file size > x} Skewed cdfs, log-log axes
☎normal skewed lognormal pareto
Evidence of long tails
0.001 0.01 0.1 1 10 100 1000 Duration (seconds) 1 0.1 0.01 0.001 0.0001 0.00001 Prob {lifetime > x} Process lifetimes
Pareto model actual cdf
Is long-tailedness an empirical property? Long-tailed dist converges to Pareto. How do we know it keeps going?
File sizes in the WWW
1 32 1KB 32KB 1MB 32MB File size (bytes) 1 1/4 1/16 1/64 1/256 1/1024 1/4096 1/16384 Prob {file size > x} File Sizes from Crovella dataset
Pareto model actual cdf
1 32 1KB 32KB 1MB 32MB File size (bytes) 1 1/4 1/16 1/64 1/256 1/1024 1/4096 1/16384 Prob {file size > x} File Sizes from NASA dataset
Pareto model actual cdf
✆Where we are
Some empirical evidence
- f long tailed distributions.
Explanatory model for WWW files. [CarlsonDoyle99] No explanation for other file systems.
Explanatory model
Goal: Model of user behavior that produces long-tailed distributions. Hypothesis: Most new files are copies of old files. Many new files are translations of old files. New size is a small multiple of the old size.
User Model
Model: Choose an existing file at random. Choose a small multiplier at random. new file size = old file size * multiplier Repeat. Two parameters: Initial file size. Variability of multipliers.
Simulation of user model
1 32 1KB 32KB 1MB 32MB File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Distribution of File Sizes
cdf from simulation actual cdf
89,000 files on rocky.wellesley.edu Choose parameters to fit the distribution. Fits pretty good! Analytic form?
Continuous model
Replace discrete file sizes with continuous. Simulation computes numerical solution of diffusion equation. Solution of PDE yields analytic model
- f the distribution.
Solve that PDE!
1 32 1KB 32KB 1MB 32MB File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} Simulation evolution
10 files 1000 files 100000 files
Distribution of file sizes is normal on a log-x axis: LOGNORMAL.
Estimate those parameters!
1 32 1KB 32KB 1MB 32MB File size (bytes) 0.0 0.2 0.4 0.6 0.8 1.0 Prob {file size < x} File Sizes, Irlam dataset
lognormal model actual cdf
Irlam collected file sizes from 500+ systems. Using the analytic model we can estimate parameters. Goodness of fit: Kolmogorov-Smirnov statistic. Range: 1.4 to 40 Median: 8.0
Oh, no!
1 32 1KB 32KB 1MB File size (bytes) 1 1/4 1/16 1/64 1/256 1/1024 Prob {file size > x} Skewed cdfs, log-log axes
normal skewed lognormal pareto
The lognormal distribution is not long-tailed. Under either aggregation model, lognormal file sizes yield self-similarity
- ver a range of time
scales, but not true self-similarity.
Tail behavior?
1 32 1KB 32KB 1MB 32MB File size (bytes)
✝1 1/4 1/16 1/64 1/256 1/1024 1/4096 1/16384 Prob {file size > x} File Sizes from Crovella dataset
✞Pareto model lognormal model actual cdf
1 32 1KB 32KB 1MB 32MB File size (bytes)
✝1 1/4 1/16 1/64 1/256 1/1024 1/4096 1/16384 Prob {file size > x} File Sizes from NASA dataset
Pareto model lognormal model actual cdf
To explain self-similarity, we only need a Pareto tail. Log-log ccdf amplifies tail. Which model is better?
Kuhn’s criteria
- ne more criterion
Theory choice
Accuracy Scope Consistency Simplicity Fruitfulness Explanatory model
Lognormal vs. Pareto
Accuracy and Scope
Diffusion model fits the bulk of the distribution. Pareto model sometimes fits the tail better.
Consistency
Diffusion model undermines self-sim explanation.
Simplicity
Pick ’em.
Fruitfulness
Long-tailed distributions are a nightmare for modelers.
Explanatory model
Carlson and Doyle only explain Web files. I think the diffusion model is more realistic.
Trade simplicity for accuracy
1 32 1KB 32KB 1MB 32MB File size (bytes) 1 1/4 1/16 1/64 1/256 1/1024 1/4096 1/16384 Prob {file size > x} File Sizes from Crovella98
lognormal model actual cdf
What if the primordial soup contained two files? Multimodal (5-parameter) lognormal model. Accuracy and complexity comparable to Crovella’s hybrid model.
Is Internet traffic really self-similar?
What seems to be an empirical question depends on theory choice. Theory choice is not determined (entirely) by evidence.
- M/G/infinity
model ON/OFF model asymptotic Pareto tail lognormal pseudo self similarity
- ther Pareto