Statistical Clustering of Internet Statistical Clustering of - - PDF document

statistical clustering of internet statistical clustering
SMART_READER_LITE
LIVE PREVIEW

Statistical Clustering of Internet Statistical Clustering of - - PDF document

Statistical Clustering of Internet Communication Patterns Talk at Interface 2003 Flix Hernndez-Campos March 13, 2003 The University of North Carolina at Chapel Hill The University of North Carolina at Chapel Hill Talk at Interface 2003


slide-1
SLIDE 1

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 1

1 1

Statistical Clustering of Internet Statistical Clustering of Internet Communication Patterns Communication Patterns

Talk at Interface 2003

March 13, 2003

Félix Hernández Félix Hernández-

  • Campos

Campos (UNC (UNC-

  • Chapel Hill Computer Science)

Chapel Hill Computer Science)

The University of North Carolina at Chapel Hill The University of North Carolina at Chapel Hill Joint work work with: Andrew Nobel Don Smith Kevin Jeffay (UNC-CH Statistics) (UNC-CH Computer Science)

2 2

Motivation Motivation

Modeling Internet Traffic Modeling Internet Traffic

INTERNET

slide-2
SLIDE 2

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 2

3 3

Motivation Motivation

Modeling Internet Traffic Modeling Internet Traffic Internet Traffic Internet Traffic

Web Browser Web Browser Web Server Web Server Email Server Email Server Email Client Email Client

4 4

Motivation Motivation

Experimental Networking Research Experimental Networking Research

  • Evaluating network technologies requires

Evaluating network technologies requires realistic realistic experiments experiments in a controlled laboratory environment in a controlled laboratory environment

  • A key component of these experiments is the

A key component of these experiments is the traffic traffic workload workload

– – Traffic is created by distributed applications running at the Traffic is created by distributed applications running at the end hosts end hosts

  • A natural approach for traffic generation is to

A natural approach for traffic generation is to simulate these applications using models of their simulate these applications using models of their behavior behavior

– – This is known as This is known as source source-

  • level modeling

level modeling

slide-3
SLIDE 3

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 3

5 5

Internet Traffic Mixes Internet Traffic Mixes

Internet2 Applications (Nov 4 2002) Internet2 Applications (Nov 4 2002)

Packets

NNTP HTTP FTP File Sharing Audio/Video Misc Encrypted Games Unidentified

  • Dozens of different applications are commonly used

Dozens of different applications are commonly used

  • There is a large percentage of unidentified traffic

There is a large percentage of unidentified traffic

Newsgroups Web File Transfer File Sharing Audio/Video Misc Encryption Games Unidentified Individual Applications Groups of Applications

6 6

Difficulties in Source Difficulties in Source-

  • Level Modeling

Level Modeling

  • Real

Real Internet traffic is the result of aggregating many Internet traffic is the result of aggregating many individual applications into a individual applications into a traffic mix traffic mix

  • Requires protocol specifications

Requires protocol specifications

– – Closed applications have to be reverse engineered Closed applications have to be reverse engineered

  • Applications change quickly

Applications change quickly

  • Privacy considerations complicate data acquisition

Privacy considerations complicate data acquisition

  • It is simply infeasible to develop models for each

It is simply infeasible to develop models for each application and maintain them up to date application and maintain them up to date

slide-4
SLIDE 4

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 4

7 7

Goals Goals

  • Develop source

Develop source-

  • level models of traffic mixes

level models of traffic mixes

– – Easy to populate and update Easy to populate and update – – Derived from very large data sets Derived from very large data sets

  • Construct flexible traffic generators

Construct flexible traffic generators

– – Reproduce a wide range of traffic mixes Reproduce a wide range of traffic mixes

8 8

Our Approach Our Approach

  • Develop source

Develop source-

  • level models of traffic mixes

level models of traffic mixes

– – Easy to populate and update Easy to populate and update – – Derived from very large data sets Derived from very large data sets

  • Construct flexible traffic generators

Construct flexible traffic generators

– – Reproduce a wide range of traffic mixes Reproduce a wide range of traffic mixes

  • Find the fundamental patterns of communication

Find the fundamental patterns of communication

– – Cluster Cluster-

  • based traffic generation

based traffic generation

  • Model communication patterns in an abstract manner

Model communication patterns in an abstract manner

– – Application Application-

  • independent source

independent source-

  • level modeling

level modeling

slide-5
SLIDE 5

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 5

9 9

Data Acquisition Data Acquisition

Inference from TCP Packet Headers Inference from TCP Packet Headers

Caller Caller Caller Callee Callee Callee

D A T A D A T A ACK ACK DATA DATA DATA DATA A C K A C K FIN FIN F I N F I N

  • A

C K A C K F I N F I N FIN FIN-

  • ACK

ACK S Y N S Y N SYN SYN-

  • ACK

ACK A C K A C K s e q n

  • s

e q n

  • 3

5 3 5 a c k n

  • a

c k n

  • 1

1 seqno seqno 1 1 ackno ackno 305 305 seqno seqno 1461 1461 ackno ackno 305 305 seqno seqno 2876 2876 ackno ackno 305 305 s e q n

  • s

e q n

  • 3

5 3 5 a c k n

  • a

c k n

  • 2

8 7 6 2 8 7 6

TIME TIME

10 10

Communication Patterns Communication Patterns

Inference from TCP Packet Headers Inference from TCP Packet Headers

Caller Caller Caller Callee Callee Callee

D A T A D A T A ACK ACK DATA DATA DATA DATA A C K A C K FIN FIN F I N F I N

  • A

C K A C K F I N F I N FIN FIN-

  • ACK

ACK S Y N S Y N SYN SYN-

  • ACK

ACK A C K A C K 305 bytes 305 305 bytes bytes 2876 bytes 2876 2876 bytes bytes s e q n

  • s

e q n

  • 3

5 3 5 a c k n

  • a

c k n

  • 1

1 seqno seqno 1 1 ackno ackno 305 305 seqno seqno 1461 1461 ackno ackno 305 305 seqno seqno 2876 2876 ackno ackno 305 305 s e q n

  • s

e q n

  • 3

5 3 5 a c k n

  • a

c k n

  • 2

8 7 6 2 8 7 6

TIME TIME

slide-6
SLIDE 6

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 6

11 11

Communication Patterns Communication Patterns

HTTP Connection (Web Traffic) HTTP Connection (Web Traffic)

Web Broser Web Web Broser Broser Web Server Web Server Web Server

ACK ACK A C K A C K FIN FIN F I N F I N

  • A

C K A C K F I N F I N FIN FIN-

  • ACK

ACK S Y N S Y N SYN SYN-

  • ACK

ACK A C K A C K 305 bytes 305 305 bytes bytes seqno seqno 1 1 ackno ackno 305 305 s e q n

  • s

e q n

  • 3

5 3 5 a c k n

  • a

c k n

  • 2

8 7 6 2 8 7 6

TIME TIME

2,876 bytes 2,876 2,876 bytes bytes

HTTP Request HTTP HTTP Request Request HTTP Response HTTP HTTP Response Response

12 12

Communication Patterns Communication Patterns

HTTP Connection (Web Traffic) HTTP Connection (Web Traffic)

TIME TIME

Web Client Web Client Web Client Web Server Web Server Web Server HTTP Request

305 bytes

HTTP HTTP Request Request

305 305 bytes bytes

HTTP Response

2,876 bytes

HTTP HTTP Response Response

2,876 2,876 bytes bytes

  • Communication pattern was

Communication pattern was ( (a a1

1,

, b b1

1)

)

– – E.g. E.g., (305 bytes, 2,876 bytes) , (305 bytes, 2,876 bytes)

slide-7
SLIDE 7

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 7

13 13

Abstract Communication Model Abstract Communication Model

The The a a-

  • b

b-

  • t

t Model Model

Caller Caller Caller Callee Callee Callee

a1 bytes a a1

1 bytes

bytes b1 bytes b b1

1 bytes

bytes a2 bytes a a2

2 bytes

bytes b3 bytes b b3

3 bytes

bytes a3 bytes a a3

3 bytes

bytes Epoch 1 Epoch 1 Epoch 1 Epoch 2 Epoch 2 Epoch 2 Epoch 3 Epoch 3 Epoch 3 t1 seconds t t1

1 seconds

seconds t2 seconds t t2

2 seconds

seconds b2 bytes b b2

2 bytes

bytes

  • General model (a

General model (a-

  • b

b-

  • t vector):

t vector): (( ((a a1

1,

, b b1

1,

, t t1

1),

), ( (a a2

2,

, b b2

2,

, t t2

2), …, (

), …, (a ae

e,

, b be

e,

, ⊥ ⊥)) )) where where e e is the number of epochs is the number of epochs

14 14

  • SMTP (send email)

SMTP (send email)

The The a a-

  • b

b-

  • t

t Model Model

Typical Communication Patterns Typical Communication Patterns

  • Telnet (remote terminal)

Telnet (remote terminal)

  • FTP

FTP-

  • DATA (file download)

DATA (file download)

slide-8
SLIDE 8

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 8

15 15

Clustering Communication Patterns Clustering Communication Patterns

TIME TIME

  • Find statistically homogeneous communication

Find statistically homogeneous communication patterns patterns

– – Study this Study this mixture of populations mixture of populations

  • Address scalability using

Address scalability using statistical clustering statistical clustering C1 C2

16 16

Statistical Features of an Statistical Features of an A A-

  • b

b-

  • t Vector

t Vector

MULTIVARIATE MULTIVARIATE UNIVARIATE UNIVARIATE

  • No. of Epochs
  • No. of Epochs

e e

Max First Diff. Max First Diff.

t tvm

vm

b bvm

vm

a avm

vm

UNIVARIATE UNIVARIATE

Total Variation Total Variation

t tvs

vs

b bvs

vs

a avs

vs

Directionality Directionality Homogeneity Homogeneity

t thx

hx

b bhx

hx

a ahx

hx

dir2.a.b dir2.a.b dir1.a.b dir1.a.b

Autocorrelations Autocorrelations

t tcor

cor.x .x

b bcor

cor.x .x

a acor

cor.x .x

Cross Cross-

  • correlations

correlations Standard Deviation Standard Deviation

t tstdev

stdev

b bstdev

stdev

a astdev

stdev

crc crc.b.t .b.t crc crc.a.t .a.t crc crc.a.b .a.b

1 1st

st 2

2nd

nd 3

3rd

rd Quartiles

Quartiles

t txq

xq

b bxq

xq

a axq

xq

Lagged Correlations Lagged Correlations Mean bytes/time Mean bytes/time

t tmean

mean

b bmean

mean

a amean

mean

cor cor.b.t.x .b.t.x cor cor.a.t.x .a.t.x cor cor.a.b.x .a.b.x

Min bytes/time Min bytes/time

t tmin

min

b bmin

min

a amin

min

Correlations Correlations Max bytes/time Max bytes/time

t tmax

max

b bmax

max

a amax

max

cor cor.b.t .b.t cor cor.a.t .a.t cor cor.a.b .a.b

Total bytes/time Total bytes/time

t ttot

tot

b btot

tot

a atot

tot

slide-9
SLIDE 9

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 9

17 17

Clustering Connections Clustering Connections

Statistical Structure in Internet Data Exchanges Statistical Structure in Internet Data Exchanges

m =

18 18

Clustering Connections Clustering Connections

Example of Two Clusters Example of Two Clusters

Mainly SMTP, HTTPS and POP Mainly SMTP, HTTPS and POP HTTP HTTP

slide-10
SLIDE 10

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 10

19 19

Clustering Communication Patterns Clustering Communication Patterns

Data Set Data Set

1 1 … … 0.11 0.11 Connection Connection m m … … … … 0.23 0.23 … … 0.45 0.45 1.03 1.03 0.24 0.24 Connection 2 Connection 2 0.61 0.61 … … 0.12 0.12 0.23 0.23 0.66 0.66 Connection 1 Connection 1 dir2.a.b dir2.a.b … … a.min a.min a.max a.max e e Features Features Observations Observations

  • Each feature is approximately normalized to [0,1]

Each feature is approximately normalized to [0,1]

– – Many features have heavy Many features have heavy-

  • tailed distributions

tailed distributions

20 20

Example 1 Example 1

Divisive Hierarchical Clustering Divisive Hierarchical Clustering

  • Packet header trace

Packet header trace collected from UNC main collected from UNC main Internet access link Internet access link

– – April 2002 April 2002

  • Random sample of 5,000

Random sample of 5,000 connections connections

– – e e ≥ ≥ 2 2

  • Analysis performed using

Analysis performed using R’s implementation R’s implementation

– – Using the Using the diana

diana algorithm

algorithm

  • Euclidean distance

Euclidean distance

Max/Min Ratio Max/Min Ratio

b bh

h

a ah

h

1 1st

st 2

2nd

nd Moments

Moments

b bµ

µ, ,σ σ

a aµ

µ, ,σ σ

Min bytes/time Min bytes/time

b bmin

min

a amin

min

Max bytes/time Max bytes/time

t tmax

max

b bmax

max

a amax

max

Total bytes/time Total bytes/time

b btot

tot

a atot

tot

  • No. of Epochs
  • No. of Epochs

e e 26 Features 26 Features

Lag Lag-

  • 1 Cross

1 Cross Corr Corr. .

ρ ρ2

2(

(b b’s ’s, , a a’s ’s) )

Spearman’s Spearman’s Correl Correl. .

ρ ρ1

1(

(a a’s ’s, , b b’s ’s) )

Lag Lag-

  • 1

1 Autocorr Autocorr. .

r rb

b

r ra

a

Total Variation Total Variation

b bvs

vs

a avs

vs

1 1st

st 2

2nd

nd 3

3rd

rd Quartiles

Quartiles

b bxq

xq

a axq

xq

slide-11
SLIDE 11

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 11

21 21

Example 1 Example 1

Dendrogram Dendrogram

1 2 3 4 5 6 5000 1952 1915 864 1914 7 37 5 6 32 5 3048 626 71 4 555 3 2422 1468 2 954 1

HTTP HTTPS AOL SMTP POP bi = 0 HTTP HTTPS MS-DS RTSP FTP-DATA e=2 a.tot = 50k b.tot = 0 Dendrogram Dendrogram Pruned at Pruned at Depth 4 Depth 4

Dissimilarity Dissimilarity

22 22

Example 2 Example 2

Agglomerative Hierarchical Clustering Agglomerative Hierarchical Clustering

  • Packet header trace

Packet header trace collected from an Internet2 collected from an Internet2 backbone link (Abilene backbone link (Abilene-

  • I

I data set) data set)

– – August 2002 August 2002

  • Sample of 717 connections

Sample of 717 connections

– – e e ≥ ≥ 2 2

  • Analysis performed using

Analysis performed using Eisen’s Eisen’s software software

– – Developed for Developed for microarrays microarrays

  • Pearson’s correlation as

Pearson’s correlation as distance metric distance metric

  • No. of Epochs
  • No. of Epochs

e e 14 Features 14 Features

Lag Lag-

  • 1 Sp.

1 Sp. Corr Corr. .

ρ ρ2

2(

(a a’s ’s, , b b’s ’s) )

Spearman’s Spearman’s Correl Correl. .

ρ ρ1

1(

(a a’s ’s, , b b’s ’s) )

log ( log (a

atot

tot /

/ b

btot

tot)

)

dir dir

Max/Min Ratio Max/Min Ratio

b bhx

hx

a ahx

hx

Max First Diff. Max First Diff.

b bfd

fd

a afd

fd

2 2nd

nd Quartiles

Quartiles

t t2q

2q

b b2q

2q

a a2q

2q

Total bytes/time Total bytes/time

t ttot

tot

b btot

tot

a atot

tot

slide-12
SLIDE 12

Statistical Clustering of Internet Communication Patterns Félix Hernández-Campos Talk at Interface 2003 March 13, 2003 12

23 23

Correlation Correlation

e e

a atot

tot

a a2q

2q

a afd

fd

a ah

h

b btot

tot

b b2q

2q

b bfd

fd

b bh

h

t ttot

tot

t t2q

2q

dir dir ρ ρ1

1

ρ ρ2

2

File File-

  • Sharing Applications

Sharing Applications ( (Kazaa Kazaa, , Edonkey Edonkey, , Gnutella Gnutella) ) Gnutella Gnutella, Telnet, HTTPS , Telnet, HTTPS Newsgroups Newsgroups Web Traffic Web Traffic Email Email FTP FTP-

  • Data

Data

24 24

Summary and Current Work Summary and Current Work

  • Developed an application

Developed an application-

  • independent model of

independent model of Internet communication patterns Internet communication patterns

– – Suitable for large scale data acquisition Suitable for large scale data acquisition

  • Applied statistical clustering to uncover fundamental

Applied statistical clustering to uncover fundamental subpopulations subpopulations

– – Working on a Working on a systematic approach systematic approach for feature selection for feature selection and cluster identification ( and cluster identification (i.e. i.e. dendrogram dendrogram pruning) pruning) – – O O( (n n2

2) is too slow, so we are also looking into data mining

) is too slow, so we are also looking into data mining algorithms for clustering algorithms for clustering