Understanding Patterns of Understanding Patterns of TCP Connection - - PowerPoint PPT Presentation

understanding patterns of understanding patterns of tcp
SMART_READER_LITE
LIVE PREVIEW

Understanding Patterns of Understanding Patterns of TCP Connection - - PowerPoint PPT Presentation

The UNIVERSITY UNIVERSITY of of NORTH CAROLINA NORTH CAROLINA Motivation Motivation The at at CHAPEL HILL CHAPEL HILL Modeling Internet Traffic Modeling Internet Traffic Understanding Patterns of Understanding Patterns of TCP Connection


slide-1
SLIDE 1

1 1

Understanding Patterns of Understanding Patterns of TCP Connection Usage TCP Connection Usage with Statistical Clustering with Statistical Clustering

The The UNIVERSITY UNIVERSITY of

  • f NORTH CAROLINA

NORTH CAROLINA at at CHAPEL HILL CHAPEL HILL

http://www.cs.unc.edu/Research/dirt

Félix Hernández-Campos Félix Hernández-Campos Kevin Jeffay Kevin Jeffay Don Smith Don Smith

Department of Computer Science Department of Computer Science

Andrew Nobel Andrew Nobel

Department of Statistics Department of Statistics

2 2

Motivation Motivation

Modeling Internet Traffic Modeling Internet Traffic

INTERNET

3 3

Motivation Motivation

Modeling Internet Traffic Modeling Internet Traffic Internet Traffic

Web Browser Web Browser Web Server Web Server Email Server Email Server Email Client Email Client

4 4

Motivation Motivation

Experimental Networking Research Experimental Networking Research

  • Evaluating network technologies requires

Evaluating network technologies requires realistic realistic experiments experiments in a controlled laboratory environment in a controlled laboratory environment

  • A key component of these experiments is the

A key component of these experiments is the traffic traffic workload workload

– – Traffic is created by distributed applications running at the Traffic is created by distributed applications running at the end hosts end hosts

  • A natural approach for traffic generation is to

A natural approach for traffic generation is to simulate these applications using models of their simulate these applications using models of their behavior behavior

– – This is known as This is known as source-level modeling source-level modeling

slide-2
SLIDE 2

5 5

Internet Traffic Mixes Internet Traffic Mixes

Internet2 Applications (Nov 4 2002) Internet2 Applications (Nov 4 2002)

Packets

NNTP HTTP FTP File Sharing Audio/Video Misc Encrypted Games Unidentified

  • Dozens of different applications are commonly used

Dozens of different applications are commonly used

  • There is a large percentage of unidentified traffic

There is a large percentage of unidentified traffic

Newsgroups Web File Transfer File Sharing Audio/Video Misc Encryption Games Unidentified Individual Applications Groups of Applications

6 6

Difficulties in Source-Level Modeling Difficulties in Source-Level Modeling

  • Real

Real Internet traffic is the result of aggregating many Internet traffic is the result of aggregating many individual applications into a individual applications into a traffic mix traffic mix

  • Requires protocol specifications

Requires protocol specifications

– – Closed applications have to be reverse engineered Closed applications have to be reverse engineered

  • Applications change quickly

Applications change quickly

  • Privacy considerations complicate data acquisition

Privacy considerations complicate data acquisition

  • It is simply infeasible to develop models for each

It is simply infeasible to develop models for each application and maintain them up to date application and maintain them up to date

7 7

Modeling of Internet Traffic Mixes Modeling of Internet Traffic Mixes

Goals Goals

  • Develop source-level models of traffic mixes

Develop source-level models of traffic mixes

– – Easy to populate and update Easy to populate and update – – Derived from very large data sets Derived from very large data sets

  • Find the fundamental patterns of communication

Find the fundamental patterns of communication

– – Cluster-based traffic generation Cluster-based traffic generation

  • Model communication patterns in an abstract manner

Model communication patterns in an abstract manner

– – Application-independent source-level modeling Application-independent source-level modeling

  • Construct flexible traffic generators

Construct flexible traffic generators

– – Reproduce a wide range of traffic mixes Reproduce a wide range of traffic mixes

8 8

Our Approach Our Approach

Finding Patterns in TCP Connections Finding Patterns in TCP Connections

  • Modeling of data exchange patterns in TCP

Modeling of data exchange patterns in TCP connections connections

– – Application-independent, network-independent Application-independent, network-independent

  • Statistical clustering of TCP connection patterns

Statistical clustering of TCP connection patterns

– – Find the fundamental subpopulations Find the fundamental subpopulations – – Construct empirical or parametric models of subpopulations Construct empirical or parametric models of subpopulations

  • Development of new, flexible traffic generators

Development of new, flexible traffic generators

– – Cluster-based synthetic traffic Cluster-based synthetic traffic

  • Validation

Validation

– – Compare synthetic traffic with some Compare synthetic traffic with some gold standard gold standard

slide-3
SLIDE 3

9 9

Modeling of Data Exchange Patterns Modeling of Data Exchange Patterns

ADU Inference from TCP Packet Headers ADU Inference from TCP Packet Headers

Caller Caller Callee Callee

DATA DATA A C K A C K D A T A D A T A D A T A D A T A ACK ACK F I N F I N FIN-ACK FIN-ACK FIN FIN F I N

  • A

C K F I N

  • A

C K SYN SYN S Y N

  • A

C K S Y N

  • A

C K ACK ACK seqno seqno 305 305 ackno ackno 1 1 s e q n

  • s

e q n

  • 1

1 a c k n

  • a

c k n

  • 3

5 3 5 s e q n

  • s

e q n

  • 1

4 6 1 1 4 6 1 a c k n

  • a

c k n

  • 3

5 3 5 s e q n

  • s

e q n

  • 2

8 7 6 2 8 7 6 a c k n

  • a

c k n

  • 3

5 3 5 seqno seqno 305 305 ackno ackno 2876 2876

TIME TIME

10 10

Caller Caller Callee Callee

DATA DATA A C K A C K D A T A D A T A D A T A D A T A ACK ACK F I N F I N FIN-ACK FIN-ACK FIN FIN F I N

  • A

C K F I N

  • A

C K SYN SYN S Y N

  • A

C K S Y N

  • A

C K ACK ACK 305 305 bytes bytes 2876 2876 bytes bytes seqno seqno 305 305 ackno ackno 1 1 s e q n

  • s

e q n

  • 1

1 a c k n

  • a

c k n

  • 3

5 3 5 s e q n

  • s

e q n

  • 1

4 6 1 1 4 6 1 a c k n

  • a

c k n

  • 3

5 3 5 s e q n

  • s

e q n

  • 2

8 7 6 2 8 7 6 a c k n

  • a

c k n

  • 3

5 3 5 seqno seqno 305 305 ackno ackno 2876 2876

TIME TIME

Modeling of Data Exchange Patterns Modeling of Data Exchange Patterns

ADU Inference from TCP Packet Headers ADU Inference from TCP Packet Headers

11 11

Web Browser Web Browser Web Server Web Server

A C K A C K ACK ACK F I N F I N FIN-ACK FIN-ACK FIN FIN F I N

  • A

C K F I N

  • A

C K SYN SYN S Y N

  • A

C K S Y N

  • A

C K ACK ACK 305 305 bytes bytes s e q n

  • s

e q n

  • 1

1 a c k n

  • a

c k n

  • 3

5 3 5 seqno seqno 305 305 ackno ackno 2876 2876

TIME TIME

2,876 2,876 bytes bytes

HTTP HTTP Request Request HTTP HTTP Response Response

Modeling of Data Exchange Patterns Modeling of Data Exchange Patterns

ADU Inference from TCP Packet Headers ADU Inference from TCP Packet Headers

12 12

Modeling of Data Exchange Patterns Modeling of Data Exchange Patterns

HTTP Connection (Web Traffic) HTTP Connection (Web Traffic)

TIME TIME

Web Client Web Client Web Server Web Server HTTP HTTP Request Request

305 305 bytes bytes

HTTP HTTP Response Response

2,876 2,876 bytes bytes

  • Communication pattern was (

Communication pattern was (a a1

1,

, b b1

1)

)

– – E.g. E.g., (305 bytes, 2,876 bytes) , (305 bytes, 2,876 bytes)

slide-4
SLIDE 4

13 13

Abstract Communication Model Abstract Communication Model

The The a a-

  • b

b-

  • t

t connection vector model connection vector model

Caller Caller Callee Callee

a a1

1

bytes bytes b b1

1 bytes

bytes a a2

2

bytes bytes b b3

3

bytes bytes a a3

3

bytes bytes Epoch 1 Epoch 1 Epoch 2 Epoch 2 Epoch 3 Epoch 3 t t1

1 seconds

seconds t t2

2 seconds

seconds b b2

2

bytes bytes

  • General model (

General model (a-b-t a-b-t vector): vector): (( ((a a1

1,

, b b1

1,

, t t1

1),

), ( (a a2

2,

, b b2

2,

, t t2

2),

), … …, ( , (a ae

e,

, b be

e,

, )) )) where where e e is the number of epochs is the number of epochs

14 14

  • SMTP (send email)

SMTP (send email)

a a-

  • b

b-

  • t

t Connection Vectors Connection Vectors

Typical Communication Patterns Typical Communication Patterns

  • Telnet (remote terminal)

Telnet (remote terminal)

  • FTP-DATA (file download)

FTP-DATA (file download)

15 15

a-b-t a-b-t Connection Vectors Connection Vectors

Clustering communication patterns Clustering communication patterns

TIME TIME

  • Find statistically homogeneous communication patterns

Find statistically homogeneous communication patterns

– – Study this Study this mixture of populations mixture of populations

  • Address scalability using

Address scalability using statistical clustering statistical clustering C1 C2

16 16

Clustering Communication Patterns Clustering Communication Patterns

Clustering 101 Clustering 101

  • Procedure that divides a given set of feature vectors

Procedure that divides a given set of feature vectors into disjoint groups, or clusters, into disjoint groups, or clusters, C

C1

1, C

, C2

2,

,… …,C ,Cm

m

  • The goals of clustering schemes:

The goals of clustering schemes:

– – Clusters are small and mutually far apart Clusters are small and mutually far apart – – Clustering is done automatically Clustering is done automatically

» » Clustering is a form of unsupervised learning Clustering is a form of unsupervised learning

  • Statistical clustering is a well founded technique

Statistical clustering is a well founded technique

– – Successfully applied to Gene Micro-array classification, Successfully applied to Gene Micro-array classification, Data Mining, Data Mining,… …

slide-5
SLIDE 5

17 17

Example Example

Clusters in a 2D Data Set Clusters in a 2D Data Set

C C3

3

C C1

1

C C2

2

Feature X Feature X Feature Feature Y Y

18 18

Example Example

Divisive Hierarchical Clustering Divisive Hierarchical Clustering

Feature X Feature X Feature Feature Y Y

19 19

Divisive Hierarchical Clustering Divisive Hierarchical Clustering

Dendrogram Dendrogram

C C3

3

C C1

1

C C2

2

Classification Threshold Classification Threshold Height Height

20 20

Statistical Features of Statistical Features of a-b-t a-b-t Connection Connection Vectors Vectors

MULTIVARIATE MULTIVARIATE UNIVARIATE UNIVARIATE

  • No. of Epochs
  • No. of Epochs

e e

Max First Diff. Max First Diff.

t tvm

vm

b bvm

vm

a avm

vm

UNIVARIATE UNIVARIATE

Total Variation Total Variation

t tvs

vs

b bvs

vs

a avs

vs

Directionality Directionality Homogeneity Homogeneity

t thx

hx

b bhx

hx

a ahx

hx

dir2.a.b dir2.a.b dir1.a.b dir1.a.b

Autocorrelations Autocorrelations

t tcor

cor.x .x

b bcor

cor.x .x

a acor

cor.x .x

Cross-correlations Cross-correlations Standard Deviation Standard Deviation

t tstdev

stdev

b bstdev

stdev

a astdev

stdev

crc crc.b.t .b.t crc crc.a.t .a.t crc crc.a.b .a.b

1 1st

st 2

2nd

nd 3

3rd

rd Quartiles

Quartiles

t txq

xq

b bxq

xq

a axq

xq

Lagged Correlations Lagged Correlations Mean bytes/time Mean bytes/time

t tmean

mean

b bmean

mean

a amean

mean

cor cor.b.t.x .b.t.x cor cor.a.t.x .a.t.x cor cor.a.b.x .a.b.x

Min bytes/time Min bytes/time

t tmin

min

b bmin

min

a amin

min

Correlations Correlations Max bytes/time Max bytes/time

t tmax

max

b bmax

max

a amax

max

cor cor.b.t .b.t cor cor.a.t .a.t cor cor.a.b .a.b

Total bytes/time Total bytes/time

t ttot

tot

b btot

tot

a atot

tot

slide-6
SLIDE 6

21 21

Clustering Connections Clustering Connections

Statistical structure in data exchanges Statistical structure in data exchanges

m =

22 22

Clustering Connections Clustering Connections

Example of two clusters Example of two clusters

Mainly SMTP, HTTPS and POP Mainly SMTP, HTTPS and POP HTTP HTTP

23 23

Clustering Communication Patterns Clustering Communication Patterns

Data Set Data Set

1 1 … … 0.11 0.11 Connection Connection m m … … … … 0.23 0.23 … … 0.45 0.45 1.03 1.03 0.24 0.24 Connection 2 Connection 2 0.61 0.61 … … 0.12 0.12 0.23 0.23 0.66 0.66 Connection 1 Connection 1 dir2.a.b dir2.a.b … … a.min a.min a.max a.max e e Features Features Observations Observations

  • Each feature is approximately normalized to [0,1]

Each feature is approximately normalized to [0,1]

– – Many features have heavy-tailed distributions Many features have heavy-tailed distributions

24 24

Example 1 Example 1

Divisive Hierarchical Clustering Divisive Hierarchical Clustering

  • Packet header trace

Packet header trace collected from UNC main collected from UNC main Internet access link Internet access link

– – April 2002 April 2002

  • Random sample of 5,000

Random sample of 5,000 connections connections

– – e e

  • 2

2

  • Analysis performed using

Analysis performed using R R’ ’s implementation s implementation

– – Using the Using the diana

diana

algorithm algorithm

  • Euclidean distance

Euclidean distance

Max/Min Ratio Max/Min Ratio

b bh

h

a ah

h

1 1st

st 2

2nd

nd Moments

Moments

b bµ

µ, ,

  • a

µ, ,

  • Min bytes/time

Min bytes/time

b bmin

min

a amin

min

Max bytes/time Max bytes/time

t tmax

max

b bmax

max

a amax

max

Total bytes/time Total bytes/time

b btot

tot

a atot

tot

  • No. of Epochs
  • No. of Epochs

e e 26 Features 26 Features

Lag-1 Cross Lag-1 Cross Corr Corr. .

  • 2

2(

(b b’ ’s s, , a a’ ’s s) )

Spearman Spearman’ ’s s Correl Correl. .

  • 1

1(

(a a’ ’s s, , b b’ ’s s) )

Lag-1 Lag-1 Autocorr Autocorr. .

r rb

b

r ra

a

Total Variation Total Variation

b bvs

vs

a avs

vs

1 1st

st 2

2nd

nd 3

3rd

rd Quartiles

Quartiles

b bxq

xq

a axq

xq

slide-7
SLIDE 7

25 25

Example 1 Example 1

Dendrogram Dendrogram

HTTP HTTPS AOL SMTP POP bi = 0 HTTP HTTPS MS-DS RTSP FTP-DATA e=2 a.tot = 50k b.tot = 0 Dendrogram Dendrogram Pruned at Pruned at Depth 4 Depth 4

Dissimilarity Dissimilarity

26 26

Example 2 Example 2

Agglomerative Hierarchical Clustering Agglomerative Hierarchical Clustering

  • Packet header trace

Packet header trace collected from an Internet2 collected from an Internet2 backbone link (Abilene-I backbone link (Abilene-I data set) data set)

– – August 2002 August 2002

  • Sample of 717 connections

Sample of 717 connections

– – e e

  • 2

2

  • Analysis performed using

Analysis performed using Eisen Eisen’ ’s s software software

– – Developed for Developed for microarrays microarrays

  • Pearson

Pearson’ ’s correlation as s correlation as distance metric distance metric

  • No. of Epochs
  • No. of Epochs

e e 14 Features 14 Features

Lag-1 Sp. Lag-1 Sp. Corr Corr. .

  • 2

2(

(a a’ ’s s, , b b’ ’s s) )

Spearman Spearman’ ’s s Correl Correl. .

  • 1

1(

(a a’ ’s s, , b b’ ’s s) )

log ( log (a

atot

tot /

/ b

btot

tot)

)

dir dir

Max/Min Ratio Max/Min Ratio

b bhx

hx

a ahx

hx

Max First Diff. Max First Diff.

b bfd

fd

a afd

fd

2 2nd

nd Quartiles

Quartiles

t t2q

2q

b b2q

2q

a a2q

2q

Total bytes/time Total bytes/time

t ttot

tot

b btot

tot

a atot

tot

27 27

Hierarchical Clustering Hierarchical Clustering

UNC01 1,000 Connection Sample UNC01 1,000 Connection Sample

1 2 3 4 5 6

1000

7 3 395 2 7 4 2 6 2 5 993 587 25 4 562 3 406 28 2 378 1

Dendrogram Pruned at Depth 4 HTTP HTTPS (ti = 0) HTTP HTTPS (t > 0) HTTP SMTP AOL No HTTP p.b = 80 a.tot = 2k b.tot = 0

28 28

Correlation Correlation

e e

a atot

tot

a a2q

2q

a afd

fd

a ah

h

b btot

tot

b b2q

2q

b bfd

fd

b bh

h

t ttot

tot

t t2q

2q

dir dir

  • 1

1

  • 2

2

File-Sharing Applications File-Sharing Applications ( (Kazaa Kazaa, , Edonkey Edonkey, , Gnutella Gnutella) ) Gnutella Gnutella, Telnet, HTTPS , Telnet, HTTPS Newsgroups Newsgroups Web Traffic Web Traffic Email Email FTP-Data FTP-Data

slide-8
SLIDE 8

29 29

Summary and Current Work Summary and Current Work

  • Developed an application-independent model of TCP

Developed an application-independent model of TCP communication patterns: the communication patterns: the a-b-t a-b-t connection vector model connection vector model

– – Suitable for large scale data acquisition Suitable for large scale data acquisition

  • Applied statistical clustering to uncover fundamental

Applied statistical clustering to uncover fundamental subpopulations subpopulations

– – Working on a Working on a systematic approach systematic approach for feature selection and for feature selection and cluster identification ( cluster identification (i.e. i.e. dendrogram dendrogram pruning) pruning) – – O O( (n n2

2) is too slow, so we are also looking into data mining

) is too slow, so we are also looking into data mining algorithms for clustering algorithms for clustering

  • A synthetic traffic generator

A synthetic traffic generator ( (“ “tmix tmix” ”) for ) for reproducing reproducing TCP application workloads TCP application workloads

– – Network specific workloads easily modeled given a packet Network specific workloads easily modeled given a packet header trace header trace

30 30

Understanding Patterns of Understanding Patterns of TCP Connection Usage TCP Connection Usage with Statistical Clustering with Statistical Clustering

The The UNIVERSITY UNIVERSITY of

  • f NORTH CAROLINA

NORTH CAROLINA at at CHAPEL HILL CHAPEL HILL

http://www.cs.unc.edu/Research/dirt

Félix Hernández-Campos Félix Hernández-Campos Kevin Jeffay Kevin Jeffay Don Smith Don Smith

Department of Computer Science Department of Computer Science

Andrew Nobel Andrew Nobel

Department of Statistics Department of Statistics