Mining Network Traffic Data Ljiljana Trajkovi ljilja@cs.sfu.ca - - PowerPoint PPT Presentation

mining network traffic data
SMART_READER_LITE
LIVE PREVIEW

Mining Network Traffic Data Ljiljana Trajkovi ljilja@cs.sfu.ca - - PowerPoint PPT Presentation

Mining Network Traffic Data Ljiljana Trajkovi ljilja@cs.sfu.ca Communication Networks Laboratory http://www.ensc.sfu.ca/cnl School of Engineering Science Simon Fraser University, Vancouver, British Columbia Canada Roadmap Introduction


slide-1
SLIDE 1

Mining Network Traffic Data

Ljiljana Trajković ljilja@cs.sfu.ca Communication Networks Laboratory http://www.ensc.sfu.ca/cnl School of Engineering Science Simon Fraser University, Vancouver, British Columbia Canada

slide-2
SLIDE 2

July 19-20, 2007 IWCSN 2007, Guilin, China 2

Roadmap

Introduction Traffic data and analysis tools:

data collection, statistical analysis, clustering tools,

prediction analysis

Case studies:

satellite network: ChinaSat packet data networks: Internet

public safety wireless network: E-Comm

Conclusions and references

slide-3
SLIDE 3

July 19-20, 2007 IWCSN 2007, Guilin, China 3

M.Sc. and M.Eng. students at SFU:

  • ChinaSat data analysis:

Qing (Kenny) Shao Savio Lau

  • E-Comm data analysis:

Duncan Sharp Hao (Leo) Chen Bozidar Vujičić Nikola Cackov Svetlana Vujičić Nenad Lasković

Graduate students

  • Internet data analysis:

Hao (Johnson) Chen

slide-4
SLIDE 4

July 19-20, 2007 IWCSN 2007, Guilin, China 4

Roadmap

Introduction Traffic data and analysis tools:

data collection, statistical analysis, clustering tools,

prediction analysis

Case studies:

satellite network: ChinaSat packet data networks: Internet public safety wireless network: E-Comm

Conclusions and references

slide-5
SLIDE 5

July 19-20, 2007 IWCSN 2007, Guilin, China 5

Traffic measurements in operational networks help:

understand traffic characteristics in deployed

networks

develop traffic models evaluate performance of protocols and applications

Analysis of traffic:

provides information about the user behavior

patterns

enables network operators to understand the

behavior of network users

Traffic prediction: important to assess future network

capacity requirements and to plan future network developments

Network traffic measurements

slide-6
SLIDE 6

July 19-20, 2007 IWCSN 2007, Guilin, China 6

Self-similarity

Self-similarity implies a ‘‘fractal-like’’ behavior:

data on various time scales have similar patterns

A wide-sense stationary process X(n) is called (exactly

second order) self-similar if its autocorrelation function satisfies:

r(m)(k) = r(k), k ≥ 0, m = 1, 2, …, n,

where m is the level of aggregation

Implications:

no natural length of bursts bursts exist across many time scales traffic does not become ‘‘smoother” when

aggregated (unlike Poisson traffic)

slide-7
SLIDE 7

July 19-20, 2007 IWCSN 2007, Guilin, China 7

Properties:

slowly decaying variance long-range dependence Hurst parameter (H)

Processes with only short-range dependence (Poisson):

H = 0.5

Self-similar processes: 0.5 < H < 1.0 As the traffic volume increases, the traffic becomes

more bursty, more self-similar, and the Hurst parameter increases

Self-similar processes

slide-8
SLIDE 8

July 19-20, 2007 IWCSN 2007, Guilin, China 8

Long-range dependence: properties

High variability:

when the sample size increases, variance of the

sample mean decays more slowly than expected

Burstiness over a range of timescales:

long runs of large values followed by long runs of

small values, repeated in aperiodic patterns

fGn trace

slide-9
SLIDE 9

July 19-20, 2007 IWCSN 2007, Guilin, China 9

Estimation of H

Various estimators:

  • variance-time plots
  • R/S plots
  • periodograms
  • wavelets

Their performance often depends on the characteristics of the data trace under analysis 2 / 1 slope H + =

slide-10
SLIDE 10

July 19-20, 2007 IWCSN 2007, Guilin, China 10

Clustering analysis

Clustering analysis groups or segments a collection of

  • bjects into subsets or clusters

Objects within a cluster are more similar to each other

than objects in distinct clusters

An object can be described by a set of measurements

  • r by its relations to other objects

Network users are classified into clusters, according

to the similarity of their behavior patterns

slide-11
SLIDE 11

July 19-20, 2007 IWCSN 2007, Guilin, China 11

Clustering analysis

  • Groups collection of objects into subsets (clusters):

resulting intra-cluster similarity is high while inter-

cluster similarity is low

  • The inter-cluster distance reflects dissimilarity between

clusters:

Euclidean distance between two cluster centroids (mean

value of objects in a cluster, viewed as cluster’s center

  • f gravity)
  • The intra-cluster distance expresses coherent similarity of

data in the same cluster:

average distance of objects from their cluster centroids

  • Better clustering:

large inter-cluster and small intra-cluster distances

slide-12
SLIDE 12

July 19-20, 2007 IWCSN 2007, Guilin, China 12

Clustering quality

Overall clustering quality: defined as difference

between minimum inter-cluster and maximum intra- cluster distances

  • larger indicator implies better overall clustering

quality

Silhouette coefficient (x):

(b(x) - a(x)) / max {a(x), b(x)} a(x) and b(x) are average distances between data point x and other data points in clusters A and B, respectively

independent of number of clusters K

slide-13
SLIDE 13

July 19-20, 2007 IWCSN 2007, Guilin, China 13

Clustering algorithms

Two approaches:

partitioning clustering (k-means) hierarchical clustering

Clustering tools:

AutoClass tool k-means algorithm

  • P. Cheeseman and J. Stutz, “Bayesian classification (AutoClass): theory

and results,” in Advances in Knowledge Discovery and Data Mining,

  • U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds.,

AAAI Press/MIT Press, 1996.

  • L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to

Cluster Analysis. New York: John Wiley & Sons, 1990.

slide-14
SLIDE 14

July 19-20, 2007 IWCSN 2007, Guilin, China 14

Clustering algorithms: k-means

The k-means algorithm is commonly used for data

clustering

The algorithm is well-known for its simplicity and

efficiency

Based on the input parameter k, it partitions a set of n

  • bjects into k clusters so that the resulting intra-

cluster similarity is high and the inter-cluster similarity is low

Similarity of clusters is measured with respect to the

mean value of the objects in a cluster (viewed as the cluster's center of gravity)

slide-15
SLIDE 15

July 19-20, 2007 IWCSN 2007, Guilin, China 15

k-means: partitioning clustering

Constructs k partitions of the data from n objects,

where k ≤ n

Two constraints:

each cluster must contain at least one object each object must belong to exactly one group

Requires exhaustive enumeration of all possible

combinations to find the optimal cluster solution

slide-16
SLIDE 16

July 19-20, 2007 IWCSN 2007, Guilin, China 16

k-means clustering

  • Generates k clusters from n objects
  • Requires two inputs:

k: number of desired partitions n objects

  • Uses random placement of initial clusters
  • Determines clustering results through an iteration technique

to relocate objects to the most similar cluster:

similarity is defined as the distance between objects

  • bjects that are closer to each other are more similar
  • Computational complexity of O(nkt), where t is the maximum

number of iterations

slide-17
SLIDE 17

July 19-20, 2007 IWCSN 2007, Guilin, China 17

k-means clustering: algorithm

1. Randomly select k objects to be the center of k clusters

  • 2. Assign each remaining object to the cluster to which

it is the most similar

  • 3. Re-calculate the cluster mean after all objects are

(re)assigned

  • 4. Re-evaluate all objects and place them in the cluster

to which they are the most similar

  • 5. Repeat Steps 3 and 4 until no changes have been made

(full convergence) or the maximum number of iterations is reached (partial convergence)

slide-18
SLIDE 18

July 19-20, 2007 IWCSN 2007, Guilin, China 18

Finding number of clusters

The number of clusters k is not known a priori k-means algorithm is repeated for different k values Number of clusters is found by comparing average SC

value for various values of k:

average SC is calculated for all objects the natural number of clusters k is found at the

local maxima

SC: silhouette coefficient

slide-19
SLIDE 19

July 19-20, 2007 IWCSN 2007, Guilin, China 19

Hierarchical clustering

  • Objects are grouped into a tree of clusters (dendrogram)
  • Two approaches are employed: agglomerative and divisive
  • Agglomerative approach (bottom-up):
  • bjects begin in its own cluster

successive steps merge objects close to each other until

all objects belong to one cluster or reach termination condition

  • Clusters are merged (or split) based on distance measure
  • Four distance measures commonly employed: minimum,

maximum, mean, and average

slide-20
SLIDE 20

July 19-20, 2007 IWCSN 2007, Guilin, China 20

Hierarchical clustering: algorithm

1. For n objects, a similarity matrix of n x n is

  • generated. Each value records the distance between

the two objects or (the number of identical values if a series of values is used)

  • 2. Objects are assigned to clusters from 1 to n
  • 3. Each iteration merges two clusters that are closest to

each other (minimum similarity value)

  • 4. Repeat steps 2 and 3 until all objects are merged into

a single cluster or until termination condition is reached.

  • 5. Groups can be found by selecting k or selecting a

maximum merge distance

slide-21
SLIDE 21

July 19-20, 2007 IWCSN 2007, Guilin, China 21

Hierarchical clustering

Visualized by dendrograms Determined by two choices:

desired number of clusters k selected cutoff based on inconsistency coefficients:

inconsistency coefficient is the difference

between the height of a dendrogram link and the average height of links at the same level

links connecting two distinct clusters have higher

inconsistency coefficient

slide-22
SLIDE 22

July 19-20, 2007 IWCSN 2007, Guilin, China 22

Dendrogram example

slide-23
SLIDE 23

July 19-20, 2007 IWCSN 2007, Guilin, China 23

Dendrogram example

slide-24
SLIDE 24

July 19-20, 2007 IWCSN 2007, Guilin, China 24

Traffic prediction: ARIMA model

Auto-Regressive Integrated Moving Average (ARIMA)

model:

general model for forecasting time series past values: AutoRegressive (AR) structure past random fluctuant effect: Moving Average (MA)

process

ARIMA model explicitly includes differencing ARIMA (p, d, q):

autoregressive parameter: p number of differencing passes: d moving average parameter: q

slide-25
SLIDE 25

July 19-20, 2007 IWCSN 2007, Guilin, China 25

Traffic prediction: SARIMA model

Seasonal ARIMA is a variation of the ARIMA model Seasonal ARIMA (SARIMA) model:

captures seasonal pattern

SARIMA additional model parameters:

seasonal period parameter: S seasonal autoregressive parameter: P number of seasonal differencing passes: D seasonal moving average parameter: Q

( ) ( )S

Q D P q d p , , , , ×

slide-26
SLIDE 26

July 19-20, 2007 IWCSN 2007, Guilin, China 26

SARIMA models: selection criteria

Order (p,d,q) selected based on:

time series plot of traffic data autocorrelation and partial autocorrelation

functions

Validity of parameter selection:

Akaike’s information criterion:

  • AIC

corrected AICc

Bayesian information criterion BIC

slide-27
SLIDE 27

July 19-20, 2007 IWCSN 2007, Guilin, China 27

Roadmap

Introduction Traffic data and analysis tools:

data collection, statistical analysis, clustering tools,

prediction analysis

Case studies:

satellite network: ChinaSat packet data networks: Internet public safety wireless network: E-Comm

Conclusions and references

slide-28
SLIDE 28

July 19-20, 2007 IWCSN 2007, Guilin, China 28

ChinaSat data: analysis

Analysis of network traffic:

characteristics of TCP connections network traffic patterns statistical and cluster analysis of traffic anomaly detection:

statistical methods wavelets principle component analysis

TCP: transport control protocol

slide-29
SLIDE 29

July 19-20, 2007 IWCSN 2007, Guilin, China 29

Network and traffic data

ChinaSat: network architecture and TCP Analysis of billing records:

aggregated traffic user behavior

Analysis of tcpdump traces:

general characteristics TCP options and operating system (OS)

fingerprinting

network anomalies

slide-30
SLIDE 30

July 19-20, 2007 IWCSN 2007, Guilin, China 30

DirecPC system diagram

slide-31
SLIDE 31

July 19-20, 2007 IWCSN 2007, Guilin, China 31

Characteristics of satellite links

Large coverage area High bandwidth Long propagation delay Large bandwidth-delay product High bit error rates:

10-6 without error correction 10-3 or 10-2 due to extreme weather and

interference

Path asymmetry

slide-32
SLIDE 32

July 19-20, 2007 IWCSN 2007, Guilin, China 32

Characteristics of satellite links

  • ChinaSat hybrid satellite network

Employs geosynchrous satellites deployed by Hughes

Network Systems Inc.

Provides data and television services:

DirecPC (Classic): unidirectional satellite data service DirecTV: satellite television service DirecWay (Hughnet): new bi-directional satellite data

service that replaces DirecPC

DirecPC transmission rates:

400 kb/s from satellite to user 33.6 kb/s from user to network operations center

(NOC) using dial-up

Improves performance using TCP splitting with spoofing

slide-33
SLIDE 33

July 19-20, 2007 IWCSN 2007, Guilin, China 33

ChinaSat data: analysis

ChinaSat traffic is self-similar and non-stationary Hurst parameter differs depending on traffic load Modeling of TCP connections:

inter-arrival time is best modeled by the Weibull

distribution

number of downloaded bytes is best modeled by the

lognormal distribution

The distribution of visited websites is best modeled by

the discrete Gaussian exponential (DGX) distribution

slide-34
SLIDE 34

July 19-20, 2007 IWCSN 2007, Guilin, China 34

ChinaSat data: analysis

Traffic prediction:

autoregressive integrative moving average (ARIMA)

was successfully used to predict uploaded traffic (but not downloaded traffic)

wavelet + autoregressive model outperforms the

ARIMA model

  • Q. Shao and Lj. Trajkovic, “Measurement and analysis of traffic in a hybrid

satellite-terrestrial network,” Proc. SPECTS 2004, San Jose, CA, July 2004,

  • pp. 329–336.
slide-35
SLIDE 35

July 19-20, 2007 IWCSN 2007, Guilin, China 35

Analysis of collected data

Analysis of patterns and statistical properties of two

sets of data from the ChinaSat DirecPC network:

billing records tcpdump traces

Billing records:

daily and weekly traffic patterns user classification:

single and multi-variable k-means clustering time series clustering using hierarchical

clustering and empirical approach

slide-36
SLIDE 36

July 19-20, 2007 IWCSN 2007, Guilin, China 36

Analysis of collected data

Analysis of tcpdump trace

tcpdump trace:

protocols and applications TCP options

  • perating system fingerprinting

network anomalies

C program pcapread that process tcpdump files

without using packet capture library libpcap

slide-37
SLIDE 37

July 19-20, 2007 IWCSN 2007, Guilin, China 37

Network anomalies

Scans and worms:

packets are sent to probe network hosts used to discover and exploit resources

Denial of service:

large number of packets is directed to a single

destination

makes a host incapable of handling incoming

connections or exhausts available bandwidth along paths to the destination

slide-38
SLIDE 38

July 19-20, 2007 IWCSN 2007, Guilin, China 38

Network anomalies

Flash crowd:

high volume of traffic is destined to a single

destination

caused by breaking news, availability of new

software

Traffic shift:

redirection of traffic from one set of paths to

another

caused by route changes, link unavailability, or

network congestion

slide-39
SLIDE 39

July 19-20, 2007 IWCSN 2007, Guilin, China 39

Network anomalies

  • Alpha traffic:

unusually high volume of traffic between two endpoints caused by file transfers or bandwidth measurements

  • Traffic volume anomalies:

significant deviation of traffic volume from usual daily or

weekly patterns

classified as:

  • utages: caused by unavailable links, crasher servers,
  • r routing problems

short term increases in demand: caused by short term

events such as holiday traffic

involve multiple sources and destinations

slide-40
SLIDE 40

July 19-20, 2007 IWCSN 2007, Guilin, China 40

Billing records

Records were collected during the continuous period

from 23:00 on Oct. 31, 2002 to 11:00 on Jan. 10, 2003

Each file contains the hourly traffic summary for each

user

Fields of interests:

SiteID (user identification) Start (record start time) CTxByt (number of bytes downloaded by a user) CRxByt (number of bytes uploaded by a user) CTxPkt (number of packets downloaded by a user) CRxPkt (number of packets uploaded by a user)

slide-41
SLIDE 41

July 19-20, 2007 IWCSN 2007, Guilin, China 41

Billing records: characteristics

186 unique SiteIDs Daily and weekly cycles:

lower traffic volume on weekends daily cycle starts at 7 AM, rises to three daily

maxima at 11 AM, 3 PM, and 7 PM, then decrease monotonically until 7 AM

Highest daily traffic recorded on Dec. 24, 2002 Outage occurred on Jan. 3, 2003

slide-42
SLIDE 42

July 19-20, 2007 IWCSN 2007, Guilin, China 42

Aggregated hourly traffic

slide-43
SLIDE 43

July 19-20, 2007 IWCSN 2007, Guilin, China 43

Aggregated daily traffic

slide-44
SLIDE 44

July 19-20, 2007 IWCSN 2007, Guilin, China 44

Daily diurnal traffic: average downloaded bytes

slide-45
SLIDE 45

July 19-20, 2007 IWCSN 2007, Guilin, China 45

Weekly traffic: average downloaded bytes

slide-46
SLIDE 46

July 19-20, 2007 IWCSN 2007, Guilin, China 46

Ranking of user traffic

Users are ranked according to the traffic volume The top user downloaded 78.8 GB, uploaded 11.9 GB,

and downloaded/uploaded ~205 million packets

Most users download/uploaded little traffic Cumulative distribution functions (CDFs) are

constructed from the ranks:

top user accounts for 11% of downloaded bytes top 25 users contributed 93.3% of downloaded

bytes

top 37 users contributed 99% of total traffic

(packets and bytes)

slide-47
SLIDE 47

July 19-20, 2007 IWCSN 2007, Guilin, China 47

Cumulative distribution functions

slide-48
SLIDE 48

July 19-20, 2007 IWCSN 2007, Guilin, China 48

k-means: clustering results

Natural number of clusters is k=3 for downloaded and

uploaded bytes

Most users belong to the group with small traffic

volume

For k=3:

159 users in group 1 (average 0.0–16.8 MB

downloaded per hour)

24 users in group 2 (average 16.8–70.6 MB

downloaded per hour)

3 users in group 3 (average 70.6–110.7 MB

downloaded per hour)

slide-49
SLIDE 49

July 19-20, 2007 IWCSN 2007, Guilin, China 49

Three most common traffic patterns

Idle users:

rarely download/upload traffic represented by zero traffic

Active users:

download/upload traffic for more than 18 hours a

day

represented by traffic over 24 hours each day

Semi-active users:

download/upload traffic for 8–12 hours a day represented by a cycle of 10 hours ACTIVE/14

hours IDLE cycle for each day

slide-50
SLIDE 50

July 19-20, 2007 IWCSN 2007, Guilin, China 50

Clustering results using three most common traffic patterns

186 Total number of users 8 Semi-active 16 Active 162 Idle Number of users Traffic pattern

slide-51
SLIDE 51

July 19-20, 2007 IWCSN 2007, Guilin, China 51

tcpdump traces

  • Traces were continuously collected from 11:30 on Dec. 14,

2002 to 11:00 on Jan. 10, 2003 at the NOC

  • The first 68 bytes of a each TCP/IP packet were captured
  • ~63 GB of data contained in 127 files
  • User IP address is not constant due to the use of the

private IP address range and dynamic IP

  • Majority of traffic is TCP:

94% of total bytes and 84% of total packets WWW (port 80) accounts for 90% of TCP connections

and 76% of TCP bytes

FTP (port 21) accounts for 0.2% of TCP connections and

11% of TCP bytes

slide-52
SLIDE 52

July 19-20, 2007 IWCSN 2007, Guilin, China 52

OS fingerprinting results

Analyzed 9 hours of tcpdump trace on Dec. 14, 2002

using the open-source tool p0f.v2

Assumed constant IP addresses Detected 171 users:

137 users did not initiate any connections and

cannot be identified (no SYN packets)

14 users employ Microsoft Windows 2 users employ Linux 1 user employs an unknown OS (identified as an

MSS-modifying proxy)

OS: operating system

slide-53
SLIDE 53

July 19-20, 2007 IWCSN 2007, Guilin, China 53

Network anomalies

Ethereal/Wireshark, tcptrace, and pcapread Four types of network anomalies were detected:

invalid TCP flag combinations large number of TCP resets UDP and TCP port scans traffic volume anomalies

slide-54
SLIDE 54

July 19-20, 2007 IWCSN 2007, Guilin, China 54

Analysis of TCP flags

100.000 39,283,305 Total packet count 0.300 112,419 *Total number of packets with invalid TCP flag combinations 0.020 8,329 *RST+FIN+PSH 0.050 18,111 *RST+PSH (no FIN) 0.200 85,571 *RST+FIN (no PSH) 0.001 408 *SYN+FIN 32.300 12,679,619 FIN only 18.900 7,440,418 RST only 48.500 19,050,849 SYN only % of Total Packet count TCP flag

slide-55
SLIDE 55

July 19-20, 2007 IWCSN 2007, Guilin, China 55

Large number of TCP resets

Connections are terminated by either TCP FIN or TCP

RST:

12,679,619 connections were terminated by FIN

(63%)

7,440,418 connections were terminated by RST

(37%)

Large number of TCP RST indicates that connections

are terminated in error conditions

TCP RST is employed by Microsoft Internet Explorer

to terminate connections instead of TCP FIN

TCP: transport control protocol

slide-56
SLIDE 56

July 19-20, 2007 IWCSN 2007, Guilin, China 56

UDP and TCP port scans

  • UDP port scans are found on UDP port 137 (NETBEUI)
  • TCP port scans are found on these TCP ports:

80 Hypertext transfer protocol (HTTP) 139 NETBIOS extended user interface (NETBEUI) 434 HTTP over secure socket layer (HTTPS) 1433 Microsoft structured query language (MS SQL) 27374 Subseven trojan

  • No HTTP(S) servers were active in the ChinaSat network
  • MSSQL vulnerability was discovered on Oct. 2002, which

may be the cause of scans on TCP port 1433

  • The Subseven trojan is a backdoor program used in malicious

intents

TCP: transport control protocol UDP: user defined protocol

slide-57
SLIDE 57

July 19-20, 2007 IWCSN 2007, Guilin, China 57

UDP port scans originating from the ChinaSat network

192.168.2.30:137 - 195.x.x.98:1025 192.168.2.30:137 - 202.x.x.153:1027 192.168.2.30:137 - 210.x.x.23:1035 192.168.2.30:137 - 195.x.x.42:1026 192.168.2.30:137 - 202.y.y.226:1026 192.168.2.30:137 - 218.x.x.238:1025 192.168.2.30:137 - 202.y.y.226:1025 192.168.2.30:137 - 202.y.y.226:1027 192.168.2.30:137 - 202.y.y.226:1028 192.168.2.30:137 - 202.y.y.226:1029 192.168.2.30:137 - 202.y.y.242:1026 192.168.2.30:137 - 61.x.x.5:1028 192.168.2.30:137 - 219.x.x.226:1025 192.168.2.30:137 - 213.x.x.189:1028 192.168.2.30:137 - 61.x.x.193:1025 192.168.2.30:137 - 202.y.y.207:1028 192.168.2.30:137 - 202.y.y.207:1025 192.168.2.30:137 - 202.y.y.207:1026 192.168.2.30:137 - 202.y.y.207:1027 192.168.2.30:137 - 64.x.x.148:1027

Client (192.168.2.30) source

port (137) scans external network addresses at destination ports (1025-1040):

> 100 are recorded within a

three-hour period

targeted IP addresses are

variable

multiple ports are scanned

per IP

may correspond to Bugbear,

OpaSoft, or other worms

slide-58
SLIDE 58

July 19-20, 2007 IWCSN 2007, Guilin, China 58

UDP port scans direct to the ChinaSat network

210.x.x.23:1035 - 192.168.1.121:137 210.x.x.23:1035 - 192.168.1.63:137 210.x.x.23:1035 - 192.168.2.11:137 210.x.x.23:1035 - 192.168.1.250:137 210.x.x.23:1035 - 192.168.1.25:137 210.x.x.23:1035 - 192.168.2.79:137 210.x.x.23:1035 - 192.168.1.52:137 210.x.x.23:1035 - 192.168.6.191:137 210.x.x.23:1035 - 192.168.1.241:137 210.x.x.23:1035 - 192.168.2.91:137 210.x.x.23:1035 - 192.168.1.5:137 210.x.x.23:1035 - 192.168.1.210:137 210.x.x.23:1035 - 192.168.6.127:137 210.x.x.23:1035 - 192.168.1.201:137 210.x.x.23:1035 - 192.168.6.179:137 210.x.x.23:1035 - 192.168.2.82:137 210.x.x.23:1035 - 192.168.1.239:137 210.x.x.23:1035 - 192.168.1.87:137 210.x.x.23:1035 - 192.168.1.90:137 210.x.x.23:1035 - 192.168.1.177:137 210.x.x.23:1035 - 192.168.1.39:137

External address (210.x.x.23)

scans for port (137) (NETBEUI) response within the ChinaSat network from source port (1035):

> 200 are recorded within a

three-hour period

targets IP addresses are not

sequential

may correspond to Bugbear,

OpaSoft, or other worms

slide-59
SLIDE 59

July 19-20, 2007 IWCSN 2007, Guilin, China 59

Detection of traffic volume anomalies using wavelets

Traffic is decomposed into various frequencies using

the wavelet transform

Traffic volume anomalies are identified by the large

variation in wavelet coefficient values

The coarsest scale level where the anomalies are found

indicates the time scale of an anomaly

slide-60
SLIDE 60

July 19-20, 2007 IWCSN 2007, Guilin, China 60

Detection of traffic volume anomalies using wavelets

tcpdump traces are binned in terms of packets or

bytes (each second)

Wavelet transform of 12 levels is employed to

decompose the traffic

The coarsest level approximately represents the

hourly traffic

Anomalies are:

detected with a moving window of size 20 and by

calculating the mean and standard deviation (σ) of the wavelet coefficients in each window

identified when wavelet coefficients lie outside the

± 3σ of the mean value

slide-61
SLIDE 61

July 19-20, 2007 IWCSN 2007, Guilin, China 61

Wavelet approximate coefficients

slide-62
SLIDE 62

July 19-20, 2007 IWCSN 2007, Guilin, China 62

Wavelet detail coefficients: d9

slide-63
SLIDE 63

July 19-20, 2007 IWCSN 2007, Guilin, China 63

Wavelet detail coefficients: d8

slide-64
SLIDE 64

July 19-20, 2007 IWCSN 2007, Guilin, China 64

Roadmap

Introduction Traffic data and analysis tools:

data collection statistical analysis, clustering tools, prediction

analysis

Case studies:

satellite network: ChinaSat packet data network: Internet public safety wireless network: E-Comm

Conclusions and references

slide-65
SLIDE 65

July 19-20, 2007 IWCSN 2007, Guilin, China 65

Autonomous System (AS)

Internet is a network of Autonomous Systems:

groups of networks sharing the same routing policy identified with Autonomous System Numbers (ASN)

Autonomous System Numbers:

http://www.iana.org/assignments/as-numbers

Internet topology on AS-level:

the arrangement of ASs and their interconnections

Border Gateway Protocol (BGP):

inter-AS protocol used to exchange network reachability information

among BGP systems

reachability information is stored in routing tables

slide-66
SLIDE 66

July 19-20, 2007 IWCSN 2007, Guilin, China 66

Internet AS-level data

Source of data are routing tables:

Route Views: http://www.routeviews.org

most participating ASs reside in North America

RIPE (Réseaux IP européens):

http://www.ripe.net/ris

most participating ASs reside in Europe

slide-67
SLIDE 67

July 19-20, 2007 IWCSN 2007, Guilin, China 67

Internet AS-level data

Data used in prior research (partial list):

Yes Yes Mihail, 2003 No Yes Vukadinovic, 2001 Yes Yes Chang, 2001 No Yes Faloutsos, 1999 RIPE Route Views

Research results have been used in developing Internet

simulation tools:

power-laws are employed to model and generate

Internet topologies: BA model, BRITE, Inet2

slide-68
SLIDE 68

July 19-20, 2007 IWCSN 2007, Guilin, China 68

Data sets

Emerging concerns about the use of the two datasets:

different observations about AS degrees:

power-law distribution: Route Views [Faloutsos,

1999]

Weibull distribution: Route Views + RIPE [Chang,

2001]

data completeness:

RIPE dataset contains ~ 40% more AS connections

and 2% more ASs than Route Views [Chang, 2001]

slide-69
SLIDE 69

July 19-20, 2007 IWCSN 2007, Guilin, China 69

Route Views and RIPE: statistics

35,225 34,878 AS pairs 15,433 15,418 Probed ASs 6,375,028 6,398,912 AS paths RIPE Route Views Number of

AS pair: a pair of connected ASs 15,369 probed ASs (99.7%) in both datasets are

identical

29,477 AS pairs in Route Views (85%) and in RIPE

(84%) are identical

Route Views and RIPE samples collected on May 30,

2003

slide-70
SLIDE 70

July 19-20, 2007 IWCSN 2007, Guilin, China 70

Core ASs

ASs with largest

degrees

16 of the core ASs in

Route Views and RIPE are identical

Core ASs in Route Views

have larger degrees than core ASs in RIPE

281 6347 258 7132 20 296 16631 263 3786 19 305 3257 263 4766 18 305 4323 277 3257 17 313 6730 289 8220 16 412 13237 291 6347 15 429 3303 294 16631 14 450 8220 315 4323 13 476 6461 468 4513 12 482 4589 498 6461 11 489 1 556 2914 10 561 2914 562 702 9 580 702 617 3549 8 612 3549 662 3356 7 673 3356 863 209 6 705 3561 999 1 5 861 209 1036 3561 4 1638 7018 1999 7018 3 1784 1239 2569 1239 2 2448 701 2595 701 1 Degree AS Degree AS RIPE Route Views

slide-71
SLIDE 71

July 19-20, 2007 IWCSN 2007, Guilin, China 71

Spectral analysis of graphs

Normalized Laplacian matrix N(G) [Chung, 1997]:

di and dj are degrees of node i and j, respectively

The second smallest eigenvalue [Fiedler, 1973] The largest eigenvalue [Chung, 1997] Characteristic valuation [Fiedler, 1975]

⎪ ⎪ ⎩ ⎪ ⎪ ⎨ ⎧ − ≠ = =

  • therwise

adjacent are j and i if d d d and j i if j i N

j i i

1 1 ) , (

slide-72
SLIDE 72

July 19-20, 2007 IWCSN 2007, Guilin, China 72

Characteristic valuation: example

The second smallest eigenvector: 0.1, 0.3, -0.2, 0 AS1(0.1), AS2(0.3), AS3(-0.2), AS4(0) Sort ASs by element value: AS3, AS4, AS1, AS2 AS3 and AS1 are connected

AS3 AS4 AS2 AS1 1

index of elements connectivity status

slide-73
SLIDE 73

July 19-20, 2007 IWCSN 2007, Guilin, China 73

Spectral analysis of topology data

  • Consider only ASs with the first 30,000 assigned AS numbers
  • AS degree distribution in Route Views and RIPE datasets:
slide-74
SLIDE 74

July 19-20, 2007 IWCSN 2007, Guilin, China 74

(c) RouteViews_min (d) RIPE_min (a) RouteViews_original (b) RIPE_original Before the sort After the sort

slide-75
SLIDE 75

July 19-20, 2007 IWCSN 2007, Guilin, China 75

Before the sort (a) RouteViews_original (b) RIPE_original (c) RouteViews_max (d) RIPE_max After the sort

slide-76
SLIDE 76

July 19-20, 2007 IWCSN 2007, Guilin, China 76

Data analysis results

The second smallest eigenvector:

separates connected ASs from disconnected ASs Route Views and RIPE datasets are similar on a coarser

scale

The largest eigenvector:

reveals highly connected clusters Route Views and RIPE datasets differ on a finer scale

slide-77
SLIDE 77

July 19-20, 2007 IWCSN 2007, Guilin, China 77

Observations

The two datasets are similar on coarse scales:

number of ASs, number of AS connections, core

ASs

They exhibit different clustering characteristics:

Route Views data contain larger AS clusters core ASs in Route Views have larger degrees than

core ASs in RIPE

core ASs in Route Views connect a larger number of

smaller ASs

slide-78
SLIDE 78

July 19-20, 2007 IWCSN 2007, Guilin, China 78

Roadmap

Introduction Traffic data and analysis tools:

data collection, statistical analysis, clustering tools,

prediction analysis

Case studies:

satellite network: ChinaSat packet data network: Internet public safety wireless network: E-Comm

Conclusions and references

slide-79
SLIDE 79

July 19-20, 2007 IWCSN 2007, Guilin, China 79

Case study: E-Comm network

E-Comm network: an operational trunked radio system

serving as a regional emergency communication system

The E-Comm network is capable of both voice and data

transmissions

Voice traffic accounts for over 99% of network traffic A group call is a standard call made in a trunked radio

system

More than 85% of calls are group calls A distributed event log database records every event

  • ccurring in the network: call establishment, channel

assignment, call drop, and emergency call

slide-80
SLIDE 80

July 19-20, 2007 IWCSN 2007, Guilin, China 80

E-Comm network: coverage and user agencies

RCMP and Police Ambulance Other Fire

Agency 1

(Police)

Agency 2

(Fire Dept.)

TG 1 TG 2 TG 3 TG 4

R1 R2 R3 R4 R5 R6 R7 R8

TG: Talk group R: Radio device (user)

... ... TG n ...

slide-81
SLIDE 81

July 19-20, 2007 IWCSN 2007, Guilin, China 81

E-Comm network architecture

Burnaby Vancouver Other EDACS systems PSTN PBX Dispatch console Users Database server Data gateway Management console Transmitters/Repeaters Network switch

1 2 3 4 5 6 7 8 9 * 8 # I B M
slide-82
SLIDE 82

July 19-20, 2007 IWCSN 2007, Guilin, China 82

Traffic data

2001 data set:

2 days of traffic data 2001-11-1 to 2001-11-02 (110,348 calls)

2002 data set:

28 days of continuous traffic data 2002-02-10 to 2002-03-09 (1,916,943 calls)

2003 data set:

92 days of continuous traffic data 2003-03-01 to 2003-05-31 (8,756,930 calls)

slide-83
SLIDE 83

July 19-20, 2007 IWCSN 2007, Guilin, China 83

Observations

Presence of daily cycles:

minimum utilization: ~ 2 PM maximum utilization: 9 PM to 3 AM

2002 sample data:

cell 5 is the busiest

  • thers seldom reach their capacities

2003 sample data:

several cells (2, 4, 7, and 9) have all channels

  • ccupied during busy hours
slide-84
SLIDE 84

July 19-20, 2007 IWCSN 2007, Guilin, China 84

Network utilization

OPNET based simulation of two weeks of network

activity

Network utilization exhibits daily cycles Between February 2002 and March 2003:

number of calls increased by ~ 60 % average utilization increased non-uniformly across

the network

Several cells may become congested in future

  • N. Cackov, B. Vujičić, S. Vujičić, and Lj. Trajković, “Using network activity data

to model the utilization of a trunked radio system,” in Proc. SPECTS 2004, San Jose, CA, July 2004, pp. 517–524.

  • N. Cackov, J. Song, B. Vujičić, S. Vujičić, and Lj. Trajković, “Simulation of a

public safety wireless networks: a case study,” Simulation, vol. 81, no. 8, pp. 571–585, Aug. 2005.

slide-85
SLIDE 85

July 19-20, 2007 IWCSN 2007, Guilin, China 85

Performance analysis

Modeling and Performance Analysis of Public Safety

Wireless Networks

WarnSim: a simulator for public safety wireless

networks (PSWN)

Traffic data analysis Traffic modeling Simulation and prediction

  • J. Song and Lj. Trajković, “Modeling and performance analysis of public

Safety wireless networks,” in Proc. IEEE IPCCC, Phoenix, AZ, Apr. 2005,

  • pp. 567–572.
slide-86
SLIDE 86

July 19-20, 2007 IWCSN 2007, Guilin, China 86

WarnSim overview

Simulators such as OPNET, ns-2, and JSim are

designed for packet-switched networks

WarnSim is a simulator developed for circuit-

switched networks, such as PSWN

WarnSim:

publicly available simulator

http://www.vannet.ca/warnsim

effective, flexible, and easy to use developed using Microsoft Visual C# .NET

  • perates on Windows platforms
slide-87
SLIDE 87

July 19-20, 2007 IWCSN 2007, Guilin, China 87

Call arrival rate in 2002 and 2003: cyclic patterns

the busiest hour is around midnight the busiest day is Thursday useful for scheduling periodical maintenance tasks

1 5 10 15 20 24 1000 2000 3000 4000 5000 6000 Time (hours) Number of calls 2002 Data 2003 Data Sat. Sun. Mon. Tue. Wed. Thu. Fri. 4 5 6 7 8 9 10 11 12 x 10

4

Time (days) Number of calls 2002 Data 2003 Data

slide-88
SLIDE 88

July 19-20, 2007 IWCSN 2007, Guilin, China 88

Modeling and characterization of traffic

We analyzed voice traffic from a public safety wireless

network in Vancouver, BC

call inter-arrival and call holding times during five

busy hours from each year (2001, 2002, 2003)

Statistical distribution and the autocorrelation function

  • f the traffic traces:

Kolmogorov-Smirnov goodness-of-fit test autocorrelation functions wavelet-based estimation of the Hurst parameter

  • B. Vujičić, N. Cackov, S. Vujičić, and Lj. Trajković, “Modeling and characterization
  • f traffic in public safety wireless networks,” in Proc. SPECTS 2005, Philadelphia,

PA, July 2005, pp. 214–223.

slide-89
SLIDE 89

July 19-20, 2007 IWCSN 2007, Guilin, China 89

Erlang traffic models

PB : probability of rejecting a call Pc : probability of delaying a call N : number of channels/lines A : total traffic volume

! !

N B x N x

A N P A x

=

=

1

! ! !

N C x N N x

A N N N A P A A N x N N A

− =

− = + −

Erlang B Erlang C

slide-90
SLIDE 90

July 19-20, 2007 IWCSN 2007, Guilin, China 90

Erlang models

Erlang B model assumes:

call holding time follows exponential distribution blocked call will be rejected immediately

Erlang C model assumes:

call holding time follows exponential distribution blocked call will be put into a FIFO queue with

infinite size

slide-91
SLIDE 91

July 19-20, 2007 IWCSN 2007, Guilin, China 91

Kolmogorov-Smirnov test

Goodness-of-fit test: quantitative decision whether

the empirical cumulative distribution function (ECDF)

  • f a set of observations is consistent with a random

sample from an assumed theoretical distribution

ECDF is a step function (step size 1/N) of N ordered

data points : : the number of data samples with values smaller than

N

Y Y Y ..., , ,

2 1

( )

N i n EN =

( )

i n

i

Y

slide-92
SLIDE 92

July 19-20, 2007 IWCSN 2007, Guilin, China 92

Traffic data

Records of network events:

established, queued, and dropped calls in the

Vancouver cell

Traffic data span periods during:

  • 2001, 2002, 2003

March 24–30, 2003 March 1–7, 2002 November 1–2, 2001 Time span 387,340 2003 370,510 2002 110,348 2001

  • No. of established calls

Trace (dataset)

slide-93
SLIDE 93

July 19-20, 2007 IWCSN 2007, Guilin, China 93

Hourly traces

Call holding and call inter-arrival times from the five

busiest hours in each dataset (2001, 2002, and 2003)

4,097 29.03.2003 01:00–02:00 3,939 02.03.2002 00:00–01:00 3,227 02.11.2001 20:00–21:00 4,150 29.03.2003 02:00–03:00 3,971 01.03.2002 00:00–01:00 3,312 01.11.2001 19:00–20:00 4,222 26.03.2003 23:00–24:00 4,179 01.03.2002 23:00–24:00 3,492 02.11.2001 16:00–17:00 4,249 25.03.2003 23:00–24:00 4,314 01.03.2002 22:00–23:00 3,707 01.11.2001 00:00–01:00 4,919 26.03.2003 22:00–23:00 4,436 01.03.2002 04:00–05:00 3,718 02.11.2001 15:00–16:00 No. Day/hour No. Day/hour No. Day/hour 2003 2002 2001

slide-94
SLIDE 94

July 19-20, 2007 IWCSN 2007, Guilin, China 94

Example: March 26, 2003

22:18:00 22:18:20 22:18:40 22:19:00 5 10 15 20 Time (hh:mm:ss) Call holding times (s)

call inter-arrival time

slide-95
SLIDE 95

July 19-20, 2007 IWCSN 2007, Guilin, China 95

Statistical distributions

Fourteen candidate distributions:

exponetial, Weibull, gamma, normal, lognormal,

logistic, log-logistic, Nakagami, Rayleigh, Rician, t-location scale, Birnbaum-Saunders, extreme value, inverse Gaussian

Parameters of the distributions: calculated by

performing maximum likelihood estimation

Best fitting distributions are determined by:

visual inspection of the distribution of the trace

and the candidate distributions

K-S test on potential candidates

slide-96
SLIDE 96

July 19-20, 2007 IWCSN 2007, Guilin, China 96

Call inter-arrival times: pdf candidates

1 2 3 4 5 6 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Call inter-arrival time (s) Probability density Traffic data Exponential model Lognormal model Weibull model Gamma model Rayleigh model Normal model

slide-97
SLIDE 97

July 19-20, 2007 IWCSN 2007, Guilin, China 97

Call inter-arrival times: K-S test results (2003 data)

0.0761 0.0795 0.0657 0.0629 0.0689 k 4.851E-21 3.267E-23 2.97E-16 4.717E-15 1.015E-20 p 1 1 1 1 1 h Lognormal 0.0171 0.0163 0.0181 0.0146 0.0139 k 0.1672 0.145 0.127 0.3458 0.3956 p h Gamma 0.0159 0.014 0.0164 0.0133 0.0130 k 0.2337 0.286 0.2065 0.4662 0.4885 p h Weibull 0.0185 0.0205 0.0137 0.0214 0.0283 k 0.1101 0.0316 0.4049 0.0469 0.0027 p 1 1 1 1 h Exponential

29.03.2003, 01:00–02:00 29.03.2003, 02:00–03:00 26.03.2003, 23:00–24:00 25.03.2003, 23:00–24:00 26.03.2003, 22:00–23:00

Parameter Distribution

slide-98
SLIDE 98

July 19-20, 2007 IWCSN 2007, Guilin, China 98

Call inter-arrival times: best-fitting distributions (cdf)

1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Call inter-arrival time (s) Cumulative distribution Traffic data Exponential model Weibull model Gamma model

slide-99
SLIDE 99

July 19-20, 2007 IWCSN 2007, Guilin, China 99

Call inter-arrival times: estimates of H

Traces pass the test for time constancy of a:

estimates of H are reliable

0.705 29.03.2003 01:00–02:00 0.747 02.03.2002 00:00–01:00 0.663 02.11.2001 20:00–21:00 0.696 29.03.2003 02:00–03:00 0.741 01.03.2002 00:00–01:00 0.774 01.11.2001 19:00–20:00 0.699 26.03.2003 23:00–24:00 0.780 01.03.2002 23:00–24:00 0.770 02.11.2001 16:00–17:00 0.832 25.03.2003 23:00–24:00 0.757 01.03.2002 22:00–23:00 0.802 01.11.2001 00:00–01:00 0.788 26.03.2003 22:00–23:00 0.679 01.03.2002 04:00–05:00 0.907 02.11.2001 15:00–16:00 H Day/hour H Day/hour H Day/hour 2003 2002 2001

slide-100
SLIDE 100

July 19-20, 2007 IWCSN 2007, Guilin, China 100

Call holding times: pdf candidates

5 10 15 20 25 0.05 0.1 0.15 0.2 0.25 Call holding time (s) Probability density Traffic data Lognormal model Gamma model Weibull model Exponential model Normal model Rayleigh model

slide-101
SLIDE 101

July 19-20, 2007 IWCSN 2007, Guilin, China 101

Call holding times: best-fitting distributions (cdf)

5 10 15 20 25 30 35 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Call holding time (s) Cumulative distribution Traffic data Lognormal model Exponential model Gamma model Weibull model

slide-102
SLIDE 102

July 19-20, 2007 IWCSN 2007, Guilin, China 102

Call holding times: K-S test results (2003 data)

No distribution passes the test when the entire trace

is tested (significance levels = 0.1 and 0.01)

Lognormal distribution passes test (significance level =

0.01) for:

5-6 sub-traces from 15 randomly chosen 1,000-

sample sub-traces

passes the test for almost all 500-sample sub-traces

Test rejects null hypothesis when the sub-traces are

compared with candidate distributions:

exponential Weibull gamma

slide-103
SLIDE 103

July 19-20, 2007 IWCSN 2007, Guilin, China 103

Call holding times: estimates of H

0.466 29.03.2003 01:00–02:00 0.503 02.03.2002 00:00–01:00 0.479 02.11.2001 20:00–21:00 0.526 29.03.2003 02:00–03:00 0.508 01.03.2002 00:00–01:00 0.467 01.11.2001 19:00–20:00 0.463 * 26.03.2003 23:00–24:00 0.489 01.03.2002 23:00–24:00 0.462 02.11.2001 16:00–17:00 0.483 25.03.2003 23:00–24:00 0.460 01.03.2002 22:00–23:00 0.471 01.11.2001 00:00–01:00 0.483 26.03.2003 22:00–23:00 0.490 01.03.2002 04:00–05:00 0.493 02.11.2001 15:00–16:00 H Day/hour H Day/hour H Day/hour 2003 2002 2001

All (except one) traces pass the test for constancy of a

  • nly one unreliable estimate (*): consistent value
slide-104
SLIDE 104

July 19-20, 2007 IWCSN 2007, Guilin, China 104

Call inter-arrival and call holding times

4.25 4.06 3.84 holding 0.88 29.03.2003 01:00–02:00 0.91 02.03.2002 00:00–01:00 1.12 02.11.2001 20:00–21:00 inter-arrival 4.14 3.95 3.97 holding 0.87 29.03.2003 02:00–03:00 0.91 01.03.2002 00:00–01:00 1.09 01.11.2001 19:00–20:00 inter-arrival 4.04 3.88 3.99 holding 0.85 26.03.2003 23:00–24:00 0.86 01.03.2002 23:00–24:00 1.03 02.11.2001 16:00–17:00 inter-arrival 4.12 3.84 3.95 holding 0.85 25.03.2003 23:00–24:00 0.83 01.03.2002 22:00–23:00 0.97 01.11.2001 00:00–01:00 inter-arrival 4.08 4.07 3.78 holding 0.73 26.03.2003 22:00–23:00 0.81 01.03.2002 04:00–05:00 0.97 02.11.2001 15:00–16:00 inter-arrival

  • Avg. (s)

Day/hour

  • Avg. (s)

Day/hour

  • Avg. (s)

Day/hour 2003 2002 2001

  • Avg. call inter-arrival times: 1.08 s (2001), 0.86 s (2002), 0.84 s (2003)
  • Avg. call holding times: 3.91 s (2001), 3.96 s (2002), 4.13 s (2003)
slide-105
SLIDE 105

July 19-20, 2007 IWCSN 2007, Guilin, China 105

Busy hour: best fitting distributions

0.6696 1.1704 0.8292 1.0299 1.0092 0.8579 26.03.2003 23:00–24:00 0.6715 1.1737 0.7891 1.0762 1.0376 0.8622 25.03.2003 23:00–24:00 0.6553 1.1838 0.6724 1.0910 1.0475 0.7475 26.03.2003 22:00–23:00 0.6803 1.1096 0.7623 1.1308 1.0790 0.8877 01.03.2002 23:00–24:00 0.6565 1.1157 0.7643 1.0931 1.0542 0.8532 01.03.2002 22:00–23:00 0.6671 1.1746 0.7319 1.1096 1.0603 0.8313 01.03.2002 04:00–05:00 0.6803 1.1432 0.9238 1.1189 1.0826 1.0651 02.11.2001 16:00–17:00 0.7535 1.0801 0.8977 1.0818 1.0517 0.9907 01.11.2001 00:00–01:00 0.6910 1.0913 0.9407 1.0326 1.1075 0.9785 02.11.2001 15:00–16:00 σ µ b a b a Lognormal Gamma Weibull Call holding times Call inter-arrival times Distribution Busy hour

slide-106
SLIDE 106

July 19-20, 2007 IWCSN 2007, Guilin, China 106

Traffic prediction

E-Comm network and traffic data:

data preprocessing and extraction

Data clustering Traffic prediction:

based on aggregate traffic cluster based

  • H. Chen and Lj. Trajković, “Trunked radio systems: traffic prediction based on

user clusters,” in Proc. IEEE ISWCS 2004, Mauritius, Sept. 2004, pp. 76–80.

  • B. Vujičić, L. Chen, and Lj. Trajković, “Prediction of traffic in a public safety

network,” in Proc. ISCAS 2006, Kos, Greece, May 2006, pp. 2637–2640.

slide-107
SLIDE 107

July 19-20, 2007 IWCSN 2007, Guilin, China 107

Traffic data: preprocessing

Collected data contain continuous data records from

92 days: March 1st 2003 – May 31st 2003

Original database: ~6 GBytes, with 44,786,489

record rows:

contains event log tables recording network

activities

aggregated from distributed database of individual

network management systems

sorted in 92 event log tables, each containing one

day’s events

9 (out of original 26) fields are of interest for our

analysis

slide-108
SLIDE 108

July 19-20, 2007 IWCSN 2007, Guilin, China 108

Traffic data: preprocessing

Data pre-processing:

cleaning the database filtering the outliers removing redundant records extracting accurate user calling activity

After the data cleaning and extraction, number of

records was reduced to only 19% of original records

slide-109
SLIDE 109

July 19-20, 2007 IWCSN 2007, Guilin, China 109

Traffic data: sample

Date Time Ms Duration Sys_id Chl_id Caller Callee C_type C_state Multi 2003-03-20 00:00:01 450 3730 8 4 6155 1801 2003-03-20 00:00:01 469 3730 6 7 6155 1801 2003-03-20 00:00:01 560 3730 3 7 6155 1801 2003-03-20 00:00:01 570 3730 2 7 6155 1801 2003-03-20 00:00:01 640 3730 1 7 6155 1801 2003-03-20 00:00:01 880 5260 9 6 13314 251 2003-03-20 00:00:01 910 5260 7 6 13314 251 2003-03-20 00:00:01 970 5260 6 8 13314 251 2003-03-20 00:00:01 980 2520 7 7 13911 418 2003-03-20 00:00:02 29 5270 4 2 13314 251 2003-03-20 00:00:02 109 5260 2 8 13314 251 2003-03-20 00:00:02 139 5270 1 8 13314 251 2003-03-20 00:00:02 9 2510 6 1 13911 418 2003-03-20 00:00:02 149 2510 2 9 13911 418 2003-03-20 00:00:05 289 3560 8 5 6011 2035 2003-03-20 00:00:05 309 3550 6 3 6011 2035 2003-03-20 00:00:05 389 3560 3 2 6011 2035 2003-03-20 00:00:05 449 3550 2 2 6011 2035 2003-03-20 00:00:05 480 3550 1 9 6011 2035 2003-03-20 00:00:05 550 3440 1 12 7614 945 2003-03-20 00:00:05 550 3440 2 3 7614 945 2003-03-20 00:00:05 949 9780 6 4 15840 418 2003-03-20 00:00:05 959 9780 7 2 15840 418 2003-03-20 00:00:06 679 3040 2 6 13931 471 2003-03-20 00:00:06 709 3040 1 2 13931 471 2003-03-20 00:00:06 130 9780 2 4 15840 418 2003-03-20 00:00:08 109 6640 9 2 13420 251 2003-03-20 00:00:08 179 6630 7 3 13420 251 2003-03-20 00:00:08 200 6640 6 5 13420 251 2003-03-20 00:00:08 270 6630 4 5 13420 251 2003-03-20 00:00:08 329 6640 1 4 13420 251 2003-03-20 00:00:08 340 6640 2 7 13420 251

slide-110
SLIDE 110

July 19-20, 2007 IWCSN 2007, Guilin, China 110

Traffic data: sample

Date Time Ms Duration Caller Callee C_type C_state Multi # Sys System List 2003-03-20 00:00:01 450 3730 6155 1801 5 8,6,3,2,1 2003-03-20 00:00:01 980 2520 13911 418 3 7,6,2 2003-03-20 00:00:01 880 5260 13314 251 6 9,7,6,4,2,1 2003-03-20 00:00:05 550 3440 7614 945 2 1,2 2003-03-20 00:00:05 289 3560 6011 2035 5 8,6,3,2,1 2003-03-20 00:00:05 949 9780 15840 418 3 6,7,2 2003-03-20 00:00:06 810 2350 8022 817 1 1 2003-03-20 00:00:06 819 1590 13902 497 4 10,9,8,4 2003-03-20 00:00:06 440 3030 13931 471 5 10,9,4,2,1 2003-03-20 00:00:08 109 6640 13420 251 6 9,7,6,4,1,2

Traffic data after cleaning and extraction:

slide-111
SLIDE 111

July 19-20, 2007 IWCSN 2007, Guilin, China 111

Data preparation

Date Original Cleaned Combined 2003/03/01 466,862 204,357 91,143 2003/03/02 415,715 184,973 88,014 2003/03/03 406,072 182,311 76,310 2003/03/04 464,534 207,016 84,350 2003/03/05 585,561 264,226 97,714 2003/03/06 605,987 271,514 104,715 2003/03/07 546,230 247,902 94,511 2003/03/08 513,459 233,982 90,310 2003/03/09 442,662 201,146 79,815 2003/03/10 419,570 186,201 76,197 2003/03/11 504,981 225,604 88,857 2003/03/12 516,306 233,140 94,779 2003/03/13 561,253 255,840 95,662 2003/03/14 550,732 248,828 99,458 Total 92 Days 44,786,489 20,130,718 8,663,586 44.95% 19.34%

slide-112
SLIDE 112

July 19-20, 2007 IWCSN 2007, Guilin, China 112

User clustering with K-means: k = 3

First cluster (heavy network users):

17 talk groups, contributing to 59% of the calls, with

an average number of calls ranging from 94 to 208 per hour

They are dispatch groups that assign and schedule

  • ther talk groups for specific tasks

Second cluster (average network users):

31 talk groups, contributing to 26% of the calls

Third cluster (least frequent network users):

569 talk groups, contributing to only 15% of the

calls

They represent over 90% of all talk groups

slide-113
SLIDE 113

July 19-20, 2007 IWCSN 2007, Guilin, China 113

User clusters with K-means: k = 3

User clusters with K-means: k = 6

slide-114
SLIDE 114

July 19-20, 2007 IWCSN 2007, Guilin, China 114

Clustering results

Larger values of silhouette coefficient produce better

results:

values between 0.7 and 1.0 imply clustering with

excellent separation between clusters

Cluster sizes:

17, 31, and 569 for K =3 17, 33, 4, and 563 for K =4 13, 17, 22, 3, 34, and 528 for K =6

K = 3 produces the best clustering results (based on

  • verall clustering quality and silhouette coefficient)

Interpretations of three clusters have been confirmed

by the E-Comm domain experts

slide-115
SLIDE 115

July 19-20, 2007 IWCSN 2007, Guilin, China 115

K-means clusters of talk groups: k = 3

15 1,310,836 0-16 1-1613 569 26 2,261,055 17-66 135-641 0-3 31 59 5,091,695 94-208 352-700 0-6 17 Total number of calls (%) Total number of calls Average number of calls Maximum number of calls Minimum number of calls Cluster size

slide-116
SLIDE 116

July 19-20, 2007 IWCSN 2007, Guilin, China 116

Traffic prediction

Traffic prediction: important to assess future network

capacity requirements and to plan future network developments

A network traffic trace consists of a series of

  • bservations in a dynamical system environment

Traditional prediction: considers aggregate traffic and

assumes a constant number of network users

Approach that focuses on individual users has high

computational cost for networks with thousands of users

Employing clustering techniques for predicting

aggregate network traffic bridges the gap between the two approaches

slide-117
SLIDE 117

July 19-20, 2007 IWCSN 2007, Guilin, China 117

Prediction based on aggregate traffic

  • The aggregate network traffic consists of all network

users' traffic

  • The R system was used to identify, estimate, and verify

the SARIMA model for the aggregate users' traffic

  • Both 24-hour (one day) and 168-hour (one week) intervals

were selected as seasonal period parameters

  • Based on m past traffic data samples, we forecast the

future n traffic data samples

  • The prediction quality was measured using the normalized

mean square error nmse:

  • where:

is the observed and is the predicted data

, ) ( ) , (

1 2 2

+ + =

− =

n m m i i i i

a b a b a nmse

i

a

i

b

slide-118
SLIDE 118

July 19-20, 2007 IWCSN 2007, Guilin, China 118

SARIMA models: selection criteria

Order (0,1,1) is used for seasonal part (P,D,Q ):

cyclical seasonal pattern is usually random-walk may be modeled as MA process after one-time

differencing

Model’s goodness-of-fit is validated using null

hypothesis test:

time plot analysis and autocorrelation of model

residual

slide-119
SLIDE 119

July 19-20, 2007 IWCSN 2007, Guilin, China 119

Prediction quality

Models (2,0,9)×(0,1,1)24 and (2,0,1)×(0,1,1)168 have

smallest criterion values based on 1,680 training data

Normalized mean square error (nmse) is used to

measure prediction quality by comparing deviation between predicted and observed data

The nmse of forecast is equal to ratio of normalized

sum of variance of forecast to squared bias of forecast

Smaller values of nmse indicate better prediction

model

slide-120
SLIDE 120

July 19-20, 2007 IWCSN 2007, Guilin, China 120

SARIMA models: summary of selection criteria

25371.2 25332.6 25332.6 0.411 1680 (1,0,2) x (1,1,1)24 25399.7 25361.2 25361.2 0.404 1680 (3,0,1) x (0,1,1)24 25392.6 25360.6 25360.5 0.408 1680 (2,0,1) x (0,1,1)24 25382.1 25292.4 25292.1 0.525 1680 (2,0,9) x (1,1,1)24 23170.8 23145.1 23145.1 0.175 1680 (1,0,1) x (0,1,1)168 23161.9 23129.8 23129.8 0.174 1680 (2,0,1) x (0,1,1)168 22826.8 22744.9 22744.6 0.379 1680 (2,0,9) x (0,1,1)24 BIC AICc AIC nmse m (p,d,q) x (P,D,Q)s

slide-121
SLIDE 121

July 19-20, 2007 IWCSN 2007, Guilin, China 121

Prediction: based on the aggregate traffic

0.1178 168 2016 168 1 1 1 2 C4 0.1282 168 2016 168 1 1 9 2 C3 0.3433 168 2016 24 1 1 1 2 C2 0.3384 168 2016 24 1 1 9 2 C1 0.1745 168 1680 168 1 1 1 2 B4 0.1736 168 1680 168 1 1 9 2 B3 0.4079 168 1680 24 1 1 1 2 B2 0.3790 168 1680 24 1 1 9 2 B1 0.1732 672 1512 168 1 1 1 2 A4 0.1742 672 1512 168 1 1 9 2 A3 0.3803 672 1512 24 1 1 1 2 A2 0.3790 672 1512 24 1 1 9 2 A1 nmse n m S Q D P q d p No.

Models forecast future n traffic data based on m past traffic data samples

slide-122
SLIDE 122

July 19-20, 2007 IWCSN 2007, Guilin, China 122

Prediction: based on the aggregate traffic

Two groups of models, with 24-hour and 168-hour

seasonal periods:

SARIMA (2, 0, 9) x (0, 1, 1)24 and 168 SARIMA (2, 0, 1) x (0, 1, 1)24 and 168

Comparisons:

rows A1 with A2, B1 with B2, and C1 with C2 SARIMA (2, 0, 9) × (0, 1, 1)24 gives better prediction

results than SARIMA (2, 0, 1)×(0, 1, 1)24

Models with a 168-hour seasonal period provided

better prediction than the four 24-hour period based models, particularly when predicting long term traffic data

slide-123
SLIDE 123

July 19-20, 2007 IWCSN 2007, Guilin, China 123

Prediction of 168 hours of traffic based on 1,680 past hours: sample

Comparison of the 24-hour and the 168-hour models

  • Solid line: observation
  • : prediction of 168-hour seasonal model
  • *: prediction of 24-hour seasonal model
slide-124
SLIDE 124

July 19-20, 2007 IWCSN 2007, Guilin, China 124

Prediction with user clustering

Raw network log data collected over 92 days:

March 1st 2003 – May 31st 2003

Footprint of network usage for talk groups: the hourly

number of calls

AutoClass and the K-means algorithm were used to

classify network talk groups into clusters

The behavior of each user cluster was predicted using

Seasonal Autoregressive Integrated Moving Average (SARIMA)

We used aggregation to predict the overall network

behavior

slide-125
SLIDE 125

July 19-20, 2007 IWCSN 2007, Guilin, China 125

Traffic prediction based on user clusters

  • The developed aggregate users based prediction assumes

that the adopted model is static: the number of network users and their behavior pattern are constant in time

  • This assumption does not hold when planning further

network expansions and cannot be used to forecast network traffic

  • We employed a user clusters based prediction approach to

predict the network traffic by accumulating the prediction results from user clusters

  • In large networks with many users, it is impractical to

predict individual users' traffic and then aggregate the predicted data

  • With user clusters, traffic prediction is reduced to

predicting and aggregating users' traffic from few clusters

slide-126
SLIDE 126

July 19-20, 2007 IWCSN 2007, Guilin, China 126

Prediction of 168 hours of traffic based on 1,680 past hours

Comparisons: model (1,0,1)x(0,1,1)168 * observation * prediction without clustering

  • prediction with clustering
slide-127
SLIDE 127

July 19-20, 2007 IWCSN 2007, Guilin, China 127

Traffic prediction with user clusters: examples (2,0,1) x (0,1,1)

nmse n m S Cluster 0.4093 168 1,680 24 A 0.4079 168 1,680 24 * 0.2852 168 1,680 24 3 0.6883 168 1,680 24 2 0.5477 168 1,680 24 1 0.2052 24 2,920 24 A 0.1941 24 1,920 24 * 0.3020 24 1,920 24 3 0.2697 24 1,920 24 2 0.2508 24 1,920 24 1 0.1175 24 1,920 168 A 0.0969 24 1,920 168 * 0.1163 24 1,920 168 3 0.3818 24 1,920 168 2 0.2241 24 1,920 168 1

slide-128
SLIDE 128

July 19-20, 2007 IWCSN 2007, Guilin, China 128

Prediction results with user clusters

For each group, rows 1, 2, and 3: traffic prediction

results for user clusters 1, 2, and 3

Row *: the aggregate user traffic prediction obtained

without clustering the users

Row A: the aggregate prediction of network traffic

based on the three user clusters

The performance of the clusters based prediction

(nmse: 0.1175) is comparable to the best prediction based on aggregate traffic (nmse: 0.0969)

Prediction of traffic in networks with a variable

number of users is possible, as long as the new user groups could be classified into the existing user clusters

slide-129
SLIDE 129

July 19-20, 2007 IWCSN 2007, Guilin, China 129

Prediction based on user clusters model (2, 0, 1)×(0, 1, 1)

n/a 0.116 0.114 0.095 0.467 0.380 48 1680 168 12 n/a 0.129 0.132 0.115 0.444 0.367 24 1680 168 11 n/a 0.178 0.180 0.155 0.375 0.348 504 1512 168 10 0.436 0.507 0.365 0.168 0.747 3.401 24 1176 168 9 n/a 0.224 0.237 0.190 0.446 0.439 504 1008 168 8 n/a 0.260 0.285 0.190 0.466 0.616 336 1008 168 7 n/a 0.399 0.396 0.236 0.647 0.665 144 1200 24 6 n/a 0.467 0.463 0.245 0.703 0.840 120 1200 24 5 0.610 0.613 0.611 0.260 0.866 1.319 96 1200 24 4 0.846 0.886 0.884 0.270 1.976 1.774 72 1200 24 3 n/a 0.332 0.343 0.445 0.712 0.394 48 240 24 2 n/a 0.241 0.254 0.308 0.548 0.323 24 240 24 1 nmse

  • ptimized

nmse cluster nmse aggregate nmse cluster 3 nmse cluster 2 nmse cluster 1 n m S Test no.

slide-130
SLIDE 130

July 19-20, 2007 IWCSN 2007, Guilin, China 130

Traffic prediction with user clusters

nmse > 1.0 for cluster 1 (tests 3, 4, and 9) and for

cluster 2 (test 3) implies that prediction is worse than prediction based on the mean value of past data

Mean value prediction leads to better prediction

results shown in column “nmse optimized” (optimized cluster-based prediction) for:

Test 3: clusters 1 and 2 Test 4: cluster 1

Prediction based on clusters performs better than the

prediction based on aggregate traffic:

Tests 1, 2, 7, 8, 10, and 11

slide-131
SLIDE 131

July 19-20, 2007 IWCSN 2007, Guilin, China 131

Traffic prediction with user clusters

57% of cluster-based predictions perform better than

aggregate-traffic-based prediction with SARIMA model (2,0,1)×(0,1,1)168

Prediction of traffic in networks with a variable

number of users is possible, as long as the new user groups could be classified into the existing user clusters

slide-132
SLIDE 132

July 19-20, 2007 IWCSN 2007, Guilin, China 132

Roadmap

Introduction Traffic data and analysis tools:

data collection statistical analysis, clustering tools, prediction

analysis

Case studies:

satellite network: ChinaSat public safety wireless network: E-Comm packet data network: Internet

Conclusions and references

slide-133
SLIDE 133

July 19-20, 2007 IWCSN 2007, Guilin, China 133

Conclusions

  • We used simulation tools and analytical methods to analyze

traffic data from three deployed networks: ChinaSat, the Internet, and E-Comm

  • Network: network performance was evaluated
  • Traffic characterization and modeling: models of inter-

arrival and call holding times were developed

  • Users: clustering algorithms were employed to classify

network users into user clusters

  • Traffic prediction: SARIMA models were used to predict

network traffic based on aggregate user traffic and based

  • n three user clusters
  • Network anomalies: wavelet analysis was employed to detect

traffic anomalies

slide-134
SLIDE 134

July 19-20, 2007 IWCSN 2007, Guilin, China 134

References: downloads

http://www.ensc.sfu.ca/~ljilja/publications_date.html

  • S. Lau and Lj. Trajkovic, “Analysis of traffic data from a hybrid satellite-terrestrial network,” in
  • Proc. QShine 2007, Vancouver, BC, Canada, Aug. 2007, to appear.
  • B. Vujičić, L. Chen, and Lj. Trajković, “Prediction of traffic in a public safety network,” in Proc. ISCAS

2006, Kos, Greece, May 2006, pp. 2637–2640.

  • N. Cackov, J. Song, B. Vujičić, S. Vujičić, and Lj. Trajković, “Simulation of a public safety wireless

networks: a case study,” Simulation, vol. 81, no. 8, pp. 571–585, Aug. 2005.

  • B. Vujičić, N. Cackov, S. Vujičić, and Lj. Trajković, “Modeling and characterization of traffic in public

safety wireless networks,” in Proc. SPECTS 2005, Philadelphia, PA, July 2005, pp. 214–223.

  • J. Song and Lj. Trajković, “Modeling and performance analysis of public safety wireless networks,” in
  • Proc. IEEE IPCCC, Phoenix, AZ, Apr. 2005, pp. 567–572.
  • H. Chen and Lj. Trajković, “Trunked radio systems: traffic prediction based on user clusters,” in Proc.

IEEE ISWCS 2004, Mauritius, Sept. 2004, pp. 76–80.

  • D. Sharp, N. Cackov, N. Lasković, Q. Shao, and Lj. Trajković, “Analysis of public safety traffic on

trunked land mobile radio systems,” IEEE J. Select. Areas Commun., vol. 22, no. 7, pp. 1197–1205,

  • Sept. 2004.
  • Q. Shao and Lj. Trajković, “Measurement and analysis of traffic in a hybrid satellite-terrestrial

network,” in Proc. SPECTS 2004, San Jose, CA, July 2004, pp. 329–336.

  • N. Cackov, B. Vujičić, S. Vujičić, and Lj. Trajković, “Using network activity data to model the

utilization of a trunked radio system,” in Proc. SPECTS 2004, San Jose, CA, July 2004, pp. 517–524.

  • J. Chen and Lj. Trajkovic, “Analysis of Internet topology data,” Proc. IEEE Int. Symp. Circuits and

Systems, Vancouver, British Columbia, Canada, May 2004, vol. IV, pp. 629-632.

slide-135
SLIDE 135

July 19-20, 2007 IWCSN 2007, Guilin, China 135

References: self-similarity

  • A. Feldmann, “Characteristics of TCP connection arrivals,” in Self-similar

Network Traffic and Performance Evaluation, K. Park and W. Willinger, Eds., New York: Wiley, 2000, pp. 367–399.

  • T. Karagiannis, M. Faloutsos, and R. H. Riedi, “Long-range dependence: now

you see it, now you don't!,” in Proc. GLOBECOM '02, Taipei, Taiwan, Nov. 2002, pp. 2165–2169.

  • W. Leland, M. Taqqu, W. Willinger, and D. Wilson, “On the self-similar

nature of ethernet traffic (extended version),” IEEE/ACM Transactions on Networking, vol. 2, no. 1, pp. 1–15, Feb. 1994.

  • M. S. Taqqu and V. Teverovsky, “On estimating the intensity of long-range

dependence in finite and infinite variance time series,” in A Practical Guide to Heavy Tails: Statistical Techniques and Applications. Boston, MA: Birkhauser, 1998, pp. 177–217.

slide-136
SLIDE 136

July 19-20, 2007 IWCSN 2007, Guilin, China 136

References: self-similarity

  • P. Abry and D. Veitch, “Wavelet analysis of long-range dependence traffic,”

IEEE Transactions on Information Theory, vol. 44, no. 1, pp. 2–15, Jan. 1998.

  • P. Abry, P. Flandrin, M. S. Taqqu, and D. Veitch, “Wavelets for the analysis,

estimation, and synthesis of scaling data,” in Self-similar Network Traffic and Performance Evaluation, K. Park and W. Willinger, Eds. New York: Wiley, 2000, pp. 39–88.

  • P. Barford, A. Bestavros, A. Bradley, and M. Crovella, “Changes in Web client

access patterns: characteristics and caching implications in world wide web,” World Wide Web, Special Issue on Characterization and Performance Evaluation, vol. 2, pp. 15–28, 1999.

  • Z. Bi, C. Faloutsos, and F. Korn, “The ‘DGX’ distribution for mining massive,

skewed data,” in Proc. of ACM SIGCOMM Internet Measurement Workshop, San Francisco, CA, Aug. 2001, pp. 17–26.

  • M. E. Crovella and A. Bestavros, “Self-similarity in world wide web traffic:

evidence and possible causes,” IEEE/ACM Transactions on Networking, vol. 5, no. 6, pp. 835–846, Dec. 1997.

slide-137
SLIDE 137

July 19-20, 2007 IWCSN 2007, Guilin, China 137

References: time series

  • G. Box and G. Jenkins, Time Series Analysis: Forecasting and Control, 2nd
  • edition. San Francisco, CA: Holden-Day, 1976, pp. 208–329.
  • P. J. Brockwell and R. A. Davis, Introduction to Time Series and Forecasting,

2nd Edition. New York: Springer-Verlag, 2002.

  • N. H. Chan, Time Series: Applications to Finance. New York: Wiley-

Interscience, 2002.

  • K. Burnham and D. Anderson, Model Selection and Multimodel Inference,

2nd ed. New York, NY: Springer-Verlag, 2002.

  • G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics,
  • vol. 6, no. 2, pp. 461–464, Mar. 1978.
slide-138
SLIDE 138

July 19-20, 2007 IWCSN 2007, Guilin, China 138

References: clustering analysis

  • P. Cheeseman and J. Stutz, “Bayesian classification (AutoClass): theory and

results,” in Advances in Knowledge Discovery and Data Mining, U. M. Fayyad,

  • G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds., AAAI Press/MIT

Press, 1996.

  • J. W. Han and M. Kamber, Data Mining: Concepts And Techniques. San

Francisco: Morgan Kaufmann, 2001.

  • T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical

Learning: Data Mining, Inference, and Prediction. New York: Springer, 2001.

  • L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to

Cluster Analysis. New York: John Wiley & Sons, 1990.

slide-139
SLIDE 139

July 19-20, 2007 IWCSN 2007, Guilin, China 139

References: data mining

  • J. Han and M. Kamber, Data Mining: concept and techniques. San Diego, CA:Academic

Press, 2001.

  • W. Wu, H. Xiong, and S. Shekhar, Clustering and Information Retrieval. Norwell,MA:

Kluwer Academic Publishers, 2004.

  • Z. Chen, Data Mining and Uncertainty Reasoning: and integrated approach. New York,

NY: John Wiley & Sons, 2001.

  • T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu, “An

efficient k-means clustering algorithm: analysis and implementation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 881–892, July. 2002.

  • P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Reading,MA:

Addison-Wesley, 2006, pp. 487–568.

  • L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: an introduction to cluster
  • analysis. New York, NY: John Wiley & Sons, 1990.
  • M. Last, A. Kandel, and H. Bunke, Eds., Data Mining in Time Series Databases.

Singapore: World Scientific Publishing Co. Pte. Ltd., 2004.

  • W.-K. Ching and M. K.-P. Ng, Eds., Advances in Data Mining and Modeling. Singapore:

World Scientific Publishing Co. Pte. Ltd., 2003.

slide-140
SLIDE 140

July 19-20, 2007 IWCSN 2007, Guilin, China 140

References: protocols

  • D. E. Comer, Internetworking with TCP/IP, Vol 1: Principles, Protocols, and

Architecture, 4th ed. Upper Saddle River, NJ: Prentice-Hall, 2000.

  • W. R. Stevens, TCP/IP Illustrated (vol. 1): The Protocols. Reading, MA: Addison-Wesley,

1994.

  • J. Postel, Ed., “Transmission Control Protocol,” RFC 793, Sep. 1981.
  • J. Postel, “TCP and IP bake off,” RFC 1025, Sep. 1987.
  • J. Mogul and S. Deering, “Path MTU discovery,” RFC 1191, Nov. 1990.
  • V. Jacobson, R. Braden, and D. Borman, “TCP extensions for high performance,” RFC

1323, May 1992.

  • M. Allman, S. Floyd, and C. Partridge, “Increasing TCP’s initial window,” RFC 2414, Sep.

1998.

  • M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, “TCP selective acknowledgment
  • ptions,” RFC 2018, Oct. 1996.
  • M. Allman, D. Glover, and L. Sanchez, “Enhancing TCP over satellite channels using

standard mechanisms,” RFC 2488, Jan. 1999.

  • M. Allman, S. Dawkins, D. Glover, J. Griner, D. Tran, T. Henderson, J. Heidemann, J.

Touch, H. Kruse, S. Ostermann, K. Scott, and J. Semke, “Ongoing TCP research related to satellites,” RFC 2760, Feb. 2000.

  • J. Border, M. Kojo, J. Griner, G. Montenegro, and Z. Shelby, “Performance enhancing

proxies intended to mitigate link-related degradations,” RFC 3135, Jun. 2001.

  • S. Floyd, “Inappropriate TCP resets considered harmful,” RFC 3360, Aug. 2002.
slide-141
SLIDE 141

July 19-20, 2007 IWCSN 2007, Guilin, China 141

References: fingerprinting

  • R. Beverly, “A Robust Classifier for Passive TCP/IP Fingerprinting,” in Proc. Passive and

Active Meas. Workshop 2004, Antibes Juan-les-Pins, France, Apr. 2004, pp. 158–167.

  • C. Smith and P. Grundl, “Know your enemy: passive fingerprinting,” The
  • Honeynet Project, Mar. 2002. [Online]. Available:

http://www.honeynet.org/papers/finger/

  • Passive OS fingerprinting tool ver. 2 (p0f v2). [Online]. Available:

http://lcamtuf.coredump.cx/p0f.shtml/

  • B. Petersen, “Intrusion detection FAQ: What is p0f and what does it do?” The

SysAdmin, Audit, Network, Security (SANS) Institute. [Online]. Available: http://www.sans.org/resources/idfaq/p0f.php

  • T. Miller, “Passive OS fingerprinting: details and techniques,” The SysAdmin, Audit,

Network, Security (SANS) Institute. [Online]. Available: http://www.sans.org/reading room/special.php/

slide-142
SLIDE 142

July 19-20, 2007 IWCSN 2007, Guilin, China 142

References: anomalies

  • P. Barford and D. Plonka, “Characteristics of network traffic flow anomalies,” in Proc.

ACM SIGCOMM Internet Meas. Workshop 2001, Nov. 2001, pp. 69–73.

  • P. Barford, J. Kline, D. Plonka, and A. Ron, “A signal analysis of network traffic

anomalies,” in Proc. ACM SIGCOMM Internet Meas. Workshop 2002, Marseille, France,

  • Nov. 2002, pp. 71–82.
  • Y. Zhang, Z. Ge, A. Greenberg, and M. Roughan, “Network anomography,” in Proc. ACM

SIGCOMM Internet Meas. Conf. 2005, Berkeley, CA, Oct. 2005, pp. 317–330.

  • A. Soule, K. Salamatian, and N. Taft, “Combining filtering and statistical methods for

anomaly detection,” in Proc. ACM SIGCOMM Internet Meas. Conf. 2005, Berkeley, CA,

  • Oct. 2005, pp. 331–344.
  • P. Huang, A. Feldmann, and W. Willinger, “A non-instrusive, wavelet-based approach to

detecting network performance problems,” in Proc. ACM SIGCOMM Internet Meas. Workshop 2001, San Francisco, CA, Nov. 2001, pp. 213–227.

  • A. Lakhina, M. Crovella, and C. Diot, “Characterization of network-wide anomalies in

traffic flows,” in Proc. ACM SIGCOMM Internet Meas. Conf. 2004, Taormina, Italy,

  • Oct. 2004, pp. 201–206.
  • A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic anomalies,” ACM

SIGCOMM Comput. Commun. Rev., vol. 34, no. 4, pp. 219–230, Oct. 2004.

  • M. Arlitt and C. Williamson, “An analysis of TCP reset behaviour on the Internet,” ACM

SIGCOMM Comput. Commun. Rev., vol. 35, no. 1, pp. 37–44, Jan. 2005.

slide-143
SLIDE 143

July 19-20, 2007 IWCSN 2007, Guilin, China 143

References: spectral analysis

  • M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On power-law relationships of the

Internet topology,” Proc. of ACM SIGCOMM ’99, Cambridge, MA, Aug. 1999,

  • pp. 251–262.
  • H. Chang, R. Govindan, S. Jamin, S. Shenker, and W. Willinger, “Towards

capturing representative AS-level Internet topologies,” Proc. of ACM SIGMETRICS 2002, New York, NY, June 2002, pp. 280–281.

  • D. Vukadinovic, P. Huang, and T. Erlebach, “On the Spectrum and Structure of

Internet Topology Graphs,” in H. Unger et al., editors, Innovative Internet Computing Systems, LNCS2346, pp. 83–96. Springer, Berlin, Germany, 2002.

  • M. Mihail, C. Gkantsidis, and E. Zegura, “Spectral analysis of Internet

topologies,” Proc. of Infocom 2003, San Francisco, CA, Mar. 2003, vol. 1,

  • pp. 364-374.
  • G. Huston, “Interconnection, peering and settlements-Part II,” Internet

Protocol Journal, June 1999: http://www.cisco.com/warp/public/759/ipj_2- 2/ipj_2-2_ps1.html.

  • F. R. K. Chung, Spectral Graph Theory. Providence, Rhode Island: Conference

Board of the Mathematical Sciences, 1997, pp. 2–6.

  • M. Fiedler, “Algebraic connectivity of graphs,” Czech. Math. J., vol. 23, no. 2,
  • pp. 298–305, 1973.
slide-144
SLIDE 144

July 19-20, 2007 IWCSN 2007, Guilin, China 144

References: traffic analysis

  • Y. W. Chen, “Traffic behavior analysis and modeling sub-networks,”

International Journal of Network Management, John Wiley & Sons, vol. 12,

  • pp. 323–330, 2002.
  • Y. Fang and I. Chlamtac, “Teletraffic analysis and mobility modeling of PCS

networks,” IEEE Trans. on Communications, vol. 47, no. 7, pp. 1062–1072, July 1999.

  • N. K. Groschwitz and G. C. Polyzos, “A time series model of long-term

NSFNET backbone traffic,” in Proc. IEEE International Conference on Communications (ICC'94), New Orleans, LA, May 1994, vol. 3, pp. 1400–1404.

  • D. Papagiannaki, N. Taft, Z.-L. Zhang, and C. Diot, “Long-term forecasting of

Internet backbone traffic: observations and initial models,” in Proc. IEEE INFOCOM 2003, San Francisco, CA, April 2003, pp. 1178–1188.

  • D. Tang and M. Baker, “Analysis of a metropolitan-area wireless network,”

Wireless Networks, vol. 8, no. 2/3, pp. 107–120, Mar.-May 2002.

slide-145
SLIDE 145

July 19-20, 2007 IWCSN 2007, Guilin, China 145

References: traffic analysis

  • R. B. D'Agostino and M. A. Stephens, Eds., Goodness-of-Fit Techniques. New

York: Marcel Dekker, 1986. pp. 63–93, pp. 97–145, pp. 421–457.

  • F. Barceló and J. I. Sάnchez, “Probability distribution of the inter-arrival

time to cellular telephony channels,” in Proc. of the 49th Vehicular Technology Conference, May 1999, vol. 1, pp. 762–766.

  • F. Barcelo and J. Jordan, “Channel holding time distribution in public

telephony systems (PAMR and PCS),” IEEE Trans. Vehicular Technology, vol. 49, no. 5, pp. 1615–1625, Sept. 2000.