Mining Network Traffic Data Ljiljana Trajkovi ljilja@cs.sfu.ca - PowerPoint PPT Presentation

Dendrogram example July 19-20, 2007 IWCSN 2007, Guilin, China 22

Dendrogram example July 19-20, 2007 IWCSN 2007, Guilin, China 23

Traffic prediction : ARIMA model � Auto-Regressive Integrated Moving Average (ARIMA) model: � general model for forecasting time series � past values: AutoRegressive (AR) structure � past random fluctuant effect: Moving Average (MA) process � ARIMA model explicitly includes differencing � ARIMA (p, d, q): � autoregressive parameter: p � number of differencing passes: d � moving average parameter: q July 19-20, 2007 IWCSN 2007, Guilin, China 24

Traffic prediction: SARIMA model � Seasonal ARIMA is a variation of the ARIMA model � Seasonal ARIMA (SARIMA) model: ( ) ( ) S p , d , q × P , D , Q � captures seasonal pattern � SARIMA additional model parameters: � seasonal period parameter: S � seasonal autoregressive parameter: P � number of seasonal differencing passes: D � seasonal moving average parameter: Q July 19-20, 2007 IWCSN 2007, Guilin, China 25

SARIMA models: selection criteria � Order (p,d,q) selected based on: � time series plot of traffic data � autocorrelation and partial autocorrelation functions � Validity of parameter selection: � Akaike’s information criterion: AIC � � corrected AICc � Bayesian information criterion BIC July 19-20, 2007 IWCSN 2007, Guilin, China 26

Roadmap � Introduction � Traffic data and analysis tools: � data collection, statistical analysis, clustering tools, prediction analysis � Case studies: � satellite network: ChinaSat � packet data networks: Internet � public safety wireless network: E-Comm � Conclusions and references July 19-20, 2007 IWCSN 2007, Guilin, China 27

ChinaSat data: analysis � Analysis of network traffic: � characteristics of TCP connections � network traffic patterns � statistical and cluster analysis of traffic � anomaly detection: � statistical methods � wavelets � principle component analysis TCP: transport control protocol July 19-20, 2007 IWCSN 2007, Guilin, China 28

Network and traffic data � ChinaSat: network architecture and TCP � Analysis of billing records: � aggregated traffic � user behavior � Analysis of tcpdump traces: � general characteristics � TCP options and operating system (OS) fingerprinting � network anomalies July 19-20, 2007 IWCSN 2007, Guilin, China 29

DirecPC system diagram July 19-20, 2007 IWCSN 2007, Guilin, China 30

Characteristics of satellite links � Large coverage area � High bandwidth � Long propagation delay � Large bandwidth-delay product � High bit error rates: � 10 -6 without error correction � 10 -3 or 10 -2 due to extreme weather and interference � Path asymmetry July 19-20, 2007 IWCSN 2007, Guilin, China 31

Characteristics of satellite links ChinaSat hybrid satellite network � � Employs geosynchrous satellites deployed by Hughes Network Systems Inc. � Provides data and television services: � DirecPC (Classic): unidirectional satellite data service � DirecTV: satellite television service � DirecWay (Hughnet): new bi-directional satellite data service that replaces DirecPC � DirecPC transmission rates: � 400 kb/s from satellite to user � 33.6 kb/s from user to network operations center (NOC) using dial-up � Improves performance using TCP splitting with spoofing July 19-20, 2007 IWCSN 2007, Guilin, China 32

ChinaSat data: analysis � ChinaSat traffic is self-similar and non-stationary � Hurst parameter differs depending on traffic load � Modeling of TCP connections: � inter-arrival time is best modeled by the Weibull distribution � number of downloaded bytes is best modeled by the lognormal distribution � The distribution of visited websites is best modeled by the discrete Gaussian exponential (DGX) distribution July 19-20, 2007 IWCSN 2007, Guilin, China 33

ChinaSat data: analysis � Traffic prediction: � autoregressive integrative moving average (ARIMA) was successfully used to predict uploaded traffic (but not downloaded traffic) � wavelet + autoregressive model outperforms the ARIMA model Q. Shao and Lj. Trajkovic, “Measurement and analysis of traffic in a hybrid satellite-terrestrial network,” Proc. SPECTS 2004 , San Jose, CA, July 2004, pp. 329–336. July 19-20, 2007 IWCSN 2007, Guilin, China 34

Analysis of collected data � Analysis of patterns and statistical properties of two sets of data from the ChinaSat DirecPC network: � billing records � tcpdump traces � Billing records: � daily and weekly traffic patterns � user classification: � single and multi-variable k-means clustering � time series clustering using hierarchical clustering and empirical approach July 19-20, 2007 IWCSN 2007, Guilin, China 35

Analysis of collected data � Analysis of tcpdump trace � tcpdump trace: � protocols and applications � TCP options � operating system fingerprinting � network anomalies � C program pcapread that process tcpdump files without using packet capture library libpcap July 19-20, 2007 IWCSN 2007, Guilin, China 36

Network anomalies � Scans and worms: � packets are sent to probe network hosts � used to discover and exploit resources � Denial of service: � large number of packets is directed to a single destination � makes a host incapable of handling incoming connections or exhausts available bandwidth along paths to the destination July 19-20, 2007 IWCSN 2007, Guilin, China 37

Network anomalies � Flash crowd: � high volume of traffic is destined to a single destination � caused by breaking news, availability of new software � Traffic shift: � redirection of traffic from one set of paths to another � caused by route changes, link unavailability, or network congestion July 19-20, 2007 IWCSN 2007, Guilin, China 38

Network anomalies Alpha traffic: � � unusually high volume of traffic between two endpoints � caused by file transfers or bandwidth measurements Traffic volume anomalies: � � significant deviation of traffic volume from usual daily or weekly patterns � classified as: � outages: caused by unavailable links, crasher servers, or routing problems � short term increases in demand: caused by short term events such as holiday traffic � involve multiple sources and destinations July 19-20, 2007 IWCSN 2007, Guilin, China 39

Billing records � Records were collected during the continuous period from 23:00 on Oct. 31, 2002 to 11:00 on Jan. 10, 2003 � Each file contains the hourly traffic summary for each user � Fields of interests: � SiteID (user identification) � Start (record start time) � CTxByt (number of bytes downloaded by a user) � CRxByt (number of bytes uploaded by a user) � CTxPkt (number of packets downloaded by a user) � CRxPkt (number of packets uploaded by a user) July 19-20, 2007 IWCSN 2007, Guilin, China 40

Billing records: characteristics � 186 unique SiteIDs � Daily and weekly cycles: � lower traffic volume on weekends � daily cycle starts at 7 AM, rises to three daily maxima at 11 AM, 3 PM, and 7 PM, then decrease monotonically until 7 AM � Highest daily traffic recorded on Dec. 24, 2002 � Outage occurred on Jan. 3, 2003 July 19-20, 2007 IWCSN 2007, Guilin, China 41

Aggregated hourly traffic July 19-20, 2007 IWCSN 2007, Guilin, China 42

Aggregated daily traffic July 19-20, 2007 IWCSN 2007, Guilin, China 43

Daily diurnal traffic: average downloaded bytes July 19-20, 2007 IWCSN 2007, Guilin, China 44

Weekly traffic: average downloaded bytes July 19-20, 2007 IWCSN 2007, Guilin, China 45

Ranking of user traffic � Users are ranked according to the traffic volume � The top user downloaded 78.8 GB, uploaded 11.9 GB, and downloaded/uploaded ~205 million packets � Most users download/uploaded little traffic � Cumulative distribution functions (CDFs) are constructed from the ranks: � top user accounts for 11% of downloaded bytes � top 25 users contributed 93.3% of downloaded bytes � top 37 users contributed 99% of total traffic (packets and bytes) July 19-20, 2007 IWCSN 2007, Guilin, China 46

Cumulative distribution functions July 19-20, 2007 IWCSN 2007, Guilin, China 47

k-means: clustering results � Natural number of clusters is k=3 for downloaded and uploaded bytes � Most users belong to the group with small traffic volume � For k=3: � 159 users in group 1 (average 0.0–16.8 MB downloaded per hour) � 24 users in group 2 (average 16.8–70.6 MB downloaded per hour) � 3 users in group 3 (average 70.6–110.7 MB downloaded per hour) July 19-20, 2007 IWCSN 2007, Guilin, China 48

Three most common traffic patterns � Idle users: � rarely download/upload traffic � represented by zero traffic � Active users: � download/upload traffic for more than 18 hours a day � represented by traffic over 24 hours each day � Semi-active users: � download/upload traffic for 8–12 hours a day � represented by a cycle of 10 hours ACTIVE/14 hours IDLE cycle for each day July 19-20, 2007 IWCSN 2007, Guilin, China 49

Clustering results using three most common traffic patterns Traffic pattern Number of users Idle 162 Active 16 Semi-active 8 Total number of users 186 July 19-20, 2007 IWCSN 2007, Guilin, China 50

tcpdump traces Traces were continuously collected from 11:30 on Dec. 14, � 2002 to 11:00 on Jan. 10, 2003 at the NOC The first 68 bytes of a each TCP/IP packet were captured � ~63 GB of data contained in 127 files � User IP address is not constant due to the use of the � private IP address range and dynamic IP Majority of traffic is TCP: � � 94% of total bytes and 84% of total packets � WWW (port 80) accounts for 90% of TCP connections and 76% of TCP bytes � FTP (port 21) accounts for 0.2% of TCP connections and 11% of TCP bytes July 19-20, 2007 IWCSN 2007, Guilin, China 51

OS fingerprinting results � Analyzed 9 hours of tcpdump trace on Dec. 14, 2002 using the open-source tool p0f.v2 � Assumed constant IP addresses � Detected 171 users: � 137 users did not initiate any connections and cannot be identified (no SYN packets) � 14 users employ Microsoft Windows � 2 users employ Linux � 1 user employs an unknown OS (identified as an MSS-modifying proxy) OS: operating system July 19-20, 2007 IWCSN 2007, Guilin, China 52

Network anomalies � Ethereal/Wireshark, tcptrace, and pcapread � Four types of network anomalies were detected: � invalid TCP flag combinations � large number of TCP resets � UDP and TCP port scans � traffic volume anomalies July 19-20, 2007 IWCSN 2007, Guilin, China 53

Analysis of TCP flags TCP flag Packet count % of Total SYN only 19,050,849 48.500 RST only 7,440,418 18.900 FIN only 12,679,619 32.300 *SYN+FIN 408 0.001 *RST+FIN (no PSH) 85,571 0.200 *RST+PSH (no FIN) 18,111 0.050 *RST+FIN+PSH 8,329 0.020 *Total number of packets 112,419 0.300 with invalid TCP flag combinations Total packet count 39,283,305 100.000 July 19-20, 2007 IWCSN 2007, Guilin, China 54

Large number of TCP resets � Connections are terminated by either TCP FIN or TCP RST: � 12,679,619 connections were terminated by FIN (63%) � 7,440,418 connections were terminated by RST (37%) � Large number of TCP RST indicates that connections are terminated in error conditions � TCP RST is employed by Microsoft Internet Explorer to terminate connections instead of TCP FIN TCP: transport control protocol July 19-20, 2007 IWCSN 2007, Guilin, China 55

UDP and TCP port scans UDP port scans are found on UDP port 137 (NETBEUI) � TCP port scans are found on these TCP ports: � � 80 Hypertext transfer protocol (HTTP) � 139 NETBIOS extended user interface (NETBEUI) � 434 HTTP over secure socket layer (HTTPS) � 1433 Microsoft structured query language (MS SQL) � 27374 Subseven trojan No HTTP(S) servers were active in the ChinaSat network � MSSQL vulnerability was discovered on Oct. 2002, which � may be the cause of scans on TCP port 1433 The Subseven trojan is a backdoor program used in malicious � intents TCP: transport control protocol UDP: user defined protocol July 19-20, 2007 IWCSN 2007, Guilin, China 56

UDP port scans originating from the ChinaSat network � Client (192.168.2.30) source 192.168.2.30:137 - 195.x.x.98:1025 192.168.2.30:137 - 202.x.x.153:1027 port (137) scans external 192.168.2.30:137 - 210.x.x.23:1035 network addresses at 192.168.2.30:137 - 195.x.x.42:1026 192.168.2.30:137 - 202.y.y.226:1026 destination ports (1025-1040): 192.168.2.30:137 - 218.x.x.238:1025 � > 100 are recorded within a 192.168.2.30:137 - 202.y.y.226:1025 192.168.2.30:137 - 202.y.y.226:1027 three-hour period 192.168.2.30:137 - 202.y.y.226:1028 � targeted IP addresses are 192.168.2.30:137 - 202.y.y.226:1029 192.168.2.30:137 - 202.y.y.242:1026 variable 192.168.2.30:137 - 61.x.x.5:1028 192.168.2.30:137 - 219.x.x.226:1025 � multiple ports are scanned 192.168.2.30:137 - 213.x.x.189:1028 per IP 192.168.2.30:137 - 61.x.x.193:1025 192.168.2.30:137 - 202.y.y.207:1028 � may correspond to Bugbear, 192.168.2.30:137 - 202.y.y.207:1025 OpaSoft, or other worms 192.168.2.30:137 - 202.y.y.207:1026 192.168.2.30:137 - 202.y.y.207:1027 192.168.2.30:137 - 64.x.x.148:1027 July 19-20, 2007 IWCSN 2007, Guilin, China 57

UDP port scans direct to the ChinaSat network 210.x.x.23:1035 - 192.168.1.121:137 � External address (210.x.x.23) 210.x.x.23:1035 - 192.168.1.63:137 scans for port (137) (NETBEUI) 210.x.x.23:1035 - 192.168.2.11:137 response within the ChinaSat 210.x.x.23:1035 - 192.168.1.250:137 210.x.x.23:1035 - 192.168.1.25:137 network from source port (1035): 210.x.x.23:1035 - 192.168.2.79:137 � > 200 are recorded within a 210.x.x.23:1035 - 192.168.1.52:137 210.x.x.23:1035 - 192.168.6.191:137 three-hour period 210.x.x.23:1035 - 192.168.1.241:137 210.x.x.23:1035 - 192.168.2.91:137 � targets IP addresses are not 210.x.x.23:1035 - 192.168.1.5:137 sequential 210.x.x.23:1035 - 192.168.1.210:137 210.x.x.23:1035 - 192.168.6.127:137 � may correspond to Bugbear, 210.x.x.23:1035 - 192.168.1.201:137 OpaSoft, or other worms 210.x.x.23:1035 - 192.168.6.179:137 210.x.x.23:1035 - 192.168.2.82:137 210.x.x.23:1035 - 192.168.1.239:137 210.x.x.23:1035 - 192.168.1.87:137 210.x.x.23:1035 - 192.168.1.90:137 210.x.x.23:1035 - 192.168.1.177:137 210.x.x.23:1035 - 192.168.1.39:137 July 19-20, 2007 IWCSN 2007, Guilin, China 58

Detection of traffic volume anomalies using wavelets � Traffic is decomposed into various frequencies using the wavelet transform � Traffic volume anomalies are identified by the large variation in wavelet coefficient values � The coarsest scale level where the anomalies are found indicates the time scale of an anomaly July 19-20, 2007 IWCSN 2007, Guilin, China 59

Detection of traffic volume anomalies using wavelets � tcpdump traces are binned in terms of packets or bytes (each second) � Wavelet transform of 12 levels is employed to decompose the traffic � The coarsest level approximately represents the hourly traffic � Anomalies are: � detected with a moving window of size 20 and by calculating the mean and standard deviation ( σ ) of the wavelet coefficients in each window � identified when wavelet coefficients lie outside the ± 3 σ of the mean value July 19-20, 2007 IWCSN 2007, Guilin, China 60

Wavelet approximate coefficients July 19-20, 2007 IWCSN 2007, Guilin, China 61

Wavelet detail coefficients: d 9 July 19-20, 2007 IWCSN 2007, Guilin, China 62

Wavelet detail coefficients: d 8 July 19-20, 2007 IWCSN 2007, Guilin, China 63

Roadmap � Introduction � Traffic data and analysis tools: � data collection � statistical analysis, clustering tools, prediction analysis � Case studies: � satellite network: ChinaSat � packet data network: Internet � public safety wireless network: E-Comm � Conclusions and references July 19-20, 2007 IWCSN 2007, Guilin, China 64

Autonomous System (AS) � Internet is a network of Autonomous Systems: � groups of networks sharing the same routing policy � identified with Autonomous System Numbers (ASN) � Autonomous System Numbers: http://www.iana.org/assignments/as-numbers � Internet topology on AS-level: � the arrangement of ASs and their interconnections � Border Gateway Protocol (BGP): � inter-AS protocol � used to exchange network reachability information among BGP systems � reachability information is stored in routing tables July 19-20, 2007 IWCSN 2007, Guilin, China 65

Internet AS-level data Source of data are routing tables: � Route Views: http://www.routeviews.org � most participating ASs reside in North America � RIPE (Réseaux IP européens): http://www.ripe.net/ris � most participating ASs reside in Europe July 19-20, 2007 IWCSN 2007, Guilin, China 66

Internet AS-level data � Data used in prior research (partial list): Route Views RIPE Faloutsos, 1999 Yes No Chang, 2001 Yes Yes Vukadinovic, 2001 Yes No Mihail, 2003 Yes Yes � Research results have been used in developing Internet simulation tools: � power-laws are employed to model and generate Internet topologies: BA model, BRITE, Inet2 July 19-20, 2007 IWCSN 2007, Guilin, China 67

Data sets Emerging concerns about the use of the two datasets: � different observations about AS degrees: � power-law distribution: Route Views [Faloutsos, 1999] � Weibull distribution: Route Views + RIPE [Chang, 2001] � data completeness: � RIPE dataset contains ~ 40% more AS connections and 2% more ASs than Route Views [Chang, 2001] July 19-20, 2007 IWCSN 2007, Guilin, China 68

Route Views and RIPE: statistics � Route Views and RIPE samples collected on May 30, 2003 Number of Route Views RIPE AS paths 6,398,912 6,375,028 Probed ASs 15,418 15,433 AS pairs 34,878 35,225 � AS pair: a pair of connected ASs � 15,369 probed ASs (99.7%) in both datasets are identical � 29,477 AS pairs in Route Views (85%) and in RIPE (84%) are identical July 19-20, 2007 IWCSN 2007, Guilin, China 69

Core ASs Route Views RIPE AS Degree AS Degree 1 701 2595 701 2448 � ASs with largest 2 1239 2569 1239 1784 degrees 3 7018 1999 7018 1638 4 3561 1036 209 861 � 16 of the core ASs in 5 1 999 3561 705 Route Views and RIPE 6 209 863 3356 673 are identical 7 3356 662 3549 612 8 3549 617 702 580 � Core ASs in Route Views 9 702 562 2914 561 have larger degrees than 10 2914 556 1 489 core ASs in RIPE 11 6461 498 4589 482 12 4513 468 6461 476 13 4323 315 8220 450 14 16631 294 3303 429 15 6347 291 13237 412 16 8220 289 6730 313 17 3257 277 4323 305 18 4766 263 3257 305 19 3786 263 16631 296 July 19-20, 2007 IWCSN 2007, Guilin, China 20 7132 258 6347 281 70

Spectral analysis of graphs � Normalized Laplacian matrix N(G) [Chung, 1997]: ⎧ if i j and d 1 = ≠ 0 ⎪ i ⎪ 1 ⎨ N ( i , j ) = − if i and j are adjacent ⎪ d d i j ⎪ ⎩ 0 otherwise d i and d j are degrees of node i and j, respectively � The second smallest eigenvalue [Fiedler, 1973] � The largest eigenvalue [Chung, 1997] � Characteristic valuation [Fiedler, 1975] July 19-20, 2007 IWCSN 2007, Guilin, China 71

Characteristic valuation: example � The second smallest eigenvector: 0.1, 0.3, -0.2, 0 � AS1(0.1), AS2(0.3), AS3(-0.2), AS4(0) � Sort ASs by element value: AS3, AS4, AS1, AS2 � AS3 and AS1 are connected connectivity status 1 0 AS3 AS4 AS1 AS2 index of elements July 19-20, 2007 IWCSN 2007, Guilin, China 72

Spectral analysis of topology data Consider only ASs with the first 30,000 assigned AS numbers � AS degree distribution in Route Views and RIPE datasets: � July 19-20, 2007 IWCSN 2007, Guilin, China 73

Before the sort (a) RouteViews_original (b) RIPE_original After the sort (c) RouteViews_min (d) RIPE_min July 19-20, 2007 IWCSN 2007, Guilin, China 74

Before the sort (a) RouteViews_original (b) RIPE_original After the sort (c) RouteViews_max (d) RIPE_max July 19-20, 2007 IWCSN 2007, Guilin, China 75

Data analysis results � The second smallest eigenvector: � separates connected ASs from disconnected ASs � Route Views and RIPE datasets are similar on a coarser scale � The largest eigenvector: � reveals highly connected clusters � Route Views and RIPE datasets differ on a finer scale July 19-20, 2007 IWCSN 2007, Guilin, China 76

Observations � The two datasets are similar on coarse scales: � number of ASs, number of AS connections, core ASs � They exhibit different clustering characteristics: � Route Views data contain larger AS clusters � core ASs in Route Views have larger degrees than core ASs in RIPE � core ASs in Route Views connect a larger number of smaller ASs July 19-20, 2007 IWCSN 2007, Guilin, China 77

Roadmap � Introduction � Traffic data and analysis tools: � data collection, statistical analysis, clustering tools, prediction analysis � Case studies: � satellite network: ChinaSat � packet data network: Internet � public safety wireless network: E-Comm � Conclusions and references July 19-20, 2007 IWCSN 2007, Guilin, China 78

Case study: E-Comm network � E-Comm network: an operational trunked radio system serving as a regional emergency communication system � The E-Comm network is capable of both voice and data transmissions � Voice traffic accounts for over 99% of network traffic � A group call is a standard call made in a trunked radio system � More than 85% of calls are group calls � A distributed event log database records every event occurring in the network: call establishment, channel assignment, call drop, and emergency call July 19-20, 2007 IWCSN 2007, Guilin, China 79

E-Comm network: coverage and user agencies RCMP and Police ... Agency 1 Agency 2 (Police) (Fire Dept.) Fire ... TG n TG 1 TG 2 TG 3 TG 4 Ambulance ... Other R1 R2 R3 R4 R5 R6 R7 R8 TG: Talk group R: Radio device (user) July 19-20, 2007 IWCSN 2007, Guilin, China 80

E-Comm network architecture Transmitters/Repeaters Users PSTN PBX Dispatch console 1 2 3 4 5 6 7 8 9 * 8 # Vancouver Other I B M EDACS systems Network switch Burnaby Database Data Management server gateway console July 19-20, 2007 IWCSN 2007, Guilin, China 81

Traffic data � 2001 data set: � 2 days of traffic data � 2001-11-1 to 2001-11-02 (110,348 calls) � 2002 data set: � 28 days of continuous traffic data � 2002-02-10 to 2002-03-09 (1,916,943 calls) � 2003 data set: � 92 days of continuous traffic data � 2003-03-01 to 2003-05-31 (8,756,930 calls) July 19-20, 2007 IWCSN 2007, Guilin, China 82

Observations � Presence of daily cycles: � minimum utilization: ~ 2 PM � maximum utilization: 9 PM to 3 AM � 2002 sample data: � cell 5 is the busiest � others seldom reach their capacities � 2003 sample data: � several cells (2, 4, 7, and 9) have all channels occupied during busy hours July 19-20, 2007 IWCSN 2007, Guilin, China 83

Network utilization � OPNET based simulation of two weeks of network activity � Network utilization exhibits daily cycles � Between February 2002 and March 2003: � number of calls increased by ~ 60 % � average utilization increased non-uniformly across the network � Several cells may become congested in future N. Cackov, B. Vuji č i ć , S. Vuji č i ć , and Lj. Trajkovi ć , “Using network activity data to model the utilization of a trunked radio system,” in Proc. SPECTS 2004 , San Jose, CA, July 2004, pp. 517–524. N. Cackov, J. Song, B. Vuji č i ć , S. Vuji č i ć , and Lj. Trajkovi ć , “Simulation of a public safety wireless networks: a case study,” Simulation , vol. 81, no. 8, pp. 571–585, Aug. 2005. July 19-20, 2007 IWCSN 2007, Guilin, China 84

Performance analysis � Modeling and Performance Analysis of Public Safety Wireless Networks � WarnSim: a simulator for public safety wireless networks (PSWN) � Traffic data analysis � Traffic modeling � Simulation and prediction J. Song and Lj. Trajkovi ć , “Modeling and performance analysis of public Safety wireless networks,” in Proc. IEEE IPCCC , Phoenix, AZ, Apr. 2005, pp. 567–572. July 19-20, 2007 IWCSN 2007, Guilin, China 85

WarnSim overview � Simulators such as OPNET, ns-2, and JSim are designed for packet-switched networks � WarnSim is a simulator developed for circuit- switched networks, such as PSWN � WarnSim: � publicly available simulator � http://www.vannet.ca/warnsim � effective, flexible, and easy to use � developed using Microsoft Visual C# .NET � operates on Windows platforms July 19-20, 2007 IWCSN 2007, Guilin, China 86

Call arrival rate in 2002 and 2003: cyclic patterns 4 12 x 10 6000 11 5000 10 Number of calls 4000 Number of calls 9 8 3000 7 2000 6 1000 5 2002 Data 2002 Data 2003 Data 2003 Data 4 0 Sat. Sun. Mon. Tue. Wed. Thu. Fri. 1 5 10 15 20 24 Time (days) Time (hours) � the busiest hour is around midnight � the busiest day is Thursday � useful for scheduling periodical maintenance tasks July 19-20, 2007 IWCSN 2007, Guilin, China 87

Modeling and characterization of traffic � We analyzed voice traffic from a public safety wireless network in Vancouver, BC � call inter-arrival and call holding times during five busy hours from each year (2001, 2002, 2003) � Statistical distribution and the autocorrelation function of the traffic traces: � Kolmogorov-Smirnov goodness-of-fit test � autocorrelation functions � wavelet-based estimation of the Hurst parameter B. Vuji č i ć , N. Cackov, S. Vuji č i ć , and Lj. Trajkovi ć , “Modeling and characterization of traffic in public safety wireless networks,” in Proc. SPECTS 2005 , Philadelphia, PA, July 2005, pp. 214–223. July 19-20, 2007 IWCSN 2007, Guilin, China 88

Erlang traffic models Erlang B Erlang C N N A A N N ! N ! N − A P = P = B C x x N N N − 1 A A A N ∑ ∑ + x ! x ! N ! N − A x = 0 x = 0 � P B : probability of rejecting a call � P c : probability of delaying a call � N : number of channels/lines � A : total traffic volume July 19-20, 2007 IWCSN 2007, Guilin, China 89

Erlang models � Erlang B model assumes: � call holding time follows exponential distribution � blocked call will be rejected immediately � Erlang C model assumes: � call holding time follows exponential distribution � blocked call will be put into a FIFO queue with infinite size July 19-20, 2007 IWCSN 2007, Guilin, China 90

Kolmogorov-Smirnov test � Goodness-of-fit test: quantitative decision whether the empirical cumulative distribution function (ECDF) of a set of observations is consistent with a random sample from an assumed theoretical distribution � ECDF is a step function (step size 1/N) of N ordered data points : Y , Y , ..., Y 1 2 N ( ) n i E N = N : the number of data samples with values smaller ( ) n i than Y i July 19-20, 2007 IWCSN 2007, Guilin, China 91

Traffic data � Records of network events: � established, queued, and dropped calls in the Vancouver cell � Traffic data span periods during: 2001, 2002, 2003 � Trace (dataset) Time span No. of established calls 2001 November 1–2, 2001 110,348 2002 March 1–7, 2002 370,510 2003 March 24–30, 2003 387,340 July 19-20, 2007 IWCSN 2007, Guilin, China 92

Hourly traces � Call holding and call inter-arrival times from the five busiest hours in each dataset (2001, 2002, and 2003) 2001 2002 2003 Day/hour No. Day/hour No. Day/hour No. 02.11.2001 01.03.2002 26.03.2003 3,718 4,436 4,919 15:00–16:00 04:00–05:00 22:00–23:00 01.11.2001 01.03.2002 25.03.2003 3,707 4,314 4,249 00:00–01:00 22:00–23:00 23:00–24:00 02.11.2001 01.03.2002 26.03.2003 3,492 4,179 4,222 16:00–17:00 23:00–24:00 23:00–24:00 01.11.2001 01.03.2002 29.03.2003 3,312 3,971 4,150 19:00–20:00 00:00–01:00 02:00–03:00 02.11.2001 02.03.2002 29.03.2003 3,227 3,939 4,097 20:00–21:00 00:00–01:00 01:00–02:00 July 19-20, 2007 IWCSN 2007, Guilin, China 93

Example: March 26, 2003 20 15 call inter-arrival time Call holding times (s) 10 5 0 22:18:00 22:18:20 22:18:40 22:19:00 Time (hh:mm:ss) July 19-20, 2007 IWCSN 2007, Guilin, China 94

Statistical distributions � Fourteen candidate distributions: � exponetial, Weibull, gamma, normal, lognormal, logistic, log-logistic, Nakagami, Rayleigh, Rician, t-location scale, Birnbaum-Saunders, extreme value, inverse Gaussian � Parameters of the distributions: calculated by performing maximum likelihood estimation � Best fitting distributions are determined by: � visual inspection of the distribution of the trace and the candidate distributions � K-S test on potential candidates July 19-20, 2007 IWCSN 2007, Guilin, China 95

Call inter-arrival times: pdf candidates 1.6 Traffic data Exponential model 1.4 Lognormal model Weibull model 1.2 Gamma model Probability density Rayleigh model 1 Normal model 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 Call inter-arrival time (s) July 19-20, 2007 IWCSN 2007, Guilin, China 96

Call inter-arrival times: K-S test results (2003 data) 26.03.2003, 25.03.2003, 26.03.2003, 29.03.2003, 29.03.2003, Distribution Parameter 22:00–23:00 23:00–24:00 23:00–24:00 02:00–03:00 01:00–02:00 h 1 1 0 1 1 Exponential p 0.0027 0.0469 0.4049 0.0316 0.1101 k 0.0283 0.0214 0.0137 0.0205 0.0185 h 0 0 0 0 0 0.4885 0.4662 0.2065 0.286 0.2337 Weibull p k 0.0130 0.0133 0.0164 0.014 0.0159 h 0 0 0 0 0 0.3956 0.3458 0.127 0.145 0.1672 Gamma p k 0.0139 0.0146 0.0181 0.0163 0.0171 h 1 1 1 1 1 Lognormal p 1.015E-20 4.717E-15 2.97E-16 3.267E-23 4.851E-21 k 0.0689 0.0629 0.0657 0.0795 0.0761 July 19-20, 2007 IWCSN 2007, Guilin, China 97

Call inter-arrival times: best-fitting distributions (cdf) 1 0.9 Traffic data Exponential model 0.8 Weibull model Cumulative distribution 0.7 Gamma model 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 Call inter-arrival time (s) July 19-20, 2007 IWCSN 2007, Guilin, China 98

Call inter-arrival times: estimates of H � Traces pass the test for time constancy of a : estimates of H are reliable 2001 2002 2003 Day/hour H Day/hour H Day/hour H 02.11.2001 01.03.2002 26.03.2003 0.907 0.679 0.788 15:00–16:00 04:00–05:00 22:00–23:00 01.11.2001 01.03.2002 25.03.2003 0.802 0.757 0.832 00:00–01:00 22:00–23:00 23:00–24:00 02.11.2001 01.03.2002 26.03.2003 0.770 0.780 0.699 16:00–17:00 23:00–24:00 23:00–24:00 01.11.2001 01.03.2002 29.03.2003 0.774 0.741 0.696 19:00–20:00 00:00–01:00 02:00–03:00 02.11.2001 02.03.2002 29.03.2003 0.663 0.747 0.705 20:00–21:00 00:00–01:00 01:00–02:00 July 19-20, 2007 IWCSN 2007, Guilin, China 99

Call holding times: pdf candidates Traffic data Lognormal model 0.25 Gamma model Weibull model 0.2 Exponential model Probability density Normal model Rayleigh model 0.15 0.1 0.05 0 0 5 10 15 20 25 Call holding time (s) July 19-20, 2007 IWCSN 2007, Guilin, China 100

Mining Network Traffic Data Ljiljana Trajkovi ljilja@cs.sfu.ca - PowerPoint PPT Presentation

Mining Network Traffic Data Ljiljana Trajkovi ljilja@cs.sfu.ca Communication Networks Laboratory http://www.ensc.sfu.ca/cnl School of Engineering Science Simon Fraser University, Vancouver, British Columbia Canada Roadmap Introduction

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Traffic Shaping, Traffic Policing Peter Puschner, Institut fr Technische Informatik Traffic

Traffic signal optimization and traffic assignment Traffic signals Traffic signal optimization

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

The Traffic Conflicts Methodology revisited Richard van der Horst Traffic Safety Assessment

Traffic Engineering with Traffic Engineering with Estimated Traffic Matrices Estimated Traffic

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Chesapeake Bay Foundation Webcast Developing a Chesapeake Bay Pollution Reduction Plan

4D Group Presentation May 2017 4D Group - structure 4D SAS Headquarters Le Pecq 100% Wakanda

SMALL BUSINESS RELIEF GRANT PROGRAM + CITY CARES ACT FUNDING FOR SMALL BUSINESSES City

June 2016 An industrial offer Complete cover of a project lifecycle SimulateBox: Applications

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

A 10 Point Plan for a better Openreach 7 July 2016 Why are we here today? Almost no-one thinks

The FTTC Project of Deutsche Telekom Regulatory Holidays would boost VDSL Investments and 3play

Broadband Infrastructure delivery in Darlington 2012-2020 1 Broadband Infrastructure delivery in