Controlling False Alarm/Discovery Rates in Online Internet Traffic - - PowerPoint PPT Presentation

controlling false alarm discovery rates in online
SMART_READER_LITE
LIVE PREVIEW

Controlling False Alarm/Discovery Rates in Online Internet Traffic - - PowerPoint PPT Presentation

Controlling False Alarm/Discovery Rates in Online Internet Traffic Flow Classification Daniel Nechay, Yvan Pointurier and Mark Coates McGill University Department of Electrical and Computer Engineering Montreal, Quebec, Canada April 22, 2009


slide-1
SLIDE 1

Controlling False Alarm/Discovery Rates in Online Internet Traffic Flow Classification

Daniel Nechay, Yvan Pointurier and Mark Coates

McGill University Department of Electrical and Computer Engineering Montreal, Quebec, Canada

April 22, 2009

slide-2
SLIDE 2

Outline Introduction Methodology Data & Processing Simulations Conclusion

Outline

1

Introduction

2

Methodology Background Traffic Classification

3

Data & Processing

4

Simulation Experiments

slide-3
SLIDE 3

Outline Introduction Methodology Data & Processing Simulations Conclusion

Introduction

What is Internet traffic classification? Associate a user-defined class to a traffic flow Class can be broad (P2P) or application specific (BitTorrent, Kazaa, etc.) Why do we need Internet traffic classification? There are a variety of applications where Internet traffic classification is needed: To help provide QoS guarantees or enforce Service Level Agreements (SLA) Prioritize or limit/block traffic Network provisioning Network security

slide-4
SLIDE 4

Outline Introduction Methodology Data & Processing Simulations Conclusion

Current Traffic Classification Methods

Port-Based Simplest method Not reliable Deep-Packet Inspection Examine the payload of the packets to look for application-specific signatures Privacy and legal concerns Shallow-Packet Inspection Derives statistics from the packet headers and uses this information to classify the flow Non-invasive and still works on encrypted packets

slide-5
SLIDE 5

Outline Introduction Methodology Data & Processing Simulations Conclusion

Our Contribution

Contributions

1 Provide a performance guarantee on the false alarm or false

discovery rates

2 Novel methodology: converted binary classifier into a

multi-class classifier

3 Online classification

slide-6
SLIDE 6

Outline Introduction Methodology Data & Processing Simulations Conclusion

Problem Formulation

Definitions X - the d-dimensional random variable corresponding to the flow features Each flow is associated an output Y Z = Y ∈ {1 . . . , c + 1} the class of the flow

slide-7
SLIDE 7

Outline Introduction Methodology Data & Processing Simulations Conclusion

Problem Statement 1

Goal of Neyman-Pearson classification To minimize the overall misclassification rate while adhering to certain false alarm rate (FAR) constraints False Alarm Rate for class i Expected fraction of the flows that do not belong to traffic class i that are incorrectly classified as belonging to i.

slide-8
SLIDE 8

Outline Introduction Methodology Data & Processing Simulations Conclusion

Problem Statement 2

Goal of Learning to Satisfy (LSAT) framework To provide false discovery rates (FDR) guarantees while minimizing the overall misclassification rate False Discovery Rate for class i Expected fraction of incorrectly classified flows among all traffic flows classified as class i.

slide-9
SLIDE 9

Outline Introduction Methodology Data & Processing Simulations Conclusion Background

Background

Support Vector Machines (SVM) SVMs consist of two steps:

1 Transform the input features xi via a mapping Φ : Rd → H

where H is a high-dimensional Hilbert space

2 Construct a hyperplane (the decision boundary) in H

according to the max-margin principle Cost-Sensitive Classification Regular SVM treats all misclassifications equally Cost-Sensitive classification (our case 2ν-SVM) treats the misclassification of each class differently Have two parameters ν− & ν+ to control the misclassification for the different classes

slide-10
SLIDE 10

Outline Introduction Methodology Data & Processing Simulations Conclusion Background

What is LSAT?

Goal The goal is to learn a set in the input (feature) space that simultaneously satisfies multiple output constraints. The LSAT framework is distinguished by:

1 multiple performance criteria must be satisfied 2 output behaviour is assessed only on the solution set.

slide-11
SLIDE 11

Outline Introduction Methodology Data & Processing Simulations Conclusion Background

LSAT example

Comparison of LSAT to WSVM

LSAT Weighted SVM (WSVM) 0.2 0.4 0.6 0.8

Reference

  • F. Thouin, M. J. Coates, B. Erikkson, R. Nowak, and C. Scott,

Learning to Satisfy, in Proc. Int. Conf. Acoustics, Speech, and Signal Proc. (ICASSP), Las Vegas, NV, USA, Apr. 2008.

slide-12
SLIDE 12

Outline Introduction Methodology Data & Processing Simulations Conclusion Traffic Classification

Traffic Classification

How to classify c classes? Use a chain of c binary classifiers Each binary classifier responsible for a particular class Ordering is important Classified as unknown if there are no mappings to a class How to determine the best classifier? Find the best parameters ν+, ν− and σ for the 2ν-SVM Introduce cost functions to rank the classifiers

slide-13
SLIDE 13

Outline Introduction Methodology Data & Processing Simulations Conclusion Traffic Classification

Cost Functions

Traffic classification with FAR constraints For every classifier, the following risk function is used: R(f ) =

  • s(i)

1 αs(i) max(PF(s(i)) − αs(i), 0) + PM(s(i)) s(i): class i αs(i): FAR constraint for class i PF(s(i)): FAR for class i PM(s(i)): Misclassification rate for class i Traffic classification with FDR constraints Ensure that it satisfies the constraints set — then choose the classifier that minimizes the misclassification rate

slide-14
SLIDE 14

Outline Introduction Methodology Data & Processing Simulations Conclusion

Input Data

Data Collected a 24 hour trace using tcpdump in April and split the trace by hour Only considered TCP flows for inputs tcptrace was able to collect 142 statistics for every flow Feature selection reduced the feature space to 5 features Classify after the first six packets of a flow Bro was used to provide a ground truth

slide-15
SLIDE 15

Outline Introduction Methodology Data & Processing Simulations Conclusion

Application Breakdown

Application Breakdown after 6 packets of a flow

Table: Application breakdown for flows > 6 packets

Flows Size Application Number Percentage GB Percentage HTTP 315375 78.3% 4.1 74.6% HTTPS 20736 5.2% 0.29 5.4% MSN 3364 0.8% 0.04 0.7% POP3 1311 0.3% 0.01 0.2% OTHER 61870 15.4% 1.05 19.1%

slide-16
SLIDE 16

Outline Introduction Methodology Data & Processing Simulations Conclusion

Simulation environment

Statistics Used total number of bytes sent (C→S) number of packets with the FIN field set (C→S) the window scaling factor used (C→S) total number of bytes truncated in the packet capture (C→S) total number of packets truncated in the packet capture (S→C)

slide-17
SLIDE 17

Outline Introduction Methodology Data & Processing Simulations Conclusion

FAR-constrained classifier

Classifiers Three classifiers compared: Baseline Classifier - Multi-class SVM FAR-constrained classifier with α{HTTP} = 0.4% FAR-constrained classifier with α{HTTPS, HTTP} = 0.05% Hour 1 Results Trained on 1000 randomly chosen points in hour 1 & validated on the rest of the hour Baseline classifier has α{HTTP} = 3.7% and α{HTTPS, HTTP} = 0.07% Classwise FAR-constrained classifier has α{HTTP} = 0.3% while the pairwise FAR-constrained classifier has α{HTTPS, HTTP} = 0.02%

slide-18
SLIDE 18

Outline Introduction Methodology Data & Processing Simulations Conclusion

FAR-constrained classifier

Overall Accuracy for Hours 2 - 24

84 86 88 90 92 94 96 98 100 Accuracy (%) Hour Baseline Classifier FAR(HTTP) = .4% FAR(HTTPS,HTTP) = .02%

slide-19
SLIDE 19

Outline Introduction Methodology Data & Processing Simulations Conclusion

FAR-constrained classifier

FAR(HTTP) for Hours 2 - 24

5 10 15 20 25 30 FAR(HTTP) (%) Hour Baseline Classifier FAR(HTTP) = .4%

slide-20
SLIDE 20

Outline Introduction Methodology Data & Processing Simulations Conclusion

FAR-constrained classifier

FAR(HTTPS,HTTP) for Hours 2 - 24

4 8 12 16 20 24 0.1 0.2 0.3 0.4 FAR(HTTPS,HTTP) (%) Hour Baseline Classifier FAR(HTTPS,HTTP) = .02%

slide-21
SLIDE 21

Outline Introduction Methodology Data & Processing Simulations Conclusion

FDR-constrained classifier

Classifiers Three classifiers compared: Baseline Classifier - Multiclass SVM Unconstrained binary-chained classifier FDR-constrained classifier with β{HTTPS} = 5% Hour 1 Results Trained on 1000 randomly chosen points in hour 1 Unconstrained binary-chained classifier has β{HTTPS} = 7.0% while the FDR-constrained classifier has β{HTTPS} = 4.2%

slide-22
SLIDE 22

Outline Introduction Methodology Data & Processing Simulations Conclusion

FDR-constrained classifier

Overall Accuracy for Hours 2 - 24

86 88 90 92 94 96 98 100 Accuracy (%) Hour Multiclass SVM Baseline Unconstrained Binary Chain FDR(HTTPS) = 5%

slide-23
SLIDE 23

Outline Introduction Methodology Data & Processing Simulations Conclusion

FDR-constrained classifier

FDR(HTTPS) for Hours 2 - 24

4 8 12 16 20 24 10 20 30 40 50 FDR(HTTPS) (%) Hour Multiclass SVM Baseline Unconstrained Bin. Chain FDR(HTTPS) = 5%

slide-24
SLIDE 24

Outline Introduction Methodology Data & Processing Simulations Conclusion

Conclusion

Summary Two novel algorithms for Internet traffic classification proposed Able to provide performance guarantees Validated our approach with data provided by an ISP On-going Research Experiment on a more diverse data set Creating a hybrid classifier