Network traffic classification: From theory to practice Pere - - PowerPoint PPT Presentation

network traffic classification from theory to practice
SMART_READER_LITE
LIVE PREVIEW

Network traffic classification: From theory to practice Pere - - PowerPoint PPT Presentation

Network traffic classification: From theory to practice Pere Barlet-Ros Associate Professor at UPC BarcelonaTech Co-founder and Chairman at Polygraph.io Joint work with: Valentn Carela-Espaol, Tomasz Bujlow and Josep Sol-Pareta


slide-1
SLIDE 1

Network traffic classification: From theory to practice

Pere Barlet-Ros

Associate Professor at UPC BarcelonaTech Co-founder and Chairman at Polygraph.io

Joint work with: Valentín Carela-Español, Tomasz Bujlow and Josep Solé-Pareta

slide-2
SLIDE 2

Background

  • What do we refer to as traffic classification?

– Identifying the application that generated each flow

  • What is traffic classification used for?

– Network planning and dimensioning – Per-application performance evaluation – Traffic steering / QoS / SLA validation – Charging and billing

slide-3
SLIDE 3

State of the Art: Ports

  • Port-based

– Computationally lightweight – Payloads not needed – Easy to understand and program – Low accuracy and completeness

slide-4
SLIDE 4

State of the Art: DPI

  • Deep packet inspection (DPI)

– High accuracy and completeness – Computationally expensive – Needs payload access – Privacy concerns – Cannot work with encrypted traffic

slide-5
SLIDE 5

State of the Art: ML

  • Machine Learning

– High accuracy and completeness – Computationally viable – Payloads not needed – Can work with encrypted traffic – Needs retraining

slide-6
SLIDE 6

Main limitations of ML-TC

  • Introduction in real products and operational

environments is limited and slow

– Current proposals suffer from practical problems – Actual products rely on simpler methods or DPI

  • We identified 3 main real-world problems

1) The deployment problem 2) The maintenance problem 3) The validation problem

slide-7
SLIDE 7

1) Deployment problem

  • Current solutions are difficult to deploy

– Need dedicated hardware appliances / probes – Need packet-level access (e.g. compute features, …)

  • How to address this problem?

– Work with flow level data (e.g. Netflow) – Support packet sampling (e.g. Sampled Netflow)

slide-8
SLIDE 8

NetFlow w/o sampling

  • Challenge: NetFlow v5 features are very limited

– IPs, ports, protocol, TCP flags, duration, #pkts, …

  • State-of-the-art ML technique: C4.5 decision tree
slide-9
SLIDE 9

Results (NetFlow w/o sampling)

  • UPC dataset (publicly available)

– 7 x 15 min traces from UPC access link – Collected at different days and hours – Labelled with L7-filter (strict version with less FPR)

slide-10
SLIDE 10

Results (Sampled NetFlow)

  • Impact of packet sampling
slide-11
SLIDE 11

Sources of inaccuracy

1) Error in the estimation of the traffic features 2) Changes in flow size distribution 3) Changes in flow splitting probability

slide-12
SLIDE 12

Solution (Sampled NetFlow)

slide-13
SLIDE 13

Deployment problem: Summary

  • Current proposals are difficult to deploy
  • Proposed a simple but effective technique

– Supports standard NetFlow data – Supports packet sampling

  • Main limitation: Needs to be frequently

retrained

  • V. Carela-Español, P

. Barlet-Ros, A. Cabellos-Aparicio, J. Solé-Pareta. Analysis of the impact of sampling on NetFlow traffic classification. Computer Networks, 55(5), 2011.

slide-14
SLIDE 14

2) Maintenance problem

  • Difficult to keep classification model updated

– Traffic changes, application updates, new applications – Involve significant human intervention – ML models need to be frequently retrained

  • Possible solution to the problem

– Make retraining automatic – Computationally viable – Without human intervention

slide-15
SLIDE 15

Autonomic Traffic Classification

  • Lightweight DPI for retraining

– Small traffic sample (e.g. 1/10000 flow sampling)

slide-16
SLIDE 16

Evaluation

  • 14-days trace collected at CESCA
slide-17
SLIDE 17

Temporal/Spatial obsolescence

  • Comparison without autonomic retraining
slide-18
SLIDE 18

Maintenance problem: Summary

  • Exiting classifiers need periodic retrainings

– Temporal obsolescence: Changes in application traffic – Spatial obsolescence: Different networks

  • Autonomic traffic classification system

– Easy to deploy: Works with Sampled NetFlow – Easy to maintain: Lightweight DPI for self-training

  • V. Carela-Español, P

. Barlet-Ros, O. Mula-Valls, J. Solé-Pareta. An autonomic traffic classification system for network operation and management. Journal of Network and Systems Management, 23(3):401-419, 2015.

slide-19
SLIDE 19

3) Validation problem

  • Current proposals are difficult to validate,

compare and reproduce

– Private datasets – Different ground-truth generators

  • Our contribution

– Publication of labeled datasets (with payloads) – Common benchmark to validate/compare/reproduce – Validation of common ground-truth generators

slide-20
SLIDE 20

Proposal

  • Reliable labeled dataset with full payloads

– Accurate: VBS (label from the application socket) – Avoid privacy issues: Realistic artificial traffic

slide-21
SLIDE 21

Methodology

  • Manually generate representative traffic

– Create fake accounts (e.g. Gmail, Facebook, Twitter) – Interact with the service simulating human behavior (e.g. posting, chatting, gaming, watching videos, …)

slide-22
SLIDE 22

Dataset

  • > 750K flows, ~55 GB of data
slide-23
SLIDE 23

DPI tools compared

slide-24
SLIDE 24

Application protocols

slide-25
SLIDE 25

Applications

slide-26
SLIDE 26

Web services (summary)

  • PACE: 16/34 (6 over 80%)
  • nDPI: 10/34 (6 over 80%)
  • OpenDPI: 2/34
  • Libprotoident: 0/34
  • L7-filter: 0/44 (high FPR)
  • NBAR: 0/34
slide-27
SLIDE 27

Validation problem: Summary

  • Comparison of most popular ground-truth generators

– PACE: Best results at all classification levels – Libprotoident: Very good results at application/protocol – nDPI: Good results, web services level, open source – NBAR and L7-filter: Very poor results

  • Dataset including payloads is publicly available

– http://www.cba.upc.edu/monitoring/traffic-classification (Including also all other datasets presented in these slides) – Common benchmark to validate, compare and reproduce

  • T. Bujlow, V. Carela-Español, P. Barlet-Ros. Independent comparison of popular DPI tools for traffic classification.

Computer Networks, 76:75-89, 2015.

  • V. Carela-Español, T. Bujlow, P. Barlet-Ros. Is our ground-truth for traffic classification reliable? In Proc. of Passive

and Active Measurement Conf. (PAM), 2014.

slide-28
SLIDE 28

Network Polygraph

  • Addressed 3 practical problems

– The deployment problem (Sampled Netflow) – The maintenance problem (Autonomic retraining) – The validation problem (Labeled payload traces)

  • We identified interest in the market

– We created a UPC spin-off: https://polygraph.io – Several customers world-wide

P . Barlet-Ros, J. Sanjuàs, V. Carela-Español. Network Polygraph: A cloud-based network visibility service. In ACM SIGCOMM Conf., Industrial Demo, 2015.

slide-29
SLIDE 29

Why Network Polygraph?

  • Other products are expensive and difficult to

deploy

– Can only be afforded by large operators, ISPs, … – Large portion of the market are SMEs (>90% in EU)

  • Our technology based on Sampled NetFlow only

needs a small volume of traffic data

– <0.5% of extra bandwidth usage – Can be provided as a service from the cloud (SaaS)

slide-30
SLIDE 30

Visibility-to-cost ratio

cost visibility

slide-31
SLIDE 31

Website + On-Line Demo

https://polygraph.io

slide-32
SLIDE 32

traffic volume, breakdown by application

slide-33
SLIDE 33

HTTP services

slide-34
SLIDE 34

top talkers (addresses, ports, autonomous systems)

slide-35
SLIDE 35

subnetwork-level bandwidth hogs

slide-36
SLIDE 36

traffic geolocation (origins & destinations)

slide-37
SLIDE 37

anomaly and attack detection with automatic baselining

slide-38
SLIDE 38

indexed traffic database for forensic analysis

slide-39
SLIDE 39

Network Polygraph

Talaia Networks, S.L. K2M – Parc UPC Campus Nord Jordi Girona, 1-3 Barcelona (08034) Spain Telephone: +34 93 405 45 87 contact@polygraph.io https://polygraph.io