BLINC: Multilevel Traffic Classification in the Dark Thomas - - PowerPoint PPT Presentation

blinc multilevel traffic classification in the dark
SMART_READER_LITE
LIVE PREVIEW

BLINC: Multilevel Traffic Classification in the Dark Thomas - - PowerPoint PPT Presentation

BLINC: Multilevel Traffic Classification in the Dark Thomas Karagiannis, UC Riverside Konstantina Papagiannaki, Intel Research Cambridge Michalis Faloutsos, UC Riverside The problem of workload characterization The goal: Classify Internet


slide-1
SLIDE 1

BLINC: Multilevel Traffic Classification in the Dark

Thomas Karagiannis, UC Riverside Konstantina Papagiannaki, Intel Research Cambridge Michalis Faloutsos, UC Riverside

slide-2
SLIDE 2

2

The problem of workload characterization

  • The goal: Classify Internet traffic flows

according to the applications that generate them “in the dark”

– No port numbers – No payload

web streaming P2P

slide-3
SLIDE 3

3

The problem of workload characterization – Why in the dark?

  • Traffic profiling based on TCP/UDP ports

– Misleading

  • Payload-based classification

– Practically infeasible

  • Applications are “hiding” their traffic

– P2P applications, skype, etc.

  • Recent research approaches

– Statistical/machine-learning based classification (Roughan

et al. IMC’04, Moore et al. SIGMETRICS’05)

– Sensitive to network dynamics such as congestion

slide-4
SLIDE 4

4

Our contributions

  • We present BLINC (BLINd Classification), a

fundamentally different “in the dark” approach

– We shift the focus to the Internet host – We analyze host behavior at three levels

  • Social
  • Functional
  • Application
  • We identify “signature” communication patterns
  • Highly accurate classification
slide-5
SLIDE 5

5

Outline

  • Developing a classification benchmark

– Payload-based classification

  • BLINC design

– Multilevel classification – Signature communication patterns

  • BLINC evaluation
slide-6
SLIDE 6

6

Classification benchmark

  • Packet-traces with machine readable headers

– Residential (2 traces)

  • 25 hours & 34 hours, 110 Mbps
  • web (35%), p2p (32%)

– Genome campus

  • 44 hours, 25 Mbps, ftp (67%)
  • Classification based on payload signatures

– Caveats : Nonpayload (1%-2%), Unknown (6%-16%)

slide-7
SLIDE 7

7

BLINC overview

  • In the dark classification

– No examination of port numbers – No examination of user payload

  • Characterize the host

– Insensitive to congestion and path changes

  • Deployable with existing equipment

– Operates on flow records

slide-8
SLIDE 8

8

BLINC: Classification process

  • Characterize the host

– Social : Popularity/Communities – Functional : Consumer/provider of services – Application : Transport layer interactions

  • Identify signature communication patterns
  • Match observed behavior to signatures
slide-9
SLIDE 9

9

  • 1. Social level
  • Characterization of the popularity of hosts
  • Two types of behavior:

– Based on number of destination IPs – Communities: Groups of communicating hosts

slide-10
SLIDE 10

10

  • 1. Social level: Popularity
  • Reveals only basic application traffic

properties

Heavier tail of CCDF of destination IPs for P2P and malware

slide-11
SLIDE 11

11

  • Communication cliques
  • Perfect cliques

– Attacks

  • Partial cliques

– Collaborative applications (p2p, games)

  • Partial cliques with same domain IPs

– Server farms (e.g., web, dns, mail)

  • 1. Social level: Communities
slide-12
SLIDE 12

12

  • 2. Functional level
  • We characterize based on tuple (IP, Port)
  • We identify three types of behavior

– Client: Consumer of services – Server: Provider of services – Collaborative

slide-13
SLIDE 13

13

  • 2. Functional level: Client vs.

Server

src port: 1000 src port: 1001 src port: 1002 Observation: The host uses a different ephemeral src port for every flow Rule: Hosts that use a large number of source ports are clients

slide-14
SLIDE 14

14

  • 2. Functional level: Client vs.

Server

src port: 80 src port: 80 src port: 443 Observation: The host uses a different ephemeral src port for every flow Observation: The host uses only two src ports for all flows Rule: Hosts that use a small number of source ports are offering services

  • n these ports
slide-15
SLIDE 15

15

  • 2. Functional level:

Characterizing the host

Clients Servers

flows vs. source ports per application

Collaborative applications: No distinction between servers and clients Obscure behavior due to multiple mail protocols and passive ftp

slide-16
SLIDE 16

16

  • 3. Application level
  • Interactions between network hosts display

diverse patterns across application types.

  • We capture patterns using “graphlets”

– Target most typical behavior – Relationship between fields of the 4-tuple

slide-17
SLIDE 17

17

  • 3. Application level: Graphlets
  • Graphlets have four columns corresponding to the

4-tuple: src IP, dst IP, src port and dst port

sourceIP destinationIP sourcePort destinationPort

  • Lines connect nodes when flows contain the specific

field values

  • Each node is a distinct entry for each column

445 135 192.168.1.1 10.0.0.0 1026 135 1026 10.0.0.0 192.168.1.1

slide-18
SLIDE 18

18

  • 3. Graphlet Generation (FTP)

sourceIP destinationIP sourcePort destinationPort

21 20 X

X Y

10001 10002 3000

Z 1026

3001

U

5000

X Y 21 10001 X Y 21 10001 X Y 20 10002 X Y 21 10001 X Y 20 10002 X Z 21 3000 X Y 21 10001 X Y 20 10002 X Z 21 3000 X Z 1026 3001

5005

slide-19
SLIDE 19

19

  • 3. Graphlet Library
slide-20
SLIDE 20

20

Heuristics: Further improving performance

  • Using the transport layer protocol.
slide-21
SLIDE 21

21

Heuristics: Further improving performance

Cardinality of set of dst IPs versus set of dst ports varies with the application

  • Using the relative cardinality of sets.
slide-22
SLIDE 22

22

Heuristics: Further improving performance

WEB: #dst ports >> # dst IPs P2P: #dst ports <= # dst IPs

  • Using the relative cardinality of sets.
slide-23
SLIDE 23

23

Heuristics: Further improving performance

  • Using the communities

10.0.0.0 10.0.0.1 Known: WEB Probably WEB too!!

slide-24
SLIDE 24

24

Heuristics: Further improving performance

  • Other heuristics:

– Using the per-flow average packet size – Recursive (mail/dns servers talk to mail/dns servers, etc.) – Failed flows (malware, p2p)

slide-25
SLIDE 25

25

Classification Results

  • We evaluate BLINC using two metrics:

– Completeness

  • Percentage classified by BLINC

– Accuracy

  • Percentage classified by BLINC correctly
  • We compare against payload classification

– Exclude unknown and nonpayload flows

slide-26
SLIDE 26

26

BLINC achieves highly accurate classification

80%-90% completeness ! >90% accuracy !!

slide-27
SLIDE 27

27

Characterizing the unknown: Non-payload flows

BLINC is not limited by non-payload flows or unknown signatures Flows classified as attacks reveal known exploits

slide-28
SLIDE 28

28

BLINC issues and limitations

  • Extensibility

– Creating and incorporating new graphlets

  • Application sub-types

– e.g., BitTorrent vs. Kazaa

  • Transport-layer encryption

– then what?

  • NATS

– Should handle most cases

  • Access vs. Backbone networks?

– Should handle but no data to test

slide-29
SLIDE 29

29

Conclusions

  • A new way of thinking of the classification problem

– Classify nodes instead of flows – Multi-level analysis:

  • social, functional, transport-layer characteristics
  • each level provides corroborative evidence or insight
  • BLINC works well in practice

– classifies 80-90% of the traffic – with >90% accuracy

  • Going beyond payload-based classification

– Nonpayload/unknown flows

  • Building block for security applications