A Labeled Data Set For Flow-based Intrusion Detection Anna - - PowerPoint PPT Presentation

a labeled data set for flow based intrusion detection
SMART_READER_LITE
LIVE PREVIEW

A Labeled Data Set For Flow-based Intrusion Detection Anna - - PowerPoint PPT Presentation

A Labeled Data Set For Flow-based Intrusion Detection Anna Sperotto, Ramin Sadre, Frank van Vliet, Aiko Pras Design and Analysis of Communication Systems University of Twente, The Netherlands NMRG Workshop on Netflow/IPFIX Usage in Network


slide-1
SLIDE 1

A Labeled Data Set For Flow-based Intrusion Detection

Anna Sperotto, Ramin Sadre, Frank van Vliet, Aiko Pras Design and Analysis of Communication Systems University of Twente, The Netherlands NMRG Workshop on Netflow/IPFIX Usage in Network Management Maastricht - July 30, 2010

slide-2
SLIDE 2

Contents

  • Operational experience in trace collections
  • Experimental Setup
  • Data processing and labeling
  • The labeled data set
slide-3
SLIDE 3

Introduction

  • Systems are evaluated on proprietary traces
  • No shared ground truth
  • Results cannot be directly compared!

trace 1 system 1 30% attacks! trace 2 system 2 85% attacks!

slide-4
SLIDE 4

Data set requirements

We want the data set to be:

  • realistic data
  • complete and correct in labeling
  • achievable in an acceptable labeling time
  • sufficient trace size

The requirements will determine the collection setup

slide-5
SLIDE 5

Measurement scale

  • realistic
  • not complete
  • it does not

scale

NETWORK

slide-6
SLIDE 6

Measurement scale

  • realistic
  • not complete
  • it does not

scale

  • realistic
  • not complete

NETWORK SUBNETWORK

slide-7
SLIDE 7

Measurement scale

  • realistic
  • not complete
  • it does not

scale

  • realistic
  • not complete
  • realistic
  • enhanced logging

(honeypot)

NETWORK SUBNETWORK SINGLE HOST

slide-8
SLIDE 8

Setup

  • daily used services with enhanced logging
  • direct connection to the Internet
  • attack exposure
  • complete tcpdump of the traffic (offline flow creation)

XEN SERVER

tcpdump HONEYPOT ssh, http, ftp ssh session transcript

slide-9
SLIDE 9

Data set creation

TRAFFIC DUMP TYPESCRIPTS LOGS EVENTS LABELLED DATASET ALERT GENERATION/ CORRELATION FLOWS CLUSTERING & CAUSALITY

F = (Isrc, Idst, Psrc, Pdst, Pckts, Octs, Tstart, Tend, Flags, Prot)

Preprocessing

  • packets  flows
  • logs  log events

L = (T, Isrc, Psrc, Idst, Pdst, Descr, Auto, Succ, Corr)

slide-10
SLIDE 10

Data set creation

TRAFFIC DUMP TYPESCRIPTS LOGS EVENTS LABELLED DATASET ALERT GENERATION/ CORRELATION FLOWS CLUSTERING & CAUSALITY

  • The correlation process will results in alerts

A = (T, Descr, Auto, Succ, Serv, Type)

slide-11
SLIDE 11

Correlation procedure

HP

F1 F2 ALERT A LOGS CORRELATE (F1, A) (F2, A)

slide-12
SLIDE 12

Cluster and Causality

  • Hierarchic view of the alerts to enrich the data set with

extra information on the traffic

  • Group simple alerts into cluster alerts
  • high level view of malicious activities

Flow Flow Flow Flow Flow Alert Cluster alert Alert Alert Alert Alert Alert cluster relation causality relation Flow

slide-13
SLIDE 13

Implementation

Packets to flows

AUTOMATIC

  • softflowd

Logs to log events

SEMI-AUTOMATIC MANUAL

  • shell scripts
  • discriminate between manual/

automated attacks

Alert correlation

SEMI-AUTOMATIC

  • correlation procedure
  • extensible for other attacks

Cluster and causality

MANUAL

  • analysis of typescripts
slide-14
SLIDE 14

The Dataset

  • Flow breakdown

SSH FTP HTTP AUTH/IDENT IRC OTHERS 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08

18970 7383 191339 9798 13 13942629

number of flows ICMP TCP UDP 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08

583 14151511 18038

number of flows

dump file 24 GB flows 14M alerts 7.6M

slide-15
SLIDE 15

The Dataset

  • Alert breakdown

dump file 24 GB flows 14M alerts 7.6M

SSH IN SSH OUT FTP HTTP AUTH/IDENT IN AUTH/IDENT OUT IRC OUT ICMP IN 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08

16382 3692 6 95664 5317 6 7591869 8756

number of alerts number of alerts SSH IN SSH OUT HTTP 10 20 30 40

4 35 10

slide-16
SLIDE 16

The Dataset

  • We labeled: 98,5% flows and 99,99% alerts
  • Mainly malicious traffic:
  • ssh brute force attacks
  • automated http connections
  • Small percentage of side-effect traffic
  • auth/ident on port 113
  • IRC traffic
slide-17
SLIDE 17

Conclusions

  • We presented the first labeled data set for flow-based

intrusion detection

  • http://traces.simpleweb.org/
  • Semi-automated correlation process
  • manual intervention is still needed
  • Data set mainly constituted of malicious traffic
  • need to extend to benign traffic
slide-18
SLIDE 18

Conclusions

  • Reactions:
  • Since publication (October 2009) ~ 7 requests
  • We do not monitor the downloads at the webpage
  • In contact with Philipp Winter (Hagenberg

University, AU): MSc Project “Inductive Intrusion Detection in Flow-Based Network Data using One-Class Support Vector Machines ”

slide-19
SLIDE 19
slide-20
SLIDE 20

Implementation

id description ALERT_TYPES id automated succeeded description timestamp type service ALERTS id src_ip dst_ip packets

  • ctets

start_time start_msec end_time end_msec src_port dst_port tcp_flag prot NETFLOWS flowid alertid NETFLOW_ALERTS parent child ALERTS_CLUSTERING parent child ALERTS_CAUSALITY id description SERVICES

slide-21
SLIDE 21

Correlation procedure

Algorithm 1 Correlation procedure

1: procedure ProcessFlowsForService (s : service) 2: for all Incoming flows F1 for the service s do 3:

Retrieve matching response Flow F2 such as

4:

F2.Isrc = F1.Idst ∧ F2.Idst = F1.Isrc ∧ F2.Psrc = F1.Pdst ∧ F2.Pdst = F1.Psrc ∧

5:

F1.Tstart ≤ F2.Tstart ≤ F1.Tstart + δ

6:

with smallest F2.Tstart − F1.Tstart ;

7:

Retrieve a matching log event L such as

8:

L.Isrc = F1.Isrc ∧ L.Idst = F1.Idst ∧ L.Psrc = F1.Pdst ∧ L.Pdst = F1.Psrc ∧

9:

F1.Tstart ≤ L.T ≤ F1.Tend ∧ not L.Corr

10:

with smallest L.T − F1.Tstart ;

11:

if L exists then

12:

Create alert A = (L.T, L.Descr, L.Auto, L.Succ, s, CONN).

13:

Correlate F1 to A ;

14:

if F2 exists then

15:

Correlate F2 to A ; L.Corr ← true ;

16:

end if

17:

end if

18: end for