Semantic Flow Augmentation for the Automated Discovery of - - PowerPoint PPT Presentation

semantic flow augmentation for the automated discovery of
SMART_READER_LITE
LIVE PREVIEW

Semantic Flow Augmentation for the Automated Discovery of - - PowerPoint PPT Presentation

Semantic Flow Augmentation for the Automated Discovery of Organizational Relationships Chris Strasburg*, Harris T Lin, Nikolas Kinkel The Ames Laboratory {cstras,htlin,nskinkel}@ameslab.gov * - Presenting Relationship Discovery Why does


slide-1
SLIDE 1

Semantic Flow Augmentation for the Automated Discovery of Organizational Relationships

Chris Strasburg*, Harris T Lin, Nikolas Kinkel The Ames Laboratory {cstras,htlin,nskinkel}@ameslab.gov * - Presenting

slide-2
SLIDE 2

Relationship Discovery – Why does it matter?

  • What is the impact of disrupting communication associated with flow set ‘F’?
slide-3
SLIDE 3

Relationship Discovery – Why does it matter?

  • What is the impact of disrupting communication associated with flow set ‘F’?
slide-4
SLIDE 4

Relationship Discovery – Why does it matter?

  • Which alarms are most critical to manually investigate?
slide-5
SLIDE 5

What is Semantic Flow Augmentation

slide-6
SLIDE 6

What is Semantic Flow Augmentation

slide-7
SLIDE 7

What is Semantic Flow Augmentation

  • Semantic – Of or relating to meaning…
slide-8
SLIDE 8

Why Semantic Augmentation

slide-9
SLIDE 9

Why Semantic Augmentation

Is it mission related?

slide-10
SLIDE 10

Statistical Features

  • Flow Statistics

– # of Flows – # of Bytes – Peer count

  • Timeseries Analysis

– First seen – Last seen – Fourier Transform Coefficient

slide-11
SLIDE 11

Semantic Features

  • Lexical Analysis

(Mallet)

– Cluster according to web page contents from:

  • Reverse DNS Lookups
  • WHOIS Org Searches
  • Session Metadata

– Requested URLs

  • Service Distribution

– Interactive / Authenticated (SSH, IMAP, POP) – Interactive / Non- Authenticated (STMP, HTTP/S) – Non-Interactive (NTP, DNS)

slide-12
SLIDE 12

Semantic Features (2)

  • Bi-clique Grouping

– Red = Internal – Green = External – Edges pruned – LP & BRIM Algorithm**

1 2 3 *Gephi http://gephi.org/ **Liu, Xin, and Tsuyoshi Murata. "Community detection in large-scale bipartite networks.” Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT'09. IEEE/WIC/ACM International Joint Conferences on.

  • Vol. 1. IET, 2009.
slide-13
SLIDE 13

Architecture Overview

slide-14
SLIDE 14

How to Label / Train

Anecdotal Human Process Time consuming!

slide-15
SLIDE 15

Kick Start Labeling

Feature Labels

Initial rank All IPs Assign labels

Classifier

Train New rank Assign labels

Classifier

Train New rank Iteration 1 Iteration 2

slide-16
SLIDE 16

Anecdotal Validation – Ames Data

  • Gathering Data

– One month of NetFlow data in Ames Lab

  • Preprocessing

– 4 sets of features: simple NetFlow statistics, time series features, lexical analysis features (document topic distributions), biclique community features

  • Labeling

– 4242 IPs (801 white / 3441 black)

  • Testing / verifying classifier

– Weka (Logistic Regression, SVM, Bayesian Network, Decision Tree) – 10 cross-fold validation

slide-17
SLIDE 17

Performance Results

10 20 30 40 50 60 70 80 90 100 Precision Recall AUC

Logistic Regression Decision Tree (C4.5)

10 20 30 40 50 60 70 80 90 100 Precision Recall AUC Lexical CC,Service,Biclique Netflow

slide-18
SLIDE 18

Info Gain by Features

0.05 0.1 0.15 0.2

Fourier Daily Fourier Weekly Community Size Peer Count Earliest Starttime Access Days Service Workhour Ratio Access Hours Latest Endtime Community Ext/Int Size Community Focus Total Source Port Total Dest Port Total Records Total Bytes Lexical Topic Conf Country Code Lexical Topic

slide-19
SLIDE 19

Lexical = Science? Y Lexical Conf < > N Country = US? N Y Service = ssh? N Y Total Bytes < > Service = pop/imap? Lexical = Reference? N Y N Y

slide-20
SLIDE 20

Implementation at Ames Laboratory

slide-21
SLIDE 21

Challenges / Future Work

  • Majority of IPs don’t have a

web page

– Automated query for WHOIS Organization – Use of AMP data; actual HTTP resources

  • Speed / Streaming

– Slow to gather features; currently batched daily.

  • Searching

– Search engines w/ free API (Faroo?)

  • Production ‘burn-in’

– Feedback from analysts into a growing set of labels

  • Integration with other systems

– BroIDS Module?

  • Mining of graphical data

– Second derivative clusters (clusters

  • f clusters)

– Internal resource categorization

slide-22
SLIDE 22

Summary

  • Flow provides ‘how much’; a bit of semantics

is required for mission relevance.

  • Public tools:

– SiLK – Flow Statistics – Crawler4J + Mallet – Lexical Analysis – Weka – Machine Learning SAK – Apache Commons Math – (Timeseries transforms) – A sprinkle of Java and a dash of Python