semantic flow augmentation for the automated discovery of
play

Semantic Flow Augmentation for the Automated Discovery of - PowerPoint PPT Presentation

Semantic Flow Augmentation for the Automated Discovery of Organizational Relationships Chris Strasburg*, Harris T Lin, Nikolas Kinkel The Ames Laboratory {cstras,htlin,nskinkel}@ameslab.gov * - Presenting Relationship Discovery Why does


  1. Semantic Flow Augmentation for the Automated Discovery of Organizational Relationships Chris Strasburg*, Harris T Lin, Nikolas Kinkel The Ames Laboratory {cstras,htlin,nskinkel}@ameslab.gov * - Presenting

  2. Relationship Discovery – Why does it matter? What is the impact of disrupting communication associated with flow set ‘F’? •

  3. Relationship Discovery – Why does it matter? What is the impact of disrupting communication associated with flow set ‘F’? •

  4. Relationship Discovery – Why does it matter? Which alarms are most critical to manually investigate? •

  5. What is Semantic Flow Augmentation

  6. What is Semantic Flow Augmentation

  7. What is Semantic Flow Augmentation • Semantic – Of or relating to meaning…

  8. Why Semantic Augmentation

  9. Why Semantic Augmentation Is it mission related?

  10. Statistical Features • Flow Statistics • Timeseries Analysis – # of Flows – First seen – # of Bytes – Last seen – Peer count – Fourier Transform Coefficient

  11. Semantic Features • Lexical Analysis • Service Distribution (Mallet) – Interactive / Authenticated (SSH, IMAP, POP) – Cluster according to web – Interactive / Non- page contents from: Authenticated (STMP, • Reverse DNS Lookups HTTP/S) • WHOIS Org Searches – Non-Interactive (NTP, DNS) • Session Metadata – Requested URLs

  12. Semantic 1 Features (2) • Bi-clique Grouping 2 – Red = Internal – Green = External – Edges pruned – LP & BRIM Algorithm** 3 **Liu, Xin, and Tsuyoshi Murata. "Community detection in large-scale bipartite networks.” Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT'09. IEEE/WIC/ACM International Joint Conferences on . Vol. 1. IET, 2009. *Gephi http://gephi.org/

  13. Architecture Overview

  14. How to Label / Train Anecdotal Human Process Time consuming!

  15. Kick Start Labeling Feature Classifier Classifier Labels Train Train Initial rank Assign labels New rank New rank All IPs Assign labels Iteration 1 Iteration 2

  16. Anecdotal Validation – Ames Data Gathering Data • – One month of NetFlow data in Ames Lab Preprocessing • – 4 sets of features: simple NetFlow statistics, time series features, lexical analysis features (document topic distributions), biclique community features Labeling • 4242 IPs (801 white / 3441 black) – Testing / verifying classifier • – Weka (Logistic Regression, SVM, Bayesian Network, Decision Tree) – 10 cross-fold validation

  17. Performance Results 100 100 90 90 80 80 70 70 60 60 Lexical 50 50 CC,Service,Biclique 40 40 Netflow 30 30 20 20 10 10 0 0 Precision Recall AUC Precision Recall AUC Decision Tree (C4.5) Logistic Regression

  18. Info Gain by Features Lexical Topic Country Code Lexical Topic Conf Total Bytes Total Records Total Dest Port Total Source Port Community Focus Community Ext/Int Size Latest Endtime Access Hours Workhour Ratio Service Access Days Earliest Starttime Peer Count Community Size Fourier Weekly Fourier Daily 0 0.05 0.1 0.15 0.2

  19. Lexical = Science? Y N Country = Lexical US? Conf Y N < > Service = ssh? Y N Total Bytes < > Service = Lexical = pop/imap? Reference? Y N Y N

  20. Implementation at Ames Laboratory

  21. Challenges / Future Work • Majority of IPs don’t have a • Production ‘burn-in’ web page – Feedback from analysts into a growing set of labels – Automated query for WHOIS Organization • Integration with other systems – Use of AMP data; actual HTTP – BroIDS Module? resources • Mining of graphical data • Speed / Streaming – Second derivative clusters (clusters – Slow to gather features; of clusters) currently batched daily. – Internal resource categorization • Searching – Search engines w/ free API (Faroo?)

  22. Summary • Flow provides ‘how much’; a bit of semantics is required for mission relevance. • Public tools: – SiLK – Flow Statistics – Crawler4J + Mallet – Lexical Analysis – Weka – Machine Learning SAK – Apache Commons Math – (Timeseries transforms) – A sprinkle of Java and a dash of Python

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend