FlonCon 2013 | January 7 10 | Albuquerque, New Mexico Introductions - - PowerPoint PPT Presentation

floncon 2013 january 7 10 albuquerque new mexico
SMART_READER_LITE
LIVE PREVIEW

FlonCon 2013 | January 7 10 | Albuquerque, New Mexico Introductions - - PowerPoint PPT Presentation

John Munro / jmunro@endgame.com Jason Trost / jtrost@endgame.com FlonCon 2013 | January 7 10 | Albuquerque, New Mexico Introductions John Munro (jmunro@endgame.com) Network Security Researcher and Data Scientist Jason Trost


slide-1
SLIDE 1

FlonCon 2013 | January 7–10 | Albuquerque, New Mexico

John Munro / jmunro@endgame.com Jason Trost / jtrost@endgame.com

slide-2
SLIDE 2
  • John Munro (jmunro@endgame.com)

– Network Security Researcher and Data Scientist

  • Jason Trost (jtrost@endgame.com)

– Senior Software Engineer – Specializes in Hadoop/Storm/BigData

Introductions

slide-3
SLIDE 3
  • The Problem
  • Our Approach
  • DGA Domain Classifier
  • String Statistics as Features
  • Malicious Domain Classifier
  • Demo
  • Real-time Streaming Platform

Agenda

slide-4
SLIDE 4

The Problem

yahoo.com docs.joomla.org youtube.com bibz01.apple.com

p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com

ns3.ohio.gov abulqe.com za6.limfoklubs.com

Ct0u2xj5dbe4.wvw—game465.com

txmxbo.info

Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com

slide-5
SLIDE 5

The Problem

yahoo.com docs.joomla.org youtube.com bibz01.apple.com

p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com

ns3.ohio.gov abulqe.com za6.limfoklubs.com

Ct0u2xj5dbe4.wvw—game465.com

txmxbo.info

Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com

slide-6
SLIDE 6
  • Massive Volumes

– Some of our partners deal with TBs per day

  • f DNS PCAPs
  • Incredible Rates

– One partner sees 13k requests/sec – Another closer to 100k/sec

The Problem

slide-7
SLIDE 7
  • Real-time streaming classification

– In parallel across multiple servers

  • Markov Models

– Random Domain Generation Traffic – Normal Benign Traffic

  • Random Forests

– Benign vs Malicious

  • Periodically retrained

– In order to maintain accuracy

Our Approach: Machine Learning!

slide-8
SLIDE 8
  • Benign Domains

– Millions of popular, real domains

  • Correlated with the Alexa top 10k domains
  • Malicious Domains

– 800k domains gathered from an internal malware sandbox – Public blacklist domains from Conficker and Murofet Botnets

Data Sources

slide-9
SLIDE 9

Markov Models

slide-10
SLIDE 10
  • Domain Generation Algorithm (DGA)
  • Popular Domain Model

– Trained: 258,039 domains from Day 1 of our Benign set – Tested: 331,359 domains from Day 2 of our Benign set – Accuracy: 99.40 % with 1,458 Unknown

  • Randomly Generated Domain Model

– Trained: 90,884 domains from Conficker Botnet – Tested: 295,306 domains from Murofet Botnet – Accuracy: 99.34 % with 1,923 Unknown

Markovian DGA Classifier

slide-11
SLIDE 11

String Statistics as Features

slide-12
SLIDE 12

Feature Usefulness

slide-13
SLIDE 13

Random Forests Algorithm

FPO VIDEO TO COME

slide-14
SLIDE 14
  • Pros:

– Very high accuracy – Scalable across many nodes – Built-in protection from over fitting – Can handle very large data sets with many features – Robust with respect to goodness of features – Practical for real world use – Does not assume a distribution – Only two parameters to tune – Memory efficient

  • Cons:

– Not the quickest classifier, but plenty fast in practice

Random Forests

slide-15
SLIDE 15
  • Performance measured by 10 – fold Cross

Validation

  • Training Set

– 200k Benign – 200k Malicious

Malicious Domain Classifier

0.01 0.02 0.03 0.04 0.05 0.06 10 20 30 40 50 60 70 80 90 100 Out of Bag Error Number of Trees

Cross Validation

K = 3 K = 5 K = 10

slide-16
SLIDE 16

Results

0.975 0.976 0.977 0.978 0.979 0.98 0.981 0.982 0.983 0.984 10 20 30 40 50 60 70 80 90 100 Precision Number of Trees

Bad Precision

0.9745 0.975 0.9755 0.976 0.9765 0.977 0.9775 0.978 0.9785 0.979 0.9795 10 20 30 40 50 60 70 80 90 100 Precision Number of Trees

Good Precision

0.976 0.9765 0.977 0.9775 0.978 0.9785 0.979 0.9795 0.98 0.9805 10 20 30 40 50 60 70 80 90 100 Accuracy Number of Trees

Bad Accuracy

0.976 0.9765 0.977 0.9775 0.978 0.9785 0.979 0.9795 0.98 0.9805 10 20 30 40 50 60 70 80 90 100 Accuracy Number of Trees

Good Accuracy

K = 3 K = 5 K = 10

slide-17
SLIDE 17

Results

10,000 20,000 30,000 40,000 50,000 60,000 10 20 30 40 50 60 70 80 90 100 Classifications/sec Number of Trees

Classification Throughput

K=3 K=5 K=10 50 100 150 200 250 300 350 400 10 20 30 40 50 60 70 80 90 100 Model Size (MB) Number of Trees

Model Size

K=3 K=5 K=10

slide-18
SLIDE 18

Results

slide-19
SLIDE 19

Demo

slide-20
SLIDE 20
  • Velocity is a platform for processing, analyzing,

and visualizing large-scale event data in real- time

  • It was designed to be horizontally scalable and

is built using Twitter’s Storm

  • It was built primarily for internal

use with DNS events, IDS alerts, and netflow data, but it is in the process of being commercialized

Realtime Streaming Platform

slide-21
SLIDE 21

Velocity Pipeline

slide-22
SLIDE 22
  • Malicious domain classification
  • DGA domain identification using Markov

Models

  • Summary Statistics based on domain string

work well

  • Random Forests are very successful at

classifying domains as Benign or Malicious

  • Real-time, distributed implementation

Conclusion

slide-23
SLIDE 23
  • Include more features: TTL, frequency seen,

etc.

  • Correlation of bad domains based on ASN,

Country, Organization, etc.

  • Identify subnets that are infected based on

high traffic to bad domains

  • Identify Content Delivery Networks
  • Self Organizing Maps and other visualizations

Future Work

slide-24
SLIDE 24

Questions

slide-25
SLIDE 25
  • John Munro
  • Email: jmunro@endgame.com
  • Jason Trost
  • Email: jtrost@endgame.com
  • Twitter: @jason_trost
  • Blog: www.covert.io

Contact Information