FlonCon 2013 | January 7 10 | Albuquerque, New Mexico Introductions - - PowerPoint PPT Presentation

▶

Oct 22, 2023 237 likes •502 views

John Munro / jmunro@endgame.com Jason Trost / jtrost@endgame.com FlonCon 2013 | January 7 10 | Albuquerque, New Mexico Introductions John Munro (jmunro@endgame.com) Network Security Researcher and Data Scientist Jason Trost

SLIDE 1

FlonCon 2013 | January 7–10 | Albuquerque, New Mexico

John Munro / jmunro@endgame.com Jason Trost / jtrost@endgame.com

SLIDE 2

John Munro (jmunro@endgame.com)

– Network Security Researcher and Data Scientist

Jason Trost (jtrost@endgame.com)

– Senior Software Engineer – Specializes in Hadoop/Storm/BigData

Introductions

SLIDE 3

The Problem
Our Approach
DGA Domain Classifier
String Statistics as Features
Malicious Domain Classifier
Demo
Real-time Streaming Platform

Agenda

SLIDE 4

The Problem

yahoo.com docs.joomla.org youtube.com bibz01.apple.com

p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com

ns3.ohio.gov abulqe.com za6.limfoklubs.com

Ct0u2xj5dbe4.wvw—game465.com

txmxbo.info

Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com

SLIDE 5

The Problem

yahoo.com docs.joomla.org youtube.com bibz01.apple.com

p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com

ns3.ohio.gov abulqe.com za6.limfoklubs.com

Ct0u2xj5dbe4.wvw—game465.com

txmxbo.info

Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com

SLIDE 6

Massive Volumes

– Some of our partners deal with TBs per day

f DNS PCAPs
Incredible Rates

– One partner sees 13k requests/sec – Another closer to 100k/sec

The Problem

SLIDE 7

Real-time streaming classification

– In parallel across multiple servers

Markov Models

– Random Domain Generation Traffic – Normal Benign Traffic

Random Forests

– Benign vs Malicious

Periodically retrained

– In order to maintain accuracy

Our Approach: Machine Learning!

SLIDE 8

Benign Domains

– Millions of popular, real domains

Correlated with the Alexa top 10k domains
Malicious Domains

– 800k domains gathered from an internal malware sandbox – Public blacklist domains from Conficker and Murofet Botnets

Data Sources

SLIDE 9

Markov Models

SLIDE 10

Domain Generation Algorithm (DGA)
Popular Domain Model

– Trained: 258,039 domains from Day 1 of our Benign set – Tested: 331,359 domains from Day 2 of our Benign set – Accuracy: 99.40 % with 1,458 Unknown

Randomly Generated Domain Model

– Trained: 90,884 domains from Conficker Botnet – Tested: 295,306 domains from Murofet Botnet – Accuracy: 99.34 % with 1,923 Unknown

Markovian DGA Classifier

SLIDE 11

String Statistics as Features

SLIDE 12

Feature Usefulness

SLIDE 13

Random Forests Algorithm

FPO VIDEO TO COME

SLIDE 14

Pros:

– Very high accuracy – Scalable across many nodes – Built-in protection from over fitting – Can handle very large data sets with many features – Robust with respect to goodness of features – Practical for real world use – Does not assume a distribution – Only two parameters to tune – Memory efficient

Cons:

– Not the quickest classifier, but plenty fast in practice

Random Forests

SLIDE 15

Performance measured by 10 – fold Cross

Validation

Training Set

– 200k Benign – 200k Malicious

Malicious Domain Classifier

0.01 0.02 0.03 0.04 0.05 0.06 10 20 30 40 50 60 70 80 90 100 Out of Bag Error Number of Trees

Cross Validation

K = 3 K = 5 K = 10

SLIDE 16

Results

0.975 0.976 0.977 0.978 0.979 0.98 0.981 0.982 0.983 0.984 10 20 30 40 50 60 70 80 90 100 Precision Number of Trees

Bad Precision

0.9745 0.975 0.9755 0.976 0.9765 0.977 0.9775 0.978 0.9785 0.979 0.9795 10 20 30 40 50 60 70 80 90 100 Precision Number of Trees

Good Precision

0.976 0.9765 0.977 0.9775 0.978 0.9785 0.979 0.9795 0.98 0.9805 10 20 30 40 50 60 70 80 90 100 Accuracy Number of Trees

Bad Accuracy

0.976 0.9765 0.977 0.9775 0.978 0.9785 0.979 0.9795 0.98 0.9805 10 20 30 40 50 60 70 80 90 100 Accuracy Number of Trees

Good Accuracy

K = 3 K = 5 K = 10

SLIDE 17

Results

10,000 20,000 30,000 40,000 50,000 60,000 10 20 30 40 50 60 70 80 90 100 Classifications/sec Number of Trees

Classification Throughput

K=3 K=5 K=10 50 100 150 200 250 300 350 400 10 20 30 40 50 60 70 80 90 100 Model Size (MB) Number of Trees

Model Size

K=3 K=5 K=10

SLIDE 18

Results

SLIDE 19

Demo

SLIDE 20

Velocity is a platform for processing, analyzing,

and visualizing large-scale event data in real- time

It was designed to be horizontally scalable and

is built using Twitter’s Storm

It was built primarily for internal

use with DNS events, IDS alerts, and netflow data, but it is in the process of being commercialized

Realtime Streaming Platform

SLIDE 21

Velocity Pipeline

SLIDE 22

Malicious domain classification
DGA domain identification using Markov

Models

Summary Statistics based on domain string

work well

Random Forests are very successful at

classifying domains as Benign or Malicious

Real-time, distributed implementation

Conclusion

SLIDE 23

Include more features: TTL, frequency seen,

etc.

Correlation of bad domains based on ASN,

Country, Organization, etc.

Identify subnets that are infected based on

high traffic to bad domains

Identify Content Delivery Networks
Self Organizing Maps and other visualizations

Future Work

SLIDE 24

Questions

SLIDE 25

John Munro
Email: jmunro@endgame.com
Jason Trost
Email: jtrost@endgame.com
Twitter: @jason_trost
Blog: www.covert.io

– Network Security Researcher and Data Scientist

– Senior Software Engineer – Specializes in Hadoop/Storm/BigData

Introductions

Agenda

The Problem

yahoo.com docs.joomla.org youtube.com bibz01.apple.com

ns3.ohio.gov abulqe.com za6.limfoklubs.com

Ct0u2xj5dbe4.wvw—game465.com

txmxbo.info

The Problem

yahoo.com docs.joomla.org youtube.com bibz01.apple.com

ns3.ohio.gov abulqe.com za6.limfoklubs.com

Ct0u2xj5dbe4.wvw—game465.com

txmxbo.info

– Some of our partners deal with TBs per day

– One partner sees 13k requests/sec – Another closer to 100k/sec

The Problem

– In parallel across multiple servers

– Random Domain Generation Traffic – Normal Benign Traffic

– Benign vs Malicious

– In order to maintain accuracy

Our Approach: Machine Learning!

– Millions of popular, real domains

– 800k domains gathered from an internal malware sandbox – Public blacklist domains from Conficker and Murofet Botnets

Data Sources

Markov Models

– Trained: 258,039 domains from Day 1 of our Benign set – Tested: 331,359 domains from Day 2 of our Benign set – Accuracy: 99.40 % with 1,458 Unknown

– Trained: 90,884 domains from Conficker Botnet – Tested: 295,306 domains from Murofet Botnet – Accuracy: 99.34 % with 1,923 Unknown

Markovian DGA Classifier

String Statistics as Features

Feature Usefulness

Random Forests Algorithm

– Not the quickest classifier, but plenty fast in practice

Random Forests

Validation

– 200k Benign – 200k Malicious

Malicious Domain Classifier

Results

Results

Results

Demo

and visualizing large-scale event data in real- time

is built using Twitter’s Storm

use with DNS events, IDS alerts, and netflow data, but it is in the process of being commercialized

Realtime Streaming Platform

Velocity Pipeline

Models

work well

classifying domains as Benign or Malicious

Conclusion

etc.

Country, Organization, etc.

high traffic to bad domains

Future Work

Questions

Contact Information