Presentation: Scalable Detection of Botnets Based on DGA Presentation - - PDF document

presentation scalable detection of botnets based on dga
SMART_READER_LITE
LIVE PREVIEW

Presentation: Scalable Detection of Botnets Based on DGA Presentation - - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/333652427 Presentation: Scalable Detection of Botnets Based on DGA Presentation June 2019 DOI: 10.13140/RG.2.2.24134.32322 CITATIONS


slide-1
SLIDE 1

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/333652427

Presentation: Scalable Detection of Botnets Based on DGA

Presentation · June 2019

DOI: 10.13140/RG.2.2.24134.32322 CITATIONS READS

26

3 authors: Some of the authors of this publication are also working on these related projects: AuthCODE View project Selfnet View project Mattia Zago University of Murcia

19 PUBLICATIONS 32 CITATIONS

SEE PROFILE

Manuel Gil Pérez University of Murcia

87 PUBLICATIONS 461 CITATIONS

SEE PROFILE

Gregorio Martinez Perez University of Murcia

252 PUBLICATIONS 2,224 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mattia Zago on 07 June 2019.

The user has requested enhancement of the downloaded file.
slide-2
SLIDE 2

SCALABLE DETECTION OF BOTNETS BASED ON DGA

EFFICIENT FEATURE DISCOVERY PROCESS IN MACHINE LEARNING TECHNIQUES

Speaker: Mattia Zago

Authors:

  • M. Zago, M. Gil Pérez, G. Martínez Pérez

Available Online – Soft Computing – Q2 IF: 2.367

Zago, M., Gil Pérez, M. & Martínez Pérez, G. Soft Comput (2019). 10.1007/s00500-018-03703-8

slide-3
SLIDE 3

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 2

OUR AGENDA FOR TODAY

State of The Art Analysis Challenges Background & Motivation

  • Machine learning

algorithms

  • Feature sets and families
  • Exploratory feature analysis
  • Classification results
  • Binary problem
  • Multiclass problem
  • Data
  • Best practices
  • Subject localisation
  • Relevance
  • Objective
slide-4
SLIDE 4

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 3

WHAT IS A BOTNET?

slide-5
SLIDE 5

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 4

DGA: DOMAIN GENERATION ALGORITHM

slide-6
SLIDE 6

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 5

DGA: DOMAIN GENERATION ALGORITHM

Objective

Analyse DNS queries to detect malicious AGDs connections

slide-7
SLIDE 7

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 6

APPROACHES TO THE DETECTION – ML

slide-8
SLIDE 8

01 02 05 04 03 06

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 7

STATE OF THE ART REGARDING ALGORITHMS

Achieved results

(either poor, average, good and excellent)

More than 100 articles Identified Since

2010

+30 researches

Selected

We have identified six comparison metrics Machine Learning approach

(either supervised, non supervised)

Type of application

(e.g., binary or multiclass classifier, correlation, anomaly detection, etc.)

Comparisons

with other works, approaches or algorithms

Real-time analysis

(i.e., online detection, performance scalability, etc.)

Family of features used

(i.e., either Context-Free or Context-Aware)

slide-9
SLIDE 9

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 8

APPROACHES TO DGA DETECTION – FEATURES

LANGUAGE ANALYSIS

Usage of Natural Language Process techniques to estimate if the domain is legit or not

(i.e. test the randomness) Examples – Length of the string – Frequency analysis – Entropy – Vowels ratio

Context-Free

A feature that is related only to a FQDN and thus is independent of contextual information, including, but not limited to, timing, origin or any

  • ther environment configuration.

Context-Aware

A feature that is dependent on the specific malware sample execution, which is realised in a precise environment with a specific config. and in a particular time frame.

DNS QUERY ANALYSIS

Decode sniffed queries and responses and look for “troublesome” indicators that may suggest a regular pattern.

  • Num. of connections –
  • Num. of NXDomains –

Examples

  • Num. of IP addresses–

Longevity of domain –

slide-10
SLIDE 10

Code Description Used By 3 5 9 11 14 16 24 28 33 34 38 45 46 48 49 51 52 54 56 57 58 59 65 66 Tot. NLP-L-x String length ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 16 NLP-LDN Number of domain levels ✔ ✔ ✔ 3 NLP-R-NUM-x Ratio of numerical characters ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 8 NLP-R-VOW-x Ratio of vowel characters ✔ ✔ ✔ ✔ 4 NLP-R-CON-x Ratio of consonants characters ✔ ✔ ✔ ✔ 4 NLP-LANG Language hypothesis ✔ ✔ 2 NLP-LC-C Longest consecutive cons. sequence ✔ ✔ ✔ ✔ ✔ 5 NLP-LC-V Longest consecutive vowel sequence ✔ 1 NLP-LC-D Longest consecutive number seq. ✔ ✔ ✔ 3 NLP-COV Covariance matrix ✔ 1 NLP-R-MC Ratio of meaningful characters ✔ ✔ ✔ 3 NLP-LMS Length of longest meaningful string NLP-WLU Number of “word-like” units ✔ 1 NLP-SQS Domain squatting score ✔ 1 NLP-LED Levenshtein Edit Distance ✔ ✔ 2 NLP-nG-FR Frequency distribution (histogram) ✔ ✔ ✔ ✔ 4 NLP-nG-E Entropy ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11 NLP-nG-COV Covariance ✔ 1 NLP-nG-MEAN Mean of frequencies ✔ 1 NLP-nG-MED Median of frequencies ✔ 1 NLP-nG-VAR Variance of frequencies ✔ 1 NLP-nG-STD Standard deviation of frequencies ✔ 1 NLP-nG-PRO Pronounceability score ✔ ✔ ✔ 3 NLP-nG-NORM Normality score ✔ ✔ ✔ 3 NLP-nG-PRT Transition probability ✔ ✔ 2 NLP-nG-PRA Probability of appearance ✔ ✔ 2 NLP-nG-PRI Index probability ✔ ✔ 2 NLP-nG-DST-KL Kullback-Leiber divergence ✔ ✔ 2 NLP-nG-DST-JI Jaccard Index measure ✔ ✔ ✔ ✔ 4 NLP-nG-DST-TH Distance - Threshold ✔ 1 NLP-nG-DST-AF Distance - Avg. frequency ✔ 1 NLP-nG-DST-AC Distance - Avg. count ✔ ✔ 2 Total 1 4 7 3 1 4 7 1 9 1 1 2 7 5 8 3 8 6 1 2 3 4 5 3

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 9

STATE OF THE ART REGARDING FEATURES

slide-11
SLIDE 11

Code Description Used By 3 5 9 11 14 16 24 28 33 34 38 45 46 48 49 51 52 54 56 57 58 59 65 66 Tot. NLP-L-x String length ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 16 NLP-LDN Number of domain levels ✔ ✔ ✔ 3 NLP-R-NUM-x Ratio of numerical characters ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 8 NLP-R-VOW-x Ratio of vowel characters ✔ ✔ ✔ ✔ 4 NLP-R-CON-x Ratio of consonants characters ✔ ✔ ✔ ✔ 4 NLP-LANG Language hypothesis ✔ ✔ 2 NLP-LC-C Longest consecutive cons. sequence ✔ ✔ ✔ ✔ ✔ 5 NLP-LC-V Longest consecutive vowel sequence ✔ 1 NLP-LC-D Longest consecutive number seq. ✔ ✔ ✔ 3 NLP-COV Covariance matrix ✔ 1 NLP-R-MC Ratio of meaningful characters ✔ ✔ ✔ 3 NLP-LMS Length of longest meaningful string NLP-WLU Number of “word-like” units ✔ 1 NLP-SQS Domain squatting score ✔ 1 NLP-LED Levenshtein Edit Distance ✔ ✔ 2 NLP-nG-FR Frequency distribution (histogram) ✔ ✔ ✔ ✔ 4 NLP-nG-E Entropy ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11 NLP-nG-COV Covariance ✔ 1 NLP-nG-MEAN Mean of frequencies ✔ 1 NLP-nG-MED Median of frequencies ✔ 1 NLP-nG-VAR Variance of frequencies ✔ 1 NLP-nG-STD Standard deviation of frequencies ✔ 1 NLP-nG-PRO Pronounceability score ✔ ✔ ✔ 3 NLP-nG-NORM Normality score ✔ ✔ ✔ 3 NLP-nG-PRT Transition probability ✔ ✔ 2 NLP-nG-PRA Probability of appearance ✔ ✔ 2 NLP-nG-PRI Index probability ✔ ✔ 2 NLP-nG-DST-KL Kullback-Leiber divergence ✔ ✔ 2 NLP-nG-DST-JI Jaccard Index measure ✔ ✔ ✔ ✔ 4 NLP-nG-DST-TH Distance - Threshold ✔ 1 NLP-nG-DST-AF Distance - Avg. frequency ✔ 1 NLP-nG-DST-AC Distance - Avg. count ✔ ✔ 2 Total 1 4 7 3 1 4 7 1 9 1 1 2 7 5 8 3 8 6 1 2 3 4 5 3

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 10

STATE OF THE ART REGARDING FEATURES

slide-12
SLIDE 12

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 11

STATE OF THE ART REGARDING FEATURES - EXPLORE

Scatter plot of 10.000 FQDNs Axis:

  • Horizontal

Length

  • Vertical

Entropy Dots:

  • Green

Legitimate

  • Other colours

Malware

slide-13
SLIDE 13

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 12

STATE OF THE ART REGARDING FEATURES - EXPLORE

Scatter plot of 10.000 FQDNs Axis:

  • Horizontal

Length

  • Vertical

Entropy Dots:

  • Light Blue

Legitimate

  • Red

Malware

slide-14
SLIDE 14

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 13

EXAMPLE OF FEATURE ANALYSIS

Longest Consecutive Consonant Sequence Length of domain name (excluding TLD) Features that are interesting for their values Features that are interesting for their shapes

slide-15
SLIDE 15

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 14

EXPERIMENTS – CONFIGURATION

Data Sources

  • 16 malware families
  • 160,000 AGDs
  • 10,000 Legitimate FQDNs

Algorithms

  • 6 classifiers
  • 3 processing tools
  • Principal Component Analysis
  • Feature Selection
  • Correlation Analysis
  • Alureon
  • Conficker
  • CLocker
  • Goz
  • Kraken
  • Matsnu
  • Murofet
  • Nymaim
  • Pushdo
  • QakBot
  • Ramdo
  • Rovnix
  • Shiotob
  • Simda
  • Tinba
  • Zeus

Experiments

  • Binary Question
  • Separate legitimate FQDNs from malicious AGDs,

considering all malware families as a single category.

  • Multiclass Question
  • Classify not only the legitimate FQDNs but also sort

malware samples according to their families.

  • AdaBoost
  • Neural Network
  • Random Forest
  • SVM
  • Decision Tree
  • K-NN
slide-16
SLIDE 16

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 15

EXPERIMENTS – BINARY RESULTS

Classifiers Result

Reported values are referring to the average value calculated over the classes-specific values.

Considering all the malware as one big family permits to ignore the small differences between the variants, thus resulting in excellent performances. Separate legitimate FQDNs from malicious AGDs, considering all malware families as a single category. Machine Learning Problem Receiver Operating Characteristic (ROC) curve for the Legitimate class The results are reflected in the ROC curve,

i.e., how close are the corresponding lines to the top left corner:

  • Classifiers such as the Random Forest

and the Neural Network generally behave better than the others

  • Classifiers such as the kNN and the

SVM behave poorly.

slide-17
SLIDE 17

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 16

EXPERIMENTS – MULTICLASS RESULTS

Classifiers Result

Reported values are referring to the average value calculated over the classes-specific values.

As expected, the multiclass experiment is performing worse than the binary one. Malware families are not distinguishable within the data and feature set. Classify not only the legitimate FQDNs but also sort malware samples according to their families. Machine Learning Problem Receiver Operating Characteristic (ROC) curve for the Legitimate class The results are reflected in the ROC curve,

i.e., how close are the corresponding lines to the top left corner:

  • Classifiers such as the Random Forest

and the Neural Network generally behave better than the others

  • Classifiers such as the kNN and the

SVM behave poorly.

slide-18
SLIDE 18

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 17

CHALLENGES AND RESEARCH DIRECTIONS 1/2

BINARY ML PROBLEM

Distinguish AGDs from legitimate FQDN 1. Do legitimate FQDNs have a signature? 2. Will Context-Aware features improve the performances? 3. Find the minimum subset of features 4. How deep learning solutions works in comparison with classic solutions? 5. How clustering performs with previously unseen variants?

MULTICLASS ML PROBLEM

Recognize multiple malware variants 1. Can ML individually distinguish every family? 2. Will Context-Aware features improve the performances? 3. Find the minimum subset of features 4. How non-linear analysis performs in comparison with classic solutions? 5. What about ensemble classifiers? 6. And deep learning algorithms?

slide-19
SLIDE 19

March 2019 Mattia Zago – Scalable Detection of Botnets based on DGA: Efficient Feature Discovery Process in ML Techniques 18

CHALLENGES AND RESEARCH DIRECTIONS 2/2

DATASET

There is a desperate need of publicly available data 1. Botnet datasets are rare, mostly

  • utdated and generally partial.

2. To the best of our knowledge there are no ML-ready publicly available datasets for botnet analysis. 3. Network traces are privacy-invasive, the available datasets are crafted and heavily unbalanced.

BEST PRACTICES

Literature solutions are not replicable 1. Future works must: a) Publish the data b) Identify the environment c) Clearly state the experiments d) Specify the configurations e) Define the control flows f) Define the work flows

slide-20
SLIDE 20

SCALABLE DETECTION OF BOTNETS BASED ON DGA

GET IN TOUCH

Mattia Zago

https://webs.um.es/mattia.zago mattia.zago@um.es

Manuel Gil Pérez

https://webs.um.es/mgilperez mgilperez@um.es

Gregorio Martínez Pérez

https://webs.um.es/gregorio gregorio@um.es

ANY QUESTION?

Thank you for attending

View publication stats View publication stats