improving malware classification bridging the static
play

Improving Malware Classification: Bridging the Static/Dynamic Gap - PowerPoint PPT Presentation

Improving Malware Classification: Bridging the Static/Dynamic Gap Authors: Blake Anderson, Curtis Storlie, Terran Lane Vinit Singh 18 th April 2017 CISC850 Cyber Analytics CISC850 Cyber Analytics INTRODUCTION Why is there a need for


  1. Improving Malware Classification: Bridging the Static/Dynamic Gap Authors: Blake Anderson, Curtis Storlie, Terran Lane Vinit Singh 18 th April 2017 CISC850 Cyber Analytics

  2. CISC850 Cyber Analytics INTRODUCTION • Why is there a need for machine learning in malware detection ? • The need for different type of data sources and how to combine them. • Unified framework by using a support vector machine using multiple kernel learning.

  3. CISC850 Cyber Analytics DATA SOURCES • STATIC SOURCES: Binary, Disassembled Binary, Control Flow Graph • DYNAMIC SOURCES: Dynamic Instruction Traces (DIT) , Dynamic System Call Traces (DST) • MISCELLANEOUS FILE INFORMATION: Entropy, Packers, Instructions in file, vertices and edges in CFG

  4. CISC850 Cyber Analytics METHOD STEP 1: DATA REPRESENTATION • Markov chain representation for raw binary, disassembled binary, DIT and DST • Standard representation for Control Flow Graph • The miscellaneous file information is represented as a simple feature vector of length seven

  5. STEP 2: KERNELS • The Kernel Trick • Exponential Kernel: x i : Features of the file information / transition probability of Markov chain • Graphlet Kernel: G: Graph , k : number of nodes of subgraph equal to k D G : Normalized probability vector = fg / # of all graphlets of size k fg = feature vector consisting number of times unique subgraph of size k occurs

  6. Heatmaps for Individual Kernels

  7. STEP 3: MULTIPLE KERNEL LEARNING • Optimization problem for classical kernel learning: Subject to constraint: Thus the Decision function is : • But for multiple kernel learning we need to estimate β k

  8. Heatmap of Combined Kernel

  9. RESULTS • Criteria 1 : Accuracy: Accuracy is calculated using 10-fold cross-validation.

  10. • Criteria 2: ROC Curves / AUC Values

  11. • Criteria 3: Speed to classify new instances

  12. • Criteria 4: Testing on a Large Malware Sample Accuracy on validation set consisting of 20k samples

  13. OBSERVATIONS • There were a total of 19 false positives and false negatives that were found out of 1556 instances of the original dataset. • Use of only static analysis doesn’t work well when the training instances have been packed.

  14. LIMITATIONS AND DRAWBACKS • Selecting an appropriate value of n for n-gram analysis • Time to collect dynamic system traces will be too resource intensive on a normal system • Choosing optimal instruction call categories • Intel Pin isn’t transparent while tracing the program to collect instructions

  15. RELATED WORK • Use of single data sources • Use of static data sources combined with ensemble learning • Result Fusion Model • Identifying packed and hidden code

  16. CONCLUSION • Not restricting malware classification to a single data source improves classification accuracy. • In a resource constrained environment combined static analysis can result in high accuracy and low number of false positives. • Static analysis is not an optimal solution when instances have been packed or have an high entropy.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend