PE-Miner : Mining Structural Information to Detect Malicious - - PowerPoint PPT Presentation

pe miner mining structural information to detect
SMART_READER_LITE
LIVE PREVIEW

PE-Miner : Mining Structural Information to Detect Malicious - - PowerPoint PPT Presentation

PE-Miner : Mining Structural Information to Detect Malicious Executables in Realtime M. Zubair Shafiq, S. Momina Tabish, Fauzan Mirza, Muddassar Farooq RAID,2009


slide-1
SLIDE 1

PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime

RAID,
2009


  • M. Zubair Shafiq, S. Momina Tabish,

Fauzan Mirza, Muddassar Farooq

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-2
SLIDE 2

Outline

2

Introduc1on
to
Domain
 Problem
Defini1on
 Literature
Survey
 Conclusion


Agenda

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


Results
and
Discussion
 Proposed
Solu1on
 Evalua1on


slide-3
SLIDE 3

Domain Introduction

3

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-4
SLIDE 4

Introduction

Computer malware is a widespread problem… Backdoor, Virus, Worm, Trojan, etc. A number of commercial anti-virus software

4

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-5
SLIDE 5

Financial losses…

5

100 200 300 400 500 600 Jan-Jun 2002 Jul-Dec 2002 Jan-Jun 2003 Jul-Dec 2003 Jan-Jun 2004 Jul-Dec 2004 Jan-Jun 2005 Jul-Dec 2005 Jan-Jun 2006 Jul-Dec 2006 Jan-Jun 2007 Milliers Number of new threats Total threats 1999 2000 2001 2002 2003 12,1 17,1 13,2 25 55 Estimated Damage (in billions of US Dollars) Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-6
SLIDE 6

Need of non-signature based AV?

 Problems with signature matching…  Size of signature database cannot scale  Evaded by simple code obfuscation techniques

6

Norton AV Command AV McAfee AV Chernobyl-1.4 F0sf0r0 Hare Z0mbie-6.b Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-7
SLIDE 7

How good are non-signature based solutions

7

 Problems with existing non-signature solutions…  High false alarm rate  Large scanning overheads  Usually leverage  Statistical analysis of machine level byte content  Disassembled code  Run-time API calls

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-8
SLIDE 8

Problem Definition

8

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-9
SLIDE 9

Problem Definition

 Non-signature based detector  Keep run-time complexity low  “Content Independent” features  Low false alarm rate

9

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-10
SLIDE 10

Proposed Solution

10

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-11
SLIDE 11

11

Proposed Solution: “PE-Miner”

  • Leverage the structural information of an executable
  • Extract structural features from all portions of an executable
  • Standard pre-processing to remove redundancy
  • Use supervised classification algorithms for detection
  • Training models provide comprehendible insights for forensic

experts

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-12
SLIDE 12

PE-Miner Framework

 Uses novel structural features to efficiently detect

malicious PE files

 Strict requirements of the system:  Must be a pure non-signature based framework with an

ability to detect zero-day malicious PE files.

 Must be realtime deployable i.e. more than 99% tp rate

and less than 1% fp rate

 Design must be modular that allows for the plug-n-play

design philosophy

12

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-13
SLIDE 13

PE-Miner Framework

 A threefold research methodology in our static

analysis:

1.

Identify a set of structural features for PE files which is computable in realtime,

2.

use an efficient preprocessor for removing redundancy in the features’ set, and

3.

select an efficient data mining algorithm for final classification

13

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-14
SLIDE 14

Proposed Architecture

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-15
SLIDE 15

Which PE format features to select?

 Structural features

from Windows PE file format

 189 features selected  For example malicious

exe’s have usually

 bigger import tables,  smaller resource tables  no exception tables

15

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-16
SLIDE 16

Which PE format features to select?

16

Name

  • f

feature Benign Backdor +Sniffer Cons +Virto

  • l

DoS+ Nuker Flooder Exploit+ Hacktool Worm Trojan Virus Malfease

Import Table Size 5.8 19.2 6.1 7.9 20.8 7.1 23.4 10.3 6.2 4.7 Rsrc Table Size 32.6 5.5 1.5 1.4 6.2 1.0 2.6 2.2 0.5 5.9 Excep tion Table 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.5 Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan
 Full table in the paper

slide-17
SLIDE 17

Which Pre-processor be used?

 Why preprocessing?  Out of 189 features, some might not convey useful

information!

 Either remove / combine such features  To reduce the dimensionality of input feature space

 Reduces training / testing times of classifiers

 Three pre-processing algorithms used

17

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-18
SLIDE 18

18

Feature Pre-processing Algorithms

  • Redundant Feature Removal (RFR) -- repeated values
  • Principal Component Analysis (PCA) -- data variance
  • Haar Wavelet Transform (HWT) -- approximation of function

 RFR is selected due to the high detection accuracy

  • btained after applying it, as well as it is realtime

deployable

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-19
SLIDE 19

Which classification algorithm?

  • IBk – nearest neighbor algorithm
  • J48 – decision tree
  • NB – Bayesian classifier
  • SMO – optimized support vector machine
  • RIPPER – inductive rule learning algorithm

 J48 is selected due to its highest detection accuracy and

low computational complexity and it is also realtime deployable after performing the timing analysis

19

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-20
SLIDE 20

Evaluation

20

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-21
SLIDE 21

Evaluation

 Evaluation of the proposed framework is done on 2

well known malware collections.

 Evaluation datasets

 VX Heavens virus collection

 10 thousand labeled malware

 Malfease malware collection

 5 thousand malware

21

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-22
SLIDE 22

Literature Survey

22

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-23
SLIDE 23

Learning to Detect and Classify Malicious Executables in the Wild

  • J. Zico Kolter, Macus A. Maloof

23

Journal of Machine Learning Research, MIT Press, 2006. (ISI Impact Factor: 2.682) @ Stanford University, George Town University, USA

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-24
SLIDE 24

24

Learning to Detect and Classify Malicious Executables in the Wild

Executable File

N-gram Analysis Classification Algorithm Feature Extraction Benign N-gram

Malicious N-gram

Result?

Overview of KM

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-25
SLIDE 25

Critiques

25

+ First “real” application of n-gram analysis for

malware detection + Forensic insights from trained models + High accuracy + Classification of malicious executables as a function

  • f their payload function (i.e., backdoor, worm, virus, etc.)
  • Huge computational complexity in training. (several days)
  • Not robust to malware packing
  • False alarms for packed benign files

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-26
SLIDE 26

McBoost: Boosting Scalability in Malware Analysis Using Statistical Classification of Executables

  • R. Perdisci, A. Lanzi, W. Lee

26

Annual Computer Security Applications Conference (ACSAC), USA, 2008. (acceptance rate 24.3%) @ Georgia Tech University, Damballa Inc., USA

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-27
SLIDE 27

27

McBoost: Boosting Scalability in Malware Analysis Using Statistical Classification of Executables

Executable File

A1

Overview of McBoost

A2 A3

Σ

dynamic unpacker C1 C2

packed non-packed

Result?

hidden code

Heuristic packer detector Malcode Classifier Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-28
SLIDE 28

Critiques

28

+ First ever technique that leverages packer identification

+ Uses unpacker to extract hidden malicious code + Separate n-gram training models for packed and unpacked executable files

  • High run-time computational overhead; not feasible for

realtime deployment

  • Inherits problems with the use of dynamic unpacker;

halt, crash, evasion

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-29
SLIDE 29

Results and Discussion

29

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-30
SLIDE 30

Results

30

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-31
SLIDE 31

Discussion

 Highly Accurate  Low scanning overheads  Structural features are robust to evasion attempts?

31

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-32
SLIDE 32

Discussion

 Scenario 1: Training on random PE files and detection

  • f packed PE files (AUC = 0.994)

 Scenario 2: Training on non-packed PE files and

detection of packed PE files (AUC = 0.964)

 Scenario 3: Training on packed PE files and detection

  • f non-packed PE files (AUC = 0.901)

 Scenario 4: Training on packed/non-packed PE files

and detection of packed benign and non-packed malicious PE files (AUC = 0.995)

32

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-33
SLIDE 33

Future Work

 Completely remove biasness w.r.t. executable

packing

 Use a non-signature based packer detector  PE-probe: leveraging packer detection and structural

information to detect malicious portable executables, Virus Bulletin (VB) Conference, September 2009.

33

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


Executable File

PD P1 P2

packed non-packed

Non-signature based packer detector Specialized PE-Miner Models

Result?

slide-34
SLIDE 34

Thanks

34

Next
Generation
Intelligent
Networks
Research
Center
(nexGIN
RC),
Pakistan


slide-35
SLIDE 35

References (1/2)

  • R. Perdisci, A. Lanzi, W. Lee, “McBoost: Boosting Scalability in

Malware Collection and Analysis Using Statistical Classification of Executables”, Annual Computer Security Applications Conference (ACSAC), USA, 2008. (In Press)

M.G. Schultz, E. Eskin, E. Zadok, S.J. Stolfo, “Data mining methods for detection of new malicious executables”, IEEE Symposium on Security and Privacy (S&P), pp. 38- 49, USA, 2001.

J.Z. Kolter, M.A. Maloof, “Learning to detect malicious executables in the wild”, ACM International Conference on Knowledge Discovery and Data Mining (KDD), pp. 470-478, USA, 2004.

Symantec Internet Security Threat Reports I-XI (Jan 2002-Jan 2008).

35

School
of
Electrical
Engineering
and
Computer
Science
(SEECS)


slide-36
SLIDE 36

References (2/2)

F-Secure Corporation, “F-Secure Reports Amount of Malware Grew by 100% during 2007”, Press release, 2007.

VX Heavens Virus Collection, VX Heavens website, available at http://vx.netlux.org.

Project Malfease, available at http://malfease.oarci.net/.

  • F. Bellard, “QEMU, a fast and portable dynamic translator”, USENIX

Annual Technical Conference, FREENIX Track, pp. 41-46, 2005.

J.R. Quinlan, “C4.5: Programs for machine learning”, Morgan Kaufmann, USA, 1993.

36

School
of
Electrical
Engineering
and
Computer
Science
(SEECS)