Analysis of Code Heterogeneity for High-precision Classification of - - PowerPoint PPT Presentation

analysis of code heterogeneity for high precision
SMART_READER_LITE
LIVE PREVIEW

Analysis of Code Heterogeneity for High-precision Classification of - - PowerPoint PPT Presentation

MOBILE SECURITY TECHNOLOGIES 2016 Analysis of Code Heterogeneity for High-precision Classification of Repackaged Malware Gang Tan Ke Tian, Daphne Yao, Barbara Ryder Department of Computer Science Department of Computer Science Virginia


slide-1
SLIDE 1

Analysis of Code Heterogeneity for High-precision Classification

  • f Repackaged Malware

Ke Tian, Daphne Yao, Barbara Ryder

MOBILE SECURITY TECHNOLOGIES 2016 Department of Computer Science Virginia Tech

Gang Tan

Department of Computer Science and Engineering Penn State University

1

slide-2
SLIDE 2

2

Background

Repackaged Malware Machine Learning

üMotivation: üSolution: üExperiment:

Repackaged malware skews machine learning results Partition + Machine learning classification 30-fold improvement in False Negative than non-partition ML-approach!

slide-3
SLIDE 3

Repackaged Malware

Background

Repackaged Malware Machine Learning

Android Malware writers are repackaging legitimate (popular) apps with malicious payload[1].

[1] http://www.zdnet.com/article/android-malwares-dirty-secret-repackaging-of-legit-apps/

Original Apps Injecting Malicious payload & Re-assembling Disassembling Front Back Stealing SMS Info. Game activity

3

FakeAngryBird.apk

slide-4
SLIDE 4

Conventional Machine Learning for Malware Classification

Background

Repackaged Malware Machine Learning

A huge Dataset

Extract Feature vectors

Train and Classification Benign & Malicious

DroidAPIMiner: Sensitive APIs Drebin : APIs, constant strings, URLs Peng et al.: Permission

4

slide-5
SLIDE 5

Background

Repackaged Malware Machine Learning

No – What the specific challenges and solutions?

5

slide-6
SLIDE 6

Benign behaviors Malicious behaviors

Motivation

Code Heterogeneity Challenges

Existing machine learning techniques extracts features from the entire app, repackaged malware skews classification results (i.e., introduce false negatives) Heterogeneous Code: Code with different Security behaviors in different code portions Research Question: How to recognize heterogeneity in code?

6

FakeAngryBird.apk

slide-7
SLIDE 7

Motivation

Code Heterogeneity Challenges

  • How to partition the code?
  • How to extract efficient features?
  • How to calculate the malware score?

1 4 5 2 6 4 2 1 1 1 2 1 3 4 2 5 2 2

{0,1} Malware Score r Ours: Non-partition

Tasks:

7

slide-8
SLIDE 8

Motivation

Code Heterogeneity Challenges

8

First Attempt: partition based on direct method call relations

Class A Class B Class C

OnCreate OnPasue b1 OnCreate OnStart c1 c2 c3 Directed call Direct call

slide-9
SLIDE 9

Motivation

Code Heterogeneity Challenges

9

Class A Class B Class C

OnCreate OnPasue b1 OnCreate OnStart c1 c2 c3 Directed call implicit life-cycle call Data field used ICC call

Missed implicit dependence relations!

Direct call

First Attempt: not wok well

slide-10
SLIDE 10

2-level graph

Solution

Feature Extraction Graph Generation Partition&Mapping

Class-level Dependence Graph (CDG) to capture event (activity) relations. A Class

10

Method-level Call Graph (MCG) for subsequent feature extraction.

slide-11
SLIDE 11

Class-level Dependence Graph (CDG) ü Class-level call dependence. ü Class-level data dependence ü Class-level ICC dependence.

Class A Class B Class C Class D Class E Class F

Inferring from static analysis

invoke iget invoke startActivity (explicit ICC)

Solution

Feature Extraction Graph Generation Partition&Mapping

call

data

iget

ICC

11

slide-12
SLIDE 12

Can feature extraction be done on class-level call graph?

  • No. Why?

Class-level call graph is too coarse-grained, lacking useful method information. Need method-level details

12

So far, we got code partitioned at class-level dependence graph

slide-13
SLIDE 13

Solution

Partition&Mapping Feature Extraction Graph Generation

Mapping Through Projection (to prepare for feature extraction)

Class D Class F

Dependence Region 1

a b c

f1 f2 f2 f3 f2 f4 f1 3*f2 f3 f4 …

Aggregation features in each region

13

slide-14
SLIDE 14

Solution

Feature Extraction Graph Generation

Feature Extraction for Regions

ü Type I: User Interaction Features *user-related functions and the graph-related impact features ü Type II: Sensitive API Features. *sensitive Java and Android APIs ü Type III: Permission Request Features. *permissions used in each region 1.Features are used to profile the region’s behaviors. 2.Combined with traditional features, user interaction and graph properties

Partition&Mapping

14

slide-15
SLIDE 15

Solution

Partition&Mapping Feature Extraction Graph Generation

Classification of Apps

  • Binary Classification for each dependence region.
  • Computing the malware score for an app based on

results from all regions.

𝑁𝑏𝑚𝑥𝑏𝑠𝑓 𝑡𝑑𝑝𝑠𝑓 𝑠

+

Malicious regions Total regions in the app

Continuous value in [0,1]

15

slide-16
SLIDE 16

16

  • Graph Accuracy. -- More accurate program analysis
  • Dynamic Code -- Native Libraries
  • Integrated Malware – Hard to partition

Limitations : Solution

Partition&Mapping Feature Extraction Graph Generation

  • (Partition) Partition the app into different Regions –>

Class-level Dependence Graph (CDG)

  • (Feature) Independently classify each Region –> Method-

Level Call Graph(MCG)

  • (Classification) Mapping the features through projection,

calculating Malware Score

Solution summaries:

slide-17
SLIDE 17

Experiment

Partition&Mapping Feature Extraction Graph Generation

Classification of non-repackaged Apps

  • Each of apps contains just a single region

(dependence region).

  • The region is labeled as benign or malicious from

dataset

  • Used to evaluate the features and get trained

classifiers

Single-region apps dataset Train classifiers Classify multiple-region apps

17

slide-18
SLIDE 18

Experiment

Ads Library

Classification of non-repackaged Apps Random Forest performs Best

Use Random Forest as the standard classifier to test repackaged malware

Non-repackaged app Repackaged app

18

All three types of features are effective

slide-19
SLIDE 19

Experiment

Partition&Mapping Feature Extraction Graph Generation

Classification of Repackaged Malware

Use the Same trained Random Forest to test

Test three repackaged malware families: 1 Geinimi 2 Kungfu 3 AnserverBot Comparison: 1 Entire-app classification (Basic) 2 Our partition classification

  • ur FNR gets 30-fold improvement than the non-partition!

Experiment

Ads Library Non-repackaged app Repackaged app

19

slide-20
SLIDE 20

Case Study of Heterogeneous Properties

  • Malicious

Region with sensitive permissions& APIs

  • Benign Region

with user- interaction functions

Experiment

Ads Library Non-repackaged app Repackaged app

20

Need to look into the code structure!

slide-21
SLIDE 21

Region analysis in popular apps

  • Table. Alerts made by Group 2 Ads library

(Group 1:admob | Group 2:adlantis)

  • Analyzing 1,617 free popular apps from Google Play.
  • 158/1,617= 9.7% Apps contain multiple regions
  • Ad Libraries introduce multiple regions in Apps.
  • Some aggressive ads libraries introduce alerts in the

detection. Experiment

Ads Library Non-repackaged app Repackaged app

21

slide-22
SLIDE 22

Experiment

Partition&Mapping Feature Extraction Graph Generation

False Negatives: False Positives:

1) Integrated benign and malicious behaviors. 2) Not enough malicious behaviors in malicious components Some aggressive packages and libraries, e.g., Adlantis, results in a false alarm in our detection.

Experiment

Ads Library Non-repackaged app Repackaged app

22

slide-23
SLIDE 23

Discussion

Limitations Future Work Usage

23

Conclusions:

  • Our approach achieves 30-fold improvement than the non-

partition-based approach.

  • Our approach is able to identify malicious code in repackaged

malware.

  • Partition can be used to label malicious code or Isolate inserted

code (Ads packages or dead code)

Future work:

More Effort on Partition/Detection for Code Provenance!

slide-24
SLIDE 24

24