Analysis of Code Heterogeneity for High-precision Classification of - PowerPoint PPT Presentation

MOBILE SECURITY TECHNOLOGIES 2016 Analysis of Code Heterogeneity for High-precision Classification of Repackaged Malware Gang Tan Ke Tian, Daphne Yao, Barbara Ryder Department of Computer Science Department of Computer Science Virginia Tech and Engineering Penn State University 1

Background Repackaged Malware Machine Learning ü Motivation: Repackaged malware skews machine learning results ü Solution: Partition + Machine learning classification ü Experiment: 30-fold improvement in False Negative than non-partition ML-approach! 2

Background Repackaged Malware Machine Learning Repackaged Malware Android Malware writers are repackaging legitimate (popular) apps with malicious payload[1]. Front Back Injecting Disassembling Malicious Original Apps payload & Re-assembling Game activity Stealing SMS Info. FakeAngryBird.apk 3 [1] http://www.zdnet.com/article/android-malwares-dirty-secret-repackaging-of-legit-apps/

Background Repackaged Malware Machine Learning Conventional Machine Learning for Malware Classification Train and Classification Benign & A huge Dataset Extract Feature vectors Malicious DroidAPIMiner: Sensitive APIs Drebin : APIs, constant strings, URLs Peng et al.: Permission 4

Background Repackaged Malware Machine Learning No – What the specific challenges and solutions? 5

Motivation Challenges Code Heterogeneity Heterogeneous Code: Code with different Security behaviors Malicious Benign in different code portions behaviors behaviors FakeAngryBird.apk Existing machine learning techniques extracts features from the entire app, repackaged malware skews classification results (i.e., introduce false negatives) Research Question: How to recognize heterogeneity in code? 6

Motivation Challenges Code Heterogeneity {0,1} 1 4 5 2 0 6 4 2 Non-partition 0 1 1 0 0 1 2 0 Malware Ours: Score r 1 3 4 2 0 5 2 2 Tasks: • How to partition the code? • How to extract efficient features? • How to calculate the malware score? 7

Motivation Challenges Code Heterogeneity First Attempt: partition based on direct method call relations Class A Class B OnCreate OnPasue Directed call Direct call OnStart OnCreate b1 c3 c1 c2 Class C 8

Motivation Challenges Code Heterogeneity First Attempt: not wok well ICC call Class A Class B OnCreate implicit life-cycle call OnPasue Directed call Direct call OnStart OnCreate b1 Missed implicit c3 c1 c2 dependence Class C relations! Data field used 9

Solution Graph Generation Feature Extraction Partition&Mapping 2-level graph A Class Class-level Dependence Graph (CDG) to capture event (activity) relations. Method-level Call Graph (MCG) for subsequent feature extraction. 10

Solution Graph Generation Partition&Mapping Feature Extraction Class-level Dependence Graph (CDG) Inferring from static analysis Class E ü Class-level call dependence. ü Class-level data dependence invoke call ü Class-level ICC dependence. invoke Class C Class A Class F iget startActivity (explicit ICC) iget Class B data ICC Class D 11

So far, we got code partitioned at class-level dependence graph Can feature extraction be done on class-level call graph? No. Why? Class-level call graph is too coarse-grained, lacking useful method information. Need method-level details 12

Solution Graph Generation Partition&Mapping Feature Extraction Mapping Through Projection (to prepare for feature extraction) Class F Dependence f1 3*f2 f3 f4 … Region 1 Class D Aggregation features b a c in each region f2 f3 f1 f2 f2 f4 13

Solution Graph Generation Partition&Mapping Feature Extraction Feature Extraction for Regions ü Type I: User Interaction Features *user-related functions and the graph-related impact features ü Type II: Sensitive API Features. *sensitive Java and Android APIs ü Type III: Permission Request Features. *permissions used in each region 1.Features are used to profile the region’s behaviors. 2.Combined with traditional features, user interaction and graph properties 14

Solution Graph Generation Partition&Mapping Feature Extraction Classification of Apps • Binary Classification for each dependence region. • Computing the malware score for an app based on results from all regions. 𝑁𝑏𝑚𝑥𝑏𝑠𝑓 𝑡𝑑𝑝𝑠𝑓 𝑠 + Malicious regions Total regions in the app Continuous value in [0,1] 15

Solution Graph Generation Partition&Mapping Feature Extraction Solution summaries: (Partition) Partition the app into different Regions –> • Class-level Dependence Graph (CDG) • (Feature) Independently classify each Region –> Method- Level Call Graph(MCG) • (Classification) Mapping the features through projection, calculating Malware Score Limitations : Graph Accuracy. -- More accurate program analysis • Dynamic Code -- Native Libraries • Integrated Malware – Hard to partition • 16

Experiment Graph Generation Partition&Mapping Feature Extraction Classification of non-repackaged Apps Each of apps contains just a single region • (dependence region). The region is labeled as benign or malicious from • dataset Used to evaluate the features and get trained • classifiers Train classifiers Classify multiple-region apps Single-region apps 17 dataset

Experiment Non-repackaged app Ads Library Repackaged app Classification of non-repackaged Apps Random Forest performs Best All three types of features are effective Use Random Forest as the standard classifier to test repackaged malware 18

Experiment Experiment Graph Generation Non-repackaged app Partition&Mapping Feature Extraction Ads Library Repackaged app Classification of Repackaged Malware Test three repackaged Comparison: malware families: 1 Entire-app classification (Basic) 1 Geinimi 2 Our partition classification 2 Kungfu Use the Same trained Random Forest to test 3 AnserverBot our FNR gets 30-fold improvement than the non-partition! 19

Experiment Non-repackaged app Ads Library Repackaged app Case Study of Heterogeneous Properties Malicious • Region with sensitive permissions& APIs Benign Region • with user- interaction functions Need to look into the code structure! 20

Experiment Non-repackaged app Ads Library Repackaged app Region analysis in popular apps • Analyzing 1,617 free popular apps from Google Play. • 158/1,617= 9.7% Apps contain multiple regions • Ad Libraries introduce multiple regions in Apps. • Some aggressive ads libraries introduce alerts in the detection. Table. Alerts made by Group 2 Ads library (Group 1:admob | Group 2:adlantis) 21

Experiment Experiment Graph Generation Non-repackaged app Partition&Mapping Feature Extraction Ads Library Repackaged app False Negatives: 1) Integrated benign and malicious behaviors. 2) Not enough malicious behaviors in malicious components False Positives: Some aggressive packages and libraries, e.g., Adlantis, results in a false alarm in our detection. 22

Discussion Usage Limitations Future Work Conclusions: • Our approach achieves 30-fold improvement than the non- partition-based approach. • Our approach is able to identify malicious code in repackaged malware. • Partition can be used to label malicious code or Isolate inserted code (Ads packages or dead code) Future work: More Effort on Partition/Detection for Code Provenance! 23

Analysis of Code Heterogeneity for High-precision Classification of - PowerPoint PPT Presentation

MOBILE SECURITY TECHNOLOGIES 2016 Analysis of Code Heterogeneity for High-precision Classification of Repackaged Malware Gang Tan Ke Tian, Daphne Yao, Barbara Ryder Department of Computer Science Department of Computer Science Virginia

Mixed Precision Training PAI Overview What is mixed-precision

A comparison of A comparison of heterogeneity correction heterogeneity correction algorithms

WORK IN THE GIG ECONOMY Huma Humans a ns as a s a Se Service rvice @JeremiasPrassl VAST

Etiologic Heterogeneity Etiologic Heterogeneity In Endometrial Cancer Advances in Endometrial

Processing Heterogeneity Nikolaus Grigorieff Heterogeneity and Biology Translocation, Brilot et

Processing Heterogeneity Nikolaus Grigorieff Larson, The Far Side Heterogeneity and Biology

Detecting and Detecting and Characterizing Heterogeneity Characterizing Heterogeneity

Unobserved Heterogeneity in Matching Games Jeremy T. Fox 1 Chenyu Yang 2 1 University of Michigan

Toward Understanding Heterogeneity in Computing Arnold L. Rosenberg Ron C. Chiang Electrical

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Week 1: Introduc/on Precision and covariance matrix 2 1.2C

PRECISION HADDAD-TYPE CALCULABLE RESISTORS J. Kucera, E. Vollmer, J. Schurr CPEM 2008 1

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

Bottomonium first results from LHC experiments Nuno Leonardo (Purdue University) for the LHC

s str t

Relation Extraction II Luke Zettlemoyer CSE 517 Winter 2013 [with slides adapted from many

Eliciting Subjectivity and Polarity Judgements on Word Senses Fangzhong Su & Katja Markert

Mechanism Design with Unknown Correlated Distributions: Can We Learn Optimal Mechanisms? Michael

A Brief Introduction to Semantics CMSC 473/673 UMBC Outline Recap: dependency grammars and

detection analysis Bradley J. Kavanagh University of Nottingham arXiv:1207.2039 with Anne M.

Augmenting Stack Overflow with API Usage Patterns Mined from GitHub Anastasia Reinhart 1,2 *

Analysis of Code Heterogeneity for High-precision Classification of - PowerPoint PPT Presentation

MOBILE SECURITY TECHNOLOGIES 2016 Analysis of Code Heterogeneity for High-precision Classification of Repackaged Malware Gang Tan Ke Tian, Daphne Yao, Barbara Ryder Department of Computer Science Department of Computer Science Virginia

Mixed Precision Training PAI Overview What is mixed-precision

A comparison of A comparison of heterogeneity correction heterogeneity correction algorithms

WORK IN THE GIG ECONOMY Huma Humans a ns as a s a Se Service rvice @JeremiasPrassl VAST

Etiologic Heterogeneity Etiologic Heterogeneity In Endometrial Cancer Advances in Endometrial

Processing Heterogeneity Nikolaus Grigorieff Heterogeneity and Biology Translocation, Brilot et

Processing Heterogeneity Nikolaus Grigorieff Larson, The Far Side Heterogeneity and Biology

Detecting and Detecting and Characterizing Heterogeneity Characterizing Heterogeneity

Unobserved Heterogeneity in Matching Games Jeremy T. Fox 1 Chenyu Yang 2 1 University of Michigan

Toward Understanding Heterogeneity in Computing Arnold L. Rosenberg Ron C. Chiang Electrical

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Week 1: Introduc/on Precision and covariance matrix 2 1.2C

PRECISION HADDAD-TYPE CALCULABLE RESISTORS J. Kucera, E. Vollmer, J. Schurr CPEM 2008 1

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

Bottomonium first results from LHC experiments Nuno Leonardo (Purdue University) for the LHC

s str t

Relation Extraction II Luke Zettlemoyer CSE 517 Winter 2013 [with slides adapted from many

Eliciting Subjectivity and Polarity Judgements on Word Senses Fangzhong Su &amp; Katja Markert

Mechanism Design with Unknown Correlated Distributions: Can We Learn Optimal Mechanisms? Michael

A Brief Introduction to Semantics CMSC 473/673 UMBC Outline Recap: dependency grammars and

detection analysis Bradley J. Kavanagh University of Nottingham arXiv:1207.2039 with Anne M.

Augmenting Stack Overflow with API Usage Patterns Mined from GitHub Anastasia Reinhart 1,2 *

Eliciting Subjectivity and Polarity Judgements on Word Senses Fangzhong Su & Katja Markert