RESource: A Framework for Online Matching of Assembly with Open - - PowerPoint PPT Presentation

resource a framework for online matching of assembly with
SMART_READER_LITE
LIVE PREVIEW

RESource: A Framework for Online Matching of Assembly with Open - - PowerPoint PPT Presentation

RESource: A Framework for Online Matching of Assembly with Open Source Code Ashkan Rahimian*, Philippe Charland**, Stere Preda*, and Mourad Debbabi* *Computer Security Laboratory, CIISE Concordia University, Montreal, Quebec, Canada **Mission


slide-1
SLIDE 1

RESource: A Framework for Online Matching of Assembly with Open Source Code

Ashkan Rahimian*, Philippe Charland**, Stere Preda*, and Mourad Debbabi*

*Computer Security Laboratory, CIISE Concordia University, Montreal, Quebec, Canada **Mission Critical Cyber Security Section, Defence R&D Canada - Valcartier, Quebec, Canada ETS – Montréal Oct. 26th 2012

slide-2
SLIDE 2

Outline

  • Background
  • Motivation
  • Methodology
  • Case study
  • Conclusion
slide-3
SLIDE 3
  • Software Reverse Engineering
  • Problem: Binary (Assembly) to Source Matching
  • Domain: Malware Analysis
  • Facts: Code Reuse
  • Code Search Engines
  • Shared Library Imports and Utilization
  • E.g., cryptographic libraries
  • Free and Open Source Software (FOSS)
  • Assumptions: No obfuscation, De-obfuscated code

Background

slide-4
SLIDE 4

Malware might be built on top of standard components.

– e.g. VCL, MFC,…

Malware developers use specific development environment.

– MS Visual Studio, Borland (Embarcadero), Eclipse,…

Some code may contain fingerprints of the programmer.

– Executable File

Malware authors may utilize free and open-source software.

– Encryption algorithms

Malware often call low-level kernel APIs.

– User level vs. Kernel level, Bypass common signature templates

Background

slide-5
SLIDE 5

Outline

  • Background
  • Motivation
  • Methodology
  • Case study
  • Conclusion
slide-6
SLIDE 6

Motivation

  • 26 million new malware samples identified in 2011 [1]
  • Software reverse engineering is a manually intensive and time-

consuming process

  • Malware authors share source code
  • Code sharing websites, Forums, etc.
  • E.g. Flame and Stuxnet are linked
  • Open source libraries widespread
  • Koders, Ohloh, Antepidia, Krugle, Google Code, etc.
  • Software reverse engineers need Automated Tools
  • Mapping ASM to Source Code
  • First attempt: RE-Google

[1] Panda Security, “PandaLabs Annual Report: 2011 Summary,” Jan. 2011; http://press.pandasecurity.com/wp-content/uploads/2012/01/Annual-Report- PandaLabs-2011.pdf.

slide-7
SLIDE 7

Outline

  • Background
  • Motivation
  • Methodology
  • Case study
  • Conclusion
slide-8
SLIDE 8

Methodology (1/4)

  • Static Code Analysis
  • Input: ASM file obtained with IDA Pro
  • RESource

ASM Feature Extraction Processing Engine Code Search Engines

Query Generator S.E. Driver Request Response Parser Engine Data Extraction Offline Analysis

slide-9
SLIDE 9

Methodology (2/4)

  • Features Extraction
  • Something exploitable at both ASM and Source Code levels
  • E.g., function names
  • Types of Features
  • Immediate Values (Constants)
  • Strings
  • Functions Imports
  • Exports (By name, Ordinal)
  • Function Prototypes (Signatures)
  • Stack Frame Information (Offline Analysis)
  • Var., Ret. Values, Parameters, Arguments
  • Size, Number, Sequence
  • Register utilization

int sum(int a, int b){ return a + b; } sum : push %ebp mov %esp,%ebp mov 0xc(%ebp),%eax add 0x8(%ebp),%eax pop %ebp ret

slide-10
SLIDE 10

Methodology (3/4)

  • Processing Engine
  • Query Building for Code Search Engines
  • Encoding HTTP Requests
  • Query Filtering (Removing Special Chars)
  • Parsing and Information Extraction
  • Filenames and URLs
  • Pre-defined Regex Template
  • Online Analysis
  • Search Code Repositories for a close match
  • Specify programming languages as part of Request
slide-11
SLIDE 11

Methodology (4/4)

  • Offline Analysis
  • Information about function prototypes:
  • Complement Online Analysis Results
  • Lower level analysis for each function
  • Function stack frame analysis
  • Dictionary of low-level system calls (Windows API)
  • A statement for describing the overall functionality
  • Return values, Number and size of arguments
  • Number and size of parameters and type information
  • Rank the results best of typing information
  • Output: ASM file with Comments, Analysis Report
slide-12
SLIDE 12

Implementation (1/2)

  • Plug-in for IDA Pro
  • Execution Flow
  • Python 2.7.3, IDAPython 1.5.2, IDA Pro 6.1+
slide-13
SLIDE 13

Implementation (2/2)

  • Example for query building:
  • Multiple search engine support
  • Interleaving algorithm (Optimizing Time)
  • The results are added as comments in the ASM file
  • for both Online and Offline analysis
slide-14
SLIDE 14

Outline

  • Background
  • Motivation
  • Methodology
  • Case study
  • Conclusion
slide-15
SLIDE 15

Case Study (1/2)

  • PreciseCalc Project
  • Open source project
  • Hosted on Sourceforge
  • Using the Koders seach engine
  • Several full matches found
  • Matches for mathematical functions
slide-16
SLIDE 16

Case Study (2/2)

  • Malware Analysis
  • Low level APIs matching
  • Offline Analysis proves more useful
  • Gives insight into the potential code output
  • Screenshots
  • Example1: File I/O
  • Example2: Screen Capture
  • Example3: Network Connectivity
  • Example4: Loading Libraries
  • Example5: Services
  • Example6: Low-level Network
slide-17
SLIDE 17

Example 1. File I/O

slide-18
SLIDE 18

Example 2. Screen Capture

slide-19
SLIDE 19

Example 3. Network Connectivity

slide-20
SLIDE 20

Example 4. Loading Libraries

slide-21
SLIDE 21

Example 5. Services

slide-22
SLIDE 22

Example 6. Low-level Network Con.

slide-23
SLIDE 23

Outline

  • Background
  • Motivation
  • Methodology
  • Case study
  • Conclusion
slide-24
SLIDE 24

Conclusion

  • Improved the idea of Re-Google
  • Offline Analysis, Multiple Search Engines
  • Better results handling
  • Automated tool for reverse engineers
  • Malware Analysis
  • Limitation
  • Quality of output depends on the repositories
  • Currently optimized of C/C++
  • Some features may not be always available
  • For validation, we need all source files (CFG)
slide-25
SLIDE 25

Q&A

  • Thank you.
  • Q&A?