Decompilation, type inference and finding the code to decompile - PowerPoint PPT Presentation

UNIVERSITY OF CAMBRIDGE Decompilation, type inference and finding the code to decompile Alan Mycroft Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/users/am/ 30 January 2012 Decompilation, type inference and finding code 1 30 January 2012

Structure UNIVERSITY OF CAMBRIDGE • Part 1: What is decompilation and why is it hard? • Part 2: Type reconstruction in decompilation Decompilation, type inference and finding code 2 30 January 2012

Problem: given a binary .EXE what does it do? UNIVERSITY OF CAMBRIDGE • Run it: and get a virus • Run it in a sandbox: better • Run it in a program instrumenter (‘dynamic analysis’): even better But any form of dynamic analysis under-approximates program behaviour—consider a trojan which only attacks one username and only on a Sunday evening. Running = testing = only explore some paths. • Decompile it: re-write the binary in a high-level language with the high-level program having exactly the same execution paths as the low-level one. Harder than it sounds (simple cases easy). Decompilation, type inference and finding code 3 30 January 2012

Decompilation—legality UNIVERSITY OF CAMBRIDGE • Isn’t this one of those things which is illegal? Or at best ‘shady’? • Depends. Lost source code, US and EU permit decompilation for interoperability. Always a ‘vaguely suspect’ activity. • New reason: Stuxnet, Duqu. Sophisticated malware written in high-level code. Decompilation, type inference and finding code 4 30 January 2012

Decompilation—techniques UNIVERSITY OF CAMBRIDGE • Not always possible. Read in some code and branch to it, or other various assembler-level tricks such as updating a return address. Not a problem for ‘dynamic binary translation’ (DBT) tools but these effectively use dynamic analysis • Always trivally possible: just prepend an x86 interpreter in your favourite high-level language to the .EXE file. Cheating solution • In practice we need to make some assumptions . . . Decompilation, type inference and finding code 5 30 January 2012

Decompilation—functionality vs beauty UNIVERSITY OF CAMBRIDGE Functionality: “if we decompile foo.exe to foo.c then recompiling to foo2.exe has the same I/O behaviour as foo.exe . Safety—which requires any analysis to be over-estimate behaviour Beauty: “the code is readable to humans” (most of the rest of this talk). While there’s not obviously a conflict, functionality means we must include all possible executions, which include some a human might wish to ignore . . . Decompilation, type inference and finding code 6 30 January 2012

Decompilation—functionality vs beauty (2) UNIVERSITY OF CAMBRIDGE int f(int *p) { p[read()] += 1; // might increment the return address return 0; } int main() { int r,v[10]; putvaluesin(v); r = f(v); // f always returns zero. r++; // perhaps "inc eax" [one byte] print r; } Might this program print 0? What if we only had the assembler code version? We can’t decompile back to the above code, because the compiler (or options) might differ (stack offset between x and return address). For safety we might have to assume that almost every indirect write might overwrite a return address (adding many un-beautiful lines). Decompilation, type inference and finding code 7 30 January 2012

Decompilation—functionality vs beauty (3) UNIVERSITY OF CAMBRIDGE f: pushl %ebp main: pushl %ebp movl %esp, %ebp movl %esp, %ebp pushl %ebx andl $-16, %esp subl $4, %esp pushl %ebx movl 8(%ebp), %ebx subl $76, %esp call read leal 24(%esp), %ebx ;;;;; here eax=-7 hits f’s return address incl (%ebx,%eax,4) movl %ebx, (%esp) addl $4, %esp call putvaluesin xorl %eax, %eax movl %ebx, (%esp) popl %ebx call f popl %ebp incl %eax ret movl %eax, (%esp) call print Decompilation, type inference and finding code 8 30 January 2012

Decompiling .EXE UNIVERSITY OF CAMBRIDGE Needs pipeline: • obtain machine code not always easy if a packer is used, e.g. self extracting archive • obtain assembler code often a choice between readable assembly and missing some execution path • obtain high-level code (reconstruct loops, high-level expressions, types, even classes) again choice between readable source and missing some behaviours. First part of the sub-pipeline here is partitioning the code into procedures—e.g. is a branch between two sections of assembler just a branch, or actually an optimised tailcall? Decompilation, type inference and finding code 9 30 January 2012

Economic argument UNIVERSITY OF CAMBRIDGE Decompilation can easily give a false impression of safety as it can miss malware-style attacks such as buffer overflow. However, even richly funded malware (e.g. Stuxnet) suffers from the “it’s not cost-effective to write everything in machine code” argument, with a result that much of it admits simple decompilation techniques. So, while malware will often contain “zero-day attacks” written in carefully crafted C or assembler, much or the high-level logic (both in malware and non-malware) will be written in “C which means C”. Decompilation, type inference and finding code 10 30 January 2012

Analogy to testing and verification UNIVERSITY OF CAMBRIDGE • Running in a sandbox, or DBT, is like testing . • Can use ‘coverage’ metrics to help identify non-exectuted paths. • Safe decompilation is like verification , we consider all paths. • When disassembling/decompiling for human readability we may ignore some paths (e.g. assumptions of possible destinations of indirect branches). Verification subject to assumptions of various run-time invariants. • Determining whether some paths are feasible is a least-fixed-point problem. E.g. virtual calls can only be determined as targeting a particular destination if we can resolve an alias which is only resolvable if we know the virtual calls only target expected destinations . . . Decompilation, type inference and finding code 11 30 January 2012

Decompilation—which high-level language? UNIVERSITY OF CAMBRIDGE • since assembler code is type-unsafe, we probably need a type-unsafe language to express things. • however if we’ve already given up on some things (e.g. we’re assuming no wild writes change return addresses) then perhaps we are willing to only consider programs with type-sensible data flow? • if we’re decompiling type-safe assembler code (e.g. JVM) we can safely decompile to a type-safe high-level language. • however, may still need to recreate abstract data types whose interface has been compiled away (e.g. generics in Java or ADTs). Decompilation, type inference and finding code 12 30 January 2012

Funtionality and Beauty (partly) reconciled UNIVERSITY OF CAMBRIDGE Could in principle decompile assembler to C which is then compiled with safe-C style checks. • Whenever there is a potential missed behaviour in the generated C (e.g. index out of bounds) then detect this at run-time and refine the decompilation. • Doesn’t work for spotting trojan malware which attempts to stay hidden unless some carefully crafted condition holds.. E.g. Akritidis PhD work on cheap run-time checks for C mis-behaviour.. Decompilation, type inference and finding code 13 30 January 2012

The interpreter problem UNIVERSITY OF CAMBRIDGE What if one carefully decompiles a program and finds out that the .EXE consists of an interpreter (e.g. for some bytecode) which does decompile nicely, followed by another layer of code in some mysterious language? • Start again at the next level • Issues if encryption is added. Decompilation, type inference and finding code 14 30 January 2012

Obfuscation to counter-attack decompilers UNIVERSITY OF CAMBRIDGE There are various ways to make code hard to decompile. One (Lokhmotov’s masters thesis) is: • flatten a general CFG into a loop containing a dispatch to all the basic blocks in the CFG which then branch to the main loop. • dispatcher uses a new variable representing the PC within the original CFG. • can be strengthened by using a one-way hash function on the state. Decompilation, type inference and finding code 15 30 January 2012

The decompilation pipeline UNIVERSITY OF Input: assembler code CAMBRIDGE Output: high-level code (e.g. C) • Partition code into procedures (may need code duplication). Need estimates of targets of indirect branches/calls. • Reconstruct control-flow (e.g. Cifuentes’ work). Irreducible CFG (perhaps produced by compiler optimisation) may need fixing up. • Transform to SSA form. Undoes register allocation etc. • Use dataflow analysis to reconstruct high-level expressions. Note C order-of-evaluation issues with f () + g () versus let x = f () in x + g () versus let y = g () in f () + y . • Generate high-level types, add casts if needed. These task are largely independent—apart from the first. Decompilation, type inference and finding code 16 30 January 2012

Decompilation, type inference and finding the code to decompile - PowerPoint PPT Presentation

UNIVERSITY OF CAMBRIDGE Decompilation, type inference and finding the code to decompile Alan Mycroft Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/users/am/ 30 January 2012 Decompilation, type inference and finding

Lecture 16 Decompilation Why decompilation? This course is ostensibly about Optimising

Type Checking Grammar Rule Semantic Rule var-decl id : type-exp Insert (id.name, type-exp .

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

Why decompilation? This course is ostensibly about Optimising Compilers. It is really about

Trustworthy decompilation: Extracting models of machine code inside an ITP Magnus O. Myreen

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Hindley-Milner Type Checking Automatic Type Inference What can be inferred about type of f or x

Where is ML type inference headed? Constraint solving meets local shape inference Franc ois

Modular Interpretive Decompilation of Low-Level Code by Partial Evaluation Elvira Albert 1 joint

Certifying OCaml type inference (and other type systems) Jacques Garrigue Nagoya University

Type Reconstruction and Polymorphism 1 Type Checking and Type Reconstruction We now come to the

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Decompilation and Data Flow Analysis Silvio Cesare <silvio.cesare@gmail.com> Who am I and

RevEngE is a dish served cold: Debug-Oriented Malware Decompilation and Reassembly Marcus Botacin

Decompilation is an information-flow problem (Or, information flow meets program transformation)

Composer 2.0 Nils Adermann @naderman Private Packagist https://packagist.com Goals for 2.0

First Investigations on Self Trained Speaker Diarization el Le Lan 1 , 2 Sylvain Meignier 2 Ga

Saratoga a Delay-Tolerant Networking convergence layer with efficient link utilization Wesley

The Problem Distributed Denial of Service Attacks and Defenses CS 239 Advanced Topics in

t t t

Software Testing Lecture 7 Property Based Testing Justin Pearson 2019 1 / 17 When are there

Eclipse Project 3.3 Release Review Eclipse Project PMC 1 Highlights 3.3 new features:

Low power, small die-size PLL using semi-digital storage instead of big loop filter capacitance

Decompilation, type inference and finding the code to decompile - PowerPoint PPT Presentation

UNIVERSITY OF CAMBRIDGE Decompilation, type inference and finding the code to decompile Alan Mycroft Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/users/am/ 30 January 2012 Decompilation, type inference and finding

Lecture 16 Decompilation Why decompilation? This course is ostensibly about Optimising

Type Checking Grammar Rule Semantic Rule var-decl id : type-exp Insert (id.name, type-exp .

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

Why decompilation? This course is ostensibly about Optimising Compilers. It is really about

Trustworthy decompilation: Extracting models of machine code inside an ITP Magnus O. Myreen

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Hindley-Milner Type Checking Automatic Type Inference What can be inferred about type of f or x

Where is ML type inference headed? Constraint solving meets local shape inference Franc ois

Modular Interpretive Decompilation of Low-Level Code by Partial Evaluation Elvira Albert 1 joint

Certifying OCaml type inference (and other type systems) Jacques Garrigue Nagoya University

Type Reconstruction and Polymorphism 1 Type Checking and Type Reconstruction We now come to the

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Decompilation and Data Flow Analysis Silvio Cesare &lt;silvio.cesare@gmail.com&gt; Who am I and

RevEngE is a dish served cold: Debug-Oriented Malware Decompilation and Reassembly Marcus Botacin

Decompilation is an information-flow problem (Or, information flow meets program transformation)

Composer 2.0 Nils Adermann @naderman Private Packagist https://packagist.com Goals for 2.0

First Investigations on Self Trained Speaker Diarization el Le Lan 1 , 2 Sylvain Meignier 2 Ga

Saratoga a Delay-Tolerant Networking convergence layer with efficient link utilization Wesley

The Problem Distributed Denial of Service Attacks and Defenses CS 239 Advanced Topics in

t t t

Software Testing Lecture 7 Property Based Testing Justin Pearson 2019 1 / 17 When are there

Eclipse Project 3.3 Release Review Eclipse Project PMC 1 Highlights 3.3 new features:

Low power, small die-size PLL using semi-digital storage instead of big loop filter capacitance

Decompilation and Data Flow Analysis Silvio Cesare <silvio.cesare@gmail.com> Who am I and