Decompilation, type inference and finding the code to decompile - - PowerPoint PPT Presentation

decompilation type inference and finding the code to
SMART_READER_LITE
LIVE PREVIEW

Decompilation, type inference and finding the code to decompile - - PowerPoint PPT Presentation

UNIVERSITY OF CAMBRIDGE Decompilation, type inference and finding the code to decompile Alan Mycroft Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/users/am/ 30 January 2012 Decompilation, type inference and finding


slide-1
SLIDE 1

UNIVERSITY OF

CAMBRIDGE

Decompilation, type inference and finding the code to decompile

Alan Mycroft Computer Laboratory, University of Cambridge

http://www.cl.cam.ac.uk/users/am/ 30 January 2012

Decompilation, type inference and finding code 1 30 January 2012

slide-2
SLIDE 2

UNIVERSITY OF

CAMBRIDGE

Structure

  • Part 1: What is decompilation and why is it hard?
  • Part 2: Type reconstruction in decompilation

Decompilation, type inference and finding code 2 30 January 2012

slide-3
SLIDE 3

UNIVERSITY OF

CAMBRIDGE

Problem: given a binary .EXE what does it do?

  • Run it: and get a virus
  • Run it in a sandbox: better
  • Run it in a program instrumenter (‘dynamic analysis’): even better

But any form of dynamic analysis under-approximates program behaviour—consider a trojan which only attacks one username and only on a Sunday evening. Running = testing = only explore some paths.

  • Decompile it: re-write the binary in a high-level language with

the high-level program having exactly the same execution paths as the low-level one. Harder than it sounds (simple cases easy).

Decompilation, type inference and finding code 3 30 January 2012

slide-4
SLIDE 4

UNIVERSITY OF

CAMBRIDGE

Decompilation—legality

  • Isn’t this one of those things which is illegal? Or at best ‘shady’?
  • Depends. Lost source code, US and EU permit decompilation for
  • interoperability. Always a ‘vaguely suspect’ activity.
  • New reason: Stuxnet, Duqu. Sophisticated malware written in

high-level code.

Decompilation, type inference and finding code 4 30 January 2012

slide-5
SLIDE 5

UNIVERSITY OF

CAMBRIDGE

Decompilation—techniques

  • Not always possible. Read in some code and branch to it, or
  • ther various assembler-level tricks such as updating a return

address. Not a problem for ‘dynamic binary translation’ (DBT) tools but these effectively use dynamic analysis

  • Always trivally possible: just prepend an x86 interpreter in your

favourite high-level language to the .EXE file. Cheating solution

  • In practice we need to make some assumptions . . .

Decompilation, type inference and finding code 5 30 January 2012

slide-6
SLIDE 6

UNIVERSITY OF

CAMBRIDGE

Decompilation—functionality vs beauty Functionality: “if we decompile foo.exe to foo.c then recompiling to foo2.exe has the same I/O behaviour as foo.exe. Safety—which requires any analysis to be over-estimate behaviour Beauty: “the code is readable to humans” (most of the rest of this talk). While there’s not obviously a conflict, functionality means we must include all possible executions, which include some a human might wish to ignore . . .

Decompilation, type inference and finding code 6 30 January 2012

slide-7
SLIDE 7

UNIVERSITY OF

CAMBRIDGE

Decompilation—functionality vs beauty (2)

int f(int *p) { p[read()] += 1; // might increment the return address return 0; } int main() { int r,v[10]; putvaluesin(v); r = f(v); // f always returns zero. r++; // perhaps "inc eax" [one byte] print r; }

Might this program print 0? What if we only had the assembler code version? We can’t decompile back to the above code, because the compiler (or options) might differ (stack offset between x and return address). For safety we might have to assume that almost every indirect write might overwrite a return address (adding many un-beautiful lines).

Decompilation, type inference and finding code 7 30 January 2012

slide-8
SLIDE 8

UNIVERSITY OF

CAMBRIDGE

Decompilation—functionality vs beauty (3)

f: pushl %ebp main: pushl %ebp movl %esp, %ebp movl %esp, %ebp pushl %ebx andl $-16, %esp subl $4, %esp pushl %ebx movl 8(%ebp), %ebx subl $76, %esp call read leal 24(%esp), %ebx ;;;;; here eax=-7 hits f’s return address incl (%ebx,%eax,4) movl %ebx, (%esp) addl $4, %esp call putvaluesin xorl %eax, %eax movl %ebx, (%esp) popl %ebx call f popl %ebp incl %eax ret movl %eax, (%esp) call print

Decompilation, type inference and finding code 8 30 January 2012

slide-9
SLIDE 9

UNIVERSITY OF

CAMBRIDGE

Decompiling .EXE Needs pipeline:

  • obtain machine code

not always easy if a packer is used, e.g. self extracting archive

  • obtain assembler code
  • ften a choice between readable assembly and missing some

execution path

  • obtain high-level code (reconstruct loops, high-level expressions,

types, even classes) again choice between readable source and missing some

  • behaviours. First part of the sub-pipeline here is partitioning the

code into procedures—e.g. is a branch between two sections of assembler just a branch, or actually an optimised tailcall?

Decompilation, type inference and finding code 9 30 January 2012

slide-10
SLIDE 10

UNIVERSITY OF

CAMBRIDGE

Economic argument Decompilation can easily give a false impression of safety as it can miss malware-style attacks such as buffer overflow. However, even richly funded malware (e.g. Stuxnet) suffers from the “it’s not cost-effective to write everything in machine code” argument, with a result that much of it admits simple decompilation techniques. So, while malware will often contain “zero-day attacks” written in carefully crafted C or assembler, much or the high-level logic (both in malware and non-malware) will be written in “C which means C”.

Decompilation, type inference and finding code 10 30 January 2012

slide-11
SLIDE 11

UNIVERSITY OF

CAMBRIDGE

Analogy to testing and verification

  • Running in a sandbox, or DBT, is like testing.
  • Can use ‘coverage’ metrics to help identify non-exectuted paths.
  • Safe decompilation is like verification, we consider all paths.
  • When disassembling/decompiling for human readability we may

ignore some paths (e.g. assumptions of possible destinations of indirect branches). Verification subject to assumptions of various run-time invariants.

  • Determining whether some paths are feasible is a

least-fixed-point problem. E.g. virtual calls can only be determined as targeting a particular destination if we can resolve an alias which is only resolvable if we know the virtual calls only target expected destinations . . .

Decompilation, type inference and finding code 11 30 January 2012

slide-12
SLIDE 12

UNIVERSITY OF

CAMBRIDGE

Decompilation—which high-level language?

  • since assembler code is type-unsafe, we probably need a

type-unsafe language to express things.

  • however if we’ve already given up on some things (e.g. we’re

assuming no wild writes change return addresses) then perhaps we are willing to only consider programs with type-sensible data flow?

  • if we’re decompiling type-safe assembler code (e.g. JVM) we can

safely decompile to a type-safe high-level language.

  • however, may still need to recreate abstract data types whose

interface has been compiled away (e.g. generics in Java or ADTs).

Decompilation, type inference and finding code 12 30 January 2012

slide-13
SLIDE 13

UNIVERSITY OF

CAMBRIDGE

Funtionality and Beauty (partly) reconciled Could in principle decompile assembler to C which is then compiled with safe-C style checks.

  • Whenever there is a potential missed behaviour in the generated

C (e.g. index out of bounds) then detect this at run-time and refine the decompilation.

  • Doesn’t work for spotting trojan malware which attempts to stay

hidden unless some carefully crafted condition holds.. E.g. Akritidis PhD work on cheap run-time checks for C mis-behaviour..

Decompilation, type inference and finding code 13 30 January 2012

slide-14
SLIDE 14

UNIVERSITY OF

CAMBRIDGE

The interpreter problem What if one carefully decompiles a program and finds out that the .EXE consists of an interpreter (e.g. for some bytecode) which does decompile nicely, followed by another layer of code in some mysterious language?

  • Start again at the next level
  • Issues if encryption is added.

Decompilation, type inference and finding code 14 30 January 2012

slide-15
SLIDE 15

UNIVERSITY OF

CAMBRIDGE

Obfuscation to counter-attack decompilers There are various ways to make code hard to decompile. One (Lokhmotov’s masters thesis) is:

  • flatten a general CFG into a loop containing a dispatch to all the

basic blocks in the CFG which then branch to the main loop.

  • dispatcher uses a new variable representing the PC within the
  • riginal CFG.
  • can be strengthened by using a one-way hash function on the

state.

Decompilation, type inference and finding code 15 30 January 2012

slide-16
SLIDE 16

UNIVERSITY OF

CAMBRIDGE

The decompilation pipeline Input: assembler code Output: high-level code (e.g. C)

  • Partition code into procedures (may need code duplication).

Need estimates of targets of indirect branches/calls.

  • Reconstruct control-flow (e.g. Cifuentes’ work). Irreducible CFG

(perhaps produced by compiler optimisation) may need fixing up.

  • Transform to SSA form. Undoes register allocation etc.
  • Use dataflow analysis to reconstruct high-level expressions.

Note C order-of-evaluation issues with f() + g() versus let x = f() in x + g() versus let y = g() in f() + y.

  • Generate high-level types, add casts if needed.

These task are largely independent—apart from the first.

Decompilation, type inference and finding code 16 30 January 2012

slide-17
SLIDE 17

UNIVERSITY OF

CAMBRIDGE

Compilation and decompilation are the same

  • Compilation: producing good (efficient) assembler from HL code.
  • Decompilation: producing good (readable) HL code from

assembler Both should be semantically correct. Many techniques are common:

  • dataflow analysis, code-transformation (replacing code with

equivalent code, e.g. common-subexpression elimination, common-subexpression creation).

  • SSA: separates two uses of a user variable into two separate

variables for optimisation. Also separates two uses of a register into two separate registers for optimisation/type inference. (SSA and register allocation are adjoints/inverses)

Decompilation, type inference and finding code 17 30 January 2012

slide-18
SLIDE 18

UNIVERSITY OF

CAMBRIDGE

Part 2: Type reconstruction in decompilation

  • Various assembler languages into RTL (essentially 3-address

code)

  • Transform to Single Static Assignment (SSA) form
  • Give each instruction operand a C-level type.
  • Unify all types given to a single SSA register
  • If circular clash (occurs check) invent a recursive struct
  • If types clash otherwise invent a cast or a union
  • Structs of Arrays and Arrays of structs
  • Miscellany: Gandhe’s minimal set of fixes to unification failure,

GUI-human driver, how does it work in practice?

Decompilation, type inference and finding code 18 30 January 2012

slide-19
SLIDE 19

UNIVERSITY OF

CAMBRIDGE

Intuitive example—straight line code f: ld.w 4[r0],r0 mul r0,r0,r0 xor r0,r1,r0 ret ri are argument and result registers. Naive decompilation: int f(int r0, int r1) { r0 = *(int *)(r0+4); r0 = r0 * r0; r0 = r1 ^ r0; return r0; } Separate live ranges: int f(int r0, int r1) { int r0a = *(int *)(r0+4); int r0b = r0a * r0a; int r0c = r1 ^ r0b; return r0c; }

Decompilation, type inference and finding code 19 30 January 2012

slide-20
SLIDE 20

UNIVERSITY OF

CAMBRIDGE

Intuitive example—straight line code (2) int f(int r0, int r1) { int r0a = *(int *)(r0+4); int r0b = r0a * r0a; int r0c = r1 ^ r0b; return r0c; } Infer type for r0 int f(int *r0, int r1) { int r0a = *(r0+1); int r0b = r0a * r0a; int r0c = r1 ^ r0b; return r0c; }

  • r even

int f(int *r0, int r1) { int r0a = r0[1]; return r1 ^ (r0a*r0a); }

Decompilation, type inference and finding code 20 30 January 2012

slide-21
SLIDE 21

UNIVERSITY OF

CAMBRIDGE

Inventing structs: iterative sum in C struct A { int hd; struct A *tl; } int f(struct A *x) { int r = 0; for (; x!=0; x = x->tl) r += x->hd; return r; }

Decompilation, type inference and finding code 21 30 January 2012

slide-22
SLIDE 22

UNIVERSITY OF

CAMBRIDGE

Inventing structs: compiled iterative sum f: mov #0,r1 cmp #0,r0 beq L4F2 L3F2: ld.w 0[r0],r2 add r2,r1,r1 ld.w 4[r0],r0 cmp #0,r0 bne L3F2 L4F2: mov r1,r0 ret

Decompilation, type inference and finding code 22 30 January 2012

slide-23
SLIDE 23

UNIVERSITY OF

CAMBRIDGE

Iterative sum in SSA form with generated type constraints f: tf = t0 → t99 mov r0,r0a t0 = t0a mov #0,r1a t1a = int ∨ t1a = ptr(α1) cmp #0,r0a t0a = int ∨ t0a = ptr(α2) beq L4F2

Decompilation, type inference and finding code 23 30 January 2012

slide-24
SLIDE 24

UNIVERSITY OF

CAMBRIDGE

L3F2: mov φ(r0a,r0c),r0b t0b = t0a, t0b = t0c mov φ(r1a,r1c),r1b t1b = t1a, t1b = t1c ld.w 0[r0b],r2a t0b = ptr(0 : t2a) add r2a,r1b,r1c t2a = ptr(α3), t1b = int, t1c = ptr(α3)∨ t2a = int, t1b = ptr(α4), t1c = ptr(α4)∨ t2a = int, t1b = int, t1c = int ld.w 4[r0b],r0c t0b = ptr(4 : t0c) cmp #0,r0c t0c = int ∨ t0c = ptr(α5) bne L3F2 L4F2: mov φ(r1a,r1c),r1d t1d = t1a, t1d = t1c mov r1d,r0d t0d = t1a ret t99 = t0d

Decompilation, type inference and finding code 24 30 January 2012

slide-25
SLIDE 25

UNIVERSITY OF

CAMBRIDGE

Constraint solution Occurs-check constraint failure: t0c = t0b = ptr(mem(4 : t0c)) = ptr(mem(0 : t2a) I.e. t0c = ptr(mem(0 : t2a, 4 : t0c)) Break cycle with struct (more polymorphic than C!) struct G { t2a m0; t0c m4; ...} i.e. t0c = ptr(mem(0 : t2a, 4 : t0c)) = ptr(struct G)

Decompilation, type inference and finding code 25 30 January 2012

slide-26
SLIDE 26

UNIVERSITY OF

CAMBRIDGE

Solving gives two solutions: t0 = t0a = t0b = t0c = ptr(struct G) t99 = t1a = t1b = t1c = t1d = t2a = t0d = int tf = ptr(struct G) → int and t0 = t0a = t0b = t0c = ptr(struct G) t2a = int t99 = t1a = t1b = t1c = t1d = t0d = ptr(α4) tf = ptr(struct G) → ptr(α4)

Decompilation, type inference and finding code 26 30 January 2012

slide-27
SLIDE 27

UNIVERSITY OF

CAMBRIDGE

Parasitic solution char *f(struct A *x) { char *r = 0; for (; x!=0; x = x->tl) r += x->hd; return r; } Compiles to same code. Not strictly ANSI conformant (undefined behaviour).

Decompilation, type inference and finding code 27 30 January 2012

slide-28
SLIDE 28

UNIVERSITY OF

CAMBRIDGE

When structs cannot resolve type conflicts (1) h: ld.w 4[r0],r1 xor r1,r0,r0 ret Type clash for r0. Union-based solution int h(union {int i; int *p;} x) { return x.p[1] ^ x.i; }

Decompilation, type inference and finding code 28 30 January 2012

slide-29
SLIDE 29

UNIVERSITY OF

CAMBRIDGE

When structs cannot resolve type conflicts (2) h: ld.w 4[r0],r1 xor r1,r0,r0 ret Cast-based alternatives: int h1(int x) { return *(int *)(x+4) ^ x; } int h2(int *x) { return x[1] ^ (int)x; } struct h3arg { Tpad1 m0; int m; Tpad2 m8; }; int h3(struct h3arg *x) { return x->m ^ (int)x; } We prefer option h3 by default, leaving array creation to be triggered by non-constant indexing (or user GUI interaction). Type h3arg can be reconciled with callers.

Decompilation, type inference and finding code 29 30 January 2012

slide-30
SLIDE 30

UNIVERSITY OF

CAMBRIDGE

Arrays versus structs Trigger array synthesis when indexing instructions occur, whether they be manifest: ld.w (r5)[r0],r3

  • r more indirectly coded

add r5,r0,r1 ld.w 0[r1],r3

Decompilation, type inference and finding code 30 30 January 2012

slide-31
SLIDE 31

UNIVERSITY OF

CAMBRIDGE

Arrays versus structs—example

ld.b 0[r0],r1 ld.b 48[r0],r2 ld.w (r5)[r0],r3

Can yield

union G { struct { char m0; char pad1[47]; char m48; } u1; int u2[]; } *r0;

Inferring limits for arrays requires, in general, techniques beyond those presented here. (Another fixed-point problem, can’t infer bounds until we know that writing beyond a bound (buffer-overflow attack) doesn’t overwrite things like the bound.) Interactive decompilation driven with a GUI?

Decompilation, type inference and finding code 31 30 January 2012

slide-32
SLIDE 32

UNIVERSITY OF

CAMBRIDGE

More recent work

  • Reps’ Codesurfer
  • Guilfanov’s IDA Pro (Interactive DisAssembler); Supported by

Hex-rays.com – now with some decompilation power.

  • Khoo’s Renaissance project

Decompilation, type inference and finding code 32 30 January 2012

slide-33
SLIDE 33

UNIVERSITY OF

CAMBRIDGE

Khoo’s Renaissance project

  • More ambitious type analysis—based on flow-dependent types.
  • Produces many more user-appropriate types when decompiling

by propagating types seeded by those determined by interactions with the Windows API.

  • Claim of up to 80% (of open-source and malware) variables being

typed—almost double the amount achieved by Hex-Rays default algorithm Implemented as a 20kLoc plug-in to IDA Pro.

Decompilation, type inference and finding code 33 30 January 2012

slide-34
SLIDE 34

UNIVERSITY OF

CAMBRIDGE

Structure

  • Part 1: What is decompilation and why is it hard?
  • Part 2: Type reconstruction in decompilation

The end

Decompilation, type inference and finding code 34 30 January 2012