Decompilation is an information-flow problem (Or, information flow - PowerPoint PPT Presentation

Decompilation is an information-flow problem (Or, information flow meets program transformation) Boris Feigin Computer Laboratory, University of Cambridge PLID 2008 joint work with Alan Mycroft 1 / 22

Motivation “Given suitable tools we can present the [cryptographic] key as a constant in the computation which is carried out using that key and then we can optimise the code given that constant. This will cause the key to be intimately intertwined with the code which uses it.” Playing ‘Hide and Seek’ with Stored Keys Shamir and van Someren (1999) 2 / 22

Typical source and target languages v ∈ Value = Z r ∈ Register = { r0 , r1 , . . . , r31 } while -language (source): e ::= v | x | op e 1 , . . . , e n c ::= x := e | skip | c 0 ; c 1 | if e then c 0 else c 1 | while e do c RISC assembly (target): ι ::= movi r d , v | mov r d , r s | ld r d , [ r s ] | st [ r d ] , r s | op r d , r 1 , . . . , r n | jz r , l | jnz r , l | nop | ι 0 ; ι 1 3 / 22

Definitions ◮ C ( − ) is a compiler from source language S to target language T . ◮ The observational equivalence relations of S and T are (respectively) ∼ S and ∼ T . ◮ Decompilation recovers a source program semantically equivalent to the original. D ( − ) is a decompiler iff D ( C ( e )) ∈ [ e ] ∼ S This is the weakest possible definition of decompilation. ◮ In certain cases there is a trivial solution for D ( − ): emit an interpreter for T written in S incorporating the text of the program (in T ) to be decompiled. 4 / 22

Definitions ◮ C ( − ) is a compiler from source language S to target language T . ◮ The observational equivalence relations of S and T are (respectively) ∼ S and ∼ T . ◮ Decompilation recovers a source program semantically equivalent to the original. D ( − ) is a decompiler iff D ( C ( e )) ∈ [ e ] ∼ S This is the weakest possible definition of decompilation. ◮ In certain cases there is a trivial solution for D ( − ): emit an interpreter for T written in S incorporating the text of the program (in T ) to be decompiled. ◮ How well can a decompiler do in principle? ◮ IOW, how much information about the source program can be inferred from the output of the compiler? 4 / 22

Example C (“ x := 42”) = C (“ y := 42; x := y”) = = C (“ z := 6; y := 7; x := z × y ”) = = “ mov r0 , 42” C ( − ) does constant folding, constant propagation, etc. 5 / 22

Program equivalence ◮ ≡ (“bit-for-bit” equality of programs) e ≡ e ′ ⇐ ⇒ strcmp ( e , e ′ ) == 0 ◮ ∼ α ( α -equivalence) 6 / 22

Program equivalence ◮ Recall: two expressions are contextually equivalent ( e ∼ e ′ ) whenever e ∼ e ′ ⇐ Ctx[ e ] ∼ = Ctx[ e ′ ] ⇒ ∀ Ctx[ − ] where Ctx[ − ] ranges over contexts of the language and ∼ = is some observation (say, convergence). 7 / 22

Program equivalence ◮ Recall: two expressions are contextually equivalent ( e ∼ e ′ ) whenever e ∼ e ′ ⇐ Ctx[ e ] ∼ = Ctx[ e ′ ] ⇒ ∀ Ctx[ − ] where Ctx[ − ] ranges over contexts of the language and ∼ = is some observation (say, convergence). ◮ Restriction to programs ( d ranges over inputs): e ∼ e ′ ⇐ [ e ′ ] ⇒ ∀ d ∈ D [ [ e ] ]( d ) = [ ]( d ) 7 / 22

Example: size t strlen(const char *str) const char *s = str; size_t len = 0; while(*s) for(; str[len]; len++) s++; ; return (s - str); return len; 8 / 22

Intuition Define the relation f − 1 ( Q ), the kernel of f w.r.t. Q (Clark et al., 2005): x f − 1 ( Q ) x ′ ⇐ ⇒ ( f x ) Q ( f x ′ ) 9 / 22

Intuition Define the relation f − 1 ( Q ), the kernel of f w.r.t. Q (Clark et al., 2005): x f − 1 ( Q ) x ′ ⇐ ⇒ ( f x ) Q ( f x ′ ) E.g. C − 1 ( ≡ ) “ x := 42” “ y := 42; x := y” 9 / 22

Intuition Define the relation f − 1 ( Q ), the kernel of f w.r.t. Q (Clark et al., 2005): x f − 1 ( Q ) x ′ ⇐ ⇒ ( f x ) Q ( f x ′ ) E.g. C − 1 ( ≡ ) “ x := 42” “ y := 42; x := y” Programs compiled by “less normalizing” compilers are more susceptible to decompilation. We tend to have the case that: ∼ α ⊂ C − 1 1 ( ≡ ) ⊂ C − 1 2 ( ≡ ) ⊂ . . . ⊂ C − 1 n ( ≡ ) ⊂ ∼ S where C 1 ( − ) to C n ( − ) are progressively more optimizing compilers. 9 / 22

Compiler correctness C ( − ) is fully abstract (Abadi, 1998) iff e ∼ S e ′ ⇐ ⇒ C ( e ) ∼ T C ( e ′ ) (1) 10 / 22

Compiler correctness C ( − ) is fully abstract (Abadi, 1998) iff e ∼ S e ′ ⇐ ⇒ C ( e ) ∼ T C ( e ′ ) (1) Abadi observes that the forward implication “means that the translation does not introduce information leaks”. 10 / 22

Non-interference e ∼ S e ′ ⇒ C ( e ) ∼ T C ( e ′ ) (2) Zero information flow (from high-security inputs to low-security outputs) for a program M : σ ∼ low σ ′ ⇒ [ ]( σ ′ ) [ M ] ]( σ ) ≈ [ [ M ] (3) where two states are equivalent up to ∼ low when their low-security parts are equal. 11 / 22

Relating non-interference and software protection Let P and Q be binary relations over domains D and E respectively. Then, given f : D → E , say that f : P ⇒ Q whenever ∀ x , x ′ ∈ D x P x ′ ⇒ ( f x ) Q ( f x ′ ) 12 / 22

Relating non-interference and software protection Let P and Q be binary relations over domains D and E respectively. Then, given f : D → E , say that f : P ⇒ Q whenever ∀ x , x ′ ∈ D x P x ′ ⇒ ( f x ) Q ( f x ′ ) The correspondence is explicit: [ [ M ] ]( − ) : ∼ low ⇒ ≈ C ( − ) : ∼ S ⇒ ∼ T The substitution {C / [ [ M ] ] , ∼ S / ∼ low , ∼ T / ≈} unifies the equations nicely. 12 / 22

Parallels ◮ Programs are secret (high-security) inputs. Compiled binaries are the public (low-security) outputs ( ≡ ). ◮ Attackers attempt to infer (as much as possible about) the inputs from the outputs. (Decompilation.) 13 / 22

Parallels ◮ Programs are secret (high-security) inputs. Compiled binaries are the public (low-security) outputs ( ≡ ). ◮ Attackers attempt to infer (as much as possible about) the inputs from the outputs. (Decompilation.) Caveat: in practice, the goal of decompilation is to recover any readable source program. 13 / 22

Secure information flow for compilers? We would like to have zero information flow compilers: C ( − ) : ∼ S ⇒ ≡ 14 / 22

Secure information flow for compilers? We would like to have zero information flow compilers: C ( − ) : ∼ S ⇒ ≡ ◮ Relational reading: C ( − ) may leak only the equivalence class of its input programs. ◮ C ( − ) must be perfectly optimizing (undecidable for Turing-complete languages). ◮ Though, cf. superoptimization (Massalin, 1987). 14 / 22

Implications In general, a compiler must leak more than just the equivalence class of its input programs. We are interested in applying techniques from quantitative information flow to deriving concrete bounds on the leakage. E.g.: the identity “compiler” ( λ x . x ) leaks its input completely. 15 / 22

Possible applications ◮ Randomized compilation and information-flow security for non-deterministic languages ◮ cf. non-deterministic encryption schemes ◮ Obfuscation (more generally: software protection) 16 / 22

Virtualization Essentially, fast whole-system emulation. Examples: KVM, VMware, Xen, . . . 17 / 22

Virtualization Essentially, fast whole-system emulation. Examples: KVM, VMware, Xen, . . . (virtual machine) transparency n. making virtual and native hardware indistinguishable under close scrutiny by a dedicated adversary (Garfinkel et al., 2007) 17 / 22

Virtualization Essentially, fast whole-system emulation. Examples: KVM, VMware, Xen, . . . (virtual machine) transparency n. making virtual and native hardware indistinguishable under close scrutiny by a dedicated adversary (Garfinkel et al., 2007) e ∼ x86 e ′ ⇐ ⇒ [ [ vm ] ]( e ) ≈ [ [ vm ] ]( e ′ ) 17 / 22

From compilers to interpreters and back again ◮ Partial evaluation [ [ e ] ]( d ) = [ [ sint ] ]( e , d ) = [ [[ [ mix ] ]( sint , e )] ]( d ) 18 / 22

From compilers to interpreters and back again ◮ Partial evaluation [ [ e ] ]( d ) = [ [ sint ] ]( e , d ) = [ [[ [ mix ] ]( sint , e )] ]( d ) ◮ Non-interference? e ∼ S e ′ ⇐ ⇒ ∀ d [ [ int ] ]( e , d ) ≈ [ [ int ] ]( e ′ , d ) e ∼ S e ′ ]( int , e ′ ) ⇐ ⇒ [ [ mix ] ]( int , e ) ≈ [ [ mix ] 18 / 22

Overview ◮ Optimizing compilers obey a “non-interference”-like property ◮ Perfect optimization is impossible, so information leaks are inevitable ◮ An information-flow approach to program transformation? 19 / 22

Challenges ◮ Probability distributions over programs ◮ Shannon information theory / Kolmogorov complexity / Scott’s information systems ◮ “Real” compilers don’t come with formalized equational theories 20 / 22

Related work ◮ Decompilation: Mycroft (1999), Katsumata and Ohori (2001), Ager et al. (2002). ◮ Full abstraction: Mitchell (1993), Abadi (1998), Kennedy (2006). ◮ Reverse engineering by power analysis etc.: Vermoen (2007). ◮ Randomized compilation: Cohen (1993), Forrest et al. (1997). ◮ Nullspace of compilers: Veldhuizen and Lumsdaine (2002). ◮ Obfuscation: Barak et al. (2001), Dalla Preda and Giacobazzi (2005). ◮ Virtual machines and partial evaluation: Feigin and Mycroft (2008). 21 / 22

Questions? 22 / 22

Decompilation is an information-flow problem (Or, information flow - PowerPoint PPT Presentation

Decompilation is an information-flow problem (Or, information flow meets program transformation) Boris Feigin Computer Laboratory, University of Cambridge PLID 2008 joint work with Alan Mycroft 1 / 22 Motivation Given suitable tools we

Lecture 16 Decompilation Why decompilation? This course is ostensibly about Optimising

Decompilation, type inference and finding the code to decompile Alan Mycroft Computer

Why decompilation? This course is ostensibly about Optimising Compilers. It is really about

Decompilation and Data Flow Analysis Silvio Cesare <silvio.cesare@gmail.com> Who am I and

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

Modular Interpretive Decompilation of Low-Level Code by Partial Evaluation Elvira Albert 1 joint

Trustworthy decompilation: Extracting models of machine code inside an ITP Magnus O. Myreen

Decompilation Ximing Yu May 3, 2011 Decompiler Definition Decompiler is a program that attempts

RevEngE is a dish served cold: Debug-Oriented Malware Decompilation and Reassembly Marcus Botacin

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

Potential Flow & Flow Nets Potential Flow Irrotational flow for which implies:

Coupling free flow / porous-medium flow General idea free flow, Navier-Stokes wind 1 phase, 2

1 What Is Control-Flow Analysis? Loop Concepts Control-flow analysis discovers the flow of

Traffic Flow Characteristics and Theory Primary Elements of Traffic Flow a. Flow Rate = q

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

= edge edge ( (u,v u,v) ) is not in is not in E E f x Y ( , ) f x y ( , ) y Y

Optimizing Flow Bandwidth Consumption with Traffic-diminishing Middlebox Placement Yan Yang

Mass quenching, cold flows and gas inflow into galaxies Yuval Birnboim The Hebrew University of

Information Flow and Decision-Making in Advanced Vehicle Development Presented by: Presented

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez,

Design Flow in Colorado: New! Fun! Litigious! Meg Parish Manager, Permits Section Water Quality

Standard Network Flow problems with Secure Mul8party Computa8on

Optical flow Cordelia Schmid Motion field The motion field is the projection of the 3D scene

Minimum-Cost Flow Math 482, Lecture 28 Misha Lavrov April 10, 2020 Introduction Basic

Decompilation is an information-flow problem (Or, information flow - PowerPoint PPT Presentation

Decompilation is an information-flow problem (Or, information flow meets program transformation) Boris Feigin Computer Laboratory, University of Cambridge PLID 2008 joint work with Alan Mycroft 1 / 22 Motivation Given suitable tools we

Lecture 16 Decompilation Why decompilation? This course is ostensibly about Optimising

Decompilation, type inference and finding the code to decompile Alan Mycroft Computer

Why decompilation? This course is ostensibly about Optimising Compilers. It is really about

Decompilation and Data Flow Analysis Silvio Cesare &lt;silvio.cesare@gmail.com&gt; Who am I and

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

Modular Interpretive Decompilation of Low-Level Code by Partial Evaluation Elvira Albert 1 joint

Trustworthy decompilation: Extracting models of machine code inside an ITP Magnus O. Myreen

Decompilation Ximing Yu May 3, 2011 Decompiler Definition Decompiler is a program that attempts

RevEngE is a dish served cold: Debug-Oriented Malware Decompilation and Reassembly Marcus Botacin

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

Potential Flow &amp; Flow Nets Potential Flow Irrotational flow for which implies:

Coupling free flow / porous-medium flow General idea free flow, Navier-Stokes wind 1 phase, 2

1 What Is Control-Flow Analysis? Loop Concepts Control-flow analysis discovers the flow of

Traffic Flow Characteristics and Theory Primary Elements of Traffic Flow a. Flow Rate = q

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

= edge edge ( (u,v u,v) ) is not in is not in E E f x Y ( , ) f x y ( , ) y Y

Optimizing Flow Bandwidth Consumption with Traffic-diminishing Middlebox Placement Yan Yang

Mass quenching, cold flows and gas inflow into galaxies Yuval Birnboim The Hebrew University of

Information Flow and Decision-Making in Advanced Vehicle Development Presented by: Presented

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez,

Design Flow in Colorado: New! Fun! Litigious! Meg Parish Manager, Permits Section Water Quality

Standard Network Flow problems with Secure Mul8party Computa8on

Optical flow Cordelia Schmid Motion field The motion field is the projection of the 3D scene

Minimum-Cost Flow Math 482, Lecture 28 Misha Lavrov April 10, 2020 Introduction Basic

Decompilation and Data Flow Analysis Silvio Cesare <silvio.cesare@gmail.com> Who am I and

Potential Flow & Flow Nets Potential Flow Irrotational flow for which implies: