Statistical Similarity of Binaries Yaniv David, Nimrod Partush, - PowerPoint PPT Presentation

Statistical Similarity of Binaries Yaniv David, Nimrod Partush, Eran Yahav Technion * The research leading to these results has received funding from the European Union's - Seventh Framework Programme (FP7) under grant agreement n° 615688 – ERC- COG-PRIME.

Outline • Motivation • Introduction • Inspiration • Our Approach • Evaluation • Summary • Questions? • Demo

Motivation Network time RedHat ’ s Linux distribution protocol ( ntpd ) Apple ’ s OSX ’ s 5900x switches (table source: https://queue.acm.org/detail.cfm?id=2717320)

Challenge: Finding Similar Procedures shr eax, 8 mov r9, 13h ? lea r14d, [r12+13h] mov r12, rbx mov r13, rbx add rbp, 3 ≈ lea rcx, [r13+3] mov rsi, rbp mov [r13+1], al lea rdi, [r12+3] mov [r13+2], r12b mov [r12+2], bl mov rdi, rcx lea r13d, [rcx+r9] shr eax, 8 Heartbleed, gcc v.4.9 -03 Heartbleed, clang v.3.5 -03

Semantic Similarity Wish List • Precise - avoid false positives • Flexible – find similiarities across • Different compiler versions • Different compiler vendors • Different versions of the same code • Limitation: Use only stripped binary form

t q1 q2 Images courtesy of Irani et al.

Similarity by Composition • Irani et. al. [2006] • image1 is similar to a image2 if you can compose image1 from the segments of image2 • Segments can be transformed • rotated, scaled, translated • Segments of (statistical) significance, give more evidence • black background should be much less accounted for less similar similar

Statistical Similarity of Binaries • Apply the same idea to binaries: procedures are similar if they share functionally equivalent, non-trivial, segments • equivalent = allow for semantic preserving transformations: register allocation, instruction selection, etc. • non-trivial = account for the statistical significance of each segment  shr eax, 8  lea r14d, [r12+13h] mov r13, rbx  lea rcx, [r13+3] less similar mov [r13+1], al similar mov [r13+2], r12b mov rdi, rcx mov r9, 13h mov rsi, 14h 𝑢 : Heartbleed, gcc v.4.9 -03 mov r12, rbx mov rdi, rcx  add rbp, 3 shr eax, 8   mov rsi, rbp mov ecx, r13 lea rdi, [r12+3] add esi, 1h mov [r12+2], bl xor ebx, ebx lea r13d, [rcx+r9] test eax, eax  shr eax, 8 jl short loc_22F4 𝑟 1 : Coreutils, gcc v.4.9 -03 𝑟 2 : Heartbleed, clang v.3.5 -03

Similarity of Binaries: 3 Step Recipe  shr eax, 8  lea r14d, [r12+13h] mov r13, rbx  lea rcx, [r13+3] mov [r13+1], al 1. Decomposition mov [r13+2], r12b mov rdi, rcx Heartbleed, gcc v.4.9 -03 Heartbleed, gcc v.4.9 -03 mov r13, rbx  lea rcx, [r13+3] ≈ ? 2. Pairwise Semantic Similarity mov r12, rbx  lea rdi, [r12+3] Heartbleed, clang v.3.5 -03  shr eax, 8 mov r13, rbx  lea rcx, [r13+3] 3. Statistical  shr eax, 8 mov r12, rbx Similarity  lea rdi, [r12+3] Evidence  shr eax, 8 CORPUS

Step 1 - Procedure Decomposition • Decompose procedure into comparable units • Strand - the set of instructions required to compute a certain variable ’ s value • Get all strands by applying program slicing on the basic-block level until all variables are covered 10

Step 1 - Procedure Decomposition shr eax, 8 v1 = rbx lea r14d, [r12+13h] r13 = v1 mov r13, rbx  lea rcx, [r13+3] v2 = r13 + 3 BAP + mov [r13+1], al v3 = int_to_ptr(v2) Smack + mov [r13+2], r12b rcx = v3 mov rdi, rcx Slice Heartbleed, gcc v.4.9 -03 Strand 3

Step 2 – Pairwise Semantic Similairity v1 = rbx ? v1 = rbx r13 = v1 r12 = v1 ≈ v2 = r13 + 3 v2 = r12 + 3 v3 = int_to_ptr(v2) v3 = int_to_ptr(v2) rcx = v3 rdi = v3 Heartbleed, gcc v.4.9 -03 Heartbleed, clang v.3.5 -03 Strand 3 Strand 3 Sure!

Step 2 – Pairwise Semantic Similairity v1 = 13h v1 = r12 r9 = v1 v2 = 13h + v1 v2 = rbx ? v3 = int_to_ptr(v2) v3 = v2 + v3 ≈ r14 = v3 v4 = int_to_ptr(v3) v4 = 18h r13 = v4 rsi = v4 v5 = v1 + 5 v5 = v4 + v3 rsi = v5 rax = v5 v6 = v5 + v4 rax = v6 Heartbleed, gcc v.4.9 -03 Heartbleed, clang v.3.5 -03 Strand 6 Strand 11 :(

Step 2 – Pairwise Semantic Similairity assume r12 q == rbx t v1 t = 13h v1 q = r12 q r9 t = v1 t v2 q = 13h + v1 q v2 t = rbx t v3 q = int_to_ptr(v2 q ) v3 t = v2 t + v3 t r14 q = v3 q v4 t = int_to_ptr(v3 t ) v4 q = 18h r13 t = v4 t rsi q = v4 q v5 t = v1 t + 5 v5 q = v4 q + v3 q rsi t = v5 t rax q = v5 q v6 t = v5 t + v4 t rax t = v6 assert v1 q ==v2 t , v2 q ==v3 t , v3 q ==v4 t , r14 q ==r13 t v4 q ==v5 t , rsi q ==rsi t ,v5 q ==v6 t , rax q ==rax t

Step 2 - Quantify Semantic Similarity ● Use the percentage of variables from 𝑡 𝑟 that have an equivalent counterpart in 𝑡 𝑢 to define probabilities ○ VCP(𝑡 𝑟 , 𝑡 𝑢 ) = Variable Containment Proportion ○ under the best matching ○ an assymetric measure ○ Pr(𝑡 𝑟 |𝑡 𝑢 ) = probability that 𝑡 𝑟 is input-output equivalent to 𝑡 𝑢 ○ sigmoid function over VCP

Step 3 – Statistical Evidence ● Define a Local Evidence Score to quantify the statistical significance of matching each strand ● Pr(𝑡 𝑟 |𝐼 0 ) - randomly finding a matching for 𝑡 𝑟 “ in the wild ” ○ Average value of Pr(𝑡 𝑟 |𝑡 𝑢 ) over the entire corpus 𝑡 𝑢 ∈ 𝑢 ∈ 𝐷𝑝𝑠𝑞𝑣𝑡 max 𝑡 𝑢 ∈𝑢 Pr(𝑡 𝑟 |𝑡 𝑢 ) 𝑀𝐹𝑇 𝑡 𝑟 |𝑢 = log Pr(𝑡 𝑟 |𝐼 0 )

Step 3 - Global Similarity • Procedures are similar if one can be composed using non-trivial, significantly similar parts of the other 𝐻𝐹𝑇 𝑟|𝑢 = ෍ 𝑀𝐹𝑇(𝑡 𝑟 |𝑢) 𝑡 𝑟 ∈𝑟

Evaluation ● Corpus ● Real-world code packages ● open-ssl, bash, qemu, wget, ws-snmp, ffmpeg, coreutils ● 1500 procedures picked at random ● Compiled with clang 3.{4,5} , gcc 4.{6,8,9} and icc {14,15} ● Spanning across product versions ● e.g . openssl-1.0.1{e,f,g} ● Queries ● Focused on vulnerabilities (for motivation ’ s sake) ● Verified with randomly picked procedures.

Results • Low low FP rate • Crucial to the vulnerability search scenario Vulnerability False Positives 1 0 Heartbleed 2 3 Shellshock 3 0 Venom 4 19 Clobberin' Time 5 0 Shellshock #2 6 1 ws-snmp 7 0 wget 8 ffmpeg 0 • Previous methods fail at cross-{version,compiler} scenario or produce too many FPs • Due to syntactic or sampling based methods.

Results 20 Finding Heartbleed

Results All v. All comparison 21

Summary • Challenging scenario • Finding similarity cross-{compiler, version} in stripped binaries • Clear motivation • Funding vulnerable code, detecting clones, etc. • A fully semantic approach that is yet feasible • Accuracy achieved with statistical framework • Applied to real-world code • Publicly available prototype

More Similarity Work in Our Group Code Similarity via Natural Language Descriptions In progress https://github.com/tech-srl/ Code Similarity by Composition PLDI ’ 16 PRIME Estimating Types in Stripped Binaries POPL ’ 16 TRACY DIZY Data-Driven Disjunctive Abstraction VMCAI ’ 16 Like2Drops Abstract Differencing via Speculative Correlation OOPSLA ’ 14 Abstract Semantic Differencing SAS ’ 13 Semantic Code Search in Binaries PLDI ’ 14 Abstract domain of symbolic automata SAS ’ 13 https://www.codota.com Synthesis from partial programs OOPSLA ’ 12 Meital Ben Sinai Yaniv David Omer Katz Hila Peleg Sharon Shoham Shir Yadid Eran Yahav

Questions?

Demo www.BinSim.com

Statistical Similarity of Binaries Yaniv David, Nimrod Partush, - PowerPoint PPT Presentation

Statistical Similarity of Binaries Yaniv David, Nimrod Partush, Eran Yahav Technion * The research leading to these results has received funding from the European Union's - Seventh Framework Programme (FP7) under grant agreement n 615688

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Does Sequence Similarity Predict Expression Similarity Kui Zhang Section on Statistical Genetics

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Similarity searching using multiple starting points Peter Willett, University of Sheffield, UK

PETER Fast similarity searches and similarity joins in Oracle DB Astrid Rheinlnder, Ulf Leser

Distributed Multi-modal Similarity Retrieval David Novak Seminar of DISA Lab, October 14, 2014

A Semantic Similarity Measure for Formal Ontologies Mark Hall Final presentation for the master

Volume Analysis Using Multimodal Surface Similarity Multimodal Surface Similarity Martin

SOEN6461: Software Design Methodologies Yann-Gal Guhneuc Yann-Gal Guhneuc Typing,

Peace & Hope Colossians 1:21-23 Outline Christ Our Enemy (Colossians 1:21) Christ Our

NEW NUMBER-THEORETIC CRYPTOGRAPHIC PRIMITIVES ric Brier Houda Ferradi Marc Joye David

Hard Lessons About Soft Skills Marlena Compton & Gordon Shippey

Takeutis argument for the finitistic admissibility of transfinite induction Andrew Arana

SERVERLESS LOAD TESTING FOR REPLAYING TRAFFIC Yuki Sawa @yukisww Software Engineer edmunds.com

REPRESENTATION OF CONCEPTS IN THE FRAME-BASED LANGUAGE OBJLOG+ : FROM PROBABILISTIC CONCEPTS TO

Dictionary lookup Suppose youre looking up a word in the dictionary (paper one, not

Statistical Similarity of Binaries Yaniv David, Nimrod Partush, - PowerPoint PPT Presentation

Statistical Similarity of Binaries Yaniv David, Nimrod Partush, Eran Yahav Technion * The research leading to these results has received funding from the European Union's - Seventh Framework Programme (FP7) under grant agreement n 615688

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Does Sequence Similarity Predict Expression Similarity Kui Zhang Section on Statistical Genetics

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Similarity searching using multiple starting points Peter Willett, University of Sheffield, UK

PETER Fast similarity searches and similarity joins in Oracle DB Astrid Rheinlnder, Ulf Leser

Distributed Multi-modal Similarity Retrieval David Novak Seminar of DISA Lab, October 14, 2014

A Semantic Similarity Measure for Formal Ontologies Mark Hall Final presentation for the master

Volume Analysis Using Multimodal Surface Similarity Multimodal Surface Similarity Martin

SOEN6461: Software Design Methodologies Yann-Gal Guhneuc Yann-Gal Guhneuc Typing,

Peace &amp; Hope Colossians 1:21-23 Outline Christ Our Enemy (Colossians 1:21) Christ Our

NEW NUMBER-THEORETIC CRYPTOGRAPHIC PRIMITIVES ric Brier Houda Ferradi Marc Joye David

Hard Lessons About Soft Skills Marlena Compton &amp; Gordon Shippey

Takeutis argument for the finitistic admissibility of transfinite induction Andrew Arana

SERVERLESS LOAD TESTING FOR REPLAYING TRAFFIC Yuki Sawa @yukisww Software Engineer edmunds.com

REPRESENTATION OF CONCEPTS IN THE FRAME-BASED LANGUAGE OBJLOG+ : FROM PROBABILISTIC CONCEPTS TO

Dictionary lookup Suppose youre looking up a word in the dictionary (paper one, not

Peace & Hope Colossians 1:21-23 Outline Christ Our Enemy (Colossians 1:21) Christ Our

Hard Lessons About Soft Skills Marlena Compton & Gordon Shippey