Statistical Similarity of Binaries Yaniv David, Nimrod Partush, - - PowerPoint PPT Presentation

statistical similarity
SMART_READER_LITE
LIVE PREVIEW

Statistical Similarity of Binaries Yaniv David, Nimrod Partush, - - PowerPoint PPT Presentation

Statistical Similarity of Binaries Yaniv David, Nimrod Partush, Eran Yahav Technion * The research leading to these results has received funding from the European Union's - Seventh Framework Programme (FP7) under grant agreement n 615688


slide-1
SLIDE 1

Statistical Similarity

  • f Binaries

Yaniv David, Nimrod Partush, Eran Yahav Technion

*The research leading to these results has received funding from the European Union's - Seventh Framework Programme (FP7) under grant agreement n° 615688– ERC- COG-PRIME.

slide-2
SLIDE 2

Outline

  • Motivation
  • Introduction
  • Inspiration
  • Our Approach
  • Evaluation
  • Summary
  • Questions?
  • Demo
slide-3
SLIDE 3

Network time protocol (ntpd)

Motivation

RedHat’s Linux distribution Apple’s OSX ’s 5900x switches

(table source: https://queue.acm.org/detail.cfm?id=2717320)

slide-4
SLIDE 4

Challenge: Finding Similar Procedures

shr eax, 8 lea r14d, [r12+13h] mov r13, rbx lea rcx, [r13+3] mov [r13+1], al mov [r13+2], r12b mov rdi, rcx mov r9, 13h mov r12, rbx add rbp, 3 mov rsi, rbp lea rdi, [r12+3] mov [r12+2], bl lea r13d, [rcx+r9] shr eax, 8

?

Heartbleed, clang v.3.5 -03 Heartbleed, gcc v.4.9 -03

slide-5
SLIDE 5

Semantic Similarity Wish List

  • Precise - avoid false positives
  • Flexible – find similiarities across
  • Different compiler versions
  • Different compiler vendors
  • Different versions of the same code
  • Limitation: Use only stripped binary

form

slide-6
SLIDE 6

Images courtesy of Irani et al.

t q1 q2

slide-7
SLIDE 7

Similarity by Composition

  • Irani et. al. [2006]
  • image1 is similar to a image2 if you can compose image1

from the segments of image2

  • Segments can be transformed
  • rotated, scaled, translated
  • Segments of (statistical) significance, give more evidence
  • black background should be much less accounted for

similar less similar

slide-8
SLIDE 8
  • Apply the same idea to binaries: procedures are similar if

they share functionally equivalent, non-trivial, segments

  • equivalent = allow for semantic preserving transformations: register

allocation, instruction selection, etc.

  • non-trivial = account for the statistical significance of each segment

Statistical Similarity of Binaries

 

shr eax, 8 lea r14d, [r12+13h] mov r13, rbx lea rcx, [r13+3] mov [r13+1], al mov [r13+2], r12b mov rdi, rcx

𝑢: Heartbleed, gcc v.4.9 -03

mov rsi, 14h mov rdi, rcx shr eax, 8 mov ecx, r13 add esi, 1h xor ebx, ebx test eax, eax jl short loc_22F4

𝑟1: Coreutils, gcc v.4.9 -03

  

mov r9, 13h mov r12, rbx add rbp, 3 mov rsi, rbp lea rdi, [r12+3] mov [r12+2], bl lea r13d, [rcx+r9] shr eax, 8

𝑟2: Heartbleed, clang v.3.5 -03

similar less similar

slide-9
SLIDE 9

 shr eax, 8

Similarity of Binaries: 3 Step Recipe

  • 1. Decomposition
  • 2. Pairwise Semantic

Similarity

  • 3. Statistical

Similarity Evidence

Heartbleed, gcc v.4.9 -03

shr eax, 8 lea r14d, [r12+13h] mov r13, rbx lea rcx, [r13+3] mov [r13+1], al mov [r13+2], r12b mov rdi, rcx

  

mov r13, rbx lea rcx, [r13+3]

Heartbleed, gcc v.4.9 -03

mov r12, rbx lea rdi, [r12+3]

Heartbleed, clang v.3.5 -03

mov r12, rbx lea rdi, [r12+3]

mov r13, rbx lea rcx, [r13+3]

shr eax, 8

shr eax, 8

CORPUS

≈?

slide-10
SLIDE 10

Step 1 - Procedure Decomposition

  • Decompose procedure into comparable units
  • Strand - the set of instructions required to compute

a certain variable’s value

  • Get all strands by applying program slicing on the

basic-block level until all variables are covered

10

slide-11
SLIDE 11

Heartbleed, gcc v.4.9 -03 v1 = rbx r13 = v1 v2 = r13 + 3 v3 = int_to_ptr(v2) rcx = v3

shr eax, 8 lea r14d, [r12+13h] mov r13, rbx lea rcx, [r13+3] mov [r13+1], al mov [r13+2], r12b mov rdi, rcx

Step 1 - Procedure Decomposition

Strand 3

BAP + Smack + Slice

slide-12
SLIDE 12

v1 = rbx r13 = v1 v2 = r13 + 3 v3 = int_to_ptr(v2) rcx = v3

Step 2 – Pairwise Semantic Similairity

Heartbleed, gcc v.4.9 -03 Strand 3 v1 = rbx r12 = v1 v2 = r12 + 3 v3 = int_to_ptr(v2) rdi = v3 Heartbleed, clang v.3.5 -03 Strand 3

Sure!

?

slide-13
SLIDE 13

v1 = r12 v2 = 13h + v1 v3 = int_to_ptr(v2) r14 = v3 v4 = 18h rsi = v4 v5 = v4 + v3 rax = v5

Step 2 – Pairwise Semantic Similairity

Heartbleed, gcc v.4.9 -03 Strand 6 v1 = 13h r9 = v1 v2 = rbx v3 = v2 + v3 v4 = int_to_ptr(v3) r13 = v4 v5 = v1 + 5 rsi = v5 v6 = v5 + v4 rax = v6 Heartbleed, clang v.3.5 -03 Strand 11

:(

?

slide-14
SLIDE 14

Step 2 – Pairwise Semantic Similairity

v1q = r12q v2q = 13h + v1q v3q = int_to_ptr(v2q) r14q = v3q v4q = 18h rsiq = v4q v5q = v4q + v3q raxq = v5q v1t = 13h r9t = v1t v2t = rbxt v3t = v2t + v3t v4t = int_to_ptr(v3t) r13t = v4t v5t = v1t + 5 rsit = v5t v6t = v5t + v4t raxt = v6 assume r12q == rbxt

assert v1q==v2t , v2q==v3t , v3q==v4t , r14q==r13t v4q==v5t , rsiq==rsit ,v5q==v6t , raxq==raxt

slide-15
SLIDE 15

Step 2 - Quantify Semantic Similarity

  • Use the percentage of variables from 𝑡𝑟 that have

an equivalent counterpart in 𝑡𝑢 to define probabilities

○ VCP(𝑡𝑟, 𝑡𝑢) = Variable Containment Proportion

○ under the best matching ○ an assymetric measure

○ Pr(𝑡𝑟|𝑡𝑢) = probability that 𝑡𝑟 is input-output equivalent to 𝑡𝑢

○ sigmoid function over VCP

slide-16
SLIDE 16

Step 3 – Statistical Evidence

  • Define a Local Evidence Score to quantify the statistical

significance of matching each strand

  • Pr(𝑡𝑟|𝐼0) - randomly finding a matching for 𝑡𝑟 “in the wild”

○ Average value of Pr(𝑡𝑟|𝑡𝑢) over the entire corpus 𝑡𝑢 ∈ 𝑢 ∈ 𝐷𝑝𝑠𝑞𝑣𝑡

𝑀𝐹𝑇 𝑡𝑟|𝑢 = log max

𝑡𝑢∈𝑢 Pr(𝑡𝑟|𝑡𝑢)

Pr(𝑡𝑟|𝐼0)

slide-17
SLIDE 17

Step 3 - Global Similarity

  • Procedures are similar if one can be composed

using non-trivial, significantly similar parts of the

  • ther

𝐻𝐹𝑇 𝑟|𝑢 = ෍

𝑡𝑟∈𝑟

𝑀𝐹𝑇(𝑡𝑟|𝑢)

slide-18
SLIDE 18

Evaluation

  • Corpus
  • Real-world code packages
  • pen-ssl, bash, qemu, wget, ws-snmp, ffmpeg, coreutils
  • 1500 procedures picked at random
  • Compiled with clang 3.{4,5}, gcc 4.{6,8,9} and icc {14,15}
  • Spanning across product versions
  • e.g. openssl-1.0.1{e,f,g}
  • Queries
  • Focused on vulnerabilities (for motivation’s sake)
  • Verified with randomly picked procedures.
slide-19
SLIDE 19

Results

  • Low low FP rate
  • Crucial to the vulnerability search scenario
  • Previous methods fail at cross-{version,compiler} scenario or

produce too many FPs

  • Due to syntactic or sampling based methods.

False Positives Vulnerability Heartbleed 1 3 Shellshock 2 Venom 3 19 Clobberin' Time 4 Shellshock #2 5 1 ws-snmp 6 wget 7 ffmpeg 8

slide-20
SLIDE 20

Results

20

Finding Heartbleed

slide-21
SLIDE 21

21

Results

All v. All comparison

slide-22
SLIDE 22

Summary

  • Challenging scenario
  • Finding similarity cross-{compiler, version} in stripped

binaries

  • Clear motivation
  • Funding vulnerable code, detecting clones, etc.
  • A fully semantic approach that is yet feasible
  • Accuracy achieved with statistical framework
  • Applied to real-world code
  • Publicly available prototype
slide-23
SLIDE 23

More Similarity Work in Our Group

Code Similarity via Natural Language Descriptions In progress Code Similarity by Composition PLDI’16 Estimating Types in Stripped Binaries POPL’16 Data-Driven Disjunctive Abstraction VMCAI’16 Abstract Differencing via Speculative Correlation OOPSLA’14 Abstract Semantic Differencing SAS’13 Semantic Code Search in Binaries PLDI’14 Abstract domain of symbolic automata SAS’13 Synthesis from partial programs OOPSLA’12

https://github.com/tech-srl/ PRIME TRACY DIZY Like2Drops https://www.codota.com

Sharon Shoham Hila Peleg Shir Yadid Meital Ben Sinai Eran Yahav Omer Katz Yaniv David

slide-24
SLIDE 24

Questions?

slide-25
SLIDE 25

Demo

www.BinSim.com