statistical similarity
play

Statistical Similarity of Binaries Yaniv David, Nimrod Partush, - PowerPoint PPT Presentation

Statistical Similarity of Binaries Yaniv David, Nimrod Partush, Eran Yahav Technion * The research leading to these results has received funding from the European Union's - Seventh Framework Programme (FP7) under grant agreement n 615688


  1. Statistical Similarity of Binaries Yaniv David, Nimrod Partush, Eran Yahav Technion * The research leading to these results has received funding from the European Union's - Seventh Framework Programme (FP7) under grant agreement n° 615688 – ERC- COG-PRIME.

  2. Outline • Motivation • Introduction • Inspiration • Our Approach • Evaluation • Summary • Questions? • Demo

  3. Motivation Network time RedHat ’ s Linux distribution protocol ( ntpd ) Apple ’ s OSX ’ s 5900x switches (table source: https://queue.acm.org/detail.cfm?id=2717320)

  4. Challenge: Finding Similar Procedures shr eax, 8 mov r9, 13h ? lea r14d, [r12+13h] mov r12, rbx mov r13, rbx add rbp, 3 ≈ lea rcx, [r13+3] mov rsi, rbp mov [r13+1], al lea rdi, [r12+3] mov [r13+2], r12b mov [r12+2], bl mov rdi, rcx lea r13d, [rcx+r9] shr eax, 8 Heartbleed, gcc v.4.9 -03 Heartbleed, clang v.3.5 -03

  5. Semantic Similarity Wish List • Precise - avoid false positives • Flexible – find similiarities across • Different compiler versions • Different compiler vendors • Different versions of the same code • Limitation: Use only stripped binary form

  6. t q1 q2 Images courtesy of Irani et al.

  7. Similarity by Composition • Irani et. al. [2006] • image1 is similar to a image2 if you can compose image1 from the segments of image2 • Segments can be transformed • rotated, scaled, translated • Segments of (statistical) significance, give more evidence • black background should be much less accounted for less similar similar

  8. Statistical Similarity of Binaries • Apply the same idea to binaries: procedures are similar if they share functionally equivalent, non-trivial, segments • equivalent = allow for semantic preserving transformations: register allocation, instruction selection, etc. • non-trivial = account for the statistical significance of each segment  shr eax, 8  lea r14d, [r12+13h] mov r13, rbx  lea rcx, [r13+3] less similar mov [r13+1], al similar mov [r13+2], r12b mov rdi, rcx mov r9, 13h mov rsi, 14h 𝑢 : Heartbleed, gcc v.4.9 -03 mov r12, rbx mov rdi, rcx  add rbp, 3 shr eax, 8   mov rsi, rbp mov ecx, r13 lea rdi, [r12+3] add esi, 1h mov [r12+2], bl xor ebx, ebx lea r13d, [rcx+r9] test eax, eax  shr eax, 8 jl short loc_22F4 𝑟 1 : Coreutils, gcc v.4.9 -03 𝑟 2 : Heartbleed, clang v.3.5 -03

  9. Similarity of Binaries: 3 Step Recipe  shr eax, 8  lea r14d, [r12+13h] mov r13, rbx  lea rcx, [r13+3] mov [r13+1], al 1. Decomposition mov [r13+2], r12b mov rdi, rcx Heartbleed, gcc v.4.9 -03 Heartbleed, gcc v.4.9 -03 mov r13, rbx  lea rcx, [r13+3] ≈ ? 2. Pairwise Semantic Similarity mov r12, rbx  lea rdi, [r12+3] Heartbleed, clang v.3.5 -03  shr eax, 8 mov r13, rbx  lea rcx, [r13+3] 3. Statistical  shr eax, 8 mov r12, rbx Similarity  lea rdi, [r12+3] Evidence  shr eax, 8 CORPUS

  10. Step 1 - Procedure Decomposition • Decompose procedure into comparable units • Strand - the set of instructions required to compute a certain variable ’ s value • Get all strands by applying program slicing on the basic-block level until all variables are covered 10

  11. Step 1 - Procedure Decomposition shr eax, 8 v1 = rbx lea r14d, [r12+13h] r13 = v1 mov r13, rbx  lea rcx, [r13+3] v2 = r13 + 3 BAP + mov [r13+1], al v3 = int_to_ptr(v2) Smack + mov [r13+2], r12b rcx = v3 mov rdi, rcx Slice Heartbleed, gcc v.4.9 -03 Strand 3

  12. Step 2 – Pairwise Semantic Similairity v1 = rbx ? v1 = rbx r13 = v1 r12 = v1 ≈ v2 = r13 + 3 v2 = r12 + 3 v3 = int_to_ptr(v2) v3 = int_to_ptr(v2) rcx = v3 rdi = v3 Heartbleed, gcc v.4.9 -03 Heartbleed, clang v.3.5 -03 Strand 3 Strand 3 Sure!

  13. Step 2 – Pairwise Semantic Similairity v1 = 13h v1 = r12 r9 = v1 v2 = 13h + v1 v2 = rbx ? v3 = int_to_ptr(v2) v3 = v2 + v3 ≈ r14 = v3 v4 = int_to_ptr(v3) v4 = 18h r13 = v4 rsi = v4 v5 = v1 + 5 v5 = v4 + v3 rsi = v5 rax = v5 v6 = v5 + v4 rax = v6 Heartbleed, gcc v.4.9 -03 Heartbleed, clang v.3.5 -03 Strand 6 Strand 11 :(

  14. Step 2 – Pairwise Semantic Similairity assume r12 q == rbx t v1 t = 13h v1 q = r12 q r9 t = v1 t v2 q = 13h + v1 q v2 t = rbx t v3 q = int_to_ptr(v2 q ) v3 t = v2 t + v3 t r14 q = v3 q v4 t = int_to_ptr(v3 t ) v4 q = 18h r13 t = v4 t rsi q = v4 q v5 t = v1 t + 5 v5 q = v4 q + v3 q rsi t = v5 t rax q = v5 q v6 t = v5 t + v4 t rax t = v6 assert v1 q ==v2 t , v2 q ==v3 t , v3 q ==v4 t , r14 q ==r13 t v4 q ==v5 t , rsi q ==rsi t ,v5 q ==v6 t , rax q ==rax t

  15. Step 2 - Quantify Semantic Similarity ● Use the percentage of variables from 𝑡 𝑟 that have an equivalent counterpart in 𝑡 𝑢 to define probabilities ○ VCP(𝑡 𝑟 , 𝑡 𝑢 ) = Variable Containment Proportion ○ under the best matching ○ an assymetric measure ○ Pr(𝑡 𝑟 |𝑡 𝑢 ) = probability that 𝑡 𝑟 is input-output equivalent to 𝑡 𝑢 ○ sigmoid function over VCP

  16. Step 3 – Statistical Evidence ● Define a Local Evidence Score to quantify the statistical significance of matching each strand ● Pr(𝑡 𝑟 |𝐼 0 ) - randomly finding a matching for 𝑡 𝑟 “ in the wild ” ○ Average value of Pr(𝑡 𝑟 |𝑡 𝑢 ) over the entire corpus 𝑡 𝑢 ∈ 𝑢 ∈ 𝐷𝑝𝑠𝑞𝑣𝑡 max 𝑡 𝑢 ∈𝑢 Pr(𝑡 𝑟 |𝑡 𝑢 ) 𝑀𝐹𝑇 𝑡 𝑟 |𝑢 = log Pr(𝑡 𝑟 |𝐼 0 )

  17. Step 3 - Global Similarity • Procedures are similar if one can be composed using non-trivial, significantly similar parts of the other 𝐻𝐹𝑇 𝑟|𝑢 = ෍ 𝑀𝐹𝑇(𝑡 𝑟 |𝑢) 𝑡 𝑟 ∈𝑟

  18. Evaluation ● Corpus ● Real-world code packages ● open-ssl, bash, qemu, wget, ws-snmp, ffmpeg, coreutils ● 1500 procedures picked at random ● Compiled with clang 3.{4,5} , gcc 4.{6,8,9} and icc {14,15} ● Spanning across product versions ● e.g . openssl-1.0.1{e,f,g} ● Queries ● Focused on vulnerabilities (for motivation ’ s sake) ● Verified with randomly picked procedures.

  19. Results • Low low FP rate • Crucial to the vulnerability search scenario Vulnerability False Positives 1 0 Heartbleed 2 3 Shellshock 3 0 Venom 4 19 Clobberin' Time 5 0 Shellshock #2 6 1 ws-snmp 7 0 wget 8 ffmpeg 0 • Previous methods fail at cross-{version,compiler} scenario or produce too many FPs • Due to syntactic or sampling based methods.

  20. Results 20 Finding Heartbleed

  21. Results All v. All comparison 21

  22. Summary • Challenging scenario • Finding similarity cross-{compiler, version} in stripped binaries • Clear motivation • Funding vulnerable code, detecting clones, etc. • A fully semantic approach that is yet feasible • Accuracy achieved with statistical framework • Applied to real-world code • Publicly available prototype

  23. More Similarity Work in Our Group Code Similarity via Natural Language Descriptions In progress https://github.com/tech-srl/ Code Similarity by Composition PLDI ’ 16 PRIME Estimating Types in Stripped Binaries POPL ’ 16 TRACY DIZY Data-Driven Disjunctive Abstraction VMCAI ’ 16 Like2Drops Abstract Differencing via Speculative Correlation OOPSLA ’ 14 Abstract Semantic Differencing SAS ’ 13 Semantic Code Search in Binaries PLDI ’ 14 Abstract domain of symbolic automata SAS ’ 13 https://www.codota.com Synthesis from partial programs OOPSLA ’ 12 Meital Ben Sinai Yaniv David Omer Katz Hila Peleg Sharon Shoham Shir Yadid Eran Yahav

  24. Questions?

  25. Demo www.BinSim.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend