Tracelet-Based Code Search in Executables Yaniv David & Eran - - PowerPoint PPT Presentation

tracelet based code search in executables
SMART_READER_LITE
LIVE PREVIEW

Tracelet-Based Code Search in Executables Yaniv David & Eran - - PowerPoint PPT Presentation

Tracelet-Based Code Search in Executables Yaniv David & Eran Yahav Technion,Israel 1 Finding vulnerable apps We can find identical or patched code int patchedFoo() int foo() { int alsoFoo() { { // buffer // buffer //


slide-1
SLIDE 1

Tracelet-Based Code Search in Executables

Yaniv David & Eran Yahav Technion,Israel

1

slide-2
SLIDE 2

Finding vulnerable apps

2

Where else does this vulnerable function exist?

We can find identical or patched code

int foo() { … // buffer // overflow … printf(…) … } int patchedFoo() { … // buffer // overflow … if (…) {} printf(…) … } int alsoFoo() { … // buffer // overflow … printf(…) … }

slide-3
SLIDE 3

Finding vulnerable apps

3

Where else does this vulnerable function exist?

We can find identical or patched code

int foo() { … // buffer // overflow … printf(…) … } int patchedFoo() { … // buffer // overflow … if (…) {} printf(…) … } int alsoFoo() { … // buffer // overflow … printf(…) … }

What if we don’t have the source code?

slide-4
SLIDE 4

... mov [esp+18h+var_18],offset aD1 mov ecx,1 mov [esp+18h+var_14], ecx call _printf ...

Search in Binaries

binary functions

4

Function 1 - wc Coreutils 6.12 Function 2 – diff Coreutils 7.15

slide-5
SLIDE 5

Search engine core

5

Measure Similarity

  • Fast & Scalable
  • Accurate (low false positives)

Similarity score

slide-6
SLIDE 6

int patchedFoo() { … // buffer // overflow … if (…) {} printf(…) … } int foo() { … // buffer // overflow … printf(…) … }

Challenge1: similarity at the binary level

6

printf(…)@foo(): printf(…)@patchedFoo():

slide-7
SLIDE 7

Challenge1: similarity at the binary level

7

loc_401358: mov [esp+18h+var_18],offset aD1 mov ecx,1 mov [esp+18h+var_14], ecx call _printf loc_401370: mov [esp+28h+var_28],offset aD1 mov ebx,1 mov esi,4 mov [esp+28h+var_24], ebx call _printf

  • Offsets in memory
  • Register allocation
  • New Instruction
slide-8
SLIDE 8

Challenge2: similarity between different structures

8

loc_401358: mov [esp+18h+var_18], offset aD1 mov ecx, 1 mov [esp+18h+var_14], ecx call _printf

foo’s CFG: patchedFoo’s CFG:

loc_401370: mov [esp+28h+var_28], offset aD1 mov ebx, 1 mov esi, 4 mov [esp+28h+var_24], ebx call _printf

slide-9
SLIDE 9

In this talk

  • A system for searching code in executables

– Based on tracelet decomposition of each function – Works by solving a set of alignment and dataflow constraints with minimal violations on tracelets

  • An evaluation methodology based on tools

from Information Retrieval

– How do we know that our search engine is good?

9

slide-10
SLIDE 10

Extract tracelets

Our Approach

Pair tracelets using alignment and rewrite

Similarity score

10

Deal with structural changes Deal with the code changes

slide-11
SLIDE 11

Using tracelets to deal with CFG structural changes

A tracelet is a fixed length sub-trace For length=3, In this example we get: (A1,A2,A5) (A1,A3,A5) (A3,A4,A5)

11

A4 A1 A3

mov [esp+18h+var_18], offset aD1 mov ecx, 1 mov [esp+18h+var_14], ecx call _printf

A5 A2 A4 A1 A3

mov [esp+18h+var_18], offset aD1 mov ecx, 1 mov [esp+18h+var_14], ecx call _printf

A5 A2 A4 A1 A3

mov [esp+18h+var_18], offset aD1 mov ecx, 1 mov [esp+18h+var_14], ecx call _printf

A5 A2

slide-12
SLIDE 12

B6 B5 B4 B1

mov [esp+28h+var_28], offset aD1 mov ebx, 1 mov esi, 4 mov [esp+28h+var_24], ebx call _printf

B3 B7 B2 B6 B5 B4 B1

mov [esp+28h+var_28], offset aD1 mov ebx, 1 mov esi, 4 mov [esp+28h+var_24], ebx call _printf

B3 B7 B2 B6 B5 B4 B1

mov [esp+28h+var_28], offset aD1 mov ebx, 1 mov esi, 4 mov [esp+28h+var_24], ebx call _printf

B3 B7 B2 B6 B5 B4 B1

mov [esp+28h+var_28], offset aD1 mov ebx, 1 mov esi, 4 mov [esp+28h+var_24], ebx call _printf

B3 B7 B2

Using tracelets calculate similarity between different structures We need to find the corresponding tracelet

12

A4 A1 A3

loc_401358: mov [esp+18h+var_18], offset aD1 mov ecx, 1 mov [esp+18h+var_14], ecx call _printf

A5 A2

foo’s CFG: patchedFoo’s CFG:

slide-13
SLIDE 13

Comparing tracelets

13

foo’s tracelet patchedFoo’s tracelet:

Graph -> linear code

B1

mov [esp+28h+var_28], offset aD1 mov ebx, 1 mov esi, 4 mov [esp+28h+var_24], ebx call _printf

B7 B2 A1

mov [esp+18h+var_18], offset aD1 mov ecx, 1 mov [esp+18h+var_14], ecx call _printf

A5 A2

Align & RW

A1

loc_401358: mov [esp+18h+var_18], offset aD1 mov ecx, 1 mov [esp+18h+var_14], ecx call _printf

A5

B1

mov [esp+28h+var_28], offset aD1 mov ebx, 1 mov esi, 4 mov [esp+28h+var_24], ebx call _printf

B7

A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2 B1

(1) mov [esp+28h+var_18], offset aD1 (2) mov ecx, 1 (X) mov esi, 4 (3) mov [esp+28h+var_14], ecx (4) call _printf

B7 B2

Edit distance

A2 B2

slide-14
SLIDE 14

Dealing with code changes: Align

Align tracelets using specialized edit-distance

14

B1

mov [esp+28h+var_28], offset aD1 mov ebx, 1 mov esi, 4 mov [esp+28h+var_24], ebx call _printf

B7 B2 B1

(1) mov [esp+28h+var_28], offset aD1 (2) mov ebx, 1 (X) mov esi, 4 (3) mov [esp+28h+var_24], ebx (4) call _printf

B7 B2 A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2 A1

mov [esp+18h+var_18], offset aD1 mov ecx, 1 mov [esp+18h+var_14], ecx call _printf

A5 A2

slide-15
SLIDE 15

Dealing with code changes: DFA

Analyze data flow Record live registers

15

B1

(1) mov [esp+28h+var_28], offset aD1 (2) mov ebx, 1 (X) mov esi, 4 (3) mov [esp+28h+var_24], ebx (4) call _printf

B7 B2 A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2 A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2 B1

(1) mov [esp+28h+var_28], offset aD1 (2) mov ebx, 1 (X) mov esi, 4 (3) mov [esp+28h+var_24], ebx (4) call _printf

B7 B2

slide-16
SLIDE 16

B1

(1) mov [r11 11+28h+m12 12], OF OF13 13 (2) mov r21 21, 1 (X) mov esi, 4 (3) mov [r31 31+28h+m31 31], r33 33 (4) call FC FC41 41

B7 B2

Dealing with code changes: Symbolize

move to symbolic names

16

A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2 B1

(1) mov [esp+28h+var_28], offset aD1 (2) mov ebx, 1 (X) mov esi, 4 (3) mov [esp+28h+var_24], ebx (4) call _printf

B7 B2 A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2

slide-17
SLIDE 17

B1

(1) mov [r11+28h+m12], OF13 (2) mov r21, 1 (X) mov esi, 4 (3) mov [r31+28h+m32], r33 (4) call FC41

B7 B2 B1

(1) mov [r11 11+28h+m12 12], OF OF13 13 (2) mov r21 21, 1 (X) mov esi, 4 (3) mov [r31 31+28h+m31 31], r33 33 (4) call FC FC41 41

B7 B2 A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2 A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2

Dealing with code changes: Solve & Rewrite

Use alignment & DFA to create constraints

17

Solve them using constraint solver with minimal conflicts

Data Flow constraints: r21=r33; r11=r31; Alignment constraints: r11=esp;F13=…; m12=var_18; r21=ecx;e31=esp; m32=var_14; r33=ecx; FC41=_printf;

slide-18
SLIDE 18

B1

(1) mov [r11+28h+m12], OF13 (2) mov r21, 1 (X) mov esi, 4 (3) mov [r31+28h+m32], r33 (4) call FC41

B7 B2 B1

(1) mov [r11 11+28h+m12 12], OF OF13 13 (2) mov r21 21, 1 (X) mov esi, 4 (3) mov [r31 31+28h+m31 31], r33 33 (4) call FC FC41 41

B7 B2 A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2 A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2 A1

(1) mov [esp+18h+var_18], offset aD1 (2) mov ecx, 1 (3) mov [esp+18h+var_14], ecx (4) call _printf

A5 A2 B1

(1) mov [esp+28h+var_18], offset aD1 (2) mov ecx, 1 (X) mov esi, 4 (3) mov [esp+28h+var_14], ecx (4) call _printf

B7 B2

Dealing with code changes: Solve & Rewrite

18

Distance after rewrite = 1 instruction delete + 2 value changes

slide-19
SLIDE 19

Extract tracelets

Our Approach

Pair tracelets using alignment and rewrite

Similarity score

19

Deal with structural changes Deal with the code changes

slide-20
SLIDE 20

From paired tracelets to function similarity score

20

Ratio

Containment

slide-21
SLIDE 21

B6 B5 B4 B1

mov [esp+28h+var_28], offset aD1 mov ebx, 1 mov esi, 4 mov [esp+28h+var_24], ebx call _printf

B3 B7 B2

Using tracelets calculate similarity between different structures

(A1,A2,A5)~(B1,B2,B7),(A1,A3,A4)~(B1,B3,B4), (A3,A4,A5)~(B3,B4,B7),(A1,A3,A5) -> “lost”

21

foo’s CFG: patchedFoo’s CFG:

A4 A1 A3

loc_401358: mov [esp+18h+var_18], offset aD1 mov ecx, 1 mov [esp+18h+var_14], ecx call _printf

A5 A2

slide-22
SLIDE 22

B6 B5 B4 B1

mov [esp+28h+var_28], offset aD1 mov ebx, 1 mov esi, 4 mov [esp+28h+var_24], ebx call _printf

B3 B7 B2

Using tracelets calculate similarity between different structures

(A1,A2,A5)~(B1,B2,B7),(A1,A3,A4)~(B1,B3,B4), (A3,A4,A5)~(B3,B4,B7),(A1,A3,A5) -> “lost”

22

foo’s CFG: patchedFoo’s CFG:

A4 A1 A3

loc_401358: mov [esp+18h+var_18], offset aD1 mov ecx, 1 mov [esp+18h+var_14], ecx call _printf

A5 A2

2 ∗ 3 4 + 7 = 6 11 = 54% 𝑇𝑗𝑛𝑗𝑚𝑏𝑠𝑗𝑢𝑧 (𝑠𝑏𝑢𝑗𝑝)

slide-23
SLIDE 23

Our system

23

Web

Repository crawler Google crawler

Functions DB (Mongodb)

Similarity search engine Web front

Score Function info 98% 0x041…@tar_1_22.rpm 92% 0x043…@tar_1_21.rpm 89% 0x042…@cpio_2_10.rpm 70% …. Other functions ….

Similarity search results

  • ver 1 Million functions

(1 TB indexed data) Search engine core & CLI interface @ github

Crawling server

slide-24
SLIDE 24

Our system

24

Web

Repository crawler Google crawler

Functions DB (Mongodb)

Similarity search engine Web front

Score Function info 98% 0x041…@tar_1_22.rpm 92% 0x043…@tar_1_21.rpm 89% 0x042…@cpio_2_10.rpm 70% …. Other functions ….

Similarity search results Crawling server

slide-25
SLIDE 25

One experiment – find my Heartbleed (CVE-2014-0160)

Mixed & stripped* Executables (1 Million functions)

Tracelet-based Search engine tls1_heartbeat @ openssl 1.0.1f

25

Score

Function info 98% tls1_heartbeat @openssl_1_0_1f.rpm 96% dtls1_process_heartbeat @openssl_1_0_1f.rpm 89% …@openssl_1_0_1e.rpm ….

more vulnerable functions

….

TLS implementation does not properly handle Heartbeat Extension packets causes information disclosure

slide-26
SLIDE 26

Using a single threshold

26

Score Function info 98% tls1_heartbeat @openssl_1_0_1f.rpm 96% dtls1_process_heartbeat @openssl_1_0_1f.rpm 89% …@openssl_1_0_1e.rpm ….

  • ther functions

…. Score Function info 88% 0x041…@tar_1_22.rpm 83% 0x043…@tar_1_21.rpm 89% 0x042…@cpio_2_10.rpm 70% …. Other functions …. Score Function info 94% 0x042…@wget_1_12.rpm 91% 0x045…@wget_1_14.rpm 60% …. Other functions ….

90% similarity score is…good? Can we really choose one threshold?

slide-27
SLIDE 27

Using a single threshold

27

Score Function info 98% tls1_heartbeat @openssl_1_0_1f.rpm 96% dtls1_process_heartbeat @openssl_1_0_1f.rpm 89% …@openssl_1_0_1e.rpm ….

  • ther functions

…. Score Function info 88% 0x041…@tar_1_22.rpm 83% 0x043…@tar_1_21.rpm 89% 0x042…@cpio_2_10.rpm 70% …. Other functions …. Score Function info 94% 0x042…@wget_1_12.rpm 91% 0x045…@wget_1_14.rpm 60% …. Other functions ….

90% similarity score is…good? Can we really choose one threshold?

Threshold

There should be a more accurate way

slide-28
SLIDE 28
  • Receiver operating characteristic
  • Try every threshold (=>binary classifier)
  • Get a number representing the method’s

accuracy

ROC – trying all thresholds

28

slide-29
SLIDE 29

Experiment example

29

Tracelet-based Search engine The function we are searching for Remove any functions below Threshold Check results (manually) Calculate Accuracy Threshold: XX%

slide-30
SLIDE 30
  • Method’s accuracy is

Area Under Curve (AUC) determines precision

ROC – trying all thresholds

30

slide-31
SLIDE 31
  • The matches we

expect are very sparse

  • We need to “punish”

false positives – they have a high cost

  • CROC does exactly that

CROC is better then ROC

31

slide-32
SLIDE 32

Experiment Structure

Linux Repositories (RpmFind.com crawler) Random (Google crawler) Manually Compiled (GNU ftp sources)

Context Group Code Change group Vulnera- ble Code

32

slide-33
SLIDE 33

Experiment goal

33

Mixed & stripped Executables (1 Million functions) Tracelet-based Search engine

Context Group Similar Functions

Context group representative

? =

slide-34
SLIDE 34

Experiment Setup & Results

Mixed & stripped Executables (1 Million functions) Tracelet-based Search engine

Tracelets K=3 Graphlets K=5 N-grams Size 5,Delta 1 99% 60% 72% AUC[ROC] 99% 12% 25% AUC[CROC]

34

slide-35
SLIDE 35

Conclusions

  • Tracelets based code search system

– Effective in finding exact and near matches – Provides a quantitative similarity score

  • Evaluated using Information Retrieval tools

– Achieves good precision and recall – Tested against other leading methods

35