SAFE: Self Attentive Function Embedding for Binary Similarity Luca - - PowerPoint PPT Presentation

β–Ά
safe self attentive function embedding for binary
SMART_READER_LITE
LIVE PREVIEW

SAFE: Self Attentive Function Embedding for Binary Similarity Luca - - PowerPoint PPT Presentation

SAFE: Self Attentive Function Embedding for Binary Similarity Luca Massarelli PhD Student @ Sapienza University of Rome Who am I? Exploring how to leverage Artificial Intelligence to improve security! Reverse Engineering is painful


slide-1
SLIDE 1

SAFE: Self Attentive Function Embedding for Binary Similarity

Luca Massarelli

slide-2
SLIDE 2

Who am I?

PhD Student @ Sapienza University of Rome Exploring how to leverage Artificial Intelligence to improve security!

slide-3
SLIDE 3

Reverse Engineering is painful…

Image Credit: G. A. Di Luna

slide-4
SLIDE 4

Binary Similarity Problem

slide-5
SLIDE 5

App ppli licatio ions

  • Vulnerability Detection
  • Library Function Identification
  • Malware Hunting
slide-6
SLIDE 6

Existing Commercial Solutions

IDA F.L.I.R.T. DIAPHORA

slide-7
SLIDE 7

Mai ain Lim imit itatio ions

Not Scalable (BinDiff - Diaphora) Require an extact copy of the function (IDA F.L.I.R.T. - YARA) Analyst have to write rule (YARA)

slide-8
SLIDE 8

A few word about recompilation

Easy to do! Effective

slide-9
SLIDE 9

How to create new efficient and effective solutions?

slide-10
SLIDE 10
slide-11
SLIDE 11

EMBEDDINGS!!

Representation of words, sentences or documents using vector!

IDEA BORROWED FROM Natural Language Processing

𝐢𝐽𝑂𝐡𝑆𝑍 = 𝑀1 = [ 0.17 , 0. 19 , … , 0.21] 𝐢𝐽𝑂𝐡𝑆𝐽𝐹𝑇 = 𝑀2 = [ 0.16 , 0. 23 , … , 0.20] 𝑇𝐽𝑁 𝐢𝐽𝑂𝐡𝑆𝑍, 𝐢𝐽𝑂𝐡𝑆𝐽𝐹𝑇 = < 𝑀1, 𝑀2 > = 0.9

slide-12
SLIDE 12

Word2Vec Model

  • The embedding of each word is computed with an unsupervised

algorithm that consider the context in od the word.

slide-13
SLIDE 13

Word2Vec Model

  • Words relationship can be retrieved from the embeddings:

π‘›π‘π‘œ ∢ π‘₯π‘π‘›π‘“π‘œ = π‘™π‘—π‘œπ‘• ∢ ? ? ? 𝑀2π‘₯ π‘›π‘π‘œ βˆ’ 𝑀2π‘₯ π‘™π‘—π‘œπ‘• + 𝑀2π‘₯ π‘₯π‘π‘›π‘“π‘œ = π‘₯2𝑀(π‘Ÿπ‘£π‘“π‘“π‘œ)

slide-14
SLIDE 14

Word2Vec Model For ASM

We can do the same with assembly code! π‘žπ‘£π‘‘β„Ž π‘ π‘π‘ž ∢ π‘žπ‘π‘ž π‘ π‘π‘ž = π‘žπ‘£π‘‘β„Ž 𝑠𝑏𝑦 ∢ ? ? ?

pop rax

slide-15
SLIDE 15

How we ag aggregate instruction embeddings to function embeddings?

slide-16
SLIDE 16

Structured Self Attentive Model

slide-17
SLIDE 17

The Full Pipeline

slide-18
SLIDE 18

Creating the dataset

  • This is easy!!!
  • We compile 11 different

projects with different compilers and optimization!

  • … and we disassemble

everithing!

slide-19
SLIDE 19

It works!!

  • AUC:
  • SAFE: 0.99
  • I2v_attention: 0.96
  • Gemini (MFE): 0.95
  • We tested SAFE on different

task!

slide-20
SLIDE 20

Function Search Engine!

  • We tested SAFE as a function search

engine!

  • We try to retrieve from a knowledge

base similar function to the query!

slide-21
SLIDE 21

Semantic Classification

  • We try to classify functions

to 4 different semantic classes using embeddings!

  • Math
  • String
  • Encryption
  • Sorting
slide-22
SLIDE 22

Semantic Classification Visualization

Embeddings are clustered in the space according to their semantic!

(S) Sorting (E) Encryption (SM) String Manipulation (M) Math

classifier flagged

  • classifier

flags confirmed

  • fier flags

confirmed final files find files

slide-23
SLIDE 23

Applications

IDENTIFICATION OF AN ENCRYPTION FUNCTION INSIDE A MALWARE! IDENTIFICATION OF A VULNERABLE FUNCTIONS INSIDE A FIRMWARE! YARASAFE – USING SAFE INSIDE YARA

slide-24
SLIDE 24

TeslaCrypt Ransomware

  • We disassemble the sample with IDA and we used our

semantic classifier to analyze every function!

  • The Classifier founds seven functions that has

encryption semantic!

  • 6 of them were effectively performing encryption!!

Sample:3372c1edab46837f1e973164fa2d726c5c5e17bcb888828ccd7c4dfcc234a370 Detected Functions: 0x41e900, 0x420ec0, 0x4210a0,0x4212c0, 0x421665,0x421900, 0x4219c0

slide-25
SLIDE 25

Function Detected At 0x41E900

SHA1 Constant

slide-26
SLIDE 26

Possible improvent: Detecting Suspicious functionality inside a firmware

slide-27
SLIDE 27

Spotting Vulnerability in COTS software

  • We develop a tool: YARASAFE, to

simplify this process!

slide-28
SLIDE 28

YARA-SAFE

slide-29
SLIDE 29

YARA-SAFE Rule

import "safe" rule Heartbleed { condition: safe.similarity("[0.094, …. , 0.0597]") > 0.97 }

slide-30
SLIDE 30

Rule - Creation

slide-31
SLIDE 31

DEMO!!

slide-32
SLIDE 32

Github hub Pape per