Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) - - PowerPoint PPT Presentation

recovering clear natural identifiers from obfuscated
SMART_READER_LITE
LIVE PREVIEW

Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) - - PowerPoint PPT Presentation

Bogdan Vasilescu Casey Casalnuovo Prem Devanbu (CMU, ISR) (UCDavis) (UCDavis) @b_vasilescu @devanbu Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) Names @b_vasilescu Today var geom2d = function() { var t =


slide-1
SLIDE 1

Bogdan Vasilescu (CMU, ISR) Prem Devanbu (UCDavis) Casey Casalnuovo (UCDavis)

Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) Names

@b_vasilescu @devanbu

slide-2
SLIDE 2

@b_vasilescu

var geom2d = function() { var t = numeric.sum; function r(n, r) { this.x = n; this.y = r; } u(r, { P: function e(n) { return t([ this.x * n.x, this.y * n.y ]); } }); function u(n, r) { for (var t in r) n[t] = r[t]; return n; } return { V: r }; }();

Today

slide-3
SLIDE 3

@b_vasilescu

var geom2d = function() { var t = numeric.sum; function r(n, r) { this.x = n; this.y = r; } u(r, { P: function e(n) { return t([ this.x * n.x, this.y * n.y ]); } }); function u(n, r) { for (var t in r) n[t] = r[t]; return n; } return { V: r }; }();

Today

slide-4
SLIDE 4

@b_vasilescu

var geom2d = function() { var t = numeric.sum; function r(n, r) { this.x = n; this.y = r; } u(r, { P: function e(n) { return t([ this.x * n.x, this.y * n.y ]); } }); function u(n, r) { for (var t in r) n[t] = r[t]; return n; } return { V: r }; }(); var geom2d = function() { var sum = numeric.sum; function Vector2d(x, y) { this.x = x; this.y = y; } mix(Vector2d, { P: function dotProduct(vector) { return sum([ this.x * vector.x, this.y * vector.y ]); } }); function mix(dest, src) { for (var k in src) dest[k] = src[k]; return dest; } return { V: Vector2d }; }();

Today

Data-driven method + tool

slide-5
SLIDE 5

Why?

  • Programs are (also) written to be read

“Instead of imagining that our main task is to instruct a

computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.” [Don Knuth]

@b_vasilescu

slide-6
SLIDE 6

Why?

  • Programs are (also) written to be read
  • Well-chosen variable names are critical to source

code readability, reusability, maintainability

  • Example tasks:
  • reverse engineering binaries
  • reverse engineering obfuscated JavaScript
  • consistent styling in large, distributed teams

@b_vasilescu

slide-7
SLIDE 7

Why?

  • Programs are (also) written to be read
  • Well-chosen variable names are critical to source

code readability, reusability, maintainability

  • Example tasks:
  • reverse engineering binaries
  • reverse engineering obfuscated JavaScript
  • consistent styling in large, distributed teams

@b_vasilescu

slide-8
SLIDE 8

Why?

  • Programs are (also) written to be read
  • Well-chosen variable names are critical to source

code readability, reusability, maintainability [many]

  • Example tasks:
  • reverse engineering binaries
  • reverse engineering obfuscated JavaScript
  • consistent styling in large, distributed teams
  • Martin Vechev, “Probabilistic Learning From Big Code”. Keynote at ISSTA 2016

@b_vasilescu

slide-9
SLIDE 9

Key ingredient

  • The “naturalness” of software [Hindle et al, 2011]
slide-10
SLIDE 10

Hmmmm….

Natural languages are complex

slide-11
SLIDE 11

Tiger, Tiger
 burning bright In the forests

  • f the night

What immortal hand or eye, Could frame thy fearful symmetry?

Natural languages are complex

slide-12
SLIDE 12

TIGER!! 
 RUN!!!

..but most utterances are simple & repetitive

slide-13
SLIDE 13

English, த, German

Can be Rich, Powerful, Expressive

slide-14
SLIDE 14

English, த, German

Can be Rich, Powerful, Expressive ..but “in nature” is mostly Simple, Repetitive, Boring

slide-15
SLIDE 15

English, த, German

Can be Rich, Powerful, Expressive ..but “in nature” is mostly Simple, Repetitive, Boring Statistical Models

slide-16
SLIDE 16

English, த, German

Can be Rich, Powerful, Expressive ..but “in nature” is mostly Simple, Repetitive, Boring Statistical Models

slide-17
SLIDE 17

The “naturalness of software” thesis

Programming Languages are complex... ...but Natural Programs are simple & repetitive. and this, too, CAN BE EXPLOITED!! [Hindle et al, 2011]

slide-18
SLIDE 18
slide-19
SLIDE 19

Variable Name Guesser (AUTONYM)

.org

Autonym

slide-20
SLIDE 20

Variable Name Guesser (AUTONYM)

Minified
 Source Code

function u(n, r) { for (var t in r) n[t] = r[t]; return n; }

.org

Autonym

slide-21
SLIDE 21

Variable Name Guesser (AUTONYM)

Minified
 Source Code Un-Minified
 Source Code

function u(n, r) { for (var t in r) n[t] = r[t]; return n; } function mix(dest, src) { for (var k in src) dest[k] = src[k]; return dest; }

.org

Autonym

slide-22
SLIDE 22

.org

Moses SMT Pre- processing Post- processing

Autonym

Minified
 Source Code Un-Minified
 Source Code

slide-23
SLIDE 23

.org

Moses SMT Pre- processing Post- processing

Autonym

What’s the relevance of Machine Translation?

slide-24
SLIDE 24

Noisy channel translation model

slide-25
SLIDE 25

Noisy channel translation model

slide-26
SLIDE 26

Noisy channel translation model

distorted message

slide-27
SLIDE 27

Noisy channel translation model

channel model distorted message

slide-28
SLIDE 28

Noisy channel translation model

channel model language model distorted message

slide-29
SLIDE 29

Noisy channel translation model

Goal: recover

p(e)

channel model language model distorted message

slide-30
SLIDE 30

Noisy channel translation model

Goal: recover

p(e)

channel model language model distorted message

slide-31
SLIDE 31

Noisy channel translation model

Goal: recover

p(e)

channel model language model distorted message (for a given ) B a y e s t h e

  • r

e m

slide-32
SLIDE 32

Noisy channel translation model

Goal: recover

p(e)

channel model language model distorted message Language model Translation (channel distortion) model

slide-33
SLIDE 33

Language model

Translation model

Translating French ( ) to English ( )

slide-34
SLIDE 34

Language model

Translation model

Aligned French-English Corpus

Translating French ( ) to English ( )

slide-35
SLIDE 35

English Corpus

Language model

Translation model

Aligned French-English Corpus

Translating French ( ) to English ( )

slide-36
SLIDE 36

English Corpus

Language model

Translation model

Aligned French-English Corpus

Translating French ( ) to English ( )

slide-37
SLIDE 37

Clear Code Corpus

Language model

Translation model

Aligned Clear-Minified 
 Code Corpus

Translating minified ( ) to clear JS ( )

slide-38
SLIDE 38

Clear Code Corpus

Language model

Translation model

Aligned Clear-Minified 
 Code Corpus

Translating minified ( ) to clear JS ( )

GitHub + minifier

slide-39
SLIDE 39

Alignment

EN: I know what you named your identifiers! NL: Ik weet wat je je ID's genoemd!

Natural language: non-trivial alignment

  • Reordering
  • Different length
  • Dropped words
slide-40
SLIDE 40

Alignment

EN: I know what you named your identifiers! NL: Ik weet wat je je ID's genoemd!

Natural language: non-trivial alignment

  • Reordering
  • Different length
  • Dropped words
slide-41
SLIDE 41

Alignment

EN: I know what you named your identifiers! NL: Ik weet wat je je ID's genoemd! function u(n, r) { function mix(dest, src){

Natural language: non-trivial alignment

  • Reordering
  • Different length
  • Dropped words
slide-42
SLIDE 42

Alignment

EN: I know what you named your identifiers! NL: Ik weet wat je je ID's genoemd! function u(n, r) { function mix(dest, src){

Natural language: non-trivial alignment

  • Reordering
  • Different length
  • Dropped words

Minification: straightforward alignment

slide-43
SLIDE 43

Complications

function r(n, r) { for (var t in r) n[t] = r[t]; return n; }

?

slide-44
SLIDE 44

function r(n, r) { for (var t in r) n[t] = r[t]; return n; }

Complications

slide-45
SLIDE 45

function r(n, r) { for (var t in r) n[t] = r[t]; return n; }

Complications

slide-46
SLIDE 46

function r(n, r) { for (var t in r) n[t] = r[t]; return n; }

Complications

Autonym

slide-47
SLIDE 47

function r(n, r) { for (var t in r) n[t] = r[t]; return n; }

(1) Overloading

function mix(dest, src) { }

Complications

Autonym

slide-48
SLIDE 48

function r(n, r) { for (var t in r) n[t] = r[t]; return n; }

(1) Overloading

function mix(dest, src) { }

Scope analysis

Complications

Autonym

slide-49
SLIDE 49

function r(n, r) { for (var t in r) n[t] = r[t]; return n; } function mix(dest, src) { for (var k in list) dest[k] = list[k]; return dest; }

(2) Consistency

(Sentence-by-sentence translation)

Complications

Autonym

slide-50
SLIDE 50

function r(n, r) { for (var t in r) n[t] = r[t]; return n; } function mix(dest, src) { for (var k in list) dest[k] = list[k]; return dest; }

(2) Consistency

(Sentence-by-sentence translation)

Language model scoring

Idea: try all, let language model decide which is more natural, on average, across ALL lines

Language model Translation model

Complications

Autonym

slide-51
SLIDE 51

Evaluation

  • Held-out test set: 2,149 files
  • Comparison to JSNice

[Raychev et al, 2015]

  • Metric: % names recovered
slide-52
SLIDE 52

Evaluation

  • Held-out test set: 2,149 files
  • Comparison to JSNice

[Raychev et al, 2015]

  • Metric: % names recovered
  • Global vs. local names

(globals don’t change)

var geom2d = function() { var t = numeric.sum; function r(n, r) { this.x = n; this.y = r; } ... var geom2d = function() { var sum = numeric.sum; function Vector2d(x, y) { this.x = x; this.y = y; } ...

slide-53
SLIDE 53

0.00 0.25 0.50 0.75 1.00 ym (Local) ym (All) JSNice (Local) JSNice (All) JSNaughty (Local) % names recovered − 2149 files

% names recovered (2,149 test files)

Local Global Autonym JSNice

slide-54
SLIDE 54

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Autonym File Accuracy JSNice File Accuracy 20 40 60 Frequency

Joining forces

slide-55
SLIDE 55

Moses SMT Pre- processing Post- processing

Autonym

Becoming JSNaughty

slide-56
SLIDE 56

Moses SMT Pre- processing Post- processing

Autonym

JSNice

Becoming JSNaughty

slide-57
SLIDE 57

0.00 0.25 0.50 0.75 1.00 ym (Local) ym (All) JSNice (Local) JSNice (All) JSNaughty (Local) JSNaughty (All) % names recovered − 2149 files

% names recovered (2,149 test files)

Autonym JSNice JSNaughty Global

slide-58
SLIDE 58

Examples

1 module . exports = http . c r e a t e S e r v e r ( function ( e ,

r ) {

2

var t ;

3

var i = new stream . Stream ( ) ;

4

. . .

5

var n = " " ;

6

csv ( ) . fromStream ( e ) . on ( " data " , function ( e , r ) {

7

i f ( ! t ) { . . . }

8

var a = {};

9

( . zip ( t , e ) ) . each ( function ( e ) { . . . } ) ;

10

i . emit ( " data " , n + JSON. s t r i n g i f y ( a ) ) ;

11

n = " ," ;

12

} ) . on ( " end " , function ( e ) {

13

i . emit ( " data " , " ]} " ) ;

14

i . emit ( " end " ) ;

15

} ) . on ( " error " , function ( e ) {

16

i . emit ( " error " , e ) ;

17

c o n s o l e . log ( " csv error " , e . message ) ;

18

} ) ;

19

} ) ;

Original: error AUTONYM err JSNICE err JSNAUGHTY err Original: tuple AUTONYM tuple JSNICE key JSNAUGHTY tuple Original: headers AUTONYM headers JSNICE headers JSNAUGHTY headers Original: jsonStream AUTONYM i JSNICE s JSNAUGHTY s Original: req AUTONYM req JSNICE q JSNAUGHTY req Original: res AUTONYM res JSNICE r JSNAUGHTY res Original: separator AUTONYM data JSNICE sep JSNAUGHTY sep

slide-59
SLIDE 59

Input program (minified) Output program (un-minified) Moses SMT Optional: Pre-processor Post- processor

Autonym

JSNice

Aligned clear-text/ minified corpus Language model Translation model Clear-text corpus Model training

This material is based upon work supported by the National Science Foundation under Grant No. 1414172

.org

https://github.com/bvasiles/jsNaughty

  • Identifier renaming using SMT, e.g.,

minified JS, decompiled C

  • Generic, mature off-the-shelf

technology (Moses)

  • Language dependence restricted

to tokenization and scope analysis

  • dependency parse in JSNice
  • Promising results: ~50% better than

JSNice on local names, on average

slide-60
SLIDE 60

Machine translation for code

# Python if n % 3 == 0: Pseudo-code: if n is divisible by 3 // C# Console . WriteLine ( "Hello World!" ) ; // Java System . out . println ( "Hello World!" ) ;

  • Oda et al. (ASE ’15):

code to pseudocode

  • Karaivanov et al. (Onward! ’14):

porting C# to Java

slide-61
SLIDE 61

Machine translation for code

# Python if n % 3 == 0: Pseudo-code: if n is divisible by 3 // C# Console . WriteLine ( "Hello World!" ) ; // Java System . out . println ( "Hello World!" ) ;

  • Oda et al. (ASE ’15):

code to pseudocode

  • Karaivanov et al. (Onward! ’14):

porting C# to Java

// Java public void findResultEdges() { for (Iterator it = dirEdgeList.iterator(); it.hasNext();) { DirectedEdge de = (DirectedEdge) it.next();…} } // C# public void FindResultEdges() { foreach (DirectedEdge de in _dirEdgeList){…} }

  • Nguyen et al. (FSE’ 13, ASE ’15):

porting Java to C#