Meaningful Variable Names for Decompiled Code: A Machine - - PowerPoint PPT Presentation

meaningful variable names for decompiled code a machine
SMART_READER_LITE
LIVE PREVIEW

Meaningful Variable Names for Decompiled Code: A Machine - - PowerPoint PPT Presentation

Meaningful Variable Names for Decompiled Code: A Machine Translation Approach Alan Jaffe, Jeremy Lacomis , Edward J. Schwartz*, Claire Le Goues, and Bogdan Vasilescu * Problem: Obfuscated Variable Names in Code Minified JavaScript: function


slide-1
SLIDE 1

Meaningful Variable Names for Decompiled Code: A Machine Translation Approach

Alan Jaffe, Jeremy Lacomis, Edward J. Schwartz*, Claire Le Goues, and Bogdan Vasilescu *

slide-2
SLIDE 2

Problem: Obfuscated Variable Names in Code

2

function callback(error, response, body) { if (!error && response.statusCode == 200) { var info = JSON.parse(body); … function callback(o, s, a) { if (!o && s.statusCode == 200) { var c = JSON.parse(a); …

Minified JavaScript:

slide-3
SLIDE 3

Problem: Obfuscated Variable Names in Code

3

function callback(error, response, body) { if (!error && response.statusCode == 200) { var info = JSON.parse(body); … function callback(o, s, a) { if (!o && s.statusCode == 200) { var c = JSON.parse(a); …

Minified JavaScript:

slide-4
SLIDE 4

Problem: Obfuscated Variable Names in Code

4

function callback(error, response, body) { if (!error && response.statusCode == 200) { var info = JSON.parse(body); … function callback(o, s, a) { if (!o && s.statusCode == 200) { var c = JSON.parse(a); … cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); }

Minified JavaScript: Decompiled C Code:

slide-5
SLIDE 5

Problem: Obfuscated Variable Names in Code

5

function callback(error, response, body) { if (!error && response.statusCode == 200) { var info = JSON.parse(body); … function callback(o, s, a) { if (!o && s.statusCode == 200) { var c = JSON.parse(a); … cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); }

Minified JavaScript: Decompiled C Code:

slide-6
SLIDE 6

Problem: Obfuscated Variable Names in Code

6

function callback(error, response, body) { if (!error && response.statusCode == 200) { var info = JSON.parse(body); … function callback(o, s, a) { if (!o && s.statusCode == 200) { var c = JSON.parse(a); …

Minified JavaScript:

  • Software is “natural” [Hindle et al., 2011].
slide-7
SLIDE 7

Problem: Obfuscated Variable Names in Code

7

function callback(error, response, body) { if (!error && response.statusCode == 200) { var info = JSON.parse(body); … function callback(o, s, a) { if (!o && s.statusCode == 200) { var c = JSON.parse(a); …

Minified JavaScript:

  • Software is “natural” [Hindle et al., 2011].
  • Use large corpora + machine learning to predict better identifier names.
  • Corpora are easy to generate!
slide-8
SLIDE 8

Problem: Obfuscated Variable Names in Code

8

function callback(error, response, body) { if (!error && response.statusCode == 200) { var info = JSON.parse(body); … function callback(o, s, a) { if (!o && s.statusCode == 200) { var c = JSON.parse(a); …

Minified JavaScript:

  • Software is “natural” [Hindle et al., 2011].
  • Use large corpora + machine learning to predict better identifier names.
  • Corpora are easy to generate!
  • Bavishi et al., Context2Name, 2017
  • Vasilescu et al., JSNaughty, 2017
  • Raychev et al., JSNice, 2015
slide-9
SLIDE 9

Problem: Obfuscated Variable Names in Code

9

cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); }

Decompiled C Code:

Can we use similar strategies for decompiled code?

slide-10
SLIDE 10

Statistical Machine Translation (SMT)

10

  • Noisy channel model
slide-11
SLIDE 11

Statistical Machine Translation (SMT)

11

  • Noisy channel model
  • English à French:
slide-12
SLIDE 12

Statistical Machine Translation (SMT)

12

  • Noisy channel model
  • English à French:

Va faire de la recherche! Go do some research!

slide-13
SLIDE 13

Statistical Machine Translation (SMT)

13

  • Noisy channel model
  • English à French:

Va faire de la recherche! Go do some research!

!"#$!%& ( ) *)

slide-14
SLIDE 14

Statistical Machine Translation (SMT)

14

  • Noisy channel model
  • English à French:

Va faire de la recherche! Go do some research!

= "#$%"&' ) * +) )(+) )(*) "#$%"&' ) + *) = "#$%"&' ) * +) )(+)

slide-15
SLIDE 15

Statistical Machine Translation (SMT)

15

  • Noisy channel model
  • English à French:

Va faire de la recherche! Go do some research!

= "#$%"&' ) * +) )(+) )(*) "#$%"&' ) + *) = "#$%"&' ) * +) )(+)

Translation Model: Probability that f is a translation of e

slide-16
SLIDE 16

Statistical Machine Translation (SMT)

16

  • Noisy channel model
  • English à French:

Va faire de la recherche! Go do some research!

= "#$%"&' ) * +) )(+) )(*) "#$%"&' ) + *) = "#$%"&' ) * +) )(+)

Language Model: “Fluency” of e

slide-17
SLIDE 17

Statistical Machine Translation (SMT)

17

  • Noisy channel model
  • English à French:

Va faire de la recherche! Go do some research!

= "#$%"&' ) * +) )(+) )(*) "#$%"&' ) + *) = "#$%"&' ) * +) )(+) ) * +): Translation Model )(+): Language Model MOSES SMT:

slide-18
SLIDE 18

SMT Model for Natural Language

18 Aligned French/English corpus English corpus

slide-19
SLIDE 19

SMT Model for Minified JavaScript

19 Aligned original/minified source corpus Original source corpus

slide-20
SLIDE 20

Problem: Obfuscated Identifiers in Code

21

cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); }

Decompiled C Code:

Can we use SMT for decompiled code?

slide-21
SLIDE 21

SMT Model for Decompiled Code?

22 Aligned original/decompiled source corpus Original source corpus

slide-22
SLIDE 22

SMT Model for Decompiled Code?

23 Aligned original/decompiled source corpus Original source corpus

Nontrivial

slide-23
SLIDE 23

24

Difficulty: Decompilation Changes Structure

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Original Source Decompiled Code

slide-24
SLIDE 24

25

Difficulty: Decompilation Changes Structure

  • Different line count.

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Original Source Decompiled Code

9 Lines 8 Lines

slide-25
SLIDE 25

26

Difficulty: Decompilation Changes Structure

  • Different line count.
  • Different numbers of variables.

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Original Source Decompiled Code

slide-26
SLIDE 26

27

Difficulty: Decompilation Changes Structure

  • Different line count.
  • Different numbers of variables.
  • Different types of loops.

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Original Source Decompiled Code

slide-27
SLIDE 27

Decompiled Code Corpus Generation

28

#include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Decompiled Code

slide-28
SLIDE 28

Decompiled Code Corpus Generation

29 ❌

#include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Decompiled Code

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; }

Original Code

slide-29
SLIDE 29

Decompiled Code Corpus Generation

30

#include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

#include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Decompiled Code

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; }

Original Code

slide-30
SLIDE 30

Decompiled Code Corpus Generation

31

#include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

#include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Decompiled Code

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; }

Original Code

slide-31
SLIDE 31

Decompiled Code Corpus Generation

32

#include <stdio.h> int main() { int v1 = 0; int __; for (__ = 0; __ < 10; ++__) printf("%d\n", __); return v1; }

#include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Decompiled Code

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; }

Original Code

slide-32
SLIDE 32

Decompiled Code Corpus Generation

33

#include <stdio.h> int main() { int v1 = 0; int cur; for (cur = 0; cur < 10; ++cur) printf("%d\n", cur); return v1; }

  • #include <stdio.h>

int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Decompiled Code

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; }

Original Code Renamed Decompiled Code

slide-33
SLIDE 33

Better SMT Model for Decompiled Code

36 Aligned renamed/decompiled source corpus Renamed source corpus

slide-34
SLIDE 34

37

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Choosing Renamings

Original Code Decompiled Code

slide-35
SLIDE 35

38

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Choosing Renamings

Original Code Decompiled Code

slide-36
SLIDE 36

39

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Choosing Renamings

  • Not used as the return value.

Original Code Decompiled Code

slide-37
SLIDE 37

40

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Choosing Renamings

  • Not used as the return value.
  • Used inside of a loop.

Original Code Decompiled Code

slide-38
SLIDE 38

41

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Choosing Renamings

  • Not used as the return value.
  • Used inside of a loop.
  • Used in a function call.

Original Code Decompiled Code

slide-39
SLIDE 39

42

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d\n", v2); return v1; }

Choosing Renamings

  • Not used as the return value.
  • Used inside of a loop.
  • Used in a function call.
  • Same operations.

Original Code Decompiled Code

slide-40
SLIDE 40

43

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int __; for (__ = 0; __ < 10; ++__) printf("%d\n", __); return v1; }

Choosing Renamings

  • Not used as the return value.
  • Used inside of a loop.
  • Used in a function call.
  • Same operations.

Original Code Decompiled Code

slide-41
SLIDE 41

44

#include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d\n", cur); ++cur; } return 0; } #include <stdio.h> int main() { int v1 = 0; int cur; for (cur = 0; cur < 10; ++cur) printf("%d\n", cur); return v1; }

Choosing Renamings

  • Not used as the return value.
  • Used inside of a loop.
  • Used in a function call.
  • Same operations.

Original Code Decompiled Code

slide-42
SLIDE 42

System Architecture

45

slide-43
SLIDE 43

Results and Evaluation

46

my_rc base2_string(base2_handle base2_h, char* buffer, size_t buffer_size)

Original

slide-44
SLIDE 44

Results and Evaluation

47

my_rc base2_string(base2_handle base2_h, char* buffer, size_t buffer_size) my_rc base2_string(base2_handle a1, char* a2, size_t a3)

Original Decompiled

slide-45
SLIDE 45

Results and Evaluation

48

my_rc base2_string(base2_handle base2_h, char* buffer, size_t buffer_size) my_rc base2_string(base2_handle a1, char* a2, size_t a3)

Original Decompiled

my_rc base2_string(base2_handle base2_h, char* buf, size_t len)

Renamed Decompiled

slide-46
SLIDE 46

Results and Evaluation

49

my_rc base2_string(base2_handle base2_h, char* buffer, size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf, size_t len)

Renamed Decompiled

slide-47
SLIDE 47

Results and Evaluation

50

my_rc base2_string(base2_handle base2_h, char* buffer, size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf, size_t len)

Renamed Decompiled

Exact

slide-48
SLIDE 48

Results and Evaluation

51

my_rc base2_string(base2_handle base2_h, char* buffer, size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf, size_t len)

Renamed Decompiled

Approx

slide-49
SLIDE 49

Results and Evaluation

52

my_rc base2_string(base2_handle base2_h, char* buffer, size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf, size_t len)

Renamed Decompiled

Not a match

slide-50
SLIDE 50

Results and Evaluation

53

my_rc base2_string(base2_handle base2_h, char* buffer, size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf, size_t len)

Renamed Decompiled

  • 12.7% Exact
  • 16.2% Exact + Approx
slide-51
SLIDE 51

Results and Evaluation

54

my_rc base2_string(base2_handle base2_h, char* buffer, size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf, size_t len)

Renamed Decompiled

Not a match

  • 12.7% Exact
  • 16.2% Exact + Approx
slide-52
SLIDE 52

Results and Evaluation

55

my_rc base2_string(base2_handle base2_h, char* buffer, size_t buffer_size) my_rc base2_string(base2_handle a1, char* a2, size_t a3)

Original Decompiled

my_rc base2_string(base2_handle base2_h, char* buf, size_t len)

Renamed Decompiled

  • 12.7% Exact
  • 16.2% Exact + Approx
slide-53
SLIDE 53

Preliminary Investigation: Human Study

  • Presented users with short snippets (<50 lines) of decompiled code,

asked to perform various maintenance tasks, graded and timed:

56

slide-54
SLIDE 54

Preliminary Investigation: Human Study

  • Presented users with short snippets (<50 lines) of decompiled code,

asked to perform various maintenance tasks, graded and timed:

57

1 int x = 1; 2 int y = 0; 3 while (x <= 5) { 4 y += 2; 5 x += 1; 6 } 7 printf("%d", y);

  • What is the value of the variable y on line 7?
slide-55
SLIDE 55

Preliminary Investigation: Human Study

  • Presented users with short snippets (<50 lines) of decompiled code,

asked to perform various maintenance tasks, graded and timed:

58

1 int x = 1; 2 int y = 0; 3 while (x <= 5) { 4 y += 2; 5 x += 1; 6 } 7 printf("%d", y);

  • What is the value of the variable y on line 7?
  • For correct answers, the time to answer using our renamings was statistically

significantly lower than when using the decompiler names.

slide-56
SLIDE 56

System Architecture

45

Conclusion

  • Questions?
  • Suggestions?

59