meaningful variable names for decompiled code a machine
play

Meaningful Variable Names for Decompiled Code: A Machine - PowerPoint PPT Presentation

Meaningful Variable Names for Decompiled Code: A Machine Translation Approach Alan Jaffe, Jeremy Lacomis , Edward J. Schwartz*, Claire Le Goues, and Bogdan Vasilescu * Problem: Obfuscated Variable Names in Code Minified JavaScript: function


  1. Meaningful Variable Names for Decompiled Code: A Machine Translation Approach Alan Jaffe, Jeremy Lacomis , Edward J. Schwartz*, Claire Le Goues, and Bogdan Vasilescu *

  2. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … 2

  3. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … 3

  4. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … Decompiled C Code: cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); } 4

  5. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … Decompiled C Code: cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); } 5

  6. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … Software is “natural” [Hindle et al., 2011]. • 6

  7. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … Software is “natural” [Hindle et al., 2011]. • Use large corpora + machine learning to predict better identifier names. • Corpora are easy to generate! • 7

  8. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … Software is “natural” [Hindle et al., 2011]. • Use large corpora + machine learning to predict better identifier names. • Corpora are easy to generate! • Bavishi et al., Context2Name, 2017 • Vasilescu et al., JSNaughty, 2017 • Raychev et al., JSNice, 2015 • 8

  9. Problem: Obfuscated Variable Names in Code Can we use similar strategies for decompiled code? Decompiled C Code: cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); } 9

  10. Statistical Machine Translation (SMT) • Noisy channel model 10

  11. Statistical Machine Translation (SMT) • Noisy channel model • English à French: 11

  12. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! 12

  13. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! !"#$!% & ( ) *) 13

  14. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! = "#$%"& ' ) * +) )(+) "#$%"& ' ) + *) )(*) = "#$%"& ' ) * +) )(+) 14

  15. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! = "#$%"& ' ) * +) )(+) "#$%"& ' ) + *) )(*) = "#$%"& ' ) * +) )(+) Translation Model: Probability that f is a translation of e 15

  16. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! = "#$%"& ' ) * +) )(+) "#$%"& ' ) + *) )(*) = "#$%"& ' ) * +) )(+) Language Model: “Fluency” of e 16

  17. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! = "#$%"& ' ) * +) )(+) "#$%"& ' ) + *) )(*) = "#$%"& ' ) * +) )(+) ) * +) : Translation Model MOSES SMT: )(+) : Language Model 17

  18. SMT Model for Natural Language Aligned French/English corpus English corpus 18

  19. SMT Model for Minified JavaScript Aligned original/minified source corpus Original source corpus 19

  20. Problem: Obfuscated Identifiers in Code Can we use SMT for decompiled code? Decompiled C Code: cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); } 21

  21. SMT Model for Decompiled Code? Aligned original/decompiled source corpus Original source corpus 22

  22. SMT Model for Decompiled Code? Nontrivial Aligned original/decompiled source corpus Original source corpus 23

  23. Difficulty: Decompilation Changes Structure Original Source Decompiled Code #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } 24

  24. Difficulty: Decompilation Changes Structure 9 Lines Original Source 8 Lines Decompiled Code #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } • Different line count. 25

  25. Difficulty: Decompilation Changes Structure Original Source Decompiled Code #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } • Different line count. • Different numbers of variables. 26

  26. Difficulty: Decompilation Changes Structure Original Source Decompiled Code #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } • Different line count. • Different numbers of variables. • Different types of loops. 27

  27. Decompiled Code Corpus Generation Decompiled Code #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d \n ", v2); return v1; } 28

  28. Decompiled Code Corpus Generation Decompiled Code #include <stdio.h> int main() { int v1 = 0; int v2; ❌ for (v2 = 0; v2 < 10; ++v2) printf("%d \n ", v2); return v1; } #include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d \n ", cur); ++cur; } return 0; } Original Code 29

  29. Decompiled Code Corpus Generation Decompiled Code #include <stdio.h> int main() { int v1 = 0; int v2; ❌ for (v2 = 0; v2 < 10; ++v2) printf("%d \n ", v2); return v1; } #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } Original Code 30

  30. Decompiled Code Corpus Generation Decompiled Code #include <stdio.h> int main() { int v1 = 0; int v2; ❌ for (v2 = 0; v2 < 10; ++v2) printf("%d \n ", v2); return v1; } #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } Original Code 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend