Recognizing and Imitating Programmer Style: Adversaries in Program - - PowerPoint PPT Presentation

recognizing and imitating programmer style adversaries in
SMART_READER_LITE
LIVE PREVIEW

Recognizing and Imitating Programmer Style: Adversaries in Program - - PowerPoint PPT Presentation

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution Lucy Simko , Luke Zettlemoyer, Tadayoshi Kohno simkol@cs.washington.edu homes.cs.washington.edu/~simkol sim Source Code Attribution B int main() { A


slide-1
SLIDE 1

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution

Lucy Simko, Luke Zettlemoyer, Tadayoshi Kohno

sim

simkol@cs.washington.edu homes.cs.washington.edu/~simkol

slide-2
SLIDE 2

2

Source Code Attribution

int main() { int i, j, k, l, m, n, st; char in[10000]; int fg[5000], chk[128]; int size, count = 0, res; scanf ("%d%d%d", &len, &n, &size); rep (i, n) scanf ("%s", dic[i]); while (size--) { scanf ("%s", in); st = 0; rep (k, n) fg[k] = 1; ...

D C A B E F

?

slide-3
SLIDE 3

Caliskan-Islam et al. “De-anonymizing programmers via code stylometry.” 24th USENIX Security Symposium (USENIX Security), Washington, DC. 2015.

  • 98% accuracy over 250 programmers
  • Extract syntactic, lexical, and layout features from C/C++ code
  • Random Forest classifier
  • Data set: Google Code Jam

○ Programming competition ○ Lots of examples of people solving the same problem in different ways

  • Open source

3

State of the Art: Source Code Attribution

slide-4
SLIDE 4

4

Source Code Attribution

int main() { int i, j, k, l, m, n, st; char in[10000]; int fg[5000], chk[128]; int size, count = 0, res; scanf ("%d%d%d", &len, &n, &size); rep (i, n) scanf ("%s", dic[i]); while (size--) { scanf ("%s", in); st = 0; rep (k, n) fg[k] = 1; ...

D C A B E F

?

98% accuracy!

slide-5
SLIDE 5

5

Source Code Attribution

int main() { int i, j, k, l, m, n, st; char in[10000]; int fg[5000], chk[128]; int size, count = 0, res; scanf ("%d%d%d", &len, &n, &size); rep (i, n) scanf ("%s", dic[i]); while (size--) { scanf ("%s", in); st = 0; rep (k, n) fg[k] = 1; ...

D C A B E F

?

98% accuracy!

slide-6
SLIDE 6

6

Source Code Attribution

int main() { int i, j, k, l, m, n, st; char in[10000]; int fg[5000], chk[128]; int size, count = 0, res; scanf ("%d%d%d", &len, &n, &size); rep (i, n) scanf ("%s", dic[i]); while (size--) { scanf ("%s", in); st = 0; rep (k, n) fg[k] = 1; ...

D C A B E F

?

98% accuracy!

C E N S O R E D C E N S O R E D

slide-7
SLIDE 7

Can we fool source code attribution classifiers?

7

Research Question

Yes!

Methodology: Lab study* with C programmers

*Approved by University of Washington’s Human Subjects Division (IRB)

slide-8
SLIDE 8
  • Motivation and Research Question
  • Source Code Attribution: Overview and Background
  • Evading Source Code Attribution: Definitions and Goals
  • Methodology
  • Results: Conservative Estimate of Adversarial Success
  • Results: How to Create Forgeries

8

Outline

slide-9
SLIDE 9

9

Source Code Attribution

int main() { int i, j, k, l, m, n, st; char in[10000]; int fg[5000], chk[128]; int size, count = 0, res; scanf ("%d%d%d", &len, &n, &size); rep (i, n) scanf ("%s", dic[i]); while (size--) { scanf ("%s", in); st = 0; rep (k, n) fg[k] = 1; ...

D C A B E F

?

slide-10
SLIDE 10

10

Source Code Attribution

int main() { int i, j, k, l, m, n, st; char in[10000]; int fg[5000], chk[128]; int size, count = 0, res; scanf ("%d%d%d", &len, &n, &size); rep (i, n) scanf ("%s", dic[i]); while (size--) { scanf ("%s", in); st = 0; rep (k, n) fg[k] = 1; ...

D

C

A B E F

Classifier

slide-11
SLIDE 11

int main() { int i, j, k, l, m, n, st; char in[10000]; int fg[5000], chk[128]; int size, count = 0, res; scanf ("%d%d%d", &len, &n, &size); rep (i, n) scanf ("%s", dic[i]); while (size--) { scanf ("%s", in); st = 0; rep (k, n) fg[k] = 1; ...

11

Source Code Attribution

D

C

A B E F

Classifier

Pc

slide-12
SLIDE 12

12

Source Code Attribution

Classifier

{A, B, C, D, E}

Pc

C

slide-13
SLIDE 13

13

Source Code Attribution

Classifier

Pc

{A, B, C, D, E}

C ✓

Who the classifier thinks wrote this code.

slide-14
SLIDE 14
  • Motivation and Research Question
  • Source Code Attribution: Overview and Background
  • Evading Source Code Attribution: Definitions and Goals
  • Methodology
  • Results: Conservative Estimate of Adversarial Success
  • Results: How to Create Forgeries

14

Outline

slide-15
SLIDE 15
  • 1. Train: Given code from original and target authors, learn styles
  • 2. Modify original code to imitate target author (forgery)
  • Or just hide the original author’s style (masking)

15

Evading Source Code Attribution

Pc

Adversarial manipulation Pc’

Code originally by C, but modified by an adversary.

slide-16
SLIDE 16
  • 1. Train: Given code from original and target authors, learn styles
  • 2. Modify original code to imitate target author (forgery)
  • Or just hide the original author’s style (masking)

16

Evading Source Code Attribution

Pc

Adversarial manipulation Pc’

{A, B, C, D, E}

Classifier

A

Forgery

slide-17
SLIDE 17
  • Motivation and Research Question
  • Source Code Attribution: Overview and Background
  • Evading Source Code Attribution: Definitions and Goals
  • Methodology
  • Results: Conservative Estimate of Adversarial Success
  • Results: How to Create Forgeries

17

Outline

slide-18
SLIDE 18

Lab Study: Dataset

  • C code
  • We used a linter1 to eliminate many typographic style differences
  • ~4000 authors: avg 2.2 files each
  • 5 authors with the most files: avg ~42.8 files

○ Authors: A, B, C, D, E

1 http://astyle.sourceforge.net/

slide-19
SLIDE 19

Lab Study: Create Forgeries

C5 {A, B, C, D, E}

Precision: 100% Recall: 100% (10-fold XV)

slide-20
SLIDE 20

Lab Study: Create Forgeries

C20 {A, B, C, D, E, ... + 15}

Precision: 87.6% Recall: 88.2% (10-fold XV)

slide-21
SLIDE 21

Lab Study: Create Forgeries

C50 {A, B, C, D, E, ... + 45}

Precision: 82.3% Recall: 84.5% (10-fold XV)

slide-22
SLIDE 22

22

Lab Study: Create Forgeries

28 C programmers (participants):

  • 1. Train: Given code from original and target author, learn styles
  • 2. Modify original code to imitate target author’s style (forgery)

Px

Participant modifies Px

Classifier

Y

Forgery X, Y ∈ {A, B, C, D, E}

Px’

slide-23
SLIDE 23

23

Lab Study: Create Forgeries

28 C programmers (participants):

  • 1. Train: Given code from original and target author, learn styles
  • 2. Modify original code to imitate target author’s style (forgery)
  • 3. Check forgery success against oracle classifiers

X, Y ∈ {A, B, C, D, E}

Px’

X

C5

Y Y

C20 C50

Px

Participant modifies Px

slide-24
SLIDE 24
  • Motivation and Research Question
  • Source Code Attribution: Overview and Background
  • Evading Source Code Attribution: Definitions and Goals
  • Methodology
  • Results: Conservative Estimate of Adversarial Success
  • Results: How to Create Forgeries

24

Outline

slide-25
SLIDE 25

Percent of final forgery attempts that were successful attacks

C5 C20 C50 Forgery 66.6% 70.0% 73.0% Masking 76.6% 76.6% 86.6%

25

Versions of the state-of-the-art machine classifier. The subscript indicates the number of authors in the training set.

Results: Estimate of Adversarial Success

slide-26
SLIDE 26

C5 C20 C50 Forgery 66.6% 70.0% 73.0% Masking 76.6% 76.6% 86.6%

Percent of final forgery attempts that were successful attacks

26

Forgery: adversary is pretending to be a specific target author. Masking: adversary is obscuring the original author.

Results: Estimate of Adversarial Success

slide-27
SLIDE 27

C5 C20 C50 Forgery 66.6% 70.0% 73.0% Masking 76.6% 76.6% 86.6%

Percent of final forgery attempts that were successful attacks

27

A successful forgery attack means the classifier output the target author instead of the original author of the code. 66.6% of forgery attacks against the C5 classifier were successful.

Results: Estimate of Adversarial Success

slide-28
SLIDE 28

C5 C20 C50 Forgery 66.6% 70.0% 73.0% Masking 76.6% 76.6% 86.6%

Percent of final forgery attempts that produced a misclassification

28

C50 attributed forgeries correctly only 13.4% of the time.

Results: Estimate of Adversarial Success

slide-29
SLIDE 29

Percent of final forgery attempts that produced a misclassification

Lesson: Non-experts can successfully attack this state-of-the-art classifier, suggesting other authorship classifiers may be vulnerable to the same type

  • f attacks.

29

Results: Estimate of Adversarial Success

C5 C20 C50 Forgery 66.6% 70.0% 73.0% Masking 76.6% 76.6% 86.6%

slide-30
SLIDE 30
  • Motivation and Research Question
  • Source Code Attribution: Overview and Background
  • Evading Source Code Attribution: Definitions and Goals
  • Methodology
  • Results: Conservative Estimate of Adversarial Success
  • Results: How to Create Forgeries

30

Outline

slide-31
SLIDE 31

Lesson: Forgers did not know the features the classifier was using for attribution. This suggests that forgeries in the wild might contain the same types of modifications.

31

Results: Methods of Forgery Creation

slide-32
SLIDE 32

Example: Two Programs by Author C

// libraries imported #define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) // variables defined int main() { int i, j, k, l, m, n, st; char in[10000]; int fg[5000], chk[128]; int size, count = 0, res; scanf ("%d%d%d", &len, &n, &size); rep (i, n) scanf ("%s", dic[i]); while (size--) { scanf ("%s", in); st = 0; rep (k, n) fg[k] = 1; // libraries imported #define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) // variables defined int main() { int i, j, k, l, m, n, t, ok; int a, b, c; int size, count = 0; scanf ("%d", &size); while (size--) { scanf ("%d%d", &n, &m); rep (i, m) { scanf ("%d", s + i);

slide-33
SLIDE 33

// libraries imported #define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) // variables defined int main() { int i, j, k, l, m, n, st; char in[10000]; int fg[5000], chk[128]; int size, count = 0, res; scanf ("%d%d%d", &len, &n, &size); rep (i, n) scanf ("%s", dic[i]); while (size--) { scanf ("%s", in); st = 0; rep (k, n) fg[k] = 1;

Example: Two Programs by Author C

// libraries imported #define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) // variables defined int main() { int i, j, k, l, m, n, t, ok; int a, b, c; int size, count = 0; scanf ("%d", &size); while (size--) { scanf ("%d%d", &n, &m); rep (i, m) { scanf ("%d", s + i);

slide-34
SLIDE 34

Example: Forgery of Author C

Information Structure

  • Variable name
  • Syntax
  • Macros
  • API calls

Control Flow

  • Loop type
slide-35
SLIDE 35

Example: Creating a Forgery of Author C

int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ...

35

Classifier output: A

slide-36
SLIDE 36

int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ...

ORIGINAL FORGERY

int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ...

36

Classifier output: A Classifier output: ??

slide-37
SLIDE 37

ORIGINAL FORGERY

37

Classifier output: A Classifier output: ??

int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ... #define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ...

slide-38
SLIDE 38

ORIGINAL FORGERY

38

Classifier output: A Classifier output: ??

#define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) int main() { int i,j,k; int size, count = 0; cin >> size; for(count=1;count<=size;count++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ... int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ...

slide-39
SLIDE 39

ORIGINAL FORGERY

39

Classifier output: A Classifier output: ??

#define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) int main() { int i,j,k; int size, count = 0; cin >> size; for(count=1;count<=size;count++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ... int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ...

slide-40
SLIDE 40

ORIGINAL FORGERY

40

Classifier output: A Classifier output: ??

int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ... #define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) int main() { int i,j,k; int size, count = 0; scanf("%d", &size); for(count=1;count<=size;count++) { scanf("%d%d%d%d", &D, &I, &M, &N); for(i=0; i<N; i++) scanf("%d", original+i); ...

slide-41
SLIDE 41

ORIGINAL FORGERY

41

Classifier output: A Classifier output: ??

#define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) int main() { int i,j,k; int size, count = 0; scanf("%d", &size); while (size--) { scanf("%d%d%d%d", &D, &I, &M, &N); rep (i,N) scanf("%d", original+i); ... int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ...

slide-42
SLIDE 42

ORIGINAL FORGERY

42

Classifier output: A Classifier output: ??

int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ... #define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) int main() { int i,j,k; int size, count = 0; scanf("%d", &size); while (size--) { scanf("%d%d%d%d", &D, &I, &M, &N); rep (i,N)scanf("%d", original+i); ...

slide-43
SLIDE 43

ORIGINAL FORGERY

43

Classifier output: A Classifier output: C

int main() { int i,j,k; int cc,ca; cin >> ca; for(cc=1;cc<=ca;cc++) { cin >> D >> I >> M >> N; for(i=0; i<N; i++) cin >> original[i]; ... #define REP(i,a,b) for(i=a;i<b;i++) #define rep(i,n) REP(i,0,n) int main() { int i,j,k; int size, count = 0; scanf("%d", &size); while (size--) { scanf("%d%d%d%d", &D, &I, &M, &N); rep (i,N)scanf("%d", original+i); ...

slide-44
SLIDE 44

44

Results: Methods of Forgery Creation

Information Structure

  • Variable name
  • Syntax
  • Macros
  • API calls

Control Flow

  • Loop type
slide-45
SLIDE 45

45

Results: Methods of Forgery Creation

Information Structure

  • Variable name
  • Syntax
  • Macros
  • API calls
  • Libraries imported
  • Variable decl location

Control Flow

  • Loop type
  • If-statements
  • Assignments per line
  • Control flow keywords
  • Loop logic
slide-46
SLIDE 46

46

Results: Methods of Forgery Creation

Information Structure

  • Variable name
  • Syntax
  • Macros
  • API calls
  • Libraries imported
  • Variable decl location

Control Flow

  • Loop type
  • If-statements
  • Assignments per line
  • Control flow keywords
  • Loop logic

Local modifications:

  • nly need to

understand a line or two of code

Local

slide-47
SLIDE 47

47

Results: Methods of Forgery Creation

Information Structure

  • Variable name
  • Syntax
  • Macros
  • API calls
  • Libraries imported
  • Variable decl location

Control Flow

  • Loop type
  • If-statements
  • Assignments per line
  • Control flow keywords
  • Loop logic

Local modifications:

  • nly need to

understand a line or two of code Algorithmic modifications: need a more comprehensive understanding of the code

Local

slide-48
SLIDE 48

48

Results: Methods of Forgery Creation

Information Structure

  • Variable name
  • Syntax
  • Macros
  • API calls
  • Libraries imported
  • Variable decl location
  • Variable type
  • Data structures
  • Static and dynamic

memory usage Control Flow

  • Loop type
  • If-statements
  • Assignments per line
  • Control flow keywords
  • Loop logic

Local modifications:

  • nly need to

understand a line or two of code Algorithmic modifications: need a more comprehensive understanding of the code

Local Algorithmic

slide-49
SLIDE 49

49

Results: Methods of Forgery Creation

Information Structure

  • Variable name
  • Syntax
  • Macros
  • API calls
  • Libraries imported
  • Variable decl location
  • Variable type
  • Data structures
  • Static and dynamic

memory usage Control Flow

  • Loop type
  • If-statements
  • Assignments per line
  • Control flow keywords
  • Loop logic

Local modifications:

  • nly need to

understand a line or two of code Algorithmic modifications: need a more comprehensive understanding of the code

Local Algorithmic

X

slide-50
SLIDE 50

50

Results: Methods of Forgery Creation

Information Structure

  • Variable name
  • Syntax
  • Macros
  • API calls
  • Libraries imported
  • Variable decl location
  • Variable type
  • Data structures
  • Static and dynamic

memory usage Control Flow

  • Loop type
  • If-statements
  • Assignments per line
  • Control flow keywords
  • Loop logic
  • Functions refactored
  • Inlined API calls
  • Major addition or

removal of control structures Local modifications:

  • nly need to

understand a line or two of code Algorithmic modifications: need a more comprehensive understanding of the code

Local Algorithmic

X

slide-51
SLIDE 51

Lessons from methods of forgery creation:

  • Local modifications are common.
  • Some forgers copied code directly the target author’s training set.

51

Results: Methods of Forgery Creation

slide-52
SLIDE 52
  • Programmers desiring privacy or with malicious intent may seek to

evade source code attribution classifiers

  • Lab study with C programmers producing forgeries, showing

unsophisticated adversaries can fool a state of the art classifier

  • Forgeries were successful with local changes that do not require a

high-level understanding of the programming style.

  • More recommendations in paper!

My coauthors: Luke Zettlemoyer, Tadayoshi Kohno Contact me: Lucy Simko, simkol@cs.washington.edu, https://homes.cs.washington.edu/~simkol/

52

Summary