graph based self supervised program repair from
play

Graph-based, Self-Supervised Program Repair from Diagnostic Feedback - PowerPoint PPT Presentation

Graph-based, Self-Supervised Program Repair from Diagnostic Feedback ICML 2020 Michihiro Yasunaga, Percy Liang Stanford University Why program repair? Programmers spend 75% of time fixing source code errors Automatic program repair can


  1. Our contributions 2. Self-supervised learning Collect unlabeled programs ○ Corrupt and get diagnostic feedback (e.g. run compiler) ○ ⇒ Extra training data : <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; corrupt compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 21

  2. Our results Improved performance on two applications DeepFix: correct intro programming assignments in C ● SPoC: correct output of C++ program synthesis ● DeepFix Test SPoC TestP 22

  3. Outline Innovations ● 1. Reasoning via program-feedback graph 2. Self-supervised learning Evaluations ● 1. DeepFix 2. SPoC Analysis & Examples ● Takeaways ● 23

  4. 1. Reasoning via program-feedback graph 24

  5. 1. Reasoning via program-feedback graph Challenges How to connect two modalities: program and feedback ? ● How to model the reasoning of repair (e.g. tracking symbols)? ● ? Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 25

  6. 1. Reasoning via program-feedback graph Our solution: program-feedback graph Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size ‘Char ’ char 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 26

  7. 1. Reasoning via program-feedback graph Our solution: program-feedback graph Join program & feedback through symbols relevant to program repair ● → shared/abstracted semantic space Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size ‘Char ’ char 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 27

  8. 1. Reasoning via program-feedback graph Our solution: program-feedback graph Join program & feedback through symbols relevant to program repair ● → shared/abstracted semantic space Reason over this space using graph attention ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size ‘Char ’ char 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 28

  9. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 29

  10. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 30

  11. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 31

  12. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 32

  13. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 33

  14. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 34

  15. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘ size ’ in 7 cin >> a >> b; ‘ a ’ , which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘ char ’ 10 tmp.push_back(a [i]); <diagnostic argument> 11 string tmp1 = tmp; 35

  16. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘ size ’ in 7 cin >> a >> b; ‘ a ’ , which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘ char ’ 10 tmp.push_back(a [i]); <diagnostic argument> 11 string tmp1 = tmp; 36

  17. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘ size ’ in 7 cin >> a >> b; ‘ a ’ , which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘ char ’ 10 tmp.push_back(a [i]); <diagnostic argument> 11 string tmp1 = tmp; 37

  18. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments ● Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; size member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... char ‘Char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 38

  19. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences ● Source code 4 int main() { Compiler message 5 char tmp, a, b; a request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 39

  20. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 40

  21. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 41

  22. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences, and all identifiers ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 42

  23. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences, and all identifiers ● Edges : connect identical tokens to capture semantic correspondence ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 43

  24. 1. Reasoning via program-feedback graph Model Initial encoding ● Graph attention ● Recontextualization ● Decoding ● 44

  25. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... Compiler message 9: request for member ‘size ’ … 45

  26. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 Compiler message Line 1 9: request for member ‘size ’ … 46

  27. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 Compiler message Line 1 Line 2 9: request for member ‘size ’ … 47

  28. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 Compiler message Line 1 Line 2 Line 3 9: request for member Source code ‘size ’ … 48

  29. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 49

  30. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; Position embedding LSTM code (1) LSTM code (1) LSTM code (1) LSTM msg (1) 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 50

  31. 1. Reasoning via program-feedback graph Model (Graph attention) Source code Graph Attention 1 int main() { 2 char tmp, a, b; Position embedding LSTM code (1) LSTM code (1) LSTM code (1) LSTM msg (1) 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 51

  32. 1. Reasoning via program-feedback graph Model (Graph attention) Message passing across tokens with long-range dependencies ● Source code hx 11 hx 12 hx 13 ... 1 int main() { ’ Line 1 hm 1 Multi-Head 2 char tmp, a, b; Attention 3 map<string,int> mp; Aggregate ... hx 21 hx 22 hx 23 ... hm 1 hm 2 hm 3 .. Line 2 Compiler message Compiler 9: request for member message hx 31 hx 32 hx 33 ... ‘size ’ … Line 3 Program-Feedback Graph 52

  33. 1. Reasoning via program-feedback graph Model (Graph attention) Source code Graph Attention 1 int main() { 2 char tmp, a, b; Position embedding LSTM code (1) LSTM code (1) LSTM code (1) LSTM msg (1) 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 53

  34. 1. Reasoning via program-feedback graph Model (Recontextualization) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 54

  35. 1. Reasoning via program-feedback graph Model (Recontextualization) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 55

  36. 1. Reasoning via program-feedback graph Model (Recontextualization) Source code 1 int main() { LSTM code (3) 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 56

  37. 1. Reasoning via program-feedback graph Model (Decoding) Localize = 2 MLP + softmax Source code 1 int main() { LSTM code (3) 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 57

  38. 1. Reasoning via program-feedback graph Model (Decoding) Localize = 2 Repair = "string tmp,a,b;" MLP Pointer-Generator + softmax Decoder Source code 1 int main() { LSTM code (3) 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 58

  39. 1. Reasoning via program-feedback graph Model overview 59

  40. 2. Self-supervised learning 60

  41. 2. Self-supervised learning Why? Labeled datasets of program repair are small (10-100K examples) ● Vast amount of unlabeled programs available online ● Can we leverage them to improve learning? ● >> 1M submissions > 30M repos 61

  42. 2. Self-supervised learning Our idea (outline) Step 1. Collect unlabeled, working programs y Design (randomized) program corruption procedure P Step 2. Step 3. Corrupt and get diagnostic feedback (e.g. run compiler) ⇒ Extra training data : <broken code x , feedback f , fixed code y > Step 4. Use them for pre-training 62

  43. 2. Self-supervised learning 1. Collect unlabeled programs Our target tasks (DeepFix & SPoC) are in C/C++ ● Collect 300K working C++ programs from codeforces.com ● 63

  44. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Expected ... expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator missing @@@ (e.g. missing " ) ● primary expression redeclaration/conflicting declaration Identifier type invalid conversion from <type> to <type> Identifier undeclared @@@ was not declared ‘else’ without a previous ‘if’ Others no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 64

  45. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Experienced Expected ... expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator 9% missing @@@ (e.g. missing " ) ● primary expression redeclaration/conflicting declaration Identifier type 9% invalid conversion from <type> to <type> Identifier undeclared 62% @@@ was not declared ‘else’ without a previous ‘if’ Others 20% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 65

  46. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Experienced Beginner Expected ... 48% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator 9% ● 37% missing @@@ (e.g. missing " ) ● primary expression ● 11 redeclaration/conflicting declaration Identifier type 9% 5% invalid conversion from <type> to <type> Identifier undeclared 62% 33% @@@ was not declared ‘else’ without a previous ‘if’ Others 20% 14% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 66

  47. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Experienced Beginner SPoC Expected ... 48% 35% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator 9% ● 37% ● 29% missing @@@ (e.g. missing " ) ● primary expression ● 11 ● 6% redeclaration/conflicting declaration Identifier type 9% 5% 18% invalid conversion from <type> to <type> Identifier undeclared 62% 33% 31% @@@ was not declared ‘else’ without a previous ‘if’ Others 20% 14% 16% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 67

  48. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Avg. Experienced Beginner SPoC 30% Expected ... 48% 35% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator ● 23% 9% ● 37% ● 29% missing @@@ (e.g. missing " ) ● primary expression ● 11 ● 6% ● 7% redeclaration/conflicting declaration 11% Identifier type 9% 5% 18% invalid conversion from <type> to <type> 42% Identifier undeclared 62% 33% 31% @@@ was not declared ‘else’ without a previous ‘if’ Others 17% 20% 14% 16% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 68

  49. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Avg. Experienced Beginner SPoC 30% Expected ... 48% 35% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator ● 23% 9% ● 37% ● 29% missing @@@ (e.g. missing " ) ● primary expression ● 11 ● 6% ● 7% redeclaration/conflicting declaration 11% Identifier type 9% 5% 18% invalid conversion from <type> to <type> 42% Identifier undeclared 62% 33% 31% @@@ was not declared ‘else’ without a previous ‘if’ Others 17% 20% 14% 16% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 69

  50. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● 70

  51. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } 71

  52. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } string tmp; ID-type (delete/insert/replace type ) → char tmp; 72

  53. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } string tmp; ID-type (delete/insert/replace type ) → char tmp; int a, b=0, m; ID-typo (delete/insert/replace IDentifier ) → int a, m; 73

  54. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } string tmp; ID-type (delete/insert/replace type ) → char tmp; int a, b=0, m; ID-typo (delete/insert/replace IDentifier ) → int a, m; if (n >= 0) Keyword (delete/insert/replace keyword/call ) → while (n >= 0) 74

  55. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● 75

  56. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . 76

  57. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code 5 int i, n; 6 string A; 7 cin >> n; 8 A.resize(n); 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 11 cout << i; } 77

  58. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code Perturbed 1 5 int i, n; 5 int i, n; 6 string A; 6 char A; 7 cin >> n; 7 cin >> n; 8 A.resize(n); 8 A.resize(n); 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 10 cin >> A[i]; 11 cout << i; } 11 cout << i; } 78

  59. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code Perturbed 1 Perturbed 2 5 int i, n; 5 int i, n; 5 int i, n; 6 string A; 6 char A; 6 char A; 7 cin >> n; 7 cin >> n; 7 cin >> n; 8 A.resize(n); 8 A.resize(n); 8 A.resize(n); 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 11 cout << i; } 79

  60. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code Perturbed 1 Perturbed 2 Perturbed 3 5 int i, n; 5 int i, n; 5 int i, n; 5 int i, n; 6 string A; 6 char A; 6 char A; 6 char A; 7 cin >> n; 7 cin >> n; 7 cin >> n; 7 cin >> n . 8 A.resize(n); 8 A.resize(n); 8 A.resize(n); 8 A.resize(n); 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 10 cin >> A[i]; 10 cin >> A[j]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 11 cout << i; } 11 cout << i; } 80

  61. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● Working code 5 int i, n; 6 string A; 7 cin >> n; 8 A.resize(n); 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 11 cout << i; } 81

  62. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● Working code Corrupted 5 int i, n; 5 int i, n; P 6 string A; 6 char A; 7 cin >> n; 7 cin >> n . 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 82

  63. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 83

  64. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 84

  65. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 85

  66. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 86

  67. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 87

  68. 2. Self-supervised learning What’s interesting? Typically, pre-training task ≠ target task (e.g. masked LM v.s. QA) ● Here, targeted pre-training (pre-training task = target task = program repair) ● More direct pre-training structure ○ Data distributions can be different between pre-training & target ○ 88

  69. Evaluation 1: DeepFix 89

  70. Evaluation 1: DeepFix Task Repair C programs ● May have multiple error lines ● Apply repair model iteratively (up to 5 times) ● [Gupta et al., 17] 90

  71. Evaluation 1: DeepFix Our model outputs Input code 4 int main() { 5 int n; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 11 m[1][i] = -1; } 12 return 0 } 91

  72. Evaluation 1: DeepFix Our model outputs Input code 4 int main() { 5 int n; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 11 m[1][i] = -1; } 12 return 0 } Error message line 9: ‘i’ undeclared 92

  73. Evaluation 1: DeepFix Our model outputs DrRepair Input code Attempt 1 4 int main() { 4 int main() { 5 int n; 5 int n, i; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } Error message line 9: ‘i’ undeclared 93

  74. Evaluation 1: DeepFix Our model outputs DrRepair Input code Attempt 1 4 int main() { 4 int main() { 5 int n; 5 int n, i; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } Error message Error message line 9: ‘i’ undeclared line 12: expected ‘;’ before ‘}’ 94

  75. Evaluation 1: DeepFix Our model outputs DrRepair DrRepair Input code Attempt 1 Attempt 2 4 int main() { 4 int main() { 4 int main() { 5 int n; 5 int n, i; 5 int n; 6 int * m[2]; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } 12 return 0 ; } Error message Error message line 9: ‘i’ undeclared line 12: expected ‘;’ before ‘}’ 95

  76. Evaluation 1: DeepFix Our model outputs DrRepair DrRepair Input code Attempt 1 Attempt 2 4 int main() { 4 int main() { 4 int main() { 5 int n; 5 int n, i; 5 int n; 6 int * m[2]; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } 12 return 0 ; } Error message Error message Compiled!! line 9: ‘i’ undeclared line 12: expected ‘;’ before ‘}’ 96

  77. Evaluation 1: DeepFix Results Test (full repair accuracy) Prior works do not use compiler messages 97

  78. Evaluation 1: DeepFix Results Test (full repair accuracy) Prior works do not use compiler messages 98

  79. Evaluation 1: DeepFix Results Use of compiler messages is Test (full repair accuracy) important Prior works do not use compiler messages 99

  80. Evaluation 1: DeepFix Results Use of compiler messages is Test (full repair accuracy) important Prior works do not use compiler messages 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend