SLIDE 1 UnnaturalNets
Work by Joshua Campbell, Eddie Antonio Santos, Nelson J Amaral, Abram Hindle
Joshua is On the postdoc market! Sept 2018
SLIDE 2
SLIDE 3 Introduction
- Exploring Bimodal program analysis via a
naturalness lens
– Syntax Error Detection
– Syntax Error Correction
– Crash Report Clustering
- Information Retrieval + Naturalness
SLIDE 4 1Ε]ΦΙΗΣΡΞΕΡΩ[ΙςΞΛΕΞ
SLIDE 5 −ϑ]ΣΨΓΕΡΛΙΕςΕςΙΓΣςΗΡΙΙΗΠΙ ΩΓςΕΞΓΛΞΛΕΞΩΞΛΙΠΕΡΚΨΕΚΙΘΣΗΙΠ ΜΡ]ΣΨςΛΙΕΗΙ∴ΤςΙΩΩΜΡΚΩΟΙΤΞΜΓΜΩΘ ΕςΣΨΡΗΞΛΜΩΕΡΩ[Ις
SLIDE 6 7ΣΨΞΛ[ΙΩΞ%ΜςΠΜΡΙΩ# 7ΙΙΘΩΕΠΜΞΞΠΙ[ΙΜςΗ −ΓΣΨΠΗΞΛΜΡΟΣϑΩΣΘΙΞΛΜΡΚ ΘΣςΙΕΤΤςΣΤςΜΕΞΙ
SLIDE 7 4ΞΙΕΤΣΞΕΠΜΞΞΠΙ! 4ΞΜςΙΗΕΠΜΞΞΠΙ! 4ΩΣΨΞΛ[ΙΩΞΕΠΜΞΞΠΙ! ;ΙΠΠΞΛΕΞΩΤΙςΤΠΙ∴ΜΡΚ
SLIDE 8
SLIDE 9 ΤςΣΦΕΦΠΙ ΤςΣΦΕΦΠΙ
SLIDE 10 ΤςΣΦΕΦΠΙ ΤςΣΦΕΦΠΙ ΨΡΟΡΣ[Ρ
SLIDE 11 −ΟΡΣ[ΚΣΣΗ ΓΣΗΙΤςΙΞΞ] [ΙΠΠ
SLIDE 12 &ΨΞΦΕΗ ΓΣΗΙ#−ΗΣΡΞ ΩΙΙΘΨΓΛΣϑ ΜΞ−ΡϑΕΓΞΜΞΜΩ ΩΨςΤςΜΩΜΡΚ
SLIDE 13 2+ςΕΘΚςΕΘ 0ΕΡΚΨΕΚΙ1ΣΗΙΠ ΣΡ+ΣΣΗ7ΣΨςΓΙ∋ΣΗΙ
79646−7)# ∋ςΣΩΩ)ΡΞςΣΤ] 4ΙςΤΠΙ∴ΜΞ]
SLIDE 14 ΣΤΞΜΘΜΩΞΜΓΤςΣΚςΕΘΩ [ΣςΟΜΡΚΦΙΓΕΨΩΙ[Ι ΟΡΙ[ΞΛΙ][ΣΨΠΗ ∋ΛΙΓΟΦ]ςΨΡΡΜΡΚ 7ΞΕΞΜΓΠ]8]ΤΙΗ;ΣςΠΗ 8]ΤΙΩΟΡΣ[Ρ ∋ΣΘΤΜΠΙ8ΜΘΙ∋ΛΙΓΟΙΗ (]ΡΕΘΜΓΠ]8]ΤΙΗ ;ΣςΠΗ
SLIDE 15 7ΞΕΞΜΓΠ]8]ΤΙΗ;ΣςΠΗ 8]ΤΙΩΟΡΣ[Ρ ∋ΣΘΤΜΠΙ8ΜΘΙ∋ΛΙΓΟΙΗ %ΖΕΜΠΕΦΠΙ3ςΕΓΠΙ8ΛΜΩΜΩΕΖΕΠΜΗΤςΣΚςΕΘ
SLIDE 16 .ΕΖΕΓΕΡΩ[ΙςΩ8ΛΜΩΜΩΕΖΕΠΜΗΤςΣΚςΕΘ 8ΙΩΞ#ΝΕΖΕΓ<ΝΕΖΕ
SLIDE 17 ΣΤΞΜΘΜΩΞΜΓΤςΣΚςΕΘΩ ΗΨΓΟΞ]ΤΜΡΚ Ξ]ΤΙΩΓΛΕΡΚΙΕΞςΨΡΞΜΘΙ ΘΙΩΩΕΚΙΤΕΩΩΜΡΚ 6ΨΡΞΜΘΙΓΛΙΓΟΜΡΚ 2336%∋0)
SLIDE 18 2ΣΣςΕΓΠΙ ;ΜΠΠςΨΡΤςΣΚςΕΘΩ[ΜΞΛΩ]ΡΞΕ∴ΙςςΣςΩ ∗ΨΡΓΞΜΣΡΩ[ΣΡΞΞΛςΣ[Ι∴ΓΙΤΞΜΣΡΩΨΡΠΙΩΩςΨΡ 6ΙΩΣςΞΞΣΞΙΩΞΜΡΚΣςςΨΡΡΜΡΚ 1ΜΩΤΙΠΠΜΡΚΩΡΙΙΗΞΣΦΙςΨΡΞΣΦΙΓΕΨΚΛΞ
SLIDE 19 Example
for (int i = 0; i < scorers.length; i++) { if (scorers[i].nextDoc() == NO_MORE_DOCS) // If even one of the sub-scorers does not have // scorer should not attempt to do any more wor lastDoc = NO_MORE_DOCS; return; } }
Does it work?
30 of 69
SLIDE 20 Example Output: Java
Check near == NO_MORE_DOCS) lastDoc = NO_MORE_DOCS; return With entropy 4.552985
Check near () == NO_MORE_DOCS ) lastDoc = NO_MORE_DOCS With entropy 4.498802 Check near NO_MORE_DOCS) lastDoc = NO_MORE_DOCS; return; With entropy 4.244520 Check near ) lastDoc = NO_MORE_DOCS; return; } With entropy 4.183379 Check near ) == NO_MORE_DOCS) lastDoc = NO_MORE_DOCS; With entropy 3.858807 31 of 69
SLIDE 21 Example Output: Java
- Detects Random Token Replacement
Check near ; StorableField[] fields = d.getFields(fieldName); for With entropy 2.175943 Check near XXXXXXXX== 0) { continue; } float idf = similarity. With entropy 2.119329 ...
32 of 69
SLIDE 22 Example Output: Java
- Detects Random Token Insertion
Check near ; StorableField[] fields = d.getFields(fieldName); for With entropy 2.175943 Check near BytesRef XXXXXXXXtext; while((text = termsEnum.next()) With entropy 2.134373 ...
33 of 69
SLIDE 23 Example Output: Python
- Let’s get Pythonic!
- Easy to set up / Ready to use
34 of 69
SLIDE 24 Scoring our Performance
Java Performance Example
35 of 69
SLIDE 25 Validation: Java Self
Trained on Lucene 4.0.0
37 of 69
SLIDE 26 Validation: Next
Trained on Lucene 4.0.0
40 of 69
SLIDE 27 Validation: Only New
Trained on Lucene 4.0.0
41 of 69
SLIDE 28 Validation: Only Changed
Trained on Lucene 4.0.0
42 of 69
SLIDE 29 Validation: Other Project
Trained on Lucene 4.0.0
43 of 69
SLIDE 30 Scoring our Performance
- Python only returns at most 1 syntax error!
- For fair comparison, so does UnnaturalCode
% Located
36 of 69
SLIDE 31 Validation: Python Self
UC Python UC Java Delete .74 Acc .87 MRR Insert .83 Acc .99 MRR Replace .77 Acc .98 MRR UC Python within 5 lines >93%
38 of 69
SLIDE 32 Comparison: Python VS UC
Python UC Python+UC Delete 64% 65% 79% Insert 64% 77% 86% Replace 63% 74% 86% Similar performance by themselves... but together, 9-23% more win!
44 of 69
SLIDE 33 Syntax Errors
We conclude that when you train a language model of working source code, “syntax errors just aren’t natural.”
52 of 69
SLIDE 34 Syntax and Sensibility
Using LSTMs to detect and fix syntax errors
Eddi ddie Antonio Santos
SLIDE 36 3
Given
A file with one syntax error
Can we
- Find its exact location
- Suggest the fix
SLIDE 37 What is one syntax error?
4
Edit distance = 1 from a syntactically-valid file
- Addition
- Deletion
- Substitution
- Transposition?
SLIDE 38 11
Can we…?
- Suggest a fix
- Without handwritten rules?
SLIDE 40 Overview
13 10 panic 20 goto 10
Collect Train Predict Input Suggest
SLIDE 43 How?
16
next(for(i = 0; i < length;) = ???
SLIDE 45 What is an LSTM?
18
“It’s a recurrent neural network with a unit that remembers a value for an indefinite amount of time”
SLIDE 46 What the heck is an LSTM?
19
Neural Network
SLIDE 47 What the heck is an LSTM?
20
Neural Network
Neural Network x y
SLIDE 48 What the heck is an LSTM?
21
Recurrent Neural Network
Recurrent Neural Network x y
SLIDE 49 What the heck is an LSTM?
22
Long Short-term Memory
Recurrent Neural Network x y
Memory
SLIDE 50 Approach
23
Parse ~500K parsed sources 1.5B tokens Collect ~10K top repos Train Used 2%
SLIDE 51 Javascript Open/Closed Tokens
SLIDE 52 Java Open/Closed Tokens
SLIDE 53 Preparing Inputs for LSTM
SLIDE 54 LSTM Configuration
SLIDE 55 the algorithm
26
SLIDE 56 LSTM: Backwards and Forwards
SLIDE 57 Suggestion
27
A tale of two models
forward backward
SLIDE 58 Suggestion
28
For each token in the file: Measure the disagreement among the two models Collect the top-k examples of highest disagreement
SLIDE 59 if if (name)) { this this.addClass(‘highlight’); return return; }
Suggestion
29
Disagreement I think it should be a
{!
You fool! Can’t you see that it should be a
)?
💦
SLIDE 60 Suggestion
30
For the top k disagreement: Create a series of fixes Print the fix if it makes the file syntactically-valid Assume a token addition:
Delete the token at the point of disagreement
Assume a token deletion:
Add a token as suggested by each model
SLIDE 61 Train Validate
10-folds cross-validation
37
SLIDE 62 Evaluation
38
Mutation testing
- 1. Take valid token stream from validation set
- 2. Apply one random edit operation
- 3. Ensure result is syntactically incorrect
SLIDE 63 Evaluation
39
Mutation testing if ( id ) return ; !
- 1. Take valid token stream from validation set
- 2. Apply one random edit operation
- 3. Ensure result is syntactically incorrect
SLIDE 64 Evaluation
40
Mutation testing if ( id ) return ; !
- 1. Take valid token stream from validation set
- 2. Apply one random edit operation
- 3. Ensure result is syntactically incorrect
Addition )
SLIDE 65 Evaluation
41
Mutation testing if ( id ) return ; !
- 1. Take valid token stream from validation set
- 2. Apply one random edit operation
- 3. Ensure result is syntactically incorrect
Deletion
SLIDE 66 Evaluation
42
Mutation testing if ( ) return ; !
- 1. Take valid token stream from validation set
- 2. Apply one random edit operation
- 3. Ensure result is syntactically incorrect
Substitution id ++
SLIDE 68 LSTM Syntax Error Location MRR
SLIDE 69 Javascript Fixes
SLIDE 70 MRRs of N-Gram & LSTM on Java
SLIDE 71 Sensibility Conclusions
- LSTMs do work for modelling good code and finding
syntax errors
- Not being able to represent the range of identifiers is
limiting.
– Large vocabulary networks are infeasible in real development
scenarios.
- LSTMs are not as flexible as NL models such as n-gram
models
- Can use the same models for location as you can for
fixes.
SLIDE 72 The Unreasonable Effectiveness of Traditional Information Retrieval in Crash Report Deduplication
Joshua Charles Campbell Eddie Antonio Santos Abram “Dragon Man” Hindle
Department of Computing Science University of Alberta
March 16, 2016
1 of 44
SLIDE 73 The Story of Ada
Popular Software Product
2 of 44
SLIDE 74 The Story of Ada
Automatically Collected Crash Reports
3 of 44
SLIDE 75 The Story of Ada
PartyCrasher
Crash database 100,000s of crash reports per day
4 of 44
SLIDE 76 Example: Mozilla
- more than 2 million crash reports per week!
- Manual bucketing @ 1 crash/minute:
- 913 Full-time employees!
5 of 44
SLIDE 77 What We Want
whoops
bug annoyance random crash
Goal: Group the crashes together in buckets by what caused them
6 of 44
SLIDE 78 Realism!
Time
whoops
bug annoyance random crash whoops regression
10 of 44
SLIDE 79 How good is a solution?
- How do we measure correctness?
- BCubed precision and recall!
- Why not just normal precision and recall?
- The solutions just put crashes together in
buckets,
- doesn’t say what bugs exist (or even how many
bugs exist)
11 of 44
SLIDE 80 High BCubed Precision
Time
whoops
bug annoyance random crash whoops regression
12 of 44
SLIDE 81 High BCubed Recall
Time
whoops
bug annoyance random crash whoops regression
13 of 44
SLIDE 82 Balanced BCubed P/R
Time
whoops
bug annoyance random crash whoops regression
14 of 44
SLIDE 83 But does it scale?
- We want it now!
- (n log n total time or log n time per crash)
- Classical clustering algorithms are n2 total time
- 2 million/week
15 of 44
SLIDE 84 Online
Time
whoops
bug annoyance random crash whoops regression
Past Future
16 of 44
SLIDE 85 Don’t want to hire devs
- Doesn’t require developers to categorize
crashes
17 of 44
SLIDE 86 Non-stationary
Time
whoops
bug random crash
Past Future
new bucket? increase crash rate?
18 of 44
SLIDE 87 In Practice: Mozilla
- “Signature Generation”
- Fast!
- Accurate?
19 of 44
SLIDE 88 In Practice: Others
- Mozilla, Microsoft (WER), Apple, Google...
- Typically involve LOTS of hand-written rules
20 of 44
SLIDE 89 In Literature
- A bunch of methods that are n2 time
complexity (or worse)
- take at least time proportional to n to sort one
crash
21 of 44
SLIDE 90 In Literature
- Lerch, et al.
- Not designed for crash report deduplication!
- Uses Lucene search engine find similar documents
(bugs)
27 of 44
SLIDE 91 Lucene search
Based on a standard textbook IR technique called TF-IDF plus some adjustments ↑↑↑ words in this document (crash) ↑↑↑ ↓↓↓ words in every document (crash) ↓↓↓
- the, be, to, of, and, a, in ...
28 of 44
SLIDE 92 In Literature
- Lerch, et al.
- Let’s try that, but instead of trying to group bugs
together, let’s group crashes!
29 of 44
SLIDE 93 Let’s Add Context
evince crashed with SIGSEGV in cairo_transform() This happens immediately when trying to mark text with the mouse. ProblemType: Crash Architecture: amd64 DistroRelease: Ubuntu 7.10 ExecutablePath: /usr/bin/evince Package: evince 0.9.0-1ubuntu4 PackageArchitecture: amd64 ProcCmdline: evince ./expenses-uds-sevilla.pdf Signal: 11 SourcePackage: evince Uname: Linux donald 2.6.20-15-generic #2 SMP
30 of 44
SLIDE 94 In Literature
- Lerch, et al.
- Requires breaking up things (bugs, crashes) into
“words”
31 of 44
SLIDE 95 Tokenization: Lerch
evince crashed with SIGSEGV in cairo_transform() #0 0x00002b34461e4dd1 in cairo_transform () from /usr/lib/libcairo.so.2 #1 0x00002b344498a150 in CairoOutputDev::setDefaultCTM () from /usr/lib/libpoppler-glib.so.1 #2 0x00002b344ae2cefc in TextSelectionPainter::TextSelectionPainter () from /usr/lib/libpoppler.so.1
32 of 44
SLIDE 96 Tokenization: Space
evince crashed with SIGSEGV in cairo_transform() #0 0x00002b34461e4dd1 in cairo_transform () from /usr/lib/libcairo.so.2 #1 0x00002b344498a150 in CairoOutputDev::setDefaultCTM () from /usr/lib/libpoppler-glib.so.1 #2 0x00002b344ae2cefc in TextSelectionPainter::TextSelectionPainter () from /usr/lib/libpoppler.so.1
33 of 44
SLIDE 97 Tokenization: CamelCase
evince crashed with SIGSEGV in cairo_transform() #0 0x00002b34461e4dd1 in cairo_transform () from /usr/lib/libcairo.so.2 #1 0x00002b344498a150 in CairoOutputDev::setDefaultCTM () from /usr/lib/libpoppler-glib.so.1 #2 0x00002b344ae2cefc in TextSelectionPainter::TextSelectionPainter () from /usr/lib/libpoppler.so.1
34 of 44
SLIDE 98 Tokenization
Lerch 0x00002b344498a150 cairooutputdev setdefaultctm from libpoppler glib Space #1 0x00002b344498a150 in CairoOutputDev::setDefaultCTM () from /usr/lib/libpoppler-glib.so.1 Camel 1 x 00002 b 344498 a 150 in Cairo Output Dev set Default CTM from usr lib libpoppler so 1 glib
35 of 44
SLIDE 99 Results
36 of 44
SLIDE 100 1Frame 2Frame 3Frame Best Recall SpaceC Lerch Best Precision 1Addr 1File 1Mod Best F1 CamelC
37 of 44
SLIDE 101 C am e l C C a m e l C C a m e l C C a m e l C
1Frame 2Frame 3Frame SpaceC Lerch 1Addr 1File 1Mod CamelC
P R
39 of 44
SLIDE 102 What’s the big deal?
- Okay so our IR, tf-idf-based technique did
the best, whats the big deal?
40 of 44
SLIDE 103 What’s the big deal?
- It totally disregards what’s on the top of the
stack and whats on the bottom of the stack!
41 of 44
SLIDE 104 What’s the big deal?
- Including contextual information (OS, CPU,
version, etc.) can improve precision and recall.
- There is an easily adjustable tradeoff
between precision and recall.
42 of 44
SLIDE 105 What’s the big deal?
- Most other papers don’t try tf-idf/IR based
techniques, but they turned out to be the best in this paper.
- tf-idf/IR based techniques meet Ada’s
requirements!
- Accurate, fast, online, unsupervised, &
non-stationary
43 of 44
SLIDE 106 What’s the big deal?
- Information-Retreival-based technique
- Disregards stack order
- Context matters
- Accurate, fast, online, unsupervised, &
non-stationary
- First paper using the Ubuntu dataset:
https://archive.org/details/ bugkets-2016-01-30
44 of 44
SLIDE 108 Conclusions
Language Models to Find Errors Language Models to Fix Errors Posing Source Code Artifacts as NL to Cluster Errors for IR