UnnaturalNets Work by Joshua Campbell, Eddie Antonio Santos, - - PowerPoint PPT Presentation

unnaturalnets
SMART_READER_LITE
LIVE PREVIEW

UnnaturalNets Work by Joshua Campbell, Eddie Antonio Santos, - - PowerPoint PPT Presentation

UnnaturalNets Work by Joshua Campbell, Eddie Antonio Santos, Nelson J Amaral, Joshua is On the Abram Hindle postdoc market! Sept 2018 Introduction Exploring Bimodal program analysis via a naturalness lens Syntax Error Detection


slide-1
SLIDE 1

UnnaturalNets

Work by Joshua Campbell, Eddie Antonio Santos, Nelson J Amaral, Abram Hindle

Joshua is On the postdoc market! Sept 2018
slide-2
SLIDE 2
slide-3
SLIDE 3

Introduction

  • Exploring Bimodal program analysis via a

naturalness lens

– Syntax Error Detection
  • Unnaturalness
– Syntax Error Correction
  • Sensibility
– Crash Report Clustering
  • Information Retrieval + Naturalness
slide-4
SLIDE 4 1Ε]ΦΙΗΣΡΞΕΡΩ[ΙςΞΛΕΞ
slide-5
SLIDE 5 −ϑ]ΣΨΓΕΡΛΙΕςΕςΙΓΣςΗΡΙΙΗΠΙ ΩΓςΕΞΓΛΞΛΕΞΩΞΛΙΠΕΡΚΨΕΚΙΘΣΗΙΠ ΜΡ]ΣΨςΛΙΕΗΙ∴ΤςΙΩΩΜΡΚΩΟΙΤΞΜΓΜΩΘ ΕςΣΨΡΗΞΛΜΩΕΡΩ[Ις
slide-6
SLIDE 6 7ΣΨΞΛ[ΙΩΞ%ΜςΠΜΡΙΩ# 7ΙΙΘΩΕΠΜΞΞΠΙ[ΙΜςΗ −ΓΣΨΠΗΞΛΜΡΟΣϑΩΣΘΙΞΛΜΡΚ ΘΣςΙΕΤΤςΣΤςΜΕΞΙ
slide-7
SLIDE 7 4ΞΙΕΤΣΞΕΠΜΞΞΠΙ! 4ΞΜςΙΗΕΠΜΞΞΠΙ! 4ΩΣΨΞΛ[ΙΩΞΕΠΜΞΞΠΙ! ;ΙΠΠΞΛΕΞΩΤΙςΤΠΙ∴ΜΡΚ
slide-8
SLIDE 8
slide-9
SLIDE 9 ΤςΣΦΕΦΠΙ ΤςΣΦΕΦΠΙ
slide-10
SLIDE 10 ΤςΣΦΕΦΠΙ ΤςΣΦΕΦΠΙ ΨΡΟΡΣ[Ρ
slide-11
SLIDE 11 −ΟΡΣ[ΚΣΣΗ ΓΣΗΙΤςΙΞΞ] [ΙΠΠ
slide-12
SLIDE 12 &ΨΞΦΕΗ ΓΣΗΙ#−ΗΣΡΞ ΩΙΙΘΨΓΛΣϑ ΜΞ−ΡϑΕΓΞΜΞΜΩ ΩΨςΤςΜΩΜΡΚ
slide-13
SLIDE 13 2+ςΕΘΚςΕΘ 0ΕΡΚΨΕΚΙ1ΣΗΙΠ ΣΡ+ΣΣΗ7ΣΨςΓΙ∋ΣΗΙ 79646−7)# ∋ςΣΩΩ)ΡΞςΣΤ] 4ΙςΤΠΙ∴ΜΞ]
slide-14
SLIDE 14 ΣΤΞΜΘΜΩΞΜΓΤςΣΚςΕΘΩ [ΣςΟΜΡΚΦΙΓΕΨΩΙ[Ι ΟΡΙ[ΞΛΙ][ΣΨΠΗ ∋ΛΙΓΟΦ]ςΨΡΡΜΡΚ 7ΞΕΞΜΓΠ]8]ΤΙΗ;ΣςΠΗ 8]ΤΙΩΟΡΣ[Ρ ∋ΣΘΤΜΠΙ8ΜΘΙ∋ΛΙΓΟΙΗ (]ΡΕΘΜΓΠ]8]ΤΙΗ ;ΣςΠΗ
slide-15
SLIDE 15 7ΞΕΞΜΓΠ]8]ΤΙΗ;ΣςΠΗ 8]ΤΙΩΟΡΣ[Ρ ∋ΣΘΤΜΠΙ8ΜΘΙ∋ΛΙΓΟΙΗ %ΖΕΜΠΕΦΠΙ3ςΕΓΠΙ8ΛΜΩΜΩΕΖΕΠΜΗΤςΣΚςΕΘ
slide-16
SLIDE 16 .ΕΖΕΓΕΡΩ[ΙςΩ8ΛΜΩΜΩΕΖΕΠΜΗΤςΣΚςΕΘ 8ΙΩΞ#ΝΕΖΕΓ<ΝΕΖΕ
slide-17
SLIDE 17 ΣΤΞΜΘΜΩΞΜΓΤςΣΚςΕΘΩ ΗΨΓΟΞ]ΤΜΡΚ Ξ]ΤΙΩΓΛΕΡΚΙΕΞςΨΡΞΜΘΙ ΘΙΩΩΕΚΙΤΕΩΩΜΡΚ 6ΨΡΞΜΘΙΓΛΙΓΟΜΡΚ 2336%∋0)
slide-18
SLIDE 18 2ΣΣςΕΓΠΙ ;ΜΠΠςΨΡΤςΣΚςΕΘΩ[ΜΞΛΩ]ΡΞΕ∴ΙςςΣςΩ ∗ΨΡΓΞΜΣΡΩ[ΣΡΞΞΛςΣ[Ι∴ΓΙΤΞΜΣΡΩΨΡΠΙΩΩςΨΡ 6ΙΩΣςΞΞΣΞΙΩΞΜΡΚΣςςΨΡΡΜΡΚ 1ΜΩΤΙΠΠΜΡΚΩΡΙΙΗΞΣΦΙςΨΡΞΣΦΙΓΕΨΚΛΞ
slide-19
SLIDE 19 Example for (int i = 0; i < scorers.length; i++) { if (scorers[i].nextDoc() == NO_MORE_DOCS) // If even one of the sub-scorers does not have // scorer should not attempt to do any more wor lastDoc = NO_MORE_DOCS; return; } } Does it work? 30 of 69
slide-20
SLIDE 20 Example Output: Java Check near == NO_MORE_DOCS) lastDoc = NO_MORE_DOCS; return With entropy 4.552985 Check near () == NO_MORE_DOCS ) lastDoc = NO_MORE_DOCS With entropy 4.498802 Check near NO_MORE_DOCS) lastDoc = NO_MORE_DOCS; return; With entropy 4.244520 Check near ) lastDoc = NO_MORE_DOCS; return; } With entropy 4.183379 Check near ) == NO_MORE_DOCS) lastDoc = NO_MORE_DOCS; With entropy 3.858807 31 of 69
slide-21
SLIDE 21 Example Output: Java
  • Detects Random Token Replacement
Check near ; StorableField[] fields = d.getFields(fieldName); for With entropy 2.175943 Check near XXXXXXXX== 0) { continue; } float idf = similarity. With entropy 2.119329 ... 32 of 69
slide-22
SLIDE 22 Example Output: Java
  • Detects Random Token Insertion
Check near ; StorableField[] fields = d.getFields(fieldName); for With entropy 2.175943 Check near BytesRef XXXXXXXXtext; while((text = termsEnum.next()) With entropy 2.134373 ... 33 of 69
slide-23
SLIDE 23 Example Output: Python
  • Let’s get Pythonic!
  • Easy to set up / Ready to use
34 of 69
slide-24
SLIDE 24 Scoring our Performance Java Performance Example 35 of 69
slide-25
SLIDE 25 Validation: Java Self Trained on Lucene 4.0.0 37 of 69
slide-26
SLIDE 26 Validation: Next Trained on Lucene 4.0.0 40 of 69
slide-27
SLIDE 27 Validation: Only New Trained on Lucene 4.0.0 41 of 69
slide-28
SLIDE 28 Validation: Only Changed Trained on Lucene 4.0.0 42 of 69
slide-29
SLIDE 29 Validation: Other Project Trained on Lucene 4.0.0 43 of 69
slide-30
SLIDE 30 Scoring our Performance
  • Python only returns at most 1 syntax error!
  • For fair comparison, so does UnnaturalCode
% Located 36 of 69
slide-31
SLIDE 31 Validation: Python Self UC Python UC Java Delete .74 Acc .87 MRR Insert .83 Acc .99 MRR Replace .77 Acc .98 MRR UC Python within 5 lines >93% 38 of 69
slide-32
SLIDE 32 Comparison: Python VS UC Python UC Python+UC Delete 64% 65% 79% Insert 64% 77% 86% Replace 63% 74% 86% Similar performance by themselves... but together, 9-23% more win! 44 of 69
slide-33
SLIDE 33 Syntax Errors We conclude that when you train a language model of working source code, “syntax errors just aren’t natural.” 52 of 69
slide-34
SLIDE 34

Syntax and Sensibility

Using LSTMs to detect and fix syntax errors

Eddi ddie Antonio Santos
slide-35
SLIDE 35

the issue

2
slide-36
SLIDE 36 3

Given

A file with one syntax error

Can we

  • Find its exact location
  • Suggest the fix
slide-37
SLIDE 37

What is one syntax error?

4

Edit distance = 1 from a syntactically-valid file

  • Addition
  • Deletion
  • Substitution
  • Transposition?
slide-38
SLIDE 38 11

Can we…?

  • Suggest a fix
  • Without handwritten rules?
slide-39
SLIDE 39

the solution

12
slide-40
SLIDE 40

Overview

13 10 panic 20 goto 10 Collect Train Predict Input Suggest
slide-41
SLIDE 41

How?

14
slide-42
SLIDE 42

How?

15
slide-43
SLIDE 43

How?

16

next(for(i = 0; i < length;) = ???

slide-44
SLIDE 44

How?

17
slide-45
SLIDE 45

What is an LSTM?

18

“It’s a recurrent neural network with a unit that remembers a value for an indefinite amount of time”

slide-46
SLIDE 46

What the heck is an LSTM?

19 Neural Network
slide-47
SLIDE 47

What the heck is an LSTM?

20 Neural Network

Neural Network x y

slide-48
SLIDE 48

What the heck is an LSTM?

21 Recurrent Neural Network

Recurrent Neural Network x y

slide-49
SLIDE 49

What the heck is an LSTM?

22 Long Short-term Memory

Recurrent Neural Network x y

Memory
slide-50
SLIDE 50

Approach

23 Parse ~500K parsed sources 1.5B tokens Collect ~10K top repos Train Used 2%
slide-51
SLIDE 51

Javascript Open/Closed Tokens

slide-52
SLIDE 52

Java Open/Closed Tokens

slide-53
SLIDE 53

Preparing Inputs for LSTM

slide-54
SLIDE 54

LSTM Configuration

slide-55
SLIDE 55

the algorithm

26
slide-56
SLIDE 56

LSTM: Backwards and Forwards

slide-57
SLIDE 57

Suggestion

27 A tale of two models

forward backward

slide-58
SLIDE 58

Suggestion

28 For each token in the file: Measure the disagreement among the two models Collect the top-k examples of highest disagreement
slide-59
SLIDE 59

if if (name)) { this this.addClass(‘highlight’); return return; }

Suggestion

29 Disagreement I think it should be a

{!

You fool! Can’t you see that it should be a

)?

💦

slide-60
SLIDE 60

Suggestion

30 For the top k disagreement: Create a series of fixes Print the fix if it makes the file syntactically-valid Assume a token addition: Delete the token at the point of disagreement Assume a token deletion: Add a token as suggested by each model
slide-61
SLIDE 61 Train Validate

10-folds cross-validation

37
slide-62
SLIDE 62

Evaluation

38 Mutation testing
  • 1. Take valid token stream from validation set
  • 2. Apply one random edit operation
  • 3. Ensure result is syntactically incorrect
slide-63
SLIDE 63

Evaluation

39 Mutation testing if ( id ) return ; !
  • 1. Take valid token stream from validation set
  • 2. Apply one random edit operation
  • 3. Ensure result is syntactically incorrect
slide-64
SLIDE 64

Evaluation

40 Mutation testing if ( id ) return ; !
  • 1. Take valid token stream from validation set
  • 2. Apply one random edit operation
  • 3. Ensure result is syntactically incorrect
Addition )
slide-65
SLIDE 65

Evaluation

41 Mutation testing if ( id ) return ; !
  • 1. Take valid token stream from validation set
  • 2. Apply one random edit operation
  • 3. Ensure result is syntactically incorrect
Deletion
slide-66
SLIDE 66

Evaluation

42 Mutation testing if ( ) return ; !
  • 1. Take valid token stream from validation set
  • 2. Apply one random edit operation
  • 3. Ensure result is syntactically incorrect
Substitution id ++
slide-67
SLIDE 67

JS Tool Output

slide-68
SLIDE 68

LSTM Syntax Error Location MRR

slide-69
SLIDE 69

Javascript Fixes

slide-70
SLIDE 70

MRRs of N-Gram & LSTM on Java

slide-71
SLIDE 71

Sensibility Conclusions

  • LSTMs do work for modelling good code and finding
syntax errors
  • Not being able to represent the range of identifiers is
limiting. – Large vocabulary networks are infeasible in real development scenarios.
  • LSTMs are not as flexible as NL models such as n-gram
models
  • Can use the same models for location as you can for
fixes.
slide-72
SLIDE 72 The Unreasonable Effectiveness of Traditional Information Retrieval in Crash Report Deduplication Joshua Charles Campbell Eddie Antonio Santos Abram “Dragon Man” Hindle Department of Computing Science University of Alberta March 16, 2016 1 of 44
slide-73
SLIDE 73 The Story of Ada Popular Software Product 2 of 44
slide-74
SLIDE 74 The Story of Ada Automatically Collected Crash Reports 3 of 44
slide-75
SLIDE 75 The Story of Ada PartyCrasher Crash database 100,000s of crash reports per day 4 of 44
slide-76
SLIDE 76 Example: Mozilla
  • more than 2 million crash reports per week!
  • Manual bucketing @ 1 crash/minute:
  • 913 Full-time employees!
5 of 44
slide-77
SLIDE 77 What We Want whoops
  • opsie
bug annoyance random crash Goal: Group the crashes together in buckets by what caused them 6 of 44
slide-78
SLIDE 78 Realism!

Time

whoops
  • opsie
bug annoyance random crash whoops regression 10 of 44
slide-79
SLIDE 79 How good is a solution?
  • How do we measure correctness?
  • BCubed precision and recall!
  • Why not just normal precision and recall?
  • The solutions just put crashes together in
buckets,
  • doesn’t say what bugs exist (or even how many
bugs exist) 11 of 44
slide-80
SLIDE 80 High BCubed Precision

Time

whoops
  • opsie
bug annoyance random crash whoops regression 12 of 44
slide-81
SLIDE 81 High BCubed Recall

Time

whoops
  • opsie
bug annoyance random crash whoops regression 13 of 44
slide-82
SLIDE 82 Balanced BCubed P/R

Time

whoops
  • opsie
bug annoyance random crash whoops regression 14 of 44
slide-83
SLIDE 83 But does it scale?
  • We want it now!
  • (n log n total time or log n time per crash)
  • Classical clustering algorithms are n2 total time
  • 2 million/week
15 of 44
slide-84
SLIDE 84 Online

Time

whoops
  • opsie
bug annoyance random crash whoops regression Past Future 16 of 44
slide-85
SLIDE 85 Don’t want to hire devs
  • Doesn’t require developers to categorize
crashes
  • unsupervised
17 of 44
slide-86
SLIDE 86 Non-stationary

Time

whoops
  • opsie
bug random crash Past Future new bucket? increase crash rate? 18 of 44
slide-87
SLIDE 87 In Practice: Mozilla
  • “Signature Generation”
  • Fast!
  • Accurate?
19 of 44
slide-88
SLIDE 88 In Practice: Others
  • Mozilla, Microsoft (WER), Apple, Google...
  • Typically involve LOTS of hand-written rules
20 of 44
slide-89
SLIDE 89 In Literature
  • A bunch of methods that are n2 time
complexity (or worse)
  • take at least time proportional to n to sort one
crash 21 of 44
slide-90
SLIDE 90 In Literature
  • Lerch, et al.
  • Not designed for crash report deduplication!
  • Uses Lucene search engine find similar documents
(bugs) 27 of 44
slide-91
SLIDE 91 Lucene search Based on a standard textbook IR technique called TF-IDF plus some adjustments ↑↑↑ words in this document (crash) ↑↑↑ ↓↓↓ words in every document (crash) ↓↓↓
  • the, be, to, of, and, a, in ...
28 of 44
slide-92
SLIDE 92 In Literature
  • Lerch, et al.
  • Let’s try that, but instead of trying to group bugs
together, let’s group crashes! 29 of 44
slide-93
SLIDE 93 Let’s Add Context evince crashed with SIGSEGV in cairo_transform() This happens immediately when trying to mark text with the mouse. ProblemType: Crash Architecture: amd64 DistroRelease: Ubuntu 7.10 ExecutablePath: /usr/bin/evince Package: evince 0.9.0-1ubuntu4 PackageArchitecture: amd64 ProcCmdline: evince ./expenses-uds-sevilla.pdf Signal: 11 SourcePackage: evince Uname: Linux donald 2.6.20-15-generic #2 SMP 30 of 44
slide-94
SLIDE 94 In Literature
  • Lerch, et al.
  • Requires breaking up things (bugs, crashes) into
“words” 31 of 44
slide-95
SLIDE 95 Tokenization: Lerch evince crashed with SIGSEGV in cairo_transform() #0 0x00002b34461e4dd1 in cairo_transform () from /usr/lib/libcairo.so.2 #1 0x00002b344498a150 in CairoOutputDev::setDefaultCTM () from /usr/lib/libpoppler-glib.so.1 #2 0x00002b344ae2cefc in TextSelectionPainter::TextSelectionPainter () from /usr/lib/libpoppler.so.1 32 of 44
slide-96
SLIDE 96 Tokenization: Space evince crashed with SIGSEGV in cairo_transform() #0 0x00002b34461e4dd1 in cairo_transform () from /usr/lib/libcairo.so.2 #1 0x00002b344498a150 in CairoOutputDev::setDefaultCTM () from /usr/lib/libpoppler-glib.so.1 #2 0x00002b344ae2cefc in TextSelectionPainter::TextSelectionPainter () from /usr/lib/libpoppler.so.1 33 of 44
slide-97
SLIDE 97 Tokenization: CamelCase evince crashed with SIGSEGV in cairo_transform() #0 0x00002b34461e4dd1 in cairo_transform () from /usr/lib/libcairo.so.2 #1 0x00002b344498a150 in CairoOutputDev::setDefaultCTM () from /usr/lib/libpoppler-glib.so.1 #2 0x00002b344ae2cefc in TextSelectionPainter::TextSelectionPainter () from /usr/lib/libpoppler.so.1 34 of 44
slide-98
SLIDE 98 Tokenization Lerch 0x00002b344498a150 cairooutputdev setdefaultctm from libpoppler glib Space #1 0x00002b344498a150 in CairoOutputDev::setDefaultCTM () from /usr/lib/libpoppler-glib.so.1 Camel 1 x 00002 b 344498 a 150 in Cairo Output Dev set Default CTM from usr lib libpoppler so 1 glib 35 of 44
slide-99
SLIDE 99 Results
  • Ok so who won?
36 of 44
slide-100
SLIDE 100 1Frame 2Frame 3Frame Best Recall SpaceC Lerch Best Precision 1Addr 1File 1Mod Best F1 CamelC 37 of 44
slide-101
SLIDE 101 C am e l C C a m e l C C a m e l C C a m e l C 1Frame 2Frame 3Frame SpaceC Lerch 1Addr 1File 1Mod CamelC P R 39 of 44
slide-102
SLIDE 102 What’s the big deal?
  • Okay so our IR, tf-idf-based technique did
the best, whats the big deal? 40 of 44
slide-103
SLIDE 103 What’s the big deal?
  • It totally disregards what’s on the top of the
stack and whats on the bottom of the stack! 41 of 44
slide-104
SLIDE 104 What’s the big deal?
  • Including contextual information (OS, CPU,
version, etc.) can improve precision and recall.
  • There is an easily adjustable tradeoff
between precision and recall. 42 of 44
slide-105
SLIDE 105 What’s the big deal?
  • Most other papers don’t try tf-idf/IR based
techniques, but they turned out to be the best in this paper.
  • tf-idf/IR based techniques meet Ada’s
requirements!
  • Accurate, fast, online, unsupervised, &
non-stationary 43 of 44
slide-106
SLIDE 106 What’s the big deal?
  • Information-Retreival-based technique
  • Disregards stack order
  • Context matters
  • Accurate, fast, online, unsupervised, &
non-stationary
  • First paper using the Ubuntu dataset:
https://archive.org/details/ bugkets-2016-01-30 44 of 44
slide-107
SLIDE 107

Conclusions

slide-108
SLIDE 108

Conclusions

Language Models to Find Errors Language Models to Fix Errors Posing Source Code Artifacts as NL to Cluster Errors for IR