Learning to Find Bugs (Work in progress) Michael Pradel TU - - PowerPoint PPT Presentation

learning to find bugs
SMART_READER_LITE
LIVE PREVIEW

Learning to Find Bugs (Work in progress) Michael Pradel TU - - PowerPoint PPT Presentation

Learning to Find Bugs (Work in progress) Michael Pradel TU Darmstadt 1 Joint work with Koushik Sen and Rohan Bavishi Automated Bug Detection Hundreds of bug Thousands of bug detectors patterns One analysis for each Existing bug


slide-1
SLIDE 1

1

Michael Pradel TU Darmstadt

Joint work with Koushik Sen and Rohan Bavishi

Learning to Find Bugs

(Work in progress)

slide-2
SLIDE 2

2

Automated Bug Detection

Hundreds of bug detectors

One analysis for each

bug pattern

E.g., Google’s Error

Prone framework: 150+ different analyses

Thousands of bug patterns

Existing bug detectors

miss most bugs

slide-3
SLIDE 3

2

Automated Bug Detection

Hundreds of bug detectors

One analysis for each

bug pattern

E.g., Google’s Error

Prone framework: 150+ different analyses

Thousands of bug patterns

Existing bug detectors

miss most bugs

Manually creating and tuning bug detectors doesn’t scale

slide-4
SLIDE 4

3

Learning to Find Bugs

Buggy code Correct code

Train a model to identify instances of bug patterns:

Classifier

Train machine learning model

slide-5
SLIDE 5

3

Learning to Find Bugs

Buggy code Correct code

Train a model to identify instances of bug patterns:

Classifier New code Buggy/Okay

Train machine learning model

slide-6
SLIDE 6

3

Learning to Find Bugs

Buggy code Correct code

Train a model to identify instances of bug patterns:

Classifier New code Buggy/Okay

Train machine learning model

Problem of writing program analysis Problem of finding training examples

slide-7
SLIDE 7

4

Here: Name-based Bug Detection

function setPoint(x, y) { ... } var x_dim = 23; var y_dim = 5; setPoint(y_dim , x_dim );

What’s wrong with this code?

slide-8
SLIDE 8

4

Here: Name-based Bug Detection

function setPoint(x, y) { ... } var x_dim = 23; var y_dim = 5; setPoint(y_dim , x_dim );

Incorrect order of arguments What’s wrong with this code?

slide-9
SLIDE 9

5

Prior Work

Name-based bug detection

Find unusual and likely incorrect arguments Exploit similarities of identifier names

First name-based bug detector [ISSTA’11]

Finds incorrectly ordered, equally typed

arguments

Compares call sites of same method

2011

slide-10
SLIDE 10

5

Prior Work

Name-based bug detection

Find unusual and likely incorrect arguments Exploit similarities of identifier names

Improved analysis [TSE’13]

Improved precision Effective for multiple languages (Java, C, C++)

2013

slide-11
SLIDE 11

5

Prior Work

Name-based bug detection

Find unusual and likely incorrect arguments Exploit similarities of identifier names

Generalized analysis [ICSE’16]

Apply to arbitrary arguments Heuristic pruning of false positives

2016

slide-12
SLIDE 12

5

Prior Work

Name-based bug detection

Find unusual and likely incorrect arguments Exploit similarities of identifier names

Adopted by Google [OOPSLA’17]

Default check in Error Prone framework Found 2000+ new bugs

2017

slide-13
SLIDE 13

6

Problem Solved?

Various hand-tuned heuristics

Detect more bugs Special check for assertEquals calls Reduce false positives Hard-coded method names that suggest that

swapping is intended, e.g., transpose

slide-14
SLIDE 14

6

Problem Solved?

Various hand-tuned heuristics

Detect more bugs Special check for assertEquals calls Reduce false positives Hard-coded method names that suggest that

swapping is intended, e.g., transpose

Goal: Replace hand-tuned analysis with trained machine learning model

slide-15
SLIDE 15

7

This Work: Overview

Code corpus Bug detector Create training data Learn representation

  • f identifiers

Train model that identifies bugs

slide-16
SLIDE 16

8

Creating Training Data

Program transformation that seeds bugs For swapped arguments:

Visit every function call with ≥ 2 arguments Positive example: Original order of arguments Negative example: Swap first two arguments

setPoint(x, y); setPoint(y, x);

slide-17
SLIDE 17

9

Representing Identifiers

How to reason about identifier names? Prior work: Lexical similarity

x similar to x dim

Want: Semantic similarity

x similar to width list similar to seq

slide-18
SLIDE 18

10

Background: Word Embeddings

Word embeddings in NLP

Continuous vector representation for each word Similar words have similar vectors

Word2Vec: Learn from corpus of text

”You shall know a word by the company it keeps” Context: Surrounding words in sentences

slide-19
SLIDE 19

11

AST Context

What’s the context of an identifier? Our approach: AST-based context

Surrounding nodes:

Parent, grandparent, siblings, etc.

Extract node types, node contents, and relative

positioning

slide-20
SLIDE 20

12

AST Context: Example

window.setTimeout(callback , 1000);

CallExpr MemberExpr Identifier window Identifier setTimeout Arguments Identifier callBack Literal 1000

slide-21
SLIDE 21

12

AST Context: Example

window.setTimeout(callback , 1000);

CallExpr MemberExpr Identifier window Identifier setTimeout Arguments Identifier callBack Literal 1000

slide-22
SLIDE 22

13

Learning Embeddings

Train neural network to predict context

from identifier

Use hidden layer as representation for

identifier

Input layer: Identifier Hidden layer Output layer: Context

slide-23
SLIDE 23

13

Learning Embeddings

Train neural network to predict context

from identifier

Use hidden layer as representation for

identifier

Input layer: Identifier Hidden layer Output layer: Context One-hot vectors Embedding vector

slide-24
SLIDE 24

14

Training the Bug Detector

Given: Embeddings of callee and two

arguments

Train neural network:

Predict whether correct or wrong

Callee

  • Arg. 1
  • Arg. 2

Probability that correct Two hidden layers + +

slide-25
SLIDE 25

15

Beyond Swapped Arguments

Same idea works for other bug patterns

Assignments of incorrect values Incorrect binary operators Swapped operands of binary operations

slide-26
SLIDE 26

15

Beyond Swapped Arguments

Same idea works for other bug patterns

Assignments of incorrect values Incorrect binary operators Swapped operands of binary operations

var callback = function () { .. }

slide-27
SLIDE 27

15

Beyond Swapped Arguments

Same idea works for other bug patterns

Assignments of incorrect values Incorrect binary operators Swapped operands of binary operations

var callback = function () { .. } "abc"

slide-28
SLIDE 28

15

Beyond Swapped Arguments

Same idea works for other bug patterns

Assignments of incorrect values Incorrect binary operators Swapped operands of binary operations

var callback = function () { .. } "abc" if (x == undefined) ...

slide-29
SLIDE 29

15

Beyond Swapped Arguments

Same idea works for other bug patterns

Assignments of incorrect values Incorrect binary operators Swapped operands of binary operations

var callback = function () { .. } "abc" if (x == undefined) ... >

slide-30
SLIDE 30

15

Beyond Swapped Arguments

Same idea works for other bug patterns

Assignments of incorrect values Incorrect binary operators Swapped operands of binary operations

var callback = function () { .. } "abc" if (x == undefined) ... > bytes[i + 1] >> 4

slide-31
SLIDE 31

15

Beyond Swapped Arguments

Same idea works for other bug patterns

Assignments of incorrect values Incorrect binary operators Swapped operands of binary operations

var callback = function () { .. } "abc" if (x == undefined) ... > bytes[i + 1] >> 4 4 >> bytes[i + 1]

slide-32
SLIDE 32

16

Evaluation: Setup

100.000 JavaScript files from various

projects

80.000 for training 20.000 for validation 68 million lines of code 37.3 million occurrences of identifiers 10.1 million occurrences of literals

slide-33
SLIDE 33

17

Examples of Bugs

// Callback must come before the // number of milliseconds to wait setTimeout (50, dojo.lang.hitch(this , function (){ ... })); // First argument must be smaller than // the second argument array.slice(3, 0);

slide-34
SLIDE 34

18

Precision and Recall

Swapped arguments 0.9 0.92 0.94 0.96 0.98 1 0.2 0.4 0.6 0.8 1 Precision Recall AST embedding

slide-35
SLIDE 35

18

Precision and Recall

Swapped arguments 0.9 0.92 0.94 0.96 0.98 1 0.2 0.4 0.6 0.8 1 Precision Recall AST embedding Random embedding

slide-36
SLIDE 36

18

Precision and Recall

0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Precision Recall Wrong operator in binary operations AST embedding Random embedding

slide-37
SLIDE 37

19

Open Challenges

Better representation of identifiers

Same name Same meaning

Ensure that seeded bugs are realistic

Learn bug patterns from version histories?

Generalize to more bug patterns

Train a model per bug pattern

slide-38
SLIDE 38

20

Conclusion

Replace manually written program analyses with trained machine learning models

Buggy code Correct code Classifier

Train machine learning model

Precision and recall match or exceed manually written analyses