Understanding Source Code through Machine Learning to Create Smart - - PowerPoint PPT Presentation

understanding source code through machine learning to
SMART_READER_LITE
LIVE PREVIEW

Understanding Source Code through Machine Learning to Create Smart - - PowerPoint PPT Presentation

Understanding Source Code through Machine Learning to Create Smart Software Engineering Tools Miltos Allamanis , University of Edinburgh March 13th, 2016 My PhD is supported by Joint work with: Charles Sutton (UoE), Earl T. Barr (UCL), Chris


slide-1
SLIDE 1

Understanding Source Code through Machine Learning to Create Smart Software Engineering Tools

Miltos Allamanis, University of Edinburgh

March 13th, 2016

My PhD is supported by Joint work with: Charles Sutton (UoE), Earl

  • T. Barr (UCL), Chris Bird (MSR), Daniel

Tarlow (MSRC), Yi Wei (MSRC), Andrew D. Gordon (MSRC)

slide-2
SLIDE 2

internal & external codebases

Mine the hidden knowledge to create smart software engineering tools. Developers implicitly embed knowledge in code that may be useful for the same or

  • ther projects.
slide-3
SLIDE 3

Machine Learning Models of Source Code Software Engineer

slide-4
SLIDE 4

A Spectrum of Problems for Machine Learning

Clustering

Supervised Unsupervised

slide-5
SLIDE 5

A Spectrum of Problems for Machine Learning

Joint Classification Learning Features

slide-6
SLIDE 6

Natural Language Processing with Machine Learning

❯ Resolve language ambiguities with principled probabilistic models of language. ❯ Learn model parameters from annotated corpora.

slide-7
SLIDE 7

Natural Language Processing (NLP)

Use Machine Learning to model aspects of a natural language.

Models of Aspects

  • f Natural

Language Data: Corpora of Text, Speech etc Parsing Named Entity Recognition Machine Translation ....

Some Knowledge

  • f Linguistics
slide-8
SLIDE 8

Machine Learning Models of Source Code

“All models are wrong, some are useful” - George Box

Machine Learning Models

  • f Aspects of Source Code

Codebases

Software Engineers

Software Engineering Tools

slide-9
SLIDE 9

Language Models for Source Code

Assign a non-zero probability to every piece of valid code Probabilities learned from training corpus

slide-10
SLIDE 10

Language Models of Source Code – Design Choices

for (int i = 0; i < 10; i++){ Console.WriteLine(i); }

ForStatement Initialization Expression Expression Body Single Variable Declaration Type int Name Initializer Numeric Literal i Infix Expression Left Operand Right Operand Operator < Numeric Literal 10 i

Syntactic Models Token-level Models

slide-11
SLIDE 11

N-gram Language Models

Parameters of ML Model

e.g. P(0 | “for (int i =”)

slide-12
SLIDE 12

How n-gram models see code?

package org.cfeclipse.cfml.snippets; import org.rioproject.examples.logicdesigner.model.getState ( ) { cdl.Choreography; import org.apache.thrift.protocol.TProtocolUtil.skip(iprot); event.newLineCount == 3 ) { case '|' : if ( rule.FireAllRulesCommand; import org.apache.hadoop.conf.get(0, 0, newByteBuffer, 0, count); } switch ( classifierID ) { pd.getName() { cBondNeighborsB.get(MODULE).declaringType = (DEREnumerated) { jobEntryName.getText("//td[2]/a", RuntimeVariables.replace("//div[@class='lfr-component lfr-menu-list']/ul/li[1]/a" )); } }

slide-13
SLIDE 13

Machine Learning

Machine Learning Model Model Parameters

Learn the parameters

  • f the model from

data. Handle uncertainty and noise.

Designed by humans Learned from data

slide-14
SLIDE 14

Learning Model Parameters

Image from marple.eeb.uconn.edu

❯ Optimize objective function in training set ❯ Use computational methods of optimization

slide-15
SLIDE 15

Finding a good model

image from http://antianti.org/?p=175

Underfitting Overfitting

slide-16
SLIDE 16

Automatic Evaluation in Machine Learning

Imperfect measures of performance such as ❯ Prediction Accuracy ❯ Model Fit ❯ Quantify performance in a reproducible manner ❯ Drive improvement of systems in a measurable way

slide-17
SLIDE 17

Source Code and Machine Learning

Coding Patterns

Mine & exploit common patterns

[Hindle et al. 2012, Allamanis & Sutton 2014, Allamanis et al. 2014, 2015]

Probabilistic Static Analyses

Probability Distribution of (Formal) Properties

[Raychev et al. 2015, Mangal et al. 2015]

Formal Methods

Probabilities over Search Space (e.g. Synthesis)

[Ellis et al. 2015]

Runtime Traces

Infer Program Properties from Traces

[Brockschmidt et al. 2014 Yujia Li et al. 2015]

Code & Text

Code search, NL to Code

[Yusuke, et al. 2015, Movshovitz-Attias & Cohen, 2013, Allamanis et al. 2014]

slide-18
SLIDE 18

Outline

Learning Naming Conventions

❯ Lexical Patterns

Learning to Map Natural Language to Source Code

❯ Syntactic Patterns

slide-19
SLIDE 19

Learning Naming Conventions

“Programs must be written for people to read, and only incidentally for machines to execute.”

  • Abelson & Sussman, SICP, preface to the first edition
slide-20
SLIDE 20

A coding convention is a syntactic constraint beyond those imposed by the language grammar.

Allamanis et. al, FSE 2014, FSE 2015 ACM Distinguished Paper Award

slide-21
SLIDE 21

The Importance of Coding Conventions

Based on 169 code reviews with 1,093 discussion threads in Microsoft.

Code Review Discussions

Conventions 38% Naming 24% Formatting 9%

[Allamanis et al. FSE 2014]

slide-22
SLIDE 22

The Importance of Coding Conventions

Based on 169 code reviews with 1,093 discussion threads in Microsoft.

Code Review Discussions

Conventions 38% Naming 24% Formatting 9%

[Allamanis et al. FSE 2014]

slide-23
SLIDE 23

Arnaoudova, Venera, L. Eshkevari, Massimiliano Di Penta, Rocco Oliveto, Giuliano Antoniol, and Y. Gueheneuc. "REPENT: Analyzing the nature of identifier renamings." (2014)

94 developers

Is recommending identifier renamings useful?

slide-24
SLIDE 24

A Machine Learning Perspective

A name reflects important aspects of code functionality. Learning to name source code elements is a first step in understanding code through machine learning.

slide-25
SLIDE 25

Suggestions for junit/src/test/java/junit/tests/runner/TextRunnerTest.java

public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); while ((i.read()) != -1); ... } ... }

slide-26
SLIDE 26

Suggestions for junit/src/test/java/junit/tests/runner/TextRunnerTest.java

public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); while ((i.read()) != -1); ... } ... }

Source Code Language Model automatically suggest renamings

slide-27
SLIDE 27

Suggestions for junit/src/test/java/junit/tests/runner/TextRunnerTest.java

public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); while ((i.read()) != -1); ... } ... }

Source Code Language Model Score by naturalness & Threshold

1.'i' (18.07%) -> {input(81.93%), }

automatically suggest renamings

slide-28
SLIDE 28

Suggesting Names to Developers: The Naturalize Framework

[Allamanis et al. FSE 2014, FSE 2015]

ML model of code

slide-29
SLIDE 29

Naturalize Tools - devstyle

devstyle suggests identifier renamings

slide-30
SLIDE 30

18 patches for 5 well known

  • pen source projects:

14 accepted, 4 ignored

slide-31
SLIDE 31

libgdx

Java Game Development Framework

Method Naming Problem

slide-32
SLIDE 32

Method Naming Problem

Names describe what it does not what it is Models need to be “non-local”

slide-33
SLIDE 33
  • create • create?UNK? • init • createShader

Suggestions:

Method Naming Problem

slide-34
SLIDE 34

Method Naming Problem

  • create • create?UNK? • init • createShader

Suggestions:

slide-35
SLIDE 35

A Machine Learning Model of Names

[Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

slide-36
SLIDE 36

Embedding Identifiers

are “embeddings” ::: model parameters

[Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

slide-37
SLIDE 37

Embedding Identifiers

[Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

slide-38
SLIDE 38

Neural Context Models of Source Code

slide-39
SLIDE 39

Neural Context Models of Source Code

slide-40
SLIDE 40

Neural Context Models of Source Code

slide-41
SLIDE 41

Neural Context Models of Source Code

slide-42
SLIDE 42

Neural Context Models of Source Code

slide-43
SLIDE 43

Neural Context Models of Source Code

slide-44
SLIDE 44

Neural Context Models of Source Code

slide-45
SLIDE 45

Neural Context Models of Source Code

Global Information

slide-46
SLIDE 46

Neural Context Models of Source Code

Local Information

slide-47
SLIDE 47

Neural Context Models of Source Code

slide-48
SLIDE 48

Embedding Identifiers

[Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

slide-49
SLIDE 49

Neologisms

slide-50
SLIDE 50

getInputStream get Input Stream

Subtoken Context Models of Code

Sequentially predict each subtoken given the context and the previous subtokens

slide-51
SLIDE 51

Training Data (project) Train Neural Network Suggest Names

  • n Test Data

Embeddings

slide-52
SLIDE 52

Evaluation Methodology

Suggestions

1. job (30%) 2. task (20%) 3. tsk (15%)

ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction(task); externalPush(job);

Test File

Evaluation on top 10 Java GitHub projects. Perturb existing code and retrieve ground truth.

slide-53
SLIDE 53

Evaluation Methodology

compare with ground truth Suggestions

1. job (30%) 2. task (20%) 3. tsk (15%)

ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction(task); externalPush(job); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction(task); externalPush(job);

Test File

Evaluation on top 10 Java GitHub projects. Perturb existing code and retrieve ground truth.

slide-54
SLIDE 54

Suggesting Variable Names

slide-55
SLIDE 55

Suggesting Variable Names

slide-56
SLIDE 56

Suggesting Method Names

slide-57
SLIDE 57

Suggesting Method Names

slide-58
SLIDE 58

Embedding Visualization http://groups.inf.ed.ac.uk/cup/naturalize

slide-59
SLIDE 59

Embedding Visualization http://groups.inf.ed.ac.uk/cup/naturalize

slide-60
SLIDE 60

Embedding Visualization http://groups.inf.ed.ac.uk/cup/naturalize

slide-61
SLIDE 61

Learning to map natural language to source code

Work done in Microsoft Research - Cambridge Joint work with Danny Tarlow, Yi Wei, Andy Gordon

slide-62
SLIDE 62

Applications of Joint Models of Code & NL

Code Retrieval NL Retrieval for Source Code

and eventually code synthesis...

slide-63
SLIDE 63

A Conditional Generative Model

“get the first letter of each word in string and uppercase” NL Query Conditional Generative Model of Source Code Synthesize/Score Code Snippet

string s; string[] words = s.ToUpper().split(‘ ‘); string[] firstLetters = new string[words.Length]; for (int i=0; i < words.Length; i++) { firstLetters[i] = words.Substring(0,1); }

slide-64
SLIDE 64

ForStatement Initialization Expression Expression Body Single Variable Declaration Type int Name Initializer Numeric Literal i Infix Expression Left Operand Right Operand Operator < Numeric Literal 10 i

Syntactic model of source code, i.e. model how AST is generated

slide-65
SLIDE 65

Tree Generation Model: Context Free Grammars (CFG)

n

c

slide-66
SLIDE 66

n

c

Tree Generation Model: Probabilistic Context Free Grammars (PCFG)

slide-67
SLIDE 67

Generating from a PCFG

ForStatement

slide-68
SLIDE 68

Generating from a PCFG

ForStatement Initialization Expression Expression Body

slide-69
SLIDE 69

Generating from a PCFG

ForStatement Initialization Expression Expression Body Single Variable Declaration

slide-70
SLIDE 70

Generating from a PCFG

ForStatement Initialization Expression Expression Body Single Variable Declaration Type Name Initializer

slide-71
SLIDE 71

Generating from a PCFG

ForStatement Initialization Expression Expression Body Single Variable Declaration Type int Name Initializer

slide-72
SLIDE 72

Generating from a PCFG

ForStatement Initialization Expression Expression Body Single Variable Declaration Type int Name Initializer Numeric Literal i Infix Expression Left Operand Right Operand Operator < Numeric Literal 10 i

slide-73
SLIDE 73

Conditional Generative Model

  • f Source Code

Given natural language, get a model that can generate (probabilistically) source code, i.e. P(code | natural language)

slide-74
SLIDE 74

A Neural Log-Bilinear Bimodal Model of Code

Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. "Multimodal neural language models." Maddison, Chris and Daniel Tarlow. "Structured generative models of natural source code."

slide-75
SLIDE 75

A Neural Log-Bilinear Bimodal Model of Code

Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. "Multimodal neural language models." Maddison, Chris and Daniel Tarlow. "Structured generative models of natural source code."

slide-76
SLIDE 76

A Neural Log-Bilinear Bimodal Model of Code

Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. "Multimodal neural language models." Maddison, Chris and Daniel Tarlow. "Structured generative models of natural source code."

slide-77
SLIDE 77

StackOverflow Data & Augmenting Data with Queries

slide-78
SLIDE 78

Natural Language “Query” Code Snippets

slide-79
SLIDE 79

30K C# Questions 40K C# Snippets

slide-80
SLIDE 80

C# enum C# foreach enum C# enumerate enum How to order enum values in C# foreach enum C# C# enumerate enumerations C# enumeration C# enumerator C# foreach enum values C# enumerate

http://stackoverflow.com/questions/105372/how-do-i-enumerate-an-enu m

slide-81
SLIDE 81

40,092 C# Snippets 6,355,393 Natural Language Queries

slide-82
SLIDE 82

Performance Metric: Mean Reciprocal Rank

Measures how well we rank the correct answer

slide-83
SLIDE 83

Model StackOverflow Test 1 StackOverflow Test 2

NL+Code

0.18 0.17

NL only

0.12 0.13

Retrieval Evaluation - MRR Performance

Model StackOverflow Test 1 StackOverflow Test 2

Multiplicative

0.43 0.41

NL only

0.25 0.26

Code Retrieval Query Retrieval

Test 1: Code snippets from training set with new natural language queries. Test 2: New code snippets and new natural language queries.

slide-84
SLIDE 84

Synthesis Samples

> timespan day the week DateTime DateTime=DateTime.Now(0); > file exists on directory var path = new File(directory)

slide-85
SLIDE 85

Synthesis Samples

> timespan day the week DateTime DateTime=DateTime.Now(0); > file exists on directory var path = new File(directory)

slide-86
SLIDE 86

Retrieval Sample

slide-87
SLIDE 87

Challenges

❯ Code Representations in Machine Learning ❯ Define Representative Evaluation Metrics for Software Engineering Tasks ❯ Create Useful and Efficient Software Engineering Tools

slide-88
SLIDE 88

Understanding Source Code through Machine Learning to Create Smart Software Engineering Tools

slide-89
SLIDE 89

(end)

slide-90
SLIDE 90

Learning Naming Conventions Allamanis et al, 2014; 2015 Mining Source Code Idioms Allamanis and Sutton, 2014 Learning to name source code n-gram LMs for Code Allamanis et al, 2013

slide-91
SLIDE 91

The Sympathetic Uniqueness Principle

  • Prune rare words
  • Repurpose special UNK token
  • Allows Naturalize to decide when it should

not suggest

public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction(task); externalPush(job); }

Rare names often usefully signify unusual functionality, and need to be preserved.

slide-92
SLIDE 92

Idioms vs. the Rest

Code Clones copy-paste code fragments

  • C. K. Roy et al. Comparison and evaluation of code clone detection techniques and tools: A

qualitative approach. Science of Computer Programming, 2009.

  • L. Jiang et al. Deckard: Scalable and accurate tree-based detection of code clones. ICSE 2007.
  • H. A. Basit and S. Jarzabek. A data mining approach for detecting higher-level clones in
  • software. IEEE Transactions on Software Engineering, 2009.

API Patterns usage patterns of methods

  • T. T. Nguyen et al. Graph-based mining of multiple object usage patterns. ESEC/FSE 2009.
  • J. Wang et al. Mining succinct and high-coverage API usage patterns from source code. MSR

2013.

  • H. Zhong et al. MAPO: Mining and recommending API usage patterns. ECOOP, 2009.

Idioms syntactic code fragments

slide-93
SLIDE 93

The Distributional Hypothesis

“You shall know a word by the company it keeps”. John Rupert Firth, 1957

slide-94
SLIDE 94

The Distributional Hypothesis

“You shall know a word by the company it keeps”. John Rupert Firth, 1957

The ????????? is walking

slide-95
SLIDE 95

For IWESEP, in a way I'm inviting you as a ML/NLP person that can teach software engineering people the cool things you can do with ML/NLP techniques. I'm not sure if you see yourself this way, but I think a quick "intro to ML/NLP" for the first 1/3

  • r so, then "look at all the cool things you can do" for the second 2/3 would be one

potential way to give the presentation.

slide-96
SLIDE 96

Sanity Check:

String Manipulation Synthetic Data

var result = input_string.Split(' ').Select((string x) => Double.parse(x))).Average(); each element parse double separated by a space and get mean each element parse double separated by a space and get average each element convert to double separated by a space and get mean each element convert to double separated by a space and get average each element parse to double separated by a space and get mean

slide-97
SLIDE 97

A Neural Log-Bilinear Bimodal Model of Code

Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. "Multimodal neural language models." Maddison, Chris and Daniel Tarlow. "Structured generative models of natural source code."