Statistical Analysis of Computer Program Text Charles Sutton - - PowerPoint PPT Presentation
Statistical Analysis of Computer Program Text Charles Sutton - - PowerPoint PPT Presentation
Statistical Analysis of Computer Program Text Charles Sutton University of Edinburgh Source code is a means of human communication Development out in the open 6000 Posts (Stack Overflow) 5000 4000 Pull requests (Github) Count
Source code is a means of human communication
Development “out in the open”
Year Count (x1000)
2011 2012 2013 2014 1000 2000 3000 4000 5000 6000
Pull requests
(Github)
Posts
(Stack Overflow)
Repositories
(Sourceforge)
Probabilistic modelling
Problem Model
(family of distributions)
Distribution Do stuff
Predict y from x “Explore” x Inspect distribution p(y|xtest) p(z|xtest) p(z|x1…xn)
Learning
(objective function)
Data
Source files Source files Source files Source files Source files
Supervised (x1,y1)…(xn,yn) Unsupervised x1…xn
Learning Natural Coding Conventions
[Allamanis, Barr, Bird, Sutton; FSE 2014]
junit/src/test/java/junit/tests/runner/TextRunnerTest.java public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); while ((i.read()) != -1); ... } ... }
junit/src/test/java/junit/tests/runner/TextRunnerTest.java public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); while ((i.read()) != -1); ... } ... }
Suggest alternate names
input inputStream is stream
Score and threshold
input (81.93%)
Language Models for Source Code
Probability distribution over token sequences: Consider naive estimator: In Naturalize : Choose the name other programmers use in similar contexts
Naming Methods and Classes
[Allamanis, Barr, Bird, Sutton; FSE 2015]
Name that Tune Java Method
1 private void createDefaultShader () { 2
String vertexShader = "literal_1";
3
String fragmentShader = "literal_2";
4
shader = new ShaderProgram(vertexShader,
5
fragmentShader);
6
if(shader.isCompiled() == false)
7
throw new IllegalArgumentException(
8
"literal_3" + shader.getLog());
9 }
Figure 1: This method is from libgdx’s CameraGroupStrategy
http://libgdx.badlogicgames.com
from libgdx
“Desktop/Android/Blackberry/iOS/HTML5 Java game development framework”
Embedding Identifiers
Log Bilinear Context Model
qcreateDefaultShader qhashCode ˆ rc
(private, void, (, ), {, String, vertexShader, =, “literal_1”, ;, String, …) c =
P(t|c1:m) = exp{sθ(t, c1:m)} P
t0 exp{sθ(t0, c1:m)}
t = createDefaultShader
sθ(t, c1:m) = q>
t ˆ
rc + bt qv ∈ RD are “embeddings” ::: model parameters More complex, we need to summarize many tokens What about ? ˆ rc
[Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]
Mining Idioms from Code
[Allamanis and Sutton; FSE 2014]
Defining a String constant Creating a logger for a class Looping through lines from a BufferedReader Iterate through the elements of an Iterator
Mined Idioms (General Java)
Mined Idioms (Library-Specific)
Get an HTML Document in jsoup Show a small popup in Android Get the distance between two points in Android Database transaction in node4j
Model: Tree substitution grammars
Mining API Patterns
[Fowkes and Sutton; NIPS WS 2014]
http://arxiv.org/abs/1510.04130
TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken
API patterns from code
TwitterFactory.getInstance TwitterFactory.<init> Status.getUser Status.getText ConfigurationBuilder.<init> ConfigurationBuilder.build ConfigurationBuilder.<init> TwitterFactory.<init> ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.build ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.build User.getId User.getId User.getId User.getScreenName ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.setOAuthConsumerSecret TwitterFactory.getInstance TwitterFactory.<init> TwitterFactory.getInstance Twitter.setOAuthConsumer TwitterFactory.<init> Twitter.setOAuthConsumer TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Status.getUser Status.getText Twitter.setOAuthConsumer Twitter.setOAuthAccessToken ConfigurationBuilder.<init> ConfigurationBuilder.build TwitterFactory.<init> Twitter.setOAuthAccessToken TwitterFactory.getInstance Twitter.setOAuthAccessToken ConfigurationBuilder.<init> TwitterFactory.<init> TwitterFactory.getInstance TwitterFactory.<init> Status.getUser Status.getText TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken auth.AccessToken.getToken auth.AccessToken.getTokenSecret ConfigurationBuilder.<init> ConfigurationBuilder.setDebugEnabled ConfigurationBuilder.build TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken Twitter.updateStatus ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.setOAuthConsumerSecret ConfigurationBuilder.build http.AccessToken.getToken http.AccessToken.getTokenSecret TwitterFactory.<init> TwitterFactory.getInstance Status.getId Status.getId
MAPO [Zhong et al, 2009] UP-Miner [Wang et al, 2013] IIM (actually a slight extension)
Model
z(j)
S
X(j)
S ∈ I S ∈ I
πS
j ∈ 1, ..., m
- 1. For each itemset, sample
To sample a transaction,
zS ∼ Bernoulli(πS).
- 2. Deterministically set
X =
- zs=1
S.
Parameters:
I
Collection of “interesting” itemsets
πS ∈ [0, 1]
S ∈ I for each probability of occurrence
Stepping Back
Local conventions (naming, formatting) Mining idioms
(probabilistic grammars) (ngram models)
Itemset mining
(latent-variable modelling)
Method naming
(word embeddings)
- Miltiadis Allamanis
- Jaroslav Fowkes
- Hao Peng
- Chris Bird, MSR
- Earl Barr, UCL
Thanks!
TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken
Key concepts in probabilistic modelling
- Sufficiency
- what statistics of the data am I memorizing?
- Latent variables, e.g.,
- what tree macros were used to generate AST?
- what item sets were used in a transaction?
Why patterns in software?
Surface-semantic correspondence
void addOne (int[] arr) { for (int i = 0; i < arr.length; i++) { arr[i] += 1; } } void foo (int[] bar) { int baz = 0; while (true) { bar[baz] = bar[baz] + 1; baz = baz + 1; if (baz > bar.length) break; } }
Semantics available from glancing rather than reading
Orthogonal interfaces
Tools that “do one thing well” need to be combined well
Natural code: Code with good correspondence?
“Semantic retreat” NLP —> statistical NLP PL analysis —> statistical PL analysis
A new type of program analysis
Static analysis
Construct program abstraction (loses information) Why abstract: Exact decision Turing-complete Then logical inference
Statistical analysis
Construct program abstraction (loses information) Why: Data sparsity, inductive bias Then statistical inference