Statistical Analysis of Computer Program Text Charles Sutton - - PowerPoint PPT Presentation

statistical analysis of computer program text
SMART_READER_LITE
LIVE PREVIEW

Statistical Analysis of Computer Program Text Charles Sutton - - PowerPoint PPT Presentation

Statistical Analysis of Computer Program Text Charles Sutton University of Edinburgh Source code is a means of human communication Development out in the open 6000 Posts (Stack Overflow) 5000 4000 Pull requests (Github) Count


slide-1
SLIDE 1

Statistical Analysis of Computer Program Text

Charles Sutton University of Edinburgh

slide-2
SLIDE 2

Source code is a means of human communication

slide-3
SLIDE 3

Development “out in the open”

Year Count (x1000)

2011 2012 2013 2014 1000 2000 3000 4000 5000 6000

Pull requests

(Github)

Posts

(Stack Overflow)

Repositories

(Sourceforge)

slide-4
SLIDE 4

Probabilistic modelling

Problem Model

(family of distributions)

Distribution Do stuff

Predict y from x “Explore” x Inspect distribution p(y|xtest) p(z|xtest) p(z|x1…xn)

Learning

(objective function)

Data

Source files Source files Source files Source files Source files

Supervised (x1,y1)…(xn,yn) Unsupervised x1…xn

slide-5
SLIDE 5

Learning Natural Coding Conventions

[Allamanis, Barr, Bird, Sutton; FSE 2014]

slide-6
SLIDE 6

junit/src/test/java/junit/tests/runner/TextRunnerTest.java
 public class TextRunnerTest extends TestCase {
 void execTest(String testClass, boolean success) throws Exception {
 ...
 InputStream i = p.getInputStream();
 while ((i.read()) != -1);
 ...
 }
 ...
 }


slide-7
SLIDE 7

junit/src/test/java/junit/tests/runner/TextRunnerTest.java
 public class TextRunnerTest extends TestCase {
 void execTest(String testClass, boolean success) throws Exception {
 ...
 InputStream i = p.getInputStream();
 while ((i.read()) != -1);
 ...
 }
 ...
 }


Suggest alternate names

input inputStream is stream

Score and threshold

input (81.93%)

slide-8
SLIDE 8

Language Models for Source Code

Probability distribution over token sequences: Consider naive estimator: In Naturalize : Choose the name other programmers use in similar contexts

slide-9
SLIDE 9

Naming Methods and Classes

[Allamanis, Barr, Bird, Sutton; FSE 2015]

slide-10
SLIDE 10

Name that Tune Java Method

1 private void createDefaultShader () { 2

String vertexShader = "literal_1";

3

String fragmentShader = "literal_2";

4

shader = new ShaderProgram(vertexShader,

5

fragmentShader);

6

if(shader.isCompiled() == false)

7

throw new IllegalArgumentException(

8

"literal_3" + shader.getLog());

9 }

Figure 1: This method is from libgdx’s CameraGroupStrategy

http://libgdx.badlogicgames.com

from libgdx

“Desktop/Android/Blackberry/iOS/HTML5 Java game development framework”

slide-11
SLIDE 11

Embedding Identifiers

Log Bilinear Context Model

qcreateDefaultShader qhashCode ˆ rc

(private, void, (, ), {, String, vertexShader, =, “literal_1”, ;, String, …) c =

P(t|c1:m) = exp{sθ(t, c1:m)} P

t0 exp{sθ(t0, c1:m)}

t = createDefaultShader

sθ(t, c1:m) = q>

t ˆ

rc + bt qv ∈ RD are “embeddings” ::: model parameters More complex, we need to summarize many tokens What about ? ˆ rc

[Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

slide-12
SLIDE 12

Mining Idioms from Code

[Allamanis and Sutton; FSE 2014]

slide-13
SLIDE 13

Defining a String constant Creating a logger for a class Looping through lines from a BufferedReader Iterate through the elements of an Iterator

Mined Idioms (General Java)

slide-14
SLIDE 14

Mined Idioms (Library-Specific)

Get an HTML Document in jsoup Show a small popup in Android Get the distance between two points in Android Database transaction in node4j

slide-15
SLIDE 15

Model: Tree substitution grammars

slide-16
SLIDE 16

Mining API Patterns

[Fowkes and Sutton; NIPS WS 2014]

http://arxiv.org/abs/1510.04130

TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken

slide-17
SLIDE 17

API patterns from code

TwitterFactory.getInstance TwitterFactory.<init> Status.getUser Status.getText ConfigurationBuilder.<init> ConfigurationBuilder.build ConfigurationBuilder.<init> TwitterFactory.<init> ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.build ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.build User.getId User.getId User.getId User.getScreenName ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.setOAuthConsumerSecret TwitterFactory.getInstance TwitterFactory.<init> TwitterFactory.getInstance Twitter.setOAuthConsumer TwitterFactory.<init> Twitter.setOAuthConsumer TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Status.getUser Status.getText Twitter.setOAuthConsumer Twitter.setOAuthAccessToken ConfigurationBuilder.<init> ConfigurationBuilder.build TwitterFactory.<init> Twitter.setOAuthAccessToken TwitterFactory.getInstance Twitter.setOAuthAccessToken ConfigurationBuilder.<init> TwitterFactory.<init> TwitterFactory.getInstance TwitterFactory.<init> Status.getUser Status.getText TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken auth.AccessToken.getToken auth.AccessToken.getTokenSecret ConfigurationBuilder.<init> ConfigurationBuilder.setDebugEnabled ConfigurationBuilder.build TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken Twitter.updateStatus ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.setOAuthConsumerSecret ConfigurationBuilder.build http.AccessToken.getToken http.AccessToken.getTokenSecret TwitterFactory.<init> TwitterFactory.getInstance Status.getId Status.getId

MAPO [Zhong et al, 2009] UP-Miner [Wang et al, 2013] IIM (actually a slight extension)

slide-18
SLIDE 18

Model

z(j)

S

X(j)

S ∈ I S ∈ I

πS

j ∈ 1, ..., m

  • 1. For each itemset, sample

To sample a transaction,

zS ∼ Bernoulli(πS).

  • 2. Deterministically set

X =

  • zs=1

S.

Parameters:

I

Collection of “interesting” itemsets

πS ∈ [0, 1]

S ∈ I for each probability of occurrence

slide-19
SLIDE 19

Stepping Back

slide-20
SLIDE 20

Local conventions (naming, formatting) Mining idioms

(probabilistic grammars) (ngram models)

Itemset mining

(latent-variable modelling)

Method naming

(word embeddings)

  • Miltiadis Allamanis
  • Jaroslav Fowkes
  • Hao Peng
  • Chris Bird, MSR
  • Earl Barr, UCL

Thanks!

TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken

slide-21
SLIDE 21

Key concepts in probabilistic modelling

  • Sufficiency
  • what statistics of the data am I memorizing?
  • Latent variables, e.g.,
  • what tree macros were used to generate AST?
  • what item sets were used in a transaction?
slide-22
SLIDE 22

Why patterns in software?

Surface-semantic correspondence

void addOne (int[] arr) { for (int i = 0; i < arr.length; i++) { arr[i] += 1; } } void foo (int[] bar) { int baz = 0; while (true) { bar[baz] = bar[baz] + 1; baz = baz + 1; if (baz > bar.length) break; } }

Semantics available from glancing rather than reading

Orthogonal interfaces

Tools that “do one thing well” need to be combined well

Natural code: Code with good correspondence?

slide-23
SLIDE 23

“Semantic retreat” NLP —> statistical NLP PL analysis —> statistical PL analysis

A new type of program analysis

Static analysis

Construct program abstraction (loses information) Why abstract: Exact decision Turing-complete Then logical inference

Statistical analysis

Construct program abstraction (loses information) Why: Data sparsity, inductive bias Then statistical inference