Mining Source Code Repositories at Massive Scale using Language - - PowerPoint PPT Presentation

mining source code repositories at massive scale using
SMART_READER_LITE
LIVE PREVIEW

Mining Source Code Repositories at Massive Scale using Language - - PowerPoint PPT Presentation

Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by: Polyglot programmers Multitude of APIs &


slide-1
SLIDE 1

Mining Source Code Repositories at Massive Scale using Language Modeling

Miltos Allamanis, Charles Sutton

m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk

University of Edinburgh

Supported by:

slide-2
SLIDE 2

Polyglot programmers Multitude of APIs & libraries Transfer Knowledge from available code

slide-3
SLIDE 3

Why Language Models?

  • Statistical models
  • Learn from data
  • Abundance of code

available online

  • Non-language specific

method

[Hindle et al., ICSE 2012]

slide-4
SLIDE 4

n-gram Language Models

public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

slide-5
SLIDE 5

n-gram Language Models

public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

slide-6
SLIDE 6

n-gram Language Models

public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

slide-7
SLIDE 7

n-gram Language Models

public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

slide-8
SLIDE 8

n-gram Language Models

Predictability Measures n-gram Log Probability (NGLP) Cross-Entropy (H)

slide-9
SLIDE 9

The Java GitHub Corpus

Java projects >1 fork Deduplication through git commit SHAs URL: http://groups.inf.ed.ac.uk/cup/javaGithub/

slide-10
SLIDE 10

Language Models of Code

slide-11
SLIDE 11

Learning about identifiers

slide-12
SLIDE 12

Learning about identifiers

API calls are predictable

slide-13
SLIDE 13

n-gram log probability (NGLP) as a complexity metric

NGLP is Data-Driven

An n-gram is more complex if it is more rare

slide-14
SLIDE 14

Complexity trade-offs

from elasticsearch

slide-15
SLIDE 15

vs

from elasticsearch

slide-16
SLIDE 16

Evaluate domain specificity of code Larger IIM, more domain specific identifiers Use to evaluate code reusability

Identifier Information Metric (IIM)

Hfull - Hcollapsed

ContinuationPending.java 5.2 FastDtoa.java 5.0 PrivateAccessClass.java 4.7 JSSetter.java 1.0 GeneratedClassLoader. java 1.1 UintMap.java 1.2

slide-17
SLIDE 17

Contributions

  • GitHub Java Corpus
  • New gigatoken language models
  • API calls are predictable
  • Data-driven code complexity metrics
  • Metric of domain-specificity
slide-18
SLIDE 18

Mining Source Code Repositories at Massive Scale using Language Modeling

Miltos Allamanis, Charles Sutton

m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk

University of Edinburgh

Supported by:

slide-19
SLIDE 19

n-gram Language Models

slide-20
SLIDE 20

Language Models - Metrics

Log Probability (NGLP) Cross Entropy (H)

slide-21
SLIDE 21

Learning about identifiers

slide-22
SLIDE 22

Learning about identifiers

Method and Type identifiers are equally hard, irrespectively of the amount of data.