understanding source code through machine learning to
play

Understanding Source Code through Machine Learning to Create Smart - PowerPoint PPT Presentation

Understanding Source Code through Machine Learning to Create Smart Software Engineering Tools Miltos Allamanis , University of Edinburgh March 13th, 2016 My PhD is supported by Joint work with: Charles Sutton (UoE), Earl T. Barr (UCL), Chris


  1. Understanding Source Code through Machine Learning to Create Smart Software Engineering Tools Miltos Allamanis , University of Edinburgh March 13th, 2016 My PhD is supported by Joint work with: Charles Sutton (UoE), Earl T. Barr (UCL), Chris Bird (MSR), Daniel Tarlow (MSRC), Yi Wei (MSRC), Andrew D. Gordon (MSRC)

  2. Developers Mine the hidden implicitly embed knowledge to knowledge in code create smart that may be useful software internal & for the same or engineering tools. external other projects. codebases

  3. Machine Software Engineer Learning Models of Source Code

  4. A Spectrum of Problems for Machine Learning Clustering Unsupervised Supervised

  5. A Spectrum of Problems for Machine Learning Joint Classification Learning Features

  6. Natural Language Processing with Machine Learning ❯ Resolve language ambiguities with principled probabilistic models of language. ❯ Learn model parameters from annotated corpora.

  7. Natural Language Processing (NLP) Parsing Some Knowledge Named Entity of Linguistics Recognition Machine Models of Aspects Translation Data: Corpora of of Natural .... Text, Speech etc Language Use Machine Learning to model aspects of a natural language.

  8. Machine Learning Software Engineers Codebases Models of Machine Learning Models Source Code of Aspects of Source Code “ All models are wrong, some are useful ” - George Box Software Engineering Tools

  9. Language Models for Source Code Assign a non-zero probability to every piece of valid code Probabilities learned from training corpus

  10. Language Models of Source Code – Design Choices for (int i = 0; i < 10; i++){ Token-level Models Console.WriteLine(i); } ForStatement Expression Expression Body Initialization Syntactic Models Infix Single Variable Expression Declaration Left Right Type Name Initializer Operator Operand Operand i < Numeric i Numeric int Literal Literal 10 0

  11. N-gram Language Models Parameters of ML Model e.g. P( 0 | “for (int i =” )

  12. How n-gram models see code? package org.cfeclipse.cfml.snippets; import org.rioproject.examples.logicdesigner.model.getState ( ) { cdl.Choreography; import org.apache.thrift.protocol.TProtocolUtil.skip(iprot); event.newLineCount == 3 ) { case '|' : if ( rule.FireAllRulesCommand; import org.apache.hadoop.conf.get(0, 0, newByteBuffer, 0, count); } switch ( classifierID ) { pd.getName() { cBondNeighborsB.get(MODULE).declaringType = (DEREnumerated) { jobEntryName.getText("//td[2]/a", RuntimeVariables.replace("//div[@class='lfr-component lfr-menu-list']/ul/li[1]/a" )); } }

  13. Machine Learning Learn the parameters of the model from data. Handle uncertainty and noise . Machine Model Learning Model Parameters Designed by humans Learned from data

  14. Learning Model Parameters Image from marple.eeb.uconn.edu ❯ Optimize objective function in training set ❯ Use computational methods of optimization

  15. Finding a good model Underfitting Overfitting image from http://antianti.org/?p=175

  16. Automatic Evaluation in Machine Learning Imperfect measures of performance such as ❯ Prediction Accuracy ❯ Model Fit ❯ Quantify performance in a reproducible manner ❯ Drive improvement of systems in a measurable way

  17. Source Code and Machine Learning Coding Patterns Formal Methods Code & Text Mine & exploit common Probabilities over Search Code search, patterns Space ( e.g. Synthesis) NL to Code [ Hindle et al. 2012, [ Ellis et al. 2015 ] [ Yusuke, et al. 2015, Allamanis & Sutton 2014, Movshovitz-Attias & Cohen, 2013, Allamanis et al. 2014, 2015 ] Allamanis et al. 2014 ] Probabilistic Static Analyses Runtime Traces Probability Distribution of (Formal) Infer Program Properties from Traces Properties [ Brockschmidt et al. 2014 Yujia Li et al. 2015 ] [ Raychev et al. 2015, Mangal et al. 2015 ]

  18. Learning Naming Conventions ❯ Lexical Patterns Outline Learning to Map Natural Language to Source Code ❯ Syntactic Patterns

  19. “ Programs must be written for people to read, and only incidentally for machines to execute. ” - Abelson & Sussman, SICP, preface to the first edition Learning Naming Conventions

  20. A coding convention is a syntactic constraint beyond those imposed by the language grammar. Allamanis et. al, FSE 2014, FSE 2015 ACM Distinguished Paper Award

  21. The Importance of Coding Conventions Code Review Discussions Conventions 38% Naming 24% Formatting 9% [Allamanis et al. FSE 2014] Based on 169 code reviews with 1,093 discussion threads in Microsoft.

  22. The Importance of Coding Conventions Code Review Discussions Conventions 38% Naming 24% Formatting 9% [Allamanis et al. FSE 2014] Based on 169 code reviews with 1,093 discussion threads in Microsoft.

  23. Is recommending identifier renamings useful? 94 developers Arnaoudova, Venera, L. Eshkevari, Massimiliano Di Penta, Rocco Oliveto, Giuliano Antoniol, and Y. Gueheneuc. "REPENT: Analyzing the nature of identifier renamings." (2014)

  24. A Machine A name reflects important Learning aspects of code functionality . Perspective Learning to name source code elements is a first step in understanding code through machine learning.

  25. Suggestions for junit/src/test/java/junit/tests/runner/TextRunnerTest.java public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); while (( i .read()) != -1); ... } ... }

  26. Suggestions for junit/src/test/java/junit/tests/runner/TextRunnerTest.java public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); while (( i .read()) != -1); automatically suggest renamings ... Source Code } Language Model ... }

  27. Suggestions for junit/src/test/java/junit/tests/runner/TextRunnerTest.java public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); while (( i .read()) != -1); automatically suggest renamings ... Source Code Language Model } Score by ... naturalness & } Threshold 1.'i' (18.07%) -> {input(81.93%), }

  28. Suggesting Names to Developers: The Naturalize Framework ML model of code [Allamanis et al. FSE 2014, FSE 2015]

  29. Naturalize Tools - devstyle devstyle suggests identifier renamings

  30. 18 patches for 5 well known open source projects: 14 accepted, 4 ignored

  31. Method Naming Problem libgdx Java Game Development Framework

  32. Method Naming Problem Names describe what it does not what it is Models need to be “non-local”

  33. Method Naming Problem Suggestions: • create • create?UNK? • init • createShader

  34. Method Naming Problem Suggestions: • create • create?UNK? • init • createShader

  35. A Machine Learning Model of Names [Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

  36. Embedding Identifiers are “embeddings” ::: model parameters [Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

  37. Embedding Identifiers [Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

  38. Neural Context Models of Source Code

  39. Neural Context Models of Source Code

  40. Neural Context Models of Source Code

  41. Neural Context Models of Source Code

  42. Neural Context Models of Source Code

  43. Neural Context Models of Source Code

  44. Neural Context Models of Source Code

  45. Neural Context Models of Source Code Global Information

  46. Neural Context Models of Source Code Local Information

  47. Neural Context Models of Source Code

  48. Embedding Identifiers [Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

  49. Neologisms

  50. Subtoken Context Models of Code getInputStream get Input Stream Sequentially predict each subtoken given the context and the previous subtokens

  51. Suggest Names Training Data Train Neural on Test Data (project) Network Embeddings

  52. Evaluation Methodology Test File ForkJoinTask<?> job; 1. job (30%) if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; 2. task (20%) else Suggestions job = new 3. tsk (15%) ForkJoinTask.AdaptedRunnableAction(task); externalPush(job); Evaluation on top 10 Java GitHub projects. Perturb existing code and retrieve ground truth.

  53. Evaluation Methodology Test File ForkJoinTask<?> job; 1. job (30%) if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; 2. task (20%) else Suggestions job = new 3. tsk (15%) ForkJoinTask.AdaptedRunnableAction(task); externalPush(job); compare with ForkJoinTask<?> job; ground truth if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction(task); externalPush(job); Evaluation on top 10 Java GitHub projects. Perturb existing code and retrieve ground truth.

  54. Suggesting Variable Names

  55. Suggesting Variable Names

  56. Suggesting Method Names

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend