Mining Source Code Repositories at Massive Scale using Language - PowerPoint PPT Presentation

Apr 13, 2023 •2.37k likes •2.69k views

Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by: Polyglot programmers Multitude of APIs &

Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by:
Polyglot programmers Multitude of APIs & libraries Transfer Knowledge from available code
Why Language Models? ● Statistical models ● Learn from data ● Abundance of code available online ● Non-language specific method [Hindle et al., ICSE 2012]
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models Predictability Measures n-gram Log Probability (NGLP) Cross-Entropy (H)
The Java GitHub Corpus Java projects >1 fork Deduplication through git commit SHAs URL: http://groups.inf.ed.ac.uk/cup/javaGithub/
Language Models of Code
Learning about identifiers
Learning about identifiers API calls are predictable
n-gram log probability (NGLP) as a complexity metric NGLP is Data-Driven An n-gram is more complex if it is more rare
Complexity trade-offs from elasticsearch
vs from elasticsearch
Identifier Information Metric (IIM) Evaluate domain specificity of code Larger IIM, more domain specific identifiers Use to evaluate code reusability H full - H collapsed ContinuationPending.java 5.2 JSSetter.java 1.0 FastDtoa.java 5.0 GeneratedClassLoader. 1.1 java PrivateAccessClass.java 4.7 UintMap.java 1.2
Contributions ● GitHub Java Corpus ● New gigatoken language models ● API calls are predictable ● Data-driven code complexity metrics ● Metric of domain-specificity
Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by:
n-gram Language Models
Language Models - Metrics Log Probability (NGLP) Cross Entropy (H)
Learning about identifiers
Learning about identifiers Method and Type identifiers are equally hard, irrespectively of the amount of data.

Recommend

Mining Software Repositories What is MSR? Mining Software Repositories (MSR) uses data

Mining Software Repositories What is MSR? Mining Software Repositories (MSR) uses data available in repositories to support development activities For example, defect assignment, software validation, evolution and planning Increased

215 views • 9 slides

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Introduction Models Massive Data Models Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction Introduction Massive Data Models Examples Massive Data Models Massive Data Massive datasets are

408 views • 15 slides

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

What is Web Mining? Wh t i W b Mi i What is Web Mining? Wh t i W b Mi i ? ? Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques to automat cally d scover and extract nformat on automatically

774 views • 20 slides

Mining Source Code^3 Mining Idioms, Usages and Edits Dario Di Nucci Research Fellow

Mining Source Code^3 Mining Idioms, Usages and Edits Dario Di Nucci Research Fellow dario.di.nucci@vub.be Mining Software Repositories 3 Software Repositories? Issue Trackers Versioning Systems Archived Communication Market Places

919 views • 78 slides

Towards a Taxonomy of Approaches Towards a Taxonomy of Approaches for for Mining of Source Code

Towards a Taxonomy of Approaches Towards a Taxonomy of Approaches for for Mining of Source Code Repositories Mining of Source Code Repositories Huzefa H. Kagdi, Michael L. Collard, Jonathan I. Maletic Software Development Laboratory

430 views • 5 slides

Working together to make ORCID work for repositories ORCID in repositories task force Open

Working together to make ORCID work for repositories ORCID in repositories task force Open Repositories 2019 | Hamburg, June 2019 Liz Krznarich, ORCID https://orcid.org/0000-0001-6622-4910 slides: https://orcid.figshare.com About this time

717 views • 40 slides

Bazel and External Repositories Which version do you get? Klaus Aehlig October 910, 2018

Bazel External Repositories Bazel and External Repositories Which version do you get? Klaus Aehlig October 910, 2018 Bazel External Repositories Imagine. . . You freshly check out your project. Bazel External Repositories Imagine. . .

712 views • 28 slides

in practice source code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code

Java byte code in practice source code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code class loader interpreter JIT compiler JVM source code javac

1.6k views • 149 slides

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History. Football Association (FA) Formed 1863, governs association football in England Prior to this, football was just a school yard sport Rule

579 views • 35 slides

Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT

Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE Intro Alexander Bezzubov source{d} committer & PMC @ apache zeppelin startup in Madrid engineer

436 views • 25 slides

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

What is Web Mining? What is Web Mining? Web Mining Web Mining Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) Web mining aims to

570 views • 22 slides

Model-based Mining of Software Repositories Markus Scheidgen 1 Saturday, 27. September 2014

Model-based Mining of Software Repositories Markus Scheidgen 1 Saturday, 27. September 2014 Agenda Mining Software Repositories (MSR) and current approaches srcrepo a model-based MSR system srcrepo components and analysis

518 views • 30 slides

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Semantic Image Indexing and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Outline State of the nation Early description methods

2.13k views • 130 slides

Connecting my repository to the PID Graph Kristian Garza Open Repositories 2019 @kriztean

Connecting my repository to the PID Graph Kristian Garza Open Repositories 2019 @kriztean https://doi.org/10.5438/jwvf-8a66 How can we add value to our repositories? 2 How can we add value to our repositories? breaking silos with PIDs 3

786 views • 13 slides

RCAAP Repositories RCAAP Repositories Network Network - Promoting Promoting Interoperability

RCAAP Repositories RCAAP Repositories Network Network - Promoting Promoting Interoperability Interoperability OR2019 OR2019 - Hamburg Hamburg - 10 10-06 06-2019 2019 Agenda About RCAAP About RCAAP Why we need an integrated

713 views • 36 slides

ORCID in Finland? How to take advantage of ORCID in institutional repositories, Open Repositories

ORCID in Finland? How to take advantage of ORCID in institutional repositories, Open Repositories 2014, June 9 Jyrki Ilva (jyrki.ilva@helsinki.fi) THE NATIONAL LIBRARY OF FINLAND Library Network Services Researcher identification in Finland

433 views • 10 slides

FALCON: AN OPTIMIZING JAVA JIT Philip Reames Azul Systems AGENDA Intro to Falcon Why

FALCON: AN OPTIMIZING JAVA JIT Philip Reames Azul Systems AGENDA Intro to Falcon Why you should use LLVM to build a JIT Common Objections (and why theyre mostly wrong) 2 WHAT IS FALCON? Falcon is an LLVM based

1.01k views • 66 slides

Adversarial Robustness for Code Pavol Bielik , Martin Vechev pavol.bielik@inf.ethz.ch,

ICML 2020 Adversarial Robustness for Code Pavol Bielik , Martin Vechev pavol.bielik@inf.ethz.ch, martin.vechev@inf.ethz.ch Department of Computer Science 1 Adversarial Robustness panda gibbon Vision + = Explaining and Harnessing

429 views • 32 slides

Amassing and indexing a large sample of version control systems: towards the census of public

Amassing and indexing a large sample of version control systems: towards the census of public source code history Audris Mockus audris@avaya.com Avaya Labs Research Basking Ridge, NJ 07920 http://mockus.org/ Why global properties of code?

559 views • 11 slides

Inject Security into Source Code How 2018 Will Shift Your Security Priorities Panelists F a

Inject Security into Source Code How 2018 Will Shift Your Security Priorities Panelists F a rsha d Ab a si, CT O Mira i Se c urity Ja c e k Ma te rna , CT O Asse mb la Je ff Ro use , Dir. Pro duc t Ma na g e me nt Ac tive Sta te Wher e

772 views • 50 slides

Heat of the Moment: Characterizing the Efficacy of Thermal Camera-Based Attacks Keaton Mowery (UC

Heat of the Moment: Characterizing the Efficacy of Thermal Camera-Based Attacks Keaton Mowery (UC San Diego) Sarah Meiklejohn (UC San Diego) Stefan Savage (UC San Diego) 1 Code-based access control 2 Code-based access control 2 Code-based

1.2k views • 102 slides

TODAYS PROGRAM: Career Development Steve, ~ 25 min Theme: Capitalize on your

@SDFLIES #BGSORIENTATION TODAYS PROGRAM: Career Development Steve, ~ 25 min Theme: Capitalize on your Skills & Traits. Time Management Julie Blendy, ~ 30 min Theme: Organize; setting boundaries. Student

734 views • 28 slides

RetDec: An Open-Source Machine-Code Decompiler Jakub K roustek Peter Matula Petr Zemek

RetDec: An Open-Source Machine-Code Decompiler Jakub K roustek Peter Matula Petr Zemek Threat Labs Botconf 2017 1 / 51 > whoarewe Jakub K roustek founder of RetDec Threat Labs lead @Avast (previously @AVG) reverse

1.74k views • 152 slides

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

Introduction Turbo Principle SISO (Soft Input Soft Output) Example of a product code Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3 Definition of a soft information, how to use it? Convolutional

843 views • 39 slides