LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE & - PowerPoint PPT Presentation

M.Sc. Thesis Defense 08.04.2019 LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE & CONTEXT by Dylan Bourgeois Supervised by Pr. Pierre Vandergheynst Michaël Defferrard Pr. Jure Leskovec Dr. Michele Catasta

1 Introduction 2 Code: a structured language with natural properties 3 Leveraging structure and context in representations of source code 4 Experiments 2

1 Introduction 3

Example applications Capturing Code recommendation ● Plagiarism detection ● similarities of Smarter development tools ● Error correction ● source code Smart search ● Programming languages offer a unified interface, which is leveraged by programmers. The regularities in coding patterns can be used as a proxy for semantics. 4

Software is ubiquitous Programming is a human endeavour. It is an intricate process, often repetitive, time-consuming and error-prone. 5

Software is multilingual. Software is multimodal It exists through several representations... The idiosyncrasies of source code and multiple abstractions. are not trivial to deal with. Software is also inherently composable, reusable and hierarchical, it has side-effects. 6

1 Heuristic-based Existing work Leveraging the strong logic encoded by PL to create formal verification tools, memory safety checkers, ... 2 Contextual regularities Capturing common patterns in the Most work has focused on solving input representation, typically used in specific tasks, less so on capturing code editors. rich representations of source code. 7

Our approach HEURISTICS (STRUCTURE) We provide evidence for the importance of leveraging structure in the representation of source code. We propose a hybrid approach, which leverages both heuristics and regularities . REGULARITIES (CONTEXT) We show that patterns in the input Specifically, we hypothesise that provide a decent signal. structure is an informative heuristic. HYBRID (OURS) We propose a model which learns to recognize both structural and lexical patterns. 8

2 Code: a structured language with natural properties 9

[Shannon, 1950, Harris, 1954, Deerwester et al, 1990, Bengio et al. 2003, Collobert and Weston, 2008] Capturing the regularities of language A Language Model (LM) defines a probability distribution over sequences of words: This probability is estimated from a corpus, and can be parameterized through different forms: n-gram ● Bidirectional / Bi-linear ● Neural Network ● 10

[Hindle et al., 2012] On the naturalness of software Source code starts out as text : as such it can present the same kind of regularities as natural language . Its restricted vocabulary, strong grammatical rules and composability properties encourage regularity and hence predictability. 11

Representations of source code Each representation has inherent properties and abstraction levels associated to it. 12

Code represented as a structured language The Abstract Syntax Tree (AST) provides a universally-available, deterministic and rich structural representation of source code. 13

The regularities of structured representations z-scores Similar to what was found by [Hindle et al., 2012] on free-form text, we see both common patterns (e.g. motif #7) and project specific patterns (e.g. motif #3). 14

3 Leveraging context and structure in representations of source code 15

3.1 Learning from context 16

Linear Language Models The n-gram model can be represented as a Markov Chain, simplifying the joint probability by assuming that the likelihood of a word depends only on its history. 17

[Mikolov et al., 2013, Peters et al., 2018] Generalized language models However, in order to integrate more complex models of language, it is necessary to allow more complex models of context. In order to model polysemy, this context should also modulate the representation of a given word. 18

The Transformer Many of these insights are captured in the Transformer architecture [Vaswani et al., 2017] . It is a deep, feed-forward, attentive architecture showing strong results compared to recurrent architectures. It is now the building block for most state-of-the-art architectures in NLP. [Radford et al., 2018, Devlin et al. 2018] 19

[Vaswani et al., 2017] The Transformer The encoder embeds input sequences. Several of these blocks are then stacked to create deeper representations. 20

3.2 Learning from structure 21

[Allamanis et al., 2018] Leveraging structured representations of code Recent work has built on the powerful Graph Neural Networks, running on semantically augmented representations. 22

INSIGHTS A limited vocabulary means contexts are ● averaged across too many usages to be Limitations of the semantically meaningful. Learning a representation for each token ● approach has the inverse problem: not enough co-occurrences. Some aggregators can have issues with ● common motifs in code [Xu et al, 2019] . Unfortunately, we found the purely structural approach to have limited results. 23

3 Learning from context and structure 24

INSIGHT The Transformer: a GNN perspective No assumptions are made on the underlying structure: the attention module can attend to all the elements in the sequence. 25

INSIGHT The Transformer: a GNN perspective No assumptions are made on the underlying structure: the attention module can attend to all the elements in the sequence. This can be seen as a message-passing GNN on a fully connected input graph. 26

OUR APPROACH Generalizing to arbitrarily structured data The message-passing edges can be restricted to a priori edges, e.g. syntactic relationships. This enables the treatment of arbitrary graph structures as input. 27

OUR APPROACH Generalizing to arbitrarily structured data The message-passing edges can be restricted to a priori edges, e.g. syntactic relationships. This enables the treatment of arbitrary graph structures as input. 28

GCN-based aggregation OUR APPROACH Generalizing to GAT-based aggregation arbitrarily structured data where The aggregation scheme can be Masked Dot-Product Attention replaced by any message-passing aggregation architecture! Semantic Aggregation? 29

OUR APPROACH Generalizing to arbitrarily structured data For example, with the masked attention formulation, we can modify a Transformer encoder block to run on arbitrarily structured inputs. 30

OUR APPROACH A hybrid approach to aggregating context With this formulation, we can jointly learn to compose local and global context, obtaining a deep contextualized node representation. This helps to learn structural and contextual regularities. 31

3.4 Learning from context and structure 32

Model pre-training: a semi-supervised approach Great success in NLP applications to first model the input data. Similar approach to auto-encoders, but only the masked input is reconstructed. 33

Source code provides abundant training data Structure is readily available and deterministic, unlike parse trees of natural language. The masked language model is similar to a node classification task on graphs. 34

Transfer learning capabilities Once the model is pre-trained, it can be fine-tuned to produce labels through a pooling token [CLS] or used as a rich feature extractor. 35

4 Experiments 36

4.1 Learning from structure 37

Graph-based tasks Node classification The structure is similar to the pre-training task. MODEL TRAINED FROM SCRATCH 38

Graph-based tasks Graph classification In this case, we use the pooled representation of the input graph to make a prediction. PRE-TRAINED MODEL 39

Graph classification Our approach is competitive with state-of-the-art results on classic graph classification datasets. ENZYMES Predicting one of 6 classes of chemical properties on molecular graphs. MSRC 21 Predicting one of 21 semantic labels (e.g. building, grass, …) on image super-pixel graphs. MUTAG Predicting the mutagenicity of chemical compounds ( binary ). 40

Transfer learning on graphs Pre-training the model seems to enable faster training. For better accuracy, the model can be trained on multiple related tasks. MSRC 21 [Winn et al. 2005] Dataset of MRFs connecting super-pixels of an image, where the goal is to predict one of 21 labels (e.g. building, grass, …) . 41

Transfer learning on graphs Pre-training the model seems to enable faster training. For better accuracy, the model can be trained on multiple related tasks. MSRC 21 / 9 [Winn et al. 2005] Dataset of MRFs connecting super-pixels of an image, where the goal is to predict one of 21 / 9 labels (e.g. building, grass, …) . 42

4.2 Learning from structure and context 43

Datasets We collect code from online repositories into three datasets at different scales. A fourth very large ( 3TB !) dataset is currently being curated. 44

Processing the data 45

Preparing the data for pre-training We generate a set of code snippets, defined as valid code subgraphs, and perturb the dataset for reconstruction in the Masked Language Model task. 46

LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE & - PowerPoint PPT Presentation

M.Sc. Thesis Defense 08.04.2019 LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE & CONTEXT by Dylan Bourgeois Supervised by Pr. Pierre Vandergheynst Michal Defferrard Pr. Jure Leskovec Dr. Michele Catasta 1 Introduction 2 Code:

in practice source code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

61A Lecture 16 Announcements String Representations String Representations 4 String

What is a Compiler? Compiler A program that translates code in one language (source code) to

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Rich representations for Rich representations for learning visual recognition learning visual

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

February 27, 2013 Source: CBRE Source: CBRE Source: CBRE Source: CBRE Source: CBRE Miami

Similar code fragment A code fragment that has similar part to it in source code

80% of Code Red 2 Code Red 2 re-re- Code Red 1 and Code Red 2 Code Red 2 re- cleaned up

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Code Generation OSU CSE 2 April 2015 BL Compiler Structure Code Tokenizer Parser Generator

Code Generation 22 November 2019 OSU CSE 1 BL Compiler Structure Code Tokenizer Parser

Learning text representations from character-level data Grzegorz Chrupa la Department of

Learning and Evaluating Contextual Embedding of Source Code Aditya Kanade 1 2 , Petros Maniatis 2

PPoPP 20 Feb. 22-26, 2020 San Diego, CA, US spcl.inf.ethz.ch @spcl_eth Deep learning

Discrimination in Decision Making: Humans vs. Machines Muhammad Bilal Zafar, Isabel Valera,

New gravity duals for higher - dimensional superconformal theories Alessandro Tomasiello based

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao,

Bernstein-Zelevinsky Derivative and Their Analogues AFW Workshop, Duquesne U Pittsburgh Zhuohui

3/12/2019 Background, Classification, & Incidence Background, Classification, & Incidence

CS 4803 / 7643: Deep Learning Topics: Low-label ML Formulations Zsolt Kira Georgia Tech

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs Samuel Rota Bul, Lorenzo

Sambuz

Useful Links

Newsletter

Mail Us

LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE & - PowerPoint PPT Presentation

M.Sc. Thesis Defense 08.04.2019 LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE & CONTEXT by Dylan Bourgeois Supervised by Pr. Pierre Vandergheynst Michal Defferrard Pr. Jure Leskovec Dr. Michele Catasta 1 Introduction 2 Code:

in practice source code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

61A Lecture 16 Announcements String Representations String Representations 4 String

What is a Compiler? Compiler A program that translates code in one language (source code) to

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Rich representations for Rich representations for learning visual recognition learning visual

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

February 27, 2013 Source: CBRE Source: CBRE Source: CBRE Source: CBRE Source: CBRE Miami

Similar code fragment A code fragment that has similar part to it in source code

80% of Code Red 2 Code Red 2 re-re- Code Red 1 and Code Red 2 Code Red 2 re- cleaned up

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Code Generation OSU CSE 2 April 2015 BL Compiler Structure Code Tokenizer Parser Generator

Code Generation 22 November 2019 OSU CSE 1 BL Compiler Structure Code Tokenizer Parser

Learning text representations from character-level data Grzegorz Chrupa la Department of

Learning and Evaluating Contextual Embedding of Source Code Aditya Kanade 1 2 , Petros Maniatis 2

PPoPP 20 Feb. 22-26, 2020 San Diego, CA, US spcl.inf.ethz.ch @spcl_eth Deep learning

Discrimination in Decision Making: Humans vs. Machines Muhammad Bilal Zafar, Isabel Valera,

New gravity duals for higher - dimensional superconformal theories Alessandro Tomasiello based

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao,

Bernstein-Zelevinsky Derivative and Their Analogues AFW Workshop, Duquesne U Pittsburgh Zhuohui

3/12/2019 Background, Classification, &amp; Incidence Background, Classification, &amp; Incidence

CS 4803 / 7643: Deep Learning Topics: Low-label ML Formulations Zsolt Kira Georgia Tech

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs Samuel Rota Bul, Lorenzo

Sambuz

Useful Links

Newsletter

Mail Us

3/12/2019 Background, Classification, & Incidence Background, Classification, & Incidence