A General Path-Based Representation for Predicting Program - - PowerPoint PPT Presentation

a general path based representation
SMART_READER_LITE
LIVE PREVIEW

A General Path-Based Representation for Predicting Program - - PowerPoint PPT Presentation

A General Path-Based Representation for Predicting Program Properties Uri Alon , Meital Zilberstein, Omer Levy, Eran Yahav University of Washington Technion 1 Motivating Example #1 Prediction of Variable Names in Python def sh3( c ): def sh3(


slide-1
SLIDE 1

Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav

A General Path-Based Representation for Predicting Program Properties

Technion

University of Washington

1

slide-2
SLIDE 2

Motivating Example #1

Prediction of Variable Names in Python

2

def sh3(c): p = Popen(c, stdout=PIPE, stderr=PIPE, shell=True)

  • , e = p.communicate()

r = p.returncode if r: raise CalledProcessError(r, c) else: return o.rstrip(), e.rstrip() def sh3(cmd): process = Popen(cmd, stdout=PIPE, stderr=PIPE, shell=True)

  • ut, err = process.communicate()

retcode = process.returncode if retcode: raise CalledProcessError(retcode, cmd) else: return out.rstrip(), err.rstrip()

slide-3
SLIDE 3

Motivating Example #2

Prediction of Method Names in JavaScript

3

function _______(object) { if (!object) return object; var clone = {}; for (var key in object) { clone[key] = object[key]; } return clone; } function cloneObject(object) { if (!object) return object; var clone = {}; for (var key in object) { clone[key] = object[key]; } return clone; }

slide-4
SLIDE 4

Motivating Example #3

Prediction of full types in Java

4

Configuration conf = HBaseConfiguration.create(); try { Connection connection = ConnectionFactory.createConnection(conf);

}

import org.apache.hadoop.hbase.client.Connection; com.mysql.jdbc.Connection ?

  • rg.apache.http.Connection

? StackOverflow answer: Configuration conf = HBaseConfiguration.create(); try { Connection connection = ConnectionFactory.createConnection(conf);

}

slide-5
SLIDE 5

5

Java JavaScript Python C# … Variable name prediction Bichsel et al. CCS’2016 (CRFs) Raychev et al. POPL’2015 (CRFs) Raychev et al. OOPSLA’2016 (Decision Trees) .. .. Method name prediction Allamanis et al. ICML’2016 (NNs) Raychev et al. OOPSLA’2016 (Decision Trees) .. .. Full types prediction .. .. .. .. .. … Raychev et al. PLDI’2014 (n-grams+RNNs) Bielik et al. ICML’2016 (PHOG) Raychev et al. OOPSLA’2016 (Decision Trees) Allamanis et al. ICML’2015 (Generative) ..

Previously – separate techniques for each problem / language

Completely automatically!

slide-6
SLIDE 6

How to represent a program element?

6

▪ Should work for many programming languages ▪ Should work for different tasks ▪ Useful in multiple learning algorithms

while (!done) { if (someCondition()) { done = true; } } while (!count) { if (someCondition()) { count = true; } } while (!done) { if (someCondition()) { done = true; } }

▪ What are the properties that make “done” a “done”?

slide-7
SLIDE 7

while (!done) { if (someCondition()) { done = true; } }

7

How to represent a program element?

▪ The semantic role of a program element is the set of all structured contexts in which it appears ▪ “done” is “done” because it appears in particular structured contexts

Key idea:

slide-8
SLIDE 8

(SymbolRef ↑ UnaryPrefix! ↑ While ↓ If ↓ Assign= ↓ SymbolRef , self)

AST-paths

done is represented as the set of all its paths A general and simple method to represent code in machine learning models

while (!done) { if (someCondition()) { done = true; } }

8

For example:

while (!done) { if (someCondition()) { done = true; } }

slide-9
SLIDE 9

9

while (!done) { if (someCondition()) { done = true; } } while (!x) { foo(); if (bar() < 3) { log.info(zoo); x = true; } }

Training Testing done

Example training & testing pipeline

1. …↑ …↑ … 2. …↑ …↑ … 3. █↑ █ ↑ █ 4. …↑ …↑ … 5. …↑ …↑ … 6. …↑ …↑ … 7. …↑ …↑ … 1. …↑ …↑ … 2. …↑ …↑ … 3. …↑ …↑ … 4. …↑ …↑ … 5. █↑ █ ↑ █ 6. …↑ …↑ … 7. …↑ …↑ … 8. …↑ …↑ …

slide-10
SLIDE 10

Advantages of AST-Paths representation

✓ Expressive enough to capture any property that is expressed syntactically. ✓ Independent of the programming language ✓ Automatically extractable – only requires a parser ✓ Not bound to the learning algorithm ✓ Works for different tasks

10

slide-11
SLIDE 11

Predicting program properties with AST paths

▪ Off-the shelf algorithms ▪ Plug-in our representation

11

Conditional Random Fields (CRFs) word2vec-based

slide-12
SLIDE 12

Predicting properties with CRFs

▪ Nodes: program elements ▪ Factors: learned scoring functions:

▪ 𝑊𝑏𝑚𝑣𝑓𝑡, 𝑊𝑏𝑚𝑣𝑓𝑡, 𝑄𝑏𝑢ℎ𝑡 → ℝ

▪ The same as in (JSNice, Raychev et al., POPL’2015), but with our paths as factors

12

SymbolRef↑UnaryPrefix!↑While↓If↓Assign=↓SymbolRef SymbolRef↑Call↑If↓Assign=↓SymbolRef SymbolRef↑Assign=↓True SymbolRef↑Assign=↓True

slide-13
SLIDE 13

Predicting properties with word2vec

▪Input: pairs of: 𝑥𝑝𝑠𝑒, 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 ▪Model:

▪word vectors: 𝑋

𝑤𝑝𝑑𝑏𝑐

▪context vectors: 𝐷𝑤𝑝𝑑𝑏𝑐

▪Prediction:

▪predict 𝑑1, … , 𝑑𝑜 = argmax𝑥𝑗∈𝑋𝑤𝑝𝑑𝑏𝑐 𝑥𝑗 ⋅ σ𝑘 𝑑

𝑘

𝑋

𝑤𝑝𝑑𝑏𝑐

. . . . . .

𝑒

𝐷𝑤𝑝𝑑𝑏𝑐 . . . . . .

𝑒

13

slide-14
SLIDE 14

Word2vec and different contexts

▪Input: pairs of: 𝑥𝑝𝑠𝑒, 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 ▪Train word2vec with 3 types of contexts:

▪ Neighbor tokens ▪ Surrounding AST-nodes ▪ AST paths

14

slide-15
SLIDE 15

Word2vec and different contexts

▪Input: pairs of: 𝑥𝑝𝑠𝑒, 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 ▪Train word2vec with 3 types of contexts:

▪ Neighbor tokens ▪ Surrounding AST-nodes ▪ AST paths

while ˽ ( ! done ) ˽ { 

slide-16
SLIDE 16

Word2vec and different contexts

▪Input: pairs of: 𝑥𝑝𝑠𝑒, 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 ▪Train word2vec with 3 types of contexts:

▪ Neighbor tokens ▪ Surrounding AST-nodes ▪ AST paths

while ˽ ( ! done ) ˽ { 

slide-17
SLIDE 17

Word2vec and different contexts

▪Input: pairs of: 𝑥𝑝𝑠𝑒, 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 ▪Train word2vec with 3 types of contexts:

▪ Neighbor tokens ▪ Surrounding AST-nodes ▪ AST paths

while ˽ ( ! done ) ˽ { 

17

slide-18
SLIDE 18

Evaluation

▪4 programming languages

▪Java, JavaScript, Python, C#

▪3 tasks

▪predicting method names, variable names, full types (“...hbase.client.Connection”)

▪2 learning algorithms

▪CRFs, word2vec-based

18

slide-19
SLIDE 19

Predicting variable names with CRFs

19

10 20 30 40 50 60 70

Java JavaScript Python C#

Accuracy (%) AST Paths Baseline

CRFs + n-grams UnuglifyJS CRFs + No-relation +8.1 (16.2%) +7.3 (12.2%) +21.5 (61%)

Format: Absolute (Relative%)

slide-20
SLIDE 20

20

Task: Variable names, word2vec, JavaScript

Word2vec with different context types

5 10 15 20 25 30 35 40 45

AST paths Surrounding nodes Neighbor tokens Accuracy (%)

+17.2 (74.1%) +19.8 (96.1%) Format: Absolute (Relative%)

slide-21
SLIDE 21

21

Reducing the number of paths

▪ Limiting path-length and path-width

▪ Path vocabulary size (JavaScript): 𝑚𝑓𝑜𝑕𝑢ℎ: 7 → 6: 13𝑁 → 11𝑁 𝑥𝑗𝑒𝑢ℎ: 3 → 2: 13𝑁 → 12𝑁

SymbolRef ↑ UnaryPrefix! ↑ While ↓ If ↓ Assign= ↓ SymbolRef …↑ While ↓…

▪ Path abstraction

▪ Path vocabulary size (Java): ~107 → ~102

slide-22
SLIDE 22

Effect of limiting path length and width

Task: Varia riable le names, , CR CRFs, Ja JavaScrip ipt

22

50 52 54 56 58 60 62 64 66 68 3 4 5 6 7

Accuracy (%) Max path-length

AST Paths with max_width=3 AST Paths with max_width=2 AST Paths with max_width=1 50 52 54 56 58 60 62 64 66 68 3 4 5 6 7

Accuracy (%) Max path-length

AST Paths with max_width=3 AST Paths with max_width=2 AST Paths with max_width=1 UnuglifyJS

slide-23
SLIDE 23

AST Path Abstractions

Task: Variable names, CRFs, Java

23

SymbolRef ↑ UnaryPrefix! ↑ While ↓ If ↓ Assign= ↓ SymbolRef SymbolRef ↑ While ↓ SymbolRef

  • nly values, without considering

the relation between them

slide-24
SLIDE 24

Example (JavaScript)

24

function countSomething(x, t) { var c = 0; for (var i = 0, l = x.length; i < l ; i++) { if (x[i] === t) { c++; } } return c; }

slide-25
SLIDE 25

Example (JavaScript)

25

function countSomething(array, target) { var count = 0; for (var i = 0, l = array.length; i < l ; i++) { if (array[i] === target) { count++ } } return count; }

slide-26
SLIDE 26

Example (Java)

26

public String sendGetRequest(String l) { HttpClient c = HttpClientBuilder.create().build(); HttpGet r = new HttpGet(l); String u = USER_AGENT; r.addHeader("User-Agent", u); HttpResponse s = c.execute(r); HttpEntity t = s.getEntity(); String g = EntityUtils.toString(t, "UTF-8"); return g; }

slide-27
SLIDE 27

Example (Java)

27

public String sendGetRequest(String url) { HttpClient client = HttpClientBuilder.create().build(); HttpGet request = new HttpGet(url); String user = USER_AGENT; request.addHeader("User-Agent", user); HttpResponse response = client.execute(request); HttpEntity entity = response.getEntity(); String result = EntityUtils.toString(entity, "UTF-8"); return result; }

slide-28
SLIDE 28

Semantic Similarity Between Names

Candidate 1. done 2. ended 3. complete 4. found 5. finished 6. stop 7. end 8. success

CRFs

28

slide-29
SLIDE 29

More Semantic Similarities

Similarities req ~ request count ~ counter ~ total element ~ elem ~ el array ~ arr ~ ary ~ list res ~ result ~ ret i ~ j ~ index

29

slide-30
SLIDE 30

Summary: a trade-off between learning effort and generalizability

30

Language-specific, task- specific, require expertise Implicitly re-learn syntactic & semantic regularities Sweet-spot

▪ Surface text – too noisy ▪ Complex analyses are great, but specific to language and task ▪ AST paths – sweet spot of simplicity, expressivity and generalizability ▪ “Structural n-grams” ▪ A strong baseline for any machine learning for code task

Structural n-grams

Questions?