PHOG: Probabilistic Model for Code Pavol Bielik , Veselin Raychev, - - PowerPoint PPT Presentation

phog probabilistic model for code
SMART_READER_LITE
LIVE PREVIEW

PHOG: Probabilistic Model for Code Pavol Bielik , Veselin Raychev, - - PowerPoint PPT Presentation

PHOG: Probabilistic Model for Code Pavol Bielik , Veselin Raychev, Martin Vechev Software Reliability Lab Department of Computer Science ETH Zurich Vision Statistical Programming Tool Probabilistic Model number of 15 million repositories


slide-1
SLIDE 1

PHOG: Probabilistic Model for Code

Pavol Bielik, Veselin Raychev, Martin Vechev

Software Reliability Lab Department of Computer Science ETH Zurich

slide-2
SLIDE 2

Vision

15 million repositories Billions of lines of code High quality, tested, maintained programs last 8 years number of repositories

Statistical Programming Tool Probabilistic Model

slide-3
SLIDE 3

Understand code/security [POPL’15]: JavaScript Deobfuscation Type Prediction

... for x in range(a): print a[x]

Debug code: Statistical Bug Detection

Statistical Programming Tools

Write new code [PLDI’14]: Code Completion

Camera camera = Camera.open(); camera.SetDisplayOrientation(90);

?

likely error

Port code [ONWARD’14]: Programming Language Translation

All of these benefit from the probabilistic model for code.

www.jsnice.org

slide-4
SLIDE 4

Understand code/security [POPL’15]: JavaScript Deobfuscation Type Prediction

... for x in range(a): print a[x]

Debug code: Statistical Bug Detection

Statistical Programming Tools

Write new code [PLDI’14]: Code Completion

Camera camera = Camera.open(); camera.SetDisplayOrientation(90);

?

likely error

Port code [ONWARD’14]: Programming Language Translation

All of these benefit from the probabilistic model for code.

www.jsnice.org

Programming Languages + Machine Learning

slide-5
SLIDE 5

Model Requirements

Probabilistic Model

High Precision Efficient Learning Widely Applicable Existing Programs Learning Model Explainable Predictions

slide-6
SLIDE 6

Model Requirements

Probabilistic Model

High Precision Efficient Learning Widely Applicable Existing Programs Learning Model Explainable Predictions PHOG: Probabilistic Higher Order Grammar

slide-7
SLIDE 7

Example Query

awaitReset = function(){ ... return defer.promise; } awaitRemoved = function(){ fail(function(error){ if (error.status === 401){ ... } defer.reject(error); }); ... return defer.? }

PHOG

promise 0.67 notify 0.12 resolve 0.11 reject 0.03 P Correct prediction

slide-8
SLIDE 8

Challenges

awaitReset = function(){ ... return defer.promise; } awaitRemoved = function(){ fail(function(error){ if (error.status === 401){ ... } defer.reject(error); }); ... return defer.? }

PHOG

promise 0.67 notify 0.12 resolve 0.11 reject 0.03 P Correct prediction

Long distance dependencies

slide-9
SLIDE 9

Challenges

awaitReset = function(){ ... return defer.promise; } awaitRemoved = function(){ fail(function(error){ if (error.status === 401){ ... } defer.reject(error); }); ... return defer.? }

PHOG

promise 0.67 notify 0.12 resolve 0.11 reject 0.03 P Correct prediction

Long distance dependencies Program semantics

slide-10
SLIDE 10

Challenges

awaitReset = function(){ ... return defer.promise; } awaitRemoved = function(){ fail(function(error){ if (error.status === 401){ ... } defer.reject(error); }); ... return defer.? }

PHOG

promise 0.67 notify 0.12 resolve 0.11 reject 0.03 P Correct prediction

Long distance dependencies Program semantics Explainable predictions

slide-11
SLIDE 11

Existing Approaches for Code

x

arg max P( x | ) label (features) conditioning context Syntactic

[Hindle et al., 2012] [Allamanis et al., 2015]

Bad fit for programs

slide-12
SLIDE 12

Existing Approaches for Code

x

arg max P( x | ) promise defer reject Semantic Hard-coded heuristics Task & Language specific

x

arg max P( x | ) label (features) conditioning context Syntactic Bad fit for programs

[Nguyen et al., 2013] [Allamanis et al., 2014] [Raychev et al., 2014] [Hindle et al., 2012] [Allamanis et al., 2015]

slide-13
SLIDE 13

PHOG: Concepts

Use function to build a probabilistic model. Generalizes PCFGs to allow conditioning on richer context. Program synthesis learns a function that explains the data. The function returns a conditioning context for a given query.

slide-14
SLIDE 14

Generalizing PCFG

Context Free Grammar → 1… n P

Property → x 0.05 Property → y 0.03 Property → promise 0.001

slide-15
SLIDE 15

PHOG: Generalizes PCFG

Context Free Grammar → 1… n P

Property → x 0.05 Property → y 0.03 Property → promise 0.001

Higher Order Grammar [] → 1… n P

Property[reject, promise] → promise 0.67 Property[reject, promise] → notify 0.12 Property[reject, promise] → resolve 0.11

slide-16
SLIDE 16

[] → 1… n

Conditioning on Richer Context

What is the best conditioning context?

slide-17
SLIDE 17

[] → 1… n

Conditioning on Richer Context

What is the best conditioning context?

  • Control Structures
  • Identifiers
  • Constants
  • APIs
  • Fields
slide-18
SLIDE 18

[] → 1… n

Conditioning on Richer Context

What is the best conditioning context?

  • Control Structures
  • Identifiers
  • Constants
  • APIs
  • Fields

?

  • Source

Code Conditioning Context

slide-19
SLIDE 19

Production Rules R: [] → 1… n Function: f: →

Higher Order Grammar

Parametrize the grammar by a function used to dynamically obtain the context

slide-20
SLIDE 20

Production Rules R: [] → 1… n Function: f: AST →

Higher Order Grammar

Parametrize the grammar by a function used to dynamically obtain the context

slide-21
SLIDE 21

Higher Order Grammar

  • ( )

f Production Rules R: [] → 1… n Function: f: AST →

Source Code Conditioning Context Abstract Syntax Tree Function Application

slide-22
SLIDE 22

Function Representation

In general: Unrestricted programs (Turing complete) Our Work: TCond Language for navigating over trees and accumulating context

TCond ::= | WriteOp TCond | MoveOp TCond MoveOp ::= Up, Left, Right, DownFirst, DownLast, NextDFS, PrevDFS, NextLeaf, PrevLeaf, PrevNodeType, PrevNodeValue, PrevNodeContext WriteOp ::= WriteValue, WriteType, WritePos

slide-23
SLIDE 23

Expressing functions: TCond Language

Up Left TCond ::= | WriteOp TCond | MoveOp TCond MoveOp ::= Up, Left, Right, DownFirst, DownLast, NextDFS, PrevDFS, NextLeaf, PrevLeaf, PrevNodeType, PrevNodeValue, PrevNodeContext WriteOp ::= WriteValue, WriteType, WritePos WriteValue

← ∙

slide-24
SLIDE 24

Example

elem.notify( ... , ... , { position: ‘top’, hide: false, ? } );

TCond Program

  • Query
slide-25
SLIDE 25

Example

elem.notify( ... , ... , { position: ‘top’, hide: false, ? } );

TCond Left WriteValue

  • {}

{hide} Query

slide-26
SLIDE 26

Example

elem.notify( ... , ... , { position: ‘top’, hide: false, ? } );

TCond Left WriteValue Up WritePos

  • {}

{hide} {hide} {hide, 3} Query

slide-27
SLIDE 27

Example

elem.notify( ... , ... , { position: ‘top’, hide: false, ? } );

TCond Left WriteValue Up WritePos Up DownFirst DownLast WriteValue

  • {}

{hide} {hide} {hide, 3} {hide, 3} {hide, 3} {hide, 3} {hide, 3, notify} Query

slide-28
SLIDE 28

elem.notify( ... , ... , { position: ‘top’, hide: false, ? } );

Example

{ Previous Property, Parameter Position, API name } TCond Left WriteValue Up WritePos Up DownFirst DownLast WriteValue

  • {}

{hide} {hide} {hide, 3} {hide, 3} {hide, 3} {hide, 3} {hide, 3, notify} Query

slide-29
SLIDE 29

Learning PHOG

TCond ::= | WriteOp TCond | MoveOp TCond MoveOp ::= Up, Left, Right, ... WriteOp ::= WriteValue, WriteType, ...

fbest = arg min cost(D, f)

f ∊ TCond Existing Dataset TCond Language |d| << |D| |cost(d, f) - cost(D,f)| < Representative sampling Program Synthesis Enumerative search Genetic programming

Learning Programs from Noisy Data. POPL ’16, ACM.

slide-30
SLIDE 30

Evaluation

Probabilistic Model of JavaScript Language 50k Blind Set 100k PHOG training 20k TCond learning

slide-31
SLIDE 31

Evaluation

PCFG n-gram Naive Bayes SVM PHOG Code Completion Error Rate 49.9% 28.7% 45.8% 29.5% 18.5%

slide-32
SLIDE 32

Evaluation

Identifier Property String Number Error Rate Example Code Completion RegExp UnaryExpr BinaryExpr LogicalExpr 38% 35% 48% 36% 34% 3% 26% 8% contains = jQuery … start = list.length; ‘[‘ + attrs + ‘]’ canvas(xy[0], xy[1], …) line.replace(/(&nbsp;| )+/, …) if (!events || !…) while (++index < …) frame = frame || …

slide-33
SLIDE 33

Evaluation

PCFG n-gram Naive Bayes SVM PHOG Training Time Queries per Second 1 min 4 min 3 min 36 hours 162 + 3 min 71 000 15 000 10 000 12 500 50 000

slide-34
SLIDE 34

Key Ideas:

  • Learn a function that explains the data.

The function dynamically obtains the best conditioning context for a given query.

  • Define a new generative model that is

parametrized by such learned function.

PHOG: Probabilistic Higher Order Grammar

TCond ::= | WriteOp TCond | MoveOp TCond MoveOp ::= Up, Left, Right, ... WriteOp ::= WriteValue, WriteType, ...

dataset TCond Language

fbest = arg min cost(D, f)

f ∊ TCond

PHOG(fbest ) High Precision Efficient Learning Widely Applicable Explainable Predictions

slide-35
SLIDE 35