Learning Programs from Noisy Data
Veselin Raychev Pavol Bielik Martin Vechev Andreas Krause ETH Zurich
Learning Programs from Noisy Data Veselin Raychev Pavol Bielik - - PowerPoint PPT Presentation
Learning Programs from Noisy Data Veselin Raychev Pavol Bielik Martin Vechev Andreas Krause ETH Zurich Why learn programs from examples? Input/output examples often easier to provide examples than specification (e.g. in FlashFill) 2
Veselin Raychev Pavol Bielik Martin Vechev Andreas Krause ETH Zurich
2
Input/output examples
examples than specification (e.g. in FlashFill)
3
Input/output examples
p
such that
p( )= ... p( )=
learn a function
?
examples than specification (e.g. in FlashFill)
4
Input/output examples
p
such that
p( )= ... p( )=
learn a function
?
examples than specification (e.g. in FlashFill) the user may make a mistake in the examples
5
Input/output examples
p
such that
p( )= ... p( )=
learn a function
?
Actual goal: produce p that the user really wanted and tried to specify
examples than specification (e.g. in FlashFill) the user may make a mistake in the examples
6
Input/output examples
p
such that
p( )= ... p( )=
learn a function
?
Actual goal: produce p that the user really wanted and tried to specify
examples than specification (e.g. in FlashFill) the user may make a mistake in the examples Key problem of synthesis: overfits, not robust to noise
7
Number of Examples Handling errors in the dataset Learned program complexity
8
Number of Examples Handling errors in the dataset Learned program complexity Program synthesis (PL)
no tens interesting programs
9
Number of Examples Handling errors in the dataset Learned program complexity Program synthesis (PL)
no tens interesting programs
Deep learning (ML)
yes millions simple, but unexplainable functions
10
Number of Examples Handling errors in the dataset Learned program complexity Program synthesis (PL)
no tens interesting programs
Deep learning (ML)
yes millions simple, but unexplainable functions
This paper bridges a gap
yes millions interesting programs
11
Number of Examples Handling errors in the dataset Learned program complexity Program synthesis (PL)
no tens interesting programs
Deep learning (ML)
yes millions simple, but unexplainable functions
This paper bridges a gap
yes millions interesting programs expands capabilities of existing synthesizers
12
Number of Examples Handling errors in the dataset Learned program complexity Program synthesis (PL)
no tens interesting programs
Deep learning (ML)
yes millions simple, but unexplainable functions
This paper bridges a gap
yes millions interesting programs new state-of-the-art precision for programming tasks expands capabilities of existing synthesizers
13
Number of Examples Handling errors in the dataset Learned program complexity Program synthesis (PL)
no tens interesting programs
Deep learning (ML)
yes millions simple, but unexplainable functions
This paper bridges a gap
yes millions interesting programs new state-of-the-art precision for programming tasks expands capabilities of existing synthesizers
Bridges gap between ML and PL Advances both areas
14
○ errors in training dataset ○ learns statistical models on data ○ handles synthesis with millions of examples
○ generalize existing works
Handling noise New probabilistic models Handling large datasets
15
Input/output examples incorrect examples
p
probabilistic model parametrized with p
Representative dataset sampler Program generator
Handling noise New probabilistic models Handling large datasets
16
Handling noise
Input/output examples incorrect examples
p
probabilistic model parametrized with p
Representative dataset sampler Program generator
17
Input/output examples
18
Input/output examples Domain Specific Language
19
Input/output examples synthesizer Domain Specific Language
20
Input/output examples synthesizer Domain Specific Language
p
such that
p( )= ... p( )=
21
Input/output examples synthesizer Domain Specific Language
p
such that
p( )= ... p( )=
incorrect example (e.g. a typo)
22
Input/output examples synthesizer Domain Specific Language
p
such that
p( )= ... p( )=
incorrect example (e.g. a typo)
p( )≠
23
Input/output examples synthesizer Domain Specific Language
p
such that
p( )= ... p( )=
incorrect example (e.g. a typo)
p( )≠
new kind of feedback from synthesizer
✔ ❌ ✔ ✔ ✔
24
Input/output examples synthesizer Domain Specific Language
p
such that
p( )= ... p( )=
incorrect example (e.g. a typo)
p( )≠
new kind of feedback from synthesizer
remove suspicious example, or
examples
✔ ❌ ✔ ✔ ✔
if return if return if return if return return
Issue:
such a program makes no errors and it may be the only solution to the minimization problem
25
D: Input/output examples incorrect examples
synthesizer
pbest = arg min errors(D, p)
p∈P
Too long program, hardcodes the input/outputs. Synthesis must penalize such answers dataset of input/output examples space of possible programs in DSL
Our problem formulation: pbest = arg min errors(D, p) + λr(p)
p∈P
regularizer penalizes long programs regularization constant error rate
26
pbest = arg min errors(D, p) + λr(p)
p∈P
number of instructions total solution
cost
27
pbest = arg min errors(D, p) + λr(p)
p∈P
number of instructions total solution
cost
28
encoding
err1 = if p( )= then 0 else 1 err2 = if p( )= then 0 else 1 err3 = if p( )= then 0 else 1 errors = err1 + err2 + err3 p ∈ Pr (with r instructions)
pbest = arg min errors(D, p) + λr(p)
p∈P
number of instructions total solution
cost
formula given to SMT solver
cost
number of errors 1 2 3
r
1 0.6 1.6 2.6 3.6 2 1.2 2.2 3.2 4.2 3 1.8 2.8 3.8 4.8
29
encoding
err1 = if p( )= then 0 else 1 err2 = if p( )= then 0 else 1 err3 = if p( )= then 0 else 1 errors = err1 + err2 + err3 p ∈ Pr (with r instructions)
pbest = arg min errors(D, p) + λr(p)
p∈P
number of instructions total solution
cost
Ask a number of SMT queries in increasing value
e.g. for
λ = 0.6
costs are formula given to SMT solver
cost
number of errors 1 2 3
r
1 0.6 1.6 2.6 3.6 2 1.2 2.2 3.2 4.2 3 1.8 2.8 3.8 4.8
30
encoding
err1 = if p( )= then 0 else 1 err2 = if p( )= then 0 else 1 err3 = if p( )= then 0 else 1 errors = err1 + err2 + err3 p ∈ Pr (with r instructions)
pbest = arg min errors(D, p) + λr(p)
p∈P
number of instructions total solution
cost
Ask a number of SMT queries in increasing value
e.g. for
λ = 0.6
costs are
UNSAT
formula given to SMT solver
cost
number of errors 1 2 3
r
1 0.6 1.6 2.6 3.6 2 1.2 2.2 3.2 4.2 3 1.8 2.8 3.8 4.8
31
encoding
err1 = if p( )= then 0 else 1 err2 = if p( )= then 0 else 1 err3 = if p( )= then 0 else 1 errors = err1 + err2 + err3 p ∈ Pr (with r instructions)
pbest = arg min errors(D, p) + λr(p)
p∈P
number of instructions total solution
cost
Ask a number of SMT queries in increasing value
e.g. for
λ = 0.6
costs are
UNSAT UNSAT
formula given to SMT solver
cost
number of errors 1 2 3
r
1 0.6 1.6 2.6 3.6 2 1.2 2.2 3.2 4.2 3 1.8 2.8 3.8 4.8
32
encoding
err1 = if p( )= then 0 else 1 err2 = if p( )= then 0 else 1 err3 = if p( )= then 0 else 1 errors = err1 + err2 + err3 p ∈ Pr (with r instructions)
pbest = arg min errors(D, p) + λr(p)
p∈P
number of instructions total solution
cost
Ask a number of SMT queries in increasing value
e.g. for
λ = 0.6
costs are
UNSAT UNSAT UNSAT
formula given to SMT solver
cost
number of errors 1 2 3
r
1 0.6 1.6 2.6 3.6 2 1.2 2.2 3.2 4.2 3 1.8 2.8 3.8 4.8
33
encoding
err1 = if p( )= then 0 else 1 err2 = if p( )= then 0 else 1 err3 = if p( )= then 0 else 1 errors = err1 + err2 + err3 p ∈ Pr (with r instructions)
pbest = arg min errors(D, p) + λr(p)
p∈P
number of instructions total solution
cost
Ask a number of SMT queries in increasing value
e.g. for
λ = 0.6
costs are
UNSAT UNSAT UNSAT UNSAT
formula given to SMT solver
cost
number of errors 1 2 3
r
1 0.6 1.6 2.6 3.6 2 1.2 2.2 3.2 4.2 3 1.8 2.8 3.8 4.8
34
encoding
err1 = if p( )= then 0 else 1 err2 = if p( )= then 0 else 1 err3 = if p( )= then 0 else 1 errors = err1 + err2 + err3 p ∈ Pr (with r instructions)
pbest = arg min errors(D, p) + λr(p)
p∈P
number of instructions total solution
cost
Ask a number of SMT queries in increasing value
e.g. for
λ = 0.6
costs are
UNSAT UNSAT UNSAT UNSAT
SAT
formula given to SMT solver
cost
number of errors 1 2 3
r
1 0.6 1.6 2.6 3.6 2 1.2 2.2 3.2 4.2 3 1.8 2.8 3.8 4.8
35
encoding
err1 = if p( )= then 0 else 1 err2 = if p( )= then 0 else 1 err3 = if p( )= then 0 else 1 errors = err1 + err2 + err3 p ∈ Pr (with r instructions)
pbest = arg min errors(D, p) + λr(p)
p∈P
number of instructions total solution
cost
Ask a number of SMT queries in increasing value
e.g. for
λ = 0.6
costs are
UNSAT UNSAT UNSAT UNSAT
SAT best program is with two instructions and makes one error
formula given to SMT solver
36
Take an actual synthesizer and show that we can make it handle noise
37
For BitStream programs, using Z3 similar to Jha et al.[ICSE’10] and Gulwani et al.[PLDI’11] Example program:
function check_if_power_of_2(int32 x) { var o = add(x, 1) return bitwise_and(x, o) } synthesized, short loop-free programs
38
For BitStream programs, using Z3 similar to Jha et al.[ICSE’10] and Gulwani et al.[PLDI’11] Example program:
function check_if_power_of_2(int32 x) { var o = add(x, 1) return bitwise_and(x, o) } synthesized, short loop-free programs
Question: how well does our synthesizer discover noise? (in programs from prior work)
39
For BitStream programs, using Z3 similar to Jha et al.[ICSE’10] and Gulwani et al.[PLDI’11] Example program:
function check_if_power_of_2(int32 x) { var o = add(x, 1) return bitwise_and(x, o) } synthesized, short loop-free programs
Question: how well does our synthesizer discover noise? (in programs from prior work)
40
For BitStream programs, using Z3 similar to Jha et al.[ICSE’10] and Gulwani et al.[PLDI’11] Example program:
function check_if_power_of_2(int32 x) { var o = add(x, 1) return bitwise_and(x, o) } synthesized, short loop-free programs
Question: how well does our synthesizer discover noise? (in programs from prior work) best area to be in. empirically pick λ here
Handling noise enables us to solve new classes of problems beyond normal synthesis
41
Handling large datasets Handling noise New probabilistic models
42
Input/output examples incorrect examples
p
probabilistic model parametrized with p
Representative dataset sampler Program generator
Handling large datasets Handling large datasets Handling noise New probabilistic models
43
Input/output examples incorrect examples
p
probabilistic model parametrized with p
Representative dataset sampler Program generator
44
Large number of examples: pbest = arg min cost(D, p)
p∈P
45
Large number of examples: pbest = arg min cost(D, p) D
Millions of input/output examples p∈P
46
Large number of examples: pbest = arg min cost(D, p) D
Millions of input/output examples computing cost(D, p)
O( |D| )
p∈P
47
Large number of examples: pbest = arg min cost(D, p) D
Millions of input/output examples computing cost(D, p)
O( |D| )
Synthesis: practically intractable p∈P
48
Large number of examples: pbest = arg min cost(D, p) D
Millions of input/output examples computing cost(D, p)
O( |D| )
Synthesis: practically intractable Key idea: iterative synthesis on fraction of examples p∈P
given dataset d, finds best program
49
pbest = arg min cost(d, p)
p∈P
Program generator
Synthesizer for small number of examples
given dataset d, finds best program
50
pbest = arg min cost(d, p)
p∈P
Program generator Dataset sampler
Picks dataset d ⊆ D
Synthesizer for small number of examples We introduce representative dataset sampler Generalize a user providing input/output examples
51
Representative dataset sampler Program generator
52
Representative dataset sampler Program generator
Start with a small random sample d⊆D Iteratively generate programs and samples.
53
Representative dataset sampler Program generator
Start with a small random sample d⊆D Iteratively generate programs and samples.
Program generator
p1
54
Representative dataset sampler Program generator
Start with a small random sample d⊆D Iteratively generate programs and samples.
Program generator
p1
Representative dataset sampler
d
55
Representative dataset sampler Program generator
Start with a small random sample d⊆D Iteratively generate programs and samples.
Program generator
p1
Program generator
p1 , p2
Representative dataset sampler
d
56
Representative dataset sampler Program generator
Start with a small random sample d⊆D Iteratively generate programs and samples.
Program generator
p1
Program generator
p1 , p2
Representative dataset sampler Representative dataset sampler
d
57
Representative dataset sampler Program generator
Start with a small random sample d⊆D Iteratively generate programs and samples.
Program generator
p1
Program generator
p1 , p2
Representative dataset sampler
pbest
Representative dataset sampler
d
58
Representative dataset sampler Program generator
Start with a small random sample d⊆D Iteratively generate programs and samples.
Program generator
p1
Program generator
p1 , p2
Representative dataset sampler
pbest
Representative dataset sampler
d
Algorithm generalizes synthesis by examples techniques
59
Idea: pick a small dataset d for which a set of already generated programs p1,...,pn behave like on the full dataset d = arg min d ⊆ D maxi∊1..n | cost(d, pi) - cost(D, pi) |
60
Costs on small dataset d
p1 p2
Idea: pick a small dataset d for which a set of already generated programs p1,...,pn behave like on the full dataset d = arg min d ⊆ D maxi∊1..n | cost(d, pi) - cost(D, pi) |
61
Costs on small dataset d Costs on full dataset D
p1 p2 p1 p2
Idea: pick a small dataset d for which a set of already generated programs p1,...,pn behave like on the full dataset d = arg min d ⊆ D maxi∊1..n | cost(d, pi) - cost(D, pi) |
62
Costs on small dataset d Costs on full dataset D
p1 p2 p1 p2
Idea: pick a small dataset d for which a set of already generated programs p1,...,pn behave like on the full dataset d = arg min d ⊆ D maxi∊1..n | cost(d, pi) - cost(D, pi) |
63
Costs on small dataset d Costs on full dataset D
p1 p2
Idea: pick a small dataset d for which a set of already generated programs p1,...,pn behave like on the full dataset d = arg min d ⊆ D maxi∊1..n | cost(d, pi) - cost(D, pi) |
p1 p2
64
Costs on small dataset d Costs on full dataset D
p1 p2 p1 p2
Idea: pick a small dataset d for which a set of already generated programs p1,...,pn behave like on the full dataset d = arg min d ⊆ D maxi∊1..n | cost(d, pi) - cost(D, pi) |
Theorem: this sampler shrinks the candidate program search space In evaluation: significant speedup of synthesis
empirical risk minimization
65
Representative dataset sampler Program generator
Handling noise New probabilistic models Handling large datasets
66
Input/output examples incorrect examples
p
probabilistic model parametrized with p
Representative dataset sampler Program generator
Handling noise New probabilistic models Handling large datasets
67
Input/output examples incorrect examples
New probabilistic models
p
probabilistic model parametrized with p
Representative dataset sampler Program generator
68
A new breed of tools: Learn from large existing codebases (e.g. Big Code) to make predictions about programs
69
learning model A new breed of tools: Learn from large existing codebases (e.g. Big Code) to make predictions about programs
70
learning model
with model A new breed of tools: Learn from large existing codebases (e.g. Big Code) to make predictions about programs
71
learning model
with model A new breed of tools: Learn from large existing codebases (e.g. Big Code) to make predictions about programs hard-coded model low precision
Essentially remember mapping from context in training data to prediction (with probabilities)
72
Hindle et al.[ICSE’12]
Essentially remember mapping from context in training data to prediction (with probabilities)
73
Hindle et al.[ICSE’12]
Essentially remember mapping from context in training data to prediction (with probabilities)
74
Hindle et al.[ICSE’12]
Essentially remember mapping from context in training data to prediction (with probabilities)
75
Hindle et al.[ICSE’12]
Essentially remember mapping from context in training data to prediction (with probabilities)
76
Learn a mapping Model will predict slice when it sees it after “+ name .” This model comes from NLP + name . slice
Raychev et al.[PLDI’14] Hindle et al.[ICSE’12]
Essentially remember mapping from context in training data to prediction (with probabilities)
77
Learn a mapping Model will predict slice when it sees it after “+ name .” This model comes from NLP + name . slice
Raychev et al.[PLDI’14] Hindle et al.[ICSE’12]
Essentially remember mapping from context in training data to prediction (with probabilities)
78
Learn a mapping Model will predict slice when it sees it after “+ name .” This model comes from NLP + name . slice
Raychev et al.[PLDI’14] Hindle et al.[ICSE’12]
Essentially remember mapping from context in training data to prediction (with probabilities)
79
Learn a mapping Model will predict slice when it sees it after “+ name .” This model comes from NLP + name . slice
Raychev et al.[PLDI’14] Hindle et al.[ICSE’12]
Essentially remember mapping from context in training data to prediction (with probabilities)
80
Learn a mapping Model will predict slice when it sees it after “+ name .” This model comes from NLP + name . slice
Raychev et al.[PLDI’14] Hindle et al.[ICSE’12]
Essentially remember mapping from context in training data to prediction (with probabilities)
81
Learn a mapping Model will predict slice when it sees it after “+ name .” This model comes from NLP + name . slice Learn a mapping Model will predict slice when it sees it after “charAt” Relies on static analysis charAt slice
82
Raychev et al.[PLDI’14] Hindle et al.[ICSE’12] Learn a mapping Model will predict slice when it sees it after “+ name .” This model comes from NLP + name . slice Learn a mapping Model will predict slice when it sees it after “charAt” Relies on static analysis charAt slice
83
Raychev et al.[PLDI’14] Hindle et al.[ICSE’12] Learn a mapping Model will predict slice when it sees it after “+ name .” This model comes from NLP + name . slice Learn a mapping Model will predict slice when it sees it after “charAt” Relies on static analysis charAt slice
Very low precision
84
Raychev et al.[PLDI’14] Hindle et al.[ICSE’12] Learn a mapping Model will predict slice when it sees it after “+ name .” This model comes from NLP + name . slice Learn a mapping Model will predict slice when it sees it after “charAt” Relies on static analysis charAt slice
Very low precision Low precision for JavaScript
85
Raychev et al.[PLDI’14] Hindle et al.[ICSE’12] Learn a mapping Model will predict slice when it sees it after “+ name .” This model comes from NLP + name . slice Learn a mapping Model will predict slice when it sees it after “charAt” Relies on static analysis charAt slice
Very low precision Low precision for JavaScript Core problem: Existing machine learning models are limited and not expressive enough
86
Learn a program that parametrizes a probabilistic model that makes predictions.
87
program describing a model Learn a program that parametrizes a probabilistic model that makes predictions.
88
program describing a model
i.e. learn the mapping
Learn a program that parametrizes a probabilistic model that makes predictions.
89
program describing a model
i.e. learn the mapping
with this model Learn a program that parametrizes a probabilistic model that makes predictions.
90
program describing a model
i.e. learn the mapping
with this model Learn a program that parametrizes a probabilistic model that makes predictions.
prior models are described by simple hard-coded programs Our approach: learn a better program
Training example:
91
slice
input
Training example:
92
slice Compute context with program p
input
Training example:
93
slice Learn a mapping toUpperCase slice Compute context with program p
input
Training example:
94
slice
Evaluation example:
Learn a mapping toUpperCase slice Compute context with program p
input
Training example:
95
slice
Evaluation example:
Learn a mapping toUpperCase slice Compute context with program p Compute context with program p
input
Training example:
96
slice
Evaluation example:
Learn a mapping toUpperCase slice Compute context with program p slice Compute context with program p predict completion
input
Training example:
97
slice
Evaluation example:
Learn a mapping toUpperCase slice Compute context with program p slice Compute context with program p predict completion
input
Synthesis of probabilistic model can be done with the same optimization problem as before!
98
Our problem formulation: pbest = arg min errors(D, p) + λr(p)
p∈P
regularizer penalizes long programs regularization constant evaluation data: input/output examples
Synthesis of probabilistic model can be done with the same optimization problem as before!
99
Our problem formulation: pbest = arg min errors(D, p) + λr(p)
p∈P
regularizer penalizes long programs regularization constant evaluation data: input/output examples
cost(D, p)
100
Handling noise Synthesizing a model Representative dataset sampler Techniques are generally applicable to program synthesis Next, application for “Big Code” called DeepSyn
Trained on 100’000 JavaScript files from GitHub
101
Program synthesizer
Representative dataset sampler Program generator
D
Trained on 100’000 JavaScript files from GitHub
102
Program synthesizer
Representative dataset sampler Program generator
D Train model on full data and best program p p D
50’000 evaluation files (not used in training or synthesis) API completion task
103
50’000 evaluation files (not used in training or synthesis) API completion task
104
Conditioning program p Accuracy Last two tokens, Hindle et al.[ICSE’12] 22.2% Last two APIs per object, Raychev et al.[PLDI’14] 30.4% Program synthesis with noise 46.3% Program synthesis with noise + dataset sampler 50.4%
This work
50’000 evaluation files (not used in training or synthesis) API completion task
105
Conditioning program p Accuracy Last two tokens, Hindle et al.[ICSE’12] 22.2% Last two APIs per object, Raychev et al.[PLDI’14] 30.4% Program synthesis with noise 46.3% Program synthesis with noise + dataset sampler 50.4% We can explain best program. It looks at API preceding completion position and at tokens prior to these APIs.
This work
Scalability
Handling noise Second-order learning Handling large datasets
106
Input/output examples incorrect examples
p
probabilistic model parametrized with p
Representative dataset sampler Program generator
Synthesis of probabilistic models Extending synthesizers to handle noise
Scalability
Handling noise Second-order learning Handling large datasets
107
Input/output examples incorrect examples
p
probabilistic model parametrized with p
Representative dataset sampler Program generator
Synthesis of probabilistic models Extending synthesizers to handle noise Bridges gap between ML and PL Advances both areas
108
Left PrevActor WriteAction WriteValue PrevActor WriteAction PrevLeaf WriteValue PrevLeaf WriteValue