M emory A ugmented P olicy O ptimization ( MAPO ) for Program - - PowerPoint PPT Presentation
M emory A ugmented P olicy O ptimization ( MAPO ) for Program - - PowerPoint PPT Presentation
M emory A ugmented P olicy O ptimization ( MAPO ) for Program Synthesis and Semantic Parsing Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc Le, Ni Lao Program Synthesis / Semantic Parsing how many more passengers flew to los angeles than to
Program Synthesis / Semantic Parsing
how many more passengers flew to los angeles than to saskatoon?
(filterin rows ['saskatoon'] r.city) (filterin rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
how many more passengers flew to los angeles than to saskatoon? 12,467
Program Synthesis / Semantic Parsing
(filterin rows ['saskatoon'] r.city) (filterin rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
how many more passengers flew to los angeles than to saskatoon? 12,467
Program Synthesis / Semantic Parsing
(filterin rows ['saskatoon'] r.city) (filterin rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
how many more passengers flew to los angeles than to saskatoon? 12,467
Program Synthesis / Semantic Parsing
(filterin rows ['saskatoon'] r.city) (filterin rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
how many more passengers flew to los angeles than to saskatoon? 12,467
Program Synthesis / Semantic Parsing
(filterin rows ['saskatoon'] r.city) (filterin rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
how many more passengers flew to los angeles than to saskatoon? 12,467
Program Synthesis / Semantic Parsing
(filterin rows ['saskatoon'] r.city) (filterin rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
how many more passengers flew to los angeles than to saskatoon? 12,467
Program Synthesis / Semantic Parsing
Latent
(filterin rows ['saskatoon'] r.city) (filterin rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
how many more passengers flew to los angeles than to saskatoon? 12,467
Latent
Program Synthesis / Semantic Parsing
Sparse
Policy Gradient
Actor Learner
On-policy Samples Updated Policy
High variance => slow training Unbiased => optimal solution
Imitation Learning
Actor Learner
Updated Policy
Demonstration
Low variance => fast training Biased => suboptimal solution
Imitation Learning
Actor Learner
Updated Policy
Demonstration
Low variance => fast training Biased => suboptimal solution Requires human supervision
Actor Learner
Updated Policy
MAPO
Unbiased => optimal solution Low variance => fast training
Actor Learner
High-reward samples Updated Policy
Memory buffer
MAPO
Unbiased => optimal solution Low variance => fast training
Actor Learner
Samples inside memory Samples outside memory High-reward samples Updated Policy
Memory buffer
MAPO
Unbiased => optimal solution Low variance => fast training
Expectation
Gradient Estimate Program space
Unbiased High variance Expectation
Gradient Estimate Program space
Sampling
Sampling from a smaller space => variance reduction Enumeration Sampling
Gradient Estimate
Unbiased
Programs outside Memory Programs inside Memory
MAPO
Stratified sampling => variance reduction Unbiased Enumeration Sampling
Gradient Estimate Programs outside Memory Programs inside Memory
Sampling
MAPO
MAPO
( = a program) ( = correct or not)
MAPO
( = a program) ( = correct or not)
WikiTableQuestions: first SOTA using RL
WikiTableQuestions: first SOTA using RL
WikiSQL: strong vs. weak supervision!
Strong supervision
WikiSQL: strong vs. weak supervision!
Strong supervision
- MAPO converges slower than iterative maximum likelihood,
but reaches a better solution.
- REINFORCE doesn’t make much progress (<10% accuracy).
- MAPO converges slower than maximum likelihood training,
but reaches a better solution.
- REINFORCE doesn’t make much progress (<10% accuracy).
https://github.com/crazydonkey200/neural-symbolic-machines https://arxiv.org/abs/1807.02322 http://crazydonkey200.github.io/
An efficient policy
- ptimization method for