1/18 Introduction Task Objectives Experiments Conclusion
Improving a Neural Semantic Parser by Counterfactual Learning from - - PowerPoint PPT Presentation
Improving a Neural Semantic Parser by Counterfactual Learning from - - PowerPoint PPT Presentation
Introduction Task Objectives Experiments Conclusion Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback Carolin Lawrence , Stefan Riezler Heidelberg University Institute for Computational Linguistics
2/18 Introduction Task Objectives Experiments Conclusion
Situation Overview
◮ Situation: deployed system (e.g. QA, MT ...) ◮ Goal: improve system using human feedback ◮ Plan: create a log Dlog of user-system interactions
& improve system offline (safety) Here: Improve a Neural Semantic Parser
3/18 Introduction Task Objectives Experiments Conclusion
Contrast to Previous Approaches
question x
Parser
Database
Comparison
gold answer
train predict
for 1...n required data
parses y1, ..., ys Rewards r1, ..., rs Answers a1, ..., as
4/18 Introduction Task Objectives Experiments Conclusion
Our Approach
question x
Parser log predict
User Feedback r
parse y
Parser (x, y, r)
Database for 1...n
train
required data answer a
5/18 Introduction Task Objectives Experiments Conclusion
Our Approach
question x
Parser
Database
Comparison
gold answer
train predict
question x
Parser log predict
User Feedback r
parse y
Parser (x, y, r)
for 1...n Database for 1...n
train
required data required data answer a
parses y1, ..., ys Rewards r1, ..., rs Answers a1, ..., as
◮ No supervision: given an input, the gold output is unknown ◮ Bandit: feedback is given for only one system output ◮ Bias: log D is biased to the decisions of the deployed system
Solution: Counterfactual / Off-policy Reinforcement Learning
6/18 Introduction Task Objectives Experiments Conclusion
Task
7/18 Introduction Task Objectives Experiments Conclusion
A natural language interface to OpenStreetMap
◮ OpenStreetMap (OSM): geographical database ◮ NLmaps v2: extension of the previous corpus, now totalling
28,609 question-parse pairs
8/18 Introduction Task Objectives Experiments Conclusion
A natural language interface to OpenStreetMap
◮ example question: “How many hotels are there in Paris?”
Answer: 951
◮ correctness of answers are difficult to judge
→ judge parses by making them human-understandable
◮ feedback collection setup:
- 1. automatically convert a parse to a set of statements
- 2. humans judge the statements
9/18 Introduction Task Objectives Experiments Conclusion
Example: Feedback Formula
q u e r y ( a r
- u
n d ( c e n t e r ( a r e a ( k e y v a l ( ' n a m e ' , ' P a r i s ' ) ) , n w r ( k e y v a l ( ' n a m e ' , ' P l a c e d e l a R é p u b l i q u e ' ) ) ) , s e a r c h ( n w r ( k e y v a l ( ' a m e n i t y ' , ' p a r k i n g ' ) ) ) , m a x d i s t ( WA L K I N G _ D I S T ) ) , q t y p e ( fj n d k e y ( ' n a m e ' ) ) )
10/18 Introduction Task Objectives Experiments Conclusion
Objectives
11/18 Introduction Task Objectives Experiments Conclusion
Counterfactual Learning
Resources
collected log Dlog = {(xt, yt, δt)}n
t=1 with ◮ xt: input ◮ yt: most likely output of deployed system π0 ◮ δt ∈ [−1, 0]: loss (i.e. negative reward) received from user
Deterministic Propensity Matching (DPM)
◮ minimize the expected risk for a target policy πw
ˆ RDPM(πw) = 1 n
n
- t=1
δtπw(yt|xt)
◮ improve πw using (stochastic) gradient descent ◮ high variance → use multiplicative control variate
12/18 Introduction Task Objectives Experiments Conclusion
Multiplicative Control Variate
◮ for random variables X and Y , with ¯
Y the expectation of Y : E[X] = E[X Y ] · ¯ Y → RHS has lower variance if Y positively correlates with X
DPM with Reweighting (DPM+R)
ˆ RDPM+R(πw) =
1 n
n
t=1 δtπw(yt|xt) 1 n
n
t=1 πw(yt|xt) · 1
Reweight Sum R
◮ reduces variance but introduces a bias of order O( 1 n) that
decreases as n increases → n should be as large as possible
◮ Problem: in stochastic minibatch learning, n is too small
13/18 Introduction Task Objectives Experiments Conclusion
One-Step Late (OSL) Reweighting
Perform gradient descent updates & reweighting asynchronously
◮ evaluate reweight sum R on the entire log of size n using
parameters w′
◮ update using minibatches of size m, m ≪ n ◮ periodically update R
→ retains all desirable properties
DPM+OSL
ˆ RDPM+OSL(πw) =
1 m
m
t=1 δtπw(yt|xt) 1 n
n
t=1 πw′(yt|xt)
14/18 Introduction Task Objectives Experiments Conclusion
Token-Level Feedback
DPM+T
ˆ RDPM+T(πw) = 1 n
n
- t=1
|y|
- j=1
δjπw(yj|xt)
DPM+T+OSL
ˆ RDPM+T+OSL(πw) =
1 m
m
t=1
|y|
j=1 δjπw(yj|xt)
- 1
n
n
t=1 πw′(yt|xt)
15/18 Introduction Task Objectives Experiments Conclusion
Experiments
16/18 Introduction Task Objectives Experiments Conclusion
Experimental Setup
◮ sequence-to-sequence neural network Nematus ◮ deployed system: pre-trained on 2k question-parse pairs ◮ feedback collection:
- 1. humans judged 1k system outputs
◮ average time to judge a parse: 16.4s ◮ most parses (>70%) judged in <10s
- 2. simulated feedback for 23k system outputs
◮ token-wise comparison to gold parse
◮ bandit-to-supervised conversion (B2S): all instances in log
with reward 1 are used as supervised training
17/18 Introduction Task Objectives Experiments Conclusion
Experimental Results
+0.34 +5.77 +0.99 +6.96 57.45 58.45 59.45 60.45 61.45 62.45 63.45 64.45 65.45 Human Feedback (1k) Large-Scale Simulated Feedback (23k)
F1 Score
B2S DPM+T+OSL
18/18 Introduction Task Objectives Experiments Conclusion