Improving a Neural Semantic Parser by Counterfactual Learning from - - PowerPoint PPT Presentation

improving a neural semantic parser by counterfactual
SMART_READER_LITE
LIVE PREVIEW

Improving a Neural Semantic Parser by Counterfactual Learning from - - PowerPoint PPT Presentation

Introduction Task Objectives Experiments Conclusion Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback Carolin Lawrence , Stefan Riezler Heidelberg University Institute for Computational Linguistics


slide-1
SLIDE 1

1/18 Introduction Task Objectives Experiments Conclusion

Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback

Carolin Lawrence, Stefan Riezler

Heidelberg University Institute for Computational Linguistics

July 17, 2018

slide-2
SLIDE 2

2/18 Introduction Task Objectives Experiments Conclusion

Situation Overview

◮ Situation: deployed system (e.g. QA, MT ...) ◮ Goal: improve system using human feedback ◮ Plan: create a log Dlog of user-system interactions

& improve system offline (safety) Here: Improve a Neural Semantic Parser

slide-3
SLIDE 3

3/18 Introduction Task Objectives Experiments Conclusion

Contrast to Previous Approaches

question x

Parser

Database

Comparison

gold answer

train predict

for 1...n required data

parses y1, ..., ys Rewards r1, ..., rs Answers a1, ..., as

slide-4
SLIDE 4

4/18 Introduction Task Objectives Experiments Conclusion

Our Approach

question x

Parser log predict

User Feedback r

parse y

Parser (x, y, r)

Database for 1...n

train

required data answer a

slide-5
SLIDE 5

5/18 Introduction Task Objectives Experiments Conclusion

Our Approach

question x

Parser

Database

Comparison

gold answer

train predict

question x

Parser log predict

User Feedback r

parse y

Parser (x, y, r)

for 1...n Database for 1...n

train

required data required data answer a

parses y1, ..., ys Rewards r1, ..., rs Answers a1, ..., as

◮ No supervision: given an input, the gold output is unknown ◮ Bandit: feedback is given for only one system output ◮ Bias: log D is biased to the decisions of the deployed system

Solution: Counterfactual / Off-policy Reinforcement Learning

slide-6
SLIDE 6

6/18 Introduction Task Objectives Experiments Conclusion

Task

slide-7
SLIDE 7

7/18 Introduction Task Objectives Experiments Conclusion

A natural language interface to OpenStreetMap

◮ OpenStreetMap (OSM): geographical database ◮ NLmaps v2: extension of the previous corpus, now totalling

28,609 question-parse pairs

slide-8
SLIDE 8

8/18 Introduction Task Objectives Experiments Conclusion

A natural language interface to OpenStreetMap

◮ example question: “How many hotels are there in Paris?”

Answer: 951

◮ correctness of answers are difficult to judge

→ judge parses by making them human-understandable

◮ feedback collection setup:

  • 1. automatically convert a parse to a set of statements
  • 2. humans judge the statements
slide-9
SLIDE 9

9/18 Introduction Task Objectives Experiments Conclusion

Example: Feedback Formula

q u e r y ( a r

  • u

n d ( c e n t e r ( a r e a ( k e y v a l ( ' n a m e ' , ' P a r i s ' ) ) , n w r ( k e y v a l ( ' n a m e ' , ' P l a c e d e l a R é p u b l i q u e ' ) ) ) , s e a r c h ( n w r ( k e y v a l ( ' a m e n i t y ' , ' p a r k i n g ' ) ) ) , m a x d i s t ( WA L K I N G _ D I S T ) ) , q t y p e ( fj n d k e y ( ' n a m e ' ) ) )

slide-10
SLIDE 10

10/18 Introduction Task Objectives Experiments Conclusion

Objectives

slide-11
SLIDE 11

11/18 Introduction Task Objectives Experiments Conclusion

Counterfactual Learning

Resources

collected log Dlog = {(xt, yt, δt)}n

t=1 with ◮ xt: input ◮ yt: most likely output of deployed system π0 ◮ δt ∈ [−1, 0]: loss (i.e. negative reward) received from user

Deterministic Propensity Matching (DPM)

◮ minimize the expected risk for a target policy πw

ˆ RDPM(πw) = 1 n

n

  • t=1

δtπw(yt|xt)

◮ improve πw using (stochastic) gradient descent ◮ high variance → use multiplicative control variate

slide-12
SLIDE 12

12/18 Introduction Task Objectives Experiments Conclusion

Multiplicative Control Variate

◮ for random variables X and Y , with ¯

Y the expectation of Y : E[X] = E[X Y ] · ¯ Y → RHS has lower variance if Y positively correlates with X

DPM with Reweighting (DPM+R)

ˆ RDPM+R(πw) =

1 n

n

t=1 δtπw(yt|xt) 1 n

n

t=1 πw(yt|xt) · 1

Reweight Sum R

◮ reduces variance but introduces a bias of order O( 1 n) that

decreases as n increases → n should be as large as possible

◮ Problem: in stochastic minibatch learning, n is too small

slide-13
SLIDE 13

13/18 Introduction Task Objectives Experiments Conclusion

One-Step Late (OSL) Reweighting

Perform gradient descent updates & reweighting asynchronously

◮ evaluate reweight sum R on the entire log of size n using

parameters w′

◮ update using minibatches of size m, m ≪ n ◮ periodically update R

→ retains all desirable properties

DPM+OSL

ˆ RDPM+OSL(πw) =

1 m

m

t=1 δtπw(yt|xt) 1 n

n

t=1 πw′(yt|xt)

slide-14
SLIDE 14

14/18 Introduction Task Objectives Experiments Conclusion

Token-Level Feedback

DPM+T

ˆ RDPM+T(πw) = 1 n

n

  • t=1

 

|y|

  • j=1

δjπw(yj|xt)  

DPM+T+OSL

ˆ RDPM+T+OSL(πw) =

1 m

m

t=1

|y|

j=1 δjπw(yj|xt)

  • 1

n

n

t=1 πw′(yt|xt)

slide-15
SLIDE 15

15/18 Introduction Task Objectives Experiments Conclusion

Experiments

slide-16
SLIDE 16

16/18 Introduction Task Objectives Experiments Conclusion

Experimental Setup

◮ sequence-to-sequence neural network Nematus ◮ deployed system: pre-trained on 2k question-parse pairs ◮ feedback collection:

  • 1. humans judged 1k system outputs

◮ average time to judge a parse: 16.4s ◮ most parses (>70%) judged in <10s

  • 2. simulated feedback for 23k system outputs

◮ token-wise comparison to gold parse

◮ bandit-to-supervised conversion (B2S): all instances in log

with reward 1 are used as supervised training

slide-17
SLIDE 17

17/18 Introduction Task Objectives Experiments Conclusion

Experimental Results

+0.34 +5.77 +0.99 +6.96 57.45 58.45 59.45 60.45 61.45 62.45 63.45 64.45 65.45 Human Feedback (1k) Large-Scale Simulated Feedback (23k)

F1 Score

B2S DPM+T+OSL

slide-18
SLIDE 18

18/18 Introduction Task Objectives Experiments Conclusion

Take Away

Counterfactual Learning

◮ safely improve a system by collecting interaction logs ◮ applicable to any task if the underlying model is differentiable ◮ DPM+OSL: new objective for stochastic minibatch learning

Improving a Semantic Parser

◮ collect feedback by making parses human-understandable ◮ judging a parse is often easier & faster than formulating a

parse or answer

NLmaps v2

◮ large question-parse corpus for QA in the geographical domain

Future Work

◮ integrate feedback form in the online NL interface to OSM