improving a neural semantic parser by counterfactual
play

Improving a Neural Semantic Parser by Counterfactual Learning from - PowerPoint PPT Presentation

Introduction Task Objectives Experiments Conclusion Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback Carolin Lawrence , Stefan Riezler Heidelberg University Institute for Computational Linguistics


  1. Introduction Task Objectives Experiments Conclusion Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback Carolin Lawrence , Stefan Riezler Heidelberg University Institute for Computational Linguistics July 17, 2018 1/18

  2. Introduction Task Objectives Experiments Conclusion Situation Overview ◮ Situation: deployed system (e.g. QA, MT ...) ◮ Goal: improve system using human feedback ◮ Plan: create a log D log of user-system interactions & improve system offline (safety) Here: Improve a Neural Semantic Parser 2/18

  3. Introduction Task Objectives Experiments Conclusion Contrast to Previous Approaches parses Answers Database y 1 , ..., y s a 1 , ..., a s predict train Rewards Parser Comparison r 1 , ..., r s required data gold question x answer for 1...n 3/18

  4. Introduction Task Objectives Experiments Conclusion Our Approach parse y Database answer a predict log train Parser (x, y, r) Parser User Feedback r required data question x for 1...n 4/18

  5. Introduction Task Objectives Experiments Conclusion Our Approach parses Answers Database parse y Database answer a y 1 , ..., y s a 1 , ..., a s predict predict train log train Rewards Parser Parser (x, y, r) Parser Comparison r 1 , ..., r s User Feedback r required data required data gold question x question x answer for 1...n for 1...n ◮ No supervision: given an input, the gold output is unknown ◮ Bandit: feedback is given for only one system output ◮ Bias: log D is biased to the decisions of the deployed system Solution: Counterfactual / Off-policy Reinforcement Learning 5/18

  6. Introduction Task Objectives Experiments Conclusion Task 6/18

  7. Introduction Task Objectives Experiments Conclusion A natural language interface to OpenStreetMap ◮ OpenStreetMap (OSM): geographical database ◮ NLmaps v2 : extension of the previous corpus, now totalling 28,609 question-parse pairs 7/18

  8. Introduction Task Objectives Experiments Conclusion A natural language interface to OpenStreetMap ◮ example question: “ How many hotels are there in Paris? ” Answer: 951 ◮ correctness of answers are difficult to judge → judge parses by making them human-understandable ◮ feedback collection setup: 1. automatically convert a parse to a set of statements 2. humans judge the statements 8/18

  9. Introduction Task Objectives Experiments Conclusion Example: Feedback Formula q u e r y ( a r o u n d ( c e n t e r ( a r e a ( k e y v a l ( ' n a m e ' , ' P a r i s ' ) ) , n w r ( k e y v a l ( ' n a m e ' , ' P l a c e d e l a R é p u b l i q u e ' ) ) ) , s e a r c h ( n w r ( k e y v a l ( ' a m e n i t y ' , ' p a r k i n g ' ) ) ) , m a x d i s t ( WA L K I N G _ D I S T ) ) , q t y p e ( fj n d k e y ( ' n a m e ' ) ) ) 9/18

  10. Introduction Task Objectives Experiments Conclusion Objectives 10/18

  11. Introduction Task Objectives Experiments Conclusion Counterfactual Learning Resources collected log D log = { ( x t , y t , δ t ) } n t =1 with ◮ x t : input ◮ y t : most likely output of deployed system π 0 ◮ δ t ∈ [ − 1 , 0]: loss (i.e. negative reward) received from user Deterministic Propensity Matching (DPM) ◮ minimize the expected risk for a target policy π w n R DPM ( π w ) = 1 ˆ � δ t π w ( y t | x t ) n t =1 ◮ improve π w using (stochastic) gradient descent ◮ high variance → use multiplicative control variate 11/18

  12. Introduction Task Objectives Experiments Conclusion Multiplicative Control Variate ◮ for random variables X and Y , with ¯ Y the expectation of Y : E [ X ] = E [ X Y ] · ¯ Y → RHS has lower variance if Y positively correlates with X DPM with Reweighting (DPM+R) 1 � n t =1 δ t π w ( y t | x t ) ˆ n R DPM+R ( π w ) = t =1 π w ( y t | x t ) · 1 Reweight Sum R � n 1 n ◮ reduces variance but introduces a bias of order O ( 1 n ) that decreases as n increases → n should be as large as possible ◮ Problem: in stochastic minibatch learning, n is too small 12/18

  13. Introduction Task Objectives Experiments Conclusion One-Step Late (OSL) Reweighting Perform gradient descent updates & reweighting asynchronously ◮ evaluate reweight sum R on the entire log of size n using parameters w ′ ◮ update using minibatches of size m , m ≪ n ◮ periodically update R → retains all desirable properties DPM+OSL 1 � m t =1 δ t π w ( y t | x t ) ˆ m R DPM+OSL ( π w ) = 1 � n t =1 π w ′ ( y t | x t ) n 13/18

  14. Introduction Task Objectives Experiments Conclusion Token-Level Feedback DPM+T   | y | n R DPM+T ( π w ) = 1 ˆ � � δ j π w ( y j | x t )   n t =1 j =1 DPM+T+OSL �� | y | � 1 � m j =1 δ j π w ( y j | x t ) t =1 m ˆ R DPM+T+OSL ( π w ) = 1 � n t =1 π w ′ ( y t | x t ) n 14/18

  15. Introduction Task Objectives Experiments Conclusion Experiments 15/18

  16. Introduction Task Objectives Experiments Conclusion Experimental Setup ◮ sequence-to-sequence neural network Nematus ◮ deployed system: pre-trained on 2k question-parse pairs ◮ feedback collection: 1. humans judged 1k system outputs ◮ average time to judge a parse: 16.4s ◮ most parses ( > 70%) judged in < 10s 2. simulated feedback for 23k system outputs ◮ token-wise comparison to gold parse ◮ bandit-to-supervised conversion (B2S): all instances in log with reward 1 are used as supervised training 16/18

  17. Introduction Task Objectives Experiments Conclusion Experimental Results B2S DPM+T+OSL 65.45 +6.96 64.45 +5.77 63.45 F1 Score 62.45 61.45 60.45 59.45 +0.99 58.45 +0.34 57.45 Human Feedback (1k) Large-Scale Simulated Feedback (23k) 17/18

  18. Introduction Task Objectives Experiments Conclusion Take Away Counterfactual Learning ◮ safely improve a system by collecting interaction logs ◮ applicable to any task if the underlying model is differentiable ◮ DPM+OSL: new objective for stochastic minibatch learning Improving a Semantic Parser ◮ collect feedback by making parses human-understandable ◮ judging a parse is often easier & faster than formulating a parse or answer NLmaps v2 ◮ large question-parse corpus for QA in the geographical domain Future Work ◮ integrate feedback form in the online NL interface to OSM 18/18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend