adversarial learning for neural dialogue generation
play

Adversarial Learning for Neural Dialogue Generation Li, Jiwei, Will - PowerPoint PPT Presentation

Adversarial Learning for Neural Dialogue Generation Li, Jiwei, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky EMNLP17 Presented by Yiren Wang (CS546, Spring 2018) 2 Ma Main Con Contri ribution ons Goal End-to-end neural


  1. Adversarial Learning for Neural Dialogue Generation Li, Jiwei, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky EMNLP’17 Presented by Yiren Wang (CS546, Spring 2018) 2

  2. Ma Main Con Contri ribution ons • Goal • End-to-end neural dialogue generation system • To produce sequences that are indistinguishable from human-generated dialogue utterances • Main Contributions • Adversarial training approach for response generation • Cast the task in a reinforcement learning framework. 3

  3. Ou Outline • Model Architecture • Adversarial Reinforce Learning: • Adversarial REINFORCE • Reward for Every Generation Step (REGS) • Teacher Enforcing • Overall Algorithm (Pseudocode) • Experiment Results • Summary 4

  4. Ad Adversarial Mod Model • Overall Architecture 5

  5. Ge Generativ tive Mod Model Model: Standard Seq2Seq model with Attention Mechanism • Input: dialogue history ! • Output: response " • (Sutskever et al., 2014; Jean et al., 2014) 6

  6. Discrimina nati tive Mod Model • Model: binary classifier Hierarchical encoder + 2-class softmax • • Input: dialogue utterances {", $} • Output: label indicating whether generated by human or by machine • & ' ( ", $ ) (by human) • & * ( ", $ ) (by machine) 7

  7. Ad Adversarial RE REINFORCE CE • Policy Gradient Training • Discriminator score is used as reward for generator • Generator is trained to maximize the expected reward 8

  8. Po Policy Gr Grad adie ient Tr Training Approximated by likelihood ratio 9

  9. Po Policy Gr Grad adie ient Tr Training Approximated by likelihood ratio Baseline value to reduce the variance of the estimate while keeping it unbiased Policy updates in the classification parameter space score 10

  10. Pr Problem wi with va vanilla REINFORCE • Expectation of reward is approximated by only one sample • Reward associated with the sample is used for all actions Input : What’s your name Human : I am John Machine : I don’t know (negative reward) 11

  11. Pr Problem wi with va vanilla REINFORCE • Expectation of reward is approximated by only one sample • Reward associated with the sample is used for all actions Input : What’s your name Human : I am John Machine : I don’t know (negative reward) Machine : I don’t know (negative reward) (neutral reward) 12

  12. Re Reward fo for Ev Every Ge Generatio tion St Step (R (REGS) • Strategies • Monte Carlo (MC) Search • Training Discriminator For Rewarding Partially Decoded Sequences 13

  13. St Strategy I I: Mon : Monte Ca Carl rlo ( o (MC) MC) Se Search ch • Repeats sampling N times • Average score is the reward 14

  14. St Strategy I I: Mon : Monte Ca Carl rlo ( o (MC) MC) Se Search ch • Repeats sampling N times • Average score is the reward Average reward 15

  15. St Strategy I I: Mon : Monte Ca Carl rlo ( o (MC) MC) Se Search ch • Repeats sampling N times • Average score is the reward √ More accurate ✘ Time consuming Average reward 16

  16. St Strategy I II: R : Reward P Part rtially D Decod oded Se Seqs • Break generated sequences into partial subsequences • Sample one positive and one negative subsequence Partially-generated sequence score for each partial sequence √ Time efficient ✘ Less accurate 17

  17. Un Unstab able le Traini ning ng ✘ Generator only indirectly exposed to the gold-standard target • When generator deteriorates: • Discriminator does an excellent job distinguishing – from + • Generator only knows generated sequences are bad • But get lost what are good and how to push itself towards good • Loss of reward signals leads to a breakdown in training 18

  18. Te Teacher Fo Forcing • Teacher Forcing: ”having a teacher intervene and force it to generate true responses” • Discriminator: • assigns a reward of 1 to the human responses • Generator: • uses this reward to update itself on human generated examples √ more direct access to the gold-standard targets 19

  19. Ov Overall Al Algorithm 20

  20. Re Results 21

  21. Su Summa mmary • Adversarial training for response generation • Cast the model in the framework of reinforcement learning • Discriminator: Turing test • Generator: trained to maximize the reward from discriminator 22

  22. Th Thanks! s! 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend