Adversarial Learning for Neural Dialogue Generation Li, Jiwei, Will - - PowerPoint PPT Presentation

adversarial learning for neural dialogue generation
SMART_READER_LITE
LIVE PREVIEW

Adversarial Learning for Neural Dialogue Generation Li, Jiwei, Will - - PowerPoint PPT Presentation

Adversarial Learning for Neural Dialogue Generation Li, Jiwei, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky EMNLP17 Presented by Yiren Wang (CS546, Spring 2018) 2 Ma Main Con Contri ribution ons Goal End-to-end neural


slide-1
SLIDE 1

Adversarial Learning for Neural Dialogue Generation

Li, Jiwei, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky EMNLP’17 Presented by Yiren Wang (CS546, Spring 2018)

2

slide-2
SLIDE 2

Ma Main Con Contri ribution

  • ns
  • Goal
  • End-to-end neural dialogue generation system
  • To produce sequences that are indistinguishable from

human-generated dialogue utterances

  • Main Contributions
  • Adversarial training approach for response generation
  • Cast the task in a reinforcement learning framework.

3

slide-3
SLIDE 3

Ou Outline

  • Model Architecture
  • Adversarial Reinforce Learning:
  • Adversarial REINFORCE
  • Reward for Every Generation Step (REGS)
  • Teacher Enforcing
  • Overall Algorithm (Pseudocode)
  • Experiment Results
  • Summary

4

slide-4
SLIDE 4

Ad Adversarial Mod Model

  • Overall Architecture

5

slide-5
SLIDE 5

Ge Generativ tive Mod Model

  • Model: Standard Seq2Seq model with Attention Mechanism
  • Input: dialogue history !
  • Output: response "

(Sutskever et al., 2014; Jean et al., 2014)

6

slide-6
SLIDE 6

Discrimina nati tive Mod Model

  • Model: binary classifier
  • Hierarchical encoder + 2-class softmax
  • Input: dialogue utterances {", $}
  • Output: label indicating whether generated by human or

by machine

  • &'( ", $ ) (by human)
  • &*( ", $ ) (by machine)

7

slide-7
SLIDE 7

Ad Adversarial RE REINFORCE CE

  • Policy Gradient Training
  • Discriminator score is used as reward for generator
  • Generator is trained to maximize the expected reward

8

slide-8
SLIDE 8

Po Policy Gr Grad adie ient Tr Training

9

Approximated by likelihood ratio

slide-9
SLIDE 9

Po Policy Gr Grad adie ient Tr Training

10

Approximated by likelihood ratio classification score Baseline value to reduce the variance of the estimate while keeping it unbiased Policy updates in the parameter space

slide-10
SLIDE 10

Pr Problem wi with va vanilla REINFORCE

  • Expectation of reward is approximated by only one sample
  • Reward associated with the sample is used for all actions

11

Input : What’s your name Human : I am John Machine : I don’t know

(negative reward)

slide-11
SLIDE 11

Pr Problem wi with va vanilla REINFORCE

  • Expectation of reward is approximated by only one sample
  • Reward associated with the sample is used for all actions

12

Input : What’s your name Human : I am John Machine : I don’t know

(negative reward)

Machine : I don’t know

(neutral reward) (negative reward)

slide-12
SLIDE 12

Re Reward fo for Ev Every Ge Generatio tion St Step (R (REGS)

  • Strategies
  • Monte Carlo (MC) Search
  • Training Discriminator For Rewarding Partially Decoded Sequences

13

slide-13
SLIDE 13

St Strategy I I: Mon : Monte Ca Carl rlo (

  • (MC)

MC) Se Search ch

  • Repeats sampling N times
  • Average score is the reward

14

slide-14
SLIDE 14

St Strategy I I: Mon : Monte Ca Carl rlo (

  • (MC)

MC) Se Search ch

  • Repeats sampling N times
  • Average score is the reward

15

Average reward

slide-15
SLIDE 15

St Strategy I I: Mon : Monte Ca Carl rlo (

  • (MC)

MC) Se Search ch

  • Repeats sampling N times
  • Average score is the reward

16

√ More accurate ✘ Time consuming

Average reward

slide-16
SLIDE 16

St Strategy I II: R : Reward P Part rtially D Decod

  • ded Se

Seqs

  • Break generated sequences into partial subsequences
  • Sample one positive and one negative subsequence

17

√ Time efficient ✘ Less accurate

score for each partial sequence

Partially-generated sequence

slide-17
SLIDE 17

Un Unstab able le Traini ning ng

18

✘ Generator only indirectly exposed to the gold-standard target

  • When generator deteriorates:
  • Discriminator does an excellent job distinguishing – from +
  • Generator only knows generated sequences are bad
  • But get lost what are good and how to push itself towards good
  • Loss of reward signals leads to a breakdown in training
slide-18
SLIDE 18

Te Teacher Fo Forcing

  • Teacher Forcing:

19

  • Discriminator:
  • assigns a reward of 1 to the human responses
  • Generator:
  • uses this reward to update itself on human generated

examples

”having a teacher intervene and force it to generate true responses”

√ more direct access to the gold-standard targets

slide-19
SLIDE 19

Ov Overall Al Algorithm

20

slide-20
SLIDE 20

Re Results

21

slide-21
SLIDE 21

Su Summa mmary

  • Adversarial training for response generation
  • Cast the model in the framework of reinforcement learning
  • Discriminator: Turing test
  • Generator: trained to maximize the reward from discriminator

22

slide-22
SLIDE 22

Th Thanks! s!

23