SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient - - PowerPoint PPT Presentation

seqgan sequence generative adversarial nets with policy
SMART_READER_LITE
LIVE PREVIEW

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient - - PowerPoint PPT Presentation

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient Lantao Yu , Weinan Zhang , Jun Wang , Yong Yu Shanghai Jiao Tong University, University College London Attribution Multiple slides taken from


slide-1
SLIDE 1

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

Lantao Yu†, Weinan Zhang†, Jun Wang‡, Yong Yu†

†Shanghai Jiao Tong University, ‡University College London

slide-2
SLIDE 2

Attribution

  • Multiple slides taken from
  • Hung-yi Lee
  • Paarth Neekhara
  • Ruirui Li
  • Original authors at AAAI 2017
  • Presented by: Pratyush Maini
slide-3
SLIDE 3

Outline

  • 1. Introduction to GANs
  • 2. Brief theoretical overview of GANs
  • 3. Overview of GANs in Sequence Generation
  • 4. SeqGAN
  • 5. Other recent work: Unsupervised Conditional Sequence Generation
slide-4
SLIDE 4

All Kinds of GAN …

https://github.com/hindupuravinash/the-gan-zoo (not updated since 2018.09)

More than 500 species in the zoo

slide-5
SLIDE 5

All Kinds of GAN …

https://github.com/hindupuravinash/the-gan-zoo

GAN ACGAN BGAN DCGAN EBGAN fGAN GoGAN CGAN

……

Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding Generative Adversarial Networks”, arXiv, 2017

slide-6
SLIDE 6

Generator “Girl with red hair” Generator

−0.3 0.1 ⋮ 0.9 random vector

Three Categories of GAN

  • 1. Generation

image

  • 2. Conditional Generation

Generator

text image

paired data blue eyes, red hair, short hair

  • 3. Unsupervised Conditional Generation

Photo Vincent van Gogh’s style

unpaired data x y domain x domain y

slide-7
SLIDE 7

Anime Face Generation

Draw Generator Examples

slide-8
SLIDE 8

Basic Idea of GAN

Generator

It is a neural network (NN), or a function.

Generator

0.1 −3 ⋮ 2.4 0.9 image

vector Generator

3 −3 ⋮ 2.4 0.9

Generator

0.1 2.1 ⋮ 5.4 0.9

Generator

0.1 −3 ⋮ 2.4 3.5

high dimensional vector

Powered by: http://mattya.github.io/chainer-DCGAN/ Each dimension of input vector represents some characteristics. Longer hair blue hair Open mouth

slide-9
SLIDE 9

Discri- minator

scalar

image

Basic Idea of GAN

It is a neural network (NN), or a function.

Larger value means real, smaller value means fake. Discri- minator Discri- minator Discri- minator

1.0 1.0 0.1

Discri- minator

0.1

slide-10
SLIDE 10

Outline

  • 1. Introduction to GANs
  • 2. Brief theoretical overview of GANs
  • 3. Overview of GANs in Sequence Generation
  • 4. SeqGAN
  • 5. Other recent work: Unsupervised Conditional Sequence Generation
slide-11
SLIDE 11
  • Initialize generator and discriminator
  • In each training iteration:

D G sample generated

  • bjects

G

Algorithm

D Update

vector vector vector vector

1 1 1 1 randomly sampled Database

Step 1: Fix generator G, and update discriminator D Discriminator learns to assign high scores to real objects and low scores to generated objects.

Fix

slide-12
SLIDE 12
  • Initialize generator and discriminator
  • In each training iteration:

D G

Algorithm

Step 2: Fix discriminator D, and update generator G

Discri- minator NN Generator vector 0.13 hidden layer

update fix

Gradient Ascent

large network Generator learns to “fool” the discriminator

slide-13
SLIDE 13
  • Initialize generator and discriminator
  • In each training iteration:

D G Learning D Sample some real objects: Generate some fake objects: G

Algorithm

D Update Learning G G D

image

1 1 1 1

image image image

1

update fix

vector vector vector vector vector vector vector vector

fix

slide-14
SLIDE 14

Anime Face Generation

100 updates Source of training data: https://zhuanlan.zhihu.com/p/24767059

slide-15
SLIDE 15

Anime Face Generation

1000 updates

slide-16
SLIDE 16

Anime Face Generation

2000 updates

slide-17
SLIDE 17

Anime Face Generation

5000 updates

slide-18
SLIDE 18

Anime Face Generation

10,000 updates

slide-19
SLIDE 19

Anime Face Generation

20,000 updates

slide-20
SLIDE 20

Anime Face Generation

50,000 updates

slide-21
SLIDE 21

In 2019, with StyleGAN ……

Source of video: https://www.gwern.net/Faces

slide-22
SLIDE 22

The first GAN

[Ian J. Goodfellow, et al., NIPS, 2014]

slide-23
SLIDE 23

Outline

  • 1. Introduction to GANs
  • 2. Brief theoretical overview of GANs
  • 3. Overview of GANs in Sequence Generation

1. Reinforcement Learning 2. GAN + RL

  • 4. SeqGAN
  • 5. Other recent work: Unsupervised Conditional Sequence Generation
slide-24
SLIDE 24

NLP tasks usually involve Se Sequence Generatio ion How to use GAN to improve sequence generation?

slide-25
SLIDE 25

Reinforcement Learning

Human Input sentence c response sentence x Chatbot

En De

response sentence x Input sentence c

[Li, et al., EMNLP , 2016]

reward

𝑆 𝑑, 𝑦

Learn to maximize expected reward E.g. Policy Gradient human “How are you?” “Not bad” “I’m John”

  • 1

+1

slide-26
SLIDE 26

Policy Gradient

𝜄𝑢

𝑑1, 𝑦1 𝑑2, 𝑦2 𝑑𝑂, 𝑦𝑂

……

𝑆 𝑑1, 𝑦1 𝑆 𝑑2, 𝑦2 𝑆 𝑑𝑂, 𝑦𝑂

……

1 𝑂 ෍

𝑗=1 𝑂

𝑆 𝑑𝑗, 𝑦𝑗 𝛼𝑚𝑝𝑕𝑄𝜄𝑢 𝑦𝑗|𝑑𝑗

𝜄𝑢+1 ← 𝜄𝑢 + 𝜃𝛼 ത 𝑆𝜄𝑢

𝑆 𝑑𝑗, 𝑦𝑗 is positive Updating 𝜄 to increase 𝑄𝜄 𝑦𝑗|𝑑𝑗 𝑆 𝑑𝑗, 𝑦𝑗 is negative Updating 𝜄 to decrease 𝑄𝜄 𝑦𝑗|𝑑𝑗

slide-27
SLIDE 27

Policy Gradient

1 𝑂 ෍

𝑗=1 𝑂

𝑆 𝑑𝑗, 𝑦𝑗 𝛼𝑚𝑝𝑕𝑄𝜄 𝑦𝑗|𝑑𝑗 1 𝑂 ෍

𝑗=1 𝑂

𝑚𝑝𝑕𝑄𝜄 ො 𝑦𝑗|𝑑𝑗 1 𝑂 ෍

𝑗=1 𝑂

𝛼𝑚𝑝𝑕𝑄𝜄 ො 𝑦𝑗|𝑑𝑗 1 𝑂 ෍

𝑗=1 𝑂

𝑆 𝑑𝑗, 𝑦𝑗 𝑚𝑝𝑕𝑄𝜄 𝑦𝑗|𝑑𝑗 𝑆 𝑑𝑗, ො 𝑦𝑗 = 1

  • btained from interaction

weighted by 𝑆 𝑑𝑗, 𝑦𝑗 Objective Function Gradient Maximum Likelihood Reinforcement Learning - Policy Gradient Training Data 𝑑1, ො 𝑦1 , … , 𝑑𝑂, ො 𝑦𝑂 𝑑1, 𝑦1 , … , 𝑑𝑂, 𝑦𝑂

slide-28
SLIDE 28

Outline

  • 1. Introduction to GANs
  • 2. Brief theoretical overview of GANs
  • 3. Overview of GANs in Sequence Generation

1. Reinforcement Learning 2. GAN + RL

  • 4. SeqGAN
  • 5. Other recent work: Unsupervised Conditional Sequence Generation
slide-29
SLIDE 29

Why we need GAN?

  • Chat-bot as example

Encoder Decoder Input sentence c

  • utput

sentence x Training data: A: How are you ? B: I’m good. …… …… How are you ? I’m good. Seq2seq Output: Not bad I’m John. Maximize likelihood Training Criterion Human better better

slide-30
SLIDE 30

Conditional GAN

Discriminator Input sentence c response sentence x Chatbot

En De

response sentence x Input sentence c reward

𝑆 𝑑, 𝑦

I am busy.

Replace human evaluation with machine evaluation

[Li, et al., EMNLP , 2017]

However, there is an issue when you train your generator.

slide-31
SLIDE 31

Three Categories of Solutions

Gumbel-softmax

  • [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019]

Continuous Input for Discriminator

  • [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen

Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017]

Reinforcement Learning

  • [Yu, et al., AAAI, 2017][Li, et al., EMNLP

, 2017][Tong Che, et al, arXiv, 2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]

slide-32
SLIDE 32

A A A B A B A A B B B <BOS>

Use the distribution as the input of discriminator Avoid the sampling process

Discriminator scalar

Update Parameters We can do backpropagation now.

slide-33
SLIDE 33

What is the problem?

  • Real sentence
  • Generated

1 1 1 1 1 0.9 0.1 0.1 0.9 0.1 0.1 0.7 0.1 0.1 0.8 0.1 0.1 0.9

Can never be 1-hot Discriminator can immediately find the difference. Discriminator with constraint (e.g. WGAN) can be helpful.

slide-34
SLIDE 34

Three Categories of Solutions

Gumbel-softmax

  • [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019]

Continuous Input for Discriminator

  • [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen

Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017]

Reinforcement Learning

  • [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,

2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]

slide-35
SLIDE 35

Tips for Sequence Generation GAN

.

RL is difficult to train GAN is difficult to train Sequence Generation GAN (RL+GAN)

slide-36
SLIDE 36

Tips for Sequence Generation GAN

  • Typical
  • Reward for Every Generation Step

Discrimi nator Chatbot En De You is good Discrimi nator Chatbot En De 0.9 0.1 0.1 0.1 You You is You is good I don’t know which part is wrong …

slide-37
SLIDE 37

Tips for Sequence Generation GAN

  • Reward for Every Generation Step

Discrimi nator Chatbot En De 0.9 0.1 0.1 You You is You is good Method 2. Discriminator For Partially Decoded Sequences Method 1. Monte Carlo (MC) Search [Yu, et al., AAAI, 2017]

[Li, et al., EMNLP , 2017]

Method 3. Step-wise evaluation[Tual, Lee, TASLP

, 2019][Xu, et al., EMNLP , 2018][William Fedus, et al., ICLR, 2018]

slide-38
SLIDE 38

Outline

  • 1. Introduction to GANs
  • 2. Brief theoretical overview of GANs
  • 3. Overview of GANs in Sequence Generation
  • 4. SeqGAN
  • 5. Other recent work: Unsupervised Conditional Sequence Generation
slide-39
SLIDE 39

Task

  • 1. Given a dataset of real-world structured sequences, train a

generative model Gθ to produce sequences that mimic the real ones.

  • 2. We want Gθto fit the unknown true data distribution

ptrue(yt|Y1:t—1), which is only revealed by the given dataset D

= {Y1:T} .

slide-40
SLIDE 40
  • Traditional objective: maximum likelihood estimation (MLE)
  • Check whether a true data is with a high mass density of

the learned model

  • Suffer from so-called

in the inference stag: exposure bias

Training Inference

When generating the next token , sample from:

yt max

θ

EY ∼ptrue X

t

log Gθ(yt|Y1:t−1) Gθ( ˆ yt| ˆ Y1:t−1)

Update the model as follows:

The real prefix The guessed prefix

max

θ

1 |D| X

Y1:T ∈D

X

t

log[Gθ(yt|Y1:t−1)]

slide-41
SLIDE 41

A promising method: Generative Adversarial Nets (GANs)

  • Discriminator tries to correctly distinguish the true data and

the fake model-generated data

  • Generator tries to generate high-quality data to fool

discriminator

  • Ideally, when D cannot distinguish the true and generated

data, G nicely fits the true underlying data distribution

G D

Real World Generator Discriminator Data

[Goodfellow I, Pouget-Abadie J, Mirza M, et al. 2014. Generative adversarial nets. In NIPS 2014.]

slide-42
SLIDE 42

Generator Network in GANs

  • Must be differentiable
  • Popular implementation: multi-layer

perceptron

  • Linked with the discriminator and get

guidance from it

P (true) P (true)

G D

x = G(z; θ(G))

min

G max D Ex∼pdata(x)[log D(x)] + Ez∼pz(z)[log(1 − D(G(z))]

slide-43
SLIDE 43

Problem for Discrete Data

  • On continuous data, there is direct gradient
  • Guide the generator to (slightly) modify the output
  • No direct gradient on discrete data
  • Text generation example
  • “I caught a penguin in the park”
  • From Ian Goodfellow: “If you output the word ‘penguin’, you

can't change that to "penguin + .001" on the next step, because there is no such word as "penguin + .001". You have to go all the way from "penguin" to "ostrich".”

P (true) P (true)

G D

[https://www.reddit.com/r/MachineLearning/comments/40ldq6/generative_adversarial_networks_for_text/]

rθ(G) 1 m

m

X

i=1

log(1 D(G(z(i))))

slide-44
SLIDE 44

SeqGAN

  • Generator is a reinforcement learning policy
  • f generating a sequence
  • decide the next word to generate (action) given the previous ones

as the state

  • Discriminator provides the reward (i.e. the probability of

being true data) for the sequence Gθ(yt|Y1:t−1) Dφ(Y n

1:T )

slide-45
SLIDE 45

Sequence Generator

  • Objective: to maximize the expected reward
  • State-action value function

is the expected accumulative reward that

  • Start from state s
  • Taking action a
  • And following policy G until the end
  • Reward is only on completed

sequence (no immediate reward)

J(θ) = E[RT |s0, θ] = X

y1∈Y

Gθ(y1|s0) · QGθ

Dφ(s0, y1)

QGθ

Dφ(s, a)

QGθ

Dφ(s = Y1:T −1, a = yT ) = Dφ(Y1:T )

slide-46
SLIDE 46

State-Action Value Setting

  • Reward is only on completed sequence
  • No immediate reward
  • Then the last-step state-action value
  • For intermediate state-action value
  • Use Monte Carlo search to estimate
  • Following a roll-out policy G

QGθ

Dφ(s = Y1:T −1, a = yT ) = Dφ(Y1:T )

QGθ

Dφ(s = Y1:t−1, a = yt) =

1 N

PN

n=1 Dφ(Y n 1:T ), Y n 1:T ∈ MCGβ(Y1:t; N)

for t < T Dφ(Y1:t) for t = T,

  • Y 1

1:T , . . . , Y N 1:T

= MCGβ(Y1:t; N)

slide-47
SLIDE 47

Training Sequence Discriminator

  • Objective: standard bi-classification

min

φ −EY ∼pdata[log Dφ(Y )] − EY ∼Gθ[log(1 − Dφ(Y ))]

slide-48
SLIDE 48

Training Sequence Generator

  • Policy gradient (REINFORCE)

[Richard Sutton et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS 1999.]

rθJ(θ) = EY1:t−1∼Gθ[ X

yt∈Y

rθGθ(yt|Y1:t−1) · QGθ

Dφ(Y1:t−1, yt)]

' 1 T

T

X

t=1

X

yt∈Y

rθGθ(yt|Y1:t−1) · QGθ

Dφ(Y1:t−1, yt)

= 1 T

T

X

t=1

X

yt∈Y

Gθ(yt|Y1:t−1)rθ log Gθ(yt|Y1:t−1) · QGθ

Dφ(Y1:t−1, yt)

= 1 T

T

X

t=1

Eyt∼Gθ(yt|Y1:t−1)[rθ log Gθ(yt|Y1:t−1) · QGθ

Dφ(Y1:t−1, yt)],

θ θ + αhrθJ(θ)

slide-49
SLIDE 49

Overall Algorithm

slide-50
SLIDE 50

Sequence Generator Model

  • RNN with LSTM cells

[Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.]

Shanghai is incredibly is incredibly Softmax sampling

  • ver vocabulary

?

slide-51
SLIDE 51

Sequence Discriminator Model

[Kim, Y. 2014. Convolutional neural networks for sentence classification. EMNLP 2014.]

slide-52
SLIDE 52

Inconsistency of Evaluation and Use

  • Check whether a

true data is with a high mass density

  • f the learned

model

  • Approximated by

Evaluation Use

  • Check whether a

model-generated data is considered as real as possible

  • More straightforward

but it is hard or impossible to directly calculate

  • Given a generator

with a certain generalization ability

Gθ max

θ

1 |D| X

x∈D

[log Gθ(x)] Ex∼ptrue(x)[log Gθ(x)] Ex∼Gθ(x)[log ptrue(x)] ptrue(x)

slide-53
SLIDE 53

Experiments on Synthetic Data

  • Evaluation measure with Oracle
  • An oracle model (e.g. the randomly initialized LSTM)
  • Firstly, the oracle model produces some sequences as

training data for the generative model

  • Secondly the oracle model can be considered as the

human observer to accurately evaluate the perceptual quality of the generative model

NLLoracle = −EY1:T ∼Gθ h

T

X

t=1

log Goracle(yt|Y1:t−1) i

slide-54
SLIDE 54

Experiments on Synthetic Data

  • Evaluation measure with Oracle

NLLoracle = −EY1:T ∼Gθ h

T

X

t=1

log Goracle(yt|Y1:t−1) i

slide-55
SLIDE 55

Experiments on Synthetic Data

  • The training strategy really matters.
slide-56
SLIDE 56

Experiments on Real-World Data

  • Chinese poem generation
  • Obama political speech text generation
  • Midi music generation
slide-57
SLIDE 57

Experiments on Real-World Data

  • Chinese poem generation
  • Human

Machine

slide-58
SLIDE 58

Obama Speech Text Generation

  • i stood here today i have one

and most important thing that not on violence throughout the horizon is OTHERS american fire and OTHERS but we need you are a strong source

  • for this business leadership will

remember now i can’t afford to start with just the way our european support for the right thing to protect those american story from the world and

  • i want to acknowledge you

were going to be an

  • utstanding job times for

student medical education and warm the republicans who like my times if he said is that brought the

  • when he was told of this

extraordinary honor that he was the most trusted man in america

  • but we also remember and

celebrate the journalism that walter practiced a standard of honesty and integrity and responsibility to which so many

  • f you have committed your
  • careers. it's a standard that's a

little bit harder to find today

  • i am honored to be here to pay

tribute to the life and times of the man who chronicled our time.

Human Machine

slide-59
SLIDE 59

Issues

  • Gradient vanishing problem:
  • Discriminator is trained to be much stronger than the generator
  • Extremely hard for the generator to have any actual updates
  • Any output instances of the generator will be scored as almost 0.
  • Mode Collapse:
  • Due to REINFORCE algorithm
  • Probability of sampling particular tokens earning high evaluation from D.
  • G only manages to mimic a limited part of the target distribution
slide-60
SLIDE 60

Summary

  • We proposed a sequence generation method, called

SeqGAN, to effectively train Generative Adversarial Nets for discrete structured sequences generation via policy gradient.

  • Design an experiment framework with oracle evaluation

metric to accurately evaluate the “perceptual quality”

  • f model-generated sequences.
slide-61
SLIDE 61

Review

  • First solid and well-motivated study on using GANs for Discrete

Sequences.

  • Extensive experimentation on both synthetic and real-world data with

convincing results.

  • Requires a lot of engineering and hyper-parameter tuning: Pre-

training, GAN parameters, g-steps, d-steps MC tree depth etc.

slide-62
SLIDE 62

Pros

  • 1. Succeed with RL+GAN / interesting idea [Everyone]
  • 2. Well written [Keshav, Rajas]
  • 3. Mathematical detail [Atishya, Jigyasa]
  • 4. Multiple domains explored [Shubham]
  • 5. Ablation study of train time [Pawan]
  • 6. Pretraining generator with MLE can help reduce high variance in gradient

estimate as very less samples are used in each episode. [Jigyasa]

  • 7. The evaluation approach, of using a randomly initialized LSTM as an oracle is a

very creative idea that provides a nice way to automatically compare how close the generator distribution is to the actual model of the world. [Rajas]

  • 8. Using CNN for discriminator and getting good results is really noteworthy.

[Vipul]

slide-63
SLIDE 63

Cons

  • 1. Real World Experiments should include all baselines not just MLE [Keshav,

Siddhant, Saransh, Rajas]

  • 2. Difficult to convince the community, given the added complications [Keshav,

Atishya, Vipul, Saransh]

  • 3. Needs more examples rather than just loss metrics. What about diversity?

[Atishya, Jigyasa, Siddhant, Rajas]

  • 4. Limitation of poems? [Soumya]
  • 5. When to stop training? [Shubham]
  • 6. Using language model as source for synthetic data [Pawan]
  • 7. Doesn't offer any strong paradigm for intermediate reward calculation [Vipul]
  • 8. MCTS not feasible on large datasets. [Lovish]
  • 9. BLEU [Lovish]
  • 10. The generator might start learning sentences in the gold set. [Rajas]
slide-64
SLIDE 64

Extensions/Discussion

  • 1. Intermediate Rewards:

1. K-discriminators trained with partial/complete sequences [Keshav] 2. K distinct Ds are expensive. Weight sharing [Atishya] 3. Use LM for intermediate rewards. ”Surprise” value [Rajas/Soumya/Saransh]

  • 2. Pre-trained discriminators (low/med/high):
  • 3. Transformer models/LMs [Shubham]
  • 4. Optimization in continuous space with periodic discrete updates [Pawan]
  • 5. WGAN [Vipul]
  • 6. Information Retrieval [Siddhant]

1. Won’t work [Saransh]

slide-65
SLIDE 65

Outline

  • 1. Introduction to GANs
  • 2. Brief theoretical overview of GANs
  • 3. Overview of GANs in Sequence Generation
  • 4. SeqGAN
  • 5. Other recent work: Unsupervised Conditional Sequence

Generation

slide-66
SLIDE 66

Unsupervised Conditional Sequence Generation

  • Text Style Transfer
  • Unsupervised Abstractive Summarization
  • Unsupervised Translation
  • Unsupervised Speech Recognition
slide-67
SLIDE 67

Three Categories of Solutions

Gumbel-softmax

  • [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019]

Continuous Input for Discriminator

  • [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen

Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017]

Reinforcement Learning

  • [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,

2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]