MaskGAN: Better Text Generation via Filling in the ______ June 5, - - PowerPoint PPT Presentation

maskgan better text generation via filling in the
SMART_READER_LITE
LIVE PREVIEW

MaskGAN: Better Text Generation via Filling in the ______ June 5, - - PowerPoint PPT Presentation

MaskGAN: Better Text Generation via Filling in the ______ June 5, 2018 ( ) Sungjae Cho (Interdisciplinary Program in Cognitive Science) sj.cho@snu.ac.kr SNU Spoken Language Processing Lab /


slide-1
SLIDE 1

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

MaskGAN: Better Text Generation via Filling in the ______

June 5, 2018 조성재 (협동과정 인지과학전공) Sungjae Cho (Interdisciplinary Program in Cognitive Science) sj.cho@snu.ac.kr

slide-2
SLIDE 2

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Abstract

  • Maximum likelihood and teacher forcing can result in poor sample quality since

generating text requires conditioning on sequences of words that may have never been

  • bserved at training time.
  • An actor-critic conditional GAN, MaskGAN, is introduced in this paper.
  • MaskGAN produces more realistic conditional and unconditional text samples compared

to a maximum likelihood trained model.

2

slide-3
SLIDE 3

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Prerequisites

  • GAN (Goodfellow et al., 2014)
  • mode collapse = mode dropping
  • Seq2seq model (Sutskever et al., 2014)
  • maximum likelihood estimation
  • stochastic gradient descent
  • Pretraining
  • Autoregression (autoregressively)
  • Bleu score, n-gram
  • (Validation) Perplexity

3

  • Reinforcement learning;

reward, V-value, Q-value, advantage A

  • Policy gradient
  • REINFORCE algorithm
  • actor-critic training algorithm
slide-4
SLIDE 4

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Motivations (From 1. Introduction & 2. Related Works)

  • Maximum Likelihood RNN’s are the most common generative model for sequences.
  • Teacher Forcing leads to unstable dynamics in the hidden states.
  • Professor Forcing does solve the above but does not encourage high sample quality.
  • GAN’s have shown incredible quality samples for images but discrete nature of text make

s training a generator harder.

  • Reinforcement Learning framework can be leveraged to train the generator by policy gra

dients.

4

slide-5
SLIDE 5

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 1. Introduction
  • GANs have only seen limited use for text sequences.
  • This is due to the discrete nature of text making it infeasible to propagate the gradient fr
  • m the discriminator back to the generator as in standard GAN training.
  • We overcome this by using Reinforcement Learning (RL) to train the generator while the

discriminator is still trained via maximum likelihood and stochastic gradient descent.

5

slide-6
SLIDE 6

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 2. Related Works

Main Related Works

  • SeqGAN (Yu et al., 2017)
  • trains a language model by using policy gradients to train the generator
  • to fool a CNN-based discriminator that discriminates between real and synthetic text
  • Professor Forcing (Lamb et al., 2016)
  • An alternative to training an RNN with teacher forcing by using a discriminator to discriminate t

he hidden states of a generator RNN that is conditioned on real and synthetic samples

  • GANs for dialogue generation (Li et al., 2017)
  • Their method applies REINFORCE with Monte Carlo sampling on the generator.
  • An actor-critic algorithm for sequence prediction (Bahdanau et al., 2017)
  • The rewards are task-specific scores such as BLEU
  • instead of having rewards supplied by a discriminator in an adversarial setting

6

slide-7
SLIDE 7

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 2. Related Works

Our work is distinct in that

  • An actor-critic training procedure on a task designed to provide rewards at every time s

tep (Li et al., 2017)

  • The in-filling task that may mitigate the problem of severe mode-collapse
  • The critic that helps the generator converge more rapidly by reducing the high-variance
  • f the gradient updates

7

slide-8
SLIDE 8

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 3. MaskGAN | 3.1. Notation
  • 𝑦𝑢: an input token at time 𝑢
  • 𝑧𝑢, 𝑦𝑢

𝑠𝑓𝑏𝑚: a target token at time 𝑢

  • < 𝑛 >: a masked token (where the original token is replaced with a hidden token)

𝑦𝑢: the filled-in token of the 𝑢-th word

𝑦𝑢: a filled-in token passed to the discriminator (ො 𝑦𝑢 = ෤ 𝑦𝑢)

𝑦𝑢 may be either real or fake.

8

slide-9
SLIDE 9

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 3. MaskGAN | 3.2. Architecture | Notations

Notations

  • 𝒚 = 𝑦1, … , 𝑦𝑈 : a discrete sequence
  • 𝒏 = 𝑛1, … , 𝑛𝑈 : a binary mask that is generated by (deterministically or stochastically) of

the same length

  • 𝑛𝑢 ∈ 0,1
  • 𝑛𝑢 selects whether the token at time 𝑢 will remain.
  • 𝒏(𝒚): the masked sequence
  • If 𝒚 = 𝑦1, 𝑦2, 𝑦3

and 𝒏 = 1,0,1 , then 𝒏 𝒚 = 𝑦1, < 𝑛 >, 𝑦3 .

  • The original real context

9

slide-10
SLIDE 10

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 3. MaskGAN | 3.2. Architecture | Problem Setups
  • Start with a ground truth discrete sequence 𝑦 = 𝑦1, … , 𝑦𝑈

and a binary mask of the sam e length, 𝑛 = (𝑛1, … , 𝑛𝑈). Applying the mask on the input sequence creates, 𝑛(𝑦), a seq uence with blanks: For example:

  • The goal of the generator is to autoregressively fill in the missing tokens conditioned on

the previous tokens and the mask.

10

𝑦 a b c d e 𝑛 1 1 1 𝑛(𝑦) a _ _ d e

slide-11
SLIDE 11

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 3. MaskGAN | 3.2. Architecture | Generator

Generator architecture

  • Seq2seq encoder-decoder architecture
  • Input: 650 dimension input (soft embedding).
  • Output: Vocab_size output (one-hot embedding).
  • The encoder reads in a masked sequence.
  • The decoder imputes the missing tokens by using the encoder hidden states.
  • It autoregressively fills in the missing tokens.

11

slide-12
SLIDE 12

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Discriminator architecture

  • The discriminator has an identical architecture to the generator / except that
  • the output is a scalar probability at each time point,

𝐸𝜚 ෤ 𝑦𝑢|෤ 𝑦0:𝑈, 𝒏 𝒚 = 𝑄 ෤ 𝑦𝑢 = 𝑦𝑢

real|෤

𝑦0:𝑈, 𝒏 𝒚

  • rather than a distribution over the vocabulary size. – the generator case
  • Set the reward at time 𝑢 as 𝑠𝑢 ≡ log 𝐸𝜚 ෤

𝑦𝑢|෤ 𝑦0:𝑈, 𝒏 𝒚 .

12

  • 3. MaskGAN | 3.2. Architecture | Discriminator

Output is probability. 𝐸𝜚 ෤ 𝑦𝑢|෤ 𝑦0:𝑈, 𝒏 𝒚 = 𝑄 ෤ 𝑦𝑢 = 𝑦𝑢

real|෤

𝑦0:𝑈, 𝒏 𝒚

Discriminator

Discriminator

a _____ _____ d e ෤ 𝑦1=a ෤ 𝑦4=d ෤ 𝑦5=e ෤ 𝑦3=y ෤ 𝑦2=x

Encoder Encoder Encoder Encoder Encoder Decoder Decoder Decoder Decoder Decoder

𝑄(෤ 𝑦1 = 𝑦1

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦2 = 𝑦2

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦3 = 𝑦3

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦4 = 𝑦4

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦5 = 𝑦5

𝑠𝑓𝑏𝑚)

𝒏(𝒚) ෤ 𝑦0:𝑈

slide-13
SLIDE 13

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Discriminator architecture

  • The discriminator is given the filled-in sequence ෤

𝑦0:𝑈 from the generator.

  • We give the discriminator the true context 𝒏(𝒚): 𝑦0:𝑈

𝑠𝑓𝑏𝑚.

  • The discriminator 𝐸𝜚 computes the probability of each token ෤

𝑦𝑢 being real (෤ 𝑦𝑢 = 𝑦𝑢

real) giv

en the true context of the masked sequence 𝒏(𝒚).

13

  • 3. MaskGAN | 3.2. Architecture | Discriminator

Output is probability. 𝐸𝜚 ෤ 𝑦𝑢|෤ 𝑦0:𝑈, 𝒏 𝒚 = 𝑄 ෤ 𝑦𝑢 = 𝑦𝑢

real|෤

𝑦0:𝑈, 𝒏 𝒚

Discriminator

Discriminator

a _____ _____ d e ෤ 𝑦1=a ෤ 𝑦4=d ෤ 𝑦5=e ෤ 𝑦3=y ෤ 𝑦2=x

Encoder Encoder Encoder Encoder Encoder Decoder Decoder Decoder Decoder Decoder

𝑄(෤ 𝑦1 = 𝑦1

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦2 = 𝑦2

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦3 = 𝑦3

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦4 = 𝑦4

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦5 = 𝑦5

𝑠𝑓𝑏𝑚)

𝒏(𝒚) ෤ 𝑦0:𝑈

slide-14
SLIDE 14

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Discriminator Critic Network

  • 3. MaskGAN | 3.2. Architecture | Critic

Critic network

  • The critic network is implemented as an additional head off the discriminator.
  • The critic network estimates the value function of the fill-in sequence:

𝑊𝑢 ො 𝑦0:𝑢 with 𝑆𝑢 = σ𝑡=𝑢

𝑈

𝛿𝑡𝑠

𝑡.

  • 𝑏𝑢 ≡ ො

𝑦𝑢, 𝑡𝑢 ≡ ො 𝑦1, … , ො 𝑦𝑢−1

14

Discriminator

a _____ _____ d e ෤ 𝑦1=a ෤ 𝑦4=d ෤ 𝑦5=e ෤ 𝑦3=y ෤ 𝑦2=x

Encoder Encoder Encoder Encoder Encoder Decoder Decoder Decoder Decoder Decoder

𝑄(෤ 𝑦1 = 𝑦1

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦2 = 𝑦2

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦3 = 𝑦3

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦4 = 𝑦4

𝑠𝑓𝑏𝑚)

𝑄(෤ 𝑦5 = 𝑦5

𝑠𝑓𝑏𝑚)

𝒏(𝒚) ෤ 𝑦0:𝑈

𝑠𝑢 = log 𝐸𝜚 ෤ 𝑦𝑢|෤ 𝑦0:𝑈, 𝒏 𝒚 = log 𝑄 ෤ 𝑦𝑢 = 𝑦𝑢

real|෤

𝑦0:𝑈, 𝒏 𝒚

  • 𝑆𝑢 = σ𝑡=𝑢

𝑈

𝛿𝑡𝑠

𝑡

  • 𝑊𝑢 ො

𝑦0:𝑢 = 𝑐𝑢 Discounted total return 𝑆𝑢, State value function 𝑊

log probability = reward

slide-15
SLIDE 15

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 3. MaskGAN | 3.3. Training
  • Our model is not fully-differentiable due to the sampling operation.
  • We estimate the gradient with respect to its parameters θ via policy gradients.
  • The generator seeks to maximize the cumulative total reward.
  • We optimize the parameters of the generator, θ, by performing gradient ascent on E𝐻 𝜄 𝑆𝑢 . 𝑆 =

σ𝑢=1

𝑈

𝑆𝑢. 𝑆𝑢 is reward at time 𝑢. 𝑆 is the cumulative total reward.

  • Using one of the REINFORCE family of algorithms, we can estimate

𝛼𝜄E𝐻 𝑆𝑢 = 𝑆𝑢 − 𝑐𝑢 𝛼𝜄 log 𝐻𝜄 ො 𝑦𝑢

where 𝑐𝑢 = 𝑊𝐻 ො 𝑦0:𝑢

15

Gradient of the generator Gradient of the discriminator

slide-16
SLIDE 16

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 3. MaskGAN | 3.4 Alternative Approaches for

Long Sequences and Large Vocabularies (Optional)

16

Problem Solution Long sequences of words

  • Curriculum learning: to increment the maximum

sequence length from 𝑈 to 𝑈 + 1 and continue training

  • if satisfying a convergence criterion

Large vocabularies ⇒ Variance with REINFORCE methods

  • Instead of generating a reward only on the sam

pled token, we compute the reward for each p

  • ssible token 𝑤.
  • This incurs a computational penalty.

This subsection is about how to extend this research.

slide-17
SLIDE 17

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 3. MaskGAN | 3.5. Method Details

1. Train a language model using standard maximum likelihood training.

  • Use the trained language model weights for the seq2seq encoder and decoder modules.

2. Train the seq2seq model on the in-filling task using maximum likelihood.

  • Select the model producing the lowest validation perplexity on the masked task

via a hyperparameter sweep (= grid search) over 500 runs.

3. Including the critic network

  • decreased the variance of gradient estimates, and
  • substantially improved training.

17

slide-18
SLIDE 18

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 4. Evaluation

Suggested Evaluation Method

  • Compute the number of unique 𝒐-grams produced by the generator that occur in the valida

tion corpus for small 𝒐 (2 ≤ 𝑜 ≤ 5).

  • Compute the geometric average of generated sequences.

About the Evaluation

  • This evaluation is for measuring the degree of mode collapse, and it is motivated by BLEU.
  • Mode collapse example from MaskGAN:
  • Ex.1: It is a very funny film that is very funny It s a very funny movie and it s charming
  • Mode collapse examples have the small number of unique 𝑜-grams.
  • 17 unique 2-grams out of 19 2-grams in Ex.1 (17 out of 19)
  • (It, is), (is, a), (a, very), (very, funny), (funny, film), (film, that), (that, is), (is, very), (funny, it), (It, s), (s, a),

(a, very), (very, funny), (funny, movie), (movie, and), (and, it), (it, s), (s, charming)

  • Count 2-grams that is in the validation corpus, which is unseen for the generator. This count unique

2-grams that have not learned.

18

slide-19
SLIDE 19

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 5. Experiments
  • Dataset: The Penn Treebank (PTB), IMDB dataset
  • Mask rate: the ratio of blank words to the total words
  • Samples: Conditional and unconditional samples
  • Conditional sample: 0 < (masking rate).
  • (masking rate) = 0.5 was used in this paper for conditional samples.
  • Unconditional sample: 0 = (masking rate)

19

slide-20
SLIDE 20

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 5. Experiments

Validation perplexity

  • Validation perplexity: Perplexity on the validation set.
  • Perplexity is a measure of confusion
  • Low perplexity:
  • Low cross entropy error
  • Good performance.
  • Low negative More likely predicting the next word in the sequence.
  • High perplexity:
  • High cross entropy error
  • Bad performance.
  • Less likely predicting the next word in the sequence.
  • 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2𝐾
  • 𝐾 = − 1

𝑈 σ𝑢=1 𝑈

𝐾 𝑢 = − 1

𝑈 σ𝑢=1 𝑈

σ𝑘=1

𝑘= 𝑊 𝑧𝑢,𝑘 × log ො

𝑧𝑢,𝑘

20

slide-21
SLIDE 21

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.1 The Penn Treebank (PTB)

  • A vocabulary of 10,000 unique words
  • The training set contains 930,000 words.
  • The validation set contains 74,000 words.
  • The test set contains 82,000 words.

Training

  • Pretrained the commonly-used variational LSTM language model

following Gal & Ghahramani (2016) to a validation perplexity of 78.

  • Loaded the weights from the language model into the MaskGAN generator
  • Pretrained the generator with masking rate of 0.5 to a validation perplexity of 55.3.
  • Pretrained the discriminator on the samples produced from

the current generator and real training text.

21

slide-22
SLIDE 22

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.1 The Penn Treebank (PTB)

Conditional samples

22

slide-23
SLIDE 23

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.1 The Penn Treebank (PTB)

23

Unconditional samples

slide-24
SLIDE 24

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.2 IMDB Movie Dataset

  • 100,000 movie reviews taken from IMDB
  • 25,000 labeled training instances
  • 25,000 labeled test instances
  • 50,000 unlabeled training instances
  • The label indicates the sentiment of the review and may be either positive or negative.
  • We use the first 40 words of each review in the training set to train our models, which le

ads to a dataset of 3 million words.

24

slide-25
SLIDE 25

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.2 IMDB Movie Dataset

Training

  • Identical to the training process in PTB
  • Pretrained the LSTM language model to a validation perplexity of 105.6.
  • Loaded the weights from the language model into the MaskGAN generator.
  • Pretrained the generator to a validation perplexity of 105.6.
  • masking rate of 0.5 (half the text blanked)
  • Pretrained the discriminator on the samples produced from the current generator

and real training text.

25

slide-26
SLIDE 26

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.2 IMDB Movie Dataset

Conditional samples

26

slide-27
SLIDE 27

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.2 IMDB Movie Dataset

Unconditional samples

27

slide-28
SLIDE 28

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.3 Perplexity of Generated Samples

  • Skip

28

slide-29
SLIDE 29

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.4 Mode Collapse

Mode Collapse

  • Mode collapse: the model has learned too simple generating distribution (= simple sequence

pattern).

  • Example of mode collapse by the MaskGAN

It is a very funny film that is very funny It s a very funny movie and it s charming

29

Mode collapse Desirable model

slide-30
SLIDE 30

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.4 Mode Collapse

Mode Collapse

  • The evaluation introduced in Section 4
  • Mode collapse can be measured by directly calculating certain n-gram statistics.
  • MaskGAN does show some mode collapse, evidenced by the reduced number of unique quadgrams.
  • Mode dropping is occurring near the tail end of sequences
  • Conjecture: That is because generated samples are unlikely to generate all the previous words correctly.

30

slide-31
SLIDE 31

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.5 Human Evaluation

  • The evaluation of generative models is still best measured by unbiased human evaluation

.

  • Theis et al. (2016) also shows how validation perplexity does not necessarily correlate wit

h sample quality. (from 5.4 Mode Collapse)

31

slide-32
SLIDE 32

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.5 Human Evaluation

32

MaskGAN generates superior human-looking samples to MaskMLE

  • n the IMDB dataset

MaskGAN too inferior samples to real samples.

slide-33
SLIDE 33

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

5.5 Human Evaluation

33

The performance gap between MaskGAN and MaskMLE is smaller than the IMDB maybe b/c dataset size.

slide-34
SLIDE 34

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • 6. Discussion

Results 1. We generally found training where contiguous blocks of words were masked produced better samples

  • Compared to non-contiguous blocks
  • Conjecture: This allows the generator an opportunity to explore longer sequences in a free-running mode.

2. We found that policy gradient methods were effective in conjunction with a learned critic. 3. We also found the use of attention was important for the in-filled words to be sufficiently conditioned o n the input context.

  • Without attention, the in-filling would fill in reasonable subsequences that became implausible in the context of th

e adjacent surrounding words.

4. In general we think the proposed contiguous in-filling task is a good approach to reduce mode collapse and help with training stability for textual GANs. 5. We show that MaskGAN samples on a larger dataset (IMDB reviews) is significantly better than the corr esponding tuned MaskMLE model as shown by human evaluation. 6. We also show we can produce high-quality samples despite the MaskGAN model having much higher p erplexity on the ground-truth test set.

  • High-quality: measured by human evaluation / Higher perplexity: High loss

34

slide-35
SLIDE 35

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  • C. Failure Modes

1. Mode Dropping is less extreme than SeqGAN but still noticeable.

  • It is a very funny film that is very funny It s a very funny movie and it s charming It

2. Matching Syntax at Boundaries

  • Cartoon is one of those films me when I first saw it back in 2000

3. Loss of Global Context

  • This movie is terrible The plot is ludicrous The title is not more interesting and original This is a

great movie Lord of the Rings was a great movie John Travolta is brilliant

35

  • Underline denotes the blank.
slide-36
SLIDE 36

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Questions

Answered 7 questions out of 9

36

slide-37
SLIDE 37

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 1 | 최상우

  • GANs have had a lot of success in producing more realistic images than other approache

s but they have only seen limited use for text sequences. This is due to the discrete natur e of text making it infeasible to propagate the gradient from the discriminator back to th e generator as in standard GAN training. 이 부분에서 GAN이 text sequence를 생성해내는데 한계가 있다고 하며, 그 이유에 대해서 간략하게 설명되어있습니다. 그 이유에 대해서 수식을 통하여 좀더 구체적으로 설명해주셨 으면 합니다.

37

slide-38
SLIDE 38

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 1 | 최상우 | Answer

  • Image
  • Generate each pixel intensity

ranges from 0 to 1. The inten sity is a real number.

  • Intensity is able to be interpo

lated.

  • 𝐻 is differentiable by 𝜄𝑒.
  • Text (= words)
  • Words cannot be interpolate

d.

  • Its objective function will be

a step function.

  • 𝐻 is not differentiable by 𝜄𝑒.

38

slide-39
SLIDE 39

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 2-1 | 손보경

  • Q1: contiguous in-filling task 디자인이 그렇지 않은 디자인에 비하여 mode collapse가 줄

어드는 이유가 무엇인지 부연 설명 부탁드립니다.

  • 저자가 그 이유에 대해 명확하게 언급하지는 않고 mode collapse가 줄어들 것이라고 believe한다

고 합니다.

  • Generated samples are unlikely to generate all the previous words correctly. ⇒ 긴 sequence는

학습하기 힘들어서 mode collapse가 일어날 수 있다. ⇒ in-filling task ⇒ 짧은 word sequence를 다룰 수 있다.

  • The words near the blank makes a GAN becomes a conditional GAN that enable to generate

more diverse samples.

39

slide-40
SLIDE 40

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 2-2 | 손보경

  • Q2: 또한 critic의 효과를 "the generator converge more rapidly by reducing the high-varia

nce of the gradient updates in an extremely high action-space environment (p.3)"라고 하 였는데, critic이 만든 value function을 baseline으로 학습하는 것이 왜 gradient estimator va riance를 줄이는지 설명 부탁드립니다.

  • policy gradient로 학습할 때, reward 𝑆𝑢가 항상 0보다 크면 그 trajectory의 action 𝑏𝑢의 확률 P 𝑏𝑢

을 높힌다. 그 𝑏𝑢가 실제로 poor quality라도 P 𝑏𝑢 가 높아진다.

  • ⇒ policy (policy parameter)을 학습할 때, policy (policy parameter)에 대한 variance가 높아진다.

= high-variance of the gradient updates = gradient estimator variance is high.

  • Baseline 𝑐𝑢을 준다: 𝑆𝑢가 𝑐𝑢보다 작으면 = 𝑆𝑢 − 𝑐𝑢 > 0이면 P 𝑏𝑢 를 줄인다.
  • 𝑐𝑢 = 𝐵 𝑏𝑢, 𝑡𝑢 = 𝑅 𝑏𝑢, 𝑡𝑢 − 𝑊(𝑡𝑢) 라고 하면 ‘𝑡𝑢일 때, 𝑏𝑢를 취하면 예상되는 reward’(𝐵의 의미)를

baseline으로 선택한 것.

  • Critic network는 𝐵 𝑏𝑢, 𝑡𝑢 를 학습한다.

40

slide-41
SLIDE 41

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 3 | 이경민

  • fully-differentiable 하지 않은 모델에 활용하기 위하여 3.3 에 나온 바와 같이 policy

gradient 를 활용한다고 설명하였습니다. policy gradients를 이용하면 미분불가능한 경우에 도 훈련이 가능한 이유가 궁금합니다.

41

slide-42
SLIDE 42

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 3 | 이경민 | Answer

policy gradients를 이용하면 미분불가능한 경우에도 훈련이 가능한 이유

  • Policy gradient methods 중에서 REINFORCE algorithm 종류를 사용했기 때문에 미분이 불

가능한 목적함수이지만 gradient based learning이 가능하도록 REINFORCE를 쓰면 아래와 같이 근사 가능.

  • Policy gradient methods, REINFORCE에 대한 알고리즘을 증명하시면 더 자세히 알 수 있음.

42

slide-43
SLIDE 43

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 3 | 이경민 | Answer

policy gradients를 이용하면 미분불가능한 경우에도 훈련이 가능한 이유

43

slide-44
SLIDE 44

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 4 | 김종인

  • Penn tree bank 데이터셋을 이용하여, maskGAN 알고리즘을 이용하여 text-generation 하는

데, 5.1.1에 나온 결과만 보면 잘 나오는 것 같습니다. 아마 모든 문장이 저렇게 잘 나오지는 않았을 것 같은데(체리 피킹 했을 것 같은데), 근본적으로 text-generation에 가능성과 한계 에 대해서 어떻게 생각하시는 지 궁금합니다.

44

slide-45
SLIDE 45

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 4 | 김종인 | Answer

Q5-1 MaskGAN이 text-generation에 기여할 가능성

  • Grammaticality 그리고 Topicality에서 MaskGAN

이 다른 모델(LM, MaskMLE)보다 크게 성능이 좋 다.

45

Q5-2 MaskGAN이 text-generation에서 가지는 한계

  • 실제 데이터(real samples)와 MaskGAN의 생성 결

과의 Grammaticality 그리고 Topicality를 비교하면 MaskGAN이 현격하게 못 미친다.

  • Mode collapse, matching syntax at boundaries,

loss of global context 문제가 발견됨. (appendix C)

slide-46
SLIDE 46

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 5 | 장한솔

  • 2페이지에서 training instability과 mode dropping이 GAN에서 문제가 되어왔다고 설명해주는데요, 이

미지를 대상으로 할 때와 텍스트를 대상으로 할 때에 각각 어떻게 문제가 되는지 설명 부탁드립니다.

  • Training instability는 잘 학습되던 GAN 모델이 작은 hyperparameter 변화로도 학습이 되지 않는 현상을 말합

니다. 이미지나 텍스트를 대상으로 할 때 동일하게 생기는 문제입니다.

  • Mode dropping은 sample 𝑨를 output distribution 𝑧의 일부분으로 mapping 하기 때문에 단조로운 output을

생성합니다.

  • Text 예시: “It is a very funny film that is very funny It s a very funny movie and it s charming”
  • Image 예시

46

slide-47
SLIDE 47

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 6 | 조석현

47

  • evaluation 부분에서 제안했다는 metric이 잘 이해가 가지 않습니다. 쉽게 설명을 해주실 수 있나

요?

  • Evaluation: Compute the number of unique 𝑜-grams produced by the generator that occur in the

validation corpus for small 𝑜. Compute the geometric average the numbers. The average is to the performance of the generator.

  • This evaluation is for measuring the degree of mode collapse.
  • Mode collapse example from MaskGAN:
  • Ex.1: It is a very funny film that is very funny It s a very funny movie and it s charming
  • Mode collapse examples have the small number of unique 𝑜-grams.
  • 17 unique 2-grams out of 19 2-grams in Ex.1 (17 out of 19)
  • (It, is), (is, a), (a, very), (very, funny), (funny, film), (film, that), (that, is), (is, very), (funny, it), (It, s), (s, a), (a, very),

(very, funny), (funny, movie), (movie, and), (and, it), (it, s), (s, charming)

  • Count 2-grams that is in the validation corpus, which is unseen for the generator. This count unique 2-grams

that have not learned.

slide-48
SLIDE 48

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question 7 | 최성호

  • 4쪽 아래에서 R_t - b_t 가 advantage A(a_t, s_t)=Q(a_t, s_t) - V(s_t)의 estimation이라고 나

와 있는데요 이 내용에 대한 background를 설명해주시면 감사하겠습니다.

  • 참고할 keywords: Policy gradient, REINFORCE, variance problem of REINFORCE, actor-critic

algorithm

  • Question 2-2 참고.
  • 그리고 GAN이 Video generation에 적용한 사례는 없는지 궁금합니다.
  • Deep Multi-Scale Video Prediction Beyond Mean Square Error [arXiv][github]
  • Video Generation From Text [arXiv]
  • Dynamics Transfer GAN: Generating Video by Transferring Arbitrary Temporal Dynamics fr
  • m a Source Video to a Single Target Image [arXiv]

48

slide-49
SLIDE 49

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

End

Thank you!

49