SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
MaskGAN: Better Text Generation via Filling in the ______ June 5, - - PowerPoint PPT Presentation
MaskGAN: Better Text Generation via Filling in the ______ June 5, - - PowerPoint PPT Presentation
MaskGAN: Better Text Generation via Filling in the ______ June 5, 2018 ( ) Sungjae Cho (Interdisciplinary Program in Cognitive Science) sj.cho@snu.ac.kr SNU Spoken Language Processing Lab /
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Abstract
- Maximum likelihood and teacher forcing can result in poor sample quality since
generating text requires conditioning on sequences of words that may have never been
- bserved at training time.
- An actor-critic conditional GAN, MaskGAN, is introduced in this paper.
- MaskGAN produces more realistic conditional and unconditional text samples compared
to a maximum likelihood trained model.
2
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Prerequisites
- GAN (Goodfellow et al., 2014)
- mode collapse = mode dropping
- Seq2seq model (Sutskever et al., 2014)
- maximum likelihood estimation
- stochastic gradient descent
- Pretraining
- Autoregression (autoregressively)
- Bleu score, n-gram
- (Validation) Perplexity
3
- Reinforcement learning;
reward, V-value, Q-value, advantage A
- Policy gradient
- REINFORCE algorithm
- actor-critic training algorithm
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Motivations (From 1. Introduction & 2. Related Works)
- Maximum Likelihood RNN’s are the most common generative model for sequences.
- Teacher Forcing leads to unstable dynamics in the hidden states.
- Professor Forcing does solve the above but does not encourage high sample quality.
- GAN’s have shown incredible quality samples for images but discrete nature of text make
s training a generator harder.
- Reinforcement Learning framework can be leveraged to train the generator by policy gra
dients.
4
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 1. Introduction
- GANs have only seen limited use for text sequences.
- This is due to the discrete nature of text making it infeasible to propagate the gradient fr
- m the discriminator back to the generator as in standard GAN training.
- We overcome this by using Reinforcement Learning (RL) to train the generator while the
discriminator is still trained via maximum likelihood and stochastic gradient descent.
5
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 2. Related Works
Main Related Works
- SeqGAN (Yu et al., 2017)
- trains a language model by using policy gradients to train the generator
- to fool a CNN-based discriminator that discriminates between real and synthetic text
- Professor Forcing (Lamb et al., 2016)
- An alternative to training an RNN with teacher forcing by using a discriminator to discriminate t
he hidden states of a generator RNN that is conditioned on real and synthetic samples
- GANs for dialogue generation (Li et al., 2017)
- Their method applies REINFORCE with Monte Carlo sampling on the generator.
- An actor-critic algorithm for sequence prediction (Bahdanau et al., 2017)
- The rewards are task-specific scores such as BLEU
- instead of having rewards supplied by a discriminator in an adversarial setting
6
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 2. Related Works
Our work is distinct in that
- An actor-critic training procedure on a task designed to provide rewards at every time s
tep (Li et al., 2017)
- The in-filling task that may mitigate the problem of severe mode-collapse
- The critic that helps the generator converge more rapidly by reducing the high-variance
- f the gradient updates
7
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 3. MaskGAN | 3.1. Notation
- 𝑦𝑢: an input token at time 𝑢
- 𝑧𝑢, 𝑦𝑢
𝑠𝑓𝑏𝑚: a target token at time 𝑢
- < 𝑛 >: a masked token (where the original token is replaced with a hidden token)
- ො
𝑦𝑢: the filled-in token of the 𝑢-th word
-
𝑦𝑢: a filled-in token passed to the discriminator (ො 𝑦𝑢 = 𝑦𝑢)
-
𝑦𝑢 may be either real or fake.
8
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 3. MaskGAN | 3.2. Architecture | Notations
Notations
- 𝒚 = 𝑦1, … , 𝑦𝑈 : a discrete sequence
- 𝒏 = 𝑛1, … , 𝑛𝑈 : a binary mask that is generated by (deterministically or stochastically) of
the same length
- 𝑛𝑢 ∈ 0,1
- 𝑛𝑢 selects whether the token at time 𝑢 will remain.
- 𝒏(𝒚): the masked sequence
- If 𝒚 = 𝑦1, 𝑦2, 𝑦3
and 𝒏 = 1,0,1 , then 𝒏 𝒚 = 𝑦1, < 𝑛 >, 𝑦3 .
- The original real context
9
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 3. MaskGAN | 3.2. Architecture | Problem Setups
- Start with a ground truth discrete sequence 𝑦 = 𝑦1, … , 𝑦𝑈
and a binary mask of the sam e length, 𝑛 = (𝑛1, … , 𝑛𝑈). Applying the mask on the input sequence creates, 𝑛(𝑦), a seq uence with blanks: For example:
- The goal of the generator is to autoregressively fill in the missing tokens conditioned on
the previous tokens and the mask.
10
𝑦 a b c d e 𝑛 1 1 1 𝑛(𝑦) a _ _ d e
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 3. MaskGAN | 3.2. Architecture | Generator
Generator architecture
- Seq2seq encoder-decoder architecture
- Input: 650 dimension input (soft embedding).
- Output: Vocab_size output (one-hot embedding).
- The encoder reads in a masked sequence.
- The decoder imputes the missing tokens by using the encoder hidden states.
- It autoregressively fills in the missing tokens.
11
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Discriminator architecture
- The discriminator has an identical architecture to the generator / except that
- the output is a scalar probability at each time point,
𝐸𝜚 𝑦𝑢| 𝑦0:𝑈, 𝒏 𝒚 = 𝑄 𝑦𝑢 = 𝑦𝑢
real|
𝑦0:𝑈, 𝒏 𝒚
- rather than a distribution over the vocabulary size. – the generator case
- Set the reward at time 𝑢 as 𝑠𝑢 ≡ log 𝐸𝜚
𝑦𝑢| 𝑦0:𝑈, 𝒏 𝒚 .
12
- 3. MaskGAN | 3.2. Architecture | Discriminator
Output is probability. 𝐸𝜚 𝑦𝑢| 𝑦0:𝑈, 𝒏 𝒚 = 𝑄 𝑦𝑢 = 𝑦𝑢
real|
𝑦0:𝑈, 𝒏 𝒚
Discriminator
Discriminator
a _____ _____ d e 𝑦1=a 𝑦4=d 𝑦5=e 𝑦3=y 𝑦2=x
Encoder Encoder Encoder Encoder Encoder Decoder Decoder Decoder Decoder Decoder
𝑄( 𝑦1 = 𝑦1
𝑠𝑓𝑏𝑚)
𝑄( 𝑦2 = 𝑦2
𝑠𝑓𝑏𝑚)
𝑄( 𝑦3 = 𝑦3
𝑠𝑓𝑏𝑚)
𝑄( 𝑦4 = 𝑦4
𝑠𝑓𝑏𝑚)
𝑄( 𝑦5 = 𝑦5
𝑠𝑓𝑏𝑚)
𝒏(𝒚) 𝑦0:𝑈
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Discriminator architecture
- The discriminator is given the filled-in sequence
𝑦0:𝑈 from the generator.
- We give the discriminator the true context 𝒏(𝒚): 𝑦0:𝑈
𝑠𝑓𝑏𝑚.
- The discriminator 𝐸𝜚 computes the probability of each token
𝑦𝑢 being real ( 𝑦𝑢 = 𝑦𝑢
real) giv
en the true context of the masked sequence 𝒏(𝒚).
13
- 3. MaskGAN | 3.2. Architecture | Discriminator
Output is probability. 𝐸𝜚 𝑦𝑢| 𝑦0:𝑈, 𝒏 𝒚 = 𝑄 𝑦𝑢 = 𝑦𝑢
real|
𝑦0:𝑈, 𝒏 𝒚
Discriminator
Discriminator
a _____ _____ d e 𝑦1=a 𝑦4=d 𝑦5=e 𝑦3=y 𝑦2=x
Encoder Encoder Encoder Encoder Encoder Decoder Decoder Decoder Decoder Decoder
𝑄( 𝑦1 = 𝑦1
𝑠𝑓𝑏𝑚)
𝑄( 𝑦2 = 𝑦2
𝑠𝑓𝑏𝑚)
𝑄( 𝑦3 = 𝑦3
𝑠𝑓𝑏𝑚)
𝑄( 𝑦4 = 𝑦4
𝑠𝑓𝑏𝑚)
𝑄( 𝑦5 = 𝑦5
𝑠𝑓𝑏𝑚)
𝒏(𝒚) 𝑦0:𝑈
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Discriminator Critic Network
- 3. MaskGAN | 3.2. Architecture | Critic
Critic network
- The critic network is implemented as an additional head off the discriminator.
- The critic network estimates the value function of the fill-in sequence:
𝑊𝑢 ො 𝑦0:𝑢 with 𝑆𝑢 = σ𝑡=𝑢
𝑈
𝛿𝑡𝑠
𝑡.
- 𝑏𝑢 ≡ ො
𝑦𝑢, 𝑡𝑢 ≡ ො 𝑦1, … , ො 𝑦𝑢−1
14
Discriminator
a _____ _____ d e 𝑦1=a 𝑦4=d 𝑦5=e 𝑦3=y 𝑦2=x
Encoder Encoder Encoder Encoder Encoder Decoder Decoder Decoder Decoder Decoder
𝑄( 𝑦1 = 𝑦1
𝑠𝑓𝑏𝑚)
𝑄( 𝑦2 = 𝑦2
𝑠𝑓𝑏𝑚)
𝑄( 𝑦3 = 𝑦3
𝑠𝑓𝑏𝑚)
𝑄( 𝑦4 = 𝑦4
𝑠𝑓𝑏𝑚)
𝑄( 𝑦5 = 𝑦5
𝑠𝑓𝑏𝑚)
𝒏(𝒚) 𝑦0:𝑈
𝑠𝑢 = log 𝐸𝜚 𝑦𝑢| 𝑦0:𝑈, 𝒏 𝒚 = log 𝑄 𝑦𝑢 = 𝑦𝑢
real|
𝑦0:𝑈, 𝒏 𝒚
- 𝑆𝑢 = σ𝑡=𝑢
𝑈
𝛿𝑡𝑠
𝑡
- 𝑊𝑢 ො
𝑦0:𝑢 = 𝑐𝑢 Discounted total return 𝑆𝑢, State value function 𝑊
log probability = reward
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 3. MaskGAN | 3.3. Training
- Our model is not fully-differentiable due to the sampling operation.
- We estimate the gradient with respect to its parameters θ via policy gradients.
- The generator seeks to maximize the cumulative total reward.
- We optimize the parameters of the generator, θ, by performing gradient ascent on E𝐻 𝜄 𝑆𝑢 . 𝑆 =
σ𝑢=1
𝑈
𝑆𝑢. 𝑆𝑢 is reward at time 𝑢. 𝑆 is the cumulative total reward.
- Using one of the REINFORCE family of algorithms, we can estimate
𝛼𝜄E𝐻 𝑆𝑢 = 𝑆𝑢 − 𝑐𝑢 𝛼𝜄 log 𝐻𝜄 ො 𝑦𝑢
where 𝑐𝑢 = 𝑊𝐻 ො 𝑦0:𝑢
15
Gradient of the generator Gradient of the discriminator
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 3. MaskGAN | 3.4 Alternative Approaches for
Long Sequences and Large Vocabularies (Optional)
16
Problem Solution Long sequences of words
- Curriculum learning: to increment the maximum
sequence length from 𝑈 to 𝑈 + 1 and continue training
- if satisfying a convergence criterion
Large vocabularies ⇒ Variance with REINFORCE methods
- Instead of generating a reward only on the sam
pled token, we compute the reward for each p
- ssible token 𝑤.
- This incurs a computational penalty.
This subsection is about how to extend this research.
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 3. MaskGAN | 3.5. Method Details
1. Train a language model using standard maximum likelihood training.
- Use the trained language model weights for the seq2seq encoder and decoder modules.
2. Train the seq2seq model on the in-filling task using maximum likelihood.
- Select the model producing the lowest validation perplexity on the masked task
via a hyperparameter sweep (= grid search) over 500 runs.
3. Including the critic network
- decreased the variance of gradient estimates, and
- substantially improved training.
17
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 4. Evaluation
Suggested Evaluation Method
- Compute the number of unique 𝒐-grams produced by the generator that occur in the valida
tion corpus for small 𝒐 (2 ≤ 𝑜 ≤ 5).
- Compute the geometric average of generated sequences.
About the Evaluation
- This evaluation is for measuring the degree of mode collapse, and it is motivated by BLEU.
- Mode collapse example from MaskGAN:
- Ex.1: It is a very funny film that is very funny It s a very funny movie and it s charming
- Mode collapse examples have the small number of unique 𝑜-grams.
- 17 unique 2-grams out of 19 2-grams in Ex.1 (17 out of 19)
- (It, is), (is, a), (a, very), (very, funny), (funny, film), (film, that), (that, is), (is, very), (funny, it), (It, s), (s, a),
(a, very), (very, funny), (funny, movie), (movie, and), (and, it), (it, s), (s, charming)
- Count 2-grams that is in the validation corpus, which is unseen for the generator. This count unique
2-grams that have not learned.
18
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 5. Experiments
- Dataset: The Penn Treebank (PTB), IMDB dataset
- Mask rate: the ratio of blank words to the total words
- Samples: Conditional and unconditional samples
- Conditional sample: 0 < (masking rate).
- (masking rate) = 0.5 was used in this paper for conditional samples.
- Unconditional sample: 0 = (masking rate)
19
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 5. Experiments
Validation perplexity
- Validation perplexity: Perplexity on the validation set.
- Perplexity is a measure of confusion
- Low perplexity:
- Low cross entropy error
- Good performance.
- Low negative More likely predicting the next word in the sequence.
- High perplexity:
- High cross entropy error
- Bad performance.
- Less likely predicting the next word in the sequence.
- 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2𝐾
- 𝐾 = − 1
𝑈 σ𝑢=1 𝑈
𝐾 𝑢 = − 1
𝑈 σ𝑢=1 𝑈
σ𝑘=1
𝑘= 𝑊 𝑧𝑢,𝑘 × log ො
𝑧𝑢,𝑘
20
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.1 The Penn Treebank (PTB)
- A vocabulary of 10,000 unique words
- The training set contains 930,000 words.
- The validation set contains 74,000 words.
- The test set contains 82,000 words.
Training
- Pretrained the commonly-used variational LSTM language model
following Gal & Ghahramani (2016) to a validation perplexity of 78.
- Loaded the weights from the language model into the MaskGAN generator
- Pretrained the generator with masking rate of 0.5 to a validation perplexity of 55.3.
- Pretrained the discriminator on the samples produced from
the current generator and real training text.
21
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.1 The Penn Treebank (PTB)
Conditional samples
22
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.1 The Penn Treebank (PTB)
23
Unconditional samples
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.2 IMDB Movie Dataset
- 100,000 movie reviews taken from IMDB
- 25,000 labeled training instances
- 25,000 labeled test instances
- 50,000 unlabeled training instances
- The label indicates the sentiment of the review and may be either positive or negative.
- We use the first 40 words of each review in the training set to train our models, which le
ads to a dataset of 3 million words.
24
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.2 IMDB Movie Dataset
Training
- Identical to the training process in PTB
- Pretrained the LSTM language model to a validation perplexity of 105.6.
- Loaded the weights from the language model into the MaskGAN generator.
- Pretrained the generator to a validation perplexity of 105.6.
- masking rate of 0.5 (half the text blanked)
- Pretrained the discriminator on the samples produced from the current generator
and real training text.
25
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.2 IMDB Movie Dataset
Conditional samples
26
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.2 IMDB Movie Dataset
Unconditional samples
27
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.3 Perplexity of Generated Samples
- Skip
28
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.4 Mode Collapse
Mode Collapse
- Mode collapse: the model has learned too simple generating distribution (= simple sequence
pattern).
- Example of mode collapse by the MaskGAN
It is a very funny film that is very funny It s a very funny movie and it s charming
29
Mode collapse Desirable model
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.4 Mode Collapse
Mode Collapse
- The evaluation introduced in Section 4
- Mode collapse can be measured by directly calculating certain n-gram statistics.
- MaskGAN does show some mode collapse, evidenced by the reduced number of unique quadgrams.
- Mode dropping is occurring near the tail end of sequences
- Conjecture: That is because generated samples are unlikely to generate all the previous words correctly.
30
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.5 Human Evaluation
- The evaluation of generative models is still best measured by unbiased human evaluation
.
- Theis et al. (2016) also shows how validation perplexity does not necessarily correlate wit
h sample quality. (from 5.4 Mode Collapse)
31
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.5 Human Evaluation
32
MaskGAN generates superior human-looking samples to MaskMLE
- n the IMDB dataset
MaskGAN too inferior samples to real samples.
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
5.5 Human Evaluation
33
The performance gap between MaskGAN and MaskMLE is smaller than the IMDB maybe b/c dataset size.
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- 6. Discussion
Results 1. We generally found training where contiguous blocks of words were masked produced better samples
- Compared to non-contiguous blocks
- Conjecture: This allows the generator an opportunity to explore longer sequences in a free-running mode.
2. We found that policy gradient methods were effective in conjunction with a learned critic. 3. We also found the use of attention was important for the in-filled words to be sufficiently conditioned o n the input context.
- Without attention, the in-filling would fill in reasonable subsequences that became implausible in the context of th
e adjacent surrounding words.
4. In general we think the proposed contiguous in-filling task is a good approach to reduce mode collapse and help with training stability for textual GANs. 5. We show that MaskGAN samples on a larger dataset (IMDB reviews) is significantly better than the corr esponding tuned MaskMLE model as shown by human evaluation. 6. We also show we can produce high-quality samples despite the MaskGAN model having much higher p erplexity on the ground-truth test set.
- High-quality: measured by human evaluation / Higher perplexity: High loss
34
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
- C. Failure Modes
1. Mode Dropping is less extreme than SeqGAN but still noticeable.
- It is a very funny film that is very funny It s a very funny movie and it s charming It
2. Matching Syntax at Boundaries
- Cartoon is one of those films me when I first saw it back in 2000
3. Loss of Global Context
- This movie is terrible The plot is ludicrous The title is not more interesting and original This is a
great movie Lord of the Rings was a great movie John Travolta is brilliant
35
- Underline denotes the blank.
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Questions
Answered 7 questions out of 9
36
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 1 | 최상우
- GANs have had a lot of success in producing more realistic images than other approache
s but they have only seen limited use for text sequences. This is due to the discrete natur e of text making it infeasible to propagate the gradient from the discriminator back to th e generator as in standard GAN training. 이 부분에서 GAN이 text sequence를 생성해내는데 한계가 있다고 하며, 그 이유에 대해서 간략하게 설명되어있습니다. 그 이유에 대해서 수식을 통하여 좀더 구체적으로 설명해주셨 으면 합니다.
37
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 1 | 최상우 | Answer
- Image
- Generate each pixel intensity
ranges from 0 to 1. The inten sity is a real number.
- Intensity is able to be interpo
lated.
- 𝐻 is differentiable by 𝜄𝑒.
- Text (= words)
- Words cannot be interpolate
d.
- Its objective function will be
a step function.
- 𝐻 is not differentiable by 𝜄𝑒.
38
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 2-1 | 손보경
- Q1: contiguous in-filling task 디자인이 그렇지 않은 디자인에 비하여 mode collapse가 줄
어드는 이유가 무엇인지 부연 설명 부탁드립니다.
- 저자가 그 이유에 대해 명확하게 언급하지는 않고 mode collapse가 줄어들 것이라고 believe한다
고 합니다.
- Generated samples are unlikely to generate all the previous words correctly. ⇒ 긴 sequence는
학습하기 힘들어서 mode collapse가 일어날 수 있다. ⇒ in-filling task ⇒ 짧은 word sequence를 다룰 수 있다.
- The words near the blank makes a GAN becomes a conditional GAN that enable to generate
more diverse samples.
39
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 2-2 | 손보경
- Q2: 또한 critic의 효과를 "the generator converge more rapidly by reducing the high-varia
nce of the gradient updates in an extremely high action-space environment (p.3)"라고 하 였는데, critic이 만든 value function을 baseline으로 학습하는 것이 왜 gradient estimator va riance를 줄이는지 설명 부탁드립니다.
- policy gradient로 학습할 때, reward 𝑆𝑢가 항상 0보다 크면 그 trajectory의 action 𝑏𝑢의 확률 P 𝑏𝑢
을 높힌다. 그 𝑏𝑢가 실제로 poor quality라도 P 𝑏𝑢 가 높아진다.
- ⇒ policy (policy parameter)을 학습할 때, policy (policy parameter)에 대한 variance가 높아진다.
= high-variance of the gradient updates = gradient estimator variance is high.
- Baseline 𝑐𝑢을 준다: 𝑆𝑢가 𝑐𝑢보다 작으면 = 𝑆𝑢 − 𝑐𝑢 > 0이면 P 𝑏𝑢 를 줄인다.
- 𝑐𝑢 = 𝐵 𝑏𝑢, 𝑡𝑢 = 𝑅 𝑏𝑢, 𝑡𝑢 − 𝑊(𝑡𝑢) 라고 하면 ‘𝑡𝑢일 때, 𝑏𝑢를 취하면 예상되는 reward’(𝐵의 의미)를
baseline으로 선택한 것.
- Critic network는 𝐵 𝑏𝑢, 𝑡𝑢 를 학습한다.
40
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 3 | 이경민
- fully-differentiable 하지 않은 모델에 활용하기 위하여 3.3 에 나온 바와 같이 policy
gradient 를 활용한다고 설명하였습니다. policy gradients를 이용하면 미분불가능한 경우에 도 훈련이 가능한 이유가 궁금합니다.
41
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 3 | 이경민 | Answer
policy gradients를 이용하면 미분불가능한 경우에도 훈련이 가능한 이유
- Policy gradient methods 중에서 REINFORCE algorithm 종류를 사용했기 때문에 미분이 불
가능한 목적함수이지만 gradient based learning이 가능하도록 REINFORCE를 쓰면 아래와 같이 근사 가능.
- Policy gradient methods, REINFORCE에 대한 알고리즘을 증명하시면 더 자세히 알 수 있음.
42
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 3 | 이경민 | Answer
policy gradients를 이용하면 미분불가능한 경우에도 훈련이 가능한 이유
43
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 4 | 김종인
- Penn tree bank 데이터셋을 이용하여, maskGAN 알고리즘을 이용하여 text-generation 하는
데, 5.1.1에 나온 결과만 보면 잘 나오는 것 같습니다. 아마 모든 문장이 저렇게 잘 나오지는 않았을 것 같은데(체리 피킹 했을 것 같은데), 근본적으로 text-generation에 가능성과 한계 에 대해서 어떻게 생각하시는 지 궁금합니다.
44
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 4 | 김종인 | Answer
Q5-1 MaskGAN이 text-generation에 기여할 가능성
- Grammaticality 그리고 Topicality에서 MaskGAN
이 다른 모델(LM, MaskMLE)보다 크게 성능이 좋 다.
45
Q5-2 MaskGAN이 text-generation에서 가지는 한계
- 실제 데이터(real samples)와 MaskGAN의 생성 결
과의 Grammaticality 그리고 Topicality를 비교하면 MaskGAN이 현격하게 못 미친다.
- Mode collapse, matching syntax at boundaries,
loss of global context 문제가 발견됨. (appendix C)
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 5 | 장한솔
- 2페이지에서 training instability과 mode dropping이 GAN에서 문제가 되어왔다고 설명해주는데요, 이
미지를 대상으로 할 때와 텍스트를 대상으로 할 때에 각각 어떻게 문제가 되는지 설명 부탁드립니다.
- Training instability는 잘 학습되던 GAN 모델이 작은 hyperparameter 변화로도 학습이 되지 않는 현상을 말합
니다. 이미지나 텍스트를 대상으로 할 때 동일하게 생기는 문제입니다.
- Mode dropping은 sample 𝑨를 output distribution 𝑧의 일부분으로 mapping 하기 때문에 단조로운 output을
생성합니다.
- Text 예시: “It is a very funny film that is very funny It s a very funny movie and it s charming”
- Image 예시
46
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 6 | 조석현
47
- evaluation 부분에서 제안했다는 metric이 잘 이해가 가지 않습니다. 쉽게 설명을 해주실 수 있나
요?
- Evaluation: Compute the number of unique 𝑜-grams produced by the generator that occur in the
validation corpus for small 𝑜. Compute the geometric average the numbers. The average is to the performance of the generator.
- This evaluation is for measuring the degree of mode collapse.
- Mode collapse example from MaskGAN:
- Ex.1: It is a very funny film that is very funny It s a very funny movie and it s charming
- Mode collapse examples have the small number of unique 𝑜-grams.
- 17 unique 2-grams out of 19 2-grams in Ex.1 (17 out of 19)
- (It, is), (is, a), (a, very), (very, funny), (funny, film), (film, that), (that, is), (is, very), (funny, it), (It, s), (s, a), (a, very),
(very, funny), (funny, movie), (movie, and), (and, it), (it, s), (s, charming)
- Count 2-grams that is in the validation corpus, which is unseen for the generator. This count unique 2-grams
that have not learned.
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question 7 | 최성호
- 4쪽 아래에서 R_t - b_t 가 advantage A(a_t, s_t)=Q(a_t, s_t) - V(s_t)의 estimation이라고 나
와 있는데요 이 내용에 대한 background를 설명해주시면 감사하겠습니다.
- 참고할 keywords: Policy gradient, REINFORCE, variance problem of REINFORCE, actor-critic
algorithm
- Question 2-2 참고.
- 그리고 GAN이 Video generation에 적용한 사례는 없는지 궁금합니다.
- Deep Multi-Scale Video Prediction Beyond Mean Square Error [arXiv][github]
- Video Generation From Text [arXiv]
- Dynamics Transfer GAN: Generating Video by Transferring Arbitrary Temporal Dynamics fr
- m a Source Video to a Single Target Image [arXiv]
48
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
End
Thank you!
49