CS11-747 Neural Networks for NLP
Models w/ Latent Random Variables
Graham Neubig
Site https://phontron.com/class/nn4nlp2017/
Models w/ Latent Random Variables Graham Neubig Site - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Models w/ Latent Random Variables Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Discriminative vs. Generative Models Discriminative model: calculate the probability of output given input
CS11-747 Neural Networks for NLP
Graham Neubig
Site https://phontron.com/class/nn4nlp2017/
input P(Y|X)
value
to some deterministic function
distribution, and may take any of several (or infinite) values
using MLE/teacher forcing, are the following variables
affect the text/images/speech that we are observing
it is. Deterministic variables cannot capture this ambiguity.
to x, where this function is usually a neural net
N z~N(0, I) x Θ x = f(z; Θ)
z x
log P(X) = X
x∈X
log P(x; θ)
P(x; θ) = Z P(x | z; θ)P(z)dz P(x; θ) ≈ X
z∈S(x)
P(x|z; θ) where S(x) := {z0; z0 ∼ P(z)}
z x Current data point Latent samples w/ non-negligible P(x|z)
model Q(z|x)
efficient training
inference model, “decodes” with generative model
Q(z|x)
approximating
P(x; θ) = Z P(x | z; θ)P(z)dz = Ez∼P (z)[P(x | z; θ)] Ez∼Q(z|x;φ)[P(x | z; θ)]
KL[Q(z | x)||P(z | x)] = Ez∼Q(z|x)[log Q(z | x) − log P(z | x)] Bayes’s Rule
KL[Q(z | x)||P(z | x)] = Ez∼Q(z|x)[log Q(z | x) − log P(x | z) − log P(z)] + log P(x)
log P(x) − KL[Q(z | x)||P(z | x)] = Ez∼Q(z|x)[log P(x | z)] − Ez∼Q(z|x)[log Q(z | x) − log P(z)]
Rearrange/negate
log P(x) − KL[Q(z | x)||P(z | x)] = Ez∼Q(z|x)[log P(x | z)] − KL[Q(z | x)||P(z)]
Definition of KL divergence
(approximated by sampling from Q)
in closed-form for Gaussians
log P(x) − KL[Q(z | x)||P(z | x)] = Ez∼Q(z|x)[log P(x | z)] − KL[Q(z | x)||P(z)]
Figure Credit: Doersch (2016)
Figure Credit: Doersch (2016)
generate from a normal language model
P(y|x) (e.g. translation, image captioning)
model that conditions on a latent variable z
is latent variable, input and output are identical
Q RNN
Sentence x
P RNN
Sentence x Latent z
standard model + regularization.
Standard encoder-decoder VAE
divergence term is much easier to learn!
decoder and ignore latent variable = Ez∼Q(z|x)[log P(x | z)] − KL[Q(z | x)||P(z)] Requires good generative model Just need to set the mean/variance
zero, then gradually increase to 1
Figure Credit: Bowman et al. (2017)
the optimal strategy is to ignore z when it is not necessary (Chen et al. 2017)
previous word in x (Bowman et al. 2015)
(Yang et al. 2017)
P(x; θ) = X
z
P(x | z; θ)P(z)
small, we can just sum over all of them
and optimize with respect to this subset
sampling, resulting in very high variance
(Maddison et al. 2017, Jang et al. 2017)
Original Categorical Sampling Method:
ˆ z = cat-sample(P(z | x))
Reparameterized Method Gumbel(0, 1) = − log(− log(Uniform(0,1))) where the Gumbel distribution is
ˆ z = argmax(log P(z | x) + Gumbel(0,1))
gradients
ˆ z = softmax((log P(z | x) + Gumbel(0,1))1/τ)
modeling and question-answer pair selection
space, question-answer more regularization?
latent code c for various aspects that we would like to control (e.g. sentiment)
(Zhou and Neubig 2017)
using auto-encoding or encoder-decoder objective
model