CS 533: Natural Language Processing
Autoencoders and VAEs
Karl Stratos
Rutgers University
Karl Stratos CS 533: Natural Language Processing 1/30
Autoencoders and VAEs Karl Stratos Rutgers University Karl Stratos - - PowerPoint PPT Presentation
CS 533: Natural Language Processing Autoencoders and VAEs Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/30 Aside: Protective Measures are Meaningful Karl Stratos CS 533: Natural Language Processing 2/30
CS 533: Natural Language Processing
Rutgers University
Karl Stratos CS 533: Natural Language Processing 1/30
Karl Stratos CS 533: Natural Language Processing 2/30
◮ Set up 1-1 meeting for proposal feedback (March 25-27) ◮ Proposal and A4 due March 24 ◮ Exam: discussion
Karl Stratos CS 533: Natural Language Processing 3/30
◮ EM: loose ends (hard EM) ◮ Autoencoders and VAEs ◮ VAE training techniques
Karl Stratos CS 533: Natural Language Processing 4/30
◮ Observed data comes from the population distribution popX ◮ LVGM: Model defining a joint distribution over X and Z
◮ Learning: Estimate pXZ by maximizing log-likelihood of data
pXZ N
Karl Stratos CS 533: Natural Language Processing 5/30
Input: data x(1) . . . x(N) ∼ popX, definition of pXZ Output: local optimum of max
pXZ N
log
pXZ(x(i), z)
qZ|X(z|x(i)) ← pXZ(x(i), z)
pXZ ← arg max
¯ pXZ N
qZ|X(z|x(i)) log pXZ(x(i), z)
Karl Stratos CS 533: Natural Language Processing 6/30
Input: data x(1) . . . x(N) ∼ popX, definition of pXZ Output: local optimum of max
pXZ, (z1...zN)∈ZN N
log pXZ(x(i), zi)
(z1 . . . zN) ← arg max
(¯ z1...¯ zN)∈ZN N
log pXZ(x(i), ¯ zi) pXZ ← arg max
¯ pXZ N
log pXZ(x(i), zi)
Karl Stratos CS 533: Natural Language Processing 7/30
◮ x ∈ Rd, z ∈ {1 . . . K}
◮ Model parameters to learn: µ1 . . . µK ∈ Rd ◮ Negative log joint probability as a function of parameters
◮ Observed x(1) . . . x(N) ∈ Rd, latents z1 . . . zN ∈ {1 . . . K}
z∈{1...K}
µ∈{1...K} N
N
Karl Stratos CS 533: Natural Language Processing 8/30
◮ Neural autoencoding: observed X, latent Z ◮ Running example
◮ X: sentence ◮ Z: m-dimensional real-valued vector
◮ We need to define
◮ qZ|X: encoder that transforms a sentence into a distribution
◮ pX|Z: decoder that transforms a vector z ∈ Rm into a
distribution over sentences
◮ pZ: prior that defines a distribution over Rm
◮ Distributions parameterized by neural networks
Karl Stratos CS 533: Natural Language Processing 9/30
◮ Input. Sentence x ∈ V T ◮ Parameters. Word embeddings E ∈ R|V |×d, LSTMCell
◮ Forward.
h1, c1 ← LSTMCell(Ex1, (0d, 0d)) h2, c2 ← LSTMCell(Ex2, (h1, c1)) . . . hT , cT ← LSTMCell(ExT , (hT −1, cT −1)) µ(x) σ2(x)
◮ Distribution over Rm conditioned on x
Karl Stratos CS 533: Natural Language Processing 10/30
◮ Input. Vector z ∈ Rm ◮ Parameters. Word embeddings E ∈ R|V |×d (often tied with
◮ Forward. Given sentence y ∈ V L compute its probability
h1, c1 ← LSTMCell(Ey1, FF2(z)) h2, c2 ← LSTMCell(Ey2, (h1, c1)) . . . hL, cL ← LSTMCell(EyL, (hL−1, cL−1))
L
Karl Stratos CS 533: Natural Language Processing 11/30
◮ Simplest: fixed standard normal pZ = N(0m, Im).
◮ Parameters. None
◮ Can also make it more expressive, for instance a mixture of K
K
k))
◮ Parameters. γ ∈ Rm and µk, σ2
k ∈ Rm for k = 1 . . . K
◮ Multimodal instead of unimodal Karl Stratos CS 533: Natural Language Processing 12/30
◮ Sentence X, d-dimensional vector Z ◮ Learnable parameters
◮ Word embeddings E shared by encoder and decoder ◮ LSTM and feedforward parameters in qZ|X ◮ LSTM and feedforward parameters in pX|Z ◮ (Optional) Parameters in the prior pZ
◮ We will now consider learning all these parameters together in
Karl Stratos CS 533: Natural Language Processing 13/30
pZ, pX|Z, qZ|X
x∼popX z∼qZ|X(·|x)
Karl Stratos CS 533: Natural Language Processing 14/30
pX|Z, LSTM
x∼popX
◮ No regularization (hence no role for prior)
Karl Stratos CS 533: Natural Language Processing 15/30
pX|Z, LSTM
x∼popX ǫ∼pE
◮ Equivalent to learning encoder
◮ Still no regularization, so no prior ◮ Example: masked language modeling
Karl Stratos CS 533: Natural Language Processing 16/30
[CLS] the dog [MASK] [SEP] the cat [MASK] away [SEP]
IsNext barked ran
(Vaswani et al., 2017)
Karl Stratos CS 533: Natural Language Processing 17/30
pZ, pX|Z, qZ|X
x∼popX z∼qZ|X(·|x)
◮ Great deal of flexibility in terms of how to optimize it ◮ Popular approach for the current setting
◮ Optimize the reconstruction term by sampling +
reparameterization trick z ∼ qZ|X(·|x) ⇔ ǫ ∼ N(0m, Im) z = µ(x) + σ(x) ⊙ ǫ
◮ Optimize the KL term in closed form
DKL(N(µ(x), diag(σ2(x)))||N(0m, Im)) = 1 2 m
σ2
i (x) + µ2 i (x) − 1 − log σ2 i (x)
CS 533: Natural Language Processing 18/30
Given a sentence x ∼ popX (in general a minibatch)
µ(x), σ2(x) ∈ Rm µ(x), σ2(x) ← Encoder(x)
κ ← 1 2 m
σ2
i (x) + µ2 i (x) − 1 − log σ2 i (x)
reparameterization trick ρ ← DecoderNLL(x, µ(x) + σ(x) ⊙ ǫ) ǫ ∼ N(0m, Im)
some weight.
Karl Stratos CS 533: Natural Language Processing 19/30
◮ Representation learning. Run encoder on a sentence x to
◮ Controlled generation. Run decoder on some seed vector to
◮ Can “interpolate” between two sentences x1, x2
z1 ∼ qZ|X(·|x1) z2 ∼ qZ|X(·|x2) xα ← Decode(αz1 + (1 − α)z2) α ∈ [0, 1]
Karl Stratos CS 533: Natural Language Processing 20/30
A Surprisingly Effective Fix for Deep Latent Variable Modeling of Text (Li et al., 2019) Karl Stratos CS 533: Natural Language Processing 21/30
Generating Diverse High-Fidelity Images with VQ-VAE-2 (Razavi et al., 2019) Karl Stratos CS 533: Natural Language Processing 22/30
x∼popX z∼qZ|X(·|x)
◮ Thus when you optimize VAE you are maximizing a lower
◮ Taking gradient steps for decoder/encoder/prior
◮ Difference with the classical EM: we no longer insist on
◮ Train a separate variational model qZ|X alongside pXZ Karl Stratos CS 533: Natural Language Processing 23/30
◮ Posterior collapse ◮ Quantities to monitor
Karl Stratos CS 533: Natural Language Processing 24/30
pX|Z, qZ|X
x∼popX z∼qZ|X(·|x)
Karl Stratos CS 533: Natural Language Processing 25/30
pX|Z
x∼popX z∼N(0m,Im)
Karl Stratos CS 533: Natural Language Processing 26/30
qZ|X(·|The company said it expects to report net income of $UNK-NUM million) = qZ|X(·|The two sides hadn’t met since Oct. 18.) = qZ|X(·|The inquiry soon focused on the judge.) . . . = qZ|X(·|Whatever sentence you provide) = N (0m, Im)
z = (0.1, 0.3, . . . , −0.7) → The company said it expects to report net income of $UNK-NUM million z = (−0.6, 0.2 . . . , 0.2) → The company said it expects to report net income of $UNK-NUM million . . . z = (0.2, 0.1 . . . , 0.1) → The company said it expects to report net income of $UNK-NUM million z = (−0.8, −0.5 . . . , −0.5) → The company said it expects to report net income of $UNK-NUM million Karl Stratos CS 533: Natural Language Processing 27/30
◮ Free bits (Kingma et al., 2016): replace KL term with
m
◮ KL annealing (Bowman et al., 2016): weight on KL gradually
◮ Current best practice (Li et al., 2019): do both with encoder
◮ Pretrain without KL term ◮ Reset decoder ◮ Train with annealing on the free-bits KL term Karl Stratos CS 533: Natural Language Processing 28/30
◮ NLL (= -ELBO)
x∼pop [log pX(x)] =
x∼pop
z∼qZ|X(·|x)
◮ Reconstruction error ◮ KL
◮ Mutual information between X and Z ◮ Number of active units (Burda et al., 2016)
Karl Stratos CS 533: Natural Language Processing 29/30
◮ “Document hashing”:
◮ See introduction of Pelsmaeker and Aziz (2019) for other
Karl Stratos CS 533: Natural Language Processing 30/30