Inference & Learning in DL
Zhiting Hu
Lecture 19, March 29, 2017
Reading:
Probabilistic Graphical Models
1
Probabilistic Graphical Models Inference & Learning in DL - - PowerPoint PPT Presentation
Probabilistic Graphical Models Inference & Learning in DL Zhiting Hu Lecture 19, March 29, 2017 Reading: 1 Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of l
1
2
l Explicit probabilistic models
l
Provide an explicit parametric specification of the distribution of π
l
Tractable likelihood function p#(π)
l
E.g.,
3
l Explicit probabilistic models
l
Provide an explicit parametric specification of the distribution of π
l
Tractable likelihood function p#(π)
l
E.g., Sigmoid Belief Nets
(3), π.
6π/ (3) + π.
(3) = 1 π9, π/ : ,π9) = π(π9 6π/ : + π9)
(3) = 0,1 >
(:) = 0,1 ?
4
l Explicit probabilistic models
l
Provide an explicit parametric specification of the distribution of π
l
Tractable likelihood function p#(π)
l
E.g., Deep generative model parameterized with NNs (e.g., VAEs)
5
l Implicit probabilistic models
l
Defines a stochastic process to simulate data π
l
Do not require tractable likelihood function
l
Data simulator
l
Natural approach for problems in population genetics, weather, ecology, etc.
l
E.g., generate data from a deterministic equation given parameters and random noise (e.g., GANs)
l Consider a probabilistic model l Assume variational distribution l Lower bound for log likelihood
l
6
π
π
l Consider a generative model
l
l Variational bound: l Use a inference network l Maximize the bound w.r.t. Γ
l l
l
7
π
l [Hinton et al., Science 1995] l Generally applicable to a wide range of generative models by
l Consider a generative model , with prior
l
l Free energy l Inference network
l
a.k.a. recognition network
8
l Free energy: l Minimize the free energy w.r.t. Γ
l l
l
9
[Figure courtesy: Maeiβs slides]
l Free energy: l Maximize the free energy w.r.t. ?
l
computationally expensive / high variance
l Instead, maximize w.r.t. . Γ
l l
βDreamingβ up samples from through top-down pass
l
Use the samples as targets for updating the recognition network
10
l Wake phase:
l
Use recognition network to perform a bottom-up pass in order to create samples for layers above (from data)
l
Train generative network using samples obtained from recognition model
l Sleep phase:
l
Use generative weights to reconstruct data by performing a top-down pass
l
Train recognition weights using samples obtained from generative model
l KL is not symmetric l Doesnβt optimize a well-defined objective function l Not guaranteed to converge
11
l [Kingma & Welling, 2014] l Enjoy similar applicability with wake-sleep algorithm
l
Not applicable to discrete latent variables
l
l
l
Alternatives: use control variates as in reinforcement learning [Mnih & Gregor, 2014]
12
l Generative model , with prior
l
a.k.a. decoder
l Inference network
l
a.k.a. encoder, recognition network
l Variational lower bound
13
l Variational lower bound l Optimize w.r.t.
l
the same with the wake phase
l Optimize w.r.t.
l
Directly computing the gradient with MC estimation a REINFORCE-like update rule which suffers from high variance [Mnih & Gregor 2014] (Next lecture for more on REINFORCE)
l
VAEs use a reparameterization trick to reduce variance
14
l
15
l
16
[Figure courtesy: Changβs slides]
l
l
l l
l
17
R E\_ π π log π# π, π
R log π# π,πR π
l
18
l
19
l
20
Generated MNIST images [Gregor et al., 2015]
l Element-wise reconstruction error
l
For image generation, to reconstruct every pixels
l
Sensitive to irrelevant variance, e.g., translations
l
Variant: feature-wise (perceptual-level) reconstruction [Dosovitskiy et al., 2016]
l
Use a pre-trained neural network to extract features of data
l
Generated images are required to have similar feature vectors with the data
l
Variant: Combining VAEs with GANs [Larsen et al., 2016] (more later)
21
Reconstruction results with different loss
l Not applicable to discrete latent variables
l
Differentiable reparameterization does not apply to discrete variables
l
Wake-sleep algorithm/GANs allow discrete latents
l
Variant: marginalize out discrete latents [Kingma et al., 2014]
l
Expensive when the discrete space is large
l
Variant: use continuous approximations
l
Gumbel-softmax [Jang et al, 2017] for approximating multinomial variables
l
Variant: combine VAEs with wake-sleep algorithm [Hu et al., 2017]
22
l
l l
For ease of inference and learning
l
Limited flexibility: converting the data distribution to fixed, single-mode prior distribution
l
Variant: use hierarchical nonparametric priors [Goyal et al., 2017]
l
E.g., Dirichlet process, nested Chinese restaurant process (more later)
l
Learn the structures of priors jointly with the model
23
π π = πͺ(π; π,π±)
l
l l
For ease of inference and learning
l
Limited flexibility: converting the data distribution to fixed, single-mode prior distribution
l
Variant: use hierarchical nonparametric priors [Goyal et al., 2017]
l
E.g., Dirichlet process, nested Chinese restaurant process (more later)
l
Learn the structures of priors jointly with the model
24
π π = πͺ(π; π,π±)
25
l Implicit probabilistic models
l
Defines a stochastic process to simulate data π
l
Do not require tractable likelihood function
l
Data simulator
l
Natural approach for problems in population genetics, weather, ecology, etc.
l
E.g., generate data from a deterministic equation given parameters and random noise (e.g., GANs)
l
l
l
l
l
26
l
l
Maps noise variable to data space
l
l
Outputs the probability that came from the data rather than the generator
l Learning
l
Train πΈ to maximize the probability of assigning the correct label to both training examples and generated samples
l
Train π» to fool the discriminator
27
π
[Figure courtesy: Kimβs slides]
l
For πΈ: binary cross entropy with label 1 for real, 0 for fake
l
For π»:
l
Alternate training of πΈ and π»
28
min
p Eπ[log 1β πΈ(π»(π))]
l l
l
29
C(G) = max
D V (G, D)
=ExβΌpdata[log Dβ
G(x)] + EzβΌpz[log(1 Dβ G(G(z)))]
=ExβΌpdata[log Dβ
G(x)] + ExβΌpg[log(1 Dβ G(x))]
=ExβΌpdata ο£Ώ log pdata(x) Pdata(x) + pg(x)
ο£Ώ log pg(x) pdata(x) + pg(x)
is achieved if and
π·β = βlog 4 πi = πjtut
l
l
Alternate between k steps of optimizing D and one step of optimizing G
l
l
suffers from vanishing gradient when πΈ is too strong
l
Uses in practice
30
min
p Eπ log 1 β πΈ π» π
max
p Eπ log πΈ π» π
l Instability of training
l
Requires careful balance between the training of πΈ and π»
l
[Arjovsky & Botton, 2017]: under certain conditions, is a centered Cauchy distribution with infinite expectation and variance
l
l
Generated samples are often from only a few modes of the data distribution
l
l
e.g., [Salimans et al., 2016]: minibatch discrimination, one-sided label smoothing, β¦
31
Eπ log πΈ π» π
l Generating images
32
l Generating images
l
33
l Generating images
l
l
l
l
34
l Variational Auto-encoders:
l
Probabilistic graphical model framework
l
Allow efficient Bayesian inference
l
Generated samples tend to be blurry
l
An issue of maximum likelihood training
l
l
l
Generate sharp images
l
Do not support inference (π β π)
l
Do not support discrete visible variables
35
l Do not support inference
l
No mechanism to inferring π from π
l
Variants: additionally learn an inference network
l
[Dumoulin, et al., 2016, Donahue et al., 2016]
36
37
l Do not support discrete visible variables
l
Non-differentiability of samples hinders gradients backprop
l
Variant: treats generator training as policy learning [Yu et al., 2017]
l
High variance, slow convergence
l
Variant: continuous approximations
l
Gumbel-softmax [Kusner & Hernndez-Lobato, 2016]
l
Only qualitative results with small discrete space
l
l
Uses VAEs to handle discrete visibles, GANs to handle discrete latents
l
Wake-sleep style learning
38
l Unable to control the attributes of generated samples
l
Uninterpretability of the input latent vector π
l
An Issue shared with VAEs and other DNN methods
l
Variants: add a mutual-information regularizer to enforce disentangled hidden codes [Chen et al., 2016]
l
Unsupervised
l
Semantic of each dimension is observed after training, rather than designated by users in a controllable way
39
l Unable to control the attributes of generated samples
l
Uninterpretability of the input latent vector π
l
An Issue shared with VAEs and other DNN methods
l
Variants: add a mutual-information regularizer to enforce disentangled hidden codes [Chen et al., 2016]
l
Unsupervised
l
Semantic of each dimension is observed after training, rather than designated by users in a controllable way
l
Variants: use supervision information to enforce designated semantics on certain dimensions of π [Odena et al., 2017; Hu et al., 2017]
l [Hu et al., 2016] l Deep NNs
l
Heavily rely on massive labeled data
l
Uninterpretable
l
Hard to encode human intention/domain knowledge
40
l How humans learn
l
Learn from concrete examples (as DNNs do)
l
Learn from general knowledge and rich experiences [Minksy 1980; Lake et al., 2015]
l
E.g., the past tense of verbs1:
41
1 https://www.technologyreview.com/s/544606/can-this-man-make-aimore-human
add -> added accept -> accepted ignore -> ignored end -> ended block -> blocked love -> loved
l Logic rule
l
l
l
l Input-target space: (π, π) l Soft first-order logic (FOL) rules: (π , π)
l
l
42
l Neural network π# π§ π¦
43
l Neural network π# π§ π¦ l Train to imitate the outputs of a rule-regularized teacher
44
l Neural network π# π§ π¦ l Train to imitate the outputs of a rule-regularized teacher
45
l Teacher network: π(π|π)
l
l
46
47
l At each iteration
l
l
l Sentence => positive/negative l Base network: CNN [Kim, 2014]
l
l
sentence S with structure A-but-B: => sentiment of B dominates
48
49
l accuracy (%)
l
[Dosovitskiy et al., 2016] βGenerating Images with Perceptual Similarity Metrics based on Deep Networksβ, NIPSβ16
l
[Larsen et al., 2016] βAutoencodingbeyond pixels using a learned similarity metricβ, ICMLβ16
l
[Kingma et al., 2014] βSemi-supervised learning with deep generative modelsβ, NIPSβ14
l
[Jang et al., 2017] βCategorical Reparameterization with Gumbel-Softmaxβ, ICLRβ17
l
[Hu et al., 2017] βControllable Text Generationβ, 2017
l
[Goyal et al., 2017] βNonparametric Variational Auto-encoders for Hierarchical Representation Learningβ, 2017
l
[Goodfellow et al., 2014] βGenerative Adversarial Netsβ, NIPSβ14
l
[Arjovsky & Botton, 2017] βTowards Principled Methods for Training Generative Adversarial Networksβ, ICLRβ17
l
[Salimans et al., 2016] βImproved Techniques for Training GANsβ, NIPSβ16
l
[Isola et al., 2016] βImage-to-Image Translation with Conditional Adversarial Networksβ, 2016
l
[Purushotham et al., 2017] βVariational Recurrent Adversarial Deep Domain Adaptationβ, ICLRβ17
l
[Ho & Ermon 2016] βGenerative Adversarial Imitation Learningβ, NIPSβ16
l
[Dumoulin, et al., 2016] βAdversarially Learned Inferenceβ, ICLRβ17
l
[Donahue et al., 2016] βAdversarial Feature Learningβ, 2016
l
[Yu et al., 2017] βSeqGAN: Sequence Generative Adversarial Nets with Policy Gradientβ, AAAIβ17
l
[Kusner & Hernndez-Lobato, 2016] βGANs for sequences of discrete elements with the gumbel-softmax distributionβ, 2016
l
[Chen el al., 2016] βInfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Netsβ, NIPSβ16
l
[Odena et al., 2017] βConditional image synthesis with auxiliary classifier GANsβ, 2017
l
[Hu et al., 2016] βHarnessing DNNs with logic rulesβ, ACLβ2016
l
[Minksy, 1980] Learning meaning. Technical Report AI Lab Memo. 1980
l
[Lake et al., 2015] Human-level concept learning through probabilistic program induction. Scienceβ15.
l
[Kim, 2014] Convolutional neural networks for sentence classification. EMNLPβ14
l
Hamid Reza Maei, Wake-Sleep algorithm for Representational Learning
l
Mark Chang, Variational Autoencoder
l
Namju Kim, Generative Adversarial Networks (GAN) 50