SLIDE 1 Raising the Reliability of Estimates of Generative Performance of MRFs
Yuri Burda, Fields Institute Joint work with Roger Grosse and Ruslan Salakhutdinov
Workshop on Big Data and Statistical Machine Learning, Fields Institute, January 28, 2015
SLIDE 2 Markov Random Fields
- Express relations between random variables
SLIDE 3 Markov Random Fields
- Express relations between random variables
- Are very powerful
SLIDE 4 Markov Random Fields
- Express relations between random variables
- Are very powerful
- Can be used as generative models
SLIDE 5 Markov Random Fields
- Express relations between random variables
- Are very powerful
- Can be used as generative models
SLIDE 6 Markov Random Fields
- Express relations between random variables
- Are very powerful
- Can be used as generative models
SLIDE 7
MRFs are powerful
SLIDE 8 Restricted Boltzmann Machines – example of an MRF
Image visible variables hidden variables
Pair-wise Unary
SLIDE 9 Generating Samples from RBMs
Run Markov chain (alternating Gibbs Sampling):
SLIDE 10 Generating Samples from RBMs
Run Markov chain (alternating Gibbs Sampling):
Random
SLIDE 11 Generating Samples from RBMs
Run Markov chain (alternating Gibbs Sampling):
Random
SLIDE 12 Generating Samples from RBMs
Run Markov chain (alternating Gibbs Sampling):
Random
SLIDE 13 Generating Samples from RBMs
Run Markov chain (alternating Gibbs Sampling):
Random V
SLIDE 14 Generating Samples from RBMs
Run Markov chain (alternating Gibbs Sampling):
Random V
SLIDE 15 Generating Samples from RBMs
Run Markov chain (alternating Gibbs Sampling):
…
Random V
SLIDE 16 Generating Samples from RBMs
Run Markov chain (alternating Gibbs Sampling):
…
Random V T= infinity
Equilibrium Distribution
SLIDE 17 Generating Samples from MRFs
- In general useful MRFs admit tractable MCMC
transition operators
- Running MCMC for many steps one can
generate (approximate) samples from the MRF equilibrium distribution
SLIDE 18
Model comparison/selection
SLIDE 19 You can’t judge a model by its samples
Model comparison/selection
SLIDE 20 You can’t judge a model by its samples
Model comparison/selection
Better to compare validation/test data log-likelihood
SLIDE 21
Computing log-likelihood for MRFs
SLIDE 22
Computing log-likelihood for MRFs
SLIDE 23
Computing log-likelihood for MRFs
SLIDE 24 Computing log-likelihood for MRFs
It’s hard to compute an exponentially large sum
SLIDE 25 Computing log-likelihood for MRFs
It’s hard to compute an exponentially large sum But simple enough to approximate
SLIDE 26 Computing log-likelihood for MRFs
It’s hard to compute an exponentially large sum But simple enough to approximate sum
SLIDE 27 Computing log-likelihood for MRFs
It’s hard to compute an exponentially large sum But simple enough to approximate sum
SLIDE 28
Importance sampling
SLIDE 29
Importance sampling
SLIDE 30
Importance sampling
SLIDE 31
Importance sampling
SLIDE 32
Importance sampling
SLIDE 33 Importance sampling
Variance of this estimator can be large
SLIDE 34 Importance sampling
Variance of this estimator can be large
SLIDE 35 Annealed Importance Sampling
does not depend on size of
SLIDE 36 Annealed Importance Sampling
does not depend on size of
SLIDE 37 Annealed Importance Sampling
does not depend on size of
SLIDE 38 Annealed Importance Sampling
does not depend on size of
SLIDE 39 Annealed Importance Sampling
does not depend on size of
SLIDE 40
Annealed Importance Sampling
SLIDE 41 Annealed Importance Sampling
MRF
SLIDE 42 Annealed Importance Sampling
MRF
SLIDE 43 Annealed Importance Sampling
MRF
SLIDE 44 Annealed Importance Sampling
MRF
SLIDE 45 Annealed Importance Sampling
— MCMC
with energy
MRF
SLIDE 46 Annealed Importance Sampling
— MCMC
with energy
MRF
— MCMC
with energy
SLIDE 47 Annealed Importance Sampling
- AIS reduces the variance of the estimator
dramatically
SLIDE 48 Annealed Importance Sampling
- AIS reduces the variance of the estimator
dramatically
- However, like all importance samplers, it tends
to underestimate the sum
SLIDE 49 Annealed Importance Sampling
- AIS reduces the variance of the estimator
dramatically
- However, like all importance samplers, it tends
to underestimate the sum
- Hence it is unreliable for model
comparison/selection
SLIDE 50 Annealed Importance Sampling
- AIS reduces the variance of the estimator
dramatically
- However, like all importance samplers, it tends
to underestimate the sum
- Hence it is unreliable for model
comparison/selection
SLIDE 51 Annealed Importance Sampling
- AIS reduces the variance of the estimator
dramatically
- However, like all importance samplers, it tends
to underestimate the sum
- Hence it is unreliable for model
comparison/selection
underestimate
SLIDE 52 Annealed Importance Sampling
- AIS reduces the variance of the estimator
dramatically
- However, like all importance samplers, it tends
to underestimate the sum
- Hence it is unreliable for model
comparison/selection
underestimate
SLIDE 53
Importance Sampling underestimates
SLIDE 54
Importance Sampling underestimates
SLIDE 55
Importance Sampling underestimates
SLIDE 56 Importance Sampling underestimates
Jensen’s inequality
SLIDE 57 Importance Sampling underestimates
Markov’s inequality
SLIDE 58 Importance Sampling underestimates
Markov’s inequality
SLIDE 59 Importance Sampling underestimates
The proposal distribution q is likely to miss some modes of f
SLIDE 60
Importance Sampling underestimates and hence overestimates !
SLIDE 61
Importance Sampling underestimates and hence overestimates !
SLIDE 62
Contrast with situation in Directed Graphical Models
SLIDE 63
Contrast with situation in Directed Graphical Models
SLIDE 64
Contrast with situation in Directed Graphical Models
SLIDE 65 Contrast with situation in Directed Graphical Models
Easy to get exact samples
SLIDE 66 Contrast with situation in Directed Graphical Models
Easy to get exact samples Evaluating still requires evaluating an exponentially large sum
SLIDE 67
Importance Sampling to the rescue
SLIDE 68
Importance Sampling to the rescue
SLIDE 69
Importance Sampling to the rescue
SLIDE 70
Importance Sampling to the rescue
SLIDE 71
Importance Sampling to the rescue
SLIDE 72 Importance Sampling to the rescue
Moreover, the estimator we get is a stochastic lower bound on
SLIDE 73 Importance Sampling to the rescue
We just have to make sure that the proposal distribution is close to the posterior
SLIDE 74
Reverse Annealing: a directed graphical model out of an undirected one
SLIDE 75
Reverse Annealing: a directed graphical model out of an undirected one
SLIDE 76 Reverse Annealing: a directed graphical model out of an undirected one
— MCMC
with energy — MCMC
with energy
SLIDE 77 Reverse Annealing: a directed graphical model out of an undirected one
— MCMC
with energy — MCMC
with energy
SLIDE 78 Reverse Annealing Importance Sampling Estimator — RAISE
- As easy to implement as AIS
SLIDE 79 Reverse Annealing Importance Sampling Estimator — RAISE
- As easy to implement as AIS
- Gives a model that approximates the original
MRF
SLIDE 80 Reverse Annealing Importance Sampling Estimator — RAISE
- As easy to implement as AIS
- Gives a model that approximates the original
MRF
- Asymptotically equivalent to the original MRF
SLIDE 81 Reverse Annealing Importance Sampling Estimator — RAISE
- As easy to implement as AIS
- Gives a model that approximates the original
MRF
- Asymptotically equivalent to the original MRF
- Exact samples can be generated from the
approximate model
SLIDE 82 Reverse Annealing Importance Sampling Estimator — RAISE
- As easy to implement as AIS
- Gives a model that approximates the original
MRF
- Asymptotically equivalent to the original MRF
- Exact samples can be generated from the
approximate model
- RAISE is a stochastic lower bound on the log-
likelihood of the approximate model
SLIDE 83 In practice
Evaluated the procedure on
- Small RBMs
- Large RBMs trained with
- Contrastive Divergence with 1 step
- Contrastive Divergence with 25 steps
- Persistent Contrastive divergence
- Large Deep Belief Networks
- Large Deep Boltzmann Machines
- Models of handwritten digits and
models of handwritten characters
Image visible variables hidden variables
SLIDE 84 Typical Results
MNIST DBN MNIST DBM
SLIDE 85 Typical Results
MNIST, CD-1 trained RBM, 500 hidden units
SLIDE 86 Non-typical Results
MNIST CD-1 trained RBM with 20 units
SLIDE 87 MNIST and Omniglot RBMs Results
- CSL: Conservative Sampling-based Log-likelihood (CSL) estimator
- f Bengio et. al.
- Y. Bengio, L. Yao, and K. Cho. Bounding the test log-
likelihood of generative models.
- RAISE errs on the side of underestimating the log-likelihood.
- The gap is typically very small!
SLIDE 88 Empirical observations
- Annealing from the data base rates model typically
gives better AIS estimates than annealing from the uniform distribution
SLIDE 89 Empirical observations
- Annealing from the data base rates model typically
gives better AIS estimates than annealing from the uniform distribution
- RAISE model approximates the original MRF
reasonably well with 1,000 – 100,000 intermediate distributions
SLIDE 90 Empirical observations
- Annealing from the data base rates model typically
gives better AIS estimates than annealing from the uniform distribution
- RAISE model approximates the original MRF
reasonably well with 1,000 – 100,000 intermediate distributions
- For models that don’t model the data distribution
well (overfitting, undertrained etc) the RAISE model can be substantially better than the original MRF.
SLIDE 91 Empirical observations
- It’s really hard to know when AIS is or isn’t working,
and RAISE can give a clue about that
SLIDE 92 Empirical observations
- It’s really hard to know when AIS is or isn’t working,
and RAISE can give a clue about that
- It’s likely that most, but not all, published results
based on AIS estimates with enough intermediate distributions are reliable.
SLIDE 93 Computational Tricks
RAISE requires estimating a large sum for each test sample, which is computationally expensive
SLIDE 94 Computational Tricks
RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates
SLIDE 95 Computational Tricks
RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates RAISE estimates and MRF unnormalized probabilities tend to be tightly correlated
SLIDE 96 Computational Tricks
RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates RAISE estimates and MRF unnormalized probabilities tend to be tightly correlated is a low-variance estimator of Hence
SLIDE 97 Computational Tricks
RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates RAISE estimates and MRF unnormalized probabilities tend to be tightly correlated Here are random test set samples and k is small is a low-variance estimator of Hence
SLIDE 98 Pretraining Very Very Deep Models
SLIDE 99 Pretraining Very Very Deep Models
- Train an RBM or a DBN
- Unroll the model using RAISE to create a sigmoid
belief network with 100 or 1000 layers
SLIDE 100 Pretraining Very Very Deep Models
- Train an RBM or a DBN
- Unroll the model using RAISE to create a sigmoid
belief network with 100 or 1000 layers
- Use p and q to fine-tune the model with an
appropriate algorithm: wake-sleep (Hinton et al, 95) reweighted wake-sleep (Bornschein, Bengio, 14) neural variational inference (Mnih, Gregor, 13)
SLIDE 101 Pretraining Very Very Deep Models
- Train an RBM or a DBN
- Unroll the model using RAISE to create a sigmoid
belief network with 100 or 1000 layers
- Use p and q to fine-tune the model with an
appropriate algorithm: wake-sleep (Hinton et al, 95) reweighted wake-sleep (Bornschein, Bengio, 14) neural variational inference (Mnih, Gregor, 13)
- Brag about the deepest network that has ever
been trained
SLIDE 102 Conclusions
- Comparing MRF models using variants of
importance sampling (AIS, sequential Monte Carlo etc) is unreliable
SLIDE 103 Conclusions
- Comparing MRF models using variants of
importance sampling (AIS, sequential Monte Carlo etc) is unreliable
- RAISE is as easy to use as AIS, and should be
used either instead or in conjunction with AIS when comparing models
SLIDE 104 Conclusions
- Comparing MRF models using variants of
importance sampling (AIS, sequential Monte Carlo etc) is unreliable
- RAISE is as easy to use as AIS, and should be
used either instead or in conjunction with AIS when comparing models
- One can (in principle) replace any MRF with a
directed graphical model that has a tractable approximation to the posterior
SLIDE 105 Thank you
Burda Y., Grosse R., Salakhutdinov R. Accurate and Conservative Estimates of MRF Log-likelihood using Reverse Annealing AISTATS 2015