Raising the Reliability of Estimates of Generative Performance of - - PowerPoint PPT Presentation

raising the reliability of estimates of generative
SMART_READER_LITE
LIVE PREVIEW

Raising the Reliability of Estimates of Generative Performance of - - PowerPoint PPT Presentation

Raising the Reliability of Estimates of Generative Performance of MRFs Yuri Burda, Fields Institute Joint work with Roger Grosse and Ruslan Salakhutdinov Workshop on Big Data and Statistical Machine Learning, Fields Institute, January 28, 2015


slide-1
SLIDE 1

Raising the Reliability of Estimates of Generative Performance of MRFs

Yuri Burda, Fields Institute Joint work with Roger Grosse and Ruslan Salakhutdinov

Workshop on Big Data and Statistical Machine Learning, Fields Institute, January 28, 2015

slide-2
SLIDE 2

Markov Random Fields

  • Express relations between random variables
slide-3
SLIDE 3

Markov Random Fields

  • Express relations between random variables
  • Are very powerful
slide-4
SLIDE 4

Markov Random Fields

  • Express relations between random variables
  • Are very powerful
  • Can be used as generative models
slide-5
SLIDE 5

Markov Random Fields

  • Express relations between random variables
  • Are very powerful
  • Can be used as generative models
slide-6
SLIDE 6

Markov Random Fields

  • Express relations between random variables
  • Are very powerful
  • Can be used as generative models
slide-7
SLIDE 7

MRFs are powerful

slide-8
SLIDE 8

Restricted Boltzmann Machines – example of an MRF

Image visible variables hidden variables

Pair-wise Unary

slide-9
SLIDE 9

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

slide-10
SLIDE 10

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random

slide-11
SLIDE 11

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random

slide-12
SLIDE 12

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random

slide-13
SLIDE 13

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random V

slide-14
SLIDE 14

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random V

slide-15
SLIDE 15

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random V

slide-16
SLIDE 16

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random V T= infinity

Equilibrium Distribution

slide-17
SLIDE 17

Generating Samples from MRFs

  • In general useful MRFs admit tractable MCMC

transition operators

  • Running MCMC for many steps one can

generate (approximate) samples from the MRF equilibrium distribution

slide-18
SLIDE 18

Model comparison/selection

slide-19
SLIDE 19

You can’t judge a model by its samples

Model comparison/selection

slide-20
SLIDE 20

You can’t judge a model by its samples

Model comparison/selection

Better to compare validation/test data log-likelihood

slide-21
SLIDE 21

Computing log-likelihood for MRFs

slide-22
SLIDE 22

Computing log-likelihood for MRFs

slide-23
SLIDE 23

Computing log-likelihood for MRFs

slide-24
SLIDE 24

Computing log-likelihood for MRFs

It’s hard to compute an exponentially large sum

slide-25
SLIDE 25

Computing log-likelihood for MRFs

It’s hard to compute an exponentially large sum But simple enough to approximate

slide-26
SLIDE 26

Computing log-likelihood for MRFs

It’s hard to compute an exponentially large sum But simple enough to approximate sum

slide-27
SLIDE 27

Computing log-likelihood for MRFs

It’s hard to compute an exponentially large sum But simple enough to approximate sum

slide-28
SLIDE 28

Importance sampling

slide-29
SLIDE 29

Importance sampling

slide-30
SLIDE 30

Importance sampling

slide-31
SLIDE 31

Importance sampling

slide-32
SLIDE 32

Importance sampling

slide-33
SLIDE 33

Importance sampling

Variance of this estimator can be large

slide-34
SLIDE 34

Importance sampling

Variance of this estimator can be large

slide-35
SLIDE 35

Annealed Importance Sampling

does not depend on size of

slide-36
SLIDE 36

Annealed Importance Sampling

does not depend on size of

slide-37
SLIDE 37

Annealed Importance Sampling

does not depend on size of

slide-38
SLIDE 38

Annealed Importance Sampling

does not depend on size of

slide-39
SLIDE 39

Annealed Importance Sampling

does not depend on size of

slide-40
SLIDE 40

Annealed Importance Sampling

slide-41
SLIDE 41

Annealed Importance Sampling

MRF

slide-42
SLIDE 42

Annealed Importance Sampling

MRF

slide-43
SLIDE 43

Annealed Importance Sampling

MRF

slide-44
SLIDE 44

Annealed Importance Sampling

MRF

slide-45
SLIDE 45

Annealed Importance Sampling

— MCMC

  • perator for an MRF

with energy

MRF

slide-46
SLIDE 46

Annealed Importance Sampling

— MCMC

  • perator for an MRF

with energy

MRF

— MCMC

  • perator for an MRF

with energy

slide-47
SLIDE 47

Annealed Importance Sampling

  • AIS reduces the variance of the estimator

dramatically

slide-48
SLIDE 48

Annealed Importance Sampling

  • AIS reduces the variance of the estimator

dramatically

  • However, like all importance samplers, it tends

to underestimate the sum

slide-49
SLIDE 49

Annealed Importance Sampling

  • AIS reduces the variance of the estimator

dramatically

  • However, like all importance samplers, it tends

to underestimate the sum

  • Hence it is unreliable for model

comparison/selection

slide-50
SLIDE 50

Annealed Importance Sampling

  • AIS reduces the variance of the estimator

dramatically

  • However, like all importance samplers, it tends

to underestimate the sum

  • Hence it is unreliable for model

comparison/selection

slide-51
SLIDE 51

Annealed Importance Sampling

  • AIS reduces the variance of the estimator

dramatically

  • However, like all importance samplers, it tends

to underestimate the sum

  • Hence it is unreliable for model

comparison/selection

underestimate

slide-52
SLIDE 52

Annealed Importance Sampling

  • AIS reduces the variance of the estimator

dramatically

  • However, like all importance samplers, it tends

to underestimate the sum

  • Hence it is unreliable for model

comparison/selection

underestimate

  • verestimate
slide-53
SLIDE 53

Importance Sampling underestimates

slide-54
SLIDE 54

Importance Sampling underestimates

slide-55
SLIDE 55

Importance Sampling underestimates

slide-56
SLIDE 56

Importance Sampling underestimates

Jensen’s inequality

slide-57
SLIDE 57

Importance Sampling underestimates

Markov’s inequality

slide-58
SLIDE 58

Importance Sampling underestimates

Markov’s inequality

slide-59
SLIDE 59

Importance Sampling underestimates

The proposal distribution q is likely to miss some modes of f

slide-60
SLIDE 60

Importance Sampling underestimates and hence overestimates !

slide-61
SLIDE 61

Importance Sampling underestimates and hence overestimates !

slide-62
SLIDE 62

Contrast with situation in Directed Graphical Models

slide-63
SLIDE 63

Contrast with situation in Directed Graphical Models

slide-64
SLIDE 64

Contrast with situation in Directed Graphical Models

slide-65
SLIDE 65

Contrast with situation in Directed Graphical Models

Easy to get exact samples

slide-66
SLIDE 66

Contrast with situation in Directed Graphical Models

Easy to get exact samples Evaluating still requires evaluating an exponentially large sum

slide-67
SLIDE 67

Importance Sampling to the rescue

slide-68
SLIDE 68

Importance Sampling to the rescue

slide-69
SLIDE 69

Importance Sampling to the rescue

slide-70
SLIDE 70

Importance Sampling to the rescue

slide-71
SLIDE 71

Importance Sampling to the rescue

slide-72
SLIDE 72

Importance Sampling to the rescue

Moreover, the estimator we get is a stochastic lower bound on

slide-73
SLIDE 73

Importance Sampling to the rescue

We just have to make sure that the proposal distribution is close to the posterior

slide-74
SLIDE 74

Reverse Annealing: a directed graphical model out of an undirected one

slide-75
SLIDE 75

Reverse Annealing: a directed graphical model out of an undirected one

slide-76
SLIDE 76

Reverse Annealing: a directed graphical model out of an undirected one

— MCMC

  • perator for an MRF

with energy — MCMC

  • perator for an MRF

with energy

slide-77
SLIDE 77

Reverse Annealing: a directed graphical model out of an undirected one

— MCMC

  • perator for an MRF

with energy — MCMC

  • perator for an MRF

with energy

slide-78
SLIDE 78

Reverse Annealing Importance Sampling Estimator — RAISE

  • As easy to implement as AIS
slide-79
SLIDE 79

Reverse Annealing Importance Sampling Estimator — RAISE

  • As easy to implement as AIS
  • Gives a model that approximates the original

MRF

slide-80
SLIDE 80

Reverse Annealing Importance Sampling Estimator — RAISE

  • As easy to implement as AIS
  • Gives a model that approximates the original

MRF

  • Asymptotically equivalent to the original MRF
slide-81
SLIDE 81

Reverse Annealing Importance Sampling Estimator — RAISE

  • As easy to implement as AIS
  • Gives a model that approximates the original

MRF

  • Asymptotically equivalent to the original MRF
  • Exact samples can be generated from the

approximate model

slide-82
SLIDE 82

Reverse Annealing Importance Sampling Estimator — RAISE

  • As easy to implement as AIS
  • Gives a model that approximates the original

MRF

  • Asymptotically equivalent to the original MRF
  • Exact samples can be generated from the

approximate model

  • RAISE is a stochastic lower bound on the log-

likelihood of the approximate model

slide-83
SLIDE 83

In practice

Evaluated the procedure on

  • Small RBMs
  • Large RBMs trained with
  • Contrastive Divergence with 1 step
  • Contrastive Divergence with 25 steps
  • Persistent Contrastive divergence
  • Large Deep Belief Networks
  • Large Deep Boltzmann Machines
  • Models of handwritten digits and

models of handwritten characters

Image visible variables hidden variables

slide-84
SLIDE 84

Typical Results

MNIST DBN MNIST DBM

slide-85
SLIDE 85

Typical Results

MNIST, CD-1 trained RBM, 500 hidden units

slide-86
SLIDE 86

Non-typical Results

MNIST CD-1 trained RBM with 20 units

slide-87
SLIDE 87

MNIST and Omniglot RBMs Results

  • CSL: Conservative Sampling-based Log-likelihood (CSL) estimator
  • f Bengio et. al.
  • Y. Bengio, L. Yao, and K. Cho. Bounding the test log-

likelihood of generative models.

  • RAISE errs on the side of underestimating the log-likelihood.
  • The gap is typically very small!
slide-88
SLIDE 88

Empirical observations

  • Annealing from the data base rates model typically

gives better AIS estimates than annealing from the uniform distribution

slide-89
SLIDE 89

Empirical observations

  • Annealing from the data base rates model typically

gives better AIS estimates than annealing from the uniform distribution

  • RAISE model approximates the original MRF

reasonably well with 1,000 – 100,000 intermediate distributions

slide-90
SLIDE 90

Empirical observations

  • Annealing from the data base rates model typically

gives better AIS estimates than annealing from the uniform distribution

  • RAISE model approximates the original MRF

reasonably well with 1,000 – 100,000 intermediate distributions

  • For models that don’t model the data distribution

well (overfitting, undertrained etc) the RAISE model can be substantially better than the original MRF.

slide-91
SLIDE 91

Empirical observations

  • It’s really hard to know when AIS is or isn’t working,

and RAISE can give a clue about that

slide-92
SLIDE 92

Empirical observations

  • It’s really hard to know when AIS is or isn’t working,

and RAISE can give a clue about that

  • It’s likely that most, but not all, published results

based on AIS estimates with enough intermediate distributions are reliable.

slide-93
SLIDE 93

Computational Tricks

RAISE requires estimating a large sum for each test sample, which is computationally expensive

slide-94
SLIDE 94

Computational Tricks

RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates

slide-95
SLIDE 95

Computational Tricks

RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates RAISE estimates and MRF unnormalized probabilities tend to be tightly correlated

slide-96
SLIDE 96

Computational Tricks

RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates RAISE estimates and MRF unnormalized probabilities tend to be tightly correlated is a low-variance estimator of Hence

slide-97
SLIDE 97

Computational Tricks

RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates RAISE estimates and MRF unnormalized probabilities tend to be tightly correlated Here are random test set samples and k is small is a low-variance estimator of Hence

slide-98
SLIDE 98

Pretraining Very Very Deep Models

  • Train an RBM or a DBN
slide-99
SLIDE 99

Pretraining Very Very Deep Models

  • Train an RBM or a DBN
  • Unroll the model using RAISE to create a sigmoid

belief network with 100 or 1000 layers

slide-100
SLIDE 100

Pretraining Very Very Deep Models

  • Train an RBM or a DBN
  • Unroll the model using RAISE to create a sigmoid

belief network with 100 or 1000 layers

  • Use p and q to fine-tune the model with an

appropriate algorithm: wake-sleep (Hinton et al, 95) reweighted wake-sleep (Bornschein, Bengio, 14) neural variational inference (Mnih, Gregor, 13)

slide-101
SLIDE 101

Pretraining Very Very Deep Models

  • Train an RBM or a DBN
  • Unroll the model using RAISE to create a sigmoid

belief network with 100 or 1000 layers

  • Use p and q to fine-tune the model with an

appropriate algorithm: wake-sleep (Hinton et al, 95) reweighted wake-sleep (Bornschein, Bengio, 14) neural variational inference (Mnih, Gregor, 13)

  • Brag about the deepest network that has ever

been trained 

slide-102
SLIDE 102

Conclusions

  • Comparing MRF models using variants of

importance sampling (AIS, sequential Monte Carlo etc) is unreliable

slide-103
SLIDE 103

Conclusions

  • Comparing MRF models using variants of

importance sampling (AIS, sequential Monte Carlo etc) is unreliable

  • RAISE is as easy to use as AIS, and should be

used either instead or in conjunction with AIS when comparing models

slide-104
SLIDE 104

Conclusions

  • Comparing MRF models using variants of

importance sampling (AIS, sequential Monte Carlo etc) is unreliable

  • RAISE is as easy to use as AIS, and should be

used either instead or in conjunction with AIS when comparing models

  • One can (in principle) replace any MRF with a

directed graphical model that has a tractable approximation to the posterior

slide-105
SLIDE 105

Thank you

Burda Y., Grosse R., Salakhutdinov R. Accurate and Conservative Estimates of MRF Log-likelihood using Reverse Annealing AISTATS 2015