Sandwiching the marginal likelihood using bidirectional Monte Carlo - PowerPoint PPT Presentation

Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani

Introduction • When comparing different statistical models, we’d like a quantitative criterion which trades off model complexity and fit to the data • In a Bayesian setting, we often use marginal likelihood • Defined as the probability of the data, with all parameters and latent variables integrated out • Motivation: plug into Bayes’ Rule p ( M i ) p ( D|M i ) p ( M i |D ) = � j p ( M j ) p ( D|M j )

Introduction: marginal likelihood + T G M G need to integrate out all of the component matrices and their + M G hyperparameters

Introduction • Advantages of marginal likelihood (ML) • Accounts for model complexity in a sophisticated way • Closely related to description length • Measures the model’s ability to generalize to unseen examples • ML is used in those rare cases where it is tractable • e.g. Gaussian processes, fully observed Bayes nets • Unfortunately, it’s typically very hard to compute because it requires a very high-dimensional integral • While ML has been criticized on many fronts, the proposed alternatives pose similar computational difficulties

Introduction • Focus on latent variable models • parameters , latent variables , observations θ z y • assume i.i.d. observations • Marginal likelihood requires summing or integrating out latent variables and parameters N � � � p ( y ) = p ( θ ) p ( z i | θ ) p ( y i | z i , θ ) d θ i =1 z i • Similar to computing the partition function � Z = f ( x ) x ∈ X

Introduction • Problem: exact marginal likelihood computation is intractable • There are many algorithms to approximate it, but we don’t know how well they work

Why evaluating ML estimators is hard The answer to life, the universe, and everything is... 42

Why evaluating ML estimators is hard The marginal likelihood is… log p ( D ) = − 23814 . 7

Why evaluating ML estimators is hard • How does one deal with this in practice? • polynomial-time approximations for partition functions of ferromagnetic Ising models • test on very small instances which can be solved exactly • run a bunch of estimators and see if they agree with each other

Log-ML lower bounds • One marginal likelihood estimator is simple importance sampling: K p ( θ ( k ) , z ( k ) , D ) p ( D ) = 1 � { θ ( k ) , z ( k ) } K ˆ k =1 ∼ q q ( θ ( k ) , z ( k ) ) K k =1 • This is an unbiased estimator E [ˆ p ( D )] = p ( D ) • Unbiased estimators are stochastic lower bounds (Jensen’s inequality) E [log ˆ p ( D )] ≤ log p ( D ) p ( D ) > log p ( D ) + b ) ≤ e − b (Markov’s inequality) Pr(log ˆ • Many widely used algorithms have the same property!

Log-ML lower bounds … True value? annealed importance sampling (AIS) sequential Monte Carlo (SMC) Chib-Murray-Salakhutdinov variational Bayes

How to obtain an upper bound? • Harmonic Mean Estimator: K p ( D ) = ˆ { θ ( k ) , z ( k ) } K k =1 ∼ p ( θ , z |D ) � K k =1 1 /p ( D| θ ( k ) , z ( k ) ) • Equivalent to simple importance sampling, but with the role of the proposal and target distributions reversed • Unbiased estimate of the reciprocal of the ML � 1 � 1 = E p ( D ) ˆ p ( D ) • Gives a stochastic upper bound on the log-ML • Caveat 1: only an upper bound if you sample exactly from the posterior, which is generally intractable • Caveat 2: this is the Worst Monte Carlo Estimator (Neal, 2008)

Annealed importance sampling (Neal, 2001) ... p 0 p 1 p 2 p 3 p 4 p K − 1 p K tractable initial intractable target distribution (e.g. distribution (e.g. prior) posterior)

Annealed importance sampling (Neal, 2001) Given: unnormalized distributions f 0 , . . . , f K MCMC transition operators T 0 , . . . , T K f 0 easy to sample from, compute partition function of x ∼ f 0 w = 1 For i = 0 , . . . , K − 1 w := w f i +1 ( x ) f i ( x ) x : ∼ T i +1 ( x ) Then, E [ w ] = Z K Z 0 S Z K = Z 0 X w ( s ) ˆ S s =1

Annealed importance sampling (Neal, 2001) T 1 T 2 T 3 T 4 T K ... x 0 x 3 x 1 x 2 x 4 x K − 1 x K p 0 p 1 p 2 p 3 p 4 p K − 1 p K ˜ ˜ ˜ ˜ ˜ T 4 T K T 3 T 2 T 1 K f i ( x i − 1 ) q back ( x 0 , x 1 , . . . , x K ) f i − 1 ( x i − 1 ) = Z K E [ w ] = Z K Forward: Y w := q fwd ( x 0 , x 1 , . . . , x K ) Z 0 Z 0 i =1 K f i − 1 ( x i ) q fwd ( x 0 , x 1 , . . . , x K ) = Z 0 E [ w ] = Z 0 Y Backward: w := f i ( x i ) q back ( x 0 , x 1 , . . . , x K ) Z K Z K i =1

Bidirectional Monte Carlo • Initial distribution: prior p ( θ , z ) p ( θ , z |D ) = p ( θ , z , D ) • Target distribution: posterior p ( D ) � • Partition function: Z = p ( θ , z , D ) d θ d z = p ( D ) • Forward chain E [ w ] = Z K stochastic lower bound = p ( D ) Z 0 • Backward chain (requires exact posterior sample!) 1 E [ w ] = Z 0 stochastic upper bound = p ( D ) Z K

Bidirectional Monte Carlo How to get an exact sample? Two ways to sample from p ( θ , z , D ) p ( θ , z ) p ( D| θ , z ) θ p ( D ) p ( θ , z |D ) z forward generate sample data, then perform D inference Therefore, the parameters and latent variables used to generate the data are an exact posterior sample!

Bidirectional Monte Carlo Summary of algorithm: θ � , z � ∼ p θ , z y ∼ p y | θ , z ( ·| θ � , z � ) log p ( y ) Obtain a stochastic lower bound on by running AIS forwards log p ( y ) Obtain a stochastic upper bound on by running AIS backwards, ( θ � , z � ) starting from The two bounds will converge given enough intermediate distributions.

Experiments • BDMC lets us compute ground truth log-ML values for data simulated from a model • We can use these ground truth values to benchmark log-ML estimators! • Obtained ground truth ML for simulated data for • clustering • low rank approximation • binary attributes • Compared a wide variety of ML estimators • MCMC operators shared between all algorithms wherever possible

Results: binary attributes harmonic mean estimator true Bayesian information criterion (BIC) Likelihood weighting

Results: binary attributes true Chib-Murray- Salakhutdinov variational Bayes

Results: binary attributes (zoomed in) reverse SMC reverse AIS true sequential nested sampling Monte Carlo annealed importance sampling (AIS)

Results: binary attributes Which estimators give accurate results? likelihood weighting harmonic mean variational Bayes mean Chib-Murray-Salakhutdinov squared nested sampling accuracy needed error to distinguish simple matrix factorizations AIS sequential Monte Carlo (SMC) time (seconds)

Results: low rank approximation annealed importance sampling (AIS)

Recommendations • Try AIS first • If AIS is too slow, try sequential Monte Carlo or nested sampling • Can’t fix a bad algorithm by averaging many samples • Don’t trust naive confidence intervals -- need to evaluate rigorously

On the quantitative evaluation of decoder-based generative models Yuhuai Wu Yuri Burda Ruslan Salakhutdinov

Decoder-based generative models • Define a generative process: • sample latent variables z from a simple (fixed) prior p(z) • pass them through a decoder network to get x = f(z) • Examples: • variational autoencoders (Kingma and Welling, 2014) • generative adversarial networks (Goodfellow et al., 2014) • generative moment matching networks (Li et al., 2015; Dziugaite et al., 2015) • nonlinear independent components estimation (Dinh et al., 2015)

Decoder-based generative models • Variational autoencoder (VAE) • Train both a generator (decoder) and a recognition network (encoder) • Optimize a variational lower bound on the log-likelihood • Generative adversarial network (GAN) • Train a generator (decoder) and a discriminator • Discriminator wants to distinguish model samples from the training data • Generator wants to fool the discriminator • Generative moment matching network (GMMN) • Train a generative network such that certain statistics match between the generated samples and the data

Decoder-based generative models Some impressive-looking samples: Denton et al. (2015) Radford et al. (2016) But how well do these models capture the distribution?

Decoder-based generative models Looking at samples can be misleading:

Decoder-based generative models GAN, 10 dim GAN, 50 dim, GAN, 50 dim, 200 epochs 1000 epochs LLD = 328.7 LLD = 543.5 LLD = 625.5

Evaluating decoder-based models • Want to quantitatively evaluate generative models in terms of the probability of held-out data • Problem: a GAN or GMMN with k latent dimensions can only generate within a k-dimensional submanifold! • Standard (but unsatisfying) solution: impose a spherical Gaussian observation model p σ ( x | z ) = N ( f ( z ) , σ I ) • tune on a validation set σ • Problem: this still requires computing an intractable integral: � p σ ( x ) = p ( z ) p σ ( x | z ) d z

Sandwiching the marginal likelihood using bidirectional Monte Carlo - PowerPoint PPT Presentation

Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani Introduction When comparing different statistical models, wed like a quantitative criterion which trades off model complexity

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Effects of Approximate Filtering on the Appearance of Bidirectional Texture Functions Adrian

Security Notions for Bidirectional Channels Giorgia Azzurra Marson Bertram Poettering FSE 2017

Achievable Rate Region of the Bidirectional Achievable Rate Region of the Bidirectional

Image Retargeting Shai Avidan Tel Aviv University Bidirectional Similarity (Simakov et al.

1 minute Path tracing Bidirectional path tracing Progressive photon mapping 1 minute

Bidirectional Transformations a PL perspective BIRS meeting on BX, 2013 Bidirectional

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representation Eliyahu

Developing tools to identify marginal lands and assess their potential for bioenergy production

Short Run Marginal Cost Short Run Marginal Cost K Peter Kolf General Manager Economic

VICTORIA HARBOUR: VICTORIA HARBOUR: marginal valuation & marginal valuation & un-priced

Joint and marginal probabilities Joint: Marginal: How to compute the probability of observations

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

The Political Economy of PTAs: An Empirical Investigation Giovanni Facchini 1 , Peri Silva 2 and

The I m pact of Good Educational Public Policy & School Quality Eric Hanushek Stanford

Professor Sara Hobolt Christopher Wratil Sutherland Chair in European PhD candidate, LSE

From Reasoning with Constraints to Mining Constraints: Multi-Objective Parameter Fitting in

Slide Set 3 Notes Regression Models and the Classical Linear Regression Model (CLRM) Pietro

IMPACT OF STORAGE ON THE EFFICIENCY AND PRICES IN REAL-TIME ELECTRICITY MARKETS Nicolas Gast

INTERNSHIP AT CSIRO COMMONWEAL TH SCIENTIFIC AND INDUSTRIAL RESEARCH ORGANIZATION THE

Contributions from : N. Most, Y.-I. Won, T. Funded by : NASAs Advancing Hearty, R. Strub, S.

Sandwiching the marginal likelihood using bidirectional Monte Carlo - PowerPoint PPT Presentation

Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani Introduction When comparing different statistical models, wed like a quantitative criterion which trades off model complexity

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Effects of Approximate Filtering on the Appearance of Bidirectional Texture Functions Adrian

Security Notions for Bidirectional Channels Giorgia Azzurra Marson Bertram Poettering FSE 2017

Achievable Rate Region of the Bidirectional Achievable Rate Region of the Bidirectional

Image Retargeting Shai Avidan Tel Aviv University Bidirectional Similarity (Simakov et al.

1 minute Path tracing Bidirectional path tracing Progressive photon mapping 1 minute

Bidirectional Transformations a PL perspective BIRS meeting on BX, 2013 Bidirectional

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representation Eliyahu

Developing tools to identify marginal lands and assess their potential for bioenergy production

Short Run Marginal Cost Short Run Marginal Cost K Peter Kolf General Manager Economic

VICTORIA HARBOUR: VICTORIA HARBOUR: marginal valuation &amp; marginal valuation &amp; un-priced

Joint and marginal probabilities Joint: Marginal: How to compute the probability of observations

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

The Political Economy of PTAs: An Empirical Investigation Giovanni Facchini 1 , Peri Silva 2 and

The I m pact of Good Educational Public Policy &amp; School Quality Eric Hanushek Stanford

Professor Sara Hobolt Christopher Wratil Sutherland Chair in European PhD candidate, LSE

From Reasoning with Constraints to Mining Constraints: Multi-Objective Parameter Fitting in

Slide Set 3 Notes Regression Models and the Classical Linear Regression Model (CLRM) Pietro

IMPACT OF STORAGE ON THE EFFICIENCY AND PRICES IN REAL-TIME ELECTRICITY MARKETS Nicolas Gast

INTERNSHIP AT CSIRO COMMONWEAL TH SCIENTIFIC AND INDUSTRIAL RESEARCH ORGANIZATION THE

Contributions from : N. Most, Y.-I. Won, T. Funded by : NASAs Advancing Hearty, R. Strub, S.

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

VICTORIA HARBOUR: VICTORIA HARBOUR: marginal valuation & marginal valuation & un-priced

The I m pact of Good Educational Public Policy & School Quality Eric Hanushek Stanford