Probabilistic & Unsupervised Learning Sampling Methods Maneesh - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Sampling Methods Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2013

Sampling For inference and learning we need to compute both: ◮ Posterior distributions (on latents and/or parameters) or predictive distributions. ◮ Expectations with respect to these distributions.

Sampling For inference and learning we need to compute both: ◮ Posterior distributions (on latents and/or parameters) or predictive distributions. ◮ Expectations with respect to these distributions. Both are often intractable.

Sampling For inference and learning we need to compute both: ◮ Posterior distributions (on latents and/or parameters) or predictive distributions. ◮ Expectations with respect to these distributions. Both are often intractable. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty.

Sampling For inference and learning we need to compute both: ◮ Posterior distributions (on latents and/or parameters) or predictive distributions. ◮ Expectations with respect to these distributions. Both are often intractable. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty. An alternative is to represent distributions and compute expectations using randomly generated samples.

Sampling For inference and learning we need to compute both: ◮ Posterior distributions (on latents and/or parameters) or predictive distributions. ◮ Expectations with respect to these distributions. Both are often intractable. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples.

Intractabilities and approximations ◮ Inference – computational intractability ◮ Factored variational approx ◮ Loopy BP/EP/Power EP ◮ LP relaxations/ convexified BP ◮ Gibbs sampling, other MCMC ◮ Inference – analytic intractability ◮ Laplace approximation (global) ◮ Parametric variational approx (for special cases). ◮ Message approximations (linearised, sigma-point, Laplace) ◮ Assumed-density methods and Expectation-Propagation ◮ (Sequential) Monte-Carlo methods ◮ Learning – intractable partition function ◮ Sampling parameters ◮ Constrastive divergence ◮ Score-matching ◮ Model selection ◮ Laplace approximation / BIC ◮ Variational Bayes ◮ (Annealed) importance sampling ◮ Reversible jump MCMC Not a complete list!

The integration problem We commonly need to compute expected value integrals of the form: Z F ( x ) p ( x ) dx , where F ( x ) is some function of a random variable X which has probability density p ( x ) . 0 0 0 0 Three typical difficulties: left panel: full line is some complicated function, dashed is density; right panel: full line is some function and dashed is complicated density; not shown: non-analytic integral (or sum) in very many dimensions

Simple Monte-Carlo Integration Z dx F ( x ) p ( x ) Evaluate: Idea: Draw samples from p ( x ) , evaluate F ( x ) , average the values. Z T F ( x ) p ( x ) dx ≃ 1 X F ( x ( t ) ) , T t = 1 where x ( t ) are (independent) samples drawn from p ( x ) . Convergence to integral follows from strong law of large numbers.

Analysis of simple Monte-Carlo Attractions: ◮ unbiased: " # T 1 X F ( x ( t ) ) = E [ F ( x )] E T t = 1 ◮ variance falls as 1 / T independent of dimension: " # " ! 2 # 1 1 X X F ( x ( t ) ) F ( x ( t ) ) − E [ F ( x )] 2 V = E T T t t “ + ( T 2 − T ) E [ F ( x )] 2 ” = 1 ˆ F ( x ) 2 ˜ − E [ F ( x )] 2 T E T 2 = 1 ` ˆ F ( x ) 2 ˜ − E [ F ( x )] 2 ´ E T Problems: ◮ May be difficult or impossible to obtain the samples directly from p ( x ) . ◮ Regions of high density p ( x ) may not correspond to regions where F ( x ) departs most from it mean value (and thus each F ( x ) evaluation might have very high variance).

Importance sampling Idea: Sample from a proposal distribution q ( x ) and weight those samples by p ( x ) / q ( x ) . Samples x ( t ) ∼ q ( x ) : Z Z T F ( x ( t ) ) p ( x ( t ) ) F ( x ) p ( x ) q ( x ) q ( x ) dx ≃ 1 X F ( x ) p ( x ) dx = q ( x ( t ) ) , T t = 1 provided q ( x ) is non-zero wherever p ( x ) is; weights w ( x ( t ) ) ≡ p ( x ( t ) ) / q ( x ( t ) ) p ( x ) q ( x ) ◮ handles cases where p ( x ) is difficult to sample. ◮ can direct samples towards high values of integrand F ( x ) p ( x ) , rather than just high p ( x ) alone ( e.g. p prior and F likelihood).

Analysis of importance sampling Attractions: R F ( x ) p ( x ) ◮ Unbiased: E q [ F ( x ) w ( x )] = q ( x ) q ( x ) dx = E p [ F ( x )] . ◮ Variance could be smaller than simple Monte Carlo if − E q [ F ( x ) w ( x )] 2 < E p ˆ ( F ( x ) w ( x )) 2 ˜ ˆ F ( x ) 2 ˜ − E p [ F ( x )] 2 E q “Optimal” proposal is q ( x ) = p ( x ) F ( x ) / Z q : every sample yields same estimate p ( x ) F ( x ) w ( x ) = F ( x ) p ( x ) F ( x ) / Z q = Z q ; but normalising requires solving the original problem! Problems: ◮ May be hard to construct or sample q ( x ) to give small variance. ˆ w ( x ) 2 ˜ − E q [ w ( x )] 2 ◮ Variance of weights could be unbounded: V [ w ( x )] = E q Z E q [ w ( x )] = q ( x ) w ( x ) dx = 1 Z Z p ( x ) 2 p ( x ) 2 ˆ w ( x ) 2 ˜ = q ( x ) 2 q ( x ) dx = E q q ( x ) dx R e 49 x 2 ; Monte Carlo average may be e.g. p ( x ) = N ( 0 , 1 ) , q ( x ) = N ( 1 , . 1 ) ⇒ V [ w ] = dominated by few samples, not even necessarily in region of large integrand.

Importance sampling — unnormalised distributions Suppose we only know p ( x ) and/or q ( x ) up to constants, p ( x ) = ˜ p ( x ) / Z p q ( x ) = ˜ q ( x ) / Z q where Z p , Z q are unknown/too expensive to computem but that we can nevertheless draw samples from q ( x ) . ◮ We can still apply importance sampling by estimating the normaliser: P Z t F ( x ( t ) ) w ( x ( t ) ) w ( x ) = ˜ p ( x ) F ( x ) p ( x ) dx ≈ P t w ( x ( t ) ) ˜ q ( x ) ◮ This estimate is only consistent (biased for finite T , converges to true value as T → ∞ ). ◮ In particular, we have fi ˜ fl Z p ( x ) dx Z p p ( x ) 1 X Z q q ( x ) q ( x ) = Z p w ( x ( t ) ) → = q ( x ) ˜ T Z q q t so with known Z q we can estimate the partition function of p . ◮ (Importance sampled integral with F ( x ) = 1.)

Importance sampling — effective sample size Variance of weights is critical to variance of estimate: ˆ w ( x ) 2 ˜ − E q [ w ( x )] 2 V [ w ( x )] = E q Z E q [ w ( x )] = q ( x ) w ( x ) dx = 1 Z Z p ( x ) 2 p ( x ) 2 ˆ w ( x ) 2 ˜ = q ( x ) 2 q ( x ) dx = E q q ( x ) dx A small effective sample size may diagnose ineffectiveness of importance sampling. Popular estimate: “P ” 2 t w ( x ( t ) ) „ » –« − 1 w ( x ) 1 + V sample = P E sample [ w ( x )] t w ( x ( t ) ) 2 However large effective sample size does not prove effectiveness (if no high weight samples found, or if q places little mass where F ( x ) is large).

Drawing samples Now, consider the problem of generating samples from an arbitrary distribution p ( x ) . Standard samplers are available for Uniform [ 0 , 1 ] and N ( 0 , 1 ) . ◮ Other univariate distributions: u ∼ Uniform [ 0 , 1 ] Z x p ( x ′ ) dx ′ the target CDF x = G − 1 ( u ) with G ( x ) = −∞ ◮ Multivariate normal with covariance C : r i ∼ N ( 0 , 1 ) 1 ˙ xx T ¸ 2 ˙ 1 rr T ¸ 1 2 r 2 = C ] x = C [ ⇒ = C C

Rejection Sampling Idea: sample from an upper bound on p ( x ) , rejecting some samples. ◮ Find a distribution q ( x ) and a constant c such that ∀ x , p ( x ) ≤ cq ( x ) ◮ Sample x ∗ from q ( x ) and accept x ∗ with probability p ( x ∗ ) / ( cq ( x ∗ )) . ◮ Reject the rest. p ( x ) cq ( x ) dx Let y ∗ ∼ Uniform [ 0 , cq ( x ∗ )] ; then the joint proposal ( x ∗ , y ∗ ) is a point uniformly drawn from the area under the cq ( x ) curve. The proposal is accepted if y ∗ ≤ p ( x ∗ ) (i.e. proposal falls in red box). The probability of this is = q ( x ) dx ∗ p ( x ) / cq ( x ) = p ( x ) / c dx . Thus accepted x ∗ ∼ p ( x ) (with average probability of acceptance 1 / c ).

Rejection Sampling Attractions: ◮ Unbiased; accepted x ∗ is true sample from p ( x ) . ◮ Diagnostics easier than (say) importance sampling: number of accepted samples is true sample size. Problem: ◮ It may be difficult to find a q ( x ) with a small c ⇒ lots of wasted area. Examples: ◮ Compute p ( X i = b | X j = a ) in a directed graphical model: sample from P ( X ) , reject if X j � = a , averaging the indicator function I ( X i = b ) ˆ ˜ ◮ Compute E x 2 | x > 4 for x ∼ N ( 0 , 1 ) Unnormalized Distributions: say we only know p ( x ) , q ( x ) up to a constant, p ( x ) = ˜ q ( x ) = ˜ p ( x ) / Z p q ( x ) / Z q where Z p , Z q are unknown/too expensive to compute, but we can still sample from q ( x ) . We can still apply rejection sampling if using c with ˜ p ( x ) ≤ c ˜ q ( x ) . Still unbiased.

Probabilistic & Unsupervised Learning Sampling Methods Maneesh - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Sampling Methods Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2013 Sampling For

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Probabilistic & Unsupervised Learning Sampling Methods Maneesh Sahani

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Overview 1. Probabilistic Reasoning/Graphical models 2. Importance Sampling 3. Markov Chain

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

CSci 8980: Advanced Topics in Graphical Models MCMC, Gibbs Sampling Instructor: Arindam Banerjee

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Sampling and Monte Carlo Integration Michael Gutmann Probabilistic Modelling and Reasoning

System Acceptance and Regression System, Acceptance, and Regression Testing (c) 2007 Mauro

1 Ex. 1 The mean salt content of a certain type of potato chips is supposed to be 2.0mg. The salt

Introduction to MCMC DB Breakfast 09/30/2011 Guozhang Wang Motivation: Statistical

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Status of LAr simulations Chris Marshall Lawrence Berkeley National Laboratory 4 th DUNE ND

Probabilistic & Unsupervised Learning Sampling Methods Maneesh - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Sampling Methods Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2013 Sampling For

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Probabilistic &amp; Unsupervised Learning Sampling Methods Maneesh Sahani

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Overview 1. Probabilistic Reasoning/Graphical models 2. Importance Sampling 3. Markov Chain

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

CSci 8980: Advanced Topics in Graphical Models MCMC, Gibbs Sampling Instructor: Arindam Banerjee

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Sampling and Monte Carlo Integration Michael Gutmann Probabilistic Modelling and Reasoning

System Acceptance and Regression System, Acceptance, and Regression Testing (c) 2007 Mauro

1 Ex. 1 The mean salt content of a certain type of potato chips is supposed to be 2.0mg. The salt

Introduction to MCMC DB Breakfast 09/30/2011 Guozhang Wang Motivation: Statistical

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Status of LAr simulations Chris Marshall Lawrence Berkeley National Laboratory 4 th DUNE ND

Probabilistic & Unsupervised Learning Sampling Methods Maneesh Sahani