Intractable Likelihood Functions Michael Gutmann Probabilistic - PowerPoint PPT Presentation

Intractable Likelihood Functions Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018

Recap � z p ( x , y o , z ) p ( x | y o ) = � x , z p ( x , y o , z ) Assume that x , y , z each are d = 500 dimensional, and that each element of the vectors can take K = 10 values. ◮ Topic 1: Representation We discussed reasonable weak assumptions to efficiently represent p ( x , y , z ). ◮ Topic 2: Exact inference We have seen that the same assumptions allow us, under certain conditions, to efficiently compute the posterior probability or derived quantities. Michael Gutmann Intractable Likelihood Functions 2 / 29

Recap � z p ( x , y o , z ) p ( x | y o ) = � x , z p ( x , y o , z ) ◮ Topic 3: Learning How can we learn the non-negative numbers p ( x , y , z ) from data? ◮ Probabilistic, statistical, and Bayesian models ◮ Learning by parameter estimation and learning by Bayesian inference ◮ Basic models to illustrate the concepts. ◮ Models for factor and independent component analysis, and their estimation by maximising the likelihood. ◮ Issue 4: For some models, exact inference and learning is too costly even after fully exploiting the factorisation (independence assumptions) that were made to efficiently represent p ( x , y , z ). Topic 4: Approximate inference and learning Michael Gutmann Intractable Likelihood Functions 3 / 29

Recap Examples we have seen where inference and learning is too costly: ◮ Computing marginals when we cannot exploit the factorisation. ◮ During variable elimination, we may generate new factors that depend on many variables so that subsequent steps are costly. ◮ Even if we can compute p ( x | y o ), if x is high-dimensional, we will generally not be able to compute expectations such as � E [ g ( x ) | y o ] = g ( x ) p ( x | y o ) d x for some function g . ◮ Solving optimisation problems such as argmax θ ℓ ( θ ) can be computationally costly. ◮ Here: focus on computational issues when evaluating ℓ ( θ ) that are caused by high-dimensional integrals (sums). Michael Gutmann Intractable Likelihood Functions 4 / 29

Computing integrals � S ⊆ R d x ∈ S f ( x ) d x ◮ In some cases, closed form solutions possibles. ◮ If x is low-dimensional ( d ≤ 2 or ≤ 3), highly accurate numerical methods exist (with e.g. Simpson’s rule), 4 2 0 -2 -1 0 1 2 see https://en.wikipedia.org/wiki/Numerical_integration . ◮ Curse of dimensionality: Solutions feasible in low dimensions become quickly computationally prohibitive as the dimension d increases. ◮ We then say that evaluating the integral (sum) is computationally “intractable”. Michael Gutmann Intractable Likelihood Functions 5 / 29

Program 1. Intractable likelihoods due to unobserved variables 2. Intractable likelihoods due to intractable partition functions 3. Combined case of unobserved variables and intractable partition functions Michael Gutmann Intractable Likelihood Functions 6 / 29

Program 1. Intractable likelihoods due to unobserved variables Unobserved variables The likelihood function is implicitly defined via an integral The gradient of the log-likelihood can be computed by solving an inference problem 2. Intractable likelihoods due to intractable partition functions 3. Combined case of unobserved variables and intractable partition functions Michael Gutmann Intractable Likelihood Functions 7 / 29

Unobserved variables ◮ Observed data D correspond to observations of some random variables. ◮ Our model may contain random variables for which we do not have observations, i.e. “unobserved variables”. ◮ Conceptually, we can distinguish between ◮ hidden/latent variables: random variables that are important for the model description but for which we (normally) never observe data (see e.g. HMM, factor analysis) ◮ variables for which data are missing: these are random variables that are (normally) observed but for which D does not contain observations for some reason (e.g. some people refuse to answer in polls, malfunction of the measurement device, etc. ) Michael Gutmann Intractable Likelihood Functions 8 / 29

The likelihood in presence of unobserved variables ◮ Likelihood function is (proportional to the) probability that the model generates data like the observed one for parameter θ ◮ We thus need to know the distribution of the variables for which we have data (e.g. the “visibles” v ) ◮ If the model is defined in terms of the visibles and unobserved variables u , we have to marginalise out the unobserved variables (sum rule) to obtain the distribution of the visibles � p ( v ; θ ) = p ( u , v ; θ ) d u u (replace with sum in case of discrete variables) ◮ Likelihood function is implicitly defined via an integral � L ( θ ) = p ( D ; θ ) = p ( u , D ; θ d u ) , u which is generally intractable. Michael Gutmann Intractable Likelihood Functions 9 / 29

Evaluating the likelihood by solving an inference problem ◮ The problem of computing the integral � p ( v ; θ ) = p ( u , v ; θ ) d u u corresponds to a marginal inference problem. ◮ Even if an analytical solution is not possible, we can sometimes exploit the properties of the model (independencies!) to numerically compute the marginal efficiently (e.g. by message passing). ◮ For each likelihood evaluation, we then have to solve a marginal inference problem. ◮ Example: In HMMs the likelihood of θ can be computed using the alpha recursion (see e.g. Barber Section 23.2). Note that this only provides the value of L ( θ ) at a specific value of θ , and not the whole function. Michael Gutmann Intractable Likelihood Functions 10 / 29

Evaluating the gradient by solving an inference problem ◮ The likelihood is often maximised by gradient ascent θ ′ = θ + ǫ ∇ θ ℓ ( θ ) where ǫ denotes the step-size. ◮ The gradient ∇ θ ℓ ( θ ) is given by ∇ θ ℓ ( θ ) = E [ ∇ θ log p ( u , D ; θ ) | D ; θ ] where the expectation is taken with respect to p ( u |D ; θ ). Michael Gutmann Intractable Likelihood Functions 11 / 29

Evaluating the gradient by solving an inference problem ∇ θ ℓ ( θ ) = E [ ∇ θ log p ( u , D ; θ ) | D ; θ ] Interpretation: ◮ ∇ θ log p ( u , D ; θ ) is the gradient of the log-likelihood if we had observed the data ( u , D ) (gradient after “filling-in” data). ◮ p ( u |D ; θ ) indicates which values of u are plausible given D (and when using parameter value θ ). ◮ ∇ θ ℓ ( θ ) is the average of the gradients weighted by the plausibility of the values that are used to fill-in the missing data. Michael Gutmann Intractable Likelihood Functions 12 / 29

Proof The key to the proof of ∇ θ ℓ ( θ ) = E [ ∇ θ log p ( u , D ; θ ) | D ; θ ] is that f ′ ( x ) = log f ( x ) ′ f ( x ) for some function f ( x ). � ∇ θ ℓ ( θ ) = ∇ θ log p ( u , D ; θ ) d u u 1 � = ∇ θ p ( u , D ; θ ) d u � u p ( u , D ; θ ) d u u � u ∇ θ p ( u , D ; θ ) d u = p ( D ; θ ) � u [ ∇ θ log p ( u , D ; θ )] p ( u , D ; θ ) d u = p ( D ; θ ) � = [ ∇ θ log p ( u , D ; θ )] p ( u |D ; θ ) d u u = E [ ∇ θ log p ( u , D ; θ ) | D ; θ ] where we have used that p ( u |D ; θ ) = p ( u , D ; θ ) / p ( D ; θ ). Michael Gutmann Intractable Likelihood Functions 13 / 29

How helpful is the connection to inference? ◮ The (log) likelihood and its gradient can be computed by solving an inference problem. ◮ This is helpful if the inference problems can be solved relatively efficiently. ◮ Allows one to use approximate inference methods (e.g. sampling) for likelihood-based learning. Michael Gutmann Intractable Likelihood Functions 14 / 29

Program 1. Intractable likelihoods due to unobserved variables Unobserved variables The likelihood function is implicitly defined via an integral The gradient of the log-likelihood can be computed by solving an inference problem 2. Intractable likelihoods due to intractable partition functions 3. Combined case of unobserved variables and intractable partition functions Michael Gutmann Intractable Likelihood Functions 15 / 29

Program 1. Intractable likelihoods due to unobserved variables 2. Intractable likelihoods due to intractable partition functions Unnormalised models and the partition function The likelihood function is implicitly defined via an integral The gradient of the log-likelihood can be computed by solving an inference problem 3. Combined case of unobserved variables and intractable partition functions Michael Gutmann Intractable Likelihood Functions 16 / 29

Unnormalised statistical models ◮ Unnormalised statistical models: statistical models where some elements ˜ p ( x ; θ ) do not integrate/sum to one � p ( x ; θ ) d x = Z ( θ ) � = 1 ˜ ◮ Partition function Z ( θ ) can be used to normalise unnormalised models via p ( x ; θ ) = ˜ p ( x ; θ ) Z ( θ ) ◮ But Z ( θ ) is only implicitly defined via an integral: to evaluate Z at θ , we have so compute an integral. Michael Gutmann Intractable Likelihood Functions 17 / 29

Intractable Likelihood Functions Michael Gutmann Probabilistic - PowerPoint PPT Presentation

Intractable Likelihood Functions Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Recap z p ( x , y o , z ) p ( x | y o ) = x , z p ( x , y o , z )

Evidence estimation for Markov random fields: a triply intractable problem Richard Everitt

Intractable Problems and DP with Bitmask Problem Solving Club March 1, 2017 Agenda

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Kernel Recursive ABC: Point Estimation with Intractable Likelihood Motonobu Kanagawa EURECOM,

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Applied Statistics Lecturer: Serena Arima Likelihood ML estimator Summaries ML properties LR

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Likelihood Functions The likelihood function answers the question: What does the sensor tell about

More on Functions Thomas Schwarz, SJ Marquette University Functions of Functions Functions

Elementary Functions Part 1, Functions Lecture 1.4a, Symmetries of Functions: Even and Odd

Elementary Functions Part 1, Functions Lecture 1.1b, Functions defined by equations Dr. Ken W.

Orthonormal bases of functions April 24, 2018 Data - Vectors or Functions Vectors Functions

Intractable Problems Time-Bounded Turing Machines Classes P and NP Polynomial-Time Reductions 1

ESCAPE E ndovascular treatment for S mall C ore and A nterior

Minor Head Injuries in Pa8ents High Risk Emergency Medicine

Introduction to Sialendoscopy and Salivary Duct Surgery Jolie Chang, MD Assistant Professor

Great Theoretical Ideas in Computer Science NP and NP-completeness I February 24th, 2015 Toolbox

Fiat-Shamir and correlation intractability from strong kdm secure encryption Ran Canetti, Yilei

CSC321 Lecture 18: Mixture Modeling Roger Grosse Roger Grosse CSC321 Lecture 18: Mixture

Monte Carlo algorithms for Bayesian social network models Alberto Caimo alberto.caimo@usi.ch

Intractable Likelihood Functions Michael Gutmann Probabilistic - PowerPoint PPT Presentation

Intractable Likelihood Functions Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Recap z p ( x , y o , z ) p ( x | y o ) = x , z p ( x , y o , z )

Evidence estimation for Markov random fields: a triply intractable problem Richard Everitt

Intractable Problems and DP with Bitmask Problem Solving Club March 1, 2017 Agenda

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Kernel Recursive ABC: Point Estimation with Intractable Likelihood Motonobu Kanagawa EURECOM,

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Applied Statistics Lecturer: Serena Arima Likelihood ML estimator Summaries ML properties LR

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Likelihood Functions The likelihood function answers the question: What does the sensor tell about

More on Functions Thomas Schwarz, SJ Marquette University Functions of Functions Functions

Elementary Functions Part 1, Functions Lecture 1.4a, Symmetries of Functions: Even and Odd

Elementary Functions Part 1, Functions Lecture 1.1b, Functions defined by equations Dr. Ken W.

Orthonormal bases of functions April 24, 2018 Data - Vectors or Functions Vectors Functions

Intractable Problems Time-Bounded Turing Machines Classes P and NP Polynomial-Time Reductions 1

ESCAPE E ndovascular treatment for S mall C ore and A nterior

Minor Head Injuries in Pa8ents High Risk Emergency Medicine

Introduction to Sialendoscopy and Salivary Duct Surgery Jolie Chang, MD Assistant Professor

Great Theoretical Ideas in Computer Science NP and NP-completeness I February 24th, 2015 Toolbox

Fiat-Shamir and correlation intractability from strong kdm secure encryption Ran Canetti, Yilei

CSC321 Lecture 18: Mixture Modeling Roger Grosse Roger Grosse CSC321 Lecture 18: Mixture

Monte Carlo algorithms for Bayesian social network models Alberto Caimo alberto.caimo@usi.ch

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for