Stein Variational Newton & other Sampling-Based Inference - PowerPoint PPT Presentation

Stein Variational Newton & other Sampling-Based Inference Methods Robert Scheichl Interdisciplinary Center for Scientific Computing & Institute of Applied Mathematics Universit¨ at Heidelberg Collaborators: G. Detommaso (Bath); T. Cui (Monash); A. Spantini & Y. Marzouk (MIT); K. Anaya-Izquierdo & S. Dolgov (Bath); C. Fox (Otago) RICAM Special Semester on Optimization Workshop 3 – Optimization and Inversion under Uncertainty Linz, November 11, 2019 R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 1 / 33

Inverse Problems Data Parameter y = F ( x ) + e forward model (PDE) observation/model errors R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 2 / 33

Inverse Problems Data Parameter y = F ( x ) + e forward model (PDE) observation/model errors y ∈ R N y Data y are limited in number, noisy, and indirect. x ∈ X Parameter x often a function (discretisation needed). F : X → R N y Continuous, bounded, and sufficiently smooth. R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 2 / 33

Bayesian interpretation The (physical) model gives π ( y | x ), the conditional probability of observing y given x . However, to predict, control, optimise or quantify uncertainty, the interest is often really in π ( x | y ), the conditional probability of possible causes x given the observed data y – the inverse problem : R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 3 / 33

Bayesian interpretation The (physical) model gives π ( y | x ), the conditional probability of observing y given x . However, to predict, control, optimise or quantify uncertainty, the interest is often really in π ( x | y ), the conditional probability of possible causes x given the observed data y – the inverse problem : π pos ( x ) := π ( x | y ) ∝ π ( y | x ) π pr ( x ) � �� Bayes’ rule R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 3 / 33

Bayesian interpretation The (physical) model gives π ( y | x ), the conditional probability of observing y given x . However, to predict, control, optimise or quantify uncertainty, the interest is often really in π ( x | y ), the conditional probability of possible causes x given the observed data y – the inverse problem : π pos ( x ) := π ( x | y ) ∝ π ( y | x ) π pr ( x ) � �� Bayes’ rule Extract information from π pos (means, covariances, event probabilities, predictions) by evaluating posterior expectations: � E π pos [ h ( x )] = h ( x ) π pos ( x ) dx R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 3 / 33

Bayes’ Rule and Classical Inversion Classically [Hadamard, 1923]: Inverse map “ F − 1 ” ( y → x ) is typically ill-posed, i.e. lack of (a) existence , (b) uniqueness or (c) boundedness R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 4 / 33

Bayes’ Rule and Classical Inversion Classically [Hadamard, 1923]: Inverse map “ F − 1 ” ( y → x ) is typically ill-posed, i.e. lack of (a) existence , (b) uniqueness or (c) boundedness least squares solution ˆ x is maximum likelihood estimate prior distribution π pr “acts” as regulariser – well-posedness ! solution of regularised least squares problem is maximum a posteriori (MAP) estimator R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 4 / 33

Bayes’ Rule and Classical Inversion Classically [Hadamard, 1923]: Inverse map “ F − 1 ” ( y → x ) is typically ill-posed, i.e. lack of (a) existence , (b) uniqueness or (c) boundedness least squares solution ˆ x is maximum likelihood estimate prior distribution π pr “acts” as regulariser – well-posedness ! solution of regularised least squares problem is maximum a posteriori (MAP) estimator However, in the Bayesian setting, the full posterior π pos contains more information than the MAP estimator alone, e.g. the posterior covariance matrix reveals components of x that are (relatively) more or less certain. R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 4 / 33

Bayes’ Rule and Classical Inversion Classically [Hadamard, 1923]: Inverse map “ F − 1 ” ( y → x ) is typically ill-posed, i.e. lack of (a) existence , (b) uniqueness or (c) boundedness least squares solution ˆ x is maximum likelihood estimate prior distribution π pr “acts” as regulariser – well-posedness ! solution of regularised least squares problem is maximum a posteriori (MAP) estimator However, in the Bayesian setting, the full posterior π pos contains more information than the MAP estimator alone, e.g. the posterior covariance matrix reveals components of x that are (relatively) more or less certain. Possible to sample/explore via Metropolis-Hastings MCMC (in theory) R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 4 / 33

Variational Bayes (as opposed to Metropolis-Hastings MCMC) Aim to characterise the posterior distribution (density π pos ) analytically (at least approximately) for more efficient inference. R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 5 / 33

Variational Bayes (as opposed to Metropolis-Hastings MCMC) Aim to characterise the posterior distribution (density π pos ) analytically (at least approximately) for more efficient inference. This is a challenging task since: x ∈ R d is typically high-dimensional (e.g., discretised function) π pos is in general non-Gaussian (even if π pr and observation noise are Gaussian) evaluations of likelihood may be expensive (e.g., solution of a PDE) R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 5 / 33

Variational Bayes (as opposed to Metropolis-Hastings MCMC) Aim to characterise the posterior distribution (density π pos ) analytically (at least approximately) for more efficient inference. This is a challenging task since: x ∈ R d is typically high-dimensional (e.g., discretised function) π pos is in general non-Gaussian (even if π pr and observation noise are Gaussian) evaluations of likelihood may be expensive (e.g., solution of a PDE) Key Tools Transport Maps, Optimisation , Principle Component Analysis, Model Order Reduction, Hierarchies, Sparsity, Low Rank Approximation R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 5 / 33

Deterministic Couplings of Probability Measures T η π R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 6 / 33

Deterministic Couplings of Probability Measures T η π Core idea [Moselhy, Marzouk, 2012] Choose a reference distribution η (e.g., standard Gaussian) Seek transport map T : R d → R d such that T ♯ η = π (or equivalently its inverse S = T − 1 ) R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 6 / 33

Deterministic Couplings of Probability Measures T η π Core idea [Moselhy, Marzouk, 2012] Choose a reference distribution η (e.g., standard Gaussian) Seek transport map T : R d → R d such that T ♯ η = π (or equivalently its inverse S = T − 1 ) In principle, enables exact (independent, unweighted) sampling! R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 6 / 33

Deterministic Couplings of Probability Measures T η π Core idea [Moselhy, Marzouk, 2012] Choose a reference distribution η (e.g., standard Gaussian) Seek transport map T : R d → R d such that T ♯ η = π (or equivalently its inverse S = T − 1 ) In principle, enables exact (independent, unweighted) sampling! Satisfying these conditions only approximately can still be useful! R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 6 / 33

Variational Inference Goal: Sampling from target density π ( x ) R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 7 / 33

Variational Inference Goal: Sampling from target density π ( x ) Given a reference density p , find an invertible map ˆ T such that ˆ D KL ( p � T − 1 T := argmin D KL ( T ♯ p � π ) = argmin π ) ♯ T T where � � � � T − 1 ( x ) ∇ x T − 1 ( x ) T ♯ ( x ):= p | det | . . . push-forward of p � � p ( x ) � D KL ( p � q ):= log p ( x ) d x . . . Kullback-Leibler divergence q ( x ) R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 7 / 33

Variational Inference Goal: Sampling from target density π ( x ) Given a reference density p , find an invertible map ˆ T such that ˆ D KL ( p � T − 1 T := argmin D KL ( T ♯ p � π ) = argmin π ) ♯ T T where � � � � T − 1 ( x ) ∇ x T − 1 ( x ) T ♯ ( x ):= p | det | . . . push-forward of p � � p ( x ) � D KL ( p � q ):= log p ( x ) d x . . . Kullback-Leibler divergence q ( x ) Advantage of using D KL : do not need normalising constant for π R. Scheichl (Heidelberg) Stein Variational Newton & More RICAM 11/11/19 7 / 33

Stein Variational Newton & other Sampling-Based Inference - PowerPoint PPT Presentation

Stein Variational Newton & other Sampling-Based Inference Methods Robert Scheichl Interdisciplinary Center for Scientific Computing & Institute of Applied Mathematics Universit at Heidelberg Collaborators: G. Detommaso (Bath); T.

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Projected Stein variational Newton: A fast and scalable Bayesian inference method in high

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Worldwide Newton Conference Paris, September 2004 eBook composition on the Newton MessagePad 2100

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

Distribution regression made easy Philippe Van Kerm Luxembourg Institute of Socio-Economic

Draft 1 On a Generalized Splitting Method for Sampling From a Conditional Distribution Pierre

Conditional distribution variability measures for causality detection Jos A. R. Fonollosa

Conditional Probability Estimation Marco Cattaneo School of Mathematics and Physical Sciences

Lecture 5: Probability Distributions Random Variables Probability Distributions

Probability Density (1) Let f ( x 1 , x 2 . . . x n ) be a probability density for the variables {

Gibbs sampling Dr. Jarad Niemi Iowa State University March 29, 2018 Jarad Niemi (Iowa State)

s r strtr qt