Multilevel methods for fast Bayesian optimal experimental design Ra - PowerPoint PPT Presentation

Multilevel methods for fast Bayesian optimal experimental design Ra´ ul Tempone Alexander von Humboldt Professor, RWTH-Aachen, KAUST Joint work with J. Beck (KAUST), B.M. Dia (KFUPM) and L. Espath (RWTH) DCSE Fall School on ROM and UQ, November 4–8, 2019 1 / 38

Optimal Experimental Design (OED) ◮ Find the design that maximizes the success of an experiment for some desired experimental goal ◮ Predict the outcome without performing the real experiment ◮ Model-based approach: Important when resources are limited 2 / 38

Motivating problem: Electrical Impedance Tomography Quantities of interest: Material properties Measurements: Electrical potentials at electrodes Goal: Find the electrode placement ( design ) that maximizes the informativeness of the measurements 3 / 38

Data model setting Data model assumption : Y ( ξ ) = g ( θ , ξ ) + ǫ , ◮ Y = ( y 1 , y 2 , . . . , y q ) ∈ R q , measurements ◮ ǫ ∈ R q , measurement error ◮ g ∈ R q , model predictions (usually, the solution of a math. model) ◮ θ ∈ R d , quantities of interest ◮ ξ = ( ξ 1 , ξ 2 , . . . , ξ s ), design parameters Measurement error assumption: The measurement errors, ǫ , are zero-mean Gaussian errors with covariance matrix Σ ǫ 4 / 38

Data model with repetitive experiments Data model with repetitive experiments : Y i ( ξ ) = g ( θ , ξ ) + ǫ i , i = 1 , . . . , N e where N e is the number of repetitive experiments with design ξ . The full set of data : Y = { Y i } N e i =1 . 5 / 38

Simple OED example: a linear regression model Linear regression model: Y = X θ + ǫ where ◮ Y , vector of observation values ◮ X is the design matrix ◮ ǫ is the error (zero-mean) vector with known covariance matrix Σ ǫ ◮ θ is an unknown parameter vector of interest. Inverse problem : Estimate θ Classical approach : ◮ Least square estimate (arg min θ ǫ T ǫ ): ˆ θ = ( X T X ) − 1 X T Y ◮ Cov(ˆ θ ) = ( X T X ) − 1 X T Σ ǫ X ( X T X ) − 1 OED goal : Find design X that minimize Cov(ˆ θ ) 6 / 38

Alphabetic optimalities What does it mean to minimize Cov(ˆ def θ ) = Λ ? ◮ A -optimality: minimize the trace of the covariance matrix, tr ( Λ ) ◮ C -optimality: minimize the variance of a predefined linear combination of quantities of interest: c T Λ c ◮ D -optimality: minimize the determinant of the covariance matrix, det ( Λ ) ◮ E -optimality: minimize the maximum eigenvalue of the covariance matrix Λ For nonlinear models the problem is much more challenging! Entropy based expected information gain in a Bayesian setting: Lindley, D. V. (1956). On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4), 986-1005. 7 / 38

Entropy-based OED criterion 1948, Information Entropy Definition : Claude Shannon introduced the concept of information entropy: p i log 1 � Entropy = , p i i for probabilities p i . The differential entropy for a continuous random variable with probability density function (pdf) p ( x ) with support X , � 1 Differential Entropy = log p ( x ) p ( x ) d x . X 1956, Shannon Entropy-based OED : Dennis V. Lindley proposed an entropy-based measure of the information provided in an experiment: the relative entropy of the posterior pdf π ( θ | Y ) with respect to the prior pdf π ( θ ): � � 1 1 π ( θ ) π ( θ | y ) d θ − π ( θ | Y ) π ( θ | Y ) d θ Relative Entropy = log log , � �� = Cross Entropy = Differential Entropy in expectation. 8 / 38

Information gain The information gain (Lindley, 1956) is here defined by the Kullback-Leibler (KL) divergence between prior and posterior pdfs � π ( θ | Y , ξ ) � � def D KL ( π ( θ | Y , ξ ) � π ( θ )) = log π ( θ | Y , ξ ) d θ , π ( θ ) Θ where ◮ π ( θ ), the prior pdf for θ ◮ π ( θ | Y , ξ ), the posterior pdf for θ given Y for design ξ How to obtain the posterior pdf? Use Bayes’ rule : π ( θ | Y , ξ ) = p ( Y | θ , ξ ) p ( Y | ξ ) π ( θ ) where ◮ p ( Y | θ , ξ ), the likelihood function for Y � ◮ p ( Y | ξ ), the evidence, i.e., p ( Y | ξ ) = Θ p ( Y | θ , ξ ) π ( θ ) d θ 9 / 38

Expected Information Gain (EIG) criterion Information gain : � p ( Y | θ , ξ ) π ( θ ) � p ( Y | θ , ξ ) � D KL ( π ( θ | Y , ξ ) � π ( θ )) = log d θ . p ( Y | ξ ) p ( Y | ξ ) Θ The information about Y is given by the data model. The expected information gain (EIG) is given by � def I ( ξ ) = D KL ( π ( θ | Y , ξ ) � π ( θ )) p ( Y ) d Y � � � � p ( Y | θ , ξ ) = log p ( Y | θ , ξ ) d Y π ( θ ) d θ , � Θ p ( Y | θ , ξ ) π ( θ ) d θ � � � �� p ( Y | θ , ξ ) = E θ E Y | θ log , E θ [ p ( Y | θ , ξ )] where, because of the assumed data model, the likelihood follows a multivariate Normal distribution � � N e − 1 � p ( Y | θ , ξ ) = det (2 π Σ ǫ ) − Ne 2 exp � y i − g ( θ , ξ ) � 2 . Σ − 1 2 ǫ i =1 10 / 38

EIG surface 11 / 38

Estimation EIG The expected information can be computed separately for each ξ . EIG criterion: � � � �� p ( Y | θ ) I = E θ E Y | θ log , E θ [ p ( Y | θ )] where the likelihood is given by the data model : � � N e − 1 � p ( Y | θ ) = det (2 π Σ ǫ ) − Ne � y i − g ( θ ) � 2 2 exp . Σ − 1 2 ǫ i =1 Conventional choice: Double-loop Monte Carlo sampling 12 / 38

Double-loop Monte Carlo (DLMC) for estimating EIG The DLMC estimator for EIG : N p ( Y n | θ n ) = 1 � def I dl � M N 1 m =1 p ( Y n | θ n , m ) n =1 M i.i.d. i.i.d. where θ n , θ n , m ∼ π ( θ ) and Y n ∼ p ( Y | θ n ). Average work model: W = N × M likelihood evaluations! Bias and variance can be modeled (Ryan, 2003): Bias: | E [ I dl ] − I | ≈ C 1 M , for some constant C 1 > 0 (Not unbiased!) Variance: V [ I dl ] ≈ C 2 N + C 3 NM , for some constants C 2 , C 3 > 0. 13 / 38

How to control the estimator error? Control the accuracy: Choose N and M to control the estimator’s accuracy Goal: For a specified error tolerance, TOL > 0, at a confidence level given by 0 < α ≪ 1, we select the values of N and M minimizing the computational work, while satisfying P ( |I dl ( N , M ) − I | ≤ TOL) ≥ 1 − α. 14 / 38

Total Error = Bias + Statistical Error Split the total error of a random estimator I into a bias component and a statistical error component | I − I| ≤ | I − E [ I ] | + | E [ I ] − I| . Optimization problem: minimize the estimator’s work, W , subject to | I − E [ I ] | ≤ (1 − κ )TOL | E [ I ] − I| ≤ κ TOL for κ ∈ (0 , 1), and the latter should hold with probability 1 − α . By using the CLT theorem, the statistical error constraint is replaced by � V [ I ] ≤ κ TOL C α for C α = Φ − 1 (1 − α 2 ) where Φ is the cumulative probability distribution (CDF) of a standard, normal random variable. 15 / 38

Optimal choice of DLMC parameters Optimization problem for DLMC:  C dl , 3 ≤ (1 − κ )TOL  M ( N ∗ , M ∗ , κ ∗ ) = arg min N × M subject to NM ≤ ( κ/ C α ) 2 TOL 2 ( N , M ,κ )  C dl , 1 + C dl , 2 N Optimal choice of number of samples (Beck et al., 2018): � C 2 1 − κ ∗ TOL − 2 � C dl , 1 N ∗ α = 2 κ ∗ � 0 . 5 C dl , 2 1 − κ ∗ TOL − 1 � M ∗ = Optimal average work: For a specified TOL with the specified probability of success, Optimal average work = N ∗ M ∗ ∝ TOL − 3 , as TOL → 0. 16 / 38

Improving the constant: Laplace-based importance sampling Assumption: The posterior pdf of θ can be well approximated by a Gaussian. Laplace approximation: Approximate the posterior pdf, π ( θ | Y ), by a � � θ ( Y ) , ˆ ˆ Σ ( ˆ multivariate normal pdf, ˜ π ( θ | Y ), following N θ ( Y )) , i.e., � � − 1 π ( θ | Y ) = (2 π ) − d 2 det( ˆ Σ ) − 1 2 � θ − ˆ 2 exp θ � 2 ˜ , ˆ Σ − 1 with maximum a posteriori (MAP) estimate, � � N e − 1 � � Y i − g ( θ ) � 2 ˆ def θ ( Y ) = arg max ǫ + log( π ( θ )) , Σ − 1 2 θ ∈ Θ i =1 and � � 1 Σ − 1 ( θ ) ˆ N e J ( θ ) T Σ − 1 ǫ J ( θ ) − ∇ θ ∇ θ log( π ( θ )) + O P √ N e = . This optimization problem can typically be solved sufficiently with ∼ 20-40 extra solves per outer sample (indexed by n ). 17 / 38

DLMC with Laplace-based Importance Sampling (DLMCIS) Laplace-based importance sampling for the inner expectation : � p ( Y | θ ) π ( θ ) � p ( Y ) = p ( Y | θ ) π ( θ ) d θ = π ( θ | Y ) d θ ˜ π ( θ | Y ) ˜ DLMCIS estimator (Ryan et al., 2015; Beck et al., 2018) :   N = 1  p ( Y | θ n )  � def   I dlis log  ,   p ( Y n | ˜ θ n , m ) π (˜ N θ n , m )  � M 1 n =1 m =1 π (˜ M ˜ θ n , m | Y n ) where ˜ i.i.d. θ n , m ∼ ˜ π ( θ | Y n ). ( no additional bias! ) Optimal average work : ∝ TOL − 3 as TOL → 0. The proportionality constant is substantially smaller for DLMCIS than for DLMC ! ( variance reduction ) 18 / 38

Numerical demonstration Nonlinear scalar model: y i ( ξ ) = θ 3 ξ 2 + θ exp( −| 0 . 2 − ξ | ) + ǫ i , for i = 1 , 2 , . . . , N e , with ◮ θ ∼ U (0 , 1) ◮ ǫ i i.i.d. ∼ N (0 , 10 − 3 ) ◮ N e = 1 and N e = 10 19 / 38

Expected information gain Figure: Expected information gain 20 / 38

Computational work for a single design Figure: Average running time vs. TOL 21 / 38

Multilevel methods for fast Bayesian optimal experimental design Ra - PowerPoint PPT Presentation

Multilevel methods for fast Bayesian optimal experimental design Ra ul Tempone Alexander von Humboldt Professor, RWTH-Aachen, KAUST Joint work with J. Beck (KAUST), B.M. Dia (KFUPM) and L. Espath (RWTH) DCSE Fall School on ROM and UQ,

Multilevel Krylov Methods Deflation Deflation, DD, MG Reinhard Nabben Multilevel Krylov

Cascadic Multilevel Methods for Cascadic Multilevel Methods for Large-Scale Ill-Posed Problems

brms: Bayesian Multilevel Models using Stan Paul Brkner 2018-04-09 1 Why using Multilevel

Figure 1a: Multilevel Visualization of Market Power Cases in Tree Format Figure 1b: Multilevel

Agenda 1. More on multilevel R formulas 2. Generalized Multilevel Models 3. GMLM in R 1 More

Bayesian hierarchical models Bruno Nicenboim / Shravan Vasishth 2020-03-14 1 Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Optimal Algorithms for Learning Bayesian Optimal Algorithms for Learning Bayesian Network

Fast Bayesian optimal experimental design and its applications Quan Long Joint work with

Fast Bayesian automatic Fast Bayesian automatic adaptive quadrature adaptive quadrature Gh.

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Multilevel Cooling Systems August Hale May 16, 2014 August Hale Multilevel Cooling Systems May

The Challenge of The Challenge of Multilevel Security Multilevel Security Rick Smith, Ph.D.,

Introduction to Multilevel Analysis Introduction to Multilevel Analysis PSYC 575 PSYC 575 Mark

Introduction to Multilevel Analysis Prof. Dr. Ulrike Cress Knowledge Media Research Center

Constrained optimal discrimination designs for Fourier regression models S. Biedermann, School of

Kernel Methods for Topological Data Analysis Kenji Fukumizu The Institute of Statistical

the multiple Chernoff distance Ke Li California Institute of Technology QMath 13, Georgia Tech

Embeddings of statistical manifolds H ong V an L e Institute of Mathematics, CAS

Design and analysis of follow-up studies with genetic component Juha Karvanen Department of

Existence of the free boundary in a diffusive ow in porous media Gabriela Marinoschi

8 Approximate inference in switching linear dynamical systems using Gaussian mixtures David

Business Statistics CONTENTS Ordinary least squares (recap for some) Statistical formulation of