Outcome-weighted sampling for Bayesian analysis Themis Sapsis and - PowerPoint PPT Presentation

Outcome-weighted sampling for Bayesian analysis Themis Sapsis and Antoine Blanchard Department of Mechanical Engineering Massachusetts Institute of Technology Funding: ONR, AFOSR, Sloan April 23, 2020 1 / 46

Problems-Motivation Risk Quantification Optimization under uncertainty 2 / 46

Challenges Challenge I : High-dimensional parameter spaces Intrinsic instabilities Stochastic loads Random parameters Challenge II : Need for expensive models Complex dynamics Hard to isolate dynamical mechanisms 3 / 46

The focus of this work Goal : Develop sampling strategies appropriate for expensive models and high-dimensional parameter spaces Models in fluids: Navier-Stokes, NL Schr¨ odinger, Euler Critical region of parameters is unknown Importance sampling based methods too expensive Input-space PCA focuses on subspaces, not sufficient 4 / 46

Risk Quantification: Problem setup x ∈ R m : Uncertain parameters; pdf: f x y ∈ R d : Output or quantities of interest; expensive to compute Risk Quantification Problem: Compute the statistics of y with the minimum number of experiments, i.e. input parameters { x 1 , x 2 , ..., x N } 5 / 46

A Bayesian approach Employ a linear regression model with an input vector x of length m that multiplies a coefficient vector A to produce an output vector y of length d , with Gaussian noise added: y = Ax + e (1) e ∼ N ( 0 , V ) (2) We are given a data set of pairs: D = { ( y 1 , x 1 ) , ( y 2 , x 2 ) , ..., ( y N , x N ) } . We set Y = [ y 1 , y 2 , ..., y N ] and X = [ x 1 , x 2 , ..., x N ] . 6 / 46

A Bayesian approach From Bayesian regression, we obtain the pdf for new inputs x : p ( y | x , D , V ) = N ( S yx S − 1 xx x , V ( 1 + c )) , c = x T S − 1 xx x , S xx = XX T + K S yx = YX T Question : How to choose the next input point x N + 1 = h ? 7 / 46

1. Minimizing the model uncertainty Given a hypothetical input point x N + 1 = h , we have at x p ( y | x , D ′ , V ) = N ( S yx S − 1 xx x , V ( 1 + c )) , c = x T S ′− 1 xx x , yx S ′− 1 xx x = S yx S − 1 xx x , assuming y N + 1 = S yx S − 1 where S ′ xx h . We minimize the model uncertainty by choosing h such that the distribution for c converges to zero (at least for the x we are interested): µ c ( h ) = E [ x T S ′− 1 xx x ] = tr [ S ′− 1 x S ′− 1 xx µ x = tr [ S ′− 1 xx C xx ] + µ T xx R xx ] (valid for any f x ) 8 / 46

1. Minimizing the model uncertainty Interpretation of the sampling process 1. The selection of the new sample does not depend on Y . 2. We diagonalize R xx ; let ˆ x i , i = 1 , ..., m be the principal directions arranged according to the eigenvalues σ 2 i + µ 2 x i . ˆ To minimize d � µ c ( h ) = tr [ S ′− 1 ( σ 2 i + µ 2 x i )[ S ′− 1 h ∈ S m − 1 , xx R xx ] = x ] ii , ˆ x ˆ ˆ i = 1 we need to sample in directions with the largest σ 2 i + µ 2 x i . ˆ 3. After sufficient sampling in this direction, the scheme switches to the next most important direction and so on. 4. Emphasis on input directions with large uncertainty, even those that have zero effect to the output. 9 / 46

2. Maximizing the x,y mutual information Maximizing the entropy transfer or mutual information between the input and output variables, when a new sample is added: I ( x , y | D ′ ) = E x + E y | D ′ − E x , y | D ′ . We have: � � E x , y ( h ) = f xy ( y , x | D ′ ) log f xy ( y , x | D ′ ) y x � � E y | x ( x | D ′ ) f x ( x ) + f x ( x ) log f x ( x ) = x x = E x [ E y | x ( D ′ )] + E x . 10 / 46

2. Maximizing the x,y mutual information Given a new input point x N + 1 = h , we have at any input x p ( y | x , D ′ , V ) = N ( S yx S − 1 xx x , V ( 1 + c )) , c = x T S ′− 1 xx x , Therefore, I ( x , y | D ′ , V ) = E y ( h ) − d 2 E x [log( 1 + c ( x ; h ))] − 1 2 log | 2 π e V | Note 1: Valid for any distribution f x Note 2: Hard to compute for high dimensions 11 / 46

2. Maximizing the x,y mutual information Gaussian approximation The Gaussian approximation of the entropy criterion: I G ( x , y | D ′ , V ) = 1 2 log | V ( 1 + µ c ( h )) + S yx S − 1 xx C xx S − 1 xx S T yx | − 1 2 log | V | − d 2 E x [log( 1 + c ( x ; h ))] , Note 1 : The effect of Y appears only through a single scalar/vector and with no coupling on the new point h . Note 2 : Asymptotically (i.e. for small σ 2 c ) the criterion becomes I G ( x , y | D ′ ) = 1 2 log | I + V − 1 S yx S − 1 xx C xx S − 1 xx S T yx | − � � µ c ( h ) d − tr [[ V + S yx S − 1 xx C xx S − 1 xx S T yx ] − 1 V ] + O ( µ 2 c ) 2 12 / 46

3. Output-weighted optimal sampling Let y 0 be the rv defined as the mean model: y 0 ≜ S yx S − 1 xx x We define the perturbed model: y + ≜ S yx S − 1 xx x + β r V ( 1 + x T S ′− 1 xx x ) , where β is a scaling factor to be chosen later and r V the most dominant eigenvector of V . We define the distance (Mohamad & Sapsis, PNAS, 2018) � D Log 1 ( y + � y 0 ; h ) = | log f y + ( y ; h ) − log f y 0 ( y ) | d y S y where S y is a finite sub-domain of y . 13 / 46

3. Output-weighted optimal sampling We can show that for bounded pdfs: D KL ( y + � y 0 ; h ) � κ D Log 1 ( y + � y 0 ; h ) , where κ is a constant. D Log 1 is more conservative compared with the KL divergence. Significantly improved performance in terms of convergence for f y . Criterion D Log 1 ( y + � y 0 ) is hard to compute/optimize. 14 / 46

3. Output-weighted optimal sampling Under appropriate smoothness conditions standard inequalities for derivatives of smooth functions give (Sapsis, Proc Roy Soc A, 2020): � f x ( x ) f y 0 ( y 0 ( x )) σ 2 lim β → 0 D Log 1 ( y + � y 0 ; h ) ≤ κ 0 y ( x ; h ) d x . 15 / 46

3. Output-weighted optimal sampling We define the output-weighted model error criterion � f x ( x ) f y 0 ( y 0 ( x )) σ 2 Q [ h ] ≜ y ( x ; h ) d x . Model error weighted according to the importance 1 (probability) of the input Model error inversely weighted according to the probability 2 of the output: emphasis is given to outputs with low probability (rare events) Relevant criterion (Verdinelli & Kadane, 1992) � U ( D ′ ) = q 1 y 0 ( x ) . 1 d x + q 2 E xy | D ′ . 16 / 46

3. Output-weighted optimal sampling Approximation of the criterion � f x ( x ) Q [ σ 2 f y 0 ( y 0 ( x )) σ 2 y ] ≜ y ( x ; h ) d x . Denominator approximation in S y for symmetric f y and scalar y f − 1 y 0 ( y ) ≃ p 1 + p 2 ( y − µ y ) 2 , where p 1 , p 2 are constants chosen so that m.s. error is min We employ a Gaussian approximation for f y 0 (only for this step) and over the interval S y = [ µ y , µ y + βσ y ] we obtain √ �� β � √ 2 dz − β 3 p 2 = 5 2 π z 2 z 2 e p 1 = 2 πσ y and β 5 σ y 3 0 17 / 46

3. Output-weighted optimal sampling Approximation of the criterion We collect all the computed terms and obtain (for Gaussian x ) Q βσ y ( h ) 1 = p 1 ( β )( 1 + tr [ S ′− 1 xx C xx ] + µ T x S ′− 1 xx µ x ) σ 2 V + p 2 ( β ) c 0 ( 1 + µ T x S ′− 1 xx µ x − tr [ S ′− 1 xx C xx ]) + 2 p 2 tr [ S − 1 xx S T yx S yx S − 1 xx C xx S ′− 1 xx C xx ] . For zero mean input we have Q βσ y ( h ) 1 = ( p 1 − p 2 c 0 ) tr [ S ′− 1 xx C xx ] σ 2 V + 2 p 2 tr [ S ′− 1 xx C xx 0 S − 1 xx S T yx S yx S − 1 xx C xx ] + const. 18 / 46

3. Output-weighted optimal sampling Gradient of the criterion For general functions of the form λ [ h ] = tr [ S ′− 1 xx C ] , where C is a symmetric matrix. The gradient takes the form ∂λ = − 2 h T S ′− 1 xx CS ′− 1 xx . ∂ h k 19 / 46

Example 1: 2-dimensional input � σ 2 � 0 ) and σ 2 1 y ( x ) = ˆ a 1 x 1 +ˆ a 2 x 2 + � , where x ∼ N ( 0 , V = 0 . 05 . ˆ σ 2 0 2 a 2 = 1 . 3, and σ 2 1 = 1 . 4 , σ 2 Case I : ˆ a 1 = 0 . 8 , ˆ 2 = 0 . 6 . a 2 = 2 . 0, and σ 2 1 = 2 . 0 , σ 2 Case II: ˆ a 1 = 0 . 01 , ˆ 2 = 0 . 2 . 20 / 46

Results for the 2D problem 21 / 46

Example 2: A 20-dimensional input 20 � a m x m + � , where x m ∼ N ( 0 , σ 2 y ( x ) = m ) , m = 1 , ..., 20 , ˆ ˆ m = 1 � � 3 � � m 10 − 3 , m = 1 , ..., 20 , a m = 1 + 40 ˆ 10 � 1 � 1 σ 2 128 ( m − 10 ) 3 10 − 1 , m = 1 , ..., 20 . m = 4 + For the observation noise we consider two cases: Case I: σ 2 � = 0 . 05 (accurate observations) Case II: σ 2 � = 0 . 5 (noisy observations) 22 / 46

Example 2: A 20-dimensional input Coefficients, ˆ α m , of the map ˆ y ( x ) (black curve) plotted together with the variance of each input direction σ 2 m (red curve). 23 / 46

Example 2: A 20-dimensional input Performance of the two adaptive approaches based on µ c and Q ∞ . 24 / 46

Example 2: A 20-dimensional input Energy of the different components of h with respect to the number of iteration N for Case I of the high dimensional problem. 25 / 46

Optimal sampling for nonlinear regression Let the input x ∈ X ⊂ R m , be expressed as a function of another input z ∈ Z ⊂ R s where the input value has distribution f z and Z be a compact set. We choose a set of basis functions x = φ ( z ) . The distribution of the output values will be p ( y | z , D , V ) = N ( S y φ S − 1 φφ φ ( z ) , V ( 1 + c )) , c = φ ( z ) T S − 1 φφ φ ( z ) , N � φ ( z i ) φ ( z i ) T S φφ = i = 1 26 / 46

Outcome-weighted sampling for Bayesian analysis Themis Sapsis and - PowerPoint PPT Presentation

Outcome-weighted sampling for Bayesian analysis Themis Sapsis and Antoine Blanchard Department of Mechanical Engineering Massachusetts Institute of Technology Funding: ONR, AFOSR, Sloan April 23, 2020 1 / 46 Problems-Motivation Risk

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Introduction to Outcome Harvesting Open Contracting Programme Agenda Definition of Outcome

Outcome Based Approach in Outcome Based Approach in Outcome Based Approach in Outcome Based

Weighted graphs 2 Weighted graphs So far we have only considered weighted graphs with

Weighted graphs 3 Weighted graph Edges in weighted graph are assigned a weight: w(v 1 , v 2 ),

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Outcome Analysis Presentation Nigeria October 2018 Livelihood Zone Map Outcome Analysis Result

Outcome Analysis Presentation Nigeria February 2019 Livelihood Zone Map Outcome Analysis

Interpolation CS3220 - Summer 2008 Jonathan Kaldor Interpolation Weve looked at the

Review: Church Encodings true = \x.\y.x; // Booleans false = \x.\y.y; pair = \x.\y.\f.f x y;

Wiener-Hopf kernel es/ma/on NEU 466M Instructor: Professor Ila R. Fiete Spring 2016 Problem

A Universal Language A Universal Language Scheme. It contains terms and rules describing

Introduction to the lambda calculus Polyvios.Pratikakis@imag.fr Based on slides by Jeff Foster,

Cache Policies Philipp Koehn 21 October 2019 Philipp Koehn Computer Systems Fundamentals: Cache

Dynamic Web Applications via Collaborative Hybrid Analysis Xiaoyin Wang* UC Berkeley Lu Zhang

... ... e8 9c 4f e5 ff call 0040f79ch 8b f0 mov