Multilevel methods for fast Bayesian optimal experimental design Ra - - PowerPoint PPT Presentation

multilevel methods for fast bayesian optimal experimental
SMART_READER_LITE
LIVE PREVIEW

Multilevel methods for fast Bayesian optimal experimental design Ra - - PowerPoint PPT Presentation

Multilevel methods for fast Bayesian optimal experimental design Ra ul Tempone Alexander von Humboldt Professor, RWTH-Aachen, KAUST Joint work with J. Beck (KAUST), B.M. Dia (KFUPM) and L. Espath (RWTH) DCSE Fall School on ROM and UQ,


slide-1
SLIDE 1

Multilevel methods for fast Bayesian optimal experimental design

Ra´ ul Tempone

Alexander von Humboldt Professor, RWTH-Aachen, KAUST Joint work with J. Beck (KAUST), B.M. Dia (KFUPM) and L. Espath (RWTH)

DCSE Fall School on ROM and UQ, November 4–8, 2019

1 / 38

slide-2
SLIDE 2

Optimal Experimental Design (OED)

◮ Find the design that maximizes the success of an

experiment for some desired experimental goal

◮ Predict the outcome without performing the real

experiment

◮ Model-based approach: Important when resources are

limited

2 / 38

slide-3
SLIDE 3

Motivating problem: Electrical Impedance Tomography

Quantities of interest: Material properties Measurements: Electrical potentials at electrodes Goal: Find the electrode placement (design) that maximizes the informativeness of the measurements

3 / 38

slide-4
SLIDE 4

Data model setting

Data model assumption: Y (ξ) = g(θ, ξ) + ǫ,

◮ Y = (y1, y2, . . . , yq) ∈ Rq, measurements ◮ ǫ ∈ Rq, measurement error ◮ g ∈ Rq, model predictions (usually, the solution of a math. model) ◮ θ ∈ Rd, quantities of interest ◮ ξ = (ξ1, ξ2, . . . , ξs), design parameters

Measurement error assumption: The measurement errors, ǫ, are zero-mean Gaussian errors with covariance matrix Σǫ

4 / 38

slide-5
SLIDE 5

Data model with repetitive experiments

Data model with repetitive experiments: Yi(ξ) = g(θ, ξ) + ǫi, i = 1, . . . , Ne where Ne is the number of repetitive experiments with design ξ. The full set of data: Y = {Yi}Ne

i=1.

5 / 38

slide-6
SLIDE 6

Simple OED example: a linear regression model

Linear regression model: Y = Xθ + ǫ where

◮ Y , vector of observation values ◮ X is the design matrix ◮ ǫ is the error (zero-mean) vector with known covariance matrix Σǫ ◮ θ is an unknown parameter vector of interest.

Inverse problem: Estimate θ Classical approach:

◮ Least square estimate (arg minθ ǫTǫ): ˆ

θ = (X TX)−1X TY

◮ Cov(ˆ

θ) = (X TX)−1X TΣǫX(X TX)−1 OED goal: Find design X that minimize Cov(ˆ θ)

6 / 38

slide-7
SLIDE 7

Alphabetic optimalities

What does it mean to minimize Cov(ˆ θ)

def

= Λ?

◮ A-optimality: minimize the trace of the covariance matrix, tr (Λ) ◮ C-optimality: minimize the variance of a predefined linear

combination of quantities of interest: cTΛc

◮ D-optimality: minimize the determinant of the covariance matrix,

det (Λ)

◮ E-optimality: minimize the maximum eigenvalue of the covariance

matrix Λ For nonlinear models the problem is much more challenging! Entropy based expected information gain in a Bayesian setting: Lindley, D. V. (1956). On a measure of the information provided by an

  • experiment. The Annals of Mathematical Statistics, 27(4), 986-1005.

7 / 38

slide-8
SLIDE 8

Entropy-based OED criterion

1948, Information Entropy Definition: Claude Shannon introduced the concept of information entropy: Entropy =

  • i

pi log 1 pi , for probabilities pi. The differential entropy for a continuous random variable with probability density function (pdf) p(x) with support X, Differential Entropy =

  • X

log 1 p(x)p(x)dx. 1956, Shannon Entropy-based OED: Dennis V. Lindley proposed an entropy-based measure of the information provided in an experiment: the relative entropy of the posterior pdf π(θ|Y ) with respect to the prior pdf π(θ): Relative Entropy =

  • log

1 π(θ)π(θ|y)dθ

  • =Cross Entropy

  • log

1 π(θ|Y )π(θ|Y )dθ

  • =Differential Entropy

, in expectation.

8 / 38

slide-9
SLIDE 9

Information gain

The information gain (Lindley, 1956) is here defined by the Kullback-Leibler (KL) divergence between prior and posterior pdfs DKL(π(θ|Y , ξ) π(θ))

def

=

  • Θ

log π(θ|Y , ξ) π(θ)

  • π(θ|Y , ξ) dθ,

where

◮ π(θ), the prior pdf for θ ◮ π(θ|Y , ξ), the posterior pdf for θ given Y for design ξ

How to obtain the posterior pdf? Use Bayes’ rule: π(θ|Y , ξ) = p(Y |θ, ξ) p(Y |ξ) π(θ) where

◮ p(Y |θ, ξ), the likelihood function for Y ◮ p(Y |ξ), the evidence, i.e., p(Y |ξ) =

  • Θ p(Y |θ, ξ)π(θ) dθ

9 / 38

slide-10
SLIDE 10

Expected Information Gain (EIG) criterion

Information gain: DKL(π(θ|Y , ξ) π(θ)) =

  • Θ

log p(Y |θ, ξ) p(Y |ξ) p(Y |θ, ξ)π(θ) p(Y |ξ) dθ. The information about Y is given by the data model. The expected information gain (EIG) is given by I(ξ)

def

=

  • DKL(π(θ|Y , ξ) π(θ)) p(Y ) dY

= log

  • p(Y |θ, ξ)
  • Θ p(Y |θ, ξ)π(θ) dθ
  • p(Y |θ, ξ) dY π(θ) dθ,

= Eθ

  • EY |θ
  • log
  • p(Y |θ, ξ)

Eθ [p(Y |θ, ξ)]

  • ,

where, because of the assumed data model, the likelihood follows a multivariate Normal distribution p(Y |θ, ξ) = det (2πΣǫ)− Ne

2 exp

  • −1

2

Ne

  • i=1

yi − g(θ, ξ)2

Σ−1

ǫ

  • .

10 / 38

slide-11
SLIDE 11

EIG surface

11 / 38

slide-12
SLIDE 12

Estimation EIG

The expected information can be computed separately for each ξ.

EIG criterion:

I = Eθ

  • EY |θ
  • log
  • p(Y |θ)

Eθ [p(Y |θ)]

  • ,

where the likelihood is given by the data model: p(Y |θ) = det (2πΣǫ)− Ne

2 exp

  • −1

2

Ne

  • i=1

yi − g(θ)2

Σ−1

ǫ

  • .

Conventional choice: Double-loop Monte Carlo sampling

12 / 38

slide-13
SLIDE 13

Double-loop Monte Carlo (DLMC) for estimating EIG

The DLMC estimator for EIG: Idl

def

= 1 N

N

  • n=1

p(Yn|θn)

1 M

M

m=1 p(Yn|θn,m)

where θn, θn,m

i.i.d.

∼ π(θ) and Yn

i.i.d.

∼ p(Y |θn). Average work model: W = N × M likelihood evaluations! Bias and variance can be modeled (Ryan, 2003): Bias: |E [Idl] − I| ≈ C1 M , for some constant C1 > 0 (Not unbiased!) Variance: V[Idl] ≈ C2 N + C3 NM , for some constants C2, C3 > 0.

13 / 38

slide-14
SLIDE 14

How to control the estimator error?

Control the accuracy: Choose N and M to control the estimator’s accuracy Goal: For a specified error tolerance, TOL > 0, at a confidence level given by 0 < α ≪ 1, we select the values of N and M minimizing the computational work, while satisfying P(|Idl(N, M) − I| ≤ TOL) ≥ 1 − α.

14 / 38

slide-15
SLIDE 15

Total Error = Bias + Statistical Error

Split the total error of a random estimator I into a bias component and a statistical error component |I − I| ≤ |I − E [I] | + |E [I] − I|. Optimization problem: minimize the estimator’s work, W, subject to |I − E [I]| ≤ (1 − κ)TOL |E [I] − I| ≤ κTOL for κ ∈ (0, 1), and the latter should hold with probability 1 − α. By using the CLT theorem, the statistical error constraint is replaced by Cα

  • V[I] ≤ κTOL

for Cα = Φ−1(1 − α

2 ) where Φ is the cumulative probability distribution

(CDF) of a standard, normal random variable.

15 / 38

slide-16
SLIDE 16

Optimal choice of DLMC parameters

Optimization problem for DLMC: (N∗, M∗, κ∗) = arg min

(N,M,κ)

N × M subject to   

Cdl,3 M

≤ (1 − κ)TOL

Cdl,1 N

+ Cdl,2

NM ≤ (κ/Cα)2 TOL2

Optimal choice of number of samples (Beck et al., 2018): N∗ = C 2

α

2κ∗ Cdl,1 1 − κ∗ TOL−2 M∗ =

  • 0.5 Cdl,2

1 − κ∗ TOL−1 Optimal average work: For a specified TOL with the specified probability of success, Optimal average work = N∗M∗ ∝ TOL−3, as TOL → 0.

16 / 38

slide-17
SLIDE 17

Improving the constant: Laplace-based importance sampling

Assumption: The posterior pdf of θ can be well approximated by a Gaussian. Laplace approximation: Approximate the posterior pdf, π(θ|Y ), by a multivariate normal pdf, ˜ π(θ|Y ), following N

  • ˆ

θ(Y ), ˆ Σ(ˆ θ(Y ))

  • , i.e.,

˜ π(θ|Y ) = (2π)− d

2 det( ˆ

Σ)− 1

2 exp

  • −1

2θ − ˆ θ2

ˆ Σ−1

  • ,

with maximum a posteriori (MAP) estimate, ˆ θ(Y )

def

= arg max

θ∈Θ

  • −1

2

Ne

  • i=1

Yi − g(θ)2

Σ−1

ǫ + log(π(θ))

  • ,

and ˆ Σ−1(θ) = NeJ(θ)TΣ−1

ǫ J(θ) − ∇θ∇θ log(π(θ)) + OP

  • 1

√Ne

  • .

This optimization problem can typically be solved sufficiently with ∼ 20-40 extra solves per outer sample (indexed by n).

17 / 38

slide-18
SLIDE 18

DLMC with Laplace-based Importance Sampling (DLMCIS)

Laplace-based importance sampling for the inner expectation: p(Y ) =

  • p(Y |θ)π(θ) dθ =

p(Y |θ)π(θ) ˜ π(θ|Y ) ˜ π(θ|Y ) dθ DLMCIS estimator (Ryan et al., 2015; Beck et al., 2018): Idlis

def

= 1 N

N

  • n=1

log      p(Y |θn)

1 M

M

m=1

p(Yn|˜ θn,m)π(˜ θn,m) ˜ π(˜ θn,m|Yn)      , where ˜ θn,m

i.i.d.

∼ ˜ π(θ|Yn). (no additional bias!) Optimal average work: ∝ TOL−3 as TOL → 0. The proportionality constant is substantially smaller for DLMCIS than for DLMC ! (variance reduction)

18 / 38

slide-19
SLIDE 19

Numerical demonstration

Nonlinear scalar model: yi(ξ) = θ3ξ2 + θ exp(−|0.2 − ξ|) + ǫi, for i = 1, 2, . . . , Ne, with

◮ θ ∼ U(0, 1) ◮ ǫi

i.i.d.

∼ N(0, 10−3)

◮ Ne = 1 and Ne = 10

19 / 38

slide-20
SLIDE 20

Expected information gain

Figure: Expected information gain

20 / 38

slide-21
SLIDE 21

Computational work for a single design

Figure: Average running time vs. TOL

21 / 38

slide-22
SLIDE 22

Optimal estimator parameter choices

Figure: Left: Ne = 1, Right: Ne = 10

22 / 38

slide-23
SLIDE 23

Consistency between TOL and Error

Figure: Error vs. TOL

23 / 38

slide-24
SLIDE 24

When g needs to be approximated?

Assumptions:

◮ g(θ, ξ) needs to be computed numerically ◮ We only have access to the numerical approximation gh of g, where

h > 0 is the mesh size.

◮ The (weak) approximation error for gh is modeled as:

Bias = O(hηw ), η > 0

◮ The computational work for gh is modeled as:

Average work ∝ h−γ, γ > 0 Then the approximate likelihood (replacing p(Y |θ)) is ph(Y |θ)

def

= det (2πΣǫ)− Ne

2 exp

  • −1

2

Ne

  • i=1

yi − gh(θ)2

Σ−1

ǫ

  • .

24 / 38

slide-25
SLIDE 25

Numerical discretization in EIG

Let us write the EIG as I = E[Z], where Z(θ)

def

= E [f (Y , θ)|θ] and f (Y , θ)

def

= log p(Y |θ) p(Y )

  • .

Then, since using a numerical approximation gh of g, the EIG is approximated by Ih

def

= E[Zh], with Zh(θ)

def

= E [fh(Y , θ)|θ] and fh(Y , θ)

def

= log ph(Y |θ) ph(Y )

  • .

25 / 38

slide-26
SLIDE 26

Optimal method parameters for DLMC

The solution is given by (Beck et al., 2018) N∗ = C 2

α

2κ∗ Cdl,1 1 − κ∗

  • 1 +

γ 2ηw

TOL−2, M∗ = Cdl,2 2

  • 1 − κ∗
  • 1 +

γ 2ηw

TOL−1, h∗ = γ ηw κ∗ 2Cw 1/ηw TOL1/ηw . and κ∗ is given by a second-order equation. Optimal average work: For a specified TOL with the specified probability of success, Optimal average work = N∗M∗h∗ ∝ TOL−(3+ γ

ηw ),

as TOL → 0.

26 / 38

slide-27
SLIDE 27

Multilevel Double-Loop Monte Carlo (MLDLMC) for EIG

Assume we have numerical discretizations ghℓ where hℓ, ℓ = 0, 1, . . . is a sequence of decreasing mesh-element sizes, e.g., hℓ = h02−ℓ. First, following the standard multilevel method by Giles (2008), let us write the approximation of the EIG with mesh-element size hL as a telescopic series with respect to level ℓ, IhL =

L

  • ℓ=0

E [∆[Zℓ]] ≈ I, where Zℓ = Zhℓ, and ∆[Zℓ(θ)]

def

=

  • Zℓ(θ) − Zℓ−1(θ)

if ℓ > 0, Zℓ(θ) if ℓ = 0. Furthermore, we assume the strong error model V [∆[Zℓ(θ)]] ≈ Cshηs

ℓ ,

for ℓ > 0. where ηs > 0, typically ηs ≈ 2ηw.

27 / 38

slide-28
SLIDE 28

MLDLMC for EIG

We use the approximate evidence using Laplace-based importance sampling, ˆ pℓ(Y ; {θm}Mℓ

m=1)

def

= 1 Mℓ

Mℓ

  • m=1

pℓ(Y |θm)Rℓ(θm; Y ) ≈ pℓ(Y ) with Mℓ = ⌈νℓM0⌉ for ℓ > 0 where ν > 1. Thus, we introduce ˆ Zℓ(θ; {θm}M

m=1)

def

= Eℓ

  • ˆ

fℓ(Y , θ, {θm}M

m=1)|θ

  • .

with ˆ fℓ(Y , θ; {θm}M

m=1)

def

= log

  • pℓ(Y |θ)

ˆ pℓ(Y |{θm}M

m=1)

  • .

28 / 38

slide-29
SLIDE 29

MLDLMC for EIG

The proposed Multilevel Double Loop Monte Carlo (MLDLMC) estimator with Laplace-based importance sampling for the expected information gain can be written as IL

def

= 1 N0

N0

  • n=1

ˆ f0(Y (0)

0,n , θ(0) n , {θ(0) m }M0 m=1)

+

L

  • ℓ=1

1 Nℓ

Nℓ

  • n=1
  • ˆ

fℓ(Y (ℓ)

ℓ,n , θ(ℓ) n ; {θ(ℓ) m,n}Mℓ m=1) − ˆ

fℓ−1(Y (ℓ)

ℓ−1,n, θ(ℓ) n ; {θ(ℓ) m,n}Mℓ−1 m=1 )

  • where

◮ Y (k) ℓ,n

i.i.d.

∼ pℓ(Y |θ(k)

n ) ◮ θ(k) n

i.i.d.

∼ π(θ)

◮ θ(k) m,n

i.i.d.

∼ N(ˆ θ(Y (k)

k,n ), Σ(ˆ

θ(Y (k)

k,n ))) (Laplace-based importance

sampling)

◮ {θ(k) n,m}Mℓ−1 ⊆ {θ(k) n,m}Mℓ (shared inner samples within each of the

differences)

29 / 38

slide-30
SLIDE 30

Asymptotic average work of MLDLMC

For a given TOL > 0, the optimal N is, under our given error and work assumptions, Nℓ = Cα κTOL 2 Vℓ MℓW (gℓ) L

  • ℓ=0
  • VℓMℓW (gℓ)
  • .

The average work of the MLDLMC estimator is modeled as W (IMLDLMC) ∝

L

  • ℓ=0

MℓNℓh−γ

, and this leads us to the optimal work of the MLDLMC estimator, W (IMLDLMC) = Cα κTOL 2 L

  • ℓ=0
  • VℓMℓW (gℓ)

2 , as TOL → 0.

30 / 38

slide-31
SLIDE 31

Asymptotic average work of MLDLMC

Under the given assumptions, and hℓ ∝ 2−ℓ, Mℓ = 4Mℓ−1 for ℓ > 0, the average computational work of the MLDLMC estimator behaves as: W (IMLDLMC) ∝      TOL−2, if ηs − γ − ηw > 0, TOL−2 log (TOL−1) 2 , if ηs − γ − ηw = 0, TOL−2(1+ γ−ηs

2ηw ),

if ηs − γ − ηw < 0, as TOL → 0, under the restriction ηw ≥ min (ηs, γ) /2. More general result in (Beck et al., 2019): with hℓ ∝ βℓ and Mℓ = νMℓ−1, under assumption logβ{ν} ≤ ηw.

31 / 38

slide-32
SLIDE 32

EIT problem with 4 plies

Figure: Experimental setup for a EIT problem with 4 plies

32 / 38

slide-33
SLIDE 33

Experiment formulation

The data model is given as Y = Uh(θ) + ǫ, without repeated experiment. The noise is ǫ ∼ N(0, 10−4), and θi ∼ U( π

2+i − 0.05, π 2+i + 0.05), for i = 1, . . . , 4. The output is the

potential at nine electrodes, i.e. y ∈ RNel−1. Uh are the potential at the electrodes and can be obtained from the solution of an elliptic PDE with h-convergence rate η ≈ 1.2 and with work rate γ ≈ 2.

33 / 38

slide-34
SLIDE 34

MLDLMC analysis

For the test we use eel = esh = esp = 2. The figure shows the convergence of the telescopic differences, ∆ ˆ Zℓ, with respect to level ℓ.

Figure: The expected value and variance of telescopic differences, ∆ ˆ Zℓ.

34 / 38

slide-35
SLIDE 35

MLDLMC consistency

Figure: Error vs. TOL

The confidence level used is α = 0.05.

35 / 38

slide-36
SLIDE 36

Comparison

The performance of the methods can be understood from looking at the average time against TOL.

MLDLMC

Figure: A numerical comparison between DLMCIS (i.e., standard DLMC with Importance Sampling), MLDLMC for the EIT problem with 4 plies.

36 / 38

slide-37
SLIDE 37

Thanks for your attention

◮ Beck, J., Mansour Dia, B., Espath, L., Tempone, R. (2019)

Multilevel Double Loop Monte Carlo and Stochastic Collocation Methods with Importance Sampling for Bayesian Optimal Experimental Design, arXiv preprint arXiv:1811.11469

◮ Beck, J., Mansour Dia, B., Espath, L., Long, Q., & Tempone R.

(2018) Fast Bayesian experimental design: Laplace-based importance sampling for the expected information gain. Computer Methods in Applied Mathematics and Engineering, 334, 523-553.

◮ Collier, N., Haji-Ali, A.L., Nobile, F., von Schwerin, E., & Tempone,

  • R. (2015) A continuation multilevel Monte Carlo algorithm. BIT

Numerical Mathematics, 55(2), 399-432.

◮ Long, Q., Scavino, M., Tempone, R., & Wang, S. (2013) Computer

Methods in Applied Mathematics and Engineering, 259, pp.24-39.

37 / 38

slide-38
SLIDE 38

Multilevel Double Loop Stochastic Collocation (MLDLSC)

A multilevel double loop stochastic collocation (MLDLSC) method for estimating EIG (Beck et al., 2019):

MLDLMC MLSC DLMC estimate

Figure: DLMC, MLDLMC and MLDLSC. All methods using Laplace-based importance sampling

38 / 38