The Convergence of the Laplace Approximation and Noise-Level-Robust - - PowerPoint PPT Presentation

the convergence of the laplace approximation and noise
SMART_READER_LITE
LIVE PREVIEW

The Convergence of the Laplace Approximation and Noise-Level-Robust - - PowerPoint PPT Presentation

The Convergence of the Laplace Approximation and Noise-Level-Robust Monte Carlo Methods for Bayesian Inverse Problems Daniel Rudolf, Claudia Schillings, Bj orn Sprungk, Philipp Wacker Institute of Mathematical Stochastics, University of G


slide-1
SLIDE 1

The Convergence of the Laplace Approximation and Noise-Level-Robust Monte Carlo Methods for Bayesian Inverse Problems

Daniel Rudolf, Claudia Schillings, Bj¨

  • rn Sprungk, Philipp Wacker

Institute of Mathematical Stochastics, University of G¨

  • ttingen

Workshop “Optimization and Inversion under Uncertainty” RICAM Linz, November 15th, 2019

1 / 30

slide-2
SLIDE 2

Bayesian Inverse Problems

Infer unknown x ∈ Rd given noisy observations of forward map G : Rd → RJ y = G(x) + ε, ε ∼ N(0, n−1Σ), n ∈ N, Given prior measure µ0 for x, here µ0 = N(0, C0), we obtain a posterior µn(dx) = 1 Zn exp(−nΦ(x)) µ0(dx), Φ(x) = 1 2 |y − G(x)|2

Σ−1 ,

where Zn :=

  • Rd e−nΦ(x) dµ0

Objective: Sample (approximately) from µn and compute Eµn [f ] =

  • Rd f (x) µn(dx),

f ∈ L1

µ0(R)

In this talk we are interested in the case of increasing precision n → ∞

2 / 30

slide-3
SLIDE 3

Computational Bayesian Inference

Computational methods for approximate sampling or integrating w.r.t. µ:

Markov chain Monte Carlo, Importance sampling Sequential Monte Carlo and particle filters, Quasi-Monte Carlo and numerical quadrature, ...

Common computational challenges:

3 / 30

slide-4
SLIDE 4

Computational Bayesian Inference

Computational methods for approximate sampling or integrating w.r.t. µ:

Markov chain Monte Carlo, Importance sampling Sequential Monte Carlo and particle filters, Quasi-Monte Carlo and numerical quadrature, ...

Common computational challenges:

1

Expensive evaluation of forward model G → Multilevel or surrogate methods

3 / 30

slide-5
SLIDE 5

Computational Bayesian Inference

Computational methods for approximate sampling or integrating w.r.t. µ:

Markov chain Monte Carlo, Importance sampling Sequential Monte Carlo and particle filters, Quasi-Monte Carlo and numerical quadrature, ...

Common computational challenges:

1

Expensive evaluation of forward model G → Multilevel or surrogate methods

2

High-dimensional or even infinite-dimensional state space, e.g., function spaces → Intense research in recent years for all mentioned methods

3 / 30

slide-6
SLIDE 6

Computational Bayesian Inference

Computational methods for approximate sampling or integrating w.r.t. µ:

Markov chain Monte Carlo, Importance sampling Sequential Monte Carlo and particle filters, Quasi-Monte Carlo and numerical quadrature, ...

Common computational challenges:

1

Expensive evaluation of forward model G → Multilevel or surrogate methods

2

High-dimensional or even infinite-dimensional state space, e.g., function spaces → Intense research in recent years for all mentioned methods

3

Concentrated µn due to informative data, i.e., n ≫ 1 or J ≫ 1 → Analyzed so far in [Beskos et al., 2018] and [Schillings & Schwab, 2016]

3 / 30

slide-7
SLIDE 7

Computational Bayesian Inference

Computational methods for approximate sampling or integrating w.r.t. µ:

Markov chain Monte Carlo, Importance sampling Sequential Monte Carlo and particle filters, Quasi-Monte Carlo and numerical quadrature, ...

Common computational challenges:

1

Expensive evaluation of forward model G → Multilevel or surrogate methods

2

High-dimensional or even infinite-dimensional state space, e.g., function spaces → Intense research in recent years for all mentioned methods

3

Concentrated µn due to informative data, i.e., n ≫ 1 or J ≫ 1 → Analyzed so far in [Beskos et al., 2018] and [Schillings & Schwab, 2016]

3 / 30

slide-8
SLIDE 8

Outline

1

Laplace Approximation

2

Markov Chain Monte Carlo

3

Importance Sampling

4

Quasi Monte Carlo

4 / 30

slide-9
SLIDE 9

Next

1

Laplace Approximation

2

Markov Chain Monte Carlo

3

Importance Sampling

4

Quasi Monte Carlo

5 / 30

slide-10
SLIDE 10

General Approach For Noise-Level Robust Sampling

Prior-based sampling or integration will suffer from the increasing difference between µn and µ0 as n → ∞, i.e., dµn dµ0 ∝ e−nΦ → δargmin Φ and dTV(µn, µ0) → 1 Idea: Base sampling methods on a suitable (simple) reference measure mimicking the (increasing) concentration of µn

6 / 30

slide-11
SLIDE 11

General Approach For Noise-Level Robust Sampling

Prior-based sampling or integration will suffer from the increasing difference between µn and µ0 as n → ∞, i.e., dµn dµ0 ∝ e−nΦ → δargmin Φ and dTV(µn, µ0) → 1 Idea: Base sampling methods on a suitable (simple) reference measure mimicking the (increasing) concentration of µn Here, Laplace approximation of µn: Lµn := N(xn, Cn), xn := argmin

x

nΦ(x) + 1 2C −1/2 x2, Cn :=

  • n∇2Φ(xn) + C −1

−1

6 / 30

slide-12
SLIDE 12

General Approach For Noise-Level Robust Sampling

Prior-based sampling or integration will suffer from the increasing difference between µn and µ0 as n → ∞, i.e., dµn dµ0 ∝ e−nΦ → δargmin Φ and dTV(µn, µ0) → 1 Idea: Base sampling methods on a suitable (simple) reference measure mimicking the (increasing) concentration of µn Here, Laplace approximation of µn: Lµn := N(xn, Cn), xn := argmin

x

nΦ(x) + 1 2C −1/2 x2, Cn :=

  • n∇2Φ(xn) + C −1

−1 Very common approximation in Bayesian statistics and OED ([Long et al.,

2013], [Alexanderian et al., 2016], [Chen & Ghattas, 2017] ...)

6 / 30

slide-13
SLIDE 13

Laplace’s Method for Asymptotics of Integrals [Laplace, 1774]

[Wong, 2001]: Considering integrals

J(n) :=

  • D

f (x) exp(−nΦ(x)) dx, D ⊆ Rd with sufficiently smooth f and Φ, we have, under suitable conditions, as n → ∞ J(n) = e−nΦ(x⋆) n−d/2

  • f (x⋆)
  • det(2πH⋆)

+ O(n−1)

  • where x⋆ := argminx∈Rd Φ ∈ D and H⋆ := ∇2Φ(x⋆) > 0

7 / 30

slide-14
SLIDE 14

Laplace’s Method for Asymptotics of Integrals [Laplace, 1774]

[Wong, 2001]: Considering integrals

J(n) :=

  • D

f (x) exp(−nΦ(x)) dx, D ⊆ Rd with sufficiently smooth f and Φ, we have, under suitable conditions, as n → ∞ J(n) = e−nΦ(x⋆) n−d/2

  • f (x⋆)
  • det(2πH⋆)

+ O(n−1)

  • where x⋆ := argminx∈Rd Φ ∈ D and H⋆ := ∇2Φ(x⋆) > 0

Yields: Given smooth Lebesgue density of µ0, then for suitable f

  • Rd f dµn −
  • Rd f dN(x⋆, (nH⋆)−1)
  • ∈ O(n−1)

7 / 30

slide-15
SLIDE 15

Convergence of Laplace Approximation

Theorem ([Schillings, S., Wacker, 2019]) Given that Φ ∈ C 3(Rd), unique xn and Cn > 0 for sufficiently large n > 0, a unique minimizer x⋆ := argminx∈Rd Φ(x) exists with ∇2Φ(x⋆) > 0 and inf

x−x⋆>r Φ(x) ≥ Φ(x⋆) + δr,

δr > 0, limn→∞ xn = x⋆. Then dH(µn, Lµn) ∈ O(n−1/2).

8 / 30

slide-16
SLIDE 16

Convergence of Laplace Approximation

Theorem ([Schillings, S., Wacker, 2019]) Given that Φ ∈ C 3(Rd), unique xn and Cn > 0 for sufficiently large n > 0, a unique minimizer x⋆ := argminx∈Rd Φ(x) exists with ∇2Φ(x⋆) > 0 and inf

x−x⋆>r Φ(x) ≥ Φ(x⋆) + δr,

δr > 0, limn→∞ xn = x⋆. Then dH(µn, Lµn) ∈ O(n−1/2). Closely related to the Bernstein–von Mises theorem but: Covariance of Lµn depends on given data (BvM: Fisher information) Misspecification (“ground truth” not in prior support) not important Density dµn/dLµn also exists in Hilbert spaces (for Gaussian µ0)

8 / 30

slide-17
SLIDE 17

Remarks

The convergence theorem can be extended under suitable assumptions to

1

any prior µ0 which is absolutely continuous w.r.t. Lebesgue measure,

2

sequences of Φn, e.g., Φn(x) = 1 2n

n

  • i=1

yi − G(x)2

3

the underdetermined case G : Rd → RJ, J < d, iff µ0 is Gaussian and G acts

  • nly on linear active subspace M with dim(M ) ≤ J:

G(x + m) = G(x), ∀x ∈ RM ∀m ∈ M ⊥

4

Approximations xn, Cn of xn, Cn such that xn − xn, Cn − Cn ∈ O(n−1)

9 / 30

slide-18
SLIDE 18

Examples

µ0 = N(0, I2), Φ(x) = 1

2y − G(x)2, G(x) = [exp( 1 5(x2 − x1)), sin(x2 − x1)]⊤

Posterior µn Lµn

10 0 10 5

n

10 -3 10 -2 10 -1 10 0

Hellinger distance dH(

n, LAn)

rate: -0.50

µ0 = N(0, I2) and Φ(x) = 1

20 − G(x)2 with G(x) = x2 − x2 1

Posterior µn Lµn

10 0 10 5

n

0.4 0.6 0.8 1 1.2

Hellinger distance dH(

n, LAn)

rate: 0.00

10 / 30

slide-19
SLIDE 19

Next

1

Laplace Approximation

2

Markov Chain Monte Carlo

3

Importance Sampling

4

Quasi Monte Carlo

11 / 30

slide-20
SLIDE 20

Markov Chain Monte Carlo (MCMC)

Construct Markov chain (Xm)m∈N with invariant measure µn, i.e., Xm ∼ µn ⇒ Xm+1 ∼ µn Given suitable conditions, we have Xm

D

− − − − →

m→∞ µn

and for f ∈ L2

µ0(R)

SM(f ) := 1 M

M

  • m=1

f (Xm)

a.s.

− − − − →

M→∞

Eµn[f ] Autocorrelation of Markov chain effects efficiency: M E

  • SM(f ) − Eµn[f ]
  • 2

− − − − →

M→∞

Varµn(f )

  • 1 + 2

  • m=0

Corr (f (X1), f (X1+m))

  • integrated autocorrelation time (IACT)

12 / 30

slide-21
SLIDE 21

Metropolis-Hastings (MH) algorithm [Metropolis et al., 1953]

Given current state Xm = x,

1

draw new state y according to proposal kernel P(x, ·): Ym ∼ P(x)

2

accept proposed y with acceptance probability α(x, y), i.e., set Xm+1 =

  • y,

with probability α(x, y), x, with probability 1 − α(x, y).

13 / 30

slide-22
SLIDE 22

Metropolis-Hastings (MH) algorithm [Metropolis et al., 1953]

Given current state Xm = x,

1

draw new state y according to proposal kernel P(x, ·): Ym ∼ P(x)

2

accept proposed y with acceptance probability α(x, y), i.e., set Xm+1 =

  • y,

with probability α(x, y), x, with probability 1 − α(x, y). Correct α = αn for µn-invariance well-known Efficiency of MH algorithm depends entirely on “good” choice of proposal P

13 / 30

slide-23
SLIDE 23

Metropolis-Hastings (MH) algorithm [Metropolis et al., 1953]

Given current state Xm = x,

1

draw new state y according to proposal kernel P(x, ·): Ym ∼ P(x)

2

accept proposed y with acceptance probability α(x, y), i.e., set Xm+1 =

  • y,

with probability α(x, y), x, with probability 1 − α(x, y). Correct α = αn for µn-invariance well-known Efficiency of MH algorithm depends entirely on “good” choice of proposal P Construct proposals s. th. efficiency/autocorrelation is robust w.r.t. n → ∞

13 / 30

slide-24
SLIDE 24

Gaussian Random Walk-MH

Proposal kernel: P(x) = N(x, s2C0) with tunable stepsize s > 0:

2000 4000 6000 8000 10000 Iteration k

  • 2

2 4 u s = 0.02, E[ ] = 0.95 2000 4000 6000 8000 10000 Iteration k

  • 4
  • 2

2 4 u s = 0.6, E[ ] = 0.25 2000 4000 6000 8000 10000 Iteration k

  • 2

2 4 u s = 5, E[ ] = 0.01

If πn : Rd → (0, ∞) denotes density of µn: αn(x, y) = min

  • 1, πn(y)

πn(x)

  • 14 / 30
slide-25
SLIDE 25

Gaussian Random Walk-MH

Proposal kernel: P(x) = N(x, s2C0) with tunable stepsize s > 0:

2000 4000 6000 8000 10000 Iteration k

  • 2

2 4 u s = 0.02, E[ ] = 0.95 2000 4000 6000 8000 10000 Iteration k

  • 4
  • 2

2 4 u s = 0.6, E[ ] = 0.25 2000 4000 6000 8000 10000 Iteration k

  • 2

2 4 u s = 5, E[ ] = 0.01

If πn : Rd → (0, ∞) denotes density of µn: αn(x, y) = min

  • 1, πn(y)

πn(x)

  • Dimension-robust version: pCN-proposal [Beskos et al., 2008]

P(x) = N(

  • 1 − s2x, s2C0),

s ∈ (0, 1], is µ0-reversible which yields αn(x, y) = min

  • 1,
  • exp(−Φ(y))

exp(−Φ(x))

n

14 / 30

slide-26
SLIDE 26

Idea for Noise-Level Robust MH Algorithms

Inform proposal P about (increasing) concentration of µn by using (an approximation of) posterior covariance for proposing (cf. [Tierney, 1994], [Haario et al., 2001], [Martin et al., 2012]...)

15 / 30

slide-27
SLIDE 27

Idea for Noise-Level Robust MH Algorithms

Inform proposal P about (increasing) concentration of µn by using (an approximation of) posterior covariance for proposing (cf. [Tierney, 1994], [Haario et al., 2001], [Martin et al., 2012]...) Here, we use the covariance Cn of the Laplace approximation Lµn

[Rudolf & S., 2018]: Candidates for noise lebel-robust RW- & pCN-variants

H-RW: Pn(x) = N(x, s2Cn), generalized pCN (gpCN): Pn(x) = N(As,nx, s2Cn) where bounded linear operator As,n ensures µ0-reversibility (cf. operator weighted proposals [Law, 2013] and [Cui et al., 2016])

15 / 30

slide-28
SLIDE 28

Numerical Experiment

Problem: Infer coefficient in 1D BVP by observing solution at 4 points Proposals: RW: P0(x) = N(x, s2C0), pCN: P0(x) = N(

  • 1 − s2x, s2C0),

H-RW: Pn(x) = N(x, s2Cn), gpCN: Pn(x) = N(Asx, s2Cn) Results:

Prior Posterior, n−1 = 10−2 Posterior, n−1 = 10−4

16 / 30

slide-29
SLIDE 29

Numerical Experiment

Problem: Infer coefficient in 1D BVP by observing solution at 4 points Proposals: RW: P0(x) = N(x, s2C0), pCN: P0(x) = N(

  • 1 − s2x, s2C0),

H-RW: Pn(x) = N(x, s2Cn), gpCN: Pn(x) = N(Asx, s2Cn) Results:

IACT vs. dimension

10 2 10 3

dimension d

10 1 10 2 10 3

IACT

RW pCN H-RW gpCN

IACT vs. precision level

10 2 10 4

precision level n

10 1 10 2 10 3 10 4

IACT

RW pCN H-RW gpCN

16 / 30

slide-30
SLIDE 30

Noise-Level Robustness of MH Algorithms

Given µn-invariant Markov chains (Xm)m∈N we can study if lim

n→∞ ∞

  • m=0

Corr (f (X1), f (X1+m)) < ∞, f ∈ L2

µ0(R)

17 / 30

slide-31
SLIDE 31

Noise-Level Robustness of MH Algorithms

Given µn-invariant Markov chains (Xm)m∈N we can study if lim

n→∞ ∞

  • m=0

Corr (f (X1), f (X1+m)) < ∞, f ∈ L2

µ0(R)

To start, we consider simpler efficiency indicators: Mean acceptance rate: E [αn(Xm, Ym)] , Lag-1-Autocorrelation: Corr(a⊤Xm, a⊤Xm+1), a ∈ Rd Noise-level robust efficiency defined as lim

n→∞ E [αn(Xm, Ym)] > 0,

lim

n→∞ Corr(a⊤Xm, a⊤Xm+1) < 1

17 / 30

slide-32
SLIDE 32

Noise-Level Robustness of MH Algorithms cont’d

[S., 2017]: For Gaussian posteriors

µn = Lµn = N(xn, Cn) the proposals Pn(x) = N(x, s2Cn), Pn(x) = N(As,nx, s2Cn) yield lim

n→∞ E [αn(Xm, Ym)] > 0,

lim

n→∞ Corr(a⊤Xm, a⊤Xm+1) < 1

(1)

18 / 30

slide-33
SLIDE 33

Noise-Level Robustness of MH Algorithms cont’d

[S., 2017]: For Gaussian posteriors

µn = Lµn = N(xn, Cn) the proposals Pn(x) = N(x, s2Cn), Pn(x) = N(As,nx, s2Cn) yield lim

n→∞ E [αn(Xm, Ym)] > 0,

lim

n→∞ Corr(a⊤Xm, a⊤Xm+1) < 1

(1) Convergence of the Laplace approximation lifts this to the non-Gaussian case: Theorem ([Rudolf, S., 2019]) Given dH(µn, Lµn) → 0 we have for the H-RW and gpCN proposal Pn(u) = N(u, s2Cn), Pn(u) = N(As,nu, s2Cn), that (1) holds.

18 / 30

slide-34
SLIDE 34

Numerical Experiment for Increasing Concentration

Linear forward map G (convolution operator) applied to unknown function Gaussian prior and noise ε ∼ N(0, n−1 I4) yield Gaussian posterior Examine mean acceptance rate vs. proposal stepsize s: P(u) = N(u, s2C0)

10 -3 10 -2 10 -1 10 0

stepsize s

0.2 0.4 0.6 0.8 1

Mean acceptance rate

n = 10 n = 40 n = 200 n = 1000

P(u) = N( √ 1 − s2u, s2C0)

10 -3 10 -2 10 -1 10 0

stepsize s

0.2 0.4 0.6 0.8 1

Mean acceptance rate

n = 10 n = 40 n = 200 n = 1000 19 / 30

slide-35
SLIDE 35

Numerical Experiment for Increasing Concentration

Linear forward map G (convolution operator) applied to unknown function Gaussian prior and noise ε ∼ N(0, n−1 I4) yield Gaussian posterior Examine mean acceptance rate vs. proposal stepsize s: P(u) = N(u, s2Cn)

10 -3 10 -2 10 -1 10 0

stepsize s

0.2 0.4 0.6 0.8 1

Mean acceptance rate

n = 10 n = 40 n = 200 n = 1000

P(u) = N(As,nu, s2Cn)

10 -3 10 -2 10 -1 10 0

stepsize s

0.2 0.4 0.6 0.8 1

Mean acceptance rate

n = 10 n = 40 n = 200 n = 1000 19 / 30

slide-36
SLIDE 36

Numerical Experiment for Increasing Concentration

Nonlinear forward map G (exp ◦ convolution operator) Gaussian prior and noise ε ∼ N(0, n−1 I4) yield non-Gaussian posterior Examine mean acceptance rate vs. proposal stepsize s: P(u) = N(u, s2Cn)

10 -3 10 -2 10 -1 10 0

stepsize s

0.2 0.4 0.6 0.8 1

Mean acceptance rate

n = 10 n = 40 n = 200 n = 1000

P(u) = N(As,nu, s2Cn)

10 -3 10 -2 10 -1 10 0

stepsize s

0.2 0.4 0.6 0.8 1

Mean acceptance rate

n = 10 n = 40 n = 200 n = 1000 19 / 30

slide-37
SLIDE 37

Next

1

Laplace Approximation

2

Markov Chain Monte Carlo

3

Importance Sampling

4

Quasi Monte Carlo

20 / 30

slide-38
SLIDE 38

Self-Normalizing Importance Sampling

Given importance distribution ν and i.i.d. samples Xm ∼ ν, m = 1, . . . , M, use Eµn [f ] =

  • Rd f e−nΦdµ0
  • Rd e−nΦdµ0

≈ M

m=1 wn(Xm) f (Xm)

M

i=1 wn(Xm)

wn ∝ dµn dν

21 / 30

slide-39
SLIDE 39

Self-Normalizing Importance Sampling

Given importance distribution ν and i.i.d. samples Xm ∼ ν, m = 1, . . . , M, use Eµn [f ] =

  • Rd f e−nΦdµ0
  • Rd e−nΦdµ0

≈ M

m=1 wn(Xm) f (Xm)

M

i=1 wn(Xm)

wn ∝ dµn dν SLLN yields

M

m=1 wn(Xm) f (Xm)

M

i=1 wn(Xm)

a.s.

− − − − →

M→∞ Eµn [f ]

and given that Vµn,ν(f ) := Eν dµn dν 2 (f − Eµn [f ])2

  • < ∞

there holds a CLT with asymptotic variance Vµn,ν(f ) as M → ∞

21 / 30

slide-40
SLIDE 40

Self-Normalizing Importance Sampling

Given importance distribution ν and i.i.d. samples Xm ∼ ν, m = 1, . . . , M, use Eµn [f ] =

  • Rd f e−nΦdµ0
  • Rd e−nΦdµ0

≈ M

m=1 wn(Xm) f (Xm)

M

i=1 wn(Xm)

wn ∝ dµn dν SLLN yields

M

m=1 wn(Xm) f (Xm)

M

i=1 wn(Xm)

a.s.

− − − − →

M→∞ Eµn [f ]

and given that Vµn,ν(f ) := Eν dµn dν 2 (f − Eµn [f ])2

  • < ∞

there holds a CLT with asymptotic variance Vµn,ν(f ) as M → ∞ How does Vµn,ν(f ) behave as n → ∞ for suitable ν?

21 / 30

slide-41
SLIDE 41

Prior Importance Sampling

Choose prior measure as importance distribution ν = µ0, i.e., Xm ∼ µ0 i.i.d. , wn(x) = exp(−nΦ(x)) Asymptotic variance of prior importance sampling given by Vµn,µ0(f ) = 1 Z 2

n

  • Rd(f − Eµn [f ])2 e−2nΦ dµ0,

Zn =

  • Rd e−nΦ dµ0

22 / 30

slide-42
SLIDE 42

Prior Importance Sampling

Choose prior measure as importance distribution ν = µ0, i.e., Xm ∼ µ0 i.i.d. , wn(x) = exp(−nΦ(x)) Asymptotic variance of prior importance sampling given by Vµn,µ0(f ) = 1 Z 2

n

  • Rd(f − Eµn [f ])2 e−2nΦ dµ0,

Zn =

  • Rd e−nΦ dµ0

Theorem ([Schillings, S., Wacker, 2019]) Given dH(µn, Lµn) → 0, sufficiently smooth Φ and f ∈ L1

µ0(R) with ∇f (x⋆) = 0,

then Vµn,µ0(f ) ∼ nd/2−1, n → ∞. ⇒ Prior importance sampling becomes less efficient as posterior concentrates

22 / 30

slide-43
SLIDE 43

Laplace-Based Importance Sampling

Choose ν = Lµn, i.e., Xm ∼ N(xn, Cn), wn(x) = exp(−n[Φ(x) − T2Φ(x; xn)]) where T2Φ(·; xn) denotes Taylor polynomial of order 2 of Φ at MAP point xn Applied, e.g., for fast Bayesian optimal experimental design [Beck et al., 2018]

23 / 30

slide-44
SLIDE 44

Laplace-Based Importance Sampling

Choose ν = Lµn, i.e., Xm ∼ N(xn, Cn), wn(x) = exp(−n[Φ(x) − T2Φ(x; xn)]) where T2Φ(·; xn) denotes Taylor polynomial of order 2 of Φ at MAP point xn Applied, e.g., for fast Bayesian optimal experimental design [Beck et al., 2018] Existence of Vµn,Lµn (f ) not ensured & requires at least quadratic growth of Φ

23 / 30

slide-45
SLIDE 45

Laplace-Based Importance Sampling

Choose ν = Lµn, i.e., Xm ∼ N(xn, Cn), wn(x) = exp(−n[Φ(x) − T2Φ(x; xn)]) where T2Φ(·; xn) denotes Taylor polynomial of order 2 of Φ at MAP point xn Applied, e.g., for fast Bayesian optimal experimental design [Beck et al., 2018] Existence of Vµn,Lµn (f ) not ensured & requires at least quadratic growth of Φ Theorem ([Schillings, S., Wacker, 2019]) Given dH(µn, Lµn) → 0, sufficiently smooth Φ, and f ∈ L2

µ0(R), we have

  • M

m=1 wn(Xm) f (Xm)

M

m=1 wn(Xm)

− Eµn [f ]

  • ∈ oP(n−δ),

δ < 1/2. ⇒ Laplace-based importance sampling becomes more efficient as n → ∞

23 / 30

slide-46
SLIDE 46

Simple Example

Prior: µ0 = U ([− 1

2, 1 2]d), noise: ε ∼ N(0, n−1Id), forward: G = (G1, . . . , Gd),

G1(x) = exp(x1/5), G2(x) = x2 − x2

1,

G3(x) = x3, G4(x) = 2x4 + x2

1

  • 0.5

0.5

x1

  • 0.5

0.5

x2

256 prior samples

  • 0.5

0.5

x1

  • 0.5

0.5

x2

256 Laplace-based samples

24 / 30

slide-47
SLIDE 47

Simple Example

Prior: µ0 = U ([− 1

2, 1 2]d), noise: ε ∼ N(0, n−1Id), forward: G = (G1, . . . , Gd),

G1(x) = exp(x1/5), G2(x) = x2 − x2

1,

G3(x) = x3, G4(x) = 2x4 + x2

1

Prior Importance Sampling (M = 105)

100 101 102 103

n

10-4 10-3 10-2 10-1

Empirical RMSE

d = 1 (r = -0.25) d = 2 (r = 0.05) d = 3 (r = 0.30) d = 4 (r = 0.53)

24 / 30

slide-48
SLIDE 48

Simple Example

Prior: µ0 = U ([− 1

2, 1 2]d), noise: ε ∼ N(0, n−1Id), forward: G = (G1, . . . , Gd),

G1(x) = exp(x1/5), G2(x) = x2 − x2

1,

G3(x) = x3, G4(x) = 2x4 + x2

1

Laplace-based Importance Sampling (M = 105)

100 101 102 103 104

n

10-2 10-1 100

Empirical RMSE

d = 1 (r = -0.57) d = 2 (r = -0.54) d = 3 (r = -0.52) d = 4 (r = -0.50)

24 / 30

slide-49
SLIDE 49

Next

1

Laplace Approximation

2

Markov Chain Monte Carlo

3

Importance Sampling

4

Quasi Monte Carlo

25 / 30

slide-50
SLIDE 50

Quasi-Monte Carlo Integration

For uniform prior µ0 = U ([− 1

2, 1 2]d) approximate integrals

  • [− 1

2 , 1 2 ]d f e−nΦ dµ0 ≈ 1

M

M

  • m=1

e−nΦ(Xm) f (Xm) using randomly shifted lattice rules [Sloan, Kuo, Joe, 2002] where Xm = frac mz M + ∆

  • − 1

2 , z ∈ {1, . . . , N − 1}d, ∆ ∼ U ([−1 2, 1 2]d)

26 / 30

slide-51
SLIDE 51

Quasi-Monte Carlo Integration

For uniform prior µ0 = U ([− 1

2, 1 2]d) approximate integrals

  • [− 1

2 , 1 2 ]d f e−nΦ dµ0 ≈ 1

M

M

  • m=1

e−nΦ(Xm) f (Xm) using randomly shifted lattice rules [Sloan, Kuo, Joe, 2002] where Xm = frac mz M + ∆

  • − 1

2 , z ∈ {1, . . . , N − 1}d, ∆ ∼ U ([−1 2, 1 2]d) Problem: For increasing n → ∞ the usual bound for the mean squared error E

  • Zn − 1

M

M

  • m=1

e−nΦ(Xm)

  • 2
  • behaves like nd/2 [Schillings, S., Wacker, 2019]

26 / 30

slide-52
SLIDE 52

Laplace-based Quasi-Monte Carlo

Apply Laplace-based transform Tn(x) := xn + τC 1/2

n

x, x ∈ [−1 2, 1 2]d to move lattice points Xm where µn concentrates (τ ensuring Tn(x) ∈ [− 1

2, 1 2]d).

  • 0.5

0.5

x1

  • 0.5

0.5

x2

256 shifted lattice points

  • 0.5

0.5

x1

  • 0.5

0.5

x2

Transformed points

27 / 30

slide-53
SLIDE 53

Laplace-based Quasi-Monte Carlo

Apply Laplace-based transform Tn(x) := xn + τC 1/2

n

x, x ∈ [−1 2, 1 2]d to move lattice points Xm where µn concentrates (τ ensuring Tn(x) ∈ [− 1

2, 1 2]d).

Lemma ([Schillings, S., Wacker, 2019]) Given dH(Lµn, µn) → 0 and sufficiently smooth Φ we obtain for the transformed shifted lattice rule 1 Z 2

n

E

  • Zn − det(τC 1/2

n

) M

M

  • m=1

e−nΦ(Tn(Xm))

  • 2
  • ≤ C(τ, M) ∈ O(n0).

⇒ Bounded relative error for computing decaying Zn → 0

27 / 30

slide-54
SLIDE 54

Simple Example cont’d

Prior: µ0 = U ([− 1

2, 1 2]d), noise: ε ∼ N(0, n−1Id), forward: G = (G1, . . . , Gd),

G1(x) = exp(x1/5), G2(x) = x2 − x2

1,

G3(x) = x3, G4(x) = 2x4 + x2

1

Relative errors for: Zn, Z ′

n =

  • [− 1

2 , 1 2 ]d f e−nΦ dµ0,

Eµn [f ] = Z ′

n

Zn

28 / 30

slide-55
SLIDE 55

Simple Example cont’d

Prior: µ0 = U ([− 1

2, 1 2]d), noise: ε ∼ N(0, n−1Id), forward: G = (G1, . . . , Gd),

G1(x) = exp(x1/5), G2(x) = x2 − x2

1,

G3(x) = x3, G4(x) = 2x4 + x2

1

Relative errors for: Zn, Z ′

n =

  • [− 1

2 , 1 2 ]d f e−nΦ dµ0,

Eµn [f ] = Z ′

n

Zn

100 102

n

10-10 10-5 100

Empirical RMSE Prior QMC for Z'n

d = 1 (r = -4.42) d = 2 (r = 0.90) d = 3 (r = 1.45) d = 4 (r = 0.97)

100 102

n

10-10 10-5 100

Empirical RMSE Prior QMC for Zn

d = 1 (r = -4.36) d = 2 (r = 1.11) d = 3 (r = 1.58) d = 4 (r = 0.99)

100 102

n

10-10 10-8 10-6 10-4 10-2 100

Empirical RMSE Prior QMC for Z'n/Zn

d = 1 (r = -4.47) d = 2 (r = 0.71) d = 3 (r = 0.83) d = 4 (r = 0.12) 28 / 30

slide-56
SLIDE 56

Simple Example cont’d

Prior: µ0 = U ([− 1

2, 1 2]d), noise: ε ∼ N(0, n−1Id), forward: G = (G1, . . . , Gd),

G1(x) = exp(x1/5), G2(x) = x2 − x2

1,

G3(x) = x3, G4(x) = 2x4 + x2

1

Relative errors for: Zn, Z ′

n =

  • [− 1

2 , 1 2 ]d f e−nΦ dµ0,

Eµn [f ] = Z ′

n

Zn

100 102

n

10-4 10-3 10-2 10-1 100

Empirical RMSE Laplace QMC for Z'n

d = 1 (r = -0.19) d = 2 (r = -0.33) d = 3 (r = -0.46) d = 4 (r = -0.53)

100 102

n

10-3 10-2 10-1

Empirical RMSE Laplace QMC for Zn

d = 1 (r = 0.03) d = 2 (r = -0.76) d = 3 (r = -0.68) d = 4 (r = -0.65)

100 102

n

10-5 10-4 10-3 10-2 10-1 100

Empirical RMSE Laplace QMC for Z'n/Zn

d = 1 (r = -1.20) d = 2 (r = -2.25) d = 3 (r = -2.23) d = 4 (r = -2.22) 28 / 30

slide-57
SLIDE 57

Example: Lognormal Elliptic PDE

Computing posterior mean of log coefficient given noisy data with ε ∼ N(0, 1

nId):

100 101 102 103 104 105

# samples

10-5 100

RMSE QMC

n=1 n=101 n=102 n=103 n=104 n=105 n=106 n=107 n=108

100 101 102 103 104 105

# samples

10-5 100

RMSE QMC transformed

n=1 n=101 n=102 n=103 n=104 n=105 n=106 n=107 n=108

29 / 30

slide-58
SLIDE 58

Example: Lognormal Elliptic PDE

Computing posterior mean of log coefficient given noisy data with ε ∼ N(0, 1

nId):

100 101 102 103 104 105

# samples

10-5 100

RMSE IS / MC Ratio Estimator

n=1 n=101 n=102 n=103 n=104 n=105 n=106 n=107 n=108

100 101 102 103 104 105

# samples

10-5 100

RMSE IS transformed

n=1 n=101 n=102 n=103 n=104 n=105 n=106 n=107 n=108

29 / 30

slide-59
SLIDE 59

Summary

Bayesian inference with informative data requires noise-level robust sampling Prior-based sampling methods suffer from a decreasing observational noise Robust sampling methods obtainable by using the Laplace approximation First theoretical results on noise-level robustness of importance sampling, MCMC, and QMC Some open issues: Spectral gap-robustness for Laplace-based MCMC Convergence of Laplace approximation and sampling analysis in Hilbert spaces Beyond Laplace: What to do if posterior concentrates along nonlinear manifolds?

30 / 30

slide-60
SLIDE 60

References

  • A. Alexanderian, N. Petra, G. Stadler, O. Ghattas. A fast and scalable method for A-optimal design of

experiments for infinite-dimensional Bayesian nonlinear inverse problems. SIAM J. Sci. Comput. 38(1):A243–A272 (2016).

  • A. Beskos, G. O. Roberts, A. Thiery, and N. Pillai. Asymptotic Analysis of the Random-Walk Metropolis

Algorithm on Ridged Densities Ann. Appl. Probab. 28(5):2966–3001 (2018).

  • Q. Long, M. Scavino, R. Tempone, S. Wang. Fast estimation of expected information gains for Bayesian

experimental designs based on Laplace approximations. Computer Methods in Applied Mechanics and Engineering 259:24–39 (2013).

  • P. Chen, U. Villa, and O. Ghattas. Hessian-based adaptive sparse quadrature for infinite-dimensional

Bayesian inverse problems. Computer Methods in Applied Mechanics and Engineering, 327:147–172 (2017).

  • D. Rudolf, B. Sprungk. On a generalization of the preconditioned Crank-Nicolson Metropolis algorithm.
  • Found. Comput. Math., 18(2):309–343 (2018).
  • C. Schillings, Ch. Schwab. Scaling limits in computational Bayesian inversion. ESAIM M2AN,

50(6):1825–1856 (2016).

  • C. Schillings, B. Sprungk and P. Wacker. On the Convergence of the Laplace Approximation and

Noise-Level-Robustness of Laplace-based Monte Carlo Methods for Bayesian Inverse Problems. arXiv:1901:03958, (2019).

30 / 30