Transport & Multilevel Approaches for Large-Scale - - PowerPoint PPT Presentation

transport multilevel approaches for large scale pde
SMART_READER_LITE
LIVE PREVIEW

Transport & Multilevel Approaches for Large-Scale - - PowerPoint PPT Presentation

Transport & Multilevel Approaches for Large-Scale PDE-Constrained Bayesian Inference Robert Scheichl Institute of Applied Mathematics & Interdisciplinary Center for Scientific Computing Heidelberg University Collaborators: K


slide-1
SLIDE 1

Transport & Multilevel Approaches for Large-Scale PDE-Constrained Bayesian Inference

Robert Scheichl

Institute of Applied Mathematics & Interdisciplinary Center for Scientific Computing

Heidelberg University

Collaborators:

K Anaya-Izquierdo & S Dolgov (Bath); C Fox (Otago); T Dodwell (Exeter); AL Teckentrup (Edinburgh); T Cui (Monash); G Detommaso (Amazon)

“Computational Statistics and Data-Driven Models”

ICERM, Brown University, March 23, 2020

slide-2
SLIDE 2

Inverse Problems

Data Parameter

y = F ( x ) + e forward model (PDE)

  • bservation/model errors
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 1 / 38

slide-3
SLIDE 3

Inverse Problems

Data Parameter

y = F ( x ) + e forward model (PDE)

  • bservation/model errors

y ∈ RNy x ∈ X F : X → RNy Data y are limited in number, noisy, and indirect. Parameter x often a function (discretisation needed). Continuous, bounded, and sufficiently smooth.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 1 / 38

slide-4
SLIDE 4

Bayesian interpretation

The (physical) model gives π(y|x), the conditional probability of observing y given x. However, to predict, control, optimise or quantify uncertainty, the interest is often really in π(x|y), the conditional probability of possible causes x given the observed data y – the inverse problem:

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 2 / 38

slide-5
SLIDE 5

Bayesian interpretation

The (physical) model gives π(y|x), the conditional probability of observing y given x. However, to predict, control, optimise or quantify uncertainty, the interest is often really in π(x|y), the conditional probability of possible causes x given the observed data y – the inverse problem: πpos (x) := π (x|y) ∝ π (y|x) πpr (x)

  • Bayes’ rule
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 2 / 38

slide-6
SLIDE 6

Bayesian interpretation

The (physical) model gives π(y|x), the conditional probability of observing y given x. However, to predict, control, optimise or quantify uncertainty, the interest is often really in π(x|y), the conditional probability of possible causes x given the observed data y – the inverse problem: πpos (x) := π (x|y) ∝ π (y|x) πpr (x)

  • Bayes’ rule

Extract information from πpos (means, covariances, event probabilities, predictions) by evaluating posterior expectations: Eπpos [h(x)] =

  • h(x)πpos(x)dx
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 2 / 38

slide-7
SLIDE 7

Bayes’ Rule and Classical Inversion

Classically [Hadamard, 1923]: Inverse map “F −1” (y → x) is typically ill-posed, i.e. lack of (a) existence, (b) uniqueness or (c) boundedness

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 3 / 38

slide-8
SLIDE 8

Bayes’ Rule and Classical Inversion

Classically [Hadamard, 1923]: Inverse map “F −1” (y → x) is typically ill-posed, i.e. lack of (a) existence, (b) uniqueness or (c) boundedness classical least squares solution ˆ x is maximum likelihood estimate prior distribution πpr “acts” as regulariser – well-posedness ! regularised least squares sol. is maximum a posteriori (MAP) estimate

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 3 / 38

slide-9
SLIDE 9

Bayes’ Rule and Classical Inversion

Classically [Hadamard, 1923]: Inverse map “F −1” (y → x) is typically ill-posed, i.e. lack of (a) existence, (b) uniqueness or (c) boundedness classical least squares solution ˆ x is maximum likelihood estimate prior distribution πpr “acts” as regulariser – well-posedness ! regularised least squares sol. is maximum a posteriori (MAP) estimate However, in the Bayesian setting, the full posterior πpos contains more information than the MAP estimator alone, e.g. the posterior covariance matrix reveals components of x that are (relatively) more or less certain.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 3 / 38

slide-10
SLIDE 10

Bayes’ Rule and Classical Inversion

Classically [Hadamard, 1923]: Inverse map “F −1” (y → x) is typically ill-posed, i.e. lack of (a) existence, (b) uniqueness or (c) boundedness classical least squares solution ˆ x is maximum likelihood estimate prior distribution πpr “acts” as regulariser – well-posedness ! regularised least squares sol. is maximum a posteriori (MAP) estimate However, in the Bayesian setting, the full posterior πpos contains more information than the MAP estimator alone, e.g. the posterior covariance matrix reveals components of x that are (relatively) more or less certain. Challenges: high dimension, expensive likelihood & the (inaccessible) normalising constant π(y) :=

  • π (y|x) πpr (x) dx

Require sample-based approach to break “Curse of Dimensionality”.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 3 / 38

slide-11
SLIDE 11

Traditional Work Horse: Markov Chain Monte Carlo

ALGORITHM 1 (Metropolis-Hastings Markov Chain Monte Carlo) Choose initial state x0 ∈ X. At state xn generate proposal x′ ∈ X from distribution q(x′ | xn) e.g. via a random walk: x′ ∼ N(xn, ε2I) Accept x′ as a sample with probability α(x′|xn) = min

  • 1, π(x′|y) q(xn | y)

π(xn|x′) q(x′ | xn)

  • i.e. xn+1 = x′ with probability α(x′|xn); otherwise xn+1 = xn.
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 4 / 38

slide-12
SLIDE 12

Traditional Work Horse: Markov Chain Monte Carlo

ALGORITHM 1 (Metropolis-Hastings Markov Chain Monte Carlo) Choose initial state x0 ∈ X. At state xn generate proposal x′ ∈ X from distribution q(x′ | xn) e.g. via a random walk: x′ ∼ N(xn, ε2I) Accept x′ as a sample with probability α(x′|xn) = min

  • 1, π(x′|y) q(xn | y)

π(xn|x′) q(x′ | xn)

  • i.e. xn+1 = x′ with probability α(x′|xn); otherwise xn+1 = xn.

The samples h(xn) of some output function (“statistic”) h(·) can be used for inference as usual – even though not i.i.d.: Eπ(x|y) [h(x)] ≈ 1 N

N

  • i=1

h(xn) := hMetH

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 4 / 38

slide-13
SLIDE 13

Slow Convergence of Random Walk Metropolis-Hastings

But sampling with Metropolis-Hastings can be very inefficient ...

(due to burn-in, small step size and large number of rejections)

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 5 / 38

slide-14
SLIDE 14

Slow Convergence of Random Walk Metropolis-Hastings

But sampling with Metropolis-Hastings can be very inefficient ...

(due to burn-in, small step size and large number of rejections)

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

... not like this ...

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 5 / 38

slide-15
SLIDE 15

Slow Convergence of Random Walk Metropolis-Hastings

But sampling with Metropolis-Hastings can be very inefficient ...

(due to burn-in, small step size and large number of rejections)

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

... not like this ...

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

... on top of the slow Monte Carlo convergence rate of O(N−1/2) !

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 5 / 38

slide-16
SLIDE 16

Variational Bayes (as opposed to Metropolis-Hastings MCMC)

Aim to characterise the posterior distribution (density πpos) analytically (at least approximately) for more efficient inference.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 6 / 38

slide-17
SLIDE 17

Variational Bayes (as opposed to Metropolis-Hastings MCMC)

Aim to characterise the posterior distribution (density πpos) analytically (at least approximately) for more efficient inference. This is a challenging task since: x ∈ Rd is typically high-dimensional (e.g., discretised function) πpos is in general non-Gaussian

(even if πpr and observational noise are Gaussian)

evaluations of likelihood may be expensive (e.g., solution of a PDE)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 6 / 38

slide-18
SLIDE 18

Variational Bayes (as opposed to Metropolis-Hastings MCMC)

Aim to characterise the posterior distribution (density πpos) analytically (at least approximately) for more efficient inference. This is a challenging task since: x ∈ Rd is typically high-dimensional (e.g., discretised function) πpos is in general non-Gaussian

(even if πpr and observational noise are Gaussian)

evaluations of likelihood may be expensive (e.g., solution of a PDE)

Key Tools

Transport Maps, Optimisation, Principle Component Analysis, Model Order Reduction, Hierarchies, Sparsity, Low Rank Approximation

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 6 / 38

slide-19
SLIDE 19

Variational Bayes (as opposed to Metropolis-Hastings MCMC)

Aim to characterise the posterior distribution (density πpos) analytically (at least approximately) for more efficient inference. This is a challenging task since: x ∈ Rd is typically high-dimensional (e.g., discretised function) πpos is in general non-Gaussian

(even if πpr and observational noise are Gaussian)

evaluations of likelihood may be expensive (e.g., solution of a PDE)

Key Tools – a playground for a numerical analyst!

Transport Maps, Optimisation, Principle Component Analysis, Model Order Reduction, Hierarchies, Sparsity, Low Rank Approximation

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 6 / 38

slide-20
SLIDE 20

Deterministic Couplings of Probability Measures

Core idea [Moselhy, Marzouk, 2012]

Choose a reference distribution η (e.g., standard Gaussian) Seek transport map T : Rd → Rd such that T♯η = π (push-forward)

(invertible)

T η π

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 7 / 38

slide-21
SLIDE 21

Deterministic Couplings of Probability Measures

Core idea [Moselhy, Marzouk, 2012]

Choose a reference distribution η (e.g., standard Gaussian) Seek transport map T : Rd → Rd such that T♯η = π (push-forward)

(invertible)

T η π

In principle, enables exact (independent, unweighted) sampling!

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 7 / 38

slide-22
SLIDE 22

Deterministic Couplings of Probability Measures

Core idea [Moselhy, Marzouk, 2012]

Choose a reference distribution η (e.g., standard Gaussian) Seek transport map T : Rd → Rd such that T♯η = π (push-forward)

(invertible)

T η π

In principle, enables exact (independent, unweighted) sampling! Approximately satisfying conditions still useful: Preconditioning!

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 7 / 38

slide-23
SLIDE 23

Variational Inference

Goal: Sampling from target density π(x)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 8 / 38

slide-24
SLIDE 24

Variational Inference

Goal: Sampling from target density π(x) Given a reference density η, find an invertible map ˆ T such that ˆ T := argmin

T

DKL(T♯ η π) = argmin

T

DKL(η T −1

π)

where DKL(p q):=

  • log

p(x) q(x)

  • p(x) dx

. . . Kullback-Leibler divergence T♯ p(x):= p

  • T −1(x)

det

  • ∇xT −1(x)
  • . . .

push-forward of p

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 8 / 38

slide-25
SLIDE 25

Variational Inference

Goal: Sampling from target density π(x) Given a reference density η, find an invertible map ˆ T such that ˆ T := argmin

T

DKL(T♯ η π) = argmin

T

DKL(η T −1

π)

where DKL(p q):=

  • log

p(x) q(x)

  • p(x) dx

. . . Kullback-Leibler divergence T♯ p(x):= p

  • T −1(x)

det

  • ∇xT −1(x)
  • . . .

push-forward of p

Advantage of using DKL: normalising constant for π is not needed

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 8 / 38

slide-26
SLIDE 26

Variational Inference

Goal: Sampling from target density π(x) Given a reference density η, find an invertible map ˆ T such that ˆ T := argmin

T

DKL(T♯ η π) = argmin

T

DKL(η T −1

π)

where DKL(p q):=

  • log

p(x) q(x)

  • p(x) dx

. . . Kullback-Leibler divergence T♯ p(x):= p

  • T −1(x)

det

  • ∇xT −1(x)
  • . . .

push-forward of p

Advantage of using DKL: normalising constant for π is not needed Minimise over some suitable class T of maps T

(where ideally Jacobian determinant det

  • ∇xT −1(x)
  • is easy to evaluate)
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 8 / 38

slide-27
SLIDE 27

Variational Inference

Goal: Sampling from target density π(x) Given a reference density η, find an invertible map ˆ T such that ˆ T := argmin

T

DKL(T♯ η π) = argmin

T

DKL(η T −1

π)

where DKL(p q):=

  • log

p(x) q(x)

  • p(x) dx

. . . Kullback-Leibler divergence T♯ p(x):= p

  • T −1(x)

det

  • ∇xT −1(x)
  • . . .

push-forward of p

Advantage of using DKL: normalising constant for π is not needed Minimise over some suitable class T of maps T

(where ideally Jacobian determinant det

  • ∇xT −1(x)
  • is easy to evaluate)

To improve: enrich class T

  • r use samples of T −1

π as proposals for MCMC or in importance sampling (see below)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 8 / 38

slide-28
SLIDE 28

Many Choices (“Architectures”) for T possible

Examples: (list not comprehensive!!)

1

Optimal Transport or Knothe-Rosenblatt Rearrangement

[Moselhy, Marzouk, 2012], [Marzouk, Moselhy, Parno, Spantini, 2016]

2

Normalizing or Autoregressive Flows [Rezende, Mohamed, 2015]

(and related methods in the ML literature)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 9 / 38

slide-29
SLIDE 29

Many Choices (“Architectures”) for T possible

Examples: (list not comprehensive!!)

1

Optimal Transport or Knothe-Rosenblatt Rearrangement

[Moselhy, Marzouk, 2012], [Marzouk, Moselhy, Parno, Spantini, 2016]

2

Normalizing or Autoregressive Flows [Rezende, Mohamed, 2015]

(and related methods in the ML literature)

3

Kernel-based variational inference: Stein Variational Methods

[Liu, Wang, 2016], [Detommaso, Cui, Spantini, Marzouk, RS, 2018], [Chen, Wu, Chen, O’Leary-Roseberry, Ghattas, 2019] not today!

4

Layers of low-rank maps [Bigoni, Zahm, Spantini, Marzouk, arXiv 2019]

5

Layers of hierarchical invertible neural networks (HINT)

not today! [Detommaso, Kruse, Ardizzone, Rother, K¨

  • the, RS, arXiv:1905.10687]
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 9 / 38

slide-30
SLIDE 30

Many Choices (“Architectures”) for T possible

Examples: (list not comprehensive!!)

1

Optimal Transport or Knothe-Rosenblatt Rearrangement

[Moselhy, Marzouk, 2012], [Marzouk, Moselhy, Parno, Spantini, 2016]

2

Normalizing or Autoregressive Flows [Rezende, Mohamed, 2015]

(and related methods in the ML literature)

3

Kernel-based variational inference: Stein Variational Methods

[Liu, Wang, 2016], [Detommaso, Cui, Spantini, Marzouk, RS, 2018], [Chen, Wu, Chen, O’Leary-Roseberry, Ghattas, 2019] not today!

4

Layers of low-rank maps [Bigoni, Zahm, Spantini, Marzouk, arXiv 2019]

5

Layers of hierarchical invertible neural networks (HINT)

not today! [Detommaso, Kruse, Ardizzone, Rother, K¨

  • the, RS, arXiv:1905.10687]

6

Low-rank tensor approximation of Knothe-Rosenblatt rearrangement

[Dolgov, Anaya-Izquierdo, Fox, RS, 2019]

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 9 / 38

slide-31
SLIDE 31

Approximation and Sampling of Multivariate Probability Distributions in the Tensor Train Decomposition

[Dolgov, Anaya-Izquierdo, Fox, RS, 2019]

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 10 / 38

slide-32
SLIDE 32

Variational Inference with Triangular Maps

In general, in Variational Inference aim to find argmin

T

DKL(T♯ η || π) Note: DKL(T♯ η || π) = −Eu∼η

  • log π(T(u)) + log | det ∇T(u)|
  • + const
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 11 / 38

slide-33
SLIDE 33

Variational Inference with Triangular Maps

In general, in Variational Inference aim to find argmin

T

DKL(T♯ η || π) Note: DKL(T♯ η || π) = −Eu∼η

  • log π(T(u)) + log | det ∇T(u)|
  • + const

Particularly useful family T are Knothe-Rosenblatt triangular rearrangements (see [Marzouk, Moshely, Parno, Spantini, 2016]): T(x) =      T1(x1) T2(x1, x2) . . . Td(x1, x2, . . . , xd)     

(= autoregressive flow in ML)

Then: log | det ∇T(u)| =

k log ∂xkT k

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 11 / 38

slide-34
SLIDE 34

Knothe-Rosenblatt via Conditional Distribution Sampling

In fact, ∃! triangular map satisfying T♯ η = π (for abs. cont. η, π on Rd)

Conditional Distribution Sampling [Rosenblatt ’52]

(explicitly available!)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 12 / 38

slide-35
SLIDE 35

Knothe-Rosenblatt via Conditional Distribution Sampling

In fact, ∃! triangular map satisfying T♯ η = π (for abs. cont. η, π on Rd)

Conditional Distribution Sampling [Rosenblatt ’52]

(explicitly available!)

Any density factorises into product of conditional densities: π(x1, . . . , xd) = π1(x1)π2(x2|x1) · · · πd(xd|x1, . . . , xd−1) Can sample (up to normalisation with known scaling factor) xk ∼ πk(xk|x1, . . . , xk−1) ∼

  • π(x1, . . . , xd)dxk+1 · · · dxd
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 12 / 38

slide-36
SLIDE 36

Knothe-Rosenblatt via Conditional Distribution Sampling

In fact, ∃! triangular map satisfying T♯ η = π (for abs. cont. η, π on Rd)

Conditional Distribution Sampling [Rosenblatt ’52]

(explicitly available!)

Any density factorises into product of conditional densities: π(x1, . . . , xd) = π1(x1)π2(x2|x1) · · · πd(xd|x1, . . . , xd−1) 1st step: Produce sample xi

1 via 1D CDF-inversion from

π1(x1) ∼

  • π(x1, x2, . . . , xd)dx2 · · · dxd
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 12 / 38

slide-37
SLIDE 37

Knothe-Rosenblatt via Conditional Distribution Sampling

In fact, ∃! triangular map satisfying T♯ η = π (for abs. cont. η, π on Rd)

Conditional Distribution Sampling [Rosenblatt ’52]

(explicitly available!)

Any density factorises into product of conditional densities: π(x1, . . . , xd) = π1(x1)π2(x2|x1) · · · πd(xd|x1, . . . , xd−1) 1st step: Produce sample xi

1 via 1D CDF-inversion from

π1(x1) ∼

  • π(x1, x2, . . . , xd)dx2 · · · dxd

k-th step: Given xi

1, . . . , xi k−1 sample xi k via 1D CDF-inversion from

πk(xk|xi

1, . . . , xi k−1) ∼

  • π(xi

1, . . . , xi k−1, xk, xk+1, . . . , xd)dxk+1 · · · dxd

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 12 / 38

slide-38
SLIDE 38

Knothe-Rosenblatt via Conditional Distribution Sampling

In fact, ∃! triangular map satisfying T♯ η = π (for abs. cont. η, π on Rd)

Conditional Distribution Sampling [Rosenblatt ’52]

(explicitly available!)

Any density factorises into product of conditional densities: π(x1, . . . , xd) = π1(x1)π2(x2|x1) · · · πd(xd|x1, . . . , xd−1) 1st step: Produce sample xi

1 via 1D CDF-inversion from

π1(x1) ∼

  • π(x1, x2, . . . , xd)dx2 · · · dxd

k-th step: Given xi

1, . . . , xi k−1 sample xi k via 1D CDF-inversion from

πk(xk|xi

1, . . . , xi k−1) ∼

  • π(xi

1, . . . , xi k−1, xk, xk+1, . . . , xd)dxk+1 · · · dxd

Problem: (d − k)-dimensional integration at k-th step! Remedy: Find approximation ˜ π ≈ π where integration is cheap!

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 12 / 38

slide-39
SLIDE 39

Low-rank Tensor Approximation of Distributions

Low-rank tensor decomposition ⇔ separation of variables:

n O(nd)

  • O(dn)

Tensor grid with n points per direction (or n polynomial basis fcts.) Approximate: π(x1, . . . , xd)

  • tensor

  • |α|≤r π1

α(x1)π2 α(x2) · · · πd α(xd)

  • tensor product decomposition

Many low-rank tensor formats exist [Kolda, Bader ’09], [Hackbusch ’12]

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 13 / 38

slide-40
SLIDE 40

Conditional Distribution Sampler (with factorised distribution)

For the low-rank tensor approximation π(x) ≈ ˜ π(x) =

  • |α|≤r

π1

α(x1) · π2 α(x2) · · · πd α(xd)

the k-th step of the CD sampler, given xi

1, . . . , xi k−1, simplifies to

˜ πk(xk|xi

1, . . . , xi k−1) ∼

  • |α|≤r

π1

α(xi 1) · · · πk−1 α

(xi

k−1) . . .

. . . πk

α(xk) . . .

. . .

  • πk+1

α

(xk+1)dxk+1 · · ·

  • πd

α(xd)dxd

  • Repeated 1D integrals!

linear in d

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 14 / 38

slide-41
SLIDE 41

Conditional Distribution Sampler (with factorised distribution)

For the low-rank tensor approximation π(x) ≈ ˜ π(x) =

  • |α|≤r

π1

α(x1) · π2 α(x2) · · · πd α(xd)

the k-th step of the CD sampler, given xi

1, . . . , xi k−1, simplifies to

˜ πk(xk|xi

1, . . . , xi k−1) ∼

  • |α|≤r

π1

α(xi 1) · · · πk−1 α

(xi

k−1) . . .

. . . πk

α(xk) . . .

. . .

  • πk+1

α

(xk+1)dxk+1 · · ·

  • πd

α(xd)dxd

  • Repeated 1D integrals!

linear in d

To sample (in each step): Simple 1D CDF-inversions

linear in d

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 14 / 38

slide-42
SLIDE 42

Low-rank Decomposition

(Two Variables)

Collect discretised values of π(θ1, θ2) on n × n grid into a matrix: P(i, j) =

r

  • α=1

Vα(i)Wα(j) + O(ε)

Rank r ≪ n (exploiting structure, smoothness, . . . ) mem(V ) + mem(W ) = 2nr ≪ n2 = mem(P) SVD provides optimal ε for fixed r

(s.t. minV ,W P − VW ∗2

F)

But requires all n2 entries of P !

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 15 / 38

slide-43
SLIDE 43

Low-rank Decomposition

(Two Variables)

Collect discretised values of π(θ1, θ2) on n × n grid into a matrix: P(i, j) =

r

  • α=1

Vα(i)Wα(j) + O(ε)

Rank r ≪ n (exploiting structure, smoothness, . . . ) mem(V ) + mem(W ) = 2nr ≪ n2 = mem(P) SVD provides optimal ε for fixed r

(s.t. minV ,W P − VW ∗2

F)

But requires all n2 entries of P ! nd in d dimensions!

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 15 / 38

slide-44
SLIDE 44

Cross Algorithm (construct low-rank factorisation from few entries)

Interpolation arguments show: for some suitable index sets I , J ⊂ {1, . . . , n} with |I | = |J | = r, the cross decomposition ≈

−1

also P(:, J )P−1(I , J )P(I , :) ≈ P

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 16 / 38

slide-45
SLIDE 45

Cross Algorithm (construct low-rank factorisation from few entries)

Interpolation arguments show: for some suitable index sets I , J ⊂ {1, . . . , n} with |I | = |J | = r, the cross decomposition ≈

−1

also P(:, J )P−1(I , J )P(I , :) ≈ P Maxvol principle gives ‘best’ indices I , J [Goreinov, Tyrtyshnikov ’01]

|det P(I , J )| = max

ˆ I , ˆ J

  • det P( ˆ

I , ˆ J )

  • ⇒ P− ˜

PC ≤ (r+1) min

rank ˆ P=r

P− ˆ P2 (NP-hard)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 16 / 38

slide-46
SLIDE 46

Cross Algorithm (construct low-rank factorisation from few entries)

Interpolation arguments show: for some suitable index sets I , J ⊂ {1, . . . , n} with |I | = |J | = r, the cross decomposition ≈

−1

also P(:, J )P−1(I , J )P(I , :) ≈ P Maxvol principle gives ‘best’ indices I , J [Goreinov, Tyrtyshnikov ’01]

|det P(I , J )| = max

ˆ I , ˆ J

  • det P( ˆ

I , ˆ J )

  • ⇒ P− ˜

PC ≤ (r+1) min

rank ˆ P=r

P− ˆ P2 (NP-hard)

But how can we find good sets I , J in practice? And how can we generalise this to d > 2?

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 16 / 38

slide-47
SLIDE 47

Alternating Iteration (for cross approximation)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 17 / 38

slide-48
SLIDE 48

Alternating Iteration (for cross approximation)

j1 j2 j3

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 17 / 38

slide-49
SLIDE 49

Alternating Iteration (for cross approximation)

j1 j2 j3 i3 i2 i1

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 17 / 38

slide-50
SLIDE 50

Alternating Iteration (for cross approximation)

j1 j2 j3 i3 i2 i1

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 17 / 38

slide-51
SLIDE 51

Alternating Iteration (for cross approximation)

i3 i2 i1 j1 j2 j3

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 17 / 38

slide-52
SLIDE 52

Alternating Iteration (for cross approximation)

i3 i2 i1 j1 j2 j3

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 17 / 38

slide-53
SLIDE 53

Alternating Iteration (for cross approximation)

i3 i2 i1 j1 j2 j3 Practically realizable strategy (with O(2nr)

samples & O(nr 2) flops).

For numerical stability use rank-revealing QR in practice. To adapt rank expand V →

  • V Z
  • ) (with

residual Z)

Several similar algorithms exist: e.g.

ACA [Bebendorf ’00] or EIM [Barrault et al ’04]

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 17 / 38

slide-54
SLIDE 54

Tensor Train (TT) Decomposition

(Many Variables)

Many variables: Matrix Product States/Tensor Train π(i1 . . . id) =

rk

  • αk=1

0<k<d

π1

α1(i1) · π2 α1,α2(i2) · π3 α2,α3(i3) · · · πd αd−1(id) n r1

. . .

rk rk−1 n

[Wilson ’75] (comput. physics), [White ’93], [Verstraete ’04]; [Oseledets ’09]

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 18 / 38

slide-55
SLIDE 55

Tensor Train (TT) Decomposition

(Many Variables)

Many variables: Matrix Product States/Tensor Train π(i1 . . . id) =

rk

  • αk=1

0<k<d

π1

α1(i1) · π2 α1,α2(i2) · π3 α2,α3(i3) · · · πd αd−1(id) n r1

. . .

rk rk−1 n

[Wilson ’75] (comput. physics), [White ’93], [Verstraete ’04]; [Oseledets ’09]

TT blocks πk are three-dimensional rk−1 × n × rk tensors with TT ranks r1, . . . , rd−1 ≤ r

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 18 / 38

slide-56
SLIDE 56

Tensor Train (TT) Decomposition

(Many Variables)

Many variables: Matrix Product States/Tensor Train π(i1 . . . id) =

rk

  • αk=1

0<k<d

π1

α1(i1) · π2 α1,α2(i2) · π3 α2,α3(i3) · · · πd αd−1(id) n r1

. . .

rk rk−1 n

[Wilson ’75] (comput. physics), [White ’93], [Verstraete ’04]; [Oseledets ’09]

TT blocks πk are three-dimensional rk−1 × n × rk tensors with TT ranks r1, . . . , rd−1 ≤ r Storage: O(dnr2)

linear in d

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 18 / 38

slide-57
SLIDE 57

TT Cross –

An Efficient Computation of a TT Decomposition

Given random initial sets J0, . . . , Jd iterate: [Oseledets, Tyrtyshnikov ’10]

1

Update kth TT block: πk(ik) = π(Ik−1, ik, Jk)

2

Update kth index set: Ik = pivotsrow(πk)

(using maxvol principle on different matrizations of tensor in each step)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 19 / 38

slide-58
SLIDE 58

TT Cross –

An Efficient Computation of a TT Decomposition

Given random initial sets J0, . . . , Jd iterate: [Oseledets, Tyrtyshnikov ’10]

1

Update kth TT block: πk(ik) = π(Ik−1, ik, Jk)

2

Update kth index set: Ik = pivotsrow(πk)

(using maxvol principle on different matrizations of tensor in each step)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 19 / 38

slide-59
SLIDE 59

TT Cross –

An Efficient Computation of a TT Decomposition

Given random initial sets J0, . . . , Jd iterate: [Oseledets, Tyrtyshnikov ’10]

1

Update kth TT block: πk() = π(Ik−1, , Jk)

2

Update kth index set: Ik = pivotsrow(πk)

(using maxvol principle on different matrizations of tensor in each step)

(j1, k1) (j2, k2) (j3, k3)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 19 / 38

slide-60
SLIDE 60

TT Cross –

An Efficient Computation of a TT Decomposition

Given random initial sets J0, . . . , Jd iterate: [Oseledets, Tyrtyshnikov ’10]

1

Update kth TT block: πk() = π(Ik−1, , Jk)

2

Update kth index set: Ik = pivotsrow(πk)

(using maxvol principle on different matrizations of tensor in each step)

(j1, k1) (j2, k2) (j3, k3) i3 i2 i1

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 19 / 38

slide-61
SLIDE 61

TT Cross –

An Efficient Computation of a TT Decomposition

Given random initial sets J0, . . . , Jd iterate: [Oseledets, Tyrtyshnikov ’10]

1

Update kth TT block: πk() = π(Ik−1, , Jk)

2

Update kth index set: Ik = pivotsrow(πk)

(using maxvol principle on different matrizations of tensor in each step)

(j1, k1) (j2, k2) (j3, k3) i3 i2 i1

3

Set k → k + 1 and move to the next block.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 19 / 38

slide-62
SLIDE 62

TT Cross –

An Efficient Computation of a TT Decomposition

Given random initial sets J0, . . . , Jd iterate: [Oseledets, Tyrtyshnikov ’10]

1

Update kth TT block: πk() = π(Ik−1, , Jk)

2

Update kth index set: Ik = pivotsrow(πk)

(using maxvol principle on different matrizations of tensor in each step)

(j1, k1) (j2, k2) (j3, k3) i3 i2 i1

3

Set k → k + 1 and move to the next block.

4

When k = d, switch direction and update index set Jk−1.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 19 / 38

slide-63
SLIDE 63

TT Cross –

An Efficient Computation of a TT Decomposition

Given random initial sets J0, . . . , Jd iterate: [Oseledets, Tyrtyshnikov ’10]

1

Update kth TT block: πk() = π(Ik−1, , Jk)

2

Update kth index set: Ik = pivotsrow(πk)

(using maxvol principle on different matrizations of tensor in each step)

(j1, k1) (j2, k2) (j3, k3) i3 i2 i1

3

Set k → k + 1 and move to the next block.

4

When k = d, switch direction and update index set Jk−1. Cost: O(dnr2) samples & O(dnr3) flops per iteration

linear in d

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 19 / 38

slide-64
SLIDE 64

Tensor Train (TT) Transport Maps (Summary & Comments)

[Dolgov, Anaya-Izquierdo, Fox, RS, 2019]

Generic – not problem specific (‘black box’) Cross approximation: ‘sequential’ design along 1D lines Separable product form: ˜ π(x1, . . . , xd) =

|α|≤r π1 α(x1) . . . πd α(xd)

Cheap construction/storage & low # model evals linear in d Cheap integration w.r.t. x linear in d Cheap samples via conditional distribution method linear in d

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 20 / 38

slide-65
SLIDE 65

Tensor Train (TT) Transport Maps (Summary & Comments)

[Dolgov, Anaya-Izquierdo, Fox, RS, 2019]

Generic – not problem specific (‘black box’) Cross approximation: ‘sequential’ design along 1D lines Separable product form: ˜ π(x1, . . . , xd) =

|α|≤r π1 α(x1) . . . πd α(xd)

Cheap construction/storage & low # model evals linear in d Cheap integration w.r.t. x linear in d Cheap samples via conditional distribution method linear in d

Tuneable approximation error ε (by adapting ranks r): = ⇒ cost & storage (poly)logarithmic in ε next slide

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 20 / 38

slide-66
SLIDE 66

Tensor Train (TT) Transport Maps (Summary & Comments)

[Dolgov, Anaya-Izquierdo, Fox, RS, 2019]

Generic – not problem specific (‘black box’) Cross approximation: ‘sequential’ design along 1D lines Separable product form: ˜ π(x1, . . . , xd) =

|α|≤r π1 α(x1) . . . πd α(xd)

Cheap construction/storage & low # model evals linear in d Cheap integration w.r.t. x linear in d Cheap samples via conditional distribution method linear in d

Tuneable approximation error ε (by adapting ranks r): = ⇒ cost & storage (poly)logarithmic in ε next slide Many known ways to use these samples for fast inference!

(as proposals for MCMC, as control variates, importance weighting, . . . )

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 20 / 38

slide-67
SLIDE 67

Theoretical Result

[Rohrbach, Dolgov, Grasedyck, RS, 2020] For Gaussian distributions π(x) we have the following result: Let π : Rd → R, x → exp

  • − 1

2xTΣx

  • and define

Σ :=

  • Σ(k)

11

ΓT

k

Γk Σ(k)

22

  • where

Γk ∈ R(d−k)×k.

  • Theorem. Let Σ be SPD with λmin > 0. Suppose ρ := maxk rank(Γk)

and σ := maxk,i σ(k)

i

, where σ(k)

i

are the singular values of Γk. Then, for all ε > 0, there exists a TT-approximation ˜ πε s.t. π − ˜ πεL2(Rd) ≤ επL2(Rd) and the TT-ranks of ˜ πε are bounded by r ≤

  • 1 + 7 σ

λmin

  • log
  • 7ρ d

ε

ρ .

(polylogarithmic growth)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 21 / 38

slide-68
SLIDE 68

How to use the TT-CD sampler to estimate EπQ?

Problem: We are sampling from approximate ˜ π = π + O(ε).

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 22 / 38

slide-69
SLIDE 69

How to use the TT-CD sampler to estimate EπQ?

Problem: We are sampling from approximate ˜ π = π + O(ε). Option 0: Classical variational inference Explicit integration (linear in d): get biased estimator E˜

πQ ≈ EπQ

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 22 / 38

slide-70
SLIDE 70

How to use the TT-CD sampler to estimate EπQ?

Problem: We are sampling from approximate ˜ π = π + O(ε). Option 0: Classical variational inference Explicit integration (linear in d): get biased estimator E˜

πQ ≈ EπQ

Non-smooth Q: use Monte Carlo sampling, or better, QMC ‘seeds’

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

2D projection of 11D map (problem specification below!)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 22 / 38

slide-71
SLIDE 71

Sampling from exact π: Unbiased estimates of EπQ

Option 1: Use {xi

˜ π} as (i.i.d.) proposals in Metropolis-Hastings

Accept proposal xi

˜ π with probability α = min

  • 1, π(xi

˜ π)˜

π(xi−1

π

) π(xi−1

π

)˜ π(xi

˜ π)

  • Can prove that rejection rate ∼ ε and IACT τ ∼ 1 + ε
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 23 / 38

slide-72
SLIDE 72

Sampling from exact π: Unbiased estimates of EπQ

Option 1: Use {xi

˜ π} as (i.i.d.) proposals in Metropolis-Hastings

Accept proposal xi

˜ π with probability α = min

  • 1, π(xi

˜ π)˜

π(xi−1

π

) π(xi−1

π

)˜ π(xi

˜ π)

  • Can prove that rejection rate ∼ ε and IACT τ ∼ 1 + ε

Option 2: Use ˜ π importance weighting with QMC quadrature EπQ ≈ 1 Z 1 N

N

  • i=1

Q(xi

˜ π)π(xi ˜ π)

˜ π(xi

˜ π)

with Z = 1 N

N

  • i=1

π(xi

˜ π)

˜ π(xi

˜ π)

We can use an unbiased (randomised) QMC rule for both integrals.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 23 / 38

slide-73
SLIDE 73

Sampling from exact π: Unbiased estimates of EπQ

using TT approximation as preconditioner, importance weight or control variate

Option 1: Use {xi

˜ π} as (i.i.d.) proposals in Metropolis-Hastings

Accept proposal xi

˜ π with probability α = min

  • 1, π(xi

˜ π)˜

π(xi−1

π

) π(xi−1

π

)˜ π(xi

˜ π)

  • Can prove that rejection rate ∼ ε and IACT τ ∼ 1 + ε

Option 2: Use ˜ π importance weighting with QMC quadrature EπQ ≈ 1 Z 1 N

N

  • i=1

Q(xi

˜ π)π(xi ˜ π)

˜ π(xi

˜ π)

with Z = 1 N

N

  • i=1

π(xi

˜ π)

˜ π(xi

˜ π)

We can use an unbiased (randomised) QMC rule for both integrals. Option 3: Use estimate w.r.t. ˜ π as control variate (multilevel MCMC)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 23 / 38

slide-74
SLIDE 74

Numerical Example

(Inverse Stationary Diffusion Problem)

Model Problem (representative for

subsurface flow or structural mechanics)

−∇κ(ξ, x)∇u(ξ, x) = 0 ξ ∈ (0, 1)2 u|ξ1=0 = 1, u|ξ1=1 = 0,

∂u ∂n

  • ξ2=0 = 0,

∂u ∂n

  • ξ2=1 = 0.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Karhunen-Lo` eve expansion of log κ(ξ, x) =

d

  • k=1

φk(ξ)xk with prior d = 11, xk ∼ U[−1, 1], φk∞ = O(k− 3

2 ) [Eigel, Pfeffer, Schneider ’16]

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 24 / 38

slide-75
SLIDE 75

Numerical Example

(Inverse Stationary Diffusion Problem)

Model Problem (representative for

subsurface flow or structural mechanics)

−∇κ(ξ, x)∇u(ξ, x) = 0 ξ ∈ (0, 1)2 u|ξ1=0 = 1, u|ξ1=1 = 0,

∂u ∂n

  • ξ2=0 = 0,

∂u ∂n

  • ξ2=1 = 0.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Karhunen-Lo` eve expansion of log κ(ξ, x) =

d

  • k=1

φk(ξ)xk with prior d = 11, xk ∼ U[−1, 1], φk∞ = O(k− 3

2 ) [Eigel, Pfeffer, Schneider ’16]

Discretisation with bilinear FEs on uniform mesh with h = 1/64.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 24 / 38

slide-76
SLIDE 76

Numerical Example

(Inverse Stationary Diffusion Problem)

Model Problem (representative for

subsurface flow or structural mechanics)

−∇κ(ξ, x)∇u(ξ, x) = 0 ξ ∈ (0, 1)2 u|ξ1=0 = 1, u|ξ1=1 = 0,

∂u ∂n

  • ξ2=0 = 0,

∂u ∂n

  • ξ2=1 = 0.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Karhunen-Lo` eve expansion of log κ(ξ, x) =

d

  • k=1

φk(ξ)xk with prior d = 11, xk ∼ U[−1, 1], φk∞ = O(k− 3

2 ) [Eigel, Pfeffer, Schneider ’16]

Discretisation with bilinear FEs on uniform mesh with h = 1/64. Data: average pressure in 9 locations (synthetic, i.e. for some ξ∗)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 24 / 38

slide-77
SLIDE 77

Numerical Example

(Inverse Stationary Diffusion Problem)

Model Problem (representative for

subsurface flow or structural mechanics)

−∇κ(ξ, x)∇u(ξ, x) = 0 ξ ∈ (0, 1)2 u|ξ1=0 = 1, u|ξ1=1 = 0,

∂u ∂n

  • ξ2=0 = 0,

∂u ∂n

  • ξ2=1 = 0.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Karhunen-Lo` eve expansion of log κ(ξ, x) =

d

  • k=1

φk(ξ)xk with prior d = 11, xk ∼ U[−1, 1], φk∞ = O(k− 3

2 ) [Eigel, Pfeffer, Schneider ’16]

Discretisation with bilinear FEs on uniform mesh with h = 1/64. Data: average pressure in 9 locations (synthetic, i.e. for some ξ∗) QoI Q = h(u(·, x)): probability that flux exceeds 1.5 (not smooth!)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 24 / 38

slide-78
SLIDE 78

Comparison against DRAM (for inverse diffusion problem)

noise level σ2

e = 0.01

101 102 103 104 10−3 10−2 10−1

  • discr. error

CPU time Relative error in P(F > 1.5) TT-MH TT-qIW DRAM

TT-MH TT conditional distribution samples (iid) as proposals for MCMC TT-qIW TT surrogate for importance sampling with QMC DRAM Delayed Rejection Adaptive Metropolis [Haario et al, 2006]

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 25 / 38

slide-79
SLIDE 79

Comparison against DRAM (for inverse diffusion problem)

noise level σ2

e = 0.01

noise level σ2

e = 0.001

101 102 103 104 10−3 10−2 10−1

  • discr. error

CPU time Relative error in P(F > 1.5) TT-MH TT-qIW DRAM 101 102 103 104 10−3 10−2 10−1 100

  • discr. error

CPU time relative error for PF>1.5 TT-MH TT-qIW DRAM

TT-MH TT conditional distribution samples (iid) as proposals for MCMC TT-qIW TT surrogate for importance sampling with QMC DRAM Delayed Rejection Adaptive Metropolis [Haario et al, 2006]

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 25 / 38

slide-80
SLIDE 80

Samples – Comparison TT-CD vs. DRAM

DRAM

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

TT-MH (i.i.d. seeds)

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 26 / 38

slide-81
SLIDE 81

Multilevel Markov Chain Monte Carlo

[Dodwell, Ketelsen, RS, Teckentrup, 2015 & 2019], [Cui, Detommaso, RS, 2019]

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 27 / 38

slide-82
SLIDE 82

Exploiting Model Hierarchy (same inverse diffusion problem)

L Vℓ

Here hℓ = h0 × 2−ℓ and Mℓ ≈ M0 × 22ℓ

Xℓ

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 28 / 38

slide-83
SLIDE 83

Monte Carlo

(assuming first π can be sampled – forward problem) Standard Monte Carlo estimator for E[Q]: (where Q = h(u(·, x)) ∈ R) ˆ QMC

L

:= 1 N

N

  • i=1

Q(i)

L ,

Q(i)

L

i.i.d. samples with Model(L)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 29 / 38

slide-84
SLIDE 84

Monte Carlo

(assuming first π can be sampled – forward problem) Standard Monte Carlo estimator for E[Q]: (where Q = h(u(·, x)) ∈ R) ˆ QMC

L

:= 1 N

N

  • i=1

Q(i)

L ,

Q(i)

L

i.i.d. samples with Model(L) Convergence of plain vanilla MC (mean square error): E ˆ QMC

L

− E[Q] 2

  • =: MSE

= V[QL] N

sampling error

+

  • E[QL − Q]

2

  • model error (“bias”)
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 29 / 38

slide-85
SLIDE 85

Monte Carlo

(assuming first π can be sampled – forward problem) Standard Monte Carlo estimator for E[Q]: (where Q = h(u(·, x)) ∈ R) ˆ QMC

L

:= 1 N

N

  • i=1

Q(i)

L ,

Q(i)

L

i.i.d. samples with Model(L) Convergence of plain vanilla MC (mean square error): E ˆ QMC

L

− E[Q] 2

  • =: MSE

= V[QL] N

sampling error

+

  • E[QL − Q]

2

  • model error (“bias”)

Assuming |E[Qℓ − Q]| = O(2−αℓ) and E[Costℓ] = O(2γℓ), to get MSE = O(ε2), we need L ∼ log2(ε−1)α−1 & N ∼ ε−2

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 29 / 38

slide-86
SLIDE 86

Monte Carlo

(assuming first π can be sampled – forward problem) Standard Monte Carlo estimator for E[Q]: (where Q = h(u(·, x)) ∈ R) ˆ QMC

L

:= 1 N

N

  • i=1

Q(i)

L ,

Q(i)

L

i.i.d. samples with Model(L) Convergence of plain vanilla MC (mean square error): E ˆ QMC

L

− E[Q] 2

  • =: MSE

= V[QL] N

sampling error

+

  • E[QL − Q]

2

  • model error (“bias”)

Assuming |E[Qℓ − Q]| = O(2−αℓ) and E[Costℓ] = O(2γℓ), to get MSE = O(ε2), we need L ∼ log2(ε−1)α−1 & N ∼ ε−2 Monte Carlo Complexity Theorem Cost( ˆ QMC

L

) = O(NML) = O

  • ε−2 − γ/α

to obtain MSE = O(ε2).

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 29 / 38

slide-87
SLIDE 87

Monte Carlo

(assuming first π can be sampled – forward problem) Standard Monte Carlo estimator for E[Q]: (where Q = h(u(·, x)) ∈ R) ˆ QMC

L

:= 1 N

N

  • i=1

Q(i)

L ,

Q(i)

L

i.i.d. samples with Model(L) Convergence of plain vanilla MC (mean square error): E ˆ QMC

L

− E[Q] 2

  • =: MSE

= V[QL] N

sampling error

+

  • E[QL − Q]

2

  • model error (“bias”)

Assuming |E[Qℓ − Q]| = O(2−αℓ) and E[Costℓ] = O(2γℓ), to get MSE = O(ε2), we need L ∼ log2(ε−1)α−1 & N ∼ ε−2 Monte Carlo Complexity Thm.

(2D model problem w. AMG: α = 1, γ = 2)

Cost( ˆ QMC

L

) = O(NML) = O

  • ε−2 − γ/α

to obtain MSE = O(ε2).

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 29 / 38

slide-88
SLIDE 88

Multilevel Monte Carlo

[Heinrich, ’98], [Giles, ’07]

Basic Idea: Note that trivially QL = Q0 +

L

  • ℓ=1

Qℓ − Qℓ−1

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 30 / 38

slide-89
SLIDE 89

Multilevel Monte Carlo

[Heinrich, ’98], [Giles, ’07]

Basic Idea: Note that trivially (due to linearity of E) E[QL] = E[Q0] +

L

  • ℓ=1

E[Qℓ − Qℓ−1]

Control Variates!!

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 30 / 38

slide-90
SLIDE 90

Multilevel Monte Carlo

[Heinrich, ’98], [Giles, ’07]

Basic Idea: Note that trivially (due to linearity of E) E[QL] = E[Q0] +

L

  • ℓ=1

E[Qℓ − Qℓ−1]

Control Variates!!

Define the following multilevel MC estimator for E[Q]:

  • QMLMC

L

:= ˆ QMC +

L

  • ℓ=1

ˆ Y MC

where Yℓ := Qℓ − Qℓ−1

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 30 / 38

slide-91
SLIDE 91

Multilevel Monte Carlo

[Heinrich, ’98], [Giles, ’07]

Basic Idea: Note that trivially (due to linearity of E) E[QL] = E[Q0] +

L

  • ℓ=1

E[Qℓ − Qℓ−1]

Control Variates!!

Define the following multilevel MC estimator for E[Q]:

  • QMLMC

L

:= ˆ QMC +

L

  • ℓ=1

ˆ Y MC

where Yℓ := Qℓ − Qℓ−1

Key Observation: (Variance Reduction! Corrections cheaper!)

Level L: V[QL − QL−1] → 0 as L → ∞ ⇒ NL = O(1) (best case)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 30 / 38

slide-92
SLIDE 92

Multilevel Monte Carlo

[Heinrich, ’98], [Giles, ’07]

Basic Idea: Note that trivially (due to linearity of E) E[QL] = E[Q0] +

L

  • ℓ=1

E[Qℓ − Qℓ−1]

Control Variates!!

Define the following multilevel MC estimator for E[Q]:

  • QMLMC

L

:= ˆ QMC +

L

  • ℓ=1

ˆ Y MC

where Yℓ := Qℓ − Qℓ−1

Key Observation: (Variance Reduction! Corrections cheaper!)

Level L: V[QL − QL−1] → 0 as L → ∞ ⇒ NL = O(1) (best case) Level 0: N0 ∼ N but Cost0 = O(M0) = O(1)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 30 / 38

slide-93
SLIDE 93

Multilevel Monte Carlo

[Heinrich, ’98], [Giles, ’07]

Basic Idea: Note that trivially (due to linearity of E) E[QL] = E[Q0] +

L

  • ℓ=1

E[Qℓ − Qℓ−1]

Control Variates!!

Define the following multilevel MC estimator for E[Q]:

  • QMLMC

L

:= ˆ QMC +

L

  • ℓ=1

ˆ Y MC

where Yℓ := Qℓ − Qℓ−1

Key Observation: (Variance Reduction! Corrections cheaper!)

Level L: V[QL − QL−1] → 0 as L → ∞ ⇒ NL = O(1) (best case) . . . Level ℓ: Nℓ optimised to “balance” with cost on levels 0 and L . . . Level 0: N0 ∼ N but Cost0 = O(M0) = O(1)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 30 / 38

slide-94
SLIDE 94

Complexity Theorem [Giles, ’07], [Cliffe, Giles, RS, Teckentrup, ’11] Assume approximation error O(2−αℓ), Cost/sample O(2γℓ) and V[Qℓ − Qℓ−1] = O(2−βℓ)

(strong error/variance reduction)

Then there exist L, {Nℓ}L

ℓ=0 to obtain MSE = O(ε2) with

Cost( QMLMC

L

) = O

  • ε−2 −max
  • 0, γ−β

α

  • + possible log-factor

using dependent or independent estimators ˆ QMC , and ˆ Y MC

L

ℓ=1.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 31 / 38

slide-95
SLIDE 95

Complexity Theorem [Giles, ’07], [Cliffe, Giles, RS, Teckentrup, ’11] Assume approximation error O(2−αℓ), Cost/sample O(2γℓ) and V[Qℓ − Qℓ−1] = O(2−βℓ)

(strong error/variance reduction)

Then there exist L, {Nℓ}L

ℓ=0 to obtain MSE = O(ε2) with

Cost( QMLMC

L

) = O

  • ε−2 −max
  • 0, γ−β

α

  • + possible log-factor

using dependent or independent estimators ˆ QMC , and ˆ Y MC

L

ℓ=1.

Running example (for smooth fctls. & AMG): α ≈ 1, β ≈ 2, γ ≈ 2 Cost( QMLMC

L

) = O

  • ε− max(2, γ

α)

= O (max(N0, ML)) ≈ O(ε−2)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 31 / 38

slide-96
SLIDE 96

Complexity Theorem [Giles, ’07], [Cliffe, Giles, RS, Teckentrup, ’11] Assume approximation error O(2−αℓ), Cost/sample O(2γℓ) and V[Qℓ − Qℓ−1] = O(2−βℓ)

(strong error/variance reduction)

Then there exist L, {Nℓ}L

ℓ=0 to obtain MSE = O(ε2) with

Cost( QMLMC

L

) = O

  • ε−2 −max
  • 0, γ−β

α

  • + possible log-factor

using dependent or independent estimators ˆ QMC , and ˆ Y MC

L

ℓ=1.

Running example (for smooth fctls. & AMG): α ≈ 1, β ≈ 2, γ ≈ 2 Cost( QMLMC

L

) = O

  • ε− max(2, γ

α)

= O (max(N0, ML)) ≈ O(ε−2) Optimality: Asymptotic cost of one deterministic solve (to tol= ε) !

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 31 / 38

slide-97
SLIDE 97

Numerical Example (Multilevel MC)

Running example with Q = uL2(D) h0 = 1

8; lognormal diffusion coeff. w. exponential covariance (σ2 = 1, λ = 0.3)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 32 / 38

slide-98
SLIDE 98

Inverse Problem: Multilevel Markov Chain Monte Carlo

Posterior distribution for PDE model problem (Bayes): πℓ(xℓ|yobs) exp(−yobs − Fℓ(xℓ)2

Σobs) πprior(xℓ)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 33 / 38

slide-99
SLIDE 99

Inverse Problem: Multilevel Markov Chain Monte Carlo

Posterior distribution for PDE model problem (Bayes): πℓ(xℓ|yobs) exp(−yobs − Fℓ(xℓ)2

Σobs) πprior(xℓ)

What were the key ingredients of “standard” multilevel Monte Carlo?

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 33 / 38

slide-100
SLIDE 100

Inverse Problem: Multilevel Markov Chain Monte Carlo

Posterior distribution for PDE model problem (Bayes): πℓ(xℓ|yobs) exp(−yobs − Fℓ(xℓ)2

Σobs) πprior(xℓ)

What were the key ingredients of “standard” multilevel Monte Carlo? Telescoping sum: E [QL] = E [Q0] + L

ℓ=1 E [Qℓ − Qℓ−1]

Models on coarser levels much cheaper to solve (M0 ≪ ML). V[Qℓ − Qℓ−1] ℓ!∞ − → 0 as = ⇒ much fewer samples on finer levels.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 33 / 38

slide-101
SLIDE 101

Inverse Problem: Multilevel Markov Chain Monte Carlo

Posterior distribution for PDE model problem (Bayes): πℓ(xℓ|yobs) exp(−yobs − Fℓ(xℓ)2

Σobs) πprior(xℓ)

What were the key ingredients of “standard” multilevel Monte Carlo? Telescoping sum: E [QL] = E [Q0] + L

ℓ=1 E [Qℓ − Qℓ−1]

Models on coarser levels much cheaper to solve (M0 ≪ ML). V[Qℓ − Qℓ−1] ℓ!∞ − → 0 as = ⇒ much fewer samples on finer levels. But Important! In MCMC the target distribution πℓ depends on ℓ: EπL [QL] = Eπ0 [Q0] +

  • ℓ Eπℓ [Qℓ] − Eπℓ−1 [Qℓ−1]
  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 33 / 38

slide-102
SLIDE 102

Inverse Problem: Multilevel Markov Chain Monte Carlo

Posterior distribution for PDE model problem (Bayes): πℓ(xℓ|yobs) exp(−yobs − Fℓ(xℓ)2

Σobs) πprior(xℓ)

What were the key ingredients of “standard” multilevel Monte Carlo? Telescoping sum: E [QL] = E [Q0] + L

ℓ=1 E [Qℓ − Qℓ−1]

Models on coarser levels much cheaper to solve (M0 ≪ ML). V[Qℓ − Qℓ−1] ℓ!∞ − → 0 as = ⇒ much fewer samples on finer levels. But Important! In MCMC the target distribution πℓ depends on ℓ: EπL [QL] = Eπ0 [Q0]

  • standard MCMC

+

  • ℓ Eπℓ [Qℓ] − Eπℓ−1 [Qℓ−1]
  • multilevel MCMC (NEW)
  • QMLMetH

h,s

:= 1 N0

N0

  • n=1

Q0(zn

0,0) + L

  • ℓ=1

1 Nℓ

Nℓ

  • n=1
  • Qℓ(zn

ℓ,ℓ) − Qℓ−1(zn ℓ,ℓ−1)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 33 / 38

slide-103
SLIDE 103

Multilevel Markov Chain Monte Carlo – Algorithm

[Dodwell, Ketelsen, RS, Teckentrup, JUQ 2015 or SIREV 2019]

ALGORITHM 2 (Multilevel Metropolis Hastings MCMC for Qℓ − Qℓ−1) At states zn

ℓ,0, . . . , zn ℓ,ℓ of ℓ + 1 Markov chains on levels 0, . . . , ℓ:

1

k = 0: Set x0

0 := zn ℓ,0. Generate samples xi 0 ∼ π0 (coarse posterior) via

basic Metropolis-Hastings.

2

k > 0: Set x0

k := zn ℓ,k. Generate samples xi k ∼ πk as follows:

(a) Propose x′

k = x(i+1)tk−1 k−1

Subsample to reduce correlation!

(b) Accept x′

k with probability

αML

(x′

k|xi k) = min

  • 1, πk(x′

k) qML k (xn k |x′ k)

πk(xn

k ) qML(x′ k | xn k )

  • i.e. set xi+1

k

= x′

k with prob. αML ℓ

(x′

k|xi k); otherwise xi+1 k

= xi

k

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 34 / 38

slide-104
SLIDE 104

Multilevel Markov Chain Monte Carlo – Algorithm

[Dodwell, Ketelsen, RS, Teckentrup, JUQ 2015 or SIREV 2019]

ALGORITHM 2 (Multilevel Metropolis Hastings MCMC for Qℓ − Qℓ−1) At states zn

ℓ,0, . . . , zn ℓ,ℓ of ℓ + 1 Markov chains on levels 0, . . . , ℓ:

1

k = 0: Set x0

0 := zn ℓ,0. Generate samples xi 0 ∼ π0 (coarse posterior) via

basic Metropolis-Hastings.

2

k > 0: Set x0

k := zn ℓ,k. Generate samples xi k ∼ πk as follows:

(a) Propose x′

k = x(i+1)tk−1 k−1

Subsample to reduce correlation!

(b) Accept x′

k with probability

αML

(x′

k|xi k) = min

  • 1, πk(x′

k)πk−1(xn k )

πk(xn

k )πk−1(x′ k)

  • JS Liu, 2001

i.e. set xi+1

k

= x′

k with prob. αML ℓ

(x′

k|xi k); otherwise xi+1 k

= xi

k

(c) Set zn+1

ℓ,k

:= xTk

k

with Tk := ℓ−1

j=k tj.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 34 / 38

slide-105
SLIDE 105

Comments

Each {zn

ℓ,k}n≥1 is a Markov chain targeting πk, k = 0, . . . , ℓ.

In the limit of infinite subsampling rate, the chains are unbiased and the multilevel algorithm is consistent (no bias between levels).

(In practice, with subsampling rate IACT the bias is negligible.)

Main Theoretical Results from [Dodwell, Ketelsen, RS, Teckentrup, ’15] Eπℓ,πℓ

  • 1 − αML

(·|·)

  • = O(h1−δ

) ∀δ > 0.

(exponential covariance)

Vπℓ,πℓ−1

  • Qℓ(zn

ℓ,ℓ) − Qℓ−1(zn ℓ,ℓ−1)

  • = O(h1−δ

) ∀δ > 0 Algorithm is a type of surrogate transition method [Liu 2001] related also to delayed acceptance [Christen, Fox, ’05] But crucially, it also exploits the variance reduction idea of MLMC and the paper provides actual rates for the diffusion problem!

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 35 / 38

slide-106
SLIDE 106

More Sophisticated Proposals – Multilevel DILI

[Cui, Detommaso, RS, arXiv:1910.12431]

Original work: pCN random walk proposal

[Cotter, Dashti, Stuart ’12] (no grad./Hessian info)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 36 / 38

slide-107
SLIDE 107

More Sophisticated Proposals – Multilevel DILI

[Cui, Detommaso, RS, arXiv:1910.12431]

Original work: pCN random walk proposal

[Cotter, Dashti, Stuart ’12] (no grad./Hessian info)

Better: DILI [Cui, Law, Marzouk, ’16]:

(dimension-independent likelihood-informed)

Samples from preconditioned Langevin eqn. using low-rank Hessian approximation (LIS) at a number of points (incl. MAP point)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 36 / 38

slide-108
SLIDE 108

More Sophisticated Proposals – Multilevel DILI

[Cui, Detommaso, RS, arXiv:1910.12431]

Original work: pCN random walk proposal

[Cotter, Dashti, Stuart ’12] (no grad./Hessian info)

Better: DILI [Cui, Law, Marzouk, ’16]:

(dimension-independent likelihood-informed)

Samples from preconditioned Langevin eqn. using low-rank Hessian approximation (LIS) at a number of points (incl. MAP point)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 36 / 38

slide-109
SLIDE 109

More Sophisticated Proposals – Multilevel DILI

[Cui, Detommaso, RS, arXiv:1910.12431]

Original work: pCN random walk proposal

[Cotter, Dashti, Stuart ’12] (no grad./Hessian info)

Better: DILI [Cui, Law, Marzouk, ’16]:

(dimension-independent likelihood-informed)

Samples from preconditioned Langevin eqn. using low-rank Hessian approximation (LIS) at a number of points (incl. MAP point) [Cui et al, ’19]: Hierarchical construction

  • f LIS (which is significantly cheaper!) and

combination of DILI with MLMCMC.

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 36 / 38

slide-110
SLIDE 110

More Sophisticated Proposals – Multilevel DILI

[Cui, Detommaso, RS, arXiv:1910.12431]

Original work: pCN random walk proposal

[Cotter, Dashti, Stuart ’12] (no grad./Hessian info)

Better: DILI [Cui, Law, Marzouk, ’16]:

(dimension-independent likelihood-informed)

Samples from preconditioned Langevin eqn. using low-rank Hessian approximation (LIS) at a number of points (incl. MAP point) [Cui et al, ’19]: Hierarchical construction

  • f LIS (which is significantly cheaper!) and

combination of DILI with MLMCMC. Numerical experiment: much higher dimensional and more complicated than above, using lognormal prior.

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 36 / 38

slide-111
SLIDE 111

Numerical Comparison: IACTs & CPU Times

Refined parameters Level ℓ 1 2 3 iact(pCN) 4300 45 48 24 iact(DILI) 34 11 3.6 2.0 Qℓ(zn

ℓ,ℓ) − Qℓ−1(zn ℓ,ℓ−1)

Level ℓ 1 2 3 iact(pCN) 4100 4.9 2.8 1.9 iact(DILI) 9.0 4.6 2.4 1.8

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 37 / 38

slide-112
SLIDE 112

Numerical Comparison: IACTs & CPU Times

Refined parameters Level ℓ 1 2 3 iact(pCN) 4300 45 48 24 iact(DILI) 34 11 3.6 2.0 Qℓ(zn

ℓ,ℓ) − Qℓ−1(zn ℓ,ℓ−1)

Level ℓ 1 2 3 iact(pCN) 4100 4.9 2.8 1.9 iact(DILI) 9.0 4.6 2.4 1.8

2.8 10 -3 5.7 10 -3 1.27 10 -2 10 3 10 4 10 5 10 6 10 7

∼ 1 CPU Month ∼ 9 CPU Days ∼ 2 CPU Days

← − 1 CPU Day CPU time (in sec)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 37 / 38

slide-113
SLIDE 113

Conclusions

Large-scale PDE-constrained Bayesian inference with sparse data Idea 1: Characterise complex/intractable distributions by constructing deterministic couplings Variational Inference: Optimisation of Kullback-Leibler divergence

(Many types: sparse, decomposable, neural nets, polynomial, kernel-based)

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 38 / 38

slide-114
SLIDE 114

Conclusions

Large-scale PDE-constrained Bayesian inference with sparse data Idea 1: Characterise complex/intractable distributions by constructing deterministic couplings Variational Inference: Optimisation of Kullback-Leibler divergence

(Many types: sparse, decomposable, neural nets, polynomial, kernel-based)

Alternative: Low-rank tensor factorisation and conditional distribution sampling (Rosenblatt transform)

[Stats & Comput, 2019]

Scales with dimension; comparable comput. efficiency to NNs Unbiased estimates via Metropolisation or importance weighting

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 38 / 38

slide-115
SLIDE 115

Conclusions

Large-scale PDE-constrained Bayesian inference with sparse data Idea 1: Characterise complex/intractable distributions by constructing deterministic couplings Variational Inference: Optimisation of Kullback-Leibler divergence

(Many types: sparse, decomposable, neural nets, polynomial, kernel-based)

Alternative: Low-rank tensor factorisation and conditional distribution sampling (Rosenblatt transform)

[Stats & Comput, 2019]

Scales with dimension; comparable comput. efficiency to NNs Unbiased estimates via Metropolisation or importance weighting Idea 2: Use model hierarchies – Multilevel MCMC

[SINUM, 2019]

Variance reduction and much better complexities (proven!) Better IACT on fine levels through surrogate transition method Further acceleration (especially on coarsest level) by using DILI

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 38 / 38

slide-116
SLIDE 116

References

1

Dolgov, Anaya-Izquierdo, Fox, RS, Approximation and sampling of multivariate probability distributions in the tensor train decomposition, Statistics & Comput. 30, 2020 [arXiv:1810.01212]

2

Dodwell, Ketelsen, RS, Teckentrup, A hierarchical multilevel Markov chain MC algorithm [. . . ], SIAM/ASA J Uncertain Q 3, 2015 [arXiv:1303.7343]

3

Cui, Detommaso, RS, Multilevel dimension-independent likelihood-informed MCMC for large-scale inverse problems, submitted, 2019 [arXiv:1910.12431]

4

Moselhy, Marzouk, Bayesian inference with optimal maps, J Comput Phys 231, 2012 [arXiv:1109.1516]

5

Rezende, Mohamed, Variational inference with normalizing flows, ICML’15

  • Proc. 32nd Inter. Conf. Machine Learning, Vol. 37, 2015 [arXiv:1505.05770]

6

Marzouk, Moselhy, Parno, Spantini, Sampling via measure transport: An introduction, Handbook of UQ (Ghanem et al, Eds), 2016 [arXiv:1602.05023]

7

Detommaso, Cui, Spantini, Marzouk, RS, A Stein variational Newton method, NIPS 2018, Vol. 31, 2018 [arXiv:1806.03085]

8

Kruse, Detommaso, RS, K¨

  • the, HINT: Hierarchical invertible neural transport

for density estimation & Bayesian inference, 2019 [arXiv:1905.10687]

  • R. Scheichl (Heidelberg)

PDE-Constrained Bayesian Inference ICERM 23/03/20 38 / 38