Stein Variational Newton & other Sampling-Based Inference - - PowerPoint PPT Presentation

stein variational newton other sampling based inference
SMART_READER_LITE
LIVE PREVIEW

Stein Variational Newton & other Sampling-Based Inference - - PowerPoint PPT Presentation

Stein Variational Newton & other Sampling-Based Inference Methods Robert Scheichl Interdisciplinary Center for Scientific Computing & Institute of Applied Mathematics Universit at Heidelberg Collaborators: G. Detommaso (Bath); T.


slide-1
SLIDE 1

Stein Variational Newton &

  • ther Sampling-Based Inference Methods

Robert Scheichl

Interdisciplinary Center for Scientific Computing & Institute of Applied Mathematics Universit¨ at Heidelberg Collaborators:

  • G. Detommaso (Bath); T. Cui (Monash); A. Spantini & Y. Marzouk (MIT);
  • K. Anaya-Izquierdo & S. Dolgov (Bath); C. Fox (Otago)

RICAM Special Semester on Optimization Workshop 3 – Optimization and Inversion under Uncertainty

Linz, November 11, 2019

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 1 / 33

slide-2
SLIDE 2

Inverse Problems

Data Parameter

y = F ( x ) + e forward model (PDE)

  • bservation/model errors
  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 2 / 33

slide-3
SLIDE 3

Inverse Problems

Data Parameter

y = F ( x ) + e forward model (PDE)

  • bservation/model errors

y ∈ RNy x ∈ X F : X → RNy Data y are limited in number, noisy, and indirect. Parameter x often a function (discretisation needed). Continuous, bounded, and sufficiently smooth.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 2 / 33

slide-4
SLIDE 4

Bayesian interpretation

The (physical) model gives π(y|x), the conditional probability of observing y given x. However, to predict, control, optimise or quantify uncertainty, the interest is often really in π(x|y), the conditional probability of possible causes x given the observed data y – the inverse problem:

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 3 / 33

slide-5
SLIDE 5

Bayesian interpretation

The (physical) model gives π(y|x), the conditional probability of observing y given x. However, to predict, control, optimise or quantify uncertainty, the interest is often really in π(x|y), the conditional probability of possible causes x given the observed data y – the inverse problem: πpos (x) := π (x|y) ∝ π (y|x) πpr (x)

  • Bayes’ rule
  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 3 / 33

slide-6
SLIDE 6

Bayesian interpretation

The (physical) model gives π(y|x), the conditional probability of observing y given x. However, to predict, control, optimise or quantify uncertainty, the interest is often really in π(x|y), the conditional probability of possible causes x given the observed data y – the inverse problem: πpos (x) := π (x|y) ∝ π (y|x) πpr (x)

  • Bayes’ rule

Extract information from πpos (means, covariances, event probabilities, predictions) by evaluating posterior expectations: Eπpos [h(x)] =

  • h(x)πpos(x)dx
  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 3 / 33

slide-7
SLIDE 7

Bayes’ Rule and Classical Inversion

Classically [Hadamard, 1923]: Inverse map “F −1” (y → x) is typically ill-posed, i.e. lack of (a) existence, (b) uniqueness or (c) boundedness

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 4 / 33

slide-8
SLIDE 8

Bayes’ Rule and Classical Inversion

Classically [Hadamard, 1923]: Inverse map “F −1” (y → x) is typically ill-posed, i.e. lack of (a) existence, (b) uniqueness or (c) boundedness least squares solution ˆ x is maximum likelihood estimate prior distribution πpr “acts” as regulariser – well-posedness ! solution of regularised least squares problem is maximum a posteriori (MAP) estimator

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 4 / 33

slide-9
SLIDE 9

Bayes’ Rule and Classical Inversion

Classically [Hadamard, 1923]: Inverse map “F −1” (y → x) is typically ill-posed, i.e. lack of (a) existence, (b) uniqueness or (c) boundedness least squares solution ˆ x is maximum likelihood estimate prior distribution πpr “acts” as regulariser – well-posedness ! solution of regularised least squares problem is maximum a posteriori (MAP) estimator However, in the Bayesian setting, the full posterior πpos contains more information than the MAP estimator alone, e.g. the posterior covariance matrix reveals components of x that are (relatively) more or less certain.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 4 / 33

slide-10
SLIDE 10

Bayes’ Rule and Classical Inversion

Classically [Hadamard, 1923]: Inverse map “F −1” (y → x) is typically ill-posed, i.e. lack of (a) existence, (b) uniqueness or (c) boundedness least squares solution ˆ x is maximum likelihood estimate prior distribution πpr “acts” as regulariser – well-posedness ! solution of regularised least squares problem is maximum a posteriori (MAP) estimator However, in the Bayesian setting, the full posterior πpos contains more information than the MAP estimator alone, e.g. the posterior covariance matrix reveals components of x that are (relatively) more or less certain. Possible to sample/explore via Metropolis-Hastings MCMC (in theory)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 4 / 33

slide-11
SLIDE 11

Variational Bayes (as opposed to Metropolis-Hastings MCMC)

Aim to characterise the posterior distribution (density πpos) analytically (at least approximately) for more efficient inference.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 5 / 33

slide-12
SLIDE 12

Variational Bayes (as opposed to Metropolis-Hastings MCMC)

Aim to characterise the posterior distribution (density πpos) analytically (at least approximately) for more efficient inference. This is a challenging task since: x ∈ Rd is typically high-dimensional (e.g., discretised function) πpos is in general non-Gaussian

(even if πpr and observation noise are Gaussian)

evaluations of likelihood may be expensive (e.g., solution of a PDE)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 5 / 33

slide-13
SLIDE 13

Variational Bayes (as opposed to Metropolis-Hastings MCMC)

Aim to characterise the posterior distribution (density πpos) analytically (at least approximately) for more efficient inference. This is a challenging task since: x ∈ Rd is typically high-dimensional (e.g., discretised function) πpos is in general non-Gaussian

(even if πpr and observation noise are Gaussian)

evaluations of likelihood may be expensive (e.g., solution of a PDE)

Key Tools

Transport Maps, Optimisation, Principle Component Analysis, Model Order Reduction, Hierarchies, Sparsity, Low Rank Approximation

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 5 / 33

slide-14
SLIDE 14

Deterministic Couplings of Probability Measures

T η π

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 6 / 33

slide-15
SLIDE 15

Deterministic Couplings of Probability Measures

T η π

Core idea [Moselhy, Marzouk, 2012]

Choose a reference distribution η (e.g., standard Gaussian) Seek transport map T : Rd → Rd such that T♯η = π

(or equivalently its inverse S = T −1)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 6 / 33

slide-16
SLIDE 16

Deterministic Couplings of Probability Measures

T η π

Core idea [Moselhy, Marzouk, 2012]

Choose a reference distribution η (e.g., standard Gaussian) Seek transport map T : Rd → Rd such that T♯η = π

(or equivalently its inverse S = T −1)

In principle, enables exact (independent, unweighted) sampling!

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 6 / 33

slide-17
SLIDE 17

Deterministic Couplings of Probability Measures

T η π

Core idea [Moselhy, Marzouk, 2012]

Choose a reference distribution η (e.g., standard Gaussian) Seek transport map T : Rd → Rd such that T♯η = π

(or equivalently its inverse S = T −1)

In principle, enables exact (independent, unweighted) sampling! Satisfying these conditions only approximately can still be useful!

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 6 / 33

slide-18
SLIDE 18

Variational Inference

Goal: Sampling from target density π(x)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 7 / 33

slide-19
SLIDE 19

Variational Inference

Goal: Sampling from target density π(x) Given a reference density p, find an invertible map ˆ T such that ˆ T := argmin

T

DKL(T♯ p π) = argmin

T

DKL(p T −1

π)

where T♯(x):= p

  • T −1(x)
  • | det
  • ∇xT −1(x)
  • |

. . . push-forward of p DKL(p q):=

  • log

p(x) q(x)

  • p(x) dx

. . . Kullback-Leibler divergence

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 7 / 33

slide-20
SLIDE 20

Variational Inference

Goal: Sampling from target density π(x) Given a reference density p, find an invertible map ˆ T such that ˆ T := argmin

T

DKL(T♯ p π) = argmin

T

DKL(p T −1

π)

where T♯(x):= p

  • T −1(x)
  • | det
  • ∇xT −1(x)
  • |

. . . push-forward of p DKL(p q):=

  • log

p(x) q(x)

  • p(x) dx

. . . Kullback-Leibler divergence

Advantage of using DKL: do not need normalising constant for π

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 7 / 33

slide-21
SLIDE 21

Variational Inference

Goal: Sampling from target density π(x) Given a reference density p, find an invertible map ˆ T such that ˆ T := argmin

T

DKL(T♯ p π) = argmin

T

DKL(p T −1

π)

where T♯(x):= p

  • T −1(x)
  • | det
  • ∇xT −1(x)
  • |

. . . push-forward of p DKL(p q):=

  • log

p(x) q(x)

  • p(x) dx

. . . Kullback-Leibler divergence

Advantage of using DKL: do not need normalising constant for π Minimise over some suitable class T of maps T

(where ideally Jacobian determinant | det

  • ∇xT −1(x)
  • | is easy to evaluate)
  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 7 / 33

slide-22
SLIDE 22

Variational Inference

Goal: Sampling from target density π(x) Given a reference density p, find an invertible map ˆ T such that ˆ T := argmin

T

DKL(T♯ p π) = argmin

T

DKL(p T −1

π)

where T♯(x):= p

  • T −1(x)
  • | det
  • ∇xT −1(x)
  • |

. . . push-forward of p DKL(p q):=

  • log

p(x) q(x)

  • p(x) dx

. . . Kullback-Leibler divergence

Advantage of using DKL: do not need normalising constant for π Minimise over some suitable class T of maps T

(where ideally Jacobian determinant | det

  • ∇xT −1(x)
  • | is easy to evaluate)

To improve: enrich class T

  • r use samples of T −1

π as proposals for MCMC or in importance sampling (see below)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 7 / 33

slide-23
SLIDE 23

Many Choices (“Architectures”) for T possible

Examples: (list not comprehensive!!)

1

Optimal Transport & Knothe-Rosenblatt Rearrangement

[Moselhy, Marzouk, 2012], [Marzouk, Moselhy, Parno, Spantini, 2016]

2

Normalizing Flows [Rezende, Mohamed, 2015]

(and related methods in the ML literature)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 8 / 33

slide-24
SLIDE 24

Many Choices (“Architectures”) for T possible

Examples: (list not comprehensive!!)

1

Optimal Transport & Knothe-Rosenblatt Rearrangement

[Moselhy, Marzouk, 2012], [Marzouk, Moselhy, Parno, Spantini, 2016]

2

Normalizing Flows [Rezende, Mohamed, 2015]

(and related methods in the ML literature)

3

Kernel-based variational inference: Stein Variational Methods

[Liu, Wang, 2016], [Detommaso, Cui, Spantini, Marzouk, RS, 2018], [Chen, Wu, Chen, O’Leary-Roseberry, Ghattas, arXiv 2019]

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 8 / 33

slide-25
SLIDE 25

Many Choices (“Architectures”) for T possible

Examples: (list not comprehensive!!)

1

Optimal Transport & Knothe-Rosenblatt Rearrangement

[Moselhy, Marzouk, 2012], [Marzouk, Moselhy, Parno, Spantini, 2016]

2

Normalizing Flows [Rezende, Mohamed, 2015]

(and related methods in the ML literature)

3

Kernel-based variational inference: Stein Variational Methods

[Liu, Wang, 2016], [Detommaso, Cui, Spantini, Marzouk, RS, 2018], [Chen, Wu, Chen, O’Leary-Roseberry, Ghattas, arXiv 2019]

4

Layers of low-rank maps [Bigoni, Zahm, Spantini, Marzouk, arXiv 2019]

5

Layers of hierarchical invertible neural networks (HINT)

not today! [Detommaso, Kruse, Ardizzone, Rother, K¨

  • the, RS, arXiv 2019]
  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 8 / 33

slide-26
SLIDE 26

Many Choices (“Architectures”) for T possible

Examples: (list not comprehensive!!)

1

Optimal Transport & Knothe-Rosenblatt Rearrangement

[Moselhy, Marzouk, 2012], [Marzouk, Moselhy, Parno, Spantini, 2016]

2

Normalizing Flows [Rezende, Mohamed, 2015]

(and related methods in the ML literature)

3

Kernel-based variational inference: Stein Variational Methods

[Liu, Wang, 2016], [Detommaso, Cui, Spantini, Marzouk, RS, 2018], [Chen, Wu, Chen, O’Leary-Roseberry, Ghattas, arXiv 2019]

4

Layers of low-rank maps [Bigoni, Zahm, Spantini, Marzouk, arXiv 2019]

5

Layers of hierarchical invertible neural networks (HINT)

not today! [Detommaso, Kruse, Ardizzone, Rother, K¨

  • the, RS, arXiv 2019]

6

Low-rank tensor approx. & Knothe-Rosenblatt rearrangement

[Dolgov, Anaya-Izquierdo, Fox, RS, 2019]

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 8 / 33

slide-27
SLIDE 27

A Stein Variational Newton (SVN) Method

[Detommaso, Cui, Spantini, Marzouk, RS, 2018]

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 9 / 33

slide-28
SLIDE 28

Stein variational gradient descent

[Liu, Wang, 2016] Construct ˆ T as a composition of simple maps ˆ Tℓ: ˆ T := ˆ T1 ◦ · · · ◦ ˆ Tℓ ◦ · · · , where ˆ Tℓ := I + ˆ Qℓ Stein Variational Gradient Descent (SVGD) picks steepest descent direction in a Reproducing Kernel Hilbert Space (RKHS) H d with reproducing kernel k : Rd × Rd → R

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 10 / 33

slide-29
SLIDE 29

Stein variational gradient descent

[Liu, Wang, 2016] Construct ˆ T as a composition of simple maps ˆ Tℓ: ˆ T := ˆ T1 ◦ · · · ◦ ˆ Tℓ ◦ · · · , where ˆ Tℓ := I + ˆ Qℓ Stein Variational Gradient Descent (SVGD) picks steepest descent direction in a Reproducing Kernel Hilbert Space (RKHS) H d with reproducing kernel k : Rd × Rd → R Given a reference measure pℓ in the ℓth step, define Jpℓ : H d → R s.t. Jpℓ[Q] := DKL

  • (I + Q)♯
  • T♯

pℓ π

  • Then ˆ

Qℓ is chosen to satisfy Jpℓ[ ˆ Qℓ] < Jpℓ[0]

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 10 / 33

slide-30
SLIDE 30

Stein variational gradient descent

[Liu, Wang, 2016] Construct ˆ T as a composition of simple maps ˆ Tℓ: ˆ T := ˆ T1 ◦ · · · ◦ ˆ Tℓ ◦ · · · , where ˆ Tℓ := I + ˆ Qℓ Stein Variational Gradient Descent (SVGD) picks steepest descent direction in a Reproducing Kernel Hilbert Space (RKHS) H d with reproducing kernel k : Rd × Rd → R Given a reference measure pℓ in the ℓth step, define Jpℓ : H d → R s.t. Jpℓ[Q] := DKL

  • (I + Q)♯
  • T♯

pℓ π

  • Then ˆ

Qℓ is chosen to satisfy Jpℓ[ ˆ Qℓ] < Jpℓ[0] SVGD uses (functional) gradient descent in H d and picks ˆ Qℓ(z) := −∇Jpℓ[0] = Ex∼pℓ[∇x log π(x)k(x, z) + ∇xk(x, z)]

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 10 / 33

slide-31
SLIDE 31

Stein variational gradient descent

[Liu, Wang, 2016] Finally one defines pℓ+1 := ( ˆ Tℓ)♯pℓ = (I + ˆ Qℓ)♯pℓ In practice, pℓ taken as the empirical density of N particles

  • x(ℓ)

j

N

j=1

(as in filtering or sequential Monte Carlo methods) such that

ˆ Qℓ(z) := 1 N

N

  • j=1
  • ∇x log π(x(ℓ)

i

)k(x(ℓ)

i

, z) + ∇xk(x(ℓ)

i

, z)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 11 / 33

slide-32
SLIDE 32

Stein variational gradient descent

[Liu, Wang, 2016] Finally one defines pℓ+1 := ( ˆ Tℓ)♯pℓ = (I + ˆ Qℓ)♯pℓ In practice, pℓ taken as the empirical density of N particles

  • x(ℓ)

j

N

j=1

(as in filtering or sequential Monte Carlo methods) such that

ˆ Qℓ(z) := 1 N

N

  • j=1
  • ∇x log π(x(ℓ)

i

)k(x(ℓ)

i

, z) + ∇xk(x(ℓ)

i

, z)

  • Algorithm 2: Stein variational gradient descent (SVGD)

Input : Particles (x(ℓ)

j

)N

j=1, step size ε

Output: Particles (x(ℓ+1)

j

)N

j=1

for j = 1, 2, . . . , N do x(ℓ+1)

j

← Tℓ(x(ℓ)

j

) := x(ℓ)

j

+ ε ˆ Qℓ(x(ℓ)

j

) end for

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 11 / 33

slide-33
SLIDE 33

1st Improvement: Using second-order information

Particles are evolved sequentially from initial distribution p0 = p to final distribution pL ≈ π.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 12 / 33

slide-34
SLIDE 34

1st Improvement: Using second-order information

Particles are evolved sequentially from initial distribution p0 = p to final distribution pL ≈ π. SVGD is a deterministic first-order optimisation algorithm. We can accelerate it by introducing second-order information!

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 12 / 33

slide-35
SLIDE 35

1st Improvement: Using second-order information

Particles are evolved sequentially from initial distribution p0 = p to final distribution pL ≈ π. SVGD is a deterministic first-order optimisation algorithm. We can accelerate it by introducing second-order information! Representing ˆ Qℓ(x) = N

j=1 cjkj(x), where kj(x) := k(x, x(ℓ) j

), the (exact) Newton step can be computed by solving the linear system Hc = g where Hmn := Epℓ[−∇2 log π km kn + ∇km∇k⊤

n ],

m, n = 1, . . . , N, gm := Epℓ[∇ log π km + ∇km], m = 1, . . . , N.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 12 / 33

slide-36
SLIDE 36

1st Improvement: Using second-order information

Particles are evolved sequentially from initial distribution p0 = p to final distribution pL ≈ π. SVGD is a deterministic first-order optimisation algorithm. We can accelerate it by introducing second-order information! Representing ˆ Qℓ(x) = N

j=1 cjkj(x), where kj(x) := k(x, x(ℓ) j

), the (exact) Newton step can be computed by solving the linear system Hc = g where Hmn := Epℓ[−∇2 log π km kn + ∇km∇k⊤

n ],

m, n = 1, . . . , N, gm := Epℓ[∇ log π km + ∇km], m = 1, . . . , N. In practice, use block-diagonal approximation (inexact Newton) Hmmcm = gm , for m = 1, . . . , N, and set ˆ Qℓ(xm) = cm .

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 12 / 33

slide-37
SLIDE 37

A Stein variational Newton method

Algorithm 3: Stein variational (inexact) Newton

Input : Particles (x(ℓ)

j

)N

j=1, step size ε

Output: Particles (x(ℓ+1)

j

)N

j=1

1: for m = 1, 2, . . . , N do 2:

Evaluate gradient gm and Hessian Hmm, replacing ∇2 log π with Gauss-Newton approximation (only needs gradient info and is SPD)

3:

Solve linear system Hmmcm = gm and set ˆ Qℓ(x(ℓ)

m ) := cm

4:

Update particle m: x(ℓ+1)

m

← x(ℓ)

m + ε ˆ

Qℓ(x(ℓ)

m )

5: end for

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 13 / 33

slide-38
SLIDE 38

2nd Improvement: Kernel based on Hessian information

[Liu, Wang, 2016] chose simple isotropic Gaussian kernel k(x, z) = exp(−γx − z2

2)

However, kernel should mimic the shape of the target distribution

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 14 / 33

slide-39
SLIDE 39

2nd Improvement: Kernel based on Hessian information

[Liu, Wang, 2016] chose simple isotropic Gaussian kernel k(x, z) = exp(−γx − z2

2)

However, kernel should mimic the shape of the target distribution We use a scaled & averaged Hessian (available at no extra cost!): M ≈ 1 d Epℓ[−∇2 log π] and then construct the (data-informed) kernel k(x, z) = exp

  • −1

2x − z2

M

  • (In practice, use Gauss-Newton Hessian approximation H and MC average.)
  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 14 / 33

slide-40
SLIDE 40

Test Case 1: two-dimensional “double-banana”

Reference distribution (prior): p = N(0, I) Forward model: F(x) = log

  • (1 − x1)2 + 100(x2 − x2

1)2

(Rosenbrock function)

Observation: y = F(xtrue) + ξ, with xtrue ∼ N(0, I), ξ ∼ N(0, 0.09I) Number of particles: N = 1000 Compare SVN-H, SVN-I, SVGD-H and SVGD-I

(“H” stands for scaled Hessian kernel and “I” stands for isotropic kernel)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 15 / 33

slide-41
SLIDE 41
  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 16 / 33

slide-42
SLIDE 42

Video

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 16 / 33

slide-43
SLIDE 43

Test Case 2: 100-dimensional conditional diffusion

Reference distribution: p = N(0, C) with C(t, t′) = min(t, t′) Forward model: F(u) = [ˆ ut5, ˆ ut10, . . . , ˆ ut100]⊤ ∈ R20, where (ˆ uti)100

i=1

is the Euler-Maruyama discretisation of dut = βu (1 − u2) (1 + u2) dt + dxt, u0 = 0 for t ∈ [0, 1] with step size ∆t = 1/100 Observation: y = F(xtrue) + ξ with xtrue ∼ N(0, I), ξ ∼ N(0, 0.01I) Number of particles: N = 1000 Compare SVN-H, SVN-I, SVGD-H and SVGD-I

(“H” stands for scaled Hessian kernel and “I” stands for isotropic kernel)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 17 / 33

slide-44
SLIDE 44

SVN-H -- 10 iterations 0.5 1

  • 1.5
  • 1
  • 0.5

0.5 SVN-H -- 50 iterations 0.5 1

  • 1.5
  • 1
  • 0.5

0.5 SVN-H -- 100 iterations 0.5 1

  • 1.5
  • 1
  • 0.5

0.5 SVN-I -- 11 iterations 0.5 1

  • 2
  • 1

1 2 SVN-I -- 54 iterations 0.5 1

  • 2
  • 1

1 SVN-I -- 108 iterations 0.5 1

  • 2
  • 1

1 SVGD-H -- 40 iterations 0.5 1

  • 2
  • 1

1 2 SVGD-H -- 198 iterations 0.5 1

  • 2
  • 1

1 2 SVGD-H -- 395 iterations 0.5 1

  • 2
  • 1

1 2 SVGD-I -- 134 iterations 0.5 1

  • 1.5
  • 1
  • 0.5

0.5 SVGD-I -- 668 iterations 0.5 1

  • 1.5
  • 1
  • 0.5

0.5 SVGD-I -- 1336 iterations 0.5 1

  • 1.5
  • 1
  • 0.5

0.5

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 18 / 33

slide-45
SLIDE 45

Compare SVN-H with Hamiltonian MCMC (HMC)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 1.6
  • 1.4
  • 1.2
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2

true SVN-H HMC

  • bs
  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 19 / 33

slide-46
SLIDE 46

Approximation and Sampling of Multivariate Probability Distributions in the Tensor Train Decomposition

[Dolgov, Anaya-Izquierdo, Fox, RS, 2019]

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 20 / 33

slide-47
SLIDE 47

Recall: General Variational Inference

In general, in Variational Inference aim to find argmin

T

DKL(T♯η || π)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 21 / 33

slide-48
SLIDE 48

Recall: General Variational Inference

In general, in Variational Inference aim to find argmin

T

DKL(T♯η || π) Note DKL(T♯η || π) = −Eu∼η

  • log π(T(u)) + log | det ∇T(u)|
  • + const

Particularly useful family are Knothe-Rosenblatt rearrangements

(see [Marzouk, Moshely, Parno, Spantini, 2016]):

T(x) =      T1(x1) T2(x1, x2) . . . Td(x1, x2, . . . , xd)      Then: log | det ∇T(u)| =

k log ∂xkT k

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 21 / 33

slide-49
SLIDE 49

Knothe-Rosenblatt via Conditional Distribution Sampling

In fact, ∃! triangular map satisfying T♯η = π (for abs. cont. η, π on Rd) Can be computed explicitly via Conditional Distribution Sampling1:

1Rosenblatt ’52; Devroye ’86; Hormann, Leydold, Derflinger ’04

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 22 / 33

slide-50
SLIDE 50

Knothe-Rosenblatt via Conditional Distribution Sampling

In fact, ∃! triangular map satisfying T♯η = π (for abs. cont. η, π on Rd) Can be computed explicitly via Conditional Distribution Sampling1: Any density factorises into product of conditional densities: π(x1, . . . , xd) = π1(x1)π2(x2|x1) · · · πd(xd|x1, . . . , xd−1) Can sample (up to normalisation with known scaling factor) xk ∼ πk(xk|x1, . . . , xk−1) ∼

  • π(x1, . . . , xd)dxk+1 · · · dxd

1Rosenblatt ’52; Devroye ’86; Hormann, Leydold, Derflinger ’04

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 22 / 33

slide-51
SLIDE 51

Knothe-Rosenblatt via Conditional Distribution Sampling

In fact, ∃! triangular map satisfying T♯η = π (for abs. cont. η, π on Rd) Can be computed explicitly via Conditional Distribution Sampling1: Any density factorises into product of conditional densities: π(x1, . . . , xd) = π1(x1)π2(x2|x1) · · · πd(xd|x1, . . . , xd−1) 1st step: Produce sample xi

1 via 1D CDF-inversion from

π1(x1) ∼

  • π(x1, x2, . . . , xd)dx2 · · · dxd

1Rosenblatt ’52; Devroye ’86; Hormann, Leydold, Derflinger ’04

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 22 / 33

slide-52
SLIDE 52

Knothe-Rosenblatt via Conditional Distribution Sampling

In fact, ∃! triangular map satisfying T♯η = π (for abs. cont. η, π on Rd) Can be computed explicitly via Conditional Distribution Sampling1: Any density factorises into product of conditional densities: π(x1, . . . , xd) = π1(x1)π2(x2|x1) · · · πd(xd|x1, . . . , xd−1) 1st step: Produce sample xi

1 via 1D CDF-inversion from

π1(x1) ∼

  • π(x1, x2, . . . , xd)dx2 · · · dxd

k-th step: Given xi

1, . . . , xi k−1, sample xi k via 1D CDF-inversion from

πk(xk|xi

1, . . . , xi k−1) ∼

  • π(xi

1, . . . , xi k−1, xk, xk+1, . . . , xd)dxk+1 · · · dxd

1Rosenblatt ’52; Devroye ’86; Hormann, Leydold, Derflinger ’04

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 22 / 33

slide-53
SLIDE 53

Knothe-Rosenblatt via Conditional Distribution Sampling

In fact, ∃! triangular map satisfying T♯η = π (for abs. cont. η, π on Rd) Can be computed explicitly via Conditional Distribution Sampling1: Any density factorises into product of conditional densities: π(x1, . . . , xd) = π1(x1)π2(x2|x1) · · · πd(xd|x1, . . . , xd−1) 1st step: Produce sample xi

1 via 1D CDF-inversion from

π1(x1) ∼

  • π(x1, x2, . . . , xd)dx2 · · · dxd

k-th step: Given xi

1, . . . , xi k−1, sample xi k via 1D CDF-inversion from

πk(xk|xi

1, . . . , xi k−1) ∼

  • π(xi

1, . . . , xi k−1, xk, xk+1, . . . , xd)dxk+1 · · · dxd

Problem: (d − k)-dimensional integration at k-th step!

1Rosenblatt ’52; Devroye ’86; Hormann, Leydold, Derflinger ’04

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 22 / 33

slide-54
SLIDE 54

Low-rank Tensor Approximation of Distributions

Presented already several times

Low-rank tensor decomposition ⇔ separation of variables:

n O(nd)

  • O(dn)

Tensor grid with n points per direction (or n polynomial basis fcts.) Approximate: π(x1, . . . , xd)

  • tensor

  • |α|≤r π1

α(x1)π2 α(x2) · · · πd α(xd)

  • tensor product decomposition

Construction, integrals, samples all available at O(dn) cost !

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 23 / 33

slide-55
SLIDE 55

Tensor Train (TT) surrogates for high-dim. distributions

[Dolgov, Anaya-Izquierdo, Fox, RS, 2019]

Generic – not problem specific (“black box”) Cross approximation: “sequential” design along 1D lines Separable product form: ˜ π(x1, . . . , xd) =

|α|≤r π1 α(x1) . . . πd α(xd)

Cheap construction/storage & low # model evals linear in d Cheap integration w.r.t. x linear in d Cheap samples via conditional distribution method linear in d

(see below)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 24 / 33

slide-56
SLIDE 56

Tensor Train (TT) surrogates for high-dim. distributions

[Dolgov, Anaya-Izquierdo, Fox, RS, 2019]

Generic – not problem specific (“black box”) Cross approximation: “sequential” design along 1D lines Separable product form: ˜ π(x1, . . . , xd) =

|α|≤r π1 α(x1) . . . πd α(xd)

Cheap construction/storage & low # model evals linear in d Cheap integration w.r.t. x linear in d Cheap samples via conditional distribution method linear in d

(see below)

Tuneable approximation error ε (by adapting ranks r): = ⇒ cost & storage (poly)logarithmic in ε

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 24 / 33

slide-57
SLIDE 57

Tensor Train (TT) surrogates for high-dim. distributions

[Dolgov, Anaya-Izquierdo, Fox, RS, 2019]

Generic – not problem specific (“black box”) Cross approximation: “sequential” design along 1D lines Separable product form: ˜ π(x1, . . . , xd) =

|α|≤r π1 α(x1) . . . πd α(xd)

Cheap construction/storage & low # model evals linear in d Cheap integration w.r.t. x linear in d Cheap samples via conditional distribution method linear in d

(see below)

Tuneable approximation error ε (by adapting ranks r): = ⇒ cost & storage (poly)logarithmic in ε Many known ways to use this surrogate for fast inference!

(as proposals for MCMC, as control variates, importance weighting, . . . )

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 24 / 33

slide-58
SLIDE 58

A Theoretical Result

[Rohrbach, Dolgov, Grasedyck, RS, in preparation]

For Gaussian distributions π(x) we have the following result: Let π : Rd → R, x → exp

  • − 1

2xTΣx

  • and define

Σ :=

  • Σ(k)

11

ΓT

k

Γk Σ(k)

22

  • where

Γk ∈ R(d−k)×k.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 25 / 33

slide-59
SLIDE 59

A Theoretical Result

[Rohrbach, Dolgov, Grasedyck, RS, in preparation]

For Gaussian distributions π(x) we have the following result: Let π : Rd → R, x → exp

  • − 1

2xTΣx

  • and define

Σ :=

  • Σ(k)

11

ΓT

k

Γk Σ(k)

22

  • where

Γk ∈ R(d−k)×k.

  • Theorem. Let Σ be SPD with λmin > 0, ρ := maxk rank(Γk) and

σ := maxk,i σ(k)

i

, where σ(k)

i

are the singular values of Γk. Then, for all ε > 0, there exists TT-approximation ˜ πε s.t. π − ˜ πεL2(Rd) ≤ επL2(Rd) and the TT-ranks of ˜ πε are bounded by r ≤

  • 1 + 7 σ

λmin

  • log
  • 7 ρ d

ε

ρ .

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 25 / 33

slide-60
SLIDE 60

Conditional Distribution Sampler for TT (TT-CD sampler)

For the TT approximation ˜ π(x) =

rk

  • αk=1

0<k<d

π1

α1(x1) · π2 α1,α2(x2) · π3 α2,α3(x3) · · · πd αd−1(xd)

the k-th step of the CD sampler, given xi

1, . . . , xi k−1, simplifies to

˜ πk(xk|xi

1, . . . , xi k−1) ∼

  • α1,...,αd−1

π1

α1(xi 1) · · · πk−1 αk−2,αk−1(xi k−1) . . .

. . . πk

αk−1,αk(xk) . . .

. . .

  • πk+1

αk,αk+1(xk+1)dxk+1 · · ·

  • πd

αd−1(xd)dxd

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 26 / 33

slide-61
SLIDE 61

Conditional Distribution Sampler for TT (TT-CD sampler)

For the TT approximation ˜ π(x) =

rk

  • αk=1

0<k<d

π1

α1(x1) · π2 α1,α2(x2) · π3 α2,α3(x3) · · · πd αd−1(xd)

the k-th step of the CD sampler, given xi

1, . . . , xi k−1, simplifies to

˜ πk(xk|xi

1, . . . , xi k−1) ∼

  • α1,...,αd−1

π1

α1(xi 1) · · · πk−1 αk−2,αk−1(xi k−1) . . .

. . . πk

αk−1,αk(xk) . . .

. . .

  • πk+1

αk,αk+1(xk+1)dxk+1 · · ·

  • πd

αd−1(xd)dxd

To sample: Simple 1D CDF-inversion

linear in d

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 26 / 33

slide-62
SLIDE 62

How to use TT-CD sampler to estimate EπQ ?

Problem: We are sampling from approximate ˜ π = π + O(ε).

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 27 / 33

slide-63
SLIDE 63

How to use TT-CD sampler to estimate EπQ ?

Problem: We are sampling from approximate ˜ π = π + O(ε). Option 0: Biased estimator EπQ ≈ E˜

πQ via i.i.d. MC quadrature

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 27 / 33

slide-64
SLIDE 64

How to use TT-CD sampler to estimate EπQ ?

Problem: We are sampling from approximate ˜ π = π + O(ε). Option 0: Biased estimator EπQ ≈ E˜

πQ via i.i.d. MC quadrature

Can use QMC “seeds” instead of random ones

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 27 / 33

slide-65
SLIDE 65

Sampling from exact π: Unbiased estimates of EπQ

Option 1: Use {xi

˜ π} as (i.i.d.) proposals in Metropolis-Hastings:

Accept proposal xi

˜ π with probability α = min

  • 1, π(xi

˜ π)˜

π(xi−1

π

) π(xi−1

π

)˜ π(xi

˜ π)

  • Can prove that rejection rate ∼ ε and IACT τ ∼ 1 + ε
  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 28 / 33

slide-66
SLIDE 66

Sampling from exact π: Unbiased estimates of EπQ

Option 1: Use {xi

˜ π} as (i.i.d.) proposals in Metropolis-Hastings:

Accept proposal xi

˜ π with probability α = min

  • 1, π(xi

˜ π)˜

π(xi−1

π

) π(xi−1

π

)˜ π(xi

˜ π)

  • Can prove that rejection rate ∼ ε and IACT τ ∼ 1 + ε

Option 2: Use ˜ π for importance weighting + QMC quadrature: EπQ ≈ 1 Z 1 N

N

  • i=1

Q(xi

˜ π)π(xi ˜ π)

˜ π(xi

˜ π)

with Z = 1 N

N

  • i=1

π(xi

˜ π)

˜ π(xi

˜ π)

We can use an unbiased (randomised) QMC rule for both integrals.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 28 / 33

slide-67
SLIDE 67

Sampling from exact π: Unbiased estimates of EπQ

Option 1: Use {xi

˜ π} as (i.i.d.) proposals in Metropolis-Hastings:

Accept proposal xi

˜ π with probability α = min

  • 1, π(xi

˜ π)˜

π(xi−1

π

) π(xi−1

π

)˜ π(xi

˜ π)

  • Can prove that rejection rate ∼ ε and IACT τ ∼ 1 + ε

Option 2: Use ˜ π for importance weighting + QMC quadrature: EπQ ≈ 1 Z 1 N

N

  • i=1

Q(xi

˜ π)π(xi ˜ π)

˜ π(xi

˜ π)

with Z = 1 N

N

  • i=1

π(xi

˜ π)

˜ π(xi

˜ π)

We can use an unbiased (randomised) QMC rule for both integrals. Option 3: Use biased QMC estimator as a control variate (MLMCMC)

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 28 / 33

slide-68
SLIDE 68

Numerical experiments: (Artificial) Inverse Diffusion Problem

− ∇κ(s, x)∇u = 0 s ∈ (0, 1)2 u|s1=0 = 1, u|s1=1 = 0, ∂u ∂n

  • s2=0 = ∂u

∂n

  • s2=1 = 0.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Karhunen-Lo` eve expansion2 of log κ(s, x) =

d

  • k=1

φk(s)xk with prior xk ∼ U[−1, 1], φk∞ = O(k− 3

2 ) & d = 11.

2Eigel, Pfeffer, Schneider, 2016.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 29 / 33

slide-69
SLIDE 69

Numerical experiments: (Artificial) Inverse Diffusion Problem

− ∇κ(s, x)∇u = 0 s ∈ (0, 1)2 u|s1=0 = 1, u|s1=1 = 0, ∂u ∂n

  • s2=0 = ∂u

∂n

  • s2=1 = 0.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Karhunen-Lo` eve expansion2 of log κ(s, x) =

d

  • k=1

φk(s)xk with prior xk ∼ U[−1, 1], φk∞ = O(k− 3

2 ) & d = 11.

Discretisation with bilinear FEs on uniform mesh with h = 1/64.

2Eigel, Pfeffer, Schneider, 2016.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 29 / 33

slide-70
SLIDE 70

Numerical experiments: (Artificial) Inverse Diffusion Problem

− ∇κ(s, x)∇u = 0 s ∈ (0, 1)2 u|s1=0 = 1, u|s1=1 = 0, ∂u ∂n

  • s2=0 = ∂u

∂n

  • s2=1 = 0.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Karhunen-Lo` eve expansion2 of log κ(s, x) =

d

  • k=1

φk(s)xk with prior xk ∼ U[−1, 1], φk∞ = O(k− 3

2 ) & d = 11.

Discretisation with bilinear FEs on uniform mesh with h = 1/64. Data: average pressure in 9 locations (synthetic, i.e. for some s∗)

2Eigel, Pfeffer, Schneider, 2016.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 29 / 33

slide-71
SLIDE 71

Numerical experiments: (Artificial) Inverse Diffusion Problem

− ∇κ(s, x)∇u = 0 s ∈ (0, 1)2 u|s1=0 = 1, u|s1=1 = 0, ∂u ∂n

  • s2=0 = ∂u

∂n

  • s2=1 = 0.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Karhunen-Lo` eve expansion2 of log κ(s, x) =

d

  • k=1

φk(s)xk with prior xk ∼ U[−1, 1], φk∞ = O(k− 3

2 ) & d = 11.

Discretisation with bilinear FEs on uniform mesh with h = 1/64. Data: average pressure in 9 locations (synthetic, i.e. for some s∗) QoI: probability that flux exceeds 1.5

2Eigel, Pfeffer, Schneider, 2016.

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 29 / 33

slide-72
SLIDE 72

Comparison against DRAM (for inverse diffusion problem)

noise level σ2

e = 0.01

101 102 103 104 10−3 10−2 10−1

  • discr. error

CPU time Relative error in P(F > 1.5) TT-MH TT-qIW DRAM

TT-MH TT conditional distribution samples (iid) as proposals for MCMC TT-qIW TT surrogate for importance sampling with QMC DRAM Delayed Rejection Adaptive Metropolis [Haario et al, 2006]

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 30 / 33

slide-73
SLIDE 73

Comparison against DRAM (for inverse diffusion problem)

noise level σ2

e = 0.01

noise level σ2

e = 0.001

101 102 103 104 10−3 10−2 10−1

  • discr. error

CPU time Relative error in P(F > 1.5) TT-MH TT-qIW DRAM 101 102 103 104 10−3 10−2 10−1 100

  • discr. error

CPU time relative error for PF>1.5 TT-MH TT-qIW DRAM

TT-MH TT conditional distribution samples (iid) as proposals for MCMC TT-qIW TT surrogate for importance sampling with QMC DRAM Delayed Rejection Adaptive Metropolis [Haario et al, 2006]

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 30 / 33

slide-74
SLIDE 74

Samples – Comparison TT-CD vs. DRAM

DRAM

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

TT-MH (i.i.d. seeds)

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 31 / 33

slide-75
SLIDE 75

Conclusions

Inverse Problems under Uncertainty – Variational Inference Central idea: characterise complex/intractable distributions by constructing deterministic couplings Central tool: Optimisation of Kullback-Leibler divergence

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 32 / 33

slide-76
SLIDE 76

Conclusions

Inverse Problems under Uncertainty – Variational Inference Central idea: characterise complex/intractable distributions by constructing deterministic couplings Central tool: Optimisation of Kullback-Leibler divergence Many types of approximation classes (non-exhaustive): Sparse maps, decomposable maps, neural nets Kernel-based approaches Low rank structure

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 32 / 33

slide-77
SLIDE 77

Conclusions

Inverse Problems under Uncertainty – Variational Inference Central idea: characterise complex/intractable distributions by constructing deterministic couplings Central tool: Optimisation of Kullback-Leibler divergence Many types of approximation classes (non-exhaustive): Sparse maps, decomposable maps, neural nets Kernel-based approaches Low rank structure Main Topic 1: Newton-acceleration and data-informed kernels for Stein Variational Methods Main Topic 2: TT surrogates for efficient samplers in high dimensions Use approximate maps to accelerate MCMC or in importance sampler

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 32 / 33

slide-78
SLIDE 78

References

1

Moselhy, Marzouk, Bayesian inference with optimal maps, J Comput Phys 231, 2012 [arXiv:1109.1516]

2

Rezende, Mohamed, Variational inference with normalizing flows, ICML’15

  • Proc. 32nd Inter. Conf. Machine Learning, Vol. 37, 2015 [arXiv:1505.05770]

3

Marzouk, Moselhy, Parno, Spantini, Sampling via measure transport: An introduction, Handbook of Uncertainty Quantification (Ghanem, Higdon, Owhadi, Eds.), 2016 [arXiv:1602.05023]

4

Liu, Wang, Stein variational gradient descent: A general purpose Bayesian inference algorithm, NIPS 2016, Vol. 29, 2016 [arXiv:1608.04471]

5

Detommaso, Cui, Spantini, Marzouk, RS, A Stein variational Newton method, NIPS 2018, Vol. 31, 2018 [arXiv:1806.03085]

6

Dolgov, Anaya-Izquierdo, Fox, RS, Approximation and sampling of multivariate probability distributions in the tensor train decomposition, Statistics & Comput. (online first), 2019 [arXiv:1810.01212]

7

Detommaso, Kruse, Ardizzone, Rother, K¨

  • the, RS, HINT: Hierarchical

invertible neural transport for general & sequential Bayesian inference, 2019 [arXiv:1905.10687]

  • R. Scheichl (Heidelberg)

Stein Variational Newton & More RICAM 11/11/19 33 / 33