SLIDE 1 Bayesian Probabilistic Numerical Methods
- J. Cockayne1
- M. Girolami2,3
- H. C. Lie4,5
- C. Oates3,6
- T. J. Sullivan4,5
- A. Teckentrup3,7
SAMSI–Lloyds–Turing Workshop on Probabilistic Numerical Methods Alan Turing Institute, London, UK, 11 April 2018
1University of Warwick, UK 2Imperial College London, UK 3Alan Turing Institute, London, UK 4Free University of Berlin, DE 5Zuse Institute Berlin, DE 6Newcastle University, UK 7University of Edinburgh, UK
SLIDE 2
A Probabilistic Treatment of Numerics?
The last 5 years have seen a renewed interest in probabilistic perspectives on numerical tasks — e.g. quadrature, ODE and PDE solution, optimisation — continuing a theme with a long heritage: Poincaré (1896); Larkin (1970); Diaconis (1988); Skilling (1992). There are many ways to motivate this modelling choice:
To a statistician’s eye, numerical tasks look like inverse problems. Worst-case errors are often too pessimistic — perhaps we should adopt an average-case viewpoint (Traub et al., 1988; Ritter, 2000; Trefethen, 2008)? “Big data” problems often require (random) subsampling. If discretisation error is not properly accounted for, then biased and over-confjdent inferences result (Conrad et al., 2016). However, the necessary numerical analysis in nonlinear and evolutionary contexts can be hard! Accounting for the impact of discretisation error in a statistical way allows forward and Bayesian inverse problems to speak a common statistical language.
To make these ideas precise and to relate them to one another, some concrete defjnitions are needed!
2/41
SLIDE 3 Outline
- 1. Numerics: An Inference Perspective
- 2. Bayes’ Theorem via Disintegration
- 3. Optimal Information
- 4. Numerical Disintegration
- 5. Coherent Pipelines of BPNMs
- 6. Randomised Bayesian Inverse Problems
- 7. Closing Remarks
3/41
SLIDE 4
An Inference Perspective on Numerical Tasks
SLIDE 5
An Abstract View of Numerical Methods i
An abstract setting for numerical tasks consists of three spaces and two functions: X , where an unknown/variable object x or u lives; dim X = ∞ A, where we observe information A(x), via a function A: X → A; dim A < ∞ Q, with a quantity of interest Q: X → Q. Example 1 (Quadrature) C0 0 1 0 1
m
A u ti u ti
m i 1
Q u
1 0 u t dt
Conventional numerical methods are cleverly-designed functions b : they estimate Q x by b A x . N.B. Some methods try to “invert” A, form an estimate of x, then apply Q. Vanilla Monte Carlo — b ti yi n
i 1 1 n n i 1 yi — does not! (cf. O’Hagan, 1987) 4/41
SLIDE 6
An Abstract View of Numerical Methods i
An abstract setting for numerical tasks consists of three spaces and two functions: X , where an unknown/variable object x or u lives; dim X = ∞ A, where we observe information A(x), via a function A: X → A; dim A < ∞ Q, with a quantity of interest Q: X → Q. Example 1 (Quadrature) X = C0([0, 1]; R) A = ([0, 1] × R)m Q = R A(u) = (ti, u(ti))m
i=1
Q(u) =
∫ 1
0 u(t) dt
Conventional numerical methods are cleverly-designed functions b : they estimate Q x by b A x . N.B. Some methods try to “invert” A, form an estimate of x, then apply Q. Vanilla Monte Carlo — b ti yi n
i 1 1 n n i 1 yi — does not! (cf. O’Hagan, 1987) 4/41
SLIDE 7
An Abstract View of Numerical Methods i
An abstract setting for numerical tasks consists of three spaces and two functions: X , where an unknown/variable object x or u lives; dim X = ∞ A, where we observe information A(x), via a function A: X → A; dim A < ∞ Q, with a quantity of interest Q: X → Q. Example 1 (Quadrature) X = C0([0, 1]; R) A = ([0, 1] × R)m Q = R A(u) = (ti, u(ti))m
i=1
Q(u) =
∫ 1
0 u(t) dt
Conventional numerical methods are cleverly-designed functions b: A → Q: they estimate Q(x) by b(A(x)). N.B. Some methods try to “invert” A, form an estimate of x, then apply Q. Vanilla Monte Carlo — b((ti, yi)n
i=1) := 1 n ∑n i=1 yi — does not! (cf. O’Hagan, 1987) 4/41
SLIDE 8 An Abstract View of Numerical Methods ii
Question: What makes for a “good” numerical method? (Larkin, 1970) Answer 1, Gauss: b ◦ A = Q on a “large” fjnite-dimensional subspace of X . Answer 2, Sard (1949): b ◦ A − Q is “small” on X . In what sense?
The worst-case error: eWC := sup
x∈X
∥b(A(x)) − Q(x)∥Q. The average-case error with respect to a probability measure µ on X : eAC :=
∫
X ∥b(A(x)) − Q(x)∥Q µ(dx).
To a Bayesian, seeing the additional structure of µ, there is “only one way forward”: if x ∼ µ, then b(A(x)) should be obtained by conditioning µ and then applying Q. But is this Bayesian solution always well-defjned, and what are its error properties?
5/41
SLIDE 9
- Rev. Bayes Does Some Numerics i
X
A
Q
6/41
SLIDE 10
- Rev. Bayes Does Some Numerics i
X
A
b
b: A → Q
6/41
SLIDE 11
- Rev. Bayes Does Some Numerics i
X
A
b
b: A → Q Go Probabilistic! PX
A#
PQ A
δ
SLIDE 12
- Rev. Bayes Does Some Numerics i
X
A
b
b: A → Q Go Probabilistic! PX
A#
PQ A
δ
6/41
SLIDE 13
- Rev. Bayes Does Some Numerics i
X
A
b
b: A → Q Go Probabilistic! PX
A#
PQ A
δ
Example 2 (Quadrature) X = C0([0, 1]; R) A = ([0, 1] × R)m Q = R A(u) = (ti, u(ti))m
i=1
Q(u) =
∫ 1
0 u(t) dt
A deterministic numerical method uses
- nly the spaces and data to produce a
point estimate of the integral. A probabilistic numerical method converts an additional belief about the integrand into a belief about the integral.
6/41
SLIDE 14
- Rev. Bayes Does Some Numerics i
X
A
b
b: A → Q Go Probabilistic! PX
A# Q#
✤ ❲
PA PQ A
a → µa
◆ ◆ ◆ ◆ ◆ ◆ ◆
δ a→B(µ,a)
Defjnition 2 (Bayesian PNM) A PNM B(µ, ·): A → PQ with prior µ ∈ PX is Bayesian for a quantity of interest Q: X → Q and information operator A: X → A if the bottom-left A-PX -PQ triangle commutes, i.e. the output of B is the push-forward of the conditional distribution µa through Q: B(µ, a) = Q#µa, for A#µ-almost all a ∈ A. Zellner (1988) calls B an “information processing rule”.
6/41
SLIDE 15
- Rev. Bayes Does Some Numerics ii
Defjnition 3 (Bayesian PNM) A PNM B with prior µ ∈ PX is Bayesian for a quantity of interest Q and information A if its output is the push-forward of the conditional distribution µa through Q: B(µ, a) = Q#µa, for A#µ-almost all a ∈ A. Example 4 Under the Gaussian Brownian motion prior on C0 0 1 , the posterior mean / MAP estimator for the defjnite integral is the trapezoidal rule, i.e. integration using linear interpolation (Sul din, 1959, 1960). The integrated Brownian motion prior corresponds to integration using cubic spline interpolation.
7/41
SLIDE 16
- Rev. Bayes Does Some Numerics ii
Defjnition 3 (Bayesian PNM) A PNM B with prior µ ∈ PX is Bayesian for a quantity of interest Q and information A if its output is the push-forward of the conditional distribution µa through Q: B(µ, a) = Q#µa, for A#µ-almost all a ∈ A. Example 4 Under the Gaussian Brownian motion prior on X = C0([0, 1]; R), the posterior mean / MAP estimator for the defjnite integral is the trapezoidal rule, i.e. integration using linear interpolation (Sul′din, 1959, 1960). The integrated Brownian motion prior corresponds to integration using cubic spline interpolation.
7/41
SLIDE 17
A Rogue’s Gallery of Bayesian and non-Bayesian PNMs
8/41
SLIDE 18
Generalising Bayes’ Theorem via Disintegration
SLIDE 19
Bayes’ Theorem
Thus, we are expressing PNMs in terms of Bayesian inverse problems (Stuart, 2010). But a naïve interpretation of Bayes’ rule makes no sense here, because supp(µa) ⊆ X a := {x ∈ X | A(x) = a}, typically µ(X a) = 0, and — in contrast to typical statistical inverse problems — we think of the observation process as noiseless. E.g. quadrature example from earlier, with A(u) = (ti, u(ti))m
i=1.
Thus, we cannot take the usual approach of defjning µa via its prior density as dµa dµ (x) ∝ likelihood(x|a) because this density “wants” to be the indicator function 1[x ∈ X a]. While linear-algebraic tricks work for linear conditioning of Gaussians, in general we condition on events of measure zero using disintegration.
9/41
SLIDE 20
Disintegration i
Write µ(f) ≡ Eµ[f] ≡
∫
X f(x) µ(dx)
Defjnition 5 (Disintegration) A disintegration of µ ∈ PX with respect to a measurable map A: X → A is a map A → PX , a → µa, such that µa(X \ X a) = 0 for A#µ-almost all a ∈ A; (support) and, for each measurable f: X → [0, ∞), a → µa(f) is measurable; (measurability) µ(f) = A#µ ( µa(f) ) , (conditioning/reconstruction) i.e.
∫
X f(x) µ(dx) =
∫
A
[∫
X a f(x) µa(dx)
] (A#µ)(da).
10/41
SLIDE 21
Disintegration ii
Theorem 6 (Disintegration theorem (Chang and Pollard, 1997, Thm. 1)) Let X be a metric space and let µ ∈ PX be inner regular. If the Borel σ-algebra on X is countably generated and contains all singletons {a} for a ∈ A, then there is an essentially unique disintegration {µa}a∈A of µ with respect to A. (If {νa}a∈A is another such disintegration, then {a ∈ A : µa ̸= νa} is an A#µ-null set.) Example 7 For µ ∈ PR2 with continuous Lebesgue density ρ: R2 → [0, ∞), i.e. dµ(x1, x2) = ρ(x1, x2) d(x1, x2), the disintegration of µ with respect to vertical projection A(x1, x2) := x1 is just the family of measures µa, where µa has Lebesgue density ρ(a, ·)/Za on the vertical line {(a, x2) | x2 ∈ R}, and Za := ∫
R ρ(a, x2) dx2.
Except for nice situations like this, Gaussian measures, etc. (Owhadi and Scovel, 2015), disintegrations cannot be computed exactly — we have to work approximately.
11/41
SLIDE 22
Optimal Information: the Worst, the Average, and the Bayesian
SLIDE 23
Measures of Error
Suppose we have a loss function L: Q × Q → R, e.g. L(q, q′) := ∥q − q′∥2
Q.
The worst-case loss for a classical numerical method b: A → Q is eWC(A, b) := sup
x∈X
L ( b(A(x)), Q(x) ) . The average-case loss under a probability measure µ ∈ PX is eAC(A, b) :=
∫
X L
( b(A(x)), Q(x) ) µ(dx), eAC(A, B) :=
∫
X
[∫
Q L
( q, Q(x) ) B(µ, A(x))(dq) ] µ(dx). Kadane and Wasilkowski (1985) show that the minimiser B is a non-random Bayes decision rule b, and the minimiser A is “optimal information” for this task. A BPNM B has “no choice” but to be Q♯µa once A(x) = a is given; optimality of A means minimising the Bayesian loss eBPN(A) :=
∫
X
[∫
Q L(q, Q(x)) (Q♯µA(x))(dq)
] µ(dx).
12/41
SLIDE 24
Optimal Information: AC = BPN?
Theorem 8 (AC = BPN for quadratic loss; Cockayne, Oates, Sullivan, and Girolami, 2017) For a quadratic loss L(q, q′) := ∥q − q′∥2
Q on a Hilbert space Q, optimal information for BPNM
and ACE coincide (though the minimal values may difger). Example 9 (AC BPN in general?) Decide whether or not a card drawn fairly at random is , incurring unit loss if you guess wrongly; can choose to be told whether the card is red (A1) or is non- (A2). Unif 0 1
1
0 1 A1 x x Q x x
2
0 1 A2 x x L q q q q Which information operator, A1 or A2, is better? (Note that eWC Ai b 1 for all deterministic b!)
13/41
SLIDE 25
Optimal Information: AC = BPN?
Theorem 8 (AC = BPN for quadratic loss; Cockayne, Oates, Sullivan, and Girolami, 2017) For a quadratic loss L(q, q′) := ∥q − q′∥2
Q on a Hilbert space Q, optimal information for BPNM
and ACE coincide (though the minimal values may difger). Example 9 (AC = BPN in general?) Decide whether or not a card drawn fairly at random is ♦, incurring unit loss if you guess wrongly; can choose to be told whether the card is red (A1) or is non-♣ (A2). X = {♣, ♦, ♥, ♠} µ = UnifX Q = {0, 1} ⊂ R A1 = {0, 1} A1(x) = 1[x ∈ {♦, ♥}] Q(x) = 1[x = ♦] A2 = {0, 1} A2(x) = 1[x ∈ {♦, ♥, ♠}] L(q, q′) = 1[q ̸= q′] Which information operator, A1 or A2, is better? (Note that eWC(Ai, b) = 1 for all deterministic b!)
13/41
SLIDE 26
Optimal Information: AC ̸= BPN!
X = {♣, ♦, ♥, ♠} µ = UnifX Q = {0, 1} ⊂ R A1(x) = ■ vs. ■ Q(x) = 1[x = ♦] A2(x) = ¬♣ vs. ♣ L(q, q′) = 1[q ̸= q′] x = ♣ ♦ ♥ ♠ eAC(A1, b) = 1
4
( L(b(■), 0) + L(b(■), 1) + L(b(■), 0) + L(b(■), 0) )
14/41
SLIDE 27
Optimal Information: AC ̸= BPN!
X = {♣, ♦, ♥, ♠} µ = UnifX Q = {0, 1} ⊂ R A1(x) = ■ vs. ■ Q(x) = 1[x = ♦] A2(x) = ¬♣ vs. ♣ L(q, q′) = 1[q ̸= q′] x = ♣ ♦ ♥ ♠ eAC(A1, b) = 1
4
( L(b(■), 0) + L(b(■), 1) + L(b(■), 0) + L(b(■), 0) ) eAC(A1, 0) = 1
4
( + 1 + + ) = 1
4
eAC(A1, id) = 1
4
( + + 1 + ) = 1
4 14/41
SLIDE 28
Optimal Information: AC ̸= BPN!
X = {♣, ♦, ♥, ♠} µ = UnifX Q = {0, 1} ⊂ R A1(x) = ■ vs. ■ Q(x) = 1[x = ♦] A2(x) = ¬♣ vs. ♣ L(q, q′) = 1[q ̸= q′] x = ♣ ♦ ♥ ♠ eAC(A1, b) = 1
4
( L(b(■), 0) + L(b(■), 1) + L(b(■), 0) + L(b(■), 0) ) eAC(A1, 0) = 1
4
( + 1 + + ) = 1
4
eAC(A1, id) = 1
4
( + + 1 + ) = 1
4
eAC(A2, b) = 1
4
( L(b(♣), 0) + L(b(¬♣), 1) + L(b(¬♣), 0) + L(b(¬♣), 0) ) eAC(A2, 0) = 1
4
( + 1 + + ) = 1
4 14/41
SLIDE 29 Optimal Information: AC ̸= BPN!
X = {♣, ♦, ♥, ♠} µ = UnifX Q = {0, 1} ⊂ R A1(x) = ■ vs. ■ Q(x) = 1[x = ♦] A2(x) = ¬♣ vs. ♣ L(q, q′) = 1[q ̸= q′] x = ♣ ♦ ♥ ♠ eAC(A1, b) = 1
4
( L(b(■), 0) + L(b(■), 1) + L(b(■), 0) + L(b(■), 0) ) eAC(A1, 0) = 1
4
( + 1 + + ) = 1
4
eAC(A1, id) = 1
4
( + + 1 + ) = 1
4
eAC(A2, b) = 1
4
( L(b(♣), 0) + L(b(¬♣), 1) + L(b(¬♣), 0) + L(b(¬♣), 0) ) eAC(A2, 0) = 1
4
( + 1 + + ) = 1
4
eBPN(A1) = 1
4
( EQ♯µ■L(·, 0) + EQ♯µ■L(·, 1) + EQ♯µ■L(·, 0) + EQ♯µ■L(·, 0) ) = 1
4
(
( 1
2 · 0 + 1 2 · 0) +
( 1
2 · 0 + 1 2 · 1)
+
( 1
2 · 1 + 1 2 · 0)
+
( 1
2 · 0 + 1 2 · 0)
) = 1
4 14/41
SLIDE 30 Optimal Information: AC ̸= BPN!
X = {♣, ♦, ♥, ♠} µ = UnifX Q = {0, 1} ⊂ R A1(x) = ■ vs. ■ Q(x) = 1[x = ♦] A2(x) = ¬♣ vs. ♣ L(q, q′) = 1[q ̸= q′] x = ♣ ♦ ♥ ♠ eAC(A1, b) = 1
4
( L(b(■), 0) + L(b(■), 1) + L(b(■), 0) + L(b(■), 0) ) eAC(A1, 0) = 1
4
( + 1 + + ) = 1
4
eAC(A1, id) = 1
4
( + + 1 + ) = 1
4
eAC(A2, b) = 1
4
( L(b(♣), 0) + L(b(¬♣), 1) + L(b(¬♣), 0) + L(b(¬♣), 0) ) eAC(A2, 0) = 1
4
( + 1 + + ) = 1
4
eBPN(A1) = 1
4
( EQ♯µ■L(·, 0) + EQ♯µ■L(·, 1) + EQ♯µ■L(·, 0) + EQ♯µ■L(·, 0) ) = 1
4
(
( 1
2 · 0 + 1 2 · 0) +
( 1
2 · 0 + 1 2 · 1)
+
( 1
2 · 1 + 1 2 · 0)
+
( 1
2 · 0 + 1 2 · 0)
) = 1
4
eBPN(A2) = 1
4
( EQ♯µ♣L(·, 0) + EQ♯µ¬♣L(·, 1) + EQ♯µ¬♣L(·, 0) + EQ♯µ¬♣L(·, 0) ) = 1
4
(
(1 · 0)
+ ( 1
3 · 0 + 1 3 · 1 + 1 3 · 1) + ( 1 3 · 1 + 1 3 · 0 + 1 3 · 0) + ( 1 3 · 1 + 1 3 · 0 + 1 3 · 0)
) = 1
3 14/41
SLIDE 31
Numerical Disintegration
SLIDE 32 Numerical Disintegration i
The exact disintegration “µa(dx) ∝ 1[A(x) = a] µ(dx)” can be accessed numerically via relaxation, with approximation guarantees provided a → µa is “nice”, e.g. A#µ ∈ PA has a smooth Lebesgue density. Consider relaxed posterior µa
δ(dx) ∝ ϕ(∥A(x) − a∥A/δ) µ(dx) with 0 < δ ≪ 1.
Essentially any ϕ: [0, ∞) → [0, 1] tending continuously to 1 at 0 and decaying quickly enough to 0 at ∞ will do. E.g. ϕ(r) := 1[r < 1] or ϕ(r) := exp(−r2).
Defjnition 10 The integral probability metric on PX with respect to a normed space F of test functions f: X → R is dF(µ, ν) := sup {|µ(f) − ν(f)|
} . F = bounded continuous functions with uniform norm ↔ total variation. F = bounded Lipschitz continuous functions with Lipschitz norm ↔ Wasserstein.
15/41
SLIDE 33 Numerical Disintegration ii
“µa(dx) ∝ 1[A(x) = a] µ(dx)” µa
δ(dx) ∝ ϕ(∥A(x) − a∥A/δ) µ(dx)
dF(µ, ν) := sup {|µ(f) − ν(f)|
} Theorem 11 (Cockayne, Oates, Sullivan, and Girolami, 2017, Theorem 4.3) If a → µa is γ-Hölder from (A, ∥·∥A) into (PX , dF), then so too is the approximation µa
δ ≈ µa
as a function of δ. That is, dF ( µa, µa′) ≤ C · ∥a − a′∥γ for a, a′ ∈ A = ⇒ dF ( µa, µa
δ
) ≤ C · Cϕ · δγ for A#µ-almost all a ∈ A. Open question: when does the hypothesis, a quantitative version of the Tjur property (Tjur, 1980), actually hold?
16/41
SLIDE 34
Numerical Disintegration iii
To evaluate expectations against µa we can extrapolate expectations against µa
δ (Schillings
and Schwab, 2016). To sample µa
δ we take inspiration from rare event simulation and use tempering schemes
to sample the posterior. Set δ0 > δ1 > . . . > δN and consider µa
δ0, µa δ1, . . . , µa δN
µa
δ0 is easy to sample — often µa δ0 = µ.
µa
δN has δN close to zero and is hard to sample.
Intermediate distributions defjne a “ladder” which takes us from prior to posterior. Even within this framework, there is considerable choice of sampling scheme, e.g. brute-force MCMC, SMC, QMC, pCN, …
17/41
SLIDE 35 Example: Painlevé’s First Transcendental i
A multivalent boundary value problem: u′′(t) − u(t)2 = −t for t ≥ 0 u(0) = 0 u(t)/ √ t → 1 as t → +∞
2 4 6 8 10
t
3 2 1 1 2 3 4
x(t)
10 20 30 40 50 60 70 80
i
10-14 10-12 10-10 10-8 10-6 10-4 10-2 100
un
Positive Negative
Figure 1: The two solutions of Painlevé’s fjrst transcendental and their spectra in the orthonormal Chebyshev polynomial basis over [0, 10].
18/41
SLIDE 36 Example: Painlevé’s First Transcendental i
A multivalent boundary value problem: u′′(t) − u(t)2 = −t for t ≥ 0 u(0) = 0 u(10) = √ 10
2 4 6 8 10
t
3 2 1 1 2 3 4
x(t)
10 20 30 40 50 60 70 80
i
10-14 10-12 10-10 10-8 10-6 10-4 10-2 100
un
Positive Negative
Figure 1: The two solutions of Painlevé’s fjrst transcendental and their spectra in the orthonormal Chebyshev polynomial basis over [0, 10].
18/41
SLIDE 37 Example: Painlevé’s First Transcendental iii
Parallel tempered pCN with 100 δ-values log-spaced from δ = 10 to δ = 10−4 and 108 iterations recovers both solutions in approximately the same proportions as the posterior densities at the two exact solutions. ✓ SMC reliably recovers one solution, but not both simultaneously.
! ?
Of course, this comes at the price of MCMC… ✗
2 4 6 8 10
t
−4 −3 −2 −1 1 2 3 4
x(t) δ = 1.0e + 01 19/41
SLIDE 38 Example: Painlevé’s First Transcendental iii
Parallel tempered pCN with 100 δ-values log-spaced from δ = 10 to δ = 10−4 and 108 iterations recovers both solutions in approximately the same proportions as the posterior densities at the two exact solutions. ✓ SMC reliably recovers one solution, but not both simultaneously.
! ?
Of course, this comes at the price of MCMC… ✗
10 2 4 6 8 10
t
−4 −3 −2 −1 1 2 3 4
x(t) δ = 5.5e − 01 19/41
SLIDE 39 Example: Painlevé’s First Transcendental iii
Parallel tempered pCN with 100 δ-values log-spaced from δ = 10 to δ = 10−4 and 108 iterations recovers both solutions in approximately the same proportions as the posterior densities at the two exact solutions. ✓ SMC reliably recovers one solution, but not both simultaneously.
! ?
Of course, this comes at the price of MCMC… ✗
10 2 4 6 8 10
t
−4 −3 −2 −1 1 2 3 4
x(t) δ = 1.0e − 04 19/41
SLIDE 40
Coherent Pipelines of BPNMs
SLIDE 41
Computational Pipelines
Numerical methods usually form part of pipelines. Prime example: a PDE solve is a forward model in an inverse problem. Motivation for PNMs in the context of Bayesian inverse problems: Make the forward and inverse problem speak the same statistical language! We can compose PNMs in series, e.g. B2(B1(µ, a1), a2) is formally B(µ, (a1, a2))… although fjguring out what the spaces X i, Ai and operators Ai etc. are is a headache!
20/41
SLIDE 42
Coherence i
More generally, we compose PNMs in a graphical way by allowing input information nodes (□) to feed into method nodes (■), which in turn output new information. N.B. Deterministic data at the left-most □ nodes, then random variables as outputs, realisations of which get fed into the next ■.
1 2 3 4 5 6 1 2 1 2 1 2 1 2 1 1 2 3 21/41
SLIDE 43
Coherence i
More generally, we compose PNMs in a graphical way by allowing input information nodes (□) to feed into method nodes (■), which in turn output new information. N.B. Deterministic data at the left-most □ nodes, then random variables as outputs, realisations of which get fed into the next ■.
1 2 3 4 5 6 7 8 9 10
We defjne the corresponding dependency graph by replacing each □→■→□ by □→□, and number the vertices in an increasing fashion, so that i → i′ implies i < i′. The independence properties of the random variables at each node are crucial.
21/41
SLIDE 44
Coherence ii
1 2 3 4 5 6 Defjnition 12 A prior µ is coherent for the dependency graph if — when the “leaf” input nodes are A♯µ-distributed and the remaining nodes are B(µ, parents)-distributed — every node Yk is conditionally independent of all older non-parent nodes Yi given its direct parents Yj. Yk ⊥ ⊥ Y{1,...,k−1}\parents(k) | Yparents(k) This is weaker than the Markov condition for directed acyclic graphs (Lauritzen, 1991): we do not insist that the variables at the source nodes are independent.
22/41
SLIDE 45
Coherency Theorem
Theorem 13 (Cockayne, Oates, Sullivan, and Girolami, 2017, Theorem 5.9) If a pipeline of PNMs is such that the prior is coherent for the dependence graph, and the component PNMs are all Bayesian then the pipeline is the Bayesian pipeline data at leaves →■→ fjnal output . Redundant structure in the pipeline (recycled information) will break coherence, and hence Bayesianity of the pipeline. In principle, coherence and hence being Bayesian depend upon the prior. This should not be surprising — as a loose analogy, one doesn’t expect the trapezoidal rule to be a good way to integrate very smooth functions.
23/41
SLIDE 46
Coherency Theorem
Theorem 13 (Cockayne, Oates, Sullivan, and Girolami, 2017, Theorem 5.9) If a pipeline of PNMs is such that the prior is coherent for the dependence graph, and the component PNMs are all Bayesian then the pipeline is the Bayesian pipeline data at leaves →■→ fjnal output . Redundant structure in the pipeline (recycled information) will break coherence, and hence Bayesianity of the pipeline. In principle, coherence and hence being Bayesian depend upon the prior. This should not be surprising — as a loose analogy, one doesn’t expect the trapezoidal rule to be a good way to integrate very smooth functions.
23/41
SLIDE 47
Split Integration: Coherence
u(t0), . . . , u(tm−1) u(tm) u(tm+1), . . . , u(t2m) B1(µ, ·) B2(µ, ·) ∫ 0.5 u(t) dt ∫ 1
0.5 u(t) dt
B3(µ, ·) ∫ 1
0 u(t) dt
Integrate a function over [0, 1] in two steps using nodes 0 ≤ t0 < · · · < tm−1 < 0.5, tm = 0.5, and tm+1 < · · · < t2m ≤ 1.
24/41
SLIDE 48
Split Integration: Coherence
u(t0), . . . , u(tm−1) u(tm) u(tm+1), . . . , u(t2m) B1(µ, ·) B2(µ, ·) ∫ 0.5 u(t) dt ∫ 1
0.5 u(t) dt
B3(µ, ·) ∫ 1
0 u(t) dt
Integrate a function over [0, 1] in two steps using nodes 0 ≤ t0 < · · · < tm−1 < 0.5, tm = 0.5, and tm+1 < · · · < t2m ≤ 1. Is ∫ 1
0.5 u(t) dt independent of u(t0), . . . , u(tm−1) given u(tm), . . . , u(t2m)? 24/41
SLIDE 49
Split Integration: Coherence
u(t0), . . . , u(tm−1) u(tm) u(tm+1), . . . , u(t2m) B1(µ, ·) B2(µ, ·) ∫ 0.5 u(t) dt ∫ 1
0.5 u(t) dt
B3(µ, ·) ∫ 1
0 u(t) dt
Integrate a function over [0, 1] in two steps using nodes 0 ≤ t0 < · · · < tm−1 < 0.5, tm = 0.5, and tm+1 < · · · < t2m ≤ 1. Is ∫ 1
0.5 u(t) dt independent of u(t0), . . . , u(tm−1) given u(tm), . . . , u(t2m)?
For a Brownian motion prior on the integrand u, yes. For an integrated BM prior on u, i.e. a BM prior on u′, no.
24/41
SLIDE 50
Split Integration: Coherence
u(t0), . . . , u(tm−1) u(tm) u(tm+1), . . . , u(t2m) B1(µ, ·) B2(µ, ·) ∫ 0.5 u(t) dt ∫ 1
0.5 u(t) dt
B3(µ, ·) ∫ 1
0 u(t) dt
Integrate a function over [0, 1] in two steps using nodes 0 ≤ t0 < · · · < tm−1 < 0.5, tm = 0.5, and tm+1 < · · · < t2m ≤ 1. Is ∫ 1
0.5 u(t) dt independent of u(t0), . . . , u(tm−1) given u(tm), . . . , u(t2m)?
For a Brownian motion prior on the integrand u, yes. For an integrated BM prior on u, i.e. a BM prior on u′, no. So how do we elicit an appropriate prior that respects the problem’s structure?
! ?
And is being fully Bayesian worth it in terms of cost and robustness? Cf. Owhadi et al. (2015a,b) and Jacob et al. (2017).
! ?
24/41
SLIDE 51
Randomised Bayesian Inverse Problems
SLIDE 52
Randomised Bayesian inverse problems
A Bayesian inverse problem can be seen as a simple pipeline of the form □→■→□→■→□ with the fjrst method being the forward solve, the second the inversion. How much does replacing the traditional deterministic forward solve by a PNM (or just a random surrogate) afgect the solution of the BIP, à la Stuart (2010)? Usual posterior µy over U with prior µ0 and negative log-likelihood Φ = Φ(·; y): U → R: dµy dµ0 (u) = exp(−Φ(u)) Z(y) . (1) Let now ΦN : Ω × U → R be a measurable function that provides a random approximation to Φ, and denote by νN the distribution of ΦN.
25/41
SLIDE 53 Random and deterministic approximate posteriors
Replacing Φ by ΦN in (1), we obtain a random approximation µsamp
N
dµsamp
N
dµ0 (u) := exp(−ΦN(u)) Zsamp
N
, (2) Zsamp
N
:= Eµ0 [ exp(−ΦN(·)) ] . Taking the expectation of the random likelihood gives a deterministic approximation: dµmarg
N
dµ0 (u) := EνN [ exp(−ΦN(u)) ] EνN [ Zsamp
N
] . (3) An alternative deterministic approximation can be obtained by taking the expected value of the density (Zsamp
N
)−1e−ΦN(u) in (2). However, µmarg
N
provides a clear interpretation as the posterior obtained by the approximation of the true data likelihood e−Φ(u) by EνN [ e−ΦN(u)] , and is more amenable to sampling methods such as pseudo-marginal MCMC (Beaumont, 2003; Andrieu and Roberts, 2009).
26/41
SLIDE 54 Summary of convergence rates
Theorem 14 (Lie, Sullivan, and Teckentrup, 2017) For suitable Hölder exponents p1, p′
1, p2, . . . quantifying the integrability of Φ and ΦN, we
- btain deterministic convergence µmarg
N
→ µ and mean-square convergence µsamp
N
→ µ in the Hellinger metric: dH ( µ, µmarg
N
) ≤ C
[|Φ − ΦN|p′
2]1/p′ 2
2p′ 1p′ 3 µ0
(U),
EνN [ dH ( µ, µsamp
N
)2]1/2 ≤ D
[ |Φ − ΦN|2q′
1
]1/2q′
1
2q′ 2 µ0 (U)
.
Skip to Example
Show Gory Details ?
27/41
SLIDE 55 Summary of convergence rates
Theorem 14 (Lie, Sullivan, and Teckentrup, 2017) For suitable Hölder exponents p1, p′
1, p2, . . . quantifying the integrability of Φ and ΦN, we
- btain deterministic convergence µmarg
N
→ µ and mean-square convergence µsamp
N
→ µ in the Hellinger metric: dH ( µ, µmarg
N
) ≤ C
[|Φ − ΦN|p′
2]1/p′ 2
2p′ 1p′ 3 µ0
(U),
EνN [ dH ( µ, µsamp
N
)2]1/2 ≤ D
[ |Φ − ΦN|2q′
1
]1/2q′
1
2q′ 2 µ0 (U)
.
Skip to Example
Show Gory Details ?
27/41
SLIDE 56 Convergence of µmarg
N
Theorem 15 (Deterministic convergence of the marginal posterior) Suppose there exist positive scalars C1, C2, C3, that do not depend on N, such that for the Hölder conjugate exponent pairs (p1, p′
1), (p2, p′ 2), and (p3, p′ 3), we have
min { EνN[e−ΦN]−1
p1 µ0(U),
p1 µ0(U)
} ≤ C1(p1);
[( e−Φ + e−ΦN)p2]1/p2
2p′ 1p3 µ0
(U)
≤ C2(p1, p2, p3); C−1
3
≤ EνN[Zsamp
N
] ≤ C3. Then there exists a scalar C = C(C1, C2, C3, Z) > 0 that does not depend on N, such that dH ( µ, µmarg
N
) ≤ C
[|Φ − ΦN|p′
2]1/p′ 2
2p′ 1p′ 3 µ0
(U),
C(C1, C2, C3, Z) = (C1(p1) Z + C3 max { Z−3, C3
3
}) C2
2(p1, p2, p3). 28/41
SLIDE 57 Convergence of µsamp
N
Theorem 16 (Mean-square convergence of the sample posterior) Suppose there exist positive scalars D1, D2, that do not depend on N, such that for Hölder conjugate exponent pairs (q1, q′
1) and (q2, q′ 2), we have
[( e−Φ/2 + e−ΦN/2)2q1]1/q1
q2 µ0(U)
≤ D1(q1, q2);
[( Zsamp
N
max { Z−3, (Zsamp
N
)−3}( e−Φ + e−ΦN)2)q1]1/q1
q2 µ0(U)
≤ D2(q1, q2). Then EνN [ dH ( µ, µsamp
N
)2]1/2 ≤ (D1 + D2)
[ |Φ − ΦN|2q′
1
]1/2q′
1
2q′ 2 µ0 (U)
,
29/41
SLIDE 58
Applicability of convergence theorems i
The assumptions of Theorems 15 and 16 are satisfjed when the exact potential Φ and the approximation quality ΦN ≈ Φ are suitably well behaved. Recall from (1) that Z is the normalisation constant of µ. Therefore, for µ to be well-defjned, we must have that 0 < Z < ∞. In particular, there exists 0 < C3 < ∞ such that C−1
3
< Z < C3. Assumption 17 There exists C0 ∈ R that does not depend on N such that, for all N ∈ N, Φ ≥ −C0 and νN({ΦN | ΦN ≥ −C0}) = 1, (4) and for any 0 < C3 < +∞ with the property that C−1
3
< Z < C3, there exists N∗(C3) ∈ N such that, for all N ≥ N∗, Eµ0[EνN[|ΦN − Φ|]] ≤ 1 2eC0 min { Z − 1 C3 , C3 − Z } . (5)
30/41
SLIDE 59 Applicability of convergence theorems ii
Lemma 18 Suppose that Assumption 17 holds with C0 as in (4) and C3 and N∗(C3) as in (5), that exp(Φ) ∈ Lp∗
µ0(U) for some 1 ≤ p∗ ≤ +∞ with conjugate exponent (p∗)′, and there exists some
C4 ∈ R that does not depend on N, such that, for all N ∈ N, νN ({ ΦN | Eµ0[ΦN] ≤ C4 }) = 1. Then the hypotheses of Theorem 15 hold, with p1 = p∗, p2 = p3 = +∞, C1 = ∥eΦ∥Lp∗
µ0
, C2 = 2eC0, and C3 as above. Moreover, the hypotheses of Theorem 16 hold, with q1 = q2 = ∞, D1 = 4eC0, D2 = 4e3C0 max{C−3
3 , e3C4}. 31/41
SLIDE 60 Applicability of convergence theorems iii
Lemma 19 Suppose that Assumption 17 holds with C0 as in (4) and C3 and N∗(C3) as in (5), and that there exists some 2 < ρ∗ < +∞ such that EνN[exp(ρ∗ΦN)] ∈ L1
µ0. Then the hypotheses of
Theorem 15 hold, with p1 = ρ∗, p2 = p3 = +∞, C1 = ∥EνN[exp(ρ∗ΦN)]∥1/ρ∗
L1
µ0 , C2 = 2eC0,
and C3 as above. Moreover, the hypotheses of Theorem 16 hold, with q1 = ρ∗ 2 , q2 = +∞, D1 = 4eC0, D2 = 4e2C0 ( C−3
3 eC0 + ∥EνN[eρ∗ΦN]∥2/ρ∗ L1
µ0
) .
32/41
SLIDE 61 Example: Monte Carlo approximation of high-dimensional misfits
We consider a Monte Carlo approximation ΦN of a quadratic potential Φ (Nemirovski et al., 2008; Shapiro et al., 2009), further applied and analysed in the MAP estimator context by Le et al. (2017). This approximation is particularly useful for data y ∈ RJ, J ≫ 1. Φ(u) := 1 2
= 1 2 ( Γ−1/2(y − G(u)) )TE[σσT] ( Γ−1/2(y − G(u)) ) where E[σ] = 0 ∈ RJ, E[σσT] = IJ×J = 1 2E [ σT( Γ−1/2(y − G(u)) ) 2 ] ≈ 1 2N
N
∑
i=1
Γ−1/2(y − G(u)) ) 2 for i.i.d. σ(1), . . . , σ(N) d = σ = 1 2
N
( Γ−1/2(y − G(u))
for ΣN := 1 √ N [σ(1) · · · σ(N)] ∈ RJ×N =: ΦN(u).
33/41
SLIDE 62
Example: Monte Carlo approximation of high-dimensional misfits
The analysis and numerical studies in Le et al. (2017, Sections 3–4) suggest that a good choice for the RJ-valued random vector σ would be one with independent and identically distributed (i.i.d.) entries from a sub-Gaussian probability distribution. Examples of sub-Gaussian distributions considered include the Gaussian distribution: σj ∼ N (0, 1), for j = 1, . . . , J; and the ℓ-sparse distribution: for ℓ ∈ [0, 1), let s :=
1 1−ℓ ≥ 1 and set, for j = 1, . . . , J,
σj := √ s 1, with probability 1
2s,
0, with probability ℓ = 1 − 1
s ,
−1, with probability 1
2s. 34/41
SLIDE 63
Example: Monte Carlo approximation of high-dimensional misfits
Le et al. (2017) observe that, for large J and moderate N ≈ 10, the random potential ΦN and the original potential Φ are very similar, in particular having approximately the same minimisers and minimum values. Statistically, these correspond to the maximum likelihood estimators under Φ and ΦN being very similar; after weighting by a prior, this corresponds to similarity of maximum a posteriori (MAP) estimators. Here, we study the BIP instead of the MAP problem, and thus the corresponding conjecture is that the deterministic posterior dµ(u) ∝ exp(−Φ(u)) dµ0(u) is well approximated by the random posterior dµsamp
N
(u) ∝ exp(−ΦN(u)) dµ0(u).
35/41
SLIDE 64
Example: Well-posedness of BIPs with Monte Carlo misfits
Applying the general Theorem 16 from earlier gives the following transfer of the Monte Carlo convergence rate from the approximation of Φ to the approximation of µ: ✓ Proposition 20 Suppose that the entries of σ are i.i.d. ℓ-sparse, for some ℓ ∈ [0, 1), and that Φ ∈ L2
µ0(U). Then
there exists a constant C, independent of N, such that ( Eσ [ dH ( µ, µsamp
N
)2])1/2 ≤ C √ N . For technical reasons to do with the non-compactness of the support and fjniteness of MGFs of maxima, the current proof technique does not work for the Gaussian case.
! ?
36/41
SLIDE 65
Closing Remarks
SLIDE 66
Closing Remarks
Numerical methods can be characterised in a Bayesian fashion. ✓ This does not coincide with average-case analysis and IBC. ✓ BPNMs can be composed into pipelines, e.g. for inverse problems. ✓ Bayes’ rule as disintegration → (expensive!) numerical implementation. ✓/✗
Lots of room to improve computational cost and bias.
! ?
Departures from the “Bayesian gold standard” can be assessed in terms of cost-accuracy tradeofg.
! ?
How to choose/design an appropriate (numerically-analytically right) prior?
! ?
Full details and further applications in Cockayne, Oates, Sullivan, and Girolami (2017) arXiv:1702.03673. Lie, Sullivan, and Teckentrup (2017) arXiv:1712.05717.
Thank You
37/41
SLIDE 67
Closing Remarks
Numerical methods can be characterised in a Bayesian fashion. ✓ This does not coincide with average-case analysis and IBC. ✓ BPNMs can be composed into pipelines, e.g. for inverse problems. ✓ Bayes’ rule as disintegration → (expensive!) numerical implementation. ✓/✗
Lots of room to improve computational cost and bias.
! ?
Departures from the “Bayesian gold standard” can be assessed in terms of cost-accuracy tradeofg.
! ?
How to choose/design an appropriate (numerically-analytically right) prior?
! ?
Full details and further applications in Cockayne, Oates, Sullivan, and Girolami (2017) arXiv:1702.03673. Lie, Sullivan, and Teckentrup (2017) arXiv:1712.05717.
Thank You
37/41
SLIDE 68 References i
- N. L. Ackerman, C. E. Freer, and D. M. Roy. On computability and disintegration. Math. Structures Comput. Sci., 27(8):
1287–1314, 2017. doi:10.1017/S0960129516000098.
- C. Andrieu and G. O. Roberts. The pseudo-marginal approach for effjcient Monte Carlo computations. Ann. Statist., 37(2):
697–725, 2009. doi:10.1214/07-AOS574.
- M. A. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):
1139–1160, 2003.
- P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B.
- Stat. Methodol., 78(5):1103–1130, 2016. doi:10.1111/rssb.12158.
- J. T. Chang and D. Pollard. Conditioning as disintegration. Statist. Neerlandica, 51(3):287–317, 1997.
doi:10.1111/1467-9574.00056.
- J. Cockayne, C. J. Oates, T. J. Sullivan, and M. Girolami. Bayesian probabilistic numerical methods, 2017. arXiv:1702.03673.
- P. R. Conrad, M. Girolami, S. Särkkä, A. M. Stuart, and K. C. Zygalakis. Statistical analysis of difgerential equations:
introducing probability measures on numerical solutions. Stat. Comput., 27(4), 2016. doi:10.1007/s11222-016-9671-0.
- P. Diaconis. Bayesian numerical analysis. In Statistical Decision Theory and Related Topics, IV, Vol. 1 (West Lafayette, Ind., 1986),
pages 163–175. Springer, New York, 1988.
- M. Giry. A categorical approach to probability theory. In Categorical aspects of topology and analysis (Ottawa, Ont., 1980),
volume 915 of Lecture Notes in Math., pages 68–85. Springer, Berlin-New York, 1982. 38/41
SLIDE 69 References ii
- P. E. Jacob, L. M. Murray, C. C. Holmes, and C. P. Robert. Better together? Statistical learning in models made of modules,
- 2017. arXiv:1708.08719.
- J. B. Kadane and G. W. Wasilkowski. Average case ϵ-complexity in computer science. A Bayesian view. In Bayesian
Statistics, 2 (Valencia, 1983), pages 361–374. North-Holland, Amsterdam, 1985.
- F. M. Larkin. Optimal approximation in Hilbert spaces with reproducing kernel functions. Math. Comp., 24:911–921, 1970.
doi:10.2307/2004625.
- S. Lauritzen. Graphical Models. Oxford University Press, 1991.
- E. B. Le, A. Myers, T. Bui-Thanh, and Q. P. Nguyen. A data-scalable randomized misfjt approach for solving large-scale
PDE-constrained inverse problems. Inverse Probl., 33(6):065003, 2017. doi:10.1088/1361-6420/aa6cbd.
- H. C. Lie, T. J. Sullivan, and A. L. Teckentrup. Random forward models and log-likelihoods in Bayesian inverse problems,
- 2017. arXiv:1712.05717.
- A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming.
SIAM J. Optim., 19(4):1574–1609, 2008. doi:10.1137/070704277.
- A. O’Hagan. Monte Carlo is fundamentally unsound. Statistician, 36(2/3):247–249, 1987. doi:10.2307/2348519.
- H. Owhadi and C. Scovel. Conditioning Gaussian measure on Hilbert space, 2015. arXiv:1506.04208.
- H. Owhadi, C. Scovel, and T. J. Sullivan. Brittleness of Bayesian inference under fjnite information in a continuous world.
- Electron. J. Stat., 9(1):1–79, 2015a. doi:10.1214/15-EJS989.
- H. Owhadi, C. Scovel, and T. J. Sullivan. On the brittleness of Bayesian inference. SIAM Rev., 57(4):566–582, 2015b.
doi:10.1137/130938633. 39/41
SLIDE 70 References iii
- H. Poincaré. Calcul des Probabilites. Georges Carré, Paris, 1896.
- K. Ritter. Average-Case Analysis of Numerical Problems, volume 1733 of Lecture Notes in Mathematics. Springer-Verlag, Berlin,
- 2000. doi:10.1007/BFb0103934.
- A. Sard. Best approximate integration formulas; best approximation formulas. Amer. J. Math., 71:80–91, 1949.
doi:10.2307/2372095.
- C. Schillings and C. Schwab. Scaling limits in computational Bayesian inversion. ESAIM Math. Model. Numer. Anal., 50(6):
1825–1856, 2016. doi:10.1051/m2an/2016005.
- A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory, volume 9 of
MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Programming Society (MPS), Philadelphia, PA, 2009. doi:10.1137/1.9780898718751.
- J. Skilling. Bayesian solution of ordinary difgerential equations. In C. R. Smith, G. J. Erickson, and P. O. Neudorfer, editors,
Maximum Entropy and Bayesian Methods, volume 50 of Fundamental Theories of Physics, pages 23–37. Springer, 1992. doi:10.1007/978-94-017-2219-3.
- A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numer., 19:451–559, 2010. doi:10.1017/S0962492910000061.
- A. V. Sul′din. Wiener measure and its applications to approximation methods. I. Izv. Vysš. Učebn. Zaved. Matematika, 6(13):
145–158, 1959.
- A. V. Sul′din. Wiener measure and its applications to approximation methods. II. Izv. Vysš. Učebn. Zaved. Matematika, 5(18):
165–179, 1960. 40/41
SLIDE 71 References iv
- T. Tjur. Probability Based on Radon Measures. John Wiley & Sons, Ltd., Chichester, 1980. Wiley Series in Probability and
Mathematical Statistics.
- J. F. Traub, G. W. Wasilkowski, and H. Woźniakowski. Information-Based Complexity. Computer Science and Scientifjc
- Computing. Academic Press, Inc., Boston, MA, 1988. With contributions by A. G. Werschulz and T. Boult.
- L. N. Trefethen. Is Gauss quadrature better than Clenshaw–Curtis? SIAM Rev., 50(1):67–87, 2008. doi:10.1137/060659831.
- A. Zellner. Optimal information processing and Bayes’s theorem. Amer. Statist., 42(4):278–284, 1988. doi:10.2307/2685143.
41/41