SLIDE 1 What is an Optimal Bayesian Method?
Newcastle University & Lloyd’s Register Foundation – Alan Turing Institute Programme on Data-Centric Engineering November 2018 @ RICAM Multivariate Algorithms and Information-Based Complexity
SLIDE 2
Collaborators
Jon Cockayne Mark Girolami Dennis Prangle Tim Sullivan University of Warwick Imperial College London Newcastle University Free University of Berlin Alan Turing Institute Zuse Institute Berlin
SLIDE 3
Aims
The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.
SLIDE 4
Aims
The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.
SLIDE 5
Aims
The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.
SLIDE 6
Aims
The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.
SLIDE 7
Aims
The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.
SLIDE 8
Aims
The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.
SLIDE 9
Aims
The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.
SLIDE 10
Aims
The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.
SLIDE 11 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 12 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 13 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 14 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 15 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 16 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 17 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 18 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 19 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 20 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 21 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 22 Notation
◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.
◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).
◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.
Example (Numerical integration)
X = C(0, 1), φ(x) = 1
0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],
e.g. histogram method de(ye(x)) =
n
x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?
SLIDE 23
Contents
Background Average Case Analysis Bayesian Decision Theory Bayesian Experimental Design Probabilistic Numerical Methods Bayesian Probabilistic Numerical Methods Optimality for a BPNM Applications and Going Forward
SLIDE 24
Background
SLIDE 25 Average Case Analysis
IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =
Φ dπX(x)
1/p ◮ Average case optimal method, optimal information and minimial error: d∗
e ∈ arg inf de∈De
ACEp(de), e∗ ∈ arg inf
e∈E
ACEp(d∗
e ),
inf
e∈E ACEp(d∗ e )
Example (Numerical integration, continued; Sul’din [1959, 1960])
For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1
n, 2 n, . . . , 1].
SLIDE 26 Average Case Analysis
IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =
Φ dπX(x)
1/p ◮ Average case optimal method, optimal information and minimial error: d∗
e ∈ arg inf de∈De
ACEp(de), e∗ ∈ arg inf
e∈E
ACEp(d∗
e ),
inf
e∈E ACEp(d∗ e )
Example (Numerical integration, continued; Sul’din [1959, 1960])
For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1
n, 2 n, . . . , 1].
SLIDE 27 Average Case Analysis
IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =
Φ dπX(x)
1/p ◮ Average case optimal method, optimal information and minimial error: d∗
e ∈ arg inf de∈De
ACEp(de), e∗ ∈ arg inf
e∈E
ACEp(d∗
e ),
inf
e∈E ACEp(d∗ e )
Example (Numerical integration, continued; Sul’din [1959, 1960])
For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1
n, 2 n, . . . , 1].
SLIDE 28 Average Case Analysis
IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =
Φ dπX(x)
1/p ◮ Average case optimal method, optimal information and minimial error: d∗
e ∈ arg inf de∈De
ACEp(de), e∗ ∈ arg inf
e∈E
ACEp(d∗
e ),
inf
e∈E ACEp(d∗ e )
Example (Numerical integration, continued; Sul’din [1959, 1960])
For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1
n, 2 n, . . . , 1].
SLIDE 29 Average Case Analysis
IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =
Φ dπX(x)
1/p ◮ Average case optimal method, optimal information and minimial error: d∗
e ∈ arg inf de∈De
ACEp(de), e∗ ∈ arg inf
e∈E
ACEp(d∗
e ),
inf
e∈E ACEp(d∗ e )
Example (Numerical integration, continued; Sul’din [1959, 1960])
For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1
n, 2 n, . . . , 1].
SLIDE 30 Average Case Analysis
IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =
Φ dπX(x)
1/p ◮ Average case optimal method, optimal information and minimial error: d∗
e ∈ arg inf de∈De
ACEp(de), e∗ ∈ arg inf
e∈E
ACEp(d∗
e ),
inf
e∈E ACEp(d∗ e )
Example (Numerical integration, continued; Sul’din [1959, 1960])
For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1
n, 2 n, . . . , 1].
SLIDE 31 Average Case Analysis
IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =
Φ dπX(x)
1/p ◮ Average case optimal method, optimal information and minimial error: d∗
e ∈ arg inf de∈De
ACEp(de), e∗ ∈ arg inf
e∈E
ACEp(d∗
e ),
inf
e∈E ACEp(d∗ e )
Example (Numerical integration, continued; Sul’din [1959, 1960])
For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =
n
(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1
n, 2 n, . . . , 1].
SLIDE 32 Bayesian Decision Theory
BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =
- ℓ(x, de(Y )) dπY |x,edπX(x)
◮ Bayes rule: d∗
e ∈ arg inf de∈De
BR(de) ◮ Optimal experiment: e∗ ∈ arg inf
e∈E
BR(d∗
e )
◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap
Φ, πY |x,e = δ(ye(x))
[Kadane and Wasilkowski, 1985].
SLIDE 33 Bayesian Decision Theory
BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =
- ℓ(x, de(Y )) dπY |x,edπX(x)
◮ Bayes rule: d∗
e ∈ arg inf de∈De
BR(de) ◮ Optimal experiment: e∗ ∈ arg inf
e∈E
BR(d∗
e )
◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap
Φ, πY |x,e = δ(ye(x))
[Kadane and Wasilkowski, 1985].
SLIDE 34 Bayesian Decision Theory
BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =
- ℓ(x, de(Y )) dπY |x,edπX(x)
◮ Bayes rule: d∗
e ∈ arg inf de∈De
BR(de) ◮ Optimal experiment: e∗ ∈ arg inf
e∈E
BR(d∗
e )
◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap
Φ, πY |x,e = δ(ye(x))
[Kadane and Wasilkowski, 1985].
SLIDE 35 Bayesian Decision Theory
BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =
- ℓ(x, de(Y )) dπY |x,edπX(x)
◮ Bayes rule: d∗
e ∈ arg inf de∈De
BR(de) ◮ Optimal experiment: e∗ ∈ arg inf
e∈E
BR(d∗
e )
◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap
Φ, πY |x,e = δ(ye(x))
[Kadane and Wasilkowski, 1985].
SLIDE 36 Bayesian Decision Theory
BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =
- ℓ(x, de(Y )) dπY |x,edπX(x)
◮ Bayes rule: d∗
e ∈ arg inf de∈De
BR(de) ◮ Optimal experiment: e∗ ∈ arg inf
e∈E
BR(d∗
e )
◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap
Φ, πY |x,e = δ(ye(x))
[Kadane and Wasilkowski, 1985].
SLIDE 37 Bayesian Decision Theory
BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =
- ℓ(x, de(Y )) dπY |x,edπX(x)
◮ Bayes rule: d∗
e ∈ arg inf de∈De
BR(de) ◮ Optimal experiment: e∗ ∈ arg inf
e∈E
BR(d∗
e )
◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap
Φ, πY |x,e = δ(ye(x))
[Kadane and Wasilkowski, 1985].
SLIDE 38 Bayesian Decision Theory
BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =
- ℓ(x, de(Y )) dπY |x,edπX(x)
◮ Bayes rule: d∗
e ∈ arg inf de∈De
BR(de) ◮ Optimal experiment: e∗ ∈ arg inf
e∈E
BR(d∗
e )
◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap
Φ, πY |x,e = δ(ye(x))
[Kadane and Wasilkowski, 1985].
SLIDE 39 Bayesian Decision Theory
BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =
- ℓ(x, de(Y )) dπY |x,edπX(x)
◮ Bayes rule: d∗
e ∈ arg inf de∈De
BR(de) ◮ Optimal experiment: e∗ ∈ arg inf
e∈E
BR(d∗
e )
◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap
Φ, πY |x,e = δ(ye(x))
[Kadane and Wasilkowski, 1985].
SLIDE 40 Bayesian Decision Theory
BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =
- ℓ(x, de(Y )) dπY |x,edπX(x)
◮ Bayes rule: d∗
e ∈ arg inf de∈De
BR(de) ◮ Optimal experiment: e∗ ∈ arg inf
e∈E
BR(d∗
e )
◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap
Φ, πY |x,e = δ(ye(x))
[Kadane and Wasilkowski, 1985].
SLIDE 41
Remark #1: Characterising a Bayes Rule
A Bayes rule is characterised by the actions a = d∗
e (y) that it takes; these are called Bayes acts.
Sometimes it is possible to characterise the Bayes act:
SLIDE 42 Remark #1: Characterising a Bayes Rule
A Bayes rule is characterised by the actions a = d∗
e (y) that it takes; these are called Bayes acts.
Sometimes it is possible to characterise the Bayes act:
Proposition
Consider A = X = Rd. Let ℓ(x, a) = φ(x) − φ(a)2
2 where φ: X → Rm, m ∈ N. Assume that φ is
twice continuously differentiable and that the matrix dφ da
= ∂ ∂aj φi(a) has full row rank at all a ∈ A. Then any Bayes act a ∈ A∗
e (ye) satisfies
φ(a) =
(1) Moreover, if there exists a unique solution to Eqn. (1) and the function φ is coercive, then this solution is a Bayes act.
SLIDE 43 Remark #1: Characterising a Bayes Rule
A Bayes rule is characterised by the actions a = d∗
e (y) that it takes; these are called Bayes acts.
Sometimes it is possible to characterise the Bayes act:
Proposition
Consider A = X = Rd. Let ℓ(x, a) = φ(x) − φ(a)2
2 where φ: X → Rm, m ∈ N. Assume that φ is
twice continuously differentiable and that the matrix dφ da
= ∂ ∂aj φi(a) has full row rank at all a ∈ A. Then any Bayes act a ∈ A∗
e (ye) satisfies
φ(a) =
(1) Moreover, if there exists a unique solution to Eqn. (1) and the function φ is coercive, then this solution is a Bayes act.
SLIDE 44 Remark #1: Characterising a Bayes Rule
A Bayes rule is characterised by the actions a = d∗
e (y) that it takes; these are called Bayes acts.
Sometimes it is possible to characterise the Bayes act:
Example (Linear regression)
◮ Let X ∼ N(µ0, Σ0), X ∈ Rd, and Y |x ∼ N(Aex, Σ), Y ∈ Rn, where the matrix Σ0 is positive definite and the matrix Ae ∈ Rn×d is determined by the choice of experiment e ∈ E. ◮ Consider a loss ℓ(x, x′) = (x − x′)Λ(x − x′) where Λ is a positive semi-definite matrix with a square root Λ
1 2 .
◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a ∈ Rd which, from the
Proposition, satisfy Λ
1 2 a = Λ 1 2 µy,e where X|y ∼ N(µy,e, Σe) and Σe = (A⊤
e Ae + Σ−1 0 )−1,
µy,e = Σe(A⊤
e y + Σ−1 0 µ0).
◮ If Λ is positive definite then it also follows from the Proposition that µy,e is the unique Bayes act.
SLIDE 45 Remark #1: Characterising a Bayes Rule
A Bayes rule is characterised by the actions a = d∗
e (y) that it takes; these are called Bayes acts.
Sometimes it is possible to characterise the Bayes act:
Example (Linear regression)
◮ Let X ∼ N(µ0, Σ0), X ∈ Rd, and Y |x ∼ N(Aex, Σ), Y ∈ Rn, where the matrix Σ0 is positive definite and the matrix Ae ∈ Rn×d is determined by the choice of experiment e ∈ E. ◮ Consider a loss ℓ(x, x′) = (x − x′)Λ(x − x′) where Λ is a positive semi-definite matrix with a square root Λ
1 2 .
◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a ∈ Rd which, from the
Proposition, satisfy Λ
1 2 a = Λ 1 2 µy,e where X|y ∼ N(µy,e, Σe) and Σe = (A⊤
e Ae + Σ−1 0 )−1,
µy,e = Σe(A⊤
e y + Σ−1 0 µ0).
◮ If Λ is positive definite then it also follows from the Proposition that µy,e is the unique Bayes act.
SLIDE 46 Remark #1: Characterising a Bayes Rule
A Bayes rule is characterised by the actions a = d∗
e (y) that it takes; these are called Bayes acts.
Sometimes it is possible to characterise the Bayes act:
Example (Linear regression)
◮ Let X ∼ N(µ0, Σ0), X ∈ Rd, and Y |x ∼ N(Aex, Σ), Y ∈ Rn, where the matrix Σ0 is positive definite and the matrix Ae ∈ Rn×d is determined by the choice of experiment e ∈ E. ◮ Consider a loss ℓ(x, x′) = (x − x′)Λ(x − x′) where Λ is a positive semi-definite matrix with a square root Λ
1 2 .
◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a ∈ Rd which, from the
Proposition, satisfy Λ
1 2 a = Λ 1 2 µy,e where X|y ∼ N(µy,e, Σe) and Σe = (A⊤
e Ae + Σ−1 0 )−1,
µy,e = Σe(A⊤
e y + Σ−1 0 µ0).
◮ If Λ is positive definite then it also follows from the Proposition that µy,e is the unique Bayes act.
SLIDE 47 Remark #1: Characterising a Bayes Rule
A Bayes rule is characterised by the actions a = d∗
e (y) that it takes; these are called Bayes acts.
Sometimes it is possible to characterise the Bayes act:
Example (Linear regression)
◮ Let X ∼ N(µ0, Σ0), X ∈ Rd, and Y |x ∼ N(Aex, Σ), Y ∈ Rn, where the matrix Σ0 is positive definite and the matrix Ae ∈ Rn×d is determined by the choice of experiment e ∈ E. ◮ Consider a loss ℓ(x, x′) = (x − x′)Λ(x − x′) where Λ is a positive semi-definite matrix with a square root Λ
1 2 .
◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a ∈ Rd which, from the
Proposition, satisfy Λ
1 2 a = Λ 1 2 µy,e where X|y ∼ N(µy,e, Σe) and Σe = (A⊤
e Ae + Σ−1 0 )−1,
µy,e = Σe(A⊤
e y + Σ−1 0 µ0).
◮ If Λ is positive definite then it also follows from the Proposition that µy,e is the unique Bayes act.
SLIDE 48 Remark #2: Admissibility
Other, more adversarial notions of optimality for general decision rules, such as admissibility, need not coincide with the Bayesian notion of optimal. A decision rule de ∈ De is called admissible if there exists no d′
e ∈ De such that
e(y)) dπY |x,e(y) ≤
for all x ∈ X, with strict inequality for some x ∈ X.
Example
Consider estimation of x ∈ R based on Y |x ∼ N(x, 1) and with ℓ(x, x′) = (x − x′)2. An admissible decision rule is d(y) = y, but this is not a Bayes rule for any proper prior πX on R.
SLIDE 49 Remark #2: Admissibility
Other, more adversarial notions of optimality for general decision rules, such as admissibility, need not coincide with the Bayesian notion of optimal. A decision rule de ∈ De is called admissible if there exists no d′
e ∈ De such that
e(y)) dπY |x,e(y) ≤
for all x ∈ X, with strict inequality for some x ∈ X.
Example
Consider estimation of x ∈ R based on Y |x ∼ N(x, 1) and with ℓ(x, x′) = (x − x′)2. An admissible decision rule is d(y) = y, but this is not a Bayes rule for any proper prior πX on R.
SLIDE 50 Remark #2: Admissibility
Other, more adversarial notions of optimality for general decision rules, such as admissibility, need not coincide with the Bayesian notion of optimal. A decision rule de ∈ De is called admissible if there exists no d′
e ∈ De such that
e(y)) dπY |x,e(y) ≤
for all x ∈ X, with strict inequality for some x ∈ X.
Example
Consider estimation of x ∈ R based on Y |x ∼ N(x, 1) and with ℓ(x, x′) = (x − x′)2. An admissible decision rule is d(y) = y, but this is not a Bayes rule for any proper prior πX on R.
SLIDE 51 Bayesian Experimental Design
BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u(e, x, y) and let BED(e) = −
- u(e, x, y) dπY |x,e(y) dπX(x)
An optimal experiment is defined as e∗ ∈ arg inf
e∈E
BED(e) N.B. Contains Bayesian decision theory as a special case with u(e, y) = −
e (y)) dπX|y,e(x′)
since then BED(e) = ℓ(x′, d∗
e (y)) dπX|y,e(x′) dπY |x,e(y)dπX(x)
= ℓ(x′, d∗
e (y)) dπX|y,e(x′) dπY |e(y)
=
e (y)) dπY |x′,e(y) dπX(x′) = BR(e, d∗ e )
SLIDE 52 Bayesian Experimental Design
BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u(e, x, y) and let BED(e) = −
- u(e, x, y) dπY |x,e(y) dπX(x)
An optimal experiment is defined as e∗ ∈ arg inf
e∈E
BED(e) N.B. Contains Bayesian decision theory as a special case with u(e, y) = −
e (y)) dπX|y,e(x′)
since then BED(e) = ℓ(x′, d∗
e (y)) dπX|y,e(x′) dπY |x,e(y)dπX(x)
= ℓ(x′, d∗
e (y)) dπX|y,e(x′) dπY |e(y)
=
e (y)) dπY |x′,e(y) dπX(x′) = BR(e, d∗ e )
SLIDE 53 Bayesian Experimental Design
BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u(e, x, y) and let BED(e) = −
- u(e, x, y) dπY |x,e(y) dπX(x)
An optimal experiment is defined as e∗ ∈ arg inf
e∈E
BED(e) N.B. Contains Bayesian decision theory as a special case with u(e, y) = −
e (y)) dπX|y,e(x′)
since then BED(e) = ℓ(x′, d∗
e (y)) dπX|y,e(x′) dπY |x,e(y)dπX(x)
= ℓ(x′, d∗
e (y)) dπX|y,e(x′) dπY |e(y)
=
e (y)) dπY |x′,e(y) dπX(x′) = BR(e, d∗ e )
SLIDE 54 Bayesian Experimental Design
BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u(e, x, y) and let BED(e) = −
- u(e, x, y) dπY |x,e(y) dπX(x)
An optimal experiment is defined as e∗ ∈ arg inf
e∈E
BED(e) N.B. Contains Bayesian decision theory as a special case with u(e, y) = −
e (y)) dπX|y,e(x′)
since then BED(e) = ℓ(x′, d∗
e (y)) dπX|y,e(x′) dπY |x,e(y)dπX(x)
= ℓ(x′, d∗
e (y)) dπX|y,e(x′) dπY |e(y)
=
e (y)) dπY |x′,e(y) dπX(x′) = BR(e, d∗ e )
SLIDE 55 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
SLIDE 56 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Consider X = Rm. A number of approximations have been developed, based on a Gaussian approximation πX|y,e ≈ N(µy,e, Σe): BED(e) ≈ tr(ΛΣe) A-optimal det(Λ1/2ΣeΛ1/2) D-optimal . . . called the alphabet criteria, for some positive semi-definite matrix Λ with a square root Λ
1 2 .
SLIDE 57 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes A-, c- and E-optimal)
◮ Consider a loss ℓ(x, x′) = x − x′2
Λ, where xΛ := Λ
1 2 x2.
◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a, which satisfy Λ
1 2 a = Λ 1 2 µy,e
due to Remark #1. ◮ Now observe that for a Bayes act
- ℓ(x, a)dπX|y,e(x) =
- (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.
◮ It follows that BR(e, d∗
e ) = tr(ΛΣe).
◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].
SLIDE 58 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes A-, c- and E-optimal)
◮ Consider a loss ℓ(x, x′) = x − x′2
Λ, where xΛ := Λ
1 2 x2.
◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a, which satisfy Λ
1 2 a = Λ 1 2 µy,e
due to Remark #1. ◮ Now observe that for a Bayes act
- ℓ(x, a)dπX|y,e(x) =
- (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.
◮ It follows that BR(e, d∗
e ) = tr(ΛΣe).
◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].
SLIDE 59 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes A-, c- and E-optimal)
◮ Consider a loss ℓ(x, x′) = x − x′2
Λ, where xΛ := Λ
1 2 x2.
◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a, which satisfy Λ
1 2 a = Λ 1 2 µy,e
due to Remark #1. ◮ Now observe that for a Bayes act
- ℓ(x, a)dπX|y,e(x) =
- (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.
◮ It follows that BR(e, d∗
e ) = tr(ΛΣe).
◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].
SLIDE 60 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes A-, c- and E-optimal)
◮ Consider a loss ℓ(x, x′) = x − x′2
Λ, where xΛ := Λ
1 2 x2.
◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a, which satisfy Λ
1 2 a = Λ 1 2 µy,e
due to Remark #1. ◮ Now observe that for a Bayes act
- ℓ(x, a)dπX|y,e(x) =
- (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.
◮ It follows that BR(e, d∗
e ) = tr(ΛΣe).
◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].
SLIDE 61 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes A-, c- and E-optimal)
◮ Consider a loss ℓ(x, x′) = x − x′2
Λ, where xΛ := Λ
1 2 x2.
◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a, which satisfy Λ
1 2 a = Λ 1 2 µy,e
due to Remark #1. ◮ Now observe that for a Bayes act
- ℓ(x, a)dπX|y,e(x) =
- (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.
◮ It follows that BR(e, d∗
e ) = tr(ΛΣe).
◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].
SLIDE 62 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes A-, c- and E-optimal)
◮ Consider a loss ℓ(x, x′) = x − x′2
Λ, where xΛ := Λ
1 2 x2.
◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a, which satisfy Λ
1 2 a = Λ 1 2 µy,e
due to Remark #1. ◮ Now observe that for a Bayes act
- ℓ(x, a)dπX|y,e(x) =
- (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.
◮ It follows that BR(e, d∗
e ) = tr(ΛΣe).
◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].
SLIDE 63 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes A-, c- and E-optimal)
◮ Consider a loss ℓ(x, x′) = x − x′2
Λ, where xΛ := Λ
1 2 x2.
◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗
e is defined through the Bayes act(s) a, which satisfy Λ
1 2 a = Λ 1 2 µy,e
due to Remark #1. ◮ Now observe that for a Bayes act
- ℓ(x, a)dπX|y,e(x) =
- (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.
◮ It follows that BR(e, d∗
e ) = tr(ΛΣe).
◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].
SLIDE 64 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes D-optimal)
◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗
e is defined through the Bayes act(s), which one can verify include µy,e.
◮ Now observe that
- ℓ(x, µy,e)dπX|y,e(x) =
- 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus
the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1
d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2
d 2 Γ( d
2 + 1).
◮ Note in particular that this is independent of ye. It follows that BR(e, d∗
e ) is minimised when
det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
SLIDE 65 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes D-optimal)
◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗
e is defined through the Bayes act(s), which one can verify include µy,e.
◮ Now observe that
- ℓ(x, µy,e)dπX|y,e(x) =
- 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus
the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1
d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2
d 2 Γ( d
2 + 1).
◮ Note in particular that this is independent of ye. It follows that BR(e, d∗
e ) is minimised when
det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
SLIDE 66 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes D-optimal)
◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗
e is defined through the Bayes act(s), which one can verify include µy,e.
◮ Now observe that
- ℓ(x, µy,e)dπX|y,e(x) =
- 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus
the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1
d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2
d 2 Γ( d
2 + 1).
◮ Note in particular that this is independent of ye. It follows that BR(e, d∗
e ) is minimised when
det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
SLIDE 67 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes D-optimal)
◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗
e is defined through the Bayes act(s), which one can verify include µy,e.
◮ Now observe that
- ℓ(x, µy,e)dπX|y,e(x) =
- 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus
the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1
d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2
d 2 Γ( d
2 + 1).
◮ Note in particular that this is independent of ye. It follows that BR(e, d∗
e ) is minimised when
det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
SLIDE 68 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes D-optimal)
◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗
e is defined through the Bayes act(s), which one can verify include µy,e.
◮ Now observe that
- ℓ(x, µy,e)dπX|y,e(x) =
- 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus
the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1
d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2
d 2 Γ( d
2 + 1).
◮ Note in particular that this is independent of ye. It follows that BR(e, d∗
e ) is minimised when
det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
SLIDE 69 Remark #3: Comment on Practical Issues
In practice the criteria BED(e) = BR(e, d∗
e ) is difficult to compute, as for each experiment e ∈ E
- ptimisation is required to identify a Bayes act.
Example (Bayes D-optimal)
◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗
e is defined through the Bayes act(s), which one can verify include µy,e.
◮ Now observe that
- ℓ(x, µy,e)dπX|y,e(x) =
- 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus
the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1
d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2
d 2 Γ( d
2 + 1).
◮ Note in particular that this is independent of ye. It follows that BR(e, d∗
e ) is minimised when
det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
SLIDE 70 Bayesian Experimental Design
Bayesian experimental design is a more general framework, and it may be interesting to ask how
- ptimal experiments depend on the choice of utility, and might some utilities lead to different results in
terms of IBC. Next in this talk: Probabilistic numerical methods, and why for these methods the experimental design framework may be more appropriate than the decision theoretic framework.
SLIDE 71 Bayesian Experimental Design
Bayesian experimental design is a more general framework, and it may be interesting to ask how
- ptimal experiments depend on the choice of utility, and might some utilities lead to different results in
terms of IBC. Next in this talk: Probabilistic numerical methods, and why for these methods the experimental design framework may be more appropriate than the decision theoretic framework.
SLIDE 72
Probabilistic Numerical Methods
SLIDE 73
Probabilistic Numerical Methods
PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ(x)” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation ye : X → Ye. ◮ A probabilistic numerical method is a map De : Ye → PΦ where PΦ is the set of distributions on Φ.
◮ Note that contains deterministic decision rules as a special case; De(y) = δ(de(y)). ◮ Actually, mathematically identical to a randomised decision rule in BDT.
SLIDE 74
Probabilistic Numerical Methods
PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ(x)” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation ye : X → Ye. ◮ A probabilistic numerical method is a map De : Ye → PΦ where PΦ is the set of distributions on Φ.
◮ Note that contains deterministic decision rules as a special case; De(y) = δ(de(y)). ◮ Actually, mathematically identical to a randomised decision rule in BDT.
SLIDE 75
Probabilistic Numerical Methods
PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ(x)” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation ye : X → Ye. ◮ A probabilistic numerical method is a map De : Ye → PΦ where PΦ is the set of distributions on Φ.
◮ Note that contains deterministic decision rules as a special case; De(y) = δ(de(y)). ◮ Actually, mathematically identical to a randomised decision rule in BDT.
SLIDE 76
Probabilistic Numerical Methods
PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ(x)” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation ye : X → Ye. ◮ A probabilistic numerical method is a map De : Ye → PΦ where PΦ is the set of distributions on Φ.
◮ Note that contains deterministic decision rules as a special case; De(y) = δ(de(y)). ◮ Actually, mathematically identical to a randomised decision rule in BDT.
SLIDE 77
Probabilistic Numerical Methods
PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ(x)” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation ye : X → Ye. ◮ A probabilistic numerical method is a map De : Ye → PΦ where PΦ is the set of distributions on Φ.
◮ Note that contains deterministic decision rules as a special case; De(y) = δ(de(y)). ◮ Actually, mathematically identical to a randomised decision rule in BDT.
SLIDE 78 Bayesian Probabilistic Numerical Methods
◮ Let {πX|y,e}y∈Ye denote a disintegration of πX along the map ye : X → Ye. ◮ Let φ# denote the push-forward. (i.e. φ#π(S) = π(φ−1(S))) ◮ A probabilistic numerical method is Bayesian if De(y) = φ#πX|y,e. πY |e a.a. y ∈ Ye for some distribution πX, called a prior.
Example (Numerical integration)
e.g. the Bayesian quadrature method [Larkin, 1972] De(ye(x)) = N
(x(ti−1) + x(ti)) 2 (ti − ti−1) ,
n
(ti − ti−1)3 12
- is obtained from disintegrating the standard Weiner prior πX.
◮ In general can also consider randomness Y ∼ πY |x,e, but atypical in a traditional numerical task.
SLIDE 79 Bayesian Probabilistic Numerical Methods
◮ Let {πX|y,e}y∈Ye denote a disintegration of πX along the map ye : X → Ye. ◮ Let φ# denote the push-forward. (i.e. φ#π(S) = π(φ−1(S))) ◮ A probabilistic numerical method is Bayesian if De(y) = φ#πX|y,e. πY |e a.a. y ∈ Ye for some distribution πX, called a prior.
Example (Numerical integration)
e.g. the Bayesian quadrature method [Larkin, 1972] De(ye(x)) = N
(x(ti−1) + x(ti)) 2 (ti − ti−1) ,
n
(ti − ti−1)3 12
- is obtained from disintegrating the standard Weiner prior πX.
◮ In general can also consider randomness Y ∼ πY |x,e, but atypical in a traditional numerical task.
SLIDE 80 Bayesian Probabilistic Numerical Methods
◮ Let {πX|y,e}y∈Ye denote a disintegration of πX along the map ye : X → Ye. ◮ Let φ# denote the push-forward. (i.e. φ#π(S) = π(φ−1(S))) ◮ A probabilistic numerical method is Bayesian if De(y) = φ#πX|y,e. πY |e a.a. y ∈ Ye for some distribution πX, called a prior.
Example (Numerical integration)
e.g. the Bayesian quadrature method [Larkin, 1972] De(ye(x)) = N
(x(ti−1) + x(ti)) 2 (ti − ti−1) ,
n
(ti − ti−1)3 12
- is obtained from disintegrating the standard Weiner prior πX.
◮ In general can also consider randomness Y ∼ πY |x,e, but atypical in a traditional numerical task.
SLIDE 81 Bayesian Probabilistic Numerical Methods
◮ Let {πX|y,e}y∈Ye denote a disintegration of πX along the map ye : X → Ye. ◮ Let φ# denote the push-forward. (i.e. φ#π(S) = π(φ−1(S))) ◮ A probabilistic numerical method is Bayesian if De(y) = φ#πX|y,e. πY |e a.a. y ∈ Ye for some distribution πX, called a prior.
Example (Numerical integration)
e.g. the Bayesian quadrature method [Larkin, 1972] De(ye(x)) = N
(x(ti−1) + x(ti)) 2 (ti − ti−1) ,
n
(ti − ti−1)3 12
- is obtained from disintegrating the standard Weiner prior πX.
◮ In general can also consider randomness Y ∼ πY |x,e, but atypical in a traditional numerical task.
SLIDE 82 Bayesian Probabilistic Numerical Methods
◮ Let {πX|y,e}y∈Ye denote a disintegration of πX along the map ye : X → Ye. ◮ Let φ# denote the push-forward. (i.e. φ#π(S) = π(φ−1(S))) ◮ A probabilistic numerical method is Bayesian if De(y) = φ#πX|y,e. πY |e a.a. y ∈ Ye for some distribution πX, called a prior.
Example (Numerical integration)
e.g. the Bayesian quadrature method [Larkin, 1972] De(ye(x)) = N
(x(ti−1) + x(ti)) 2 (ti − ti−1) ,
n
(ti − ti−1)3 12
- is obtained from disintegrating the standard Weiner prior πX.
◮ In general can also consider randomness Y ∼ πY |x,e, but atypical in a traditional numerical task.
SLIDE 83 Remark #4: Bayes is Optimal
Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)
which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.
SLIDE 84 Remark #4: Bayes is Optimal
Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)
which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.
SLIDE 85 Remark #4: Bayes is Optimal
Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)
which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.
SLIDE 86 Remark #4: Bayes is Optimal
Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)
which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.
SLIDE 87 Remark #4: Bayes is Optimal
Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)
which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.
SLIDE 88 Remark #4: Bayes is Optimal
Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)
which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.
SLIDE 89
(Some) Existing Work on Bayesian PNM
Integration: ◮ Briol F-X, CJO, Girolami, M, Osborne, MA, Sejdinovic, D. Probabilistic Integration: A Role in Statistical Computation? (with discussion and rejoinder) Statistical Science, 2018. ◮ Xi X, Briol F-X, Girolami M. Bayesian Quadrature for Multiple Related Integrals. ICML 2018. ◮ CJO, Niederer S, Lee A, Briol F-X, Girolami M. Probabilistic Models for Integration Error in Assessment of Functional Cardiac Models. NIPS 2017. ◮ Karvonen T, Sa¨ arkka¨ a S. Fully Symmetric Kernel Quadrature. SIAM Journal on Scientific Computing, 40(2), pp.A697-A720. ◮ Kanagawa M, Sriperumbudur BK, Fukumizu K. Convergence Guarantees for Kernel-Based Quadrature Rules in Misspecified Settings. NIPS 2016. ◮ Jagadeeswaran R, Hickernell FJ. Fast Automatic Bayesian Cubature Using Lattice Sampling. arXiv:1809.09803, 2018. Differential Equations: ◮ Owhadi H. Bayesian Numerical Homogenization. Multiscale Modeling & Simulation, 13(3), pp.812-828, 2015. ◮ CJO, Cockayne J, Aykroyd RG, Girolami M. Bayesian Probabilistic Numerical Methods for Industrial Process Monitoring. arXiv:1707.06107 ◮ Cockayne J, CJO, Sullivan T, Girolami M. Probabilistic Meshless Methods for Bayesian Inverse Problems. arXiv:1605.07811 Linear Solvers: ◮ Cockayne J, CJO, Ipsen I, Girolami M. A Bayesian Conjugate Gradient Method. arXiv:1801.05242 ◮ Bartels S, Cockayne J, Ipsen I, Girolami M., Hennig P. Probabilistic Linear Solvers: A Unifying View.
Not many papers focus on optimality at this point.
SLIDE 90
(Some) Existing Work on Bayesian PNM
Integration: ◮ Briol F-X, CJO, Girolami, M, Osborne, MA, Sejdinovic, D. Probabilistic Integration: A Role in Statistical Computation? (with discussion and rejoinder) Statistical Science, 2018. ◮ Xi X, Briol F-X, Girolami M. Bayesian Quadrature for Multiple Related Integrals. ICML 2018. ◮ CJO, Niederer S, Lee A, Briol F-X, Girolami M. Probabilistic Models for Integration Error in Assessment of Functional Cardiac Models. NIPS 2017. ◮ Karvonen T, Sa¨ arkka¨ a S. Fully Symmetric Kernel Quadrature. SIAM Journal on Scientific Computing, 40(2), pp.A697-A720. ◮ Kanagawa M, Sriperumbudur BK, Fukumizu K. Convergence Guarantees for Kernel-Based Quadrature Rules in Misspecified Settings. NIPS 2016. ◮ Jagadeeswaran R, Hickernell FJ. Fast Automatic Bayesian Cubature Using Lattice Sampling. arXiv:1809.09803, 2018. Differential Equations: ◮ Owhadi H. Bayesian Numerical Homogenization. Multiscale Modeling & Simulation, 13(3), pp.812-828, 2015. ◮ CJO, Cockayne J, Aykroyd RG, Girolami M. Bayesian Probabilistic Numerical Methods for Industrial Process Monitoring. arXiv:1707.06107 ◮ Cockayne J, CJO, Sullivan T, Girolami M. Probabilistic Meshless Methods for Bayesian Inverse Problems. arXiv:1605.07811 Linear Solvers: ◮ Cockayne J, CJO, Ipsen I, Girolami M. A Bayesian Conjugate Gradient Method. arXiv:1801.05242 ◮ Bartels S, Cockayne J, Ipsen I, Girolami M., Hennig P. Probabilistic Linear Solvers: A Unifying View.
Not many papers focus on optimality at this point.
SLIDE 91 Desiderata for a PNM
Some desiderata for a probabilistic numerical method:
- 1. Should be Bayesian for some prior πX
- 2. Should “put mass close” to the true quantity of interest φ(x)
- 3. Should be “well-calibrated”
- 4. Should be “easy to optimise” over e ∈ E
N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.
SLIDE 92 Desiderata for a PNM
Some desiderata for a probabilistic numerical method:
- 1. Should be Bayesian for some prior πX
- 2. Should “put mass close” to the true quantity of interest φ(x)
- 3. Should be “well-calibrated”
- 4. Should be “easy to optimise” over e ∈ E
N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.
SLIDE 93 Desiderata for a PNM
Some desiderata for a probabilistic numerical method:
- 1. Should be Bayesian for some prior πX
- 2. Should “put mass close” to the true quantity of interest φ(x)
- 3. Should be “well-calibrated”
- 4. Should be “easy to optimise” over e ∈ E
N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.
SLIDE 94 Desiderata for a PNM
Some desiderata for a probabilistic numerical method:
- 1. Should be Bayesian for some prior πX
- 2. Should “put mass close” to the true quantity of interest φ(x)
- 3. Should be “well-calibrated”
- 4. Should be “easy to optimise” over e ∈ E
N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.
SLIDE 95 Desiderata for a PNM
Some desiderata for a probabilistic numerical method:
- 1. Should be Bayesian for some prior πX
- 2. Should “put mass close” to the true quantity of interest φ(x)
- 3. Should be “well-calibrated”
- 4. Should be “easy to optimise” over e ∈ E
N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.
SLIDE 96 Desiderata for a PNM
Some desiderata for a probabilistic numerical method:
- 1. Should be Bayesian for some prior πX
- 2. Should “put mass close” to the true quantity of interest φ(x)
- 3. Should be “well-calibrated”
- 4. Should be “easy to optimise” over e ∈ E
N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.
SLIDE 97 Desiderata for a PNM
Some desiderata for a probabilistic numerical method:
- 1. Should be Bayesian for some prior πX
- 2. Should “put mass close” to the true quantity of interest φ(x)
- 3. Should be “well-calibrated”
- 4. Should be “easy to optimise” over e ∈ E
N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.
SLIDE 98 Desiderata for a PNM
Some desiderata for a probabilistic numerical method:
- 1. Should be Bayesian for some prior πX
- 2. Should “put mass close” to the true quantity of interest φ(x)
- 3. Should be “well-calibrated”
- 4. Should be “easy to optimise” over e ∈ E
N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.
SLIDE 99 An Optimality Criterion
A recent proposal in Cockayne, CJO, Sullivan, Girolami, to appear in SIAM Review, 2018: Idea: Consider a loss function ℓ(x, x′) and use this to define a utility u(e, x, y) = −
N.B. this could encode a quantity of interest, e.g. ℓ(x, x′) = φ(x) − φ(x′)2
Φ.
Then consider the implied Bayesian experimental design criterion BPN(e) := BED(e) =
- ℓ(x, x′) dπX|y,e(x′) dπY |x,e(y)dπX(x),
E∗
BPN
:= arg infe∈E BPN(e) X πX|y,e ← x′ → x ℓ(x, x′) = (x − x′)2
SLIDE 100 An Optimality Criterion
A recent proposal in Cockayne, CJO, Sullivan, Girolami, to appear in SIAM Review, 2018: Idea: Consider a loss function ℓ(x, x′) and use this to define a utility u(e, x, y) = −
N.B. this could encode a quantity of interest, e.g. ℓ(x, x′) = φ(x) − φ(x′)2
Φ.
Then consider the implied Bayesian experimental design criterion BPN(e) := BED(e) =
- ℓ(x, x′) dπX|y,e(x′) dπY |x,e(y)dπX(x),
E∗
BPN
:= arg infe∈E BPN(e) X πX|y,e ← x′ → x ℓ(x, x′) = (x − x′)2
SLIDE 101 Desiderata (2): Should “put mass close” to the true quantity of interest φ(x)
To build intuition, let ◮ X = Rd ◮ φ(x) = x and ℓ(x, x′) = x − x′p
2
Then BPN(e) =
- ℓ(x, x′) dπX|y,e(x′) dπY |x,e(y)dπX(x)
=
- ℓ(x, x′) dπX|y,e(x′)
- DW,p(δ(x),πX|y,e)
dπX(x) where DW,p is the pth Wasserstein distance. Note that the BPN criteria does not capture the statistical notion of the posterior being well-calibrated.
SLIDE 102 Desiderata (2): Should “put mass close” to the true quantity of interest φ(x)
To build intuition, let ◮ X = Rd ◮ φ(x) = x and ℓ(x, x′) = x − x′p
2
Then BPN(e) =
- ℓ(x, x′) dπX|y,e(x′) dπY |x,e(y)dπX(x)
=
- ℓ(x, x′) dπX|y,e(x′)
- DW,p(δ(x),πX|y,e)
dπX(x) where DW,p is the pth Wasserstein distance. Note that the BPN criteria does not capture the statistical notion of the posterior being well-calibrated.
SLIDE 103 Desiderata (2): Should “put mass close” to the true quantity of interest φ(x)
To build intuition, let ◮ X = Rd ◮ φ(x) = x and ℓ(x, x′) = x − x′p
2
Then BPN(e) =
- ℓ(x, x′) dπX|y,e(x′) dπY |x,e(y)dπX(x)
=
- ℓ(x, x′) dπX|y,e(x′)
- DW,p(δ(x),πX|y,e)
dπX(x) where DW,p is the pth Wasserstein distance. Note that the BPN criteria does not capture the statistical notion of the posterior being well-calibrated.
SLIDE 104 Desiderata (4): Should be “easy to optimise” over e ∈ E
Re-write the BPN objective as follows: BPN(e) =
- ℓ(x, x′) dπX|y,e(x′)dπX|y,e(x)dπY |e(y).
(1) Thus BPN(e) can be “easily” approximated using (Markov chain) Monte Carlo methods to sample from πX|y,e.
Example (Numerical integration, continued)
Consider ℓ(x, x′) = 1
0 (x(t) − x′(t))2 dt.
Note that the posterior πX|y,e is a collection of independent Brownian bridges Xi : [ti−1, ti] → R with Xi(ti−1) = x(ti−1) and Xi(ti) = x(ti) with covariance function cov(Xi(t), Xi(t′)) =
(ti −t′)(t−ti−1) (ti −ti−1)
, ti−1 ≤ t ≤ t′ ≤ ti. Through some calculation we end up with an optimal experiment ti = i
n.
Interesting to observe this in this case BED(e) = 2BR(e, d∗
e ). In particular, we must have the same
- ptimal information as recovered earlier with ACA/BDT.
SLIDE 105 Desiderata (4): Should be “easy to optimise” over e ∈ E
Re-write the BPN objective as follows: BPN(e) =
- ℓ(x, x′) dπX|y,e(x′)dπX|y,e(x)dπY |e(y).
(1) Thus BPN(e) can be “easily” approximated using (Markov chain) Monte Carlo methods to sample from πX|y,e.
Example (Numerical integration, continued)
Consider ℓ(x, x′) = 1
0 (x(t) − x′(t))2 dt.
Note that the posterior πX|y,e is a collection of independent Brownian bridges Xi : [ti−1, ti] → R with Xi(ti−1) = x(ti−1) and Xi(ti) = x(ti) with covariance function cov(Xi(t), Xi(t′)) =
(ti −t′)(t−ti−1) (ti −ti−1)
, ti−1 ≤ t ≤ t′ ≤ ti. Through some calculation we end up with an optimal experiment ti = i
n.
Interesting to observe this in this case BED(e) = 2BR(e, d∗
e ). In particular, we must have the same
- ptimal information as recovered earlier with ACA/BDT.
SLIDE 106 Desiderata (4): Should be “easy to optimise” over e ∈ E
Re-write the BPN objective as follows: BPN(e) =
- ℓ(x, x′) dπX|y,e(x′)dπX|y,e(x)dπY |e(y).
(1) Thus BPN(e) can be “easily” approximated using (Markov chain) Monte Carlo methods to sample from πX|y,e.
Example (Numerical integration, continued)
Consider ℓ(x, x′) = 1
0 (x(t) − x′(t))2 dt.
Note that the posterior πX|y,e is a collection of independent Brownian bridges Xi : [ti−1, ti] → R with Xi(ti−1) = x(ti−1) and Xi(ti) = x(ti) with covariance function cov(Xi(t), Xi(t′)) =
(ti −t′)(t−ti−1) (ti −ti−1)
, ti−1 ≤ t ≤ t′ ≤ ti. Through some calculation we end up with an optimal experiment ti = i
n.
Interesting to observe this in this case BED(e) = 2BR(e, d∗
e ). In particular, we must have the same
- ptimal information as recovered earlier with ACA/BDT.
SLIDE 107 Desiderata (4): Should be “easy to optimise” over e ∈ E
Re-write the BPN objective as follows: BPN(e) =
- ℓ(x, x′) dπX|y,e(x′)dπX|y,e(x)dπY |e(y).
(1) Thus BPN(e) can be “easily” approximated using (Markov chain) Monte Carlo methods to sample from πX|y,e.
Example (Numerical integration, continued)
Consider ℓ(x, x′) = 1
0 (x(t) − x′(t))2 dt.
Note that the posterior πX|y,e is a collection of independent Brownian bridges Xi : [ti−1, ti] → R with Xi(ti−1) = x(ti−1) and Xi(ti) = x(ti) with covariance function cov(Xi(t), Xi(t′)) =
(ti −t′)(t−ti−1) (ti −ti−1)
, ti−1 ≤ t ≤ t′ ≤ ti. Through some calculation we end up with an optimal experiment ti = i
n.
Interesting to observe this in this case BED(e) = 2BR(e, d∗
e ). In particular, we must have the same
- ptimal information as recovered earlier with ACA/BDT.
SLIDE 108 Desiderata (4): Should be “easy to optimise” over e ∈ E
Re-write the BPN objective as follows: BPN(e) =
- ℓ(x, x′) dπX|y,e(x′)dπX|y,e(x)dπY |e(y).
(1) Thus BPN(e) can be “easily” approximated using (Markov chain) Monte Carlo methods to sample from πX|y,e.
Example (Numerical integration, continued)
Consider ℓ(x, x′) = 1
0 (x(t) − x′(t))2 dt.
Note that the posterior πX|y,e is a collection of independent Brownian bridges Xi : [ti−1, ti] → R with Xi(ti−1) = x(ti−1) and Xi(ti) = x(ti) with covariance function cov(Xi(t), Xi(t′)) =
(ti −t′)(t−ti−1) (ti −ti−1)
, ti−1 ≤ t ≤ t′ ≤ ti. Through some calculation we end up with an optimal experiment ti = i
n.
Interesting to observe this in this case BED(e) = 2BR(e, d∗
e ). In particular, we must have the same
- ptimal information as recovered earlier with ACA/BDT.
SLIDE 109 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t1), p=2;
SLIDE 110 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t2), p=2;
SLIDE 111 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t3), p=2;
SLIDE 112 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t4), p=2;
SLIDE 113 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t5), p=2;
SLIDE 114 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t6), p=2;
SLIDE 115 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t7), p=2;
SLIDE 116 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t8), p=2;
SLIDE 117 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t9), p=2;
SLIDE 118 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t1), p→ ∞;
SLIDE 119 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t2), p→ ∞;
SLIDE 120 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t3), p→ ∞;
SLIDE 121 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t4), p→ ∞;
SLIDE 122 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t5), p→ ∞;
SLIDE 123 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t6), p→ ∞;
SLIDE 124 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t7), p→ ∞;
SLIDE 125 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t8), p→ ∞;
SLIDE 126 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t9), p→ ∞;
SLIDE 127 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t10), p→ ∞;
SLIDE 128 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t11), p→ ∞;
SLIDE 129 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t12), p→ ∞;
SLIDE 130 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t13), p→ ∞;
SLIDE 131 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t14), p→ ∞;
SLIDE 132 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t15), p→ ∞;
SLIDE 133 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t16), p→ ∞;
SLIDE 134 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t17), p→ ∞;
SLIDE 135 Desiderata (4): Should be “easy to optimise” over e ∈ E
Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =
1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2
2).
The optimal experiment can be “easily” approximated in a greedy manner: BPN(t18), p→ ∞;
SLIDE 136 Relationship to Bayesian Decision Theory (/Average Case Analysis)
Recall that for Suld’in’s numerical integration example the optimal information from BDT/ACA coincided with the optimal information under BPN. It is therefore interesting to ask whether E∗
BPN ?
= E∗
BDT in general.
Proposition (A positive result)
Consider a loss function of the form ℓ(x, x′) = φ(x) − φ(x′)2
Φ where φ: X → Φ takes values in an
inner product space Φ, with inner product ·, ·Φ and induced norm ϕΦ = ϕ, ϕ1/2
Φ . Suppose that
any Bayes act a ∈ A∗
e (ye) satisfies∗
φ(a) =
(2) Then E∗
BPN = E∗ BDT. ∗From the earlier Proposition a sufficient condition is A = X = Rd, φ is twice continuously
differentiable and the matrix dφ/da has full row rank.
SLIDE 137 Relationship to Bayesian Decision Theory (/Average Case Analysis)
Recall that for Suld’in’s numerical integration example the optimal information from BDT/ACA coincided with the optimal information under BPN. It is therefore interesting to ask whether E∗
BPN ?
= E∗
BDT in general.
Proposition (A positive result)
Consider a loss function of the form ℓ(x, x′) = φ(x) − φ(x′)2
Φ where φ: X → Φ takes values in an
inner product space Φ, with inner product ·, ·Φ and induced norm ϕΦ = ϕ, ϕ1/2
Φ . Suppose that
any Bayes act a ∈ A∗
e (ye) satisfies∗
φ(a) =
(2) Then E∗
BPN = E∗ BDT. ∗From the earlier Proposition a sufficient condition is A = X = Rd, φ is twice continuously
differentiable and the matrix dφ/da has full row rank.
SLIDE 138 Relationship to Bayesian Decision Theory (/Average Case Analysis)
Recall that for Suld’in’s numerical integration example the optimal information from BDT/ACA coincided with the optimal information under BPN. It is therefore interesting to ask whether E∗
BPN ?
= E∗
BDT in general.
Proposition (A positive result)
Consider a loss function of the form ℓ(x, x′) = φ(x) − φ(x′)2
Φ where φ: X → Φ takes values in an
inner product space Φ, with inner product ·, ·Φ and induced norm ϕΦ = ϕ, ϕ1/2
Φ . Suppose that
any Bayes act a ∈ A∗
e (ye) satisfies∗
φ(a) =
(2) Then E∗
BPN = E∗ BDT. ∗From the earlier Proposition a sufficient condition is A = X = Rd, φ is twice continuously
differentiable and the matrix dφ/da has full row rank.
SLIDE 139
Relationship to Bayesian Decision Theory (/Average Case Analysis)
Proposition (A negative result)
Suppose that the state space X can be partitioned into three disjoint subsets, each with positive probability under πX. Then there exists a loss function ℓ and a set of candidate experiments E such that E∗
BPN = E∗ BDT.
Potential open questions: ◮ An explicit characterisation of the loss functions ℓ for which E∗
BPN = E∗ BDT is not, at least to my
knowledge, available at present. ◮ The analytic intractability of optimal experiments in all but the simplest of numerical tasks leaves it unclear whether there exist a numerical task of practical importance for which E∗
BPN = E∗ BDT.
◮ Any utility function u from the Bayesian experimental design literature provides a criterion BED(e) that can be studied from an information-based complexity standpoint (analogous to the nth minimal error in ACA).
SLIDE 140
Relationship to Bayesian Decision Theory (/Average Case Analysis)
Proposition (A negative result)
Suppose that the state space X can be partitioned into three disjoint subsets, each with positive probability under πX. Then there exists a loss function ℓ and a set of candidate experiments E such that E∗
BPN = E∗ BDT.
Potential open questions: ◮ An explicit characterisation of the loss functions ℓ for which E∗
BPN = E∗ BDT is not, at least to my
knowledge, available at present. ◮ The analytic intractability of optimal experiments in all but the simplest of numerical tasks leaves it unclear whether there exist a numerical task of practical importance for which E∗
BPN = E∗ BDT.
◮ Any utility function u from the Bayesian experimental design literature provides a criterion BED(e) that can be studied from an information-based complexity standpoint (analogous to the nth minimal error in ACA).
SLIDE 141
Relationship to Bayesian Decision Theory (/Average Case Analysis)
Proposition (A negative result)
Suppose that the state space X can be partitioned into three disjoint subsets, each with positive probability under πX. Then there exists a loss function ℓ and a set of candidate experiments E such that E∗
BPN = E∗ BDT.
Potential open questions: ◮ An explicit characterisation of the loss functions ℓ for which E∗
BPN = E∗ BDT is not, at least to my
knowledge, available at present. ◮ The analytic intractability of optimal experiments in all but the simplest of numerical tasks leaves it unclear whether there exist a numerical task of practical importance for which E∗
BPN = E∗ BDT.
◮ Any utility function u from the Bayesian experimental design literature provides a criterion BED(e) that can be studied from an information-based complexity standpoint (analogous to the nth minimal error in ACA).
SLIDE 142
Relationship to Bayesian Decision Theory (/Average Case Analysis)
Proposition (A negative result)
Suppose that the state space X can be partitioned into three disjoint subsets, each with positive probability under πX. Then there exists a loss function ℓ and a set of candidate experiments E such that E∗
BPN = E∗ BDT.
Potential open questions: ◮ An explicit characterisation of the loss functions ℓ for which E∗
BPN = E∗ BDT is not, at least to my
knowledge, available at present. ◮ The analytic intractability of optimal experiments in all but the simplest of numerical tasks leaves it unclear whether there exist a numerical task of practical importance for which E∗
BPN = E∗ BDT.
◮ Any utility function u from the Bayesian experimental design literature provides a criterion BED(e) that can be studied from an information-based complexity standpoint (analogous to the nth minimal error in ACA).
SLIDE 143 Some Recent Applications of Probabilistic Numerical Methods
CJO, Cockayne, Aykroyd, Girolami (2018) Bayesian Probabilistic Numerical Meth-
in Time-Dependent State Estima- tion for Industrial Hydrocyclone Equipment. arXiv:1707.06107
SLIDE 144 Some Recent Applications of Probabilistic Numerical Methods
CJO, Niederer S, Lee A, Briol F-X, Girolami
- M. Probabilistic Models for Integration Error
in Assessment of Functional Cardiac Models. In NIPS 2017.
SLIDE 145
Conclusion
In this talk we argued that ◮ Bayesian expeirmental design is quite general and could motivate alternative notions of optimality to be studied in IBC. ◮ Probabilistic numerical methods have different desiderata compared to classical numerical methods... ◮ ... but nevertheless sometimes∗ their optimal information coincides (∗in search of an “if and only if” result). Thank you for your attention!
SLIDE 146
Conclusion
In this talk we argued that ◮ Bayesian expeirmental design is quite general and could motivate alternative notions of optimality to be studied in IBC. ◮ Probabilistic numerical methods have different desiderata compared to classical numerical methods... ◮ ... but nevertheless sometimes∗ their optimal information coincides (∗in search of an “if and only if” result). Thank you for your attention!
SLIDE 147 References I
- J. O. Berger. Statistical decision theory and Bayesian analysis. Springer Science & Business Media, 1985.
- J. M. Bernardo. Expected information as expected utility. the Annals of Statistics, pages 686–690, 1979.
- R. Brooks. A decision theory approach to optimal regression designs. Biometrika, 59(3):563–571, 1972.
- R. Brooks. On the choice of an experiment for prediction in linear regression. Biometrika, 61(2):303–311, 1974.
- R. Brooks. Optimal regression designs for prediction when prior knowledge is available. Metrika, 23(1):221–230,
1976.
- R. Brooks. Optimal regression design for control in linear regression. Biometrika, 64(2):319–325, 1977.
- K. Chaloner. Optimal Bayesian experimental design for linear models. The Annals of Statistics, pages 283–300,
1984.
- K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science, pages 273–304, 1995.
- P. Diaconis. Bayesian numerical analysis. In Statistical Decision Theory and Related Topics IV, volume 1, pages
163–175. Springer-Verlag New York, 1988.
- G. Duncan and M. DeGroot. A mean squared error approach to optimal design theory. In Proceedings of the
1976 Conference on Information: Science and systems. The John Hopkins University, pages 217–221, 1976.
- S. M. El-Krunz and W. J. Studden. Bayesian optimal designs for linear regression models. The Annals of
Statistics, pages 2183–2208, 1991.
- J. B. Kadane and G. W. Wasilkowski. Bayesian Statistics, chapter Average Case ǫ-Complexity in Computer
Science: A Bayesian View, pages 361–374. Elsevier, North-Holland, 1985.
- F. M. Larkin. Gaussian measure in Hilbert space and applications in numerical analysis. The Rocky Mountain
Journal of Mathematics, 2(3):379–421, 1972.
- A. O’Hagan. Some Bayesian numerical analysis. Bayesian Statistics, 4:345–363, 1992.
- R. Owen. The optimum design of a two-factor experiment using prior information. The Annals of Mathematical
Statistics, pages 1917–1934, 1970.
SLIDE 148 References II
- A. V. Sul’din. Wiener measure and its applications to approximation methods. I. Izv. Vysˇ
- s. Uˇ
- cebn. Zaved.
Matematika, 6(13):145–158, 1959.
- A. V. Sul’din. Wiener measure and its applications to approximation methods. II. Izv. Vysˇ
- s. Uˇ
- cebn. Zaved.
Matematika, 5(18):165–179, 1960.
- G. C. Tiao and B. Afonja. Some Bayesian considerations of the choice of design for ranking, selection and
- estimation. Annals of the Institute of Statistical Mathematics, 28:167–186, 1976.
- J. F. Traub, G. W. Wasilkowski, and H. Wo´
- zniakowski. Information Based Complexity. Academic Press, New
York, 1988.