What is an Optimal Bayesian Method? Chris. J. Oates Newcastle - - PowerPoint PPT Presentation

what is an optimal bayesian method
SMART_READER_LITE
LIVE PREVIEW

What is an Optimal Bayesian Method? Chris. J. Oates Newcastle - - PowerPoint PPT Presentation

What is an Optimal Bayesian Method? Chris. J. Oates Newcastle University & Lloyds Register Foundation Alan Turing Institute Programme on Data-Centric Engineering November 2018 @ RICAM Multivariate Algorithms and Information-Based


slide-1
SLIDE 1

What is an Optimal Bayesian Method?

  • Chris. J. Oates

Newcastle University & Lloyd’s Register Foundation – Alan Turing Institute Programme on Data-Centric Engineering November 2018 @ RICAM Multivariate Algorithms and Information-Based Complexity

slide-2
SLIDE 2

Collaborators

Jon Cockayne Mark Girolami Dennis Prangle Tim Sullivan University of Warwick Imperial College London Newcastle University Free University of Berlin Alan Turing Institute Zuse Institute Berlin

slide-3
SLIDE 3

Aims

The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.

slide-4
SLIDE 4

Aims

The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.

slide-5
SLIDE 5

Aims

The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.

slide-6
SLIDE 6

Aims

The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.

slide-7
SLIDE 7

Aims

The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.

slide-8
SLIDE 8

Aims

The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.

slide-9
SLIDE 9

Aims

The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.

slide-10
SLIDE 10

Aims

The aims of this talk are as follows: ◮ To recall that average case analysis and Bayesian decision theory are identical. ◮ To recall that Bayesian decision theory can be cast in a more general Bayesian experimental design framework. ◮ To survey some alternative optimility criteria from the Bayesian experimental design context. ◮ To pose the question of whether these alternative criteria lead to different notions of optimal information in the numerical/IBC context. ◮ To recall the definition of a probabilistic numerical method. ◮ To develop an appropriate optimality criterion for a probabilistic numerical method. ◮ (If time allows) to showcase some recent work on probabilistic numerical methods.

slide-11
SLIDE 11

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-12
SLIDE 12

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-13
SLIDE 13

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-14
SLIDE 14

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-15
SLIDE 15

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-16
SLIDE 16

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-17
SLIDE 17

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-18
SLIDE 18

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-19
SLIDE 19

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-20
SLIDE 20

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-21
SLIDE 21

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-22
SLIDE 22

Notation

◮ Denote the state space as X. ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E. ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Ye with Y ∼ πY |x,e.

◮ This includes the case of a deterministic function; Y ∼ δ(ye(x)).

◮ Let de : Ye → Φ denote a numerical method. Set of all allowed methods is De.

Example (Numerical integration)

X = C(0, 1), φ(x) = 1

0 x(t)dt, e = [t0, . . . , tn], t0 = 0, ti−1 < ti, tn = 1, ye(x) = [x(t0), . . . , x(tn)],

e.g. histogram method de(ye(x)) =

n

  • i=1

x(ti−1)(ti − ti−1) e.g. trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) What about optimality in this context?

slide-23
SLIDE 23

Contents

Background Average Case Analysis Bayesian Decision Theory Bayesian Experimental Design Probabilistic Numerical Methods Bayesian Probabilistic Numerical Methods Optimality for a BPNM Applications and Going Forward

slide-24
SLIDE 24

Background

slide-25
SLIDE 25

Average Case Analysis

IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =

  • φ(x) − φ(de(ye(x)))p

Φ dπX(x)

1/p ◮ Average case optimal method, optimal information and minimial error: d∗

e ∈ arg inf de∈De

ACEp(de), e∗ ∈ arg inf

e∈E

ACEp(d∗

e ),

inf

e∈E ACEp(d∗ e )

Example (Numerical integration, continued; Sul’din [1959, 1960])

For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1

n, 2 n, . . . , 1].

slide-26
SLIDE 26

Average Case Analysis

IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =

  • φ(x) − φ(de(ye(x)))p

Φ dπX(x)

1/p ◮ Average case optimal method, optimal information and minimial error: d∗

e ∈ arg inf de∈De

ACEp(de), e∗ ∈ arg inf

e∈E

ACEp(d∗

e ),

inf

e∈E ACEp(d∗ e )

Example (Numerical integration, continued; Sul’din [1959, 1960])

For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1

n, 2 n, . . . , 1].

slide-27
SLIDE 27

Average Case Analysis

IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =

  • φ(x) − φ(de(ye(x)))p

Φ dπX(x)

1/p ◮ Average case optimal method, optimal information and minimial error: d∗

e ∈ arg inf de∈De

ACEp(de), e∗ ∈ arg inf

e∈E

ACEp(d∗

e ),

inf

e∈E ACEp(d∗ e )

Example (Numerical integration, continued; Sul’din [1959, 1960])

For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1

n, 2 n, . . . , 1].

slide-28
SLIDE 28

Average Case Analysis

IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =

  • φ(x) − φ(de(ye(x)))p

Φ dπX(x)

1/p ◮ Average case optimal method, optimal information and minimial error: d∗

e ∈ arg inf de∈De

ACEp(de), e∗ ∈ arg inf

e∈E

ACEp(d∗

e ),

inf

e∈E ACEp(d∗ e )

Example (Numerical integration, continued; Sul’din [1959, 1960])

For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1

n, 2 n, . . . , 1].

slide-29
SLIDE 29

Average Case Analysis

IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =

  • φ(x) − φ(de(ye(x)))p

Φ dπX(x)

1/p ◮ Average case optimal method, optimal information and minimial error: d∗

e ∈ arg inf de∈De

ACEp(de), e∗ ∈ arg inf

e∈E

ACEp(d∗

e ),

inf

e∈E ACEp(d∗ e )

Example (Numerical integration, continued; Sul’din [1959, 1960])

For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1

n, 2 n, . . . , 1].

slide-30
SLIDE 30

Average Case Analysis

IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =

  • φ(x) − φ(de(ye(x)))p

Φ dπX(x)

1/p ◮ Average case optimal method, optimal information and minimial error: d∗

e ∈ arg inf de∈De

ACEp(de), e∗ ∈ arg inf

e∈E

ACEp(d∗

e ),

inf

e∈E ACEp(d∗ e )

Example (Numerical integration, continued; Sul’din [1959, 1960])

For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1

n, 2 n, . . . , 1].

slide-31
SLIDE 31

Average Case Analysis

IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information is considered to be deterministic; Y ∈ Ye with Y ∼ δ(ye(x)). ◮ Suppose Φ is a normed space with norm · Φ. ◮ Average case error: ACEp(de) =

  • φ(x) − φ(de(ye(x)))p

Φ dπX(x)

1/p ◮ Average case optimal method, optimal information and minimial error: d∗

e ∈ arg inf de∈De

ACEp(de), e∗ ∈ arg inf

e∈E

ACEp(d∗

e ),

inf

e∈E ACEp(d∗ e )

Example (Numerical integration, continued; Sul’din [1959, 1960])

For πX the standard Weiner process with E[X(t)] = 0, E[X(t)X(t′)] = min(t, t′), the trapezoidal method de(ye(x)) =

n

  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) is average case optimal for p = 2 and Φ = R, ϕΦ = |ϕ|. Optimal information is e = [0, 1

n, 2 n, . . . , 1].

slide-32
SLIDE 32

Bayesian Decision Theory

BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =

  • ℓ(x, de(Y )) dπY |x,edπX(x)

◮ Bayes rule: d∗

e ∈ arg inf de∈De

BR(de) ◮ Optimal experiment: e∗ ∈ arg inf

e∈E

BR(d∗

e )

◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap

Φ, πY |x,e = δ(ye(x))

[Kadane and Wasilkowski, 1985].

slide-33
SLIDE 33

Bayesian Decision Theory

BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =

  • ℓ(x, de(Y )) dπY |x,edπX(x)

◮ Bayes rule: d∗

e ∈ arg inf de∈De

BR(de) ◮ Optimal experiment: e∗ ∈ arg inf

e∈E

BR(d∗

e )

◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap

Φ, πY |x,e = δ(ye(x))

[Kadane and Wasilkowski, 1985].

slide-34
SLIDE 34

Bayesian Decision Theory

BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =

  • ℓ(x, de(Y )) dπY |x,edπX(x)

◮ Bayes rule: d∗

e ∈ arg inf de∈De

BR(de) ◮ Optimal experiment: e∗ ∈ arg inf

e∈E

BR(d∗

e )

◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap

Φ, πY |x,e = δ(ye(x))

[Kadane and Wasilkowski, 1985].

slide-35
SLIDE 35

Bayesian Decision Theory

BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =

  • ℓ(x, de(Y )) dπY |x,edπX(x)

◮ Bayes rule: d∗

e ∈ arg inf de∈De

BR(de) ◮ Optimal experiment: e∗ ∈ arg inf

e∈E

BR(d∗

e )

◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap

Φ, πY |x,e = δ(ye(x))

[Kadane and Wasilkowski, 1985].

slide-36
SLIDE 36

Bayesian Decision Theory

BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =

  • ℓ(x, de(Y )) dπY |x,edπX(x)

◮ Bayes rule: d∗

e ∈ arg inf de∈De

BR(de) ◮ Optimal experiment: e∗ ∈ arg inf

e∈E

BR(d∗

e )

◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap

Φ, πY |x,e = δ(ye(x))

[Kadane and Wasilkowski, 1985].

slide-37
SLIDE 37

Bayesian Decision Theory

BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =

  • ℓ(x, de(Y )) dπY |x,edπX(x)

◮ Bayes rule: d∗

e ∈ arg inf de∈De

BR(de) ◮ Optimal experiment: e∗ ∈ arg inf

e∈E

BR(d∗

e )

◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap

Φ, πY |x,e = δ(ye(x))

[Kadane and Wasilkowski, 1985].

slide-38
SLIDE 38

Bayesian Decision Theory

BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =

  • ℓ(x, de(Y )) dπY |x,edπX(x)

◮ Bayes rule: d∗

e ∈ arg inf de∈De

BR(de) ◮ Optimal experiment: e∗ ∈ arg inf

e∈E

BR(d∗

e )

◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap

Φ, πY |x,e = δ(ye(x))

[Kadane and Wasilkowski, 1985].

slide-39
SLIDE 39

Bayesian Decision Theory

BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =

  • ℓ(x, de(Y )) dπY |x,edπX(x)

◮ Bayes rule: d∗

e ∈ arg inf de∈De

BR(de) ◮ Optimal experiment: e∗ ∈ arg inf

e∈E

BR(d∗

e )

◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap

Φ, πY |x,e = δ(ye(x))

[Kadane and Wasilkowski, 1985].

slide-40
SLIDE 40

Bayesian Decision Theory

BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ πX. ◮ Information could be random; Y ∼ πY |x,e. ◮ Consider a space of actions A and a decision rule de : Ye → A. ◮ Consider a loss function ℓ : X × A → [0, ∞]. ◮ Bayes risk: BR(de) =

  • ℓ(x, de(Y )) dπY |x,edπX(x)

◮ Bayes rule: d∗

e ∈ arg inf de∈De

BR(de) ◮ Optimal experiment: e∗ ∈ arg inf

e∈E

BR(d∗

e )

◮ Average case analysis is a special case with A = Φ, ℓ(x, a) = φ(x) − ap

Φ, πY |x,e = δ(ye(x))

[Kadane and Wasilkowski, 1985].

slide-41
SLIDE 41

Remark #1: Characterising a Bayes Rule

A Bayes rule is characterised by the actions a = d∗

e (y) that it takes; these are called Bayes acts.

Sometimes it is possible to characterise the Bayes act:

slide-42
SLIDE 42

Remark #1: Characterising a Bayes Rule

A Bayes rule is characterised by the actions a = d∗

e (y) that it takes; these are called Bayes acts.

Sometimes it is possible to characterise the Bayes act:

Proposition

Consider A = X = Rd. Let ℓ(x, a) = φ(x) − φ(a)2

2 where φ: X → Rm, m ∈ N. Assume that φ is

twice continuously differentiable and that the matrix dφ da

  • i,j

= ∂ ∂aj φi(a) has full row rank at all a ∈ A. Then any Bayes act a ∈ A∗

e (ye) satisfies

φ(a) =

  • φ(x) dπX|y,e(x).

(1) Moreover, if there exists a unique solution to Eqn. (1) and the function φ is coercive, then this solution is a Bayes act.

slide-43
SLIDE 43

Remark #1: Characterising a Bayes Rule

A Bayes rule is characterised by the actions a = d∗

e (y) that it takes; these are called Bayes acts.

Sometimes it is possible to characterise the Bayes act:

Proposition

Consider A = X = Rd. Let ℓ(x, a) = φ(x) − φ(a)2

2 where φ: X → Rm, m ∈ N. Assume that φ is

twice continuously differentiable and that the matrix dφ da

  • i,j

= ∂ ∂aj φi(a) has full row rank at all a ∈ A. Then any Bayes act a ∈ A∗

e (ye) satisfies

φ(a) =

  • φ(x) dπX|y,e(x).

(1) Moreover, if there exists a unique solution to Eqn. (1) and the function φ is coercive, then this solution is a Bayes act.

slide-44
SLIDE 44

Remark #1: Characterising a Bayes Rule

A Bayes rule is characterised by the actions a = d∗

e (y) that it takes; these are called Bayes acts.

Sometimes it is possible to characterise the Bayes act:

Example (Linear regression)

◮ Let X ∼ N(µ0, Σ0), X ∈ Rd, and Y |x ∼ N(Aex, Σ), Y ∈ Rn, where the matrix Σ0 is positive definite and the matrix Ae ∈ Rn×d is determined by the choice of experiment e ∈ E. ◮ Consider a loss ℓ(x, x′) = (x − x′)Λ(x − x′) where Λ is a positive semi-definite matrix with a square root Λ

1 2 .

◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a ∈ Rd which, from the

Proposition, satisfy Λ

1 2 a = Λ 1 2 µy,e where X|y ∼ N(µy,e, Σe) and Σe = (A⊤

e Ae + Σ−1 0 )−1,

µy,e = Σe(A⊤

e y + Σ−1 0 µ0).

◮ If Λ is positive definite then it also follows from the Proposition that µy,e is the unique Bayes act.

slide-45
SLIDE 45

Remark #1: Characterising a Bayes Rule

A Bayes rule is characterised by the actions a = d∗

e (y) that it takes; these are called Bayes acts.

Sometimes it is possible to characterise the Bayes act:

Example (Linear regression)

◮ Let X ∼ N(µ0, Σ0), X ∈ Rd, and Y |x ∼ N(Aex, Σ), Y ∈ Rn, where the matrix Σ0 is positive definite and the matrix Ae ∈ Rn×d is determined by the choice of experiment e ∈ E. ◮ Consider a loss ℓ(x, x′) = (x − x′)Λ(x − x′) where Λ is a positive semi-definite matrix with a square root Λ

1 2 .

◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a ∈ Rd which, from the

Proposition, satisfy Λ

1 2 a = Λ 1 2 µy,e where X|y ∼ N(µy,e, Σe) and Σe = (A⊤

e Ae + Σ−1 0 )−1,

µy,e = Σe(A⊤

e y + Σ−1 0 µ0).

◮ If Λ is positive definite then it also follows from the Proposition that µy,e is the unique Bayes act.

slide-46
SLIDE 46

Remark #1: Characterising a Bayes Rule

A Bayes rule is characterised by the actions a = d∗

e (y) that it takes; these are called Bayes acts.

Sometimes it is possible to characterise the Bayes act:

Example (Linear regression)

◮ Let X ∼ N(µ0, Σ0), X ∈ Rd, and Y |x ∼ N(Aex, Σ), Y ∈ Rn, where the matrix Σ0 is positive definite and the matrix Ae ∈ Rn×d is determined by the choice of experiment e ∈ E. ◮ Consider a loss ℓ(x, x′) = (x − x′)Λ(x − x′) where Λ is a positive semi-definite matrix with a square root Λ

1 2 .

◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a ∈ Rd which, from the

Proposition, satisfy Λ

1 2 a = Λ 1 2 µy,e where X|y ∼ N(µy,e, Σe) and Σe = (A⊤

e Ae + Σ−1 0 )−1,

µy,e = Σe(A⊤

e y + Σ−1 0 µ0).

◮ If Λ is positive definite then it also follows from the Proposition that µy,e is the unique Bayes act.

slide-47
SLIDE 47

Remark #1: Characterising a Bayes Rule

A Bayes rule is characterised by the actions a = d∗

e (y) that it takes; these are called Bayes acts.

Sometimes it is possible to characterise the Bayes act:

Example (Linear regression)

◮ Let X ∼ N(µ0, Σ0), X ∈ Rd, and Y |x ∼ N(Aex, Σ), Y ∈ Rn, where the matrix Σ0 is positive definite and the matrix Ae ∈ Rn×d is determined by the choice of experiment e ∈ E. ◮ Consider a loss ℓ(x, x′) = (x − x′)Λ(x − x′) where Λ is a positive semi-definite matrix with a square root Λ

1 2 .

◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a ∈ Rd which, from the

Proposition, satisfy Λ

1 2 a = Λ 1 2 µy,e where X|y ∼ N(µy,e, Σe) and Σe = (A⊤

e Ae + Σ−1 0 )−1,

µy,e = Σe(A⊤

e y + Σ−1 0 µ0).

◮ If Λ is positive definite then it also follows from the Proposition that µy,e is the unique Bayes act.

slide-48
SLIDE 48

Remark #2: Admissibility

Other, more adversarial notions of optimality for general decision rules, such as admissibility, need not coincide with the Bayesian notion of optimal. A decision rule de ∈ De is called admissible if there exists no d′

e ∈ De such that

  • ℓ(x, d′

e(y)) dπY |x,e(y) ≤

  • ℓ(x, de(y)) dπY |x,e(y)

for all x ∈ X, with strict inequality for some x ∈ X.

Example

Consider estimation of x ∈ R based on Y |x ∼ N(x, 1) and with ℓ(x, x′) = (x − x′)2. An admissible decision rule is d(y) = y, but this is not a Bayes rule for any proper prior πX on R.

slide-49
SLIDE 49

Remark #2: Admissibility

Other, more adversarial notions of optimality for general decision rules, such as admissibility, need not coincide with the Bayesian notion of optimal. A decision rule de ∈ De is called admissible if there exists no d′

e ∈ De such that

  • ℓ(x, d′

e(y)) dπY |x,e(y) ≤

  • ℓ(x, de(y)) dπY |x,e(y)

for all x ∈ X, with strict inequality for some x ∈ X.

Example

Consider estimation of x ∈ R based on Y |x ∼ N(x, 1) and with ℓ(x, x′) = (x − x′)2. An admissible decision rule is d(y) = y, but this is not a Bayes rule for any proper prior πX on R.

slide-50
SLIDE 50

Remark #2: Admissibility

Other, more adversarial notions of optimality for general decision rules, such as admissibility, need not coincide with the Bayesian notion of optimal. A decision rule de ∈ De is called admissible if there exists no d′

e ∈ De such that

  • ℓ(x, d′

e(y)) dπY |x,e(y) ≤

  • ℓ(x, de(y)) dπY |x,e(y)

for all x ∈ X, with strict inequality for some x ∈ X.

Example

Consider estimation of x ∈ R based on Y |x ∼ N(x, 1) and with ℓ(x, x′) = (x − x′)2. An admissible decision rule is d(y) = y, but this is not a Bayes rule for any proper prior πX on R.

slide-51
SLIDE 51

Bayesian Experimental Design

BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u(e, x, y) and let BED(e) = −

  • u(e, x, y) dπY |x,e(y) dπX(x)

An optimal experiment is defined as e∗ ∈ arg inf

e∈E

BED(e) N.B. Contains Bayesian decision theory as a special case with u(e, y) = −

  • ℓ(x′, d∗

e (y)) dπX|y,e(x′)

since then BED(e) = ℓ(x′, d∗

e (y)) dπX|y,e(x′) dπY |x,e(y)dπX(x)

= ℓ(x′, d∗

e (y)) dπX|y,e(x′) dπY |e(y)

=

  • ℓ(x′, d∗

e (y)) dπY |x′,e(y) dπX(x′) = BR(e, d∗ e )

slide-52
SLIDE 52

Bayesian Experimental Design

BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u(e, x, y) and let BED(e) = −

  • u(e, x, y) dπY |x,e(y) dπX(x)

An optimal experiment is defined as e∗ ∈ arg inf

e∈E

BED(e) N.B. Contains Bayesian decision theory as a special case with u(e, y) = −

  • ℓ(x′, d∗

e (y)) dπX|y,e(x′)

since then BED(e) = ℓ(x′, d∗

e (y)) dπX|y,e(x′) dπY |x,e(y)dπX(x)

= ℓ(x′, d∗

e (y)) dπX|y,e(x′) dπY |e(y)

=

  • ℓ(x′, d∗

e (y)) dπY |x′,e(y) dπX(x′) = BR(e, d∗ e )

slide-53
SLIDE 53

Bayesian Experimental Design

BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u(e, x, y) and let BED(e) = −

  • u(e, x, y) dπY |x,e(y) dπX(x)

An optimal experiment is defined as e∗ ∈ arg inf

e∈E

BED(e) N.B. Contains Bayesian decision theory as a special case with u(e, y) = −

  • ℓ(x′, d∗

e (y)) dπX|y,e(x′)

since then BED(e) = ℓ(x′, d∗

e (y)) dπX|y,e(x′) dπY |x,e(y)dπX(x)

= ℓ(x′, d∗

e (y)) dπX|y,e(x′) dπY |e(y)

=

  • ℓ(x′, d∗

e (y)) dπY |x′,e(y) dπX(x′) = BR(e, d∗ e )

slide-54
SLIDE 54

Bayesian Experimental Design

BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u(e, x, y) and let BED(e) = −

  • u(e, x, y) dπY |x,e(y) dπX(x)

An optimal experiment is defined as e∗ ∈ arg inf

e∈E

BED(e) N.B. Contains Bayesian decision theory as a special case with u(e, y) = −

  • ℓ(x′, d∗

e (y)) dπX|y,e(x′)

since then BED(e) = ℓ(x′, d∗

e (y)) dπX|y,e(x′) dπY |x,e(y)dπX(x)

= ℓ(x′, d∗

e (y)) dπX|y,e(x′) dπY |e(y)

=

  • ℓ(x′, d∗

e (y)) dπY |x′,e(y) dπX(x′) = BR(e, d∗ e )

slide-55
SLIDE 55

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.
slide-56
SLIDE 56

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Consider X = Rm. A number of approximations have been developed, based on a Gaussian approximation πX|y,e ≈ N(µy,e, Σe): BED(e) ≈      tr(ΛΣe) A-optimal det(Λ1/2ΣeΛ1/2) D-optimal . . . called the alphabet criteria, for some positive semi-definite matrix Λ with a square root Λ

1 2 .

slide-57
SLIDE 57

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes A-, c- and E-optimal)

◮ Consider a loss ℓ(x, x′) = x − x′2

Λ, where xΛ := Λ

1 2 x2.

◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a, which satisfy Λ

1 2 a = Λ 1 2 µy,e

due to Remark #1. ◮ Now observe that for a Bayes act

  • ℓ(x, a)dπX|y,e(x) =
  • (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.

◮ It follows that BR(e, d∗

e ) = tr(ΛΣe).

◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].

slide-58
SLIDE 58

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes A-, c- and E-optimal)

◮ Consider a loss ℓ(x, x′) = x − x′2

Λ, where xΛ := Λ

1 2 x2.

◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a, which satisfy Λ

1 2 a = Λ 1 2 µy,e

due to Remark #1. ◮ Now observe that for a Bayes act

  • ℓ(x, a)dπX|y,e(x) =
  • (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.

◮ It follows that BR(e, d∗

e ) = tr(ΛΣe).

◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].

slide-59
SLIDE 59

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes A-, c- and E-optimal)

◮ Consider a loss ℓ(x, x′) = x − x′2

Λ, where xΛ := Λ

1 2 x2.

◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a, which satisfy Λ

1 2 a = Λ 1 2 µy,e

due to Remark #1. ◮ Now observe that for a Bayes act

  • ℓ(x, a)dπX|y,e(x) =
  • (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.

◮ It follows that BR(e, d∗

e ) = tr(ΛΣe).

◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].

slide-60
SLIDE 60

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes A-, c- and E-optimal)

◮ Consider a loss ℓ(x, x′) = x − x′2

Λ, where xΛ := Λ

1 2 x2.

◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a, which satisfy Λ

1 2 a = Λ 1 2 µy,e

due to Remark #1. ◮ Now observe that for a Bayes act

  • ℓ(x, a)dπX|y,e(x) =
  • (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.

◮ It follows that BR(e, d∗

e ) = tr(ΛΣe).

◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].

slide-61
SLIDE 61

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes A-, c- and E-optimal)

◮ Consider a loss ℓ(x, x′) = x − x′2

Λ, where xΛ := Λ

1 2 x2.

◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a, which satisfy Λ

1 2 a = Λ 1 2 µy,e

due to Remark #1. ◮ Now observe that for a Bayes act

  • ℓ(x, a)dπX|y,e(x) =
  • (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.

◮ It follows that BR(e, d∗

e ) = tr(ΛΣe).

◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].

slide-62
SLIDE 62

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes A-, c- and E-optimal)

◮ Consider a loss ℓ(x, x′) = x − x′2

Λ, where xΛ := Λ

1 2 x2.

◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a, which satisfy Λ

1 2 a = Λ 1 2 µy,e

due to Remark #1. ◮ Now observe that for a Bayes act

  • ℓ(x, a)dπX|y,e(x) =
  • (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.

◮ It follows that BR(e, d∗

e ) = tr(ΛΣe).

◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].

slide-63
SLIDE 63

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes A-, c- and E-optimal)

◮ Consider a loss ℓ(x, x′) = x − x′2

Λ, where xΛ := Λ

1 2 x2.

◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ Then a Bayes decision rule d∗

e is defined through the Bayes act(s) a, which satisfy Λ

1 2 a = Λ 1 2 µy,e

due to Remark #1. ◮ Now observe that for a Bayes act

  • ℓ(x, a)dπX|y,e(x) =
  • (x − µy,e)⊤Λ(x − µy,e)dπX|y,e(x) = tr(ΛΣe), which is independent of ye.

◮ It follows that BR(e, d∗

e ) = tr(ΛΣe).

◮ Selecting e to minimise tr(ΛΣe), or tr(Σe) in the common case where Λ = I, is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of supc=1 tr(cc⊤Σe) is called Bayes E-optimal [Chaloner, 1984].

slide-64
SLIDE 64

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes D-optimal)

◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗

e is defined through the Bayes act(s), which one can verify include µy,e.

◮ Now observe that

  • ℓ(x, µy,e)dπX|y,e(x) =
  • 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus

the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1

d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2

d 2 Γ( d

2 + 1).

◮ Note in particular that this is independent of ye. It follows that BR(e, d∗

e ) is minimised when

det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].

slide-65
SLIDE 65

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes D-optimal)

◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗

e is defined through the Bayes act(s), which one can verify include µy,e.

◮ Now observe that

  • ℓ(x, µy,e)dπX|y,e(x) =
  • 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus

the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1

d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2

d 2 Γ( d

2 + 1).

◮ Note in particular that this is independent of ye. It follows that BR(e, d∗

e ) is minimised when

det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].

slide-66
SLIDE 66

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes D-optimal)

◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗

e is defined through the Bayes act(s), which one can verify include µy,e.

◮ Now observe that

  • ℓ(x, µy,e)dπX|y,e(x) =
  • 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus

the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1

d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2

d 2 Γ( d

2 + 1).

◮ Note in particular that this is independent of ye. It follows that BR(e, d∗

e ) is minimised when

det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].

slide-67
SLIDE 67

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes D-optimal)

◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗

e is defined through the Bayes act(s), which one can verify include µy,e.

◮ Now observe that

  • ℓ(x, µy,e)dπX|y,e(x) =
  • 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus

the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1

d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2

d 2 Γ( d

2 + 1).

◮ Note in particular that this is independent of ye. It follows that BR(e, d∗

e ) is minimised when

det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].

slide-68
SLIDE 68

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes D-optimal)

◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗

e is defined through the Bayes act(s), which one can verify include µy,e.

◮ Now observe that

  • ℓ(x, µy,e)dπX|y,e(x) =
  • 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus

the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1

d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2

d 2 Γ( d

2 + 1).

◮ Note in particular that this is independent of ye. It follows that BR(e, d∗

e ) is minimised when

det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].

slide-69
SLIDE 69

Remark #3: Comment on Practical Issues

In practice the criteria BED(e) = BR(e, d∗

e ) is difficult to compute, as for each experiment e ∈ E

  • ptimisation is required to identify a Bayes act.

Example (Bayes D-optimal)

◮ Consider a loss ℓ(x, x′) = 0 if x − x′Λ ≤ ǫ and, otherwise, ℓ(x, x′) = 1. ◮ Suppose that the posterior πX|y,e is a Gaussian N(µy,e, Σe). ◮ A Bayes decision rule d∗

e is defined through the Bayes act(s), which one can verify include µy,e.

◮ Now observe that

  • ℓ(x, µy,e)dπX|y,e(x) =
  • 1x−µy,eΛ>ǫdπX|y,e(x), which is equal to one minus

the probability that Z2 ≤ ǫ where Z ∼ N(0, Λ1/2ΣeΛ1/2). ◮ For small ǫ, this is 1 − O(c−1

d det(Λ1/2ΣeΛ1/2)−d/2ǫd) where cd := 2

d 2 Γ( d

2 + 1).

◮ Note in particular that this is independent of ye. It follows that BR(e, d∗

e ) is minimised when

det(Λ1/2ΣeΛ1/2) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].

slide-70
SLIDE 70

Bayesian Experimental Design

Bayesian experimental design is a more general framework, and it may be interesting to ask how

  • ptimal experiments depend on the choice of utility, and might some utilities lead to different results in

terms of IBC. Next in this talk: Probabilistic numerical methods, and why for these methods the experimental design framework may be more appropriate than the decision theoretic framework.

slide-71
SLIDE 71

Bayesian Experimental Design

Bayesian experimental design is a more general framework, and it may be interesting to ask how

  • ptimal experiments depend on the choice of utility, and might some utilities lead to different results in

terms of IBC. Next in this talk: Probabilistic numerical methods, and why for these methods the experimental design framework may be more appropriate than the decision theoretic framework.

slide-72
SLIDE 72

Probabilistic Numerical Methods

slide-73
SLIDE 73

Probabilistic Numerical Methods

PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ(x)” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation ye : X → Ye. ◮ A probabilistic numerical method is a map De : Ye → PΦ where PΦ is the set of distributions on Φ.

◮ Note that contains deterministic decision rules as a special case; De(y) = δ(de(y)). ◮ Actually, mathematically identical to a randomised decision rule in BDT.

slide-74
SLIDE 74

Probabilistic Numerical Methods

PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ(x)” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation ye : X → Ye. ◮ A probabilistic numerical method is a map De : Ye → PΦ where PΦ is the set of distributions on Φ.

◮ Note that contains deterministic decision rules as a special case; De(y) = δ(de(y)). ◮ Actually, mathematically identical to a randomised decision rule in BDT.

slide-75
SLIDE 75

Probabilistic Numerical Methods

PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ(x)” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation ye : X → Ye. ◮ A probabilistic numerical method is a map De : Ye → PΦ where PΦ is the set of distributions on Φ.

◮ Note that contains deterministic decision rules as a special case; De(y) = δ(de(y)). ◮ Actually, mathematically identical to a randomised decision rule in BDT.

slide-76
SLIDE 76

Probabilistic Numerical Methods

PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ(x)” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation ye : X → Ye. ◮ A probabilistic numerical method is a map De : Ye → PΦ where PΦ is the set of distributions on Φ.

◮ Note that contains deterministic decision rules as a special case; De(y) = δ(de(y)). ◮ Actually, mathematically identical to a randomised decision rule in BDT.

slide-77
SLIDE 77

Probabilistic Numerical Methods

PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ(x)” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation ye : X → Ye. ◮ A probabilistic numerical method is a map De : Ye → PΦ where PΦ is the set of distributions on Φ.

◮ Note that contains deterministic decision rules as a special case; De(y) = δ(de(y)). ◮ Actually, mathematically identical to a randomised decision rule in BDT.

slide-78
SLIDE 78

Bayesian Probabilistic Numerical Methods

◮ Let {πX|y,e}y∈Ye denote a disintegration of πX along the map ye : X → Ye. ◮ Let φ# denote the push-forward. (i.e. φ#π(S) = π(φ−1(S))) ◮ A probabilistic numerical method is Bayesian if De(y) = φ#πX|y,e. πY |e a.a. y ∈ Ye for some distribution πX, called a prior.

Example (Numerical integration)

e.g. the Bayesian quadrature method [Larkin, 1972] De(ye(x)) = N

  • n
  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) ,

n

  • i=1

(ti − ti−1)3 12

  • is obtained from disintegrating the standard Weiner prior πX.

◮ In general can also consider randomness Y ∼ πY |x,e, but atypical in a traditional numerical task.

slide-79
SLIDE 79

Bayesian Probabilistic Numerical Methods

◮ Let {πX|y,e}y∈Ye denote a disintegration of πX along the map ye : X → Ye. ◮ Let φ# denote the push-forward. (i.e. φ#π(S) = π(φ−1(S))) ◮ A probabilistic numerical method is Bayesian if De(y) = φ#πX|y,e. πY |e a.a. y ∈ Ye for some distribution πX, called a prior.

Example (Numerical integration)

e.g. the Bayesian quadrature method [Larkin, 1972] De(ye(x)) = N

  • n
  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) ,

n

  • i=1

(ti − ti−1)3 12

  • is obtained from disintegrating the standard Weiner prior πX.

◮ In general can also consider randomness Y ∼ πY |x,e, but atypical in a traditional numerical task.

slide-80
SLIDE 80

Bayesian Probabilistic Numerical Methods

◮ Let {πX|y,e}y∈Ye denote a disintegration of πX along the map ye : X → Ye. ◮ Let φ# denote the push-forward. (i.e. φ#π(S) = π(φ−1(S))) ◮ A probabilistic numerical method is Bayesian if De(y) = φ#πX|y,e. πY |e a.a. y ∈ Ye for some distribution πX, called a prior.

Example (Numerical integration)

e.g. the Bayesian quadrature method [Larkin, 1972] De(ye(x)) = N

  • n
  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) ,

n

  • i=1

(ti − ti−1)3 12

  • is obtained from disintegrating the standard Weiner prior πX.

◮ In general can also consider randomness Y ∼ πY |x,e, but atypical in a traditional numerical task.

slide-81
SLIDE 81

Bayesian Probabilistic Numerical Methods

◮ Let {πX|y,e}y∈Ye denote a disintegration of πX along the map ye : X → Ye. ◮ Let φ# denote the push-forward. (i.e. φ#π(S) = π(φ−1(S))) ◮ A probabilistic numerical method is Bayesian if De(y) = φ#πX|y,e. πY |e a.a. y ∈ Ye for some distribution πX, called a prior.

Example (Numerical integration)

e.g. the Bayesian quadrature method [Larkin, 1972] De(ye(x)) = N

  • n
  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) ,

n

  • i=1

(ti − ti−1)3 12

  • is obtained from disintegrating the standard Weiner prior πX.

◮ In general can also consider randomness Y ∼ πY |x,e, but atypical in a traditional numerical task.

slide-82
SLIDE 82

Bayesian Probabilistic Numerical Methods

◮ Let {πX|y,e}y∈Ye denote a disintegration of πX along the map ye : X → Ye. ◮ Let φ# denote the push-forward. (i.e. φ#π(S) = π(φ−1(S))) ◮ A probabilistic numerical method is Bayesian if De(y) = φ#πX|y,e. πY |e a.a. y ∈ Ye for some distribution πX, called a prior.

Example (Numerical integration)

e.g. the Bayesian quadrature method [Larkin, 1972] De(ye(x)) = N

  • n
  • i=1

(x(ti−1) + x(ti)) 2 (ti − ti−1) ,

n

  • i=1

(ti − ti−1)3 12

  • is obtained from disintegrating the standard Weiner prior πX.

◮ In general can also consider randomness Y ∼ πY |x,e, but atypical in a traditional numerical task.

slide-83
SLIDE 83

Remark #4: Bayes is Optimal

Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)

  • ,

which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.

slide-84
SLIDE 84

Remark #4: Bayes is Optimal

Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)

  • ,

which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.

slide-85
SLIDE 85

Remark #4: Bayes is Optimal

Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)

  • ,

which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.

slide-86
SLIDE 86

Remark #4: Bayes is Optimal

Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)

  • ,

which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.

slide-87
SLIDE 87

Remark #4: Bayes is Optimal

Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)

  • ,

which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.

slide-88
SLIDE 88

Remark #4: Bayes is Optimal

Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = PX . ◮ Consider a loss of the form ℓ(x, a) = − log da dπX (x)

  • ,

which is an example of a proper scoring rule. ◮ Then a Bayes rule is de(y) = πX|y,e for each y ∈ Ye and the Bayes risk is DKL(πX|y,e||πX). So if πX is our prior belief, then (at least in this sense) posteriors are the right thing to report.

slide-89
SLIDE 89

(Some) Existing Work on Bayesian PNM

Integration: ◮ Briol F-X, CJO, Girolami, M, Osborne, MA, Sejdinovic, D. Probabilistic Integration: A Role in Statistical Computation? (with discussion and rejoinder) Statistical Science, 2018. ◮ Xi X, Briol F-X, Girolami M. Bayesian Quadrature for Multiple Related Integrals. ICML 2018. ◮ CJO, Niederer S, Lee A, Briol F-X, Girolami M. Probabilistic Models for Integration Error in Assessment of Functional Cardiac Models. NIPS 2017. ◮ Karvonen T, Sa¨ arkka¨ a S. Fully Symmetric Kernel Quadrature. SIAM Journal on Scientific Computing, 40(2), pp.A697-A720. ◮ Kanagawa M, Sriperumbudur BK, Fukumizu K. Convergence Guarantees for Kernel-Based Quadrature Rules in Misspecified Settings. NIPS 2016. ◮ Jagadeeswaran R, Hickernell FJ. Fast Automatic Bayesian Cubature Using Lattice Sampling. arXiv:1809.09803, 2018. Differential Equations: ◮ Owhadi H. Bayesian Numerical Homogenization. Multiscale Modeling & Simulation, 13(3), pp.812-828, 2015. ◮ CJO, Cockayne J, Aykroyd RG, Girolami M. Bayesian Probabilistic Numerical Methods for Industrial Process Monitoring. arXiv:1707.06107 ◮ Cockayne J, CJO, Sullivan T, Girolami M. Probabilistic Meshless Methods for Bayesian Inverse Problems. arXiv:1605.07811 Linear Solvers: ◮ Cockayne J, CJO, Ipsen I, Girolami M. A Bayesian Conjugate Gradient Method. arXiv:1801.05242 ◮ Bartels S, Cockayne J, Ipsen I, Girolami M., Hennig P. Probabilistic Linear Solvers: A Unifying View.

Not many papers focus on optimality at this point.

slide-90
SLIDE 90

(Some) Existing Work on Bayesian PNM

Integration: ◮ Briol F-X, CJO, Girolami, M, Osborne, MA, Sejdinovic, D. Probabilistic Integration: A Role in Statistical Computation? (with discussion and rejoinder) Statistical Science, 2018. ◮ Xi X, Briol F-X, Girolami M. Bayesian Quadrature for Multiple Related Integrals. ICML 2018. ◮ CJO, Niederer S, Lee A, Briol F-X, Girolami M. Probabilistic Models for Integration Error in Assessment of Functional Cardiac Models. NIPS 2017. ◮ Karvonen T, Sa¨ arkka¨ a S. Fully Symmetric Kernel Quadrature. SIAM Journal on Scientific Computing, 40(2), pp.A697-A720. ◮ Kanagawa M, Sriperumbudur BK, Fukumizu K. Convergence Guarantees for Kernel-Based Quadrature Rules in Misspecified Settings. NIPS 2016. ◮ Jagadeeswaran R, Hickernell FJ. Fast Automatic Bayesian Cubature Using Lattice Sampling. arXiv:1809.09803, 2018. Differential Equations: ◮ Owhadi H. Bayesian Numerical Homogenization. Multiscale Modeling & Simulation, 13(3), pp.812-828, 2015. ◮ CJO, Cockayne J, Aykroyd RG, Girolami M. Bayesian Probabilistic Numerical Methods for Industrial Process Monitoring. arXiv:1707.06107 ◮ Cockayne J, CJO, Sullivan T, Girolami M. Probabilistic Meshless Methods for Bayesian Inverse Problems. arXiv:1605.07811 Linear Solvers: ◮ Cockayne J, CJO, Ipsen I, Girolami M. A Bayesian Conjugate Gradient Method. arXiv:1801.05242 ◮ Bartels S, Cockayne J, Ipsen I, Girolami M., Hennig P. Probabilistic Linear Solvers: A Unifying View.

Not many papers focus on optimality at this point.

slide-91
SLIDE 91

Desiderata for a PNM

Some desiderata for a probabilistic numerical method:

  • 1. Should be Bayesian for some prior πX
  • 2. Should “put mass close” to the true quantity of interest φ(x)
  • 3. Should be “well-calibrated”
  • 4. Should be “easy to optimise” over e ∈ E

N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.

slide-92
SLIDE 92

Desiderata for a PNM

Some desiderata for a probabilistic numerical method:

  • 1. Should be Bayesian for some prior πX
  • 2. Should “put mass close” to the true quantity of interest φ(x)
  • 3. Should be “well-calibrated”
  • 4. Should be “easy to optimise” over e ∈ E

N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.

slide-93
SLIDE 93

Desiderata for a PNM

Some desiderata for a probabilistic numerical method:

  • 1. Should be Bayesian for some prior πX
  • 2. Should “put mass close” to the true quantity of interest φ(x)
  • 3. Should be “well-calibrated”
  • 4. Should be “easy to optimise” over e ∈ E

N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.

slide-94
SLIDE 94

Desiderata for a PNM

Some desiderata for a probabilistic numerical method:

  • 1. Should be Bayesian for some prior πX
  • 2. Should “put mass close” to the true quantity of interest φ(x)
  • 3. Should be “well-calibrated”
  • 4. Should be “easy to optimise” over e ∈ E

N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.

slide-95
SLIDE 95

Desiderata for a PNM

Some desiderata for a probabilistic numerical method:

  • 1. Should be Bayesian for some prior πX
  • 2. Should “put mass close” to the true quantity of interest φ(x)
  • 3. Should be “well-calibrated”
  • 4. Should be “easy to optimise” over e ∈ E

N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.

slide-96
SLIDE 96

Desiderata for a PNM

Some desiderata for a probabilistic numerical method:

  • 1. Should be Bayesian for some prior πX
  • 2. Should “put mass close” to the true quantity of interest φ(x)
  • 3. Should be “well-calibrated”
  • 4. Should be “easy to optimise” over e ∈ E

N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.

slide-97
SLIDE 97

Desiderata for a PNM

Some desiderata for a probabilistic numerical method:

  • 1. Should be Bayesian for some prior πX
  • 2. Should “put mass close” to the true quantity of interest φ(x)
  • 3. Should be “well-calibrated”
  • 4. Should be “easy to optimise” over e ∈ E

N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.

slide-98
SLIDE 98

Desiderata for a PNM

Some desiderata for a probabilistic numerical method:

  • 1. Should be Bayesian for some prior πX
  • 2. Should “put mass close” to the true quantity of interest φ(x)
  • 3. Should be “well-calibrated”
  • 4. Should be “easy to optimise” over e ∈ E

N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): X πX|y,e a x ℓ(x, a) = (x − a)2 Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of πX.

slide-99
SLIDE 99

An Optimality Criterion

A recent proposal in Cockayne, CJO, Sullivan, Girolami, to appear in SIAM Review, 2018: Idea: Consider a loss function ℓ(x, x′) and use this to define a utility u(e, x, y) = −

  • ℓ(x, x′) dπX|y,e(x′).

N.B. this could encode a quantity of interest, e.g. ℓ(x, x′) = φ(x) − φ(x′)2

Φ.

Then consider the implied Bayesian experimental design criterion BPN(e) := BED(e) =

  • ℓ(x, x′) dπX|y,e(x′) dπY |x,e(y)dπX(x),

E∗

BPN

:= arg infe∈E BPN(e) X πX|y,e ← x′ → x ℓ(x, x′) = (x − x′)2

slide-100
SLIDE 100

An Optimality Criterion

A recent proposal in Cockayne, CJO, Sullivan, Girolami, to appear in SIAM Review, 2018: Idea: Consider a loss function ℓ(x, x′) and use this to define a utility u(e, x, y) = −

  • ℓ(x, x′) dπX|y,e(x′).

N.B. this could encode a quantity of interest, e.g. ℓ(x, x′) = φ(x) − φ(x′)2

Φ.

Then consider the implied Bayesian experimental design criterion BPN(e) := BED(e) =

  • ℓ(x, x′) dπX|y,e(x′) dπY |x,e(y)dπX(x),

E∗

BPN

:= arg infe∈E BPN(e) X πX|y,e ← x′ → x ℓ(x, x′) = (x − x′)2

slide-101
SLIDE 101

Desiderata (2): Should “put mass close” to the true quantity of interest φ(x)

To build intuition, let ◮ X = Rd ◮ φ(x) = x and ℓ(x, x′) = x − x′p

2

Then BPN(e) =

  • ℓ(x, x′) dπX|y,e(x′) dπY |x,e(y)dπX(x)

=

       

  • ℓ(x, x′) dπX|y,e(x′)
  • DW,p(δ(x),πX|y,e)

         dπX(x) where DW,p is the pth Wasserstein distance. Note that the BPN criteria does not capture the statistical notion of the posterior being well-calibrated.

slide-102
SLIDE 102

Desiderata (2): Should “put mass close” to the true quantity of interest φ(x)

To build intuition, let ◮ X = Rd ◮ φ(x) = x and ℓ(x, x′) = x − x′p

2

Then BPN(e) =

  • ℓ(x, x′) dπX|y,e(x′) dπY |x,e(y)dπX(x)

=

       

  • ℓ(x, x′) dπX|y,e(x′)
  • DW,p(δ(x),πX|y,e)

         dπX(x) where DW,p is the pth Wasserstein distance. Note that the BPN criteria does not capture the statistical notion of the posterior being well-calibrated.

slide-103
SLIDE 103

Desiderata (2): Should “put mass close” to the true quantity of interest φ(x)

To build intuition, let ◮ X = Rd ◮ φ(x) = x and ℓ(x, x′) = x − x′p

2

Then BPN(e) =

  • ℓ(x, x′) dπX|y,e(x′) dπY |x,e(y)dπX(x)

=

       

  • ℓ(x, x′) dπX|y,e(x′)
  • DW,p(δ(x),πX|y,e)

         dπX(x) where DW,p is the pth Wasserstein distance. Note that the BPN criteria does not capture the statistical notion of the posterior being well-calibrated.

slide-104
SLIDE 104

Desiderata (4): Should be “easy to optimise” over e ∈ E

Re-write the BPN objective as follows: BPN(e) =

  • ℓ(x, x′) dπX|y,e(x′)dπX|y,e(x)dπY |e(y).

(1) Thus BPN(e) can be “easily” approximated using (Markov chain) Monte Carlo methods to sample from πX|y,e.

Example (Numerical integration, continued)

Consider ℓ(x, x′) = 1

0 (x(t) − x′(t))2 dt.

Note that the posterior πX|y,e is a collection of independent Brownian bridges Xi : [ti−1, ti] → R with Xi(ti−1) = x(ti−1) and Xi(ti) = x(ti) with covariance function cov(Xi(t), Xi(t′)) =

(ti −t′)(t−ti−1) (ti −ti−1)

, ti−1 ≤ t ≤ t′ ≤ ti. Through some calculation we end up with an optimal experiment ti = i

n.

Interesting to observe this in this case BED(e) = 2BR(e, d∗

e ). In particular, we must have the same

  • ptimal information as recovered earlier with ACA/BDT.
slide-105
SLIDE 105

Desiderata (4): Should be “easy to optimise” over e ∈ E

Re-write the BPN objective as follows: BPN(e) =

  • ℓ(x, x′) dπX|y,e(x′)dπX|y,e(x)dπY |e(y).

(1) Thus BPN(e) can be “easily” approximated using (Markov chain) Monte Carlo methods to sample from πX|y,e.

Example (Numerical integration, continued)

Consider ℓ(x, x′) = 1

0 (x(t) − x′(t))2 dt.

Note that the posterior πX|y,e is a collection of independent Brownian bridges Xi : [ti−1, ti] → R with Xi(ti−1) = x(ti−1) and Xi(ti) = x(ti) with covariance function cov(Xi(t), Xi(t′)) =

(ti −t′)(t−ti−1) (ti −ti−1)

, ti−1 ≤ t ≤ t′ ≤ ti. Through some calculation we end up with an optimal experiment ti = i

n.

Interesting to observe this in this case BED(e) = 2BR(e, d∗

e ). In particular, we must have the same

  • ptimal information as recovered earlier with ACA/BDT.
slide-106
SLIDE 106

Desiderata (4): Should be “easy to optimise” over e ∈ E

Re-write the BPN objective as follows: BPN(e) =

  • ℓ(x, x′) dπX|y,e(x′)dπX|y,e(x)dπY |e(y).

(1) Thus BPN(e) can be “easily” approximated using (Markov chain) Monte Carlo methods to sample from πX|y,e.

Example (Numerical integration, continued)

Consider ℓ(x, x′) = 1

0 (x(t) − x′(t))2 dt.

Note that the posterior πX|y,e is a collection of independent Brownian bridges Xi : [ti−1, ti] → R with Xi(ti−1) = x(ti−1) and Xi(ti) = x(ti) with covariance function cov(Xi(t), Xi(t′)) =

(ti −t′)(t−ti−1) (ti −ti−1)

, ti−1 ≤ t ≤ t′ ≤ ti. Through some calculation we end up with an optimal experiment ti = i

n.

Interesting to observe this in this case BED(e) = 2BR(e, d∗

e ). In particular, we must have the same

  • ptimal information as recovered earlier with ACA/BDT.
slide-107
SLIDE 107

Desiderata (4): Should be “easy to optimise” over e ∈ E

Re-write the BPN objective as follows: BPN(e) =

  • ℓ(x, x′) dπX|y,e(x′)dπX|y,e(x)dπY |e(y).

(1) Thus BPN(e) can be “easily” approximated using (Markov chain) Monte Carlo methods to sample from πX|y,e.

Example (Numerical integration, continued)

Consider ℓ(x, x′) = 1

0 (x(t) − x′(t))2 dt.

Note that the posterior πX|y,e is a collection of independent Brownian bridges Xi : [ti−1, ti] → R with Xi(ti−1) = x(ti−1) and Xi(ti) = x(ti) with covariance function cov(Xi(t), Xi(t′)) =

(ti −t′)(t−ti−1) (ti −ti−1)

, ti−1 ≤ t ≤ t′ ≤ ti. Through some calculation we end up with an optimal experiment ti = i

n.

Interesting to observe this in this case BED(e) = 2BR(e, d∗

e ). In particular, we must have the same

  • ptimal information as recovered earlier with ACA/BDT.
slide-108
SLIDE 108

Desiderata (4): Should be “easy to optimise” over e ∈ E

Re-write the BPN objective as follows: BPN(e) =

  • ℓ(x, x′) dπX|y,e(x′)dπX|y,e(x)dπY |e(y).

(1) Thus BPN(e) can be “easily” approximated using (Markov chain) Monte Carlo methods to sample from πX|y,e.

Example (Numerical integration, continued)

Consider ℓ(x, x′) = 1

0 (x(t) − x′(t))2 dt.

Note that the posterior πX|y,e is a collection of independent Brownian bridges Xi : [ti−1, ti] → R with Xi(ti−1) = x(ti−1) and Xi(ti) = x(ti) with covariance function cov(Xi(t), Xi(t′)) =

(ti −t′)(t−ti−1) (ti −ti−1)

, ti−1 ≤ t ≤ t′ ≤ ti. Through some calculation we end up with an optimal experiment ti = i

n.

Interesting to observe this in this case BED(e) = 2BR(e, d∗

e ). In particular, we must have the same

  • ptimal information as recovered earlier with ACA/BDT.
slide-109
SLIDE 109

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t1), p=2;

slide-110
SLIDE 110

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t2), p=2;

slide-111
SLIDE 111

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t3), p=2;

slide-112
SLIDE 112

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t4), p=2;

slide-113
SLIDE 113

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t5), p=2;

slide-114
SLIDE 114

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t6), p=2;

slide-115
SLIDE 115

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t7), p=2;

slide-116
SLIDE 116

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t8), p=2;

slide-117
SLIDE 117

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t9), p=2;

slide-118
SLIDE 118

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t1), p→ ∞;

slide-119
SLIDE 119

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t2), p→ ∞;

slide-120
SLIDE 120

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t3), p→ ∞;

slide-121
SLIDE 121

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t4), p→ ∞;

slide-122
SLIDE 122

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t5), p→ ∞;

slide-123
SLIDE 123

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t6), p→ ∞;

slide-124
SLIDE 124

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t7), p→ ∞;

slide-125
SLIDE 125

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t8), p→ ∞;

slide-126
SLIDE 126

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t9), p→ ∞;

slide-127
SLIDE 127

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t10), p→ ∞;

slide-128
SLIDE 128

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t11), p→ ∞;

slide-129
SLIDE 129

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t12), p→ ∞;

slide-130
SLIDE 130

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t13), p→ ∞;

slide-131
SLIDE 131

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t14), p→ ∞;

slide-132
SLIDE 132

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t15), p→ ∞;

slide-133
SLIDE 133

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t16), p→ ∞;

slide-134
SLIDE 134

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t17), p→ ∞;

slide-135
SLIDE 135

Desiderata (4): Should be “easy to optimise” over e ∈ E

Take a linear elliptic PDE: ∆x(t) = f (t), t ∈ (0, 1)2 x(t) = 0, t2 ∈ {0, 1} ∂ ∂t1 x(t) = 0, t1 ∈ {0, 1} Consider ℓ(x, x′) =

  • (x(t) − x′(t))pdt

1/p. The prior πX was taken to be Gaussian with E[X(t)] = 0, E[X(t)X(t′)] = exp(−t − t′2

2).

The optimal experiment can be “easily” approximated in a greedy manner: BPN(t18), p→ ∞;

slide-136
SLIDE 136

Relationship to Bayesian Decision Theory (/Average Case Analysis)

Recall that for Suld’in’s numerical integration example the optimal information from BDT/ACA coincided with the optimal information under BPN. It is therefore interesting to ask whether E∗

BPN ?

= E∗

BDT in general.

Proposition (A positive result)

Consider a loss function of the form ℓ(x, x′) = φ(x) − φ(x′)2

Φ where φ: X → Φ takes values in an

inner product space Φ, with inner product ·, ·Φ and induced norm ϕΦ = ϕ, ϕ1/2

Φ . Suppose that

any Bayes act a ∈ A∗

e (ye) satisfies∗

φ(a) =

  • φ(x)dπX|y,e(x).

(2) Then E∗

BPN = E∗ BDT. ∗From the earlier Proposition a sufficient condition is A = X = Rd, φ is twice continuously

differentiable and the matrix dφ/da has full row rank.

slide-137
SLIDE 137

Relationship to Bayesian Decision Theory (/Average Case Analysis)

Recall that for Suld’in’s numerical integration example the optimal information from BDT/ACA coincided with the optimal information under BPN. It is therefore interesting to ask whether E∗

BPN ?

= E∗

BDT in general.

Proposition (A positive result)

Consider a loss function of the form ℓ(x, x′) = φ(x) − φ(x′)2

Φ where φ: X → Φ takes values in an

inner product space Φ, with inner product ·, ·Φ and induced norm ϕΦ = ϕ, ϕ1/2

Φ . Suppose that

any Bayes act a ∈ A∗

e (ye) satisfies∗

φ(a) =

  • φ(x)dπX|y,e(x).

(2) Then E∗

BPN = E∗ BDT. ∗From the earlier Proposition a sufficient condition is A = X = Rd, φ is twice continuously

differentiable and the matrix dφ/da has full row rank.

slide-138
SLIDE 138

Relationship to Bayesian Decision Theory (/Average Case Analysis)

Recall that for Suld’in’s numerical integration example the optimal information from BDT/ACA coincided with the optimal information under BPN. It is therefore interesting to ask whether E∗

BPN ?

= E∗

BDT in general.

Proposition (A positive result)

Consider a loss function of the form ℓ(x, x′) = φ(x) − φ(x′)2

Φ where φ: X → Φ takes values in an

inner product space Φ, with inner product ·, ·Φ and induced norm ϕΦ = ϕ, ϕ1/2

Φ . Suppose that

any Bayes act a ∈ A∗

e (ye) satisfies∗

φ(a) =

  • φ(x)dπX|y,e(x).

(2) Then E∗

BPN = E∗ BDT. ∗From the earlier Proposition a sufficient condition is A = X = Rd, φ is twice continuously

differentiable and the matrix dφ/da has full row rank.

slide-139
SLIDE 139

Relationship to Bayesian Decision Theory (/Average Case Analysis)

Proposition (A negative result)

Suppose that the state space X can be partitioned into three disjoint subsets, each with positive probability under πX. Then there exists a loss function ℓ and a set of candidate experiments E such that E∗

BPN = E∗ BDT.

Potential open questions: ◮ An explicit characterisation of the loss functions ℓ for which E∗

BPN = E∗ BDT is not, at least to my

knowledge, available at present. ◮ The analytic intractability of optimal experiments in all but the simplest of numerical tasks leaves it unclear whether there exist a numerical task of practical importance for which E∗

BPN = E∗ BDT.

◮ Any utility function u from the Bayesian experimental design literature provides a criterion BED(e) that can be studied from an information-based complexity standpoint (analogous to the nth minimal error in ACA).

slide-140
SLIDE 140

Relationship to Bayesian Decision Theory (/Average Case Analysis)

Proposition (A negative result)

Suppose that the state space X can be partitioned into three disjoint subsets, each with positive probability under πX. Then there exists a loss function ℓ and a set of candidate experiments E such that E∗

BPN = E∗ BDT.

Potential open questions: ◮ An explicit characterisation of the loss functions ℓ for which E∗

BPN = E∗ BDT is not, at least to my

knowledge, available at present. ◮ The analytic intractability of optimal experiments in all but the simplest of numerical tasks leaves it unclear whether there exist a numerical task of practical importance for which E∗

BPN = E∗ BDT.

◮ Any utility function u from the Bayesian experimental design literature provides a criterion BED(e) that can be studied from an information-based complexity standpoint (analogous to the nth minimal error in ACA).

slide-141
SLIDE 141

Relationship to Bayesian Decision Theory (/Average Case Analysis)

Proposition (A negative result)

Suppose that the state space X can be partitioned into three disjoint subsets, each with positive probability under πX. Then there exists a loss function ℓ and a set of candidate experiments E such that E∗

BPN = E∗ BDT.

Potential open questions: ◮ An explicit characterisation of the loss functions ℓ for which E∗

BPN = E∗ BDT is not, at least to my

knowledge, available at present. ◮ The analytic intractability of optimal experiments in all but the simplest of numerical tasks leaves it unclear whether there exist a numerical task of practical importance for which E∗

BPN = E∗ BDT.

◮ Any utility function u from the Bayesian experimental design literature provides a criterion BED(e) that can be studied from an information-based complexity standpoint (analogous to the nth minimal error in ACA).

slide-142
SLIDE 142

Relationship to Bayesian Decision Theory (/Average Case Analysis)

Proposition (A negative result)

Suppose that the state space X can be partitioned into three disjoint subsets, each with positive probability under πX. Then there exists a loss function ℓ and a set of candidate experiments E such that E∗

BPN = E∗ BDT.

Potential open questions: ◮ An explicit characterisation of the loss functions ℓ for which E∗

BPN = E∗ BDT is not, at least to my

knowledge, available at present. ◮ The analytic intractability of optimal experiments in all but the simplest of numerical tasks leaves it unclear whether there exist a numerical task of practical importance for which E∗

BPN = E∗ BDT.

◮ Any utility function u from the Bayesian experimental design literature provides a criterion BED(e) that can be studied from an information-based complexity standpoint (analogous to the nth minimal error in ACA).

slide-143
SLIDE 143

Some Recent Applications of Probabilistic Numerical Methods

CJO, Cockayne, Aykroyd, Girolami (2018) Bayesian Probabilistic Numerical Meth-

  • ds

in Time-Dependent State Estima- tion for Industrial Hydrocyclone Equipment. arXiv:1707.06107

slide-144
SLIDE 144

Some Recent Applications of Probabilistic Numerical Methods

CJO, Niederer S, Lee A, Briol F-X, Girolami

  • M. Probabilistic Models for Integration Error

in Assessment of Functional Cardiac Models. In NIPS 2017.

slide-145
SLIDE 145

Conclusion

In this talk we argued that ◮ Bayesian expeirmental design is quite general and could motivate alternative notions of optimality to be studied in IBC. ◮ Probabilistic numerical methods have different desiderata compared to classical numerical methods... ◮ ... but nevertheless sometimes∗ their optimal information coincides (∗in search of an “if and only if” result). Thank you for your attention!

slide-146
SLIDE 146

Conclusion

In this talk we argued that ◮ Bayesian expeirmental design is quite general and could motivate alternative notions of optimality to be studied in IBC. ◮ Probabilistic numerical methods have different desiderata compared to classical numerical methods... ◮ ... but nevertheless sometimes∗ their optimal information coincides (∗in search of an “if and only if” result). Thank you for your attention!

slide-147
SLIDE 147

References I

  • J. O. Berger. Statistical decision theory and Bayesian analysis. Springer Science & Business Media, 1985.
  • J. M. Bernardo. Expected information as expected utility. the Annals of Statistics, pages 686–690, 1979.
  • R. Brooks. A decision theory approach to optimal regression designs. Biometrika, 59(3):563–571, 1972.
  • R. Brooks. On the choice of an experiment for prediction in linear regression. Biometrika, 61(2):303–311, 1974.
  • R. Brooks. Optimal regression designs for prediction when prior knowledge is available. Metrika, 23(1):221–230,

1976.

  • R. Brooks. Optimal regression design for control in linear regression. Biometrika, 64(2):319–325, 1977.
  • K. Chaloner. Optimal Bayesian experimental design for linear models. The Annals of Statistics, pages 283–300,

1984.

  • K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science, pages 273–304, 1995.
  • P. Diaconis. Bayesian numerical analysis. In Statistical Decision Theory and Related Topics IV, volume 1, pages

163–175. Springer-Verlag New York, 1988.

  • G. Duncan and M. DeGroot. A mean squared error approach to optimal design theory. In Proceedings of the

1976 Conference on Information: Science and systems. The John Hopkins University, pages 217–221, 1976.

  • S. M. El-Krunz and W. J. Studden. Bayesian optimal designs for linear regression models. The Annals of

Statistics, pages 2183–2208, 1991.

  • J. B. Kadane and G. W. Wasilkowski. Bayesian Statistics, chapter Average Case ǫ-Complexity in Computer

Science: A Bayesian View, pages 361–374. Elsevier, North-Holland, 1985.

  • F. M. Larkin. Gaussian measure in Hilbert space and applications in numerical analysis. The Rocky Mountain

Journal of Mathematics, 2(3):379–421, 1972.

  • A. O’Hagan. Some Bayesian numerical analysis. Bayesian Statistics, 4:345–363, 1992.
  • R. Owen. The optimum design of a two-factor experiment using prior information. The Annals of Mathematical

Statistics, pages 1917–1934, 1970.

slide-148
SLIDE 148

References II

  • A. V. Sul’din. Wiener measure and its applications to approximation methods. I. Izv. Vysˇ
  • s. Uˇ
  • cebn. Zaved.

Matematika, 6(13):145–158, 1959.

  • A. V. Sul’din. Wiener measure and its applications to approximation methods. II. Izv. Vysˇ
  • s. Uˇ
  • cebn. Zaved.

Matematika, 5(18):165–179, 1960.

  • G. C. Tiao and B. Afonja. Some Bayesian considerations of the choice of design for ranking, selection and
  • estimation. Annals of the Institute of Statistical Mathematics, 28:167–186, 1976.
  • J. F. Traub, G. W. Wasilkowski, and H. Wo´
  • zniakowski. Information Based Complexity. Academic Press, New

York, 1988.