What is an Optimal Bayesian Method? Chris. J. Oates Newcastle - PowerPoint PPT Presentation

Jan 11, 2023 •145 likes •1.63k views

What is an Optimal Bayesian Method? Chris. J. Oates Newcastle University & Lloyds Register Foundation Alan Turing Institute Programme on Data-Centric Engineering November 2018 @ RICAM Multivariate Algorithms and Information-Based

Notation ◮ Denote the state space as X . ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E . ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Y e with Y ∼ π Y | x , e . ◮ This includes the case of a deterministic function; Y ∼ δ ( y e ( x )). ◮ Let d e : Y e → Φ denote a numerical method. Set of all allowed methods is D e . Example (Numerical integration) � 1 X = C (0 , 1), φ ( x ) = 0 x ( t ) d t , e = [ t 0 , . . . , t n ], t 0 = 0, t i − 1 < t i , t n = 1, y e ( x ) = [ x ( t 0 ) , . . . , x ( t n )], e.g. histogram method n � d e ( y e ( x )) = x ( t i − 1 )( t i − t i − 1 ) i =1 e.g. trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 What about optimality in this context?
Notation ◮ Denote the state space as X . ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E . ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Y e with Y ∼ π Y | x , e . ◮ This includes the case of a deterministic function; Y ∼ δ ( y e ( x )). ◮ Let d e : Y e → Φ denote a numerical method. Set of all allowed methods is D e . Example (Numerical integration) � 1 X = C (0 , 1), φ ( x ) = 0 x ( t ) d t , e = [ t 0 , . . . , t n ], t 0 = 0, t i − 1 < t i , t n = 1, y e ( x ) = [ x ( t 0 ) , . . . , x ( t n )], e.g. histogram method n � d e ( y e ( x )) = x ( t i − 1 )( t i − t i − 1 ) i =1 e.g. trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 What about optimality in this context?
Notation ◮ Denote the state space as X . ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E . ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Y e with Y ∼ π Y | x , e . ◮ This includes the case of a deterministic function; Y ∼ δ ( y e ( x )). ◮ Let d e : Y e → Φ denote a numerical method. Set of all allowed methods is D e . Example (Numerical integration) � 1 X = C (0 , 1), φ ( x ) = 0 x ( t ) d t , e = [ t 0 , . . . , t n ], t 0 = 0, t i − 1 < t i , t n = 1, y e ( x ) = [ x ( t 0 ) , . . . , x ( t n )], e.g. histogram method n � d e ( y e ( x )) = x ( t i − 1 )( t i − t i − 1 ) i =1 e.g. trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 What about optimality in this context?
Notation ◮ Denote the state space as X . ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E . ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Y e with Y ∼ π Y | x , e . ◮ This includes the case of a deterministic function; Y ∼ δ ( y e ( x )). ◮ Let d e : Y e → Φ denote a numerical method. Set of all allowed methods is D e . Example (Numerical integration) � 1 X = C (0 , 1), φ ( x ) = 0 x ( t ) d t , e = [ t 0 , . . . , t n ], t 0 = 0, t i − 1 < t i , t n = 1, y e ( x ) = [ x ( t 0 ) , . . . , x ( t n )], e.g. histogram method n � d e ( y e ( x )) = x ( t i − 1 )( t i − t i − 1 ) i =1 e.g. trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 What about optimality in this context?
Notation ◮ Denote the state space as X . ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E . ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Y e with Y ∼ π Y | x , e . ◮ This includes the case of a deterministic function; Y ∼ δ ( y e ( x )). ◮ Let d e : Y e → Φ denote a numerical method. Set of all allowed methods is D e . Example (Numerical integration) � 1 X = C (0 , 1), φ ( x ) = 0 x ( t ) d t , e = [ t 0 , . . . , t n ], t 0 = 0, t i − 1 < t i , t n = 1, y e ( x ) = [ x ( t 0 ) , . . . , x ( t n )], e.g. histogram method n � d e ( y e ( x )) = x ( t i − 1 )( t i − t i − 1 ) i =1 e.g. trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 What about optimality in this context?
Notation ◮ Denote the state space as X . ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E . ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Y e with Y ∼ π Y | x , e . ◮ This includes the case of a deterministic function; Y ∼ δ ( y e ( x )). ◮ Let d e : Y e → Φ denote a numerical method. Set of all allowed methods is D e . Example (Numerical integration) � 1 X = C (0 , 1), φ ( x ) = 0 x ( t ) d t , e = [ t 0 , . . . , t n ], t 0 = 0, t i − 1 < t i , t n = 1, y e ( x ) = [ x ( t 0 ) , . . . , x ( t n )], e.g. histogram method n � d e ( y e ( x )) = x ( t i − 1 )( t i − t i − 1 ) i =1 e.g. trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 What about optimality in this context?
Notation ◮ Denote the state space as X . ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E . ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Y e with Y ∼ π Y | x , e . ◮ This includes the case of a deterministic function; Y ∼ δ ( y e ( x )). ◮ Let d e : Y e → Φ denote a numerical method. Set of all allowed methods is D e . Example (Numerical integration) � 1 X = C (0 , 1), φ ( x ) = 0 x ( t ) d t , e = [ t 0 , . . . , t n ], t 0 = 0, t i − 1 < t i , t n = 1, y e ( x ) = [ x ( t 0 ) , . . . , x ( t n )], e.g. histogram method n � d e ( y e ( x )) = x ( t i − 1 )( t i − t i − 1 ) i =1 e.g. trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 What about optimality in this context?
Notation ◮ Denote the state space as X . ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E . ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Y e with Y ∼ π Y | x , e . ◮ This includes the case of a deterministic function; Y ∼ δ ( y e ( x )). ◮ Let d e : Y e → Φ denote a numerical method. Set of all allowed methods is D e . Example (Numerical integration) � 1 X = C (0 , 1), φ ( x ) = 0 x ( t ) d t , e = [ t 0 , . . . , t n ], t 0 = 0, t i − 1 < t i , t n = 1, y e ( x ) = [ x ( t 0 ) , . . . , x ( t n )], e.g. histogram method n � d e ( y e ( x )) = x ( t i − 1 )( t i − t i − 1 ) i =1 e.g. trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 What about optimality in this context?
Notation ◮ Denote the state space as X . ◮ Consider the task of approximating a quantity of interest φ : X → Φ. ◮ Allowed to select an experiment e ∈ E . ◮ Information about the state x ∈ X is provided via a random variable Y ∈ Y e with Y ∼ π Y | x , e . ◮ This includes the case of a deterministic function; Y ∼ δ ( y e ( x )). ◮ Let d e : Y e → Φ denote a numerical method. Set of all allowed methods is D e . Example (Numerical integration) � 1 X = C (0 , 1), φ ( x ) = 0 x ( t ) d t , e = [ t 0 , . . . , t n ], t 0 = 0, t i − 1 < t i , t n = 1, y e ( x ) = [ x ( t 0 ) , . . . , x ( t n )], e.g. histogram method n � d e ( y e ( x )) = x ( t i − 1 )( t i − t i − 1 ) i =1 e.g. trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 What about optimality in this context?
Contents Background Average Case Analysis Bayesian Decision Theory Bayesian Experimental Design Probabilistic Numerical Methods Bayesian Probabilistic Numerical Methods Optimality for a BPNM Applications and Going Forward
Background
Average Case Analysis IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information is considered to be deterministic; Y ∈ Y e with Y ∼ δ ( y e ( x )). ◮ Suppose Φ is a normed space with norm � · � Φ . ◮ Average case error: �� 1 / p � φ ( x ) − φ ( d e ( y e ( x ))) � p ACE p ( d e ) = Φ d π X ( x ) ◮ Average case optimal method, optimal information and minimial error: e ∗ ∈ arg inf d ∗ ACE p ( d ∗ e ∈E ACE p ( d ∗ e ∈ arg inf ACE p ( d e ) , e ) , inf e ) d e ∈D e e ∈E Example (Numerical integration, continued; Sul’din [1959, 1960]) For π X the standard Weiner process with E [ X ( t )] = 0, E [ X ( t ) X ( t ′ )] = min( t , t ′ ), the trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 is average case optimal for p = 2 and Φ = R , � ϕ � Φ = | ϕ | . Optimal information is e = [0 , 1 n , 2 n , . . . , 1].
Average Case Analysis IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information is considered to be deterministic; Y ∈ Y e with Y ∼ δ ( y e ( x )). ◮ Suppose Φ is a normed space with norm � · � Φ . ◮ Average case error: �� 1 / p � φ ( x ) − φ ( d e ( y e ( x ))) � p ACE p ( d e ) = Φ d π X ( x ) ◮ Average case optimal method, optimal information and minimial error: e ∗ ∈ arg inf d ∗ ACE p ( d ∗ e ∈E ACE p ( d ∗ e ∈ arg inf ACE p ( d e ) , e ) , inf e ) d e ∈D e e ∈E Example (Numerical integration, continued; Sul’din [1959, 1960]) For π X the standard Weiner process with E [ X ( t )] = 0, E [ X ( t ) X ( t ′ )] = min( t , t ′ ), the trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 is average case optimal for p = 2 and Φ = R , � ϕ � Φ = | ϕ | . Optimal information is e = [0 , 1 n , 2 n , . . . , 1].
Average Case Analysis IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information is considered to be deterministic; Y ∈ Y e with Y ∼ δ ( y e ( x )). ◮ Suppose Φ is a normed space with norm � · � Φ . ◮ Average case error: �� 1 / p � φ ( x ) − φ ( d e ( y e ( x ))) � p ACE p ( d e ) = Φ d π X ( x ) ◮ Average case optimal method, optimal information and minimial error: e ∗ ∈ arg inf d ∗ ACE p ( d ∗ e ∈E ACE p ( d ∗ e ∈ arg inf ACE p ( d e ) , e ) , inf e ) d e ∈D e e ∈E Example (Numerical integration, continued; Sul’din [1959, 1960]) For π X the standard Weiner process with E [ X ( t )] = 0, E [ X ( t ) X ( t ′ )] = min( t , t ′ ), the trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 is average case optimal for p = 2 and Φ = R , � ϕ � Φ = | ϕ | . Optimal information is e = [0 , 1 n , 2 n , . . . , 1].
Average Case Analysis IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information is considered to be deterministic; Y ∈ Y e with Y ∼ δ ( y e ( x )). ◮ Suppose Φ is a normed space with norm � · � Φ . ◮ Average case error: �� 1 / p � φ ( x ) − φ ( d e ( y e ( x ))) � p ACE p ( d e ) = Φ d π X ( x ) ◮ Average case optimal method, optimal information and minimial error: e ∗ ∈ arg inf d ∗ ACE p ( d ∗ e ∈E ACE p ( d ∗ e ∈ arg inf ACE p ( d e ) , e ) , inf e ) d e ∈D e e ∈E Example (Numerical integration, continued; Sul’din [1959, 1960]) For π X the standard Weiner process with E [ X ( t )] = 0, E [ X ( t ) X ( t ′ )] = min( t , t ′ ), the trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 is average case optimal for p = 2 and Φ = R , � ϕ � Φ = | ϕ | . Optimal information is e = [0 , 1 n , 2 n , . . . , 1].
Average Case Analysis IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information is considered to be deterministic; Y ∈ Y e with Y ∼ δ ( y e ( x )). ◮ Suppose Φ is a normed space with norm � · � Φ . ◮ Average case error: �� 1 / p � φ ( x ) − φ ( d e ( y e ( x ))) � p ACE p ( d e ) = Φ d π X ( x ) ◮ Average case optimal method, optimal information and minimial error: e ∗ ∈ arg inf d ∗ ACE p ( d ∗ e ∈E ACE p ( d ∗ e ∈ arg inf ACE p ( d e ) , e ) , inf e ) d e ∈D e e ∈E Example (Numerical integration, continued; Sul’din [1959, 1960]) For π X the standard Weiner process with E [ X ( t )] = 0, E [ X ( t ) X ( t ′ )] = min( t , t ′ ), the trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 is average case optimal for p = 2 and Φ = R , � ϕ � Φ = | ϕ | . Optimal information is e = [0 , 1 n , 2 n , . . . , 1].
Average Case Analysis IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information is considered to be deterministic; Y ∈ Y e with Y ∼ δ ( y e ( x )). ◮ Suppose Φ is a normed space with norm � · � Φ . ◮ Average case error: �� 1 / p � φ ( x ) − φ ( d e ( y e ( x ))) � p ACE p ( d e ) = Φ d π X ( x ) ◮ Average case optimal method, optimal information and minimial error: e ∗ ∈ arg inf d ∗ ACE p ( d ∗ e ∈E ACE p ( d ∗ e ∈ arg inf ACE p ( d e ) , e ) , inf e ) d e ∈D e e ∈E Example (Numerical integration, continued; Sul’din [1959, 1960]) For π X the standard Weiner process with E [ X ( t )] = 0, E [ X ( t ) X ( t ′ )] = min( t , t ′ ), the trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 is average case optimal for p = 2 and Φ = R , � ϕ � Φ = | ϕ | . Optimal information is e = [0 , 1 n , 2 n , . . . , 1].
Average Case Analysis IBC/ACA: Traub, Wasilkowski, and Wo´ zniakowski [1988] Idea: “A numerical method should work well for typical problems” ◮ A typical problem is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information is considered to be deterministic; Y ∈ Y e with Y ∼ δ ( y e ( x )). ◮ Suppose Φ is a normed space with norm � · � Φ . ◮ Average case error: �� 1 / p � φ ( x ) − φ ( d e ( y e ( x ))) � p ACE p ( d e ) = Φ d π X ( x ) ◮ Average case optimal method, optimal information and minimial error: e ∗ ∈ arg inf d ∗ ACE p ( d ∗ e ∈E ACE p ( d ∗ e ∈ arg inf ACE p ( d e ) , e ) , inf e ) d e ∈D e e ∈E Example (Numerical integration, continued; Sul’din [1959, 1960]) For π X the standard Weiner process with E [ X ( t )] = 0, E [ X ( t ) X ( t ′ )] = min( t , t ′ ), the trapezoidal method n ( x ( t i − 1 ) + x ( t i )) � d e ( y e ( x )) = ( t i − t i − 1 ) 2 i =1 is average case optimal for p = 2 and Φ = R , � ϕ � Φ = | ϕ | . Optimal information is e = [0 , 1 n , 2 n , . . . , 1].
Bayesian Decision Theory BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information could be random; Y ∼ π Y | x , e . ◮ Consider a space of actions A and a decision rule d e : Y e → A . ◮ Consider a loss function ℓ : X × A → [0 , ∞ ]. ◮ Bayes risk: �� BR( d e ) = ℓ ( x , d e ( Y )) d π Y | x , e d π X ( x ) ◮ Bayes rule: d ∗ e ∈ arg inf BR( d e ) d e ∈D e ◮ Optimal experiment: e ∗ ∈ arg inf BR( d ∗ e ) e ∈E ◮ Average case analysis is a special case with A = Φ, ℓ ( x , a ) = � φ ( x ) − a � p Φ , π Y | x , e = δ ( y e ( x )) [Kadane and Wasilkowski, 1985].
Bayesian Decision Theory BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information could be random; Y ∼ π Y | x , e . ◮ Consider a space of actions A and a decision rule d e : Y e → A . ◮ Consider a loss function ℓ : X × A → [0 , ∞ ]. ◮ Bayes risk: �� BR( d e ) = ℓ ( x , d e ( Y )) d π Y | x , e d π X ( x ) ◮ Bayes rule: d ∗ e ∈ arg inf BR( d e ) d e ∈D e ◮ Optimal experiment: e ∗ ∈ arg inf BR( d ∗ e ) e ∈E ◮ Average case analysis is a special case with A = Φ, ℓ ( x , a ) = � φ ( x ) − a � p Φ , π Y | x , e = δ ( y e ( x )) [Kadane and Wasilkowski, 1985].
Bayesian Decision Theory BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information could be random; Y ∼ π Y | x , e . ◮ Consider a space of actions A and a decision rule d e : Y e → A . ◮ Consider a loss function ℓ : X × A → [0 , ∞ ]. ◮ Bayes risk: �� BR( d e ) = ℓ ( x , d e ( Y )) d π Y | x , e d π X ( x ) ◮ Bayes rule: d ∗ e ∈ arg inf BR( d e ) d e ∈D e ◮ Optimal experiment: e ∗ ∈ arg inf BR( d ∗ e ) e ∈E ◮ Average case analysis is a special case with A = Φ, ℓ ( x , a ) = � φ ( x ) − a � p Φ , π Y | x , e = δ ( y e ( x )) [Kadane and Wasilkowski, 1985].
Bayesian Decision Theory BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information could be random; Y ∼ π Y | x , e . ◮ Consider a space of actions A and a decision rule d e : Y e → A . ◮ Consider a loss function ℓ : X × A → [0 , ∞ ]. ◮ Bayes risk: �� BR( d e ) = ℓ ( x , d e ( Y )) d π Y | x , e d π X ( x ) ◮ Bayes rule: d ∗ e ∈ arg inf BR( d e ) d e ∈D e ◮ Optimal experiment: e ∗ ∈ arg inf BR( d ∗ e ) e ∈E ◮ Average case analysis is a special case with A = Φ, ℓ ( x , a ) = � φ ( x ) − a � p Φ , π Y | x , e = δ ( y e ( x )) [Kadane and Wasilkowski, 1985].
Bayesian Decision Theory BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information could be random; Y ∼ π Y | x , e . ◮ Consider a space of actions A and a decision rule d e : Y e → A . ◮ Consider a loss function ℓ : X × A → [0 , ∞ ]. ◮ Bayes risk: �� BR( d e ) = ℓ ( x , d e ( Y )) d π Y | x , e d π X ( x ) ◮ Bayes rule: d ∗ e ∈ arg inf BR( d e ) d e ∈D e ◮ Optimal experiment: e ∗ ∈ arg inf BR( d ∗ e ) e ∈E ◮ Average case analysis is a special case with A = Φ, ℓ ( x , a ) = � φ ( x ) − a � p Φ , π Y | x , e = δ ( y e ( x )) [Kadane and Wasilkowski, 1985].
Bayesian Decision Theory BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information could be random; Y ∼ π Y | x , e . ◮ Consider a space of actions A and a decision rule d e : Y e → A . ◮ Consider a loss function ℓ : X × A → [0 , ∞ ]. ◮ Bayes risk: �� BR( d e ) = ℓ ( x , d e ( Y )) d π Y | x , e d π X ( x ) ◮ Bayes rule: d ∗ e ∈ arg inf BR( d e ) d e ∈D e ◮ Optimal experiment: e ∗ ∈ arg inf BR( d ∗ e ) e ∈E ◮ Average case analysis is a special case with A = Φ, ℓ ( x , a ) = � φ ( x ) − a � p Φ , π Y | x , e = δ ( y e ( x )) [Kadane and Wasilkowski, 1985].
Bayesian Decision Theory BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information could be random; Y ∼ π Y | x , e . ◮ Consider a space of actions A and a decision rule d e : Y e → A . ◮ Consider a loss function ℓ : X × A → [0 , ∞ ]. ◮ Bayes risk: �� BR( d e ) = ℓ ( x , d e ( Y )) d π Y | x , e d π X ( x ) ◮ Bayes rule: d ∗ e ∈ arg inf BR( d e ) d e ∈D e ◮ Optimal experiment: e ∗ ∈ arg inf BR( d ∗ e ) e ∈E ◮ Average case analysis is a special case with A = Φ, ℓ ( x , a ) = � φ ( x ) − a � p Φ , π Y | x , e = δ ( y e ( x )) [Kadane and Wasilkowski, 1985].
Bayesian Decision Theory BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information could be random; Y ∼ π Y | x , e . ◮ Consider a space of actions A and a decision rule d e : Y e → A . ◮ Consider a loss function ℓ : X × A → [0 , ∞ ]. ◮ Bayes risk: �� BR( d e ) = ℓ ( x , d e ( Y )) d π Y | x , e d π X ( x ) ◮ Bayes rule: d ∗ e ∈ arg inf BR( d e ) d e ∈D e ◮ Optimal experiment: e ∗ ∈ arg inf BR( d ∗ e ) e ∈E ◮ Average case analysis is a special case with A = Φ, ℓ ( x , a ) = � φ ( x ) − a � p Φ , π Y | x , e = δ ( y e ( x )) [Kadane and Wasilkowski, 1985].
Bayesian Decision Theory BDT: Berger [1985], ... Idea: “Take the best decision according to your personal belief” ◮ Epistemic uncertainty in the state is modelled as a random variable X ∈ X with X ∼ π X . ◮ Information could be random; Y ∼ π Y | x , e . ◮ Consider a space of actions A and a decision rule d e : Y e → A . ◮ Consider a loss function ℓ : X × A → [0 , ∞ ]. ◮ Bayes risk: �� BR( d e ) = ℓ ( x , d e ( Y )) d π Y | x , e d π X ( x ) ◮ Bayes rule: d ∗ e ∈ arg inf BR( d e ) d e ∈D e ◮ Optimal experiment: e ∗ ∈ arg inf BR( d ∗ e ) e ∈E ◮ Average case analysis is a special case with A = Φ, ℓ ( x , a ) = � φ ( x ) − a � p Φ , π Y | x , e = δ ( y e ( x )) [Kadane and Wasilkowski, 1985].
Remark #1: Characterising a Bayes Rule A Bayes rule is characterised by the actions a = d ∗ e ( y ) that it takes; these are called Bayes acts . Sometimes it is possible to characterise the Bayes act:
Remark #1: Characterising a Bayes Rule A Bayes rule is characterised by the actions a = d ∗ e ( y ) that it takes; these are called Bayes acts . Sometimes it is possible to characterise the Bayes act: Proposition Consider A = X = R d . Let ℓ ( x , a ) = � φ ( x ) − φ ( a ) � 2 2 where φ : X → R m , m ∈ N . Assume that φ is twice continuously differentiable and that the matrix � d φ � ∂ = ∂ a j φ i ( a ) d a i , j has full row rank at all a ∈ A . Then any Bayes act a ∈ A ∗ e ( y e ) satisfies � φ ( a ) = φ ( x ) d π X | y , e ( x ) . (1) Moreover, if there exists a unique solution to Eqn. (1) and the function φ is coercive, then this solution is a Bayes act.
Remark #1: Characterising a Bayes Rule A Bayes rule is characterised by the actions a = d ∗ e ( y ) that it takes; these are called Bayes acts . Sometimes it is possible to characterise the Bayes act: Proposition Consider A = X = R d . Let ℓ ( x , a ) = � φ ( x ) − φ ( a ) � 2 2 where φ : X → R m , m ∈ N . Assume that φ is twice continuously differentiable and that the matrix � d φ � ∂ = ∂ a j φ i ( a ) d a i , j has full row rank at all a ∈ A . Then any Bayes act a ∈ A ∗ e ( y e ) satisfies � φ ( a ) = φ ( x ) d π X | y , e ( x ) . (1) Moreover, if there exists a unique solution to Eqn. (1) and the function φ is coercive, then this solution is a Bayes act.
Remark #1: Characterising a Bayes Rule A Bayes rule is characterised by the actions a = d ∗ e ( y ) that it takes; these are called Bayes acts . Sometimes it is possible to characterise the Bayes act: Example (Linear regression) ◮ Let X ∼ N ( µ 0 , Σ 0 ), X ∈ R d , and Y | x ∼ N ( A e x , Σ), Y ∈ R n , where the matrix Σ 0 is positive definite and the matrix A e ∈ R n × d is determined by the choice of experiment e ∈ E . ◮ Consider a loss ℓ ( x , x ′ ) = ( x − x ′ )Λ( x − x ′ ) where Λ is a positive semi-definite matrix with a 1 2 . square root Λ e is defined through the Bayes act(s) a ∈ R d which, from the ◮ Then a Bayes decision rule d ∗ 1 1 2 a = Λ 2 µ y , e where X | y ∼ N ( µ y , e , Σ e ) and Σ e = ( A ⊤ e A e + Σ − 1 0 ) − 1 , Proposition, satisfy Λ µ y , e = Σ e ( A ⊤ e y + Σ − 1 0 µ 0 ). ◮ If Λ is positive definite then it also follows from the Proposition that µ y , e is the unique Bayes act.
Remark #1: Characterising a Bayes Rule A Bayes rule is characterised by the actions a = d ∗ e ( y ) that it takes; these are called Bayes acts . Sometimes it is possible to characterise the Bayes act: Example (Linear regression) ◮ Let X ∼ N ( µ 0 , Σ 0 ), X ∈ R d , and Y | x ∼ N ( A e x , Σ), Y ∈ R n , where the matrix Σ 0 is positive definite and the matrix A e ∈ R n × d is determined by the choice of experiment e ∈ E . ◮ Consider a loss ℓ ( x , x ′ ) = ( x − x ′ )Λ( x − x ′ ) where Λ is a positive semi-definite matrix with a 1 2 . square root Λ e is defined through the Bayes act(s) a ∈ R d which, from the ◮ Then a Bayes decision rule d ∗ 1 1 2 a = Λ 2 µ y , e where X | y ∼ N ( µ y , e , Σ e ) and Σ e = ( A ⊤ e A e + Σ − 1 0 ) − 1 , Proposition, satisfy Λ µ y , e = Σ e ( A ⊤ e y + Σ − 1 0 µ 0 ). ◮ If Λ is positive definite then it also follows from the Proposition that µ y , e is the unique Bayes act.
Remark #1: Characterising a Bayes Rule A Bayes rule is characterised by the actions a = d ∗ e ( y ) that it takes; these are called Bayes acts . Sometimes it is possible to characterise the Bayes act: Example (Linear regression) ◮ Let X ∼ N ( µ 0 , Σ 0 ), X ∈ R d , and Y | x ∼ N ( A e x , Σ), Y ∈ R n , where the matrix Σ 0 is positive definite and the matrix A e ∈ R n × d is determined by the choice of experiment e ∈ E . ◮ Consider a loss ℓ ( x , x ′ ) = ( x − x ′ )Λ( x − x ′ ) where Λ is a positive semi-definite matrix with a 1 2 . square root Λ e is defined through the Bayes act(s) a ∈ R d which, from the ◮ Then a Bayes decision rule d ∗ 1 1 2 a = Λ 2 µ y , e where X | y ∼ N ( µ y , e , Σ e ) and Σ e = ( A ⊤ e A e + Σ − 1 0 ) − 1 , Proposition, satisfy Λ µ y , e = Σ e ( A ⊤ e y + Σ − 1 0 µ 0 ). ◮ If Λ is positive definite then it also follows from the Proposition that µ y , e is the unique Bayes act.
Remark #1: Characterising a Bayes Rule A Bayes rule is characterised by the actions a = d ∗ e ( y ) that it takes; these are called Bayes acts . Sometimes it is possible to characterise the Bayes act: Example (Linear regression) ◮ Let X ∼ N ( µ 0 , Σ 0 ), X ∈ R d , and Y | x ∼ N ( A e x , Σ), Y ∈ R n , where the matrix Σ 0 is positive definite and the matrix A e ∈ R n × d is determined by the choice of experiment e ∈ E . ◮ Consider a loss ℓ ( x , x ′ ) = ( x − x ′ )Λ( x − x ′ ) where Λ is a positive semi-definite matrix with a 1 2 . square root Λ e is defined through the Bayes act(s) a ∈ R d which, from the ◮ Then a Bayes decision rule d ∗ 1 1 2 a = Λ 2 µ y , e where X | y ∼ N ( µ y , e , Σ e ) and Σ e = ( A ⊤ e A e + Σ − 1 0 ) − 1 , Proposition, satisfy Λ µ y , e = Σ e ( A ⊤ e y + Σ − 1 0 µ 0 ). ◮ If Λ is positive definite then it also follows from the Proposition that µ y , e is the unique Bayes act.
Remark #2: Admissibility Other, more adversarial notions of optimality for general decision rules, such as admissibility, need not coincide with the Bayesian notion of optimal. A decision rule d e ∈ D e is called admissible if there exists no d ′ e ∈ D e such that � � ℓ ( x , d ′ e ( y )) d π Y | x , e ( y ) ≤ ℓ ( x , d e ( y )) d π Y | x , e ( y ) for all x ∈ X , with strict inequality for some x ∈ X . Example Consider estimation of x ∈ R based on Y | x ∼ N ( x , 1) and with ℓ ( x , x ′ ) = ( x − x ′ ) 2 . An admissible decision rule is d ( y ) = y , but this is not a Bayes rule for any proper prior π X on R .
Remark #2: Admissibility Other, more adversarial notions of optimality for general decision rules, such as admissibility, need not coincide with the Bayesian notion of optimal. A decision rule d e ∈ D e is called admissible if there exists no d ′ e ∈ D e such that � � ℓ ( x , d ′ e ( y )) d π Y | x , e ( y ) ≤ ℓ ( x , d e ( y )) d π Y | x , e ( y ) for all x ∈ X , with strict inequality for some x ∈ X . Example Consider estimation of x ∈ R based on Y | x ∼ N ( x , 1) and with ℓ ( x , x ′ ) = ( x − x ′ ) 2 . An admissible decision rule is d ( y ) = y , but this is not a Bayes rule for any proper prior π X on R .
Remark #2: Admissibility Other, more adversarial notions of optimality for general decision rules, such as admissibility, need not coincide with the Bayesian notion of optimal. A decision rule d e ∈ D e is called admissible if there exists no d ′ e ∈ D e such that � � ℓ ( x , d ′ e ( y )) d π Y | x , e ( y ) ≤ ℓ ( x , d e ( y )) d π Y | x , e ( y ) for all x ∈ X , with strict inequality for some x ∈ X . Example Consider estimation of x ∈ R based on Y | x ∼ N ( x , 1) and with ℓ ( x , x ′ ) = ( x − x ′ ) 2 . An admissible decision rule is d ( y ) = y , but this is not a Bayes rule for any proper prior π X on R .
Bayesian Experimental Design BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u ( e , x , y ) and let �� BED( e ) = − u ( e , x , y ) d π Y | x , e ( y ) d π X ( x ) An optimal experiment is defined as e ∗ ∈ arg inf BED( e ) e ∈E N.B. Contains Bayesian decision theory as a special case with � ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) u ( e , y ) = − since then �� ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) d π Y | x , e ( y ) d π X ( x ) BED( e ) = � � ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) d π Y | e ( y ) = �� ℓ ( x ′ , d ∗ e ( y )) d π Y | x ′ , e ( y ) d π X ( x ′ ) = BR( e , d ∗ = e )
Bayesian Experimental Design BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u ( e , x , y ) and let �� BED( e ) = − u ( e , x , y ) d π Y | x , e ( y ) d π X ( x ) An optimal experiment is defined as e ∗ ∈ arg inf BED( e ) e ∈E N.B. Contains Bayesian decision theory as a special case with � ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) u ( e , y ) = − since then �� ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) d π Y | x , e ( y ) d π X ( x ) BED( e ) = � � ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) d π Y | e ( y ) = �� ℓ ( x ′ , d ∗ e ( y )) d π Y | x ′ , e ( y ) d π X ( x ′ ) = BR( e , d ∗ = e )
Bayesian Experimental Design BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u ( e , x , y ) and let �� BED( e ) = − u ( e , x , y ) d π Y | x , e ( y ) d π X ( x ) An optimal experiment is defined as e ∗ ∈ arg inf BED( e ) e ∈E N.B. Contains Bayesian decision theory as a special case with � ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) u ( e , y ) = − since then �� ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) d π Y | x , e ( y ) d π X ( x ) BED( e ) = � � ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) d π Y | e ( y ) = �� ℓ ( x ′ , d ∗ e ( y )) d π Y | x ′ , e ( y ) d π X ( x ′ ) = BR( e , d ∗ = e )
Bayesian Experimental Design BED: Chaloner and Verdinelli [1995], ... Idea: “Design the experiment that I believe will be the most useful” Consider a utility function u ( e , x , y ) and let �� BED( e ) = − u ( e , x , y ) d π Y | x , e ( y ) d π X ( x ) An optimal experiment is defined as e ∗ ∈ arg inf BED( e ) e ∈E N.B. Contains Bayesian decision theory as a special case with � ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) u ( e , y ) = − since then �� ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) d π Y | x , e ( y ) d π X ( x ) BED( e ) = � � ℓ ( x ′ , d ∗ e ( y )) d π X | y , e ( x ′ ) d π Y | e ( y ) = �� ℓ ( x ′ , d ∗ e ( y )) d π Y | x ′ , e ( y ) d π X ( x ′ ) = BR( e , d ∗ = e )
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act.
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Consider X = R m . A number of approximations have been developed, based on a Gaussian approximation π X | y , e ≈ N ( µ y , e , Σ e ):  tr(ΛΣ e ) A-optimal   det(Λ 1 / 2 Σ e Λ 1 / 2 ) D-optimal BED( e ) ≈ .  .  . 1 2 . called the alphabet criteria , for some positive semi-definite matrix Λ with a square root Λ
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes A-, c - and E-optimal) 1 ◮ Consider a loss ℓ ( x , x ′ ) = � x − x ′ � 2 2 x � 2 . Λ , where � x � Λ := � Λ ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). 1 1 ◮ Then a Bayes decision rule d ∗ 2 a = Λ 2 µ y , e e is defined through the Bayes act(s) a , which satisfy Λ due to Remark #1. ◮ Now observe that for a Bayes act � � ( x − µ y , e ) ⊤ Λ( x − µ y , e ) d π X | y , e ( x ) = tr(ΛΣ e ), which is independent of y e . ℓ ( x , a ) d π X | y , e ( x ) = ◮ It follows that BR( e , d ∗ e ) = tr(ΛΣ e ). ◮ Selecting e to minimise tr(ΛΣ e ), or tr(Σ e ) in the common case where Λ = I , is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc ⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of sup � c � =1 tr( cc ⊤ Σ e ) is called Bayes E-optimal [Chaloner, 1984].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes A-, c - and E-optimal) 1 ◮ Consider a loss ℓ ( x , x ′ ) = � x − x ′ � 2 2 x � 2 . Λ , where � x � Λ := � Λ ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). 1 1 ◮ Then a Bayes decision rule d ∗ 2 a = Λ 2 µ y , e e is defined through the Bayes act(s) a , which satisfy Λ due to Remark #1. ◮ Now observe that for a Bayes act � � ( x − µ y , e ) ⊤ Λ( x − µ y , e ) d π X | y , e ( x ) = tr(ΛΣ e ), which is independent of y e . ℓ ( x , a ) d π X | y , e ( x ) = ◮ It follows that BR( e , d ∗ e ) = tr(ΛΣ e ). ◮ Selecting e to minimise tr(ΛΣ e ), or tr(Σ e ) in the common case where Λ = I , is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc ⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of sup � c � =1 tr( cc ⊤ Σ e ) is called Bayes E-optimal [Chaloner, 1984].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes A-, c - and E-optimal) 1 ◮ Consider a loss ℓ ( x , x ′ ) = � x − x ′ � 2 2 x � 2 . Λ , where � x � Λ := � Λ ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). 1 1 ◮ Then a Bayes decision rule d ∗ 2 a = Λ 2 µ y , e e is defined through the Bayes act(s) a , which satisfy Λ due to Remark #1. ◮ Now observe that for a Bayes act � � ( x − µ y , e ) ⊤ Λ( x − µ y , e ) d π X | y , e ( x ) = tr(ΛΣ e ), which is independent of y e . ℓ ( x , a ) d π X | y , e ( x ) = ◮ It follows that BR( e , d ∗ e ) = tr(ΛΣ e ). ◮ Selecting e to minimise tr(ΛΣ e ), or tr(Σ e ) in the common case where Λ = I , is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc ⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of sup � c � =1 tr( cc ⊤ Σ e ) is called Bayes E-optimal [Chaloner, 1984].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes A-, c - and E-optimal) 1 ◮ Consider a loss ℓ ( x , x ′ ) = � x − x ′ � 2 2 x � 2 . Λ , where � x � Λ := � Λ ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). 1 1 ◮ Then a Bayes decision rule d ∗ 2 a = Λ 2 µ y , e e is defined through the Bayes act(s) a , which satisfy Λ due to Remark #1. ◮ Now observe that for a Bayes act � � ( x − µ y , e ) ⊤ Λ( x − µ y , e ) d π X | y , e ( x ) = tr(ΛΣ e ), which is independent of y e . ℓ ( x , a ) d π X | y , e ( x ) = ◮ It follows that BR( e , d ∗ e ) = tr(ΛΣ e ). ◮ Selecting e to minimise tr(ΛΣ e ), or tr(Σ e ) in the common case where Λ = I , is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc ⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of sup � c � =1 tr( cc ⊤ Σ e ) is called Bayes E-optimal [Chaloner, 1984].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes A-, c - and E-optimal) 1 ◮ Consider a loss ℓ ( x , x ′ ) = � x − x ′ � 2 2 x � 2 . Λ , where � x � Λ := � Λ ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). 1 1 ◮ Then a Bayes decision rule d ∗ 2 a = Λ 2 µ y , e e is defined through the Bayes act(s) a , which satisfy Λ due to Remark #1. ◮ Now observe that for a Bayes act � � ( x − µ y , e ) ⊤ Λ( x − µ y , e ) d π X | y , e ( x ) = tr(ΛΣ e ), which is independent of y e . ℓ ( x , a ) d π X | y , e ( x ) = ◮ It follows that BR( e , d ∗ e ) = tr(ΛΣ e ). ◮ Selecting e to minimise tr(ΛΣ e ), or tr(Σ e ) in the common case where Λ = I , is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc ⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of sup � c � =1 tr( cc ⊤ Σ e ) is called Bayes E-optimal [Chaloner, 1984].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes A-, c - and E-optimal) 1 ◮ Consider a loss ℓ ( x , x ′ ) = � x − x ′ � 2 2 x � 2 . Λ , where � x � Λ := � Λ ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). 1 1 ◮ Then a Bayes decision rule d ∗ 2 a = Λ 2 µ y , e e is defined through the Bayes act(s) a , which satisfy Λ due to Remark #1. ◮ Now observe that for a Bayes act � � ( x − µ y , e ) ⊤ Λ( x − µ y , e ) d π X | y , e ( x ) = tr(ΛΣ e ), which is independent of y e . ℓ ( x , a ) d π X | y , e ( x ) = ◮ It follows that BR( e , d ∗ e ) = tr(ΛΣ e ). ◮ Selecting e to minimise tr(ΛΣ e ), or tr(Σ e ) in the common case where Λ = I , is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc ⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of sup � c � =1 tr( cc ⊤ Σ e ) is called Bayes E-optimal [Chaloner, 1984].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes A-, c - and E-optimal) 1 ◮ Consider a loss ℓ ( x , x ′ ) = � x − x ′ � 2 2 x � 2 . Λ , where � x � Λ := � Λ ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). 1 1 ◮ Then a Bayes decision rule d ∗ 2 a = Λ 2 µ y , e e is defined through the Bayes act(s) a , which satisfy Λ due to Remark #1. ◮ Now observe that for a Bayes act � � ( x − µ y , e ) ⊤ Λ( x − µ y , e ) d π X | y , e ( x ) = tr(ΛΣ e ), which is independent of y e . ℓ ( x , a ) d π X | y , e ( x ) = ◮ It follows that BR( e , d ∗ e ) = tr(ΛΣ e ). ◮ Selecting e to minimise tr(ΛΣ e ), or tr(Σ e ) in the common case where Λ = I , is called Bayes A-optimal [Owen, 1970, Brooks, 1972, 1974, 1976, Duncan and DeGroot, 1976, Brooks, 1977]. ◮ In the special case where Λ = cc ⊤ is a rank-1 matrix, the optimality criteria is called Bayes c-optimal [El-Krunz and Studden, 1991]. Related, minimisation of sup � c � =1 tr( cc ⊤ Σ e ) is called Bayes E-optimal [Chaloner, 1984].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes D-optimal) ◮ Consider a loss ℓ ( x , x ′ ) = 0 if � x − x ′ � Λ ≤ ǫ and, otherwise, ℓ ( x , x ′ ) = 1. ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). ◮ A Bayes decision rule d ∗ e is defined through the Bayes act(s), which one can verify include µ y , e . � � ◮ Now observe that ℓ ( x , µ y , e ) d π X | y , e ( x ) = 1 � x − µ y , e � Λ >ǫ d π X | y , e ( x ), which is equal to one minus the probability that � Z � 2 ≤ ǫ where Z ∼ N (0 , Λ 1 / 2 Σ e Λ 1 / 2 ). d ◮ For small ǫ , this is 1 − O ( c − 1 d det(Λ 1 / 2 Σ e Λ 1 / 2 ) − d / 2 ǫ d ) where c d := 2 2 Γ( d 2 + 1). ◮ Note in particular that this is independent of y e . It follows that BR( e , d ∗ e ) is minimised when det(Λ 1 / 2 Σ e Λ 1 / 2 ) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes D-optimal) ◮ Consider a loss ℓ ( x , x ′ ) = 0 if � x − x ′ � Λ ≤ ǫ and, otherwise, ℓ ( x , x ′ ) = 1. ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). ◮ A Bayes decision rule d ∗ e is defined through the Bayes act(s), which one can verify include µ y , e . � � ◮ Now observe that ℓ ( x , µ y , e ) d π X | y , e ( x ) = 1 � x − µ y , e � Λ >ǫ d π X | y , e ( x ), which is equal to one minus the probability that � Z � 2 ≤ ǫ where Z ∼ N (0 , Λ 1 / 2 Σ e Λ 1 / 2 ). d ◮ For small ǫ , this is 1 − O ( c − 1 d det(Λ 1 / 2 Σ e Λ 1 / 2 ) − d / 2 ǫ d ) where c d := 2 2 Γ( d 2 + 1). ◮ Note in particular that this is independent of y e . It follows that BR( e , d ∗ e ) is minimised when det(Λ 1 / 2 Σ e Λ 1 / 2 ) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes D-optimal) ◮ Consider a loss ℓ ( x , x ′ ) = 0 if � x − x ′ � Λ ≤ ǫ and, otherwise, ℓ ( x , x ′ ) = 1. ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). ◮ A Bayes decision rule d ∗ e is defined through the Bayes act(s), which one can verify include µ y , e . � � ◮ Now observe that ℓ ( x , µ y , e ) d π X | y , e ( x ) = 1 � x − µ y , e � Λ >ǫ d π X | y , e ( x ), which is equal to one minus the probability that � Z � 2 ≤ ǫ where Z ∼ N (0 , Λ 1 / 2 Σ e Λ 1 / 2 ). d ◮ For small ǫ , this is 1 − O ( c − 1 d det(Λ 1 / 2 Σ e Λ 1 / 2 ) − d / 2 ǫ d ) where c d := 2 2 Γ( d 2 + 1). ◮ Note in particular that this is independent of y e . It follows that BR( e , d ∗ e ) is minimised when det(Λ 1 / 2 Σ e Λ 1 / 2 ) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes D-optimal) ◮ Consider a loss ℓ ( x , x ′ ) = 0 if � x − x ′ � Λ ≤ ǫ and, otherwise, ℓ ( x , x ′ ) = 1. ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). ◮ A Bayes decision rule d ∗ e is defined through the Bayes act(s), which one can verify include µ y , e . � � ◮ Now observe that ℓ ( x , µ y , e ) d π X | y , e ( x ) = 1 � x − µ y , e � Λ >ǫ d π X | y , e ( x ), which is equal to one minus the probability that � Z � 2 ≤ ǫ where Z ∼ N (0 , Λ 1 / 2 Σ e Λ 1 / 2 ). d ◮ For small ǫ , this is 1 − O ( c − 1 d det(Λ 1 / 2 Σ e Λ 1 / 2 ) − d / 2 ǫ d ) where c d := 2 2 Γ( d 2 + 1). ◮ Note in particular that this is independent of y e . It follows that BR( e , d ∗ e ) is minimised when det(Λ 1 / 2 Σ e Λ 1 / 2 ) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes D-optimal) ◮ Consider a loss ℓ ( x , x ′ ) = 0 if � x − x ′ � Λ ≤ ǫ and, otherwise, ℓ ( x , x ′ ) = 1. ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). ◮ A Bayes decision rule d ∗ e is defined through the Bayes act(s), which one can verify include µ y , e . � � ◮ Now observe that ℓ ( x , µ y , e ) d π X | y , e ( x ) = 1 � x − µ y , e � Λ >ǫ d π X | y , e ( x ), which is equal to one minus the probability that � Z � 2 ≤ ǫ where Z ∼ N (0 , Λ 1 / 2 Σ e Λ 1 / 2 ). d ◮ For small ǫ , this is 1 − O ( c − 1 d det(Λ 1 / 2 Σ e Λ 1 / 2 ) − d / 2 ǫ d ) where c d := 2 2 Γ( d 2 + 1). ◮ Note in particular that this is independent of y e . It follows that BR( e , d ∗ e ) is minimised when det(Λ 1 / 2 Σ e Λ 1 / 2 ) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
Remark #3: Comment on Practical Issues In practice the criteria BED( e ) = BR( e , d ∗ e ) is difficult to compute, as for each experiment e ∈ E optimisation is required to identify a Bayes act. Example (Bayes D-optimal) ◮ Consider a loss ℓ ( x , x ′ ) = 0 if � x − x ′ � Λ ≤ ǫ and, otherwise, ℓ ( x , x ′ ) = 1. ◮ Suppose that the posterior π X | y , e is a Gaussian N ( µ y , e , Σ e ). ◮ A Bayes decision rule d ∗ e is defined through the Bayes act(s), which one can verify include µ y , e . � � ◮ Now observe that ℓ ( x , µ y , e ) d π X | y , e ( x ) = 1 � x − µ y , e � Λ >ǫ d π X | y , e ( x ), which is equal to one minus the probability that � Z � 2 ≤ ǫ where Z ∼ N (0 , Λ 1 / 2 Σ e Λ 1 / 2 ). d ◮ For small ǫ , this is 1 − O ( c − 1 d det(Λ 1 / 2 Σ e Λ 1 / 2 ) − d / 2 ǫ d ) where c d := 2 2 Γ( d 2 + 1). ◮ Note in particular that this is independent of y e . It follows that BR( e , d ∗ e ) is minimised when det(Λ 1 / 2 Σ e Λ 1 / 2 ) is minimised. This criteria is called Bayes D-optimal [Tiao and Afonja, 1976].
Bayesian Experimental Design Bayesian experimental design is a more general framework, and it may be interesting to ask how optimal experiments depend on the choice of utility, and might some utilities lead to different results in terms of IBC. Next in this talk: Probabilistic numerical methods, and why for these methods the experimental design framework may be more appropriate than the decision theoretic framework.
Bayesian Experimental Design Bayesian experimental design is a more general framework, and it may be interesting to ask how optimal experiments depend on the choice of utility, and might some utilities lead to different results in terms of IBC. Next in this talk: Probabilistic numerical methods, and why for these methods the experimental design framework may be more appropriate than the decision theoretic framework.
Probabilistic Numerical Methods
Probabilistic Numerical Methods PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ ( x ) ” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation y e : X → Y e . ◮ A probabilistic numerical method is a map D e : Y e → P Φ where P Φ is the set of distributions on Φ. ◮ Note that contains deterministic decision rules as a special case; D e ( y ) = δ ( d e ( y )). ◮ Actually, mathematically identical to a randomised decision rule in BDT.
Probabilistic Numerical Methods PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ ( x ) ” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation y e : X → Y e . ◮ A probabilistic numerical method is a map D e : Y e → P Φ where P Φ is the set of distributions on Φ. ◮ Note that contains deterministic decision rules as a special case; D e ( y ) = δ ( d e ( y )). ◮ Actually, mathematically identical to a randomised decision rule in BDT.
Probabilistic Numerical Methods PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ ( x ) ” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation y e : X → Y e . ◮ A probabilistic numerical method is a map D e : Y e → P Φ where P Φ is the set of distributions on Φ. ◮ Note that contains deterministic decision rules as a special case; D e ( y ) = δ ( d e ( y )). ◮ Actually, mathematically identical to a randomised decision rule in BDT.
Probabilistic Numerical Methods PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ ( x ) ” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation y e : X → Y e . ◮ A probabilistic numerical method is a map D e : Y e → P Φ where P Φ is the set of distributions on Φ. ◮ Note that contains deterministic decision rules as a special case; D e ( y ) = δ ( d e ( y )). ◮ Actually, mathematically identical to a randomised decision rule in BDT.
Probabilistic Numerical Methods PN: Larkin [1972], Diaconis [1988], O’Hagan [1992], ... Idea: “Perform formal, systematic uncertainty quantification for the quantity of interest φ ( x ) ” ◮ Probabilities are arguably the most popular framework for UQ. ◮ Consider for the moment a deterministic situation y e : X → Y e . ◮ A probabilistic numerical method is a map D e : Y e → P Φ where P Φ is the set of distributions on Φ. ◮ Note that contains deterministic decision rules as a special case; D e ( y ) = δ ( d e ( y )). ◮ Actually, mathematically identical to a randomised decision rule in BDT.
Bayesian Probabilistic Numerical Methods ◮ Let { π X | y , e } y ∈Y e denote a disintegration of π X along the map y e : X → Y e . ◮ Let φ # denote the push-forward. (i.e. φ # π ( S ) = π ( φ − 1 ( S ))) ◮ A probabilistic numerical method is Bayesian if D e ( y ) = φ # π X | y , e . π Y | e a.a. y ∈ Y e for some distribution π X , called a prior. Example (Numerical integration) e.g. the Bayesian quadrature method [Larkin, 1972] � � n n ( t i − t i − 1 ) 3 ( x ( t i − 1 ) + x ( t i )) � � D e ( y e ( x )) = N ( t i − t i − 1 ) , 2 12 i =1 i =1 is obtained from disintegrating the standard Weiner prior π X . ◮ In general can also consider randomness Y ∼ π Y | x , e , but atypical in a traditional numerical task.
Bayesian Probabilistic Numerical Methods ◮ Let { π X | y , e } y ∈Y e denote a disintegration of π X along the map y e : X → Y e . ◮ Let φ # denote the push-forward. (i.e. φ # π ( S ) = π ( φ − 1 ( S ))) ◮ A probabilistic numerical method is Bayesian if D e ( y ) = φ # π X | y , e . π Y | e a.a. y ∈ Y e for some distribution π X , called a prior. Example (Numerical integration) e.g. the Bayesian quadrature method [Larkin, 1972] � � n n ( t i − t i − 1 ) 3 ( x ( t i − 1 ) + x ( t i )) � � D e ( y e ( x )) = N ( t i − t i − 1 ) , 2 12 i =1 i =1 is obtained from disintegrating the standard Weiner prior π X . ◮ In general can also consider randomness Y ∼ π Y | x , e , but atypical in a traditional numerical task.
Bayesian Probabilistic Numerical Methods ◮ Let { π X | y , e } y ∈Y e denote a disintegration of π X along the map y e : X → Y e . ◮ Let φ # denote the push-forward. (i.e. φ # π ( S ) = π ( φ − 1 ( S ))) ◮ A probabilistic numerical method is Bayesian if D e ( y ) = φ # π X | y , e . π Y | e a.a. y ∈ Y e for some distribution π X , called a prior. Example (Numerical integration) e.g. the Bayesian quadrature method [Larkin, 1972] � � n n ( t i − t i − 1 ) 3 ( x ( t i − 1 ) + x ( t i )) � � D e ( y e ( x )) = N ( t i − t i − 1 ) , 2 12 i =1 i =1 is obtained from disintegrating the standard Weiner prior π X . ◮ In general can also consider randomness Y ∼ π Y | x , e , but atypical in a traditional numerical task.
Bayesian Probabilistic Numerical Methods ◮ Let { π X | y , e } y ∈Y e denote a disintegration of π X along the map y e : X → Y e . ◮ Let φ # denote the push-forward. (i.e. φ # π ( S ) = π ( φ − 1 ( S ))) ◮ A probabilistic numerical method is Bayesian if D e ( y ) = φ # π X | y , e . π Y | e a.a. y ∈ Y e for some distribution π X , called a prior. Example (Numerical integration) e.g. the Bayesian quadrature method [Larkin, 1972] � � n n ( t i − t i − 1 ) 3 ( x ( t i − 1 ) + x ( t i )) � � D e ( y e ( x )) = N ( t i − t i − 1 ) , 2 12 i =1 i =1 is obtained from disintegrating the standard Weiner prior π X . ◮ In general can also consider randomness Y ∼ π Y | x , e , but atypical in a traditional numerical task.
Bayesian Probabilistic Numerical Methods ◮ Let { π X | y , e } y ∈Y e denote a disintegration of π X along the map y e : X → Y e . ◮ Let φ # denote the push-forward. (i.e. φ # π ( S ) = π ( φ − 1 ( S ))) ◮ A probabilistic numerical method is Bayesian if D e ( y ) = φ # π X | y , e . π Y | e a.a. y ∈ Y e for some distribution π X , called a prior. Example (Numerical integration) e.g. the Bayesian quadrature method [Larkin, 1972] � � n n ( t i − t i − 1 ) 3 ( x ( t i − 1 ) + x ( t i )) � � D e ( y e ( x )) = N ( t i − t i − 1 ) , 2 12 i =1 i =1 is obtained from disintegrating the standard Weiner prior π X . ◮ In general can also consider randomness Y ∼ π Y | x , e , but atypical in a traditional numerical task.
Remark #4: Bayes is Optimal Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = P X . ◮ Consider a loss of the form � d a � ℓ ( x , a ) = − log d π X ( x ) , which is an example of a proper scoring rule . ◮ Then a Bayes rule is d e ( y ) = π X | y , e for each y ∈ Y e and the Bayes risk is D KL ( π X | y , e || π X ). So if π X is our prior belief, then (at least in this sense) posteriors are the right thing to report.
Remark #4: Bayes is Optimal Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = P X . ◮ Consider a loss of the form � d a � ℓ ( x , a ) = − log d π X ( x ) , which is an example of a proper scoring rule . ◮ Then a Bayes rule is d e ( y ) = π X | y , e for each y ∈ Y e and the Bayes risk is D KL ( π X | y , e || π X ). So if π X is our prior belief, then (at least in this sense) posteriors are the right thing to report.
Remark #4: Bayes is Optimal Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = P X . ◮ Consider a loss of the form � d a � ℓ ( x , a ) = − log d π X ( x ) , which is an example of a proper scoring rule . ◮ Then a Bayes rule is d e ( y ) = π X | y , e for each y ∈ Y e and the Bayes risk is D KL ( π X | y , e || π X ). So if π X is our prior belief, then (at least in this sense) posteriors are the right thing to report.
Remark #4: Bayes is Optimal Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = P X . ◮ Consider a loss of the form � d a � ℓ ( x , a ) = − log d π X ( x ) , which is an example of a proper scoring rule . ◮ Then a Bayes rule is d e ( y ) = π X | y , e for each y ∈ Y e and the Bayes risk is D KL ( π X | y , e || π X ). So if π X is our prior belief, then (at least in this sense) posteriors are the right thing to report.
Remark #4: Bayes is Optimal Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = P X . ◮ Consider a loss of the form � d a � ℓ ( x , a ) = − log d π X ( x ) , which is an example of a proper scoring rule . ◮ Then a Bayes rule is d e ( y ) = π X | y , e for each y ∈ Y e and the Bayes risk is D KL ( π X | y , e || π X ). So if π X is our prior belief, then (at least in this sense) posteriors are the right thing to report.
Remark #4: Bayes is Optimal Q: Why are we particularly interested in Bayesian probabilistic numerical methods? Several reasons (mainly interpretation and composability), but here is a cute optimality argument due to Bernardo [1979]: ◮ Consider Bayesian decision theory with A = P X . ◮ Consider a loss of the form � d a � ℓ ( x , a ) = − log d π X ( x ) , which is an example of a proper scoring rule . ◮ Then a Bayes rule is d e ( y ) = π X | y , e for each y ∈ Y e and the Bayes risk is D KL ( π X | y , e || π X ). So if π X is our prior belief, then (at least in this sense) posteriors are the right thing to report.
(Some) Existing Work on Bayesian PNM Integration: ◮ Briol F-X, CJO, Girolami, M, Osborne, MA, Sejdinovic, D. Probabilistic Integration: A Role in Statistical Computation? (with discussion and rejoinder) Statistical Science, 2018. ◮ Xi X, Briol F-X, Girolami M. Bayesian Quadrature for Multiple Related Integrals . ICML 2018. ◮ CJO, Niederer S, Lee A, Briol F-X, Girolami M. Probabilistic Models for Integration Error in Assessment of Functional Cardiac Models . NIPS 2017. ◮ Karvonen T, Sa¨ arkka¨ a S. Fully Symmetric Kernel Quadrature . SIAM Journal on Scientific Computing, 40(2), pp.A697-A720. ◮ Kanagawa M, Sriperumbudur BK, Fukumizu K. Convergence Guarantees for Kernel-Based Quadrature Rules in Misspecified Settings . NIPS 2016. ◮ Jagadeeswaran R, Hickernell FJ. Fast Automatic Bayesian Cubature Using Lattice Sampling . arXiv:1809.09803, 2018. Differential Equations: ◮ Owhadi H. Bayesian Numerical Homogenization . Multiscale Modeling & Simulation, 13(3), pp.812-828, 2015. ◮ CJO, Cockayne J, Aykroyd RG, Girolami M. Bayesian Probabilistic Numerical Methods for Industrial Process Monitoring . arXiv:1707.06107 ◮ Cockayne J, CJO, Sullivan T, Girolami M. Probabilistic Meshless Methods for Bayesian Inverse Problems . arXiv:1605.07811 Linear Solvers: ◮ Cockayne J, CJO, Ipsen I, Girolami M. A Bayesian Conjugate Gradient Method . arXiv:1801.05242 ◮ Bartels S, Cockayne J, Ipsen I, Girolami M., Hennig P. Probabilistic Linear Solvers: A Unifying View . Not many papers focus on optimality at this point.
(Some) Existing Work on Bayesian PNM Integration: ◮ Briol F-X, CJO, Girolami, M, Osborne, MA, Sejdinovic, D. Probabilistic Integration: A Role in Statistical Computation? (with discussion and rejoinder) Statistical Science, 2018. ◮ Xi X, Briol F-X, Girolami M. Bayesian Quadrature for Multiple Related Integrals . ICML 2018. ◮ CJO, Niederer S, Lee A, Briol F-X, Girolami M. Probabilistic Models for Integration Error in Assessment of Functional Cardiac Models . NIPS 2017. ◮ Karvonen T, Sa¨ arkka¨ a S. Fully Symmetric Kernel Quadrature . SIAM Journal on Scientific Computing, 40(2), pp.A697-A720. ◮ Kanagawa M, Sriperumbudur BK, Fukumizu K. Convergence Guarantees for Kernel-Based Quadrature Rules in Misspecified Settings . NIPS 2016. ◮ Jagadeeswaran R, Hickernell FJ. Fast Automatic Bayesian Cubature Using Lattice Sampling . arXiv:1809.09803, 2018. Differential Equations: ◮ Owhadi H. Bayesian Numerical Homogenization . Multiscale Modeling & Simulation, 13(3), pp.812-828, 2015. ◮ CJO, Cockayne J, Aykroyd RG, Girolami M. Bayesian Probabilistic Numerical Methods for Industrial Process Monitoring . arXiv:1707.06107 ◮ Cockayne J, CJO, Sullivan T, Girolami M. Probabilistic Meshless Methods for Bayesian Inverse Problems . arXiv:1605.07811 Linear Solvers: ◮ Cockayne J, CJO, Ipsen I, Girolami M. A Bayesian Conjugate Gradient Method . arXiv:1801.05242 ◮ Bartels S, Cockayne J, Ipsen I, Girolami M., Hennig P. Probabilistic Linear Solvers: A Unifying View . Not many papers focus on optimality at this point.
Desiderata for a PNM Some desiderata for a probabilistic numerical method: 1. Should be Bayesian for some prior π X 2. Should “put mass close” to the true quantity of interest φ ( x ) 3. Should be “well-calibrated” 4. Should be “easy to optimise” over e ∈ E N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): π X | y , e ℓ ( x , a ) = ( x − a ) 2 X x a Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of π X .
Desiderata for a PNM Some desiderata for a probabilistic numerical method: 1. Should be Bayesian for some prior π X 2. Should “put mass close” to the true quantity of interest φ ( x ) 3. Should be “well-calibrated” 4. Should be “easy to optimise” over e ∈ E N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): π X | y , e ℓ ( x , a ) = ( x − a ) 2 X x a Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of π X .
Desiderata for a PNM Some desiderata for a probabilistic numerical method: 1. Should be Bayesian for some prior π X 2. Should “put mass close” to the true quantity of interest φ ( x ) 3. Should be “well-calibrated” 4. Should be “easy to optimise” over e ∈ E N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): π X | y , e ℓ ( x , a ) = ( x − a ) 2 X x a Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of π X .
Desiderata for a PNM Some desiderata for a probabilistic numerical method: 1. Should be Bayesian for some prior π X 2. Should “put mass close” to the true quantity of interest φ ( x ) 3. Should be “well-calibrated” 4. Should be “easy to optimise” over e ∈ E N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): π X | y , e ℓ ( x , a ) = ( x − a ) 2 X x a Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of π X .
Desiderata for a PNM Some desiderata for a probabilistic numerical method: 1. Should be Bayesian for some prior π X 2. Should “put mass close” to the true quantity of interest φ ( x ) 3. Should be “well-calibrated” 4. Should be “easy to optimise” over e ∈ E N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): π X | y , e ℓ ( x , a ) = ( x − a ) 2 X x a Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of π X .
Desiderata for a PNM Some desiderata for a probabilistic numerical method: 1. Should be Bayesian for some prior π X 2. Should “put mass close” to the true quantity of interest φ ( x ) 3. Should be “well-calibrated” 4. Should be “easy to optimise” over e ∈ E N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): π X | y , e ℓ ( x , a ) = ( x − a ) 2 X x a Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of π X .
Desiderata for a PNM Some desiderata for a probabilistic numerical method: 1. Should be Bayesian for some prior π X 2. Should “put mass close” to the true quantity of interest φ ( x ) 3. Should be “well-calibrated” 4. Should be “easy to optimise” over e ∈ E N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): π X | y , e ℓ ( x , a ) = ( x − a ) 2 X x a Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of π X .
Desiderata for a PNM Some desiderata for a probabilistic numerical method: 1. Should be Bayesian for some prior π X 2. Should “put mass close” to the true quantity of interest φ ( x ) 3. Should be “well-calibrated” 4. Should be “easy to optimise” over e ∈ E N.B. These cannot all be satisfied in general! Note that Bayesian decision theory does not fully account for (1), (2), (3) or (4): π X | y , e ℓ ( x , a ) = ( x − a ) 2 X x a Our plan, in the remainder of the talk, is to build an optimality criteria based on (1), (2) and (4), and then to defer to the analyst for (3) through the choice of π X .
An Optimality Criterion A recent proposal in Cockayne, CJO, Sullivan, Girolami, to appear in SIAM Review, 2018: Idea: Consider a loss function ℓ ( x , x ′ ) and use this to define a utility � ℓ ( x , x ′ ) d π X | y , e ( x ′ ) . u ( e , x , y ) = − N.B. this could encode a quantity of interest, e.g. ℓ ( x , x ′ ) = � φ ( x ) − φ ( x ′ ) � 2 Φ . Then consider the implied Bayesian experimental design criterion �� ℓ ( x , x ′ ) d π X | y , e ( x ′ ) d π Y | x , e ( y ) d π X ( x ) , E ∗ BPN( e ) := BED( e ) = := arg inf e ∈E BPN( e ) BPN π X | y , e ℓ ( x , x ′ ) = ( x − x ′ ) 2 X x ← x ′ →
An Optimality Criterion A recent proposal in Cockayne, CJO, Sullivan, Girolami, to appear in SIAM Review, 2018: Idea: Consider a loss function ℓ ( x , x ′ ) and use this to define a utility � ℓ ( x , x ′ ) d π X | y , e ( x ′ ) . u ( e , x , y ) = − N.B. this could encode a quantity of interest, e.g. ℓ ( x , x ′ ) = � φ ( x ) − φ ( x ′ ) � 2 Φ . Then consider the implied Bayesian experimental design criterion �� ℓ ( x , x ′ ) d π X | y , e ( x ′ ) d π Y | x , e ( y ) d π X ( x ) , E ∗ BPN( e ) := BED( e ) = := arg inf e ∈E BPN( e ) BPN π X | y , e ℓ ( x , x ′ ) = ( x − x ′ ) 2 X x ← x ′ →

Recommend

What is an Optimal Bayesian Method? Chris. J. Oates Newcastle - PowerPoint PPT Presentation

What is an Optimal Bayesian Method? Chris. J. Oates Newcastle University & Lloyds Register Foundation Alan Turing Institute Programme on Data-Centric Engineering November 2018 @ RICAM Multivariate Algorithms and Information-Based

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Optimal Algorithms for Learning Bayesian Optimal Algorithms for Learning Bayesian Network

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

A Circle Detection Method Based on Optimal A Circle Detection Method Based on Optimal Parameter

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

VARMA versus VAR for Macroeconomic Forecasting George Athanasopoulos Department of Econometrics

Linear mixed models with improper priors and flexible distributional assumptions for longitudinal

Generative Adversarial Networks Phillip Isola 9.520 10/17/18 Image classification Classifier

Policy tradeoffs under risk of abrupt climate change Yacov Tsur 1 and Amos Zemel 2 1 Hebrew

Geometric ergodicity in Wasserstein distance of a Metropolis algorithm based on a first-order

Whats inside English? Prof. Diane Pecorari Head Department of English Crafting Creative and

d Applications of Partial Derivative i E 0.25 Lecture a l l u d b Dr. Abdulla Eid A .

Commerce Commission Review of the state of competition in New Zealands dairy markets 6