Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of - - PowerPoint PPT Presentation

meta bayesian analysis a bayesian decision theoretic
SMART_READER_LITE
LIVE PREVIEW

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of - - PowerPoint PPT Presentation

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model misspecification Jun Yang joint work with Daniel Roy Department of Statistical Sciences University of Toronto World Congress in Probability and


slide-1
SLIDE 1

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model misspecification Jun Yang joint work with Daniel Roy

Department of Statistical Sciences University of Toronto

World Congress in Probability and Statistics July 11, 2016

Meta-Bayesian Analysis (Yang) 1
slide-2
SLIDE 2

Motivation

“All models are wrong, some are useful.” — George Box “truth [...] is much too complicated to allow anything but approximations.” – John von Neumann

◮ Subjectivism Bayesian:

alluring but impossible to practice when model is wrong

◮ Prior probability = degree of Belief... in what?

What is a prior?

◮ Is there any role for (subjective) Bayesianism?

Our proposal: More inclusive and pragmatic definition for “prior”. Our approach: Bayesian decision theory

Meta-Bayesian Analysis (Yang) 2
slide-3
SLIDE 3

Example: Grossly Misspecified Model

Setting: Machine learning

data are collection of documents:

◮ Model: Latent Dirichlet Allocation (LDA)

aka “topic modeling”

◮ Prior belief: ˜

π ≡ 0, i.e., no setting of LDA is faithful to

  • ur true beliefs about data.

◮ Conjugate priors π(dθ) ∼ Dirichlet(α)

What is the meaning of a prior on LDA parameters? Pragmatic question: If we use an LDA model (for whatever reason), how should we choose our “prior”?

Meta-Bayesian Analysis (Yang) 3
slide-4
SLIDE 4

Example: Accurate but still Misspecified Model

Setting: Careful Science

data are experimental measurements:

◮ Model: (Qθ)θ∈Θ, painstakingly produced after years of effort ◮ Prior belief: ˜

π ≡ 0, i.e., no Qθ is 100% faithful to

  • ur true beliefs about data.

What is the meaning of a prior in a misspecified model?

(All models are misspecified.)

Pragmatic question: How should we choose a “prior”?

Meta-Bayesian Analysis (Yang) 4
slide-5
SLIDE 5

Standard Bayesian Analysis for Prediction

Qθ(·) Model on X × Y given parameter θ X: what you will observe Y: what you will then predict π(·) prior on θ (πQ)(·) =

  • Qθ(·)π(dθ)

Marginal distribution on X × Y Believe (X, Y ) ∼ πQ The Task

  • 1. Observe X.
  • 2. Choose action ˆ

Y .

  • 3. Suffer loss L( ˆ

Y , Y ) The Goal Minimize expected loss Bayes optimal action minimizes expected loss under the conditional distribution of Y given X = x, written πQ(dy|x): BayesOptAction(πQ, x) = arg min

a

  • L(a, y) πQ(dy|x).

◮ Quadratic loss −

→ posterior mean.

◮ Self-information loss (log loss)

− → posterior πQ(·|x).

Meta-Bayesian Analysis (Yang) 5
slide-6
SLIDE 6

Meta-Bayesian Analysis

◮ (Qθ)θ∈Θ: the model, i.e., a family of distributions on X × Y. ◮ Don’t believe Qθ, i.e., model is misspecified ◮ P: represents our true belief on X × Y.

Believe (X, Y ) ∼ P But We Will Use Qθ to predict The Task

  • 1. Choose (surrogate) prior π
  • 2. Observe X.
  • 3. Take action ˆ

Y = BayesOptAction(πQ, x)

  • 4. Suffer loss L( ˆ

Y , Y ) The Goal Minimize expected loss with respect to P not πQ.

Meta-Bayesian Analysis (Yang) 6
slide-7
SLIDE 7

Meta-Bayesian Analysis

Key ideas:

◮ Believe (X, Y ) ∼ P ◮ But predict using πQ(·|X = x) for some prior π ◮ Prior π is an choice/decision/action. ◮ Loss associated with π and (x, y) is

L∗(π, (x, y)) = L(BayesOptAction(πQ, x), y)

Meta-Bayesian risk

◮ Bayes risk under P of doing Bayesian analysis under πQ

R(P, π) =

  • L∗(π, (x, y))P(dx × dy).

◮ Meta-Bayesian optimal prior minimizes meta-Bayesian risk:

inf

π∈F R(P, π),

where F is some set of priors under consideration.

Meta-Bayesian Analysis (Yang) 7
slide-8
SLIDE 8

Meta-Bayesian Analysis

Recipe

◮ Step 1: State P, Qθ, and select a loss function L; ◮ Step 2: Choose prior π that minimizes meta-Bayesian risk.

Examples

◮ Log loss: minimizing the conditional relative entropy

inf

π

  • KL
  • P2(x, ·)||πQ(·|x)
  • P1(dx)

where P(dx, dy) = P1(dx)P2(x, dy).

◮ Quadratic loss: minimizing the expected quadratic distance

between two posterior means πQ(·|x) and P2(x, ·): inf

π

  • mπQ(x) − mP2(x)2

2P1(dx)

Meta-Bayesian Analysis (Yang) 8
slide-9
SLIDE 9

Meta-Bayesian Analysis

High-level Goals

◮ Meta-Bayesian analysis for Qθ under P is generally no easier

than doing Bayesian analysis under P directly.

◮ But P serves only as a placeholder for an

impossible-to-express true belief.

◮ Our theoretical approach is to attempt to prove general

theorems true of broad classes of “true beliefs” P.

◮ The hope is that this will tell us something deep about

subjective Bayesianism. Remaining results are some key findings. Meta-Bayesianism sometimes violates traditional Bayesian tenets.

Meta-Bayesian Analysis (Yang) 9
slide-10
SLIDE 10

Meta-Bayesian 101: if true belief is realizable

When model is well-specified

◮ There exists π such that P =

  • Qθπ(dθ) (i.e. P = πQ)

◮ Meta-Bayesian loss reduces to expected loss in traditional

Bayesian

◮ Self-consistency: π is the meta-Bayesian optimal prior.

Meta-Bayesian Analysis reduces to traditional Bayesian Analysis when model is well-specified.

Meta-Bayesian Analysis (Yang) 10
slide-11
SLIDE 11

Meta-Bayesian Analysis for i.i.d. Normal Model

Example: i.i.d. Normal

◮ true belief P: N(θ, r2), with ˜

π(dθ) ∼ N(0, 1).

◮ model Qθ = N(θ, s2) where s2 = r2. ◮ prior π: N(0, V ) with one parameter V . ◮ X ∈ Rn, Y ∈ Rk.

s 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Optimal V 0.5 1 1.5 2 2.5 Simple Normal Model with r=4 Quadratic Loss Log Loss

Results for n = 1 and k = 1

◮ Predictive of Y given X = x:

P: N(

x 1+r2 , r2 + r2 1+r2 )

πQ: N(

x 1+s2/V , s2 + s2 1+s2/V ) ◮ Quadratic Loss: Vopt = s2 r2 ◮ Log Loss: Vopt balances predictive mean and variance. ◮ If well-specified (s2 = r2), Vopt = 1 for both losses.

In general, the optimal prior depends on n, k and the loss!

Meta-Bayesian Analysis (Yang) 11
slide-12
SLIDE 12

General Results when P is a mixture of i.i.d.

Theorem (Berk 1966). Posterior distribution of θ concentrates

asymptotically on point minimizing the KL divergence.

p

PTE

,

tidy

%

/ l

:

9

~

ifao

fi

u ' arg y

KLIPYHQOI

Conjecture

◮ For each ψ ∈ Ψ, assume there is a unique parameter φ(ψ) ∈ Θ

such that Qφ(ψ) minimizes the KL divergence with ˜ Pψ.

◮ Maybe “KL-projection” of prior, i.e., ˜

π = ˜ ν ◦ φ−1, is optimal.

Meta-Bayesian Analysis (Yang) 12
slide-13
SLIDE 13

General Results when P is a mixture of i.i.d.

◮ Let ˜

π = ˜ ν ◦ φ−1 and ˜ ν(dψ|θ) be disintegration of ˜ ν along φ.

◮ We can transform true model over Ψ to one over Θ:

Pθ =

  • ˜

Pψ ˜ ν(dψ|θ).

◮ Belief about first k observations: P(k) =

  • Θ Pk

θ ˜

π(dθ).

Theorem (Y.–Roy)

For every θ ∈ Θ, assume θ is the unique point in Θ achieving the infimum infθ′∈Θ KL(Qθ′||Pθ). Then

  • KL(P(k)||π∗

kQk)

  • R(P,π∗
k )

− KL(P(k)||˜ πQk)

  • R(P,˜

π)

  • → 0 as k → ∞.

True belief about asymptotic “location” of posterior distribution is an asymptotically optimal (surrogate) prior.

Meta-Bayesian Analysis (Yang) 13
slide-14
SLIDE 14

Meta-Bayesian Analysis for i.i.d. Bernoulli Model

Example

data are coin tosses: 10001001100001000100100

◮ true belief P: two state {0, 1} discrete Markov chain with

transition matrix 1 − p p q 1 − q

  • .

◮ model Qk θ = Bernoulli(θ)k. ◮ true prior belief

˜ ν(dp, dq) = ˜ π(dθ) ˜ κ(dψ|θ), where θ = p p + q is the limiting relative frequency of 1’s (LRF).

Meta-Bayesian Analysis (Yang) 14
slide-15
SLIDE 15

What does a prior on an i.i.d. Bernoulli model mean?

Conjecture

Optimal prior for the model Qk

θ is our true belief ˜

π(dθ) on the LRF.

In general, false! Counterexample

Assume we know θ = 1

2. ◮ Truth: Sticky Markov Chain:

0000001111111100000011111111

◮ Model: i.i.d. sequence

0010011101001011001001001001 If we make one observation (n = 1) and then make one prediction (k = 1) better off with Beta(0.01, 0.01) prior than true belief δ 1

2 on LRF. θ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 fπ(θ) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Beta(0.01,0.01) Meta-Bayesian Analysis (Yang) 15
slide-16
SLIDE 16

What does a prior on an i.i.d. Bernoulli model mean?

Theorem (Y.–Roy)

  • 1. Let Qk

θ be the i.i.d. Bernoulli model.

  • 2. Let P be true belief and assume P believes in LRF.
  • 3. Let ˜

π(dθ) be the true belief about the LRF and assume ˜ π is absolutely continuous.

  • 4. Let π∗

k = arg minπ R(P, π) be an optimal surrogate prior.

Then

  • KL(P(k)||π∗

kQk)

  • R(P,π∗
k )

− KL(P(k)||˜ πQk)

  • R(P,˜

π)

  • → 0 as k → ∞.

True belief about limiting relative frequency is an asymptotically

  • ptimal (surrogate) prior.
Meta-Bayesian Analysis (Yang) 16
slide-17
SLIDE 17

Conclusion and Future work

Conclusion

◮ Standard definition of a (subjective) prior too restrictive ◮ More useful definition using Bayesian decision theory. ◮ Meta-Bayesian prior is one you believe will lead to best results.

Future Work

◮ Beyond choosing priors: General Meta-Bayesian analysis

(optimal prediction algorithms)

◮ Analysis of the rationality of non-subjective procedures

(e.g, switching, empirical Bayes)

Meta-Bayesian Analysis (Yang) 17