[PPT] - Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy PowerPoint Presentation

SLIDE 1

Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy

Department of Statistical Sciences University of Toronto

ISBA 2016 June 16, 2016

Meta-Bayesian Analysis (Yang) 1

SLIDE 2

Motivation

“All models are wrong, some are useful.” — George Box “truth [...] is much too complicated to allow anything but approximations.” – John von Neumann

◮ Subjectivism Bayesian:

alluring but impossible to practice when model is wrong

◮ Prior probability = degree of Belief... in what?

What is a prior?

◮ Is there any role for (subjective) Bayesianism?

Our proposal: More inclusive and pragmatic definition for “prior”. Our approach: Bayesian decision theory

Meta-Bayesian Analysis (Yang) 2

SLIDE 3

Example: Grossly Misspecified Model

Setting: Machine learning

data are collection of documents:

◮ Model: Latent Dirichlet Allocation (LDA)

aka “topic modeling”

◮ Prior belief: ˜

π ≡ 0, i.e., no setting of LDA is faithful to

ur true beliefs about data.

◮ Conjugate priors π(dθ) ∼ Dirichlet(α)

What is the meaning of a prior on LDA parameters? Pragmatic question: If we use an LDA model (for whatever reason), how should we choose our “prior”?

Meta-Bayesian Analysis (Yang) 3

SLIDE 4

Example: Accurate but still Misspecified Model

Setting: Careful Science

data are experimental measurements:

◮ Model: (Qθ)θ∈Θ, painstakingly produced after years of effort ◮ Prior belief: ˜

π ≡ 0, i.e., no Qθ is 100% faithful to

ur true beliefs about data.

What is the meaning of a prior in a misspecified model?

(All models are misspecified.)

Pragmatic question: How should we choose a “prior”?

Meta-Bayesian Analysis (Yang) 4

SLIDE 5

Standard Bayesian Analysis for Prediction

Qθ(·) Model on X × Y given parameter θ X: what you will observe Y: what you will then predict π(·) prior on θ (πQ)(·) =

Qθ(·)π(dθ)

Marginal distribution on X × Y Believe (X, Y ) ∼ πQ The Task

1. Observe X.
2. Take action ˆ

Y .

3. Suffer loss L( ˆ

Y , Y ) The Goal Minimize expected loss Bayes optimal action minimizes expected loss under the conditional distribution of Y given X = x, written πQ(dy|x): BayesOptAction(πQ, x) = arg min

a

L(a, y) πQ(dy|x).

◮ Quadratic loss −

→ posterior mean.

◮ Self-information loss (log loss)

− → posterior πQ(·|x).

Meta-Bayesian Analysis (Yang) 5

SLIDE 6

Meta-Bayesian Analysis

◮ (Qθ)θ∈Θ: the model, i.e., a family of distributions on X × Y. ◮ Don’t believe Qθ, i.e., model is misspecified ◮ P: represents our true belief on X × Y.

Believe (X, Y ) ∼ P But We Will Use Qθ The Task

1. Choose (surrogate) prior π
2. Observe X.
3. Take action ˆ

Y = BayesOptAction(πQ, x)

4. Suffer loss L( ˆ

Y , Y ) The Goal Minimize expected loss with respect to P not πQ.

Meta-Bayesian Analysis (Yang) 6

SLIDE 7

Meta-Bayesian Analysis

Key ideas:

◮ Believe (X, Y ) ∼ P ◮ But predict using πQ(·|X = x) for some prior π ◮ Prior π is an choice/decision/action. ◮ Loss associated with π and (x, y) is

L∗(π, (x, y)) = L(BayesOptAction(πQ, x), y)

Meta-Bayesian risk

◮ Bayes risk under P of doing Bayesian analysis under πQ

R(P, π) =

L∗(π, (x, y))P(dx × dy).

◮ Meta-Bayesian optimal prior minimizes meta-Bayesian risk:

inf

π∈F R(P, π),

where F is some set of priors under consideration.

Meta-Bayesian Analysis (Yang) 7

SLIDE 8

Meta-Bayesian Analysis

Recipe

◮ Step 1: State P, Qθ, and select a loss function L; ◮ Step 2: Choose prior π that minimizes meta-Bayesian risk.

Examples

◮ Log loss: minimizing the conditional relative entropy

inf

π

KL
P2(x, ·)||πQ(·|x)
P1(dx)

where P(dx, dy) = P1(dx)P2(x, dy).

◮ Quadratic loss: minimizing the expected quadratic distance

between two posterior means πQ(·|x) and P2(x, ·): inf

π

mπQ(x) − mP2(x)2

2P1(dx)

Meta-Bayesian Analysis (Yang) 8

SLIDE 9

Meta-Bayesian Analysis

High-level Goals

◮ Meta-Bayesian analysis for Qθ under P is generally no easier

than doing Bayesian analysis under P directly.

◮ But P serves only as a placeholder for an

impossible-to-express true belief.

◮ Our theoretical approach is to attempt to prove general

theorems true of broad classes of “true beliefs” P.

◮ The hope is that this will tell us something deep about

subjective Bayesianism. Remaining results are some key findings.

Meta-Bayesian Analysis (Yang) 9

SLIDE 10

Meta-Bayesian 101: optimal prior depends on loss

data are coin tosses: 10001001100001000100100

◮ Model: i.i.d. Bernoulli(θ) sequence, unknown θ ◮ True prior belief ˜

π(dθ)

Problem Setting

◮ X = {0, 1}n, Y = {0, 1}k. ◮ P: [Bernoulli(θ)]n+k, θ ∼ ˜

π(dθ)

◮ Qθ: [Bernoulli(θ)]n+k, θ ∼ π(dθ)

Results from Meta-Bayesian Analysis

◮ Log loss: π should match the first n + k moments of ˜

π; The optimal prior usually depends on n and k!

Meta-Bayesian Analysis (Yang) 10

SLIDE 11

Meta-Bayesian Analysis for i.i.d. Bernoulli Model

Example

◮ true belief P: two state {0, 1} discrete Markov chain with

transition matrix 1 − p p q 1 − q

.

◮ model Qk θ = Bernoulli(θ)k. ◮ true prior belief

˜ ν(dp, dq) = ˜ π(dθ) ˜ κ(dψ|θ), where θ = p p + q is the limiting relative frequency of 1’s (LRF).

Meta-Bayesian Analysis (Yang) 11

SLIDE 12

What does a prior on an i.i.d. Bernoulli model mean?

Conjecture

Optimal prior for the model Qk

θ is our true belief ˜

π(dθ) on the LRF.

Theorem (Y.–Roy)

False.

Example for n = 1 and k = 1

◮ Sticky Markov Chain:

0000001111111100000011111111

◮ i.i.d. Model:

0010011101001011001001001001

◮ Beta(0.01, 0.01) is better.

θ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 fπ(θ) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Beta(0.01,0.01) Meta-Bayesian Analysis (Yang) 12

SLIDE 13

What does a prior on an i.i.d. Bernoulli model mean?

Theorem (Y.–Roy)

1. Let Qk

θ be the i.i.d. Bernoulli model.

2. Let P be true belief and assume P believes in LRF.
3. Let ˜

π(dθ) be the true belief about the LRF and assume ˜ π is absolutely continuous.

4. Let π∗

k = arg minπ R(P, π) be an optimal surrogate prior.

Then

KL(P(k)||π∗

kQk)

R(P,π∗

k )

− KL(P(k)||˜ πQk)

R(P,˜

π)

→ 0 as k → ∞.

True belief about limiting relative frequency is an asymptotically

ptimal (surrogate) prior.

Meta-Bayesian Analysis (Yang) 13

SLIDE 14

General Results when P is a mixture of i.i.d.

Theorem (Berk 1966). Posterior distribution of θ concentrates

asymptotically on point minimizing the KL divergence.

p

PTE

,

tidy

%

/ l

:

9

~

ifao

fi

u ' arg y

KLIPYHQOI

Problem Setting

◮ P =

˜ Pψ˜ ν(dψ), where ˜ Pψ is i.i.d.

◮ Let Ψθ be the set of ψ such that Qθ closest to ˜

Pψ.

◮ Define Pθ =

Ψθ ˜

Pψ˜ ν(dψ|θ) and ˜ π(dθ) =

Ψθ ˜

ν(dψ).

Meta-Bayesian Analysis (Yang) 14

SLIDE 15

General Results when P is a mixture of i.i.d.

Theorem (Y.–Roy)

1. ˜

πP(k) =

P(k)

θ

˜ π(dθ), where P(k)

θ

is i.i.d.

2. For every θ ∈ Θ, the point θ is the unique point in Θ

achieving the infimum infθ′∈Θ KL(Q(k)

θ′ ||P(k) θ

) for k = 1. Then

KL(P(k)||π∗

kQk)

R(P,π∗

k )

− KL(P(k)||˜ πQk)

R(P,˜

π)

→ 0 as k → ∞.

True belief about asymptotic “location” of posterior distribution is an asymptotically optimal (surrogate) prior.

Meta-Bayesian Analysis (Yang) 15

SLIDE 16

Conclusion and Future work

Conclusion

◮ Standard definition of a (subjective) prior too restrictive ◮ More useful definition using Bayesian decision theory. ◮ Meta-Bayesian prior is one you believe will lead to best results.

Future Work

◮ Beyond choosing priors: General Meta-Bayesian analysis

(optimal prediction algorithms)

◮ Analysis of the rationality of non-subjective procedures

(e.g, switching, empirical Bayes)

Meta-Bayesian Analysis (Yang) 16