Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy
Department of Statistical Sciences University of Toronto
ISBA 2016 June 16, 2016
Meta-Bayesian Analysis (Yang) 1
Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy - - PowerPoint PPT Presentation
Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy Department of Statistical Sciences University of Toronto ISBA 2016 June 16, 2016 Meta-Bayesian Analysis (Yang) 1 Motivation All models are wrong, truth [...] is much too
Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy
Department of Statistical Sciences University of Toronto
ISBA 2016 June 16, 2016
Meta-Bayesian Analysis (Yang) 1Motivation
“All models are wrong, some are useful.” — George Box “truth [...] is much too complicated to allow anything but approximations.” – John von Neumann
◮ Subjectivism Bayesian:
alluring but impossible to practice when model is wrong
◮ Prior probability = degree of Belief... in what?
What is a prior?
◮ Is there any role for (subjective) Bayesianism?
Our proposal: More inclusive and pragmatic definition for “prior”. Our approach: Bayesian decision theory
Meta-Bayesian Analysis (Yang) 2Example: Grossly Misspecified Model
Setting: Machine learning
data are collection of documents:
◮ Model: Latent Dirichlet Allocation (LDA)
aka “topic modeling”
◮ Prior belief: ˜
π ≡ 0, i.e., no setting of LDA is faithful to
◮ Conjugate priors π(dθ) ∼ Dirichlet(α)
What is the meaning of a prior on LDA parameters? Pragmatic question: If we use an LDA model (for whatever reason), how should we choose our “prior”?
Meta-Bayesian Analysis (Yang) 3Example: Accurate but still Misspecified Model
Setting: Careful Science
data are experimental measurements:
◮ Model: (Qθ)θ∈Θ, painstakingly produced after years of effort ◮ Prior belief: ˜
π ≡ 0, i.e., no Qθ is 100% faithful to
What is the meaning of a prior in a misspecified model?
(All models are misspecified.)
Pragmatic question: How should we choose a “prior”?
Meta-Bayesian Analysis (Yang) 4Standard Bayesian Analysis for Prediction
Qθ(·) Model on X × Y given parameter θ X: what you will observe Y: what you will then predict π(·) prior on θ (πQ)(·) =
Marginal distribution on X × Y Believe (X, Y ) ∼ πQ The Task
Y .
Y , Y ) The Goal Minimize expected loss Bayes optimal action minimizes expected loss under the conditional distribution of Y given X = x, written πQ(dy|x): BayesOptAction(πQ, x) = arg min
a
◮ Quadratic loss −
→ posterior mean.
◮ Self-information loss (log loss)
− → posterior πQ(·|x).
Meta-Bayesian Analysis (Yang) 5Meta-Bayesian Analysis
◮ (Qθ)θ∈Θ: the model, i.e., a family of distributions on X × Y. ◮ Don’t believe Qθ, i.e., model is misspecified ◮ P: represents our true belief on X × Y.
Believe (X, Y ) ∼ P But We Will Use Qθ The Task
Y = BayesOptAction(πQ, x)
Y , Y ) The Goal Minimize expected loss with respect to P not πQ.
Meta-Bayesian Analysis (Yang) 6Meta-Bayesian Analysis
Key ideas:
◮ Believe (X, Y ) ∼ P ◮ But predict using πQ(·|X = x) for some prior π ◮ Prior π is an choice/decision/action. ◮ Loss associated with π and (x, y) is
L∗(π, (x, y)) = L(BayesOptAction(πQ, x), y)
Meta-Bayesian risk
◮ Bayes risk under P of doing Bayesian analysis under πQ
R(P, π) =
◮ Meta-Bayesian optimal prior minimizes meta-Bayesian risk:
inf
π∈F R(P, π),
where F is some set of priors under consideration.
Meta-Bayesian Analysis (Yang) 7Meta-Bayesian Analysis
Recipe
◮ Step 1: State P, Qθ, and select a loss function L; ◮ Step 2: Choose prior π that minimizes meta-Bayesian risk.
Examples
◮ Log loss: minimizing the conditional relative entropy
inf
π
where P(dx, dy) = P1(dx)P2(x, dy).
◮ Quadratic loss: minimizing the expected quadratic distance
between two posterior means πQ(·|x) and P2(x, ·): inf
π
2P1(dx)
Meta-Bayesian Analysis (Yang) 8Meta-Bayesian Analysis
High-level Goals
◮ Meta-Bayesian analysis for Qθ under P is generally no easier
than doing Bayesian analysis under P directly.
◮ But P serves only as a placeholder for an
impossible-to-express true belief.
◮ Our theoretical approach is to attempt to prove general
theorems true of broad classes of “true beliefs” P.
◮ The hope is that this will tell us something deep about
subjective Bayesianism. Remaining results are some key findings.
Meta-Bayesian Analysis (Yang) 9Meta-Bayesian 101: optimal prior depends on loss
data are coin tosses: 10001001100001000100100
◮ Model: i.i.d. Bernoulli(θ) sequence, unknown θ ◮ True prior belief ˜
π(dθ)
Problem Setting
◮ X = {0, 1}n, Y = {0, 1}k. ◮ P: [Bernoulli(θ)]n+k, θ ∼ ˜
π(dθ)
◮ Qθ: [Bernoulli(θ)]n+k, θ ∼ π(dθ)
Results from Meta-Bayesian Analysis
◮ Log loss: π should match the first n + k moments of ˜
π; The optimal prior usually depends on n and k!
Meta-Bayesian Analysis (Yang) 10Meta-Bayesian Analysis for i.i.d. Bernoulli Model
Example
◮ true belief P: two state {0, 1} discrete Markov chain with
transition matrix 1 − p p q 1 − q
◮ model Qk θ = Bernoulli(θ)k. ◮ true prior belief
˜ ν(dp, dq) = ˜ π(dθ) ˜ κ(dψ|θ), where θ = p p + q is the limiting relative frequency of 1’s (LRF).
Meta-Bayesian Analysis (Yang) 11What does a prior on an i.i.d. Bernoulli model mean?
Conjecture
Optimal prior for the model Qk
θ is our true belief ˜
π(dθ) on the LRF.
Theorem (Y.–Roy)
False.
Example for n = 1 and k = 1
◮ Sticky Markov Chain:
0000001111111100000011111111
◮ i.i.d. Model:
0010011101001011001001001001
◮ Beta(0.01, 0.01) is better.
θ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 fπ(θ) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Beta(0.01,0.01) Meta-Bayesian Analysis (Yang) 12What does a prior on an i.i.d. Bernoulli model mean?
Theorem (Y.–Roy)
θ be the i.i.d. Bernoulli model.
π(dθ) be the true belief about the LRF and assume ˜ π is absolutely continuous.
k = arg minπ R(P, π) be an optimal surrogate prior.
Then
kQk)
− KL(P(k)||˜ πQk)
π)
True belief about limiting relative frequency is an asymptotically
General Results when P is a mixture of i.i.d.
Theorem (Berk 1966). Posterior distribution of θ concentrates
asymptotically on point minimizing the KL divergence.
PTE
,tidy
%
/ l:
9KLIPYHQOI
Problem Setting
◮ P =
˜ Pψ˜ ν(dψ), where ˜ Pψ is i.i.d.
◮ Let Ψθ be the set of ψ such that Qθ closest to ˜
Pψ.
◮ Define Pθ =
Pψ˜ ν(dψ|θ) and ˜ π(dθ) =
ν(dψ).
Meta-Bayesian Analysis (Yang) 14General Results when P is a mixture of i.i.d.
Theorem (Y.–Roy)
πP(k) =
θ
˜ π(dθ), where P(k)
θ
is i.i.d.
achieving the infimum infθ′∈Θ KL(Q(k)
θ′ ||P(k) θ
) for k = 1. Then
kQk)
− KL(P(k)||˜ πQk)
π)
True belief about asymptotic “location” of posterior distribution is an asymptotically optimal (surrogate) prior.
Meta-Bayesian Analysis (Yang) 15Conclusion and Future work
Conclusion
◮ Standard definition of a (subjective) prior too restrictive ◮ More useful definition using Bayesian decision theory. ◮ Meta-Bayesian prior is one you believe will lead to best results.
Future Work
◮ Beyond choosing priors: General Meta-Bayesian analysis
(optimal prediction algorithms)
◮ Analysis of the rationality of non-subjective procedures
(e.g, switching, empirical Bayes)
Meta-Bayesian Analysis (Yang) 16