Introduction to Bayesian Statistics
Lecture 11: Model Comparison
Rung-Ching Tsai
Department of Mathematics National Taiwan Normal University
May 20, 2015
Introduction to Bayesian Statistics Lecture 11: Model Comparison - - PowerPoint PPT Presentation
Introduction to Bayesian Statistics Lecture 11: Model Comparison Rung-Ching Tsai Department of Mathematics National Taiwan Normal University May 20, 2015 Evaluating and Comparing Models Measure of predictive accuracy log predictive
Rung-Ching Tsai
Department of Mathematics National Taiwan Normal University
May 20, 2015
2 of 16
elpdAIC = logp(y|ˆ θmle) − k
elpd = E˜
y[logp(˜
y|ˆ θmle)]
y
AIC is an unbiased estimator of predictive accuracy.
3 of 16
What is the ‘deviance’?
D(y, θ) = −2logp(y|θ) e.g. Y1, Y2, · · · , Yn ∼ Binomial(ni, θi), the deviance is −2[
yilogθi + (ni − yi)log(1θi) + log ni yi
the likelihood and evaluated at a certain point in parameter space. Likelihoods greater than 1 could lead to negative deviance, and are appropriate.
4 of 16
deviance D = −2logp(y|θ)
D = E[D] as a measure of fit
smaller ¯ D
against ¯ D
5 of 16
D
Eθ|y[−2logp(y|θ)] − (−2logp(y|˜ θ)) = Eθ|y[D(y, θ)] − D(y, ˜ θ). where ˜ θ = E[θ|y], then the measure is defined as posterior mean deviance - deviance of posterior means.
p(1)
D
= ˆ Eθ|y[D(y, θ)] − D(y, ˜ θ). = ˆ Davg(y) − Dˆ
θ(y)
= 1 L
L
(D(y, θl) − Dˆ
θ(y)).
6 of 16
D
variance of the model-level deviance, its estimate is known as p(2)
D
(Gelman et al, 2004) p(2)
D
= ˆ varθ|y[D(y, θ)]/2 = 1 2 1 L − 1
L
(D(y, θl) − ˆ Davg(y))2
7 of 16
D and p(2) D
D
is not invariant to reparameterization (subject of much criticism).
D = tr(H) where Hy = ˆ
H is the hat matrix which projects data onto fitted values. Thus p(1)
D = i hii = leverages. In general, justification depends on
asymptotic normality of posterior distribution.
D
D , can be thought of as the number of ’unconstrained’
parameters in the model, where a parameter counts as: 1 if it is estimated with no constraints or prior information; 0 if it is fully constrained or if all the information about the parameter comes from the prior distribution; or an intermediate value if both the data and the prior are informative.
D
and p(2)
D
should be positive. A negative p(1)
D
value indicates one or more problems: log-likelihood is non-concave, a conflict between the prior and the data, or that the posterior mean is a poor estimator (such as with a bimodal posterior).
8 of 16
model and the corresponding complexity of the model
criterion based on this principle: Deviance Information Criterion, DIC = goodness of fit + complexity
elpdDIC = logp(y|ˆ θBayes) − pDIC
approximation (χ2 approximation to -2 log likelihood): p(1)
D
D
D
D
is asymptotically ok in expectation
9 of 16
DIC = D(ˆ θBayes) + 2p(1)
D = ¯
D + p(1)
D or DIC = ¯
D + p(2)
D
methods, as long as the dependent variable does not change between models, making DIC the most flexible model fit statistic.
sample size becomes large. DIC is valid only when the joint posterior distribution is approximately multivariate normal.
with model complexity (p(1)
D
D ), simpler models are preferred.
10 of 16
short-term predictions, in the same spirit as Akaike’s criterion.
in DIC. Very roughly,
the higher DIC.
different inferences, then it could be misleading just to report the model with the lowest DIC.
11 of 16
elppdWAIC = (n
i=1 logppost(yi)) − pWAIC
i=1 varpost(logp(yi|θ))
12 of 16
Comparing two or more models: p(H2|y) p(H1|y) = p(H2) p(H1) p(y|H2) p(y|H1)
p(H1) is “prior odds”
p(y|H1) is “Bayes factor” with
p(y|H) =
y ∼ N(θ, σ2/n) and p(θ) ∝ U(−A, A), for some large A
A
13 of 16
H1: the woman is affected, θ = 1 H2: the woman is unaffected, θ = 0
no obvious scientific models in between; i.e., truly discrete parameter space
14 of 16
j ), for j = 1, . . . , 8.
H1: no pooling, p(θ1, · · · , θ8) ∝ 1 H2: complete pooling, θ1 = . . . = θJ = θ, p(θ) ∝ 1
any data!
15 of 16
2log(B[H2 : H1]) B[H2 : H1] Favor H2 over H1 0 to 2 1 to 3 Not worth a bare mention 2 to 6 3 to 20 Positive 6 to 10 30 to 150 Strong > 10 > 150 Very Strong
statistics
16 of 16