[PPT] - Introduction to Bayesian Statistics Lecture 11: Model Comparison PowerPoint Presentation

SLIDE 1

Introduction to Bayesian Statistics

Lecture 11: Model Comparison

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

May 20, 2015

SLIDE 2

Evaluating and Comparing Models

Measure of predictive accuracy
log predictive density as a measure of fit
Out-of-sample predictive accuracy as a gold standard
deviance, information criteria and cross-validation
Within-sample predictive accuracy
Subtracting an adjustment
Cross-validation
Model comparison based on predictive performance
Model comparison based on Bayes factor

2 of 16

SLIDE 3

Akaike information criterion (AIC)

elpdAIC = logp(y|ˆ θmle) − k

elpd= expected log predictive density
Based on fit to observed data given maximum likelihood estimates
Goal: use expected log predictive density (elpd) such that

elpd = E˜

y[logp(˜

y|ˆ θmle)]

expectation averages over the predictive distribution of ˜

y

AIC began life with Akaikes (1973) theorem, which established that

AIC is an unbiased estimator of predictive accuracy.

3 of 16

SLIDE 4

deviance

What is the ‘deviance’?

For a likelihood p(y|θ), we define the deviance as

D(y, θ) = −2logp(y|θ) e.g. Y1, Y2, · · · , Yn ∼ Binomial(ni, θi), the deviance is −2[

i

yilogθi + (ni − yi)log(1θi) + log ni yi

]
It is possible to have a negative deviance. Deviance is derived from

the likelihood and evaluated at a certain point in parameter space. Likelihoods greater than 1 could lead to negative deviance, and are appropriate.

4 of 16

SLIDE 5

mean deviance as measure of fit

Dempster (1974) suggested plotting posterior distribution of

deviance D = −2logp(y|θ)

Use of posterior mean deviance ¯

D = E[D] as a measure of fit

Invariant to parameterization of θ
Robust, generally converges well
But more complex models will fit the data better and so will have

smaller ¯ D

Need to have some measure of model complexity to trade off

against ¯ D

5 of 16

SLIDE 6

counting parameters and model complexity-p(1)

D

Bayesian measures of model complexity (Spiegelhalter et al, 2002)

the measure of effective number of parameters of a Bayesian model

p(1)

D

= ˆ Eθ|y[D(y, θ)] − D(y, ˜ θ). = ˆ Davg(y) − Dˆ

θ(y)

= 1 L

L

l=1

(D(y, θl) − Dˆ

θ(y)).

6 of 16

SLIDE 7

counting parameters and model complexity-p(2)

D

A related way to measure model complexity is as half the posterior

variance of the model-level deviance, its estimate is known as p(2)

D

(Gelman et al, 2004) p(2)

D

= ˆ varθ|y[D(y, θ)]/2 = 1 2 1 L − 1

L

l=1

(D(y, θl) − ˆ Davg(y))2

7 of 16

SLIDE 8

comparison of p(1)

D and p(2) D

p(1)

D

is not invariant to reparameterization (subject of much criticism).

In normal linear hierarchical models: p(1)

D = tr(H) where Hy = ˆ

y. Hence

H is the hat matrix which projects data onto fitted values. Thus p(1)

D = i hii = leverages. In general, justification depends on

asymptotic normality of posterior distribution.

p(1)

D

r p(2)

D , can be thought of as the number of ’unconstrained’

parameters in the model, where a parameter counts as: 1 if it is estimated with no constraints or prior information; 0 if it is fully constrained or if all the information about the parameter comes from the prior distribution; or an intermediate value if both the data and the prior are informative.

p(1)

D

and p(2)

D

should be positive. A negative p(1)

D

value indicates one or more problems: log-likelihood is non-concave, a conflict between the prior and the data, or that the posterior mean is a poor estimator (such as with a bimodal posterior).

8 of 16

SLIDE 9

Deviance information criterion (DIC)

use criterion based on trade-off between the fit of the data to the

model and the corresponding complexity of the model

Spiegelhalter et al (2002) proposed a Bayesian model comparison

criterion based on this principle: Deviance Information Criterion, DIC = goodness of fit + complexity

elpdDIC = logp(y|ˆ θBayes) − pDIC

Based on fit to observed data given posterior mean
Effective number of parameters pDIC computed based on normal

approximation (χ2 approximation to -2 log likelihood): p(1)

D

r p(2)

D

Either p(1)

D

r p(2)

D

is asymptotically ok in expectation

9 of 16

SLIDE 10

Model comparison-using DIC

The DIC is then defined analagously to AIC as

DIC = D(ˆ θBayes) + 2p(1)

D = ¯

D + p(1)

D or DIC = ¯

D + p(2)

D

DIC may be compared across different models and even different

methods, as long as the dependent variable does not change between models, making DIC the most flexible model fit statistic.

Like AIC and BIC, DIC is an asymptotic approximation as the

sample size becomes large. DIC is valid only when the joint posterior distribution is approximately multivariate normal.

Models with smaller DIC should be preferred . Since DIC increases

with model complexity (p(1)

D

r p(2)

D ), simpler models are preferred.

10 of 16

SLIDE 11

How do I compare different DICs?

The model with the minimum DIC estimates will make the best

short-term predictions, in the same spirit as Akaike’s criterion.

It is difficult to say what would constitute an important difference

in DIC. Very roughly,

differences of more than 10 might definitely rule out the model with

the higher DIC.

differences between 5 and 10 are substantial
if the difference in DIC is, say, less than 5, and the models make very

different inferences, then it could be misleading just to report the model with the lowest DIC.

11 of 16

SLIDE 12

Watanabe-Akaike information criterion (WAIC)

elppdWAIC = (n

i=1 logppost(yi)) − pWAIC

elppd = expected log posterior predictive density
Based on posterior predictive fit to observed data
pWAIC = n

i=1 varpost(logp(yi|θ))

Compute ppost and varpost using simulations
Requires data partition
Connection to leave-one-out cross-validation

12 of 16

SLIDE 13

Model comparison-Bayes factor

Comparing two or more models: p(H2|y) p(H1|y) = p(H2) p(H1) p(y|H2) p(y|H1)

p(H2)

p(H1) is “prior odds”

B[H2 : H1] = p(y|H2)

p(y|H1) is “Bayes factor” with

p(y|H) =

p(y|θ, H)p(θ|H)dθ.
Problem with p(y|H)
Integral depends on irrelevant tail properties of the prior density
Consider ¯

y ∼ N(θ, σ2/n) and p(θ) ∝ U(−A, A), for some large A

Marginal p(y) is proportional to 1

A

13 of 16

SLIDE 14

An example where the Bayes factor is good

Genetics example with

H1: the woman is affected, θ = 1 H2: the woman is unaffected, θ = 0

prior odds are p(H2)/p(H1) = 1
Bayes factor of the data is p(y|H2)/p(y|H1) = 1.0/0.25 = 4
the posterior odds are thus p(H2|y)/p(H1|y) = 4
The two features that allow Bayes factors to be helpful.
each of the discrete alternatives makes scientific sense, and there are

no obvious scientific models in between; i.e., truly discrete parameter space

Model of probabilities; no unbounded parameters

14 of 16

SLIDE 15

An example where the Bayes factor is bad

8 schools example: yj ∼ N(θj, σ2

j ), for j = 1, . . . , 8.

H1: no pooling, p(θ1, · · · , θ8) ∝ 1 H2: complete pooling, θ1 = . . . = θJ = θ, p(θ) ∝ 1

Bayes factor is 0/0
Instead, express flat priors as N(0, A2) and let A get large
Now Bayes factor strongly depends on A
As A → ∞, complete pooling model gets 100% of the probability for

any data!

Also a horrible dependence on J

15 of 16

SLIDE 16

Interpretation of Bayes Factors

Jeffreys (1961) and Kass & Raftery (1995)

2log(B[H2 : H1]) B[H2 : H1] Favor H2 over H1 0 to 2 1 to 3 Not worth a bare mention 2 to 6 3 to 20 Positive 6 to 10 30 to 150 Strong > 10 > 150 Very Strong

B[H2 : H1] = 1/B[H1 : H2]
Interpretation is on same scale as deviance and likelihood ratio

statistics

16 of 16