Introduction to Bayesian Statistics Lecture 11: Model Comparison - - PowerPoint PPT Presentation

introduction to bayesian statistics
SMART_READER_LITE
LIVE PREVIEW

Introduction to Bayesian Statistics Lecture 11: Model Comparison - - PowerPoint PPT Presentation

Introduction to Bayesian Statistics Lecture 11: Model Comparison Rung-Ching Tsai Department of Mathematics National Taiwan Normal University May 20, 2015 Evaluating and Comparing Models Measure of predictive accuracy log predictive


slide-1
SLIDE 1

Introduction to Bayesian Statistics

Lecture 11: Model Comparison

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

May 20, 2015

slide-2
SLIDE 2

Evaluating and Comparing Models

  • Measure of predictive accuracy
  • log predictive density as a measure of fit
  • Out-of-sample predictive accuracy as a gold standard
  • deviance, information criteria and cross-validation
  • Within-sample predictive accuracy
  • Subtracting an adjustment
  • Cross-validation
  • Model comparison based on predictive performance
  • Model comparison based on Bayes factor

2 of 16

slide-3
SLIDE 3

Akaike information criterion (AIC)

elpdAIC = logp(y|ˆ θmle) − k

  • elpd= expected log predictive density
  • Based on fit to observed data given maximum likelihood estimates
  • Goal: use expected log predictive density (elpd) such that

elpd = E˜

y[logp(˜

y|ˆ θmle)]

  • expectation averages over the predictive distribution of ˜

y

  • AIC began life with Akaikes (1973) theorem, which established that

AIC is an unbiased estimator of predictive accuracy.

3 of 16

slide-4
SLIDE 4

deviance

What is the ‘deviance’?

  • For a likelihood p(y|θ), we define the deviance as

D(y, θ) = −2logp(y|θ) e.g. Y1, Y2, · · · , Yn ∼ Binomial(ni, θi), the deviance is −2[

  • i

yilogθi + (ni − yi)log(1θi) + log ni yi

  • ]
  • It is possible to have a negative deviance. Deviance is derived from

the likelihood and evaluated at a certain point in parameter space. Likelihoods greater than 1 could lead to negative deviance, and are appropriate.

4 of 16

slide-5
SLIDE 5

mean deviance as measure of fit

  • Dempster (1974) suggested plotting posterior distribution of

deviance D = −2logp(y|θ)

  • Use of posterior mean deviance ¯

D = E[D] as a measure of fit

  • Invariant to parameterization of θ
  • Robust, generally converges well
  • But more complex models will fit the data better and so will have

smaller ¯ D

  • Need to have some measure of model complexity to trade off

against ¯ D

5 of 16

slide-6
SLIDE 6

counting parameters and model complexity-p(1)

D

  • Bayesian measures of model complexity (Spiegelhalter et al, 2002)

Eθ|y[−2logp(y|θ)] − (−2logp(y|˜ θ)) = Eθ|y[D(y, θ)] − D(y, ˜ θ). where ˜ θ = E[θ|y], then the measure is defined as posterior mean deviance - deviance of posterior means.

  • the measure of effective number of parameters of a Bayesian model

p(1)

D

= ˆ Eθ|y[D(y, θ)] − D(y, ˜ θ). = ˆ Davg(y) − Dˆ

θ(y)

= 1 L

L

  • l=1

(D(y, θl) − Dˆ

θ(y)).

6 of 16

slide-7
SLIDE 7

counting parameters and model complexity-p(2)

D

  • A related way to measure model complexity is as half the posterior

variance of the model-level deviance, its estimate is known as p(2)

D

(Gelman et al, 2004) p(2)

D

= ˆ varθ|y[D(y, θ)]/2 = 1 2 1 L − 1

L

  • l=1

(D(y, θl) − ˆ Davg(y))2

7 of 16

slide-8
SLIDE 8

comparison of p(1)

D and p(2) D

  • p(1)

D

is not invariant to reparameterization (subject of much criticism).

  • In normal linear hierarchical models: p(1)

D = tr(H) where Hy = ˆ

  • y. Hence

H is the hat matrix which projects data onto fitted values. Thus p(1)

D = i hii = leverages. In general, justification depends on

asymptotic normality of posterior distribution.

  • p(1)

D

  • r p(2)

D , can be thought of as the number of ’unconstrained’

parameters in the model, where a parameter counts as: 1 if it is estimated with no constraints or prior information; 0 if it is fully constrained or if all the information about the parameter comes from the prior distribution; or an intermediate value if both the data and the prior are informative.

  • p(1)

D

and p(2)

D

should be positive. A negative p(1)

D

value indicates one or more problems: log-likelihood is non-concave, a conflict between the prior and the data, or that the posterior mean is a poor estimator (such as with a bimodal posterior).

8 of 16

slide-9
SLIDE 9

Deviance information criterion (DIC)

  • use criterion based on trade-off between the fit of the data to the

model and the corresponding complexity of the model

  • Spiegelhalter et al (2002) proposed a Bayesian model comparison

criterion based on this principle: Deviance Information Criterion, DIC = goodness of fit + complexity

elpdDIC = logp(y|ˆ θBayes) − pDIC

  • Based on fit to observed data given posterior mean
  • Effective number of parameters pDIC computed based on normal

approximation (χ2 approximation to -2 log likelihood): p(1)

D

  • r p(2)

D

  • Either p(1)

D

  • r p(2)

D

is asymptotically ok in expectation

9 of 16

slide-10
SLIDE 10

Model comparison-using DIC

  • The DIC is then defined analagously to AIC as

DIC = D(ˆ θBayes) + 2p(1)

D = ¯

D + p(1)

D or DIC = ¯

D + p(2)

D

  • DIC may be compared across different models and even different

methods, as long as the dependent variable does not change between models, making DIC the most flexible model fit statistic.

  • Like AIC and BIC, DIC is an asymptotic approximation as the

sample size becomes large. DIC is valid only when the joint posterior distribution is approximately multivariate normal.

  • Models with smaller DIC should be preferred . Since DIC increases

with model complexity (p(1)

D

  • r p(2)

D ), simpler models are preferred.

10 of 16

slide-11
SLIDE 11

How do I compare different DICs?

  • The model with the minimum DIC estimates will make the best

short-term predictions, in the same spirit as Akaike’s criterion.

  • It is difficult to say what would constitute an important difference

in DIC. Very roughly,

  • differences of more than 10 might definitely rule out the model with

the higher DIC.

  • differences between 5 and 10 are substantial
  • if the difference in DIC is, say, less than 5, and the models make very

different inferences, then it could be misleading just to report the model with the lowest DIC.

11 of 16

slide-12
SLIDE 12

Watanabe-Akaike information criterion (WAIC)

elppdWAIC = (n

i=1 logppost(yi)) − pWAIC

  • elppd = expected log posterior predictive density
  • Based on posterior predictive fit to observed data
  • pWAIC = n

i=1 varpost(logp(yi|θ))

  • Compute ppost and varpost using simulations
  • Requires data partition
  • Connection to leave-one-out cross-validation

12 of 16

slide-13
SLIDE 13

Model comparison-Bayes factor

Comparing two or more models: p(H2|y) p(H1|y) = p(H2) p(H1) p(y|H2) p(y|H1)

  • p(H2)

p(H1) is “prior odds”

  • B[H2 : H1] = p(y|H2)

p(y|H1) is “Bayes factor” with

p(y|H) =

  • p(y|θ, H)p(θ|H)dθ.
  • Problem with p(y|H)
  • Integral depends on irrelevant tail properties of the prior density
  • Consider ¯

y ∼ N(θ, σ2/n) and p(θ) ∝ U(−A, A), for some large A

  • Marginal p(y) is proportional to 1

A

13 of 16

slide-14
SLIDE 14

An example where the Bayes factor is good

  • Genetics example with

H1: the woman is affected, θ = 1 H2: the woman is unaffected, θ = 0

  • prior odds are p(H2)/p(H1) = 1
  • Bayes factor of the data is p(y|H2)/p(y|H1) = 1.0/0.25 = 4
  • the posterior odds are thus p(H2|y)/p(H1|y) = 4
  • The two features that allow Bayes factors to be helpful.
  • each of the discrete alternatives makes scientific sense, and there are

no obvious scientific models in between; i.e., truly discrete parameter space

  • Model of probabilities; no unbounded parameters

14 of 16

slide-15
SLIDE 15

An example where the Bayes factor is bad

  • 8 schools example: yj ∼ N(θj, σ2

j ), for j = 1, . . . , 8.

H1: no pooling, p(θ1, · · · , θ8) ∝ 1 H2: complete pooling, θ1 = . . . = θJ = θ, p(θ) ∝ 1

  • Bayes factor is 0/0
  • Instead, express flat priors as N(0, A2) and let A get large
  • Now Bayes factor strongly depends on A
  • As A → ∞, complete pooling model gets 100% of the probability for

any data!

  • Also a horrible dependence on J

15 of 16

slide-16
SLIDE 16

Interpretation of Bayes Factors

  • Jeffreys (1961) and Kass & Raftery (1995)

2log(B[H2 : H1]) B[H2 : H1] Favor H2 over H1 0 to 2 1 to 3 Not worth a bare mention 2 to 6 3 to 20 Positive 6 to 10 30 to 150 Strong > 10 > 150 Very Strong

  • B[H2 : H1] = 1/B[H1 : H2]
  • Interpretation is on same scale as deviance and likelihood ratio

statistics

16 of 16