Prediction and Model Comparison Applied Bayesian Statistics Dr. - - PowerPoint PPT Presentation

prediction and model comparison
SMART_READER_LITE
LIVE PREVIEW

Prediction and Model Comparison Applied Bayesian Statistics Dr. - - PowerPoint PPT Presentation

Prediction and Model Comparison Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics & Statistics Loyola University Chicago October 31, 2017 Prediction and Model Comparison 1 Last edited October 25, 2017 by


slide-1
SLIDE 1

Prediction and Model Comparison

Applied Bayesian Statistics

  • Dr. Earvin Balderama

Department of Mathematics & Statistics Loyola University Chicago

October 31, 2017

1 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-2
SLIDE 2 MCMC

(Bayesian) Modeling is all about

MCMCMC

2 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-3
SLIDE 3 MCMC

MCMCMC

  • Steps in (Bayesian) modeling:
1

Model Creation (Choice; Computation)

2

Model Checking (Criticism; Diagnostics)

3

Model Comparison (Choice; Selection; Change)

3 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-4
SLIDE 4 MCMC

MCMCMC

  • Steps in (Bayesian) modeling:
1

Model Creation (Choice; Computation)

2

Model Checking (Criticism; Diagnostics)

3

Model Comparison (Choice; Selection; Change)

4

Repeat!

3 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-5
SLIDE 5 MCMC

What we’re focusing on today

Recall conditional distributions: f(A |B) = f(A, B) f(B) conditional = joint marginal This time, we’ll give some attention to the marginal distribution. f(θ |y) = f(y |θ)f(θ) f(y) = f(y |θ)f(θ)

  • f(y |θ)f(θ)dθ
4 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-6
SLIDE 6 MCMC

Outline

1

Some Popular Bayesian fit statistics

Connection to classical statistics

2

Predictive Distributions

Prior predictive distribution Posterior predictive distribution Posterior predictive checks

3

Predictive Performance

Precision Accuracy Extreme Values

5 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-7
SLIDE 7 MCMC

Modeling

Classical methods...

1

Standardized Pearson residuals

2

p-values

3

Likelihood ratio

4

MLE ...also apply in a Bayesian analysis:

1

Posterior mean of the standardized residuals.

2

Posterior probabilities

3

Bayes factor

4

Posterior mean

6 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-8
SLIDE 8 Popular Model Fit Statistics

Bayes Factor

For determining which model fits the data “better”, the Bayes factor is commonly used in a hypothesis test. Given data y and two competing models, M1 and M2, with parameter vectors θ1 and θ2, respectively, the Bayes factor is a measure of how much the data favors Model 1 over Model 2: BF(y) = f1(y) f2(y) =

  • f(y |θ1)f(θ1)dθ1
  • f(y |θ2)f(θ2)dθ2

Note: The Bayes factor is an odds ratio: the ratio of the posterior and prior

  • dds of favoring Model 1 over Model 2:
7 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-9
SLIDE 9 Popular Model Fit Statistics

Bayes Factor

The good: More robust than frequentist hypothesis testing. Often used for testing a “full model” vs. “reduced model” like in classical statistics. One model doesn’t need to be nested within the other model. The bad: Difficult to compute, although easy to approximate with software. Only defined for proper marginal density functions. Computation is conditional that one of the models is true.

Because of this, Gelman thinks Bayes factors are irrelevant. Prefers looking at distance measures between data and model.

Many distance measures to choose from! One of which is...

8 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-10
SLIDE 10 Popular Model Fit Statistics

DIC

Like many good measures of model fit and comparison, the Deviance Information Criterion (DIC) includes

1

how well the model fits the data (goodness of fit) and

2

the complexity of the model (effective number of parameters). The Deviance Information Criterion (DIC) is given by DIC = ¯ D + pD where

1

¯ D = E(D) is the “mean deviance”

2

pD = ¯ D − D(¯ θ) is the “mean deviance − deviance at means”

9 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-11
SLIDE 11 Popular Model Fit Statistics

Deviance

Define the deviance as D = −2 log f(y |θ) Example: Poisson likelihood D = −2 log

  • i

µyi

i e−µi

yi!

  • = −2
  • i
  • − µi + yi log µi − log(yi! )
  • 10
Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-12
SLIDE 12 Popular Model Fit Statistics

DIC

DIC can then be rewritten as DIC = ¯ D + pD = pD + D(¯ θ) + pD (since ¯ D = pD + D(¯ θ)) = D(¯ θ) + 2pD = −2 log f(y | ¯ θ) + 2pD which is a generalization of AIC = −2 log f(y | ˆ θMLE) + 2k DIC can be used to compare different models as well as different methods. Preferred models have low DIC values.

11 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-13
SLIDE 13 Popular Model Fit Statistics

DIC

Requires joint posterior distribution to be approximately multivariate normal. Doesn’t work well with

highly non-linear models mixture models with discrete parameters models with missing data

If pD is negative

log-likelihood may be non-concave prior may be misspecified posterior mean may not be a good estimator

12 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-14
SLIDE 14 Predictive Distributions

Predictions

Maybe a better (best?) way to decide between competing models is to rank them based on how “well” each model does in predicting future observations.

13 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-15
SLIDE 15 Predictive Distributions

The plug-in approach to prediction

Example Consider the regression model Yi

ind

∼ Normal

  • β0 + Xi1β1 + · · · + Xipβp, σ2

Suppose we have a new covariate vector Xnew and we would like to predict the corresponding response Ynew. The “plug-in” approach would be to fix β and σ at their posterior means ˆ β and ˆ σ to make predictions: Ynew | ˆ β, ˆ σ ∼ Normal

  • Xnew ˆ

β, ˆ σ2 .

14 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-16
SLIDE 16 Predictive Distributions

The plug-in approach to prediction

However, this plug-in approach suppresses uncertainty about the parameters, β and σ. Therefore, the prediction intervals will be too narrow, leading to undercoverage. We need to account for all uncertainty when making predictions, including our uncertainty about β and σ.

15 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-17
SLIDE 17 Predictive Distributions

Predictive distributions

In Bayesian analyses, predictive distributions are used for comparing models in terms of how “well” each model does in predicting future

  • bservations.

The idea is, we want to explore the predictive distributions of the unknown

  • bservations, which accounts for the uncertainty in predicting those
  • bservations. Having distributions for the unknown future observations comes

naturally in a Bayesian analysis because of uncertainty distributions for the unknown model parameters. First, a question: Before any data is observed, what could we use for predictions?

16 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-18
SLIDE 18 Predictive Distributions

Prior Predictive Distribution

Before any data is observed, what could we use for predictions? We have a likelihood function, but to account for all uncertainty when making predictions, we marginalize over the model parameters. The marginal likelihood is what one would expect data to look like after averaging over the prior distribution of θ , so it is also called the prior predictive distribution: f(y) =

  • f(y |θ)f(θ)dθ
17 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-19
SLIDE 19 Predictive Distributions

Posterior Predictive Distribution

More interestingly, what if a set of data y have already been observed? How can we make predictions for future (or new or unobserved) observations ynew? We can select from the marginal posterior likelihood of ynew, called the posterior predictive distribution (PPD): f(ynew |y) =

  • f(ynew |θ)f(θ |y)dθ

This distribution is what one would expect ynew to look like after observing y and averaging over the posterior distribution of θ given y. The concept of the PPD applies generally (e.g., logistic regression).

18 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-20
SLIDE 20 Predictive Distributions

Posterior Predictive Distribution

Equivalently, ynew can be considered missing values and treated as additional parameters to be estimated in a Bayesian framework. (More on missing values later) Example For a certain complete dataset, we may want to randomly assign NA values to some number m observations which creates a test set ymis = {y1, y2, . . . , ym}. After MCMC, the m posterior predictive distributions, P1, P2, . . . , Pm, can be used to determine measures of overall model goodness-of-fit, as well as predictive performance measures of each yi in the test set.

19 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-21
SLIDE 21 Predictive Distributions

Posterior Predictive Distribution

The MCMC framework makes it very easy to draw samples from Ynew’s PPD. This is one of the reasons for the claim that “Bayesian methods naturally quantify uncertainty”. Example: For each MCMC iteration t:

1

we have updates β(t) and σ2(t).

2

we sample y (t)

new ∼ Normal

  • Xβ(t), σ2(t)

.

y(1)

new, y(2) new, . . . , y(S) new are samples from the PPD.

20 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-22
SLIDE 22 Predictive Distributions

Posterior Predictive Ordinate

The posterior predictive ordinate (PPOi) is the density of the posterior predictive distribution evaluated at an observation yi. PPOi = f(yi |y) =

  • f(yi |θ)f(θ |y)dθ

PPOi can be used to estimate the probability of observing yi in the future if after having already observed y. We can estimate the ith posterior predictive ordinate by

  • PPOi = 1

S

S

  • s=1

f

  • yi |θ(s)
21 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-23
SLIDE 23 Predictive Distributions

Posterior Predictive Ordinate

We can easily compute PPOi after running MCMC. Example: Poisson count model For each observation i, compute the summation term at every iteration of the MCMC, then average over the S iterations to get PPOi. In R, suppose our data is the vector y, and the posterior samples are in the vector post.lambda.

ppo <- NULL for(i in 1:N){ ppo[i] <- mean(dpois(y[i], post.lambda)) }

22 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-24
SLIDE 24 Predictive Distributions

Conditional Predictive Ordinate

The conditional predictive ordinate (CPOi) estimates the probability of

  • bserving yi in the future if after having already observed y−i:

CPOi = f(yi |y−i) = · · · =

  • 1

f(yi |θ)f(θ |y)dθ −1 CPOi can be estimated by taking the inverse of the posterior mean of the inverse density function value of yi (harmonic mean of the likelihood of yi):

  • CPOi =

  1 S

S

  • s=1

1 f

  • yi |θ(s)

 

−1

Low CPOi values suggest possible outliers, high-leverage and influential observations.

23 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-25
SLIDE 25 Predictive Distributions

Conditional Predictive Ordinate

Proof CPOi = f(yi |y−i) = f(y−i) f(y) −1 = f(y−i |θ)f(θ) f(y) dθ −1 =

  • 1

f(yi |θ) f(y |θ)f(θ) f(y) dθ −1 =

  • 1

f(yi |θ)f(θ |y)dθ −1 =

  • Eθ |y
  • 1

f(yi |θ) −1

24 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-26
SLIDE 26 Predictive Distributions

Conditional Predictive Ordinate

We can easily compute CPOi after running MCMC. Example: Poisson count model In R, suppose our data is the vector y, and the posterior samples are in the vector post.lambda.

cpo <- NULL for(i in 1:N){ cpo[i] <- 1 / (mean(1 / dpois(y[i], post.lambda))) }

25 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-27
SLIDE 27 Predictive Distributions

Predictive Ordinates

Estimate of PPO

  • PPOi = 1

S

S

  • s=1

f(yi |θ(s)) Estimate of CPO

  • CPOi =
  • 1

S

S

  • s=1

1 f(yi |θ(s)) −1 Notes: PPO is good for prediction, but violates the likelihood principle. CPO is based on leave-one-out-cross-validation.

26 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-28
SLIDE 28 Predictive Distributions

LPML

The log-pseudo marginal likelihood (LPML) is the sum of the log CPO’s and is an estimator for the log marginal likelihood: LPML =

n

  • i=1

log(CPOi) Preferred models have large LPML values.

27 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-29
SLIDE 29 Predictive Distributions

For hypothesis testing

A ratio of LPML ’s is a surrogate for the Bayes factor. Another overall measure for model comparison is the posterior Bayes factor, which is simply the Bayes factor but using the posterior predictive distributions.

28 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-30
SLIDE 30 Predictive Performance

Predictive Performance

So far, we have seen different ways to rank models in terms of how good they seem to be in prediction. Let’s look now at some measures that quantify the actual performance in prediction, which always boils down to two things:

1

Precision

2

Accuracy Combining information from measures of precision (e.g., MSE) and measures

  • f accuracy (e.g., Coverage) is important for model comparison.
29 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-31
SLIDE 31 Predictive Performance

Measures of Predictive Precision

The mean absolute deviation (MAD) is the mean of the absolute values of the deviations between the actual observed value, yi, and the median of the respective posterior predictive distribution, Pi. Mean Absolute Deviation MAD = 1 n

n

  • i=1
  • yi −

Pi

  • Note: You can also use the median absolute deviation (also MAD!) for a more

robust statistic.

30 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-32
SLIDE 32 Predictive Performance

Measures of Predictive Precision

MSE is the mean of the squared deviations (errors) between the actual

  • bserved value, yi, and the mean of the respective posterior predictive

distribution, Pi. Mean Squared Error MSE = 1 n

n

  • i=1
  • yi − Pi

2 The average standard deviation of the P’s can also be helpful. Mean Standard Deviation SD = 1 n

n

  • i=1

σPi

31 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-33
SLIDE 33 Predictive Performance

Measures of Predictive Accuracy

Coverage is the proportion of test set observations yi falling inside some interval of the respective posterior predictive distributions Pi. 90% Coverage C(90%) = 1 n

n

  • i=1

1

  • P(.05)

i

< yi < P(.95)

i

  • where 1 is the indicator function and P(q) is the estimated qth quantile of P.

This shows how well the model does in creating posterior predictive distributions that actually capture the true value. Note: A high coverage probability can simply be a result of high-variance posterior predictive distributions.

32 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-34
SLIDE 34 Predictive Performance

Prediction of Extreme Values

The Brier score is the squared difference between the posterior predictive probability of exceeding a certain value and whether or not the actual

  • bservation exceeds that value.

The Brier Score for a test set observation yi, given a certain value c, can be computed as BSi =

  • 1(yi > c) − Pi(yi > c)

2 where 1 is the indicator function and P(y > c) is the posterior predictive probability that y > c.

33 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-35
SLIDE 35 Predictive Performance

Prediction of Extreme Values

Another score that can be used as a measure of predictive accuracy of extreme values is the quantile score. The quantile score for a test set observation yi can be computed as QSi = 2 ×

  • 1
  • yi < P(q)

i

  • − q
  • ×
  • P(q)

i

− yi

  • where 1 is the indicator function and P(q) is the estimated qth quantile.
34 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-36
SLIDE 36 Predictive Performance

Prediction of Extreme Values

The average of the quantile scores and the average of the Brier Scores can be used for model comparison. They can be used to evaluate how well a model captures extreme values. Smaller scores are better. Larger Brier scores suggest lack of predictive accuracy. Larger quantile scores suggest that the observed value is very far from its estimated quantile value from P.

35 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-37
SLIDE 37 Conclusion

Other Posterior Predictive Checks

1

Other Bayesian p-value tests.

2

Gelman Chi-Square tests, and other Chi-Square tests.

3

Quantile Ratio

4

Predictive Concordance

5

Bayesian Predictive Information Criterion (BPIC)

6

L-criterion

7

...and many more...

36 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
slide-38
SLIDE 38 Conclusion

Summary

1

Separate your research into the three MC’s.

2

List of model checks is not exhaustive!

3

Choose some based on the focus of your research.

4

Statistics is only a guide.

37 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>