Prediction and Model Comparison
Applied Bayesian Statistics
- Dr. Earvin Balderama
Department of Mathematics & Statistics Loyola University Chicago
October 31, 2017
1 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>
Prediction and Model Comparison Applied Bayesian Statistics Dr. - - PowerPoint PPT Presentation
Prediction and Model Comparison Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics & Statistics Loyola University Chicago October 31, 2017 Prediction and Model Comparison 1 Last edited October 25, 2017 by
Prediction and Model Comparison
Applied Bayesian Statistics
Department of Mathematics & Statistics Loyola University Chicago
October 31, 2017
1 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>(Bayesian) Modeling is all about
MCMCMC
Model Creation (Choice; Computation)
2Model Checking (Criticism; Diagnostics)
3Model Comparison (Choice; Selection; Change)
3 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>MCMCMC
Model Creation (Choice; Computation)
2Model Checking (Criticism; Diagnostics)
3Model Comparison (Choice; Selection; Change)
4Repeat!
3 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>What we’re focusing on today
Recall conditional distributions: f(A |B) = f(A, B) f(B) conditional = joint marginal This time, we’ll give some attention to the marginal distribution. f(θ |y) = f(y |θ)f(θ) f(y) = f(y |θ)f(θ)
Outline
1Some Popular Bayesian fit statistics
Connection to classical statistics
2Predictive Distributions
Prior predictive distribution Posterior predictive distribution Posterior predictive checks
3Predictive Performance
Precision Accuracy Extreme Values
5 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Modeling
Classical methods...
1Standardized Pearson residuals
2p-values
3Likelihood ratio
4MLE ...also apply in a Bayesian analysis:
1Posterior mean of the standardized residuals.
2Posterior probabilities
3Bayes factor
4Posterior mean
6 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Bayes Factor
For determining which model fits the data “better”, the Bayes factor is commonly used in a hypothesis test. Given data y and two competing models, M1 and M2, with parameter vectors θ1 and θ2, respectively, the Bayes factor is a measure of how much the data favors Model 1 over Model 2: BF(y) = f1(y) f2(y) =
Note: The Bayes factor is an odds ratio: the ratio of the posterior and prior
Bayes Factor
The good: More robust than frequentist hypothesis testing. Often used for testing a “full model” vs. “reduced model” like in classical statistics. One model doesn’t need to be nested within the other model. The bad: Difficult to compute, although easy to approximate with software. Only defined for proper marginal density functions. Computation is conditional that one of the models is true.
Because of this, Gelman thinks Bayes factors are irrelevant. Prefers looking at distance measures between data and model.
Many distance measures to choose from! One of which is...
8 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>DIC
Like many good measures of model fit and comparison, the Deviance Information Criterion (DIC) includes
1how well the model fits the data (goodness of fit) and
2the complexity of the model (effective number of parameters). The Deviance Information Criterion (DIC) is given by DIC = ¯ D + pD where
1¯ D = E(D) is the “mean deviance”
2pD = ¯ D − D(¯ θ) is the “mean deviance − deviance at means”
9 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Deviance
Define the deviance as D = −2 log f(y |θ) Example: Poisson likelihood D = −2 log
µyi
i e−µi
yi!
DIC
DIC can then be rewritten as DIC = ¯ D + pD = pD + D(¯ θ) + pD (since ¯ D = pD + D(¯ θ)) = D(¯ θ) + 2pD = −2 log f(y | ¯ θ) + 2pD which is a generalization of AIC = −2 log f(y | ˆ θMLE) + 2k DIC can be used to compare different models as well as different methods. Preferred models have low DIC values.
11 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>DIC
Requires joint posterior distribution to be approximately multivariate normal. Doesn’t work well with
highly non-linear models mixture models with discrete parameters models with missing data
If pD is negative
log-likelihood may be non-concave prior may be misspecified posterior mean may not be a good estimator
12 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Predictions
Maybe a better (best?) way to decide between competing models is to rank them based on how “well” each model does in predicting future observations.
13 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>The plug-in approach to prediction
Example Consider the regression model Yi
ind
∼ Normal
Suppose we have a new covariate vector Xnew and we would like to predict the corresponding response Ynew. The “plug-in” approach would be to fix β and σ at their posterior means ˆ β and ˆ σ to make predictions: Ynew | ˆ β, ˆ σ ∼ Normal
β, ˆ σ2 .
14 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>The plug-in approach to prediction
However, this plug-in approach suppresses uncertainty about the parameters, β and σ. Therefore, the prediction intervals will be too narrow, leading to undercoverage. We need to account for all uncertainty when making predictions, including our uncertainty about β and σ.
15 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Predictive distributions
In Bayesian analyses, predictive distributions are used for comparing models in terms of how “well” each model does in predicting future
The idea is, we want to explore the predictive distributions of the unknown
naturally in a Bayesian analysis because of uncertainty distributions for the unknown model parameters. First, a question: Before any data is observed, what could we use for predictions?
16 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Prior Predictive Distribution
Before any data is observed, what could we use for predictions? We have a likelihood function, but to account for all uncertainty when making predictions, we marginalize over the model parameters. The marginal likelihood is what one would expect data to look like after averaging over the prior distribution of θ , so it is also called the prior predictive distribution: f(y) =
Posterior Predictive Distribution
More interestingly, what if a set of data y have already been observed? How can we make predictions for future (or new or unobserved) observations ynew? We can select from the marginal posterior likelihood of ynew, called the posterior predictive distribution (PPD): f(ynew |y) =
This distribution is what one would expect ynew to look like after observing y and averaging over the posterior distribution of θ given y. The concept of the PPD applies generally (e.g., logistic regression).
18 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Posterior Predictive Distribution
Equivalently, ynew can be considered missing values and treated as additional parameters to be estimated in a Bayesian framework. (More on missing values later) Example For a certain complete dataset, we may want to randomly assign NA values to some number m observations which creates a test set ymis = {y1, y2, . . . , ym}. After MCMC, the m posterior predictive distributions, P1, P2, . . . , Pm, can be used to determine measures of overall model goodness-of-fit, as well as predictive performance measures of each yi in the test set.
19 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Posterior Predictive Distribution
The MCMC framework makes it very easy to draw samples from Ynew’s PPD. This is one of the reasons for the claim that “Bayesian methods naturally quantify uncertainty”. Example: For each MCMC iteration t:
1we have updates β(t) and σ2(t).
2we sample y (t)
new ∼ Normal
.
y(1)
new, y(2) new, . . . , y(S) new are samples from the PPD.
20 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Posterior Predictive Ordinate
The posterior predictive ordinate (PPOi) is the density of the posterior predictive distribution evaluated at an observation yi. PPOi = f(yi |y) =
PPOi can be used to estimate the probability of observing yi in the future if after having already observed y. We can estimate the ith posterior predictive ordinate by
S
S
f
Posterior Predictive Ordinate
We can easily compute PPOi after running MCMC. Example: Poisson count model For each observation i, compute the summation term at every iteration of the MCMC, then average over the S iterations to get PPOi. In R, suppose our data is the vector y, and the posterior samples are in the vector post.lambda.
ppo <- NULL for(i in 1:N){ ppo[i] <- mean(dpois(y[i], post.lambda)) }
22 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Conditional Predictive Ordinate
The conditional predictive ordinate (CPOi) estimates the probability of
CPOi = f(yi |y−i) = · · · =
f(yi |θ)f(θ |y)dθ −1 CPOi can be estimated by taking the inverse of the posterior mean of the inverse density function value of yi (harmonic mean of the likelihood of yi):
1 S
S
1 f
−1
Low CPOi values suggest possible outliers, high-leverage and influential observations.
23 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Conditional Predictive Ordinate
Proof CPOi = f(yi |y−i) = f(y−i) f(y) −1 = f(y−i |θ)f(θ) f(y) dθ −1 =
f(yi |θ) f(y |θ)f(θ) f(y) dθ −1 =
f(yi |θ)f(θ |y)dθ −1 =
f(yi |θ) −1
24 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Conditional Predictive Ordinate
We can easily compute CPOi after running MCMC. Example: Poisson count model In R, suppose our data is the vector y, and the posterior samples are in the vector post.lambda.
cpo <- NULL for(i in 1:N){ cpo[i] <- 1 / (mean(1 / dpois(y[i], post.lambda))) }
25 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Predictive Ordinates
Estimate of PPO
S
S
f(yi |θ(s)) Estimate of CPO
S
S
1 f(yi |θ(s)) −1 Notes: PPO is good for prediction, but violates the likelihood principle. CPO is based on leave-one-out-cross-validation.
26 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>LPML
The log-pseudo marginal likelihood (LPML) is the sum of the log CPO’s and is an estimator for the log marginal likelihood: LPML =
n
log(CPOi) Preferred models have large LPML values.
27 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>For hypothesis testing
A ratio of LPML ’s is a surrogate for the Bayes factor. Another overall measure for model comparison is the posterior Bayes factor, which is simply the Bayes factor but using the posterior predictive distributions.
28 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Predictive Performance
So far, we have seen different ways to rank models in terms of how good they seem to be in prediction. Let’s look now at some measures that quantify the actual performance in prediction, which always boils down to two things:
1Precision
2Accuracy Combining information from measures of precision (e.g., MSE) and measures
Measures of Predictive Precision
The mean absolute deviation (MAD) is the mean of the absolute values of the deviations between the actual observed value, yi, and the median of the respective posterior predictive distribution, Pi. Mean Absolute Deviation MAD = 1 n
n
Pi
robust statistic.
30 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Measures of Predictive Precision
MSE is the mean of the squared deviations (errors) between the actual
distribution, Pi. Mean Squared Error MSE = 1 n
n
2 The average standard deviation of the P’s can also be helpful. Mean Standard Deviation SD = 1 n
n
σPi
31 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Measures of Predictive Accuracy
Coverage is the proportion of test set observations yi falling inside some interval of the respective posterior predictive distributions Pi. 90% Coverage C(90%) = 1 n
n
1
i
< yi < P(.95)
i
This shows how well the model does in creating posterior predictive distributions that actually capture the true value. Note: A high coverage probability can simply be a result of high-variance posterior predictive distributions.
32 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Prediction of Extreme Values
The Brier score is the squared difference between the posterior predictive probability of exceeding a certain value and whether or not the actual
The Brier Score for a test set observation yi, given a certain value c, can be computed as BSi =
2 where 1 is the indicator function and P(y > c) is the posterior predictive probability that y > c.
33 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Prediction of Extreme Values
Another score that can be used as a measure of predictive accuracy of extreme values is the quantile score. The quantile score for a test set observation yi can be computed as QSi = 2 ×
i
i
− yi
Prediction of Extreme Values
The average of the quantile scores and the average of the Brier Scores can be used for model comparison. They can be used to evaluate how well a model captures extreme values. Smaller scores are better. Larger Brier scores suggest lack of predictive accuracy. Larger quantile scores suggest that the observed value is very far from its estimated quantile value from P.
35 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Other Posterior Predictive Checks
1Other Bayesian p-value tests.
2Gelman Chi-Square tests, and other Chi-Square tests.
3Quantile Ratio
4Predictive Concordance
5Bayesian Predictive Information Criterion (BPIC)
6L-criterion
7...and many more...
36 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>Summary
1Separate your research into the three MC’s.
2List of model checks is not exhaustive!
3Choose some based on the focus of your research.
4Statistics is only a guide.
37 Prediction and Model Comparison Last edited October 25, 2017 by <ebalderama@luc.edu>