Basics of Bayesian Inference A frequentist thinks of unknown - - PowerPoint PPT Presentation

basics of bayesian inference
SMART_READER_LITE
LIVE PREVIEW

Basics of Bayesian Inference A frequentist thinks of unknown - - PowerPoint PPT Presentation

Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of Bayesian Inference p. 1 Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed A Bayesian thinks of parameters as random,


slide-1
SLIDE 1

Basics of Bayesian Inference

A frequentist thinks of unknown parameters as fixed

Basics of Bayesian Inference – p. 1

slide-2
SLIDE 2

Basics of Bayesian Inference

A frequentist thinks of unknown parameters as fixed A Bayesian thinks of parameters as random, and thus coming from distributions (just like the data).

Basics of Bayesian Inference – p. 1

slide-3
SLIDE 3

Basics of Bayesian Inference

A frequentist thinks of unknown parameters as fixed A Bayesian thinks of parameters as random, and thus coming from distributions (just like the data). A Bayesian writes down a prior distribution for θ, and combines it with the likelihood for the observed data Y to obtain the posterior distribution of θ. All statistical inferences then follow from summarizing the posterior.

Basics of Bayesian Inference – p. 1

slide-4
SLIDE 4

Basics of Bayesian Inference

A frequentist thinks of unknown parameters as fixed A Bayesian thinks of parameters as random, and thus coming from distributions (just like the data). A Bayesian writes down a prior distribution for θ, and combines it with the likelihood for the observed data Y to obtain the posterior distribution of θ. All statistical inferences then follow from summarizing the posterior. This approach expands the class of candidate models, and facilitates hierarchical modeling, where it is important to properly account for various sources of uncertainty (e.g. spatial vs. nonspatial heterogeneity)

Basics of Bayesian Inference – p. 1

slide-5
SLIDE 5

Basics of Bayesian Inference

A frequentist thinks of unknown parameters as fixed A Bayesian thinks of parameters as random, and thus coming from distributions (just like the data). A Bayesian writes down a prior distribution for θ, and combines it with the likelihood for the observed data Y to obtain the posterior distribution of θ. All statistical inferences then follow from summarizing the posterior. This approach expands the class of candidate models, and facilitates hierarchical modeling, where it is important to properly account for various sources of uncertainty (e.g. spatial vs. nonspatial heterogeneity) The classical (frequentist) approach to inference will cause awkward interpretation and will struggle with uncertainty.

Basics of Bayesian Inference – p. 1

slide-6
SLIDE 6

Basics of Bayesian inference

In the simplest form, we start with a model/distribution for the data given unknowns (parameters), f(y|θ) Since the data is observed, hence known, while θ is not, we equivalently view this as a function of θ given y and call it the likelihood, L(θ; y). We write the prior distribution for θ as π(θ) Then the joint model for the data and parameters becomes f(y|θ)π(θ) Conditioning in the opposite direction, we have

π(θ|y)m(y)

The first term is called the posterior distribution for θ, the second is the marginal distribution of the data We see that π(θ|y) ∝ f(y|θ)π(θ). m(y) is the normalizing constant.

Basics of Bayesian Inference – p. 2

slide-7
SLIDE 7

Basics of Bayesian Inference

More generally, we would have a prior distribution

π(θ|λ), where λ is a vector of hyperparameters.

In fact, we can think of θ even more generally as the “process” of interest with some parts known and some parts unknown Then, we can write

f(y|process, θ)(f(process, θ|λ)π(θ|λ)π(λ)

A hierarchical specification If λ known, the posterior distribution for θ is given by

p(θ|y, λ) = p(y, θ|λ) p(y|λ) = p(y, θ|λ)

  • p(y, θ|λ) dθ

= f(y|θ)π(θ|λ)

  • f(y|θ)π(θ|λ) dθ = f(y|θ)π(θ|λ)

m(y|λ) .

Basics of Bayesian Inference – p. 3

slide-8
SLIDE 8

Basics of Bayesian Inference

Since λ will not be known, a second stage (hyperprior) distribution h(λ) will be required, so that

p(θ|y) = p(y, θ) p(y) =

  • f(y|θ)π(θ|λ)h(λ) dλ
  • f(y|θ)π(θ|λ)h(λ) dθdλ .

Alternatively, we might replace λ in p(θ|y, λ) by an estimate ˆ

λ; this is called empirical Bayes analysis

So, p(θ|y) = π(θ) This is referred to as Bayesian learning (the change in the posterior distribution compared with the prior).

Basics of Bayesian Inference – p. 4

slide-9
SLIDE 9

Illustration of Bayes’ Theorem

Suppose f(y|θ) = N(y|θ, σ2), θ ∈ ℜ and σ > 0 known

Basics of Bayesian Inference – p. 5

slide-10
SLIDE 10

Illustration of Bayes’ Theorem

Suppose f(y|θ) = N(y|θ, σ2), θ ∈ ℜ and σ > 0 known If we take π(θ|λ) = N(θ|µ, τ2) where λ = (µ, τ)′ is fixed and known, then it is easy to show that

p(θ|y) = N

  • θ

σ2 σ2 + τ2µ + τ2 σ2 + τ2y , σ2τ2 σ2 + τ2

  • .

Basics of Bayesian Inference – p. 5

slide-11
SLIDE 11

Illustration of Bayes’ Theorem

Suppose f(y|θ) = N(y|θ, σ2), θ ∈ ℜ and σ > 0 known If we take π(θ|λ) = N(θ|µ, τ2) where λ = (µ, τ)′ is fixed and known, then it is easy to show that

p(θ|y) = N

  • θ

σ2 σ2 + τ2µ + τ2 σ2 + τ2y , σ2τ2 σ2 + τ2

  • .

Note that The posterior mean E(θ|y) is a weighted average of the prior mean µ and the data value y, with weights depending on our relative uncertainty the posterior precision (reciprocal of the variance) is equal to 1/σ2 + 1/τ2, which is the sum of the likelihood and prior precisions.

Basics of Bayesian Inference – p. 5

slide-12
SLIDE 12

Illustration (continued)

As a concrete example, let µ = 2, τ = 1, ¯

y = 6, and σ = 1:

density

  • 2

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 1.2

θ

prior posterior with n = 1 posterior with n = 10

When n = 1, prior and likelihood receive equal weight

Basics of Bayesian Inference – p. 6

slide-13
SLIDE 13

Illustration (continued)

As a concrete example, let µ = 2, τ = 1, ¯

y = 6, and σ = 1:

density

  • 2

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 1.2

θ

prior posterior with n = 1 posterior with n = 10

When n = 1, prior and likelihood receive equal weight When n = 10, the data begin to dominate the prior

Basics of Bayesian Inference – p. 6

slide-14
SLIDE 14

Illustration (continued)

As a concrete example, let µ = 2, τ = 1, ¯

y = 6, and σ = 1:

density

  • 2

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 1.2

θ

prior posterior with n = 1 posterior with n = 10

When n = 1, prior and likelihood receive equal weight When n = 10, the data begin to dominate the prior The posterior variance goes to zero as n → ∞

Basics of Bayesian Inference – p. 6

slide-15
SLIDE 15

Notes on priors

The prior here is conjugate: it leads to a posterior distribution for θ that is a member of the same distributional family as the prior. Note that setting τ2 = ∞ corresponds to an arbitrarily vague (or noninformative) prior. The posterior is then

p (θ|y) = N

  • θ|y, σ2/n
  • ,

the same as the likelihood! The limit of the conjugate (normal) prior here is a uniform (or “flat”) prior, and thus the posterior is the normalized likelihood. The flat prior is improper since

p(θ)dθ = +∞.

However, as long as the posterior is integrable, i.e.,

  • Θ f(y|θ)π(θ)dθ < ∞ an improper prior can be used!

Basics of Bayesian Inference – p. 7

slide-16
SLIDE 16

A linear model example

Let Y be an n × 1 data vector, X an n × p matrix of covariates, and adopt the likelihood and prior structure,

Y|β ∼ Nn (Xβ, Σ) and β ∼ Np (Aα, V )

Then the posterior distribution of β|Y is

β|Y ∼ N (Dd, D) , where D−1 = XT Σ−1X + V −1 and d = XT Σ−1Y + V −1Aα. V −1 = 0 delivers a “flat” prior; if Σ = σ2Ip, we get β|Y ∼ N

  • ˆ

β , σ2(X′X)−1 , where ˆ β = (X′X)−1X′y ⇐ ⇒ usual likelihood analysis!

Basics of Bayesian Inference – p. 8

slide-17
SLIDE 17

More on priors

How do we choose priors? Prior robustness, sensitivity to prior Informative vs noninformative. Dangers with improper priors; appealing but... Always some prior information Prior elicitation Priors based upon previous experiments (previous posteriors can be current priors) Hyperpriors?

Basics of Bayesian Inference – p. 9

slide-18
SLIDE 18

More on priors

Back to conjugacy: Y |µ ∼ N(µ, σ2), µ ∼ N(µ0, τ2) then marginally, Y ∼ Normal and conditionally, µ|y ∼ Normal For vectors, Y|µ ∼ N(µ, Σ), µ ∼ N(µ0, V ) then marginally, Y ∼ Normal and conditionally, µ|y ∼ Normal For variances, with Y |µ ∼ N(µ, σ2), if σ2 ∼ IG(a, b), then

σ2|y ∼ IG

Never use IG(ǫ, ǫ) for small ǫ. Almost improper and, with variance components, leads to almost improper posteriors. Similar result for Σ but with inverse Wishart distributions Other conjugacies: Poisson with Gamma; Binomial with Beta

Basics of Bayesian Inference – p. 10

slide-19
SLIDE 19

Bayesian updating

Often referred to as “crossing bridges as you come to them” Simplifies sequential data collection Simplest version: Y1, Y2 indep given θ. So joint model is

p(y2|θ)p(y1|θ)π(θ) ∝ p(y2|θ)π(θ|y1),

i.e., Y1 updates π(θ) to π(θ|y1) before Y2 arrives Works for more than two updates, for updating in blocks, for dependent as well as independent data

Basics of Bayesian Inference – p. 11

slide-20
SLIDE 20

CIHM

Conditionally independent hierarchical model Model:

Πip(yi|θi)Πip(θi|η)π(η) η known - not interesting; separate model for each i.

So, η unknown Lots of learning about η; not much about θi Model implies θi are exchangeable; learning about θi takes the form of shrinkage

Basics of Bayesian Inference – p. 12

slide-21
SLIDE 21

Principles of Bayesian inference

Usual issues - point and interval estimation, hypothesis testing ⇔ model comparison, prediction In fact, with a posterior distribution, we can do more general inference, including probability statements about the unknown For parameters, we use posterior directly For prediction, posterior predictive distribution For model comparison, depends upon the utility for the

  • model. With primary interest in explanation we would

use Bayes factors. With primary interest in prediction we would use cross-validation Model adequacy is difficult. Sometimes m(yobs), the marginal density ordinate but hard to calibrate. So, typically, we use usual EDA tools with cross-validation using f(yi|y−i)

Basics of Bayesian Inference – p. 13

slide-22
SLIDE 22

Bayesian estimation

Point estimation: Choose an appropriate measure of

centrality: the posterior mean, median, or mode.

Interval estimation:

Equal tail interval. Consider qL and qU, the α/2- and

(1 − α/2)-quantiles of p(θ|y):

qL

−∞

p(θ|y)dθ = α/2 and

qU

p(θ|y)dθ = 1 − α/2 .

Then, P(qL < θ < qU|y) = 1 − α. Thus, this interval is a

(1 − α) credible set (“Bayesian CI”) for θ.

Easy to compute but not necessarily best interval. Highest posterior density set (HPD) is best but hard to compute. Direct interpretation (“The probability that θ lies in

(qL, qU) is (1 − α)”).

Basics of Bayesian Inference – p. 14

slide-23
SLIDE 23

Ex: Y ∼ Bin(10, θ), θ ∼ U(0, 1), yobs = 7

posterior density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Basics of Bayesian Inference – p. 15

slide-24
SLIDE 24

Ex: Y ∼ Bin(10, θ), θ ∼ U(0, 1), yobs = 7

posterior density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Plot Beta(yobs + 1, n − yobs + 1) = Beta(8, 4) posterior in R/S: > theta <- seq(from=0, to=1, length=101) > yobs <- 7; n <- 10 > plot(theta, dbeta(theta, yobs+1, n-yobs+1), type="l")

Basics of Bayesian Inference – p. 15

slide-25
SLIDE 25

Ex: Y ∼ Bin(10, θ), θ ∼ U(0, 1), yobs = 7

posterior density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Plot Beta(yobs + 1, n − yobs + 1) = Beta(8, 4) posterior in R/S: > theta <- seq(from=0, to=1, length=101) > yobs <- 7; n <- 10 > plot(theta, dbeta(theta, yobs+1, n-yobs+1), type="l") Add 95% equal-tail Bayesian CI (dotted vertical lines): > abline(v=qbeta(.5, yobs+1, n-yobs+1)) > abline(v=qbeta(c(.025, .975), yobs+1, n-yobs+1), lty=2)

Basics of Bayesian Inference – p. 15

slide-26
SLIDE 26

Bayesian hypothesis testing

Classical approach bases accept/reject decision on p-value = P{T(Y) more “extreme” than T(yobs)|θ, H0} , where “extremeness” is in the direction of HA

Basics of Bayesian Inference – p. 16

slide-27
SLIDE 27

Bayesian hypothesis testing

Classical approach bases accept/reject decision on p-value = P{T(Y) more “extreme” than T(yobs)|θ, H0} , where “extremeness” is in the direction of HA Several troubles with this approach:

Basics of Bayesian Inference – p. 16

slide-28
SLIDE 28

Bayesian hypothesis testing

Classical approach bases accept/reject decision on p-value = P{T(Y) more “extreme” than T(yobs)|θ, H0} , where “extremeness” is in the direction of HA Several troubles with this approach: hypotheses must be nested

Basics of Bayesian Inference – p. 16

slide-29
SLIDE 29

Bayesian hypothesis testing

Classical approach bases accept/reject decision on p-value = P{T(Y) more “extreme” than T(yobs)|θ, H0} , where “extremeness” is in the direction of HA Several troubles with this approach: hypotheses must be nested p-value only offers evidence against the null and this evidence is badly distorted with a point null

Basics of Bayesian Inference – p. 16

slide-30
SLIDE 30

Bayesian hypothesis testing

Classical approach bases accept/reject decision on p-value = P{T(Y) more “extreme” than T(yobs)|θ, H0} , where “extremeness” is in the direction of HA Several troubles with this approach: hypotheses must be nested p-value only offers evidence against the null and this evidence is badly distorted with a point null p-value is not the “probability that H0 is true” (but is

  • ften erroneously interpreted this way)

Basics of Bayesian Inference – p. 16

slide-31
SLIDE 31

Bayesian hypothesis testing

Classical approach bases accept/reject decision on p-value = P{T(Y) more “extreme” than T(yobs)|θ, H0} , where “extremeness” is in the direction of HA Several troubles with this approach: hypotheses must be nested p-value only offers evidence against the null and this evidence is badly distorted with a point null p-value is not the “probability that H0 is true” (but is

  • ften erroneously interpreted this way)

Two experiments with different designs but identical likelihoods could result in different p-values, violating the Likelihood Principle

Basics of Bayesian Inference – p. 16

slide-32
SLIDE 32

Bayesian hypothesis testing

Classical approach bases accept/reject decision on p-value = P{T(Y) more “extreme” than T(yobs)|θ, H0} , where “extremeness” is in the direction of HA Several troubles with this approach: hypotheses must be nested p-value only offers evidence against the null and this evidence is badly distorted with a point null p-value is not the “probability that H0 is true” (but is

  • ften erroneously interpreted this way)

Two experiments with different designs but identical likelihoods could result in different p-values, violating the Likelihood Principle Why reject based on an unobserved tail region?

Basics of Bayesian Inference – p. 16

slide-33
SLIDE 33

Bayesian hypothesis testing (cont’d)

Select the model with the largest posterior probability,

P(Mi|y) = p(y|Mi)p(Mi)/

j p(y|Mj)p(Mj),

where

p(y|Mi) =

  • f(y|θi, Mi)πi(θi)dθi .
  • Awkward. Where would the p(Mi) come from? And,

p(y|Mi) may be difficult to compute

For two models, the Bayes factor,

BF = P(M1|y)/P(M2|y) P(M1)/P(M2) = p(y | M1) p(y | M2) ,

Can be used for any pair of models. The likelihood ratio if both hypotheses are simple Posterior odds relative to prior odds Not defined if πi(θ) improper

Basics of Bayesian Inference – p. 17

slide-34
SLIDE 34

Bayesian hypothesis testing via DIC

A generalization of the Akaike Information Criterion (AIC) to the case of hierarchical models based on the posterior distribution of the deviance statistic,

D(θ) = −2 log f(y|θ) + 2 log h(y) ,

where f(y|θ) is the likelihood and h(y) is any standardizing function of the data alone Summarize the fit of a model by the posterior expectation of the deviance, D = Eθ|y[D] Summarize the complexity of a model by the effective number of parameters,

pD = Eθ|y[D] − D(Eθ|y[θ]) = D − D(¯ θ) .

Basics of Bayesian Inference – p. 18

slide-35
SLIDE 35

Bayesian hypothesis testing via DIC

The Deviance Information Criterion (DIC) is then

DIC = D + pD = 2D − D(¯ θ) ,

with smaller values indicating preferred models. Both building blocks of DIC and pD, that is, Eθ|y[D] and

D(Eθ|y[θ]), are easily estimated via MCMC methods,

and in fact are automatic within WinBUGS. While pD has a scale (effective model size), DIC does not, so only differences in DIC across models matter. DIC can be sensitive to parametrization and “focus”

f(y|θ): “focused on θ”, p(y|η) =

  • f(y|θ)p(θ|η)dθ:

“focused on η” Comparative explanatory performance. DIC tends to select “bigger" models

Basics of Bayesian Inference – p. 19

slide-36
SLIDE 36

Prediction

Often use for a model is prediction Need posterior predictive distribution: To predict at a new Y0,

p(y0|y) =

  • p(y0|θ)p(θ|y)dθ

Works if y’s are dependent Since θ is only an artificial, unobservable, theoretical

  • bject, perhaps we should only judge models in

predictive space Suggests cross-validation - compare a hold out sample with its associated posterior predictive distributions

Basics of Bayesian Inference – p. 20

slide-37
SLIDE 37

Posterior predictive loss criterion

When looking at predictive performance, again we need to penalize for model complexity We need a loss function that rewards goodness of fit to the observed data as well as predictive performance for new or replicate data We introduce a balanced loss function For the squared error loss case, we obtain

Dk =

k k+1G + P where G = l(E(Yl,new|y) − yl,obs)2 and

P =

l Var(Yl,new|y)

Posterior predictive mean and variance readily computed

G is a goodness of fit term, P is a penalty

Comparative predictive performance. Small values of

Dk are preferred

Basics of Bayesian Inference – p. 21

slide-38
SLIDE 38

The Bayesian computation problem

What is the problem? INTEGRATION In practical problems of interest, in hierarchical models, the dimension of the set of parameters is large. But then, to normalize the joint density so that we can make probability statements requires a high dimensional integration to compute any desired expectations requires a ratio

  • f high dimensional integrations

to marginalize, i.e., to find the distribution of any parameter (or function of the parameters) requires a ratio of high dimensional integrations Such computational limitations limited the use of Bayesian inference to toy problems

Basics of Bayesian Inference – p. 22

slide-39
SLIDE 39

cont.

Sampling-based methods, that is, sampling from the high-dimensional posterior distribution in conjunction with the wide availability of inexpensive, high-speed computing has solved the computation problem Now tables are turned; through hierarchical modeling in a Bayesian framework, models that are inaccessible in a classical framework can be handled within a Bayesian framework this is particularly the case for spatial data models due to concerns with asymptotics and with getting the uncertainty “right"

Basics of Bayesian Inference – p. 23