Introduction to Bayesian Statistics Lecture 9: Hierarchical Models - - PowerPoint PPT Presentation

introduction to bayesian statistics
SMART_READER_LITE
LIVE PREVIEW

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models - - PowerPoint PPT Presentation

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models Rung-Ching Tsai Department of Mathematics National Taiwan Normal University May 6, 2015 Example Data: Weekly weights of 30 young rats (Gelfand, Hills, Racine-Poon, &


slide-1
SLIDE 1

Introduction to Bayesian Statistics

Lecture 9: Hierarchical Models

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

May 6, 2015

slide-2
SLIDE 2

Example

  • Data: Weekly weights of 30 young rats (Gelfand, Hills,

Racine-Poon, & Smith, 1990).

Day 8 15 22 29 36 Rat 1 151 199 246 283 320 Rat 2 145 199 249 293 354 · · · Rat 30 153 200 244 286 324

  • Model:

Yij = α + βxj + ǫij, where Yij: weight of i-th rat on day xj; ǫij ∼ Normal(0, σ2)

  • What is the assumption on the growth of the 30 rats in this model?

2 of 22

slide-3
SLIDE 3

Example

  • Data: Number of Failures and length of operation time of 10

power plant pumps (George, Makov, & Smith, 1993).

Pump 1 2 3 4 5 6 7 8 9 10 time 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5 failure 5 1 5 14 3 19 1 1 4 22

  • Model:

Xij ∼ Poisson(λti) where Xij is the number of power failures, λ is the failure rate, and ti is the length of operation time of pump i (in 1000s of hours).

  • What is the assumption on the failure rates of the 10 power plant

pumps in this model?

3 of 22

slide-4
SLIDE 4

Possible problems with above approaches

  • A single (α, β) may be inadequate to fit all the rats. Likewise,

a common failure rate for all the power plant pumps may not be suitable.

  • Separate unrelated (αi, βi) for each rat, or λi for each pump

are likely to overfit the data. Some information about the parameters of one rat or one pump can be obtained from

  • thers’ data.

4 of 22

slide-5
SLIDE 5

Motivation for hierarchical models

  • A thought naturally arises by assuming that (αi, βi)’s or λi’s

are samples from a common population distribution. The distribution of observed outcomes are conditional on parameters which themselves have a probability specification, known as a hierarchical or multilevel model.

  • The new parameters introduced to govern the population

distribution of the parameters are called hyperparameters.

  • Thus, we would need to estimate the parameters governing

the population distribution of (αi, βi) rather than each (αi, βi) separately.

5 of 22

slide-6
SLIDE 6

Bayesian approach to hierarchical models

  • Model specification
  • specify the sampling distribution of data: p(y|θ)
  • specify the population distribution of θ: p(θ|φ) where φ is the

hyperparameter

  • Bayesian estimation
  • specify the prior for hyperparameter: p(φ); Many levels are possible.

The hyperprior distribution at highest level is often chosen to be non-informative

  • consider the above model specification: p(y|θ) and p(θ|φ)
  • find the joint posterior distribution of parameter θ and

hyperparameter φ: p(θ, φ|y) ∝ p(θ, φ)p(y|θ, φ) = p(θ, φ)p(y|θ) ∝ p(φ)p(θ|φ)p(y|θ)

  • Point and Credible interval estimations for φ and θ
  • Predictive distribution for ˜

y

6 of 22

slide-7
SLIDE 7

Analytical derivation of conditional/marginal dist.

  • Write put the joint posterior distribution:

p(θ, φ|y) ∝ p(φ)p(θ|φ)p(y|θ)

  • Determine analytically the conditional posterior density of θ given

φ: p(θ|φ, y)

  • Obtain the marginal posterior distribution of φ:

p(φ|y) =

  • p(θ, φ|y)dθ
  • r

p(φ|y) = p(θ, φ|y) p(θ|φ, y).

7 of 22

slide-8
SLIDE 8

Simulations from the posterior distributions

  • 1. Two steps to simulate a random draw from the joint posterior

distribution of θ and φ: p(θ, φ|y)

  • Draw φ from its marginal posterior distribution: p(φ|y)
  • Draw parameter θ from its conditional posterior p(θ|φ, y)
  • 2. If desired, draw predictive values ˜

y from the posterior predictive distribution given the drawn θ

8 of 22

slide-9
SLIDE 9

Example: Rat tumors

  • Goal: Estimating the risk of tumor in a group of rats
  • Data (number of rats developed some kind of tumor):
  • 1. 70 historical experiments:

0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/19 0/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/19 1/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/20 2/20 1/10 5/49 2/19 5/46 3/27 2/17 7/49 7/47 3/20 3/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/20 4/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/20 6/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24

  • 2. Current experiment: 4/14

9 of 22

slide-10
SLIDE 10

Bayesian approach to hierarchical models

  • Model specification
  • sampling distribution of data: yj ∼ binomial(nj, θj), j = 1, 2, · · · , 71.
  • the population distribution of θ: θj ∼ Beta(α, β) where α and β are

the hyperparameters.

  • Bayesian estimation
  • non-informative prior for hyperparameters: p(α, β)
  • consider the above model specification: p(θ|α, β)
  • find the joint posterior distribution of parameter θ and

hyperparameters α and β: p(θ, α, β|y) ∝ p(α, β)p(θ|α, β)p(y|θ, α, β) ∝ p(α, β)

J

  • j=1

Γ(α + β) Γ(α)Γ(β)θα−1

j

(1 − θj)β−1

J

  • j=1

θyi

j (1 − θj)nj−yj

10 of 22

slide-11
SLIDE 11

Analytical derivation of conditional/marginal dist.

  • the joint posterior distribution:

p(θ, α, β|y) ∝ p(α, β)

J

  • j=1

Γ(α + β) Γ(α)Γ(β)θα−1

j

(1 − θj)β−1

J

  • j=1

θyi

j (1 − θj)nj−yj

  • the conditional posterior density of θ given α and β:

p(θ|α, β, y) =

J

  • j=1

Γ(α + β + nj) Γ(α + yj)Γ(β + nj − yj)θα+yj−1

j

(1 − θj)β+nj−yj−1

  • the marginal posterior distribution of α and β:

p(α, β|y) = p(θ, α, β|y) p(θ|α, β, y) ∝ p(α, β)

J

  • j=1

Γ(α + β) Γ(α)Γ(β) Γ(α + yj)Γ(β + nj − yj) Γ(α + β + nj)

11 of 22

slide-12
SLIDE 12

Choice of hyperprior distribution

  • Idea: To set up a ‘non-informative’ hyperprior distribution
  • p
  • logit(

α α+β ) = log( α β ), log(α + β)

  • ∝ 1

NO GOOD because it leads to improper posterior.

  • p
  • α

α+β , α + β

  • ∝ 1 or p(α, β) ∝ 1

NO GOOD because the posterior density is not integrable in the limit.

  • p
  • α

α + β , (α + β)−1/2

  • ∝ 1

⇐ ⇒ p(α, β) ∝ (α + β)−5/2 ⇐ ⇒ p

  • log(α

β ), log(α + β)

  • ∝ αβ(α + β)−5/2

OK because it leads to proper posterior.

12 of 22

slide-13
SLIDE 13

Computing marginal posterior of the hyperparameters

  • Computing the relative (unnormalized) posterior density on a grid
  • f values that cover the effective range of (α, β)
  • log( α

β ), log(α + β)

  • ∈ [−1, −2.5] × [1.5, 3]
  • log( α

β ), log(α + β)

  • ∈ [−1.3, −2.3] × [1, 5]
  • Drawing contour plot of the marginal density of
  • log( α

β ), log(α + β)

  • contour lines are at 0.05, 0.15, · · · , 0.95 times the density at the mode.
  • Normalizing by approximating the posterior distribution as a step

function over a grid and setting total probability in the grid to 1.

  • Computing the posterior moments based on the grid of

(log( α

β ), log(α + β)). For example, E(α|y) is estimated by

  • log( α

β ),log(α+β)

= αp(log(α β ), log(α + β)|y)

13 of 22

slide-14
SLIDE 14

Sampling from the joint posterior

  • 1. Simulation 1000 draws of (log( α

β ), log(α + β)) from their posterior

distribution using the discrete-grid sampling procedure.

  • 2. For l = 1, · · · , 1000
  • Transform the l-th draw of (log( α

β ), log(α + β)) to the scale of (α, β)

to yield a draw of the hyperparameters from their marginal posterior distribution.

  • For each j = 1, · · · , J, sample θj from its conditional posterior

distribution θj|α, β, y ∼ Beta(α + yj, β + nj − yj).

14 of 22

slide-15
SLIDE 15

Displaying the results

  • Plot the posterior means and 95% intervals for the θj’s (Figure 5.4
  • n page 131)
  • Rate θj’s are shrunk from their sample point estimates, yj

nj , towards

the population distribution, with approximate mean.

  • Experiment with few observation are shrunk more and have higher

posterior variances.

  • Note that posterior variability is higher in the full Bayesian

analysis, reflecting posterior uncertainty in the hyperparameters.

15 of 22

slide-16
SLIDE 16

Hierarchical normal models (I)

  • Model specification
  • Sampling distribution of data:

yij|θj ∼ Normal(θj, σ2), i = 1, · · · , nj, j = 1, 2, · · · , J. σ2 known

  • the population distribution of θ: θj ∼ Normal(µ, τ 2) where µ and τ

are the hyperparameters. That is, p(θ1, · · · , θJ|µ, τ) =

J

  • j=1

N(θj|µ, τ 2)

  • p(θ1, · · · , θJ) =
  • J
  • j=1

[N(θj|µ, τ 2)]p(µ, τ)d(µ, τ).

16 of 22

slide-17
SLIDE 17

Hierarchical normal models (II)

  • Bayesian estimation
  • non-informative prior for hyperparameters:

p(µ, τ) = p(µ|τ)p(τ) ∝ p(τ)

  • consider the above model specification: p(θ|µ, τ)
  • find the joint posterior distribution of parameter θ and

hyperparameters µ and τ: p(θ, µ, τ|y) ∝ p(µ, τ)p(θ|µ, τ)p(y|θ) ∝ p(µ, τ)

J

  • j=1

N(θj|µ, τ 2)

J

  • j=1

N(¯ y.j|θj, σ2/nj)

17 of 22

slide-18
SLIDE 18

Conditional posterior of θ given (µ, τ), p(θ|µ, τ, y)

  • θj|µ, τ ∼ Normal(µ, τ 2),
  • θj|µ, τ, y ∼ Normal(ˆ

θj, Vj), where

  • ˆ

θj =

nj σ2 ¯

y.j + 1

τ 2 µ nj σ2 + 1 τ 2

  • Vj =

1

nj σ2 + 1 τ 2

18 of 22

slide-19
SLIDE 19

Marginal posterior of µ and τ, p(µ, τ|y)

p(µ, τ|y) ∝ p(µ, τ)p(y|µ, τ) ¯ y.j|µ, τ ∼ Normal(µ, σ2 nj + τ 2) Therefore, p(µ, τ|y) ∝ p(µ, τ)

J

  • j=1

N(¯ y.j|µ, σ2 nj + τ 2)

19 of 22

slide-20
SLIDE 20

Posterior of µ given τ, p(µ|τ, y)

p(µ, τ|y) = p(µ|τ, y)p(τ|y) ⇒ p(µ|τ, y) = p(µ, τ|y) p(τ|y) Therefore, µ|τ, y ∼ Normal(ˆ µ, Vµ), where ˆ µ = J

j=1 1

σ2 nj +τ 2 ¯

y.j J

j=1 1

σ2 nj +τ 2

and V −1

µ

=

J

  • j=1

1

σ2 nj + τ 2

20 of 22

slide-21
SLIDE 21

Posterior distribution of τ, p(τ|y)

p(τ|y) = p(µ, τ|y) p(µ|τ, y ∝ p(τ) J

j=1 N(¯

y.j|µ, σ2

nj + τ 2)

N(µ|ˆ µ, Vµ) ∝ p(τ) J

j=1 N(¯

y.j|ˆ µ, σ2

nj + τ 2)

N(ˆ µ|ˆ µ, Vµ) ∝ p(τ)V 1/2

µ J

  • j=1

(σ2 nj + τ 2)−1/2exp  − (¯ y.j − ˆ µ)2 2( σ2

nj + τ 2)

 

21 of 22

slide-22
SLIDE 22

Prior distribution of τ, p(τ)

p(τ|y) = p(µ, τ|y) p(µ|τ, y ∝ p(τ) J

j=1 N(¯

y.j|µ, σ2

nj + τ 2)

N(µ|ˆ µ, Vµ) ∝ p(τ) J

j=1 N(¯

y.j|ˆ µ, σ2

nj + τ 2)

N(ˆ µ|ˆ µ, Vµ) ∝ p(τ)V 1/2

µ J

  • j=1

(σ2 nj + τ 2)−1/2exp  − (¯ y.j − ˆ µ)2 2( σ2

nj + τ 2)

 

22 of 22