Introduction to Bayesian Statistics Lecture 9: Hierarchical Models - PowerPoint PPT Presentation

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models Rung-Ching Tsai Department of Mathematics National Taiwan Normal University May 6, 2015

Example • Data: Weekly weights of 30 young rats (Gelfand, Hills, Racine-Poon, & Smith, 1990). Day 8 15 22 29 36 Rat 1 151 199 246 283 320 Rat 2 145 199 249 293 354 · · · Rat 30 153 200 244 286 324 • Model: Y ij = α + β x j + ǫ ij , where Y ij : weight of i -th rat on day x j ; ǫ ij ∼ Normal(0 , σ 2 ) • What is the assumption on the growth of the 30 rats in this model? 2 of 22

Example • Data: Number of Failures and length of operation time of 10 power plant pumps (George, Makov, & Smith, 1993). Pump 1 2 3 4 5 6 7 8 9 10 time 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5 failure 5 1 5 14 3 19 1 1 4 22 • Model: X ij ∼ Poisson( λ t i ) where X ij is the number of power failures, λ is the failure rate, and t i is the length of operation time of pump i (in 1000s of hours). • What is the assumption on the failure rates of the 10 power plant pumps in this model? 3 of 22

Possible problems with above approaches • A single ( α, β ) may be inadequate to fit all the rats. Likewise, a common failure rate for all the power plant pumps may not be suitable. • Separate unrelated ( α i , β i ) for each rat, or λ i for each pump are likely to overfit the data. Some information about the parameters of one rat or one pump can be obtained from others’ data. 4 of 22

Motivation for hierarchical models • A thought naturally arises by assuming that ( α i , β i )’s or λ i ’s are samples from a common population distribution. The distribution of observed outcomes are conditional on parameters which themselves have a probability specification, known as a hierarchical or multilevel model. • The new parameters introduced to govern the population distribution of the parameters are called hyperparameters. • Thus, we would need to estimate the parameters governing the population distribution of ( α i , β i ) rather than each ( α i , β i ) separately. 5 of 22

Bayesian approach to hierarchical models • Model specification ◦ specify the sampling distribution of data: p ( y | θ ) ◦ specify the population distribution of θ : p ( θ | φ ) where φ is the hyperparameter • Bayesian estimation ◦ specify the prior for hyperparameter: p ( φ ); Many levels are possible. The hyperprior distribution at highest level is often chosen to be non-informative ◦ consider the above model specification: p ( y | θ ) and p ( θ | φ ) ◦ find the joint posterior distribution of parameter θ and hyperparameter φ : p ( θ, φ | y ) ∝ p ( θ, φ ) p ( y | θ, φ ) = p ( θ, φ ) p ( y | θ ) ∝ p ( φ ) p ( θ | φ ) p ( y | θ ) ◦ Point and Credible interval estimations for φ and θ ◦ Predictive distribution for ˜ y 6 of 22

Simulations from the posterior distributions 1. Two steps to simulate a random draw from the joint posterior distribution of θ and φ : p ( θ, φ | y ) ◦ Draw φ from its marginal posterior distribution: p ( φ | y ) ◦ Draw parameter θ from its conditional posterior p ( θ | φ, y ) 2. If desired, draw predictive values ˜ y from the posterior predictive distribution given the drawn θ 8 of 22

Example: Rat tumors • Goal: Estimating the risk of tumor in a group of rats • Data (number of rats developed some kind of tumor): 1. 70 historical experiments: 0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/19 0/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/19 1/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/20 2/20 1/10 5/49 2/19 5/46 3/27 2/17 7/49 7/47 3/20 3/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/20 4/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/20 6/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24 2. Current experiment: 4/14 9 of 22

Bayesian approach to hierarchical models • Model specification ◦ sampling distribution of data: y j ∼ binomial( n j , θ j ) , j = 1 , 2 , · · · , 71 . ◦ the population distribution of θ : θ j ∼ Beta( α, β ) where α and β are the hyperparameters. • Bayesian estimation ◦ non-informative prior for hyperparameters: p ( α, β ) ◦ consider the above model specification: p ( θ | α, β ) ◦ find the joint posterior distribution of parameter θ and hyperparameters α and β : p ( θ, α, β | y ) ∝ p ( α, β ) p ( θ | α, β ) p ( y | θ, α, β ) J J Γ( α + β ) � � Γ( α )Γ( β ) θ α − 1 (1 − θ j ) β − 1 θ y i j (1 − θ j ) n j − y j ∝ p ( α, β ) j j =1 j =1 10 of 22

Analytical derivation of conditional/marginal dist. • the joint posterior distribution: J J Γ( α + β ) � � θ y i Γ( α )Γ( β ) θ α − 1 (1 − θ j ) β − 1 j (1 − θ j ) n j − y j p ( θ, α, β | y ) ∝ p ( α, β ) j j =1 j =1 • the conditional posterior density of θ given α and β : J Γ( α + β + n j ) Γ( α + y j )Γ( β + n j − y j ) θ α + y j − 1 � (1 − θ j ) β + n j − y j − 1 p ( θ | α, β, y ) = j j =1 • the marginal posterior distribution of α and β : J p ( α, β | y ) = p ( θ, α, β | y ) Γ( α + β ) Γ( α + y j )Γ( β + n j − y j ) � p ( θ | α, β, y ) ∝ p ( α, β ) Γ( α )Γ( β ) Γ( α + β + n j ) j =1 11 of 22

Choice of hyperprior distribution • Idea: To set up a ‘non-informative’ hyperprior distribution � � α + β ) = log( α α ◦ p logit( β ) , log( α + β ) ∝ 1 NO GOOD because it leads to improper posterior. � � α ◦ p α + β , α + β ∝ 1 or p ( α, β ) ∝ 1 NO GOOD because the posterior density is not integrable in the limit. ◦ � � α α + β , ( α + β ) − 1 / 2 p ( α, β ) ∝ ( α + β ) − 5 / 2 ∝ 1 ⇐ ⇒ p � � log( α ∝ αβ ( α + β ) − 5 / 2 β ) , log( α + β ) ⇐ ⇒ p OK because it leads to proper posterior. 12 of 22

Computing marginal posterior of the hyperparameters • Computing the relative (unnormalized) posterior density on a grid of values that cover the effective range of ( α, β ) � � log( α ◦ β ) , log( α + β ) ∈ [ − 1 , − 2 . 5] × [1 . 5 , 3] � � log( α ◦ β ) , log( α + β ) ∈ [ − 1 . 3 , − 2 . 3] × [1 , 5] • Drawing contour plot of the marginal density of � � log( α β ) , log( α + β ) ◦ contour lines are at 0 . 05 , 0 . 15 , · · · , 0 . 95 times the density at the mode. • Normalizing by approximating the posterior distribution as a step function over a grid and setting total probability in the grid to 1. • Computing the posterior moments based on the grid of (log( α β ) , log( α + β )). For example, E( α | y ) is estimated by = α p (log( α � β ) , log( α + β ) | y ) log ( α β ) , log ( α + β ) 13 of 22

Sampling from the joint posterior 1. Simulation 1000 draws of (log( α β ) , log( α + β )) from their posterior distribution using the discrete-grid sampling procedure. 2. For l = 1 , · · · , 1000 ◦ Transform the l -th draw of (log( α β ) , log( α + β )) to the scale of ( α, β ) to yield a draw of the hyperparameters from their marginal posterior distribution. ◦ For each j = 1 , · · · , J , sample θ j from its conditional posterior distribution θ j | α, β, y ∼ Beta( α + y j , β + n j − y j ). 14 of 22

Displaying the results • Plot the posterior means and 95% intervals for the θ j ’s (Figure 5.4 on page 131) • Rate θ j ’s are shrunk from their sample point estimates, y j n j , towards the population distribution, with approximate mean. • Experiment with few observation are shrunk more and have higher posterior variances. • Note that posterior variability is higher in the full Bayesian analysis, reflecting posterior uncertainty in the hyperparameters. 15 of 22

Hierarchical normal models (I) • Model specification ◦ Sampling distribution of data: y ij | θ j ∼ Normal( θ j , σ 2 ) , i = 1 , · · · , n j , j = 1 , 2 , · · · , J . σ 2 known ◦ the population distribution of θ : θ j ∼ Normal( µ, τ 2 ) where µ and τ are the hyperparameters. That is, J � N( θ j | µ, τ 2 ) p ( θ 1 , · · · , θ J | µ, τ ) = j =1 ◦ J � � [N( θ j | µ, τ 2 )] p ( µ, τ ) d ( µ, τ ) . p ( θ 1 , · · · , θ J ) = j =1 16 of 22

Hierarchical normal models (II) • Bayesian estimation ◦ non-informative prior for hyperparameters: p ( µ, τ ) = p ( µ | τ ) p ( τ ) ∝ p ( τ ) ◦ consider the above model specification: p ( θ | µ, τ ) ◦ find the joint posterior distribution of parameter θ and hyperparameters µ and τ : p ( θ, µ, τ | y ) ∝ p ( µ, τ ) p ( θ | µ, τ ) p ( y | θ ) J J � � N( θ j | µ, τ 2 ) y . j | θ j , σ 2 / n j ) ∝ p ( µ, τ ) N(¯ j =1 j =1 17 of 22

Conditional posterior of θ given ( µ, τ ), p ( θ | µ, τ, y ) • θ j | µ, τ ∼ Normal( µ, τ 2 ) , • θ j | µ, τ, y ∼ Normal(ˆ θ j , V j ) , where ◦ n j y . j + 1 σ 2 ¯ τ 2 µ ˆ θ j = n j σ 2 + 1 τ 2 ◦ 1 V j = n j σ 2 + 1 τ 2 18 of 22

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models - PowerPoint PPT Presentation

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models Rung-Ching Tsai Department of Mathematics National Taiwan Normal University May 6, 2015 Example Data: Weekly weights of 30 young rats (Gelfand, Hills, Racine-Poon, &

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Statistics for Analytical Science at Warwick Simon Spencer Bayesian statistics in epidemiology

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Statistics for Applications Chapter 8: Bayesian Statistics 1/17 The Bayesian approach (1)

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science

Introduction to Bayesian Statistics Louis Raes Spring 2017 Table of contents Organisation,

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Workshop 7.2b: Introduction to Bayesian models Murray Logan February 7, 2017 Table of

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 3: Probability Jan-Willem van de

Thanks to R Parr, C Guesterin

User Popula,ons Forgo=en usernames/ Distributed across networks; LOW-RATE passwords the

CSE446: Point Estjmatjon Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein,

Overview Bayesian Methods for Parameter Estimation Introduction to Bayesian Statistics: Learning

Learning Objectives At the end of the class you should be able to: derive Bayesian learning from

Identifying Parametric Prior Distributions Stephanie Kovalchik UCLA, Department of Biostatistics

Introduction to Bayesian Inference Frank Wood April 6, 2010 Introduction Overview of Topics