Case Study: Approximate Bayesian Inference for Latent Gaussian - PowerPoint PPT Presentation

Case Study: Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations BIOSTAT830: Graphical Models December 08, 2016

Introduction - INLA ◮ Inference for latent Gaussian Markov random field (GMRF) models, avoiding MCMC simulations ◮ Fast Bayesian inference using accurate, multiple types of approximations to ◮ pr ( θ | y ): marginal density for the model parameters ◮ pr ( x i | y ): marginal posterior densities for one (or more) latent variables. ◮ Can be used for model criticisms: 1. Fast cross-validation 2. Bayes factors and deviation information criterion (DIC) can be efficiently calculated for model comparisoins ◮ Software inla available from R; very easy to use

Supported Models ◮ Hierarchical GMRF of the form: likelihood : y j | η j , θ 1 ∼ pr ( y j | η j , θ 1 ) , j ∈ J , n f − 1 � w ki f k ( c ki ) + z ′ linear predictor : η i = Offset i + i β + ǫ i , k =0 i = 0 , . . . , n η − 1 . ◮ J ⊂ { 0 , 1 , . . . , n η − 1 } , i.e., not all latent η are observed through data y ◮ pr ( y j | η j , θ 1 ): likelihood of data; known link function ◮ ǫ = ( ǫ 0 , . . . , ǫ n η − 1 ) ′ | λ η ∼ N ( 0 , λ η I ); λ η denotes precision ◮ η = { η i } : a vector of linear predictors, e.g., in GLM ◮ w k : in the k -th nonlinear effect, the known weights, one for each observed data point

Supported Models (continued) ◮ Hierarchical GMRF of the form: likelihood : y j | η j , θ 1 ∼ pr ( y j | η j , θ 1 ) , j ∈ J , n f − 1 w ki f k ( c ki ) + z ′ � linear predictor : η i = Offset i + i β + ǫ i , k =0 i = 0 , . . . , n η − 1 . ◮ f k ( c ki ): nonlinear effect of covariate k for observation i 1. Nonlinear effects: time trends and seasonal effects, two dimensional surfaces, iid random intercepts, slopes and spatial random effects. 2. The unknown functions f k = ( f 0 k , . . . , f m k − 1 , k ) ′ are modelled as GMRF given some parameter θ f k : f k | θ f k ∼ N ( 0 , Q − 1 k ) ◮ z i : A vector of n β covariates assumed to have a linear effect ◮ β : The vector of unknown parameters

Supported Models (continued) ◮ x = ( η ′ , f ′ 0 , . . . , f ′ n f − 1 , β ′ ) ′ : full vector of latent variables; Dimension: n = n η + � n f − 1 j =0 m j + n β ; note we parameterized x by η instead of ǫ ◮ All the elements of vector x are defined as GMRFs: n η − 1 n f − 1 pr ( η i | f 0 , . . . , f n f − 1 , β , λ − 1 � � pr ( x | θ 2 ) = pr ( f k | θ f k ) η ) i =0 k =0 n β − 1 � × pr ( β m ) , m =0 where   n f − 1 iid � f k ( c ki ) + z ′  , β m η i | f 0 , . . . , f n f − 1 , β ∼ N ∼ N (0 , λ β ) i β , λ η  k =0 and θ 2 = { log λ η , θ f 0 , . . . , θ f nf − 1 } is a vector of unknown hyperparameters.

Prior ◮ Specify priors on the hyperparameters: θ 2 = { log λ η , θ f 0 , . . . , θ f nf − 1 , log λ β } ◮ Need not be Gaussian

Examples ◮ Time series model : c k = t for time, f k for nonlinear trends or seasonal effects η t = f trend ( t ) + f seasonal ( t ) + z ′ t β ◮ Generalized additive models (GAM) : 1. pr ( y i | η i , θ 1 ) belongs to an exponential family 2. c ki ’s are univariate, continuous covariates 3. f k ’s are smooth functions

Examples ◮ Generalized additive mixed models (GAMM) for longitudinal data ◮ Individuals: i = 0 , · · · , n i − 1, observed at time points t 0 , t 1 , . . . . A GAMM extends a GAM by introducing individual specific random effects: η it = f 0 ( c it 0 )+ . . . + f n f − 1 ( c it , n f − 1 )+ b oi w it 0 + . . . + b n b − 1 , i w it , n b − 1 , where η it is the linear predictor for individual i at time t , c itk , k = 0 , . . . , n f − 1 , w itq , q = 0 , . . . , n b − 1 are covariate values for individual i at time t , and b 0 i , . . . , b n b − 1 , i is a vector of n b individual-specific random intercepts (if w itq = 1) or slopes. ◮ Just define r = ( i , t ) and c kr = c kit for k = 0 , . . . , n f − 1 and c n f − 1+ q , r = w qit , f n f − 1+ q ( c ( n f − 1+ q ) , r ) = b qi w qit for q = 0 , . . . , n b .

Examples ◮ Geoadditive models (Kammann and Wand, 2003, JRSS-C): η i = f 1 ( c 0 i ) + . . . + f n f − 1 ( c n f − 1 , i ) + f spatial ( s i ) + z ′ i β , where s i indicates the location of observation i and f spatial is a spatially correlated effect.

Examples ◮ ANOVA-type interaction model : For the effect of two continuous covariates w and v : η i = f 1 ( w i ) + f 2 ( v i ) + f 1 , 2 ( w i , v i ) + . . . , where f 1 , f 2 are the main effects and f 1 , 2 is a two dimensional interaction surface. As a special case, we just define c 1 i = w i , c 2 i = v i and c 3 i = ( w i , v i ), ◮ Univariate stochastic volatility model ◮ Time series models with Gaussian likelihood where the variance (not the mean) of the observed data is part of the latent GMRF model: y i | η i ∼ N (0 , exp( η i )) , and, for example, model the latent field η as an autoregressive model of order 1.

Bayesian for Spatial and Spatio-temporal Models (Blangiardo and Cameletti, 2015, Wiley)

INLA for Spatial Area Data: Suicides in London ◮ Disease mapping is commonly used in small area studies to assess the pattern of a disease ◮ To identify areas characterized by unusually high or low relative risk (Lawson 2009)

London Cholera Outbreak in 1854 - John Snow’s Cholera map in dot style; dots represent deaths from cholera in London in 1854 to detect the source of the disease

Example: Suicide Mortality ◮ 32 Boroughs in London; 1989-1993 ◮ For the i -th area, the number of suicides y i : y i ∼ Poisson ( λ i ) , where λ i = ρ i E i , a product of rate ρ i and the expected number of suicides E i ◮ Linear predictor defined on logarithmic scale: η i = log( ρ i ) = α + v i + ν i , where α is the intercept, v i = f 1 ( i ) and ν i = f 2 ( i ) are two area specific effects.

Besag-York-Mollie (BYM) model (Besag et al. 1991) ◮ v i : spatially structured residual, modeled using an intrinsic conditional autoregressive structure (iCAR): v i | v j � = i ∼ Normal ( m i , s 2 i ) � j ∈N ( i ) v j m i = |N ( i ) | σ 2 s 2 v i = |N ( i ) | , where |N ( i ) | is the number of areas which share boundaries with the i -th one. ◮ ν i : unstructured residual; modeled by exchangeable prior: ν i ∼ Normal (0 , σ 2 )

Priors ◮ log τ ν ∼ log − Gamma (1 , 0 . 0005) ◮ log τ v ∼ log − Gamma (1 , 0 . 0005)

Goal 1. Posterior for borough-specific relative risks of suicides, compared to the whole of London: pr (exp( v i + ν i ) | y ) 2. Posterior exceedance probability: pr (exp( v i + ν i ) > 1 | y ) 3. Fraction of structured variance component

Incorporating Risk Factors ◮ Extension: when risk factors are available and the aim of the study is to evaluate their effect on the risk of death (or disease) ◮ Ecological regression model ◮ For example: Index of social deprivation ( x 1 ), index of social fragmentation (describing lack of social connections and of sense of community) ( x 2 ) ◮ Model: η i = α + β 1 x 1 i + β 2 x 2 i + v i + ν i ◮ Can be fitted using the R-INLA package

London Suicide Rates Mapping Figure 1: suicide_rates

Other Spatial Examples (FiveThirtyEight) Figure 2: Stehpen Curry

Background: Gaussian Markov Random Fields (GMRF) ◮ GMRF: x = ( x 1 , . . . , x n ) ′ with Markov property that for some i � = j , x i ⊥ x j | x − ij ◮ Can be encoded by precision matrix Q : Q ij = 0 if and only if x i ⊥ x j | x − ij ◮ Density function with mean vector µ : pr ( x ) = (2 π ) − n / 2 | Q | − 1 / 2 exp {− 1 2( x − µ ) ′ Q ( x − µ ) } ◮ Most cases: Q is sparse: only O ( n ) of the n 2 entries are nonzero ◮ Can handle extra linear constraints: Ax = e for a k × n matrix A of rank k ◮ Computational note : Simulation usually based on lower Cholesky decomposition Q = LL ′ , with L preserving the sparseness in Q . See Section 2.1 in Rue et al. (2009) for more details.

Background: Gaussian approximation (under regularity conditions) ◮ Find a Gaussian density q ( z ) to approximate a density � f ( z ) d z p ( z ) = 1 Z f ( z ), where Z = ◮ One-dimensional case ◮ Multi-dimensional case ◮ Need to find mode z 0 (Newton or quasi-Newton methods) ◮ Need not know the normalizing constant Z ◮ Central Limit Theorem, approximate becomes better as sample size n increases if f ( z ; Data ) is a posterior distribution of model parameters ◮ Typically better for marginal and conditional posteriors than joint posteriors (marginals are averages across other distributions!) ◮ Can use transformations (e.g., logit or log) to approximate a distribution over a constrained space ◮ Not so useful if there is skewness, or if interested in extreme values that are far from the mode

Background: Gaussian approximation - Density at Extreme Values (Bishop CM, 2006, Sec 4.4)

Case Study: Approximate Bayesian Inference for Latent Gaussian - PowerPoint PPT Presentation

Case Study: Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations BIOSTAT830: Graphical Models December 08, 2016 Introduction - INLA Inference for latent Gaussian Markov random field

Approximate Bayesian inference for latent Gaussian models avard Rue 1 H Department of

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Approximate Bayesian Computation Chris Drovandi, Charisse Farr October 24, 2012 Chris Drovandi,

Approximate Bayesian Computation Dr. Jarad Niemi STAT 615 - Iowa State University December 5,

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of

Bayesian Networks Philipp Koehn 2 April 2020 Philipp Koehn Artificial Intelligence: Bayesian

Bayesian Networks Philipp Koehn 6 April 2017 Philipp Koehn Artificial Intelligence: Bayesian

Bayesian Networks Philipp Koehn 29 October 2015 Philipp Koehn Artificial Intelligence: Bayesian

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Finding Latent Code Errors via Machine Learning over Program Executions Yuriy Brun Michael D.

Learning Latent Dynamics for Planning from Pixels Danijar Hafner, Timothy Lillicrap, Ian Fischer,

A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang , Dan

Poster #24 1 Applied AI Lab, Oxford Robotics Institute 2 Department of Statistics, University of

Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta

The semnova Package for Latent Repeated Measures ANOVA Benedikt Langenberg, RWTH Aachen

POIR 613: Computational Social Science Pablo Barber a School of International Relations

lavaan : an R package for structural equation modeling and more Yves Rosseel Department of Data

Sambuz

Useful Links

Newsletter

Mail Us