evaluating predictive loss for models with observation
play

Evaluating predictive loss for models with observation-level latent - PowerPoint PPT Presentation

Evaluating predictive loss for models with observation-level latent variables Russell Millar University of Auckland Dec 2015 Russell Millar University of Auckland Predictive loss Dec 2015 1 / 20 Motivation So, youve fitted a Bayesian


  1. Evaluating predictive loss for models with observation-level latent variables Russell Millar University of Auckland Dec 2015 Russell Millar University of Auckland Predictive loss Dec 2015 1 / 20

  2. Motivation So, you’ve fitted a Bayesian model.... very likely more than one, you’d like to know which is preferred. How? Bayes factors? ◮ Computationally challenging. ◮ Sensitive to priors. Posterior predictive checks? ◮ Poor performance to detect model deficiencies. ◮ Not addressing the question directly. Predictive loss? ◮ DIC ◮ WAIC ◮ Cross-validation Russell Millar University of Auckland Predictive loss Dec 2015 2 / 20

  3. Notation y = ( y 1 , ..., y n ), observations with density p ( y ) I d , parameter vector θ ∈ R p ( y | θ ), the likelihood p ( θ ), prior z , future realizations from true distribution of y . D ( θ ) = − 2 log p ( y | θ ), deviance function Russell Millar University of Auckland Predictive loss Dec 2015 3 / 20

  4. DIC, the Dirty Information Criterion Widely used: Spiegelhalter et al. (2002) > 6 500 cites. DIC can be written as DIC = − 2 E θ | y [log p ( y | θ )] + p = D ( θ ) + p , where p is a penalty term to correct for using the data twice. A Taylor series expansion of D ( θ ) around θ = E θ | y [ θ ] “suggests” that p can be estimated as the posterior expected value of D ( θ ) − D ( θ ), giving p D = D ( θ ) − D ( θ ) . Easy to estimate from a posterior sample. Not invariant to re-parameterization due to use of θ . ��� p D can be negative if deviance is not concave. ��� Never explicitly stated what DIC is trying to estimate!!! Russell Millar University of Auckland Predictive loss Dec 2015 4 / 20

  5. DIC, the Dirty Information Criterion Since D ( θ ) = E θ | y [ D ( θ )] = − 2 E θ | y [log p ( y | θ )] you might suspect that DIC is estimating the expected predictive deviance − 2 E z E θ | y [log p ( z | θ )] . (1) 1 van der Linde (2005) & Ando (2011). Russell Millar University of Auckland Predictive loss Dec 2015 5 / 20

  6. DIC, the Dirty Information Criterion Since D ( θ ) = E θ | y [ D ( θ )] = − 2 E θ | y [log p ( y | θ )] you might suspect that DIC is estimating the expected predictive deviance − 2 E z E θ | y [log p ( z | θ )] . (1) But its not - it needs a heavier penalty for using y in place of z . 1 The extra-penalized form DIC ∗ = D ( θ ) + 2 p , is an asymptotically unbiased estimator of (1). 1 van der Linde (2005) & Ando (2011). Russell Millar University of Auckland Predictive loss Dec 2015 5 / 20

  7. WAIC, Widely Applicable Information Criteria Sumio Watanabe (2009) developed a singular learning theory derived using algebraic geometry results developed by Heisuke Hironaka (who earned a Fields medal in 1970 for his work). It is assumed that p ( y i | θ ) are independent. Russell Millar University of Auckland Predictive loss Dec 2015 6 / 20

  8. WAIC, Widely Applicable Information Criteria Sumio Watanabe (2009) developed a singular learning theory derived using algebraic geometry results developed by Heisuke Hironaka (who earned a Fields medal in 1970 for his work). It is assumed that p ( y i | θ ) are independent. Watanabe defines several WAIC variants. One particular variant has gained popularity due to: It’s asymptotic equivalence with Bayesian leave-one-out cross-validation (LOO-CV), Watanabe (2010). It’s high degree of approximation to its target loss Russell Millar University of Auckland Predictive loss Dec 2015 6 / 20

  9. WAIC, Widely Applicable Information Criteria n � WAIC = − 2 log p ( y i | y ) + 2 V i =1 � � n = = − 2 log p ( y i | θ ) p ( θ | y ) d θ + 2 V , i =1 where � n V = Var θ | y (log p ( y i | θ )) . i =1 Watanabe showed that E Y [ WAIC ] is an asymptotically unbiased estimator of E Y ( B ) where � � � � n � n B = − 2 E Z i [log p i ( z i | y )] = − 2 E Z i log p ( z i | θ ) p ( θ | y ) d θ . i =1 i =1 This holds under very general conditions, including for non-identifiable, singular and unrealizable models. Russell Millar University of Auckland Predictive loss Dec 2015 7 / 20

  10. LOO-CVL, Leave-one-out Cross-validation Letting y − i denote the observations with y i removed, a natural approximation for B is the LOO-CVL estimator n � CVL = CVL i , i =1 where CVL i = − 2 log p ( y i | y − i ) � = − 2 log p ( y i | θ ) p ( θ | y − i ) d θ . (2) CVL has asymptotic bias of O (1 / n ) as an estimator of B . Russell Millar University of Auckland Predictive loss Dec 2015 8 / 20

  11. LOO-CVL, Leave-one-out Cross-validation Letting y − i denote the observations with y i removed, a natural approximation for B is the LOO-CVL estimator n � CVL = CVL i , i =1 where CVL i = − 2 log p ( y i | y − i ) � = − 2 log p ( y i | θ ) p ( θ | y − i ) d θ . (2) CVL has asymptotic bias of O (1 / n ) as an estimator of B . But, direct estimation of CVL can be very computationally intensive since it requires samples from n posteriors p ( θ | y − i ) , i = 1 , ..., n . This direct estimator will be denoted � CVL . Russell Millar University of Auckland Predictive loss Dec 2015 8 / 20

  12. Importance sampling approximation to LOO-CVL p ( y i | y − i ) can be expressed as the harmonic mean of p ( y i | θ ) with respect to the full posterior, �� � − 1 1 p ( y i | y − i ) = p ( y i | θ ) p ( θ | y ) d θ , and so p ( y i | y − i ) can be estimated as S p ( y i | y − i ) = � , (3) � S 1 s =1 p ( y i | θ ( s ) ) where θ ( s ) , s = 1 , ..., S , is a sample from p ( θ | y ). Thus, each CVL i , i = 1 , ..., n and hence CVL = � n i =1 CVL i can be estimated from a single posterior sample. Note that (3) can also be written as a self-normalizing importance-sampling estimator, � S s =1 p ( y i | θ ( s ) ) w si p ( y i | y − i ) = � , (4) � S s =1 w si where w si = p ( y i | θ ( s ) ) − 1 . The importance-sampling estimator of CVL will be denoted � ISCVL . Russell Millar University of Auckland Predictive loss Dec 2015 9 / 20

  13. Importance sampling approximation to LOO-CVL p ( y i | y − i ) can be highly unstable when θ ( s ) is in the tails of p ( y i | θ ( s ) ). Note that � It is very useful to quantify the reliability of importance sampling using the notion of effective sample size. The effective sample size is with respect to a sample from p ( θ | y − i ) for evaluating CVL i using (2). For observation i , ESS i can be calculated as ESS i = nw i 2 , w 2 i where w si = p ( y i | θ ( s ) ) − 1 and w i is the mean of the weights w si , s = 1 , ..., S , and i is the mean of the squared weights w 2 w 2 si , s = 1 , ..., S . Russell Millar University of Auckland Predictive loss Dec 2015 10 / 20

  14. Evaluation of predictive loss Recent work has examined the relative performance of WAIC, CVL and IS-CVL in the context of normal models. I have been examining their performance with regard to: Model focus (i.e., level of hierarchy at which likelihood is specified). Use with non-normal data. Russell Millar University of Auckland Predictive loss Dec 2015 11 / 20

  15. Evaluation of predictive loss Recent work has examined the relative performance of WAIC, CVL and IS-CVL in the context of normal models. I have been examining their performance with regard to: Model focus (i.e., level of hierarchy at which likelihood is specified). Use with non-normal data. Models for over-dispersed count data incorporate both of these issues. E.g., the negative binomial density can be expressed directly (marginal focus), or as a Poisson density conditional on an underlying gamma latent variable (conditional focus). Russell Millar University of Auckland Predictive loss Dec 2015 11 / 20

  16. Evaluation of predictive loss, y ∼ Pois ( λ ) 6 5 Expected loss E y [B] 4 E y [WAIC] 3 2 1 0 0 2 4 6 8 10 λ 0 WAIC approximation not so good until normal approximation (to Poisson) kicks in at around λ 0 = 5. Russell Millar University of Auckland Predictive loss Dec 2015 12 / 20

  17. Evaluation of predictive loss, y ∼ Pois ( λ ) FYI, the underlying R code to numerically evaluate B for y ∼ Pois ( λ 0 ). BayesLoss=function(y,lambda0,alpha=0.001,beta=0.001) { yrep_limits=qpois(c(1e-15,1-1e-15),lambda0) yrep_grid=seq(yrep_limits[1],yrep_limits[2]) #Grid of values for reps grid_probs=dpois(yrep_grid,lambda0) #Probabilities over the grid grid_pd=dnbinom(yrep_grid,size=y+alpha,mu=(y+alpha)/(beta+1)) #Pred density BLoss=-2*sum(grid_probs*log(grid_pd)) #Predictive loss, B, for a given y return(BLoss) } Russell Millar University of Auckland Predictive loss Dec 2015 13 / 20

  18. Simulation study with over-dispersed count data How well can the predictive criteria distinguish the following three models? Poisson: y i | µ ∼ Pois ( µ ) PGA: y i | λ i ∼ Pois ( λ i ) where λ i ∼ Γ( α, α/µ ) PLN: y i | λ i ∼ Pois ( λ i ) where λ i ∼ LN (log( µ ) − 0 . 5 τ 2 , τ 2 ) These are conditional-level specifications. Russell Millar University of Auckland Predictive loss Dec 2015 14 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend