Evaluating predictive loss for models with observation-level latent - PowerPoint PPT Presentation

Evaluating predictive loss for models with observation-level latent variables Russell Millar University of Auckland Dec 2015 Russell Millar University of Auckland Predictive loss Dec 2015 1 / 20

Motivation So, you’ve fitted a Bayesian model.... very likely more than one, you’d like to know which is preferred. How? Bayes factors? ◮ Computationally challenging. ◮ Sensitive to priors. Posterior predictive checks? ◮ Poor performance to detect model deficiencies. ◮ Not addressing the question directly. Predictive loss? ◮ DIC ◮ WAIC ◮ Cross-validation Russell Millar University of Auckland Predictive loss Dec 2015 2 / 20

Notation y = ( y 1 , ..., y n ), observations with density p ( y ) I d , parameter vector θ ∈ R p ( y | θ ), the likelihood p ( θ ), prior z , future realizations from true distribution of y . D ( θ ) = − 2 log p ( y | θ ), deviance function Russell Millar University of Auckland Predictive loss Dec 2015 3 / 20

DIC, the Dirty Information Criterion Widely used: Spiegelhalter et al. (2002) > 6 500 cites. DIC can be written as DIC = − 2 E θ | y [log p ( y | θ )] + p = D ( θ ) + p , where p is a penalty term to correct for using the data twice. A Taylor series expansion of D ( θ ) around θ = E θ | y [ θ ] “suggests” that p can be estimated as the posterior expected value of D ( θ ) − D ( θ ), giving p D = D ( θ ) − D ( θ ) . Easy to estimate from a posterior sample. Not invariant to re-parameterization due to use of θ . �� p D can be negative if deviance is not concave. �� Never explicitly stated what DIC is trying to estimate!!! Russell Millar University of Auckland Predictive loss Dec 2015 4 / 20

DIC, the Dirty Information Criterion Since D ( θ ) = E θ | y [ D ( θ )] = − 2 E θ | y [log p ( y | θ )] you might suspect that DIC is estimating the expected predictive deviance − 2 E z E θ | y [log p ( z | θ )] . (1) 1 van der Linde (2005) & Ando (2011). Russell Millar University of Auckland Predictive loss Dec 2015 5 / 20

DIC, the Dirty Information Criterion Since D ( θ ) = E θ | y [ D ( θ )] = − 2 E θ | y [log p ( y | θ )] you might suspect that DIC is estimating the expected predictive deviance − 2 E z E θ | y [log p ( z | θ )] . (1) But its not - it needs a heavier penalty for using y in place of z . 1 The extra-penalized form DIC ∗ = D ( θ ) + 2 p , is an asymptotically unbiased estimator of (1). 1 van der Linde (2005) & Ando (2011). Russell Millar University of Auckland Predictive loss Dec 2015 5 / 20

WAIC, Widely Applicable Information Criteria Sumio Watanabe (2009) developed a singular learning theory derived using algebraic geometry results developed by Heisuke Hironaka (who earned a Fields medal in 1970 for his work). It is assumed that p ( y i | θ ) are independent. Russell Millar University of Auckland Predictive loss Dec 2015 6 / 20

WAIC, Widely Applicable Information Criteria Sumio Watanabe (2009) developed a singular learning theory derived using algebraic geometry results developed by Heisuke Hironaka (who earned a Fields medal in 1970 for his work). It is assumed that p ( y i | θ ) are independent. Watanabe defines several WAIC variants. One particular variant has gained popularity due to: It’s asymptotic equivalence with Bayesian leave-one-out cross-validation (LOO-CV), Watanabe (2010). It’s high degree of approximation to its target loss Russell Millar University of Auckland Predictive loss Dec 2015 6 / 20

WAIC, Widely Applicable Information Criteria n � WAIC = − 2 log p ( y i | y ) + 2 V i =1 � � n = = − 2 log p ( y i | θ ) p ( θ | y ) d θ + 2 V , i =1 where � n V = Var θ | y (log p ( y i | θ )) . i =1 Watanabe showed that E Y [ WAIC ] is an asymptotically unbiased estimator of E Y ( B ) where � � � � n � n B = − 2 E Z i [log p i ( z i | y )] = − 2 E Z i log p ( z i | θ ) p ( θ | y ) d θ . i =1 i =1 This holds under very general conditions, including for non-identifiable, singular and unrealizable models. Russell Millar University of Auckland Predictive loss Dec 2015 7 / 20

LOO-CVL, Leave-one-out Cross-validation Letting y − i denote the observations with y i removed, a natural approximation for B is the LOO-CVL estimator n � CVL = CVL i , i =1 where CVL i = − 2 log p ( y i | y − i ) � = − 2 log p ( y i | θ ) p ( θ | y − i ) d θ . (2) CVL has asymptotic bias of O (1 / n ) as an estimator of B . Russell Millar University of Auckland Predictive loss Dec 2015 8 / 20

LOO-CVL, Leave-one-out Cross-validation Letting y − i denote the observations with y i removed, a natural approximation for B is the LOO-CVL estimator n � CVL = CVL i , i =1 where CVL i = − 2 log p ( y i | y − i ) � = − 2 log p ( y i | θ ) p ( θ | y − i ) d θ . (2) CVL has asymptotic bias of O (1 / n ) as an estimator of B . But, direct estimation of CVL can be very computationally intensive since it requires samples from n posteriors p ( θ | y − i ) , i = 1 , ..., n . This direct estimator will be denoted � CVL . Russell Millar University of Auckland Predictive loss Dec 2015 8 / 20

Importance sampling approximation to LOO-CVL p ( y i | y − i ) can be expressed as the harmonic mean of p ( y i | θ ) with respect to the full posterior, �� − 1 1 p ( y i | y − i ) = p ( y i | θ ) p ( θ | y ) d θ , and so p ( y i | y − i ) can be estimated as S p ( y i | y − i ) = � , (3) � S 1 s =1 p ( y i | θ ( s ) ) where θ ( s ) , s = 1 , ..., S , is a sample from p ( θ | y ). Thus, each CVL i , i = 1 , ..., n and hence CVL = � n i =1 CVL i can be estimated from a single posterior sample. Note that (3) can also be written as a self-normalizing importance-sampling estimator, � S s =1 p ( y i | θ ( s ) ) w si p ( y i | y − i ) = � , (4) � S s =1 w si where w si = p ( y i | θ ( s ) ) − 1 . The importance-sampling estimator of CVL will be denoted � ISCVL . Russell Millar University of Auckland Predictive loss Dec 2015 9 / 20

Importance sampling approximation to LOO-CVL p ( y i | y − i ) can be highly unstable when θ ( s ) is in the tails of p ( y i | θ ( s ) ). Note that � It is very useful to quantify the reliability of importance sampling using the notion of effective sample size. The effective sample size is with respect to a sample from p ( θ | y − i ) for evaluating CVL i using (2). For observation i , ESS i can be calculated as ESS i = nw i 2 , w 2 i where w si = p ( y i | θ ( s ) ) − 1 and w i is the mean of the weights w si , s = 1 , ..., S , and i is the mean of the squared weights w 2 w 2 si , s = 1 , ..., S . Russell Millar University of Auckland Predictive loss Dec 2015 10 / 20

Evaluation of predictive loss Recent work has examined the relative performance of WAIC, CVL and IS-CVL in the context of normal models. I have been examining their performance with regard to: Model focus (i.e., level of hierarchy at which likelihood is specified). Use with non-normal data. Russell Millar University of Auckland Predictive loss Dec 2015 11 / 20

Evaluation of predictive loss Recent work has examined the relative performance of WAIC, CVL and IS-CVL in the context of normal models. I have been examining their performance with regard to: Model focus (i.e., level of hierarchy at which likelihood is specified). Use with non-normal data. Models for over-dispersed count data incorporate both of these issues. E.g., the negative binomial density can be expressed directly (marginal focus), or as a Poisson density conditional on an underlying gamma latent variable (conditional focus). Russell Millar University of Auckland Predictive loss Dec 2015 11 / 20

Evaluation of predictive loss, y ∼ Pois ( λ ) 6 5 Expected loss E y [B] 4 E y [WAIC] 3 2 1 0 0 2 4 6 8 10 λ 0 WAIC approximation not so good until normal approximation (to Poisson) kicks in at around λ 0 = 5. Russell Millar University of Auckland Predictive loss Dec 2015 12 / 20

Evaluation of predictive loss, y ∼ Pois ( λ ) FYI, the underlying R code to numerically evaluate B for y ∼ Pois ( λ 0 ). BayesLoss=function(y,lambda0,alpha=0.001,beta=0.001) { yrep_limits=qpois(c(1e-15,1-1e-15),lambda0) yrep_grid=seq(yrep_limits[1],yrep_limits[2]) #Grid of values for reps grid_probs=dpois(yrep_grid,lambda0) #Probabilities over the grid grid_pd=dnbinom(yrep_grid,size=y+alpha,mu=(y+alpha)/(beta+1)) #Pred density BLoss=-2*sum(grid_probs*log(grid_pd)) #Predictive loss, B, for a given y return(BLoss) } Russell Millar University of Auckland Predictive loss Dec 2015 13 / 20

Simulation study with over-dispersed count data How well can the predictive criteria distinguish the following three models? Poisson: y i | µ ∼ Pois ( µ ) PGA: y i | λ i ∼ Pois ( λ i ) where λ i ∼ Γ( α, α/µ ) PLN: y i | λ i ∼ Pois ( λ i ) where λ i ∼ LN (log( µ ) − 0 . 5 τ 2 , τ 2 ) These are conditional-level specifications. Russell Millar University of Auckland Predictive loss Dec 2015 14 / 20

Evaluating predictive loss for models with observation-level latent - PowerPoint PPT Presentation

Evaluating predictive loss for models with observation-level latent variables Russell Millar University of Auckland Dec 2015 Russell Millar University of Auckland Predictive loss Dec 2015 1 / 20 Motivation So, youve fitted a Bayesian

Session 3 Upskilling for Predictive Analytics Travis M Short, FSA Upskilling for Predictive

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Predictive Analytics for Capacity Planning HIC 2015 Andrae Gaeth What is predictive

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Qatar observation stations Qatar observation stations, Instruments and calibrations By By

PREDICTIVE MODELING CONFERENCE Data Workshop Cyber Risk Models Loss Aggregation Models

Outline Evaluating Models of Natural Image Patches Evaluating Models Comparing Whitening

Thailand Earth Observation Activities and THEOS Thailand Earth Observation Activities and THEOS

Medicare Medicare Outpatient Observation Notice Outpatient Observation Notice Janet Miller

Rotational Momentum Observation Experiment 1 - Figure Skater Observation Experiment 2 - Diver

High-Fidelity Coupling of Predictive Plasma-Wall Models Goal: Develop a predictive model of the

Why the Best Predictive What Do We Mean by . . . Models Are Often Different Main Result: . . .

EE359 Lecture 2 Outline TX and RX Signal Models Path Loss Models Free-space and

RADIO PROPAGATION MODELS 1 Radio Propagation Models 1 Path Loss Free Space Loss

Can you trust your models uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

Decentralised water and waste treatment in view of resource recovery: The I-QUA & WAVE

C A P E H E N L O P E N H I G H S C H O O L A G R I S C I E N C E P A T H W A Y S 3 P A T H

1Q20 EARNINGS PRESENTATION May 2020 Forward-looking Statements This presentation contains

SOUTH AFRICAN AGRICULTURE BIG CHANGES IN NEXT DECADE LARGE NUMBERS OF RESOURCE POOR GROWERS

Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's

Ecosystem Threats: Ecosystem Threats: What the fishing community can do to ensure a sustainable

CLUB XLIV & ENCORE Booking Information: 504.587.3663 or email clubXLIVsales@asmneworleans.com

Accepts AC power via E27 Socket as well as DC power from standard powerbanks and the NYX

Evaluating predictive loss for models with observation-level latent - PowerPoint PPT Presentation

Evaluating predictive loss for models with observation-level latent variables Russell Millar University of Auckland Dec 2015 Russell Millar University of Auckland Predictive loss Dec 2015 1 / 20 Motivation So, youve fitted a Bayesian

Session 3 Upskilling for Predictive Analytics Travis M Short, FSA Upskilling for Predictive

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Predictive Analytics for Capacity Planning HIC 2015 Andrae Gaeth What is predictive

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Qatar observation stations Qatar observation stations, Instruments and calibrations By By

PREDICTIVE MODELING CONFERENCE Data Workshop Cyber Risk Models Loss Aggregation Models

Outline Evaluating Models of Natural Image Patches Evaluating Models Comparing Whitening

Thailand Earth Observation Activities and THEOS Thailand Earth Observation Activities and THEOS

Medicare Medicare Outpatient Observation Notice Outpatient Observation Notice Janet Miller

Rotational Momentum Observation Experiment 1 - Figure Skater Observation Experiment 2 - Diver

High-Fidelity Coupling of Predictive Plasma-Wall Models Goal: Develop a predictive model of the

Why the Best Predictive What Do We Mean by . . . Models Are Often Different Main Result: . . .

EE359 Lecture 2 Outline TX and RX Signal Models Path Loss Models Free-space and

RADIO PROPAGATION MODELS 1 Radio Propagation Models 1 Path Loss Free Space Loss

Can you trust your models uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

Decentralised water and waste treatment in view of resource recovery: The I-QUA &amp; WAVE

C A P E H E N L O P E N H I G H S C H O O L A G R I S C I E N C E P A T H W A Y S 3 P A T H

1Q20 EARNINGS PRESENTATION May 2020 Forward-looking Statements This presentation contains

SOUTH AFRICAN AGRICULTURE BIG CHANGES IN NEXT DECADE LARGE NUMBERS OF RESOURCE POOR GROWERS

Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's

Ecosystem Threats: Ecosystem Threats: What the fishing community can do to ensure a sustainable

CLUB XLIV &amp; ENCORE Booking Information: 504.587.3663 or email clubXLIVsales@asmneworleans.com

Accepts AC power via E27 Socket as well as DC power from standard powerbanks and the NYX

Decentralised water and waste treatment in view of resource recovery: The I-QUA & WAVE

CLUB XLIV & ENCORE Booking Information: 504.587.3663 or email clubXLIVsales@asmneworleans.com