A Kullback-Leibler Divergence for Bayesian Model Comparison with - PDF document

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies Chen-Pin Wang, UTHSCSA Malay Ghosh, U. Florida Lehmann Symposium, May 9, 2011 1

Background • KLD: the expected (with respect to the reference model) logarithm of the ratio of the probability density functions (p.d.f.’s) of two models. � � � r ( t n | θ ) log r ( t n | θ ) dt n f ( t n | θ ) • KLD: a measure of the discrepancy of information about θ contained in the data revealed by two competing models (K-L; Lindley; Bernardo; Akaike; Schwarz; Goutis and Robert). • Challenge in the Bayesian framework: – identify priors that are compatible under the competing models – the resulting integrated likelihoods are proper.

G-R KLD • Remedy: The Kullback-Leibler projection by Goutis and Robert (1998), or G-R KLD: the inf. KLD between the likelihood under the reference model and all possible likelihoods arising from the competing model. • G-R KLD is the KLD between the reference model and the competing model evaluated at its MLE if the reference model is correctly specified (ref. Akaike 1974). • G-R KLD overcomes the challenges associated with prior elicitation in calculating KLD under the Bayesian framework.

G-R KLD • The Bayesian estimate of G-R KLD: integrat- ing the G-R KLD with respect to the posterior distribution of model parameters under the reference model. – Bayesian estimate of G-R KLD is not subject to impropriety of the prior as long as the posterior under the reference model is proper. – G-R KLD is suitable for comparing the predic- tivity of the competing models. – G-R KLD was originally developed for comparing nested GLM with a known true model, and its extension to general model comparison remains limited.

Notations • X i ’s are i.i.d. originating from model g gov- erned by θ ∈ Θ. • T n = T ( X 1 , · · · , X n ): the statistic for model diagnostics. • Two competing models: r for the reference model and f for the fitted model. • Assume that prior π r ( θ ) leads to proper posterior under r .

Our proposed KLD • KLD t ( r, f | θ ) quantifies the relative model fit for statistic T n between models r and f . • KLD t ( r, f | θ ) is identical to G-R KLD when the reference model r is the correct model. • KLD t ( r, f | θ ) is not necessarily the same as the G-R KLD. • KLD t ( r, f | θ ) needs no additional adjustment for non-nested situations. • KLD t ( r, f | θ ) is more practical than G-R KLD.

Regularity Conditions I ( A 1) For each x , both log r ( x | θ ) and log f ( x | θ ) are 3 times continuously differentiable in θ . Further, there exist neigh- borhoods N r ( δ ) = ( θ − δ r , θ + δ r ) and N f ( δ ) = ( θ − δ f , θ + δ f ) of θ and integrable functions H θ,δ r ( x ) and H θ,δ f ( x ) such that � � � � ∂ k � � sup ∂θ k log r ( x | θ ) ≤ H θ,δ r ( x ) � � θ ′ ∈ N ( δ r ) θ = θ ′ and � � � � ∂ k � � sup ∂θ k log f ( x | θ ) ≤ H θ,δ f ( x ) � � θ ′ ∈ N ( δ f ) θ = θ ′ for k=1, 2, 3. ( A 2) For all sufficiently large λ > 0, � � log r ( x | θ ′ ) E r sup < 0; r ( x | θ ) | θ ′ − θ | >λ � � log f ( x | θ ′ ) E f sup < 0 . f ( x | θ ) | θ ′ − θ | >λ

Regularity Conditions II � � � � � log r ( x | θ ′ ) ( A 3) E r sup � θ → E r [log r ( x | θ )] as δ → 0; θ ′ ∈ ( θ − δ,θ + δ ) � � � � � log f ( x | θ ′ ) sup → E f [log f ( x | θ )] as δ → 0 . E f � θ θ ′ ∈ ( θ − δ,θ + δ ) ( A 4) The prior density π ( θ ) is continuously differentiable in a neighborhood of θ and π ( θ ) > 0. ( A 5) Suppose that T n is asymptotically normally distributed under both models such that r ( θ ) φ ( √ n { T n − µ r ( θ ) } /σ r ( θ )) + O ( n − 1 / 2 ); r ( T n | θ ) = σ − 1 f ( θ ) φ ( √ n { T n − µ f ( θ ) } /σ f ( θ )) + O ( n − 1 / 2 ) . f ( T n | θ ) = σ − 1

Assume the regularity conditions Theorem 1. (A1)-(A5). Then µ r ( U n ) } 2 − { ˆ µ f ( U n ) − ˆ 2 KLD t ( r, f | U n ) = o p (1)(3) σ 2 n ˆ f ( U n ) when µ f ( θ ) � = µ r ( θ ), and   σ 2  ˆ r ( U n )  = o p (1) 2 KLD t ( r, f | U n ) − Q (4) σ 2 ˆ f ( U n ) when µ r ( θ ) = µ f ( θ ) but σ 2 r ( θ ) � = σ 2 f ( θ ).

Remarks for Theorem 1 • KLD t ( r, f | θ ) is also a divergence of model pa- rameter estimates • Model comparison in real applications may rely on the fit to a multi-dimensional statistic. The results in Theorem 1 are applicable to the mul- tivariate case with a fixed dimension. • KLD t ( r, f | θ ) can be viewed as the discrepancy between r and f in terms of their posterior pre- dictivity of T n . • We study how KLD t ( r, f | θ ) is connected to a weighted posterior predictive p-value, a typical Bayesian technique to assess model discrepancy (see Rubin 1984; Gelman et al. 1996).

Weighted Posterior Predictive P-value � �� t n � f ∗ ( y n | ˆ θ f ) dy n r ∗ ( t n | θ ) dt n WPPP r ( U n ) ≡ π r ( θ | U n ) dθ, (5) −∞ where r ∗ and f ∗ are the predictive density functions of T n under r and f , respectively. • WPPP is equivalent to the weighted posterior predictive p-value of T n under f with respect to the posterior predictive distribution of T n under r .

Theorem 2. 2 KLD t ( r, f | U n ) n { Φ − 1 ( WPPP r ( U n )) } 2 = n � � µ f ( U n )) 2 σ 2 (ˆ µ r ( U n ) − ˆ ˆ r ( U n ) + f ( U n ) + o p (1) (6) σ 2 σ 2 σ 2 ˆ f ( U n ) + ˆ r ( U n ) ˆ when µ f ( θ ) � = µ r ( θ ). Let Q ( y ) = y − log ( y ) − 1. Then � � σ 2 ˆ r ( U n ) 2 KLD t ( r, f | U n ) − Q = o p (1) (7) σ 2 ˆ f ( U n ) and WPPP r ( U n ) − 0 . 5 = o p (1) (8) when µ r ( θ ) = µ f ( θ ) but σ 2 r ( θ ) � = σ 2 f ( θ ).

Remarks of Theorem 2. • It shows the asymptotic relationship between KLD t ( r, f | u n ) and WPPP. • Suppose that µ f ( θ ) � = µ r ( θ ). – Both KLD t ( r, f | U n ) and Φ − 1 ( WP P P r ( U n )) are of order O p ( n ). – KLD t ( r, f | U n ) is greater than Φ − 1 ( WP P P r ( U n )) by an O p ( n ) term that assumes positive values with probability 1. • When µ r ( θ ) = µ f ( θ ) (i.e., both r and f assume the same mean of T n ) but σ 2 f ( θ ) � = σ 2 r ( θ ), – Φ − 1 ( WP P P r ( U n )) converges to 0; WP P P r ( U n ) converges to 0.5 – KLD t ( r, f | U n ) converges to a positive quantity order O p (1)

∼ g θ ( x i ) = φ (( x i − θ 1 ) / √ θ 2 ) / √ θ 2 , where i.i.d. Example 1. X i Let T n = √ n [( � i X i ) /n − θ 1 ] / √ κ . θ 2 > 0. Let r = g and f θ ( x i ) = φ (( x i − θ 1 ) / √ κ ) / √ κ . Then • µ r ( θ ) = E h ( T n ) = µ f ( θ ) = E f ( T n ) = θ 1 , σ 2 r ( θ ) = θ 2 , σ 2 f ( θ ) = κ , � 2 lim KLD t ( r, f | u n ) n →∞ � ˆ � � > 0 ˆ θ 2 ( u n ) θ 2 ( u n ) if κ � = θ 2 = − log + − 1 κ = θ 2 . = 0 if κ κ • T n is the MLE for θ 1 under both h and f . • lim n →∞ WPPP ( U n ) = 0 . 5 • WPPP ( U n ) is asymptotically equivalent to the KLD ap- proaches.

i.i.d. Example 2 Assume X i ∼ g θ ( x i ) = exp {− θ/ (1 − θ ) }{ θ/ (1 − θ ) } x i /x i !, where 0 < θ < 1. Let T n = ¯ X n / (1 + ¯ X n ), r = g , and f θ ( x i ) = θ x i (1 − θ ). Then • µ r ( θ ) = µ f ( θ ) = θ , σ 2 r ( θ ) = θ (1 − θ ) 3 , and σ 2 f ( θ ) = θ (1 − θ ) 2 . • θ = E ( X i ) / (1 + E ( X i )). • T n is the MLE for θ under both r and f • 2 lim n →∞ KLD t ( r, f | u n ) = − log(1 − ˆ θ ( u n )) + (1 − ˆ θ ( u n )) − 1 > 0 for 0 < θ < 1. • lim n →∞ WPPP ( U n ) = 0 . 5

i.i.d. Γ(( θ 2 +1) / 2) Example 3 Assume X i ∼ g θ ( x i ) = πθ 2 (1 + ( x − √ Γ( θ 2 / 2) θ 1 ) 2 /θ 2 ) − (1+ θ 2 ) / 2 , where θ 2 > 2. Let T n = ¯ X . Let r = g and f θ ( x i ) = φ ( X i − θ 1 ). Then • µ f ( θ ) = µ r ( θ ) = θ 1 , σ 2 r ( θ ) = θ 2 / ( θ 2 − 2), and σ 2 f ( θ ) = 1 • 2 lim n →∞ KLD t ( r, f | u n ) = − log( θ 2 ( u n ) / ( θ 2 ( u n ) − 2))+ θ 2 / ( θ 2 ( u n ) − 2) − 1 ≥ 0 for all θ 2 with equality if and only if θ 2 = ∞ .

i.i.d. Example 4 Assume X i ∼ g θ ( x i ) = exp( − x i /θ ) /θ . Let r = g and f θ ( x i ) = exp( − x i ), T n = min { X 1 , · · · , X n } . Then • r θ ( t n ) = n exp( − nt n /θ ) /θ and f θ ( t n ) = n exp( − nt n ) ¯ x n x n ) = E f ( Pr ( T ∗ • WPPP f (¯ n < T n ) | ¯ x n ) → ¯ x n +1 � KLD t ( r, f | ¯ x n ) → − log(¯ x n ) + n (¯ x n − 1) • • The asymptotic equivalence between KLD t ( r, f | u n ) and WPPP f ( u n ) does not hold in the sense of Thm. 2 due to the violation of the asym. normality assumption.

A Kullback-Leibler Divergence for Bayesian Model Comparison with - PDF document

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies Chen-Pin Wang, UTHSCSA Malay Ghosh, U. Florida Lehmann Symposium, May 9, 2011 1 Background KLD: the expected (with respect to the refer-

Kullback-Leibler Designs Astrid JOURDAN Jessica FRANCO ENBIS 2009 /

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

Fast adaptive estimation of log-additive exponential models in Kullback-Leibler divergence

KULLBACK-LEIBLER ENTROPY for FUZZY OIL DROP MODEL Mateusz Banach, Barbara Kalinowska, Leszek

Blind Sensing Techniques based on Kullback-Leibler Distance for Cognitive Radio Systems A. Hayar*

Bayesian Networks Part 3 CS 760@UW-Madison Goals for the lecture you should understand the

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

z A single Gaussian might be a poor fit . . . . . Simplest form is 2 layer . . . ... .

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

JUST THE MATHS SLIDES NUMBER 2.3 SERIES 3 (Elementary convergence and divergence) by

Divergence Theorems in Path Space Denis Bell University of North Florida Motivation Divergence

29. The divergence theorem Theorem 29.1 (Divergence Theorem; Gauss, Ostrogradsky) . Let S be a

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian model averaging Dr. Jarad Niemi Iowa State University September 7, 2017 Jarad Niemi

Lecture 2 Measures of Information I-Hsiang Wang Department of Electrical Engineering National

18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23 Total

Advanced Software Engineering with C++ Templates Administrative Issues Thomas Gschwind <thg at

Long Term Protection Model in R Dr. Urszula Gasser, Senior Pricing Actuary 2 Disclaimer The

Limit Distributions for Smooth Total Variation and 2 -Divergence in High Dimensions Ziv

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**,

CS 221 Tuesday 8 November 2011 Agenda 1. Announcements 2. Review: Solving Equations (Text

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John

A Kullback-Leibler Divergence for Bayesian Model Comparison with - PDF document

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies Chen-Pin Wang, UTHSCSA Malay Ghosh, U. Florida Lehmann Symposium, May 9, 2011 1 Background KLD: the expected (with respect to the refer-

Kullback-Leibler Designs Astrid JOURDAN Jessica FRANCO ENBIS 2009 /

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

Fast adaptive estimation of log-additive exponential models in Kullback-Leibler divergence

KULLBACK-LEIBLER ENTROPY for FUZZY OIL DROP MODEL Mateusz Banach, Barbara Kalinowska, Leszek

Blind Sensing Techniques based on Kullback-Leibler Distance for Cognitive Radio Systems A. Hayar*

Bayesian Networks Part 3 CS 760@UW-Madison Goals for the lecture you should understand the

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

z A single Gaussian might be a poor fit . . . . . Simplest form is 2 layer . . . ... .

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

JUST THE MATHS SLIDES NUMBER 2.3 SERIES 3 (Elementary convergence and divergence) by

Divergence Theorems in Path Space Denis Bell University of North Florida Motivation Divergence

29. The divergence theorem Theorem 29.1 (Divergence Theorem; Gauss, Ostrogradsky) . Let S be a

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian model averaging Dr. Jarad Niemi Iowa State University September 7, 2017 Jarad Niemi

Lecture 2 Measures of Information I-Hsiang Wang Department of Electrical Engineering National

18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23 Total

Advanced Software Engineering with C++ Templates Administrative Issues Thomas Gschwind &lt;thg at

Long Term Protection Model in R Dr. Urszula Gasser, Senior Pricing Actuary 2 Disclaimer The

Limit Distributions for Smooth Total Variation and 2 -Divergence in High Dimensions Ziv

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**,

CS 221 Tuesday 8 November 2011 Agenda 1. Announcements 2. Review: Solving Equations (Text

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John

Advanced Software Engineering with C++ Templates Administrative Issues Thomas Gschwind <thg at