 
              Higher Order Analysis of Bayesian Cross Validation in Regular Asymptotic Theory Information Geometry and Its Applications IV In honor of Prof. Amari’s 80 th birthday June 12-17, 2016, Liblice, Czech Republic Sumio Watanabe Tokyo Institute of Technology
Purpose of this Research Answer to the Bayesian question: “Is choosing a prior by minimizing cross validation really optimal for minimizing generalization loss ?” S. Watanabe, Bayesian Cross Validation and WAIC for Predictive Prior Design in Regular Asymptotic Theory arXiv:1503.07970
Why Higher Order is Necessary (1) In Bayesian statistics, it is frequently discussed how to choose (or optimize) a prior. (2) In regular statistical models, the first order statistics does not depend on a prior. (3) Higher order analysis is necessary to study the effect of a prior.
Optimality Measure of a Prior In this presentation, we study the optimality of a prior on the following situations. (1) Evaluation measure : generalization loss (= KL loss) of Bayes predictive distribution. (2) Optimizing criteria : cross validation, information criteria, and marginal likelihood. (3) Statistical model : regular
Contents 1. Foundations of Bayesian Statistics 2. Main Theorem 3. Proof 4. Example
Notations: Model and Prior (1) q(x) : an unknown true probability density on R N . (2) X n =(X 1 ,X 2 ,…,X n ) : a set of random variables which are independently subject to q(x). (3) p(x|w) : a probability density on R N for a given parameter w in R d . Note: q(x) is not realizable by p(x|w) in general . (4) j 0 (w) : a fixed prior on R d (improper). j (w) : a candidate prior on R d (improper).
Definition of Bayesian Estimation (1) Posterior distribution is defined by n p(w|X n ) = (1/Z) j (w) P p(X i |w), i=1 where Z is a normalizing constant. Note: Even if a prior is improper, we assume Z is finite. (2) E w [ ] shows the expected value over p(w|X n ). V w [ ] shows the variance over p(w|X n ). (3) Predictive distribution p(x|X n ) = E w [ p(x|w) ].
Generalization and Cross Validation (1) Random generalization loss G n ( j ) = - q(x) log p(x|X n ) dx. (2) Average generalization loss E[ G n ( j ) ]. (3) Cross validation loss (Leave-one-out) n CV n ( j ) = - (1/n) S log p(X i |X n - X i ). i=1 (4) Average cross validation loss E[ CV n ( j ) ].
ISCV and WAIC (1) Importance sampling CV ( Gel’fand et. al., 1992) n ISCV n ( j ) = (1/n) S log E w [ 1/ p(X i |w) ]. i=1 It is proved that CV n ( j ) = ISCV n ( j ). (2) Widely Applicable Information Criterion (Watanabe, 2009) n WAIC n ( j ) = - (1/n) S log E w [ p(X i |w) ] i=1 n + (1/n) S V w [ log p(X i |w) ]. i=1 In regular models, CV n ( j ) = WAIC n ( j ) + O p (1/n 3 ).
Marginal likelihood For an improper prior j (w), a priori probability distribution is j (w) / j (w) dw. The minus log marginal likelihood (I.J. Good) is n F n ( j ) = - log j (w) P p(X i |w)dw + log j (w) dw. i=1 Note: If ∫ j (w) dw= ∞ , the marginal likelihood can not be defined, whereas CV and WAIC can be defined. Note: If you employ the marginal likelihood as a criterion, a prior function should be proper. However, the optimal prior function that minimizes the generalization loss may be improper in general.
Basic Question By the definition, for an arbitrary integer n>1, E[ G n-1 ( j ) ] = E[ F n ( j ) ] - E[ F n-1 ( j ) ], E[ G n-1 ( j ) ] = E[ CV n ( j ) ]. However, G n-1 ( j ) is not equal to F n ( j ) - F n-1 ( j ), G n-1 ( j ) is not equal to CV n ( j ). Basic Question: Assume j (w) = j (w| a ), where a is a hyperparameter. Does a that minimizes CV n ( j ) or F n ( j ) also minimizes G n ( j ) and E[ G n ( j ) ], asymptotically ?
Contents 1. Basic Bayesian procedures 2. Main Theorem 3. Proof 4. Example
Notations I (1) j 0 (w) : A fixed prior (for example, j 0 (w) ≡ 1) n (2) L n (w) = - (1/n) S log p(X i |w) – log j 0 (w) i=1 (3) w* = argmin L n (w) : MAP estimator for j 0 (w) (4) L(w) = - q(x) log p(x|w) dx (5) w 0 = argmin L(w) Note: If j 0 (w) =1 is chosen as a fixed prior, then it is improper. L n (w) is a minus likelihood function and w* is MLE. w* does not depend on a candidate prior.
Notations II (1) For a given function f(w), f k1k2 ・・・ km (w) = ( / w k1 ) ( / w k2 ) ・・・ ( / w km ) f(w). (2) Einstein’s summation convention d A k1k2 B k2k3 = S A k1k2 B k2k3 . k2 =1 (3) Assumption: (L(w)) k1k2 is positive definite in a neighborhood of w 0 (g n ) k1k2 (w) = Inverse matrix of (L n (w)) k1k2 (g) k1k2 (w) = Inverse matrix of (L(w)) k1k2 Note: These functions do not depend on a candidate prior.
Notations III Correlations n (F n ) k1, k2 (w) = (1/n) S (log p(X i |w)) k1 (log p(X i |w)) k2 i=1 n (F n ) k1k2, k3 (w) = (1/n) S (log p(X i |w)) k1k2 (log p(X i |w)) k3 i=1 Average correlations (F) k1, k2 (w) = E[ (F n ) k1, k2 (w) ] (F) k1k2, k3 (w) = E[ (F n ) k1k2, k3 (w) ] Note: These functions do not depend on a candidate prior.
Notations IV For higher order analysis, the following functions are necessary. (A n ) k1 k2 (w) = (1/2) (g n ) k1k2 (w) (B n ) k1k2 (w) = (1/2) { (g n ) k1k2 (w) + (g n ) k1k3 (w) (g n ) k2k4 (w) (F n ) k3, k4 (w) } (C n ) k1 (w) = (g n ) k1k2 (w) (g n ) k3k4 (w) (F n ) k2k4,k3 (w) -(1/2) (g n ) k1k2 (w) (g n ) k3k4 (w) (L n ) k2k3k4 (w) -(1/2) (g n ) k1k2 (w) (g n ) k3k4 (w)(g n ) k5k6 (w)(L n ) k2k3k5 (w)(F n ) k4,k6 (w) Definitions of (A) k1 k2 (w), (B) k1 k2 (w), and (C) k1 (w) (A) k1 k2 (w), (B) k1 k2 (w), and (C) k1 (w) are defined by the same equations as (A n ) k1 k2 (w), (B n ) k1 k2 (w), and (C n ) k1 (w) by using (g) k1k2 (w), (F) k1, k2 (w), and (F) k1k2, k3 (w) in stead of (g n ) k1k2 (w), (F n ) k1, k2 (w), and (F n ) k1k2, k3 (w). Note: These functions do not depend on a candidate prior.
Notations V For higher order analysis, the followings are necessary. Mathematical relations between priors j (w) and j 0 (w). Φ (w) = j (w)/ j 0 (w). Ratio of candidate and fixed priors. M n ( j ,w) = (A n ) k 1 k 2 (w) (log Φ ) k1 (log Φ ) k2 + (B n ) k1k2 (w) (log Φ ) k1k2 + (C n ) k1 (w) (log Φ ) k1 M( j ,w) = (A) k1k2 (w) (log Φ ) k1 (log Φ ) k2 + (B) k1k2 (w) (log Φ ) k1k2 + (C) k1 (w) (log Φ ) k1 Note : Neither (A) k1, k2 (w), (B) k1, k2 (w), (C) k1 (w), (A n ) k1, k2 (w),(B n ) k1, k2 (w), nor (C n ) k1 (w) depends on the candidate prior j (w). A candidate prior affects only (log Φ).
Theorem w* = MAP estimator for j 0 (w) (1) Mathematical relations asymptotically satisfy M n ( j ,w*) = M( j , w 0 ) + O p (1/n 1/2 ), E[ M n ( j ,w*) ] = M( j , w 0 ) + O(1/n). Note: Minimizing M n ( j ,w*) is asymptotically equivalent to minimizing E[ M n ( j ,w*) ] and M( j , w 0 ).
Theorem (2) Cross validation asymptotically satisfies CV( j ) = CV( j 0 ) + (1/n 2 ) M n ( j ,w*) +O p (1/n 3 ) E[CV( j )] = E[CV( j 0 )] + (1/n 2 ) M( j ,w 0 ) +O(1/n 3 ) Note: Minimizing CV( j ) is asymptotically equivalent to minimizing M n ( j ,w*). Note: Minimizing CV( j ) is asymptotically equivalent to minimizing E[CV( j )] .
Theorem (3) Generalization loss asymptotically satisfies G n ( j ) = G n ( j 0 ) +O p (1/n 3/2 ) E[ G n ( j ) ] = E[ G n ( j 0 ) ] + (1/n 2 ) M( j ,w 0 ) +O(1/n 3 ) Note: Minimizing CV n ( j ) is not asymptotically equivalent to minimizing G n ( j ) . Note: Minimizing CV n ( j ) is asymptotically equivalent to minimizing E[ G n ( j ) ] . Note: Minimizing E[ G n ( j ) ] can be performed by minimizing CV n ( j ). Note: Minimizing G n ( j ) seems to be impossible if we do not know the true distribution.
Contents 1. Basic Bayesian procedures 2. Main Theorem 3. Proof arXiv:1503.07970 4. Example
Contents 1. Basic Bayesian procedures 2. Main Theorem 3. Proof 4. Example
An Example Model p(x|s,m) = (s/2 p ) 1/2 exp(- (s/2)(x-m) 2 ) True q(x) = p(x|1,1) Prior j (s,m| m, l ) = s m exp( - l s(m 2 +1) ) j ( m, l ) is a set of hyperparameters Proper ⇔ m > -1/2, l>0 Fixed Prior j 0 (s,m) =1 (w*,s*) : MAP = MLE
An Example 1/(2s*) 0 (A n ) k1, k2 (w*) = 0 s* 2 1/(s*) -s* 2 M 3 /2 (B n ) k1, k2 (w*) = -s* 2 M 3 /2 (s* 2 +s* 4 M 4 )/2 (C n ) k1 (w*) = (0, s* +s* 3 M 3 )
An Example Mathematical relation between priors j (s,m) and j0 (s,m) results in M n ( j ,m*,s*) = (1/2) l 2 s*m* 2 + (- l s*m* 2 /2 + m - l s*/2) 2 + (- l s*m* 2 /2+ m/2 - l s*/2)(1 + s* 2 M 4 ) - l + l m*s* 2 M 3 Simulation : l is fixed and m is optimized.
Information Criteria G n ( m ) = - E x [ log p(x|X n ) ] Generalization loss Importance sampling ISCV n ( m ) = (1/n) S log E w [ 1/ p(X i |w) ] cross validation WAIC n ( m ) = - (1/n) S log E w [ p(X i |w) ] Widely Applicable +(1/n) S V w [ log p(X i |w) ] Information Criterion Deviance Information DIC n ( m ) = (1/n) S log p(X i | E w [w] ) Criterion - (2/n) S log E w [ p(X i |w) ] (Spiegelhalter et.al.) F n ( m ) = - log j (w) P p(X i |w)dw Minus log marginal + log j (w) dw Likelihood WAICR n ( m ) = (1/n 2 ) M n ( j ,w*) Higher order CV
Simulation Results ISCV( m )-ISCV( 0 ) G( m )-G(0) WAIC( m )-WAIC( 0 ) F( m) -F( 0) DIC( m )-DIC(0) WAICR( m ) Improper
Recommend
More recommend