Bayesian Cross Validation in Regular Asymptotic Theory Information - PowerPoint PPT Presentation

Higher Order Analysis of Bayesian Cross Validation in Regular Asymptotic Theory Information Geometry and Its Applications IV In honor of Prof. Amari’s 80 th birthday June 12-17, 2016, Liblice, Czech Republic Sumio Watanabe Tokyo Institute of Technology

Purpose of this Research Answer to the Bayesian question: “Is choosing a prior by minimizing cross validation really optimal for minimizing generalization loss ?” S. Watanabe, Bayesian Cross Validation and WAIC for Predictive Prior Design in Regular Asymptotic Theory arXiv:1503.07970

Why Higher Order is Necessary (1) In Bayesian statistics, it is frequently discussed how to choose (or optimize) a prior. (2) In regular statistical models, the first order statistics does not depend on a prior. (3) Higher order analysis is necessary to study the effect of a prior.

Optimality Measure of a Prior In this presentation, we study the optimality of a prior on the following situations. (1) Evaluation measure : generalization loss (= KL loss) of Bayes predictive distribution. (2) Optimizing criteria : cross validation, information criteria, and marginal likelihood. (3) Statistical model : regular

Contents １． Foundations of Bayesian Statistics ２． Main Theorem ３． Proof ４． Example

Notations: Model and Prior (1) q(x) : an unknown true probability density on R N . (2) X n =(X 1 ,X 2 ,…,X n ) : a set of random variables which are independently subject to q(x). (3) p(x|w) : a probability density on R N for a given parameter w in R d . Note: q(x) is not realizable by p(x|w) in general . (4) j 0 (w) : a fixed prior on R d (improper). j (w) : a candidate prior on R d (improper).

Definition of Bayesian Estimation (1) Posterior distribution is defined by n p(w|X n ) = (1/Z) j (w) P p(X i |w), i=1 where Z is a normalizing constant. Note: Even if a prior is improper, we assume Z is finite. (2) E w [ ] shows the expected value over p(w|X n ). V w [ ] shows the variance over p(w|X n ). (3) Predictive distribution p(x|X n ) = E w [ p(x|w) ].

Generalization and Cross Validation (1) Random generalization loss G n ( j ) = - q(x) log p(x|X n ) dx. (2) Average generalization loss E[ G n ( j ) ]. (3) Cross validation loss (Leave-one-out) n CV n ( j ) = - (1/n) S log p(X i |X n - X i ). i=1 (4) Average cross validation loss E[ CV n ( j ) ].

ISCV and WAIC (1) Importance sampling CV ( Gel’fand et. al., 1992) n ISCV n ( j ) = (1/n) S log E w [ 1/ p(X i |w) ]. i=1 It is proved that CV n ( j ) = ISCV n ( j ). (2) Widely Applicable Information Criterion (Watanabe, 2009) n WAIC n ( j ) = - (1/n) S log E w [ p(X i |w) ] i=1 n + (1/n) S V w [ log p(X i |w) ]. i=1 In regular models, CV n ( j ) = WAIC n ( j ) + O p (1/n 3 ).

Marginal likelihood For an improper prior j (w), a priori probability distribution is j (w) / j (w) dw. The minus log marginal likelihood (I.J. Good) is n F n ( j ) = - log j (w) P p(X i |w)dw + log j (w) dw. i=1 Note: If ∫ j (w) dw= ∞ , the marginal likelihood can not be defined, whereas CV and WAIC can be defined. Note: If you employ the marginal likelihood as a criterion, a prior function should be proper. However, the optimal prior function that minimizes the generalization loss may be improper in general.

Basic Question By the definition, for an arbitrary integer n>1, E[ G n-1 ( j ) ] = E[ F n ( j ) ] - E[ F n-1 ( j ) ], E[ G n-1 ( j ) ] = E[ CV n ( j ) ]. However, G n-1 ( j ) is not equal to F n ( j ) - F n-1 ( j ), G n-1 ( j ) is not equal to CV n ( j ). Basic Question: Assume j (w) = j (w| a ), where a is a hyperparameter. Does a that minimizes CV n ( j ) or F n ( j ) also minimizes G n ( j ) and E[ G n ( j ) ], asymptotically ?

Contents １． Basic Bayesian procedures ２． Main Theorem ３． Proof ４． Example

Notations I (1) j 0 (w) : A fixed prior (for example, j 0 (w) ≡ 1) n (2) L n (w) = - (1/n) S log p(X i |w) – log j 0 (w) i=1 (3) w* = argmin L n (w) : MAP estimator for j 0 (w) (4) L(w) = - q(x) log p(x|w) dx (5) w 0 = argmin L(w) Note: If j 0 (w) =1 is chosen as a fixed prior, then it is improper. L n (w) is a minus likelihood function and w* is MLE. w* does not depend on a candidate prior.

Notations II (1) For a given function f(w), f k1k2 ・・・ km (w) = ( / w k1 ) ( / w k2 ) ・・・ ( / w km ) f(w). (2) Einstein’s summation convention d A k1k2 B k2k3 = S A k1k2 B k2k3 . k2 =1 (3) Assumption: (L(w)) k1k2 is positive definite in a neighborhood of w 0 (g n ) k1k2 (w) = Inverse matrix of (L n (w)) k1k2 (g) k1k2 (w) = Inverse matrix of (L(w)) k1k2 Note: These functions do not depend on a candidate prior.

Notations III Correlations n (F n ) k1, k2 (w) = (1/n) S (log p(X i |w)) k1 (log p(X i |w)) k2 i=1 n (F n ) k1k2, k3 (w) = (1/n) S (log p(X i |w)) k1k2 (log p(X i |w)) k3 i=1 Average correlations (F) k1, k2 (w) = E[ (F n ) k1, k2 (w) ] (F) k1k2, k3 (w) = E[ (F n ) k1k2, k3 (w) ] Note: These functions do not depend on a candidate prior.

Notations IV For higher order analysis, the following functions are necessary. (A n ) k1 k2 (w) = (1/2) (g n ) k1k2 (w) (B n ) k1k2 (w) = (1/2) { (g n ) k1k2 (w) + (g n ) k1k3 (w) (g n ) k2k4 (w) (F n ) k3, k4 (w) } (C n ) k1 (w) = (g n ) k1k2 (w) (g n ) k3k4 (w) (F n ) k2k4,k3 (w) -(1/2) (g n ) k1k2 (w) (g n ) k3k4 (w) (L n ) k2k3k4 (w) -(1/2) (g n ) k1k2 (w) (g n ) k3k4 (w)(g n ) k5k6 (w)(L n ) k2k3k5 (w)(F n ) k4,k6 (w) Definitions of (A) k1 k2 (w), (B) k1 k2 (w), and (C) k1 (w) (A) k1 k2 (w), (B) k1 k2 (w), and (C) k1 (w) are defined by the same equations as (A n ) k1 k2 (w), (B n ) k1 k2 (w), and (C n ) k1 (w) by using (g) k1k2 (w), (F) k1, k2 (w), and (F) k1k2, k3 (w) in stead of (g n ) k1k2 (w), (F n ) k1, k2 (w), and (F n ) k1k2, k3 (w). Note: These functions do not depend on a candidate prior.

Notations V For higher order analysis, the followings are necessary. Mathematical relations between priors j (w) and j 0 (w). Φ (w) = j (w)/ j 0 (w). Ratio of candidate and fixed priors. M n ( j ,w) = (A n ) k 1 k 2 (w) (log Φ ) k1 (log Φ ) k2 + (B n ) k1k2 (w) (log Φ ) k1k2 + (C n ) k1 (w) (log Φ ) k1 M( j ,w) = (A) k1k2 (w) (log Φ ) k1 (log Φ ) k2 + (B) k1k2 (w) (log Φ ) k1k2 + (C) k1 (w) (log Φ ) k1 Note : Neither (A) k1, k2 (w), (B) k1, k2 (w), (C) k1 (w), (A n ) k1, k2 (w),(B n ) k1, k2 (w), nor (C n ) k1 (w) depends on the candidate prior j (w). A candidate prior affects only (log Φ).

Theorem w* = MAP estimator for j 0 (w) (1) Mathematical relations asymptotically satisfy M n ( j ,w*) = M( j , w 0 ) + O p (1/n 1/2 ), E[ M n ( j ,w*) ] = M( j , w 0 ) + O(1/n). Note: Minimizing M n ( j ,w*) is asymptotically equivalent to minimizing E[ M n ( j ,w*) ] and M( j , w 0 ).

Theorem (2) Cross validation asymptotically satisfies CV( j ) = CV( j 0 ) + (1/n 2 ) M n ( j ,w*) +O p (1/n 3 ) E[CV( j )] = E[CV( j 0 )] + (1/n 2 ) M( j ,w 0 ) +O(1/n 3 ) Note: Minimizing CV( j ) is asymptotically equivalent to minimizing M n ( j ,w*). Note: Minimizing CV( j ) is asymptotically equivalent to minimizing E[CV( j )] .

Theorem (3) Generalization loss asymptotically satisfies G n ( j ) = G n ( j 0 ) +O p (1/n 3/2 ) E[ G n ( j ) ] = E[ G n ( j 0 ) ] + (1/n 2 ) M( j ,w 0 ) +O(1/n 3 ) Note: Minimizing CV n ( j ) is not asymptotically equivalent to minimizing G n ( j ) ． Note: Minimizing CV n ( j ) is asymptotically equivalent to minimizing E[ G n ( j ) ] . Note: Minimizing E[ G n ( j ) ] can be performed by minimizing CV n ( j ). Note: Minimizing G n ( j ) seems to be impossible if we do not know the true distribution.

Contents １． Basic Bayesian procedures ２． Main Theorem ３． Proof arXiv:1503.07970 ４． Example

Contents １． Basic Bayesian procedures ２． Main Theorem ３． Proof ４． Example

An Example Model p(x|s,m) = (s/2 p ) 1/2 exp(- (s/2)(x-m) 2 ) True q(x) = p(x|1,1) Prior j (s,m| m, l ) = s m exp( - l s(m 2 +1) ) j ( m, l ) is a set of hyperparameters Proper ⇔ m > -1/2, l>0 Fixed Prior j 0 (s,m) =1 (w*,s*) : MAP = MLE

An Example 1/(2s*) 0 (A n ) k1, k2 (w*) = 0 s* 2 1/(s*) -s* 2 M 3 /2 (B n ) k1, k2 (w*) = -s* 2 M 3 /2 (s* 2 +s* 4 M 4 )/2 (C n ) k1 (w*) = (0, s* +s* 3 M 3 )

An Example Mathematical relation between priors j (s,m) and j0 (s,m) results in M n ( j ,m*,s*) = (1/2) l 2 s*m* 2 + (- l s*m* 2 /2 + m - l s*/2) 2 + (- l s*m* 2 /2+ m/2 - l s*/2)(1 + s* 2 M 4 ) - l + l m*s* 2 M 3 Simulation : l is fixed and m is optimized.

Information Criteria G n ( m ) = - E x [ log p(x|X n ) ] Generalization loss Importance sampling ISCV n ( m ) = (1/n) S log E w [ 1/ p(X i |w) ] cross validation WAIC n ( m ) = - (1/n) S log E w [ p(X i |w) ] Widely Applicable +(1/n) S V w [ log p(X i |w) ] Information Criterion Deviance Information DIC n ( m ) = (1/n) S log p(X i | E w [w] ) Criterion - (2/n) S log E w [ p(X i |w) ] (Spiegelhalter et.al.) F n ( m ) = - log j (w) P p(X i |w)dw Minus log marginal + log j (w) dw Likelihood WAICR n ( m ) = (1/n 2 ) M n ( j ,w*) Higher order CV

Simulation Results ISCV( m )-ISCV( 0 ) G( m )-G(0) WAIC( m )-WAIC( 0 ) F( m) -F( 0) DIC( m )-DIC(0) WAICR( m ) Improper

Bayesian Cross Validation in Regular Asymptotic Theory Information - PowerPoint PPT Presentation

Higher Order Analysis of Bayesian Cross Validation in Regular Asymptotic Theory Information Geometry and Its Applications IV In honor of Prof. Amaris 80 th birthday June 12-17, 2016, Liblice, Czech Republic Sumio Watanabe Tokyo Institute of

Cross-validation and the Bootstrap In the section we discuss two resampling methods:

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Progress to Date in A3: Method Transfer, Partial Validation and Cross validation A3: Method

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

Validation of National Burn Severity Validation of National Burn Severity Validation of National

Form Validation 1 CS380 What is form validation? 2 validation: ensuring that form's values

Stratified Cross-Validation in Multi-Label Classification Using Genetic Algorithms 7-8/02/2013

Holdout and Cross- -Validation Validation Holdout and Cross Methods Overfitting Avoidance

Criticality experiments and benchmarks for for validation of cross validation of cross sections:

Importance-Weighted Cross- Importance-Weighted Cross- Validation for Covariate Shift Validation

Data Mining II Model Validation Heiko Paulheim Why Model Validation? We have seen so far

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Theory of statistical inference: a lazy approach to obtaining asymptotic results in parametric

rtrs r trs

The universal invariant profile of the multiplicative group Greg Martin University of British

On asymptotic behaviour of the increments of sums of i.i.d. random variables from domains of

Automating Asymptotics in a Theorem Prover Manuel Eberl Technical University of Munich Formal

Unit #2: Complexity Theory and Asymptotic Analysis CPSC 221: Algorithms and Data Structures Lars

Asymptotics Will Perkins January 22, 2013 Asymptotics In many theorems and questions in

10/5/2016 CSE373: Data Structures and Algorithms Asymptotic Analysis (Big O, , and )