Asymptotic Bayesian Generalization Error when Training and Test - PowerPoint PPT Presentation

Asymptotic Bayesian Generalization Error when Training and Test Distributions are Different Keisuke Yamazaki 1) Motoaki Kawanabe 2) Sumio Watanabe 1) Masashi Sugiyama 1) Klaus-Robert Müller 2), 3) 1) Tokyo Institute of Technology 2) Fraunhofer FIRST, IDA 3) Technical University of Berlin

Summary of This Talk � Our target situation is non-regular models under the covariate shift. regular non-regular standard statistics algebraic geometry covariate shift importance weight Non-regular model is a class of practical parametric models such as Gaussian mixtures, neural networks, hidden Markov models, etc. The covariate shift is the setting, where the training and test input distributions are different. 2

Summary of Our Theoretical Results � Analytic expression of generalization error in large sample cases � Small order terms, which can be ignored in the absence of covariate shift, play an important role. � Small order terms are difficult to analyze in practice. � Upper bound of generalization error in small sample cases � Our bound is computable for any sample size. � The worst case generalization error is elucidated. 3

Contents 1. Explanation of the table regular non-regular standard covariate shift First, I’ll explain the table, 2. Our results then, show our results. 4

Contents 1. Explanation of the table regular non-regular standard covariate shift 2. Our results 5

Regression Problem � Training phase: learn input-output relation ( | ) from training samples r y x q(x) Input density generates Model is fitted to training input points. training samples. x y y Training True function produces training output values. x x 6

Regression Problem � Test phase: predict test output values at given test input points q(x) Input density again Model is used for estimating generates test input data test output values. x We evaluate the test error (performance) y y Test x x 7

Input Distribution in Standard Setting � The training and test distributions are same q(x) Generation of input data x y y Test Training x x 8

Input Distribution in Practical Situations � The training and test input distributions are … q(x) Training dist. x y x 9

Input Distribution in Practical Situations � The training and test input distributions are different!!! q(x) Covariate shift Test dist. � Bioinformatics � [Baldi et al., 1998] Training dist. � Econometrics � [Heckman, 1979] x � Brain-computer interface � [Wolpaw et al., 2002] etc. y x 10

Input Distribution in Practical Situations � The training and test input distributions are different!!! q(x) Covariate shift Test dist. � Bioinformatics � [Baldi et al., 1998] Training dist. � Econometrics � [Heckman, 1979] x � Brain-computer interface � [Wolpaw et al., 2002] etc. y d o e s n ' t c h a n g e . ( | ) r y x ⇒ ( ) ( ) q x q x 0 1 x 11

Input Distribution in Practical Situations � The training and test input distributions are different!!! Covariate shift Test dist. Training dist. x Due to the change of data region, the performance also changes. y A standard technique does NOT work. x 12

Contents 1. Explanation of the table regular non-regular standard covariate shift 2. Our results 13

Classes of Learning Models � Non-/ Semi-parametric models � SVM etc. � Parametric models � Non-regular � Regular � Neural network � Polynomial regression � Gaussian mixture � Linear model � Hidden Markov model etc. � Bayesian network � Stochastic CFG etc. Non-regular models have hierarchical structure or hidden variables. It is important to analyze non-regular models. 14

Our Learning Method is Bayesian frequentists’ Bayesian Maximum MAP Bayes Likelihood � Bayesian Learning [ Parametric ] The Bayesian learning constructs y the predictive distribution as the average of models. x 15

Contents 1. Parametric Bayesian framework regular non-regular standard covariate shift Here, the interest is the generalization 2. Our results performance in each setting. Before looking at each case, let us define how to measure the generalization performance. 16

How to Measure Generalization Performance � Kullback divergence (or log-loss) ( ) p x ∫ = 1 ( || ) ( ) log D p p p x dx 1 2 1 ( ) p x 2 It shows the distance between densities. = ⇔ = ( ) ( ) ( || ) 0 p x p x D p p 1 2 1 2 ≠ ⇔ > ( ) ( ) ( || ) 0 p x p x D p p 1 2 1 2 D(true function || predictive distribution) Kullback divergence from the true distribution to the predictive distribution. 17

Expected Kullback Divergence Is Our Generalization Error ⎡ ⎤ ( | ) r y x ∫ = 0 0 ( ) ( | ) ( ) log ⎢ ⎥ G n E r y x q x dxdy 0 n n ⎣ ⎦ ( | , , ) n n p y x X Y , X Y y We take the expectation over all training samples. x It is the function w.r.t. the training sample size. n 18

What Do We Want to Know? � Learning curve: generalization error as a function of sample size When the sample size n is sufficiently large, ⎛ ⎞ 1 S = + + ⎜ ⎟ 0 0 ( ) G n B o 0 ⎝ ⎠ n n 0 n ( ) G the value of errors Speed of convergence 0 > ≠ 0 S B ( | ) ( | , ) r y x p y x w 0 0 = = 0 ˆ B ( | ) ( | , ) r y x p y x w Bias 0 n :sample size 19

Contents 1. Parametric Bayesian framework regular non-regular standard covariate shift Now, we take a careful look 2. Our results at each case separately. 20

Regular Models in the Standard Input Dist. � In statistics, the analysis has a long history. � Learning curve is well studied. ⎛ ⎞ 1 S = + + ⎜ ⎟ 0 0 ( ) G n B o 0 ⎝ ⎠ n n B Distance from the true function to the optimal model 0 S (Dimension of parameter space)/2 0 True function B 0 S Optimal model 0 21

Contents 1. Parametric Bayesian framework regular non-regular standard statistics covariate shift 2. Our results 22

Regular Models under Covariate Shift ( ) q x 1 Importance Weight = ( ) q x 0 ( ) q x ∫ ∫ × × = × 1 ( ) ( ) ( ) ( ) q x IW Loss x dx q x Loss x dx 0 0 ( ) q x 0 ∫ = ( ) ( ) q x Loss x dx 1 � The importance weight improves the generalization error. [Shimodaira, 2000] B Distance to the optimal model following the test data . 0 S Original speed + A factor from the importance weight. 0 ⎛ ⎞ 1 S = + + ⎜ ⎟ 0 0 ( ) G n B o 0 ⎝ ⎠ n n 23

Contents 1. Parametric Bayesian framework regular non-regular standard statistics covariate shift importance weight 2. Our results 24

Non-Regular Models without Covariate Shift Stochastic Complexity: the average of marginal likelihood ⎡− + ⎤ 1 n ∫∏ = ϕ 0 0 ( ) log ( | , ) ( ) U n E ⎢ p Y X w w dw ⎥ n n i i , X Y ⎣ ⎦ = 1 model a prior i Marginal likelihood is used for the model selection or the optimization of the prior. n : the training data size An asymptotic form of the stochastic complexity is = + + 0 ( ) log (log ) U n a n b n o n 0 0 25

Analysis of Generalization Error in the Absence of Covariate Shift According to the definition, = + − 0 0 0 ( ) ( 1 ) ( ) G n U n U n + = + + + + 0 ( 1 ) ( 1 ) log( 1 ) (log ) U n a n b n o n 0 0 = + + 0 ( ) log (log ) U n a n b n o n 0 0 ⎛ ⎞ 1 b = + + by very simple 0 ⎜ ⎟ 0 ( ) G n a o 0 ⎝ ⎠ n n subtraction. Generalization Error ⎛ ⎞ 1 b = + + ⎜ ⎟ 0 0 ( ) , G n a o which includes regular cases. 0 ⎝ ⎠ n n 26

Contents 1. Parametric Bayesian framework regular non-regular standard statistics stochastic complexity covariate shift importance weight The analysis for non- 2. Our results regular models under the covariate shift is A) Large sample cases still open !!! B) Finite sample cases 27

Kullback Divergence w.r.t. Test Distribution ⎡ ⎤ ( | ) r y x ∫ = 0 i ( ) ( | ) ( ) log ⎢ ⎥ G n E r y x q x dxdy i n n ⎣ ⎦ ( | , , ) n n p y x X Y , X Y y y x x ( ) ( ) q 0 x q 1 x x x 0 n 1 n ( ) ( ) G G : standard case : covariate shift 28

Stochastic Complexity under Covariate Shift We define shifted stochastic complexity: ⎡− ⎤ + 1 n ∫∏ + = ϕ 0 i i ( 1 ) log ( | , ) ( ) U n E E ⎢ p Y X w w dw ⎥ , X Y n n i i , + + X Y ⎣ ⎦ n 1 n 1 = 1 i The expectation of test data is different. The previous definition: ⎡− + ⎤ 1 n ∫∏ = ϕ 0 0 ( ) log ( | , ) ( ) U n E ⎢ p Y X w w dw ⎥ n n i i , X Y ⎣ ⎦ = 1 i n : the training data size 29

Following the previous study, Assumption: An asymptotic form of the stochastic complexity is ≅ + + + + + i L L ( ) log / U n a n b n c d n i i i i The previous assumption: = + + 0 ( ) log (log ) U n a n b n o n 0 0 n : the training data size 30

Asymptotic Bayesian Generalization Error when Training and Test - PowerPoint PPT Presentation

Asymptotic Bayesian Generalization Error when Training and Test Distributions are Different Keisuke Yamazaki 1) Motoaki Kawanabe 2) Sumio Watanabe 1) Masashi Sugiyama 1) Klaus-Robert Mller 2), 3) 1) Tokyo Institute of Technology 2)

An Introduction to Asymptotic Theory Ping Yu School of Economics and Finance The University of

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

On the Influence of Input Noise On the Influence of Input Noise on a Generalization Error

Outline IAML: Overfitting and Capacity Control Generalization error Estimating

Cut-points in asymptotic cones of groups Mark Sapir With J. Behrstock, C. Drut u, S. Mozes,

Introduction to Machine Learning Evaluation: Training Error compstat-lmu.github.io/lecture_i2ml

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

INFORMATION SHARING IN THE ELECTRICITY SUB-SECTOR SEPTEMBER 20, 2013 SCOTT R. MIX, CISSP CIP

The Sustainability of State and Local Government Pensions: A Public Finance Approach Jamie

Science and Technology in Disasters Opportunities Revealed by Sandy CCICADA/DIMACS Workshop

World Bank Group Establishing effective state- business relations and measuring impact on

Design Techniques for Scalable, Sub-pJ/b Serial I/O Transceivers Samuel Palermo

Executive Order 13636 & Presidential Policy Directive 21 Ed Goff, Duke Energy Melanie

Accretion Disk Matt Coleman Institute for Advanced Study Boundary Layers mcoleman@ias.edu UNLV

A consistent approach to inconsistencies Fabian Khlinger (Kavli IPMU) in collaboration with

Asymptotic Bayesian Generalization Error when Training and Test - PowerPoint PPT Presentation

Asymptotic Bayesian Generalization Error when Training and Test Distributions are Different Keisuke Yamazaki 1) Motoaki Kawanabe 2) Sumio Watanabe 1) Masashi Sugiyama 1) Klaus-Robert Mller 2), 3) 1) Tokyo Institute of Technology 2)

An Introduction to Asymptotic Theory Ping Yu School of Economics and Finance The University of

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

On the Influence of Input Noise On the Influence of Input Noise on a Generalization Error

Outline IAML: Overfitting and Capacity Control Generalization error Estimating

Cut-points in asymptotic cones of groups Mark Sapir With J. Behrstock, C. Drut u, S. Mozes,

Introduction to Machine Learning Evaluation: Training Error compstat-lmu.github.io/lecture_i2ml

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

INFORMATION SHARING IN THE ELECTRICITY SUB-SECTOR SEPTEMBER 20, 2013 SCOTT R. MIX, CISSP CIP

The Sustainability of State and Local Government Pensions: A Public Finance Approach Jamie

Science and Technology in Disasters Opportunities Revealed by Sandy CCICADA/DIMACS Workshop

World Bank Group Establishing effective state- business relations and measuring impact on

Design Techniques for Scalable, Sub-pJ/b Serial I/O Transceivers Samuel Palermo

Executive Order 13636 &amp; Presidential Policy Directive 21 Ed Goff, Duke Energy Melanie

Accretion Disk Matt Coleman Institute for Advanced Study Boundary Layers mcoleman@ias.edu UNLV

A consistent approach to inconsistencies Fabian Khlinger (Kavli IPMU) in collaboration with

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Executive Order 13636 & Presidential Policy Directive 21 Ed Goff, Duke Energy Melanie