Asymptotic Bayesian Generalization Error when Training and Test - - PowerPoint PPT Presentation

asymptotic bayesian generalization error when training
SMART_READER_LITE
LIVE PREVIEW

Asymptotic Bayesian Generalization Error when Training and Test - - PowerPoint PPT Presentation

Asymptotic Bayesian Generalization Error when Training and Test Distributions are Different Keisuke Yamazaki 1) Motoaki Kawanabe 2) Sumio Watanabe 1) Masashi Sugiyama 1) Klaus-Robert Mller 2), 3) 1) Tokyo Institute of Technology 2)


slide-1
SLIDE 1

Asymptotic Bayesian Generalization Error when Training and Test Distributions are Different

Keisuke Yamazaki 1) Motoaki Kawanabe 2) Sumio Watanabe 1) Masashi Sugiyama 1) Klaus-Robert Müller 2), 3)

1) Tokyo Institute of Technology 2) Fraunhofer FIRST, IDA 3) Technical University of Berlin

slide-2
SLIDE 2

2

Summary of This Talk

Our target situation is non-regular models under the covariate shift.

Non-regular model is a class of practical parametric models such as Gaussian mixtures, neural networks, hidden Markov models, etc.

regular non-regular standard covariate shift statistics

importance weight

algebraic geometry The covariate shift is the setting, where the training and test input distributions are different.

slide-3
SLIDE 3

3

Summary of Our Theoretical Results

Analytic expression of generalization error in large sample cases

Small order terms, which can be ignored in the absence of covariate shift, play an important role. Small order terms are difficult to analyze in practice.

Upper bound of generalization error in small sample cases

Our bound is computable for any sample size. The worst case generalization error is elucidated.

slide-4
SLIDE 4

4

Contents

  • 1. Explanation of the table
  • 2. Our results

regular non-regular standard covariate shift

First, I’ll explain the table, then, show our results.

slide-5
SLIDE 5

5

Contents

  • 1. Explanation of the table
  • 2. Our results

regular non-regular standard covariate shift

slide-6
SLIDE 6

6

Regression Problem

Training phase: learn input-output relation from training samples

x y x y

Training

True function produces training output values.

x

Input density generates training input points.

) | ( x y r

Model is fitted to training samples.

q(x)

slide-7
SLIDE 7

7

Regression Problem

Test phase: predict test output values at given test input points

x y x y

Model is used for estimating test output values. Test

x

Input density again generates test input data

q(x)

We evaluate the test error (performance)

slide-8
SLIDE 8

8

Input Distribution in Standard Setting

The training and test distributions are same x y

Test

x

Generation of input data

x y

Training

q(x)

slide-9
SLIDE 9

9

Training dist.

Input Distribution in Practical Situations

The training and test input distributions are …

x x y q(x)

slide-10
SLIDE 10

10

Training dist. Test dist.

Input Distribution in Practical Situations

The training and test input distributions are different!!!

x y x

Covariate shift Bioinformatics

[Baldi et al., 1998]

Econometrics

[Heckman, 1979]

Brain-computer interface

[Wolpaw et al., 2002]

etc.

q(x)

slide-11
SLIDE 11

11

Training dist. Test dist.

Input Distribution in Practical Situations

The training and test input distributions are different!!!

x y x

Covariate shift Bioinformatics

[Baldi et al., 1998]

Econometrics

[Heckman, 1979]

Brain-computer interface

[Wolpaw et al., 2002]

etc.

) | ( x y r

q(x)

d

  • e

s n ' t c h a n g e .

) ( ) (

1

x q x q ⇒

slide-12
SLIDE 12

12

Training dist. Test dist.

Input Distribution in Practical Situations

The training and test input distributions are different!!!

x y x

Due to the change of data region, the performance also changes.

A standard technique does NOT work. Covariate shift

slide-13
SLIDE 13

13

Contents

  • 1. Explanation of the table
  • 2. Our results

regular non-regular standard covariate shift

slide-14
SLIDE 14

14

Classes of Learning Models

Regular

Polynomial regression Linear model

etc. Non-regular

Neural network Gaussian mixture Hidden Markov model Bayesian network Stochastic CFG etc.

It is important to analyze non-regular models. Non-regular models have hierarchical structure or hidden variables. Non-/ Semi-parametric models

SVM etc.

Parametric models

slide-15
SLIDE 15

15

Our Learning Method is Bayesian

Bayesian Learning [ Parametric ]

x y

The Bayesian learning constructs the predictive distribution as the average of models.

frequentists’ Bayesian Maximum Likelihood MAP Bayes

slide-16
SLIDE 16

16

Contents

  • 1. Parametric Bayesian framework
  • 2. Our results

regular non-regular standard covariate shift

Here, the interest is the generalization performance in each setting. Before looking at each case, let us define how to measure the generalization performance.

slide-17
SLIDE 17

17

How to Measure Generalization Performance

Kullback divergence (or log-loss)

dx x p x p x p p p D

= ) ( ) ( log ) ( ) || (

2 1 1 2 1

) || ( ) ( ) (

2 1 2 1

= ⇔ = p p D x p x p It shows the distance between densities.

) || ( ) ( ) (

2 1 2 1

> ⇔ ≠ p p D x p x p D(true function || predictive distribution)

Kullback divergence from the true distribution to the predictive distribution.

slide-18
SLIDE 18

18

n ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ =

dxdy Y X x y p x y r x q x y r E n G

n n

n Y n X

) , , | ( ) | ( log ) ( ) | ( ) (

,

Expected Kullback Divergence Is Our Generalization Error

x y

We take the expectation

  • ver all training samples.

It is the function w.r.t. the training sample size.

slide-19
SLIDE 19

19

What Do We Want to Know?

Learning curve: generalization error as a function of sample size

When the sample size n is sufficiently large,

:sample size the value of errors

Speed of convergence Bias

0 >

B S

n

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + + = n

  • n

S B n G 1 ) (

) (

0 n

G

0 =

B

) ˆ , | ( ) | ( w x y p x y r = ) , | ( ) | ( w x y p x y r ≠

slide-20
SLIDE 20

20

Contents

  • 1. Parametric Bayesian framework
  • 2. Our results

regular non-regular standard covariate shift

Now, we take a careful look at each case separately.

slide-21
SLIDE 21

21

Regular Models in the Standard Input Dist.

In statistics, the analysis has a long history.

Learning curve is well studied.

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + + = n

  • n

S B n G 1 ) ( B S

(Dimension of parameter space)/2 Distance from the true function to the optimal model

B S

True function Optimal model

slide-22
SLIDE 22

22

Contents

  • 1. Parametric Bayesian framework
  • 2. Our results

regular non-regular standard covariate shift statistics

slide-23
SLIDE 23

23

Regular Models under Covariate Shift

The importance weight improves the generalization error.

[Shimodaira, 2000]

Importance Weight =

∫ ∫ ∫

= × = × × dx x Loss x q dx x Loss x q x q x q dx x Loss IW x q ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) (

1 1

) ( ) (

1

x q x q

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + + = n

  • n

S B n G 1 ) ( B S

Distance to the optimal model following the test data. Original speed + A factor from the importance weight.

slide-24
SLIDE 24

24

Contents

  • 1. Parametric Bayesian framework
  • 2. Our results

regular non-regular standard covariate shift statistics

importance weight

slide-25
SLIDE 25

25

Non-Regular Models without Covariate Shift

Stochastic Complexity: the average of marginal likelihood

n : the training data size

a prior Marginal likelihood is used for the model selection

  • r the optimization of the prior.

⎥ ⎦ ⎤ ⎢ ⎣ ⎡− =

∫∏

+ = 1 1 ,

) ( ) , | ( log ) (

n i i i Y X

dw w w X Y p E n U

n n

ϕ

model

) (log log ) ( n

  • n

b n a n U + + =

An asymptotic form of the stochastic complexity is

slide-26
SLIDE 26

26

Analysis of Generalization Error in the Absence of Covariate Shift

Generalization Error

, 1 ) ( ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + + = n

  • n

b a n G

which includes regular cases.

) ( ) 1 ( ) ( n U n U n G − + =

According to the definition,

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + + = + + = + + + + = + n

  • n

b a n G n

  • n

b n a n U n

  • n

b n a n U 1 ) ( ) (log log ) ( ) (log ) 1 log( ) 1 ( ) 1 (

by very simple subtraction.

slide-27
SLIDE 27

27

Contents

  • 1. Parametric Bayesian framework
  • 2. Our results

A) Large sample cases B) Finite sample cases

regular non-regular standard covariate shift statistics

importance weight

stochastic complexity

The analysis for non- regular models under the covariate shift is still open !!!

slide-28
SLIDE 28

28

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ =

dxdy Y X x y p x y r x q x y r E n G

n n i i

n Y n X

) , , | ( ) | ( log ) ( ) | ( ) (

,

Kullback Divergence w.r.t. Test Distribution

: standard case : covariate shift

) (

1 n

G

x y

) (

0 n

G

x y

x x

) (

0 x

q

) (

1 x

q

slide-29
SLIDE 29

29

Stochastic Complexity under Covariate Shift

⎥ ⎦ ⎤ ⎢ ⎣ ⎡− = +

∫∏

+ =

+ +

1 1 , ,

) ( ) , | ( log ) 1 (

1 1

n i i i Y X i Y X i

dw w w X Y p E E n U

n n n n

ϕ

We define shifted stochastic complexity:

The expectation of test data is different. The previous definition:

⎥ ⎦ ⎤ ⎢ ⎣ ⎡− =

∫∏

+ = 1 1 ,

) ( ) , | ( log ) (

n i i i Y X

dw w w X Y p E n U

n n

ϕ

n : the training data size

slide-30
SLIDE 30

30

Following the previous study,

n : the training data size

Assumption: An asymptotic form of the stochastic complexity is

L L + + + + + ≅ n d c n b n a n U

i i i i i

/ log ) ( ) (log log ) ( n

  • n

b n a n U + + =

The previous assumption:

slide-31
SLIDE 31

31

We Obtain Analytic expression of Generalization Error by subtraction

According to the definition,

) ( ) 1 ( ) (

0 n

U n U n G

i i

− + =

) / 1 ( / ) ( log ) ( ) ( ) ( ) / 1 ( / log ) ( ) / 1 ( ) 1 /( ) 1 log( ) 1 ( ) 1 (

1 1 1 1 1 1 1 1 1 1 1 1

n

  • n

d d b c c a n b b n a a n G n

  • n

d c n b n a n U n

  • n

d c n b n a n U + − + + − + + − + − = + + + + + = + + + + + + + + = + L L

Based on a property of the learning curve, the expression can be simplified.

slide-32
SLIDE 32

32

Small Order Terms Cannot be Ignored

) , (

1 1

b b a a = =

Small order terms are

ignored in the standard asymptotic analysis.

n : the training data size L L + + + + + ≅ n d c n b n a n U

i i i i i

/ log ) ( n b a n G ) ( + ≅ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + − + + − + = n

  • n

d d b c c a n G 1 ) ( ) ( ) (

1 1 1

Theorem 1

slide-33
SLIDE 33

33

Evaluation of Small Order Terms is Difficult!!!!

Simple Neural Network

= y ) tanh(bx a y =

x y x y

: the true function : the learning function

Evaluating small order terms is very hard even in very simple settings.

slide-34
SLIDE 34

34

Contents

  • 1. Parametric Bayesian framework
  • 2. Our results

A) Large sample cases B) Finite sample cases

regular non-regular standard covariate shift statistics

importance weight

stochastic complexity

slide-35
SLIDE 35

35

We Obtain an Finite-Sample Upper Bound

Theorem 2

) ( ) (

1

n MG n G ≤ ∞ < ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ) ( ) ( max

1 ) ( ~

x q x q M

x q x

Maximum ratio of input densities:

The upper bound can be easily computed!!! We can overcome the difficulty in the previous theorem.

slide-36
SLIDE 36

36

We Can Obtain Worst-Case Learning Curve

Previous example

= y ) tanh(bx a y =

x y x y

: the true function : the learning function

) (

1 n

G

: error under covariate shift : sample size the value of errors

) (

0 n

G

: error without shift

) (

0 n

MG

: upper bound

slide-37
SLIDE 37

37

Conclusions

We analyzed Bayesian generalization error

  • f non-regular models: GM, HMM, NN etc.

under covariate shift: Input distribution change

We proved that small order terms of stochastic complexity, which can be usually ignored, play important roles.

Directly evaluating generalization error is very hard.

We derived a computable finite-sample upper bound

Worst-case generalization error is elucidated.