Model Selection Model Selection with Small Samples with Small - - PowerPoint PPT Presentation

model selection model selection with small samples with
SMART_READER_LITE
LIVE PREVIEW

Model Selection Model Selection with Small Samples with Small - - PowerPoint PPT Presentation

ICANNGA2001 April 25, 2001. 1 Model Selection Model Selection with Small Samples with Small Samples Department of Computer Science, Tokyo Institute of Technology, Japan. Masashi Sugiyama Hidemitsu Ogawa ICANNGA2001 April 25, 2001. 2


slide-1
SLIDE 1

April 25, 2001. ICANNGA2001 1

Model Selection with Small Samples Model Selection with Small Samples

Department of Computer Science, Tokyo Institute of Technology, Japan. Masashi Sugiyama Hidemitsu Ogawa

slide-2
SLIDE 2

April 25, 2001. ICANNGA2001 2

Supervised Learning Supervised Learning

[ ]

− = du u p u f u f E JG ) ( ) ( ) ( ˆ

2

noise

  • ver

n Expectatio : E ) (u p u points input test Future, ) (x f function Target ) ( ˆ x f result Learning

1

x

2

x

2

y

M

x

M

y

1

y L L

mean :

m

ε

m m m

x f y ε + = ) (

u

2

σ variance

From training examples , obtain that minimizes generalization error :

{ }

M m m m y

x

1

,

= G

J ) ( ˆ x f

slide-3
SLIDE 3

April 25, 2001. ICANNGA2001 3

For the Time Being, We Assume… For the Time Being, We Assume…

Target function is linear combination of specified basis functions : Correlation matrix of future input is known.

=

=

µ

ϕ θ

1

) ( ) (

i i i

x x f ) (x f u du u p u u U

j i ij

) ( ) ( ) (

= ϕ ϕ U

{ }

µ

ϕ

1

) (

= i i x

Later, we will discuss the case when these assumptions do not hold.

slide-4
SLIDE 4

April 25, 2001. ICANNGA2001 4

Subset Regression Models Subset Regression Models

} , , 2 , 1 { : µ K indices

  • f

Subset S

that so determined is

S

θ ˆ

T M

y y y y ) , , , (

2 1

K =

y X S

S =

θ ˆ

T S S T S S

A A A X

1

) (

= ⎩ ⎨ ⎧ ∉ ∈ = S i S i x A

m i mi S

: : ) ( ] [ ϕ functions basis

  • f

# : µ

=

S i i i S S

x x f ) ( ] ˆ [ ) ( ˆ ϕ θ

( )

minimized. is error training

=

M m m m S

y x f

1 2

) ( ˆ

slide-5
SLIDE 5

April 25, 2001. ICANNGA2001 5

Model Selection Model Selection

Select the best subset of basis functions so that generalization error is minimized:

G

J

However, includes unknown target function . We derive an estimate of called the subspace information criterion (SIC), and model is determined so that SIC is minimized.

G

J

G

J

[ ]

− = du u p u f u f E JG ) ( ) ( ) ( ˆ

2

) (x f

slide-6
SLIDE 6

April 25, 2001. ICANNGA2001 6

Key Idea: Unbiased Estimate Key Idea: Unbiased Estimate

estimate error training Minimum :

u

θ ˆ

T M

y y y y ) , , , (

2 1

K =

y X u

u =

θ ˆ

T T u

A A A X

1

) (

= ) ( ] [

m i mi

x A ϕ =

is an unbiased estimate of true parameter :

θ

θ θ =

u

E ˆ

u

θ ˆ

is used for estimating generalization error of .

S

θ ˆ

u

θ ˆ

noise

  • ver

n Expectatio : E

=

=

µ

ϕ θ

1

) ( ] ˆ [ ) ( ˆ

i i i u u

x x f

model Largest

slide-7
SLIDE 7

April 25, 2001. ICANNGA2001 7

Bias Variance

Bias / Variance Decomposition Bias / Variance Decomposition

Bias Variance

[ ]

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = ∫ du u p u f u f E

S

) ( ) ( ) ( ˆ

2 2

ˆ

U S G

E J θ θ − =

2 2

ˆ ˆ ˆ

U S S U S

E E E θ θ θ θ − + − =

θ θ θ U

T U = 2

noise

  • ver

n Expectatio : E du u p u u U

j i ij

) ( ) ( ) (

= ϕ ϕ

] [S J B ] [S JV θ

S

θ ˆ

S

Eθ ˆ

slide-8
SLIDE 8

April 25, 2001. ICANNGA2001 8

Unbiased Estimate of Variance Unbiased Estimate of Variance

( )

T S S V

X X S J trace ˆ ] [ ˆ

2

σ =

( )

T S S U S S V

X X E E S J trace ˆ ˆ ] [

2 2

σ θ θ = − =

V V

J J E = ˆ

y X S

S =

θ ˆ

T M

y y y y ) , , , (

2 1

K = noise

  • ver

n Expectatio : E

) ( ˆ ˆ

2 2 2

µ θ σ σ − − = → M y A

u

slide-9
SLIDE 9

April 25, 2001. ICANNGA2001 9

Unbiased Estimate of Bias Unbiased Estimate of Bias

Bias

B B

J J E = ˆ

2

ˆ ] [ θ θ − =

S B

E S J

T M )

, , , (

2 1

ε ε ε ε K =

T M

x f x f x f z )) ( , ), ( ), ( (

2 1

K =

( )

T u S B

X X S J

2 2

trace ˆ ˆ ˆ ] [ ˆ σ θ θ − − − = E

u S

X X X − =

y X u ← → y X S

2 2

ˆ , σ σ → E

2 2

, 2 ˆ ˆ ε ε θ θ X X z X

u S

− − − =

Rough estimate

θ

S

θ ˆ

S

Eθ ˆ

u

θ ˆ

u

Eθ ˆ =

slide-10
SLIDE 10

April 25, 2001. ICANNGA2001 10

Subspace Information Criterion (SIC) Subspace Information Criterion (SIC)

( ) ( )

T S S T u S V B

X X X X S J S J S SIC trace ˆ trace ˆ ˆ ˆ ] [ ˆ ] [ ˆ ] [

2 2 2

σ σ θ θ + − − = + =

] [ ] [ S J S SIC E

G

=

u S

X X X − =

SIC is an unbiased estimate of generalization error with finite samples.

G

J

[ ]

− = du u p u f u f E S J

S G

) ( ) ( ) ( ˆ ] [

2

slide-11
SLIDE 11

April 25, 2001. ICANNGA2001 11

When Assumptions Do Not Hold (1) When Assumptions Do Not Hold (1)

[ ]

− = du u p u f u f E J

S G

) ( ) ( ) ( ˆ

2

const. + − =

2 *

ˆ

U S

E θ θ

{ }

µ

ϕ θ

1 *

) ( :

= i i x

L in estimate Best

When target function is not included in … SIC is an asymptotic unbiased estimate of :

2 *

ˆ

U S

E θ θ −

{ }

µ

ϕ

1

) (

= i i x

L ) (x f

. ˆ

2 *

∞ → − → M E SIC E

U S

as θ θ

samples training

  • f

# : M

slide-12
SLIDE 12

April 25, 2001. ICANNGA2001 12

When correlation matrix is not available…

When Assumptions Do Not Hold (2) When Assumptions Do Not Hold (2)

U

du u p u u U

j i ij

) ( ) ( ) (

= ϕ ϕ

If unlabeled samples are available, (samples without output values) Training samples are used instead, SIC essentially agrees with Mallows’s CP. Just

′ = ′

=

M m m j m i M ij

u u U

1 1

) ( ) ( ˆ ϕ ϕ I U = ˆ

matrix Identity : I

M m m

u

′ =1

} {

M m m

x

1

} {

=

slide-13
SLIDE 13

April 25, 2001. ICANNGA2001 13

  • Computer Simulation

Computer Simulation

] , [ : π π − in created Randomly

m

x

( )

=

+ =

50 1 10 1

cos sin ) (

p

px px x f : function Target

{ }

100 10

, , , S S S K : models Compared du u f u f

− =

π π π 2 2 1

) ( ) ( ˆ Error ) , ( ) (

2

σ ε ε N x f y

m m m m

to subject is : + =

{ }

n p n

px px S

1

cos , sin , 1 :

=

201 = µ

{ }

100 1

cos , sin , 1

= p

px px : functions Basis

slide-14
SLIDE 14

April 25, 2001. ICANNGA2001 14

Compared Methods Compared Methods

), 6 . , 500 ( ), 6 . , 250 ( )

2

, ( = σ M

SIC Mallows’ CP Leave-one-out cross-validation (CV) Akaike’s information criterion (AIC) Sugiura’s corrected AIC (cAIC) Schwarz’s Bayesian information criterion (BIC) Vapnik’s measure (VM)

variance Noise :

2

σ samples training

  • f

# : M

) 2 . , 500 ( ), 2 . , 250 (

Simulations are performed 100 times with

slide-15
SLIDE 15

April 25, 2001. ICANNGA2001 15

Easiest Case (Curve) Easiest Case (Curve)

= ) , (

2

σ M ) 2 . , 250 ( ) 6 . , 500 ( ) 6 . , 250 ( ) 2 . , 500 (

slide-16
SLIDE 16

April 25, 2001. ICANNGA2001 16

Easiest Case (Error) Easiest Case (Error)

= ) , (

2

σ M ) 2 . , 250 ( ) 6 . , 500 ( ) 6 . , 250 ( ) 2 . , 500 (

slide-17
SLIDE 17

April 25, 2001. ICANNGA2001 17

Hardest Case Hardest Case

= ) , (

2

σ M ) 2 . , 250 ( ) 6 . , 500 ( ) 6 . , 250 ( ) 2 . , 500 (

slide-18
SLIDE 18

April 25, 2001. ICANNGA2001 18

Summary of Simulations Summary of Simulations

M

2

σ

SIC AIC cAIC BIC VM

Works well Selects smaller Selects larger

CP CV

variance Noise :

2

σ samples training

  • f

# : M

slide-19
SLIDE 19

April 25, 2001. ICANNGA2001 19

When is not in . When is not in .

Interpolation of chaotic series Similar results were obtained !!

µ

ϕ

1

)} ( {

= i i x

L

µ

ϕ

1

)} ( {

= i i x

L ) (x f ) (x f

slide-20
SLIDE 20

April 25, 2001. ICANNGA2001 20

Conclusions Conclusions

We proposed a new model selection criterion called subspace information criterion (SIC). SIC gives an unbiased estimate of generalization error. Computer simulations showed that SIC works well with small samples and large noise.