A New Information Criterion A New Information Criterion for the - - PowerPoint PPT Presentation

a new information criterion a new information criterion
SMART_READER_LITE
LIVE PREVIEW

A New Information Criterion A New Information Criterion for the - - PowerPoint PPT Presentation

1 A New Information Criterion A New Information Criterion for the Selection of Subspace Models for the Selection of Subspace Models Department of Computer Science, Tokyo Institute of Technology, Japan Masashi Sugiyama Hidemitsu Ogawa 2


slide-1
SLIDE 1

1

A New Information Criterion for the Selection of Subspace Models A New Information Criterion for the Selection of Subspace Models

Department of Computer Science, Tokyo Institute of Technology, Japan Masashi Sugiyama Hidemitsu Ogawa

slide-2
SLIDE 2

2

Function Approximation Function Approximation

) ( function target x f

x y

) ( ˆ result learning x f

1

x

point sample :

m

x

2

x

2

y

3

x

3

y

1

y

m m m

n x f y + = ) (

value sample :

m

y

{ }

. examples training the using by

1 M m m m,y

x

=

) ( to ) ( ˆ ion approximat

  • ptimal

Obtain the x f x f

slide-3
SLIDE 3

3

Model Model

Generally, function approximation is performed by estimating parameters of a prefixed set of functions called a model.

=

=

N n n n

b x a x f

1

) ; ( ) ( ˆ networks neural layer

  • 3

σ

The choice of the model complexity (e.g. order of polynomial, number of units) is crucial for optimal generalization.

=

=

N n n nx

a x f ) ( ˆ polynomial

slide-4
SLIDE 4

4

Model Selection Model Selection

Select the best model providing the optimal generalization capability.

Simple model Appropriate model Complex model

Target function Learning result

slide-5
SLIDE 5

5

Motivation and goal Motivation and goal

Most of the traditional model selection criteria do not work well when the number of training examples is small.

e.g. AIC (Akaike, 1974), BIC (Schwarz, 1978), MDL (Rissanen, 1978), NIC (Murata, Yoshizawa, & Amari, 1994)

Devise a model selection criterion which works well even when the number of training examples is small.

slide-6
SLIDE 6

6

Setting Setting

f , S f, H f S f

θ θ θ θ

θ θ θ ˆ and including space Hilbert : model by function result learning : ˆ model by indicated functions

  • f

family : model : function target learning :

2

ˆ minimizing model the models,

  • f

set a from Select, f f En −

θ

f

θ

f ˆ

θ

S H

noise

  • ver

n expectatio :

n

E

slide-7
SLIDE 7

7

Least mean squares (LMS) learning Least mean squares (LMS) learning

=

M m m m

y x f

1 2

) ( ˆ error training the minimizing at aimed is learning LMS

θ

θ θ

S x x K e y y y y

m M

  • f

kernel g reproducin : ) , ( C in basis standard th

  • m

: ) , , , (

M 2 1

′ = L

( ) ( )

f g h h g f g f , product Schatten Neumann : inverse d generalize Penrose Moore : = ⊗ − ⊗ − +

( )

+ =

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⊗ = =

M m m m

x x K e X y X f f

1

) , ( : ˆ as given is ˆ function result learning LMS The

θ θ θ θ θ

slide-8
SLIDE 8

8

Assumptions (1) Assumptions (1)

. as given is matrix covariance noise The zero. is noise mean The

2

I σ

unknown. generally is

2

σ

slide-9
SLIDE 9

9

Assumptions (2) Assumptions (2)

y X f f f E f

u u u n u

= = ˆ : ˆ ˆ result learning unbiased an gives models the

  • f

One .

H x x KH

  • f

kernel g reproducin : ) , ( ′

{ } ( )

H M H x x K

M m m H

dim if span ) , ( speaking, Roughly

1

=

examples training

  • f

number the : M

{ }

( )

+ = =

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⊗ = ∑

M m m H m u M m m H

x x K e X H x x K

1 1

) , ( then , span ) , ( If

slide-10
SLIDE 10

1

Generalization error and bias/variance Generalization error and bias/variance

2 2 2

ˆ ˆ ˆ ˆ

θ θ θ θ

f E f E f f E f f E

n n n n

− + − = −

bias variance

f

θ

f En ˆ

θ

f ˆ

bias variance generalization error generalization error

noise

  • ver

n expectatio :

n

E

slide-11
SLIDE 11

1 1

Estimation of bias Estimation of bias

2

ˆ f f En −

θ n

E

n

E

( )

* 2 2 1

  • f
  • perator

adjoint : variance, noise : , , X X ,n , ,n n n X X X

T M u

σ

θ

L = − =

( )

* 2 2

tr ˆ ˆ X X f f

u

σ

θ

− − − ≈ y

u

X

θ

X

2 2

, ˆ Re 2 ˆ ˆ n X n X f f E f f

n u

− − − − =

θ θ θ

f En ˆ

θ

f ˆ

u

f ˆ f

u n f

E ˆ =

bias

slide-12
SLIDE 12

1 2

Estimation of noise variance Estimation of noise variance

( )

H M y x f

M m m m u

dim ) ( ˆ ˆ

1 2 2

− − = ∑

=

σ

2 2

  • f

estimate unbiased an is ˆ σ σ

( ) ( )

* 2 * 2 2 2

tr tr ˆ ˆ ˆ

θ θ θ θ

σ σ X X X X f f f f E

u n

+ − − ≈ −

X X X X X

u

  • f
  • perator

adjoint : , variance, noise :

* 2

− =

θ

σ

generalization error bias estimate variance

slide-13
SLIDE 13

1 3

Subspace Information Criterion (SIC) Subspace Information Criterion (SIC)

( ) ( )

* 2 * 2 2

tr ˆ tr ˆ ˆ ˆ SIC

θ θ θ

σ σ X X X X f f

u

+ − − =

The model minimizing SIC is called the minimum SIC model (MSIC model). MSIC model is expected to provide the optimal generalization capability. From a set of models, select the model minimizing the following SIC.

slide-14
SLIDE 14

1 4

Validity of SIC Validity of SIC

SIC gives an unbiased estimate

  • f the generalization error:

2

ˆ SIC f f E E

n n

− =

θ

noise

  • ver

n expectatio :

n

E

  • cf. AIC gives an asymptotic unbiased estimate
  • f the generalization error.

SIC will work well even when the number of training examples is small.

slide-15
SLIDE 15

1 5

Illustrative Simulation Illustrative Simulation

) 3 , ( subject to : ) ( , 2 N n n x f y M m M x

m m m m m

+ = + − − = π π π

x x x x x x x x x x x f 5 cos 2 5 sin 2 4 cos 2 4 sin 2 2 3 cos 2 3 sin 2 2 cos 2 2 2 sin 2 cos 2 2 sin 2 ) ( − + − + − + − − + =

{ }

[ ]

π π,

  • n

defined cos , sin , 1 by spaned space Hilbert :

1

= N n N

nx nx S

{ }

20 2 1

, , , : models compared S S S L

slide-16
SLIDE 16

1 6

  • SIC
  • Network information criterion (NIC)

(Murata, Yoshizawa, & Amari, 1994)

A generalized AIC

41 ) dim( :

20

= = H S H

In this simulation, SIC and NIC are fairly compared.

dx x f x f f f

− = − =

π π

π

2 2

) ( ) ( ˆ 2 1 ˆ Error

Compared model selection criteria Compared model selection criteria

slide-17
SLIDE 17

1 7

200 = M

) 11 . (Error model MSIC

5

= S ) 17 . (Error model MNIC

6

= S ) 11 . (Error model Optimal

5

= S

slide-18
SLIDE 18

1 8

100 = M

) 37 . (Error model MSIC

5

= S ) 75 . (Error model MNIC

9

= S ) 37 . (Error model Optimal

5

= S

slide-19
SLIDE 19

1 9

50 = M

) 98 . (Error model MSIC

5

= S ) 36 . 3 (Error model MNIC

20

= S ) 98 . (Error model Optimal

5

= S

SIC works well even when M is small.

slide-20
SLIDE 20

2

Unrealizable case Unrealizable case

{ }

{ }

M 1 m 200 1 p

values sample from series chaotic a Estimate

= = m p

y M h 100 = M

slide-21
SLIDE 21

2 1

{ }

200 1 p

series chaotic the to ing correspond ) 1 ( 200 2 995 . point sample Consider

=

− + − =

p p

h p x

p p

h p f h

  • f

estimate an is ) 1 ( 200 2 995 . ˆ ˆ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − + − =

=

− =

200 1 2

ˆ Error

p p p

h h

We perform the simulation 1000 times.

Estimation of chaotic series Estimation of chaotic series

slide-22
SLIDE 22

2 2

  • SIC
  • NIC

{ }

d. distribute uniformly as regarded are function. loss the as adopted is loss log

1

x

M m m =

41 ) dim( :

40

= = H S H

Compared model selection criteria Compared model selection criteria

{ }

40 35 30 25 20 15

, , , , , : models Compared S S S S S S

{ }

[ ]

1 , 1

  • n

defined by spaned space Hilbert : −

= N n N

x S

slide-23
SLIDE 23

2 3

250 = M

SIC NIC

0021 . Mean 0022 . Mean

slide-24
SLIDE 24

2 4

150 = M

SIC NIC

0058 . Mean 013 . Mean

slide-25
SLIDE 25

2 5

50 = M

SIC NIC

018 . Mean 040 . Mean

SIC works well even when M is small.

slide-26
SLIDE 26

2 6

Conclusions Conclusions

  • We proposed a new model selection

criterion named the subspace information criterion (SIC) .

  • SIC gives an unbiased estimate of the

generalization error.

  • SIC works well even when the number of

training examples is small.