1
A New Information Criterion A New Information Criterion for the - - PowerPoint PPT Presentation
A New Information Criterion A New Information Criterion for the - - PowerPoint PPT Presentation
1 A New Information Criterion A New Information Criterion for the Selection of Subspace Models for the Selection of Subspace Models Department of Computer Science, Tokyo Institute of Technology, Japan Masashi Sugiyama Hidemitsu Ogawa 2
2
Function Approximation Function Approximation
) ( function target x f
x y
) ( ˆ result learning x f
1
x
point sample :
m
x
2
x
2
y
3
x
3
y
1
y
m m m
n x f y + = ) (
value sample :
m
y
{ }
. examples training the using by
1 M m m m,y
x
=
) ( to ) ( ˆ ion approximat
- ptimal
Obtain the x f x f
3
Model Model
Generally, function approximation is performed by estimating parameters of a prefixed set of functions called a model.
∑
=
=
N n n n
b x a x f
1
) ; ( ) ( ˆ networks neural layer
- 3
σ
The choice of the model complexity (e.g. order of polynomial, number of units) is crucial for optimal generalization.
∑
=
=
N n n nx
a x f ) ( ˆ polynomial
4
Model Selection Model Selection
Select the best model providing the optimal generalization capability.
Simple model Appropriate model Complex model
Target function Learning result
5
Motivation and goal Motivation and goal
Most of the traditional model selection criteria do not work well when the number of training examples is small.
e.g. AIC (Akaike, 1974), BIC (Schwarz, 1978), MDL (Rissanen, 1978), NIC (Murata, Yoshizawa, & Amari, 1994)
Devise a model selection criterion which works well even when the number of training examples is small.
6
Setting Setting
f , S f, H f S f
θ θ θ θ
θ θ θ ˆ and including space Hilbert : model by function result learning : ˆ model by indicated functions
- f
family : model : function target learning :
2
ˆ minimizing model the models,
- f
set a from Select, f f En −
θ
f
θ
f ˆ
θ
S H
noise
- ver
n expectatio :
n
E
7
Least mean squares (LMS) learning Least mean squares (LMS) learning
∑
=
−
M m m m
y x f
1 2
) ( ˆ error training the minimizing at aimed is learning LMS
θ
θ θ
S x x K e y y y y
m M
- f
kernel g reproducin : ) , ( C in basis standard th
- m
: ) , , , (
M 2 1
′ = L
( ) ( )
f g h h g f g f , product Schatten Neumann : inverse d generalize Penrose Moore : = ⊗ − ⊗ − +
( )
+ =
⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⊗ = =
∑
M m m m
x x K e X y X f f
1
) , ( : ˆ as given is ˆ function result learning LMS The
θ θ θ θ θ
8
Assumptions (1) Assumptions (1)
. as given is matrix covariance noise The zero. is noise mean The
2
I σ
unknown. generally is
2
σ
9
Assumptions (2) Assumptions (2)
y X f f f E f
u u u n u
= = ˆ : ˆ ˆ result learning unbiased an gives models the
- f
One .
H x x KH
- f
kernel g reproducin : ) , ( ′
{ } ( )
H M H x x K
M m m H
dim if span ) , ( speaking, Roughly
1
≥
=
examples training
- f
number the : M
{ }
( )
+ = =
⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⊗ = ∑
M m m H m u M m m H
x x K e X H x x K
1 1
) , ( then , span ) , ( If
1
Generalization error and bias/variance Generalization error and bias/variance
2 2 2
ˆ ˆ ˆ ˆ
θ θ θ θ
f E f E f f E f f E
n n n n
− + − = −
bias variance
f
θ
f En ˆ
θ
f ˆ
bias variance generalization error generalization error
noise
- ver
n expectatio :
n
E
1 1
Estimation of bias Estimation of bias
2
ˆ f f En −
θ n
E
n
E
( )
* 2 2 1
- f
- perator
adjoint : variance, noise : , , X X ,n , ,n n n X X X
T M u
σ
θ
L = − =
( )
* 2 2
tr ˆ ˆ X X f f
u
σ
θ
− − − ≈ y
u
X
θ
X
2 2
, ˆ Re 2 ˆ ˆ n X n X f f E f f
n u
− − − − =
θ θ θ
f En ˆ
θ
f ˆ
u
f ˆ f
u n f
E ˆ =
bias
1 2
Estimation of noise variance Estimation of noise variance
( )
H M y x f
M m m m u
dim ) ( ˆ ˆ
1 2 2
− − = ∑
=
σ
2 2
- f
estimate unbiased an is ˆ σ σ
( ) ( )
* 2 * 2 2 2
tr tr ˆ ˆ ˆ
θ θ θ θ
σ σ X X X X f f f f E
u n
+ − − ≈ −
X X X X X
u
- f
- perator
adjoint : , variance, noise :
* 2
− =
θ
σ
generalization error bias estimate variance
1 3
Subspace Information Criterion (SIC) Subspace Information Criterion (SIC)
( ) ( )
* 2 * 2 2
tr ˆ tr ˆ ˆ ˆ SIC
θ θ θ
σ σ X X X X f f
u
+ − − =
The model minimizing SIC is called the minimum SIC model (MSIC model). MSIC model is expected to provide the optimal generalization capability. From a set of models, select the model minimizing the following SIC.
1 4
Validity of SIC Validity of SIC
SIC gives an unbiased estimate
- f the generalization error:
2
ˆ SIC f f E E
n n
− =
θ
noise
- ver
n expectatio :
n
E
- cf. AIC gives an asymptotic unbiased estimate
- f the generalization error.
SIC will work well even when the number of training examples is small.
1 5
Illustrative Simulation Illustrative Simulation
) 3 , ( subject to : ) ( , 2 N n n x f y M m M x
m m m m m
+ = + − − = π π π
x x x x x x x x x x x f 5 cos 2 5 sin 2 4 cos 2 4 sin 2 2 3 cos 2 3 sin 2 2 cos 2 2 2 sin 2 cos 2 2 sin 2 ) ( − + − + − + − − + =
{ }
[ ]
π π,
- n
defined cos , sin , 1 by spaned space Hilbert :
1
−
= N n N
nx nx S
{ }
20 2 1
, , , : models compared S S S L
1 6
- SIC
- Network information criterion (NIC)
(Murata, Yoshizawa, & Amari, 1994)
A generalized AIC
41 ) dim( :
20
= = H S H
In this simulation, SIC and NIC are fairly compared.
dx x f x f f f
∫
−
− = − =
π π
π
2 2
) ( ) ( ˆ 2 1 ˆ Error
Compared model selection criteria Compared model selection criteria
1 7
200 = M
) 11 . (Error model MSIC
5
= S ) 17 . (Error model MNIC
6
= S ) 11 . (Error model Optimal
5
= S
1 8
100 = M
) 37 . (Error model MSIC
5
= S ) 75 . (Error model MNIC
9
= S ) 37 . (Error model Optimal
5
= S
1 9
50 = M
) 98 . (Error model MSIC
5
= S ) 36 . 3 (Error model MNIC
20
= S ) 98 . (Error model Optimal
5
= S
SIC works well even when M is small.
2
Unrealizable case Unrealizable case
{ }
{ }
M 1 m 200 1 p
values sample from series chaotic a Estimate
= = m p
y M h 100 = M
2 1
{ }
200 1 p
series chaotic the to ing correspond ) 1 ( 200 2 995 . point sample Consider
=
− + − =
p p
h p x
p p
h p f h
- f
estimate an is ) 1 ( 200 2 995 . ˆ ˆ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − + − =
∑
=
− =
200 1 2
ˆ Error
p p p
h h
We perform the simulation 1000 times.
Estimation of chaotic series Estimation of chaotic series
2 2
- SIC
- NIC
{ }
d. distribute uniformly as regarded are function. loss the as adopted is loss log
1
x
M m m =
41 ) dim( :
40
= = H S H
Compared model selection criteria Compared model selection criteria
{ }
40 35 30 25 20 15
, , , , , : models Compared S S S S S S
{ }
[ ]
1 , 1
- n
defined by spaned space Hilbert : −
= N n N
x S
2 3
250 = M
SIC NIC
0021 . Mean 0022 . Mean
2 4
150 = M
SIC NIC
0058 . Mean 013 . Mean
2 5
50 = M
SIC NIC
018 . Mean 040 . Mean
SIC works well even when M is small.
2 6
Conclusions Conclusions
- We proposed a new model selection
criterion named the subspace information criterion (SIC) .
- SIC gives an unbiased estimate of the
generalization error.
- SIC works well even when the number of