April 25, 2001. ICANNGA2001 1
Model Selection Model Selection with Small Samples with Small - - PowerPoint PPT Presentation
Model Selection Model Selection with Small Samples with Small - - PowerPoint PPT Presentation
ICANNGA2001 April 25, 2001. 1 Model Selection Model Selection with Small Samples with Small Samples Department of Computer Science, Tokyo Institute of Technology, Japan. Masashi Sugiyama Hidemitsu Ogawa ICANNGA2001 April 25, 2001. 2
April 25, 2001. ICANNGA2001 2
Supervised Learning Supervised Learning
[ ]
∫
− = du u p u f u f E JG ) ( ) ( ) ( ˆ
2
noise
- ver
n Expectatio : E ) (u p u points input test Future, ) (x f function Target ) ( ˆ x f result Learning
1
x
2
x
2
y
M
x
M
y
1
y L L
mean :
m
ε
m m m
x f y ε + = ) (
u
2
σ variance
From training examples , obtain that minimizes generalization error :
{ }
M m m m y
x
1
,
= G
J ) ( ˆ x f
April 25, 2001. ICANNGA2001 3
For the Time Being, We Assume… For the Time Being, We Assume…
Target function is linear combination of specified basis functions : Correlation matrix of future input is known.
∑
=
=
µ
ϕ θ
1
) ( ) (
i i i
x x f ) (x f u du u p u u U
j i ij
) ( ) ( ) (
∫
= ϕ ϕ U
{ }
µ
ϕ
1
) (
= i i x
Later, we will discuss the case when these assumptions do not hold.
April 25, 2001. ICANNGA2001 4
Subset Regression Models Subset Regression Models
} , , 2 , 1 { : µ K indices
- f
Subset S
that so determined is
S
θ ˆ
T M
y y y y ) , , , (
2 1
K =
y X S
S =
θ ˆ
T S S T S S
A A A X
1
) (
−
= ⎩ ⎨ ⎧ ∉ ∈ = S i S i x A
m i mi S
: : ) ( ] [ ϕ functions basis
- f
# : µ
∑
∈
=
S i i i S S
x x f ) ( ] ˆ [ ) ( ˆ ϕ θ
( )
minimized. is error training
∑
=
−
M m m m S
y x f
1 2
) ( ˆ
April 25, 2001. ICANNGA2001 5
Model Selection Model Selection
Select the best subset of basis functions so that generalization error is minimized:
G
J
However, includes unknown target function . We derive an estimate of called the subspace information criterion (SIC), and model is determined so that SIC is minimized.
G
J
G
J
[ ]
∫
− = du u p u f u f E JG ) ( ) ( ) ( ˆ
2
) (x f
April 25, 2001. ICANNGA2001 6
Key Idea: Unbiased Estimate Key Idea: Unbiased Estimate
estimate error training Minimum :
u
θ ˆ
T M
y y y y ) , , , (
2 1
K =
y X u
u =
θ ˆ
T T u
A A A X
1
) (
−
= ) ( ] [
m i mi
x A ϕ =
is an unbiased estimate of true parameter :
θ
θ θ =
u
E ˆ
u
θ ˆ
is used for estimating generalization error of .
S
θ ˆ
u
θ ˆ
noise
- ver
n Expectatio : E
∑
=
=
µ
ϕ θ
1
) ( ] ˆ [ ) ( ˆ
i i i u u
x x f
model Largest
April 25, 2001. ICANNGA2001 7
Bias Variance
Bias / Variance Decomposition Bias / Variance Decomposition
Bias Variance
[ ]
⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = ∫ du u p u f u f E
S
) ( ) ( ) ( ˆ
2 2
ˆ
U S G
E J θ θ − =
2 2
ˆ ˆ ˆ
U S S U S
E E E θ θ θ θ − + − =
θ θ θ U
T U = 2
noise
- ver
n Expectatio : E du u p u u U
j i ij
) ( ) ( ) (
∫
= ϕ ϕ
] [S J B ] [S JV θ
S
θ ˆ
S
Eθ ˆ
April 25, 2001. ICANNGA2001 8
Unbiased Estimate of Variance Unbiased Estimate of Variance
( )
T S S V
X X S J trace ˆ ] [ ˆ
2
σ =
( )
T S S U S S V
X X E E S J trace ˆ ˆ ] [
2 2
σ θ θ = − =
V V
J J E = ˆ
y X S
S =
θ ˆ
T M
y y y y ) , , , (
2 1
K = noise
- ver
n Expectatio : E
) ( ˆ ˆ
2 2 2
µ θ σ σ − − = → M y A
u
April 25, 2001. ICANNGA2001 9
Unbiased Estimate of Bias Unbiased Estimate of Bias
Bias
B B
J J E = ˆ
2
ˆ ] [ θ θ − =
S B
E S J
T M )
, , , (
2 1
ε ε ε ε K =
T M
x f x f x f z )) ( , ), ( ), ( (
2 1
K =
( )
T u S B
X X S J
2 2
trace ˆ ˆ ˆ ] [ ˆ σ θ θ − − − = E
u S
X X X − =
y X u ← → y X S
2 2
ˆ , σ σ → E
2 2
, 2 ˆ ˆ ε ε θ θ X X z X
u S
− − − =
Rough estimate
θ
S
θ ˆ
S
Eθ ˆ
u
θ ˆ
u
Eθ ˆ =
April 25, 2001. ICANNGA2001 10
Subspace Information Criterion (SIC) Subspace Information Criterion (SIC)
( ) ( )
T S S T u S V B
X X X X S J S J S SIC trace ˆ trace ˆ ˆ ˆ ] [ ˆ ] [ ˆ ] [
2 2 2
σ σ θ θ + − − = + =
] [ ] [ S J S SIC E
G
=
u S
X X X − =
SIC is an unbiased estimate of generalization error with finite samples.
G
J
[ ]
∫
− = du u p u f u f E S J
S G
) ( ) ( ) ( ˆ ] [
2
April 25, 2001. ICANNGA2001 11
When Assumptions Do Not Hold (1) When Assumptions Do Not Hold (1)
[ ]
∫
− = du u p u f u f E J
S G
) ( ) ( ) ( ˆ
2
const. + − =
2 *
ˆ
U S
E θ θ
{ }
µ
ϕ θ
1 *
) ( :
= i i x
L in estimate Best
When target function is not included in … SIC is an asymptotic unbiased estimate of :
2 *
ˆ
U S
E θ θ −
{ }
µ
ϕ
1
) (
= i i x
L ) (x f
. ˆ
2 *
∞ → − → M E SIC E
U S
as θ θ
samples training
- f
# : M
April 25, 2001. ICANNGA2001 12
When correlation matrix is not available…
When Assumptions Do Not Hold (2) When Assumptions Do Not Hold (2)
U
du u p u u U
j i ij
) ( ) ( ) (
∫
= ϕ ϕ
If unlabeled samples are available, (samples without output values) Training samples are used instead, SIC essentially agrees with Mallows’s CP. Just
∑
′ = ′
=
M m m j m i M ij
u u U
1 1
) ( ) ( ˆ ϕ ϕ I U = ˆ
matrix Identity : I
M m m
u
′ =1
} {
M m m
x
1
} {
=
April 25, 2001. ICANNGA2001 13
- Computer Simulation
Computer Simulation
] , [ : π π − in created Randomly
m
x
( )
∑
=
+ =
50 1 10 1
cos sin ) (
p
px px x f : function Target
{ }
100 10
, , , S S S K : models Compared du u f u f
∫
−
− =
π π π 2 2 1
) ( ) ( ˆ Error ) , ( ) (
2
σ ε ε N x f y
m m m m
to subject is : + =
{ }
n p n
px px S
1
cos , sin , 1 :
=
201 = µ
{ }
100 1
cos , sin , 1
= p
px px : functions Basis
April 25, 2001. ICANNGA2001 14
Compared Methods Compared Methods
), 6 . , 500 ( ), 6 . , 250 ( )
2
, ( = σ M
SIC Mallows’ CP Leave-one-out cross-validation (CV) Akaike’s information criterion (AIC) Sugiura’s corrected AIC (cAIC) Schwarz’s Bayesian information criterion (BIC) Vapnik’s measure (VM)
variance Noise :
2
σ samples training
- f
# : M
) 2 . , 500 ( ), 2 . , 250 (
Simulations are performed 100 times with
April 25, 2001. ICANNGA2001 15
Easiest Case (Curve) Easiest Case (Curve)
= ) , (
2
σ M ) 2 . , 250 ( ) 6 . , 500 ( ) 6 . , 250 ( ) 2 . , 500 (
April 25, 2001. ICANNGA2001 16
Easiest Case (Error) Easiest Case (Error)
= ) , (
2
σ M ) 2 . , 250 ( ) 6 . , 500 ( ) 6 . , 250 ( ) 2 . , 500 (
April 25, 2001. ICANNGA2001 17
Hardest Case Hardest Case
= ) , (
2
σ M ) 2 . , 250 ( ) 6 . , 500 ( ) 6 . , 250 ( ) 2 . , 500 (
April 25, 2001. ICANNGA2001 18
Summary of Simulations Summary of Simulations
M
2
σ
SIC AIC cAIC BIC VM
Works well Selects smaller Selects larger
CP CV
variance Noise :
2
σ samples training
- f
# : M
April 25, 2001. ICANNGA2001 19
When is not in . When is not in .
Interpolation of chaotic series Similar results were obtained !!
µ
ϕ
1
)} ( {
= i i x
L
µ
ϕ
1
)} ( {
= i i x
L ) (x f ) (x f
April 25, 2001. ICANNGA2001 20