model selection model selection with small samples with
play

Model Selection Model Selection with Small Samples with Small - PowerPoint PPT Presentation

ICANNGA2001 April 25, 2001. 1 Model Selection Model Selection with Small Samples with Small Samples Department of Computer Science, Tokyo Institute of Technology, Japan. Masashi Sugiyama Hidemitsu Ogawa ICANNGA2001 April 25, 2001. 2


  1. ICANNGA2001 April 25, 2001. 1 Model Selection Model Selection with Small Samples with Small Samples Department of Computer Science, Tokyo Institute of Technology, Japan. Masashi Sugiyama Hidemitsu Ogawa

  2. ICANNGA2001 April 25, 2001. 2 Supervised Learning Supervised Learning Target function f ( x ) ˆ x Learning result f ( ) y y L 2 1 y = + ε M y f ( x ) m m m ε : mean 0 m σ 2 variance L x x x u 1 2 M { } M From training examples , obtain x m y , = m m 1 ˆ x f ( ) that minimizes generalization error : J [ ] G ∫ 2 = − ˆ J G E f ( u ) f ( u ) p ( u ) du Expectatio n over noise Future, test input points u E : p ( u )

  3. ICANNGA2001 April 25, 2001. 3 For the Time Being, We Assume… For the Time Being, We Assume… � Target function is linear combination of ( x ) f { } µ ϕ specified basis functions : i x ( ) = i 1 µ ∑ = θ ϕ f ( x ) ( x ) i i = i 1 � Correlation matrix of future input is known. u U ∫ = ϕ ϕ ( ) ( ) ( ) U u u p u du ij i j Later, we will discuss the case when these assumptions do not hold.

  4. ICANNGA2001 April 25, 2001. 4 Subset Regression Models Subset Regression Models ∑ = θ ϕ ˆ ˆ f ( x ) [ ] ( x ) S S i i ∈ i S µ K Subset of indices S : { 1 , 2 , , } µ # of basis functions : θ ˆ is determined so that S ( ) M ∑ 2 − ˆ training error is minimized. f ( x ) y S m m = m 1 = − T 1 T X ( A A ) A S S S S θ S = ˆ X S y ϕ ∈ ⎧ ( x ) : i S = i m ⎨ [ A ] ∉ S mi ⎩ 0 : i S = T K y ( y , y , , y ) 1 2 M

  5. ICANNGA2001 April 25, 2001. 5 Model Selection Model Selection Select the best subset of basis functions so that generalization error is minimized: J [ ] G ∫ 2 = ˆ − J G E f ( u ) f ( u ) p ( u ) du However, includes unknown target function . f ( x ) J G We derive an estimate of called J G the subspace information criterion (SIC), and model is determined so that SIC is minimized.

  6. ICANNGA2001 April 25, 2001. 6 Key Idea: Unbiased Estimate Key Idea: Unbiased Estimate µ ∑ = θ ϕ ˆ ˆ f ( x ) [ ] ( x ) Largest model u u i i = i 1 θ ˆ : Minimum training error estimate u = − T 1 T X ( A A ) A θ u = ˆ u X u y = ϕ [ A ] ( x ) mi i m = T K y ( y , y , , y ) 1 2 M θ θ ˆ is an unbiased estimate of true parameter : u θ = θ E ˆ u Expectatio n over noise E : θ θ ˆ ˆ is used for estimating generalization error of . u S

  7. ICANNGA2001 April 25, 2001. 7 Bias / Variance Decomposition Bias / Variance Decomposition [ ] = ∫ ⎛ ⎞ 2 2 = θ − θ − ˆ ˆ ⎜ ⎟ J E E f ( u ) f ( u ) p ( u ) du ⎝ ⎠ G S S U 2 2 = θ − θ + θ − θ ˆ ˆ ˆ E E E θ U = 2 θ θ T U S S S U U ∫ = ϕ ϕ U ( u ) ( u ) p ( u ) du ij i j Variance Bias J B [ S ] J V [ S ] E θ ˆ θ S Bias Variance θ ˆ Expectatio n over noise E : S

  8. ICANNGA2001 April 25, 2001. 8 Unbiased Estimate of Variance Unbiased Estimate of Variance 2 = θ − θ ˆ ˆ [ ] J S E E V S S θ S = ˆ U ( ) X S y = σ T = 2 T K trace X X ( , , , ) y y y y 1 2 M S S 2 σ → σ = θ − − µ ˆ 2 2 ˆ ( ) A y M u ( ) = σ ˆ T 2 ˆ J [ S ] trace X X V S S = ˆ E J J V V Expectatio n over noise : E

  9. ICANNGA2001 April 25, 2001. 9 Unbiased Estimate of Bias Unbiased Estimate of Bias 2 = T = θ − θ K ˆ z ( f ( x ), f ( x ), , f ( x )) J [ S ] E 1 2 M ε = ε ε ε B S T K ( , , , M ) 1 2 = − 2 = θ − θ − ε − ε 2 X X X ˆ ˆ 2 X z , X X 0 S u S u 0 0 0 σ → σ 2 2 ˆ E , E ( ) 2 = θ − θ − − σ ˆ ˆ ˆ T 2 ˆ J [ S ] 0 trace X X 0 0 B S u = ˆ E J J B B ← θ ˆ X u y E θ ˆ u Bias S = E θ θ ˆ u → θ ˆ X S y Rough estimate S

  10. ICANNGA2001 April 25, 2001. 10 Subspace Information Criterion Subspace Information Criterion (SIC) (SIC) = + ˆ ˆ = − [ ] [ ] [ ] SIC S J S J S X X X 0 S u B V ( ) ( ) 2 = θ − θ − σ + σ ˆ ˆ 2 T 2 T ˆ ˆ trace X X trace X X S u 0 0 S S SIC is an unbiased estimate of generalization error with finite samples. J G = E SIC [ S ] J [ S ] G [ ] 2 ∫ = ˆ − [ ] ( ) ( ) ( ) J S E f u f u p u du G S

  11. ICANNGA2001 April 25, 2001. 11 When Assumptions Do Not Hold (1) When Assumptions Do Not Hold (1) f ( x ) When target function is not { } µ ϕ included in … L i x ( ) = i 1 [ ] ∫ 2 = − ˆ J E f ( u ) f ( u ) p ( u ) du G S 2 = θ − θ + ˆ * const. E S U { } µ θ ϕ * Best estimate in : L i x ( ) = i 1 2 θ − θ ˆ * SIC is an asymptotic unbiased estimate of : E S U 2 → θ − θ → ∞ ˆ * as E SIC E M . S U # of training samples M :

  12. ICANNGA2001 April 25, 2001. 12 When Assumptions Do Not Hold (2) When Assumptions Do Not Hold (2) When correlation matrix is not available… U ∫ = ϕ ϕ U ( u ) ( u ) p ( u ) du ij i j ′ � If unlabeled samples are available, M { } u = 1 m m (samples without output values) ′ M ∑ = ϕ ϕ ˆ 1 U ( u ) ( u ) ′ ij i m j m M = m 1 M � Training samples are used instead, { } x = m m 1 SIC essentially agrees with Mallows’s C P . U = ˆ � Just I Identity matrix : I

  13. ICANNGA2001 April 25, 2001. 13 Computer Simulation Computer Simulation 50 ( ) ∑ � = + 1 Target function : f ( x ) sin px cos px 10 = p 1 − π π � Randomly created in x : [ , ] m = + ε ε σ � 2 : is subject to y f ( x ) N ( 0 , ) m m m m { } µ = � 100 Basis functions : 201 1 , sin px , cos px = p 1 { } � K Compared models : S , S , , S 0 10 100 { } n S : 1 , sin px , cos px π 2 = ∫ n p 1 = − ˆ � Error 1 f ( u ) f ( u ) du π 2 − π

  14. ICANNGA2001 April 25, 2001. 14 Compared Methods Compared Methods � SIC � Mallows’ C P � Leave-one-out cross-validation (CV) � Akaike’s information criterion (AIC) � Sugiura’s corrected AIC (cAIC) � Schwarz’s Bayesian information criterion (BIC) � Vapnik’s measure (VM) Simulations are performed 100 times with σ = 2 ( , M ) ( 250 , 0 . 6 ), ( 500 , 0 . 6 ), ( 250 , 0 . 2 ), ( 500 , 0 . 2 ) σ 2 # of training samples Noise variance : M :

  15. ICANNGA2001 April 25, 2001. 15 Easiest Case (Curve) Easiest Case (Curve) σ = 2 ( 250 , 0 . 6 ) ( 500 , 0 . 6 ) ( M , ) ( 250 , 0 . 2 ) ( 500 , 0 . 2 )

  16. ICANNGA2001 April 25, 2001. 16 Easiest Case (Error) Easiest Case (Error) σ = 2 ( 250 , 0 . 6 ) ( 500 , 0 . 6 ) ( M , ) ( 250 , 0 . 2 ) ( 500 , 0 . 2 )

  17. ICANNGA2001 April 25, 2001. 17 Hardest Case Hardest Case σ = 2 ( 250 , 0 . 6 ) ( 500 , 0 . 6 ) ( M , ) ( 250 , 0 . 2 ) ( 500 , 0 . 2 )

  18. ICANNGA2001 April 25, 2001. 18 Summary of Simulations Summary of Simulations SIC C P CV AIC σ 2 M cAIC BIC VM Works well Selects smaller Selects larger σ 2 # of training samples Noise variance : M :

  19. ICANNGA2001 April 25, 2001. 19 µ ϕ µ ϕ When is not in . ( x ) f L { i x ( )} When is not in . ( x ) f L { i x ( )} = = i 1 i 1 Interpolation of chaotic series Similar results were obtained !!

  20. ICANNGA2001 April 25, 2001. 20 Conclusions Conclusions � We proposed a new model selection criterion called subspace information criterion (SIC). � SIC gives an unbiased estimate of generalization error. � Computer simulations showed that SIC works well with small samples and large noise.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend