A New Information Criterion A New Information Criterion for the - PowerPoint PPT Presentation

1 A New Information Criterion A New Information Criterion for the Selection of Subspace Models for the Selection of Subspace Models Department of Computer Science, Tokyo Institute of Technology, Japan Masashi Sugiyama Hidemitsu Ogawa

2 Function Approximation Function Approximation f x target function ( ) y ˆ f x learning result ( ) y y x y : sample point 2 m 3 1 y : sample value m = + y f x n ( ) m m m x x x x 3 1 2 ˆ f x f x Obtain the optimal approximat ion ( ) to ( ) { } M x m ,y by using the training examples . m = m 1

3 Model Model Generally, function approximation is performed by estimating parameters of a prefixed set of functions called a model. polynomial 3 - layer neural networks N N ∑ ∑ ˆ = = σ n ˆ f x a n x f x a x b ( ) ( ) ( ; ) n n = = n n 0 1 The choice of the model complexity (e.g. order of polynomial, number of units) is crucial for optimal generalization.

4 Model Selection Model Selection Simple model Appropriate model Complex model Target function Learning result Select the best model providing the optimal generalization capability.

5 Motivation and goal Motivation and goal Most of the traditional model selection criteria do not work well when the number of training examples is small. e.g. AIC (Akaike, 1974), BIC (Schwarz, 1978), MDL (Rissanen, 1978), NIC (Murata, Yoshizawa, & Amari, 1994) Devise a model selection criterion which works well even when the number of training examples is small.

6 Setting Setting H f : learning target function θ : model f S θ θ S : family of functions indicated by model θ θ ˆ f : learning result function by model θ ˆ f ˆ H f, S , f θ : Hilbert space including and θ θ Select, from a set of models, the model minimizing 2 − ˆ E n f f θ E : expectatio n over noise n

7 Least mean squares (LMS) learning Least mean squares (LMS) learning LMS learning is aimed at minimizing the training error M ∑ 2 − ˆ f x y ( ) θ m m = m 1 θ ˆ f The LMS learning result function is given as + ( ) ⎛ ⎞ M ∑ = = ⊗ ˆ ⎜ ⎟ f X y X e K x x : ( , ) θ θ θ θ m m ⎝ ⎠ = m 1 + − = y y y y L : Moore Penrose generalize d inverse ( , , , ) M 1 2 ( ) ⊗ − f g e M : Neumann Schatten product : m - th standard basis in C m ( ) ⊗ = ′ f g h h g f K x x S , ( , ) : reproducin g kernel of θ θ

8 Assumptions (1) Assumptions (1) The mean noise is zero. 2 σ I The noise covariance matrix is given as . 2 σ is generally unknown.

9 Assumptions (2) Assumptions (2) . ˆ f One of the models gives an unbiased learning result u = = ˆ ˆ E f f f X y : n u u u + ( ) ⎛ ⎞ M = ∑ { } M ⎜ ⊗ ⎟ K x x H X e K x x If ( , ) span , then ( , ) H m = u m H m m 1 ⎝ ⎠ = m 1 ′ K H x x H ( , ) : reproducin g kernel of { } ( ) M ≥ K x x H M H Roughly speaking, ( , ) span if dim H m = m 1 M : the number of training examples

1 0 Generalization error and bias/variance Generalization error and bias/variance 2 2 2 − = − + − ˆ ˆ ˆ ˆ E f f E f f E f E f θ θ θ θ n n n n generalization bias variance error E : expectatio n over noise n E n ˆ f bias θ f variance generalization ˆ f error θ

1 1 Estimation of bias Estimation of bias 2 2 ˆ − = ˆ − ˆ − ˆ − − 2 E n f f f f E f f X n X n 2 Re , θ θ θ u n 0 0 E E n n bias ( ) 2 ≈ ˆ − ˆ − − σ f f X X 2 * 0 tr θ u 0 0 ( ) = − = T σ X X X n n ,n , ,n X X 2 * L , , : noise variance, : adjoint operator of θ u M 0 1 2 0 0 ˆ f X u u y E n ˆ f f θ = E ˆ n f u ˆ f X θ θ

1 2 Estimation of noise variance Estimation of noise variance ( ) ( ) 2 2 ˆ − ≈ ˆ − ˆ − σ + σ E f f f f X X X X 2 * 2 * tr tr θ θ θ θ n u 0 0 generalization bias estimate variance error σ = − X X X X X 2 * : noise variance, , : adjoint operator of θ u 0 = ∑ 2 M − ˆ f x y ( ) u m m = m σ 1 2 ˆ ( ) − M H dim σ σ 2 2 ˆ is an unbiased estimate of

1 3 Subspace Information Criterion Subspace Information Criterion (SIC) (SIC) From a set of models, select the model minimizing the following SIC. ( ) ( ) 2 = − − σ + σ ˆ ˆ f f X X X X 2 * 2 * ˆ ˆ SIC tr tr θ θ θ u 0 0 The model minimizing SIC is called the minimum SIC model (MSIC model). MSIC model is expected to provide the optimal generalization capability.

1 4 Validity of SIC Validity of SIC SIC gives an unbiased estimate of the generalization error: 2 = ˆ − E E f f SIC θ n n E : expectatio n over noise n cf. AIC gives an asymptotic unbiased estimate of the generalization error. SIC will work well even when the number of training examples is small.

1 5 Illustrative Simulation Illustrative Simulation = + − − + f x x x x x x ( ) 2 sin 2 2 cos 2 sin 2 2 2 cos 2 2 sin 3 − + − + − x x x x x 2 cos 3 2 2 sin 4 2 cos 4 2 sin 5 2 cos 5 π π m 2 = − π − + = + x y f x n , ( ) m m m m M M n N : subject to ( 0 , 3 ) m { } S S S L compared models : , , , 1 2 20 { } N S nx nx : Hilbert space spaned by 1 , sin , cos = N n 1 [ ] − π , π defined on

1 6 Compared model selection criteria Compared model selection criteria • SIC = = H S H : dim( ) 41 20 • Network information criterion (NIC) (Murata, Yoshizawa, & Amari , 1994) A generalized AIC In this simulation, SIC and NIC are fairly compared. 1 π 2 2 ∫ = − = − ˆ ˆ f f f x f x dx Error ( ) ( ) π − π 2

1 7 = M 200 = S Optimal model (Error 0 . 11 ) 5 = = S S MSIC model (Error 0 . 11 ) MNIC model (Error 0 . 17 ) 5 6

1 8 = M 100 = S Optimal model (Error 0 . 37 ) 5 = = S S MSIC model (Error 0 . 37 ) MNIC model (Error 0 . 75 ) 5 9

1 9 = M 50 = S Optimal model (Error 0 . 98 ) 5 SIC works well even when M is small. = = S S MSIC model (Error 0 . 98 ) MNIC model (Error 3 . 36 ) 5 20

2 0 Unrealizable case Unrealizable case { } { } 200 M h M y Estimate a chaotic series from sample values p m = = m 1 p 1 = M 100

2 1 Estimation of chaotic series Estimation of chaotic series 2 = − + − x p Consider sample point 0 . 995 ( 1 ) p 200 { } 200 h correspond ing to the chaotic series p = p 1 ⎛ ⎞ 2 = − + − ˆ ˆ ⎜ ⎟ h f p h 0 . 995 ( 1 ) is an estimate of p p ⎝ ⎠ 200 200 ∑ 2 = − ˆ h h Error p p = p 1 We perform the simulation 1000 times.

2 2 Compared model selection criteria Compared model selection criteria • SIC = = H S H : dim( ) 41 40 • NIC log loss is adopted as the loss function. { } M x 1 are regarded as uniformly distribute d. m = m { } S S S S S S Compared models : , , , , , 15 20 25 30 35 40 { } N S x : Hilbert space spaned by = N n 0 [ ] − defined on 1 , 1

3 2 0021 0022 Mean Mean . . 0 0 NIC SIC 250 = M

2 4 = M 150 Mean 0 . 0058 SIC Mean NIC 0 . 013

2 5 = M 50 Mean 0 . 018 SIC SIC works well even when M is small. Mean NIC 0 . 040

2 6 Conclusions Conclusions • We proposed a new model selection criterion named the subspace information criterion (SIC) . • SIC gives an unbiased estimate of the generalization error. • SIC works well even when the number of training examples is small.

A New Information Criterion A New Information Criterion for the - PowerPoint PPT Presentation

1 A New Information Criterion A New Information Criterion for the Selection of Subspace Models for the Selection of Subspace Models Department of Computer Science, Tokyo Institute of Technology, Japan Masashi Sugiyama Hidemitsu Ogawa 2

NEW CRITERIA LABELS Criterion 1. Students Criterion 2. Program Educational Objectives Criterion

Subspace Information Criterion Subspace Information Criterion for Image Restoration for Image

Von Mises Failure Criterion Von Mises Criterion . . . in Mechanics of Materials: Computing V

Plan of the Lecture Review: Nyquist stability criterion Todays topic: Nyquist stability

Shear Strength Chapter 10 Mohrs Failure Criterion 1 4/13/2015 Coulombs

The Intuitive and Divinity Criterion: Explanation and Step-by-step examples Ana

Stochastic multi-scale selection of the stopping Nicolai Bissantz criterion for MLEM

2018 FMA A Con onference erence Celina Makowski, MBA, CHCP , RHIT Manager, CPPD/CME

The Existence of Decoherence-Free Subspaces and an Effective Criterion Takeo Kamizawa Faculty of

The Intuitive and Divinity Criterion: Explanation and Step-by-step examples EconS 491 - Felix

A Syntactic Criterion for Injectivity of Authentication Protocols Cas Cremers joint work with

LANGUAGE, TRUTH, AND LOGIC A.J. AYER Verificationism The criterion which we use to test

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

Support Vector Machines Greg Mori - CMPT 419/726 Bishop PRML Ch. 7 Maximum Margin Criterion

On Grinbergs Criterion 6 5 5 5 9 Gunnar Brinkmann and Carol T. Zamfirescu Grinbergs

A formality criterion for differential graded Lie algebras Marco Manetti Sapienza University,

RESULTS Water and Sewer Rate Evaluation Dudley MA May 28, 2020 Memorandum Attachment WATER RATE

Q4-14 Financial Results January 28, 2015 1 Forward-looking Statements & Non-GAAP Measures

IETF 91 Honolulu Yang Data Model for OSPF Protocol

TA-02 Replacement Decision Briefing Bob Zwaska (for TSD et al. ) Proton PMG 7 June 2018

Generation of giant single-cycle pulses of THz light for controlling matter Vitaliy Goryashko

Majorana Representation of Complex Vectors and Some of Applications Mikio Nakahara and Yan Zhu

Magnetoplasmons in graphene and topological insulator ribbon arrays

Wireless Communication Systems @CS.NCTU Lecture 8: Successive Interference Cancellation

A New Information Criterion A New Information Criterion for the - PowerPoint PPT Presentation

1 A New Information Criterion A New Information Criterion for the Selection of Subspace Models for the Selection of Subspace Models Department of Computer Science, Tokyo Institute of Technology, Japan Masashi Sugiyama Hidemitsu Ogawa 2

NEW CRITERIA LABELS Criterion 1. Students Criterion 2. Program Educational Objectives Criterion

Subspace Information Criterion Subspace Information Criterion for Image Restoration for Image

Von Mises Failure Criterion Von Mises Criterion . . . in Mechanics of Materials: Computing V

Plan of the Lecture Review: Nyquist stability criterion Todays topic: Nyquist stability

Shear Strength Chapter 10 Mohrs Failure Criterion 1 4/13/2015 Coulombs

The Intuitive and Divinity Criterion: Explanation and Step-by-step examples Ana

Stochastic multi-scale selection of the stopping Nicolai Bissantz criterion for MLEM

2018 FMA A Con onference erence Celina Makowski, MBA, CHCP , RHIT Manager, CPPD/CME

The Existence of Decoherence-Free Subspaces and an Effective Criterion Takeo Kamizawa Faculty of

The Intuitive and Divinity Criterion: Explanation and Step-by-step examples EconS 491 - Felix

A Syntactic Criterion for Injectivity of Authentication Protocols Cas Cremers joint work with

LANGUAGE, TRUTH, AND LOGIC A.J. AYER Verificationism The criterion which we use to test

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

Support Vector Machines Greg Mori - CMPT 419/726 Bishop PRML Ch. 7 Maximum Margin Criterion

On Grinbergs Criterion 6 5 5 5 9 Gunnar Brinkmann and Carol T. Zamfirescu Grinbergs

A formality criterion for differential graded Lie algebras Marco Manetti Sapienza University,

RESULTS Water and Sewer Rate Evaluation Dudley MA May 28, 2020 Memorandum Attachment WATER RATE

Q4-14 Financial Results January 28, 2015 1 Forward-looking Statements &amp; Non-GAAP Measures

IETF 91 Honolulu Yang Data Model for OSPF Protocol

TA-02 Replacement Decision Briefing Bob Zwaska (for TSD et al. ) Proton PMG 7 June 2018

Generation of giant single-cycle pulses of THz light for controlling matter Vitaliy Goryashko

Majorana Representation of Complex Vectors and Some of Applications Mikio Nakahara and Yan Zhu

Magnetoplasmons in graphene and topological insulator ribbon arrays

Wireless Communication Systems @CS.NCTU Lecture 8: Successive Interference Cancellation

Q4-14 Financial Results January 28, 2015 1 Forward-looking Statements & Non-GAAP Measures