Learning Theory and Model Selection Weinan Zhang Shanghai Jiao - PowerPoint PPT Presentation

CS420, Machine Learning, Lecture 10 Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html

Content • Learning Theory • Bias-Variance Decomposition • Finite Hypothesis Space ERM Bound • Infinite Hypothesis Space ERM Bound • VC Dimension • Model Selection • Cross Validation • Feature Selection • Occam’s Razor for Bayesian Model Selection

Learning Theory • Theorems that characterize classes of learning problems or specific algorithms in terms of computational complexity or sample complexity • i.e. the number of training examples necessary or sufficient to learn hypotheses of a given accuracy r r ³ ³ ´ ´ 1 1 log d + log 1 log d + log 1 ² ( d; N; ± ) = ² ( d; N; ± ) = 2 N 2 N ± ± Error #. Training Hypothesis Probability samples space of correctness

Learning Theory • Complexity of a learning problem depends on: • Size or expressiveness of the hypothesis space • Accuracy to which target concept must be approximated • Probability with which the learner must produce a successful hypothesis • Manner in which training examples are presented, e.g. randomly or by query to an oracle r r ³ ³ ´ ´ 1 1 log d + log 1 log d + log 1 ² ( d; N; ± ) = ² ( d; N; ± ) = 2 N 2 N ± ± Error #. Training Hypothesis Probability samples space of correctness

Model Selection • Which model is the best? Linear model: underfitting 4 th -order model: well fitting 15 th -order model: overfitting • Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. • Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship

Regularization • Add a penalty term of the parameters to prevent the model from overfitting the data N N X X 1 1 min min L ( y i ; f μ ( x i )) + ¸ Ð( μ ) L ( y i ; f μ ( x i )) + ¸ Ð( μ ) N N μ μ i =1 i =1

Content • Learning Theory • Bias-Variance Decomposition • Finite Hypothesis Space ERM Bound • Infinite Hypothesis Space ERM Bound • VC Dimension • Model Selection • Cross Validation • Feature Selection • Occam’s Razor for Bayesian Model Selection

Bias Variance Decomposition

Bias-Variance Decomposition • Bias-Variance Decomposition • Assume where ² » N (0 ; ¾ 2 ² » N (0 ; ¾ 2 Y = f ( X ) + ² Y = f ( X ) + ² ² ) ² ) • Then the expected prediction error at an input point x 0 Err( x 0 ) = E [( Y ¡ ^ Err( x 0 ) = E [( Y ¡ ^ f ( X )) 2 j X = x 0 ] f ( X )) 2 j X = x 0 ] = E [( ² + f ( x 0 ) ¡ ^ = E [( ² + f ( x 0 ) ¡ ^ f ( x 0 )) 2 ] f ( x 0 )) 2 ] = E [ ² 2 ] + E [2 ² ( f ( x 0 ) ¡ ^ = E [ ² 2 ] + E [2 ² ( f ( x 0 ) ¡ ^ + E [( f ( x 0 ) ¡ ^ + E [( f ( x 0 ) ¡ ^ f ( x 0 )) 2 ] f ( x 0 )) 2 ] f ( x 0 ))] f ( x 0 ))] | | {z {z } } =0 =0 ² + E [( f ( x 0 ) ¡ E [ ^ ² + E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )] + E [ ^ f ( x 0 )] + E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ = ¾ 2 = ¾ 2 f ( x 0 )) 2 ] f ( x 0 )) 2 ] ² + E [( f ( x 0 ) ¡ E [ ^ ² + E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ = ¾ 2 = ¾ 2 f ( x 0 )) 2 ] f ( x 0 )) 2 ] ¡ 2 E [( f ( x 0 ) ¡ E [ ^ ¡ 2 E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )])( E [ ^ f ( x 0 )])( E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ f ( x 0 ))] f ( x 0 ))] ² + E [( f ( x 0 ) ¡ E [ ^ ² + E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ = ¾ 2 = ¾ 2 f ( x 0 )) 2 ] f ( x 0 )) 2 ] f ( x 0 )] 2 + E [ ^ f ( x 0 )] 2 + E [ ^ ¡ 2 ( f ( x 0 ) E [ ^ ¡ 2 ( f ( x 0 ) E [ ^ f ( x 0 )] ¡ f ( x 0 ) E [ ^ f ( x 0 )] ¡ f ( x 0 ) E [ ^ f ( x 0 )] ¡ E [ ^ f ( x 0 )] ¡ E [ ^ f ( x 0 )] 2 ) f ( x 0 )] 2 ) | | {z {z } } =0 =0 f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ ² + ( E [ ^ ² + ( E [ ^ f ( x 0 ) ¡ E [ ^ f ( x 0 ) ¡ E [ ^ = ¾ 2 = ¾ 2 f ( x 0 )]) 2 ] f ( x 0 )]) 2 ] = ¾ 2 = ¾ 2 ² + Bias 2 ( ^ ² + Bias 2 ( ^ f ( x 0 )) + Var( ^ f ( x 0 )) + Var( ^ f ( x 0 )) f ( x 0 ))

Bias-Variance Decomposition • Bias-Variance Decomposition • Assume where ² » N (0 ; ¾ 2 ² » N (0 ; ¾ 2 Y = f ( X ) + ² Y = f ( X ) + ² ² ) ² ) • Then the expected prediction error at an input point x 0 f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ ² + ( E [ ^ ² + ( E [ ^ f ( x 0 ) ¡ E [ ^ f ( x 0 ) ¡ E [ ^ Err( x 0 ) = ¾ 2 Err( x 0 ) = ¾ 2 f ( x 0 )]) 2 ] f ( x 0 )]) 2 ] ² + Bias 2 ( ^ ² + Bias 2 ( ^ f ( x 0 )) + Var( ^ f ( x 0 )) + Var( ^ = ¾ 2 = ¾ 2 f ( x 0 )) f ( x 0 )) Observation How far How uncertain away the the prediction is noise (Irreducible expected (given different error) prediction is training settings from the e.g. data and truth initialization)

Illustration of Bias-Variance High Bias Low f ( x ) f ( x ) ^ ^ f ( x ) f ( x ) High Regularization Low Low Variance High Figures provided by Max Welling

Illustration of Bias-Variance regularization • Training error measures bias, but ignores variance. • Testing error / cross-validation error measures both bias and variance. Figures provided by Max Welling

Bias-Variance Decomposition • Schematic of the behavior of bias and variance Closest fit in population Realization Closest fit Truth MODEL SPACE Regularized fit Model bias Estimation bias RESTRICED MODEL SPACE Estimation Variance Slide credit Liqing Zhang

Hypothesis Space ERM Bound Empirical Risk Minimization Finite Hypothesis Space Infinite Hypothesis Space

Machine Learning Process Training Raw Model Data Data Data Formaliz- Evaluation ation Raw Test Data Data • After selecting ‘good’ hyperparameters, we train the model over the whole training data and the model can be used on test data.

Generalization Ability • Generalization Ability is the model prediction capacity on unobserved data • Can be evaluated by Generalization Error, defined by Z Z R ( f ) = E [ L ( Y; f ( X ))] = R ( f ) = E [ L ( Y; f ( X ))] = L ( y; f ( x )) p ( x; y ) dxdy L ( y; f ( x )) p ( x; y ) dxdy X £ Y X £ Y • where is the underlying (probably unknown) p ( x; y ) p ( x; y ) joint data distribution • Empirical estimation of GA on a training dataset is N N X X R ( f ) = 1 R ( f ) = 1 ^ ^ L ( y i ; f ( x i )) L ( y i ; f ( x i )) N N i =1 i =1

A Simple Case Study on Generalization Error • Finite hypothesis set F = f f 1 ; f 2 ; : : : ; f d g F = f f 1 ; f 2 ; : : : ; f d g • Theorem of generalization error bound: For any function , with probability no less f 2 F f 2 F than , it satisfies 1 ¡ ± 1 ¡ ± R ( f ) · ^ R ( f ) · ^ R ( f ) + ² ( d; N; ± ) R ( f ) + ² ( d; N; ± ) where r r ³ ³ ´ ´ 1 1 log d + log 1 log d + log 1 ² ( d; N; ± ) = ² ( d; N; ± ) = 2 N 2 N ± ± • N : number of training instances • d: number of functions in the hypothesis set Section 1.7 in Dr. Hang Li’s text book.

Lemma: Hoeffding Inequality Let be bounded independent random X 1 ; X 2 ; : : : ; X N X 1 ; X 2 ; : : : ; X N variables , the average variable Z is X i 2 [ a; b ] X i 2 [ a; b ] N N X X Z = 1 Z = 1 X i X i N N i =1 i =1 Then the following inequalities satisfy: μ ¡ 2 Nt 2 μ ¡ 2 Nt 2 ¶ ¶ P ( Z ¡ E [ Z ] ¸ t ) · exp P ( Z ¡ E [ Z ] ¸ t ) · exp ( b ¡ a ) 2 ( b ¡ a ) 2 μ ¡ 2 Nt 2 μ ¡ 2 Nt 2 ¶ ¶ P ( E [ Z ] ¡ Z ¸ t ) · exp P ( E [ Z ] ¡ Z ¸ t ) · exp ( b ¡ a ) 2 ( b ¡ a ) 2 http://cs229.stanford.edu/extra-notes/hoeffding.pdf

Proof of Generalized Error Bound • For binary classification, the error rate 0 · R ( f ) · 1 0 · R ( f ) · 1 • Based on Hoeffding Inequality, for , we have ² > 0 ² > 0 P ( R ( f ) ¡ ^ P ( R ( f ) ¡ ^ R ( f ) ¸ ² ) · exp( ¡ 2 N² 2 ) R ( f ) ¸ ² ) · exp( ¡ 2 N² 2 ) • As is a finite set, it satisfies F = f f 1 ; f 2 ; : : : ; f d g F = f f 1 ; f 2 ; : : : ; f d g [ [ P ( 9 f 2 F : R ( f ) ¡ ^ P ( 9 f 2 F : R ( f ) ¡ ^ f R ( f ) ¡ ^ f R ( f ) ¡ ^ R ( f ) ¸ ² ) = P ( R ( f ) ¸ ² ) = P ( R ( f ) ¸ ² g ) R ( f ) ¸ ² g ) f 2F f 2F X X P ( R ( f ) ¡ ^ P ( R ( f ) ¡ ^ · · R ( f ) ¸ ² ) R ( f ) ¸ ² ) f 2F f 2F · d exp( ¡ 2 N² 2 ) · d exp( ¡ 2 N² 2 )

Proof of Generalized Error Bound • Equivalence statements P ( 9 f 2 F : R ( f ) ¡ ^ P ( 9 f 2 F : R ( f ) ¡ ^ R ( f ) ¸ ² ) · d exp( ¡ 2 N² 2 ) R ( f ) ¸ ² ) · d exp( ¡ 2 N² 2 ) m P ( 8 f 2 F : R ( f ) ¡ ^ P ( 8 f 2 F : R ( f ) ¡ ^ R ( f ) < ² ) ¸ 1 ¡ d exp( ¡ 2 N² 2 ) R ( f ) < ² ) ¸ 1 ¡ d exp( ¡ 2 N² 2 ) • Then setting r r 2 N log d 2 N log d 1 1 ± = d exp( ¡ 2 N² 2 ) ± = d exp( ¡ 2 N² 2 ) , , ² = ² = ± ± The generalized error is bounded with the probability P ( R ( f ) < ^ P ( R ( f ) < ^ R ( f ) + ² ) ¸ 1 ¡ ± R ( f ) + ² ) ¸ 1 ¡ ±

For Infinite Hypothesis Space • Many hypothesis classes, including any parameterized by real numbers actually contain an infinite number of functions • E.g., linear models, neural networks f ( x ) = μ 0 + μ 1 x 1 + μ 2 x 2 f ( x ) = μ 0 + μ 1 x 1 + μ 2 x 2 f ( x ) = ¾ ( W 3 ( W 2 tanh( W 1 x + b 1 ) + b 2 ) + b 3 ) f ( x ) = ¾ ( W 3 ( W 2 tanh( W 1 x + b 1 ) + b 2 ) + b 3 )

Learning Theory and Model Selection Weinan Zhang Shanghai Jiao - PowerPoint PPT Presentation

CS420, Machine Learning, Lecture 10 Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Content Learning Theory Bias-Variance Decomposition

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016

Minimum Description Length Bono Nonchev Principle in Model Selection Information Theory The

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Selection Rules: Selection Rules Each of the spectroscopies have associated selection

Absolute notions in model theory Syntactic and semantic notions Absolutness from model theory

Machine learning theory Model Selection Hamid Beigy Sharif University of Technology March 16,

CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical

Stability of Clustering Methods Sasha Rakhlin Ph.D. candidate, MIT 1 A procedure is stable if P

Testing in the PHP w orld Marcus Brger PHP Qubec Conference 2007 The need for Testing

Lecture 5: Logistic Regression Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu Last lecture,

and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz Karthik Sridharan Ambuj

ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 Big Targets Workshop How can

Distributed Work: Forecasts + Recommendations ibute d Wor k: T e am Numbe r : 3 Government

ERP IMPLEMENTATION ERP IMPLEMENTATION Kedar Gaonkar Kedar Gaonkar IETF IETF- -69 Chicago,

Learning Theory and Model Selection Weinan Zhang Shanghai Jiao - PowerPoint PPT Presentation

CS420, Machine Learning, Lecture 10 Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Content Learning Theory Bias-Variance Decomposition

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016

Minimum Description Length Bono Nonchev Principle in Model Selection Information Theory The

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Selection Rules: Selection Rules Each of the spectroscopies have associated selection

Absolute notions in model theory Syntactic and semantic notions Absolutness from model theory

Machine learning theory Model Selection Hamid Beigy Sharif University of Technology March 16,

CSC2412: Private Gradient Descent &amp; Empirical Risk Minimization Sasho Nikolov 1 Empirical

Stability of Clustering Methods Sasha Rakhlin Ph.D. candidate, MIT 1 A procedure is stable if P

Testing in the PHP w orld Marcus Brger PHP Qubec Conference 2007 The need for Testing

Lecture 5: Logistic Regression Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu Last lecture,

and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz Karthik Sridharan Ambuj

ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 Big Targets Workshop How can

Distributed Work: Forecasts + Recommendations ibute d Wor k: T e am Numbe r : 3 Government

ERP IMPLEMENTATION ERP IMPLEMENTATION Kedar Gaonkar Kedar Gaonkar IETF IETF- -69 Chicago,

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical