lecture 5
play

Lecture 5: Regularization ML Methodology Aykut Erdem February 2016 - PowerPoint PPT Presentation

Lecture 5: Regularization ML Methodology Aykut Erdem February 2016 Hacettepe University Recall from last time Linear Regression y ( x ) = w 0 + w 1 x w = ( w 0 , w 1 ) N i 2 h X t ( n ) ( w 0 + w 1 x ( n ) ) ` ( w ) = n =1


  1. Lecture 5: − Regularization − ML Methodology Aykut Erdem February 2016 Hacettepe University

  2. Recall from last time… Linear Regression y ( x ) = w 0 + w 1 x w = ( w 0 , w 1 ) N i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) ` ( w ) = n =1 Gradient Descent Update Rule: ⇣ ⌘ t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ from Bishop Closed Form Solution: � − 1 X T t X T X � w = � � � � � � � � � � � � � � ฀ � 1 � � � � � � Training inde- Test �� �� � � ฀ � � � � � �� �� �� �� E RMS 0.5 �� �� � � � �� �� � � � �� �� � � �� �� ฀ � �� �� � � � �� �� 0 0 3 6 9 M w � ( � T � ) � 1 � T y ฀ � 2 ฀ �

  3. 1-D regression illustrates key concepts • Data fits – is linear model best ( model selection )? − Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model • One method of assessing fit: test generalization = model’s ability to predict the held out data • Optimization is essential: stochastic and batch iterative approaches; analytic when available slide by Richard Zemel 3

  4. 
 Today • Regularization • Machine Learning Methodology - validation - cross-validation (k-fold, leave-one-out) - model selection 
 4

  5. Regularization 5

  6. Regularized Least Squares • A technique to control the overfitting phenomenon • Add a penalty term to the error function in order to discourage the coe ffi cients from reaching large values Ridge N � E ( w ) = 1 � { y ( x n , w ) − t n } 2 + λ � 2 ∥ w ∥ 2 � regression 2 n =1 where ∥ w ∥ 2 ≡ w T w = w 2 0 + w 2 1 + . . . + w 2 M , slide by Erik Sudderth importance of the regularization term compared which is minimized by ' 6

  7. The e ff ect of regularization ln λ = − 18 ln λ = 0 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 9 slide by Erik Sudderth 7

  8. The e ff ect of regularization 1 ln λ = −∞ ln λ = − 18 ln λ = 0 Training w ⋆ 0.35 0.35 0.13 0 w ⋆ Test 232.37 4.74 -0.05 1 w ⋆ -5321.83 -0.77 -0.06 2 w ⋆ 48568.31 -31.97 -0.05 3 E RMS w ⋆ -231639.30 -3.89 -0.03 4 0.5 w ⋆ 640042.26 55.28 -0.02 5 w ⋆ -1061800.52 41.32 -0.01 6 w ⋆ 1042400.18 -45.95 -0.00 7 w ⋆ -557682.99 -91.53 0.00 8 w ⋆ 125201.43 72.68 0.01 9 0 − 35 − 30 − 25 − 20 ln λ The corresponding coe ffi cients from the fitted polynomials, showing slide by Erik Sudderth that regularization has the desired e ff ect of reducing the magnitude of the coe ffi cients. 8

  9. A more general regularizer N M 1 { t n − w T φ ( x n ) } 2 + λ � � | w j | q 2 2 n =1 j =1 q = 0 . 5 q = 1 q = 2 q = 4 slide by Richard Zemel 9

  10. Machine Learning 
 Methodology 10

  11. Recap: Regression � � � � � i are • In regression, labels y y � � � � � continuous � � • Classification/regression are � � � � solved very similarly 6 • Everything we have done so � � 3 far transfers to classification � � with very minor changes x 4 8 1 � � � � � • Error: sum of distances from � � � � examples to the fitted slide by Olga Veksler � � � model � � � � � 11 � � � �

  12. Training/Test Data Split • Talked about splitting data in training/test sets - training data is used to fit parameters - test data is used to assess how classifier generalizes to new data • What if classifier has “non ‐ tunable” parameters? - a parameter is “non ‐ tunable” if tuning (or training) it on the training data leads to overfitting - Examples: ‣ k in kNN classifier ‣ number of hidden units in MNN slide by Olga Veksler ‣ number of hidden layers in MNN ‣ etc … 12

  13. Example of Overfitting � � � � � � � • Want to fit a polynomial machine f(x,w) y � • Instead of fixing polynomial degree, 
 � � � � make it parameter d � � �� � - learning machine f(x,w,d) � �� • Consider just three choices for d x � � � � � - degree 1 - degree 2 � - degree 3 
 � � • Training error is a bad measure to choose d − degree 3 is the best according to the training error, but overfits � � � � � � � � slide by Olga Veksler the data � � � � � � � � � � � 13

  14. Training/Test Data Split � � � � � � � • What about test error? Seems appropriate − degree 2 is the best model according to the test error � � � � � � � � � • Except what do we report as the test error now? � � � � � � � � � � � � � � � � � � • Test error should be computed on data that was not used for slide by Olga Veksler � � � � training at all! �� � � � � � � • Here used “test” data for training, i.e. choosing model 14

  15. Validation data � • Same question when choosing among several classifiers � � � � � � - our polynomial degree example can be looked at as � � � � � � � � � � choosing among 3 classifiers (degree 1, 2, or 3) � � � � � � � • Solution: split the labeled data into three parts � � � �� � � � labeled � data Training Test � Validation � � 20% � 60% � 20% train � other � use � only to � train � tunable slide by Olga Veksler parameters, � assess � final � parameters � w or � to � select � performance classifier 15

  16. Training/Validation labeled � data Test � Training Validation � � 20% � 60% � 20% Validation � error: � Test � error: � Training � error: ��� computed � on � computed � on � computed � on � training � validation � test � examples examples examples slide by Olga Veksler 16

  17. Training/Validation/Test Data � d = � 1 � d = � 3 � d = � 2 � validation error: � 3.3 validation � error: � 1.8 � validation � error: � 3.4 � • Training Data � • Validation Data � - d = 2 is chosen � � � slide by Olga Veksler • Test Data � - 1.3 test error computed for d = 2 � � � � � � 17

  18. Choosing Parameters: Example � � error Validation � Error Training � Error number � of � hidden � units 50 � � � � � � � � � � • Need to choose number of hidden units for a MNN � � � � � � � � � slide by Olga Veksler - The more hidden units, the better can fit training data � � � � � � - But at some point we overfit the data 18

  19. Diagnosing Underfitting/Overfitting � Underfitting Just � Right Overfitting • large � training � error • small � training � error • small � training � error • large � validation � error • small � validation � error • large � validation � error slide by Olga Veksler 19

  20. Fixing Underfitting/Overfitting • Fixing Underfitting - getting more training examples will not help - get more features - try more complex classifier ‣ if using MNN, try more hidden units 
 • Fixing Overfitting - getting more training examples might help - try smaller set of features - Try less complex classifier slide by Olga Veksler ‣ If using MNN, try less hidden units 20

  21. Train/Test/Validation Method • Good news - Very simple 
 • Bad news: - Wastes data - in general, the more data we have, the better are the estimated parameters - we estimate parameters on 40% less data, since 20% removed for test and 20% for validation data - If we have a small dataset our test (validation) set might just be lucky or unlucky slide by Olga Veksler • Cross Validation is a method for performance evaluation that wastes less data 21

  22. Small Dataset � Linear � Model: Quadratic � Model: Join � the � dots � Model: x Mean � Squared � Error � = � 2.4 Mean � Squared � Error � = � 0.9 Mean � Squared � Error � = � 2.2 slide by Olga Veksler 22

  23. LOOCV (Leave-one-out Cross Validation) � � � � � For � k=1 � to � R 1. � Let ( x k , y k ) be � the � k example y x slide by Olga Veksler 23

  24. LOOCV (Leave-one-out Cross Validation) � � � � � For � k =1 � to � n 1. � Let ( x k , y k ) be � the � k th example 2. � Temporarily � remove � ( x k , y k ) from � the � dataset y x slide by Olga Veksler 24

  25. LOOCV (Leave-one-out Cross Validation) � � � � � For � k =1 � to � n 1. � Let ( x k , y k ) be � the � k th example 2. � Temporarily � remove � ( x k , y k ) from � the � dataset 3. � Train � on � the � remaining � n � 1 � y examples x slide by Olga Veksler 25

  26. LOOCV (Leave-one-out Cross Validation) � � � � � For � k =1 � to � n 1. � Let ( x k , y k ) be � the � k th example 2. � Temporarily � remove � ( x k , y k ) from � the � dataset 3. � Train � on � the � remaining � n � 1 � y examples 4. � Note � your � error � on � ( x k , y k ) x slide by Olga Veksler 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend