cs 6316 machine learning
play

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji - PowerPoint PPT Presentation

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer Science University of Virginia Overview Polynominals Polynomial regression (a) d 1 (b) d 3 (c) d 15 2 Boosting Adaboost combines T weak


  1. CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer Science University of Virginia

  2. Overview

  3. Polynominals Polynomial regression (a) d � 1 (b) d � 3 (c) d � 15 2

  4. Boosting Adaboost combines T weak classifiers to form a (strong) classifier T � sign ( w t h t ( x )) � h ( x ) (1) t � 1 where T controls the model complexity [Mohri et al., 2018, Page 147] 3

  5. Structural Risk Minimization Take linear regression with ℓ 2 as an example. Let H λ represents the hypothesis space defined with the following objective function m L S ,ℓ 2 ( h w ) � 1 � ( h w ( x i ) − y i ) 2 + λ � w � 2 (2) m i � 1 where λ is the regularization parameter 4

  6. Structural Risk Minimization Take linear regression with ℓ 2 as an example. Let H λ represents the hypothesis space defined with the following objective function m L S ,ℓ 2 ( h w ) � 1 � ( h w ( x i ) − y i ) 2 + λ � w � 2 (2) m i � 1 where λ is the regularization parameter ◮ The basic idea of SRM is to start from a small hypothesis space (e.g., H λ with a small λ , then gradually increase λ to have a larger H λ 4

  7. Structural Risk Minimization Take linear regression with ℓ 2 as an example. Let H λ represents the hypothesis space defined with the following objective function m L S ,ℓ 2 ( h w ) � 1 � ( h w ( x i ) − y i ) 2 + λ � w � 2 (2) m i � 1 where λ is the regularization parameter ◮ The basic idea of SRM is to start from a small hypothesis space (e.g., H λ with a small λ , then gradually increase λ to have a larger H λ ◮ Another example: Support Vector Machines (next lecture) 4

  8. Model Evaluation and Selection Since we cannot compute the true error of any given hypothesis h ∈ H ◮ How to evaluate the performance for a given model? ◮ How to select the best model among a few candidates? 5

  9. Model Validation

  10. Validation Set The simplest way to estimate the true error of a predictor h ◮ Independently sample an additional set of examples V with size m v V � {( x 1 , y 1 ) , . . . , ( x m v , y m v )} (3) ◮ Evaluate the predictor h on this validation set L V ( h ) � |{ i ∈ [ m v ] : h ( x ) � y i }| . (4) m v Usually, L V ( h ) is a good approximation to L D ( h ) 7

  11. Theorem Let h be some predictor and assume that the loss function is in [ 0 , 1 ] . Then, for every δ ∈ ( 0 , 1 ) , with probability of at least 1 − δ over the choice of a validation set V of size m v , we have � log ( 2 / δ ) | L V ( h ) − L D ( h )| ≤ (5) 2 m v where ◮ L V ( h ) : the validation error ◮ L D ( h ) : the true error [Shalev-Shwartz and Ben-David, 2014, Theorem 11.1] 8

  12. Sample Complexity ◮ The fundamental theorem of learning � C d + log ( 1 / δ ) L D ( h ) ≤ L S ( h ) + (6) m where d is the VC dimension of the corresponding hypothesis space 9

  13. Sample Complexity ◮ The fundamental theorem of learning � C d + log ( 1 / δ ) L D ( h ) ≤ L S ( h ) + (6) m where d is the VC dimension of the corresponding hypothesis space ◮ On the other hand, from the previous theorem � log ( 2 / δ ) L D ( h ) ≤ L V ( h ) + (7) 2 m v ◮ A good validation set should have similar number of examples as in the training set 9

  14. Model Selection

  15. Model Selection Procedure Given the training set S and the validation set V ◮ For each model configuration c , find the best hypothesis h c ( x , S ) L S ( h ′ ( x , S )) h c ( x , S ) � argmin (8) h ′ ∈ H c 11

  16. Model Selection Procedure Given the training set S and the validation set V ◮ For each model configuration c , find the best hypothesis h c ( x , S ) L S ( h ′ ( x , S )) h c ( x , S ) � argmin (8) h ′ ∈ H c ◮ With a collection of best models with different ′ � { h c 1 ( x , S ) , . . . , h c k ( x , S )} , find the configurations H overall best hypothesis L V ( h ′ ( x , S )) h ( x , S ) � argmin (9) h ′ ∈ H ′ 11

  17. Model Selection Procedure Given the training set S and the validation set V ◮ For each model configuration c , find the best hypothesis h c ( x , S ) L S ( h ′ ( x , S )) h c ( x , S ) � argmin (8) h ′ ∈ H c ◮ With a collection of best models with different ′ � { h c 1 ( x , S ) , . . . , h c k ( x , S )} , find the configurations H overall best hypothesis L V ( h ′ ( x , S )) h ( x , S ) � argmin (9) h ′ ∈ H ′ ◮ It is similar to learn with the finite hypothesis space ′ H 11

  18. Model Configuration/Hyperparameters Consider polynomial regression d � { w 0 + w 1 x + · · · + w d x d : w 0 , w 1 , . . . , w d ∈ R } H (10) ◮ the degree of polynomials d ◮ regularization coefficient λ as in λ · � w � 2 2 ◮ the bias term w 0 12

  19. Model Configuration/Hyperparameters Consider polynomial regression d � { w 0 + w 1 x + · · · + w d x d : w 0 , w 1 , . . . , w d ∈ R } H (10) ◮ the degree of polynomials d ◮ regularization coefficient λ as in λ · � w � 2 2 ◮ the bias term w 0 Additional factors during learning ◮ Optimization methods ◮ Dimensionality of inputs, etc. 12

  20. Limitation of Keeping a Validation Set If the validation set is ◮ small, then it could be biased and could not give a good approximation to the true error ◮ large, e.g., the same order of the training set, then we waste the information if do not use the examples for training. 13

  21. k -Fold Cross Validation The basic procedure of k -fold cross validation: ◮ Split the whole data set into k parts Data 14

  22. k -Fold Cross Validation The basic procedure of k -fold cross validation: ◮ Split the whole data set into k parts ◮ For each model configuration, run the learning procedure k times ◮ Each time, pick one part as validation set and the rest as training set Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 14

  23. k -Fold Cross Validation The basic procedure of k -fold cross validation: ◮ Split the whole data set into k parts ◮ For each model configuration, run the learning procedure k times ◮ Each time, pick one part as validation set and the rest as training set ◮ Take the average of k validation errors as the model error Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 14

  24. Cross-Validation Algorithm 1: Input : (1) training set S ; (2) set of parameter values Θ ; (3) learning algorithm A , and (4) integer k 2: Partition S into S 1 , S 2 , . . . , S k 3: for θ ∈ Θ do for i � 1 , . . . , k do 4: h i ,θ � A ( S \ S i ; θ ) 5: end for 6: � k Err ( θ ) � 1 i � 1 L S i ( h i ,θ ) 7: k 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) In practice, k is usually 5 or 10. 15

  25. Train-Validation-Test Split ◮ Training set: used for learning with a pre-selected hypothesis space, such as ◮ logistic regression for classification ◮ polynomial regression with d � 15 and λ � 0 . 1 ◮ Validation set: used for selecting the best hypothesis across multiple hypothesis spaces ◮ Similar to learning with a finite hypothesis space H ′ ◮ Test set: only used for evaluating the overall best hypothesis 16

  26. Train-Validation-Test Split ◮ Training set: used for learning with a pre-selected hypothesis space, such as ◮ logistic regression for classification ◮ polynomial regression with d � 15 and λ � 0 . 1 ◮ Validation set: used for selecting the best hypothesis across multiple hypothesis spaces ′ ◮ Similar to learning with a finite hypothesis space H ◮ Test set: only used for evaluating the overall best hypothesis Typical splits on all available data Train Val Test 16

  27. Train-Validation-Test Split ◮ Training set: used for learning with a pre-selected hypothesis space, such as ◮ logistic regression for classification ◮ polynomial regression with d � 15 and λ � 0 . 1 ◮ Validation set: used for selecting the best hypothesis across multiple hypothesis spaces ′ ◮ Similar to learning with a finite hypothesis space H ◮ Test set: only used for evaluating the overall best hypothesis Typical splits on all available data Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Test 16

  28. Model Selection in Practice

  29. What To Do If A Learning Fails There are many elements that can help fix the learning procedure ◮ Get a larger sample [Shalev-Shwartz and Ben-David, 2014, Page 151] 18

  30. What To Do If A Learning Fails There are many elements that can help fix the learning procedure ◮ Get a larger sample ◮ Change the hypothesis class by ◮ Enlarging it ◮ Reducing it ◮ Completely changing it ◮ Changing the parameters you consider [Shalev-Shwartz and Ben-David, 2014, Page 151] 18

  31. What To Do If A Learning Fails There are many elements that can help fix the learning procedure ◮ Get a larger sample ◮ Change the hypothesis class by ◮ Enlarging it ◮ Reducing it ◮ Completely changing it ◮ Changing the parameters you consider ◮ Change the feature representation of the data (usually domain dependent) [Shalev-Shwartz and Ben-David, 2014, Page 151] 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend