validation and testing
play

Validation and Testing COMPSCI 371D Machine Learning COMPSCI 371D - PowerPoint PPT Presentation

Validation and Testing COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Validation and Testing 1 / 19 Outline 1 Training, Testing, and Model Selection 2 A Generative Data Model 3 Model Selection: Validation 4 Model Selection:


  1. Validation and Testing COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Validation and Testing 1 / 19

  2. Outline 1 Training, Testing, and Model Selection 2 A Generative Data Model 3 Model Selection: Validation 4 Model Selection: Cross-Validation 5 Model Selection: The Bootstrap COMPSCI 371D — Machine Learning Validation and Testing 2 / 19

  3. Training, Testing, and Model Selection Training and Testing • Empirical risk is average loss over training set: def 1 L T ( h ) = � ( x , y ) ∈ T ℓ ( y , h ( x )) | T | • Training is Empirical Risk Minimization: ERM T ( H ) ∈ arg min h ∈H L T ( h ) (A fitting problem) • Not enough for machine learning: Must generalize • Small loss on “previously unseen data” • How do we know? Evaluate on a separate test set S • This is called testing the predictor • How do we know that S and T are “related?” COMPSCI 371D — Machine Learning Validation and Testing 3 / 19

  4. Training, Testing, and Model Selection Model Selection • Hyper-parameters: Degree k for polynomials, number k of neighbors in k -NN • How to choose? Why not just include with parameters, and train? • Difficulty 0: k -NN has no training! No big deal • Difficulty 1: k ∈ N , while v ∈ R m for some predictors. Hybrid optimization. Medium deal, just technical difficulty • Difficulty 2: Answer from training would be trivial! • Can always achieve zero risk on T • So k must be chosen separately from training. It tunes generalization • This is what makes it a hyper-parameter • Choosing hyper-parameters is called model selection • Evaluate choices on a separate validation set V COMPSCI 371D — Machine Learning Validation and Testing 4 / 19

  5. Training, Testing, and Model Selection Model Selection, Training, Testing • “Model” = H • Given a parametric family of hypothesis spaces, model selection selects one particular member of the family • Given a specific hypothesis space, training selects one particular predictor out of it • Use V to select model, T to train, S to test • V , T , S are mutually disjoint but “related” • What does “related” mean? • Train on cats and test on horses? COMPSCI 371D — Machine Learning Validation and Testing 5 / 19

  6. A Generative Data Model A Generative Data Model • What does “related” mean? • Every sample ( x , y ) comes from a joint probability distribution p ( x , y ) • True for training, validation, and test data, and for data seen during deployment • For the latter, y is “out there” but unknown • The goal of machine learning: • Define the (statistical) risk ˜ L p ( h ) = E p [ ℓ ( y , h ( x ))] = ℓ ( y , h ( x )) p ( x , y ) d x dy • Learning performs (Statistical) Risk Minimization : RM p ( H ) ∈ arg min h ∈H L p ( h ) def • Lowest risk on H : L p ( H ) = min h ∈H L p ( h ) COMPSCI 371D — Machine Learning Validation and Testing 6 / 19

  7. A Generative Data Model p is Unknown • So, we don’t need training data anymore? • We typically do not know p ( x , y ) • x = image? Or sentence? • Can we not estimate p ? • The curse of dimensionality, again • We typically cannot find RM p ( H ) or L p ( H ) • That’s the goal all the same COMPSCI 371D — Machine Learning Validation and Testing 7 / 19

  8. A Generative Data Model So Why Talk About It? • Why talk about p ( x , y ) if we cannot know it? • L p ( h ) is a mean, and we can estimate means • We can sandwich L p ( h ) or L p ( H ) between bounds over all possible choices of p • What else would we do anyway? • p is conceptually clean and simple • The unattainable holy grail • Think of p as an oracle that sells samples from X × Y • She knows p , we don’t • Samples cost money and effort! [Example: MNIST Database] COMPSCI 371D — Machine Learning Validation and Testing 8 / 19

  9. A Generative Data Model Even More Importantly... • We know what “related” means: T , V , S are all drawn independently from p ( x , y ) • We know what “generalize” means: Find RM p ( H ) ∈ arg min h ∈H L p ( h ) • We know the goal of machine learning COMPSCI 371D — Machine Learning Validation and Testing 9 / 19

  10. Model Selection: Validation Validation • Parametric family of hypothesis spaces H = � π ∈ Π H π • Finding a good vector ˆ π of hyper-parameters is called model selection • A popular method is called validation • Use a validation set V separate from T • Pick a hyper-parameter vector for which the predictor trained on the training set minimizes the validation risk π = arg min ˆ π ∈ Π L V ( ERM T ( H π )) • When the set Π of hyper-parameters is finite, try them all COMPSCI 371D — Machine Learning Validation and Testing 10 / 19

  11. Model Selection: Validation Validation Algorithm procedure V ALIDATION ( H , Π , T , V , ℓ ) ˆ L = ∞ ⊲ Stores the best risk so far on V for π ∈ Π do h ∈ arg min h ′ ∈H π L T ( h ′ ) ⊲ Use loss ℓ to compute best predictor ERM T ( H π ) on T L = L V ( h ) ⊲ Use loss ℓ to evaluate the predictor’s risk on V if L < ˆ L then π , ˆ h , ˆ (ˆ L ) = ( π , h , L ) ⊲ Keep track of the best hyper-parameters, predictor, and risk end if end for π , ˆ h , ˆ return (ˆ L ) ⊲ Return best hyper-parameters, predictor, and risk estimate end procedure COMPSCI 371D — Machine Learning Validation and Testing 11 / 19

  12. Model Selection: Validation Validation for Infinite Sets • When Π is not finite, scan and find a local minimum • Example: Polynomial degree 1.5 5 training risk validation risk 1 k = 1 0.5 k = 2 k = 3 k = 6 k = 9 0 0 0 2 4 6 8 10 0 1 • When Π is not countable, scan a grid and find a local minimum COMPSCI 371D — Machine Learning Validation and Testing 12 / 19

  13. Model Selection: Cross-Validation Resampling Methods for Validation • Validation is good but expensive: needs separate data • A pity not to use V as part of T ! • Resampling methods split T into T k and V k for k = 1 , . . . , K • (Nothing to do with number of classes or polynomial degree!) • For each π , for each k , train on T k , test on V k to measure performance • Average performance over k taken as validation risk for π • Let ˆ π be the best π • Train the predictor in H ˆ π and on all of T • Cross-validation and the bootstrap differ on how splits are made COMPSCI 371D — Machine Learning Validation and Testing 13 / 19

  14. Model Selection: Cross-Validation K -Fold Cross-Validation • V 1 , . . . , V K are a partition of T into approximately equal-sized sets • T k = T \ V k • For π ∈ Π For k = 1 , . . . , K : train on T k , measure performance on V k Average performance over k is validation risk for π • Pick ˆ π as the π with best average performance • Train the predictor in H ˆ π and on all of T • Since performance is an average, we also get a variance! • We don’t have that for standard validation COMPSCI 371D — Machine Learning Validation and Testing 14 / 19

  15. Model Selection: Cross-Validation Cross-Validation Algorithm procedure C ROSS V ALIDATION ( H , Π , T , K , ℓ ) { V 1 , . . . , V K } = S PLIT ( T , K ) ⊲ Split T in K approximately equal-sized sets at random ˆ L = ∞ ⊲ Will hold the lowest risk over Π for π ∈ Π do s , s 2 = 0 , 0 ⊲ Will hold sum of risks and their squares to compute risk mean and variance for k = 1 , . . . , K do T k = T \ V k ⊲ Use all of T except V k as training set h ∈ arg min h ′∈H π L Tk ( h ′ ) ⊲ Use the loss ℓ to compute h = ERM Tk ( H π ) L = L Vk ( h ) ⊲ Use the loss ℓ to compute the risk of h on V k ( s , s 2 ) = ( s + L , s 2 + L 2 ) ⊲ Keep track of quantities to compute risk mean and variance end for L = s / K ⊲ Sample mean of the risk over the K folds if L < ˆ L then σ 2 = ( s 2 − s 2 / K ) / ( K − 1 ) ⊲ Sample variance of the risk over the K folds π , ˆ σ 2 ) = ( π , L , σ 2 ) ( ˆ L , ˆ ⊲ Keep track of the best hyper-parameters and their risk statistics end if end for ˆ h = arg min h ∈H ˆ π L T ( h ) ⊲ Train predictor afresh on all of T with the best hyper-parameters π , ˆ h , ˆ σ 2 ) return ( ˆ L , ˆ ⊲ Return best hyper-parameters, predictor, and risk statistics end procedure COMPSCI 371D — Machine Learning Validation and Testing 15 / 19

  16. Model Selection: Cross-Validation How big is K ? • T k has | T | ( K − 1 ) / K samples, so the predictor in each fold is a bit worse than the final predictor • Smaller K : More pessimistic risk estimate (upward bias b/c we train on smaller T k ) • Bigger K decreases bias of risk estimate • (training on bigger T k ) • Why not K = N ? • LOOCV (Leave-One-Out Cross-Validation) • Train on all but one data point, test on that data point, repeat • Any issue? • Nadeau and Bengio recommend K = 15 COMPSCI 371D — Machine Learning Validation and Testing 16 / 19

  17. Model Selection: The Bootstrap The Bootstrap • Bag or multiset : A set that allows for multiple instances • { a , a , b , b , b , c } has cardinality 6 • Multiplicities : 2 for a , 3 for b , and 1 for c • A set is also a bag: { a , b , c } • Bootstrap: Same as CV, except • T k : N samples drawn uniformly at random from T , with replacement • V k = T \ T k • T k is a bag, V k is a set • Repetitions change training risk to a weighted average: � N � J L T k ( h ) = 1 n = 1 ℓ ( y n , h ( x n )) = 1 j = 1 m j ℓ ( y j , h ( x j )) N N COMPSCI 371D — Machine Learning Validation and Testing 17 / 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend