outline iaml overfitting and capacity control
play

Outline IAML: Overfitting and Capacity Control Generalization error - PowerPoint PPT Presentation

Outline IAML: Overfitting and Capacity Control Generalization error Estimating generalization error Example: polynomial regression Charles Sutton and Victor Lavrenko School of Informatics Under- and over-fitting


  1. Outline IAML: Overfitting and Capacity Control ◮ Generalization error ◮ Estimating generalization error ◮ Example: polynomial regression Charles Sutton and Victor Lavrenko School of Informatics ◮ Under- and over-fitting ◮ Cross-validation ◮ Regularization Semester 1 ◮ Reading: W & F § 5.1, 5.3, 1 / 26 2 / 26 Generalization error Generalization error ◮ The real aim of supervised learning is to do well on test ◮ The real aim of supervised learning is to do well on test data that is not known during training data that is not known during training n E train = 1 � error ( f D ( x i ) , y i ) n E train = 1 n � error ( f D ( x i ) , y i ) i = 1 n � i = 1 E gen = error ( f D ( x ) , y ) p ( y , x ) d x � E gen = error ( f D ( x ) , y ) p ( y , x ) d x where p ( y , x ) is the probability density of the input data and f D ( x ) is the predictor after training on dataset D . where p ( y , x ) is the probability density of the input data and f D ( x ) is the predictor after training on dataset D . ◮ We cannot measure the generalization error E gen directly. For example, in linear regression, ◮ The key point is: Our learning method chooses f D so as to ◮ f D ( x i ) = ˆ w T φ ( x i ) optimize E train . Often E gen > E train , because the model has ◮ error (ˆ y − y ) 2 y , y ) = (ˆ been fitted using the training data 3 / 26 4 / 26

  2. Polynomial regression Under- and Overfitting φ ( x ) = ( 1 , x , x 2 , . . . , x M ) T ◮ Choosing values of the parameters that minimize the training error may not lead to the best generalization 1 M = 0 1 M = 1 t t performance 0 0 ◮ If the model too simple, it will not be able to represent the patterns that exist. This is underfitting . −1 −1 ◮ If the model is too complex, it will memorize the training 0 1 0 1 x x data. It will remember “noise”, i.e., patterns in the data that occur only due to chance. This is called overfitting . 1 M = 3 1 M = 9 ◮ Overfitting: A hypothesis f ∈ F is said to overfit the data if t t there exists some alternative hypothesis f ′ ∈ F such that f 0 0 has a smaller training error than f ′ , but f ′ has a smaller generalization error than f . −1 −1 ◮ Need a balance between the two 0 1 0 1 x x Figure credit: Chris Bishop, PRML 5 / 26 6 / 26 Training vs Generalization Error Knobs are your friend ◮ Every data set will require a different balance between over- and underfitting. Depends on how much data we have and how complex the actual relationship is ◮ In general we need: (a) a knob that causes the algorithm to favour simpler or more complex rules, and (b) a procedure for setting this knob based on data, to choose the right balance ◮ This is why all the learning algorithms in Weka have parameters. ◮ For decision trees: The parameters of the pruning algorithm ◮ For polynomial regression: M (order of the polynomial) Adapted from figure by Sam Roweis. ◮ For k -nearest neighbor: 7 / 26 8 / 26

  3. Knobs are your friend Knobs are your friend ◮ Every data set will require a different balance between ◮ Every data set will require a different balance between over- and underfitting. Depends on how much data we over- and underfitting. Depends on how much data we have and how complex the actual relationship is have and how complex the actual relationship is ◮ In general we need: (a) a knob that causes the algorithm ◮ In general we need: (a) a knob that causes the algorithm to favour simpler or more complex rules, and (b) a to favour simpler or more complex rules, and (b) a procedure for setting this knob based on data, to choose procedure for setting this knob based on data, to choose the right balance the right balance ◮ This is why all the learning algorithms in Weka have ◮ This is why all the learning algorithms in Weka have parameters. parameters. ◮ For decision trees: The parameters of the pruning ◮ For decision trees: The parameters of the pruning algorithm algorithm ◮ For polynomial regression: M (order of the polynomial) ◮ For polynomial regression: M (order of the polynomial) ◮ For k -nearest neighbor: k ◮ For k -nearest neighbor: k ◮ For linear regression: ◮ For linear regression: ???? 9 / 26 10 / 26 Regularization Regularized Loss Function ◮ Regularization is a general approach to add a “complexity knob” to a learning algorithm. Requires that the parameters be continuous. (i.e., Regression OK, Decision ◮ The overall cost function is the trees not.) sum of two parabolic bowls. ◮ If we penalize polynomials that have large values for their The sum is also a parabolic bowl. coefficients we will get less wiggly solutions ◮ The combined minimum lies E ( w ) = | y − Φ w | 2 + λ | w | 2 ˜ on the line between the minimum of the squared error ◮ Solution is and the origin. w = (Φ T Φ + λ I ) − 1 Φ T y ˆ ◮ The regularizer just shrinks ◮ This is known as ridge regression the weights. ◮ Rather than using a discrete control parameter like M (model order) we can use a continuous parameter λ Credit: Geoff Hinton ◮ Caution: Don’t shrink the bias term! (The one that corresponds to the all 1 feature.) 11 / 26 12 / 26

  4. The effect of regularization for M = 9 M = 9 1 ln λ = − 18 1 M = 9 1 Training t t Test 0 0 −1 −1 E RMS 0.5 0 1 0 1 x x 1 ln λ = 0 t 0 0 −35 −30 −25 −20 ln λ −1 Chris Bishop, PRML 0 1 x Figure credit: Chris Bishop, PRML 13 / 26 14 / 26 But with ridge regression we have For regular old linear regression, we had ◮ Define the task : regression ◮ Decide on the model structure : linear regression model ◮ Define the task : regression ◮ Decide on the score function : squared error with ◮ Decide on the model structure : linear regression model quadratic regularizaton ◮ Decide on the score function : squared error (likelihood) ◮ Decide on optimization/search method to optimize the ◮ Decide on optimization/search method to optimize the score function: calculus (analytic solution) score function: calculus (analytic solution) Notice how you can train the same model structure with different score functions. This is the first time we have seen this. This is important. 15 / 26 16 / 26

  5. A Knob-Setting Procedure Using a validation set ◮ Split the labelled data into a training set, validation set, and ◮ Regularization was a way of adding “capacity control” a a test set. knob. ◮ Training set: Use for training ◮ But how do we set the value? e.g., the regularization ◮ Validation set: Tune the “knobs” according to performance parameter λ on the validation set ◮ Won’t work to do it on the training set (why not?) ◮ Test set: to check how the final model performs ◮ We will cover two choices ◮ Validation set ◮ No right answers, but for example, could choose 60% ◮ Cross-validation training, 20% validation, 20% test 17 / 26 18 / 26 Example of using a validation set Cross-validation Consider polynomial regression: 1. For each m = 1 , 2 , . . . M (you choose M in advance 2. Train the polynomial regression using ◮ The idea of holding out a separate validation set seems φ ( x ) = ( 1 , x , x 2 , . . . , x m ) T on training set (e.g., by rather wasteful of data → k -fold cross validation. minimizing squared error). This produces a predictor f m ( x ) . ◮ Divide the labelled data into k parts (or folds), train on k − 1 3. Measure the error of f m on the validation set folds, and validate on one. Do this k times, holding out a 4. End for different fold each time. Common choices for k are 3 or 10 5. Choose the f m with the best validation error. 6. Measure the error of f m on the test set to see how well you should expect it to perform 19 / 26 20 / 26

  6. Cross-validation (pretty) Cross-validation ◮ Validation performance is average of validation 1 5 1 1 performance on each of the k folds Test Train ◮ Choose m with the maximum validation performance 2-5 Train 4 2 Test ◮ If k = n , then we have leave-one-out cross validation 2 Train 3-5 Test (LOO-CV) 3 ◮ Once you have selected m , pool all of the data back Fold 1 Fold 2 Fold 3 together, train as usual on that value only. 21 / 26 22 / 26 Continuous Knobs Continuous Knobs ◮ For a discrete knob like polynomial order m we could ◮ For a discrete knob like polynomial order m we could simply search all values. simply search all values. ◮ What about a quadratic regularization parameter λ . What ◮ What about a quadratic regularization parameter λ . What do we do then? do we do then? ◮ Pick a grid of values to search. In practice you want the grid to vary geometrically for this sort of parameter. e.g., Try λ ∈ { 0 . 01 , 0 . 1 , 0 . 5 , 1 . 0 , 5 . 0 , 10 . 0 } . Don’t bother trying 2 . 0, 3 . 0, 7 . 0. 23 / 26 24 / 26

  7. Problems with cross-validation Summary ◮ You can still overfit! If you exhaustively try a really large ◮ Generalization error vs training error number of possible approaches and knob settings, you ◮ Under- and over-fitting could by chance happen to find a parameter setting that ◮ Using knobs to control the complexity of a predictor predicts all the training data well. ◮ Estimate generalization error with a validation set (or CV) ◮ It can be expensive computationally. ◮ Regularization ◮ Sometimes there are tricks to reduce the computation. 25 / 26 26 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend