Outline IAML: Overfitting and Capacity Control Generalization error - - PowerPoint PPT Presentation

outline iaml overfitting and capacity control
SMART_READER_LITE
LIVE PREVIEW

Outline IAML: Overfitting and Capacity Control Generalization error - - PowerPoint PPT Presentation

Outline IAML: Overfitting and Capacity Control Generalization error Estimating generalization error Example: polynomial regression Charles Sutton and Victor Lavrenko School of Informatics Under- and over-fitting


slide-1
SLIDE 1

IAML: Overfitting and Capacity Control

Charles Sutton and Victor Lavrenko School of Informatics Semester 1

1 / 26

Outline

◮ Generalization error ◮ Estimating generalization error ◮ Example: polynomial regression ◮ Under- and over-fitting ◮ Cross-validation ◮ Regularization ◮ Reading: W & F §5.1, 5.3,

2 / 26

Generalization error

◮ The real aim of supervised learning is to do well on test

data that is not known during training Etrain = 1 n

n

  • i=1

error(fD(xi), yi) Egen =

  • error(fD(x), y)p(y, x)dx

where p(y, x) is the probability density of the input data and fD(x) is the predictor after training on dataset D. For example, in linear regression,

◮ fD(xi) = ˆ

wTφ(xi)

◮ error(ˆ

y, y) = (ˆ y − y)2

3 / 26

Generalization error

◮ The real aim of supervised learning is to do well on test

data that is not known during training Etrain = 1 n

n

  • i=1

error(fD(xi), yi) Egen =

  • error(fD(x), y)p(y, x)dx

where p(y, x) is the probability density of the input data and fD(x) is the predictor after training on dataset D.

◮ We cannot measure the generalization error Egen directly. ◮ The key point is: Our learning method chooses fD so as to

  • ptimize Etrain. Often Egen > Etrain, because the model has

been fitted using the training data

4 / 26

slide-2
SLIDE 2

Polynomial regression

φ(x) = (1, x, x2, . . . , xM)T

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

Figure credit: Chris Bishop, PRML 5 / 26

Under- and Overfitting

◮ Choosing values of the parameters that minimize the

training error may not lead to the best generalization performance

◮ If the model too simple, it will not be able to represent the

patterns that exist. This is underfitting.

◮ If the model is too complex, it will memorize the training

  • data. It will remember “noise”, i.e., patterns in the data that
  • ccur only due to chance. This is called overfitting.

◮ Overfitting: A hypothesis f ∈ F is said to overfit the data if

there exists some alternative hypothesis f ′ ∈ F such that f has a smaller training error than f ′, but f ′ has a smaller generalization error than f.

◮ Need a balance between the two

6 / 26

Training vs Generalization Error

Adapted from figure by Sam Roweis. 7 / 26

Knobs are your friend

◮ Every data set will require a different balance between

  • ver- and underfitting. Depends on how much data we

have and how complex the actual relationship is

◮ In general we need: (a) a knob that causes the algorithm

to favour simpler or more complex rules, and (b) a procedure for setting this knob based on data, to choose the right balance

◮ This is why all the learning algorithms in Weka have

parameters.

◮ For decision trees: The parameters of the pruning

algorithm

◮ For polynomial regression: M (order of the polynomial) ◮ For k-nearest neighbor:

8 / 26

slide-3
SLIDE 3

Knobs are your friend

◮ Every data set will require a different balance between

  • ver- and underfitting. Depends on how much data we

have and how complex the actual relationship is

◮ In general we need: (a) a knob that causes the algorithm

to favour simpler or more complex rules, and (b) a procedure for setting this knob based on data, to choose the right balance

◮ This is why all the learning algorithms in Weka have

parameters.

◮ For decision trees: The parameters of the pruning

algorithm

◮ For polynomial regression: M (order of the polynomial) ◮ For k-nearest neighbor: k ◮ For linear regression:

9 / 26

Knobs are your friend

◮ Every data set will require a different balance between

  • ver- and underfitting. Depends on how much data we

have and how complex the actual relationship is

◮ In general we need: (a) a knob that causes the algorithm

to favour simpler or more complex rules, and (b) a procedure for setting this knob based on data, to choose the right balance

◮ This is why all the learning algorithms in Weka have

parameters.

◮ For decision trees: The parameters of the pruning

algorithm

◮ For polynomial regression: M (order of the polynomial) ◮ For k-nearest neighbor: k ◮ For linear regression: ????

10 / 26

Regularization

◮ Regularization is a general approach to add a “complexity

knob” to a learning algorithm. Requires that the parameters be continuous. (i.e., Regression OK, Decision trees not.)

◮ If we penalize polynomials that have large values for their

coefficients we will get less wiggly solutions ˜ E(w) = |y − Φw|2 + λ|w|2

◮ Solution is

ˆ w = (ΦTΦ + λI)−1ΦTy

◮ This is known as ridge regression ◮ Rather than using a discrete control parameter like M

(model order) we can use a continuous parameter λ

◮ Caution: Don’t shrink the bias term! (The one that

corresponds to the all 1 feature.)

11 / 26

Regularized Loss Function

◮ The overall cost function is the

sum of two parabolic bowls. The sum is also a parabolic bowl.

◮ The combined minimum lies

  • n the line between the

minimum of the squared error and the origin.

◮ The regularizer just shrinks

the weights.

Credit: Geoff Hinton 12 / 26

slide-4
SLIDE 4

The effect of regularization for M = 9

x t M = 9 1 −1 1 x t ln λ = −18 1 −1 1 x t ln λ = 0 1 −1 1 Figure credit: Chris Bishop, PRML 13 / 26

M = 9

ERMS ln λ −35 −30 −25 −20 0.5 1 Training Test

Chris Bishop, PRML 14 / 26

For regular old linear regression, we had

◮ Define the task: regression ◮ Decide on the model structure: linear regression model ◮ Decide on the score function: squared error (likelihood) ◮ Decide on optimization/search method to optimize the

score function: calculus (analytic solution)

15 / 26

But with ridge regression we have

◮ Define the task: regression ◮ Decide on the model structure: linear regression model ◮ Decide on the score function: squared error with

quadratic regularizaton

◮ Decide on optimization/search method to optimize the

score function: calculus (analytic solution) Notice how you can train the same model structure with different score functions. This is the first time we have seen

  • this. This is important.

16 / 26

slide-5
SLIDE 5

A Knob-Setting Procedure

◮ Regularization was a way of adding “capacity control” a

knob.

◮ But how do we set the value? e.g., the regularization

parameter λ

◮ Won’t work to do it on the training set (why not?) ◮ We will cover two choices

◮ Validation set ◮ Cross-validation 17 / 26

Using a validation set

◮ Split the labelled data into a training set, validation set, and

a test set.

◮ Training set: Use for training ◮ Validation set: Tune the “knobs” according to performance

  • n the validation set

◮ Test set: to check how the final model performs ◮ No right answers, but for example, could choose 60%

training, 20% validation, 20% test

18 / 26

Example of using a validation set

Consider polynomial regression:

  • 1. For each m = 1, 2, . . . M (you choose M in advance
  • 2. Train the polynomial regression using

φ(x) = (1, x, x2, . . . , xm)T on training set (e.g., by minimizing squared error). This produces a predictor fm(x).

  • 3. Measure the error of fm on the validation set
  • 4. End for
  • 5. Choose the fm with the best validation error.
  • 6. Measure the error of fm on the test set to see how well you

should expect it to perform

19 / 26

Cross-validation

◮ The idea of holding out a separate validation set seems

rather wasteful of data → k-fold cross validation.

◮ Divide the labelled data into k parts (or folds), train on k − 1

folds, and validate on one. Do this k times, holding out a different fold each time. Common choices for k are 3 or 10

20 / 26

slide-6
SLIDE 6

Cross-validation (pretty)

Fold 1 Fold 2 Fold 3

Test Test Test Train Train Train 1 2-5 2 1 3-5 3 4 5 1 2

21 / 26

Cross-validation

◮ Validation performance is average of validation

performance on each of the k folds

◮ Choose m with the maximum validation performance ◮ If k = n, then we have leave-one-out cross validation

(LOO-CV)

◮ Once you have selected m, pool all of the data back

together, train as usual on that value only.

22 / 26

Continuous Knobs

◮ For a discrete knob like polynomial order m we could

simply search all values.

◮ What about a quadratic regularization parameter λ. What

do we do then?

23 / 26

Continuous Knobs

◮ For a discrete knob like polynomial order m we could

simply search all values.

◮ What about a quadratic regularization parameter λ. What

do we do then?

◮ Pick a grid of values to search. In practice you want the

grid to vary geometrically for this sort of parameter. e.g., Try λ ∈ {0.01, 0.1, 0.5, 1.0, 5.0, 10.0}. Don’t bother trying 2.0, 3.0, 7.0.

24 / 26

slide-7
SLIDE 7

Problems with cross-validation

◮ You can still overfit! If you exhaustively try a really large

number of possible approaches and knob settings, you could by chance happen to find a parameter setting that predicts all the training data well.

◮ It can be expensive computationally. ◮ Sometimes there are tricks to reduce the computation.

25 / 26

Summary

◮ Generalization error vs training error ◮ Under- and over-fitting ◮ Using knobs to control the complexity of a predictor ◮ Estimate generalization error with a validation set (or CV) ◮ Regularization

26 / 26