Lecture 5: Regularization ML Methodology Aykut Erdem February 2016 - - PowerPoint PPT Presentation

lecture 5
SMART_READER_LITE
LIVE PREVIEW

Lecture 5: Regularization ML Methodology Aykut Erdem February 2016 - - PowerPoint PPT Presentation

Lecture 5: Regularization ML Methodology Aykut Erdem February 2016 Hacettepe University Recall from last time Linear Regression y ( x ) = w 0 + w 1 x w = ( w 0 , w 1 ) N i 2 h X t ( n ) ( w 0 + w 1 x ( n ) ) ` ( w ) = n =1


slide-1
SLIDE 1

Lecture 5:

−Regularization −ML Methodology

Aykut Erdem

February 2016 Hacettepe University

slide-2
SLIDE 2

inde- M ERMS 3 6 9 0.5 1 Training Test

2

Recall from last time… Linear Regression

from Bishop

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2 w = (w0, w1)

w ← w + 2λ ⇣ t(n) − y(x(n)) ⌘ x(n)

Gradient Descent Update Rule: Closed Form Solution:

w =

  • XT X

−1 XT t

w (T)1Ty

slide-3
SLIDE 3

1-D regression illustrates key concepts

  • Data fits – is linear model best (model selection)?

− Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model

  • One method of assessing fit: test generalization = model’s

ability to predict the held out data

  • Optimization is essential: stochastic and batch iterative

approaches; analytic when available

3

slide by Richard Zemel

slide-4
SLIDE 4

Today

  • Regularization
  • Machine Learning Methodology
  • validation
  • cross-validation (k-fold, leave-one-out)
  • model selection


4

slide-5
SLIDE 5

Regularization

5

slide-6
SLIDE 6

Regularized Least Squares

  • A technique to control the overfitting phenomenon
  • Add a penalty term to the error function in order to

discourage the coefficients from reaching large values

6

  • E(w) = 1

2

N

  • n=1

{y(xn, w) − tn}2 + λ 2 ∥w∥2

  • where ∥w∥2 ≡ wTw = w2

0 + w2 1 + . . . + w2

M,

importance of the regularization term compared

'

Ridge regression

which is minimized by

slide by Erik Sudderth

slide-7
SLIDE 7

The effect of regularization

7

x t ln λ = −18 1 −1 1 x t ln λ = 0 1 −1 1

M = 9

slide by Erik Sudderth

slide-8
SLIDE 8

ERMS ln λ −35 −30 −25 −20 0.5 1 Training Test

The effect of regularization

8

ln λ = −∞ ln λ = −18 ln λ = 0 w⋆ 0.35 0.35 0.13 w⋆

1

232.37 4.74

  • 0.05

w⋆

2

  • 5321.83
  • 0.77
  • 0.06

w⋆

3

48568.31

  • 31.97
  • 0.05

w⋆

4

  • 231639.30
  • 3.89
  • 0.03

w⋆

5

640042.26 55.28

  • 0.02

w⋆

6

  • 1061800.52

41.32

  • 0.01

w⋆

7

1042400.18

  • 45.95
  • 0.00

w⋆

8

  • 557682.99
  • 91.53

0.00 w⋆

9

125201.43 72.68 0.01

The corresponding coefficients from the fitted polynomials, showing that regularization has the desired effect of reducing the magnitude

  • f the coefficients.

slide by Erik Sudderth

slide-9
SLIDE 9

A more general regularizer

9

1 2

N

  • n=1

{tn − wTφ(xn)}2 + λ 2

M

  • j=1

|wj|q

q = 0.5 q = 1 q = 2 q = 4

slide by Richard Zemel

slide-10
SLIDE 10

Machine Learning 
 Methodology

10

slide-11
SLIDE 11

Recap: Regression

  • In regression, labels y

i are

continuous

  • Classification/regression are

solved very similarly

  • Everything we have done so

far transfers to classification with very minor changes

  • Error: sum of distances from

examples to the fitted model

11

  • x

y

1 4 8 3 6

slide by Olga Veksler

slide-12
SLIDE 12

Training/Test Data Split

  • Talked about splitting data in training/test sets
  • training data is used to fit parameters
  • test data is used to assess how classifier generalizes

to new data

  • What if classifier has “non‐tunable” parameters?
  • a parameter is “non‐tunable” if tuning (or training) it
  • n the training data leads to overfitting
  • Examples:
  • k in kNN classifier
  • number of hidden units in MNN
  • number of hidden layers in MNN
  • etc …

12

slide by Olga Veksler

slide-13
SLIDE 13

Example of Overfitting

  • Want to fit a polynomial machine f(x,w)
  • Instead of fixing polynomial degree, 


make it parameter d

  • learning machine f(x,w,d)
  • Consider just three choices for d
  • degree 1
  • degree 2
  • degree 3 

  • Training error is a bad measure to choose d

− degree 3 is the best according to the training error, but overfits

the data

13

  • x

y

  • slide by Olga Veksler
slide-14
SLIDE 14

Training/Test Data Split

  • What about test error? Seems appropriate

− degree 2 is the best model according to the test error

  • Except what do we report as the test error now?
  • Test error should be computed on data that was not used for

training at all!

  • Here used “test” data for training, i.e. choosing model

14

  • slide by Olga Veksler
slide-15
SLIDE 15

Validation data

  • Same question when choosing among several classifiers
  • our polynomial degree example can be looked at as

choosing among 3 classifiers (degree 1, 2, or 3)

  • Solution: split the labeled data into three parts

15

  • traintunable

parametersw trainother parameters,

  • rtoselect

classifier

labeleddata

Training 60% Validation 20% Test 20%

useonly to assessfinal performance

slide by Olga Veksler

slide-16
SLIDE 16

Training/Validation

16

Trainingerror: computedontraining examples Validationerror: computedon validation examples

labeleddata

Training 60% Validation 20% Test 20%

Testerror: computedon testexamples

slide by Olga Veksler

slide-17
SLIDE 17

Training/Validation/Test Data

  • Training Data
  • Validation Data
  • d = 2 is chosen
  • Test Data
  • 1.3 test error computed for d = 2

17

  • validation error:3.3

validationerror:1.8 validationerror:3.4

  • d =1

d =2 d =3

slide by Olga Veksler

slide-18
SLIDE 18

Choosing Parameters: Example

  • Need to choose number of hidden units for a MNN
  • The more hidden units, the better can fit training data
  • But at some point we overfit the data

18

  • numberofhiddenunits

error TrainingError ValidationError 50

slide by Olga Veksler

slide-19
SLIDE 19

Diagnosing Underfitting/Overfitting

19

  • largetrainingerror
  • largevalidationerror

Underfitting JustRight

  • smalltrainingerror
  • smallvalidationerror

Overfitting

  • smalltrainingerror
  • largevalidationerror

slide by Olga Veksler

slide-20
SLIDE 20

Fixing Underfitting/Overfitting

  • Fixing Underfitting
  • getting more training examples will not help
  • get more features
  • try more complex classifier
  • if using MNN, try more hidden units

  • Fixing Overfitting
  • getting more training examples might help
  • try smaller set of features
  • Try less complex classifier
  • If using MNN, try less hidden units

20

slide by Olga Veksler

slide-21
SLIDE 21

Train/Test/Validation Method

  • Good news
  • Very simple

  • Bad news:
  • Wastes data
  • in general, the more data we have, the better are the estimated

parameters

  • we estimate parameters on 40% less data, since 20% removed

for test and 20% for validation data

  • If we have a small dataset our test (validation) set might just

be lucky or unlucky

  • Cross Validation is a method for performance evaluation

that wastes less data

21

slide by Olga Veksler

slide-22
SLIDE 22

Small Dataset

22

  • LinearModel:

MeanSquaredError=2.4 QuadraticModel: MeanSquaredError=0.9 x JointhedotsModel: MeanSquaredError=2.2

slide by Olga Veksler

slide-23
SLIDE 23

LOOCV (Leave-one-out Cross Validation)

23

  • x

y

Fork=1toR 1.Let (xk,yk) bethek example

slide by Olga Veksler

slide-24
SLIDE 24

LOOCV (Leave-one-out Cross Validation)

24

x y

  • Fork=1ton

1.Let (xk,yk) bethekth example 2.Temporarilyremove(xk,yk) fromthedataset

slide by Olga Veksler

slide-25
SLIDE 25

LOOCV (Leave-one-out Cross Validation)

25

x y

  • Fork=1ton

1.Let (xk,yk) bethekth example 2.Temporarilyremove(xk,yk) fromthedataset 3.Trainontheremainingn1 examples

slide by Olga Veksler

slide-26
SLIDE 26

LOOCV (Leave-one-out Cross Validation)

26

x y

  • Fork=1ton

1.Let (xk,yk) bethekth example 2.Temporarilyremove(xk,yk) fromthedataset 3.Trainontheremainingn1 examples 4.Noteyourerroron(xk,yk)

slide by Olga Veksler

slide-27
SLIDE 27

LOOCV (Leave-one-out Cross Validation)

27

Fork=1ton 1.Let (xk,yk) bethekth example 2.Temporarilyremove(xk,yk) fromthedataset 3.Trainontheremainingn1 examples 4.Noteyourerroron(xk,yk) Whenyou’vedoneallpoints, reportthemeanerror

x y

  • slide by Olga Veksler
slide-28
SLIDE 28

LOOCV (Leave-one-out Cross Validation)

28

x y x y x y x y x y x y x y x y x y

MSELOOCV =2.12

  • slide by Olga Veksler
slide-29
SLIDE 29

LOOCV for Quadratic Regression

29

  • x

y x y x y x y x y x y x y x y x y

MSELOOCV =0.962

slide by Olga Veksler

slide-30
SLIDE 30

LOOCV for Joint The Dots

30

  • x

y x y x y x y x y x y x y x y x y

MSELOOCV =3.33

slide by Olga Veksler

slide-31
SLIDE 31

Which kind of Cross Validation?

  • Can we get the best of both worlds?

31

  • Downside

Upside Testset maygiveunreliable estimateoffuture performance cheap Leaveone

  • ut

expensive doesn’twaste data

  • slide by Olga Veksler
slide-32
SLIDE 32

K-Fold Cross Validation

32

x y Randomly

  • part
  • part
  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

slide by Olga Veksler

slide-33
SLIDE 33

K-Fold Cross Validation

33

  • x

y

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐ set sum of errors on blue points

slide by Olga Veksler

slide-34
SLIDE 34

K-Fold Cross Validation

34

  • x

y

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐ set sum of errors on blue points

  • For the green partition: train on all points not

in green partition. Find test‐set sum of errors on green points

slide by Olga Veksler

slide-35
SLIDE 35

K-Fold Cross Validation

35

  • x

y

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐ set sum of errors on blue points

  • For the green partition: train on all points not

in green partition. Find test‐set sum of errors on green points

  • For the red partition: train on all points not in

red partition. Find the test‐set sum of errors

  • n red points

slide by Olga Veksler

slide-36
SLIDE 36

K-Fold Cross Validation

36

LinearRegression MSE3FOLD=2.05

x y

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐ set sum of errors on blue points

  • For the green partition: train on all points not

in green partition. Find test‐set sum of errors on green points

  • For the red partition: train on all points not in

red partition. Find the test‐set sum of errors

  • n red points
  • Report the mean error

slide by Olga Veksler

slide-37
SLIDE 37

K-Fold Cross Validation

37

QuadraticRegression MSE3FOLD=1.11

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐ set sum of errors on blue points

  • For the green partition: train on all points not

in green partition. Find test‐set sum of errors on green points

  • For the red partition: train on all points not in

red partition. Find the test‐set sum of errors

  • n red points
  • Report the mean error

slide by Olga Veksler

slide-38
SLIDE 38

K-Fold Cross Validation

38

x y

Jointthedots MSE3FOLD=2.93

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐ set sum of errors on blue points

  • For the green partition: train on all points not

in green partition. Find test‐set sum of errors on green points

  • For the red partition: train on all points not in

red partition. Find the test‐set sum of errors

  • n red points
  • Report the mean error

slide by Olga Veksler

slide-39
SLIDE 39

Which kind of Cross Validation?

39

  • Downside

Upside Testset maygiveunreliable estimateoffuture performance cheap Leave

  • neout

expensive doesn’twastedata 10fold wastes10%ofthedata,10 timesmoreexpensivethan testset

  • nlywastes10%,only10

timesmoreexpensive insteadofn times 3fold wastesmoredatathan10 fold,moreexpensivethan testset slightlybetterthantestset Nfold IdenticaltoLeaveoneout

slide by Olga Veksler

slide-40
SLIDE 40

Cross-validation for classification

  • Instead of computing the sum squared

errors on a test set, you should compute...

40

slide by Andrew Moore

slide-41
SLIDE 41

Cross-validation for classification

  • Instead of computing the sum squared

errors on a test set, you should compute…

The total number of misclassifications on a test set

41

slide by Andrew Moore

slide-42
SLIDE 42

Cross-validation for classification

  • Instead of computing the sum squared

errors on a test set, you should compute…

The total number of misclassifications on a test set

42

  • W
  • W
  • W
  • What’s LOOCV of 1-NN?
  • What’s LOOCV of 3-NN?
  • What’s LOOCV of 22-NN?

slide by Andrew Moore

slide-43
SLIDE 43

Cross-validation for classification

  • Choosing k for k‐nearest neighbors
  • Choosing Kernel parameters for SVM
  • Any other “free” parameter of a classifier
  • Choosing Features to use
  • Choosing which classifier to use

43

slide by Andrew Moore

slide-44
SLIDE 44

CV-based Model Selection

  • We’re trying to decide which algorithm to use.
  • We train each machine and make a table...

44

  • fi

TrainingError

  • f1

f2 f3

  • f4

f5 f6

slide by Olga Veksler

slide-45
SLIDE 45

CV-based Model Selection

  • We’re trying to decide which algorithm to use.
  • We train each machine and make a table...

45

  • fi

TrainingError

10FOLDCVError f1 f2 f3

  • f4

f5 f6

slide by Olga Veksler

slide-46
SLIDE 46

CV-based Model Selection

  • We’re trying to decide which algorithm to use.
  • We train each machine and make a table...

46

  • fi

TrainingError

10FOLDCVError Choice f1 f2 f3

  • f4

f5 f6

slide by Olga Veksler

slide-47
SLIDE 47

CV-based Model Selection

  • Example: Choosing “k” for a k‐nearest‐neighbor regression.
  • Step 1: Compute LOOCV error for six different model classes:

47

  • Algorithm

TrainingError

10foldCVError Choice

k=1 k=2 k=3 k=4

  • k=5

k=6

  • Step 2: Choose model that gave the best CV score
  • Train with all the data, and that’s the final model you’ll use

slide by Olga Veksler

slide-48
SLIDE 48

CV-based Model Selection

  • Why stop at k=6?
  • No good reason, except it looked like things were getting

worse as K was increasing

  • Are we guaranteed that a local optimum of K vs LOOCV

will be the global optimum?

  • No, in fact the relationship can be very bumpy
  • What should we do if we are depressed at the expense
  • f doing LOOCV for k = 1 through 1000?
  • Try: k=1, 2, 4, 8, 16, 32, 64, ... ,1024
  • Then do hillclimbing from an initial guess at k

48

slide by Olga Veksler