Introduction to Machine Learning Model Validation and Selection - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Model Validation and Selection - - PowerPoint PPT Presentation

Introduction to Machine Learning Model Validation and Selection Dr. Ilija Bogunovic Learning and Adaptive Systems (las.ethz.ch) Recap: Achieving generalization Fundamental assumption: Our data set is generated independently and identically


slide-1
SLIDE 1

Model Validation and Selection

  • Dr. Ilija Bogunovic

Learning and Adaptive Systems (las.ethz.ch)

Introduction to Machine Learning

slide-2
SLIDE 2

Recap: Achieving generalization

Fundamental assumption: Our data set is generated independently and identically distributed (iid) from some unknown distribution P Our goal is to minimize the expected error (true risk) under P

2

(xi, yi) ∼ P(X, Y ) = Ex,y[(y − wT x)2] R(w) = Z P(x, y)(y − wT x)2dxdy

slide-3
SLIDE 3

Recap: Evaluating predictive performance

Training error (empirical risk) systematically underestimates true risk

3

ED h ˆ RD( ˆ wD) i < ED h R( ˆ wD) i

slide-4
SLIDE 4

Recap:More realistic evaluation?

Want to avoid underestimating the prediction error Idea: Use separate test set from the same distribution P Obtain training and test data and Optimize w on training set Evaluate on test set Then:

4

Dtrain Dtest ˆ w = argmin

w

ˆ Rtrain(w) ˆ Rtest( ˆ w) = 1 |Dtest| X

(x,y)∈Dtest

(y − ˆ wT x)2

EDtrain,Dtest h ˆ RDtest(ˆ wDtrain) i = EDtrain h R(ˆ wDtrain) i

slide-5
SLIDE 5

Why?

5

slide-6
SLIDE 6

Recap: Evaluating predictive performance

Training error (empirical risk) systematically underestimates true risk Using an independent test set avoids this bias

6

ED h ˆ RD( ˆ wD) i < ED h R( ˆ wD) i EDtrain,Dtest h ˆ RDtest(ˆ wDtrain) i = EDtrain h R(ˆ wDtrain) i

slide-7
SLIDE 7

First attempt: Evaluation for model selection

Obtain training and test data and Fit each candidate model (e.g., degree m of polynomial) Pick one that does best on test set:

Do you see a problem?

7

Dtrain Dtest

ˆ wm = argmin

w:degree(w)≤m

ˆ Rtrain(w) ˆ m = argmin

m

ˆ Rtest( ˆ wm)

slide-8
SLIDE 8

Overfitting to test set

Test error is itself random! Variance usually increases for more complex models Optimizing for single test set creates bias

8

Degree of polynomial Error True risk

slide-9
SLIDE 9

Solution: Pick multiple test sets!

Key idea: Instead of using a single test set, use multiple test sets and average to decrease variance! Dilemma: Any data I use for testing I can‘t use for training è Using multiple independent test sets is expensive and wasteful

9

slide-10
SLIDE 10

Evaluation for model selection

For each candidate model m (e.g., polynomial degree) repeat the following procedure for i = 1:k

Split the same data set into training and validation set Train model Estimate error

Select model:

10

ˆ wi = arg min

w

ˆ R(i)

train(w)

D = D(i)

train ] D(i) val

ˆ m = argmin

m

1 k

k

X

i=1

ˆ R(i)

m

ˆ R(i)

m = ˆ

R(i)

val( ˆ

wi)

slide-11
SLIDE 11

How should we do the splitting?

Randomly (Monte Carlo cross-validation)

Pick training set of given size uniformly at random Validate on remaining points Estimate prediction error by averaging the validation error

  • ver multiple random trials

k-fold cross-validation (è default choice)

Partition the data into k „folds“ Train on (k-1) folds, evaluating on remaining fold Estimate prediction error by averaging the validation error

  • btained while varying the validation fold

11

slide-12
SLIDE 12

k-fold cross-validation

12

D1 D2 Di Dk ... ...

slide-13
SLIDE 13

Accuracy of cross-validation

Cross-validation error estimate is very nearly unbiased for large enough k Show demo

13

slide-14
SLIDE 14

Cross-validation

How large should we pick k? Too small

è Risk of overfitting to test set è Using too little data for training è risk of underfitting to training set

Too large

In general, better performance! k=n is perfectly fine (called leave-one-out cross-validation, LOOCV) Higher computational complexity

In practice, k=5 or k=10 is often used and works well

14

slide-15
SLIDE 15

Best practice for evaluating supervised learning

Split data set into training and test set Never look at test set when fitting the model. For example, use k-fold cross-validation on training set Report final accuracy on test set (but never optimize on test set)! Caveat: This only works if the data is i.i.d. Be careful, for example, if there are temporal trends or

  • ther dependencies

15

slide-16
SLIDE 16

Supervised learning summary so far

16

Model/

  • bjective:

Loss-function

Squared loss, lp-loss

Method:

Exact solution, Gradient Descent

Model selection:

K-fold Cross-Validation, Monte Carlo CV

Representation/ features

Linear hypotheses, nonlinear hypotheses through feature transformations

Evaluation metric:

Mean squared error

slide-17
SLIDE 17

Model selection more generally

For polynomial regression, model complexity is naturally controlled by the degree In general, there may not be an ordering of the features that aligns with complexity

E.g., how should we order words in the bag-of-words model? Collection of nonlinear feature transformations

Now model complexity is no longer naturally „ordered“

17

x 7! log(x + c) x 7! xα x 7! sin(ax + b)

slide-18
SLIDE 18

Demo: Overfitting à Large Weights

18

slide-19
SLIDE 19

Regularization

If we only seek to minimize our loss (optimize data fit) can get very complex models (large weights) Solution?

Regularization! Encourage small weights via penalty functions (regularizers)

19

slide-20
SLIDE 20

Ridge regression

Regularized optimization problem: Can optimize using gradient descent, or still find analytical solution: Note that now the scale of x matters!

20

min

w

1 n

n

X

i=1

(yi − wT xi)2 + λ||w||2

2

ˆ w = (XT X + λI)−1XT y

slide-21
SLIDE 21

Renormalizing data: Standardization

Ensure that each feature has zero mean and unit variance Hereby is the value of the j-th feature of the i-th data point

21

˜ xi,j = (xi,j − ˆ µj)/ˆ σj xi,j ˆ µj = 1 n

n

X

i=1

xi,j ˆ σ2

j = 1

n

n

X

i=1

(xi,j − ˆ µj)2

slide-22
SLIDE 22

Gradient descent for ridge regression

22

slide-23
SLIDE 23

Demo: Regularization

23

slide-24
SLIDE 24

How to choose regularization parameter?

Cross-validation! Typically pick λ logarithmically spaced:

24

min

w

1 n

n

X

i=1

(yi − wT xi)2 + λ||w||2

2

slide-25
SLIDE 25

Regularization path

25

slide-26
SLIDE 26

Outlook: Fundamental tradeoff in ML

Need to trade loss (goodness of fit) and simplicity A lot of supervised learning problems can be written in this way: Can control complexity by varying regularization parameter Many other types of regularizers exist and are very useful (more later in this class)

26

λ

min

w

ˆ R(w) + λC(w)

slide-27
SLIDE 27

Supervised learning summary so far

27

Model/

  • bjective:

Loss-function

Squared loss, lp-loss

Method:

Exact solution, Gradient Descent

Model selection:

K-fold Cross-Validation, Monte Carlo CV

Representation/ features

Linear hypotheses, nonlinear hypotheses through feature transformations

Evaluation metric:

Mean squared error

+ Regularization

L2 norm

slide-28
SLIDE 28

What you need to know

Linear regression as model and optimization problem

How do you solve it? Closed form vs gradient descent Can represent non-linear functions using basis functions

Model validation

Resampling; Cross-validation

Model selection for regression

Comparing different models via cross-validation

Regularization

Adding penalty function to control magnitude of weights Choose regularization parameter via cross-validation

28