Introduction to Machine Learning Model Validation and Selection - PowerPoint PPT Presentation

Introduction to Machine Learning Model Validation and Selection Dr. Ilija Bogunovic Learning and Adaptive Systems (las.ethz.ch)

Recap: Achieving generalization Fundamental assumption: Our data set is generated independently and identically distributed (iid) from some unknown distribution P ( x i , y i ) ∼ P ( X , Y ) Our goal is to minimize the expected error (true risk) under P Z P ( x , y )( y − w T x ) 2 d x dy R ( w ) = = E x ,y [( y − w T x ) 2 ] 2

Recap: Evaluating predictive performance Training error (empirical risk) systematically underestimates true risk h i h i ˆ R D ( ˆ w D ) R ( ˆ w D ) E D < E D 3

Recap:More realistic evaluation? Want to avoid underestimating the prediction error Idea : Use separate test set from the same distribution P Obtain training and test data and D train D test Optimize w on training set ˆ w = argmin ˆ R train ( w ) w Evaluate on test set 1 ˆ X w T x ) 2 R test ( ˆ w ) = ( y − ˆ | D test | ( x ,y ) ∈ D test Then: h i h i ˆ R D test ( ˆ w D train ) = E D train R ( ˆ w D train ) E D train ,D test 4

Why? 5

Recap: Evaluating predictive performance Training error (empirical risk) systematically underestimates true risk h i h i ˆ R D ( ˆ w D ) R ( ˆ w D ) E D < E D Using an independent test set avoids this bias h i h i ˆ R D test ( ˆ w D train ) = E D train R ( ˆ w D train ) E D train ,D test 6

First attempt: Evaluation for model selection Obtain training and test data and D train D test Fit each candidate model (e.g., degree m of polynomial) ˆ ˆ w m = argmin R train ( w ) w :degree( w ) ≤ m Pick one that does best on test set: ˆ R test ( ˆ m = argmin ˆ w m ) m Do you see a problem? 7

Overfitting to test set Error True risk Degree of polynomial Test error is itself random! Variance usually increases for more complex models Optimizing for single test set creates bias 8

Solution: Pick multiple test sets! Key idea : Instead of using a single test set, use multiple test sets and average to decrease variance! Dilemma : Any data I use for testing I can‘t use for training è Using multiple independent test sets is expensive and wasteful 9

Evaluation for model selection For each candidate model m (e.g., polynomial degree) repeat the following procedure for i = 1:k Split the same data set into training and validation set D = D ( i ) train ] D ( i ) val R ( i ) ˆ w i = arg min ˆ train ( w ) Train model w Estimate error R ( i ) ˆ m = ˆ R ( i ) val ( ˆ w i ) k Select model: 1 X R ( i ) ˆ m = argmin ˆ m k m i =1 10

How should we do the splitting? Randomly (Monte Carlo cross-validation) Pick training set of given size uniformly at random Validate on remaining points Estimate prediction error by averaging the validation error over multiple random trials k-fold cross-validation ( è default choice) Partition the data into k „folds“ Train on (k-1) folds, evaluating on remaining fold Estimate prediction error by averaging the validation error obtained while varying the validation fold 11

k-fold cross-validation D 1 D 2 D i D k ... ... 12

Accuracy of cross-validation Cross-validation error estimate is very nearly unbiased for large enough k Show demo 13

Cross-validation How large should we pick k? Too small è Risk of overfitting to test set è Using too little data for training è risk of underfitting to training set Too large In general, better performance! k=n is perfectly fine (called leave-one-out cross-validation, LOOCV) Higher computational complexity In practice, k=5 or k=10 is often used and works well 14

Best practice for evaluating supervised learning Split data set into training and test set Never look at test set when fitting the model. For example, use k -fold cross-validation on training set Report final accuracy on test set (but never optimize on test set)! Caveat : This only works if the data is i.i.d. Be careful, for example, if there are temporal trends or other dependencies 15

Supervised learning summary so far Representation/ Linear hypotheses, nonlinear hypotheses through feature transformations features Model/ Loss-function objective: Squared loss, l p -loss Method: Exact solution, Gradient Descent Evaluation Mean squared error metric: Model selection: K-fold Cross-Validation, Monte Carlo CV 16

Model selection more generally For polynomial regression, model complexity is naturally controlled by the degree In general, there may not be an ordering of the features that aligns with complexity E.g., how should we order words in the bag-of-words model? Collection of nonlinear feature transformations x 7! log( x + c ) x 7! x α x 7! sin( ax + b ) Now model complexity is no longer naturally „ordered“ 17

Demo: Overfitting à Large Weights 18

Regularization If we only seek to minimize our loss (optimize data fit) can get very complex models (large weights) Solution? Regularization! Encourage small weights via penalty functions (regularizers) 19

Ridge regression Regularized optimization problem: n 1 ( y i − w T x i ) 2 + λ || w || 2 X min 2 n w i =1 Can optimize using gradient descent, or still find analytical solution: w = ( X T X + λ I ) − 1 X T y ˆ Note that now the scale of x matters! 20

Renormalizing data: Standardization Ensure that each feature has zero mean and unit variance x i,j = ( x i,j − ˆ ˜ µ j ) / ˆ σ j Hereby is the value of the j-th feature of the i-th x i,j data point n n j = 1 µ j = 1 X X σ 2 µ j ) 2 ˆ ( x i,j − ˆ ˆ x i,j n n i =1 i =1 21

Gradient descent for ridge regression 22

Demo: Regularization 23

How to choose regularization parameter? n 1 ( y i − w T x i ) 2 + λ || w || 2 X min 2 n w i =1 Cross-validation! Typically pick λ logarithmically spaced: 24

Regularization path 25

Outlook: Fundamental tradeoff in ML Need to trade loss (goodness of fit) and simplicity A lot of supervised learning problems can be written in this way: ˆ min R ( w ) + λ C ( w ) w Can control complexity by varying regularization parameter λ Many other types of regularizers exist and are very useful (more later in this class) 26

Supervised learning summary so far Representation/ Linear hypotheses, nonlinear hypotheses through feature transformations features Model/ Loss-function + Regularization objective: Squared loss, l p -loss L 2 norm Method: Exact solution, Gradient Descent Evaluation Mean squared error metric: Model selection: K-fold Cross-Validation, Monte Carlo CV 27

What you need to know Linear regression as model and optimization problem How do you solve it? Closed form vs gradient descent Can represent non-linear functions using basis functions Model validation Resampling; Cross-validation Model selection for regression Comparing different models via cross-validation Regularization Adding penalty function to control magnitude of weights Choose regularization parameter via cross-validation 28

Introduction to Machine Learning Model Validation and Selection - PowerPoint PPT Presentation

Introduction to Machine Learning Model Validation and Selection Dr. Ilija Bogunovic Learning and Adaptive Systems (las.ethz.ch) Recap: Achieving generalization Fundamental assumption: Our data set is generated independently and identically

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two

Week 2 Video 5 Cross-Validation and Over-Fitting Over-Fitting Ive mentioned over-fitting a

Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this

RECSM Summer School: Machine Learning for Social Sciences Session 1.3: Supervised Learning and

A major risk in classification: overfitting Assume we have a small data set We fit a model that

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak