Introduction to Machine Learning Tuning: Nested Resampling - PowerPoint PPT Presentation

Introduction to Machine Learning Tuning: Nested Resampling Motivation compstat-lmu.github.io/lecture_i2ml

MOTIVATION Selecting the best model from a set of potential candidates (e.g., different classes of learners, different hyperparameter settings, different feature sets, different preprocessing, ....) is an important part of most machine learning problems. Problem � c Introduction to Machine Learning – 1 / 8

MOTIVATION We cannot evaluate our finally selected learner on the same resampling splits that we have used to perform model selection for it, e.g., to tune its hyperparameters. By repeatedly evaluating the learner on the same test set, or the same CV splits, information about the test set “leaks” into our evaluation. Danger of overfitting to the resampling splits / overtuning! The final performance estimate will be optimistically biased. One could also see this as a problem similar to multiple testing. � c Introduction to Machine Learning – 2 / 8

INSTRUCTIVE AND PROBLEMATIC EXAMPLE Assume a binary classification problem with equal class sizes. Assume a learner with hyperparameter λ . Here, the learner is a (nonsense) feature-independent classifier, where λ has no effect. The learner simply predicts random labels with equal probability. Of course, it’s true generalization error is 50%. A cross-validation of the learner (with any fixed λ ) will easily show this (given that the partitioned data set for CV is not too small). Now lets “tune” it, by trying out 100 different λ values. We repeat this experiment 50 times and average results. � c Introduction to Machine Learning – 3 / 8

INSTRUCTIVE AND PROBLEMATIC EXAMPLE 0.48 Data Size Tuning Error 100 0.44 200 500 0.40 0 25 50 75 100 Iteration Plotted is the best “tuning error” (i.e. the performance of the model with fixed λ as evaluated by the cross-validation) after k tuning iterations. We have performed the experiment for different sizes of learning data that where cross-validated. � c Introduction to Machine Learning – 4 / 8

INSTRUCTIVE AND PROBLEMATIC EXAMPLE n=200; #runs=1; best err=0.48 n=200; #runs=10; best err=0.45 0.04 0.04 Density Density 0.02 0.02 0.00 0.00 ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Test Error Test Error For 1 experiment, the CV score will be nearly 0.5, as expected We basically sample from a (rescaled) binomial distribution when we calculate error rates And multiple experiment scores are also nicely arranged around the expected mean 0.5 � c Introduction to Machine Learning – 5 / 8

INSTRUCTIVE AND PROBLEMATIC EXAMPLE n=200; #runs=100; best err=0.42 n=200; #runs=1000; best err=0.36 0.04 0.04 Density Density 0.02 0.02 0.00 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Test Error Test Error But in tuning we take the minimum of those! So we don’t really estimate the "average performance" anymore, we get an estimate of "best case" performance instead. The more we sample, the more "biased" this value becomes. � c Introduction to Machine Learning – 6 / 8

UNTOUCHED TEST SET PRINCIPLE Countermeasure: simulate what actually happens in model application. All parts of the model building (including model selection, preprocessing) should be embedded in the model-finding process on the training data . The test set should only be touched once, so we have no way of “cheating”. The test dataset is only used once after a model is completely trained, after deciding for example on specific hyper-parameters. Only if we do this are the performance estimates we obtained from the test set unbiased estimates of the true performance. � c Introduction to Machine Learning – 7 / 8

UNTOUCHED TEST SET PRINCIPLE For steps that themselves require resampling (e.g., hyperparameter tuning) this results in nested resampling , i.e., resampling strategies for both tuning: an inner resampling loop to find what works best based on training data outer evaluation on data not used for tuning to get honest estimates of the expected performance on new data � c Introduction to Machine Learning – 8 / 8

Introduction to Machine Learning Tuning: Nested Resampling - PowerPoint PPT Presentation

Introduction to Machine Learning Tuning: Nested Resampling Motivation compstat-lmu.github.io/lecture_i2ml MOTIVATION Selecting the best model from a set of potential candidates (e.g., different classes of learners, different hyperparameter

Introduction to Machine Learning Tuning: Nested Resampling compstat-lmu.github.io/lecture_i2ml

Nested Word Automata Jens Stimpfle 30.6.2014 Nested Words Nested Words Theoretically and

Introduction to Machine Learning Evaluation: Resampling compstat-lmu.github.io/lecture_i2ml

Estimating the Performance of Predictive Models with Resampling Methods Florian Pargent

Nested Transactions Nested Transactions Flat transactions The rules for committing of

Nested and Composite Classes Lecture 14 COP 3252 Summer 2017 May 30, 2017 Nested Classes

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine learning with mlr Dr. Shirin Elsinghorst Data Scientist DataCamp Hyperparameter Tuning

6 Subsequences and sequential compactness 6.1 Nested intervals and nested d -cells Recall the

NEVE: Nested Virtualization Extensions for ARM Jin Tack Lim, Christo ff er Dall, Shih-Wei Li, Jason

Introduction to resampling methods Tushar Shanker Data Scientist DataCamp Statistical

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

Subsampling versus bootstrap in resampling-based model selection for multivariable regression

The EXQUIRES (EXtensible QUantitative Image RESampling) Test Suite: Impact of the Downsampler,

Controlling Adaptive Resampling Fons Adriaensen Casa della Musica, Parma Linux Audio Conference

Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main

L e s s o n s f ro m t h e e a r ly L H C d a t a fo r M C t u n i n g P. S k a n d s ( C

MC Tuning @ ATLAS Stephen Jiggins on behalf of the ATLAS Collaboration University College London

Automatically Tuning Performance and Power for GPUs Jeffrey K. Hollingsworth What is

CAPES:Unsupervised Storage Performance Tuning Using Neural Network-Based Deep Reinforcement

ADVANCED DATABASE SYSTEMS Self-Driving Database Management Systems @ Andy_Pavlo // 15- 721 //

The Future of Supersymmetry Sreerup Raychaudhuri TIFR HEP Seminar Institute of Physics,

Context Change and Versatile Models in Machine Learning Jos Hernndez-Orallo Universitat