BBM406 Fundamentals of Machine Learning Lecture 5: ML Methodology - PowerPoint PPT Presentation

Illustration: detail from The Alchemist Discovering Phosphorus by Joseph Wright (1771) BBM406 Fundamentals of   Machine Learning Lecture 5: ML Methodology Aykut Erdem // Hacettepe University // Fall 2019

About class projects • This semester the theme is machine learning for good. • To be done in groups of 3 people. • Deliverables: Proposal, blog posts, progress report, project presentations (classroom + video presentations), final report and code • For more details please check the project webpage:   http://web.cs.hacettepe.edu.tr/~aykut/classes/fall2019/bbm406/project.html. 2

Recall from last time… Linear Regression y ( x ) = w 0 + w 1 x w = ( w 0 , w 1 ) Gradient Descent Update Rule: ⇣ ⌘ N t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) ` ( w ) = n =1 Closed Form Solution: � − 1 X T t X T X � w = 3

Recall from last time… Some key concepts • Data fits – is linear model best ( model selection )? − Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data   (fit not only the signal but also the noise in the data), especially if not enough data to constrain model • One method of assessing fit: − test generalization = model’s ability to predict   the held out data • Regularization ln λ = −∞ ln λ = − 18 ln λ = 0 1 Training inde- � N w ⋆ 0.35 0.35 0.13 E ( w ) = 1 0 { y ( x n , w ) − t n } 2 + λ � Test � w ⋆ 232.37 4.74 -0.05 2 ∥ w ∥ 2 1 � w ⋆ -5321.83 -0.77 -0.06 2 2 slide by Richard Zemel E RMS w ⋆ 48568.31 -31.97 -0.05 n =1 3 0.5 w ⋆ -231639.30 -3.89 -0.03 4 w ⋆ 640042.26 55.28 -0.02 where ∥ w ∥ 2 ≡ w T w = w 2 5 0 + w 2 1 + . . . + w 2 M , w ⋆ -1061800.52 41.32 -0.01 6 w ⋆ 1042400.18 -45.95 -0.00 7 importance of the regularization term compared 0 w ⋆ -557682.99 -91.53 0.00 8 0 3 6 9 M w ⋆ 125201.43 72.68 0.01 9 4

   Today • Machine Learning Methodology - validation - cross-validation (k-fold, leave-one-out) - model selection   5

Machine Learning   Methodology 6

Recap: Regression • In regression, labels y i are y continuous • Classification/regression are solved very similarly 6 • Everything we have done so 3 far transfers to classification with very minor changes x • Error: sum of distances from 1 4 8 examples to the fitted slide by Olga Veksler model 7

Training/Test Data Split • Talked about splitting data in training/test sets - training data is used to fit parameters - test data is used to assess how classifier generalizes to new data • What if classifier has “non ‐ tunable” parameters? - a parameter is “non ‐ tunable” if tuning (or training) it on the training data leads to overfitting - Examples: ‣ k in kNN classifier ‣ number of hidden units in MNN slide by Olga Veksler ‣ number of hidden layers in MNN ‣ etc … 8

Example of Overfitting • Want to fit a polynomial machine f ( x , w ) • Instead of fixing polynomial degree,   y make it parameter d - learning machine f ( x , w,d ) • Consider just three choices for d - degree 1 - degree 2 - degree 3   x • Training error is a bad measure to choose d − degree 3 is the best according to the training error, but overfits the data slide by Olga Veksler 9

Training/Test Data Split • What about test error? Seems appropriate − degree 2 is the best model according to the test error • Except what do we report as the test error now? • Test error should be computed on data that was not used for slide by Olga Veksler training at all! • Here used “test” data for training, i.e. choosing model 10

Validation data • Same question when choosing among several classifiers - our polynomial degree example can be looked at as choosing among 3 classifiers (degree 1, 2, or 3) slide by Olga Veksler 11

Validation data • Same question when choosing among several classifiers - our polynomial degree example can be looked at as choosing among 3 classifiers (degree 1, 2, or 3) • Solution: split the labeled data into three parts labeled data Training Validation Test ≈ 60% ≈ 20% ≈ 20% train other   use only to   train tunable   parameters,   slide by Olga Veksler assess final   parameters w or to select   performance classifier 12

Training/Validation labeled data Training Validation Test ≈ 60% ≈ 20% ≈ 20% Training error:   Validation Test error:   computed on training   error:   computed example computed on   on   validation   test examples examples slide by Olga Veksler 13

Training/Validation/Test Data validation error: 3.3 validation error: 1.8 validation error: 3.4 • Training Data • Validation Data - d = 2 is chosen slide by Olga Veksler • Test Data - 1.3 test error computed for d = 2 14

Choosing Parameters: Example error Validation error Training error number of base functions 50 • Need to choose number of hidden units for a MNN slide by Olga Veksler - The more hidden units, the better can fit training data - But at some point we overfit the data 15

Diagnosing Underfitting/Overfitting Underfitting Just Right Overfitting • large training error • small training error • small training error • large validation error • small validation error • large validation error slide by Olga Veksler 16

Fixing Underfitting/Overfitting • Fixing Underfitting - getting more training examples will not help - get more features - try more complex classifier ‣ if using MLP , try more hidden units   • Fixing Overfitting - getting more training examples might help - try smaller set of features - Try less complex classifier slide by Olga Veksler ‣ If using MLP , try less hidden units 17

Train/Test/Validation Method • Good news - Very simple   • Bad news: - Wastes data - in general, the more data we have, the better are the estimated parameters - we estimate parameters on 40% less data, since 20% removed for test and 20% for validation data - If we have a small dataset our test (validation) set might just be lucky or unlucky slide by Olga Veksler • Cross Validation is a method for performance evaluation that wastes less data 18

Small Dataset Linear Model: Quadratic Model: Join the dots Model: y y y x x x Mean Squared Error = 2.4 Mean Squared Error = 0.9 Mean Squared Error = 2.2 slide by Olga Veksler 19

LOOCV (Leave-one-out Cross Validation) For k=1 to n 1. Let ( x k , y k ) be the k th example 2. Temporarily remove ( x k , y k ) from the dataset 3. Train on the remaining n-1 examples y 4. Note your error on ( x k , y k ) When you’ve done all points, x report the mean error slide by Olga Veksler 20

LOOCV (Leave-one-out Cross Validation) MSE LOOCV   = 2.12 y y y x x x y y y x x x slide by Olga Veksler y y y x x x 25

LOOCV for Quadratic Regression MSE LOOCV   = 0.96 y y y x x x y y y x x x slide by Olga Veksler y y y x x x 26

LOOCV for Joint The Dots MSE LOOCV   = 3.33 y y y x x x y y y x x x slide by Olga Veksler y y y x x x 27

Which kind of Cross Validation? � � � � Downside Upside Test � set may � give � unreliable �� cheap estimate � of � future � performance Leave � one � expensive � doesn’t � waste � out data • Can we get the best of both worlds? slide by Olga Veksler � � � � � � � 28

K-Fold Cross Validation • Randomly break the dataset into k partitions • In this example, we have k=3 partitions colored red green and blue y x slide by Olga Veksler 29

K-Fold Cross Validation • Randomly break the dataset into k partitions • In this example, we have k=3 partitions colored red green and blue • For the blue partition: train on all points not y in the blue partition. Find test ‐ set sum of errors on blue points x slide by Olga Veksler 30

BBM406 Fundamentals of Machine Learning Lecture 5: ML Methodology - PowerPoint PPT Presentation

Illustration: detail from The Alchemist Discovering Phosphorus by Joseph Wright (1771) BBM406 Fundamentals of Machine Learning Lecture 5: ML Methodology Aykut Erdem // Hacettepe University // Fall 2019 About class projects This semester

BBM406 Fundamentals of Machine Learning Lecture 1: Course outline and logistics An overview

BBM406 Fundamentals of Machine Learning Lecture 23: Dimensionality Reduction Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory Probability Review Aykut

BBM406 Fundamentals of Machine Learning Lecture 18: Decision Trees Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 9: Logistic Regression Discriminative vs.

BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass

BBM406 Fundamentals of Machine Learning Lecture 13: Introduction to Deep Learning Aykut

BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum

BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron

BBM406 Fundamentals of Machine Learning Lecture 19: What is Ensemble Learning? Bagging

BBM406 Fundamentals of Machine Learning Lecture 8: Maximum a Posteriori (MAP) Nave Bayes

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation

Linear Regression - Estimating Parameters Bernd Schr oder logo1 Bernd Schr oder

Error Handling Marco Chiarandini Department of Mathematics & Computer Science University of

CS7015 (Deep Learning) : Lecture 3 Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

Training Neural Networks with Local Error Signals Arild Nkland Lars H. Eidnes Local learning

Clustering / Unsupervised Learning The target features are not given in the training examples The

Certifying Non-negativity with Lasserres Hierarchy and Semidefinite Programming Victor Magron ,

Lecture 10. Simple linear regression 2020 (1) Using one r.v. to predict another X and Y are

Earthmover resilience & testing in ordered structures Eldar Fischer Omri Ben-Eliezer

BBM406 Fundamentals of Machine Learning Lecture 5: ML Methodology - PowerPoint PPT Presentation

Illustration: detail from The Alchemist Discovering Phosphorus by Joseph Wright (1771) BBM406 Fundamentals of Machine Learning Lecture 5: ML Methodology Aykut Erdem // Hacettepe University // Fall 2019 About class projects This semester

BBM406 Fundamentals of Machine Learning Lecture 1: Course outline and logistics An overview

BBM406 Fundamentals of Machine Learning Lecture 23: Dimensionality Reduction Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory Probability Review Aykut

BBM406 Fundamentals of Machine Learning Lecture 18: Decision Trees Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 9: Logistic Regression Discriminative vs.

BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass

BBM406 Fundamentals of Machine Learning Lecture 13: Introduction to Deep Learning Aykut

BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum

BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron

BBM406 Fundamentals of Machine Learning Lecture 19: What is Ensemble Learning? Bagging

BBM406 Fundamentals of Machine Learning Lecture 8: Maximum a Posteriori (MAP) Nave Bayes

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation

Linear Regression - Estimating Parameters Bernd Schr oder logo1 Bernd Schr oder

Error Handling Marco Chiarandini Department of Mathematics &amp; Computer Science University of

CS7015 (Deep Learning) : Lecture 3 Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

Training Neural Networks with Local Error Signals Arild Nkland Lars H. Eidnes Local learning

Clustering / Unsupervised Learning The target features are not given in the training examples The

Certifying Non-negativity with Lasserres Hierarchy and Semidefinite Programming Victor Magron ,

Lecture 10. Simple linear regression 2020 (1) Using one r.v. to predict another X and Y are

Earthmover resilience &amp; testing in ordered structures Eldar Fischer Omri Ben-Eliezer

Error Handling Marco Chiarandini Department of Mathematics & Computer Science University of

Earthmover resilience & testing in ordered structures Eldar Fischer Omri Ben-Eliezer