Cross Validation and Penalized Linear Regression Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1

CV & Penalized LR Objectives • Regression with transformations of features • Cross Validation • L2 penalties • L1 penalties Mike Hughes - Tufts COMP 135 - Spring 2019 3

What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4

Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Spring 2019 5

Review: Linear Regression Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1   x 11 . . . x 1 F 1 x 21 . . . x 2 F 1 Exact formula for optimal values of w, b exist! ˜   X =   . . .   1 x N 1 . . . x NF [ w 1 . . . w F b ] T = ( ˜ X T ˜ X ) − 1 ˜ X T y Math works in 1D and for many dimensions Mike Hughes - Tufts COMP 135 - Spring 2019 6

Recap: solving linear regression • More examples than features (N > F) And if inverse of X^T X exists (needs to be full rank) Then an optimal weight vector exists, can use formula Likely has non-zero error (overdetermined) • Same number of examples and features (N=F) And if inverse of X^T X exists (needs to be full rank): Then an optimal weight vector exists, can use formula Will have zero error on training set. • Fewer examples than features (N < F) or low rank Then: Infinitely many optimal weight vectors exist with zero error Inverse of X^T X does not exist (naïvely, formula will fail) Mike Hughes - Tufts COMP 135 - Spring 2019 7

Recap • Squared error is special • Exact formulas for estimating parameters • Most metrics do not have exact formulas • Take derivative, set to zero, try to solve, …. HARD ! • Example: absolute error • General algorithm: Gradient Descent! • As long as first derivative exists, we can do iterations to estimate optimal parameters Mike Hughes - Tufts COMP 135 - Spring 2019 8

Transformations of Features Mike Hughes - Tufts COMP 135 - Spring 2019 9

Fitting a line isn’t always ideal Mike Hughes - Tufts COMP 135 - Spring 2019 10

Can fit linear functions to nonlinear features A nonlinear function of x: y ( x i ) = θ 0 + θ 1 x i + θ 2 x 2 i + θ 3 x 3 ˆ i φ ( x i ) = [1 x i x 2 i x 3 i ] Can be written as a linear function of 4 θ g φ g ( x i ) = θ T φ ( x i ) X y ( x i ) = ˆ g =1 “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data Mike Hughes - Tufts COMP 135 - Spring 2019 11

What feature transform to use? • Anything that works for your data! • sin / cos for periodic data • polynomials for high-order dependencies φ ( x i ) = [1 x i x 2 i . . . ] • interactions between feature dimensions φ ( x i ) = [1 x i 1 x i 2 x i 3 x i 4 . . . ] • Many other choices possible Mike Hughes - Tufts COMP 135 - Spring 2019 12

Linear Regression with Transformed Features φ ( x i ) = [1 φ 1 ( x i ) φ 2 ( x i ) . . . φ G − 1 ( x i )] y ( x i ) = θ T φ ( x i ) ˆ Optimization problem: “Least Squares” n =1 ( y n − θ T φ ( x i )) 2 P N min θ   1 φ 1 ( x 1 ) φ G − 1 ( x 1 ) . . . Exact solution: 1 φ 1 ( x 2 ) φ G − 1 ( x 2 ) . . .   Φ =  .  ... .   .   θ ∗ = (Φ T Φ) − 1 Φ T y 1 φ 1 ( x N ) φ G − 1 ( x N ) . . . N x G matrix Mike Hughes - Tufts COMP 135 - Spring 2019 13

Cross Validation Mike Hughes - Tufts COMP 135 - Spring 2019 14

Generalize: sample to population Mike Hughes - Tufts COMP 135 - Spring 2019 15

Labeled dataset y x Each row represents one example Assume rows are arranged “uniformly at random” (order doesn’t matter) Mike Hughes - Tufts COMP 135 - Spring 2019 16

Split into train and test y x train test Mike Hughes - Tufts COMP 135 - Spring 2019 17

Model Complexity vs Error Overfitting Underfitting Mike Hughes - Tufts COMP 135 - Spring 2019 18

How to fit best model? Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set y x train validation test Mike Hughes - Tufts COMP 135 - Spring 2019 19

How to fit best model? Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set y x train Concerns • Will train be too small? • Make better use of data? validation test Mike Hughes - Tufts COMP 135 - Spring 2019 20

Estimating Heldout Error with Fixed Validation Set Single random split 10 other random splits Credit: ISL Textbook, Chapter 5 Mike Hughes - Tufts COMP 135 - Spring 2019 21

3-fold Cross Validation y x Divide labeled dataset fold 1 into 3 even-sized parts fold 2 fold 3 Fit model 3 independent times. Each time leave one fold as validation and keep remaining as training y y x x y x train validation Heldout error estimate: average of the validation error across all 3 fits Mike Hughes - Tufts COMP 135 - Spring 2019 22

K-fold CV: How many folds K ? • Can do as low as 2 fold • Can do as high as N-1 folds (“Leave one out”) • Usual rule of thumb: 5-fold or 10-fold CV • Computation runtime scales linearly with K • Larger K also means each fit uses more train data, so each fit might take longer too • Each fit is independent and parallelizable Mike Hughes - Tufts COMP 135 - Spring 2019 23

Estimating Heldout Error with Cross Validation 9 separate splits Leave-one-out (N-1 folds) Each one with 10 folds Credit: ISL Textbook, Chapter 5 Mike Hughes - Tufts COMP 135 - Spring 2019 24

What to do about underfitting? • Increase model complexity • Add more features! Mike Hughes - Tufts COMP 135 - Spring 2019 25

What to do about overfitting? • Select complexity with cross validation • Control single-fit complexity with a penalty! Mike Hughes - Tufts COMP 135 - Spring 2019 26

Zero degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 27

1 st degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 28

3rd degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 29

9 th degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 30

Error vs Complexity sqrt of mean squared errror polynomial degree Mike Hughes - Tufts COMP 135 - Spring 2019 31

Polynomial degree 9 3 0 1 Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 32

Idea: Penalize magnitude of weights N J ( θ ) = 1 ( y n − θ T ˜ X X x n ) 2 θ 2 + α f 2 n =1 f α ≥ 0 Penalty strength: Larger alpha means we prefer smaller magnitude weights Mike Hughes - Tufts COMP 135 - Spring 2019 33

Idea: Penalize magnitude of weights N J ( θ ) = 1 ( y n − θ T ˜ X X x n ) 2 θ 2 + α f 2 n =1 f Written via matrix/vector product notation: J ( θ ) = 1 Xθ ) T ( y − ˜ Xθ ) + αθ T θ 2( y − ˜ Mike Hughes - Tufts COMP 135 - Spring 2019 34

Exact solution for L2 penalized linear regression Optimization problem: “Penalized Least Squares” 1 Xθ ) T ( y − ˜ Xθ ) + αθ T θ 2( y − ˜ min θ Solution: θ ∗ = ( ˜ X T ˜ X + α I ) − 1 ˜ X T y If alpha > 0 , this is always invertible! Mike Hughes - Tufts COMP 135 - Spring 2019 35

Slides on L1/L2 penalties See slides 71-82 from UC-Irvine course here: https://canvas.eee.uci.edu/courses/8278/files/2 735313/ Mike Hughes - Tufts COMP 135 - Spring 2019 36

Pair Coding Activity https://github.com/tufts-ml-courses/comp135-19s- assignments/blob/master/labs/GradientDescentDemo.ipynb • Try existing gradient descent code: • Optimizes scalar slope to produce minimum error • Try step sizes of 0.0001, 0.02, 0.05, 0.1 • Add L2 penalty with alpha > 0 • Write calc_penalized_loss and calc_penalized_grad • What happens to estimated slope value w? • Repeat with L1 penalty with alpha > 0 Mike Hughes - Tufts COMP 135 - Spring 2019 37

Cross Validation and Penalized Linear Regression Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

The Many Flavors of Penalized Linear Discriminant Analysis Daniela M. Witten Assistant Professor

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Cross-validation and the Bootstrap In the section we discuss two resampling methods:

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Progress to Date in A3: Method Transfer, Partial Validation and Cross validation A3: Method

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Penalized Linear Regression Prof. Mike Hughes Many slides attributable to: Erik Sudderth (UCI)

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two

Morale Issues in your Library Lori Reed consultant, strategic planner, LJ "Mover &

Cross-Language Prominence Detection Andrew Rosenberg 1 , Erica Cooper 2 , Rivka Levitan 2 , Julia

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of

Air Force Retraining Program Air Force Retraining Program These slides are intended for those

High Energy Neutrino Cross Sections Neutrino 2004, 18 June 2004 Mary Hall Reno Energy Ranges

Math 104 Calculus 6.1 Volume by Cross-sec:ons

Measurement of the cross section for W -boson production in association with jets in p collisions

AMPLITUDES AND CROSS SECTIONS AT THE LHC Errol Gotsman Tel Aviv University (work done with Genya

Cross Validation and Penalized Linear Regression Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

The Many Flavors of Penalized Linear Discriminant Analysis Daniela M. Witten Assistant Professor

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Cross-validation and the Bootstrap In the section we discuss two resampling methods:

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Progress to Date in A3: Method Transfer, Partial Validation and Cross validation A3: Method

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Penalized Linear Regression Prof. Mike Hughes Many slides attributable to: Erik Sudderth (UCI)

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two

Morale Issues in your Library Lori Reed consultant, strategic planner, LJ &quot;Mover &amp;

Cross-Language Prominence Detection Andrew Rosenberg 1 , Erica Cooper 2 , Rivka Levitan 2 , Julia

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of

Air Force Retraining Program Air Force Retraining Program These slides are intended for those

High Energy Neutrino Cross Sections Neutrino 2004, 18 June 2004 Mary Hall Reno Energy Ranges

Math 104 Calculus 6.1 Volume by Cross-sec:ons

Measurement of the cross section for W -boson production in association with jets in p collisions

AMPLITUDES AND CROSS SECTIONS AT THE LHC Errol Gotsman Tel Aviv University (work done with Genya

Morale Issues in your Library Lori Reed consultant, strategic planner, LJ "Mover &