Linear Regression with Polynomial Features , Cross Validation, and - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Linear Regression with Polynomial Features , Cross Validation, and Hyperparameter Selection Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

Objectives for Today (day 04) • Regression with transformations of features • Especially, polynomial features • Ways to estimate generalization error • Fixed Validation Set • K-fold Cross Validation • Hyperparameter Selection Mike Hughes - Tufts COMP 135 - Fall 2020 3

What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Fall 2020 4

Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Fall 2020 5

Review: Linear Regression Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1   x 11 . . . x 1 F 1 x 21 . . . x 2 F 1 Exact formula for optimal values of w, b exist! ˜   X =   . . .   x N 1 . . . x NF 1 [ w 1 . . . w F b ] T = ( ˜ X T ˜ X ) − 1 ˜ X T y Math works in 1D and for many dimensions Mike Hughes - Tufts COMP 135 - Fall 2020 6

Transformations of Features Mike Hughes - Tufts COMP 135 - Fall 2020 8

Fitting a line isn’t always ideal Mike Hughes - Tufts COMP 135 - Fall 2020 9

Can fit linear functions to nonlinear features A nonlinear function of x: y ( x i ) = θ 0 + θ 1 x i + θ 2 x 2 i + θ 3 x 3 ˆ i φ ( x i ) = [1 x i x 2 i x 3 i ] Can be written as a linear function of 4 θ g φ g ( x i ) = θ T φ ( x i ) X y ( x i ) = ˆ g =1 “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data Mike Hughes - Tufts COMP 135 - Fall 2020 10

What feature transform to use? • Anything that works for your data! • sin / cos for periodic data • polynomials for high-order dependencies φ ( x i ) = [1 x i x 2 i . . . ] • interactions between feature dimensions φ ( x i ) = [1 x i 1 x i 2 x i 3 x i 4 . . . ] • Many other choices possible Mike Hughes - Tufts COMP 135 - Fall 2020 11

Linear Regression with Transformed Features φ ( x i ) = [1 φ 1 ( x i ) φ 2 ( x i ) . . . φ G − 1 ( x i )] y ( x i ) = θ T φ ( x i ) ˆ Optimization problem: “Least Squares” n =1 ( y n − θ T φ ( x i )) 2 P N min θ   1 φ 1 ( x 1 ) φ G − 1 ( x 1 ) Exact solution: . . . 1 φ 1 ( x 2 ) φ G − 1 ( x 2 ) . . .   Φ =  .  ... .   .   θ ∗ = ( Φ T Φ ) − 1 Φ T y 1 φ 1 ( x N ) φ G − 1 ( x N ) . . . N x G matrix Mike Hughes - Tufts COMP 135 - Fall 2020 12

0 th degree polynomial features --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 13

1 st degree polynomial features --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 14

3 rd degree polynomial features --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 15

9 th degree polynomial features --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 16

Error vs Degree mean squared error polynomial degree Mike Hughes - Tufts COMP 135 - Fall 2020 17

Error vs Model Complexity high-degree 0 degree polynomial polynomial Overfitting Underfitting Mike Hughes - Tufts COMP 135 - Fall 2020 18

What to do about underfitting? Increase model complexity (add more features!) What to do about overfitting? Select among several complexity levels the one that generalizes best (today) Control complexity with a penalty in training objective (next class) Mike Hughes - Tufts COMP 135 - Fall 2020 19

Hyperparameter Selection Selection problem: What polynomial degree to use? If we picked lowest training error, mean we’d select a 9-degree polynomial squared error If we picked lowest test error, we’d select a 3 or 4 degree polynomial polynomial degree “Parameter” (e.g. weight values in linear regression) a numerical variable controlling quality of fit that we can effectively estimate by minimizing error on training set “Hyperparameter” (e.g. degree of polynomial features) a numerical variable controlling model complexity / quality of fit whose value we cannot effectively estimate from the training set Mike Hughes - Tufts COMP 135 - Fall 2020 20

Goal of regression (supervised ML) is to generalize : sample to population For any regression task, we might want to: • Train a model (estimate parameters) • Requires calling `fit` on a training labeled dataset • Select hyperparameters (e.g. which degree of polynomial?) • Requires evaluating predictions on a validation labeled dataset • Report its ability on data it has never seen before (“generalization error” or “test error”) • Requires comparing predictions to a test labeled dataset Should ALWAYS use different labeled datasets to do each of these things! Mike Hughes - Tufts COMP 135 - Fall 2020 21

Two Ways to Measure Generalization Error - Fixed Validation Set - Cross-Validation Mike Hughes - Tufts COMP 135 - Fall 2020 22

Labeled dataset y x Each row represents one example Assume rows are arranged “uniformly at random” (order doesn’t matter) N x F N x 1 Mike Hughes - Tufts COMP 135 - Fall 2020 23

Split into train and test y x train test Mike Hughes - Tufts COMP 135 - Fall 2020 24

Selection via Fixed Validation Set Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set y x train validation test Mike Hughes - Tufts COMP 135 - Fall 2020 25

Selection via Fixed Validation Set Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set y x train Concerns • What sizes to pick? • Will train be too small? validation • Is validation set used effectively? (only to test evaluate predictions?) Mike Hughes - Tufts COMP 135 - Fall 2020 26

For small datasets, randomness in validation split will impact selection Single random split 10 other random splits Credit: ISL Textbook, Chapter 5 Mike Hughes - Tufts COMP 135 - Fall 2020 27

3-fold Cross Validation y x Divide labeled dataset fold 1 into 3 even-sized parts fold 2 fold 3 Fit model 3 independent times. Each time leave one fold as validation and keep remaining as training y y x x y x train validation Heldout error estimate: average of the validation error across all 3 fits Mike Hughes - Tufts COMP 135 - Fall 2020 28

K-fold CV: How many folds K ? • Can do as low as 2 fold • Can do as high as N-1 folds (“Leave one out”) • Usual rule of thumb: 5-fold or 10-fold CV • Computation runtime scales linearly with K • Larger K also means each fit uses more train data, so each fit might take longer too • Each fit is independent and parallelizable Mike Hughes - Tufts COMP 135 - Fall 2020 29

Estimating Heldout Error with Cross Validation 9 separate splits Leave-one-out (N-1 folds) Each one with 10 folds Credit: ISL Textbook, Chapter 5 Mike Hughes - Tufts COMP 135 - Fall 2020 30

Linear Regression with Polynomial Features , Cross Validation, and - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Linear Regression with Polynomial Features , Cross Validation, and Hyperparameter Selection Many slides attributable to: Prof. Mike Hughes Erik Sudderth

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Introduction Warping polynomial Span of warping polynomial Span and dealternating number Ayaka

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Causality and Experiments Michael R. Roberts Department of Finance The Wharton School

10601 Machine Learning Model and feature selection Model selection issues We have seen some

Implementing SB 1376 TNC Access for All - Track 3 Issues Erin McAuliff Senior Planner,

FY 2019 A L L S E G M E N T S N O W I N C L U D E S E R V I C E S Electronic Industrial

Consistency Estimates for gFD Methods and Selection of Sets of Influence Oleg Davydov University

Why are Polls So Wrong? CTC1-1A 4 Dec, 2016 1A 1A 2016 Schield CTC1 1 2016 Schield CTC1

Solution approaches for Solution approaches for address-selection problems address-selection

Leveraging Operation-Aware UREQA: Error Rates for Effective Quantum Circuit Mapping on NISQ-Era