Cross Validation and Penalized Linear Regression Many slides - - PowerPoint PPT Presentation

cross validation and penalized linear regression
SMART_READER_LITE
LIVE PREVIEW

Cross Validation and Penalized Linear Regression Many slides - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,


slide-1
SLIDE 1

Cross Validation and Penalized Linear Regression

1

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

CV & Penalized LR Objectives

  • Regression with transformations of features
  • Cross Validation
  • L2 penalties
  • L1 penalties

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-4
SLIDE 4

5

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

slide-5
SLIDE 5

6

Mike Hughes - Tufts COMP 135 - Spring 2019

Review: Linear Regression

min

w,b N

X

n=1

⇣ yn − ˆ y(xn, w, b) ⌘2

Optimization problem: “Least Squares”

Exact formula for optimal values of w, b exist! Math works in 1D and for many dimensions

[w1 . . . wF b]T = ( ˜ XT ˜ X)−1 ˜ XT y

˜ X =     x11 . . . x1F 1 x21 . . . x2F 1 . . . xN1 . . . xNF 1    

slide-6
SLIDE 6

Recap: solving linear regression

  • More examples than features (N > F)
  • Same number of examples and features (N=F)
  • Fewer examples than features (N < F) or low rank

7

Mike Hughes - Tufts COMP 135 - Spring 2019

Then: Infinitely many optimal weight vectors exist with zero error Inverse of X^T X does not exist (naïvely, formula will fail) And if inverse of X^T X exists (needs to be full rank): Then an optimal weight vector exists, can use formula Will have zero error on training set. And if inverse of X^T X exists (needs to be full rank) Then an optimal weight vector exists, can use formula Likely has non-zero error (overdetermined)

slide-7
SLIDE 7

Recap

  • Squared error is special
  • Exact formulas for estimating parameters
  • Most metrics do not have exact formulas
  • Take derivative, set to zero, try to solve, …. HARD!
  • Example: absolute error
  • General algorithm: Gradient Descent!
  • As long as first derivative exists, we can do

iterations to estimate optimal parameters

8

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-8
SLIDE 8

9

Mike Hughes - Tufts COMP 135 - Spring 2019

Transformations of Features

slide-9
SLIDE 9

Fitting a line isn’t always ideal

10

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-10
SLIDE 10

Can fit linear functions to nonlinear features

11

Mike Hughes - Tufts COMP 135 - Spring 2019

ˆ y(xi) = θ0 + θ1xi + θ2x2

i + θ3x3 i

A nonlinear function of x: Can be written as a linear function of

“Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data

φ(xi) = [1 xi x2

i x3 i ]

ˆ y(xi) =

4

X

g=1

θgφg(xi) = θT φ(xi)

slide-11
SLIDE 11

What feature transform to use?

  • Anything that works for your data!
  • sin / cos for periodic data
  • polynomials for high-order dependencies
  • interactions between feature dimensions
  • Many other choices possible

12

Mike Hughes - Tufts COMP 135 - Spring 2019

φ(xi) = [1 xi x2

i . . .]

φ(xi) = [1 xi1xi2 xi3xi4 . . .]

slide-12
SLIDE 12

13

Mike Hughes - Tufts COMP 135 - Spring 2019

Linear Regression with Transformed Features

Optimization problem: “Least Squares” Exact solution:

ˆ y(xi) = θT φ(xi) φ(xi) = [1 φ1(xi) φ2(xi) . . . φG−1(xi)] minθ PN

n=1(yn − θT φ(xi))2

θ∗ = (ΦT Φ)−1ΦT y

Φ =      1 φ1(x1) . . . φG−1(x1) 1 φ1(x2) . . . φG−1(x2) . . . ... 1 φ1(xN) . . . φG−1(xN)     

N x G matrix

slide-13
SLIDE 13

Cross Validation

14

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-14
SLIDE 14

15

Mike Hughes - Tufts COMP 135 - Spring 2019

Generalize: sample to population

slide-15
SLIDE 15

Labeled dataset

16

Mike Hughes - Tufts COMP 135 - Spring 2019

x y

Each row represents one example Assume rows are arranged “uniformly at random” (order doesn’t matter)

slide-16
SLIDE 16

Split into train and test

17

Mike Hughes - Tufts COMP 135 - Spring 2019

x y

train test

slide-17
SLIDE 17

Model Complexity vs Error

18

Mike Hughes - Tufts COMP 135 - Spring 2019

Overfitting Underfitting

slide-18
SLIDE 18

How to fit best model?

19

Mike Hughes - Tufts COMP 135 - Spring 2019

Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set

train test validation

x y

slide-19
SLIDE 19

How to fit best model?

20

Mike Hughes - Tufts COMP 135 - Spring 2019

Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set

train test validation

x y

Concerns

  • Will train be too small?
  • Make better use of data?
slide-20
SLIDE 20

Estimating Heldout Error with Fixed Validation Set

21

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: ISL Textbook, Chapter 5

10 other random splits Single random split

slide-21
SLIDE 21

3-fold Cross Validation

22

Mike Hughes - Tufts COMP 135 - Spring 2019

train

validation

x y

x y

x y x y

Divide labeled dataset into 3 even-sized parts Fit model 3 independent times. Each time leave one fold as validation and keep remaining as training

fold 1 fold 2 fold 3

Heldout error estimate: average of the validation error across all 3 fits

slide-22
SLIDE 22

K-fold CV: How many folds K?

  • Can do as low as 2 fold
  • Can do as high as N-1 folds (“Leave one out”)
  • Usual rule of thumb: 5-fold or 10-fold CV
  • Computation runtime scales linearly with K
  • Larger K also means each fit uses more train data,

so each fit might take longer too

  • Each fit is independent and parallelizable

23

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-23
SLIDE 23

24

Mike Hughes - Tufts COMP 135 - Spring 2019

9 separate splits Each one with 10 folds Leave-one-out (N-1 folds)

Credit: ISL Textbook, Chapter 5

Estimating Heldout Error with Cross Validation

slide-24
SLIDE 24

What to do about underfitting?

  • Increase model complexity
  • Add more features!

25

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-25
SLIDE 25

What to do about overfitting?

  • Select complexity with cross validation
  • Control single-fit complexity with a penalty!

26

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-26
SLIDE 26

Zero degree polynomial

27

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Slides from course by Prof. Erik Sudderth (UCI)

slide-27
SLIDE 27

1st degree polynomial

28

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Slides from course by Prof. Erik Sudderth (UCI)

slide-28
SLIDE 28

3rd degree polynomial

29

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Slides from course by Prof. Erik Sudderth (UCI)

slide-29
SLIDE 29

9th degree polynomial

30

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Slides from course by Prof. Erik Sudderth (UCI)

slide-30
SLIDE 30

Error vs Complexity

31

Mike Hughes - Tufts COMP 135 - Spring 2019

polynomial degree sqrt

  • f

mean squared errror

slide-31
SLIDE 31

32

Mike Hughes - Tufts COMP 135 - Spring 2019

Polynomial degree

1 3 9

Credit: Slides from course by Prof. Erik Sudderth (UCI)

slide-32
SLIDE 32

Idea: Penalize magnitude of weights

33

Mike Hughes - Tufts COMP 135 - Spring 2019

J(θ) = 1 2

N

X

n=1

(yn − θT ˜ xn)2 + α X

f

θ2

f

α ≥ 0

Penalty strength:

Larger alpha means we prefer smaller magnitude weights

slide-33
SLIDE 33

Idea: Penalize magnitude of weights

34

Mike Hughes - Tufts COMP 135 - Spring 2019

J(θ) = 1 2

N

X

n=1

(yn − θT ˜ xn)2 + α X

f

θ2

f

J(θ) = 1 2(y − ˜ Xθ)T (y − ˜ Xθ) + αθT θ

Written via matrix/vector product notation:

slide-34
SLIDE 34

Exact solution for L2 penalized linear regression

35

Mike Hughes - Tufts COMP 135 - Spring 2019

Optimization problem: “Penalized Least Squares”

min

θ

1 2(y − ˜ Xθ)T (y − ˜ Xθ) + αθT θ

Solution:

θ∗ = ( ˜ XT ˜ X + αI)−1 ˜ XT y

If alpha > 0 , this is always invertible!

slide-35
SLIDE 35

Slides on L1/L2 penalties

See slides 71-82 from UC-Irvine course here: https://canvas.eee.uci.edu/courses/8278/files/2 735313/

36

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-36
SLIDE 36

Pair Coding Activity

https://github.com/tufts-ml-courses/comp135-19s- assignments/blob/master/labs/GradientDescentDemo.ipynb

  • Try existing gradient descent code:
  • Optimizes scalar slope to produce minimum error
  • Try step sizes of 0.0001, 0.02, 0.05, 0.1
  • Add L2 penalty with alpha > 0
  • Write calc_penalized_loss and calc_penalized_grad
  • What happens to estimated slope value w?
  • Repeat with L1 penalty with alpha > 0

37

Mike Hughes - Tufts COMP 135 - Spring 2019