Linear Regression & Gradient Descent Many slides attributable - - PowerPoint PPT Presentation

linear regression gradient descent
SMART_READER_LITE
LIVE PREVIEW

Linear Regression & Gradient Descent Many slides attributable - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Linear Regression & Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten,


slide-1
SLIDE 1

Linear Regression & Gradient Descent

1

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

LR & GD Unit Objectives

  • Exact solutions of least squares
  • 1D case without bias
  • 1D case with bias
  • General case
  • Gradient descent for least squares

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-4
SLIDE 4

5

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

slide-5
SLIDE 5

Visualizing errors

6

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-6
SLIDE 6

Regression: Evaluation Metrics

  • mean squared error
  • mean absolute error

7

Mike Hughes - Tufts COMP 135 - Spring 2019

1 N

N

X

n=1

|yn − ˆ yn| 1 N

N

X

n=1

(yn − ˆ yn)2

slide-7
SLIDE 7

Linear Regression

Parameters: Prediction: Training: find weights and bias that minimize error

8

Mike Hughes - Tufts COMP 135 - Spring 2019

ˆ y(xi) ,

F

X

f=1

wfxif + b

w = [w1, w2, . . . wf . . . wF ] b

weight vector bias scalar

slide-8
SLIDE 8

Sales vs. Ad Budgets

9

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-9
SLIDE 9

Linear Regression: Training

10

Mike Hughes - Tufts COMP 135 - Spring 2019

min

w,b N

X

n=1

⇣ yn − ˆ y(xn, w, b) ⌘2

Optimization problem: “Least Squares”

slide-10
SLIDE 10

11

Mike Hughes - Tufts COMP 135 - Spring 2019

Linear Regression: Training

min

w,b N

X

n=1

⇣ yn − ˆ y(xn, w, b) ⌘2

Optimization problem: “Least Squares”

Exact formula for optimal values of w, b exist! With only one feature (F=1):

w = PN

n=1(xn − ¯

x)(yn − ¯ y) PN

n=1(xn − ¯

x)2

b = ¯ y − w¯ x

¯ x = mean(x1, . . . xN) ¯ y = mean(y1, . . . yN)

Where does this come from?

slide-11
SLIDE 11

12

Mike Hughes - Tufts COMP 135 - Spring 2019

Linear Regression: Training

min

w,b N

X

n=1

⇣ yn − ˆ y(xn, w, b) ⌘2

Optimization problem: “Least Squares”

Exact formula for optimal values of w, b exist! With many features (F >= 1):

Where does this come from?

[w1 . . . wF b]T = ( ˜ XT ˜ X)−1 ˜ XT y

˜ X =     x11 . . . x1F 1 x21 . . . x2F 1 . . . xN1 . . . xNF 1    

slide-12
SLIDE 12

Derivation Notes

http://www.cs.tufts.edu/comp/135/2019s/notes /day03_linear_regression.pdf

13

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-13
SLIDE 13

When does the Least Squares estimator exist?

  • Fewer examples than features (N < F)
  • Same number of examples and features (N=F)
  • More examples than features (N > F)

14

Mike Hughes - Tufts COMP 135 - Spring 2019

Optimum exists if X is full rank Optimum exists if X is full rank Infinitely many solutions!

slide-14
SLIDE 14

More compact notation

15

Mike Hughes - Tufts COMP 135 - Spring 2019

θ = [b w1 w2 . . . wF ] ˜ xn = [1 xn1 xn2 . . . xnF ] ˆ y(xn, θ) = θT ˜ xn J(θ) ,

N

X

n=1

(yn − ˆ y(xn, θ))2

slide-15
SLIDE 15

Idea: Optimize via small steps

16

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-16
SLIDE 16

Derivatives point uphill

17

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-17
SLIDE 17

18

Mike Hughes - Tufts COMP 135 - Spring 2019

To minimize, go downhill

Step in the opposite direction of the derivative

slide-18
SLIDE 18

Steepest descent algorithm

19

Mike Hughes - Tufts COMP 135 - Spring 2019

input: initial θ ∈ R input: step size α ∈ R+ while not converged: θ ← θ − α d dθJ(θ)

slide-19
SLIDE 19

Steepest descent algorithm

20

Mike Hughes - Tufts COMP 135 - Spring 2019

input: initial θ ∈ R input: step size α ∈ R+ while not converged: θ ← θ − α d dθJ(θ)

slide-20
SLIDE 20

How to set step size?

21

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-21
SLIDE 21

How to set step size?

22

Mike Hughes - Tufts COMP 135 - Spring 2019

  • Simple and usually effective: pick small constant
  • Improve: decay over iterations
  • Improve: Line search for best value at each step

α = 0.01

αt = C t

αt = (C + t)−0.9

slide-22
SLIDE 22

How to assess convergence?

  • Ideal: stop when derivative equals zero
  • Practical heuristics: stop when …
  • when change in loss becomes small
  • when step size is indistinguishable from zero

23

Mike Hughes - Tufts COMP 135 - Spring 2019

↵| d d✓J(✓)| < ✏

|J(✓t) − J(✓t−1)| < ✏

slide-23
SLIDE 23

Visualizing the cost function

24

Mike Hughes - Tufts COMP 135 - Spring 2019 “Level set” contours : all points with same function value

slide-24
SLIDE 24

In 2D parameter space

25

Mike Hughes - Tufts COMP 135 - Spring 2019

gradient = vector of partial derivatives

slide-25
SLIDE 25

Gradient Descent DEMO

https://github.com/tufts-ml-courses/comp135-19s- assignments/blob/master/labs/GradientDescentDemo. ipynb

26

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-26
SLIDE 26

Fitting a line isn’t always ideal

27

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-27
SLIDE 27

Can fit linear functions to nonlinear features

28

Mike Hughes - Tufts COMP 135 - Spring 2019

ˆ y(xi) = θ0 + θ1xi + θ2x2

i + θ3x3 i

ˆ y(φ(xi)) = θ0 + θ1φ(xi)1 + θ2φ(xi)2 + θ3φ(xi)3

A nonlinear function of x: Can be written as a linear function of φ(xi) = [xi x2

i

x3

i ]

“Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data

slide-28
SLIDE 28

What feature transform to use?

  • Anything that works for your data!
  • sin / cos for periodic data
  • polynomials for high-order dependencies
  • interactions between feature dimensions
  • Many other choices possible

29

Mike Hughes - Tufts COMP 135 - Spring 2019

φ(xi) = [xi1xi2 xi3xi4]

φ(xi) = [xi x2

i

x3

i ]