Linear Regression + Optimization for ML Matt Gormley Lecture 8 - - PowerPoint PPT Presentation

linear regression optimization for ml
SMART_READER_LITE
LIVE PREVIEW

Linear Regression + Optimization for ML Matt Gormley Lecture 8 - - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Linear Regression + Optimization for ML Matt Gormley Lecture 8 Feb. 07, 2020 1 Q&A Q: How can I get more


slide-1
SLIDE 1

Linear Regression + Optimization for ML

1

10-601 Introduction to Machine Learning

Matt Gormley Lecture 8

  • Feb. 07, 2020

Machine Learning Department School of Computer Science Carnegie Mellon University

slide-2
SLIDE 2

Q&A

3

Q: How can I get more one-on-one interaction with the

course staff?

A: Attend office hours as soon after the homework release

as possible!

slide-3
SLIDE 3

Reminders

  • Homework 3: KNN, Perceptron, Lin.Reg.

– Out: Wed, Feb. 05 (+ 1 day) – Due: Wed, Feb. 12 at 11:59pm

  • Today’s In-Class Poll

– http://p8.mlcourse.org

4

slide-4
SLIDE 4

LINEAR REGRESSION

5

slide-5
SLIDE 5

Regression Problems

Chalkboard

– Definition of Regression – Linear functions – Residuals – Notation trick: fold in the intercept

6

slide-6
SLIDE 6

OPTIMIZATION FOR ML

The Big Picture

7

slide-7
SLIDE 7

Optimization for ML

Not quite the same setting as other fields…

– Function we are optimizing might not be the true goal (e.g. likelihood vs generalization error) – Precision might not matter (e.g. data is noisy, so optimal up to 1e-16 might not help) – Stopping early can help generalization error (i.e. “early stopping” is a technique for regularization – discussed more next time)

8

slide-8
SLIDE 8

min vs. argmin

9

y = f(x) =x2 + 1 1 2 3

v* = minx f(x) x* = argminx f(x) 1. Q: What is v*?

  • 2. Q: What is x*?

v* = 1, the minimum value of the function x* = 0, the argument that yields the minimum value

slide-9
SLIDE 9

Linear Regression as Function Approximation

10

slide-10
SLIDE 10

11

slide-11
SLIDE 11

Contour Plots

Contour Plots 1. Each level curve labeled with value 2. Value label indicates the value of the function for all points lying on that level curve 3. Just like a topographical map, but for a function

12

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2 θ1 θ2

slide-12
SLIDE 12

Optimization by Random Guessing

Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J(θ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives smallest J(θ)

13

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2 θ1 θ2 θ1 θ2 J(θ1, θ2) 0.2 0.2 10.4 0.3 0.7 7.2 0.6 0.4 1.0 0.9 0.7 19.2 t 1 2 3 4

slide-13
SLIDE 13

Optimization by Random Guessing

Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J(θ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives smallest J(θ)

14

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2 θ1 θ2 θ1 θ2 J(θ1, θ2) 0.2 0.2 10.4 0.3 0.7 7.2 0.6 0.4 1.0 0.9 0.7 19.2 t 1 2 3 4

For Linear Regression:

  • bjective function is Mean

Squared Error (MSE)

  • MSE = J(w, b)

= J(θ1, θ2) =

  • contour plot: each line labeled with

MSE – lower means a better fit

  • minimum corresponds to

parameters (w,b) = (θ1, θ2) that best fit some training dataset

slide-14
SLIDE 14

Linear Regression by Rand. Guessing

Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J(θ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives smallest J(θ)

15

time # tourists (thousands) y = h*(x) (unknown)

h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))

For Linear Regression:

  • target function h*(x) is unknown
  • nly have access to h*(x) through

training examples (x(i),y(i))

  • want h(x; θ(t)) that best

approximates h*(x)

  • enable generalization w/inductive

bias that restricts hypothesis class to linear functions

slide-15
SLIDE 15

Linear Regression by Rand. Guessing

Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J(θ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives smallest J(θ)

16

θ1 θ2 θ1 θ2 J(θ1, θ2) 0.2 0.2 10.4 0.3 0.7 7.2 0.6 0.4 1.0 0.9 0.7 19.2 time # tourists (thousands) y = h*(x) (unknown) t 1 2 3 4

h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

slide-16
SLIDE 16

OPTIMIZATION METHOD #1: GRADIENT DESCENT

17

slide-17
SLIDE 17

Optimization for ML

Chalkboard

– Unconstrained optimization – Derivatives – Gradient

18

slide-18
SLIDE 18

19

Topographical Maps

slide-19
SLIDE 19

20

Topographical Maps

slide-20
SLIDE 20

Gradients

21

slide-21
SLIDE 21

Gradients

22

These are the gradients that Gradient Ascent would follow.

slide-22
SLIDE 22

(Negative) Gradients

23

These are the negative gradients that Gradient Descent would follow.

slide-23
SLIDE 23

(Negative) Gradient Paths

24

Shown are the paths that Gradient Descent would follow if it were making infinitesimally small steps.

slide-24
SLIDE 24

Pros and cons of gradient descent

  • Simple and often quite effective on ML tasks
  • Often very scalable
  • Only applies to smooth functions (differentiable)
  • Might find a local minimum, rather than a global one

25 Slide courtesy of William Cohen

slide-25
SLIDE 25

Gradient Descent

Chalkboard

– Gradient Descent Algorithm – Details: starting point, stopping criterion, line search

26

slide-26
SLIDE 26

Gradient Descent

27

Algorithm 1 Gradient Descent

1: procedure GD(D, θ(0)) 2:

θ θ(0)

3:

while not converged do

4:

θ θ + λθJ(θ)

5:

return θ

In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives).

θJ(θ) =     

d dθ1 J(θ) d dθ2 J(θ)

. . .

d dθN J(θ)

    

— M

slide-27
SLIDE 27

Gradient Descent

28

Algorithm 1 Gradient Descent

1: procedure GD(D, θ(0)) 2:

θ θ(0)

3:

while not converged do

4:

θ θ + λθJ(θ)

5:

return θ

There are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance.

||θJ(θ)||2

Alternatively we could check that the reduction in the

  • bjective function from one iteration to the next is small.

slide-28
SLIDE 28

GRADIENT DESCENT FOR LINEAR REGRESSION

29

slide-29
SLIDE 29

Linear Regression as Function Approximation

30

slide-30
SLIDE 30

Linear Regression by Gradient Desc.

Optimization Method #1: Gradient Descent 1. Pick a random θ 2. Repeat:

  • a. Evaluate gradient ∇J(θ)
  • b. Step opposite gradient

3. Return θ that gives smallest J(θ)

31

θ1 θ2 θ1 θ2 J(θ1, θ2) 0.01 0.02 25.2 0.30 0.12 8.7 0.51 0.30 1.5 0.59 0.43 0.2 t 1 2 3 4 J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

slide-31
SLIDE 31

Linear Regression by Gradient Desc.

Optimization Method #1: Gradient Descent 1. Pick a random θ 2. Repeat:

  • a. Evaluate gradient ∇J(θ)
  • b. Step opposite gradient

3. Return θ that gives smallest J(θ)

32

θ1 θ2 J(θ1, θ2) 0.01 0.02 25.2 0.30 0.12 8.7 0.51 0.30 1.5 0.59 0.43 0.2 time # tourists (thousands) y = h*(x) (unknown) t 1 2 3 4

h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))

slide-32
SLIDE 32

Linear Regression by Gradient Desc.

Optimization Method #1: Gradient Descent 1. Pick a random θ 2. Repeat:

  • a. Evaluate gradient ∇J(θ)
  • b. Step opposite gradient

3. Return θ that gives smallest J(θ)

33

θ1 θ2 θ1 θ2 J(θ1, θ2) 0.01 0.02 25.2 0.30 0.12 8.7 0.51 0.30 1.5 0.59 0.43 0.2 time # tourists (thousands) y = h*(x) (unknown) t 1 2 3 4

h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

slide-33
SLIDE 33

Linear Regression by Gradient Desc.

34

θ1 θ2 J(θ1, θ2) 0.01 0.02 25.2 0.30 0.12 8.7 0.51 0.30 1.5 0.59 0.43 0.2 time # tourists (thousands) y = h*(x) (unknown) t 1 2 3 4

h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))

iteration, t mean squared error, J(θ1, θ2)

slide-34
SLIDE 34

Linear Regression by Gradient Desc.

35

θ1 θ2 θ1 θ2 J(θ1, θ2) 0.01 0.02 25.2 0.30 0.12 8.7 0.51 0.30 1.5 0.59 0.43 0.2 time # tourists (thousands) y = h*(x) (unknown) t 1 2 3 4

h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2 iteration, t mean squared error, J(θ1, θ2)

slide-35
SLIDE 35

Optimization for Linear Regression

Chalkboard

– Computing the gradient for Linear Regression – Gradient Descent for Linear Regression

36

slide-36
SLIDE 36

Gradient Calculation for Linear Regression

37

[used by Gradient Descent]

slide-37
SLIDE 37

GD for Linear Regression

Gradient Descent for Linear Regression repeatedly takes steps opposite the gradient of the objective function

38

slide-38
SLIDE 38

CONVEXITY

39

slide-39
SLIDE 39

Convexity

40

slide-40
SLIDE 40

Convexity

Convex Function

  • Each local minimum is a

global minimum

Nonconvex Function

  • A nonconvex function is not

convex

  • Each local minimum is not

necessarily a global minimum

41

slide-41
SLIDE 41

Convexity

44

Each local minimum of a convex function is also a global minimum. A strictly convex function has a unique global minimum.

slide-42
SLIDE 42

CONVEXITY AND LINEAR REGRESSION

45

slide-43
SLIDE 43

Convexity and Linear Regression

46

The Mean Squared Error function, which we minimize for learning the parameters of Linear Regression, is convex! …but in the general case it is not strictly convex.

slide-44
SLIDE 44

Regression Loss Functions

In-Class Exercise:

Which of the following could be used as loss functions for training a linear regression model? Select all that apply.

47

slide-45
SLIDE 45

Answer:

Solving Linear Regression

48

Question:

slide-46
SLIDE 46

OPTIMIZATION METHOD #2: CLOSED FORM SOLUTION

49

slide-47
SLIDE 47

Calculus and Optimization

In-Class Exercise Plot three functions:

50

Answer Here:

slide-48
SLIDE 48

Optimization: Closed form solutions

Chalkboard

– Zero Derivatives – Example: 1-D function – Example: higher dimensions

52

slide-49
SLIDE 49

CLOSED FORM SOLUTION FOR LINEAR REGRESSION

53

slide-50
SLIDE 50

Linear Regression as Function Approximation

54

slide-51
SLIDE 51

Linear Regression: Closed Form

Optimization Method #2: Closed Form 1. Evaluate 2. Return θMLE

55

θ1 θ2 θ1 θ2 J(θ1, θ2) 0.59 0.43 0.2 time # tourists (thousands) y = h*(x) (unknown) t

MLE

h(x; θ(MLE))

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

slide-52
SLIDE 52

Optimization for Linear Regression

Chalkboard

– Closed-form (Normal Equations)

56

slide-53
SLIDE 53

Computational Complexity of OLS:

Computational Complexity of OLS

57

To solve the Ordinary Least Squares problem we compute: The resulting shape of the matrices:

Linear in # of examples, N Polynomial in # of features, M

slide-54
SLIDE 54

Gradient Descent

Cases to consider gradient descent:

  • 1. What if we can not find a closed-form

solution?

  • 2. What if we can, but it’s inefficient to

compute?

  • 3. What if we can, but it’s numerically

unstable to compute?

58

slide-55
SLIDE 55

Convergence Curves

  • SGD reduces MSE

much more rapidly than GD

  • For GD / SGD, training

MSE is initially large due to uninformed initialization

59

Gradient Descent SGD Closed-form (normal eq.s)

Figure adapted from Eric P. Xing

  • Def: an epoch is a

single pass through the training data 1. For GD, only one update per epoch

  • 2. For SGD, N updates

per epoch N = (# train examples)