Least Mean Squares Regression Machine Learning 1 Least Squares - - PowerPoint PPT Presentation

โ–ถ
least mean squares regression
SMART_READER_LITE
LIVE PREVIEW

Least Mean Squares Regression Machine Learning 1 Least Squares - - PowerPoint PPT Presentation

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent 2 Least Squares Method for regression Examples The


slide-1
SLIDE 1

Machine Learning

Least Mean Squares Regression

1

slide-2
SLIDE 2

Least Squares Method for regression

  • Examples
  • The LMS objective
  • Gradient descent
  • Incremental/stochastic gradient descent

2

slide-3
SLIDE 3

Least Squares Method for regression

  • Examples
  • The LMS objective
  • Gradient descent
  • Incremental/stochastic gradient descent

3

slide-4
SLIDE 4

Whatโ€™s the mileage?

Suppose we want to predict the mileage of a car from its weight and age

4

Weight (x 100 lb) x1 Age (years) x2 Mileage 31.5 6 21 36.2 2 25 43.1 18 27.6 2 30

What we want: A function that can predict mileage using x1 and x2

slide-5
SLIDE 5

Linear regression: The strategy

Assumption: The output is a linear function of the inputs Mileage = w0 + w1 x1 + w2 x2 Learning: Using the training data to find the best possible value

  • f w

Prediction: Given the values for x1, x2 for a new car, use the learned w to predict the Mileage for the new car

5

Predicting continuous values using a linear model

slide-6
SLIDE 6

Linear regression: The strategy

Assumption: The output is a linear function of the inputs Mileage = w0 + w1 x1 + w2 x2 Learning: Using the training data to find the best possible value

  • f w

Prediction: Given the values for x1, x2 for a new car, use the learned w to predict the Mileage for the new car

6

Parameters of the model Also called weights Collectively, a vector

Predicting continuous values using a linear model

slide-7
SLIDE 7

Linear regression: The strategy

  • Inputs are vectors: ๐ฒ โˆˆ โ„œ!
  • Outputs are real numbers: ๐‘ง โˆˆ โ„œ
  • We have a training set

D = { ๐ฒ!, ๐‘ง! , ๐ฒ", ๐‘ง" , โ‹ฏ }

  • We want to approximate ๐‘ง as

๐‘ง = ๐‘ฅ! + ๐‘ฅ"๐‘ฆ" + โ‹ฏ + ๐‘ฅ#๐‘ฆ# ๐‘ง = ๐ฑ$๐ฒ

7

๐ฑ is the learned weight vector in โ„œ#

For simplicity, we will assume that the first feature is always 1. ๐’š = 1 ๐‘ฆ! โ‹ฎ ๐‘ฆ" This lets makes notation easier

slide-8
SLIDE 8

Examples

8

x1 y One dimensional input

slide-9
SLIDE 9

Examples

9

x1 y Predict using y = w1 + w2 x2 One dimensional input

slide-10
SLIDE 10

Examples

10

x1 y Predict using y = w1 + w2 x2 The linear function is not our only choice. We could have tried to fit the data as another polynomial One dimensional input

slide-11
SLIDE 11

Examples

11

x1 y Predict using y = w1 + w2 x2 The linear function is not our only choice. We could have tried to fit the data as another polynomial One dimensional input Two dimensional input Predict using y = w1 + w2 x2 +w3 x3

slide-12
SLIDE 12

Least Squares Method for regression

  • Examples
  • The LMS objective
  • Gradient descent
  • Incremental/stochastic gradient descent

12

slide-13
SLIDE 13

What is the best weight vector?

Question: How do we know which weight vector is the best one for a training set? For an input (xi, yi) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be One strategy for learning: Find the w with least cost on this data

13

Sum of squared costs over the training set

slide-14
SLIDE 14

What is the best weight vector?

Question: How do we know which weight vector is the best one for a training set? For an input (xi, yi) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be One strategy for learning: Find the w with least cost on this data

14

Sum of squared costs over the training set

slide-15
SLIDE 15

What is the best weight vector?

Question: How do we know which weight vector is the best one for a training set? For an input (xi, yi) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be One strategy for learning: Find the w with least cost on this data

15

slide-16
SLIDE 16

What is the best weight vector?

Question: How do we know which weight vector is the best one for a training set? For an input (xi, yi) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be One strategy for learning: Find the w with least cost on this data

16

Sum of squared costs over the training set

slide-17
SLIDE 17

What is the best weight vector?

Question: How do we know which weight vector is the best one for a training set? For an input (xi, yi) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be One strategy for learning: Find the w with least cost on this data

17

Sum of squared costs over the training set

slide-18
SLIDE 18

Least Mean Squares (LMS) Regression

18

Learning: minimizing mean squared error

slide-19
SLIDE 19

Least Mean Squares (LMS) Regression

Different strategies exist for learning by optimization

  • Gradient descent is a popular algorithm

(For this particular minimization objective, there is also an analytical solution. No need for gradient descent)

19

Learning: minimizing mean squared error

slide-20
SLIDE 20

Least Squares Method for regression

  • Examples
  • The LMS objective
  • Gradient descent
  • Incremental/stochastic gradient descent

20

slide-21
SLIDE 21

Gradient descent

General strategy for minimizing a function J(w)

  • Start with an initial guess for

w, say w0

  • Iterate till convergence:

โ€“ Compute the gradient of the gradient of J at wt โ€“ Update wt to get wt+1 by taking a step in the opposite direction

  • f the gradient

21

J(w) w Intuition: The gradient is the direction

  • f steepest increase in the function. To

get to the minimum, go in the opposite direction We are trying to minimize

slide-22
SLIDE 22

Gradient descent

General strategy for minimizing a function J(w)

  • Start with an initial guess for

w, say w0

  • Iterate till convergence:

โ€“ Compute the gradient of the gradient of J at wt โ€“ Update wt to get wt+1 by taking a step in the opposite direction

  • f the gradient

22

J(w) w w0 Intuition: The gradient is the direction

  • f steepest increase in the function. To

get to the minimum, go in the opposite direction We are trying to minimize

slide-23
SLIDE 23

Gradient descent

General strategy for minimizing a function J(w)

  • Start with an initial guess for

w, say w0

  • Iterate till convergence:

โ€“ Compute the gradient of the gradient of J at wt โ€“ Update wt to get wt+1 by taking a step in the opposite direction

  • f the gradient

23

J(w) w w0 w1 Intuition: The gradient is the direction

  • f steepest increase in the function. To

get to the minimum, go in the opposite direction We are trying to minimize

slide-24
SLIDE 24

Gradient descent

General strategy for minimizing a function J(w)

  • Start with an initial guess for

w, say w0

  • Iterate till convergence:

โ€“ Compute the gradient of the gradient of J at wt โ€“ Update wt to get wt+1 by taking a step in the opposite direction

  • f the gradient

24

J(w) w w0 w1 w2 Intuition: The gradient is the direction

  • f steepest increase in the function. To

get to the minimum, go in the opposite direction We are trying to minimize

slide-25
SLIDE 25

Gradient descent

General strategy for minimizing a function J(w)

  • Start with an initial guess for

w, say w0

  • Iterate till convergence:

โ€“ Compute the gradient of the gradient of J at wt โ€“ Update wt to get wt+1 by taking a step in the opposite direction

  • f the gradient

25

J(w) w w0 w1 w2 w3 Intuition: The gradient is the direction

  • f steepest increase in the function. To

get to the minimum, go in the opposite direction We are trying to minimize

slide-26
SLIDE 26

Gradient descent

General strategy for minimizing a function J(w)

  • Start with an initial guess for

w, say w0

  • Iterate till convergence:

โ€“ Compute the gradient of the gradient of J at wt โ€“ Update wt to get wt+1 by taking a step in the opposite direction

  • f the gradient

26

J(w) w w0 w1 w2 w3 Intuition: The gradient is the direction

  • f steepest increase in the function. To

get to the minimum, go in the opposite direction We are trying to minimize

slide-27
SLIDE 27

Gradient descent for LMS

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, โ€ฆ.

1. Compute gradient of J(w) at wt. Call it r J(w

t)

2. Update w as follows:

27

r: Called the learning rate (For now, a small constant. We will get to this later) We are trying to minimize

slide-28
SLIDE 28

Gradient descent for LMS

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, โ€ฆ.

1. Compute gradient of J(w) at wt. Call it r J(w

t)

2. Update w as follows:

28

r: Called the learning rate (For now, a small constant. We will get to this later)

What is the gradient of J?

We are trying to minimize

slide-29
SLIDE 29

Gradient of the cost

  • The gradient is of the form
  • Remember that w is a vector with d elements

โ€“ w = [w1, w2, w3, ! wj, !, wd]

29

We are trying to minimize

slide-30
SLIDE 30
  • The gradient is of the form

30

Gradient of the cost

We are trying to minimize

slide-31
SLIDE 31

Gradient of the cost

  • The gradient is of the form

31

We are trying to minimize

slide-32
SLIDE 32

Gradient of the cost

  • The gradient is of the form

32

We are trying to minimize

slide-33
SLIDE 33

Gradient of the cost

  • The gradient is of the form

33

We are trying to minimize

slide-34
SLIDE 34

Gradient of the cost

  • The gradient is of the form

34

We are trying to minimize

slide-35
SLIDE 35

Gradient of the cost

  • The gradient is of the form

35

One element

  • f the gradient

vector We are trying to minimize

slide-36
SLIDE 36

Gradient of the cost

  • The gradient is of the form

36

One element

  • f the gradient

vector Error Input ร— Sum of We are trying to minimize

slide-37
SLIDE 37

Gradient descent for LMS

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, โ€ฆ.

1. Compute gradient of J(w) at wt. Call it โˆ‡๐พ ๐‘ฅ(

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

37

r: Called the learning rate (For now, a small constant. We will get to this later) We are trying to minimize

slide-38
SLIDE 38

Gradient descent for LMS

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, โ€ฆ.

1. Compute gradient of J(w) at wt. Call it โˆ‡๐พ ๐‘ฅ(

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

38

r: Called the learning rate (For now, a small constant. We will get to this later) We are trying to minimize One element

  • f โˆ‡๐พ ๐‘ฅ#
slide-39
SLIDE 39

Gradient descent for LMS

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, โ€ฆ.

1. Compute gradient of J(w) at wt. Call it โˆ‡๐พ ๐‘ฅ(

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

39

r: Called the learning rate (For now, a small constant. We will get to this later) We are trying to minimize One element

  • f โˆ‡๐พ ๐‘ฅ#
slide-40
SLIDE 40

Gradient descent for LMS

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, โ€ฆ.

1. Compute gradient of J(w) at wt. Call it โˆ‡๐พ ๐‘ฅ(

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

40

r: Called the learning rate (For now, a small constant. We will get to this later) We are trying to minimize One element

  • f โˆ‡๐พ ๐‘ฅ#

(until total error is below a threshold)

slide-41
SLIDE 41

Gradient descent for LMS

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, โ€ฆ.

1. Compute gradient of J(w) at wt. Call it โˆ‡๐พ ๐‘ฅ(

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

41

r: Called the learning rate (For now, a small constant. We will get to this later)

(until total error is below a threshold)

We are trying to minimize One element

  • f โˆ‡๐พ ๐‘ฅ#
slide-42
SLIDE 42

Gradient descent for LMS

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, โ€ฆ.

1. Compute gradient of J(w) at wt. Call it โˆ‡๐พ ๐‘ฅ(

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

42

r: Called the learning rate (For now, a small constant. We will get to this later)

(until total error is below a threshold)

This algorithm is guaranteed to converge to the minimum of J if r is small enough. Why? The objective J is a convex function We are trying to minimize One element

  • f โˆ‡๐พ ๐‘ฅ#
slide-43
SLIDE 43

Least Squares Method for regression

  • Examples
  • The LMS objective
  • Gradient descent
  • Incremental/stochastic gradient descent

43

slide-44
SLIDE 44

Gradient descent for LMS

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, โ€ฆ.

1. Compute gradient of J(w) at wt. Call it โˆ‡๐พ ๐‘ฅ(

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

44

(until total error is below a threshold)

We are trying to minimize

slide-45
SLIDE 45

Gradient descent for LMS

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, โ€ฆ.

1. Compute gradient of J(w) at wt. Call it โˆ‡๐พ ๐‘ฅ(

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

45

(until total error is below a threshold)

The weight vector is not updated until all errors are calculated Why not make early updates to the weight vector as soon as we encounter errors instead of waiting for a full pass over the data?

We are trying to minimize

slide-46
SLIDE 46

Incremental/Stochastic gradient descent

  • Repeat for each example (xi, yi)

โ€“ Pretend that the entire training set is represented by this single example โ€“ Use this example to calculate the gradient and update the model

  • Contrast with batch gradient descent which makes
  • ne update to the weight vector for every pass over

the data

46

slide-47
SLIDE 47

Incremental/Stochastic gradient descent

  • 1. Initialize w
  • 2. For t = 0, 1, 2, โ€ฆ. (until error below some threshold)

โ€“ For each training example (xi, yi):

  • Update w. For each element of the weight vector (wj):

47

slide-48
SLIDE 48

Incremental/Stochastic gradient descent

  • 1. Initialize w
  • 2. For t = 0, 1, 2, โ€ฆ. (until error below some threshold)

โ€“ For each training example (xi, yi):

  • Update w. For each element of the weight vector (wj):

48

Contrast with the previous method, where the weights are updated only after all examples are processed once

slide-49
SLIDE 49

Incremental/Stochastic gradient descent

  • 1. Initialize w
  • 2. For t = 0, 1, 2, โ€ฆ. (until error below some threshold)

โ€“ For each training example (xi, yi):

  • Update w. For each element of the weight vector (wj):

49

This update rule is also called the Widrow-Hoff rule in the neural networks literature

slide-50
SLIDE 50

Incremental/Stochastic gradient descent

  • 1. Initialize w
  • 2. For t = 0, 1, 2, โ€ฆ. (until error below some threshold)

โ€“ For each training example (xi, yi):

  • Update w. For each element of the weight vector (wj):

50

This update rule is also called the Widrow-Hoff rule in the neural networks literature

Online/Incremental algorithms are often preferred when the training set is very large May get close to optimum much faster than the batch version

slide-51
SLIDE 51

Learning Rates and Convergence

  • In the general (non-separable) case the learning rate r must

decrease to zero to guarantee convergence

  • The learning rate is called the step size.

โ€“ More sophisticated algorithms choose the step size automatically and converge faster

  • Choosing a better starting point can also have impact
  • Gradient descent and its stochastic version are very simple

algorithms

โ€“ Yet, almost all the algorithms we will learn in the class can be traced back to gradient decent algorithms for different loss functions and different hypotheses spaces

51

slide-52
SLIDE 52

Linear regression: Summary

  • What we want: Predict a real valued output using a

feature representation of the input

  • Assumption: Output is a linear function of the inputs
  • Learning by minimizing total cost

โ€“ Gradient descent and stochastic gradient descent to find the best weight vector โ€“ This particular optimization can be computed directly by framing the problem as a matrix problem

52

slide-53
SLIDE 53

Exercises

1. Use the gradient descent algorithms to solve the mileage problem (on paper, or write a small program) 2. LMS regression can be solved analytically. Given a dataset D = { (x1, y1), (x2, y2), !, (xm, ym)}, define matrix X and vector Y as follows: Show that the optimization problem we saw earlier is equivalent to This can be solved analytically. Show that the solution w* is

53

Hint: You have to take the derivative of the objective with respect to the vector w and set it to zero.