Regression Practical Machine Learning Fabian Wauthier 09/10/2009 - - PowerPoint PPT Presentation

regression
SMART_READER_LITE
LIVE PREVIEW

Regression Practical Machine Learning Fabian Wauthier 09/10/2009 - - PowerPoint PPT Presentation

Regression Practical Machine Learning Fabian Wauthier 09/10/2009 Adapted from slides by Kurt Miller and Romain Thibaux 1 Outline Ordinary Least Squares Regression - Online version - Normal equations - Probabilistic interpretation


slide-1
SLIDE 1

Regression

Practical Machine Learning Fabian Wauthier 09/10/2009

Adapted from slides by Kurt Miller and Romain Thibaux

1
slide-2
SLIDE 2

Outline

  • Ordinary Least Squares Regression
  • Online version
  • Normal equations
  • Probabilistic interpretation
  • Overfitting and Regularization
  • Overview of additional topics
  • L1 Regression
  • Quantile Regression
  • Generalized linear models
  • Kernel Regression and Locally Weighted Regression
2
slide-3
SLIDE 3

Outline

  • Ordinary Least Squares Regression
  • Online version
  • Normal equations
  • Probabilistic interpretation
  • Overfitting and Regularization
  • Overview of additional topics
  • L1 Regression
  • Quantile Regression
  • Generalized linear models
  • Kernel Regression and Locally Weighted Regression
3
slide-4
SLIDE 4

Regression vs. Classification:

Anything:

  • continuous (ℜ, ℜd, …)
  • discrete ({0,1}, {1,…k}, …)
  • structured (tree, string, …)
  • Discrete:

– {0,1} binary – {1,…k} multi-class – tree, etc. structured

Classification

X Y

4
slide-5
SLIDE 5

Regression vs. Classification:

Anything:

  • continuous (ℜ, ℜd, …)
  • discrete ({0,1}, {1,…k}, …)
  • structured (tree, string, …)

Perceptron Logistic Regression Support Vector Machine Decision Tree Random Forest Kernel trick

Classification

X Y

5
slide-6
SLIDE 6

Regression vs. Classification:

Anything:

  • continuous (ℜ, ℜd, …)
  • discrete ({0,1}, {1,…k}, …)
  • structured (tree, string, …)

Regression

  • continuous:

– ℜ, ℜd

X Y

6
slide-7
SLIDE 7

Examples

  • Voltage Temperature
  • Processes, memory Power consumption
  • Protein structure Energy
  • Robot arm controls Torque at effector
  • Location, industry, past losses Premium

⇒ ⇒ ⇒ ⇒ ⇒

7
slide-8
SLIDE 8

Linear regression

10 20 20 40

x y

10 20 30 40 10 20 30 20 22 24 26

x y

Given examples Predict given a new point

8
slide-9
SLIDE 9

where is a parameter to be estimated and we have used the standard convention of letting the first component of be 1. We wish to estimate by a linear function of our data :

10 20 30 40 10 20 30 20 22 24 26

Linear regression

10 20 20 40

x y x y

ˆ y x w ˆ yn+1 = w0 + w1xn+1,1 + w2xn+1,2 = w⊤xn+1 x

9
slide-10
SLIDE 10

Choosing the regressor

10

Of the many regression fits that approximate the data, which should we choose? Observation

20

Xi = 1 xi

  • 10
slide-11
SLIDE 11

LMS Algorithm

(Least Mean Squares) In order to clarify what we mean by a good choice of , we will define a cost function for how well we are doing on the training data: w

20

Error or “residual” Prediction Observation

Cost =

1 2

n

  • i=1

(w⊤xi − yi)2 Xi = 1 xi

  • 11
slide-12
SLIDE 12

LMS Algorithm

(Least Mean Squares) The best choice of is the one that minimizes our cost function w E = 1 2

n

  • i=1

(w⊤xi − yi)2 =

n

  • i=1

Ei In order to optimize this equation, we use standard gradient descent where ∂ ∂wE =

n

  • i=1

∂ ∂wEi and ∂ ∂wEi = 1 2 ∂ ∂w(w⊤xi − yi)2 = (w⊤xi − yi)xi

wt+1 := wt − α ∂ ∂wE

12
slide-13
SLIDE 13

LMS Algorithm

(Least Mean Squares) The LMS algorithm is an online method that performs the following update for each new data point

wt+1 := wt − α ∂ ∂wEi = wt + α(yi − x⊤

i w)xi

α∂Ei ∂w

13
slide-14
SLIDE 14

LMS, Logistic regression, and Perceptron updates

  • LMS
  • Logistic Regression
  • Perceptron

wt+1 := wt + α(yi − x⊤

i w)xi

wt+1 := wt + α(yi − fw(xi))xi wt+1 := wt + α(yi − fw(xi))xi

14
slide-15
SLIDE 15

Ordinary Least Squares (OLS)

20

Error or “residual” Prediction Observation

Cost =

1 2

n

  • i=1

(w⊤xi − yi)2 Xi = 1 xi

  • 15
slide-16
SLIDE 16

Minimize the sum squared error

n d ∂ ∂wE = X⊤Xw − X⊤y Setting the derivative equal to zero gives us the Normal Equations X⊤Xw = X⊤y w = (X⊤X)−1X⊤y

E = 1 2

n

  • i=1

(w⊤xi − yi)2 = 1 2(Xw − y)⊤(Xw − y) = 1 2(w⊤X⊤Xw − 2y⊤Xw + y⊤y)

16
slide-17
SLIDE 17

A geometric interpretation

17

We solved

∂ ∂wE = X⊤(Xw − y) = 0

Residuals are orthogonal to columns of X

⇒ ⇒

gives the best reconstruction of

ˆ y = Xw

y

in the range ofX

17
slide-18
SLIDE 18

18

[X]1 y [X]2 y’ y’ is an orthogonal projection of y onto S Subspace S spanned by columns of X Residual vector y!y’ is

  • rthogonal to subspace S
18
slide-19
SLIDE 19

Computing the solution

19

w.

. Euclidean norm. the pseudoinverse

X

If X⊤X is not invertible, there is no unique solution In that case chooses the solution with smallest and the solution is unique.

w = (X⊤X)−1X⊤y

We compute If X⊤X is invertible, then (X⊤X)−1X⊤ coincides with

X+ of

An alternative way to deal with non-invertible X⊤X is to add a small portion of the identity matrix (= Ridge regression).

w = X+y

19
slide-20
SLIDE 20

Beyond lines and planes

10 20 20 40

Linear models become powerful function approximators when we consider non-linear feature transformations.

All the math is the same! Predictions are still linear in X !

20
slide-21
SLIDE 21

Geometric interpretation

[Matlab demo]

10 20 100 200 300 400
  • 10
10 20

ˆ y = w0 + w1x + w2x2

21
slide-22
SLIDE 22

Ordinary Least Squares [summary]

n d Let For example Let Minimize by solving Given examples Predict

22
slide-23
SLIDE 23

Probabilistic interpretation

20

Likelihood

23
slide-24
SLIDE 24

24

2 4 6 8 10 5 10 15 20 25 X y µ=8 µ=5 µ=3 Mean µ Conditional Gaussians p(y|x)

24
slide-25
SLIDE 25

BREAK

25
slide-26
SLIDE 26

Outline

  • Ordinary Least Squares Regression
  • Online version
  • Normal equations
  • Probabilistic interpretation
  • Overfitting and Regularization
  • Overview of additional topics
  • L1 Regression
  • Quantile Regression
  • Generalized linear models
  • Kernel Regression and Locally Weighted Regression
26
slide-27
SLIDE 27

Overfitting

  • So the more features the better? NO!
  • Carefully selected features can improve

model accuracy.

  • But adding too many can lead to overfitting.
  • Feature selection will be discussed in a

separate lecture.

27

27
slide-28
SLIDE 28

Overfitting

2 4 6 8 10 12 14 16 18 20
  • 15
  • 10
  • 5
5 10 15 20 25 30

[Matlab demo]

Degree 15 polynomial

28
slide-29
SLIDE 29

Ridge Regression (Regularization)

2 4 6 8 10 12 14 16 18 20
  • 10
  • 5
5 10 15 Effect of regularization (degree 19)

with “small” by solving Minimize (X⊤X + ǫI)w = X⊤y

[Continue Matlab demo]

29
slide-30
SLIDE 30

Probabilistic interpretation

Likelihood Prior Posterior

P(w|X, y) = P(w, x1, . . . , xn, y1, . . . , yn) P(x1, . . . , xn, y1, . . . , yn) ∝ P(w.x1, . . . , x1, y1, . . . , yn) ∝ exp

  • − ǫ

2σ2 ||w||2

2 i

exp

  • − 1

2σ2

  • X⊤

i w − yi

2 = exp

  • − 1

2σ2

  • ǫ||w||2

2 +

  • i

(X⊤

i w − yi)2

  • 30
slide-31
SLIDE 31

Outline

  • Ordinary Least Squares Regression
  • Online version
  • Normal equations
  • Probabilistic interpretation
  • Overfitting and Regularization
  • Overview of additional topics
  • L1 Regression
  • Quantile Regression
  • Generalized linear models
  • Kernel Regression and Locally Weighted Regression
31
slide-32
SLIDE 32

Errors in Variables (Total Least Squares)

32
slide-33
SLIDE 33

Sensitivity to outliers

High weight given to outliers

10 20 30 40 10 20 30 5 10 15 20 25

Temperature at noon

Influence function

33
slide-34
SLIDE 34

L1 Regression

Linear program

Influence function [Matlab demo]

34
slide-35
SLIDE 35

Quantile Regression

  • 15

16 17 18 19 20 21 260 280 300 320 340 360 workload (ViewItem.php) [req/s] CPU utilization [MHz] mean CPU 95th percentile of CPU

Slide courtesy of Peter Bodik

35
slide-36
SLIDE 36

Generalized Linear Models

36

Probabilistic interpretation of OLS

Mean is linear in Xi

OLS: linearly predict the mean of a Gaussian conditional. GLM: predict the mean of some other conditional density. May need to transform linear prediction by to produce a valid parameter.

yi|xi ∼ p

  • f(X⊤

i w)

  • f(·)
36
slide-37
SLIDE 37

Example: “Poisson regression”

37

yi|xi ∼ Poisson

  • f(X⊤

i w)

  • Suppose data

are event counts:

y

Typical distribution for count data: Poisson

Poisson(y|λ) = e−λλy y!

Mean parameter is λ > 0 Say we predict λ = f(x⊤w) = exp

  • x⊤w
  • y ∈ N0

GLM:

37
slide-38
SLIDE 38

38

2 4 6 8 10 5 10 15 20 25 X y Mean ! Conditional Poissons p(y|x) !=8 !=5 !=3

38
slide-39
SLIDE 39

Poisson regression: learning

39

As for OLS: optimize by maximizing the likelihood of data.

w

Equivalently: maximize log likelihood. Likelihood

L =

  • i

Poisson

  • yi|f(X⊤

i w)

  • l =
  • i
  • X⊤

i wyi − exp

  • X⊤

i w

  • + const.

Log likelihood ∂l ∂w =

  • i
  • yi − exp
  • X⊤

i w

  • Xi

=

  • i
  • yi − f
  • X⊤

i w

  • Xi

Batch gradient:

  • “residual”
39
slide-40
SLIDE 40

LMS, Logistic regression, Perceptron and GLM updates

  • GLM (online)
  • LMS
  • Logistic Regression
  • Perceptron

wt+1 := wt + α(yi − x⊤

i w)xi

wt+1 := wt + α(yi − fw(xi))xi wt+1 := wt + α(yi − fw(xi))xi wt+1 := wt + α(yi − fw(xi))xi

40
slide-41
SLIDE 41

Kernel Regression and Locally Weighted Linear Regression

  • Kernel Regression:

Take a very very conservative function approximator called

  • AVERAGING. Locally weight it.
  • Locally Weighted Linear Regression:

Take a conservative function approximator called LINEAR

  • REGRESSION. Locally weight it.

Slide from Paul Viola 2003

41
slide-42
SLIDE 42

Kernel Regression

2 4 6 8 10 12 14 16 18 20
  • 10
  • 5
5 10 15

Kernel regression (sigma=1)

42
slide-43
SLIDE 43 ! " # $ % &! &" &# &$ &% "! !&! !' ! ' &! &'

Locally Weighted Linear Regression (LWR)

Kernel regression (sigma=1)

E = 1 2

n

  • i=1

(w⊤xi − yi)2 OLS cost function: LWR cost function:

E′ =

n

  • i=1

k(xi − x)(w⊤xi − yi)2 [Matlab demo]

43
slide-44
SLIDE 44

1 2

#requests per minute

Time (days)

5000

Heteroscedasticity

44
slide-45
SLIDE 45

What we covered

  • Ordinary Least Squares Regression
  • Online version
  • Normal equations
  • Probabilistic interpretation
  • Overfitting and Regularization
  • Overview of additional topics
  • L1 Regression
  • Quantile Regression
  • Generalized linear models
  • Kernel Regression and Locally Weighted Regression
45