Chapter 4: Training Regression Models Dr. Xudong Liu Assistant - - PowerPoint PPT Presentation

chapter 4 training regression models
SMART_READER_LITE
LIVE PREVIEW

Chapter 4: Training Regression Models Dr. Xudong Liu Assistant - - PowerPoint PPT Presentation

Chapter 4: Training Regression Models Dr. Xudong Liu Assistant Professor School of Computing University of North Florida Monday, 10/14/2019 1 / 41 Overview Linear regression Normal equation Gradient descent Batch gradient descent


slide-1
SLIDE 1

Chapter 4: Training Regression Models

  • Dr. Xudong Liu

Assistant Professor School of Computing University of North Florida Monday, 10/14/2019

1 / 41

slide-2
SLIDE 2

Overview

Linear regression

Normal equation Gradient descent

Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Polynomial regression Regularization for linear models

Ridge regression Lasso regression Elastic Net

Logistic regression Softmax regression

Overview 2 / 41

slide-3
SLIDE 3

Linear Regression Model

Overview 3 / 41

slide-4
SLIDE 4

Linear Regression Model

Overview 4 / 41

slide-5
SLIDE 5

Learning Linear Regression Models

We try to learn θ so that the following MSE cost function is minimized.

Linear Regression Learning 5 / 41

slide-6
SLIDE 6

Normal Equation

To find the θ that minimized the cost function, we can apply the following closed-form solution: How is it derived?

Linear Regression Learning 6 / 41

slide-7
SLIDE 7

Normal Equation

Under the cost function, we are looking for a line that minimize the summation of the distances from the data points to the line. The cost function is convex, so all we need to do is to compute the cost function’s partial derivative w.r.t θ, and make the partial derivative to 0 to solve for θ. The θ that minimizes 1

m m

  • i=1

(θT · X (i) − y (i))2 also minimizes

m

  • i=1

(θT · X (i) − y (i))2

Linear Regression Learning 7 / 41

slide-8
SLIDE 8

Normal Equation

1

m

  • i=1

(θT · X (i) − y (i))2 = (y − Xθ)T(y − Xθ) = Eθ

2

∂Eθ ∂θ = (−X)T(y − Xθ) + (−X)(y − Xθ)T, this is because we have ∂uT v ∂X

= ∂uT

∂X v + ∂vT ∂X u

3 Thus, ∂Eθ

∂θ = 2X T(Xθ − y)

4 Let ∂Eθ

∂θ = 0, we can solve for θ = (X T · X)−1 · X Ty

Linear Regression Learning 8 / 41

slide-9
SLIDE 9

Normal Equation Experiment

Linear Regression Learning 9 / 41

slide-10
SLIDE 10

Normal Equation Complexity

Using the normal equation takes time roughly O(mn3), where n is the number of features, and m is the number of examples. So, it scales well with large set of examples, but poorly with big number of features. Now let’s look at other ways of learning θ that may be better for when there are a lot of features or too many training examples to fit in memory.

Linear Regression Learning 10 / 41

slide-11
SLIDE 11

Gradient Descent

General idea is to tweak parameters iteratively in order to minimize a cost function. Analogy: suppose you are lost in the mountains in a dense fog. You can only feel the slope of the ground below your fee. A good strategy to get to the bottom quickly is to go downhill in the direction of the steepest slope. The step size during the walking downhill is called learning rate.

Gradient Descent 11 / 41

slide-12
SLIDE 12

Learning Rate Too Small

Gradient Descent 12 / 41

slide-13
SLIDE 13

Learning Rate Too Big

Gradient Descent 13 / 41

slide-14
SLIDE 14

Gradient Descent Pitfalls

Gradient Descent 14 / 41

slide-15
SLIDE 15

Gradient Descent with Feature Scaling

When all features have a similar scale, GD tends to converge quick.

Gradient Descent 15 / 41

slide-16
SLIDE 16

Batch Gradient Descent

The gradient vector of the MSE cost function is below. It is something we already computed for normal equation! Gradient descent step: θ(nextstep) → θ − η ∗ ▽θMSE(θ)

Gradient Descent 16 / 41

slide-17
SLIDE 17

Batch Gradient Descent: Learning Rates

Gradient Descent 17 / 41

slide-18
SLIDE 18

Other Gradient Descent Algorithms

Batch GD: uses the whole training set to compute GD. Stochastic GD: uses one random training example to compute GD. Mini-batch GD: uses a small random set of training examples to compute GD.

Gradient Descent 18 / 41

slide-19
SLIDE 19

Comparing Linear Regression Algorithms

Gradient Descent 19 / 41

slide-20
SLIDE 20

Polynomial Regression

Univariate polynomial: ˆ y = θ0 + θ1x + . . . + θdxd Multivariate polynomial: more complex

E.g., for degree 2 and 2 variables, ˆ y = θ0 + θ1x1 + θ2x2 + θ3x1x2 + θ4x2

1 + θ5x2 2

In general, for degree d and n variables, there are n+d

d

  • = (n+d)!

d!n! 1.

Clearly, polynomial regression is more general than linear regression and can fit non-linear data.

1https://mathoverflow.net/questions/225953/

number-of-polynomial-terms-for-certain-degree-and-certain-number-of-variables Polynomial Regression 20 / 41

slide-21
SLIDE 21

Polynomial Regression Learning

We may transform the given attributes to get additional attributes in the higher degrees. Then the problem boils down to a linear regression problem. The following data is generated for y = 0.5x2

1 + 1.0x1 + 2.0+

Gaussian noise.

Polynomial Regression 21 / 41

slide-22
SLIDE 22

Polynomial Regression Learning

We first transform x1 to a polynomial feature of degree 2, then fit the data to learn ˆ y = 0.56x2

1 + 0.93x1 + 1.78.

The PolynomialFeatures class in sklearn can produce all n+d

d

  • polynomial features.

Apparently, we wouldn’t know what degree exactly the data was generated.

Polynomial Regression 22 / 41

slide-23
SLIDE 23

Polynomial Regression Learning: Overfitting

Generally, the higher the polynomial degree the better the model fits the training data. The danger is overfitting, making the model generate poorly on testing/future data.

Polynomial Regression 23 / 41

slide-24
SLIDE 24

Learning Curves: Underfitting

Adding more training examples will not help underfitting. Instead, need to use more complex models.

Polynomial Regression 24 / 41

slide-25
SLIDE 25

Learning Curves: Overfitting

Adding more training examples may help overfitting. Another way to battle overfitting is regularization.

Polynomial Regression 25 / 41

slide-26
SLIDE 26

Ridge Regression

To regularize a model is to constrain it: the less freedom it has, the hard it will be to overfit. For linear regression, this regularization typically is achieved by constraining the weights (θ’s) of the model. First way to constrain the weights is Ridge regression, which simply adds α 1

2 n

  • i=1

θ2

i to the cost function:

J(θ) = MSE(θ) + α 1

2 n

  • i=1

θ2

i

Notice θ0 is NOT constrained. Remember to scale the data before using Ridge regression. α is a hyperparameter: bigger results in flatter and smoother model.

Regularized Linear Models 26 / 41

slide-27
SLIDE 27

Ridge Regression: α

Regularized Linear Models 27 / 41

slide-28
SLIDE 28

Ridge Regression

Closed-form solution: ˆ θ = (X T · X + αA)−1 · X Ty, where A is the n × n identity matrix except top-left cell is 0.

In sklearn: import Ridge

Stochastic GD: ▽θMSE(θ) = 2

mX T(Xθ − y) + 1 2αθ

In sklearn: SGDRegressor(penalty=”l2”)

Regularized Linear Models 28 / 41

slide-29
SLIDE 29

Lasso Regression

Second way to constrain the weights is Lasso regression. Lasso: least absolute shrinkage and selection operator. It add an l1 norm, instead of an l2 norm in Ridge, to the cost function: J(θ) = MSE(θ) + α

n

  • i=1

|θi| p-norm: ||x||p = (

n

  • i=1

|xi|p)

1 p

Lasso tends to completely eliminate the weights of the least important features (i.e., setting them to 0) and it automatically performs feature selection. Role of α is same as the α in Ridge regression.

Regularized Linear Models 29 / 41

slide-30
SLIDE 30

Lasso Regression

Regularized Linear Models 30 / 41

slide-31
SLIDE 31

Lasso Regression

Closed-form solution: Does not exist because J(θ) is not differentiable at θi = 0. Stochastic GD: ▽θMSE(θ) = 2

mX T(Xθ − y) + 1 2αsign(θ),

where sign(θi) = −1, if θi < 0; 0, if θi = 0; and 1, if θi > 0.

In sklearn: SGDRegressor(penalty=”l1”), or import Lasso

Regularized Linear Models 31 / 41

slide-32
SLIDE 32

Elastic Net

Last way to constrain the weights is Elastic net, a combination of Ridge and Lasso. It combines both cost functions: J(θ) = MSE(θ) + rα

n

  • i=1

|θi| + 1−r

2 α 1 2 n

  • i=1

θ2

i

When to use which?

Ridge is a good default. If you suspect some features are not useful, use Lasso or Elastic. When features are more than training examples, prefer Elastic.

Regularized Linear Models 32 / 41

slide-33
SLIDE 33

Logistic Regression

Logistic regression outputs class probabilities for binary regression problems, which can be used to predict classes for binary classification problems too. Although called regression, it often is used for binary classification. Multi-class regression or classification? Softmax regression. (next) The logistic regression model: ˆ p = hθ(x) = σ(θT · x), where σ(t) =

1 1+e−t is the logistic function.

Logistic Regression 33 / 41

slide-34
SLIDE 34

Logistic Regression

Once ˆ p is computed for example x, it predicts the class ˆ y to be 0 if ˆ p < 0.5; 1, otherwise. In other words, it predicts ˆ y to be 0 if θT · x is negative; 1, otherwise. Now we design a cost function, first for one single training example. For positive examples, we want the cost function to be small when the probability is big; for negative examples, when the prbability is small. Cost function per single training example: c(θ) =

  • − log(ˆ

p), if y = 1. − log(1 − ˆ p), if y = 0. Overall cost function (log loss): J(θ) = − 1

m m

  • i=1

[yi log(ˆ p(i)) + (1 − yi) log((1 − ˆ p(i)))]

Logistic Regression 34 / 41

slide-35
SLIDE 35

Logistic Regression

No known close-form euqation for J(θ), but J(θ) is convex, so gradient descent can be used. Log loss partial derivative:

∂J(θ) ∂θj

= 1

m m

  • i=1

(σ(θT · x(i)) − y(i))x(i)

j

This is very similar to linear regression’s partial derivative:

∂MSE(θ) ∂θj

= 2

m m

  • i=1

(θT · x(i) − y(i))x(i)

j

Like linear regression, we can perform SGD, BGD and MBGD. In sklearn, logistic regression by default uses l2 penalty to regularize.

Logistic Regression 35 / 41

slide-36
SLIDE 36

Iris Dataset Example

Logistic Regression 36 / 41

slide-37
SLIDE 37

Iris Dataset Example: Decision Boundary

The decision boundary is petal width being around 1.6cm.

Figure: Consider only one feature: petal length

Logistic Regression 37 / 41

slide-38
SLIDE 38

Iris Dataset Example: Decision Boundary

The dashed line is the decision boundary, representing the points where the model estimates a 50% probability.

This line is θ0 + θ1x1 + θ2x2 = 0. Figure: Consider two features: petal length and sepal length

Logistic Regression 38 / 41

slide-39
SLIDE 39

Softmax Regression

Logistic regression by default does not solve for multi-class problems. But it can be extended to do so, called softmax regression. Idea: given an instance x, we first compute a score sk(x) for each class k, then estimate the probability of each class by applying the softmax function. Scoring function: sk(x) = θT

k · x

Softmax function: ˆ pk =

esk (x)

K

  • j=1

esj (x)

Prediction:

Softmax Regression 39 / 41

slide-40
SLIDE 40

Softmax Regression Cost Function

The cost function for softmax regression is cross entropy: Below, y(i)

k

= 1, if y (i) = k; 0, otherwise. If K = 2, the cross entropy function is the log loss function. Cross entropy gradient descent for class k:

Softmax Regression 40 / 41

slide-41
SLIDE 41

Iris Dataset Example: Decision Boundary

Softmax Regression 41 / 41