Linear Regression Yijun Zhao Northeastern University Fall 2016 - - PowerPoint PPT Presentation

linear regression
SMART_READER_LITE
LIVE PREVIEW

Linear Regression Yijun Zhao Northeastern University Fall 2016 - - PowerPoint PPT Presentation

Linear Regression Yijun Zhao Northeastern University Fall 2016 Yijun Zhao Linear Regression Regression Examples Any Attributes Continuous Value = x y { age , major , gender , race } GPA { income , credit score , profession }


slide-1
SLIDE 1

Linear Regression

Yijun Zhao

Northeastern University

Fall 2016

Yijun Zhao Linear Regression

slide-2
SLIDE 2

Regression Examples

Any Attributes Continuous Value

x = ⇒ y

{age, major, gender, race} ⇒GPA {income, credit score, profession} ⇒ loan {college, major, GPA} ⇒ future income . . .

Yijun Zhao Linear Regression

slide-3
SLIDE 3

Regression Examples

Data often has/can be converted into matrix form:

Age Gender Race Major GPA

20 A Art 3.85 22 C Engineer 3.90 25 1 A Engineer 3.50 24 AA Art 3.60 19 1 H Art 3.70 18 1 C Engineer 3.00 30 AA Engineer 3.80 25 C Engineer 3.95 28 1 A Art 4.00 26 C Engineer 3.20

Yijun Zhao Linear Regression

slide-4
SLIDE 4

Formal Problem Setup

Given N observations {(x1, y1), (x2, y2), . . . , (xN, yN)} a regression problem tries to uncover the function yi = f (xi) ∀i = 1, 2. . . . , n such that for a new input value x∗, we can accurately predict the corresponding value y∗ = f (x∗).

Yijun Zhao Linear Regression

slide-5
SLIDE 5

Linear Regression

Assume the function f is a linear combination

  • f components in x

Formally, let x = (1, x1, x2, . . . , xd)T, we have y = ω0 + ω1x1 + ω2x2 + · · · + ωdxd = wTx where w = (ω0, ω1, ω2, . . . , ωd)T

w is the parameter to estimate !

Prediction: y∗ = wTx∗

Yijun Zhao Linear Regression

slide-6
SLIDE 6

Visual Illustration

Figure: 1D and 2D linear regression

Yijun Zhao Linear Regression

slide-7
SLIDE 7

Error Measure

Mean Squared Error (MSE): E(w) = 1 N

N

  • n=1

(wTxn − yn)2 = 1 N Xw − y 2 where X =     — x1T — — x2T — . . . — xNT —     y =     y1 y2 . . . yN    

Yijun Zhao Linear Regression

slide-8
SLIDE 8

Minimizing Error Measure

E(w) = 1

N Xw − y 2

▽E(w) = 2

NXT(Xw − y) = 0

XTXw = XTy w = X†y where X† = (XTX)−1XT is the ’pseudo-inverse’ of X

Yijun Zhao Linear Regression

slide-9
SLIDE 9

LR Algorithm Summary

Ordinary Least Squares (OLS) Algorithm Construct matrix X and the vector y from the dataset {(x1, y1), x2, y2), . . . , (xN, yN)} (each x includes x0 = 1) as follows: X =     — xT

1 —

— xT

2 —

. . . — xT

N —

    y =     y1 y2 . . . yN     Compute X† = (XTX)−1XT Return w = X†y

Yijun Zhao Linear Regression

slide-10
SLIDE 10

Gradient Descent

Why? Minimize our target function (E(w)) by moving down in the steepest direction

Yijun Zhao Linear Regression

slide-11
SLIDE 11

Gradient Descent

Gradient Descent Algorithm Initialize the w(0) for time t = 0 for t = 0, 1, 2, . . . do Compute the gradient gt = ▽E(w(t)) Set the direction to move, vt = −gt Update w(t + 1) = w(t) + ηvt Iterate until it is time to stop Return the final weights w

Yijun Zhao Linear Regression

slide-12
SLIDE 12

Gradient Descent

How η affects the algorithm? Use 0.1 (practical observation) Use variable size: ηt = η ▽E

Yijun Zhao Linear Regression

slide-13
SLIDE 13

OLS or Gradient Descent?

Yijun Zhao Linear Regression

slide-14
SLIDE 14

Computational Complexity

OLS Gradient Descent

¡

OLS is expensive when D is large!

Yijun Zhao Linear Regression

slide-15
SLIDE 15

Linear Regression

What is the Probabilistic Interpretation?

Yijun Zhao Linear Regression

slide-16
SLIDE 16

Normal Distribution

Right Skewed Left Skewed Random

Yijun Zhao Linear Regression

slide-17
SLIDE 17

Normal Distribution

mean = median = mode symmetry about the center x ∼ N(µ, σ2) = ⇒ f (x) =

1 σ √ 2πe−

1 2σ2 (x−µ)2

Yijun Zhao Linear Regression

slide-18
SLIDE 18

Central Limit Theorem

All things bell shaped! Random occurrences over a large population tend to wash out the asymmetry and uniformness of individual events. A more ’natural’ distribution ensues. The name for it is the Normal distribution (the bell curve). Formal definition: If (y1, . . . , yn) are i.i.d. and 0 < σ2

y < ∞, then when n is large the

distribution of ¯ y is well approximated by a normal distribution N(µy,

σ2

y

n ).

Yijun Zhao Linear Regression

slide-19
SLIDE 19

Central Limit Theorem

Example:

Yijun Zhao Linear Regression

slide-20
SLIDE 20

LR: Probabilistic Interpretation

Yijun Zhao Linear Regression

slide-21
SLIDE 21

LR: Probabilistic Interpretation prob(yi|xi) =

1 √ 2πσe− 1

2σ2(wTxi−yi)2

Yijun Zhao Linear Regression

slide-22
SLIDE 22

LR: Probabilistic Interpretation

Likelihood of the entire dataset:

L ∝

  • i
  • e− 1

2σ2(wTxi−yi)2

= e

− 1

2σ2

  • i

(wTxi−yi)2

Maximize L ⇐ ⇒ Minimize

i

(wTxi − yi)2

Yijun Zhao Linear Regression

slide-23
SLIDE 23

Non-linear Transformation

Linear is limited: Linear models become powerful when we consider non-linear feature transformations: Xi = (1, xi, x2

i ) =

⇒ yi = ω0 + ω1xi + ω2x2

i

Yijun Zhao Linear Regression

slide-24
SLIDE 24

Overfitting

Yijun Zhao Linear Regression

slide-25
SLIDE 25

Overfitting

How do we know we overfitted? Ein: Error from the training data Eout: Error from the test data Example:

Yijun Zhao Linear Regression

slide-26
SLIDE 26

Overfitting

How to avoid overfitting? Use more data Evaluate on a parameter tuning set Regularization

Yijun Zhao Linear Regression

slide-27
SLIDE 27

Regularization

Attempts to impose ”Occam’s razor” principle Add a penalty term for model complexity Most commonly used :

L2 regularization (ridge regression) minimizes: E(w) = Xw − y 2 + λ w 2 where λ ≥ 0 and w 2 = wTw L1 regularization (LASSO) minimizes: E(w) = Xw − y 2 + λ|w|1 where λ ≥ 0 and |w|1 =

D

  • i=1

|ωi|

Yijun Zhao Linear Regression

slide-28
SLIDE 28

Regularization

L2: closed form solution w = (XTX + λI)−1XTy L1: No closed form solution. Use quadratic programming: minimize Xw − y 2 s.t. w 1≤ s

Yijun Zhao Linear Regression

slide-29
SLIDE 29

L2 Regularization Example

Yijun Zhao Linear Regression

slide-30
SLIDE 30

Model Selection

Which model? A central problem in supervised learning Simple model: ”underfit” the data

Constant function Linear model applied to quadratic data

Complex model: ”overfit” the data

High degree polynomials Model with hidden logics that fits the data to completion

Yijun Zhao Linear Regression

slide-31
SLIDE 31

Bias-Variance Trade-off

Consider E

  • 1

N N

  • n=1

(wTxn − yn)2

  • let ˆ

y = wTxn E(ˆ y − yn)2 can be decomposed into (reading):

var{noise} + bias2 + var{yi}

var{noise}: can’t be reduced bias2 + var{yi} is what counts for prediction High bias2: model mismatch, often due to ”underfitting” High var{yi}: training set and test set mismatch, often due to ”overfitting”

Yijun Zhao Linear Regression

slide-32
SLIDE 32

Bias-Variance Trade-off

Often: low bias ⇒ high variance low variance ⇒ high bias Trade-off:

Yijun Zhao Linear Regression

slide-33
SLIDE 33

How to choose λ ?

But we still need to pick λ. Use the test set data ? NO! Set aside another evaluation set

Small evaluation set ⇒ inaccurate estimated error Large evaluation set ⇒ small training set

CrossValidation

Yijun Zhao Linear Regression

slide-34
SLIDE 34

Cross Validation (CV)

Divide data into K folds Alternatively train on all except kth folds, and test on kth fold

Yijun Zhao Linear Regression

slide-35
SLIDE 35

Cross Validation (CV)

How to choose K? Common choice of K = 5, 10, or N (LOOCV) Measure on average performance Cost of computation: K folds × choices of λ

Yijun Zhao Linear Regression

slide-36
SLIDE 36

Learning Curve

A learning curve plots the performance of the algorithm as a function of the size of training data

Yijun Zhao Linear Regression

slide-37
SLIDE 37

Learning Curve

Yijun Zhao Linear Regression