Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP - - PowerPoint PPT Presentation

applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP - - PowerPoint PPT Presentation

Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP 551 (fall 2020) Learning objectives linear model evaluation criteria how to find the best fit geometric interpretation maximum likelihood interpretation Motivation History:


slide-1
SLIDE 1

Applied Machine Learning

Linear Regression

Siamak Ravanbakhsh

COMP 551 (fall 2020)

slide-2
SLIDE 2

linear model evaluation criteria how to find the best fit geometric interpretation maximum likelihood interpretation

Learning objectives

slide-3
SLIDE 3

Motivation

effect of income inequality on health and social problems

source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/

History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt)

slide-4
SLIDE 4

COMP 551 | Fall 2020

Motivation (?)

slide-5
SLIDE 5

each instance:

Notation

x ∈

(n)

RD

vectors are assume to be column vectors x =

= ⎣ ⎢ ⎢ ⎢ ⎢ ⎡x1 x2 ⋮ xD⎦ ⎥ ⎥ ⎥ ⎥ ⎤ [x ,

1

x ,

2

… , xD]

a feature

y ∈

(n)

R

  • ne instance

we assume N instances in the dataset D = {(x

, y )}

(n) (n n=1 N

for example, is the feature d of instance n

each instance has D features indexed by d

x ∈

d (n)

R

recall

slide-6
SLIDE 6

X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎤

design matrix: concatenate all instances

each row is a datapoint, each column is a feature

= ⎣ ⎢ ⎢ ⎢ ⎡ x ,

1 (1)

⋮ x ,

1 (N)

x ,

2 (1)

⋮ x ,

2 (N)

⋯ , ⋱ ⋯ , xD

(1)

⋮ xD

(N)⎦

⎥ ⎥ ⎥ ⎤

∈ RN×D

  • ne instance
  • ne feature

Notation

recall

slide-7
SLIDE 7

COMP 551 | Fall 2020

Notation

Example: Micro array data (X), contains gene expression levels labels (y) can be {cancer/no cancer} label for each patient

gene (d) patient (n)

∈ RN×D

slide-8
SLIDE 8

Linear model

f (x) =

w

w + w x +

1 1

… + w x

D D

f :

w

R →

D

R

assuming a scalar output

will generalize to a vector later

concatenate a 1 to x

x = [1, x , … , x ]

1 D ⊤

f (x) =

w

w x

simplification

bias or intercept model parameters or weights

slide-9
SLIDE 9

L(y, ) ≜ y ^ (y −

2 1

) y ^ 2

square error loss (a.k.a. L2 loss)

for a single instance (a function of labels)

Loss function

  • bjective: find parameters to fit the data

f (x ) ≈

w (n)

y(n)

x , y ∀n

(n) (n)

minimize a measure of difference between = y ^(n) f (x )

w (n)

and

y(n)

J(w) = (y −

2 1 ∑n=1 N (n)

w x )

⊤ (n) 2

sum of squared errors cost function

for future convenience for the whole dataset versus

slide-10
SLIDE 10

Example (D = 1)

w =

min (y −

w ∑n (n)

w x )

T (n) 2

Linear Least Squares

y

x = [x ]

1

w0

(x , y )

(3) (3)

(x , y )

(1) (1)

(x , y )

(2) (2)

(x , y )

(4) (4)

f (x) =

w∗

w +

w x

1 ∗

y −

(3)

f(x )

(3)

+bias (D=2)!

slide-11
SLIDE 11

COMP 551 | Fall 2020

w =

arg min (y −

w ∑n (n)

w x )

T (n) 2

Example (D=2)

Linear Least Squares

y

x1

w0

f (x) =

w∗

w +

w x +

1 ∗ 1

w x

2 ∗ 2

x2

+bias (D=3)!

slide-12
SLIDE 12

Matrix form

instead of

= y ^(n) w x

⊤ (n) 1 × D D × 1 ∈ R

use design matrix to write

= y ^ Xw

N × D N × 1 D × 1

Linear least squares

arg min ∣∣y −

w 2 1

Xw∣∣ =

2 2

(y −

2 1

Xw) (y −

Xw)

squared L2 norm of the residual vector

slide-13
SLIDE 13

Minimizing the cost

the cost function is a smooth function of w find minimum by setting partial derivatives to zero

slide-14
SLIDE 14

Simple case: D = 1 (no intercept)

f (x) =

w

wx

both scalar

J(w) = (y −

2 1 ∑n (n)

wx )

(n) 2

cost function model

J w

w =

∗ x ∑n

(n)2

x y ∑n

(n) (n)

setting the derivative to zero

=

dw dJ

x (wx − ∑n

(n) (n)

y )

(n)

derivative

global minimum because cost is smooth and convex

more on convexity layer

slide-15
SLIDE 15

Gradient

for a multivariate function J(w , w )

1

partial derivatives instead of derivative

J(w , w ) ≜

∂w1 ∂ 1

limϵ→0

ϵ J(w ,w +ϵ)−J(w ,w )

1 1

critical point: all partial derivatives are zero

w0

w1

J

= derivative when other vars. are fixed

gradient: vector of all partial derivatives

∇J(w) = [ J(w), ⋯ J(w)]

∂w1 ∂ ∂wD ∂ ⊤

w1

w0

J

slide-16
SLIDE 16

Finding w (any D)

w0 w1

J(w)

setting

J(w) =

∂wd ∂

cost is a smooth and convex function of w

(y −

∂wd ∂ ∑n 2 1 (n)

f (x )) =

w (n) 2

using chain rule:

=

∂wd ∂J dfw dJ ∂wd ∂fw

(w x − ∑n

⊤ (n)

y )x =

(n) d (n)

∀d ∈ {1, … , D}

we get

slide-17
SLIDE 17

Normal equation

(y − ∑n

(n)

w x )x =

⊤ (n) d (n)

∀d

X (y −

Xw) = 0

matrix form (using the design matrix)

N × 1

D × N each row enforces one of D equations

Normal equation: because for optimal w, the residual vector is

normal to column space of the design matrix

y ∈ RN

x1 x2

y ^

y − Xw

2nd column of the design matrix

  • ptimal weights w* satisfy

X Xw =

X y

system of D linear equations

Aw = b

  • r
slide-18
SLIDE 18

Direct solution

X Xw =

X y

= y ^ Xw = X(X X) X y

⊤ −1 ⊤

X (y −

Xw) = 0

projection matrix into column space of X

w =

(X X) X y

⊤ −1 ⊤

D × D D × N N × 1

pseudo-inverse of X

y ∈ RN

x1 x2

y ^

y − Xw

we can get a closed form solution!

slide-19
SLIDE 19

Uniqueness of the solution

we can get a closed form solution!

w =

(X X) X y

⊤ −1 ⊤

what if the covariance matrix is not invertible?

QΛQ⊤

recall the eigenvalue decomposition for the covariance matrix when is this not invertible?

(QΛQ ) =

⊤ −1

QΛ Q

−1 ⊤

Λ =

−1

⎣ ⎢ ⎡ λ1

1 λ2 1

… … …

λD 1 ⎦

⎥ ⎤

this matrix is not well-defined when some eigenvalues are zero!

where

slide-20
SLIDE 20

Uniqueness of the solution

under some assumptions, we can get a closed form solution!

(QΛQ ) =

⊤ −1

QΛ Q

−1 ⊤ Λ =

−1

⎣ ⎢ ⎡ λ1

1 λ2 1

… … …

λD 1 ⎦

⎥ ⎤ this does not exist when some eigenvalues are zero!

that is, if features are completely correlated ... or more generally if features are not linearly independent

λ1 λ2

α x = ∑d

d d

there exists some such that

{α }

d

slide-21
SLIDE 21

COMP 551 | Fall 2020

Uniqueness of the solution

examples having a binary feature as well as its negation

x1 x =

2

(1 − x )

1

that is, if features are completely correlated ... or more generally if features are not linearly independent α x = ∑d

d d

there exists some such that

{α }

d

X (y −

Xw ) =

if satisfies the normal equation

w∗

then the following are also solutions

w +

cα ∀c

alternatively because X(w +

cα) = Xw +

cXα

when we have many features and D ≥ N

patient (n) gene (d)

slide-22
SLIDE 22

Time complexity

w =

(X X) X y

⊤ −1 ⊤

O(D )

3

matrix inversion O(ND) D elements, each using N ops. O(D N)

2

D x D elements, each requiring N multiplications

total complexity for N > D is O(ND +

2

D )

3

in practice we don't directly use matrix inversion (unstable)

D × D D × N N × 1

however, other more stable solutions (e.g., Gaussian elimination) have similar complexity

slide-23
SLIDE 23

Multiple targets

instead of

Y ∈ RN×D′

y ∈ RN

we have

W =

(X X) X Y

⊤ −1 ⊤

D × D D × N N × D′

= Y ^ XW

N × D D × D′

a different weight vectors for each target each column of Y is associated with a column of W

N × D′

slide-24
SLIDE 24

Feature engineering

so far we learned a linear function f

=

w

w x ∑d

d d

sometimes this may be too simplistic idea create new more useful features out of initial set of given features e.g., x , x x , log(x),

1 2 1 2 how about ?

x +

1

2x3

slide-25
SLIDE 25

Nonlinear basis functions

so far we learned a linear function f

=

w

w x ∑d

d d

let's denote the set of all features by

(Φ Φ)w =

⊤ ∗

Φ y

solution simply becomes with Φ

Φ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ ϕ (x ),

1 (1)

ϕ (x ),

1 (2)

⋮ ϕ (x ),

1 (N)

ϕ (x ),

2 (1)

ϕ (x ),

2 (2)

⋮ ϕ (x ),

2 (N)

⋯ , ⋯ , ⋱ ⋯ , ϕ (x )

D (1)

ϕ (x )

D (2)

⋮ ϕ (x )

D (N) ⎦

⎥ ⎥ ⎥ ⎥ ⎤

replacing

X

a (nonlinear) feature

  • ne instance

ϕ (x)∀d

d

the problem of linear regression doesn't change f

=

w

w ϕ (x) ∑d

d d ϕ (x)

d

is the new x

slide-26
SLIDE 26

Nonlinear basis functions

example

x ∈ R

polynomial bases

ϕ (x) =

k

xk

Gaussian bases

ϕ (x) =

k

e−

s2 (x−μ ) k 2

Sigmoid bases

ϕ (x) =

k 1+e−

s x−μk

1

  • riginal input is scalar
slide-27
SLIDE 27

Example: Gaussian bases

ϕ (x) =

k

e−

s2 (x−μ ) k 2

= y ^(n) w + w ϕ (x) ∑k

k k

the green curve (our fit) is the sum of these scaled Gaussian bases plus the intercept. Each basis is scaled by the corresponding weight

we are using a fixed standard deviation of s=1

slide-28
SLIDE 28

COMP 551 | Fall 2020

Example: Sigmoid bases

Sigmoid bases

ϕ (x) =

k 1+e−

s x−μk

1

we are using a fixed standard deviation of s=1

= y ^(n) w + w ϕ (x) ∑k

k k

the green curve (our fit) is the sum of these scaled Gaussian bases plus the intercept. Each basis is scaled by the corresponding weight

slide-29
SLIDE 29

given the dataset

idea

image: http://blog.nguyenvq.com/blog/2009/05/12/linear-regression-plot-with-normal-curves-for-error-sideways/

w x

y x

Probailistic interpretation

D = {(x , y ), … , (x , y )}

(1) (1) (N) (N)

learn a probabilistic model p(y∣x; w)

p (y ∣

w

x) = N(y ∣ w x, σ ) =

⊤ 2

e

2πσ2 1 −

2σ2 (y−w x) ⊤ 2

consider with the following form

p(y∣x; w)

assume a fixed variance, say σ =

2

1

Q: how to fit the model? A: maximize the conditional likelihood!

slide-30
SLIDE 30

COMP 551 | Fall 2020

  • cond. probability p(y ∣ x; w) = N(y ∣ w x, σ ) =

⊤ 2

e

2πσ2 1 −

2σ2 (y−w x) ⊤ 2

image: http://blog.nguyenvq.com/blog/2009/05/12/linear-regression-plot-with-normal-curves-for-error-sideways/

w x

T

y x

log likelihood ℓ(w) =

− (y − ∑n

2σ2 1 (n)

w x ) +

⊤ (n) 2

constants L(w) = p(y ∣ ∏n=1

N (n)

x ; w)

(n)

likelihood max-likelihood params. w =

arg max ℓ(w) =

w

arg min (y −

w 2 1 ∑n (n)

w x )

⊤ (n) 2 linear least squares!

Maximum likelihood & linear regression

whenever we use square loss, we are assuming Gaussian noise!

slide-31
SLIDE 31

Summary

linear regression: models targets as a linear function of features fit the model by minimizing the sum of squared errors has a direct solution with complexity probabilistic interpretation we can build more expressive models: using any number of non-linear features

O(ND +

2

D )

3