Applied Machine Learning Applied Machine Learning Linear Regression - - PowerPoint PPT Presentation

applied machine learning applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Applied Machine Learning Linear Regression - - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives linear model evaluation criteria how to find


slide-1
SLIDE 1

Applied Machine Learning Applied Machine Learning

Linear Regression

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020) 1

slide-2
SLIDE 2

linear model evaluation criteria how to find the best fit geometric interpretation

Learning objectives Learning objectives

2

slide-3
SLIDE 3

Motivation Motivation

source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/

History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt)

3 . 1

slide-4
SLIDE 4

Motivation Motivation

effect of income inequality on health and social problems

source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/

History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt)

3 . 1

slide-5
SLIDE 5

Winter 2020 | Applied Machine Learning (COMP551)

Motivation (?) Motivation (?)

3 . 2

slide-6
SLIDE 6

each instance:

Representing data Representing data

x ∈

(n)

RD y ∈

(n)

R

4 . 1

slide-7
SLIDE 7

each instance:

Representing data Representing data

x ∈

(n)

RD y ∈

(n)

R

  • ne instance

4 . 1

slide-8
SLIDE 8

each instance:

Representing data Representing data

x ∈

(n)

RD

vectors are assume to be column vectors x =

=

⎣ ⎢ ⎢ ⎢ ⎢ ⎡x

1

x

2

⋮ x

D⎦

⎥ ⎥ ⎥ ⎥ ⎤ [x

,

1

x

,

2

… , x

D] ⊤

y ∈

(n)

R

  • ne instance

4 . 1

slide-9
SLIDE 9

each instance:

Representing data Representing data

x ∈

(n)

RD

vectors are assume to be column vectors x =

=

⎣ ⎢ ⎢ ⎢ ⎢ ⎡x

1

x

2

⋮ x

D⎦

⎥ ⎥ ⎥ ⎥ ⎤ [x

,

1

x

,

2

… , x

D] ⊤

a feature

y ∈

(n)

R

  • ne instance

4 . 1

slide-10
SLIDE 10

each instance:

Representing data Representing data

x ∈

(n)

RD

vectors are assume to be column vectors x =

=

⎣ ⎢ ⎢ ⎢ ⎢ ⎡x

1

x

2

⋮ x

D⎦

⎥ ⎥ ⎥ ⎥ ⎤ [x

,

1

x

,

2

… , x

D] ⊤

a feature

y ∈

(n)

R

  • ne instance

we assume N instances in the dataset D = {(x

, y )}

(n) (n n=1 N

for example, is the feature d of instance n

each instance has D features indexed by d

x

d (n)

R

4 . 1

slide-11
SLIDE 11

Representing data Representing data

X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤

design matrix: concatenate all instances

each row is a datapoint, each column is a feature

4 . 2

slide-12
SLIDE 12

Representing data Representing data

X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤

design matrix: concatenate all instances

each row is a datapoint, each column is a feature

= ⎣ ⎢ ⎢ ⎡ x

,

1 (1)

⋮ x

,

1 (N)

x

,

2 (1)

⋮ x

,

2 (N)

⋯ , ⋱ ⋯ , x

D (1)

⋮ x

D (N)⎦

⎥ ⎥ ⎤

4 . 2

slide-13
SLIDE 13

Representing data Representing data

X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤

design matrix: concatenate all instances

each row is a datapoint, each column is a feature

= ⎣ ⎢ ⎢ ⎡ x

,

1 (1)

⋮ x

,

1 (N)

x

,

2 (1)

⋮ x

,

2 (N)

⋯ , ⋱ ⋯ , x

D (1)

⋮ x

D (N)⎦

⎥ ⎥ ⎤

  • ne instance

4 . 2

slide-14
SLIDE 14

Representing data Representing data

X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤

design matrix: concatenate all instances

each row is a datapoint, each column is a feature

= ⎣ ⎢ ⎢ ⎡ x

,

1 (1)

⋮ x

,

1 (N)

x

,

2 (1)

⋮ x

,

2 (N)

⋯ , ⋱ ⋯ , x

D (1)

⋮ x

D (N)⎦

⎥ ⎥ ⎤

  • ne instance
  • ne feature

4 . 2

slide-15
SLIDE 15

Representing data Representing data

X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤

design matrix: concatenate all instances

each row is a datapoint, each column is a feature

= ⎣ ⎢ ⎢ ⎡ x

,

1 (1)

⋮ x

,

1 (N)

x

,

2 (1)

⋮ x

,

2 (N)

⋯ , ⋱ ⋯ , x

D (1)

⋮ x

D (N)⎦

⎥ ⎥ ⎤

∈ RN×D

  • ne instance
  • ne feature

4 . 2

slide-16
SLIDE 16

Winter 2020 | Applied Machine Learning (COMP551)

Representing data Representing data

Example: Micro array data (X), contains gene expression levels labels (y) can be {cancer/no cancer} label for each patient

gene (d) patient (n)

∈ RN×D

4 . 3

slide-17
SLIDE 17

Linear model Linear model

f

(x) =

w

w

+

w

x +

1 1

… + w

x

D D

f

:

w

R →

D

R

assuming a scalar output

will generalize to a vector later

5

slide-18
SLIDE 18

Linear model Linear model

f

(x) =

w

w

+

w

x +

1 1

… + w

x

D D

f

:

w

R →

D

R

assuming a scalar output

will generalize to a vector later

model parameters or weights

5

slide-19
SLIDE 19

Linear model Linear model

f

(x) =

w

w

+

w

x +

1 1

… + w

x

D D

f

:

w

R →

D

R

assuming a scalar output

will generalize to a vector later

bias or intercept model parameters or weights

5

slide-20
SLIDE 20

Linear model Linear model

f

(x) =

w

w

+

w

x +

1 1

… + w

x

D D

f

:

w

R →

D

R

assuming a scalar output

will generalize to a vector later

concatenate a 1 to x

x = [1, x

, … , x ]

1 D ⊤

f

(x) =

w

w x

simplification

bias or intercept model parameters or weights

5

slide-21
SLIDE 21

Linear model Linear model

f

(x) =

w

w

+

w

x +

1 1

… + w

x

D D

f

:

w

R →

D

R

assuming a scalar output

will generalize to a vector later

concatenate a 1 to x

x = [1, x

, … , x ]

1 D ⊤

f

(x) =

w

w x

simplification

bias or intercept model parameters or weights

yh_n = np.dot(w,x)

5

slide-22
SLIDE 22

Loss function Loss function

  • bjective: find parameters to fit the data x

, y ∀n

(n) (n)

6 . 1

slide-23
SLIDE 23

Loss function Loss function

  • bjective: find parameters to fit the data

f

(x

) ≈

w (n)

y(n)

x , y ∀n

(n) (n)

6 . 1

slide-24
SLIDE 24

Loss function Loss function

  • bjective: find parameters to fit the data

f

(x

) ≈

w (n)

y(n)

x , y ∀n

(n) (n)

minimize a measure of difference between = y ^(n) f

(x

)

w (n)

and

y(n)

6 . 1

slide-25
SLIDE 25

L(y,

) ≜

y ^

(y −

2 1

)

y ^ 2

square error loss (a.k.a. L2 loss)

for a single instance (a function of labels)

Loss function Loss function

  • bjective: find parameters to fit the data

f

(x

) ≈

w (n)

y(n)

x , y ∀n

(n) (n)

minimize a measure of difference between = y ^(n) f

(x

)

w (n)

and

y(n)

6 . 1

slide-26
SLIDE 26

L(y,

) ≜

y ^

(y −

2 1

)

y ^ 2

square error loss (a.k.a. L2 loss)

for a single instance (a function of labels)

Loss function Loss function

  • bjective: find parameters to fit the data

f

(x

) ≈

w (n)

y(n)

x , y ∀n

(n) (n)

minimize a measure of difference between = y ^(n) f

(x

)

w (n)

and

y(n)

for future convenience

6 . 1

slide-27
SLIDE 27

L(y,

) ≜

y ^

(y −

2 1

)

y ^ 2

square error loss (a.k.a. L2 loss)

for a single instance (a function of labels)

Loss function Loss function

  • bjective: find parameters to fit the data

f

(x

) ≈

w (n)

y(n)

x , y ∀n

(n) (n)

minimize a measure of difference between = y ^(n) f

(x

)

w (n)

and

y(n)

J(w) =

(y

2 1 ∑n=1 N (n)

w x )

⊤ (n) 2

sum of squared errors cost function

for future convenience

6 . 1

slide-28
SLIDE 28

L(y,

) ≜

y ^

(y −

2 1

)

y ^ 2

square error loss (a.k.a. L2 loss)

for a single instance (a function of labels)

Loss function Loss function

  • bjective: find parameters to fit the data

f

(x

) ≈

w (n)

y(n)

x , y ∀n

(n) (n)

minimize a measure of difference between = y ^(n) f

(x

)

w (n)

and

y(n)

J(w) =

(y

2 1 ∑n=1 N (n)

w x )

⊤ (n) 2

sum of squared errors cost function

for future convenience for the whole dataset versus

6 . 1

slide-29
SLIDE 29

Example Example (D = 1) (D = 1)

6 . 2

+bias (D=2)!

slide-30
SLIDE 30

Example Example (D = 1) (D = 1)

y

x = [x

]

1

6 . 2

+bias (D=2)!

slide-31
SLIDE 31

Example Example (D = 1) (D = 1)

y

x = [x

]

1 (x , y )

(3) (3)

(x , y )

(1) (1)

(x , y )

(2) (2)

(x , y )

(4) (4)

6 . 2

+bias (D=2)!

slide-32
SLIDE 32

Example Example (D = 1) (D = 1)

y

x = [x

]

1

w

(x , y )

(3) (3)

(x , y )

(1) (1)

(x , y )

(2) (2)

(x , y )

(4) (4)

f

(x) =

w∗

w

+

w

x

1 ∗

6 . 2

+bias (D=2)!

slide-33
SLIDE 33

Example Example (D = 1) (D = 1)

y

x = [x

]

1

w

(x , y )

(3) (3)

(x , y )

(1) (1)

(x , y )

(2) (2)

(x , y )

(4) (4)

f

(x) =

w∗

w

+

w

x

1 ∗

y −

(3)

f(x )

(3)

6 . 2

+bias (D=2)!

slide-34
SLIDE 34

Example Example (D = 1) (D = 1)

min

(y

w ∑n (n)

w x )

T (n) 2

Linear Least Squares

y

x = [x

]

1

w

(x , y )

(3) (3)

(x , y )

(1) (1)

(x , y )

(2) (2)

(x , y )

(4) (4)

f

(x) =

w∗

w

+

w

x

1 ∗

y −

(3)

f(x )

(3)

6 . 2

+bias (D=2)!

slide-35
SLIDE 35

Winter 2020 | Applied Machine Learning (COMP551)

w =

arg min

(y

w ∑n (n)

w x )

T (n) 2

Example Example (D=2) (D=2)

Linear Least Squares

y

x

1

w

f

(x) =

w∗

w

+

w

x +

1 ∗ 1

w

x

2 ∗ 2

x

2

+bias (D=3)!

6 . 3

slide-36
SLIDE 36

Matrix form Matrix form

instead of

= y ^(n) w x

⊤ (n) 1 × D D × 1 ∈ R

7

slide-37
SLIDE 37

Matrix form Matrix form

instead of

= y ^(n) w x

⊤ (n) 1 × D D × 1 ∈ R

use design matrix to write

=

y ^ Xw

N × D N × 1 D × 1

7

slide-38
SLIDE 38

Matrix form Matrix form

instead of

= y ^(n) w x

⊤ (n) 1 × D D × 1 ∈ R

use design matrix to write

=

y ^ Xw

N × D N × 1 D × 1

Linear least squares

arg min

∣∣y −

w 2 1

Xw∣∣ =

2

(y −

2 1

Xw) (y −

Xw)

squared L2 norm of the residual vector

7

slide-39
SLIDE 39

Matrix form Matrix form

instead of

= y ^(n) w x

⊤ (n) 1 × D D × 1 ∈ R

use design matrix to write

=

y ^ Xw

N × D N × 1 D × 1

yh = np.dot(X, w) cost = np.sum((yh - y)**2)/2. # or cost = np.mean((yh - y)**2)/2.

Linear least squares

arg min

∣∣y −

w 2 1

Xw∣∣ =

2

(y −

2 1

Xw) (y −

Xw)

squared L2 norm of the residual vector

7

slide-40
SLIDE 40

Minimizing the cost Minimizing the cost

x y

w

1

w

weight space data space

image: Grosse, Farahmand, Carrasquilla

8 . 1

slide-41
SLIDE 41

Minimizing the cost Minimizing the cost

the objective is a smooth function of w find minimum by setting partial derivatives to zero

x y

w

1

w

weight space data space

image: Grosse, Farahmand, Carrasquilla

8 . 1

slide-42
SLIDE 42

Simple case: Simple case: D = 1 D = 1

f

(x) =

w

wx

both scalar

model

8 . 2

slide-43
SLIDE 43

Simple case: Simple case: D = 1 D = 1

f

(x) =

w

wx

both scalar

J(w) =

(y

2 1 ∑n (n)

wx )

(n) 2

cost function model

8 . 2

slide-44
SLIDE 44

Simple case: Simple case: D = 1 D = 1

f

(x) =

w

wx

both scalar

J(w) =

(y

2 1 ∑n (n)

wx )

(n) 2

cost function model

J w

8 . 2

slide-45
SLIDE 45

Simple case: Simple case: D = 1 D = 1

f

(x) =

w

wx

both scalar

J(w) =

(y

2 1 ∑n (n)

wx )

(n) 2

cost function model

J w

w =

x

∑n

(n)2

x

y ∑n

(n) (n)

setting the derivative to zero

=

dw dJ

x

(wx − ∑n

(n) (n)

y )

(n)

derivative

8 . 2

slide-46
SLIDE 46

Simple case: Simple case: D = 1 D = 1

f

(x) =

w

wx

both scalar

J(w) =

(y

2 1 ∑n (n)

wx )

(n) 2

cost function model

J w

w =

x

∑n

(n)2

x

y ∑n

(n) (n)

setting the derivative to zero

=

dw dJ

x

(wx − ∑n

(n) (n)

y )

(n)

derivative

global minimum because cost is smooth and convex

more on convexity layer

8 . 2

slide-47
SLIDE 47

Gradient Gradient

for a multivariate function J(w

, w )

1

partial derivatives instead of derivative = derivative when other vars. are fixed

8 . 3

slide-48
SLIDE 48

Gradient Gradient

for a multivariate function J(w

, w )

1

partial derivatives instead of derivative

J(w , w ) ≜

∂w

1

∂ 1

lim

ϵ→0 ϵ J(w

,w +ϵ)−J(w ,w )

1 1

w

w

1

J

= derivative when other vars. are fixed

8 . 3

slide-49
SLIDE 49

Gradient Gradient

for a multivariate function J(w

, w )

1

partial derivatives instead of derivative

J(w , w ) ≜

∂w

1

∂ 1

lim

ϵ→0 ϵ J(w

,w +ϵ)−J(w ,w )

1 1

critical point: all partial derivatives are zero

w

w

1

J

= derivative when other vars. are fixed

8 . 3

slide-50
SLIDE 50

Gradient Gradient

for a multivariate function J(w

, w )

1

partial derivatives instead of derivative

J(w , w ) ≜

∂w

1

∂ 1

lim

ϵ→0 ϵ J(w

,w +ϵ)−J(w ,w )

1 1

critical point: all partial derivatives are zero

w

w

1

J

= derivative when other vars. are fixed

gradient: vector of all partial derivatives

∇J(w) = [

J(w), ⋯ J(w)]

∂w

1

∂ ∂w

D

∂ ⊤

w

1

w

J

8 . 3

slide-51
SLIDE 51

Finding w Finding w (any D) (any D)

w w

1

J(w)

setting

J(w) =

∂w

i

cost is a smooth and convex function of w

(y

∂w

i

∂ ∑n 2 1 (n)

f

(x

)) =

w (n) 2

8 . 4

slide-52
SLIDE 52

Finding w Finding w (any D) (any D)

w w

1

J(w)

setting

J(w) =

∂w

i

cost is a smooth and convex function of w

(y

∂w

i

∂ ∑n 2 1 (n)

f

(x

)) =

w (n) 2

using chain rule:

=

∂w

i

∂J df

w

dJ ∂w

i

∂f

w

8 . 4

slide-53
SLIDE 53

Finding w Finding w (any D) (any D)

w w

1

J(w)

setting

J(w) =

∂w

i

cost is a smooth and convex function of w

(y

∂w

i

∂ ∑n 2 1 (n)

f

(x

)) =

w (n) 2

using chain rule:

=

∂w

i

∂J df

w

dJ ∂w

i

∂f

w

(w x

− ∑n

⊤ (n)

y )x

=

(n) d (n)

∀d ∈ {1, … , D}

we get

8 . 4

slide-54
SLIDE 54

Normal equation Normal equation

(y

− ∑n

(n)

w x )x

=

⊤ (n) d (n)

∀d

system of D linear equations

8 . 5

slide-55
SLIDE 55

Normal equation Normal equation

(y

− ∑n

(n)

w x )x

=

⊤ (n) d (n)

∀d

system of D linear equations

X (y −

Xw) = 0

matrix form (using the design matrix)

N × 1

D × N each row enforces one of D equations

8 . 5

slide-56
SLIDE 56

y ∈ RN

x

1

x

2

y ^

y − Xw

Normal equation Normal equation

(y

− ∑n

(n)

w x )x

=

⊤ (n) d (n)

∀d

system of D linear equations

X (y −

Xw) = 0

matrix form (using the design matrix)

N × 1

D × N each row enforces one of D equations

8 . 5

slide-57
SLIDE 57

y ∈ RN

x

1

x

2

y ^

y − Xw

Normal equation Normal equation

(y

− ∑n

(n)

w x )x

=

⊤ (n) d (n)

∀d

system of D linear equations

X (y −

Xw) = 0

matrix form (using the design matrix)

N × 1

D × N each row enforces one of D equations

Normal equation: because for optimal w, the residual vector is

normal to column space of the design matrix

8 . 5

slide-58
SLIDE 58

Direct solution Direct solution

we can get a closed form solution!

y ∈ RN

x

1

x

2

y ^

y − Xw

8 . 6

slide-59
SLIDE 59

Direct solution Direct solution

X Xw =

X y

X (y −

Xw) = 0

we can get a closed form solution!

y ∈ RN

x

1

x

2

y ^

y − Xw

8 . 6

slide-60
SLIDE 60

Direct solution Direct solution

X Xw =

X y

X (y −

Xw) = 0

we can get a closed form solution!

w =

(X X) X y

⊤ −1 ⊤

y ∈ RN

x

1

x

2

y ^

y − Xw

8 . 6

slide-61
SLIDE 61

Direct solution Direct solution

X Xw =

X y

X (y −

Xw) = 0

we can get a closed form solution!

w =

(X X) X y

⊤ −1 ⊤

D × D D × N N × 1

y ∈ RN

x

1

x

2

y ^

y − Xw

8 . 6

slide-62
SLIDE 62

Direct solution Direct solution

X Xw =

X y

X (y −

Xw) = 0

we can get a closed form solution!

w =

(X X) X y

⊤ −1 ⊤

D × D D × N N × 1

pseudo-inverse of X

y ∈ RN

x

1

x

2

y ^

y − Xw

8 . 6

slide-63
SLIDE 63

Direct solution Direct solution

X Xw =

X y

=

y ^ Xw = X(X X) X y

⊤ −1 ⊤

X (y −

Xw) = 0

projection matrix into column space of X

we can get a closed form solution!

w =

(X X) X y

⊤ −1 ⊤

D × D D × N N × 1

pseudo-inverse of X

y ∈ RN

x

1

x

2

y ^

y − Xw

8 . 6

slide-64
SLIDE 64

Winter 2020 | Applied Machine Learning (COMP551)

Direct solution Direct solution

X Xw =

X y

=

y ^ Xw = X(X X) X y

⊤ −1 ⊤

X (y −

Xw) = 0

projection matrix into column space of X

we can get a closed form solution!

w = np.linalg.lstsq(X,y)[0]

w =

(X X) X y

⊤ −1 ⊤

D × D D × N N × 1

pseudo-inverse of X

y ∈ RN

x

1

x

2

y ^

y − Xw

8 . 6

slide-65
SLIDE 65

Time complexity Time complexity

w =

(X X) X y

⊤ −1 ⊤

9

D × D D × N N × 1

slide-66
SLIDE 66

Time complexity Time complexity

w =

(X X) X y

⊤ −1 ⊤

O(ND) D elements, each using N ops.

9

D × D D × N N × 1

slide-67
SLIDE 67

Time complexity Time complexity

w =

(X X) X y

⊤ −1 ⊤

O(ND) D elements, each using N ops. O(D N)

2

D x D elements, each requiring N multiplications

9

D × D D × N N × 1

slide-68
SLIDE 68

Time complexity Time complexity

w =

(X X) X y

⊤ −1 ⊤

O(D )

3

matrix inversion O(ND) D elements, each using N ops. O(D N)

2

D x D elements, each requiring N multiplications

9

D × D D × N N × 1

slide-69
SLIDE 69

Time complexity Time complexity

w =

(X X) X y

⊤ −1 ⊤

O(D )

3

matrix inversion O(ND) D elements, each using N ops. O(D N)

2

D x D elements, each requiring N multiplications

total complexity for N > D is O(ND +

2

D )

3

in practice we don't directly use matrix inversion (unstable)

9

D × D D × N N × 1

slide-70
SLIDE 70

Multiple targets Multiple targets

instead of

Y ∈ RN×D′

y ∈ RN

we have

10

slide-71
SLIDE 71

Multiple targets Multiple targets

instead of

Y ∈ RN×D′

y ∈ RN

we have

= Y ^ XW

N × D D × D′

a different weight vectors for each target each column of Y is associated with a column of W

N × D′

10

slide-72
SLIDE 72

Multiple targets Multiple targets

instead of

Y ∈ RN×D′

y ∈ RN

we have

W =

(X X) X Y

⊤ −1 ⊤

D × D D × N N × D′

= Y ^ XW

N × D D × D′

a different weight vectors for each target each column of Y is associated with a column of W

N × D′

10

slide-73
SLIDE 73

Multiple targets Multiple targets

instead of

w = np.linalg.lstsq(X,Y)[0]

Y ∈ RN×D′

y ∈ RN

we have

W =

(X X) X Y

⊤ −1 ⊤

D × D D × N N × D′

= Y ^ XW

N × D D × D′

a different weight vectors for each target each column of Y is associated with a column of W

N × D′

10

slide-74
SLIDE 74

Nonlinear basis functions Nonlinear basis functions

so far we learned a linear function f

=

w

w x

∑d

d d

11 . 1

slide-75
SLIDE 75

Nonlinear basis functions Nonlinear basis functions

so far we learned a linear function f

=

w

w x

∑d

d d

nothing changes if we have nonlinear bases

f

=

w

w ϕ (x)

∑d

d d

11 . 1

slide-76
SLIDE 76

Nonlinear basis functions Nonlinear basis functions

so far we learned a linear function f

=

w

w x

∑d

d d

nothing changes if we have nonlinear bases

f

=

w

w ϕ (x)

∑d

d d

w =

(Φ Φ) Φ y

⊤ −1 ⊤

solution simply becomes

Φ = ⎣ ⎢ ⎢ ⎢ ⎡ ϕ

(x

),

1 (1)

ϕ

(x

),

1 (2)

⋮ ϕ

(x

),

1 (N)

ϕ

(x

),

2 (1)

ϕ

(x

),

2 (2)

⋮ ϕ

(x

),

2 (N)

⋯ , ⋯ , ⋱ ⋯ , ϕ

(x

)

D (1)

ϕ

(x

)

D (2)

⋮ ϕ

(x

)

D (N) ⎦

⎥ ⎥ ⎥ ⎤

replacing

X with Φ

11 . 1

slide-77
SLIDE 77

Nonlinear basis functions Nonlinear basis functions

so far we learned a linear function f

=

w

w x

∑d

d d

nothing changes if we have nonlinear bases

f

=

w

w ϕ (x)

∑d

d d

w =

(Φ Φ) Φ y

⊤ −1 ⊤

solution simply becomes

Φ = ⎣ ⎢ ⎢ ⎢ ⎡ ϕ

(x

),

1 (1)

ϕ

(x

),

1 (2)

⋮ ϕ

(x

),

1 (N)

ϕ

(x

),

2 (1)

ϕ

(x

),

2 (2)

⋮ ϕ

(x

),

2 (N)

⋯ , ⋯ , ⋱ ⋯ , ϕ

(x

)

D (1)

ϕ

(x

)

D (2)

⋮ ϕ

(x

)

D (N) ⎦

⎥ ⎥ ⎥ ⎤

replacing

X with Φ

a (nonlinear) feature

  • ne instance

11 . 1

slide-78
SLIDE 78

Nonlinear basis functions Nonlinear basis functions

examples

x ∈ R

  • riginal input is scalar

11 . 2

slide-79
SLIDE 79

Nonlinear basis functions Nonlinear basis functions

examples

x ∈ R

polynomial bases

ϕ

(x) =

k

xk

  • riginal input is scalar

11 . 2

slide-80
SLIDE 80

Nonlinear basis functions Nonlinear basis functions

examples

x ∈ R

polynomial bases

ϕ

(x) =

k

xk

Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

  • riginal input is scalar

11 . 2

slide-81
SLIDE 81

Nonlinear basis functions Nonlinear basis functions

examples

x ∈ R

polynomial bases

ϕ

(x) =

k

xk

Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

Sigmoid bases

ϕ

(x) =

k 1+e−

s x−μ k

1

  • riginal input is scalar

11 . 2

slide-82
SLIDE 82

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

11 . 3

slide-83
SLIDE 83

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

11 . 3

slide-84
SLIDE 84

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

before adding noise noise

11 . 3

slide-85
SLIDE 85

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

before adding noise

  • ur fit to data using 10 Gaussian bases

noise

11 . 3

slide-86
SLIDE 86

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9 phi = lambda x,mu: np.exp(-(x-mu)**2) mu = np.linspace(0,10,10) #10 Gaussians bases Phi = phi(x[:,None], mu[None,:]) #N x 10 #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 4 5 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

before adding noise

  • ur fit to data using 10 Gaussian bases

noise

11 . 3

slide-87
SLIDE 87

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9 phi = lambda x,mu: np.exp(-(x-mu)**2) mu = np.linspace(0,10,10) #10 Gaussians bases Phi = phi(x[:,None], mu[None,:]) #N x 10 #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 4 5 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9 w = np.linalg.lstsq(Phi, y)[0] yh = np.dot(Phi,w) plt.plot(x, yh, 'g-') #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 7 8 9

before adding noise

  • ur fit to data using 10 Gaussian bases

noise

11 . 3

slide-88
SLIDE 88

Winter 2020 | Applied Machine Learning (COMP551)

Example Example: Sigmoid bases : Sigmoid bases

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

phi = lambda x,mu: 1/(1 + np.exp(-(x - mu))) #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 4 mu = np.linspace(0,10,10) #10 sigmoid bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

  • ur fit to data using 10 sigmoid bases

Sigmoid bases

ϕ

(x) =

k 1+e−

s x−μ k

1

11 . 4

slide-89
SLIDE 89

Problematic settings Problematic settings

W =

(X X) X Y

⊤ −1 ⊤

In

12

slide-90
SLIDE 90

Problematic settings Problematic settings

W =

(X X) X Y

⊤ −1 ⊤

In what if we have a large dataset?

use stochastic gradient descent

N > 100, 000, 000

12

slide-91
SLIDE 91

Problematic settings Problematic settings

W =

(X X) X Y

⊤ −1 ⊤

In what if we have a large dataset?

use stochastic gradient descent

N > 100, 000, 000 what if

X X

is not invertible?

columns of X (features) are not linearly independent (either redundant features or D > N)

12

slide-92
SLIDE 92

Problematic settings Problematic settings

W =

(X X) X Y

⊤ −1 ⊤

In

W* is not unique, make it unique by

removing redundant features

regularization (later!)

what if we have a large dataset?

use stochastic gradient descent

N > 100, 000, 000 what if

X X

is not invertible?

columns of X (features) are not linearly independent (either redundant features or D > N)

12

slide-93
SLIDE 93
  • r find one of the solutions

decomposition-based (not discussed) methods still work use gradient descent (later!)

Problematic settings Problematic settings

W =

(X X) X Y

⊤ −1 ⊤

In

W* is not unique, make it unique by

removing redundant features

regularization (later!)

w = np.linalg.lstsq(X,Y)[0]

what if we have a large dataset?

use stochastic gradient descent

N > 100, 000, 000 what if

X X

is not invertible?

columns of X (features) are not linearly independent (either redundant features or D > N)

12

slide-94
SLIDE 94

Summary Summary

linear regression: models targets as a linear function of features fit the model by minimizing sum of squared errors has a direct solution with complexity gradient descent: future we can build more expressive models: using any number of non-linear features ensure features are linearly independent

O(ND +

2

D )

3

13