Lecture 4: Linear Regression (contd.) Optimization Generalization - - PowerPoint PPT Presentation

lecture 4
SMART_READER_LITE
LIVE PREVIEW

Lecture 4: Linear Regression (contd.) Optimization Generalization - - PowerPoint PPT Presentation

Lecture 4: Linear Regression (contd.) Optimization Generalization Model complexity Regularization Aykut Erdem October 2017 Hacettepe University Administrative Assignment 1 is out! It is due October 20 (i.e. in two


slide-1
SLIDE 1

Lecture 4:

− Linear Regression (cont’d.) − Optimization − Generalization − Model complexity − Regularization

Aykut Erdem

October 2017 Hacettepe University

slide-2
SLIDE 2

Administrative

  • Assignment 1 is out!
  • It is due October 20 (i.e. in two weeks).
  • It includes

− Pencil-and-paper derivations − Implementing kNN classifier − numpy/Python code

2

slide-3
SLIDE 3

Classifying Bird Species

3

  • Caltech-UCSD Birds 200 dataset (200 bird species)

−5033 train, 1000 test images

  • You may want to split the training set into train and validation (more on this next week)
  • Do not use test data for training or parameter tuning
  • Features:

− Attributes, − Color histogram, − HOG features − Deep CNN features

  • Report performance on test data

adapted from Sanja Fidler

Hooded Oriole (Icterus cucullatus)

slide-4
SLIDE 4

Recall from last time… Kernel Regression

4

x y

Here, this is the closest

Here, this is the closest

Here, this is the closest

Here, this is the closest

1-NN for Regression

  • 𝑦

1, 𝑧1 , … , 𝑦 𝑜, 𝑧𝑜

– 𝑦 𝑗 ∈ 𝑌 – 𝑧𝑗 ∈ ℜ

𝐿 ∶ 𝑌 × 𝑌 → ℜ –

x’ – 𝐿 𝑦 𝑗, 𝑦 ′

Weighted K-NN for Regression

D = n X

i=1

|xi − yi|p !1/p

Distance metrics wi = exp(-d(xi, query)2 / σ2) Kernel width

slide-5
SLIDE 5

Recall from last time… Least-Squares Regression

5

slide by Sanja Fidler

y(x) = function(x, w)

slide-6
SLIDE 6

Recall from last time… Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared

error between y and the true value t
 


  • For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

6

slide by Sanja Fidler

y(x) = function(x, w)

slide-7
SLIDE 7

Recall from last time… Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared error

between y and the true value t
 


  • For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

7

y(x) = w0 + w1x

slide by Sanja Fidler

`(w) =

N

X

n=1

h t(n) − y(x(n)) i2

slide-8
SLIDE 8

Recall from last time… Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared error

between y and the true value t
 
 Linear model:

  • The loss for the red hypothesis is the sum of the squared vertical

errors (squared lengths of green vertical lines)

8

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

slide-9
SLIDE 9

Recall from last time… Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared error

between y and the true value t
 
 Linear model:

  • How do we obtain weights ?


9

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

w = (w0, w1)

slide-10
SLIDE 10

Recall from last time… Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared error

between y and the true value t
 
 Linear model:

  • How do we obtain weights ? Find w that minimizes 


loss

10

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

w = (w0, w1) `(w)

slide-11
SLIDE 11

Optimizing the Objective

  • One straightforward method: gradient descent

− initialize w (e.g., randomly) − repeatedly update w based on the gradient 


  • λ is the learning rate
  • For a single training case, this gives the LMS update rule:
  • Note: As error approaches zero, so does the update 


(w stops changing)

11

slide by Sanja Fidler

w ← w − @` @w w ← w + 2λ ⇣ t(n) − y(x(n)) ⌘ x(n)

{

error

slide-12
SLIDE 12

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 12

Optimizing the Objective

slide-13
SLIDE 13

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 13

Optimizing the Objective

slide-14
SLIDE 14

Effect of learning rate λ

  • Large λ => Fast convergence but larger residual error 


Also possible oscillations

  • Small λ => Slow convergence but small residual error

14

slide by Erik Sudderth

w0 `(w) `(w)

w0

slide-15
SLIDE 15

Optimizing Across Training Set

  • Two ways to generalize this for all examples in training set:
  • 1. Batch updates: sum or average updates across every example n,

then change the parameter values

  • 2. Stochastic/online updates: update the parameters for each

training case in turn, according to its own gradients

15

Algorithm 1 Stochastic gradient descent

1: Randomly shuffle examples in the training set 2: for i = 1 to N do 3:

Update: w ← w + 2λ(t(i) − y(x(i)))x(i) (update for a linear model)

4: end for

slide by Sanja Fidler

w ← w + 2λ ⇣ t(n) − y(x(n)) ⌘ x(n)

slide-16
SLIDE 16

Optimizing Across Training Set

  • Two ways to generalize this for all examples in training set:
  • 1. Batch updates: sum or average updates across every example n,

then change the parameter values

  • 2. Stochastic/online updates: update the parameters for each

training case in turn, according to its own gradients

16

slide by Sanja Fidler

w ← w + 2λ ⇣ t(n) − y(x(n)) ⌘ x(n)

  • Underlying assumption: sample is independent and identically

distributed (i.i.d.)

slide-17
SLIDE 17

Analytical Solution

  • For some objectives we can also find the optimal solution

analytically

  • This is the case for linear least-squares regression
  • How?

17

slide by Sanja Fidler

slide-18
SLIDE 18

Vectorization

  • Consider our model:
  • Let
  • Can write the model in vectorized form as

18

y(x) = w0 + w1x w = w0 w1

  • xT = [1

x] y(x) = wT x

slide-19
SLIDE 19

Vectorization

  • Consider our model with N instances:


  • Then:


19

slide by Sanja Fidler

`(w) =

N

X

n=1

h wT x(n) − t(n)i2 = (Xw − t)T (Xw − t)

t = h t(1), t(2), . . . , t(N)iT X = 2 6 6 4 1, x(1) 1, x(2) . . . 1, x(N) 3 7 7 5

T

w = w0 w1

  • RN×2

RN×1 R2×1 RN×1

{

R1×N

{

slide-20
SLIDE 20
  • Instead of using GD, solve for optimal w analytically

− Notice the solution is when

  • Derivation:



 
 


− Take derivative and set equal to 0, then solve for


Analytical Solution

20

@ @w`(w) = 0

`(w) = (Xw − t)T (Xw − t) = wT XT Xw − tT Xw − wT XT t + tT t = wT XT Xw − 2wT XT t + tT t ∂ ∂w

  • wT XT Xw − 2wT XT t + tT t
  • = 0
  • XT X
  • w − XT t = 0
  • XT X
  • w = XT t

w =

  • XT X

−1 XT t

Closed Form Solution:

If XTX is not inver-ble (i.e., singular), may need to:

  • Use pseudo-inverse instead of

the inverse

− In Python, numpy.linalg.pinv(a)

  • Remove redundant (not

linearly independent) features

  • Remove extra features to

ensure that d ≤ N

1x1

slide-21
SLIDE 21

21

slide-22
SLIDE 22

Multi-dimensional Inputs

  • One method of extending the model is to consider other input dimensions 



 


  • In the Boston housing example, we can look at the number of rooms

22

slide by Sanja Fidler

y(x) = w0 + w1x1 + w2x2

slide-23
SLIDE 23

Linear Regression with 
 Multi-dimensional Inputs

  • Imagine now we want to predict the median house price from

these multi-dimensional observations

  • Each house is a data point n, with observations indexed by j:
  • We can incorporate the bias w0 into w, by using x0 = 1, then
  • We can then solve for w = (w0,w1,…,wd). How?
  • We can use gradient descent to solve for each coefficient, or

compute w analytically (how does the solution change?)

23

slide by Sanja Fidler

x(n) = ⇣ x(n)

1 , . . . , x(n) j

, . . . , x(n)

d

⌘ y(x) = w0 +

d

X

j=1

wjxj = wT x

w =

  • XT X

−1 XT t

recall:

slide-24
SLIDE 24

More Powerful Models?

  • What if our linear model is not good? How can we create a more

complicated model?

24

slide by Sanja Fidler

slide-25
SLIDE 25

Fitting a Polynomial

  • What if our linear model is not good? How can we create a more

complicated model?

  • We can create a more complicated model by defining input variables

that are combinations of components of x

  • Example: an M-th order polynomial function of one dimensional

feature x: 
 
 
 
 
 where xj is the j-th power of x

  • We can use the same approach to optimize for the weights w
  • How do we do that?

25

slide by Sanja Fidler

y(x, w) = w0 +

M

X

j=1

wjxj

slide-26
SLIDE 26

Some types of basis functions in 1-D

26

Sigmoids Gaussians Polynomials

φj(x) = exp

  • −(x − µj)2

2s2

  • φj(x) = σ

x − µj

s

  • σ(a) =

1 1 + exp(−a).

slide by Erik Sudderth

slide-27
SLIDE 27

) ( ... ) ( ) ( ) ( ... ) (

2 2 1 1 2 2 1 1

x w x x w x, x w w x, Φ = + + + = = + + + =

T T

w w w y x w x w w y φ φ

bias

Two types of linear model that are equivalent with respect to learning

  • The first model has the same number of adaptive coefficients as the

dimensionality of the data +1.

  • The second model has the same number of adaptive coefficients as

the number of basis functions +1.

  • Once we have replaced the data by the outputs of the basis

functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick)

27

slide by Erik Sudderth

slide-28
SLIDE 28

General linear regression problem

  • Using our new notations for the basis function linear

regression can be written as
 
 
 
 
 where can be either xj for multivariate regression

  • r one of the nonlinear basis we defined
  • Once again we can use “least squares” to find the
  • ptimal solution.

28

  • notations for the basis
  • Once ¡again ¡we ¡can ¡use ¡‘least ¡squares’ ¡to ¡find ¡the ¡optimal ¡solution.

฀ y w j j(x)

j 0 n

  • Where j(x) can

non linear bas

  • Once ¡again ¡we ¡can ¡use ¡‘least ¡squares’ ¡to ¡find ¡the ¡optimal ¡solution.

  • slide by E. P

. Xing

slide-29
SLIDE 29

regression problem

฀ y w j j(x)

j 0 n

J(w) (y i w j j(x i)

j

  • )

i

  • 2

Our goal is to minimize the following loss function: Moving to vector notations we get: We take the derivative w.r.t w

฀ J(w) (yi w T(xi))2

i

  • w

(y i w T(x i))2

i

  • 2

(y i w T(x i))

i

  • (x i)T

Equating to 0 we get

฀ 2 (y i w T(x i))

i

  • (x i)T 0

y i

i

  • (x i)T w T

(x i)

i

  • (x i)T
  • w – vector of dimension k+1

(xi) – vector of dimension k+1 yi – a scaler

29

LMS for the general linear regression problem

slide by E. P . Xing

slide-30
SLIDE 30

30

We take the derivative w.r.t w ฀ J(w) (yi w T(xi))2

i

  • w

(y i w T(x i))2

i

  • 2

(y i w T(x i))

i

  • (x i)T

Equating to 0 we get

฀ 2 (y i w T(x i))

i

  • (x i)T 0

y i

i

  • (x i)T w T

(x i)

i

  • (x i)T
  • Define:

฀ 0(x1) 1(x1) m(x1) 0(x 2) 1(x 2) m(x 2) 0(x n) 1(x n) m(x n)

  • Then deriving w

we get:

฀ w (T)1Ty

LMS for the general linear regression problem

slide by E. P . Xing

slide-31
SLIDE 31

LMS for the general linear regression problem

31

฀ J(w) (yi w T(xi))2

i

  • Deriving w we get:

฀ w (T)1Ty

n by k+1 matrix n entries vector k+1 entries vector This solution is also known as ‘psuedo ¡inverse’

slide by E. P . Xing

slide-32
SLIDE 32

0th order polynomial

32

slide by Erik Sudderth

slide-33
SLIDE 33

1st order polynomial

33

slide by Erik Sudderth

slide-34
SLIDE 34

3rd order polynomial

34

slide by Erik Sudderth

slide-35
SLIDE 35

9th order polynomial

35

slide by Erik Sudderth

slide-36
SLIDE 36

Which Fit is Best?

36

slide by Sanja Fidler from Bishop

slide-37
SLIDE 37

Root Mean Square (RMS) Error

37

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

E(w) = 1 2

N

  • n=1

{y(xn, w) − tn}2

ERMS =

  • 2E(w⋆)/N

The division by N allows us to compare different sizes of data sets on an equal footing, and 
 the square root ensures that ERMS is measured on the same scale (and in the same units) as the target variable t

slide by Erik Sudderth

slide-38
SLIDE 38

Root>Mean>Square'(RMS)'Error:'

E(w) = 1 2

N

X

n=1

(tn − φ(xn)T w)2 = 1 2||t − Φw||2

Root Mean Square (RMS) Error

38

inde- M ERMS 3 6 9 0.5 1 Training Test

slide by Erik Sudderth

slide-39
SLIDE 39

Generalization

  • Generalization = model’s ability to predict the held out data
  • What is happening?
  • Our model with M = 9 overfits the data (it models also noise)

39

slide by Sanja Fidler

inde- M ERMS 3 6 9 0.5 1 Training Test

slide-40
SLIDE 40

Generalization

  • Generalization = model’s ability to predict the held out data
  • What is happening?
  • Our model with M = 9 overfits the data (it models also noise)
  • Not a problem if we have lots of training examples

40

slide by Sanja Fidler

slide-41
SLIDE 41

Generalization

  • Generalization = model’s ability to predict the held out data
  • What is happening?
  • Our model with M = 9 overfits the data (it models also noise)
  • Let’s look at the estimated weights for various M in the case of

fewer examples

41

slide by Sanja Fidler

slide-42
SLIDE 42

Generalization

  • Generalization = model’s ability to predict the held out data
  • What is happening?
  • Our model with M = 9 overfits the data (it models also noise)
  • Let’s look at the estimated weights for various M in the case
  • f fewer examples
  • The weights are becoming huge to compensate for the noise
  • One way of dealing with this is to encourage the weights to be

small (this way no input dimension will have too much influence on prediction). This is called regularization.

42

slide by Sanja Fidler

slide-43
SLIDE 43

inde- M ERMS 3 6 9 0.5 1 Training Test

1-D regression illustrates key concepts

  • Data fits – is linear model best (model selection)?

− Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data 
 (fit not only the signal but also the noise in the data), especially if not enough data to constrain model

  • One method of assessing fit:

− test generalization = model’s ability to predict 
 the held out data

  • Optimization is essential: stochastic and batch

iterative approaches; analytic when available

43

slide by Richard Zemel

slide-44
SLIDE 44

Regularized Least Squares

  • A technique to control the overfitting phenomenon
  • Add a penalty term to the error function in order to

discourage the coefficients from reaching large values

44

  • E(w) = 1

2

N

  • n=1

{y(xn, w) − tn}2 + λ 2 ∥w∥2

  • where ∥w∥2 ≡ wTw = w2

0 + w2 1 + . . . + w2

M,

importance of the regularization term compared

'

Ridge regression

which is minimized by

slide by Erik Sudderth

slide-45
SLIDE 45

The effect of regularization

45

x t ln λ = −18 1 −1 1 x t ln λ = 0 1 −1 1

M = 9

slide by Erik Sudderth

slide-46
SLIDE 46

ERMS ln λ −35 −30 −25 −20 0.5 1 Training Test

The effect of regularization

46

ln λ = −∞ ln λ = −18 ln λ = 0 w⋆ 0.35 0.35 0.13 w⋆

1

232.37 4.74

  • 0.05

w⋆

2

  • 5321.83
  • 0.77
  • 0.06

w⋆

3

48568.31

  • 31.97
  • 0.05

w⋆

4

  • 231639.30
  • 3.89
  • 0.03

w⋆

5

640042.26 55.28

  • 0.02

w⋆

6

  • 1061800.52

41.32

  • 0.01

w⋆

7

1042400.18

  • 45.95
  • 0.00

w⋆

8

  • 557682.99
  • 91.53

0.00 w⋆

9

125201.43 72.68 0.01

The corresponding coefficients from the fitted polynomials, showing that regularization has the desired effect of reducing the magnitude

  • f the coefficients.

slide by Erik Sudderth

slide-47
SLIDE 47

A more general regularizer

47

1 2

N

  • n=1

{tn − wTφ(xn)}2 + λ 2

M

  • j=1

|wj|q

q = 0.5 q = 1 q = 2 q = 4

slide by Richard Zemel

slide-48
SLIDE 48

inde- M ERMS 3 6 9 0.5 1 Training Test

1-D regression illustrates key concepts

  • Data fits – is linear model best (model selection)?

− Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data 
 (fit not only the signal but also the noise in the data), especially if not enough data to constrain model

  • One method of assessing fit:

− test generalization = model’s ability to predict 
 the held out data

  • Optimization is essential: stochastic and batch

iterative approaches; analytic when available

48

slide by Richard Zemel

slide-49
SLIDE 49

Next Lecture:

Machine Learning Methodology

49