Linear regression . Course of Machine Learning Master Degree in - - PowerPoint PPT Presentation

linear regression
SMART_READER_LITE
LIVE PREVIEW

Linear regression . Course of Machine Learning Master Degree in - - PowerPoint PPT Presentation

Linear regression . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1 Linear models Linear combination of input features functions 2 y ( x , w ) = w 0 + w


slide-1
SLIDE 1

Linear regression

.

Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019

1

slide-2
SLIDE 2

Linear models

  • Linear combination of input features

y(x, w) = w0 + w1x1 + w2x2 + . . . + wDxD with x = (x1, . . . , xD)

  • Linear function of parameters w
  • Linear function of features x. Extension to linear combination of base

functions y(x, w) = w0 +

M−1

j=1

wjφj(x)

  • Let φ0(x) = 1, then y(x, w) = wT φ(x)

2

slide-3
SLIDE 3

Base functions

  • Many types:
  • Polynomial (global functions)

φj(x) = xj

  • Gaussian (local)

φj(x) = exp ( − (x − µj)2 2s2 )

  • Sigmoid (local)

φj(x) = σ ( x − µj s ) = 1 1 + e−

x−µj s

  • Hyperbolic tangent (local)

φj(x) = tanh(x) = 2σ(x) − 1 = 1 − e−

x−µj s

1 + e−

x−µj s

3

slide-4
SLIDE 4

Maximum likelihood and least squares

  • Assume an additional gaussian noise

t = y(x, w) + ε with p(ε) = N(ε|0, β−1) β = 1 σ2 is the precision.

  • Then,

p(t|x, w, β) = N(t|y(x, w), β−1) and the expectation of the conditional distribution is E[t|x] = ∫ tp(t|x)dt = y(x, w)

4

slide-5
SLIDE 5

Maximum likelihood and least squares

  • The likelihood of a given training set X, t is

p(t|X, w, β) =

N

i=1

N(ti|wT φ(xi), β−1)

  • The corresponding log-likelihood is then

ln p(t|X, w, β) =

N

i=1

ln N(ti|wT φ(xi), β−1) = N 2 ln β − N 2 ln(2π) − βED(w) where ED(w) = 1 2

N

i=1

( ti − wT φ(xi) )2 = 1 2(Φw − y)T (Φw − y)

5

slide-6
SLIDE 6

Maximum likelihood and least squares

  • Maximizing the log-likelihood w.r.t. w is equivalent to minimizing the

error function ED(w)

  • Maximization performed by setting the gradient to 0

0 = ∂ ∂w ln p(t|X, w, β) =

N

i=1

( ti − wT φ(xi) ) φ(xi)T =

N

i=1

tiφ(xi)T − wT ( N ∑

i=1

φ(xi)φ(xi)T )

  • Result

wML = (ΦT Φ)−1ΦT t normal equations for least squares β−1

ML = 1

N

N

i=1

( ti − wT

MLφ(xi)

)2 Φ =       φ0(x1) φ1(x1) · · · φM−1(x1) φ0(x2) φ1(x2) · · · φM−1(x2) . . . . . . ... . . . φ0(xN) φ1(xN) · · · φM−1(xN)      

6

slide-7
SLIDE 7

Least squares geometry

  • t = (t1, . . . , tN)T is a vector in I

RN

  • Each basis function φj applied to x1, . . . , xN is a vector

ϕj = (φj(x1), . . . , φj(xN))T ∈ I RN

  • If M < N, vectors ϕ0, . . . , ϕM−1 define a subspace S of dimension (at

most) M

  • y = (y(x1, w), . . . , y(xN, w))T is a vector in I

RN: it can be represented as linear combination y = ∑M−1

i=0 wiφ(xi). Hence, it belongs to S

  • Given t ∈ I

RN, y ∈ I RN is the vector in subspace S at minimal squared distance from t

  • Given t ∈ I

RN and vectors φ0, . . . , φM−1, wML is such that y is the vector on S nearest to t

7

slide-8
SLIDE 8

Gradient descent

  • The minimum of ED(w) may be computed numerically, by means of

gradient descent methods

  • Initial assignment w(0) = (w(0)

0 , w(0) 1 , . . . , w(0) D ), with a corresponding

error value ED(w(0)) = 1 2

N

i=1

( ti − (w(0))T φ(xi) )2

  • Iteratively, the current value w(i−1) is modified in the direction of

steepest descent of di ED(w)

  • At step i, w(i−1)

j

is updated as follows: w(i)

j

:= w(i−1)

j

− η ∂ED(w) ∂wj

  • w(i−1)

8

slide-9
SLIDE 9

Gradient descent

  • In matrix notation:

w(i) := w(i−1) − η ∂ED(w) ∂w

  • w(i−1)
  • By definition of ED(w):

w(i) := w(i−1) − η(ti − w(i−1)φ(xi))φ(xi)

9

slide-10
SLIDE 10

Regularized least squares

  • Regularization term in the cost function

ED(w) + λEW (w) ED(w) dependent from the dataset (and the parameters), EW (w) dependent from the parameters alone.

  • The regularization coefficient controls the relative importance of the

two terms.

  • Simple form

EW (w) = 1 2wT w = 1 2

M−1

i=0

w2

i

  • Sum-of squares cost function: weight decay

E(w) = 1 2

N

i=1

{ti−wT φ(xi)}2+λ 2 wT w = 1 2(Φw−y)T (Φw−y)+λ 2 wT w with solution w = (λI + ΦT Φ)−1ΦT t

10

slide-11
SLIDE 11

Regularization

  • A more general form

E(w) = 1 2

N

i=1

{ti − wT φ(xi)}2 + λ 2

M−1

j=0

|wj|q

  • The case q = 1 is denoted as lasso: sparse models are favored (in blue,

level curves of the cost function)

11

slide-12
SLIDE 12

Bias vs variance: an example

  • Consider the case of function y = sin 2πx and assume L = 100 training

sets T1, . . . , TL are available, each of size n = 25.

  • Given M = 24 gaussian basis functions φ1(x), . . . , φM(x), from each

training set Ti a prediction function yi(x) is derived by minimizing the regularized cost function ED(w) = 1 2(Φw − t)T (Φw − t) + λ 2 wT w

12

slide-13
SLIDE 13

An example

x t ln λ = 2.6 1 −1 1 x t 1 −1 1

Left, a possible plot of prediction functions yi(x) (i = 1, . . . , 100), as derived, respectively, by training sets Ti, i = 1, . . . , 100 setting ln λ = 2.6. Right, their expectation, with the unknown function y = sin 2πx. The prediction functions yi(x) do not differ much between them (small variance), but their expecation is a bad approximation of the unknown function (large bias).

13

slide-14
SLIDE 14

An example

x t ln λ = −0.31 1 −1 1 x t 1 −1 1

Plot of the prediction functions obtained with ln λ = −0.31.

14

slide-15
SLIDE 15

An example

x t ln λ = −2.4 1 −1 1 x t 1 −1 1

Plot of the prediction functions obtained with ln λ = −2.4. As λ decreases, the variance increases (prediction functions yi(x) are more different each

  • ther), while bias decreases (their expectation is a better approximation of

y = sin 2πx).

15

slide-16
SLIDE 16

An example

  • Plot of (bias)

2, variance and their sum as unctions of λ: las λ increases,

bias increases and varinace decreases. Their sum has a minimum in correspondance to the optimal value of λ.

  • The term Ex[σ2

y|x] shows an inherent limit to the approximability of

y = sin 2πx.

16

slide-17
SLIDE 17

Bayesian approach to regression

  • Applying maximum likelihood to determine the values of model

parameters is prone to overfitting: need of a regularization term E(w).

  • In order control model complexity, a bayesian approach assumes a

prior distribution of parameter values.

17

slide-18
SLIDE 18

Prior distribution

Posterior proportional to prior times likelihood: likelihood is gaussian (gaussian noise). p(t|Φ, w, β) =

n

i=1

N(ti|wT φ(xi), β−1) Conjugate of gaussian is gaussian: choosing a gaussian prior distribution of w p(w) = N(w|m0, S0) results into a gaussian posterior distribution p(w|t, Φ) = N(w|mN, SN) ∝ p(t, Φ|w)p(w) where mN = SN(S−1

0 m0 + βΦT t)

S−1

N = S−1

+ βΦT Φ

18

slide-19
SLIDE 19

Prior distribution

A common approach: zero-mean isotropic gaussian prior distribution of w p(w|α) =

M−1

i=0

( α 2π )1/2 e− α

2 w2 i

  • Parameters in w are assumed independent e identically distributed,

according to a gaussian with mean 0, uniform variance σ2 = α−1 and null covariance.

  • Prior distribution defined with a hyper-parameter α, inversely

proportional to the variance.

19

slide-20
SLIDE 20

Posterior distribution

Given the likelihood p(t|Φ, w, β) =

n

i=1

e− β

2 (ti−wT φ(xi))2

the posterior distribution for w derives from Bayes' rule p(w|t, Φ, α, σ) = p(t|Φ, w, σ)p(w|α) p(t|Φ, α, σ) ∝ p(t|Φ, w, σ)p(w|α)

20

slide-21
SLIDE 21

In this case

It is possible to show that, assuming p(w) = N(w|0, α−1I) p(t|w, Φ) = N(t|wT Φ, β−1I) the posterior distribution is itself a gaussian p(w|t, Φ, α, σ) = N(w|mN, SN) with SN = (αI + βΦT Φ)−1 mN = βSNΦT t Note that if α → 0 the prior tends to have infinite variance, and we have minimum information on w before the training set is considered. In this case, mN → (ΦT βIΦ)−1(ΦT βIt) = (ΦT Φ)−1(ΦT t) that is wML, the ML estimation of w.

21

slide-22
SLIDE 22

Maximum a Posteriori

  • Given the posterior distribution p(w|Φ, t, α, β), we may derive the value
  • f wMAP which makes it maximum (the mode of the distribution)
  • This is equivalent to maximizing its logarithm

log p(w|Φ, t, α, β) = log p(t|w, Φ, β) + log p(w|α) − log p(t|Φ, β) and, since p(t|Φ, β) is a constant wrt w wMAP = argmax

w

log p(w|Φ, t, α, β) = argmax

w

(log p(t|w, Φ, β) + log p(w|α)) that is, wMAP = argmin

w

(− log p(t|Φ, w, β) − log p(w|α))

22

slide-23
SLIDE 23

Derivation of MAP

By considering the assumptions on prior and likelihood, wMAP = argmin

w

( β 2

n

i=1

(ti − wT φ(xi))2 + α 2

M−1

i=0

w2

i + constants

) = argmin

w

( n ∑

i=1

(ti − wT φ(xi))2 + α β

M−1

i=0

w2

i

) this is equivalent to considering a cost function EMAP (w) =

n

i=1

(yi − wT φ(xi)) + α β wT w that is to a regularized min square function with λ = α β

23

slide-24
SLIDE 24

Sequential learning

  • The bayesian approach can be applied when the training set is acquired

incrementally (sequential learning).

  • Fundamental property: if prior and likelihood are gaussians, the

posterior is gaussian too.

  • Let T1, T2 be two independent training sets, acquired one after the
  • ther. Then

p(w|T1, T2) ∝ p(T1, T2|w)p(w) = p(T2|w)p(T1|w)p(w) ∝ p(T2|w)p(w|T1)p(w)

24

slide-25
SLIDE 25

Sequential learning

  • The posterior after observing T1 can be used as a prior for the next

training set acquired.

  • In general, for a sequence T1, . . . , Tn of training sets,

p(w|T1, . . . Tn) ∝ p(Tn|w)p(w|T1, . . . Tn−1) p(w|T1, . . . Tn−1) ∝ p(Tn−1|w)p(w|T1, . . . Tn−2) . . . p(w|T1) ∝ p(T1|w)p(w)

25

slide-26
SLIDE 26

Example

  • Input variable x, target variable t, linear regression

y(x, w0, w1) = w0 + w1x.

  • Dataset generated by applying function y = a0 + a1x (with a0 = −0.3,

a1 = 0.5) to values uniformly sampled in [−1, 1], with added gaussian noise (µ = 0, σ = 0.2).

  • Assume the prior distribution p(w0, w1) is a bivariate gaussian with

µ = 0 and Σ = σ2I = 0.04I Left, prior distribution of w0, w1; right, 6 lines sampled from the distribution.

26

slide-27
SLIDE 27

Example

After observing item (x1, y1) (circle in right figure). Left, posterior distribution p(w0, w1|x1, y1); right, 6 lines sampled from the distribution.

27

slide-28
SLIDE 28

Esempio

After observing items (x1, y1), (x2, y2) (circles in right figure). Left, posterior distribution p(w0, w1|x1, y1, x2, y2); right, 6 lines sampled from the distribution.

28

slide-29
SLIDE 29

Example

After observing a set of n items (x1, y1), . . . , (xn, yn) (circles in right figure). Left, posterior distribution p(w0, w1|xi, yi, i = 1, . . . , n); right, 6 lines sampled from the distribution.

29

slide-30
SLIDE 30

Example

  • As the number of observed items increases, the distribution of

parameters w0, w1 tends to concentrate (variance decreases to 0) around a mean point a0, a1.

  • As a consequence, sampled lines are concentrated around y = a0 + a1x.

30

slide-31
SLIDE 31

Approaches to prediction in linear regression

Classical

  • A value wLS for w is learned through a point estimate, performed by

minimizing a quadratic cost function, or equivalently by maximizing likelihood (ML) under the hypothesis of gaussian noise; regularization can be applied to modify the cost function to limit overfitting

  • Given any x, the obtained value wLS is used to predict the

corresponding t as y = xT wLS, where xT = (1, x)T , or, in general, as y = φ(x)T wLS

31

slide-32
SLIDE 32

Approaches to prediction in linear regression

Bayesian point estimation

  • The posterior distribution p(w|t, Φ, α, β) is derived and a point

estimate is performed from it, computing the mode wMAP of the distribution (MAP)

  • Equivalent to the classical approach, as wMAP corresponds to wLS if

λ = α β

  • The prediction, for a value x, is a gaussian distribution

p(y|φ(x)T wMAP , β) for y, with mean φ(x)T wMAP and variance β−1

  • The distribution is not derived directly from the posterior

p(w|t, Φ, α, β): it is built, instead, as a gaussian with mean depending from the expectation of the posterior, and variance given by the assumed noise.

32

slide-33
SLIDE 33

Approaches to prediction in linear regression

Fully bayesian

  • The real interest is not in estimating w or its distribution p(w|t, Φ, α, β),

but in deriving the predictive distribution p(y|x). This can be done through expectation of the probability p(y|x, w, β) predicted by a model instance wrt model instance distribution p(w|t, Φ, α, β), that is p(y|x, t, Φ, α, β) = ∫ p(y|x, w, β)p(w|t, Φ, α, β)dw

  • p(y|x, w, β) is assumed gaussian, and p(w|t, Φ, α, β) is gaussian by the

assumption that the likelihood p(t|w, Φ, β) and the prior p(w|α) are gaussian themselves and by their being conjugate p(y|x, w, β) = N(y|wT φ(x), β) p(w|t, Φ, α, β) = N(w|βSNΦT t, SN) where SN = (αI + βΦT Φ)−1

33

slide-34
SLIDE 34

Approaches to prediction in linear regression

Fully bayesian Under such hypothesis, p(y|x) is gaussian p(y|x, y, Φ, α, β) = N(y|m(x), σ2(x)) with mean m(x) = βφ(x)T SNΦT t and variance σ2(x) = 1 β + φ(x)T SNφ(x)

  • 1

β is a measure of the uncertainty intrinsic to observed data (noise)

  • φ(x)T SNφ(x) is the uncertainty wrt the values derived for the

parameters w

  • as the noise distribution and the distribution of w are independent

gaussians, their variances add

  • φ(x)T SNφ(x) → 0 as n → ∞, and the only uncertainty remaining is

the one intrinsic into data observation

34

slide-35
SLIDE 35

Example

  • predictive distribution for y = sin 2πx, applying a model with 9 gaussian

base functions and training sets of 1, 2, 4, 25 items, respectively

  • left: items in training sets (sampled uniformly, with added gaussian

noise); expectation of the predictive distribution (red), as function of x; variance of such distribution (pink shade within 1 standard deviation from mean), as a function of x

  • right: items in training sets, 5 possible curves approximating

y = sin 2πx, derived through sampling from the posterior distribution p(w|t, Φ, α, β)

35

slide-36
SLIDE 36

Example

n = 1 n = 2

36

slide-37
SLIDE 37

Example

n = 4 n = 25

37

slide-38
SLIDE 38

Equivalent kernel

  • The expectation of the predictive distribution can be written also as

y(x) = βφ(x)T SNΦT t =

n

i=1

βφ(x)T SNφ(xi)ti

  • The prediction can then be seen as a linear combination of the target

values ti of items in the training set, with weights dependent from the item values xi (and from x) y(x) =

n

i=1

κ(x, xi)ti The weight function κ(x, x′) = βφ(x)T SNφ(x′) is said equivalent kernel or linear smoother

38

slide-39
SLIDE 39

Equivalent kernel

Right: plot on the plane (x, xi) of a sample equivalent kernel, in the case of gaussian basis functions. Left: plot as a function of xi for three different values of x In deriving y, the equivalent kernel tends to assign greater relevance to the target values ti corresponding to items xi near to x.

39

slide-40
SLIDE 40

Equivalent kernel

The same localization property holds also for different base functions. −1 1 0.02 0.04 −1 1 0.02 0.04 Left, κ(0, x′) in the case of polynomial basis functions. Right, κ(0, x′) in the case of gaussian basis functions.

40

slide-41
SLIDE 41

Equivalent kernel

Some properties:

  • It is possible to prove that ∑n

i=1 κ(x, xi) = 1 for any x

  • The covariance between y(x) and y(x′) is given by

cov(x, x′) = cov(φ(x)T w, wT φ(x′)) = Φ(x)T SNφ(x′) = 1 β κ(x, x′) predicted values are highly correlated at nearby points.

  • Instead of introducing base functions which results into a kernel, we

may define a localized kernel directly and use it to make predictions (gaussian processes)

  • The equivalent kernel can be expressed as inner product

κ(x, x′) = ψ(x)T ψ(x′) of a suitable set of functions ψ(x) = β1/2S1/2

N φ(x) 41

slide-42
SLIDE 42

Alternative approach to linear regression

  • First approach: define a set of base functions
  • used to derive w
  • or (by means of the resulting equivalent kernel) to directly computing y(x)

as a linear combination of training set items

  • New approach: a suitable kernel is defined and used to compute y(x)

42