CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear - - PowerPoint PPT Presentation

β–Ά
cs480 680 machine learning lecture 3 january 14 th 2020
SMART_READER_LITE
LIVE PREVIEW

CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear - - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear Regression Zahra Sheikhbahaee CS480/680 Winter 2020 Zahra Sheikhbahaee First Assignment Available 15 th of January Deadline 23 th of January Teacher assistants


slide-1
SLIDE 1

CS480/680 Winter 2020 Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 3: January 14th, 2020

Linear Regression Zahra Sheikhbahaee

slide-2
SLIDE 2

CS480/680 Winter 2020 Zahra Sheikhbahaee

First Assignment

  • Available 15th of January
  • Deadline 23th of January
  • Teacher assistants responsible for the first assignment
  • Gaurav Gupta g27gupta@uwaterloo.ca
  • Colin Michiel Vandenhof cm5vandenhof@uwaterloo.ca
  • Office hour for TA’s: 20th of January at 1:00 pm in DC2584

University of Waterloo

2

slide-3
SLIDE 3

CS480/680 Winter 2020 Zahra Sheikhbahaee

Outline

  • Linear regression
  • Ridge regression
  • Lasso
  • Bayesian linear regression

University of Waterloo

3

slide-4
SLIDE 4

CS480/680 Winter 2020 Zahra Sheikhbahaee

Linear Models for Regression

Definition: Given a training set comprising 𝑂 observations {𝑦$, 𝑦&, . . , 𝑦(} with corresponding target real-valued 𝑧$, 𝑧&, . . , 𝑧( , the goal is to predict the value of 𝑧(+$for a new value of 𝑦(+$. The form of a linear regression model is 𝑔(π’š)=𝛾1 + βˆ‘45$

(

𝑦4𝛾4 𝛾4: unknown coefficients

University of Waterloo

4

slide-5
SLIDE 5

CS480/680 Winter 2020 Zahra Sheikhbahaee

Linear Models for Regression

Definition: Given a training set comprising 𝑂 observations {𝑦$, 𝑦&, . . , 𝑦(} with corresponding target real-valued 𝑧$, 𝑧&, . . , 𝑧( , the goal is to predict the value of 𝑧(+$for a new value of 𝑦(+$. The form of a linear regression model is 𝑔(π’š)=𝛾1 + βˆ‘45$

(

𝑦4𝛾4 𝛾4: unknown coefficients

  • Basis expansions: 𝑦& = 𝑦$

&, 𝑦7 = 𝑦$ 7, . . , 𝑦( = 𝑦$ ( (a polynomial regression)

  • We can have interactions between variables, i.e. 𝑦7 = 𝑦$. 𝑦&
  • By using nonlinear basis functions like 𝑔(π’š)=𝛾1 + βˆ‘45$

(

𝜚4(π’š)𝛾4, we allow the function 𝑔 π’š to be a non-linear function of the input vector x but linear w.r.t 𝜸.

University of Waterloo

5

slide-6
SLIDE 6

CS480/680 Winter 2020 Zahra Sheikhbahaee

Basis Function Choices

  • Polynomial

𝜚4 𝑦 = 𝑦4

  • Gaussian

𝜚4 𝑦 = exp(βˆ’

(>?@A)B &CB

)

  • Sigmoidal

𝜚4 𝑦 = 𝜏(

>?@A C ) with 𝜏 𝑦 = $ $+EFG

University of Waterloo

6

slide-7
SLIDE 7

CS480/680 Winter 2020 Zahra Sheikhbahaee

Minimizing the Residual Sum of Squares (RSS)

  • Assumption: The output variable 𝑧 is given by a deterministic function 𝑔(π’š, 𝜸)

with additive Gaussian noise 𝒛 = π’ˆ π’š, 𝜸 + 𝝑 𝝑 :a zero-mean Gaussian random variable πœ— ∼ 𝑂(0, 𝜏&) 𝑄 𝒛 π’ˆ π’š, 𝜸 , 𝜏&𝕁 = P

Q5$ (

1 2𝜌𝜏& exp[βˆ’ 1 2𝜏& (𝑧Qβˆ’π›Ύ1 βˆ’ V

45$ W?$

𝛾4 𝜚4(π’šQ))&] Assuming these data points are drawn independent from the distribution RSS 𝛾 = βˆ‘Q5$

( (𝑧Qβˆ’πœΈ[𝚾(π’šQ))&=βˆ₯ πœ— βˆ₯& &

βˆ₯βˆ₯&

& :square of the β„“& norm

University of Waterloo

7

slide-8
SLIDE 8

CS480/680 Winter 2020 Zahra Sheikhbahaee

Minimizing RSS

  • Compute the gradient of the log-likelihood function

βˆ‡` ln 𝑄 𝒛 𝜸, 𝜏&𝕁 = V

Q5$ (

{𝑧Q βˆ’ 𝜸[𝜚(π’šQ)} 𝜚(π’šQ)[

  • Setting the gradient to zero

πœΈπ‘΅π‘΄ = (πšΎπ‘ΌπšΎ)?πŸπšΎπ‘Όπ’› 𝛾1 = g 𝑧 βˆ’ βˆ‘45$

W?$ 𝛾4𝜚4, 𝜚4 = $ ( βˆ‘Q5$ (

𝜚4(𝑦Q) 𝚾: the design matrix 𝚾 = 𝜚1 π’š$ … 𝜚1 π’š& … . . 𝜚1 π’š( … 𝜚W?$(π’š$) 𝜚W?$(π’š&) . . 𝜚W?$(π’š()

University of Waterloo

8

slide-9
SLIDE 9

CS480/680 Winter 2020 Zahra Sheikhbahaee

Minimizing RSS

  • Solving ln 𝑄(π’›β”‚πœΈ, 𝜏&𝕁) w.r.t the noise

parameter 𝜏&

$ jklB = $ ( βˆ‘Q5$ ( (𝑧Qβˆ’π›ΎWm [ 𝜚(𝑦Q))&

University of Waterloo

9 Squared-Error for linear regression model Gray line: The true noise-free function

slide-10
SLIDE 10

CS480/680 Winter 2020 Zahra Sheikhbahaee

Bayesian view of Linear Regression

  • Bayes Rule: The posterior is proportional to likelihood times prior

𝑄 𝛾 n 𝑧, 𝑦 ∝ 𝑄 n 𝑧 𝑦, 𝛾 𝑄(𝛾|𝜈`, 𝜏

` &)

Let’s use Gaussian prior for weight 𝛾 𝑄 𝛾 𝜈`, 𝜏

` & = 𝑂(0, 𝜏 ` &)= $ &rjs

B exp(βˆ’

`B &js

B)

The posterior is 𝑄 𝛾 n 𝑧, 𝑦 ∝ exp(βˆ’ 1 2𝜏& (𝑧 βˆ’ 𝜸[𝚾(π’š))&)exp(βˆ’ 𝛾& 2𝜏

` &)

University of Waterloo

10

slide-11
SLIDE 11

CS480/680 Winter 2020 Zahra Sheikhbahaee

Bayesian view of Linear Regression (Ridge Regression)

The log posterior is log 𝑄 𝛾 n 𝑧, 𝑦 ∝ βˆ’ 1 2𝜏& (𝑧 βˆ’ 𝜸[𝚾(π’š))& βˆ’ 𝛾& 2𝜏

` &

If we assume that 𝜏& = 1 and πœ‡ = $

js

B then

log 𝑄 𝛾 n 𝑧, 𝑦 ∝ βˆ’ $

& βˆ₯ 𝑧 βˆ’ 𝜸[𝚾(π’š) βˆ₯& & βˆ’ w & βˆ₯ 𝛾 βˆ₯& & subject to βˆ‘451

W?$ 𝛾4 & ≀ 𝑑

πœΈπ‘Ίπ’‹π’†π’‰π’‡ = (πšΎπ‘ΌπšΎ + 𝝁𝕁)?πŸπšΎπ‘Όπ’›

First term: the mean square error

πœΈπ‘Ίπ’‹π’†π’‰π’‡:the posterior mode

Second term: a complexity penalty πœ‡ β‰₯ 0 Goal: Regularization is the most common way to avoid overfitting and reduce the model complexity by shrinking coefficients close to zero. University of Waterloo

11

slide-12
SLIDE 12

CS480/680 Winter 2020 Zahra Sheikhbahaee

Lasso

Definition: The lasso cost function is defined if we replace square 𝜸 in the second term of ridge regression with | 𝜸| log 𝑄 𝛾 n 𝑧, 𝑦 ∝ βˆ’ $

& βˆ₯ 𝑧 βˆ’ 𝜸[𝚾(π’š) βˆ₯& & βˆ’ w & | 𝜸|

Where for some 𝑒 > 0, βˆ‘451

W?$ 𝛾4 < 𝑒.

For lasso the prior is a Laplace distribution 𝑄 𝑦 𝜈, 𝑐 = 1 2𝑐 exp(βˆ’ |𝑦 βˆ’ 𝜈| 𝑐 ) Goal: beside helping to reduce overfitting it can be used for variable selection and make it easier to eliminate some of input variable which are not contributing to the output.

University of Waterloo

12

slide-13
SLIDE 13

CS480/680 Winter 2020 Zahra Sheikhbahaee

The Bias-Variance Decomposition

  • Let’s assume that 𝑍 = 𝑔 𝑦 + πœ— where 𝔽 πœ— = 0 and Var πœ— = 𝜏&.
  • The expected prediction error of the regression model Ε 

𝑔(𝑦) at the input point π‘Œ = 𝑦1 using expected squared-error loss:

Err 𝑦1 = 𝔽 𝑍 βˆ’ Ε  𝑔 𝑦1

&

π‘Œ = 𝑦1 = 𝔽 𝑍 βˆ’ 𝑔 𝑦1 + 𝑔(𝑦1) βˆ’ Ε  𝑔 𝑦1

&

π‘Œ = 𝑦1 = 𝔽 𝑍 βˆ’ 𝑔 𝑦1

& π‘Œ = 𝑦1 +2 𝔽

𝑍 βˆ’ 𝑔 𝑦1 (𝑔(𝑦1) βˆ’ Ε  𝑔 𝑦1 ) π‘Œ = 𝑦1 + 𝔽 𝑔(𝑦1) βˆ’ Ε  𝑔 𝑦1

&

π‘Œ = 𝑦1 = Var πœ— +2 𝔽 πœ— 𝔽 𝑔(𝑦1) βˆ’ Ε  𝑔 𝑦1 π‘Œ = 𝑦1 + 𝔽 𝑔(𝑦1) βˆ’ Ε  𝑔 𝑦1

&

π‘Œ = 𝑦1 The first term can not be avoided because it is the variance of the target around its true mean.

University of Waterloo

13

slide-14
SLIDE 14

CS480/680 Winter 2020 Zahra Sheikhbahaee

  • Compute the expectation w.r.t the probability of given data set

𝔽 𝑔(𝑦1) βˆ’ Ε  𝑔 𝑦1

&

π‘Œ = 𝑦1 = 𝔽 𝑔 𝑦1 βˆ’ 𝔽[ Ε  𝑔 𝑦1 ] + 𝔽[ Ε  𝑔 𝑦1 ] βˆ’ Ε  𝑔 𝑦1

&

π‘Œ = 𝑦1 = 𝔽 𝔽 Ε  𝑔 𝑦1 βˆ’ 𝑔(𝑦1)

& π‘Œ = 𝑦1 +2 𝔽

𝑔 𝑦1 βˆ’ 𝔽 Ε  𝑔 𝑦1 π‘Œ = 𝑦1 𝔽[𝔽[ Ε  𝑔 𝑦1 ] βˆ’ Ε  𝑔 𝑦1 |π‘Œ = 𝑦1] + 𝔽 𝔽 Ε  𝑔 𝑦1 βˆ’ Ε  𝑔 𝑦1

&

π‘Œ = 𝑦1 = Bias& Ε  𝑔 𝑦1 + Var[ Ε  𝑔 𝑦1 ].

The first term is the amount by the average of our estimate differs from its true mean. The second term is the expected squared deviation of Ε  𝑔 𝑦1 around its mean. More complex model β†’ bias↓ and variance↑

University of Waterloo

14

The Bias-Variance Decomposition

slide-15
SLIDE 15

CS480/680 Winter 2020 Zahra Sheikhbahaee

Bias-Variance Trade-Off

  • Flexible models: low bias, high variance
  • Rigid models: high bias, low variance

CS480/680 Winter 2020 Zahra Sheikhbahaee

𝑀 = 100 data sets, 𝑂 = 25 data points in each set, β„Ž 𝑦 = sin(2πœŒπ‘¦), 𝑁 = 25 the total number of parameters, the corresponding average of the 100 fits (red) along with the sinusoidal function from which the data sets were generated (green)

slide-16
SLIDE 16

CS480/680 Winter 2020 Zahra Sheikhbahaee

Bayesian Linear Regression

  • Bayesian linear regression: avoid the over-fitting problem of maximum likelihood,

automatically determines model complexity using the training data alone. 𝑄 𝛾 = 𝑂(𝛾|𝜈`, Ξ£`) 𝛾:model parameters 𝑄(𝑧|𝛾) : the likelihood function 𝑄 𝛾 𝑧 = 𝑂(𝛾|𝜈, Ξ£) ∝ 𝑄 𝛾 𝑄(𝑧|𝛾) 𝜈=Ξ£ Ξ£`

?$𝜈` + $ jB Ξ¦[𝑧

Ξ£?$ = Ξ£`

?$+ $ jB Ξ¦[Ξ¦

If data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point.

University of Waterloo

16

slide-17
SLIDE 17

CS480/680 Winter 2020 Zahra Sheikhbahaee

Bayesian Learning

University of Waterloo

17

For a simple linear model: 𝑧 𝑦, 𝒙 = π‘₯1 +π‘₯$𝑦 𝑄 𝒙 𝛽 = 𝑂 𝒙 0, 𝛽?$𝕁 True parameter: π‘₯1 = βˆ’0.3 π‘₯$ = 0.5 𝛽 = 2.0

slide-18
SLIDE 18

CS480/680 Winter 2020 Zahra Sheikhbahaee

Predictive Distribution

We are interested to making prediction of π‘§βˆ— for new value of π‘¦βˆ— 𝑄 π‘§βˆ— π‘¦βˆ—, 𝑧, 𝑦 = ΕΎ π‘ž π‘§βˆ— π‘¦βˆ—, 𝛾 π‘ž 𝛾 𝑦, 𝑧 d𝛾 ∝ ΕΎ exp(βˆ’ 1 2 (𝛾 βˆ’ 𝜈)[Ξ£?$(𝛾 βˆ’ 𝜈)) exp βˆ’ 1 2𝜏& π‘§βˆ— βˆ’ 𝛾[𝜚(π‘¦βˆ—) & 𝑒𝛾 =𝑂(π‘§βˆ—|𝜏?&𝜚(π‘¦βˆ—) [Σ𝜚 𝑦 𝑧, 𝜏& + 𝜚(π‘¦βˆ—) [Ξ£ 𝜚(π‘¦βˆ—))

University of Waterloo

18 A model with nine Gaussian basis functions