cs480 680 machine learning lecture 3 january 14 th 2020
play

CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear Regression Zahra Sheikhbahaee CS480/680 Winter 2020 Zahra Sheikhbahaee First Assignment Available 15 th of January Deadline 23 th of January Teacher assistants


  1. CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear Regression Zahra Sheikhbahaee CS480/680 Winter 2020 Zahra Sheikhbahaee

  2. First Assignment • Available 15 th of January • Deadline 23 th of January • Teacher assistants responsible for the first assignment • Gaurav Gupta g27gupta@uwaterloo.ca • Colin Michiel Vandenhof cm5vandenhof@uwaterloo.ca • Office hour for TA’s: 20 th of January at 1:00 pm in DC2584 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2

  3. Outline • Linear regression • Ridge regression • Lasso • Bayesian linear regression University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 3

  4. Linear Models for Regression Definition : Given a training set comprising 𝑂 observations {𝑦 $ , 𝑦 & , . . , 𝑦 ( } with corresponding target real-valued 𝑧 $ , 𝑧 & , . . , 𝑧 ( , the goal is to predict the value of 𝑧 (+$ for a new value of 𝑦 (+$ . The form of a linear regression model is ( 𝑔(𝒚) = 𝛾 1 + ∑ 45$ 𝑦 4 𝛾 4 𝛾 4 : unknown coefficients University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4

  5. Linear Models for Regression Definition : Given a training set comprising 𝑂 observations {𝑦 $ , 𝑦 & , . . , 𝑦 ( } with corresponding target real-valued 𝑧 $ , 𝑧 & , . . , 𝑧 ( , the goal is to predict the value of 𝑧 (+$ for a new value of 𝑦 (+$ . The form of a linear regression model is ( 𝑔(𝒚) = 𝛾 1 + ∑ 45$ 𝑦 4 𝛾 4 𝛾 4 : unknown coefficients ( (a polynomial regression) 7 , . . , 𝑦 ( = 𝑦 $ & , 𝑦 7 = 𝑦 $ • Basis expansions: 𝑦 & = 𝑦 $ • We can have interactions between variables, i.e. 𝑦 7 = 𝑦 $ . 𝑦 & ( • By using nonlinear basis functions like 𝑔(𝒚) = 𝛾 1 + ∑ 45$ 𝜚 4 (𝒚)𝛾 4 , we allow the function 𝑔 𝒚 to be a non-linear function of the input vector x but linear w.r.t 𝜸 . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5

  6. Basis Function Choices • Polynomial 𝜚 4 𝑦 = 𝑦 4 • Gaussian (>?@ A ) B 𝜚 4 𝑦 = exp(− ) &C B • Sigmoidal >?@ A $ 𝜚 4 𝑦 = 𝜏( C ) with 𝜏 𝑦 = $+E FG University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6

  7. Minimizing the Residual Sum of Squares ( RSS ) • Assumption: The output variable 𝑧 is given by a deterministic function 𝑔(𝒚, 𝜸) with additive Gaussian noise 𝒛 = 𝒈 𝒚, 𝜸 + 𝝑 𝝑 :a zero-mean Gaussian random variable 𝜗 ∼ 𝑂(0, 𝜏 & ) ( W?$ 2𝜌𝜏 & exp[− 1 1 𝑄 𝒛 𝒈 𝒚, 𝜸 , 𝜏 & 𝕁 = P 𝛾 4 𝜚 4 (𝒚 Q )) & ] 2𝜏 & (𝑧 Q −𝛾 1 − V Q5$ 45$ Assuming these data points are drawn independent from the distribution ( (𝑧 Q −𝜸 [ 𝚾(𝒚 Q )) & = ∥ 𝜗 ∥ & & RSS 𝛾 = ∑ Q5$ & :square of the ℓ & norm ∥∥ & University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7

  8. Minimizing RSS • Compute the gradient of the log-likelihood function ( ∇ ` ln 𝑄 𝒛 𝜸, 𝜏 & 𝕁 = V {𝑧 Q − 𝜸 [ 𝜚(𝒚 Q )} 𝜚(𝒚 Q ) [ Q5$ • Setting the gradient to zero 𝜸 𝑵𝑴 = (𝚾 𝑼 𝚾) ?𝟐 𝚾 𝑼 𝒛 W?$ 𝛾 4 𝜚 4 , 𝜚 4 = $ ( 𝑧 − ∑ 45$ ( ∑ Q5$ 𝛾 1 = g 𝜚 4 (𝑦 Q ) 𝚾 : the design matrix 𝜚 1 𝒚 $ … 𝜚 W?$ (𝒚 $ ) 𝜚 W?$ (𝒚 & ) 𝜚 1 𝒚 & … 𝚾 = . . . . 𝜚 W?$ (𝒚 ( ) 𝜚 1 𝒚 ( … University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8

  9. 9 Minimizing RSS • Solving ln 𝑄(𝒛│𝜸, 𝜏 & 𝕁) w.r.t the noise parameter 𝜏 & Squared-Error for linear regression model ( (𝑧 Q −𝛾 Wm [ 𝜚(𝑦 Q )) & j klB = $ $ ( ∑ Q5$ Gray line: The true noise-free function University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

  10. Bayesian view of Linear Regression • Bayes Rule: The posterior is proportional to likelihood times prior & ) 𝑄 𝛾 n 𝑧, 𝑦 ∝ 𝑄 n 𝑧 𝑦, 𝛾 𝑄(𝛾|𝜈 ` , 𝜏 ` Let’s use Gaussian prior for weight 𝛾 ` B & = 𝑂(0, 𝜏 $ & ) = 𝑄 𝛾 𝜈 ` , 𝜏 B exp(− B ) ` ` &j s &rj s The posterior is 2𝜏 & (𝑧 − 𝜸 [ 𝚾(𝒚)) & )exp(− 𝛾 & 𝑧, 𝑦 ∝ exp(− 1 𝑄 𝛾 n & ) 2𝜏 ` University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 10

  11. Bayesian view of Linear Regression (Ridge Regression) The log posterior is 2𝜏 & (𝑧 − 𝜸 [ 𝚾(𝒚)) & − 𝛾 & 𝑧, 𝑦 ∝ − 1 log 𝑄 𝛾 n & 2𝜏 ` If we assume that 𝜏 & = 1 and 𝜇 = $ B then j s W?$ 𝛾 4 & ≤ 𝑑 & − w 𝑧, 𝑦 ∝ − $ & subject to ∑ 451 & ∥ 𝑧 − 𝜸 [ 𝚾(𝒚) ∥ & log 𝑄 𝛾 n & ∥ 𝛾 ∥ & 𝜸 𝑺𝒋𝒆𝒉𝒇 = (𝚾 𝑼 𝚾 + 𝝁𝕁) ?𝟐 𝚾 𝑼 𝒛 First term: the mean square error 𝜸 𝑺𝒋𝒆𝒉𝒇 :the posterior mode Second term: a complexity penalty 𝜇 ≥ 0 Goal: Regularization is the most common way to avoid overfitting and reduce the model complexity by shrinking coefficients close to zero. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11

  12. 12 Lasso Definition : The lasso cost function is defined if we replace square 𝜸 in the second term of ridge regression with | 𝜸| & − w 𝑧, 𝑦 ∝ − $ & ∥ 𝑧 − 𝜸 [ 𝚾(𝒚) ∥ & log 𝑄 𝛾 n & | 𝜸| W?$ 𝛾 4 < 𝑢 . Where for some 𝑢 > 0 , ∑ 451 For lasso the prior is a Laplace distribution 𝑄 𝑦 𝜈, 𝑐 = 1 2𝑐 exp(− |𝑦 − 𝜈| ) 𝑐 Goal: beside helping to reduce overfitting it can be used for variable selection and make it easier to eliminate some of input variable which are not contributing to the output . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

  13. The Bias-Variance Decomposition • Let’s assume that 𝑍 = 𝑔 𝑦 + 𝜗 where 𝔽 𝜗 = 0 and Var 𝜗 = 𝜏 & . • The expected prediction error of the regression model Š 𝑔(𝑦) at the input point 𝑌 = 𝑦 1 using expected squared-error loss: & & 𝑍 − Š 𝑍 − 𝑔 𝑦 1 + 𝑔(𝑦 1 ) − Š Err 𝑦 1 = 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 = 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 = & 𝑌 = 𝑦 1 +2 𝔽 & (𝑔(𝑦 1 ) − Š 𝑔(𝑦 1 ) − Š 𝔽 𝑍 − 𝑔 𝑦 1 𝑍 − 𝑔 𝑦 1 𝑔 𝑦 1 ) 𝑌 = 𝑦 1 + 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 = & Var 𝜗 +2 𝔽 𝜗 𝔽 𝑔(𝑦 1 ) − Š 𝑔(𝑦 1 ) − Š 𝑔 𝑦 1 𝑌 = 𝑦 1 + 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 The first term can not be avoided because it is the variance of the target around its true mean. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13

  14. The Bias-Variance Decomposition • Compute the expectation w.r.t the probability of given data set & & 𝑔(𝑦 1 ) − Š 𝑔 𝑦 1 − 𝔽[ Š 𝑔 𝑦 1 ] + 𝔽[ Š 𝑔 𝑦 1 ] − Š 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 = 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 = & 𝑌 = 𝑦 1 +2 𝔽 𝔽 Š 𝑔 𝑦 1 − 𝔽 Š 𝑌 = 𝑦 1 𝔽[𝔽[ Š 𝔽 𝑔 𝑦 1 − 𝑔(𝑦 1 ) 𝑔 𝑦 1 𝑔 𝑦 1 ] − & 𝑌 = 𝑦 1 = Bias & Š Š 𝔽 Š − Š + Var[ Š 𝑔 𝑦 1 | 𝑌 = 𝑦 1 ] + 𝔽 𝑔 𝑦 1 𝑔 𝑦 1 𝑔 𝑦 1 𝑔 𝑦 1 ]. The first term is the amount by the average of our estimate differs from its true mean. The second term is the expected squared deviation of Š 𝑔 𝑦 1 around its mean. More complex model → bias ↓ and variance ↑ University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14

  15. Bias-Variance Trade-Off • Flexible models: low bias, high variance • Rigid models: high bias, low variance 𝑀 = 100 data sets, 𝑂 = 25 data points in each set, ℎ 𝑦 = sin(2𝜌𝑦) , 𝑁 = 25 the total number of parameters, the corresponding average of the 100 fits (red) along with the sinusoidal function from which the data sets were generated (green) CS480/680 Winter 2020 Zahra Sheikhbahaee CS480/680 Winter 2020 Zahra Sheikhbahaee

  16. Bayesian Linear Regression • Bayesian linear regression: avoid the over-fitting problem of maximum likelihood, automatically determines model complexity using the training data alone. 𝑄 𝛾 = 𝑂(𝛾|𝜈 ` , Σ ` ) 𝛾 :model parameters 𝑄(𝑧|𝛾) : the likelihood function 𝑄 𝛾 𝑧 = 𝑂(𝛾|𝜈, Σ) ∝ 𝑄 𝛾 𝑄(𝑧|𝛾) ?$ 𝜈 ` + $ j B Φ [ 𝑧 𝜈 = Σ Σ ` Σ ?$ = Σ ` $ ?$ + j B Φ [ Φ If data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16

  17. Bayesian Learning For a simple linear model: 𝑧 𝑦, 𝒙 = 𝑥 1 +𝑥 $ 𝑦 𝑄 𝒙 𝛽 = 𝑂 𝒙 0, 𝛽 ?$ 𝕁 True parameter: 𝑥 1 = −0.3 𝑥 $ = 0.5 𝛽 = 2.0 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 17

  18. Predictive Distribution We are interested to making prediction of 𝑧 ∗ for new value of 𝑦 ∗ 𝑄 𝑧 ∗ 𝑦 ∗ , 𝑧, 𝑦 = ž 𝑞 𝑧 ∗ 𝑦 ∗ , 𝛾 𝑞 𝛾 𝑦, 𝑧 d𝛾 ∝ ž exp(− 1 2 (𝛾 − 𝜈) [ Σ ?$ (𝛾 − 𝜈)) exp − 1 2𝜏 & 𝑧 ∗ − 𝛾 [ 𝜚(𝑦 ∗ ) & 𝑒𝛾 = 𝑂(𝑧 ∗ | 𝜏 ?& 𝜚(𝑦 ∗ ) [ Σ𝜚 𝑦 𝑧, 𝜏 & + 𝜚(𝑦 ∗ ) [ Σ 𝜚(𝑦 ∗ )) A model with nine Gaussian basis functions University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend