CS480/680 Winter 2020 Zahra Sheikhbahaee
CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear - - PowerPoint PPT Presentation
CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear - - PowerPoint PPT Presentation
CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear Regression Zahra Sheikhbahaee CS480/680 Winter 2020 Zahra Sheikhbahaee First Assignment Available 15 th of January Deadline 23 th of January Teacher assistants
CS480/680 Winter 2020 Zahra Sheikhbahaee
First Assignment
- Available 15th of January
- Deadline 23th of January
- Teacher assistants responsible for the first assignment
- Gaurav Gupta g27gupta@uwaterloo.ca
- Colin Michiel Vandenhof cm5vandenhof@uwaterloo.ca
- Office hour for TAβs: 20th of January at 1:00 pm in DC2584
University of Waterloo
2
CS480/680 Winter 2020 Zahra Sheikhbahaee
Outline
- Linear regression
- Ridge regression
- Lasso
- Bayesian linear regression
University of Waterloo
3
CS480/680 Winter 2020 Zahra Sheikhbahaee
Linear Models for Regression
Definition: Given a training set comprising π observations {π¦$, π¦&, . . , π¦(} with corresponding target real-valued π§$, π§&, . . , π§( , the goal is to predict the value of π§(+$for a new value of π¦(+$. The form of a linear regression model is π(π)=πΎ1 + β45$
(
π¦4πΎ4 πΎ4: unknown coefficients
University of Waterloo
4
CS480/680 Winter 2020 Zahra Sheikhbahaee
Linear Models for Regression
Definition: Given a training set comprising π observations {π¦$, π¦&, . . , π¦(} with corresponding target real-valued π§$, π§&, . . , π§( , the goal is to predict the value of π§(+$for a new value of π¦(+$. The form of a linear regression model is π(π)=πΎ1 + β45$
(
π¦4πΎ4 πΎ4: unknown coefficients
- Basis expansions: π¦& = π¦$
&, π¦7 = π¦$ 7, . . , π¦( = π¦$ ( (a polynomial regression)
- We can have interactions between variables, i.e. π¦7 = π¦$. π¦&
- By using nonlinear basis functions like π(π)=πΎ1 + β45$
(
π4(π)πΎ4, we allow the function π π to be a non-linear function of the input vector x but linear w.r.t πΈ.
University of Waterloo
5
CS480/680 Winter 2020 Zahra Sheikhbahaee
Basis Function Choices
- Polynomial
π4 π¦ = π¦4
- Gaussian
π4 π¦ = exp(β
(>?@A)B &CB
)
- Sigmoidal
π4 π¦ = π(
>?@A C ) with π π¦ = $ $+EFG
University of Waterloo
6
CS480/680 Winter 2020 Zahra Sheikhbahaee
Minimizing the Residual Sum of Squares (RSS)
- Assumption: The output variable π§ is given by a deterministic function π(π, πΈ)
with additive Gaussian noise π = π π, πΈ + π π :a zero-mean Gaussian random variable π βΌ π(0, π&) π π π π, πΈ , π&π = P
Q5$ (
1 2ππ& exp[β 1 2π& (π§QβπΎ1 β V
45$ W?$
πΎ4 π4(πQ))&] Assuming these data points are drawn independent from the distribution RSS πΎ = βQ5$
( (π§QβπΈ[πΎ(πQ))&=β₯ π β₯& &
β₯β₯&
& :square of the β& norm
University of Waterloo
7
CS480/680 Winter 2020 Zahra Sheikhbahaee
Minimizing RSS
- Compute the gradient of the log-likelihood function
β` ln π π πΈ, π&π = V
Q5$ (
{π§Q β πΈ[π(πQ)} π(πQ)[
- Setting the gradient to zero
πΈπ΅π΄ = (πΎπΌπΎ)?ππΎπΌπ πΎ1 = g π§ β β45$
W?$ πΎ4π4, π4 = $ ( βQ5$ (
π4(π¦Q) πΎ: the design matrix πΎ = π1 π$ β¦ π1 π& β¦ . . π1 π( β¦ πW?$(π$) πW?$(π&) . . πW?$(π()
University of Waterloo
8
CS480/680 Winter 2020 Zahra Sheikhbahaee
Minimizing RSS
- Solving ln π(πβπΈ, π&π) w.r.t the noise
parameter π&
$ jklB = $ ( βQ5$ ( (π§QβπΎWm [ π(π¦Q))&
University of Waterloo
9 Squared-Error for linear regression model Gray line: The true noise-free function
CS480/680 Winter 2020 Zahra Sheikhbahaee
Bayesian view of Linear Regression
- Bayes Rule: The posterior is proportional to likelihood times prior
π πΎ n π§, π¦ β π n π§ π¦, πΎ π(πΎ|π`, π
` &)
Letβs use Gaussian prior for weight πΎ π πΎ π`, π
` & = π(0, π ` &)= $ &rjs
B exp(β
`B &js
B)
The posterior is π πΎ n π§, π¦ β exp(β 1 2π& (π§ β πΈ[πΎ(π))&)exp(β πΎ& 2π
` &)
University of Waterloo
10
CS480/680 Winter 2020 Zahra Sheikhbahaee
Bayesian view of Linear Regression (Ridge Regression)
The log posterior is log π πΎ n π§, π¦ β β 1 2π& (π§ β πΈ[πΎ(π))& β πΎ& 2π
` &
If we assume that π& = 1 and π = $
js
B then
log π πΎ n π§, π¦ β β $
& β₯ π§ β πΈ[πΎ(π) β₯& & β w & β₯ πΎ β₯& & subject to β451
W?$ πΎ4 & β€ π
πΈπΊππππ = (πΎπΌπΎ + ππ)?ππΎπΌπ
First term: the mean square error
πΈπΊππππ:the posterior mode
Second term: a complexity penalty π β₯ 0 Goal: Regularization is the most common way to avoid overfitting and reduce the model complexity by shrinking coefficients close to zero. University of Waterloo
11
CS480/680 Winter 2020 Zahra Sheikhbahaee
Lasso
Definition: The lasso cost function is defined if we replace square πΈ in the second term of ridge regression with | πΈ| log π πΎ n π§, π¦ β β $
& β₯ π§ β πΈ[πΎ(π) β₯& & β w & | πΈ|
Where for some π’ > 0, β451
W?$ πΎ4 < π’.
For lasso the prior is a Laplace distribution π π¦ π, π = 1 2π exp(β |π¦ β π| π ) Goal: beside helping to reduce overfitting it can be used for variable selection and make it easier to eliminate some of input variable which are not contributing to the output.
University of Waterloo
12
CS480/680 Winter 2020 Zahra Sheikhbahaee
The Bias-Variance Decomposition
- Letβs assume that π = π π¦ + π where π½ π = 0 and Var π = π&.
- The expected prediction error of the regression model Ε
π(π¦) at the input point π = π¦1 using expected squared-error loss:
Err π¦1 = π½ π β Ε π π¦1
&
π = π¦1 = π½ π β π π¦1 + π(π¦1) β Ε π π¦1
&
π = π¦1 = π½ π β π π¦1
& π = π¦1 +2 π½
π β π π¦1 (π(π¦1) β Ε π π¦1 ) π = π¦1 + π½ π(π¦1) β Ε π π¦1
&
π = π¦1 = Var π +2 π½ π π½ π(π¦1) β Ε π π¦1 π = π¦1 + π½ π(π¦1) β Ε π π¦1
&
π = π¦1 The first term can not be avoided because it is the variance of the target around its true mean.
University of Waterloo
13
CS480/680 Winter 2020 Zahra Sheikhbahaee
- Compute the expectation w.r.t the probability of given data set
π½ π(π¦1) β Ε π π¦1
&
π = π¦1 = π½ π π¦1 β π½[ Ε π π¦1 ] + π½[ Ε π π¦1 ] β Ε π π¦1
&
π = π¦1 = π½ π½ Ε π π¦1 β π(π¦1)
& π = π¦1 +2 π½
π π¦1 β π½ Ε π π¦1 π = π¦1 π½[π½[ Ε π π¦1 ] β Ε π π¦1 |π = π¦1] + π½ π½ Ε π π¦1 β Ε π π¦1
&
π = π¦1 = Bias& Ε π π¦1 + Var[ Ε π π¦1 ].
The first term is the amount by the average of our estimate differs from its true mean. The second term is the expected squared deviation of Ε π π¦1 around its mean. More complex model β biasβ and varianceβ
University of Waterloo
14
The Bias-Variance Decomposition
CS480/680 Winter 2020 Zahra Sheikhbahaee
Bias-Variance Trade-Off
- Flexible models: low bias, high variance
- Rigid models: high bias, low variance
CS480/680 Winter 2020 Zahra Sheikhbahaee
π = 100 data sets, π = 25 data points in each set, β π¦ = sin(2ππ¦), π = 25 the total number of parameters, the corresponding average of the 100 fits (red) along with the sinusoidal function from which the data sets were generated (green)
CS480/680 Winter 2020 Zahra Sheikhbahaee
Bayesian Linear Regression
- Bayesian linear regression: avoid the over-fitting problem of maximum likelihood,
automatically determines model complexity using the training data alone. π πΎ = π(πΎ|π`, Ξ£`) πΎ:model parameters π(π§|πΎ) : the likelihood function π πΎ π§ = π(πΎ|π, Ξ£) β π πΎ π(π§|πΎ) π=Ξ£ Ξ£`
?$π` + $ jB Ξ¦[π§
Ξ£?$ = Ξ£`
?$+ $ jB Ξ¦[Ξ¦
If data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point.
University of Waterloo
16
CS480/680 Winter 2020 Zahra Sheikhbahaee
Bayesian Learning
University of Waterloo
17
For a simple linear model: π§ π¦, π = π₯1 +π₯$π¦ π π π½ = π π 0, π½?$π True parameter: π₯1 = β0.3 π₯$ = 0.5 π½ = 2.0
CS480/680 Winter 2020 Zahra Sheikhbahaee
Predictive Distribution
We are interested to making prediction of π§β for new value of π¦β π π§β π¦β, π§, π¦ = ΕΎ π π§β π¦β, πΎ π πΎ π¦, π§ dπΎ β ΕΎ exp(β 1 2 (πΎ β π)[Ξ£?$(πΎ β π)) exp β 1 2π& π§β β πΎ[π(π¦β) & ππΎ =π(π§β|π?&π(π¦β) [Ξ£π π¦ π§, π& + π(π¦β) [Ξ£ π(π¦β))
University of Waterloo
18 A model with nine Gaussian basis functions