LEARNING Outline Linear Models 1D Ordinary Least Squares (OLS) - - PowerPoint PPT Presentation
LEARNING Outline Linear Models 1D Ordinary Least Squares (OLS) - - PowerPoint PPT Presentation
Linear Regression CSCI 447/547 MACHINE LEARNING Outline Linear Models 1D Ordinary Least Squares (OLS) Solution of OLS Interpretation Anscombes Quartet Multivariate OLS OLS Pros and Cons Optional Reading
Outline
- Linear Models
- 1D Ordinary Least Squares (OLS)
- Solution of OLS
- Interpretation
- Anscombe’s Quartet
- Multivariate OLS
- OLS Pros and Cons
Optional Reading
Terminology
- Features (Covariates or predictors)
- Labels (Variates or targets)
- Regression
- Classification
Types of Machine Learning
- Unsupervised
Finding structure in data
- Supervised
Predict from given data
Height Weight Height Weight Women Men Classification categorical output data Logistic Regression OLS Regression (Prediction) continuous output data Weight Height
What is a Linear Model?
- Predict Housing Prices
Depends on:
Area # of bedrooms # of bathrooms
Hypothesis is that relationship is linear
Price = k1(Area) + k2(#bed) + k3(#bath) yi = a0 + a1x1 + a2x2 + …
Why Use Linear Models?
- Interpretable
Relationships are easy to see
- Low Complexity
Prevents overfitting
- Scalable
Scale up to more data, larger problems
- Baseline
Can benchmark other methods against them
Examples of Use
- Example of Use
MNIST dataset – handwritten digits Best performance – neural networks and
regularization
99.79% accurate Takes about a day to train More difficult to build Logistic Regression 92.5% accurate Takes seconds to train Can be built with less expertise
- Building Blocks of Later Techniques
Optional Reading
Definition of 1-Dimension OLS
- The Problem Statement
i is an observation, we have N of them i = 1…N x is the independent variable (feature) y is dependent variable (output variable) y = ax + b, a,b are constants yi = axi + b OR yi = axi + b + ε Two unknowns – want to solve for a and b
ˆ
The Loss Function
- L = ∑i=1
N(yi – yi)2
- Goal is to minimize this function
- Using yi = axi + b, the equation becomes:
- L = ∑i=1
N(yi – axi - b)2
- So this is the equation we want to minimize
ˆ ˆ
Solution of OLS
- Derivation
L = ∑i=1
N(yi – axi - b)2
- Want to minimize L
- Take derivative of loss function wrt each
variable
𝑒𝑀 𝑒𝑏 = 0, 𝑒𝑀 𝑒𝑐 = 0
𝑒𝑀 𝑒𝑏 = 0 => 𝑒𝑀 𝑒𝑏 = ∑i=1
N2(yi – axi - b)(-xi) = 0
=>
𝑒𝑀 𝑒𝑏 = ∑i=1
Nxiyi – a∑i=1 Nxi 2 - b∑i=1 Nxi = 0
Solution of OLS
- Derivation
𝑒𝑀 𝑒𝑐 = 0 => 𝑒𝑀 𝑒𝑐 = ∑i=1
N2(yi – axi - b)(+1) = 0
=>
𝑒𝑀 𝑒𝑐 = ∑i=1
Nyi –∑i=1 Nxi – bN = 0
b =
1 𝑂 ∑i=1
Nyi –
𝑏 𝑂∑i=1
Nxi
This is the closed form solution for b
Solution of OLS
- Derivation
From first set,
𝑒𝑀 𝑒𝑏 = ∑i=1
Nxiyi – a∑i=1 Nxi 2 - b∑i=1 Nxi = 0
=>∑i=1
Nxiyi = a∑i=1 Nxi 2 + ∑i=1 Nxi(
1 𝑂 ∑i=1
Nyi –
𝑏 𝑂∑i=1
Nxi)
a =
𝑦𝑗𝑧𝑗 − 1
𝑂
𝑦𝑗𝑧𝑗
𝑂 1 𝑂 1
𝑦2
𝑂 1
𝑗−1
𝑂(
𝑦𝑗)2
𝑂 1
This is the closed form solution for a
Solution of OLS
- Optimal Choices
Interpretation
- Interpretation of a and b
a is the slope of the line
tangent of angle θ the effect of the independent variable on the dependent
b is the intercept of the line
x – independent variable y – dependent variable θ
Interpretation
- Interpretation of L
L = ∑i=1
N(yi – yi)2
Expresses how well the solution captures the
variation in the data
R2 = 1 – MSE/Var(y) R2 [0, 1]
Interpretation
Anscombe’s Quartet
Anscombe’s Quartet
- Same values for mean, variance and best fit
line
- R2 values are the same for each example
- But … linear regression may not be the best
for the last three examples
Multivariable OLS
- Definition of Model
Data Matrix The Loss Function
Mutivariable OLS
- i = an observation
- N = number of observations
- i = 1…N
- M = number of features
- xi = [xi1, xi2, …, xiM]
- yi - dependent variable
- Data matrix: X =
𝑦11 𝑦12 … 𝑦1𝑁 … … … 𝑦𝑂1 𝑌𝑂2 … 𝑌𝑂𝑁
Mutivariable OLS
- Data matrix: X =
𝑦11 𝑦12 … 𝑦1𝑁 … … … 𝑦𝑂1 𝑌𝑂2 … 𝑌𝑂𝑁
- y = ax + b(1)
- Add a column of all 1’s to left of data matrix to
get bias term included
- yi = B0 + B1xi1 + B2xi2 + … + BMxiM
- xi . B, B =
𝐶0 … 𝐶𝑁 , y = XB ˆ
Multivariable OLS
- Loss Function
L = ∑i=1
N(yi – yi)2
- Still want to minimize L
L = ∑i=1
N(yi – (B0 + B1 xi1 + … + BMxiM))2
L = ∑i=1
N(yi – xiB)2
Norm manner – L2 norm of the vector L = 𝑧 − 𝑌𝐶
2 2
L = (y – XB)T(y – XB)
ˆ
Optimization
- A Few Facts from Matrix Calculus
- 𝑒(𝑏𝑦)
𝑒𝑦
= 𝑏
- 𝑒 𝑏𝑦2
𝑒𝑦
= 2𝑏𝑦
Optimization
- Minimizing the Loss
L = (y – XB)T(y – XB)
𝑒𝑀 𝑒𝐶 = 0
𝑒 𝑧 −𝑌𝐶 𝑈(𝑧−𝑌𝐶)
𝑒𝐶
= 0 𝑒(𝑧𝑈𝑧 −𝑧𝑈𝑌𝐶 −𝐶𝑈𝑌𝑈𝑧+𝐶𝑈𝑌𝑈𝑌𝐶)
𝑒𝐶
= 0 ((XY)T = YTXT) -(XTy) – (XTy) + 2(XTX)B = 0 XTy = (XTX)B B = (XTX)-1XTy (assuming XTX is invertible, which is true if X is a full rank matrix, that is none of its columns are linearly dependent)
OLS Pros and Cons
- OLS
Pros
Efficient to compute Unique minimum Stable under perturbation of data Easy to interpret
Cons
Influenced by outliers (XTX)-1 may not exist
Features may not be linearly independent
Summary
- Linear Models
- 1D Ordinary Least Squares (OLS)
- Solution of OLS
- Interpretation
- Anscombe’s Quartet
- Multivariate OLS
- OLS Pros and Cons