LEARNING Outline Linear Models 1D Ordinary Least Squares (OLS) - - PowerPoint PPT Presentation

learning outline
SMART_READER_LITE
LIVE PREVIEW

LEARNING Outline Linear Models 1D Ordinary Least Squares (OLS) - - PowerPoint PPT Presentation

Linear Regression CSCI 447/547 MACHINE LEARNING Outline Linear Models 1D Ordinary Least Squares (OLS) Solution of OLS Interpretation Anscombes Quartet Multivariate OLS OLS Pros and Cons Optional Reading


slide-1
SLIDE 1

CSCI 447/547 MACHINE LEARNING

Linear Regression

slide-2
SLIDE 2

Outline

  • Linear Models
  • 1D Ordinary Least Squares (OLS)
  • Solution of OLS
  • Interpretation
  • Anscombe’s Quartet
  • Multivariate OLS
  • OLS Pros and Cons
slide-3
SLIDE 3

Optional Reading

slide-4
SLIDE 4

Terminology

  • Features (Covariates or predictors)
  • Labels (Variates or targets)
  • Regression
  • Classification
slide-5
SLIDE 5

Types of Machine Learning

  • Unsupervised

 Finding structure in data

  • Supervised

 Predict from given data

Height Weight Height Weight Women Men Classification categorical output data Logistic Regression OLS Regression (Prediction) continuous output data Weight Height

slide-6
SLIDE 6

What is a Linear Model?

  • Predict Housing Prices

 Depends on:

 Area  # of bedrooms  # of bathrooms

 Hypothesis is that relationship is linear

 Price = k1(Area) + k2(#bed) + k3(#bath)  yi = a0 + a1x1 + a2x2 + …

slide-7
SLIDE 7

Why Use Linear Models?

  • Interpretable

 Relationships are easy to see

  • Low Complexity

 Prevents overfitting

  • Scalable

 Scale up to more data, larger problems

  • Baseline

 Can benchmark other methods against them

slide-8
SLIDE 8

Examples of Use

  • Example of Use

 MNIST dataset – handwritten digits  Best performance – neural networks and

regularization

 99.79% accurate  Takes about a day to train  More difficult to build  Logistic Regression  92.5% accurate  Takes seconds to train  Can be built with less expertise

  • Building Blocks of Later Techniques
slide-9
SLIDE 9

Optional Reading

slide-10
SLIDE 10

Definition of 1-Dimension OLS

  • The Problem Statement

 i is an observation, we have N of them  i = 1…N  x is the independent variable (feature)  y is dependent variable (output variable)  y = ax + b, a,b are constants  yi = axi + b OR yi = axi + b + ε  Two unknowns – want to solve for a and b

ˆ

slide-11
SLIDE 11

The Loss Function

  • L = ∑i=1

N(yi – yi)2

  • Goal is to minimize this function
  • Using yi = axi + b, the equation becomes:
  • L = ∑i=1

N(yi – axi - b)2

  • So this is the equation we want to minimize

ˆ ˆ

slide-12
SLIDE 12

Solution of OLS

  • Derivation

 L = ∑i=1

N(yi – axi - b)2

  • Want to minimize L
  • Take derivative of loss function wrt each

variable

𝑒𝑀 𝑒𝑏 = 0, 𝑒𝑀 𝑒𝑐 = 0

𝑒𝑀 𝑒𝑏 = 0 => 𝑒𝑀 𝑒𝑏 = ∑i=1

N2(yi – axi - b)(-xi) = 0

 =>

𝑒𝑀 𝑒𝑏 = ∑i=1

Nxiyi – a∑i=1 Nxi 2 - b∑i=1 Nxi = 0

slide-13
SLIDE 13

Solution of OLS

  • Derivation

𝑒𝑀 𝑒𝑐 = 0 => 𝑒𝑀 𝑒𝑐 = ∑i=1

N2(yi – axi - b)(+1) = 0

 =>

𝑒𝑀 𝑒𝑐 = ∑i=1

Nyi –∑i=1 Nxi – bN = 0

 b =

1 𝑂 ∑i=1

Nyi –

𝑏 𝑂∑i=1

Nxi

 This is the closed form solution for b

slide-14
SLIDE 14

Solution of OLS

  • Derivation

 From first set, 

𝑒𝑀 𝑒𝑏 = ∑i=1

Nxiyi – a∑i=1 Nxi 2 - b∑i=1 Nxi = 0

 =>∑i=1

Nxiyi = a∑i=1 Nxi 2 + ∑i=1 Nxi(

1 𝑂 ∑i=1

Nyi –

𝑏 𝑂∑i=1

Nxi)

 a =

𝑦𝑗𝑧𝑗 − 1

𝑂

𝑦𝑗𝑧𝑗

𝑂 1 𝑂 1

𝑦2

𝑂 1

𝑗−1

𝑂(

𝑦𝑗)2

𝑂 1

 This is the closed form solution for a

slide-15
SLIDE 15

Solution of OLS

  • Optimal Choices
slide-16
SLIDE 16

Interpretation

  • Interpretation of a and b

 a is the slope of the line

 tangent of angle θ  the effect of the independent variable on the dependent

 b is the intercept of the line

x – independent variable y – dependent variable θ

slide-17
SLIDE 17

Interpretation

  • Interpretation of L

 L = ∑i=1

N(yi – yi)2

 Expresses how well the solution captures the

variation in the data

 R2 = 1 – MSE/Var(y)  R2  [0, 1]

slide-18
SLIDE 18

Interpretation

slide-19
SLIDE 19

Anscombe’s Quartet

slide-20
SLIDE 20

Anscombe’s Quartet

  • Same values for mean, variance and best fit

line

  • R2 values are the same for each example
  • But … linear regression may not be the best

for the last three examples

slide-21
SLIDE 21

Multivariable OLS

  • Definition of Model

 Data Matrix  The Loss Function

slide-22
SLIDE 22

Mutivariable OLS

  • i = an observation
  • N = number of observations
  • i = 1…N
  • M = number of features
  • xi = [xi1, xi2, …, xiM]
  • yi - dependent variable
  • Data matrix: X =

𝑦11 𝑦12 … 𝑦1𝑁 … … … 𝑦𝑂1 𝑌𝑂2 … 𝑌𝑂𝑁

slide-23
SLIDE 23

Mutivariable OLS

  • Data matrix: X =

𝑦11 𝑦12 … 𝑦1𝑁 … … … 𝑦𝑂1 𝑌𝑂2 … 𝑌𝑂𝑁

  • y = ax + b(1)
  • Add a column of all 1’s to left of data matrix to

get bias term included

  • yi = B0 + B1xi1 + B2xi2 + … + BMxiM
  • xi . B, B =

𝐶0 … 𝐶𝑁 , y = XB ˆ

slide-24
SLIDE 24

Multivariable OLS

  • Loss Function

 L = ∑i=1

N(yi – yi)2

  • Still want to minimize L

 L = ∑i=1

N(yi – (B0 + B1 xi1 + … + BMxiM))2

 L = ∑i=1

N(yi – xiB)2

 Norm manner – L2 norm of the vector  L = 𝑧 − 𝑌𝐶

2 2

 L = (y – XB)T(y – XB)

ˆ

slide-25
SLIDE 25

Optimization

  • A Few Facts from Matrix Calculus
  • 𝑒(𝑏𝑦)

𝑒𝑦

= 𝑏

  • 𝑒 𝑏𝑦2

𝑒𝑦

= 2𝑏𝑦

slide-26
SLIDE 26

Optimization

  • Minimizing the Loss

 L = (y – XB)T(y – XB) 

𝑒𝑀 𝑒𝐶 = 0

 𝑒 𝑧 −𝑌𝐶 𝑈(𝑧−𝑌𝐶)

𝑒𝐶

= 0  𝑒(𝑧𝑈𝑧 −𝑧𝑈𝑌𝐶 −𝐶𝑈𝑌𝑈𝑧+𝐶𝑈𝑌𝑈𝑌𝐶)

𝑒𝐶

= 0 ((XY)T = YTXT)  -(XTy) – (XTy) + 2(XTX)B = 0  XTy = (XTX)B  B = (XTX)-1XTy (assuming XTX is invertible, which is true if X is a full rank matrix, that is none of its columns are linearly dependent)

slide-27
SLIDE 27

OLS Pros and Cons

  • OLS

 Pros

 Efficient to compute  Unique minimum  Stable under perturbation of data  Easy to interpret

 Cons

 Influenced by outliers  (XTX)-1 may not exist

 Features may not be linearly independent

slide-28
SLIDE 28

Summary

  • Linear Models
  • 1D Ordinary Least Squares (OLS)
  • Solution of OLS
  • Interpretation
  • Anscombe’s Quartet
  • Multivariate OLS
  • OLS Pros and Cons