Linear Regression Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

โ–ถ
linear regression
SMART_READER_LITE
LIVE PREVIEW

Linear Regression Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

Linear Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative Office hour Chen Gao Shih-Yang Su Feedback (Thanks!) Notation? More descriptive slides? Video/audio recording? TA hours


slide-1
SLIDE 1

Linear Regression

Jia-Bin Huang Virginia Tech

Spring 2019

ECE-5424G / CS-5824

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4

Administrative

  • Office hour
  • Chen Gao
  • Shih-Yang Su
  • Feedback (Thanks!)
  • Notation?
  • More descriptive slides?
  • Video/audio recording?
  • TA hours (uniformly spread over the week)?
slide-5
SLIDE 5

Recap: Machine learning algorithms

Supervised Learning Unsupervised Learning Discrete Classification Clustering Continuous Regression Dimensionality reduction

slide-6
SLIDE 6

Recap: Nearest neighbor classifier

  • Training data

๐‘ฆ 1 , ๐‘ง 1 , ๐‘ฆ 2 , ๐‘ง 2 , โ‹ฏ , ๐‘ฆ ๐‘‚ , ๐‘ง ๐‘‚

  • Learning

Do nothing.

  • Testing

โ„Ž ๐‘ฆ = ๐‘ง(๐‘™), where ๐‘™ = argmini ๐ธ(๐‘ฆ, ๐‘ฆ(๐‘—))

slide-7
SLIDE 7

Recap: Instance/Memory-based Learning

  • 1. A distance metric
  • Continuous? Discrete? PDF? Gene data? Learn the metric?
  • 2. How many nearby neighbors to look at?
  • 1? 3? 5? 15?
  • 3. A weighting function (optional)
  • Closer neighbors matter more
  • 4. How to fit with the local points?
  • Kernel regression

Slide credit: Carlos Guestrin

slide-8
SLIDE 8

Validation set

  • Spliting training set: A fake test set to tune hyper-parameters

Slide credit: CS231 @ Stanford

slide-9
SLIDE 9

Cross-validation

  • 5-fold cross-validation -> split the training data into 5 equal folds
  • 4 of them for training and 1 for validation

Slide credit: CS231 @ Stanford

slide-10
SLIDE 10

Things to remember

  • Supervised Learning
  • Training/testing data; classification/regression; Hypothesis
  • k-NN
  • Simplest learning algorithm
  • With sufficient data, very hard to beat โ€œstrawmanโ€ approach
  • Kernel regression/classification
  • Set k to n (number of data points) and chose kernel width
  • Smoother than k-NN
  • Problems with k-NN
  • Curse of dimensionality
  • Not robust to irrelevant features
  • Slow NN search: must remember (very large) dataset for prediction
slide-11
SLIDE 11

Todayโ€™s plan: Linear Regression

  • Model representation
  • Cost function
  • Gradient descent
  • Features and polynomial regression
  • Normal equation
slide-12
SLIDE 12

Linear Regression

  • Model representation
  • Cost function
  • Gradient descent
  • Features and polynomial regression
  • Normal equation
slide-13
SLIDE 13

Training set Learning Algorithm โ„Ž ๐‘ฆ ๐‘ง

Hypothesis Size of house Estimated price

Regression real-valued output

slide-14
SLIDE 14

House pricing prediction

Price ($) in 1000โ€™s 500 1000 1500 2000 2500 100 200 300 400 Size in feet^2

slide-15
SLIDE 15
  • Notation:
  • ๐‘› = Number of training examples
  • ๐‘ฆ = Input variable / features
  • ๐‘ง = Output variable / target variable
  • (๐‘ฆ, ๐‘ง) = One training example
  • (๐‘ฆ(๐‘—), ๐‘ง(๐‘—)) = ๐‘—๐‘ขโ„Ž training example

Training set

Size in feet^2 (x) Price ($) in 1000โ€™s (y) 2104 460 1416 232 1534 315 852 178 โ€ฆ โ€ฆ

๐‘› = 47 Examples: ๐‘ฆ(1) = 2104 ๐‘ฆ(2) = 1416 ๐‘ง(1) = 460

Slide credit: Andrew Ng

slide-16
SLIDE 16

Model representation

๐‘ง = โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ

Shorthand โ„Ž ๐‘ฆ

Training set Learning Algorithm โ„Ž ๐‘ฆ ๐‘ง

Hypothesis Size of house Estimated price

Price ($) in 1000โ€™s 500 1000 1500 2000 2500 100 200 300 400 Size in feet^2

Univariate linear regression

Slide credit: Andrew Ng

slide-17
SLIDE 17

Linear Regression

  • Model representation
  • Cost function
  • Gradient descent
  • Features and polynomial regression
  • Normal equation
slide-18
SLIDE 18

Training set

  • Hypothesis โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ

๐œ„0, ๐œ„1: parameters/weights How to choose ๐œ„๐‘—โ€™s?

Size in feet^2 (x) Price ($) in 1000โ€™s (y) 2104 460 1416 232 1534 315 852 178 โ€ฆ โ€ฆ

๐‘› = 47

Slide credit: Andrew Ng

slide-19
SLIDE 19

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

๐œ„0 = 1.5 ๐œ„1 = 0 ๐œ„0 = 0 ๐œ„1 = 0.5 ๐œ„0 = 1 ๐œ„1 = 0.5

๐‘ฆ ๐‘ง ๐‘ฆ ๐‘ง ๐‘ฆ ๐‘ง

Slide credit: Andrew Ng

slide-20
SLIDE 20

Cost function

  • Idea:

Choose ๐œ„0, ๐œ„1 so that โ„Ž๐œ„ ๐‘ฆ is close to ๐‘ง for our training example (๐‘ฆ, ๐‘ง)

Price ($) in 1000โ€™s 500 1000 1500 2000 2500 100 200 300 400 Size in feet^2

๐‘ฆ ๐‘ง

โ„Ž๐œ„ ๐‘ฆ ๐‘— = ๐œ„0 + ๐œ„1๐‘ฆ(๐‘—)

๐พ ๐œ„0, ๐œ„1 = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

minimize

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

๐œ„0, ๐œ„1

minimize ๐พ ๐œ„0, ๐œ„1

๐œ„0, ๐œ„1

Cost function

Slide credit: Andrew Ng

slide-21
SLIDE 21
  • Hypothesis:

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ

  • Parameters:

๐œ„0, ๐œ„1

  • Cost function:

๐พ ๐œ„0, ๐œ„1 = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

  • Goal:

minimize ๐พ ๐œ„0, ๐œ„1

๐œ„0, ๐œ„1

  • Hypothesis:

โ„Ž๐œ„ ๐‘ฆ = ๐œ„1๐‘ฆ ๐œ„0 = 0

  • Parameters:

๐œ„1

  • Cost function:

๐พ ๐œ„1 = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

  • Goal:

minimize ๐พ ๐œ„1

Simplified

๐œ„0, ๐œ„1

Slide credit: Andrew Ng

slide-22
SLIDE 22

โ„Ž๐œ„ ๐‘ฆ , function of ๐‘ฆ

1 2 3 1 2 3 ๐‘ฆ ๐‘ง 1 2 3 1 2 3 ๐พ ๐œ„1 ๐œ„1

๐พ ๐œ„1 , function of ๐œ„1

Slide credit: Andrew Ng

slide-23
SLIDE 23

โ„Ž๐œ„ ๐‘ฆ , function of ๐‘ฆ

1 2 3 1 2 3 ๐‘ฆ ๐‘ง 1 2 3 1 2 3 ๐พ ๐œ„1 ๐œ„1

๐พ ๐œ„1 , function of ๐œ„1

Slide credit: Andrew Ng

slide-24
SLIDE 24

โ„Ž๐œ„ ๐‘ฆ , function of ๐‘ฆ

1 2 3 1 2 3 ๐‘ฆ ๐‘ง 1 2 3 1 2 3 ๐พ ๐œ„1 ๐œ„1

๐พ ๐œ„1 , function of ๐œ„1

Slide credit: Andrew Ng

slide-25
SLIDE 25

โ„Ž๐œ„ ๐‘ฆ , function of ๐‘ฆ

1 2 3 1 2 3 ๐‘ฆ ๐‘ง 1 2 3 1 2 3 ๐พ ๐œ„1 ๐œ„1

๐พ ๐œ„1 , function of ๐œ„1

Slide credit: Andrew Ng

slide-26
SLIDE 26

โ„Ž๐œ„ ๐‘ฆ , function of ๐‘ฆ

1 2 3 1 2 3 ๐‘ฆ ๐‘ง 1 2 3 1 2 3 ๐พ ๐œ„1 ๐œ„1

๐พ ๐œ„1 , function of ๐œ„1

Slide credit: Andrew Ng

slide-27
SLIDE 27
  • Hypothesis: โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ
  • Parameters: ๐œ„0, ๐œ„1
  • Cost function: ๐พ ๐œ„0, ๐œ„1 =

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

  • Goal: minimize ๐พ ๐œ„0, ๐œ„1

๐œ„0, ๐œ„1

Slide credit: Andrew Ng

slide-28
SLIDE 28

Cost function

Slide credit: Andrew Ng

slide-29
SLIDE 29

How do we find good ๐œ„0, ๐œ„1 that minimize ๐พ ๐œ„0, ๐œ„1 ?

Slide credit: Andrew Ng

slide-30
SLIDE 30

Linear Regression

  • Model representation
  • Cost function
  • Gradient descent
  • Features and polynomial regression
  • Normal equation
slide-31
SLIDE 31

Gradient descent

Have some function ๐พ ๐œ„0, ๐œ„1 Want argmin ๐พ ๐œ„0, ๐œ„1 Outline:

  • Start with some ๐œ„0, ๐œ„1
  • Keep changing ๐œ„0, ๐œ„1 to reduce ๐พ ๐œ„0, ๐œ„1

until we hopefully end up at minimum

๐œ„0, ๐œ„1

Slide credit: Andrew Ng

slide-32
SLIDE 32

Slide credit: Andrew Ng

slide-33
SLIDE 33

Gradient descent

Repeat until convergence{ ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ ๐œ– ๐œ–๐œ„๐‘˜ ๐พ ๐œ„0, ๐œ„1

(for ๐‘˜ = 0 and ๐‘˜ = 1) } ๐›ฝ: Learning rate (step size)

๐œ– ๐œ–๐œ„๐‘˜ ๐พ ๐œ„0, ๐œ„1 : derivative (rate of change)

Slide credit: Andrew Ng

slide-34
SLIDE 34

Gradient descent

Incorrect: temp0 โ‰” ๐œ„0 โˆ’๐›ฝ ๐œ– ๐œ–๐œ„0 ๐พ ๐œ„0, ๐œ„1 ๐œ„0 โ‰” temp0 temp1 โ‰” ๐œ„1 โˆ’๐›ฝ ๐œ– ๐œ–๐œ„1 ๐พ ๐œ„0, ๐œ„1 ๐œ„1 โ‰” temp1 Correct: simultaneous update temp0 โ‰” ๐œ„0 โˆ’๐›ฝ ๐œ– ๐œ–๐œ„0 ๐พ ๐œ„0, ๐œ„1 temp1 โ‰” ๐œ„1 โˆ’๐›ฝ ๐œ– ๐œ–๐œ„1 ๐พ ๐œ„0, ๐œ„1 ๐œ„0 โ‰” temp0 ๐œ„1 โ‰” temp1

Slide credit: Andrew Ng

slide-35
SLIDE 35

๐œ„1 โ‰” ๐œ„1 โˆ’ ๐›ฝ ๐œ– ๐œ–๐œ„1 ๐พ ๐œ„1

1 2 3 1 2 3 ๐พ ๐œ„1 ๐œ„1

๐œ– ๐œ–๐œ„1 ๐พ ๐œ„1 > 0 ๐œ– ๐œ–๐œ„1 ๐พ ๐œ„1 < 0

Slide credit: Andrew Ng

slide-36
SLIDE 36

Learning rate

slide-37
SLIDE 37

Gradient descent for linear regression

Repeat until convergence{ ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ ๐œ– ๐œ–๐œ„๐‘˜ ๐พ ๐œ„0, ๐œ„1

(for ๐‘˜ = 0 and ๐‘˜ = 1) }

  • Linear regression model

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ ๐พ ๐œ„0, ๐œ„1 = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

Slide credit: Andrew Ng

slide-38
SLIDE 38

Computing partial derivative

  • ๐œ–

๐œ–๐œ„๐‘˜ ๐พ ๐œ„0, ๐œ„1 = ๐œ– ๐œ–๐œ„๐‘˜ 1 2๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

=

๐œ– ๐œ–๐œ„๐‘˜ 1 2๐‘› ฯƒ๐‘—=1 ๐‘›

๐œ„0 + ๐œ„1๐‘ฆ(๐‘—) โˆ’ ๐‘ง ๐‘—

2

  • ๐‘˜ = 0:

๐œ– ๐œ–๐œ„0 ๐พ ๐œ„0, ๐œ„1 = 1 ๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

  • ๐‘˜ = 1:

๐œ– ๐œ–๐œ„1 ๐พ ๐œ„0, ๐œ„1 = 1 ๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ ๐‘—

Slide credit: Andrew Ng

slide-39
SLIDE 39

Gradient descent for linear regression

Repeat until convergence{ ๐œ„0 โ‰” ๐œ„0 โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐œ„1 โ‰” ๐œ„1 โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ ๐‘— } Update ๐œ„0 and ๐œ„1 simultaneously

Slide credit: Andrew Ng

slide-40
SLIDE 40

Batch gradient descent

  • โ€œBatchโ€: Each step of gradient descent uses all the training examples

Repeat until convergence{ ๐œ„0 โ‰” ๐œ„0 โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐œ„1 โ‰” ๐œ„1 โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ ๐‘— }

๐‘›: Number of training examples

Slide credit: Andrew Ng

slide-41
SLIDE 41

Linear Regression

  • Model representation
  • Cost function
  • Gradient descent
  • Features and polynomial regression
  • Normal equation
slide-42
SLIDE 42

Training dataset

Size in feet^2 (x) Price ($) in 1000โ€™s (y) 2104 460 1416 232 1534 315 852 178 โ€ฆ โ€ฆ

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ

Slide credit: Andrew Ng

slide-43
SLIDE 43

Multiple features (input variables)

Size in feet^2 (๐‘ฆ1) Number of bedrooms (๐‘ฆ2) Number of floors (๐‘ฆ3) Age of home (years) (๐‘ฆ4) Price ($) in 1000โ€™s (y) 2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315 852 2 1 36 178 โ€ฆ โ€ฆ Notation: ๐‘œ = Number of features ๐‘ฆ(๐‘—)= Input features of ๐‘—๐‘ขโ„Ž training example ๐‘ฆ๐‘˜

(๐‘—)= Value of feature ๐‘˜ in ๐‘—๐‘ขโ„Ž training example

๐‘ฆ3

(2) =?

๐‘ฆ3

(4) =?

Slide credit: Andrew Ng

slide-44
SLIDE 44

Hypothesis

Previously: โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ Now: โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2 +๐œ„3 ๐‘ฆ3 + ๐œ„4๐‘ฆ4

Slide credit: Andrew Ng

slide-45
SLIDE 45

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2 + โ‹ฏ + ๐œ„๐‘œ๐‘ฆ๐‘œ

  • For convenience of notation, define ๐‘ฆ0 = 1

(๐‘ฆ0

(๐‘—) = 1 for all examples)

  • ๐’š =

๐‘ฆ0 ๐‘ฆ1 ๐‘ฆ2 โ‹ฎ ๐‘ฆ๐‘œ โˆˆ ๐‘†๐‘œ+1 ๐œพ = ๐œ„0 ๐œ„1 ๐œ„2 โ‹ฎ ๐œ„๐‘œ โˆˆ ๐‘†๐‘œ+1

  • โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2 + โ‹ฏ + ๐œ„๐‘œ๐‘ฆ๐‘œ

= ๐œพโŠค๐’š

Slide credit: Andrew Ng

slide-46
SLIDE 46

Gradient descent

  • Previously (๐‘œ = 1)

Repeat until convergence{ ๐œ„0 โ‰” ๐œ„0 โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐œ„1 โ‰” ๐œ„1 โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ ๐‘— }

  • New algorithm (๐‘œ โ‰ฅ 1)

Repeat until convergence{ ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘—

} Simultaneously update ๐œ„

๐‘˜, for ๐‘˜ = 0, 1, โ‹ฏ , ๐‘œ

Slide credit: Andrew Ng

slide-47
SLIDE 47

Gradient descent in practice: Feature scaling

  • Idea: Make sure features are on a similar scale (e.g,. โˆ’1 โ‰ค ๐‘ฆ๐‘—โ‰ค 1)
  • E.g. ๐‘ฆ1 = size (0-2000 feat^2)

๐‘ฆ2 = number of bedrooms (1-5)

1 2 3 1 2 3 ๐œ„1 ๐œ„2 1 2 3 1 2 3 ๐œ„1 ๐œ„2

Slide credit: Andrew Ng

slide-48
SLIDE 48

Gradient descent in practice: Learning rate

  • Automatic convergence test
  • ๐›ฝ too small: slow convergence
  • ๐›ฝ too large: may not converge
  • To choose ๐›ฝ, try

0.001, โ€ฆ 0.01, โ€ฆ, 0.1, โ€ฆ , 1

Image credit: CS231n@Stanford

slide-49
SLIDE 49

House prices prediction

  • โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1 ร— frontage + ๐œ„2 ร— depth
  • Area

๐‘ฆ = frontage ร— depth

  • โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ

Slide credit: Andrew Ng

slide-50
SLIDE 50

Polynomial regression

  • โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2 + ๐œ„3๐‘ฆ3

= ๐œ„0 + ๐œ„1 ๐‘ก๐‘—๐‘จ๐‘“ + ๐œ„2 ๐‘ก๐‘—๐‘จ๐‘“ 2 + ๐œ„3 ๐‘ก๐‘—๐‘จ๐‘“ 3

Price ($) in 1000โ€™s 500 1000 1500 2000 2500 100 200 300 400 Size in feet^2

๐‘ฆ1 = (size) ๐‘ฆ2 = (size)^2 ๐‘ฆ3 = (size)^3

Slide credit: Andrew Ng

slide-51
SLIDE 51

Linear Regression

  • Model representation
  • Cost function
  • Gradient descent
  • Features and polynomial regression
  • Normal equation
slide-52
SLIDE 52

(๐‘ฆ0) Size in feet^2 (๐‘ฆ1) Number of bedrooms (๐‘ฆ2) Number of floors (๐‘ฆ3) Age of home (years) (๐‘ฆ4) Price ($) in 1000โ€™s (y) 1 2104 5 1 45 460 1 1416 3 2 40 232 1 1534 3 2 30 315 1 852 2 1 36 178 โ€ฆ โ€ฆ

๐‘ง = 460 232 315 178

๐œ„ = (๐‘ŒโŠค๐‘Œ)โˆ’1๐‘ŒโŠค๐‘ง

Slide credit: Andrew Ng

slide-53
SLIDE 53

Least square solution

  • ๐พ ๐œ„ =

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

=

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

๐œ„โŠค๐‘ฆ(๐‘—) โˆ’ ๐‘ง ๐‘—

2

=

1 2๐‘› ๐‘Œ๐œ„ โˆ’ ๐‘ง 2 2

  • ๐œ–

๐œ–๐œ„ ๐พ ๐œ„ = 0

  • ๐œ„ = (๐‘ŒโŠค๐‘Œ)โˆ’1๐‘ŒโŠค๐‘ง
slide-54
SLIDE 54

Justification/interpretation 1

  • Loss minimization

๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 = 1

๐‘› เท

๐‘—=1 ๐‘›

๐‘€๐‘š๐‘ก โ„Ž๐œ„ ๐‘ฆ ๐‘— , ๐‘ง ๐‘—

  • ๐‘€๐‘š๐‘ก ๐‘ง, เทœ

๐‘ง =

1 2 ๐‘ง โˆ’ เทœ

๐‘ง 2

2: Least squares loss

  • Empirical Risk Minimization (ERM)

1 ๐‘› เท

๐‘—=1 ๐‘›

๐‘€๐‘š๐‘ก ๐‘ง ๐‘— , เทœ ๐‘ง

slide-55
SLIDE 55

Justification/interpretation 2

  • Probabilistic model
  • Assume linear model with Gaussian errors

๐‘ž๐œ„ ๐‘ง ๐‘— ๐‘ฆ ๐‘— = 1 2๐œŒ๐œ2 exp(โˆ’ 1 2๐œ2 (๐‘ง ๐‘— โˆ’ ๐œ„โŠค๐‘ฆ ๐‘— )

  • Solving maximum likelihood

argmin

๐œ„

เท‘

๐‘—=1 ๐‘›

๐‘ž๐œ„ ๐‘ง ๐‘— ๐‘ฆ ๐‘— argmin

๐œ„

log(เท‘

๐‘—=1 ๐‘›

๐‘ž ๐‘ง ๐‘— ๐‘ฆ ๐‘— ) = argmin

๐œ„

1 2๐œ2 เท

๐‘—=1 ๐‘› 1

2 ๐œ„โŠค๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

Image credit: CS 446@UIUC

slide-56
SLIDE 56

Justification/interpretation 3

  • Geometric interpretation

๐’€ = 1 โ† ๐’š(1) โ†’ 1 โ‹ฎ 1 โ† ๐’š(2) โ†’ โ‹ฎ โ† ๐’š(๐‘›) โ†’ = โ†‘ ๐’œ๐Ÿ โ†‘ ๐’œ๐Ÿ โ†‘ ๐’œ๐Ÿ‘ โ‹ฏ โ†‘ ๐’œ๐’ โ†“ โ†“ โ†“ โ†“

  • ๐’€๐œพ: column space of ๐’€ or span({๐’œ๐Ÿ, ๐’œ๐Ÿ, โ‹ฏ , ๐’œ๐’})
  • Residual ๐’€๐œพ โˆ’ ๐ณ is orthogonal to the column space of ๐’€
  • ๐’€โŠค ๐’€๐œพ โˆ’ ๐ณ = 0 โ†’ (๐’€โŠค๐’€)๐œพ = ๐’€โŠค๐’›

๐’›

column space of ๐’€

๐’€๐œพ ๐’€๐œพ โˆ’ ๐’›

slide-57
SLIDE 57

๐‘› training examples, ๐‘œ features

Gradient Descent

  • Need to choose ๐›ฝ
  • Need many iterations
  • Works well even when

๐‘œ is large Normal Equation

  • No need to choose ๐›ฝ
  • Donโ€™t need to iterate
  • Need to compute

(๐‘ŒโŠค๐‘Œ)โˆ’1

  • Slow if ๐‘œ is very large

Slide credit: Andrew Ng

slide-58
SLIDE 58

Things to remember

  • Model representation

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2 + โ‹ฏ + ๐œ„๐‘œ๐‘ฆ๐‘œ = ๐œ„โŠค๐‘ฆ

  • Cost function

๐พ ๐œ„ =

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

  • Gradient descent for linear regression

Repeat until convergence {๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1 ๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘— }

  • Features and polynomial regression

Can combine features; can use different functions to generate features (e.g., polynomial)

  • Normal equation ๐œ„ = (๐‘ŒโŠค๐‘Œ)โˆ’1๐‘ŒโŠค๐‘ง
slide-59
SLIDE 59

Next

  • Naรฏve Bayes, Logistic regression, Regularization