Linear Regression Many slides attributable to: Prof. Mike Hughes - - PowerPoint PPT Presentation

linear regression
SMART_READER_LITE
LIVE PREVIEW

Linear Regression Many slides attributable to: Prof. Mike Hughes - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani


slide-1
SLIDE 1

Linear Regression

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Objectives for Today (day 03)

  • Training “least squares” linear regression
  • Simplest case: 1-dim. features without intercept
  • Simple case: 1-dim. features with intercept
  • General case: Many features with intercept
  • Concepts (algebraic and graphical view)
  • Where do formulas come from?
  • When are optimal solutions unique?
  • Programming:
  • How to solve linear systems in Python
  • Hint: use np.linalg.solve; avoid np.linalg.inv

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-4
SLIDE 4

5

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

slide-5
SLIDE 5

Visualizing errors

6

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-6
SLIDE 6

Evaluation Metrics for Regression

  • mean squared error
  • mean absolute error

7

Mike Hughes - Tufts COMP 135 - Spring 2019

1 N

N

X

n=1

|yn − ˆ yn| 1 N

N

X

n=1

(yn − ˆ yn)2

Today, we’ll focus on mean squared error (MSE). Mean squared error is smooth everywhere. Good analytical properties and widely studied. Thus, it is a common choice. NB: Many applications, absolute error (or other error metrics) may be more suitable, if computational

  • r analytical convenience was not the chief concern.
slide-7
SLIDE 7

Linear Regression 1-dim features, no bias

Parameters: Prediction:

8

Mike Hughes - Tufts COMP 135 - Spring 2019

w = [

weight scalar

ˆ y(xi) , w · xi1

<latexit sha1_base64="GH0G5C3UmjNZ9rLMU3vQwo9X1u4=">ACD3icbVC7TsNAEDyHVwgvAyXNiQgUmsgOSFBG0FAGiTykOLOl0tyvls7taQyMof0PArNBQgREtLx9weRSQMNJKo5ld7e4EseAaHOfbyiwtr6yuZdzG5tb2zv27l5NR4mirEojEalGQDQTXLIqcBCsEStGwkCwetC/Gv1e6Y0j+QtDGPWCklX8g6nBIzk28dej0A6HBUGPj/BHihOZFewO/yAPdqOA/8lLsj3847RWcCvEjcGcmjGSq+/eW1I5qETAIVROum68TQSokCTgUb5bxEs5jQPumypqGShEy30sk/I3xklDbuRMqUBDxRf0+kJNR6GAamMyTQ0/PeWPzPaybQuWilXMYJMEmnizqJwBDhcTi4zRWjIaGEKq4uRXTHlGEgokwZ0Jw519eJLVS0T0tlm7O8uXLWRxZdIAOUQG56ByV0TWqoCqi6BE9o1f0Zj1ZL9a79TFtzVizmX30B9bnD38YnE8=</latexit>

Graphical interpretation: Pick a line with slope w that goes through the origin w = 1.0 w = 0.5 w = 0.0

Training:

Input: training set of N observed examples of features x and responses y Output: value of w that minimizes mean squared error on training set.

slide-8
SLIDE 8

Training for 1-dim, no-bias LR

9

Mike Hughes - Tufts COMP 135 - Spring 2019

Training objective: minimize squared error (“least squares” estimation) Formula for parameters that minimize the objective: When can you use this formula? When you observe at least 1 example with non-zero features Otherwise, all possible w values will be perfect (zero training error) Why? all lines in our hypothesis space go through origin.

How to derive the formula (see notes):

  • 1. Compute gradient of objective, as a function of w
  • 2. Set gradient equal to zero and solve for w

w∗ = PN

n=1 ynxn

PN

n=1 x2 n

<latexit sha1_base64="RyMrO0KJWM2mqkB4hbk2yqhfto=">ACInicbZDLSsNAFIYn9VbrLerSzWARxEVJqAuCkU3rqSCvUCThsl0g6dTMLMRC0hz+LGV3HjQlFXg/j9LQ1h8O/HznHGbO78eMSmVZX0ZuYXFpeSW/Wlhb39jcMrd3GjJKBCZ1HLFItHwkCaOc1BVjLRiQVDoM9L0B5ejfvOCEkjfquGMXFD1OM0oBgpjTz/L5zBCvQCQTCKXRkEnopr9hZ5xoOPQ4fdGUzXLNOGWaeWbRK1lhw3thTUwRT1Tzw+lGOAkJV5ghKdu2FSs3RUJRzEhWcBJYoQHqEfa2nIUEum4xMzeKBJFwaR0MUVHNPfGykKpRyGvp4MkerL2d4I/tdrJyo4c1PK40QRjicPBQmDKoKjvGCXCoIVG2qDsKD6rxD3kU5L6VQLOgR79uR50yiX7ONS+eakWL2YxpEHe2AfHAIbnIquAI1UAcYPIJn8ArejCfjxXg3PiejOWO6swv+yPj+AehWoq0=</latexit>

min

w∈R N

X

n=1

(yn − ˆ y(xn, w))2

<latexit sha1_base64="kRNCTsnOS4C+Me5ga0rZgQRdjc4=">ACK3icbVDLSgMxFM34tr6qLt1cLEIFLTNV0I0gdeNKVKwKnTpk0tQGk8yQZNRhmH6PG3/FhS584Nb/MK1d+DoQOJxzLzfnhDFn2rjuqzM0PDI6Nj4xWZianpmdK84vnOoUYTWScQjdR5iTmTtG6Y4fQ8VhSLkNOz8Gqv59dU6VZJE9MGtOmwJeStRnBxkpBseYLJoPsBnwmwRfYdMIwO85z6HbB14kIMrnj5RcHUE4DCevgd7DJ0rx8G8g1uFmF1YtqUCy5FbcP+Eu8ASmhAQ6D4qPfikgiqDSEY60bnhubZoaVYTvOAnmsaYXOFL2rBUYkF1M+tnzWHFKi1oR8o+aCvft/IsNA6FaGd7KXRv72e+J/XSEx7u5kxGSeGSvJ1qJ1wMBH0ioMWU5QYnlqCiWL2r0A6WGFibL0FW4L3O/JfclqteBuV6tFmabc2qGMCLaFlVEYe2kK7aB8dojoi6A49oGf04tw7T86b8/41OuQMdhbRDzgfnwalpjs=</latexit>
slide-9
SLIDE 9

For details, see derivation notes

https://www.cs.tufts.edu/comp/135/2020f/note s/day03_linear_regression.pdf

10

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-10
SLIDE 10

Linear Regression 1-dim features, with bias

Parameters: Prediction:

11

Mike Hughes - Tufts COMP 135 - Spring 2019

w = [

weight scalar

Graphical interpretation: Predict along line with slope w and intercept b w = 1.0 b = 0.0 w = - 0.2 b = 0.6

bias scalar

b

<latexit sha1_base64="YJjhR7RY5hyNtVLBH/MermOQ7I=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZtAvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYglitbCRtRZmx2ZRsCN7y6ukXat6F9Va87JSv8njKMIJnMI5eHAFdbiDBrSAcIzvMKb8+i8O/Ox6K14OQzx/AHzucPxi+M6g=</latexit>

ˆ y(xi) , w · xi1 + b

<latexit sha1_base64="3CvL/Hb4nFjYE+qiVY/5uVs/3Q=">ACE3icbVDLSgNBEJz1GeNr1aOXwSBEhbAbBT0GvXiMYB6QDcvsZJIMmZ1dZ3o1Yck/ePFXvHhQxKsXb/6Nk8dBEwsaiqpuruCWHANjvNtLSwuLa+sZtay6xubW9v2zm5VR4mirEIjEal6QDQTXLIKcBCsHitGwkCwWtC7Gvm1e6Y0j+QtDGLWDElH8janBIzk28del0A6GOb7Pj/CHihOZEewO/yAPdqKAPf9lLtDfID3845BWcMPE/cKcmhKcq+/eW1IpqETAIVROuG68TQTIkCTgUbZr1Es5jQHumwhqGShEw30/FPQ3xolBZuR8qUBDxWf0+kJNR6EAamMyTQ1bPeSPzPayTQvmimXMYJMEkni9qJwBDhUC4xRWjIAaGEKq4uRXTLlGEgokxa0JwZ1+eJ9ViwT0tFG/OcqXLaRwZtI8OUB656ByV0DUqowqi6BE9o1f0Zj1ZL9a79TFpXbCmM3voD6zPH4SQnUQ=</latexit>

Training:

Input: training set of N observed examples of features x and responses y Output: values of w and b that minimize mean squared error on training set.

slide-11
SLIDE 11

Training for 1-dim, with-bias LR

12

Mike Hughes - Tufts COMP 135 - Spring 2019

Training objective: minimize squared error (“least squares” estimation) Formula for parameters that minimize the objective: When can you use this formula? When you observe at least 2 examples with distinct 1-dim. features Otherwise, many w, b will be perfect (lowest possible training error) Why? many lines in our hypothesis space go through one point

How to derive the formula (see notes):

  • 1. Compute gradient of objective wrt w, as a function of w and b
  • 2. Compute gradient of objective wrt b, as a function of w and b
  • 3. Set (1) and (2) equal to zero and solve for w and b (2 equations, 2 unknowns)

min

w∈R,b∈R N

X

n=1

(yn − ˆ y(xn, w, b))2

<latexit sha1_base64="U9/jy4ZC90GpwtZMYXOB0GAeOfg=">ACTHicbVDPSxtBGJ1NW6vpD1N7OWjoRDBht0otJeCtJeiopRIROX2clsMjgzu8x8W12W9f/rpYfe+lf04kERwUnMQWMfDze+x7fC/JlXQYhn+DxpOnz5aeL680X7x89Xq19WbtwGWF5aLPM5XZo4Q5oaQRfZSoxFuBdOJEofJybepf/hTWCczs49lLoajY1MJWfopbjFqZYmrk6BSgNUM5wkSbVXb0CyoNRwfg7UFTquzJeoPv4BVIkUO1DGBj4CnTCsyrpzFpsNOPXxdaBWjie4ftyLW+2wG84Aj0k0J20yx07c+kNHGS+0MgVc24QhTkOK2ZRciXqJi2cyBk/YWMx8NQwLdywmpVRwevjCDNrH8GYabeT1RMO1fqxE9Oj3OL3lT8nzcoMP08rKTJCxSG3y1KCwWYwbRZGEkrOKrSE8at9H8FPmGWcfT9N30J0eLJj8lBrxtdnu7W+3tr/M6lsk78p50SEQ+kW3yneyQPuHkF/lHLslV8Du4CK6Dm7vRjDPvCUP0Fi6BRF1sgo=</latexit>

w = PN

n=1(xn − ¯

x)(yn − ¯ y) PN

n=1(xn − ¯

x)2 b = ¯ y − w¯ x

¯ x = mean(x1, . . . xN) ¯ y = mean(y1, . . . yN)

slide-12
SLIDE 12

Linear Regression F-dim features, with bias

Parameters: Prediction:

13

Mike Hughes - Tufts COMP 135 - Spring 2019

weight vector

Graphical interpretation: Predict along one plane in F+1-dim. space

bias scalar

b

<latexit sha1_base64="YJjhR7RY5hyNtVLBH/MermOQ7I=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZtAvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYglitbCRtRZmx2ZRsCN7y6ukXat6F9Va87JSv8njKMIJnMI5eHAFdbiDBrSAcIzvMKb8+i8O/Ox6K14OQzx/AHzucPxi+M6g=</latexit>

Training:

Input: training set of N observed examples of features x and responses y Output: values of w and b that minimize mean squared error on training set.

ˆ y(xi) ,

F

X

f=1

wfxif + b

<latexit sha1_base64="GX7LbXTshziGK46SjC2/EzEdSo=">ACH3icbVDLSgMxFM34rPVdekmWARFKDNV1I1QFMRlBatCpw6ZNOGZjJjckctw/yJG3/FjQtFxF3/xvSx0NYDgcM53Jzjx8LrsG2e9bU9Mzs3HxuIb+4tLyWlhbv9ZRoir0UhE6tYnmgkuWQ04CHYbK0ZCX7Abv3PW928emNI8klfQjVkjJC3JA04JGMkrHLptAmk328FPHse72AXFiWwJdo9dnYReGpw42d05fvQCk0h5kOE97Oe9QtEu2QPgSeKMSBGNUPUK324zoknIJFBtK47dgyNlCjgVLAs7yaxYR2SIvVDZUkZLqRDu7L8LZRmjiIlHkS8ED9PZGSUOtu6JtkSKCtx72+J9XTyA4bqRcxgkwSYeLgkRgiHC/LNzkilEQXUMIVdz8FdM2UYSCqbRfgjN+8iS5Lpec/VL58qBYOR3VkUObaAvtIAcdoQq6QFVUQxQ9o1f0j6sF+vN+rS+htEpazSzgf7A6v0Asreheg=</latexit>

w = [w1, w2, . . . wF ]

<latexit sha1_base64="A2dlF0EIEDfivtbXZDY3VCiwEs8=">ACA3icbZDLSsNAFIYn9VbrLepON4NFcFKUgXdCEVBXFawF0hDmEwm7dBJsxMLCU3Pgqblwo4taXcOfbOG2z0NYfBj7+cw5nzu8njEplWd9GYWl5ZXWtuF7a2Nza3jF391qSpwKTJuaMi46PJGE0Jk1FSOdRBAU+Yy0/cH1pN5+IEJSHt+rULcCPViGlKMlLY82AIL6Ez9OwKHq1CuygCup+cb1zLJVtaCi2DnUAa5Gp751Q04TiMSK8yQlI5tJcrNkFAUMzIudVNJEoQHqEcjTGKiHSz6Q1jeKydAIZc6BcrOHV/T2QoknIU+bozQqov52sT87+ak6rws1onKSKxHi2KEwZVBxOAoEBFQrNtKAsKD6rxD3kUBY6dhKOgR7/uRFaNWq9m1dndWrl/lcRTBITgCJ8AG56AObkEDNAEGj+AZvI348l4Md6Nj1lrwchn9sEfGZ8/4JuVxA=</latexit>
slide-13
SLIDE 13

Training for F-dim, with-bias LR

14

Mike Hughes - Tufts COMP 135 - Spring 2019

Training objective: minimize squared error (“least squares” estimation) Formula for parameters that minimize the objective: When can you use this formula? When you observe at least F+1 examples that are linearly independent Otherwise, infinitely many w, b will yield lowest possible training error

How to derive the formula (see notes):

  • 1. Compute gradient of objective wrt each entry of w, and wrt scalar b (F+1 total expressions)
  • 2. Set all gradients equal to zero and solve for w and b (F+1 equations, F+1 unknowns)

min

w∈RF ,b∈R N

X

n=1

(yn − ˆ y(xn, w, b))2

<latexit sha1_base64="7urT2WdKhfskLafpXljP3LhsVnU=">ACT3icbZHBaxQxFMYzq9V2W3XVo5dHl8IW6jKzCnoRioJ4klrctrDZHTLZzG5okhmSN9ZhmP6FXvTmv+HFg6WYWegWx8EfnzfeyTvS5Ir6TAMvwedW7c37tzd3Opu79y7/6D38NGJywrLxZhnKrNnCXNCSPGKFGJs9wKphMlTpPzN41/+klYJzPzEctcTDVbGJlKztBLcS+lWpq4ugAqDVDNcJk1XE9e3sAyZpWw+UlUFfouDKvonr2HqgSKQ6gjA08BbpkWJX14HNsDuDCj+8DtXKxP3ZqBv3+uEwXBXchKiFPmnrKO59o/OMF1oY5Io5N4nCHKcVsyi5EnWXFk7kjJ+zhZh4NEwLN61WedSw5U5pJn1xyCs1L8nKqadK3XiO5vt3LrXiP/zJgWmL6eVNHmBwvA/F6WFAsygCRfm0gqOqvTAuJX+rcCXzDKO/guaEKL1lW/CyWgYPRuOPjzvH75u49gkT8guGZCIvCH5B05ImPCyRfyg/wiV8HX4Gdw3WlbO0ELj8k/1dn6Dc97sdc=</latexit>

[w1 . . . wF b]T = ( ˜ XT ˜ X)−1 ˜ XT y

˜ X =     x11 . . . x1F 1 x21 . . . x2F 1 . . . xN1 . . . xNF 1     y =

     y1 y2 . . . yN     

<latexit sha1_base64="CpPYWC8yCiRfefbVRXgbwSkx/R4=">ACMnicbVBNa9tAEF2lzZfy5abHXpaQk5GcgLJRDS3IpKcROwCvMajWyl6xWYncUEMK/qZf+kAOzSEh5Nof0bWtQ+L0wTKP92Z2d15cKGkxCP54Sx8+Lq+srq37G5tb2zutT7t9m5dGQE/kKjfXMbegpIYeSlRwXRjgWazgKr75PvWvbsFYmetLrAqIMj7SMpWCo5OGrfOKHlOmIMUBi2Ekdc2N4dWkFhO/GoaUMVe608JukxztXPjhM9BJ08qMHI0xGrbaQSeYgb4nYUPapMHFsHXHklyUGWgUils7CIMCI3cpSqFg4rPSQsHFDR/BwFHNM7BRPVt5Qr85JaFpbtzRSGfq64maZ9ZWew6M45ju+hNxf95gxLTo6iWuigRtJg/lJaKYk6n+dFEGhCoKke4MNL9lYoxN1ygS9l3IYSLK78n/W4n3O90fx60T06bONbIF/KV7JGQHJITckYuSI8I8ovck0fy5P32Hrxn72XeuQ1M5/JG3h/wFvpKm0</latexit>
slide-14
SLIDE 14

More compact notation

15

Mike Hughes - Tufts COMP 135 - Spring 2019

θ = [b w1 w2 . . . wF ] ˜ xn = [1 xn1 xn2 . . . xnF ] ˆ y(xn, θ) = θT ˜ xn J(θ) ,

N

X

n=1

(yn − ˆ y(xn, θ))2

slide-15
SLIDE 15

Visualizing the cost function

16

Mike Hughes - Tufts COMP 135 - Spring 2019 “Level set” contours : all points with same function value

slide-16
SLIDE 16

Breakout!

  • Do the day03 lab!
  • Ask questions in Live Q&A on Piazza

17

Mike Hughes - Tufts COMP 135 - Spring 2019