Penalized Linear Regression Prof. Mike Hughes Many slides - - PowerPoint PPT Presentation

penalized linear regression
SMART_READER_LITE
LIVE PREVIEW

Penalized Linear Regression Prof. Mike Hughes Many slides - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Penalized Linear Regression Prof. Mike Hughes Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie,


slide-1
SLIDE 1

Penalized Linear Regression

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Today’s objectives (day 05)

  • Recap: Overfitting with high-degree features
  • Remedy: Add L2 penalty to the loss (“Ridge”)
  • Avoid high magnitude weights
  • Remedy: Add L1 penalty to the loss (“Lasso”)
  • Avoid high magnitude weights
  • Often, some weights exactly zero (feature selection)

3

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-3
SLIDE 3

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Fall 2020

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-4
SLIDE 4

5

Mike Hughes - Tufts COMP 135 - Fall 2020

Task: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

slide-5
SLIDE 5

6

Mike Hughes - Tufts COMP 135 - Fall 2020

Review: Linear Regression

Optimization problem: “Least Squares”

Exact formula for estimating optimal parameter vector values:

Can use formula when you observe at least F+1 examples that are linearly independent Otherwise, many theta values yield lowest possible training error (many linear functions make perfect predictions on the training set)

min

θ∈RF +1 N

X

n=1

(yn − ˆ y(xn, θ))2

<latexit sha1_base64="5FDy/8MPeWNsORWABemLQ8+v/ak=">ACSHicbVBNaxRBEO1ZP5KsH1n16KVwETaoy8xG0IsQDIgnieImge3doae3Z6dJd8/QXSMOzeTfeckxt/yGXDwo4s2Z3T1o4oOCx3tV3VUvKZR0GIYXQefGzVu3Nza3unfu3ru/3Xvw8NDlpeVizHOV2+OEOaGkEWOUqMRxYQXTiRJHycl+6x9EdbJ3HzGqhBTzRZGpIzbKS4F1MtTewpZgIZUGmAaoZkvhP9cy/exbVNZyeAnWljr15E9WzD0CVSHEAVWzgBdCMoa/qwdfYPIfVMztArVxkuDMbdeNePxyGS8B1Eq1Jn6xEPfO6TznpRYGuWLOTaKwKlnFiVXou7S0omC8RO2EJOGqaFm/plEDU8bZQ5pLltyiAs1b8nPNPOVTpOtsr3VWvFf/nTUpMX0+9NEWJwvDVR2mpAHNoU4W5tIKjqhrCuJXNrsAzZhnHJvs2hOjqydfJ4WgY7Q5H1/296u49gkj8kTMiAReUX2yHtyQMaEk2/kvwgP4Oz4HvwK/i9au0E65lH5B90On8AqaSw9w=</latexit>

˜ X =     x11 . . . x1F 1 x21 . . . x2F 1 . . . xN1 . . . xNF 1     y =

     y1 y2 . . . yN     

<latexit sha1_base64="CpPYWC8yCiRfefbVRXgbwSkx/R4=">ACMnicbVBNa9tAEF2lzZfy5abHXpaQk5GcgLJRDS3IpKcROwCvMajWyl6xWYncUEMK/qZf+kAOzSEh5Nof0bWtQ+L0wTKP92Z2d15cKGkxCP54Sx8+Lq+srq37G5tb2zutT7t9m5dGQE/kKjfXMbegpIYeSlRwXRjgWazgKr75PvWvbsFYmetLrAqIMj7SMpWCo5OGrfOKHlOmIMUBi2Ekdc2N4dWkFhO/GoaUMVe608JukxztXPjhM9BJ08qMHI0xGrbaQSeYgb4nYUPapMHFsHXHklyUGWgUils7CIMCI3cpSqFg4rPSQsHFDR/BwFHNM7BRPVt5Qr85JaFpbtzRSGfq64maZ9ZWew6M45ju+hNxf95gxLTo6iWuigRtJg/lJaKYk6n+dFEGhCoKke4MNL9lYoxN1ygS9l3IYSLK78n/W4n3O90fx60T06bONbIF/KV7JGQHJITckYuSI8I8ovck0fy5P32Hrxn72XeuQ1M5/JG3h/wFvpKm0</latexit>

T = ( ˜

XT ˜ X)−1 ˜ XT y

θ = [

slide-6
SLIDE 6

7

Mike Hughes - Tufts COMP 135 - Fall 2020

Review: Linear Regression with Transformed Features

Optimization problem: “Least Squares” Exact solution:

ˆ y(xi) = θT φ(xi) φ(xi) = [1 φ1(xi) φ2(xi) . . . φG−1(xi)] minθ PN

n=1(yn − θT φ(xi))2

θ∗ = (ΦT Φ)−1ΦT y

Φ =      1 φ1(x1) . . . φG−1(x1) 1 φ1(x2) . . . φG−1(x2) . . . ... 1 φ1(xN) . . . φG−1(xN)     

N x G matrix G x 1 vector

slide-7
SLIDE 7

0th degree polynomial features

8

Mike Hughes - Tufts COMP 135 - Fall 2020 Credit: Slides from course by Prof. Erik Sudderth (UCI)

  • -- true function
  • training data
  • - predictions from

LR using polynomial features

φ(xi) = [1]

<latexit sha1_base64="Vqjm5xzGTMUb4IQbGI/zga5haw=">AB+nicbVDLSsNAFL2pr1pfqS7dDBahbkpSBd0IRTcuK9gHpCFMpN26OTBzEQtsZ/ixoUibv0Sd/6N0zYLbT1w4XDOvdx7j59wJpVlfRuFldW19Y3iZmlre2d3zyzvt2WcCkJbJOax6PpYUs4i2lJMcdpNBMWhz2nH1P/c49FZLF0Z0aJ9QN8SBiASNYackzy71kyKqPHjtBl8hBNnI9s2LVrBnQMrFzUoEcTc/86vVjkoY0UoRjKR3bSpSbYaEY4XRS6qWSJpiM8IA6mkY4pNLNZqdP0LFW+iIha5IoZn6eyLDoZTj0NedIVZDuehNxf8J1XBhZuxKEkVjch8UZBypGI0zQH1maBE8bEmAimb0VkiAUmSqdV0iHYiy8vk3a9Zp/W6rdnlcZVHkcRDuEIqmDOTgBprQAgIP8Ayv8GY8GS/Gu/Exby0Y+cwB/IHx+QN2+JIt</latexit>

# parameters: G = 1

slide-8
SLIDE 8

9

Mike Hughes - Tufts COMP 135 - Fall 2020 Credit: Slides from course by Prof. Erik Sudderth (UCI)

1st degree polynomial features

  • -- true function
  • training data
  • - predictions from

LR using polynomial features

φ(xi) = [1 xi1]

<latexit sha1_base64="ROx53TGV+mlmSc9CBFRvpTRYj1g=">ACA3icbVDLSgNBEOyNrxhfq970MhiEeAm7UdCLEPTiMYJ5QLIs5PZMjsg5lZSVgiXvwVLx4U8epPePNvnCR70MSChqKqm+4uL+ZMKsv6NnJLyura/n1wsbm1vaOubvXkFEiCK2TiEei5WFJOQtpXTHFaSsWFAcep01vcD3xm/dUSBaFd2oUyfAvZD5jGClJdc86MR9Vhq67ARdojay0QMauimzx8hxzaJVtqZAi8TOSBEy1Fzq9ONSBLQUBGOpWzbVqycFAvFCKfjQieRNMZkgHu0rWmIAyqdPrDGB1rpYv8SOgKFZqvydSHEg5CjzdGWDVl/PeRPzPayfKv3BSFsaJoiGZLfITjlSEJoGgLhOUKD7SBPB9K2I9LHAROnYCjoEe/7lRdKolO3TcuX2rFi9yuLIwyEcQlsOIcq3EAN6kDgEZ7hFd6MJ+PFeDc+Zq05I5vZhz8wPn8AxciVrg=</latexit>

# parameters: G = 2

slide-9
SLIDE 9

10

Mike Hughes - Tufts COMP 135 - Fall 2020 Credit: Slides from course by Prof. Erik Sudderth (UCI)

3rd degree polynomial features

  • -- true function
  • training data
  • - predictions from

LR using polynomial features

φ(xi) = [1 xi1 x2

i1 x3 i1]

<latexit sha1_base64="35/9F4e7tWNCJ7bwpwB5G5u+rUw=">ACHicbZDLSsNAFIYn9VbrLerSzWAR6qYkraAboejGZQV7gTSGyXTSDp1MwsxEWkL7Hm58FTcuFHjQvBtnF7A2vrDwMd/zuHM+f2YUaks69vIrKyurW9kN3Nb2zu7e+b+QV1GicCkhiMWiaPJGUk5qipFmLAgKfUYafu96XG8ECFpxO/UICZuiDqcBhQjpS3PLfiLi30PXoKL6EDbTgawb6XUnv4S/elOS5D1zPzVtGaC6DPYM8mKnqmZ+tdoSTkHCFGZLSsa1YuSkSimJGhrlWIkmMcA91iKORo5BIN50cN4Qn2mnDIBL6cQUn7vxEikIpB6GvO0OkunKxNjb/qzmJCi7clPI4UYTj6aIgYVBFcJwUbFNBsGIDQgLqv8KcRcJhJXOM6dDsBdPXoZ6qWiXi6Xbs3zlahZHFhyBY1ANjgHFXADqAGMHgEz+AVvBlPxovxbnxMWzPGbOYQ/JHx9QMnR5+R</latexit>

# parameters: G = 4

slide-10
SLIDE 10

9th degree polynomial features

11

Mike Hughes - Tufts COMP 135 - Fall 2020 Credit: Slides from course by Prof. Erik Sudderth (UCI)

  • -- true function
  • training data
  • - predictions from

LR using polynomial features

φ(xi) = [1 xi1 x2

i1 x3 i1 x4 i1 x5 i1 x6 i1 x7 i1 x8 i1 x9 i1]

<latexit sha1_base64="0ZuKXclmIgbBxiMXMrMg1yHQ2Eg=">ACaXicbZHLTgIxFIY74w0RFTQao5tGYoIbMgMouDAhunGJiVwSGCedUqChc0nbMZAJxGd05wu48SUsl8QRPEmT7/9PT9r+dQJGhTSMT03f2Nza3knsJvdS+weH6cxRQ/ghx6SOfebzloMEYdQjdUklI62AE+Q6jDSd4eOs3wjXFDfe5HjgFgu6nu0RzGSyrLT751gQHMjm17De9iGJpxO4ciOqDn5pdCjIsxLsGYuImL27gox0UlLu6gZaezRt6YF1wHcwlZsKyanf7odH0cusSTmCEh2qYRSCtCXFLMyCTZCQUJEB6iPmkr9JBLhBXNk5rAK+V0Yc/nankSzt34RIRcIcauo3a6SA7Eam9m/tdrh7JXsSLqBaEkHl4c1AsZlD6cxQ67lBMs2VgBwpyqu0I8QBxhqT4nqUIwV5+8Do1C3izmC8+lbPVhGUcCXIBLkAMmKIMqeAI1UAcYfGkp7UQ71b71jH6mny+26tpy5hj8KT37A/GctZM=</latexit>

# parameters: G = 10

slide-11
SLIDE 11

Error vs Complexity

12

Mike Hughes - Tufts COMP 135 - Fall 2020

polynomial degree mean squared error

slide-12
SLIDE 12

13

Mike Hughes - Tufts COMP 135 - Fall 2020

Polynomial degree

1 3 9

Credit: Slides from course by Prof. Erik Sudderth (UCI)

Weight Values vs Complexity

Estimated Regression Coefficients

WOW! These values are very large.

slide-13
SLIDE 13

Idea: Add Penalty Term to Loss

Goal: Avoid finding weights with large magnitude Result: Ridge regression, a method with objective:

14

Mike Hughes - Tufts COMP 135 - Fall 2020

α ≥ 0

Hyperparameter: Penalty strength “alpha”

Alpha = 0 recovers original unpenalized Linear Regression Larger alpha means we prefer smaller magnitude weights

Penalty term: Sum of squares of entries of theta = Square of the “L2 norm” of theta vector Thus, also called “L2-penalized” linear regression

J(θ) =

N

X

n=1

(yn − θT φ(xn))2 + α

G

X

g=1

θ2

g

<latexit sha1_base64="EthG14PEeaE+gWAxSNz8sNc9IQo=">ACQXicbVDPSxtBGJ21tdpoa9oevXw0FBKkYTcK9iJIPSgeioLRQDZvp1MsoOzs8vMt2JY8q/14n/Qm3cvHiql17cTfbQah8MPN4PZuaFqZKWXPfWXrxcvnVyur2tr6m7cb9Xfvz2SGS6PFGJ6YVohZJadEmSEr3UCIxDJS7Cy4PSv7gSxspEn9E0FYMYJ1qOJUcqpKDeO276FAnCFuyBb7M4yPWeNxt+g+Y0PAZFu7wDPw0ks3rQLdaw7AFvio0girzqTsHFbhYDLs1IJ6w27c8Bz4lWkwSqcBPUf/ijhWSw0cYXW9j03pUGOhiRXYlbzMytS5Jc4Ef2CaoyFHeTzBWbwqVBGME5McTBXP27kWNs7TQOi2SMFNmnXin+z+tnNP4yKVOMxKaLy4aZwogXJOGEkjOKlpQZAbWbwVeIQGORWjlyN4T7/8nJx32t52u3O609j/Ws2xyjbZR9ZkHtl+yInbAu4+w7u2M/2YNz49w7v5zfi+iSU3U+sH/g/HkE1sKs5Q=</latexit>
slide-14
SLIDE 14

Rewrite in matrix notation?

15

Mike Hughes - Tufts COMP 135 - Fall 2020

Rewriting, this is equivalent to

Can rewrite sum of squares as an inner product of theta vector with itself

J(θ) =

N

X

n=1

(yn − θT φ(xn))2 + α

G

X

g=1

θ2

g

<latexit sha1_base64="EthG14PEeaE+gWAxSNz8sNc9IQo=">ACQXicbVDPSxtBGJ21tdpoa9oevXw0FBKkYTcK9iJIPSgeioLRQDZvp1MsoOzs8vMt2JY8q/14n/Qm3cvHiql17cTfbQah8MPN4PZuaFqZKWXPfWXrxcvnVyur2tr6m7cb9Xfvz2SGS6PFGJ6YVohZJadEmSEr3UCIxDJS7Cy4PSv7gSxspEn9E0FYMYJ1qOJUcqpKDeO276FAnCFuyBb7M4yPWeNxt+g+Y0PAZFu7wDPw0ks3rQLdaw7AFvio0girzqTsHFbhYDLs1IJ6w27c8Bz4lWkwSqcBPUf/ijhWSw0cYXW9j03pUGOhiRXYlbzMytS5Jc4Ef2CaoyFHeTzBWbwqVBGME5McTBXP27kWNs7TQOi2SMFNmnXin+z+tnNP4yKVOMxKaLy4aZwogXJOGEkjOKlpQZAbWbwVeIQGORWjlyN4T7/8nJx32t52u3O609j/Ws2xyjbZR9ZkHtl+yInbAu4+w7u2M/2YNz49w7v5zfi+iSU3U+sH/g/HkE1sKs5Q=</latexit>

J(θ) = (y − Φθ)T (y − Φθ) + αθT θ

<latexit sha1_base64="4iLO1WqJobLxYp0SoP+Rqg4/HQo=">ACOHicbZBNa9tAEIZXbpq4apuo7bGXISbgUmokJ9BeCiG9lFzqQvwBlmtG67W1ePXB7qhgjH9WL/kZuYVcmgIufYXdG3r0Np5YeHlmRlm541yJQ35/rVTebLzdHev+sx9/uLl/oH36nXHZIXmos0zlelehEYomYo2SVKil2uBSaREN5p+Wda7P4U2MksvaJaLQYKTVI4lR7Jo6H1z+shxYLwHXyG+gw+QNiKJZTsx8U2A4D3EKLKYyR7Vob1x16Nb/hrwTbJihNjZVqDb2rcJTxIhEpcYXG9AM/p8EcNUmuxMINCyNy5FOciL61KSbCDOarwxdwZMkIxpm2LyVY0X8n5pgYM0si25kgxWaztoSP1foFjT8N5jLNCxIpXy8aFwog2WKMJacFIza5Braf8KPEaNnGzWyxCzZO3TafZCI4bze8ntdOzMo4qe8sOWZ0F7CM7ZV9Zi7UZ7/YDfvN7pxL59a5dx7WrRWnHnD/pPz5y/PVqeu</latexit>

Φ =      1 φ1(x1) . . . φG−1(x1) 1 φ1(x2) . . . φG−1(x2) . . . ... 1 φ1(xN) . . . φG−1(xN)     

N x G y =      y1 y2 . . . yN     

<latexit sha1_base64="CpPYWC8yCiRfefbVRXgbwSkx/R4=">ACMnicbVBNa9tAEF2lzZfy5abHXpaQk5GcgLJRDS3IpKcROwCvMajWyl6xWYncUEMK/qZf+kAOzSEh5Nof0bWtQ+L0wTKP92Z2d15cKGkxCP54Sx8+Lq+srq37G5tb2zutT7t9m5dGQE/kKjfXMbegpIYeSlRwXRjgWazgKr75PvWvbsFYmetLrAqIMj7SMpWCo5OGrfOKHlOmIMUBi2Ekdc2N4dWkFhO/GoaUMVe608JukxztXPjhM9BJ08qMHI0xGrbaQSeYgb4nYUPapMHFsHXHklyUGWgUils7CIMCI3cpSqFg4rPSQsHFDR/BwFHNM7BRPVt5Qr85JaFpbtzRSGfq64maZ9ZWew6M45ju+hNxf95gxLTo6iWuigRtJg/lJaKYk6n+dFEGhCoKke4MNL9lYoxN1ygS9l3IYSLK78n/W4n3O90fx60T06bONbIF/KV7JGQHJITckYuSI8I8ovck0fy5P32Hrxn72XeuQ1M5/JG3h/wFvpKm0</latexit>

θ =      θ1 θ2 . . . θG     

<latexit sha1_base64="iCFnoLQj6qywSlOdgJpgIn14oCM=">ACSXicbZBPT9swGMadAlvJ2OjGkYtFNYlTlXRI7DIJbYdxLBKlSE0UHOdNa+E4kf0GKYr69XbZbd9By4cQIgTbhskKDySpZ+e94/tJy6kMOh5/53W2vrGu/ftTfD1sdP253PX85MXmoOQ57LXJ/HzIAUCoYoUMJ5oYFlsYRfPlrXh9dgTYiV6dYFRBmbKJEKjhDa0WdiwCngIz+oG4gIcWxG8QwEapmWrNqVvOZu+yIfBoET9xf8FWSo3nm/nYDUEkz6QZaTKYRp2u1/MWoq/Bb6BLGg2izr8gyXmZgUIumTFj3yswtFtRcAl2b2mgYPySTWBsUbEMTFgvkpjRr9ZJaJprexTShft8omaZMVUW286M4dSs1ubmW7Vxien3sBaqKBEUX16UlpJiTuex0kRo4CgrC4xrYd9K+ZRpxtG79oQ/NUv4azfs/1ufHSPfjZxtMku2SP7xCeH5IgckwEZEk7+kGtyS+6cv86Nc+8LFtbTjOzQ16otfYIjaSxlg=</latexit>

N x 1 G x 1

N : num. examples G : num transformed features

slide-15
SLIDE 15

Estimating weights for L2 penalized linear regression

16

Mike Hughes - Tufts COMP 135 - Fall 2020

Optimization problem: “Penalized Least Squares” Solution:

If alpha > 0 , the matrix is always invertible! Always one unique optimal theta vector, provided by this formula

min

θ

) = (y − Φθ)T (y − Φθ) + αθT θ

<latexit sha1_base64="4iLO1WqJobLxYp0SoP+Rqg4/HQo=">ACOHicbZBNa9tAEIZXbpq4apuo7bGXISbgUmokJ9BeCiG9lFzqQvwBlmtG67W1ePXB7qhgjH9WL/kZuYVcmgIufYXdG3r0Np5YeHlmRlm541yJQ35/rVTebLzdHev+sx9/uLl/oH36nXHZIXmos0zlelehEYomYo2SVKil2uBSaREN5p+Wda7P4U2MksvaJaLQYKTVI4lR7Jo6H1z+shxYLwHXyG+gw+QNiKJZTsx8U2A4D3EKLKYyR7Vob1x16Nb/hrwTbJihNjZVqDb2rcJTxIhEpcYXG9AM/p8EcNUmuxMINCyNy5FOciL61KSbCDOarwxdwZMkIxpm2LyVY0X8n5pgYM0si25kgxWaztoSP1foFjT8N5jLNCxIpXy8aFwog2WKMJacFIza5Braf8KPEaNnGzWyxCzZO3TafZCI4bze8ntdOzMo4qe8sOWZ0F7CM7ZV9Zi7UZ7/YDfvN7pxL59a5dx7WrRWnHnD/pPz5y/PVqeu</latexit>

θ∗ = (ΦT Φ + αIG)−1ΦT y

<latexit sha1_base64="reMIP636JpFBHhYHOETeh3FM7w=">ACIHicbVDLSgMxFM34tr6qLt0Ei1AVy4wKuhGKLtRdBWsLnbcSdNOMPMguSOUoZ/ixl9x40IR3enXmD4ErR4IOZxzLsk9XiyFRtv+sCYmp6ZnZufmMwuLS8sr2dW1Gx0livEyi2Skqh5oLkXIyhQ8mqsOASe5BXv9qzvV+640iIKr7Eb83oAnVC0BQM0UjN75KLPERo79ITmqVvyReN6cNFd6oKMfaCXzfS8R7cb6Z7T+050m9mcXbAHoH+JMyI5MkKpmX13WxFLAh4ik6B1zbFjrKegUDJexk30TwGdgsdXjM0hIDrejpYsEe3jNKi7UiZEyIdqD8nUgi07gaeSQaAvh73+uJ/Xi3B9nE9FWGcIA/Z8KF2IilGtN8WbQnFGcquIcCUMH+lzAcFDE2nGVOCM7yX3KzX3AOCvtXh7ni6aiObJBNkmeOSIFMkFKZEyYeSePJn8mI9WE/Wq/U2jE5Yo5l18gvW5xd9TaAZ</latexit>

G x G identity matrix

slide-16
SLIDE 16

What happens if we rescale a feature in unpenalized LR?

17

Mike Hughes - Tufts COMP 135 - Fall 2020

Suppose we changed units on “volume” of the engine feature, from liters to mL Remember, 1 liter = 1000 mL

Before After

Answer: Just “rescale” the individual weight for that single feature. No other weights will change.

vol_in_L hp mi_per_gal 2.1 115.0 30.0 2.3 150.0 25.0 2.5 193.0 21.2 vol_in_mL hp mi_per_gal 2100 115.0 30.0 2300 150.0 25.0 2500 193.0 21.2

vol [-51.25 ] hp [ 0.15 ] bias [120.375] [ -0.05125 ] [ 0.15 ] [120.375 ]

x:1

<latexit sha1_base64="pSeFEgUnf98S0JgtUkryCgj5Hk=">AB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtguKp6MVjBbstEvJptk2NpsSVYsS/+DFw+KePX/ePfmLZ70NYHA4/3ZpiZFyacaeO6305hZXVtfaO4Wdra3tndK+8f+FqmitAmkVyqdog15UzQpmG03aiKI5DTlvh6Gbqtx6p0kyKezNOaBDjgWARI9hYyX/qZVfepFeuFV3BrRMvJxUIEejV/7q9iVJYyoM4VjrjucmJsiwMoxwOil1U0TEZ4QDuWChxTHWSzayfoxCp9FElSxg0U39PZDjWehyHtjPGZqgXvan4n9dJTXQZEwkqaGCzBdFKUdGounrqM8UJYaPLcFEMXsrIkOsMDE2oJINwVt8eZn4tap3Vq3dnVfq13kcRTiCYzgFDy6gDrfQgCYQeIBneIU3RzovzrvzMW8tOPnMIfyB8/kDUD2O9A=</latexit>

x:2

<latexit sha1_base64="H7IaJle6iIfzKJw7eLIMP5RADZ4=">AB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtguKp6MVjBbstEvJptk2NpsSVYsS/+DFw+KePX/ePfmLZ70NYHA4/3ZpiZFyacaeO6305hZXVtfaO4Wdra3tndK+8f+FqmitAmkVyqdog15UzQpmG03aiKI5DTlvh6Gbqtx6p0kyKezNOaBDjgWARI9hYyX/qZVe1Sa9cavuDGiZeDmpQI5Gr/zV7UuSxlQYwrHWHc9NTJBhZRjhdFLqpomIzwgHYsFTimOshm107QiVX6KJLKljBopv6eyHCs9TgObWeMzVAvelPxP6+TmugyJhIUkMFmS+KUo6MRNPXUZ8pSgwfW4KJYvZWRIZYWJsQCUbgrf48jLxa1XvrFq7O6/Ur/M4inAEx3AKHlxAHW6hAU0g8ADP8ApvjnRenHfnY95acPKZQ/gD5/MHUcKO9Q=</latexit>

y

<latexit sha1_base64="mEcz1FLhuG1BpP6c5hi50qAIJ0g=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qRfrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU67nJsbPqDKcCZyWeqnGhLIxHWLXUkj1H42P3RKzqwyIGsbElD5urviYxGWk+iwHZG1Iz0sjcT/O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaijJjsynZELzl1dJu1b1Lq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f6QuNAQ=</latexit>
slide-17
SLIDE 17

What happens if we rescale a feature in penalized LR?

18

Mike Hughes - Tufts COMP 135 - Fall 2020

Suppose we changed units on “volume” of the engine feature, from liters to mL Remember, 1 liter = 1000 mL

Before After

vol_in_L hp mi_per_gal 2.1 115.0 30.0 2.3 150.0 25.0 2.5 193.0 21.2 vol_in_mL hp mi_per_gal 2100 115.0 30.0 2300 150.0 25.0 2500 193.0 21.2

[ 18.428 ] [ -0.199] [ 13.407] [ 0.027 ] [ -0.248 ] [ 1.439 ]

x:1

<latexit sha1_base64="pSeFEgUnf98S0JgtUkryCgj5Hk=">AB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtguKp6MVjBbstEvJptk2NpsSVYsS/+DFw+KePX/ePfmLZ70NYHA4/3ZpiZFyacaeO6305hZXVtfaO4Wdra3tndK+8f+FqmitAmkVyqdog15UzQpmG03aiKI5DTlvh6Gbqtx6p0kyKezNOaBDjgWARI9hYyX/qZVfepFeuFV3BrRMvJxUIEejV/7q9iVJYyoM4VjrjucmJsiwMoxwOil1U0TEZ4QDuWChxTHWSzayfoxCp9FElSxg0U39PZDjWehyHtjPGZqgXvan4n9dJTXQZEwkqaGCzBdFKUdGounrqM8UJYaPLcFEMXsrIkOsMDE2oJINwVt8eZn4tap3Vq3dnVfq13kcRTiCYzgFDy6gDrfQgCYQeIBneIU3RzovzrvzMW8tOPnMIfyB8/kDUD2O9A=</latexit>

x:2

<latexit sha1_base64="H7IaJle6iIfzKJw7eLIMP5RADZ4=">AB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtguKp6MVjBbstEvJptk2NpsSVYsS/+DFw+KePX/ePfmLZ70NYHA4/3ZpiZFyacaeO6305hZXVtfaO4Wdra3tndK+8f+FqmitAmkVyqdog15UzQpmG03aiKI5DTlvh6Gbqtx6p0kyKezNOaBDjgWARI9hYyX/qZVe1Sa9cavuDGiZeDmpQI5Gr/zV7UuSxlQYwrHWHc9NTJBhZRjhdFLqpomIzwgHYsFTimOshm107QiVX6KJLKljBopv6eyHCs9TgObWeMzVAvelPxP6+TmugyJhIUkMFmS+KUo6MRNPXUZ8pSgwfW4KJYvZWRIZYWJsQCUbgrf48jLxa1XvrFq7O6/Ur/M4inAEx3AKHlxAHW6hAU0g8ADP8ApvjnRenHfnY95acPKZQ/gD5/MHUcKO9Q=</latexit>

y

<latexit sha1_base64="mEcz1FLhuG1BpP6c5hi50qAIJ0g=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qRfrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU67nJsbPqDKcCZyWeqnGhLIxHWLXUkj1H42P3RKzqwyIGsbElD5urviYxGWk+iwHZG1Iz0sjcT/O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaijJjsynZELzl1dJu1b1Lq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f6QuNAQ=</latexit>

Answer: Because all weights contribute to the penalty term, ALL learned weights change alpha = 0.01

slide-18
SLIDE 18

Ridge Regression is more sensitive to the scale of your features.

19

Mike Hughes - Tufts COMP 135 - Fall 2020

Before feeding data into a Ridge regression model, should standardize the scale

  • f all features, so the penalty acts on each feature in more uniform way.
  • Rescale each column between 0 and 1
  • sklearn’s MinMaxScaler
  • https://scikit-learn.org/stable/modules/preprocessing.html#scaling-

features-to-a-range

  • Transform each column to have mean 0 and variance 1
  • sklearn’s StandardScaler
  • https://scikit-

learn.org/stable/modules/generated/sklearn.preprocessing.StandardScale r.html#sklearn.preprocessing.StandardScaler OR, you can impose your own feature-specific penalties if you want.

slide-19
SLIDE 19

Lasso Regression (L1 penalty)

20

Mike Hughes - Tufts COMP 135 - Fall 2020

min

θ

(y − Φθ)T (y − Φθ) + α

G

X

g=1

|θg|

<latexit sha1_base64="GTsp0wchV72akBT+YOv3CNMPnY=">ACQXicbZDPSxtBFMdnra02tW2qRy8PQ0EpDbtWaC+CtId6jGA0kI3L28kOzgzu8y8LYQ1/m9D/ozXsvHizi1YuTHwdr+oWBL5/ve8zMNy2UdBSGV8HSs+XnL1ZWX9Zerb1+87b+bv3E5aXlos1zldtOik4oaUSbJCnRKaxAnSpxmp5/m+SnP4R1MjfHNCpET+PQyIHkSB4l9U6spUmqmDJBOIbLy+0RfIS4lUmYsZ2zY1hgAPABYlRFhC7UifVcD8an32Hi9lAMrxI6o2wGU4Fiyamwabq5XUf8f9nJdaGOIKnetGYUG9Ci1JrsS4FpdOFMjPcSi63hrUwvWqaQNjeO9JHwa59cQTOnjQq1cyOd+kmNlLmn2QT+L+uWNPjSq6QpShKGzy4alAoh0md0JdWcFIjb5Bb6d8KPEOLnHzpNV9C9PTLi+Zktxl9au4e7TUOvs7rWGWbIts4h9ZgfskLVYm3H2k/1hN+xv8Cu4Dm6Du9noUjDf2WD/KLh/ALgLrgE=</latexit>

Sum of absolute values of entries (aka the L1 norm of the vector theta) N : num. examples G : num transformed features

Like L2 penalty (Ridge), the Lasso objective above encourages small magnitude weights. We’ll see in lab: L1 penalty encourages theta to be a sparse vector At modest alpha values, many entries of theta could be exactly zero. Can use for feature selection (only features with non-zero weights matter)

slide-20
SLIDE 20

Today’s objectives (day 05)

  • Recap: Overfitting with high-degree features
  • Remedy: Add L2 penalty to the loss (“Ridge”)
  • Avoid high magnitude weights
  • Remedy: Add L1 penalty to the loss (“Lasso”)
  • Avoid high magnitude weights
  • Often, some weights exactly zero (feature selection)

21

Mike Hughes - Tufts COMP 135 - Fall 2020