Penalized Linear Regression Prof. Mike Hughes Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Penalized Linear Regression Prof. Mike Hughes Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

Today’s objectives (day 05) • Recap: Overfitting with high-degree features • Remedy: Add L2 penalty to the loss (“Ridge”) • Avoid high magnitude weights • Remedy: Add L1 penalty to the loss (“Lasso”) • Avoid high magnitude weights • Often, some weights exactly zero (feature selection) Mike Hughes - Tufts COMP 135 - Fall 2020 3

What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Fall 2020 4

Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Fall 2020 5

<latexit sha1_base64="CpPYWC8yCiRfefbVRXgbwSkx/R4=">ACMnicbVBNa9tAEF2lzZfy5abHXpaQk5GcgLJRDS3IpKcROwCvMajWyl6xWYncUEMK/qZf+kAOzSEh5Nof0bWtQ+L0wTKP92Z2d15cKGkxCP54Sx8+Lq+srq37G5tb2zutT7t9m5dGQE/kKjfXMbegpIYeSlRwXRjgWazgKr75PvWvbsFYmetLrAqIMj7SMpWCo5OGrfOKHlOmIMUBi2Ekdc2N4dWkFhO/GoaUMVe608JukxztXPjhM9BJ08qMHI0xGrbaQSeYgb4nYUPapMHFsHXHklyUGWgUils7CIMCI3cpSqFg4rPSQsHFDR/BwFHNM7BRPVt5Qr85JaFpbtzRSGfq64maZ9ZWew6M45ju+hNxf95gxLTo6iWuigRtJg/lJaKYk6n+dFEGhCoKke4MNL9lYoxN1ygS9l3IYSLK78n/W4n3O90fx60T06bONbIF/KV7JGQHJITckYuSI8I8ovck0fy5P32Hrxn72XeuQ1M5/JG3h/wFvpKm0</latexit> <latexit sha1_base64="5FDy/8MPeWNsORWABemLQ8+v/ak=">ACSHicbVBNaxRBEO1ZP5KsH1n16KVwETaoy8xG0IsQDIgnieImge3doae3Z6dJd8/QXSMOzeTfeckxt/yGXDwo4s2Z3T1o4oOCx3tV3VUvKZR0GIYXQefGzVu3Nza3unfu3ru/3Xvw8NDlpeVizHOV2+OEOaGkEWOUqMRxYQXTiRJHycl+6x9EdbJ3HzGqhBTzRZGpIzbKS4F1MtTewpZgIZUGmAaoZkvhP9cy/exbVNZyeAnWljr15E9WzD0CVSHEAVWzgBdCMoa/qwdfYPIfVMztArVxkuDMbdeNePxyGS8B1Eq1Jn6xEPfO6TznpRYGuWLOTaKwKlnFiVXou7S0omC8RO2EJOGqaFm/plEDU8bZQ5pLltyiAs1b8nPNPOVTpOtsr3VWvFf/nTUpMX0+9NEWJwvDVR2mpAHNoU4W5tIKjqhrCuJXNrsAzZhnHJvs2hOjqydfJ4WgY7Q5H1/296u49gkj8kTMiAReUX2yHtyQMaEk2/kvwgP4Oz4HvwK/i9au0E65lH5B90On8AqaSw9w=</latexit> Review: Linear Regression Optimization problem: “Least Squares” N y ( x n , θ )) 2 X min ( y n − ˆ θ ∈ R F +1 n =1 Exact formula for estimating optimal parameter vector values: T = ( ˜ X T ˜ X ) − 1 ˜ X T y θ = [   x 11 . . . x 1 F 1   y 1 x 21 . . . x 2 F 1 y 2 ˜     X =     y =  . . . .   .  .   x N 1 . . . x NF 1 y N Can use formula when you observe at least F+1 examples that are linearly independent Otherwise, many theta values yield lowest possible training error (many linear functions make perfect predictions on the training set) Mike Hughes - Tufts COMP 135 - Fall 2020 6

Review: Linear Regression with Transformed Features φ ( x i ) = [1 φ 1 ( x i ) φ 2 ( x i ) . . . φ G − 1 ( x i )] y ( x i ) = θ T φ ( x i ) ˆ Optimization problem: “Least Squares” n =1 ( y n − θ T φ ( x i )) 2 P N min θ Exact solution:   1 φ 1 ( x 1 ) φ G − 1 ( x 1 ) . . . 1 φ 1 ( x 2 ) φ G − 1 ( x 2 ) . . .   θ ∗ = ( Φ T Φ ) − 1 Φ T y Φ =  .  ... .   .   1 φ 1 ( x N ) φ G − 1 ( x N ) . . . G x 1 N x G matrix vector Mike Hughes - Tufts COMP 135 - Fall 2020 7

<latexit sha1_base64="Vqjm5xzGTMUb4IQbGI/zga5haw=">AB+nicbVDLSsNAFL2pr1pfqS7dDBahbkpSBd0IRTcuK9gHpCFMpN26OTBzEQtsZ/ixoUibv0Sd/6N0zYLbT1w4XDOvdx7j59wJpVlfRuFldW19Y3iZmlre2d3zyzvt2WcCkJbJOax6PpYUs4i2lJMcdpNBMWhz2nH1P/c49FZLF0Z0aJ9QN8SBiASNYackzy71kyKqPHjtBl8hBNnI9s2LVrBnQMrFzUoEcTc/86vVjkoY0UoRjKR3bSpSbYaEY4XRS6qWSJpiM8IA6mkY4pNLNZqdP0LFW+iIha5IoZn6eyLDoZTj0NedIVZDuehNxf8J1XBhZuxKEkVjch8UZBypGI0zQH1maBE8bEmAimb0VkiAUmSqdV0iHYiy8vk3a9Zp/W6rdnlcZVHkcRDuEIqmDOTgBprQAgIP8Ayv8GY8GS/Gu/Exby0Y+cwB/IHx+QN2+JIt</latexit> 0 th degree polynomial features φ ( x i ) = [1] # parameters: G = 1 --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 8

<latexit sha1_base64="ROx53TGV+mlmSc9CBFRvpTRYj1g=">ACA3icbVDLSgNBEOyNrxhfq970MhiEeAm7UdCLEPTiMYJ5QLIs5PZMjsg5lZSVgiXvwVLx4U8epPePNvnCR70MSChqKqm+4uL+ZMKsv6NnJLyura/n1wsbm1vaOubvXkFEiCK2TiEei5WFJOQtpXTHFaSsWFAcep01vcD3xm/dUSBaFd2oUyfAvZD5jGClJdc86MR9Vhq67ARdojay0QMauimzx8hxzaJVtqZAi8TOSBEy1Fzq9ONSBLQUBGOpWzbVqycFAvFCKfjQieRNMZkgHu0rWmIAyqdPrDGB1rpYv8SOgKFZqvydSHEg5CjzdGWDVl/PeRPzPayfKv3BSFsaJoiGZLfITjlSEJoGgLhOUKD7SBPB9K2I9LHAROnYCjoEe/7lRdKolO3TcuX2rFi9yuLIwyEcQlsOIcq3EAN6kDgEZ7hFd6MJ+PFeDc+Zq05I5vZhz8wPn8AxciVrg=</latexit> 1 st degree polynomial features φ ( x i ) = [1 x i 1 ] # parameters: G = 2 --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 9

<latexit sha1_base64="35/9F4e7tWNCJ7bwpwB5G5u+rUw=">ACHicbZDLSsNAFIYn9VbrLerSzWAR6qYkraAboejGZQV7gTSGyXTSDp1MwsxEWkL7Hm58FTcuFHjQvBtnF7A2vrDwMd/zuHM+f2YUaks69vIrKyurW9kN3Nb2zu7e+b+QV1GicCkhiMWiaPJGUk5qipFmLAgKfUYafu96XG8ECFpxO/UICZuiDqcBhQjpS3PLfiLi30PXoKL6EDbTgawb6XUnv4S/elOS5D1zPzVtGaC6DPYM8mKnqmZ+tdoSTkHCFGZLSsa1YuSkSimJGhrlWIkmMcA91iKORo5BIN50cN4Qn2mnDIBL6cQUn7vxEikIpB6GvO0OkunKxNjb/qzmJCi7clPI4UYTj6aIgYVBFcJwUbFNBsGIDQgLqv8KcRcJhJXOM6dDsBdPXoZ6qWiXi6Xbs3zlahZHFhyBY1ANjgHFXADqAGMHgEz+AVvBlPxovxbnxMWzPGbOYQ/JHx9QMnR5+R</latexit> 3 rd degree polynomial features φ ( x i ) = [1 x i 1 x 2 i 1 x 3 i 1 ] # parameters: G = 4 --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 10

<latexit sha1_base64="0ZuKXclmIgbBxiMXMrMg1yHQ2Eg=">ACaXicbZHLTgIxFIY74w0RFTQao5tGYoIbMgMouDAhunGJiVwSGCedUqChc0nbMZAJxGd05wu48SUsl8QRPEmT7/9PT9r+dQJGhTSMT03f2Nza3knsJvdS+weH6cxRQ/ghx6SOfebzloMEYdQjdUklI62AE+Q6jDSd4eOs3wjXFDfe5HjgFgu6nu0RzGSyrLT751gQHMjm17De9iGJpxO4ciOqDn5pdCjIsxLsGYuImL27gox0UlLu6gZaezRt6YF1wHcwlZsKyanf7odH0cusSTmCEh2qYRSCtCXFLMyCTZCQUJEB6iPmkr9JBLhBXNk5rAK+V0Yc/nankSzt34RIRcIcauo3a6SA7Eam9m/tdrh7JXsSLqBaEkHl4c1AsZlD6cxQ67lBMs2VgBwpyqu0I8QBxhqT4nqUIwV5+8Do1C3izmC8+lbPVhGUcCXIBLkAMmKIMqeAI1UAcYfGkp7UQ71b71jH6mny+26tpy5hj8KT37A/GctZM=</latexit> 9 th degree polynomial features φ ( x i ) = [1 x i 1 x 2 i 1 x 3 i 1 x 4 i 1 x 5 i 1 x 6 i 1 x 7 i 1 x 8 i 1 x 9 i 1 ] # parameters: G = 10 --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 11

Error vs Complexity mean squared error polynomial degree Mike Hughes - Tufts COMP 135 - Fall 2020 12

Weight Values vs Complexity Polynomial degree 9 3 0 1 Estimated Regression Coefficients WOW! These values are very large. Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 13

<latexit sha1_base64="EthG14PEeaE+gWAxSNz8sNc9IQo=">ACQXicbVDPSxtBGJ21tdpoa9oevXw0FBKkYTcK9iJIPSgeioLRQDZvp1MsoOzs8vMt2JY8q/14n/Qm3cvHiql17cTfbQah8MPN4PZuaFqZKWXPfWXrxcvnVyur2tr6m7cb9Xfvz2SGS6PFGJ6YVohZJadEmSEr3UCIxDJS7Cy4PSv7gSxspEn9E0FYMYJ1qOJUcqpKDeO276FAnCFuyBb7M4yPWeNxt+g+Y0PAZFu7wDPw0ks3rQLdaw7AFvio0girzqTsHFbhYDLs1IJ6w27c8Bz4lWkwSqcBPUf/ijhWSw0cYXW9j03pUGOhiRXYlbzMytS5Jc4Ef2CaoyFHeTzBWbwqVBGME5McTBXP27kWNs7TQOi2SMFNmnXin+z+tnNP4yKVOMxKaLy4aZwogXJOGEkjOKlpQZAbWbwVeIQGORWjlyN4T7/8nJx32t52u3O609j/Ws2xyjbZR9ZkHtl+yInbAu4+w7u2M/2YNz49w7v5zfi+iSU3U+sH/g/HkE1sKs5Q=</latexit> Idea: Add Penalty Term to Loss Goal: Avoid finding weights with large magnitude Result: Ridge regression , a method with objective: N G ( y n − θ T φ ( x n )) 2 + α X X θ 2 J ( θ ) = g n =1 g =1 Penalty term: Sum of squares of entries of theta = Square of the “L2 norm” of theta vector Thus, also called “L2-penalized” linear regression α ≥ 0 Hyperparameter : Penalty strength “alpha” Alpha = 0 recovers original unpenalized Linear Regression Larger alpha means we prefer smaller magnitude weights Mike Hughes - Tufts COMP 135 - Fall 2020 14

Penalized Linear Regression Prof. Mike Hughes Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Penalized Linear Regression Prof. Mike Hughes Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie,

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

The Many Flavors of Penalized Linear Discriminant Analysis Daniela M. Witten Assistant Professor

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Regularity results for a penalized boundary obstacle problem Donatella Danielli Purdue

Todays Agenda Turn in Written HW A Upcoming Homework Section 1.5: Continuity

More NP-Complete Problems A language L is NP-Complete Lecture 4 L is in NP September 5,

Andrs Llombart University of Zaragoza (UZ) DAY 2: SMART CITIES TABLE 4: IMPLEMENTATION OF THE

Tijeras Arroyo Groundwater (TAG) Investigation J o h n R . C o p l a n d E n v i ro n m e n t a

CLEAN HIGH PURITY LITHIUM Efficient disruptive clean technology ASX Small and Mid-Cap Conference

THEM NOW! Nate Elliott, Lithium Board of Advisors @nate_elliott Katy Keim, Lithium CMO @katykeim

Energy Storage Technologies to Support Californias Clean Energy Goals September 24, 2020

e-MOTICON e-MO MObility Transnational strategy for an Interoperable CO COmmunity and