Linear Regression & Gradient Descent Many slides attributable - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Linear Regression & Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1

LR & GD Unit Objectives • Exact solutions of least squares • 1D case without bias • 1D case with bias • General case • Gradient descent for least squares Mike Hughes - Tufts COMP 135 - Spring 2019 3

What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4

Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Spring 2019 5

Visualizing errors Mike Hughes - Tufts COMP 135 - Spring 2019 6

Regression: Evaluation Metrics N 1 • mean squared error X y n ) 2 ( y n − ˆ N n =1 N • mean absolute error 1 X | y n − ˆ y n | N n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 7

Linear Regression Parameters: w = [ w 1 , w 2 , . . . w f . . . w F ] weight vector b bias scalar Prediction: F X y ( x i ) , ˆ w f x if + b f =1 Training: find weights and bias that minimize error Mike Hughes - Tufts COMP 135 - Spring 2019 8

Sales vs. Ad Budgets Mike Hughes - Tufts COMP 135 - Spring 2019 9

Linear Regression: Training Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 10

Linear Regression: Training Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1 Exact formula for optimal values of w, b exist! x = mean( x 1 , . . . x N ) ¯ With only one feature (F=1): y = mean( y 1 , . . . y N ) ¯ P N n =1 ( x n − ¯ x )( y n − ¯ y ) b = ¯ y − w ¯ x w = P N n =1 ( x n − ¯ x ) 2 Where does this come from? Mike Hughes - Tufts COMP 135 - Spring 2019 11

Linear Regression: Training Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1 Exact formula for optimal values of w, b exist!   x 11 . . . x 1 F 1 x 21 . . . x 2 F 1 ˜   X = With many features (F >= 1 ):   . . .   x N 1 . . . x NF 1 [ w 1 . . . w F b ] T = ( ˜ X T ˜ X ) − 1 ˜ X T y Where does this come from? Mike Hughes - Tufts COMP 135 - Spring 2019 12

Derivation Notes http://www.cs.tufts.edu/comp/135/2019s/notes /day03_linear_regression.pdf Mike Hughes - Tufts COMP 135 - Spring 2019 13

When does the Least Squares estimator exist? • Fewer examples than features (N < F) Infinitely many solutions! • Same number of examples and features (N=F) Optimum exists if X is full rank • More examples than features (N > F) Optimum exists if X is full rank Mike Hughes - Tufts COMP 135 - Spring 2019 14

More compact notation θ = [ b w 1 w 2 . . . w F ] x n = [1 x n 1 x n 2 . . . x nF ] ˜ y ( x n , θ ) = θ T ˜ ˆ x n N X y ( x n , θ )) 2 J ( θ ) , ( y n − ˆ n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 15

Idea: Optimize via small steps Mike Hughes - Tufts COMP 135 - Spring 2019 16

Derivatives point uphill Mike Hughes - Tufts COMP 135 - Spring 2019 17

To minimize, go downhill Step in the opposite direction of the derivative Mike Hughes - Tufts COMP 135 - Spring 2019 18

Steepest descent algorithm input: initial θ ∈ R input: step size α ∈ R + while not converged: θ ← θ − α d d θ J ( θ ) Mike Hughes - Tufts COMP 135 - Spring 2019 19

Steepest descent algorithm input: initial θ ∈ R input: step size α ∈ R + while not converged: θ ← θ − α d d θ J ( θ ) Mike Hughes - Tufts COMP 135 - Spring 2019 20

How to set step size? Mike Hughes - Tufts COMP 135 - Spring 2019 21

How to set step size? • Simple and usually effective: pick small constant α = 0 . 01 • Improve: decay over iterations α t = C α t = ( C + t ) − 0 . 9 t • Improve: Line search for best value at each step Mike Hughes - Tufts COMP 135 - Spring 2019 22

How to assess convergence? • Ideal: stop when derivative equals zero • Practical heuristics: stop when … • when change in loss becomes small | J ( ✓ t ) − J ( ✓ t − 1 ) | < ✏ • when step size is indistinguishable from zero ↵ | d d ✓ J ( ✓ ) | < ✏ Mike Hughes - Tufts COMP 135 - Spring 2019 23

Visualizing the cost function “Level set” contours : all points with same function value Mike Hughes - Tufts COMP 135 - Spring 2019 24

In 2D parameter space gradient = vector of partial derivatives Mike Hughes - Tufts COMP 135 - Spring 2019 25

Gradient Descent DEMO https://github.com/tufts-ml-courses/comp135-19s- assignments/blob/master/labs/GradientDescentDemo. ipynb Mike Hughes - Tufts COMP 135 - Spring 2019 26

Fitting a line isn’t always ideal Mike Hughes - Tufts COMP 135 - Spring 2019 27

Can fit linear functions to nonlinear features A nonlinear function of x: y ( x i ) = θ 0 + θ 1 x i + θ 2 x 2 i + θ 3 x 3 ˆ i Can be written as a linear function of φ ( x i ) = [ x i x 2 x 3 i ] i y ( φ ( x i )) = θ 0 + θ 1 φ ( x i ) 1 + θ 2 φ ( x i ) 2 + θ 3 φ ( x i ) 3 ˆ “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data Mike Hughes - Tufts COMP 135 - Spring 2019 28

What feature transform to use? • Anything that works for your data! • sin / cos for periodic data • polynomials for high-order dependencies φ ( x i ) = [ x i x 2 x 3 i ] i • interactions between feature dimensions φ ( x i ) = [ x i 1 x i 2 x i 3 x i 4 ] • Many other choices possible Mike Hughes - Tufts COMP 135 - Spring 2019 29

Linear Regression & Gradient Descent Many slides attributable - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Linear Regression & Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten,

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Bosonization in 3d and 2d David Tong ICTP, April 2018 Work with Andreas Karch and Carl Turner

A solution to the PoplMark challenge in Isabelle/HOL Stefan Berghofer Technische Universit at

MON: MISSION-OPTIMIZED OVERLAY NETWORKS Bruce Spang , Anirudh Sabnis, Ramesh Sitaraman, Don

WF=NWF? On Models which are not Fundamentally Different Petr Kuznetsov TU Berlin/DTLabs

Hardware and FPGAs computing just right ICERM 2020 Florent de Dinechin Everybody is wondering

Training DNNs: Tricks Ju Sun Computer Science & Engineering University of Minnesota, Twin

Constructing Generalized Bent Functions from Trace Forms over Galois Rings Xiaoming Zhang Key

WRAP-UP Joint IS-ENES Workshop on Workflows and Metadata Generation Costa da Caparica,

Linear Regression & Gradient Descent Many slides attributable - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Linear Regression & Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten,

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Bosonization in 3d and 2d David Tong ICTP, April 2018 Work with Andreas Karch and Carl Turner

A solution to the PoplMark challenge in Isabelle/HOL Stefan Berghofer Technische Universit at

MON: MISSION-OPTIMIZED OVERLAY NETWORKS Bruce Spang , Anirudh Sabnis, Ramesh Sitaraman, Don

WF=NWF? On Models which are not Fundamentally Different Petr Kuznetsov TU Berlin/DTLabs

Hardware and FPGAs computing just right ICERM 2020 Florent de Dinechin Everybody is wondering

Training DNNs: Tricks Ju Sun Computer Science &amp; Engineering University of Minnesota, Twin

Constructing Generalized Bent Functions from Trace Forms over Galois Rings Xiaoming Zhang Key

WRAP-UP Joint IS-ENES Workshop on Workflows and Metadata Generation Costa da Caparica,

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Training DNNs: Tricks Ju Sun Computer Science & Engineering University of Minnesota, Twin