Linear Regression + Optimization for ML Matt Gormley Lecture 8 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Linear Regression + Optimization for ML Matt Gormley Lecture 8 Feb. 07, 2020 1

Q&A Q: How can I get more one-on-one interaction with the course staff? A: Attend office hours as soon after the homework release as possible! 3

Reminders • Homework 3: KNN, Perceptron, Lin.Reg. – Out: Wed, Feb. 05 (+ 1 day) – Due: Wed, Feb. 12 at 11:59pm • Today’s In-Class Poll – http://p8.mlcourse.org 4

LINEAR REGRESSION 5

Regression Problems Chalkboard – Definition of Regression – Linear functions – Residuals – Notation trick: fold in the intercept 6

The Big Picture OPTIMIZATION FOR ML 7

Optimization for ML Not quite the same setting as other fields… – Function we are optimizing might not be the true goal (e.g. likelihood vs generalization error) – Precision might not matter (e.g. data is noisy, so optimal up to 1e-16 might not help) – Stopping early can help generalization error (i.e. “early stopping” is a technique for regularization – discussed more next time) 8

min vs. argmin 3 v* = min x f(x) 2 x* = argmin x f(x) 1 y = f(x) =x 2 + 1 1. Q: What is v*? v* = 1, the minimum value of the function 2. Q: What is x*? x* = 0, the argument that yields the minimum value 9

Linear Regression as Function Approximation 10

Contour Plots Contour Plots 1. Each level curve labeled J( θ ) = J(θ 1 , θ 2 ) = (10(θ 1 – 0.5)) 2 + (6(θ 1 – 0.4)) 2 with value 2. Value label indicates the value of the function for all points lying on that level curve 3. Just like a topographical map, but for a function θ 2 θ 1 12

Optimization by Random Guessing J( θ ) = J(θ 1 , θ 2 ) = (10(θ 1 – 0.5)) 2 + (6(θ 1 – 0.4)) 2 Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J( θ ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives θ 2 smallest J( θ ) θ 1 t θ 1 θ 2 J(θ 1 , θ 2 ) 1 0.2 0.2 10.4 2 0.3 0.7 7.2 3 0.6 0.4 1.0 13 4 0.9 0.7 19.2

Optimization by Random Guessing J( θ ) = J(θ 1 , θ 2 ) = (10(θ 1 – 0.5)) 2 + (6(θ 1 – 0.4)) 2 Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J( θ ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives θ 2 smallest J( θ ) For Linear Regression: • objective function is Mean Squared Error (MSE) • MSE = J(w, b) = J(θ 1 , θ 2 ) = θ 1 • contour plot: each line labeled with t θ 1 θ 2 J(θ 1 , θ 2 ) MSE – lower means a better fit 1 0.2 0.2 10.4 • minimum corresponds to 2 0.3 0.7 7.2 parameters (w,b) = (θ 1 , θ 2 ) that 3 0.6 0.4 1.0 best fit some training dataset 14 4 0.9 0.7 19.2

Linear Regression by Rand. Guessing Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J( θ ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives smallest J( θ ) For Linear Regression: y = h*(x) • target function h*(x) is unknown h(x; θ (4) ) (unknown) # tourists (thousands) • only have access to h*(x) through h(x; θ (2) ) training examples (x (i) ,y (i) ) h(x; θ (3) ) • want h(x; θ (t) ) that best approximates h*(x) • enable generalization w/inductive bias that restricts hypothesis class to linear functions h(x; θ (1) ) 15 time

Linear Regression by Rand. Guessing J( θ ) = J(θ 1 , θ 2 ) = (10(θ 1 – 0.5)) 2 + (6(θ 1 – 0.4)) 2 Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J( θ ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives θ 2 smallest J( θ ) y = h*(x) h(x; θ (4) ) (unknown) # tourists (thousands) h(x; θ (2) ) h(x; θ (3) ) θ 1 t θ 1 θ 2 J(θ 1 , θ 2 ) 1 0.2 0.2 10.4 h(x; θ (1) ) 2 0.3 0.7 7.2 3 0.6 0.4 1.0 16 4 0.9 0.7 19.2 time

OPTIMIZATION METHOD #1: GRADIENT DESCENT 17

Optimization for ML Chalkboard – Unconstrained optimization – Derivatives – Gradient 18

Topographical Maps 19

Topographical Maps 20

Gradients 21

Gradients These are the gradients that 22 Gradient Ascent would follow.

(Negative) Gradients These are the negative gradients that 23 Gradient Descent would follow.

(Negative) Gradient Paths Shown are the paths that Gradient Descent would follow if it were making infinitesimally 24 small steps .

Pros and cons of gradient descent • Simple and often quite effective on ML tasks • Often very scalable • Only applies to smooth functions (differentiable) • Might find a local minimum, rather than a global one 25 Slide courtesy of William Cohen

Gradient Descent Chalkboard – Gradient Descent Algorithm – Details: starting point, stopping criterion, line search 26

Gradient Descent Algorithm 1 Gradient Descent 1: procedure GD ( D , θ (0) ) θ � θ (0) 2: while not converged do 3: θ � θ + λ � θ J ( θ ) — 4: return θ 5: d d θ 1 J ( θ )   In order to apply GD to Linear d Regression all we need is the d θ 2 J ( θ )   gradient of the objective � θ J ( θ ) =  .  .   function (i.e. vector of partial .   derivatives). d d θ N J ( θ ) M 27

Gradient Descent Algorithm 1 Gradient Descent 1: procedure GD ( D , θ (0) ) θ � θ (0) 2: while not converged do 3: θ � θ + λ � θ J ( θ ) — 4: return θ 5: There are many possible ways to detect convergence . For example, we could check whether the L2 norm of the gradient is below some small tolerance. || � θ J ( θ ) || 2 � � Alternatively we could check that the reduction in the objective function from one iteration to the next is small. 28

GRADIENT DESCENT FOR LINEAR REGRESSION 29

Linear Regression as Function Approximation 30

Linear Regression by Gradient Desc. J( θ ) = J(θ 1 , θ 2 ) = (10(θ 1 – 0.5)) 2 + (6(θ 1 – 0.4)) 2 Optimization Method #1: Gradient Descent 1. Pick a random θ 2. Repeat: a. Evaluate gradient ∇ J( θ ) b. Step opposite gradient 3. Return θ that gives θ 2 smallest J( θ ) θ 1 t θ 1 θ 2 J(θ 1 , θ 2 ) 1 0.01 0.02 25.2 2 0.30 0.12 8.7 3 0.51 0.30 1.5 31 4 0.59 0.43 0.2

Linear Regression by Gradient Desc. Optimization Method #1: Gradient Descent 1. Pick a random θ 2. Repeat: a. Evaluate gradient ∇ J( θ ) b. Step opposite gradient 3. Return θ that gives smallest J( θ ) y = h*(x) (unknown) # tourists (thousands) h(x; θ (4) ) h(x; θ (3) ) t θ 1 θ 2 J(θ 1 , θ 2 ) 1 0.01 0.02 25.2 h(x; θ (2) ) 2 0.30 0.12 8.7 h(x; θ (1) ) 3 0.51 0.30 1.5 32 4 0.59 0.43 0.2 time

Linear Regression by Gradient Desc. J( θ ) = J(θ 1 , θ 2 ) = (10(θ 1 – 0.5)) 2 + (6(θ 1 – 0.4)) 2 Optimization Method #1: Gradient Descent 1. Pick a random θ 2. Repeat: a. Evaluate gradient ∇ J( θ ) b. Step opposite gradient 3. Return θ that gives θ 2 smallest J( θ ) y = h*(x) (unknown) # tourists (thousands) h(x; θ (4) ) h(x; θ (3) ) θ 1 t θ 1 θ 2 J(θ 1 , θ 2 ) 1 0.01 0.02 25.2 h(x; θ (2) ) 2 0.30 0.12 8.7 h(x; θ (1) ) 3 0.51 0.30 1.5 33 4 0.59 0.43 0.2 time

Linear Regression by Gradient Desc. mean squared error, J(θ 1 , θ 2 ) iteration, t y = h*(x) (unknown) # tourists (thousands) h(x; θ (4) ) h(x; θ (3) ) t θ 1 θ 2 J(θ 1 , θ 2 ) 1 0.01 0.02 25.2 h(x; θ (2) ) 2 0.30 0.12 8.7 h(x; θ (1) ) 3 0.51 0.30 1.5 34 4 0.59 0.43 0.2 time

Linear Regression by Gradient Desc. J( θ ) = J(θ 1 , θ 2 ) = (10(θ 1 – 0.5)) 2 + (6(θ 1 – 0.4)) 2 mean squared error, J(θ 1 , θ 2 ) θ 2 iteration, t y = h*(x) (unknown) # tourists (thousands) h(x; θ (4) ) h(x; θ (3) ) θ 1 t θ 1 θ 2 J(θ 1 , θ 2 ) 1 0.01 0.02 25.2 h(x; θ (2) ) 2 0.30 0.12 8.7 h(x; θ (1) ) 3 0.51 0.30 1.5 35 4 0.59 0.43 0.2 time

Optimization for Linear Regression Chalkboard – Computing the gradient for Linear Regression – Gradient Descent for Linear Regression 36

Gradient Calculation for Linear Regression [used by Gradient Descent] 37

GD for Linear Regression Gradient Descent for Linear Regression repeatedly takes steps opposite the gradient of the objective function 38

CONVEXITY 39

Convexity 40

Convexity Convex Function Nonconvex Function • • Each local minimum is a A nonconvex function is not global minimum convex • Each local minimum is not necessarily a global minimum 41

Convexity Each local minimum of a convex function is also a global minimum . A strictly convex function has a unique global minimum . 44

CONVEXITY AND LINEAR REGRESSION 45

Convexity and Linear Regression The Mean Squared Error function, which we minimize for learning the parameters of Linear Regression, is convex ! …but in the general case it is not strictly convex . 46

Regression Loss Functions In-Class Exercise: Which of the following could be used as loss functions for training a linear regression model? Select all that apply . 47

Solving Linear Regression Question: Answer: 48

OPTIMIZATION METHOD #2: CLOSED FORM SOLUTION 49

Calculus and Optimization In-Class Exercise Answer Here: Plot three functions: 50

Linear Regression + Optimization for ML Matt Gormley Lecture 8 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Linear Regression + Optimization for ML Matt Gormley Lecture 8 Feb. 07, 2020 1 Q&A Q: How can I get more

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Neutrinos in Core Collapse Supernovae Evan OConnor, Stockholm University Outline: SN

MIMO systems Systems with more than one input and output A system with M ultiple I nputs and M

Double-directional channel modeling approach for MIMO mobile communication systems Jun-ichi

Status of Present Chip Submission M.Winter / CERN, 17th of February 2014 on behalf of the PICSEL

Tensor estimation with structured priors Clment Luneau, Nicolas Macris June 29, 2020

Extensions Gerhard Schmidt Christian-Albrechts-Universitt zu Kiel Faculty of Engineering

Nonlinear Control Lecture # 20 Special nonlinear Forms Nonlinear Control Lecture # 20 Special

National Disaster Resilience (CDBG- NDR) Competition Phase 1 Factors Q&A Jessie Handforth

Linear Regression + Optimization for ML Matt Gormley Lecture 8 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Linear Regression + Optimization for ML Matt Gormley Lecture 8 Feb. 07, 2020 1 Q&A Q: How can I get more

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Neutrinos in Core Collapse Supernovae Evan OConnor, Stockholm University Outline: SN

MIMO systems Systems with more than one input and output A system with M ultiple I nputs and M

Double-directional channel modeling approach for MIMO mobile communication systems Jun-ichi

Status of Present Chip Submission M.Winter / CERN, 17th of February 2014 on behalf of the PICSEL

Tensor estimation with structured priors Clment Luneau, Nicolas Macris June 29, 2020

Extensions Gerhard Schmidt Christian-Albrechts-Universitt zu Kiel Faculty of Engineering

Nonlinear Control Lecture # 20 Special nonlinear Forms Nonlinear Control Lecture # 20 Special

National Disaster Resilience (CDBG- NDR) Competition Phase 1 Factors Q&amp;A Jessie Handforth

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

National Disaster Resilience (CDBG- NDR) Competition Phase 1 Factors Q&A Jessie Handforth