Linear Regression / Optimization for ML Matt Gormley Lecture 7 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Linear Regression / Optimization for ML Matt Gormley Lecture 7 Feb. 6, 2019 1

Reminders • Homework 2: Decision Trees – Out: Wed, Jan 23 – Due: Wed, Feb 6 at 11:59pm • Homework 3: KNN, Perceptron, Lin.Reg. – Out: Wed, Feb 6 – Due: Fri, Feb 15 at 11:59pm • Today’s In-Class Poll – http://p7.mlcourse.org 4

ANALYSIS OF PERCEPTRON 5

Geometric Margin Definition: The margin of example ! w.r.t. a linear sep. " is the distance from ! to the plane " ⋅ ! = 0 (or the negative if on wrong side) Margin of positive example ! & ! & w Margin of negative example ! ' ! ' Slide from Nina Balcan

Geometric Margin Definition: The margin of example % w.r.t. a linear sep. $ is the distance from % to the plane $ ⋅ % = 0 (or the negative if on wrong side) Definition: The margin ! " of a set of examples # wrt a linear separator $ is the smallest margin over points % ∈ # . + + + w + + ! " - ! " + + - - + - - - - - - Slide from Nina Balcan

Geometric Margin Definition: The margin of example % w.r.t. a linear sep. $ is the distance from % to the plane $ ⋅ % = 0 (or the negative if on wrong side) Definition: The margin ! # of a set of examples " wrt a linear separator $ is the smallest margin over points % ∈ " . Definition: The margin ! of a set of examples " is the maximum ! # over all linear separators $ . w + + ! - ! + + - - + - - - - - - Slide from Nina Balcan

Linear Separability Def : For a binary classification problem, a set of examples ! is linearly separable if there exists a linear decision boundary that can separate the points Case 4: Case 2: Case 3: Case 1: - + + - + - + + + + + + - 9

Analysis: Perceptron Perceptron Mistake Bound Guarantee: If data has margin γ and all points inside a ball of radius R , then Perceptron makes ≤ ( R / γ ) 2 mistakes. (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn’t change the number of mistakes; algo is invariant to scaling.) + + + + θ ∗ + + g g - + + - - - R - - - - - 10 Slide adapted from Nina Balcan

Analysis: Perceptron Perceptron Mistake Bound Guarantee: If data has margin γ and all points inside a ball of radius R , then Perceptron makes ≤ ( R / γ ) 2 mistakes. (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn’t change the number of mistakes; algo is invariant to scaling.) + + + + Def: We say that the (batch) perceptron algorithm has θ ∗ + + g converged if it stops making mistakes on the training data g - + (perfectly classifies the training data). + - - - Main Takeaway : For linearly separable data, if the R - perceptron algorithm cycles repeatedly through the data, - - it will converge in a finite # of steps. - - 11 Slide adapted from Nina Balcan

Analysis: Perceptron Perceptron Mistake Bound Theorem 0.1 (Block (1962), Novikoff (1962)) . Given dataset: D = { ( � ( i ) , y ( i ) ) } N i =1 . Suppose: 1. Finite size inputs: || x ( i ) || ≤ R 2. Linearly separable data: ∃ θ ∗ s.t. || θ ∗ || = 1 and y ( i ) ( θ ∗ · � ( i ) ) ≥ γ , ∀ i Then: The number of mistakes made by the Perceptron algorithm on this dataset is + + + + θ ∗ + g k ≤ ( R / γ ) 2 + g - + + - - - R - - - - - 12 Figure from Nina Balcan

Common Analysis: Perceptron Misunderstanding : The radius is Perceptron Mistake Bound centered at the Theorem 0.1 (Block (1962), Novikoff (1962)) . origin , not at the Given dataset: D = { ( � ( i ) , y ( i ) ) } N i =1 . center of the points . Suppose: 1. Finite size inputs: || x ( i ) || ≤ R 2. Linearly separable data: ∃ θ ∗ s.t. || θ ∗ || = 1 and y ( i ) ( θ ∗ · � ( i ) ) ≥ γ , ∀ i Then: The number of mistakes made by the Perceptron algorithm on this dataset is + + + + θ ∗ + g k ≤ ( R / γ ) 2 + g - + + - - - R - - - - - 13 Figure from Nina Balcan

Covered in Recitation Analysis: Perceptron Proof of Perceptron Mistake Bound: We will show that there exist constants A and B s.t. √ Ak ≤ || θ ( k +1) || ≤ B k Ak √ ≤ B k ≤ || θ ( k +1) || Ak 14

Covered in Recitation Analysis: Perceptron Theorem 0.1 (Block (1962), Novikoff (1962)) . Given dataset: D = { ( � ( i ) , y ( i ) ) } N i =1 . + + + + Suppose: + θ ∗ 1. Finite size inputs: || x ( i ) || ≤ R g + g 2. Linearly separable data: ∃ θ ∗ s.t. || θ ∗ || = 1 and - + + - y ( i ) ( θ ∗ · � ( i ) ) ≥ γ , ∀ i - - R - Then: The number of mistakes made by the Perceptron - - - - algorithm on this dataset is k ≤ ( R / γ ) 2 Algorithm 1 Perceptron Learning Algorithm (Online) 1: procedure P��( D = { ( � (1) , y (1) ) , ( � (2) , y (2) ) , . . . } ) θ ← 0 , k = 1 � Initialize parameters 2: for i ∈ { 1 , 2 , . . . } do � For each example 3: if y ( i ) ( θ ( k ) · � ( i ) ) ≤ 0 then � If mistake 4: θ ( k +1) ← θ ( k ) + y ( i ) � ( i ) � Update parameters 5: 6: k ← k + 1 return θ 15 7:

Covered in Recitation Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 1: for some A, Ak ≤ || θ ( k +1) || θ ( k +1) · θ ∗ = ( θ ( k ) + y ( i ) � ( i ) ) θ ∗ by Perceptron algorithm update = θ ( k ) · θ ∗ + y ( i ) ( θ ∗ · � ( i ) ) ≥ θ ( k ) · θ ∗ + γ by assumption ⇒ θ ( k +1) · θ ∗ ≥ k γ by induction on k since θ (1) = 0 ⇒ || θ ( k +1) || ≥ k γ since || � || × || � || ≥ � · � and || θ ∗ || = 1 Cauchy-Schwartz inequality 17

Covered in Recitation Analysis: Perceptron Proof of Perceptron Mistake Bound: √ ≤ || θ ( k +1) || ≤ B Part 2: for some B, k || θ ( k +1) || 2 = || θ ( k ) + y ( i ) � ( i ) || 2 by Perceptron algorithm update = || θ ( k ) || 2 + ( y ( i ) ) 2 || � ( i ) || 2 + 2 y ( i ) ( θ ( k ) · � ( i ) ) ≤ || θ ( k ) || 2 + ( y ( i ) ) 2 || � ( i ) || 2 since k th mistake ⇒ y ( i ) ( θ ( k ) · � ( i ) ) ≤ 0 = || θ ( k ) || 2 + R 2 since ( y ( i ) ) 2 || � ( i ) || 2 = || � ( i ) || 2 = R 2 by assumption and ( y ( i ) ) 2 = 1 ⇒ || θ ( k +1) || 2 ≤ kR 2 by induction on k since ( θ (1) ) 2 = 0 √ ⇒ || θ ( k +1) || ≤ kR 18

Covered in Recitation Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 3: Combining the bounds finishes the proof. √ k γ ≤ || θ ( k +1) || ≤ kR ⇒ k ≤ ( R / γ ) 2 The total number of mistakes must be less than this 19

Analysis: Perceptron What if the data is not linearly separable? 1. Perceptron will not converge in this case (it can’t!) 2. However, Freund & Schapire (1999) show that by projecting the points (hypothetically) into a higher dimensional space, we can achieve a similar bound on the number of mistakes made on one pass through the sequence of examples Theorem2. Let ⟨ ( x 1 , y 1 ), . . . , ( x m , y m ) ⟩ beasequenceoflabeledexampleswith ∥ x i ∥ ≤ R. Let u be any vector with ∥ u ∥ = 1 and let γ > 0 . Define the deviation of each example as d i = max { 0 , γ − y i ( u · x i ) } , �� m i = 1 d 2 and define D = i . Then the number of mistakes of the online perceptron algorithm on this sequence is bounded by � R + D � 2 . γ 20

Perceptron Exercises Question: Unlike Decision Trees and K- Nearest Neighbors, the Perceptron algorithm does not suffer from overfitting because it does not have any hyperparameters that could be over-tuned on the training data. A. True B. False C. True and False 21

Summary: Perceptron • Perceptron is a linear classifier • Simple learning algorithm : when a mistake is made, add / subtract the features • Perceptron will converge if the data are linearly separable , it will not converge if the data are linearly inseparable • For linearly separable and inseparable data, we can bound the number of mistakes (geometric argument) • Extensions support nonlinear separators and structured prediction 22

Perceptron Learning Objectives You should be able to… • Explain the difference between online learning and batch learning • Implement the perceptron algorithm for binary classification [CIML] • Determine whether the perceptron algorithm will converge based on properties of the dataset, and the limitations of the convergence guarantees • Describe the inductive bias of perceptron and the limitations of linear models • Draw the decision boundary of a linear model • Identify whether a dataset is linearly separable or not • Defend the use of a bias term in perceptron 23

LINEAR REGRESSION AS FUNCTION APPROXIMATION 27

Regression Example Applications: • Stock price prediction • Forecasting epidemics • Speech synthesis • Generation of images (e.g. Deep Dream ) • Predicting the number of tourists on Machu Picchu on a given day 29

Regression Problems Chalkboard – Definition of Regression – Linear functions – Residuals – Notation trick: fold in the intercept 30

Linear Regression as Function Approximation Chalkboard – Objective function: Mean squared error – Hypothesis space: Linear Functions 31

OPTIMIZATION IN CLOSED FORM 32

Linear Regression / Optimization for ML Matt Gormley Lecture 7 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Linear Regression / Optimization for ML Matt Gormley Lecture 7 Feb. 6, 2019 1 Q&A 2 Reminders Homework 2:

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

How to establish a culture of autonomy through Integrative Intelligence Agile, Sydney 2019

Concluding remarks Keith Ellis - Fermilab "From a long view of the history of mankind - seen

Office of Fair Housing and Equal Opportunity Seven basic design and construction requirements

Common on B Boar ard Q Question ons October 28 th , 2020 U.S. DEPARTMENT OF HOUSING AND URBAN

2021 Housing T ax Cr e dit Applic ation T r aining November 30, 2020 3 Age nda 01 2021 Pr

Graeme Baxter and Rita Marcella Dept. Dept. of of Inf nfor orma mation tion Mana anagement,

Bayesian estimation of the latent dimension and communities in stochastic blockmodels Francesco

AND TOOLS FOR PROCESSING AND VISUALIZATION The CODATA-RDA Research Data Science Advanced

Linear Regression / Optimization for ML Matt Gormley Lecture 7 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Linear Regression / Optimization for ML Matt Gormley Lecture 7 Feb. 6, 2019 1 Q&A 2 Reminders Homework 2:

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

How to establish a culture of autonomy through Integrative Intelligence Agile, Sydney 2019

Concluding remarks Keith Ellis - Fermilab &quot;From a long view of the history of mankind - seen

Office of Fair Housing and Equal Opportunity Seven basic design and construction requirements

Common on B Boar ard Q Question ons October 28 th , 2020 U.S. DEPARTMENT OF HOUSING AND URBAN

2021 Housing T ax Cr e dit Applic ation T r aining November 30, 2020 3 Age nda 01 2021 Pr

Graeme Baxter and Rita Marcella Dept. Dept. of of Inf nfor orma mation tion Mana anagement,

Bayesian estimation of the latent dimension and communities in stochastic blockmodels Francesco

AND TOOLS FOR PROCESSING AND VISUALIZATION The CODATA-RDA Research Data Science Advanced

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Concluding remarks Keith Ellis - Fermilab "From a long view of the history of mankind - seen