Linear Regression / Optimization for ML
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 7
- Feb. 6, 2019
Machine Learning Department School of Computer Science Carnegie Mellon University
Linear Regression / Optimization for ML Matt Gormley Lecture 7 - - PowerPoint PPT Presentation
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Linear Regression / Optimization for ML Matt Gormley Lecture 7 Feb. 6, 2019 1 Q&A 2 Reminders Homework 2:
Linear Regression / Optimization for ML
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 7
Machine Learning Department School of Computer Science Carnegie Mellon University
2
Reminders
– Out: Wed, Jan 23 – Due: Wed, Feb 6 at 11:59pm
– Out: Wed, Feb 6 – Due: Fri, Feb 15 at 11:59pm
– http://p7.mlcourse.org
4
ANALYSIS OF PERCEPTRON
5
Geometric Margin
Definition: The margin of example ! w.r.t. a linear sep. " is the distance from ! to the plane " ⋅ ! = 0 (or the negative if on wrong side)
!& w Margin of positive example !& !' Margin of negative example !'
Slide from Nina Balcan
Geometric Margin
Definition: The margin !" of a set of examples # wrt a linear separator $ is the smallest margin over points % ∈ #.
+ + + + + +
!"
+
w
Definition: The margin of example % w.r.t. a linear sep. $ is the distance from % to the plane $ ⋅ % = 0 (or the negative if on wrong side)
Slide from Nina Balcan
+ + + +
!
+
Definition: The margin ! of a set of examples " is the maximum !#
Geometric Margin
Definition: The margin !# of a set of examples " wrt a linear separator $ is the smallest margin over points % ∈ ". Definition: The margin of example % w.r.t. a linear sep. $ is the distance from % to the plane $ ⋅ % = 0 (or the negative if on wrong side)
Slide from Nina Balcan
Linear Separability
9
Def: For a binary classification problem, a set of examples ! is linearly separable if there exists a linear decision boundary that can separate the points
+ +
+ +
+ + +
Case 3:
+ +
Analysis: Perceptron
10
Slide adapted from Nina Balcan
(Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn’t change the number of mistakes; algo is invariant to scaling.)
Perceptron Mistake Bound
Guarantee: If data has margin γ and all points inside a ball of radius R, then Perceptron makes ≤ (R/γ)2 mistakes.
+ + + + + + +
g
R
θ∗
Analysis: Perceptron
11
Slide adapted from Nina Balcan
(Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn’t change the number of mistakes; algo is invariant to scaling.)
Perceptron Mistake Bound
Guarantee: If data has margin γ and all points inside a ball of radius R, then Perceptron makes ≤ (R/γ)2 mistakes.
+ + + + + + +
g
R
θ∗
Def: We say that the (batch) perceptron algorithm has converged if it stops making mistakes on the training data (perfectly classifies the training data). Main Takeaway: For linearly separable data, if the perceptron algorithm cycles repeatedly through the data, it will converge in a finite # of steps.
Analysis: Perceptron
12
Figure from Nina Balcan
Perceptron Mistake Bound
+ + + + + + +θ∗
Theorem 0.1 (Block (1962), Novikoff (1962)). Given dataset: D = {((i), y(i))}N
i=1.
Suppose:
y(i)(θ∗ · (i)) ≥ γ, ∀i Then: The number of mistakes made by the Perceptron algorithm on this dataset is k ≤ (R/γ)2
Analysis: Perceptron
13
Figure from Nina Balcan
Perceptron Mistake Bound
+ + + + + + +θ∗
Theorem 0.1 (Block (1962), Novikoff (1962)). Given dataset: D = {((i), y(i))}N
i=1.
Suppose:
y(i)(θ∗ · (i)) ≥ γ, ∀i Then: The number of mistakes made by the Perceptron algorithm on this dataset is k ≤ (R/γ)2 Common Misunderstanding: The radius is centered at the
center of the points.
Analysis: Perceptron
14
Proof of Perceptron Mistake Bound: We will show that there exist constants A and B s.t.
Ak ≤ ||θ(k+1)|| ≤ B √ k
≤ ||θ(k+1)|| Ak ≤ B √ k Ak
Covered in Recitation
Analysis: Perceptron
15
+ + + + + + +θ∗
Theorem 0.1 (Block (1962), Novikoff (1962)). Given dataset: D = {((i), y(i))}N
i=1.
Suppose:
y(i)(θ∗ · (i)) ≥ γ, ∀i Then: The number of mistakes made by the Perceptron algorithm on this dataset is k ≤ (R/γ)2
Algorithm 1 Perceptron Learning Algorithm (Online)
1: procedure P(D = {((1), y(1)), ((2), y(2)), . . .}) 2:
θ ← 0, k = 1 Initialize parameters
3:
for i ∈ {1, 2, . . .} do For each example
4:
if y(i)(θ(k) · (i)) ≤ 0 then If mistake
5:
θ(k+1) ← θ(k) + y(i)(i) Update parameters
6:
k ← k + 1
7:
return θ
Covered in Recitation
Analysis: Perceptron
17
Proof of Perceptron Mistake Bound: Part 1: for some A, Ak ≤ ||θ(k+1)||
θ(k+1) · θ∗ = (θ(k) + y(i)(i))θ∗ by Perceptron algorithm update = θ(k) · θ∗ + y(i)(θ∗ · (i)) ≥ θ(k) · θ∗ + γ by assumption ⇒ θ(k+1) · θ∗ ≥ kγ by induction on k since θ(1) = 0 ⇒ ||θ(k+1)|| ≥ kγ since |||| × |||| ≥ · and ||θ∗|| = 1
Cauchy-Schwartz inequality
Covered in Recitation
Analysis: Perceptron
18
Proof of Perceptron Mistake Bound: Part 2: for some B,
≤ ||θ(k+1)|| ≤ B √ k
||θ(k+1)||2 = ||θ(k) + y(i)(i)||2 by Perceptron algorithm update = ||θ(k)||2 + (y(i))2||(i)||2 + 2y(i)(θ(k) · (i)) ≤ ||θ(k)||2 + (y(i))2||(i)||2 since kth mistake ⇒ y(i)(θ(k) · (i)) ≤ 0 = ||θ(k)||2 + R2 since (y(i))2||(i)||2 = ||(i)||2 = R2 by assumption and (y(i))2 = 1 ⇒ ||θ(k+1)||2 ≤ kR2 by induction on k since (θ(1))2 = 0 ⇒ ||θ(k+1)|| ≤ √ kR
Covered in Recitation
Analysis: Perceptron
19
Proof of Perceptron Mistake Bound: Part 3: Combining the bounds finishes the proof.
kγ ≤ ||θ(k+1)|| ≤ √ kR ⇒k ≤ (R/γ)2
The total number of mistakes must be less than this
Covered in Recitation
Analysis: Perceptron
What if the data is not linearly separable? 1. Perceptron will not converge in this case (it can’t!) 2. However, Freund & Schapire (1999) show that by projecting the points (hypothetically) into a higher dimensional space, we can achieve a similar bound on the number of mistakes made on
20
Theorem2. Let⟨(x1, y1), . . . , (xm, ym)⟩beasequenceoflabeledexampleswith∥xi∥ ≤ R. Let u be any vector with ∥u∥ = 1 and let γ > 0. Define the deviation of each example as di = max{0, γ − yi(u · xi)}, and define D = m
i=1 d2 i . Then the number of mistakes of the online perceptron algorithmR + D γ 2 .
Perceptron Exercises
21
Question: Unlike Decision Trees and K- Nearest Neighbors, the Perceptron algorithm does not suffer from
have any hyperparameters that could be over-tuned on the training data.
Summary: Perceptron
made, add / subtract the features
separable, it will not converge if the data are linearly inseparable
can bound the number of mistakes (geometric argument)
structured prediction
22
Perceptron Learning Objectives
You should be able to…
batch learning
classification [CIML]
converge based on properties of the dataset, and the limitations of the convergence guarantees
limitations of linear models
23
LINEAR REGRESSION AS FUNCTION APPROXIMATION
27
Regression
Example Applications:
(e.g. Deep Dream)
Picchu on a given day
29
Regression Problems
Chalkboard
– Definition of Regression – Linear functions – Residuals – Notation trick: fold in the intercept
30
Linear Regression as Function Approximation
Chalkboard
– Objective function: Mean squared error – Hypothesis space: Linear Functions
31
OPTIMIZATION IN CLOSED FORM
32
Optimization for ML
Not quite the same setting as other fields…
– Function we are optimizing might not be the true goal (e.g. likelihood vs generalization error) – Precision might not matter (e.g. data is noisy, so optimal up to 1e-16 might not help) – Stopping early can help generalization error (i.e. “early stopping” is a technique for regularization – discussed more next time)
33
34
Topographical Maps
35
Topographical Maps
Calculus and Optimization
In-Class Exercise Plot three functions:
36
Answer Here:
Optimization for ML
Chalkboard
– Unconstrained optimization – Convex, concave, nonconvex – Derivatives – Zero derivatives – Gradient and Hessian
38
Optimization: Closed form solutions
Chalkboard
– Example: 1-D function – Example: higher dimensions
39
Convexity
40
There is only one local optimum if the function is convex
Slide adapted from William Cohen
Convexity
41
There is only one local optimum if the function is convex
Slide adapted from William Cohen
The Mean Squared Error function, which we will minimize for learning the parameters of Linear Regression, is convex!
CLOSED FORM SOLUTION FOR LINEAR REGRESSION
42
Optimization for Linear Regression
Chalkboard
– Closed-form (Normal Equations) – Computational complexity of Closed-form Solution – Stability of Closed-form Solution
43
44
Function Approximation
Chalkboard
– The Big Picture
45