[PPT] - Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu PowerPoint Presentation

SLIDE 1

Machine Learning Basics Lecture 4: SVM I

Princeton University COS 495 Instructor: Yingyu Liang

SLIDE 2

Review: machine learning basics

SLIDE 3

Math formulation

Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸
Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 that minimizes ෠

𝑀 𝑔 =

1 𝑜 σ𝑗=1 𝑜

𝑚(𝑔, 𝑦𝑗, 𝑧𝑗)

s.t. the expected loss is small

𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸[𝑚(𝑔, 𝑦, 𝑧)]

SLIDE 4

Machine learning 1-2-3

Collect data and extract features
Build model: choose hypothesis class 𝓘 and loss function 𝑚
Optimization: minimize the empirical loss

SLIDE 5

Loss function

𝑚2 loss: linear regression
Cross-entropy: logistic regression
Hinge loss: Perceptron
General principle: maximum likelihood estimation (MLE)
𝑚2 loss: corresponds to Normal distribution
logistic regression: corresponds to sigmoid conditional distribution

SLIDE 6

Optimization

Linear regression: closed form solution
Logistic regression: gradient descent
Perceptron: stochastic gradient descent
General principle: local improvement
SGD: Perceptron; can also be applied to linear regression/logistic regression

SLIDE 7

Principle for hypothesis class?

Yes, there exists a general principle (at least philosophically)
Different names/faces/connections
Occam’s razor
VC dimension theory
Minimum description length
Tradeoff between Bias and variance; uniform convergence
The curse of dimensionality
Running example: Support Vector Machine (SVM)

SLIDE 8

Motivation

SLIDE 9

Linear classification

(𝑥∗)𝑈𝑦 = 0 Class +1 Class -1 𝑥∗ (𝑥∗)𝑈𝑦 > 0 (𝑥∗)𝑈𝑦 < 0 Assume perfect separation between the two classes

SLIDE 10

Attempt

Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸
Hypothesis 𝑧 = sign(𝑔

𝑥 𝑦 ) = sign(𝑥𝑈𝑦)

𝑧 = +1 if 𝑥𝑈𝑦 > 0
𝑧 = −1 if 𝑥𝑈𝑦 < 0
Let’s assume that we can optimize to find 𝑥

SLIDE 11

Multiple optimal solutions?

Class +1 Class -1 𝑥2 𝑥3 𝑥1 Same on empirical loss; Different on test/expected loss

SLIDE 12

What about 𝑥1?

Class +1 Class -1 𝑥1

New test data

SLIDE 13

What about 𝑥3?

Class +1 Class -1 𝑥3

New test data

SLIDE 14

Most confident: 𝑥2

Class +1 Class -1 𝑥2

New test data

SLIDE 15

Intuition: margin

Class +1 Class -1 𝑥2

large margin

SLIDE 16

Margin

SLIDE 17

Margin

Lemma 1: 𝑦 has distance

|𝑔

𝑥 𝑦 |

| 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 = 𝑥𝑈𝑦 = 0

Proof:

𝑥 is orthogonal to the hyperplane
The unit direction is

𝑥 | 𝑥 |

The projection of 𝑦 is

𝑥 𝑥 𝑈

𝑦 =

𝑔

𝑥(𝑦)

| 𝑥 | 𝑥 | 𝑥 |

𝑦

𝑥 𝑥

𝑈

𝑦

SLIDE 18

Margin: with bias

Claim 1: 𝑥 is orthogonal to the hyperplane 𝑔

𝑥,𝑐 𝑦 = 𝑥𝑈𝑦 + 𝑐 = 0

Proof:

pick any 𝑦1 and 𝑦2 on the hyperplane
𝑥𝑈𝑦1 + 𝑐 = 0
𝑥𝑈𝑦2 + 𝑐 = 0
So 𝑥𝑈(𝑦1 − 𝑦2) = 0

SLIDE 19

Margin: with bias

Claim 2: 0 has distance

−𝑐 | 𝑥 | to the hyperplane 𝑥𝑈𝑦 + 𝑐 = 0

Proof:

pick any 𝑦1 the hyperplane
Project 𝑦1 to the unit direction

𝑥 | 𝑥 | to get the distance

𝑥

𝑥 𝑈

𝑦1 =

−𝑐 | 𝑥 | since 𝑥𝑈𝑦1 + 𝑐 = 0

SLIDE 20

Margin: with bias

Lemma 2: 𝑦 has distance

|𝑔𝑥,𝑐 𝑦 | | 𝑥 |

to the hyperplane 𝑔

𝑥,𝑐 𝑦 = 𝑥𝑈𝑦 +

𝑐 = 0 Proof:

Let 𝑦 = 𝑦⊥ + 𝑠

𝑥 | 𝑥 |, then |𝑠| is the distance

Multiply both sides by 𝑥𝑈 and add 𝑐
Left hand side: 𝑥𝑈𝑦 + 𝑐 = 𝑔

𝑥,𝑐 𝑦

Right hand side: 𝑥𝑈𝑦⊥ + 𝑠

𝑥𝑈𝑥 | 𝑥 | + 𝑐 = 0 + 𝑠| 𝑥 |

SLIDE 21

𝑧 𝑦 = 𝑥𝑈𝑦 + 𝑥0 The notation here is:

Figure from Pattern Recognition and Machine Learning, Bishop

SLIDE 22

Support Vector Machine (SVM)

SLIDE 23

SVM: objective

Margin over all training data points:

𝛿 = min

𝑗

|𝑔

𝑥,𝑐 𝑦𝑗 |

| 𝑥 |

Since only want correct 𝑔

𝑥,𝑐, and recall 𝑧𝑗 ∈ {+1, −1}, we have

𝛿 = min

𝑗

𝑧𝑗𝑔

𝑥,𝑐 𝑦𝑗

| 𝑥 |

If 𝑔

𝑥,𝑐 incorrect on some 𝑦𝑗, the margin is negative

SLIDE 24

SVM: objective

Maximize margin over all training data points:

max

𝑥,𝑐 𝛿 = max 𝑥,𝑐 min 𝑗

𝑧𝑗𝑔

𝑥,𝑐 𝑦𝑗

| 𝑥 | = max

𝑥,𝑐 min 𝑗

𝑧𝑗(𝑥𝑈𝑦𝑗 + 𝑐) | 𝑥 |

A bit complicated …

SLIDE 25

SVM: simplified objective

Observation: when (𝑥, 𝑐) scaled by a factor 𝑑, the margin unchanged
Let’s consider a fixed scale such that

𝑧𝑗∗ 𝑥𝑈𝑦𝑗∗ + 𝑐 = 1 where 𝑦𝑗∗ is the point closest to the hyperplane 𝑧𝑗(𝑑𝑥𝑈𝑦𝑗 + 𝑑𝑐) | 𝑑𝑥 | = 𝑧𝑗(𝑥𝑈𝑦𝑗 + 𝑐) | 𝑥 |

SLIDE 26

SVM: simplified objective

Let’s consider a fixed scale such that

𝑧𝑗∗ 𝑥𝑈𝑦𝑗∗ + 𝑐 = 1 where 𝑦𝑗∗ is the point closet to the hyperplane

Now we have for all data

𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1 and at least for one 𝑗 the equality holds

Then the margin is

1 | 𝑥 |

SLIDE 27

SVM: simplified objective

Optimization simplified to

min

𝑥,𝑐

1 2 𝑥

2

𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1, ∀𝑗

How to find the optimum ෝ

𝑥∗?

SLIDE 28

SVM: principle for hypothesis class

SLIDE 29

Thought experiment

Suppose pick an 𝑆, and suppose can decide if exists 𝑥 satisfying

1 2 𝑥

2 ≤ 𝑆

𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1, ∀𝑗

Decrease 𝑆 until cannot find 𝑥 satisfying the inequalities

SLIDE 30