[PPT] - Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 PowerPoint Presentation

SLIDE 1

Support Vector Machines Part 1

Yingyu Liang Computer Sciences 760 Fall 2017

http://pages.cs.wisc.edu/~yliang/cs760/

Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

SLIDE 2

Goals for the lecture

you should understand the following concepts

the margin
the linear support vector machine
the primal and dual formulations of SVM learning
support vectors
VC-dimension and maximizing the margin

2

SLIDE 3

Motivation

SLIDE 4

Linear classification

(𝑥∗)𝑈𝑦 = 0 Class +1 Class -1 𝑥∗ (𝑥∗)𝑈𝑦 > 0 (𝑥∗)𝑈𝑦 < 0 Assume perfect separation between the two classes

SLIDE 5

Attempt

Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸
Hypothesis 𝑧 = sign(𝑔

𝑥 𝑦 ) = sign(𝑥𝑈𝑦)

𝑧 = +1 if 𝑥𝑈𝑦 > 0
𝑧 = −1 if 𝑥𝑈𝑦 < 0
Let’s assume that we can optimize to find 𝑥

SLIDE 6

Multiple optimal solutions?

Class +1 Class -1 𝑥2 𝑥3 𝑥1 Same on empirical loss; Different on test/expected loss

SLIDE 7

What about 𝑥1?

Class +1 Class -1 𝑥1

New test data

SLIDE 8

What about 𝑥3?

Class +1 Class -1 𝑥3

New test data

SLIDE 9

Most confident: 𝑥2

Class +1 Class -1 𝑥2

New test data

SLIDE 10

Intuition: margin

Class +1 Class -1 𝑥2

large margin

SLIDE 11

Margin

SLIDE 12

Margin

Lemma 1: 𝑦 has distance

|𝑔

𝑥 𝑦 |

| 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 = 𝑥𝑈𝑦 = 0

Proof:

𝑥 is orthogonal to the hyperplane
The unit direction is

𝑥 | 𝑥 |

The projection of 𝑦 is

𝑥 𝑥 𝑈

𝑦 =

𝑔

𝑥(𝑦)

| 𝑥 | 𝑥 | 𝑥 |

𝑦

𝑥 𝑥

𝑈

𝑦

SLIDE 13

Margin: with bias

Claim 1: 𝑥 is orthogonal to the hyperplane 𝑔

𝑥,𝑐 𝑦 = 𝑥𝑈𝑦 + 𝑐 = 0

Proof:

pick any 𝑦1 and 𝑦2 on the hyperplane
𝑥𝑈𝑦1 + 𝑐 = 0
𝑥𝑈𝑦2 + 𝑐 = 0
So 𝑥𝑈(𝑦1 − 𝑦2) = 0

SLIDE 14

Margin: with bias

Claim 2: 0 has distance

−𝑐 | 𝑥 | to the hyperplane 𝑥𝑈𝑦 + 𝑐 = 0

Proof:

pick any 𝑦1 the hyperplane
Project 𝑦1 to the unit direction

𝑥 | 𝑥 | to get the distance

𝑥

𝑥 𝑈

𝑦1 =

−𝑐 | 𝑥 | since 𝑥𝑈𝑦1 + 𝑐 = 0

SLIDE 15

Margin: with bias

Lemma 2: 𝑦 has distance

|𝑔𝑥,𝑐 𝑦 | | 𝑥 |

to the hyperplane 𝑔

𝑥,𝑐 𝑦 = 𝑥𝑈𝑦 +

𝑐 = 0 Proof:

Let 𝑦 = 𝑦⊥ + 𝑠

𝑥 | 𝑥 |, then |𝑠| is the distance

Multiply both sides by 𝑥𝑈 and add 𝑐
Left hand side: 𝑥𝑈𝑦 + 𝑐 = 𝑔

𝑥,𝑐 𝑦

Right hand side: 𝑥𝑈𝑦⊥ + 𝑠

𝑥𝑈𝑥 | 𝑥 | + 𝑐 = 0 + 𝑠| 𝑥 |

SLIDE 16

𝑧 𝑦 = 𝑥𝑈𝑦 + 𝑥0 The notation here is:

Figure from Pattern Recognition and Machine Learning, Bishop

SLIDE 17

Support Vector Machine (SVM)

SLIDE 18

SVM: objective

Margin over all training data points:

𝛿 = min

𝑗

|𝑔

𝑥,𝑐 𝑦𝑗 |

| 𝑥 |

Since only want correct 𝑔

𝑥,𝑐, and recall 𝑧𝑗 ∈ {+1, −1}, we have

𝛿 = min

𝑗

𝑧𝑗𝑔

𝑥,𝑐 𝑦𝑗

| 𝑥 |

If 𝑔

𝑥,𝑐 incorrect on some 𝑦𝑗, the margin is negative

SLIDE 19

SVM: objective

Maximize margin over all training data points:

max

𝑥,𝑐 𝛿 = max 𝑥,𝑐 min 𝑗

𝑧𝑗𝑔

𝑥,𝑐 𝑦𝑗

| 𝑥 | = max

𝑥,𝑐 min 𝑗

𝑧𝑗(𝑥𝑈𝑦𝑗 + 𝑐) | 𝑥 |

A bit complicated …

SLIDE 20

SVM: simplified objective

Observation: when (𝑥, 𝑐) scaled by a factor 𝑑, the margin unchanged
Let’s consider a fixed scale such that

𝑧𝑗∗ 𝑥𝑈𝑦𝑗∗ + 𝑐 = 1 where 𝑦𝑗∗ is the point closest to the hyperplane 𝑧𝑗(𝑑𝑥𝑈𝑦𝑗 + 𝑑𝑐) | 𝑑𝑥 | = 𝑧𝑗(𝑥𝑈𝑦𝑗 + 𝑐) | 𝑥 |

SLIDE 21

SVM: simplified objective

Let’s consider a fixed scale such that

𝑧𝑗∗ 𝑥𝑈𝑦𝑗∗ + 𝑐 = 1 where 𝑦𝑗∗ is the point closet to the hyperplane

Now we have for all data

𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1 and at least for one 𝑗 the equality holds

Then the margin is

1 | 𝑥 |

SLIDE 22

SVM: simplified objective

Optimization simplified to

min

𝑥,𝑐

1 2 𝑥

2

𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1, ∀𝑗

How to find the optimum ෝ

𝑥∗?

Solved by Lagrange multiplier method

SLIDE 23

Lagrange multiplier

SLIDE 24

Lagrangian

Consider optimization problem:

min

𝑥

𝑔(𝑥) ℎ𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚

Lagrangian:

ℒ 𝑥, 𝜸 = 𝑔 𝑥 + ෍

𝑗

𝛾𝑗ℎ𝑗(𝑥) where 𝛾𝑗’s are called Lagrange multipliers

SLIDE 25

Lagrangian

Consider optimization problem:

min

𝑥

𝑔(𝑥) ℎ𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚

Solved by setting derivatives of Lagrangian to 0

𝜖ℒ 𝜖𝑥𝑗 = 0; 𝜖ℒ 𝜖𝛾𝑗 = 0

SLIDE 26

Generalized Lagrangian

Consider optimization problem:

min

𝑥

𝑔(𝑥) 𝑕𝑗 𝑥 ≤ 0, ∀1 ≤ 𝑗 ≤ 𝑙 ℎ𝑘 𝑥 = 0, ∀1 ≤ 𝑘 ≤ 𝑚

Generalized Lagrangian:

ℒ 𝑥, 𝜷, 𝜸 = 𝑔 𝑥 + ෍

𝑗

𝛽𝑗𝑕𝑗(𝑥) + ෍

𝑘

𝛾𝑘ℎ𝑘(𝑥) where 𝛽𝑗, 𝛾𝑘’s are called Lagrange multipliers

SLIDE 27

Generalized Lagrangian

Consider the quantity:

𝜄𝑄 𝑥 ≔ max

𝜷,𝜸:𝛽𝑗≥0 ℒ 𝑥, 𝜷, 𝜸

Why?

𝜄𝑄 𝑥 = ቊ𝑔 𝑥 , if 𝑥 satisfies all the constraints +∞, if 𝑥 does not satisfy the constraints

So minimizing 𝑔 𝑥 is the same as minimizing 𝜄𝑄 𝑥

min

𝑥 𝑔 𝑥 = min 𝑥 𝜄𝑄 𝑥 = min 𝑥

max

𝜷,𝜸:𝛽𝑗≥0 ℒ 𝑥, 𝜷, 𝜸

SLIDE 28

Lagrange duality

The primal problem

𝑞∗ ≔ min

𝑥 𝑔 𝑥 = min 𝑥

max

𝜷,𝜸:𝛽𝑗≥0 ℒ 𝑥, 𝜷, 𝜸

The dual problem

𝑒∗ ≔ max

𝜷,𝜸:𝛽𝑗≥0min 𝑥 ℒ 𝑥, 𝜷, 𝜸

Always true:

𝑒∗ ≤ 𝑞∗

SLIDE 29

Lagrange duality

The primal problem

𝑞∗ ≔ min

𝑥 𝑔 𝑥 = min 𝑥

max

𝜷,𝜸:𝛽𝑗≥0 ℒ 𝑥, 𝜷, 𝜸

The dual problem

𝑒∗ ≔ max

𝜷,𝜸:𝛽𝑗≥0min 𝑥 ℒ 𝑥, 𝜷, 𝜸

Interesting case: when do we have

𝑒∗ = 𝑞∗?

SLIDE 30

Lagrange duality

Theorem: under proper conditions, there exists 𝑥∗, 𝜷∗, 𝜸∗ such that

𝑒∗ = ℒ 𝑥∗, 𝜷∗, 𝜸∗ = 𝑞∗ Moreover, 𝑥∗, 𝜷∗, 𝜸∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ 𝜖𝑥𝑗 = 0, 𝛽𝑗𝑕𝑗 𝑥 = 0 𝑕𝑗 𝑥 ≤ 0, ℎ𝑘 𝑥 = 0, 𝛽𝑗 ≥ 0

SLIDE 31

Lagrange duality

Theorem: under proper conditions, there exists 𝑥∗, 𝜷∗, 𝜸∗ such that

𝑒∗ = ℒ 𝑥∗, 𝜷∗, 𝜸∗ = 𝑞∗ Moreover, 𝑥∗, 𝜷∗, 𝜸∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ 𝜖𝑥𝑗 = 0, 𝛽𝑗𝑕𝑗 𝑥 = 0 𝑕𝑗 𝑥 ≤ 0, ℎ𝑘 𝑥 = 0, 𝛽𝑗 ≥ 0 dual complementarity

SLIDE 32

Lagrange duality

Theorem: under proper conditions, there exists 𝑥∗, 𝜷∗, 𝜸∗ such that

𝑒∗ = ℒ 𝑥∗, 𝜷∗, 𝜸∗ = 𝑞∗

Moreover, 𝑥∗, 𝜷∗, 𝜸∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:

𝜖ℒ 𝜖𝑥𝑗 = 0, 𝛽𝑗𝑕𝑗 𝑥 = 0 𝑕𝑗 𝑥 ≤ 0, ℎ𝑘 𝑥 = 0, 𝛽𝑗 ≥ 0 dual constraints primal constraints

SLIDE 33

Lagrange duality

What are the proper conditions?
A set of conditions (Slater conditions):
𝑔, 𝑕𝑗 convex, ℎ𝑘 affine, and exists 𝑥 satisfying all 𝑕𝑗 𝑥 < 0
There exist other sets of conditions
Check textbooks, e.g., Convex Optimization by Boyd and Vandenberghe

SLIDE 34

SVM: optimization

SLIDE 35

SVM: optimization

Optimization (Quadratic Programming):

min

𝑥,𝑐

1 2 𝑥

2

𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1, ∀𝑗

Generalized Lagrangian:

ℒ 𝑥, 𝑐, 𝜷 = 1 2 𝑥

2

− ෍

𝑗

𝛽𝑗[𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 − 1] where 𝜷 is the Lagrange multiplier

SLIDE 36

SVM: optimization

KKT conditions:

𝜖ℒ 𝜖𝑥 = 0,  𝑥 = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗 (1) 𝜖ℒ 𝜖𝑐 = 0,  0 = σ𝑗 𝛽𝑗𝑧𝑗

(2)

Plug into ℒ:

ℒ 𝑥, 𝑐, 𝜷 = σ𝑗 𝛽𝑗 −

1 2 σ𝑗𝑘 𝛽𝑗𝛽𝑘𝑧𝑗𝑧𝑘𝑦𝑗 𝑈𝑦𝑘 (3)

combined with 0 = σ𝑗 𝛽𝑗𝑧𝑗 , 𝛽𝑗 ≥ 0

SLIDE 37

SVM: optimization

Reduces to dual problem:

ℒ 𝑥, 𝑐, 𝜷 = ෍

𝑗

𝛽𝑗 − 1 2 ෍

𝑗𝑘

𝛽𝑗𝛽𝑘𝑧𝑗𝑧𝑘𝑦𝑗

𝑈𝑦𝑘

෍

𝑗

𝛽𝑗𝑧𝑗 = 0, 𝛽𝑗 ≥ 0

Since 𝑥 = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗, we have 𝑥𝑈𝑦 + 𝑐 = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗

𝑈𝑦 + 𝑐

Only depend on inner products

SLIDE 38

Support Vectors

those instances with αi > 0 are

called support vectors

they lie on the margin boundary
solution NOT changed if delete

the instances with αi = 0

support vectors

final solution is a sparse linear combination of the training instances

SLIDE 39

Learning theory justification

Vapnik showed a connection between the margin and VC dimension

𝑊𝐷 ≤ 4𝑆2 𝑛𝑏𝑠𝑕𝑗𝑜𝐸(ℎ)

thus to minimize the VC dimension (and to improve the error bound)

 maximize the margin

error on true distribution training set error VC: VC-dimension

f hypothesis class

𝑓𝑠𝑠𝑝𝑠 ℎ ≤ 𝑓𝑠𝑠𝑝𝑠𝐸 ℎ + 𝑊𝐷 log 2𝑛 𝑊𝐷 + 1 + log 4 𝜀 𝑛

constant dependent on training data