Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning - - PowerPoint PPT Presentation

linear binary svm classifiers
SMART_READER_LITE
LIVE PREVIEW

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning - - PowerPoint PPT Presentation

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Linear, Binary SVM Classifiers 1 / 17 Outline 1 What Linear, Binary SVM Classifiers Do 2 Margin 3 Loss and Regularized Risk 4 Training an SVM is


slide-1
SLIDE 1

Linear, Binary SVM Classifiers

COMPSCI 371D — Machine Learning

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 1 / 17

slide-2
SLIDE 2

Outline

1 What Linear, Binary SVM Classifiers Do 2 Margin 3 Loss and Regularized Risk 4 Training an SVM is a Quadratic Program 5 The KKT Conditions and the Support Vectors

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 2 / 17

slide-3
SLIDE 3

What Linear, Binary SVM Classifiers Do

The Separable Case

?

  • Where to place the boundary?
  • The number of degrees of freedom grows with d

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 3 / 17

slide-4
SLIDE 4

What Linear, Binary SVM Classifiers Do

SVMs Maximize the Smallest Margin

  • Placing the boundary as far as possible from the nearest

samples improves generalization

  • Leave as much empty space around the boundary as

possible

  • Only the points that barely make the margin matter
  • These are the support vectors
  • Initially, we don’t know which points will be support vectors

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 4 / 17

slide-5
SLIDE 5

What Linear, Binary SVM Classifiers Do

The General Case

  • If the data is not linearly separable, there must be misclassified
  • samples. These have a negative margin
  • Assign a penalty that increases when the smallest margin

diminishes (penalize a small margin between classes), and grows with any negative margin (penalize misclassified samples)

  • Give different weights to the two penalties (cross-validation!)
  • Find the optimal compromise: minimum risk (total penalty)

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 5 / 17

slide-6
SLIDE 6

Margin

Separating Hyperplane

  • X = Rd and Y = {−1, 1} (more convenient labels)
  • Hyperplane: nTx + c = 0

with n = 1

  • Decision rule: ˆ

y = h(x) = sign(nTx + c)

  • n points towards the ˆ

y = 1 half-space

  • If y is the true label, decision is correct if

nTx + c ≥ 0 if y = 1 nTx + c ≤ 0 if y = −1

  • More compactly,

decision is correct if y(nTx + c) ≥ 0

  • SVMs want this inequality to hold with a margin

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 6 / 17

slide-7
SLIDE 7

Margin

Margin

  • The margin of (x, y) is the

signed distance of x from the boundary: Positive if x is on the correct side of the boundary, negative otherwise µv(x, y)

def

= y (nTx + c)

  • v = (n, c)
  • Margin of a training set T:

µv(T)

def

= min(x,y)∈T µv(x, y)

  • Boundary separates T if

µv(T) > 0

n

separating hyperplane y ^ = 1 = −1 y ^

1

µ (x, 1)

v

1

µ (x, -1)

v

1

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 7 / 17

slide-8
SLIDE 8

Loss and Regularized Risk

The Hinge Loss

  • Reference margin µ∗ > 0

(unknown, to be determined)

  • Hinge loss ℓv(x, y):

1 µ∗ max{0, µ∗ − µv(x, y)}

  • Training samples with

µv(x, y) ≥ µ∗ are classified correctly with a margin at least µ∗

  • Some loss incurred as soon as

µv(x, y) < µ∗ even if the sample is classified correctly

n

reference margin separating hyperplane y ^ = 1 = −1 y ^

1

µ (x, 1)

v

µ*

1

µ* µ (x, -1)

v

reference margin

l (x, -1)

v

l (x, 1)

v

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 8 / 17

slide-9
SLIDE 9

Loss and Regularized Risk

The Training Risk

  • The training risk for SVMs is not just 1

N

N

n=1 ℓv(xn, yn)

  • A regularization term is added to force µ∗ to be large
  • Separating hyperplane is nTx + c = 0
  • Let wTx + b = 0

with w = ωn, b = ωc and ω = w =

1 µ∗

  • ω is a reciprocal scaling factor if w is changed for a fixed b:

Large margin, small ω

  • Make risk higher when ω is large (small margin):

LT(w, b)

def

=

1 2w2 + C N

N

n=1 ℓ(w,b)(xn, yn)

where ℓ(w,b)(x, y) =

1 µ∗ max{0, µ∗ − µ(w,b)(x, y)}

=

1 µ∗ max{0, µ∗ − y (nTx + c)} = max{0, 1 − y(wTx + b)}

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 9 / 17

slide-10
SLIDE 10

Loss and Regularized Risk

Regularized Risk

  • ERM classifier:

(w∗, b∗) = ERMT(w, b) = arg min(w,b) LT(w, b) where LT(w, b)

def

=

1 2w2 + C N

N

n=1 ℓ(w,b)(xn, yn)

  • ℓ(w,b)(xn, yn)

def

= max{0, 1 − yn(wTxn + b)}

  • C determines a trade-off
  • Large C ⇒ w less important ⇒ larger ω ⇒ smaller margin

⇒ fewer samples within the margin

  • We buy a larger margin by accepting more samples inside it
  • C is a hyper-parameter: Cross-validation!

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 10 / 17

slide-11
SLIDE 11

Training an SVM is a Quadratic Program

Rephrasing as Training a Quadratic Program

  • (w∗, b∗) = arg min(w,b)

1 2w2 + C N

N

n=1 ℓn

where ℓn = ℓ(w,b)(νn)

def

= max{0, 1 − yn(wTxn + b)

  • νn

} = max{0, 1 − νn}

  • Not differentiable because of the max: Bummer!
  • Neat trick:
  • Introduce new slack variables ξn = ℓn
  • Note that ξn = ℓn is the same as ξn = minξ≥ℓn ξ

l (ν)

(w, b)

1 1 ν ξ

ln

νn

  • We moved ℓn from the target to a constraint

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 11 / 17

slide-12
SLIDE 12

Training an SVM is a Quadratic Program

Rephrasing as Training a Quadratic Program

  • Changed from

(w∗, b∗) = arg min(w,b)

1 2w2 + C N

N

n=1 ℓn(νn)

to (w∗, b∗) = arg min(w,b)

1 2w2 + C N

N

n=1 ξn

where ξn are new variables subject to constraints ξn ≥ ℓn(νn)

  • Now the target is a quadratic function of w, b, ξ1, . . . , ξN
  • However, constraints are not affine
  • No problem: ξn ≥ ℓ is the same as ξn ≥ 0 and ξn ≥ 1 − ν
  • Two affine constraints instead of one nonlinear one

1 1 µ

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 12 / 17

slide-13
SLIDE 13

Training an SVM is a Quadratic Program

Quadratic Program Formulation

  • We achieve differentiability at the cost of adding N slack

variables ξn:

  • Old:

min(w,b)

1 2w2 + C N

N

n=1 ℓ(w,b)(xn, yn)

where ℓ(w,b)(xn, yn)

def

= max{0, 1 − yn(wTxn + b)}

  • New:

minw,b,ξ f(w, ξ) where f(w, ξ) = 1

2w2 + γ N n=1 ξn

subject to the constraints yn(wTxn + b) − 1 + ξn ≥ ξn ≥ and with γ

def

=

C N

  • We have our quadratic program!

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 13 / 17

slide-14
SLIDE 14

The KKT Conditions and the Support Vectors

The KKT Conditions

SVM Quadratic Program min

w,b,ξ f(w, ξ)

where f(w, ξ) = 1 2 w2 + γ

N

  • n=1

ξn subject to the constraints yn(wT xn + b) − 1 + ξn ≥ ξn ≥ KKT Conditions (u = (w, b, ξ)) ∇f(u∗) =

  • i∈A(u∗)

α∗

i ∇ci(u∗) with α∗ i ≥ 0 COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 14 / 17

slide-15
SLIDE 15

The KKT Conditions and the Support Vectors

Differentiating Target and Constraints

f = 1

2w2 + γ N n=1 ξn

  • Two types of constraints:

cj = yj(wTxj + b) − 1 + ξj ≥ 0 dk = ξk ≥ 0

  • Unknowns w, b, ξn

∂f ∂w = w ∂f ∂b = ∂f ∂ξn = γ ∂cj ∂w = yjxj ∂cj ∂b = yj ∂cj ∂ξj = 1 ∂dk ∂w = ∂dk ∂b = ∂dk ∂ξk = 1

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 15 / 17

slide-16
SLIDE 16

The KKT Conditions and the Support Vectors

KKT Conditions

w∗ =

  • n∈A(u∗)

α∗

nynxn

=

  • n∈A(u∗)

α∗

nyn

γ = α∗

n + β∗ n for n = 1, . . . , N

≤ α∗

j , β∗ k

A(u∗) is the set of indices where the constraints cj ≥ 0 are active

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 16 / 17

slide-17
SLIDE 17

The KKT Conditions and the Support Vectors

The Support Vectors

  • The representer theorem:

w∗ = N

n∈A(u∗) α∗ nynxn

  • The separating-hyperplane parameter w is a linear

combination of the active training data points xn

  • Misclassified and low-margin points are active (αn > 0)
  • In the separable case, data points on the margin

boundaries are active

  • Either way, these data points are called the support vectors

COMPSCI 371D — Machine Learning Linear, Binary SVM Classifiers 17 / 17