Machine Learning A Geometric Approach CIML book Chap 7.7 Linear - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning A Geometric Approach CIML book Chap 7.7 Linear - - PowerPoint PPT Presentation

Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron to SVM batch 1997


slide-1
SLIDE 1

Machine Learning

A Geometric Approach

Professor Liang Huang

Linear Classification: Support Vector Machines (SVM)

some slides from Alex Smola (CMU)

CIML book Chap 7.7

slide-2
SLIDE 2

Linear Separator

Spam Ham

slide-3
SLIDE 3

From Perceptron to SVM

1959 Rosenblatt invention 1962 Novikoff proof 1999 Freund/Schapire voted/avg: revived 2002 Collins structured 2003

Crammer/Singer

MIRA 1997 Cortes/Vapnik SVM 2006

Singer group

aggressive 2005*

McDonald/Crammer/Pereira

structured MIRA

*mentioned in lectures but optional (others papers all covered in detail)

  • nline approx.

max margin

+ m a x m a r g i n + k e r n e l s +soft-margin c

  • n

s e r v a t i v e u p d a t e s i n s e p a r a b l e c a s e

2007--2010*

Singer group

Pegasos

subgradient descent minibatch

batch

  • nline

AT&T Research ex-AT&T and students

1964 Vapnik Chervonenkis

fall of USSR

3

slide-4
SLIDE 4

Large Margin Classifier

hw, xi + b  1 hw, xi + b 1 f(x) = hw, xi + b

linear function

slide-5
SLIDE 5

Large Margin Classifier

f(x) = hw, xi + b

linear function

hw, xi + b 1 hw, xi + b  1

slide-6
SLIDE 6

Why large margins?

  • Maximum

robustness relative to uncertainty

  • Symmetry breaking
  • Independent of

correctly classified instances

  • Easy to find for

easy problems

  • +

+ +

  • +

r ρ

slide-7
SLIDE 7

Feature Map Φ

  • SVM is often used with kernels
slide-8
SLIDE 8

hw, xi + b = 1 hw, xi + b = 1

Large Margin Classifier

geometric margin: w

hw, xi + b 1 hw, xi + b  1

functional margin: yi(w · xi)

yi(w · xi) kwk = 1 kwk

slide-9
SLIDE 9

hw, xi + b = 1 hw, xi + b = 1

Large Margin Classifier

  • max. geometric margin

s.t. functional margin is at least 1

w

hw, xi + b 1 hw, xi + b  1

SVM objective (max version):

max

w

1 kwk s.t. 8(x, y) 2 D, y(w · x) 1

Q1: what if we want functional margin of 2? Q2: what if we want geometric margin of 1?

slide-10
SLIDE 10

hw, xi + b = 1 hw, xi + b = 1

Large Margin Classifier

w

hw, xi + b 1 hw, xi + b  1

SVM objective (min version):

min

w kwk s.t. 8(x, y) 2 D, y(w · x) 1

interpretation: small models generalize better

  • min. weight vector

s.t. functional margin is at least 1

slide-11
SLIDE 11

hw, xi + b = 1 hw, xi + b = 1

Large Margin Classifier

  • min. weight vector

s.t. functional margin is at least 1

w

hw, xi + b 1 hw, xi + b  1

SVM objective (min version):

min

w

1 2 kwk2 s.t. 8(x, y) 2 D, y(w · x) 1

|w| not differentiable, |w|2 is.

slide-12
SLIDE 12

SVM vs. MIRA

  • SVM: min weight vector to enforce functional

margin of at least 1 on ALL EXAMPLES

  • MIRA: min weight change to enforce functional

margin of at least 1 on THIS EXAMPLE

  • MIRA is 1-step or online approximation of SVM
  • Aggressive MIRA→SVM as p→1

min

w

1 2 kwk2 s.t. 8(x, y) 2 D, y(w · x) 1 min

w0 kw0 wk2

s.t. w0 · x 1

xi w

MIRA

perceptron

slide-13
SLIDE 13

Convex Hull Interpretation

  • max. distance between convex hulls

why don’t use convex hulls for SVMs in practice??

how many support vectors in 2D?

weight vector is determined by the support vectors alone c.f. perceptron: what about MIRA?

w = X

(x,y)∈errors

y · x

slide-14
SLIDE 14

Convexity and Convex Hulls

convex combination

slide-15
SLIDE 15

Optimization

  • Primal optimization problem
  • Convex optimization: convex function over convex set!
  • Quadratic prog.: quadratic function w/ linear constraints

constraint

min

w

1 2 kwk2 s.t. 8(x, y) 2 D, y(w · x) 1

linear quadratic

slide-16
SLIDE 16

MIRA as QP

  • MIRA is a trivial QP; can be solved geometrically
  • what about multiple constraints (e.g. minibatch)?

wi xi ⊕

MIRA 1 kxik wi · xi kxik 1 wi · xi kxik

min

w0 kw0 wk2

s.t. w0 · x 1

slide-17
SLIDE 17

Optimization

  • Primal optimization problem
  • Convex optimization: convex function over convex set!
  • Lagrange function

Derivatives in w need to vanish

constraint

L(w, b, α) = 1 2 kwk2 X

i

αi [yi [hxi, wi + b] 1]

model is a linear combo

  • f a small subset of input

(the support vectors) i.e., those with αi > 0

min

w

1 2 kwk2 s.t. 8(x, y) 2 D, y(w · x) 1

∂wL(w, b, a) = w − X

i

αiyixi = 0 ∂bL(w, b, a) = X

i

αiyi = 0 w = X

i

yiαixi

α

slide-18
SLIDE 18

Lagrangian & Saddle Point

  • equality: min x2 s.t. x = 1
  • inequality: min x2 s.t. x >= 1
  • Lagrangian: L(x, α)=x2 - α(x-1)
  • derivative in x need to vanish
  • optimality is at saddle point with α
  • minx in primal => maxα in dual

α x

slide-19
SLIDE 19

Constrained Optimization

  • Quadratic Programming
  • Quadratic Objective
  • Linear Constraints

KKT condition (complementary slackness)

  • ptimal point is achieved at active constraints

where αi > 0 (αi=0 => inactive)

αi [yi [hw, xii + b] 1] = 0 minimize

w,b

1 2 kwk2 subject to yi [hxi, wi + b] 1

w = X

i

yiαixi

constraint

Karush–Kuhn–Tucker

slide-20
SLIDE 20

w

KKT => Support Vectors

minimize

w,b

1 2 kwk2 subject to yi [hxi, wi + b] 1 w = X

i

yiαixi Karush Kuhn Tucker (KKT) Optimality Condition

αi = 0 αi > 0 = ) yi [hw, xii + b] = 1

αi [yi [hw, xii + b] 1] = 0

slide-21
SLIDE 21

w

w = X

i

yiαixi

Properties

  • Weight vector w as weighted linear combination of instances
  • Only points on margin matter (ignore the rest and get same solution)
  • Only inner products matter
  • Quadratic program
  • We can replace the inner product by a kernel
  • Keeps instances away from the margin
slide-22
SLIDE 22

Alternative: Primal=>Dual

  • Lagrange function
  • Derivatives in w need to vanish
  • Plugging w back into L yields

L(w, b, α) = 1 2 kwk2 X

i

αi [yi [hxi, wi + b] 1] maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X αiyi = 0 and αi 0 ∂wL(w, b, a) = w − X

i

αiyixi = 0 ∂bL(w, b, a) = X

i

αiyi = 0 w = X

i

yiαixi

dual variables

α

slide-23
SLIDE 23

w

Primal vs. Dual

w = X

i

yiαixi maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 0

dual variables

minimize

w,b

1 2 kwk2 subject to yi [hxi, wi + b] 1

Primal Dual

slide-24
SLIDE 24

Solving the optimization problem

  • Dual problem
  • If problem is small enough (1000s of variables)

we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO)

  • For larger problem use fact that only SVs

matter and solve in blocks (active set method).

maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 0

dual variables

slide-25
SLIDE 25

Quadratic Program in Dual

  • Dual problem

maximize

α

− 1 2αT Qα − αT b subject to α ≥ 0

  • Quadratic Programming
  • Objective: Quadratic function
  • Q is positive semidefinite
  • Constraints: Linear functions
  • Methods
  • Gradient Descent
  • Coordinate Descent
  • aka., Hildreth Algorithm
  • Sequential Minimal

Optimization (SMO)

Q: what’s the Q in SVM primal? how about Q in SVM dual?

slide-26
SLIDE 26

Convex QP

  • if Q is positive (semi)definite, i.e., xTQx ≥ 0 for all x, then

convex QP => local min/max is global min/max

  • if Q = 0, it reduces to linear programming
  • if Q is indefinite => saddlepoint
  • general QP is NP-hard; convex QP is polynomial-time

LP

QP

CP

svm

slide-27
SLIDE 27

QP: Hildreth Algorithm

  • idea 1:
  • update one coordinate while fixing all other

coordinates

  • e.g., update coordinate i is to solve:

argmax

αi

− 1 2αT Qα − αT b subject to α ≥ 0

Quadratic function with only one variable Maximum => first-order derivative is 0

slide-28
SLIDE 28

QP: Hildreth Algorithm

  • idea 2:
  • choose another coordinate and repeat until meet stopping

criterion

  • reach maximum or
  • increase between 2 consecutive iterations is very small or
  • after some # of iterations
  • how to choose coordinate: sweep pattern
  • Sequential:
  • 1, 2, ..., n, 1, 2, ..., n, ...
  • 1, 2, ..., n, n-1, n-2, ...,1, 2, ...
  • Random: permutation of 1,2, ..., n
  • Maximal Descent
  • choose i with maximal descent in objective
slide-29
SLIDE 29

QP: Hildreth Algorithm

i

initialize for all repeat pick following sweep pattern solve until meet stopping criterion

αi ← argmax

αi

− 1 2αT Qα − αT b subject to α ≥ 0 αi = 0 i

slide-30
SLIDE 30

QP: Hildreth Algorithm

  • choose coordinates
  • 1, 2, 1, 2, ...

maximize

α

− 1 2αT ✓ 4 1 1 2 ◆ α − αT ✓ −6 −4 ◆ subject to α ≥ 0

slide-31
SLIDE 31

QP: Hildreth Algorithm

  • pros:
  • extremely simple
  • no gradient calculation
  • easy to implement
  • cons:
  • converges slow, compared to other methods
  • can’t deal with too many constraints
  • works for minibatch MIRA but not SVM
slide-32
SLIDE 32

Linear Separator

Spam Ham

slide-33
SLIDE 33

Large Margin Classifier

hw, xi + b  1 hw, xi + b 1 f(x) = hw, xi + b

linear function linear separator is impossible + +

slide-34
SLIDE 34

Large Margin Classifier

hw, xi + b  1 hw, xi + b 1

Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard

minimum error separator is impossible +

slide-35
SLIDE 35

hw, xi + b  1 + ξ

Adding slack variables

Convex optimization problem

minimize amount

  • f slack

hw, xi + b 1 ξ

+

slide-36
SLIDE 36

margin violation vs. misclassification

misclassification is also margin violation (ξ>0)

slide-37
SLIDE 37

Adding slack variables

  • Hard margin problem
  • With slack variables

Problem is always feasible. Proof: (also yields upper bound)

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 w = 0 and b = 0 and ξi = 1 minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1

C=0? C=+∞? C=+∞ => not tolerant on violation => hard margin

ξ

w determines ξ

slide-38
SLIDE 38

Review: Convex Optimization

  • Primal optimization problem
  • Lagrange function
  • First order optimality conditions in x
  • Solve for x and plug it back into L

(keep explicit constraints)

minimize

x

f(x) subject to ci(x) ≤ 0 L(x, α) = f(x) + X

i

αici(x) ∂xL(x, α) = ∂xf(x) + X

i

αi∂xci(x) = 0 maximize

α

L(x(α), α)

slide-39
SLIDE 39
  • Primal optimization problem
  • Lagrange function
  • ptimality in (w, ξ) is at saddle point with α,η
  • Derivatives in (w, ξ) need to vanish

L(w, b, α) = 1 2 kwk2 + C X

i

ξi X

i

αi [yi [hxi, wi + b] + ξi 1] X

i

ηiξi

Dual Problem

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0

ξ

ξ, ,𝝷)

slide-40
SLIDE 40

Dual Problem

  • Lagrange function
  • Derivatives in w need to vanish
  • Plugging terms back into L yields

∂wL(w, b, ξ, α, η) = w − X

i

αiyixi = 0 ∂bL(w, b, ξ, α, η) = X

i

αiyi = 0 ∂ξiL(w, b, ξ, α, η) = C − αi − ηi = 0

maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 2 [0, C]

bound influence

dual variables

L(w, b, α) = 1 2 kwk2 + C X

i

ξi X

i

αi [yi [hxi, wi + b] + ξi 1] X

i

ηiξi

ξ, ,𝝷)

slide-41
SLIDE 41

w

Karush Kuhn Tucker Conditions

w = X

i

yiαixi

αi [yi [hw, xii + b] + ξi 1] = 0 ηiξi = 0

0 ≤ αi = C − ηi ≤ C

L(w, b, α) = 1 2 kwk2 + C X

i

ξi X

i

αi [yi [hxi, wi + b] + ξi 1] X

i

ηiξi

αi = 0 = ) yi [hw, xii + b] 1 0 < αi < C = ) yi [hw, xii + b] = 1 αi = C = ) yi [hw, xii + b]  1

ξ,

why these are not disjoint?

α=0 α=0

0<α<C 0<α<C α=C α=C

,𝝷)

slide-42
SLIDE 42

Support Vectors and Violations

all circled and squared examples are support vectors (α>0) they include ξ=0 (hard-margin SVs), ξ>0 (margin violations), and ξ>1 (misclassifications)

slide-43
SLIDE 43

Support Vectors and Violations

all circled and squared examples are support vectors (α>0) they include ξ=0 (hard-margin SVs), ξ>0 (margin violations), and ξ>1 (misclassifications)

(0<α≤C, ξ≥0) support vectors

(α=C, ξ>0)

margin violations

non-support vectors (α=0, ξ=0)

(α=C, ξ>1)

misclassifications

α=C

slide-44
SLIDE 44

SVM with sklearn

python demo.py 1e10 python demo.py 1e10

slide-45
SLIDE 45

SVM with sklearn

python demo.py 0.01 python demo.py 0.1

slide-46
SLIDE 46

SVM with sklearn

+ + ‒ ‒ + + ‒ ‒

In [2]: clf = svm.SVC(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(X, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32)

+

In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-0.1,-0.1]] In [3]: Y = [1,1,-1,-1, 1] In [4]: clf = svm.SVC(kernel='linear', C=1) In [5]: clf.fit(X, Y) In [6]: clf.support_vectors_ Out[6]: array([[-1. , 1. ], [-1. , -1. ], [ 1. , -1. ], [-0.1, -0.1]]) In [7]: clf.dual_coef_ Out[7]: array([[-0.45, -0.6 , 0.05, 1. ]]) In [8]: clf.coef_ Out[8]: array([[ 1.00000000e+00, 1.49011611e-09]]) In [9]: clf.intercept_ Out[9]: array([-0.])

α=C

α=C

slide-47
SLIDE 47

SVM with sklearn

+ + ‒ ‒ + + ‒ ‒

In [2]: clf = svm.SVC(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(X, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32)

+

In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-0.1,-0.1]] In [3]: Y = [1,1,-1,-1, 1] In [12]: clf = svm.SVC(kernel='linear', C=1e10) In [13]: clf.fit(X, Y) In [14]: clf.coef_ Out[14]: array([[ 2.02010102e+00, 1.00999543e-04]]) In [15]: clf.intercept_ Out[15]: array([ 1.02013469]) In [16]: clf.support_vectors_ Out[16]: array([[-1. , 1. ], [-1. , -1. ], [-0.01, -0.01]]) In [17]: clf.dual_coef_ Out[17]: array([[-1.01000001, -1.03050607, 2.04050608]])

slide-48
SLIDE 48

SVM with sklearn

+ + ‒ ‒ + + ‒ ‒

In [2]: clf = svm.SVC(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(X, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32)

+

In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-2,0]] In [3]: Y = [1,1,-1,-1, 1] In [4]: clf = svm.SVC(kernel='linear', C=1) In [5]: clf.fit(X, Y) In [6]: clf.coef_ Out[6]: array([[ 1., 0.]]) In [7]: clf.support_vectors_ Out[7]: array([[-1., 1.], [-1., -1.], [ 1., 1.], [ 1., -1.], [-2., 0.]]) In [8]: clf.dual_coef_ Out[8]: array([[-1. , -1. , 0.5, 0.5, 1. ]]) In [9]: clf.intercept_ Out[9]: array([-0.])

α=C α=C α=C

αi = 0 = ) yi [hw, xii + b] 1 0 < αi < C = ) yi [hw, xii + b] = 1 αi = C = ) yi [hw, xii + b]  1

what if C=1e10?

slide-49
SLIDE 49

hard vs. soft margins

slide-50
SLIDE 50

hard vs. soft margins

slide-51
SLIDE 51

C=1

51

slide-52
SLIDE 52

C=20

52

slide-53
SLIDE 53

Optimization

Learning an SVM has been formulated as a constrained optimization prob- lem over w and ξ min

w∈Rd,ξi∈R+ ||w||2 + C

N

X

i

ξi subject to yi

³

w>xi + b

´

≥ 1 − ξi for i = 1 . . . N The constraint yi

³

w>xi + b

´

≥ 1 − ξi, can be written more concisely as yif(xi) ≥ 1 − ξi which, together with ξi ≥ 0, is equivalent to ξi = max (0, 1 − yif(xi)) Hence the learning problem is equivalent to the unconstrained optimiza- tion problem over w min

w∈Rd ||w||2 + C

N

X

i

max (0, 1 − yif(xi))

loss function regularization

From Constrained Optimization to Unconstrained Optimization (back to Primal)

53

slides 49-56 from Andrew Zisserman (Oxford/DeepMind); with annotations

w determines ξ

slide-54
SLIDE 54

Loss function

w Support Vector Support Vector wTx + b = 0 min

w∈Rd ||w||2 + C

N

X

i

max (0, 1 − yif(xi))

Points are in three categories:

  • 1. yif(xi) > 1

Point is outside margin. No contribution to loss

  • 2. yif(xi) = 1

Point is on margin. No contribution to loss. As in hard margin case.

  • 3. yif(xi) < 1

Point violates margin constraint. Contributes to loss

loss function

54

(margin violation ξ>0, including misclassification ξ>1)

slide-55
SLIDE 55

Loss functions

  • SVM uses “hinge” loss
  • an approximation to the 0-1 loss

max (0, 1 − yif(xi))

yif(xi)

55

good very bad (misclassification)

not good enough!

(perceptron uses a shifted hinge-loss touching the origin)

slide-56
SLIDE 56

SVM min

w∈Rd C

N

X

i

max (0, 1 − yif(xi)) + ||w||2

+

convex

56

convex + convex = convex!

slide-57
SLIDE 57

Gradient (or steepest) descent algorithm for SVM

First, rewrite the optimization problem as an average min

w C(w)

= λ 2||w||2 + 1 N

N

X

i

max (0, 1 − yif(xi)) = 1 N

N

X

i

µλ

2||w||2 + max (0, 1 − yif(xi))

(with λ = 2/(NC) up to an overall scale of the problem) and f(x) = w>x + b Because the hinge loss is not differentiable, a sub-gradient is computed

To minimize a cost function C(w) use the iterative update

wt+1 ← wt − ηt∇wC(wt)

where η is the learning rate.

57

slide-58
SLIDE 58

Sub-gradient for hinge loss

L(xi, yi; w) = max (0, 1 − yif(xi)) f(xi) = w>xi + b

yif(xi)

∂L ∂w = −yixi ∂L ∂w = 0

58

sub-gradients

slide-59
SLIDE 59

Sub-gradient descent algorithm for SVM

C(w) = 1 N

N

X

i

µλ

2||w||2 + L(xi, yi; w)

The iterative update is

wt+1

← wt − η∇wtC(wt) ← wt − η 1 N

N

X

i

(λwt + ∇wL(xi, yi; wt)) where η is the learning rate. Then each iteration t involves cycling through the training data with the updates:

wt+1

← wt − η(λwt − yixi) if yif(xi) < 1 ← wt − ηλwt

  • therwise

In the Pegasos algorithm the learning rate is set at ηt = 1

λt

just like perceptron! batch gradient

  • nline gradient

59

perc: ≤0

slide-60
SLIDE 60

60

slide-61
SLIDE 61

Pegasos – Stochastic Gradient Descent Algorithm

Randomly sample from the training data

50 100 150 200 250 300 10

  • 2

10

  • 1

10 10

1

10

2

energy

  • 6
  • 4
  • 2

2 4 6

  • 6
  • 4
  • 2

2 4 6

SGD is online update: gradient on one example (unbiasedly) approximates the gradient on the whole training data (SGD)

61

slides 49-56 from Andrew Zisserman (Oxford/DeepMind); with annotations