Introduction to Machine Learning 5. Support Vector Classification - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning 5. Support Vector Classification - - PowerPoint PPT Presentation

Introduction to Machine Learning 5. Support Vector Classification Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Support Vector Classification Large Margin Separation, optimization


slide-1
SLIDE 1

Introduction to Machine Learning

  • 5. Support Vector Classification

Alex Smola Carnegie Mellon University

http://alex.smola.org/teaching/cmu2013-10-701 10-701

slide-2
SLIDE 2
  • Support Vector Classification

Large Margin Separation, optimization problem

  • Properties

Support Vectors, kernel expansion

  • Soft margin classifier

Dual problem, robustness

Outline

slide-3
SLIDE 3

Support Vector Machines

http://maktoons.blogspot.com/2009/03/support-vector-machine.html

slide-4
SLIDE 4

Linear Separator

Spam Ham

slide-5
SLIDE 5

Linear Separator

Spam Ham

slide-6
SLIDE 6

Linear Separator

Spam Ham

slide-7
SLIDE 7

Linear Separator

Spam Ham

slide-8
SLIDE 8

Linear Separator

Spam Ham

slide-9
SLIDE 9

Linear Separator

Spam Ham

slide-10
SLIDE 10

Linear Separator

Spam Ham

slide-11
SLIDE 11

Large Margin Classifier

hw, xi + b  1 hw, xi + b 1 f(x) = hw, xi + b

linear function

slide-12
SLIDE 12

Large Margin Classifier

hw, xi + b = 1 hw, xi + b = 1 hx+ x−, wi 2 kwk = 1 2 kwk [[hx+, wi + b] [hx−, wi + b]] = 1 kwk

margin w

slide-13
SLIDE 13

Large Margin Classifier

hw, xi + b = 1 hw, xi + b = 1

  • ptimization problem

w

maximize

w,b

1 kwk subject to yi [hxi, wi + b] 1

slide-14
SLIDE 14

Large Margin Classifier

hw, xi + b = 1 hw, xi + b = 1

  • ptimization problem

w

minimize

w,b

1 2 kwk2 subject to yi [hxi, wi + b] 1

slide-15
SLIDE 15

Dual Problem

  • Primal optimization problem
  • Lagrange function

Optimality in w, b is at saddle point with α

  • Derivatives in w, b need to vanish

minimize

w,b

1 2 kwk2 subject to yi [hxi, wi + b] 1 L(w, b, α) = 1 2 kwk2 X

i

αi [yi [hxi, wi + b] 1]

constraint

slide-16
SLIDE 16

Dual Problem

  • Lagrange function
  • Derivatives in w, b need to vanish
  • Plugging terms back into L yields

L(w, b, α) = 1 2 kwk2 X

i

αi [yi [hxi, wi + b] 1] ∂wL(w, b, a) = w − X

i

αiyixi = 0 ∂bL(w, b, a) = X

i

αiyi = 0 maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X αiyi = 0 and αi 0

slide-17
SLIDE 17

w

Support Vector Machines

minimize

w,b

1 2 kwk2 subject to yi [hxi, wi + b] 1 maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 0 w = X

i

yiαixi

slide-18
SLIDE 18

w

Support Vectors

minimize

w,b

1 2 kwk2 subject to yi [hxi, wi + b] 1 w = X

i

yiαixi

Karush Kuhn Tucker Optimality condition

αi [yi [hw, xii + b] 1] = 0

αi = 0 αi > 0 = ) yi [hw, xii + b] = 1

slide-19
SLIDE 19

w

w = X

i

yiαixi

Properties

  • Weight vector w as weighted linear combination of instances
  • Only points on margin matter (ignore the rest and get same solution)
  • Only inner products matter
  • Quadratic program
  • We can replace the inner product by a kernel
  • Keeps instances away from the margin
slide-20
SLIDE 20

Example

slide-21
SLIDE 21

Example

slide-22
SLIDE 22

Why large margins?

  • Maximum

robustness relative to uncertainty

  • Symmetry breaking
  • Independent of

correctly classified instances

  • Easy to find for

easy problems

  • +

+ +

  • +

r ρ

slide-23
SLIDE 23

Support Vector Machines

CLASSIFIERS

slide-24
SLIDE 24

Large Margin Classifier

hw, xi + b  1 hw, xi + b 1 f(x) = hw, xi + b

linear function

slide-25
SLIDE 25

Large Margin Classifier

hw, xi + b  1 hw, xi + b 1 f(x) = hw, xi + b

linear function

slide-26
SLIDE 26

Large Margin Classifier

hw, xi + b  1 hw, xi + b 1 f(x) = hw, xi + b

linear function linear separator is impossible

slide-27
SLIDE 27

Large Margin Classifier

hw, xi + b  1 hw, xi + b 1

Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard

slide-28
SLIDE 28

Large Margin Classifier

hw, xi + b  1 hw, xi + b 1

Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard

slide-29
SLIDE 29

Large Margin Classifier

hw, xi + b  1 hw, xi + b 1

Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard

minimum error separator is impossible

slide-30
SLIDE 30

hw, xi + b  1 + ξ

Adding slack variables

Convex optimization problem

hw, xi + b 1 ξ

slide-31
SLIDE 31

hw, xi + b  1 + ξ

Adding slack variables

Convex optimization problem

hw, xi + b 1 ξ

slide-32
SLIDE 32

hw, xi + b  1 + ξ

Adding slack variables

Convex optimization problem

minimize amount

  • f slack

hw, xi + b 1 ξ

slide-33
SLIDE 33

Intermezzo Convex Programs for Dummies

  • Primal optimization problem
  • Lagrange function
  • First order optimality conditions in x
  • Solve for x and plug it back into L

(keep explicit constraints)

minimize

x

f(x) subject to ci(x) ≤ 0 L(x, α) = f(x) + X

i

αici(x) ∂xL(x, α) = ∂xf(x) + X

i

αi∂xci(x) = 0 maximize

α

L(x(α), α)

slide-34
SLIDE 34

hw, xi + b  1 + ξ

Adding slack variables

Convex optimization problem

hw, xi + b 1 ξ

slide-35
SLIDE 35

hw, xi + b  1 + ξ

Adding slack variables

Convex optimization problem

hw, xi + b 1 ξ

slide-36
SLIDE 36

hw, xi + b  1 + ξ

Adding slack variables

Convex optimization problem

minimize amount

  • f slack

hw, xi + b 1 ξ

slide-37
SLIDE 37

Adding slack variables

  • Hard margin problem
  • With slack variables

Problem is always feasible. Proof: (also yields upper bound)

minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1 minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 w = 0 and b = 0 and ξi = 1

slide-38
SLIDE 38
  • Primal optimization problem
  • Lagrange function

Optimality in w,b,ξ is at saddle point with α,η

  • Derivatives in w,b,ξ need to vanish

L(w, b, α) = 1 2 kwk2 + C X

i

ξi X

i

αi [yi [hxi, wi + b] + ξi 1] X

i

ηiξi

Dual Problem

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0

slide-39
SLIDE 39

Dual Problem

  • Lagrange function
  • Derivatives in w, b need to vanish
  • Plugging terms back into L yields

L(w, b, α) = 1 2 kwk2 + C X

i

ξi X

i

αi [yi [hxi, wi + b] + ξi 1] X

i

ηiξi

∂wL(w, b, ξ, α, η) = w − X

i

αiyixi = 0 ∂bL(w, b, ξ, α, η) = X

i

αiyi = 0 ∂ξiL(w, b, ξ, α, η) = C − αi − ηi = 0

maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 2 [0, C]

bound influence

slide-40
SLIDE 40

w

Karush Kuhn Tucker Conditions

w = X

i

yiαixi

maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 2 [0, C]

αi = 0 = ) yi [hw, xii + b] 1 0 < αi < C = ) yi [hw, xii + b] = 1 αi = C = ) yi [hw, xii + b]  1

αi [yi [hw, xii + b] + ξi 1] = 0 ηiξi = 0

slide-41
SLIDE 41

C=1

slide-42
SLIDE 42

C=2

slide-43
SLIDE 43

C=5

slide-44
SLIDE 44

C=10

slide-45
SLIDE 45

C=20

slide-46
SLIDE 46

C=50

slide-47
SLIDE 47

C=100

slide-48
SLIDE 48

C=1

slide-49
SLIDE 49

C=2

slide-50
SLIDE 50

C=5

slide-51
SLIDE 51

C=10

slide-52
SLIDE 52

C=20

slide-53
SLIDE 53

C=50

slide-54
SLIDE 54

C=100

slide-55
SLIDE 55

C=1

slide-56
SLIDE 56

C=2

slide-57
SLIDE 57

C=5

slide-58
SLIDE 58

C=10

slide-59
SLIDE 59

C=20

slide-60
SLIDE 60

C=50

slide-61
SLIDE 61

C=100

slide-62
SLIDE 62

C=1

slide-63
SLIDE 63

C=2

slide-64
SLIDE 64

C=5

slide-65
SLIDE 65

C=10

slide-66
SLIDE 66

C=20

slide-67
SLIDE 67

C=50

slide-68
SLIDE 68

C=100

slide-69
SLIDE 69

Solving the optimization problem

  • Dual problem
  • If problem is small enough (1000s of variables)

we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO)

  • For larger problem use fact that only SVs

matter and solve in blocks (active set method).

maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 2 [0, C]

slide-70
SLIDE 70

Nonlinear Separation

slide-71
SLIDE 71
  • Linear soft margin problem
  • Dual problem
  • Support vector expansion

The Kernel Trick

f(x) = X

i

αiyi hxi, xi + b maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 2 [0, C]

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0

slide-72
SLIDE 72

f(x) = X

i

αiyik(xi, x) + b maximize

α

− 1 2 X

i,j

αiαjyiyjk(xi, xj) + X

i

αi subject to X

i

αiyi = 0 and αi ∈ [0, C]

  • Linear soft margin problem
  • Dual problem
  • Support vector expansion

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, φ(xi)i + b] 1 ξi and ξi 0

The Kernel Trick

slide-73
SLIDE 73

C=1

slide-74
SLIDE 74

C=2

slide-75
SLIDE 75

C=5

slide-76
SLIDE 76

C=10

slide-77
SLIDE 77

C=20

slide-78
SLIDE 78

C=50

slide-79
SLIDE 79

C=100

slide-80
SLIDE 80

C=1

slide-81
SLIDE 81

C=2

slide-82
SLIDE 82

C=5

slide-83
SLIDE 83

C=10

slide-84
SLIDE 84

C=20

slide-85
SLIDE 85

C=50

slide-86
SLIDE 86

C=100

slide-87
SLIDE 87

C=1

slide-88
SLIDE 88

C=2

slide-89
SLIDE 89

C=5

slide-90
SLIDE 90

C=10

slide-91
SLIDE 91

C=20

slide-92
SLIDE 92

C=50

slide-93
SLIDE 93

C=100

slide-94
SLIDE 94

C=1

slide-95
SLIDE 95

C=2

slide-96
SLIDE 96

C=5

slide-97
SLIDE 97

C=10

slide-98
SLIDE 98

C=20

slide-99
SLIDE 99

C=50

slide-100
SLIDE 100

C=100

slide-101
SLIDE 101

And now with a narrower kernel

slide-102
SLIDE 102
slide-103
SLIDE 103
slide-104
SLIDE 104
slide-105
SLIDE 105
slide-106
SLIDE 106

And now with a very wide kernel

slide-107
SLIDE 107
slide-108
SLIDE 108

Nonlinear separation

  • Increasing C allows for more nonlinearities
  • Decreases number of errors
  • SV boundary need not be contiguous
  • Kernel width adjusts function class
slide-109
SLIDE 109

Risk and Loss

slide-110
SLIDE 110

Loss function point of view

  • Constrained quadratic program
  • Risk minimization setting

Follows from finding minimal slack variable for given (w,b) pair.

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 minimize

w,b

1 2 kwk2 + C X

i

max [0, 1 yi [hw, xii + b]]

empirical risk

slide-111
SLIDE 111

Soft margin as proxy for binary

  • Soft margin loss
  • Binary loss

max(0, 1 − yf(x)) {yf(x) < 0}

convex upper bound binary loss function margin

slide-112
SLIDE 112

More loss functions

  • Logistic
  • Huberized loss
  • Soft margin

     if f(x) > 1

1 2(1 − f(x))2

if f(x) ∈ [0, 1]

1 2 − f(x)

if f(x) < 0

max(0, 1 − f(x))

(asymptotically) linear (asymptotically) 0

log h 1 + e−f(x)i

slide-113
SLIDE 113

Risk minimization view

  • Find function f minimizing classification error
  • Compute empirical average
  • Minimization is nonconvex
  • Overfitting as we minimize empirical error
  • Compute convex upper bound on the loss
  • Add regularization for capacity control

R[f] := Ex,y∼p(x,y) [{yf(x) > 0}] Remp[f] := 1 m

m

X

i=1

{yif(xi) > 0} Rreg[f] := 1 m

m

X

i=1

max(0, 1 − yif(xi)) + λΩ[f]

regularization

how to control ƛ

slide-114
SLIDE 114

Summary

  • Support Vector Classification

Large Margin Separation, optimization problem

  • Properties

Support Vectors, kernel expansion

  • Soft margin classifier

Dual problem, robustness