Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 - - PowerPoint PPT Presentation

lecture 17
SMART_READER_LITE
LIVE PREVIEW

Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 - - PowerPoint PPT Presentation

Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday December 17, 2016 (I will check the date today) . Project progress reports are due today!


slide-1
SLIDE 1

Lecture 17:

−Multi-class SVMs −Kernels

Aykut Erdem

December 2016 Hacettepe University

slide-2
SLIDE 2

Administrative

  • We will have a make-up lecture on Saturday

December 17, 2016 (I will check the date today).

  • Project progress reports are due today!

2

slide-3
SLIDE 3

Last time… Support Vector Machines

3

hw, xi + b  1 hw, xi + b 1 f(x) = hw, xi + b

linear function

slide by Alex Smola

slide-4
SLIDE 4

4

hw, xi + b = 1 hw, xi + b = 1

  • ptimization problem

w

maximize

w,b

1 kwk subject to yi [hxi, wi + b] 1

slide by Alex Smola

Last time… Support Vector Machines

slide-5
SLIDE 5

hw, xi + b = 1 hw, xi + b = 1

w

minimize

w,b

1 2 kwk2 subject to yi [hxi, wi + b] 1

  • ptimization problem

slide by Alex Smola

Last time… Support Vector Machines

slide-6
SLIDE 6

w

minimize

w,b

1 2 kwk2 subject to yi [hxi, wi + b] 1 maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 0 w = X

i

yiαixi

slide by Alex Smola

Last time… Support Vector Machines

slide-7
SLIDE 7

Last time… Large Margin Classifier

7

hw, xi + b = 1 hw, xi + b = 1

w

support 
 vectors

αi > 0 = )

slide-8
SLIDE 8

hw, xi + b  1 hw, xi + b 1

Theorem (Minsky & Papert)
 Finding the minimum error separating hyperplane is NP hard

minimum error separator is impossible

Last time… Soft-margin Classifier

slide by Alex Smola

slide-9
SLIDE 9

hw, xi + b  1 + ξ

Convex optimization problem

minimize amount

  • f slack

hw, xi + b 1 ξ

Last time… Adding Slack Variables

slide by Alex Smola

ξi ≥ 0

slide-10
SLIDE 10

hw, xi + b  1 + ξ

Convex optimization problem

minimize amount

  • f slack

hw, xi + b 1 ξ

Last time… Adding Slack Variables

  • for point is between the margin and correctly

classified

  • for point is misclassified

ξi ≥ 0 0 < ξ ≤ 1

adopted from Andrew Zisserman

slide-11
SLIDE 11
  • Hard margin problem
  • With slack variables



 


Problem is always feasible. Proof: 
 (also yields upper bound)

minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1 minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 w = 0 and b = 0 and ξi = 1

Last time… Adding Slack Variables

slide by Alex Smola

slide-12
SLIDE 12
  • Optimisation problem:



 


minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0

Soft-margin classifier

adopted from Andrew Zisserman

C is a regularization parameter:

  • small C allows constraints to be easily ignored 


→ large margin

  • large C makes constraints hard to ignore


→ narrow margin

  • C = ∞ enforces all constraints: hard margin
slide-13
SLIDE 13

Demo time…

13

slide-14
SLIDE 14

This week

  • Multi-class classification
  • Introduction to kernels

14

slide-15
SLIDE 15

Multi-class classification

15

slide by Eric Xing

slide-16
SLIDE 16

Multi-class classification

16

slide by Eric Xing

slide-17
SLIDE 17

Multi-class classification

17

slide by Eric Xing

slide-18
SLIDE 18

One versus all classification

18

  • Learn&3&classifiers:&

– &.&vs.&{o,+},&weights&w.& – +&vs.&{o,.},&weights&w+& – o&vs.&{+,.},&weights&wo&

  • Predict&label&using:&

w+ w- wo

  • Any&problems?&
  • Could&we&learn&this&dataset?&

slide by Eric Xing

slide-19
SLIDE 19

Multi-class SVM

19

  • Simultaneously-learn-3-sets--
  • f-weights:--
  • How-do-we-guarantee-the--

correct-labels?--

  • Need-new-constraints!--

The-“score”-of-the-correct-- class-must-be-be?er-than- the-“score”-of-wrong-classes:--

w+ w- wo

slide by Eric Xing

slide-20
SLIDE 20

Multi-class SVM

20

  • As#for#the#SVM,#we#introduce#slack#variables#and#maximize#margin:##

To predict, we use:

Now#can#we#learn#it?###

? 

slide by Eric Xing

slide-21
SLIDE 21

Kernels

21

slide by Alex Smola

slide-22
SLIDE 22
  • Regression


We got nonlinear functions by preprocessing

  • Perceptron
  • Map data into feature space
  • Solve problem in this space
  • Query replace by for code
  • Feature Perceptron
  • Solution in span of

x → φ(x) hx, x0i hφ(x), φ(x0)i φ(xi)

Non-linear features

slide by Alex Smola

slide-23
SLIDE 23
  • Separating surfaces are


Circles, hyperbolae, parabolae

Non-linear features

slide by Alex Smola

slide-24
SLIDE 24

Solving XOR

  • XOR not linearly separable
  • Mapping into 3 dimensions makes it easily solvable

24

(x1, x2) (x1, x2, x1x2)

slide by Alex Smola

slide-25
SLIDE 25

Linear Separation with Quadratic Kernels

25

slide by Alex Smola

slide-26
SLIDE 26

Quadratic Features

26

Quadratic Features in R2 Φ(x) := ⇣ x2

1,

p 2x1x2, x2

2

⌘ Dot Product hΦ(x), Φ(x0)i = D⇣ x2

1,

p 2x1x2, x2

2

⌘ , ⇣ x0

1 2,

p 2x0

1x0 2, x0 2 2⌘E

= hx, x0i2. Insight Trick works for any polynomials of order d via hx, x0id.

Quadratic Features in

Dot Product

Trick works for any polynomials of order

Insight

via

slide by Alex Smola

slide-27
SLIDE 27

Computational Efficiency

27

Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- mial features much worse. Solution Don’t compute the features, try to compute dot products

  • implicitly. For some features this works . . .

Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k(x, x0) = hΦ(x), Φ(x0)i for some feature map Φ. If k(x, x0) is much cheaper to compute than Φ(x) . . .

Solu%on

Defini%on

5 · 105

slide by Alex Smola

slide-28
SLIDE 28
  • Nothing happens if classified correctly
  • Weight vector is linear combination
  • Classifier is linear combination of


inner products

Recap: The Perceptron

28

initialize w = 0 and b = 0 repeat if yi [hw, xii + b]  0 then w w + yixi and b b + yi end if until all classified correctly w = X

i∈I

yixi f(x) = X

i∈I

yi hxi, xi + b

slide by Alex Smola

slide-29
SLIDE 29

Recap: The Perceptron on features

  • Nothing happens if classified correctly
  • Weight vector is linear combination
  • Classifier is linear combination of


inner products

29

initialize w, b = 0 repeat Pick (xi, yi) from data if yi(w · Φ(xi) + b)  0 then w0 = w + yiΦ(xi) b0 = b + yi until yi(w · Φ(xi) + b) > 0 for all i end

w = X

i∈I

yiφ(xi) f(x) = X

i∈I

yi hφ(xi), φ(x)i + b

slide by Alex Smola

slide-30
SLIDE 30

The Kernel Perceptron

30

w = X

i∈I

yiφ(xi)

initialize f = 0 repeat Pick (xi, yi) from data if yif(xi) ≤ 0 then f(·) ← f(·) + yik(xi, ·) + yi until yif(xi) > 0 for all i end

f(x) = X

i∈I

yi hφ(xi), φ(x)i + b = X

i∈I

yik(xi, x) + b

  • Nothing happens if classified correctly
  • Weight vector is linear combination
  • Classifier is linear combination of inner products

slide by Alex Smola

slide-31
SLIDE 31

Processing Pipeline

  • Original data
  • Data in feature space (implicit)
  • Solve in feature space using kernels

31

slide by Alex Smola

slide-32
SLIDE 32

Polynomial Kernels

32

Idea We want to extend k(x, x0) = hx, x0i2 to k(x, x0) = (hx, x0i + c)d where c > 0 and d 2 N. Prove that such a kernel corresponds to a dot product. Proof strategy Simple and straightforward: compute the explicit sum given by the kernel, i.e. k(x, x0) = (hx, x0i + c)d =

m

X

i=0

✓d i ◆ (hx, x0i)i cdi Individual terms (hx, x0i)i are dot products for some Φi(x).

slide by Alex Smola

slide-33
SLIDE 33

Kernel Conditions

33

Computability We have to be able to compute k(x, x0) efficiently (much cheaper than dot products themselves). “Nice and Useful” Functions The features themselves have to be useful for the learn- ing problem at hand. Quite often this means smooth functions. Symmetry Obviously k(x, x0) = k(x0, x) due to the symmetry of the dot product hΦ(x), Φ(x0)i = hΦ(x0), Φ(x)i. Dot Product in Feature Space Is there always a Φ such that k really is a dot product?

slide by Alex Smola

slide-34
SLIDE 34

The Theorem For any symmetric function k : X ⇥ X ! R which is square integrable in X ⇥ X and which satisfies Z

X⇥X

k(x, x0)f(x)f(x0)dxdx0 0 for all f 2 L2(X) there exist φi : X ! R and numbers λi 0 where k(x, x0) = X

i

λiφi(x)φi(x0) for all x, x0 2 X. Interpretation Double integral is the continuous version of a vector- matrix-vector multiplication. For positive semidefinite matrices we have X X k(xi, xj)αiαj 0

Mercer’s Theorem

34

slide by Alex Smola

slide-35
SLIDE 35

Properties

35

Distance in Feature Space Distance between points in feature space via d(x, x0)2 :=kΦ(x) Φ(x0)k2 =hΦ(x), Φ(x)i 2hΦ(x), Φ(x0)i + hΦ(x0), Φ(x0)i =k(x, x) + k(x0, x0) 2k(x, x) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by Kij = hΦ(xi), Φ(xj)i = k(xi, xj) where xi are the training patterns. Similarity Measure The entries Kij tell us the overlap between Φ(xi) and Φ(xj), so k(xi, xj) is a similarity measure.

slide by Alex Smola

slide-36
SLIDE 36

Properties

36

K is Positive Semidefinite Claim: α>Kα 0 for all α 2 Rm and all kernel matrices K 2 Rm⇥m. Proof:

m

X

i,j

αiαjKij =

m

X

i,j

αiαjhΦ(xi), Φ(xj)i = * m X

i

αiΦ(xi),

m

X

j

αjΦ(xj) + =

  • m

X

i=1

αiΦ(xi)

  • 2

Kernel Expansion If w is given by a linear combination of Φ(xi) we get hw, Φ(x)i = * m X

i=1

αiΦ(xi), Φ(x) + =

m

X

i=1

αik(xi, x).

slide by Alex Smola

slide-37
SLIDE 37

Examples

37

Examples of kernels k(x, x0) Linear hx, x0i Laplacian RBF exp (λkx x0k) Gaussian RBF exp

  • λkx x0k2

Polynomial (hx, x0i + ci)d , c 0, d 2 N B-Spline B2n+1(x x0)

  • Cond. Expectation

Ec[p(x|c)p(x0|c)] Simple trick for checking Mercer’s condition Compute the Fourier transform of the kernel and check that it is nonnegative.

slide by Alex Smola

slide-38
SLIDE 38

Linear Kernel

38

slide by Alex Smola

slide-39
SLIDE 39

Laplacian Kernel

39

slide by Alex Smola

slide-40
SLIDE 40

Gaussian Kernel

40

slide by Alex Smola

slide-41
SLIDE 41

Polynomial of order 3

41

slide by Alex Smola

slide-42
SLIDE 42

B3 Spline Kernel

42

slide by Alex Smola

slide-43
SLIDE 43

Kernels in Computer Vision

  • Features x = histogram (of color, texture, etc)
  • Common Kernels
  • Intersection Kernel
  • Chi-square Kernel

43

slide by Dhruv Batra

slide-44
SLIDE 44

44

slide by Dhruv Batra

Image credit: Subhransu Maji

slide-45
SLIDE 45

The Kernel Trick for SVMs

slide by Alex Smola

slide-46
SLIDE 46
  • Linear soft margin problem
  • Dual problem



 


  • Support vector expansion

f(x) = X

i

αiyi hxi, xi + b maximize

α

1 2 X

i,j

αiαjyiyj hxi, xji + X

i

αi subject to X

i

αiyi = 0 and αi 2 [0, C]

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0

The Kernel Trick for SVMs

slide by Alex Smola

slide-47
SLIDE 47

f(x) = X

i

αiyik(xi, x) + b maximize

α

− 1 2 X

i,j

αiαjyiyjk(xi, xj) + X

i

αi subject to X

i

αiyi = 0 and αi ∈ [0, C]

  • Linear soft margin problem
  • Dual problem



 


  • Support vector expansion

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, φ(xi)i + b] 1 ξi and ξi 0

The Kernel Trick for SVMs

slide by Alex Smola

slide-48
SLIDE 48

C=1

slide by Alex Smola

slide-49
SLIDE 49

C=1

slide by Alex Smola

support 
 vectors support 
 vectors y=0 y = 1 y = -1

slide-50
SLIDE 50

C=2

slide by Alex Smola

slide-51
SLIDE 51

C=5

slide by Alex Smola

slide-52
SLIDE 52

C=10

slide by Alex Smola

slide-53
SLIDE 53

C=20

slide by Alex Smola

slide-54
SLIDE 54

C=50

slide by Alex Smola

slide-55
SLIDE 55

C=100

slide by Alex Smola

slide-56
SLIDE 56

C=1

slide by Alex Smola

slide-57
SLIDE 57

C=2

slide by Alex Smola

slide-58
SLIDE 58

C=5

slide by Alex Smola

slide-59
SLIDE 59

C=10

slide by Alex Smola

slide-60
SLIDE 60

C=20

slide by Alex Smola

slide-61
SLIDE 61

C=50

slide by Alex Smola

slide-62
SLIDE 62

C=100

slide by Alex Smola

slide-63
SLIDE 63

C=1

slide by Alex Smola

slide-64
SLIDE 64

C=2

slide by Alex Smola

slide-65
SLIDE 65

C=5

slide by Alex Smola

slide-66
SLIDE 66

C=10

slide by Alex Smola

slide-67
SLIDE 67

C=20

slide by Alex Smola

slide-68
SLIDE 68

C=50

slide by Alex Smola

slide-69
SLIDE 69

C=100

slide by Alex Smola

slide-70
SLIDE 70

C=1

slide by Alex Smola

slide-71
SLIDE 71

C=2

slide by Alex Smola

slide-72
SLIDE 72

C=5

slide by Alex Smola

slide-73
SLIDE 73

C=10

slide by Alex Smola

slide-74
SLIDE 74

C=20

slide by Alex Smola

slide-75
SLIDE 75

C=50

slide by Alex Smola

slide-76
SLIDE 76

C=100

slide by Alex Smola

slide-77
SLIDE 77

And now with a narrower kernel

slide by Alex Smola

slide-78
SLIDE 78

slide by Alex Smola

slide-79
SLIDE 79

slide by Alex Smola

slide-80
SLIDE 80

slide by Alex Smola

slide-81
SLIDE 81

slide by Alex Smola

slide-82
SLIDE 82

And now with a very wide kernel

slide by Alex Smola

slide-83
SLIDE 83

slide by Alex Smola

slide-84
SLIDE 84
  • Increasing C allows for more nonlinearities
  • Decreases number of errors
  • SV boundary need not be contiguous
  • Kernel width adjusts function class

Nonlinear Separation

slide by Alex Smola

slide-85
SLIDE 85

Overfitting?

  • Huge feature space with kernels: should we

worry about overfitting?


  • SVM objective seeks a solution with large margin
  • Theory says that large margin leads to good

generalization (we will see this in a couple of lectures)


  • But everything overfits sometimes!!! 

  • Can control by:
  • Setting C
  • Choosing a better Kernel
  • Varying parameters of the Kernel (width of Gaussian,

etc.)

85

slide by Alex Smola