Lecture 17:
−Multi-class SVMs −Kernels
Aykut Erdem
December 2016 Hacettepe University
Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 - - PowerPoint PPT Presentation
Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday December 17, 2016 (I will check the date today) . Project progress reports are due today!
−Multi-class SVMs −Kernels
Aykut Erdem
December 2016 Hacettepe University
2
3
hw, xi + b 1 hw, xi + b 1 f(x) = hw, xi + b
linear function
slide by Alex Smola
4
hw, xi + b = 1 hw, xi + b = 1
w
maximize
w,b
1 kwk subject to yi [hxi, wi + b] 1
slide by Alex Smola
hw, xi + b = 1 hw, xi + b = 1
w
minimize
w,b
1 2 kwk2 subject to yi [hxi, wi + b] 1
slide by Alex Smola
w
minimize
w,b
1 2 kwk2 subject to yi [hxi, wi + b] 1 maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 0 w = X
i
yiαixi
slide by Alex Smola
7
hw, xi + b = 1 hw, xi + b = 1
w
support vectors
αi > 0 = )
hw, xi + b 1 hw, xi + b 1
Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard
minimum error separator is impossible
slide by Alex Smola
hw, xi + b 1 + ξ
Convex optimization problem
minimize amount
hw, xi + b 1 ξ
slide by Alex Smola
ξi ≥ 0
hw, xi + b 1 + ξ
Convex optimization problem
minimize amount
hw, xi + b 1 ξ
classified
ξi ≥ 0 0 < ξ ≤ 1
adopted from Andrew Zisserman
minimize
w,b
1 2 kwk2 subject to yi [hw, xii + b] 1 minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 w = 0 and b = 0 and ξi = 1
slide by Alex Smola
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0
adopted from Andrew Zisserman
C is a regularization parameter:
→ large margin
→ narrow margin
13
14
15
slide by Eric Xing
16
slide by Eric Xing
17
slide by Eric Xing
18
– &.&vs.&{o,+},&weights&w.& – +&vs.&{o,.},&weights&w+& – o&vs.&{+,.},&weights&wo&
w+ w- wo
slide by Eric Xing
19
correct-labels?--
The-“score”-of-the-correct-- class-must-be-be?er-than- the-“score”-of-wrong-classes:--
w+ w- wo
slide by Eric Xing
20
To predict, we use:
Now#can#we#learn#it?###
?
slide by Eric Xing
21
slide by Alex Smola
We got nonlinear functions by preprocessing
x → φ(x) hx, x0i hφ(x), φ(x0)i φ(xi)
slide by Alex Smola
Circles, hyperbolae, parabolae
slide by Alex Smola
24
(x1, x2) (x1, x2, x1x2)
slide by Alex Smola
25
slide by Alex Smola
26
Quadratic Features in R2 Φ(x) := ⇣ x2
1,
p 2x1x2, x2
2
⌘ Dot Product hΦ(x), Φ(x0)i = D⇣ x2
1,
p 2x1x2, x2
2
⌘ , ⇣ x0
1 2,
p 2x0
1x0 2, x0 2 2⌘E
= hx, x0i2. Insight Trick works for any polynomials of order d via hx, x0id.
Quadratic Features in
Dot Product
Trick works for any polynomials of order
Insight
via
slide by Alex Smola
27
Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- mial features much worse. Solution Don’t compute the features, try to compute dot products
Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k(x, x0) = hΦ(x), Φ(x0)i for some feature map Φ. If k(x, x0) is much cheaper to compute than Φ(x) . . .
Solu%on
Defini%on
5 · 105
slide by Alex Smola
inner products
28
initialize w = 0 and b = 0 repeat if yi [hw, xii + b] 0 then w w + yixi and b b + yi end if until all classified correctly w = X
i∈I
yixi f(x) = X
i∈I
yi hxi, xi + b
slide by Alex Smola
inner products
29
initialize w, b = 0 repeat Pick (xi, yi) from data if yi(w · Φ(xi) + b) 0 then w0 = w + yiΦ(xi) b0 = b + yi until yi(w · Φ(xi) + b) > 0 for all i end
w = X
i∈I
yiφ(xi) f(x) = X
i∈I
yi hφ(xi), φ(x)i + b
slide by Alex Smola
30
w = X
i∈I
yiφ(xi)
initialize f = 0 repeat Pick (xi, yi) from data if yif(xi) ≤ 0 then f(·) ← f(·) + yik(xi, ·) + yi until yif(xi) > 0 for all i end
f(x) = X
i∈I
yi hφ(xi), φ(x)i + b = X
i∈I
yik(xi, x) + b
slide by Alex Smola
31
slide by Alex Smola
32
Idea We want to extend k(x, x0) = hx, x0i2 to k(x, x0) = (hx, x0i + c)d where c > 0 and d 2 N. Prove that such a kernel corresponds to a dot product. Proof strategy Simple and straightforward: compute the explicit sum given by the kernel, i.e. k(x, x0) = (hx, x0i + c)d =
m
X
i=0
✓d i ◆ (hx, x0i)i cdi Individual terms (hx, x0i)i are dot products for some Φi(x).
slide by Alex Smola
33
Computability We have to be able to compute k(x, x0) efficiently (much cheaper than dot products themselves). “Nice and Useful” Functions The features themselves have to be useful for the learn- ing problem at hand. Quite often this means smooth functions. Symmetry Obviously k(x, x0) = k(x0, x) due to the symmetry of the dot product hΦ(x), Φ(x0)i = hΦ(x0), Φ(x)i. Dot Product in Feature Space Is there always a Φ such that k really is a dot product?
slide by Alex Smola
The Theorem For any symmetric function k : X ⇥ X ! R which is square integrable in X ⇥ X and which satisfies Z
X⇥X
k(x, x0)f(x)f(x0)dxdx0 0 for all f 2 L2(X) there exist φi : X ! R and numbers λi 0 where k(x, x0) = X
i
λiφi(x)φi(x0) for all x, x0 2 X. Interpretation Double integral is the continuous version of a vector- matrix-vector multiplication. For positive semidefinite matrices we have X X k(xi, xj)αiαj 0
34
slide by Alex Smola
35
Distance in Feature Space Distance between points in feature space via d(x, x0)2 :=kΦ(x) Φ(x0)k2 =hΦ(x), Φ(x)i 2hΦ(x), Φ(x0)i + hΦ(x0), Φ(x0)i =k(x, x) + k(x0, x0) 2k(x, x) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by Kij = hΦ(xi), Φ(xj)i = k(xi, xj) where xi are the training patterns. Similarity Measure The entries Kij tell us the overlap between Φ(xi) and Φ(xj), so k(xi, xj) is a similarity measure.
slide by Alex Smola
36
K is Positive Semidefinite Claim: α>Kα 0 for all α 2 Rm and all kernel matrices K 2 Rm⇥m. Proof:
m
X
i,j
αiαjKij =
m
X
i,j
αiαjhΦ(xi), Φ(xj)i = * m X
i
αiΦ(xi),
m
X
j
αjΦ(xj) + =
X
i=1
αiΦ(xi)
Kernel Expansion If w is given by a linear combination of Φ(xi) we get hw, Φ(x)i = * m X
i=1
αiΦ(xi), Φ(x) + =
m
X
i=1
αik(xi, x).
slide by Alex Smola
37
Examples of kernels k(x, x0) Linear hx, x0i Laplacian RBF exp (λkx x0k) Gaussian RBF exp
Polynomial (hx, x0i + ci)d , c 0, d 2 N B-Spline B2n+1(x x0)
Ec[p(x|c)p(x0|c)] Simple trick for checking Mercer’s condition Compute the Fourier transform of the kernel and check that it is nonnegative.
slide by Alex Smola
38
slide by Alex Smola
39
slide by Alex Smola
40
slide by Alex Smola
41
slide by Alex Smola
42
slide by Alex Smola
43
slide by Dhruv Batra
44
slide by Dhruv Batra
Image credit: Subhransu Maji
slide by Alex Smola
f(x) = X
i
αiyi hxi, xi + b maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 2 [0, C]
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0
slide by Alex Smola
f(x) = X
i
αiyik(xi, x) + b maximize
α
− 1 2 X
i,j
αiαjyiyjk(xi, xj) + X
i
αi subject to X
i
αiyi = 0 and αi ∈ [0, C]
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, φ(xi)i + b] 1 ξi and ξi 0
slide by Alex Smola
C=1
slide by Alex Smola
C=1
slide by Alex Smola
support vectors support vectors y=0 y = 1 y = -1
C=2
slide by Alex Smola
C=5
slide by Alex Smola
C=10
slide by Alex Smola
C=20
slide by Alex Smola
C=50
slide by Alex Smola
C=100
slide by Alex Smola
C=1
slide by Alex Smola
C=2
slide by Alex Smola
C=5
slide by Alex Smola
C=10
slide by Alex Smola
C=20
slide by Alex Smola
C=50
slide by Alex Smola
C=100
slide by Alex Smola
C=1
slide by Alex Smola
C=2
slide by Alex Smola
C=5
slide by Alex Smola
C=10
slide by Alex Smola
C=20
slide by Alex Smola
C=50
slide by Alex Smola
C=100
slide by Alex Smola
C=1
slide by Alex Smola
C=2
slide by Alex Smola
C=5
slide by Alex Smola
C=10
slide by Alex Smola
C=20
slide by Alex Smola
C=50
slide by Alex Smola
C=100
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
worry about overfitting?
generalization (we will see this in a couple of lectures)
etc.)
85
slide by Alex Smola