Aykut Erdem // Hacettepe University // Fall 2019
Lecture 17:
Kernel Trick for SVMs Risk and Loss Support Vector Regression
BBM406
Fundamentals of Machine Learning
Photo by Arthur Gretton, CMU Machine Learning Protestors at G20
BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick - - PowerPoint PPT Presentation
Photo by Arthur Gretton, CMU Machine Learning Protestors at G20 BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss Support Vector Regression Aykut Erdem // Hacettepe University // Fall 2019
Aykut Erdem // Hacettepe University // Fall 2019
Lecture 17:
Kernel Trick for SVMs Risk and Loss Support Vector Regression
Photo by Arthur Gretton, CMU Machine Learning Protestors at G20
due soon!
Due: December 22, 2019 (11:59pm) Each group should submit a project progress report by December 22, 2018. The report should be 3-4 pages and should describe the following points as clearly as possible:
will explore. Explain why you find it interesting.
expected to form the basis of the project. State whether you will extend an existing method or you are going to devise your own approach.
State which dataset(s) you will employ in your evaluation. Provide your preliminary results (if any).
2
Deadlines are much closer than they appear
hw, xi + b 1 hw, xi + b 1
Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard
minimum error separator is impossible
Last time… Soft-margin Classifier
slide by Alex Smolahw, xi + b 1 + ξ
Convex optimization problem
minimize amount
hw, xi + b 1 ξ
Last time… Adding Slack Variables
slide by Alex Smolaξi ≥ 0
hw, xi + b 1 + ξ
Convex optimization problem
minimize amount
hw, xi + b 1 ξ
Last time… Adding Slack Variables
classified
ξi ≥ 0 0 < ξ ≤ 1
adopted from Andrew Zisserman
Problem is always feasible. Proof: (also yields upper bound)
minimize
w,b
1 2 kwk2 subject to yi [hw, xii + b] 1 minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 w = 0 and b = 0 and ξi = 1
Last time… Adding Slack Variables
slide by Alex Smola
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0
C is a regularization parameter:
→ large margin
→ narrow margin
8
correct-labels?--
The-“score”-of-the-correct-- class-must-be-be?er-than- the-“score”-of-wrong-classes:--
w+ w- wo
slide by Eric Xing9
To predict, we use:
Now#can#we#learn#it?###
?
slide by Eric Xing10
slide by Alex SmolaLast time… Quadratic Features
11
Quadratic Features in R2 Φ(x) := ⇣ x2
1,
p 2x1x2, x2
2
⌘ Dot Product hΦ(x), Φ(x0)i = D⇣ x2
1,
p 2x1x2, x2
2
⌘ , ⇣ x0
1 2,
p 2x0
1x0 2, x0 2 2⌘E
= hx, x0i2. Insight Trick works for any polynomials of order d via hx, x0id.
Quadratic Features in
Dot Product
Trick works for any polynomials of order
Insight
via
slide by Alex Smola12
Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- mial features much worse. Solution Don’t compute the features, try to compute dot products
Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k(x, x0) = hΦ(x), Φ(x0)i for some feature map Φ. If k(x, x0) is much cheaper to compute than Φ(x) . . .
Solution
Definition
5 · 105
slide by Alex Smola13
Examples of kernels k(x, x0) Linear hx, x0i Laplacian RBF exp (λkx x0k) Gaussian RBF exp
Polynomial (hx, x0i + ci)d , c 0, d 2 N B-Spline B2n+1(x x0)
Ec[p(x|c)p(x0|c)] Simple trick for checking Mercer’s condition Compute the Fourier transform of the kernel and check that it is nonnegative.
slide by Alex Smola14
f(x) = X
i
αiyi hxi, xi + b maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 2 [0, C]
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0
f(x) = X
i
αiyik(xi, x) + b maximize
α
− 1 2 X
i,j
αiαjyiyjk(xi, xj) + X
i
αi subject to X
i
αiyi = 0 and αi ∈ [0, C]
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, φ(xi)i + b] 1 ξi and ξi 0
C=1
slide by Alex SmolaC=1
slide by Alex Smolasupport vectors support vectors y=0 y = 1 y = -1
C=2
slide by Alex SmolaC=5
slide by Alex SmolaC=10
slide by Alex SmolaC=20
slide by Alex SmolaC=50
slide by Alex SmolaC=100
slide by Alex SmolaC=1
slide by Alex SmolaC=2
slide by Alex SmolaC=5
slide by Alex SmolaC=10
slide by Alex SmolaC=20
slide by Alex SmolaC=50
slide by Alex SmolaC=100
slide by Alex SmolaC=1
slide by Alex SmolaC=2
slide by Alex SmolaC=5
slide by Alex SmolaC=10
slide by Alex SmolaC=20
slide by Alex SmolaC=50
slide by Alex SmolaC=100
slide by Alex SmolaC=1
slide by Alex SmolaC=2
slide by Alex SmolaC=5
slide by Alex SmolaC=10
slide by Alex SmolaC=20
slide by Alex SmolaC=50
slide by Alex SmolaC=100
slide by Alex Smolaworry about overfitting?
generalization (we will see this in a couple of lectures)
etc.)
55
slide by Alex Smola56
slide by Alex Smola
Follows from finding minimal slack variable for given (w,b) pair.
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 minimize
w,b
1 2 kwk2 + C X
i
max [0, 1 yi [hw, xii + b]]
empirical risk
max(0, 1 − yf(x)) {yf(x) < 0}
convex upper bound binary loss function margin
if f(x) > 1
1 2(1 − f(x))2
if f(x) ∈ [0, 1]
1 2 − f(x)
if f(x) < 0
max(0, 1 − f(x))
(asymptotically) linear (asymptotically) 0
log h 1 + e−f(x)i
R[f] := Ex,y∼p(x,y) [{yf(x) > 0}] Remp[f] := 1 m
m
X
i=1
{yif(xi) > 0} Rreg[f] := 1 m
m
X
i=1
max(0, 1 − yif(xi)) + λΩ[f]
regularization how to control ƛ
− Minimization is nonconvex − Overfitting as we minimize empirical error
slide by Alex Smola
61
Overfitting as we minimize empirical error
62
R[f] := Ex,y∼p(x,y) [l(y, f(x))] Remp[f] := 1 m
m
X
i=1
l(yi, f(xi)) Rreg[f] := 1 m
m
X
i=1
l(yi, f(xi)) + λΩ[f]
slide by Alex Smola63
l(y, f(x)) = 1 2(y − f(x))2
slide by Alex Smola64
l(y, f(x)) = |y − f(x)|
slide by Alex Smola65
l(y, f(x)) = max(0, |y − f(x)| − ✏)
slide by Alex Smolaallow some deviation without a penalty
66
∂w [. . .] = 1 m
m
X
i=1
⇥ xix>
i w − xiyi
⇤ + λw = 1 mXX> + λ1
mXy = 0 hence w = ⇥ XX> + λm1 ⇤1 Xy
Conjugate Gradient Sherman Morrison Woodbury
Outer product matrix in X
minimize
w
1 2m
m
X
i=1
(yi hxi, wi)2 + λ 2 kwk2
slide by Alex SmolaPenalized least mean squares ... now with kernels
67
minimize
w
1 2m
m
X
i=1
(yi hφ(xi), wi)2 + λ 2 kwk2 kwk2 =
empirical risk dependent
w⊥ wk
slide by Alex Smola
solve for as linear system
φ(xi) w = X
i
αiφ(xi) minimize
α
1 2m
m
X
i=1
⇣ yi − X
j
Kijαj ⌘2 + λ 2 X
i,j
αiαjKij minimize
w
1 2m
m
X
i=1
(yi hφ(xi), wi)2 + λ 2 kwk2 α = (K + mλ1)−1y
Penalized least mean squares ... now with kernels
solve for as linear system
φ(xi) w = X
i
αiφ(xi) minimize
α
1 2m
m
X
i=1
⇣ yi − X
j
Kijαj ⌘2 + λ 2 X
i,j
αiαjKij minimize
w
1 2m
m
X
i=1
(yi hφ(xi), wi)2 + λ 2 kwk2 α = (K + mλ1)−1y
Penalized least mean squares ... now with kernels
70
x x x x x x x x x x x x x x
+ε −ε
x
ξ +ε −ε ξ
y x y − f(x) loss
don’t care about deviations within the tube
slide by Alex SmolaSVM Regression (ϵ-insensitive loss)
71
minimize
w,b
1 2 kwk2 + C
m
X
i=1
[⇠i + ⇠∗
i ]
subject to hw, xii + b yi + ✏ + ⇠i and ⇠i 0 hw, xii + b yi ✏ ⇠∗
i and ⇠∗ i 0
L =1 2 kwk2 + C
m
X
i=1
[⇠i + ⇠∗
i ] m
X
i=1
[⌘i⇠i + ⌘∗
i ⇠∗ i ] + m
X
i=1
↵i [hw, xii + b yi ✏ ⇠i] +
m
X
i=1
↵∗
i [yi ✏ ⇠∗ i hw, xii b]
slide by Alex SmolaSVM Regression (ϵ-insensitive loss)
72
∂wL = 0 = w + X
i
[αi − α∗
i ] xi
∂bL = 0 = X
i
[αi − α∗
i ]
∂ξiL = 0 = C − ηi − αi ∂ξ∗
i L = 0 = C − η∗
i − α∗ i
minimize
α,α∗
1 2(↵ − ↵⇤)>K(↵ − ↵⇤) + ✏1>(↵ + ↵⇤) + y>(↵ − ↵⇤) subject to 1>(↵ − ↵⇤) = 0 and ↵i, ↵⇤
i ∈ [0, C]
slide by Alex SmolaSVM Regression (ϵ-insensitive loss)
time
problem
but with added quadratic penalty on coefficients
73
slide by Alex Smola74
sinc x + 0.1 sinc x - 0.1 approximation
slide by Alex Smola75
sinc x + 0.2 sinc x - 0.2 approximation
slide by Alex Smola76
sinc x + 0.5 sinc x - 0.5 approximation
slide by Alex Smola77
Support Vectors Support Vectors Support Vectors
slide by Alex Smola78
quadratic linear
l(y, f(x)) = (
1 2(y − f(x))2
if |y − f(x)| < 1 |y − f(x)| − 1
2
trimmed mean estimatior
slide by Alex Smolaapproximate methods
examples
digits vs. LeNet’s 0.9%)
79
slide by Sanja FidlerFast optimization, can handle very large datasets, C++ code.
unbalanced data, etc.
lectures)
source software”
80
slide by Alex SmolaDecision Trees
81