Machine Learning
Fall 2017
Professor Liang Huang
Kernels
(Kernels, Kernelized Perceptron and SVM)
(Chap. 12 of CIML)
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron - - PowerPoint PPT Presentation
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x2: -1 x3: +1 Concatenated (combined) features XOR: x = (x 1 , x 2 , x 1 x 2 )
Fall 2017
Professor Liang Huang
Kernels
(Kernels, Kernelized Perceptron and SVM)
(Chap. 12 of CIML)
x → φ(x) φ(xi) x1: +1 x2: -1 x4: -1 x3: +1
Circles, hyperbolae, parabolae
Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- mial features much worse. Solution Don’t compute the features, try to compute dot products
Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k(x, x0) = hΦ(x), Φ(x0)i for some feature map Φ. If k(x, x0) is much cheaper to compute than Φ(x) . . .
5 · 105
Quadratic Features in R2 Φ(x) := ⇣ x2
1,
p 2x1x2, x2
2
⌘ Dot Product hΦ(x), Φ(x0)i = D⇣ x2
1,
p 2x1x2, x2
2
⌘ , ⇣ x0
1 2,
p 2x0
1x0 2, x0 2 2⌘E
= hx, x0i2. Insight Trick works for any polynomials of order d via hx, x0id. x1: +1 x2: -1 x4: -1 x3: +1
for x in ℝn, quadratic ɸ: naive: ɸ(x): O(n2) ɸ(x)∙ɸ(x’): O(n2) kernel k(x,x’): O(n)
= k(x, x0)
inner products
initialize w, b = 0 repeat Pick (xi, yi) from data if yi(w · Φ(xi) + b) 0 then w0 = w + yiΦ(xi) b0 = b + yi until yi(w · Φ(xi) + b) > 0 for all i end
w = X
i∈I
αiφ(xi) f(x) = X
i∈I
αi hφ(xi), φ(x)i
initialize f = 0 repeat Pick (xi, yi) from data if yif(xi) ≤ 0 then f(·) ← f(·) + yik(xi, ·) + yi until yif(xi) > 0 for all i end
f(x) = X
i∈I
αi hφ(xi), φ(x)i = X
i∈I
αik(xi, x) w = X
i∈I
αiφ(xi)
Functional Form
αi ← αi + yi increase its vote by 1
Dual Form
update linear coefficients
implicitly equivalent to:
Primal Form
update weights classify w ← w + yiφ(xi) f(k) = w · φ(x)
αi ← αi + yi
w = X
i∈I
αiφ(xi)
w = X
i∈I
αiφ(xi) f(x) = X
i∈I
αi hφ(xi), φ(x)i = X
i∈I
αik(xi, x)
Dual Form
update linear coefficients
implicitly equivalent to:
Primal Form
update weights classify w ← w + yiφ(xi)
classify
f(k) = w · φ(x)
w = X
i∈I
αiφ(xi)
αi ← αi + yi
f(x) = w · φ(x) = [ X
i∈I
αiφ(xi)]φ(x) = X
i∈I
αihφ(xi), φ(x)i = X
i∈I
αik(xi, x)
fast
O(d)
slow
O(d2)
initialize for all repeat Pick from data if then until for all
αi = 0 (xi, yi) yif(xi) ≤ 0 αi ← αi + yi yif(xi) > 0 i i
Dual Form
update linear coefficients
implicitly
classify
αi ← αi + yi
w = X
i∈I
αiφ(xi) f(x) = w · φ(x) = [ X
i∈I
αiφ(xi)]φ(x) = X
i∈I
αihφ(xi), φ(x)i = X
i∈I
αik(xi, x) if #features >> #examples, dual is easier;
fast
O(d)
slow
O(d2)
Dual Perceptron
update linear coefficients
implicitly
Primal Perceptron
update weights classify w ← w + yiφ(xi) f(k) = w · φ(x)
αi ← αi + yi
w = X
i∈I
αiφ(xi) if #features >> #examples, dual is easier;
Q: when is #features >> #examples? A: higher-order polynomial kernels
Dual Perceptron
update linear coefficients
implicitly
classify
Pros/Cons of Kernel in Dual
(memory)
training examples for test
training examples (memory)
αi ← αi + yi
w = X
i∈I
αiφ(xi) f(x) = w · φ(x) = [ X
i∈I
αiφ(xi)]φ(x) = X
i∈I
αihφ(xi), φ(x)i = X
i∈I
αik(xi, x)
fast
O(d)
slow
O(d2)
Dual Perceptron Primal Perceptron
update on new param. x1: -1 w = (0, -1) x2: +1 w = (2, 0) x3: +1 w = (2, -1) update on new param. w (implicit) x1: -1 α = (-1, 0, 0)
x2: +1 α = (-1, 1, 0)
x3: +1 α = (-1, 1, 1) -x1 + x2 + x3 linear kernel (identity map) final implicit w = (2, -1)
x2(2, 1) : +1
x3(0, −1) : +1
x1(0, 1) : −1
geometric interpretation
sum of dot-products with x2 & x3 bigger than dot-product with x1
(agreement w/ positive > w/ negative)
Dual Perceptron
update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0)
φ(x1)
x2: -1 α = (+1, -1, 0, 0)
φ(x1) - φ(x2)
x1: +1 x2: -1 x4: -1 x3: +1
classification rule in dual/geom:
(x · x1)2 > (x · x2)2 ⇒ cos2 θ1 > cos2 θ2 ⇒ | cos θ1| > | cos θ2|
x1: +1 x2: -1
in dual/algebra:
(x · x1)2 > (x · x2)2 ⇒ (x1 + x2)2 > (x1 − x2)2 ⇒ x1x2 > 0
also verify in primal
k(x, x0) = (x · x0)2 ⇔ φ(x) = (x2
1, x2 2,
√ 2x1x2)
w = (0, 0, 2 √ 2)
Dual Perceptron
update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0)
φ(x1)
x2: -1 α = (+1, -1, 0, 0)
φ(x1) - φ(x2)
k(x, x0) = (x · x0)2 ⇔ φ(x) = (x2
1, x2 2,
√ 2x1x2)
Idea We want to extend k(x, x0) = hx, x0i2 to k(x, x0) = (hx, x0i + c)d where c > 0 and d 2 N. Prove that such a kernel corresponds to a dot product. Proof strategy Simple and straightforward: compute the explicit sum given by the kernel, i.e. k(x, x0) = (hx, x0i + c)d =
m
X
i=0
✓d i ◆ (hx, x0i)i cdi Individual terms (hx, x0i)i are dot products for some Φi(x).
+c is just augmenting space. simpler proof: set x0 = sqrt(c)
Dual Perceptron
y +1 +1
update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0, 0)
φ(x1)
x2: -1 α = (+1, -1, 0, 0, 0) φ(x1) - φ(x2) x3: -1 α = (+1, -1, -1, 0, 0)
k(x, x0) = (x · x0)2 ⇔ φ(x) = (x2
1, x2 2,
√ 2x1x2) k(x, x0) = (x · x0 + 1)2 ⇔ φ(x) =?
x1 x2 x3 x4 x5
Examples of kernels k(x, x0) Linear hx, x0i Laplacian RBF exp (λkx x0k) Gaussian RBF exp
Polynomial (hx, x0i + ci)d , c 0, d 2 N B-Spline B2n+1(x x0)
Ec[p(x|c)p(x0|c)] Simple trick for checking Mercer’s condition Compute the Fourier transform of the kernel and check that it is nonnegative.
you only need to know polynomial and gaussian.
distorts distance distorts angle
Examples of kernels k(x, x0) Linear hx, x0i Laplacian RBF exp (λkx x0k) Gaussian RBF exp
Polynomial (hx, x0i + ci)d , c 0, d 2 N B-Spline
The Theorem For any symmetric function k : X ⇥ X ! R which is square integrable in X ⇥ X and which satisfies Z
X⇥X
k(x, x0)f(x)f(x0)dxdx0 0 for all f 2 L2(X) there exist φi : X ! R and numbers λi 0 where k(x, x0) = X
i
λiφi(x)φi(x0) for all x, x0 2 X. Interpretation Double integral is the continuous version of a vector- matrix-vector multiplication. For positive semidefinite matrices we have X X k(xi, xj)αiαj 0
Distance in Feature Space Distance between points in feature space via d(x, x0)2 :=kΦ(x) Φ(x0)k2 =hΦ(x), Φ(x)i 2hΦ(x), Φ(x0)i + hΦ(x0), Φ(x0)i =k(x, x) + k(x0, x0) 2k(x, x) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by Kij = hΦ(xi), Φ(xj)i = k(xi, xj) where xi are the training patterns. Similarity Measure The entries Kij tell us the overlap between Φ(xi) and Φ(xj), so k(xi, xj) is a similarity measure.
for HW2, you don’t need to randomly choose training examples. just go over all training examples in the original order, and call that an epoch (same as HW1).
σ = 1.0 C = ∞
f(x) = 1 f(x) = 0
f(x) = −1
f(x) =
N
X
i
αiyi exp
³
−||x − xi||2/2σ2´ + b
Gaussian RBF kernel (default in sklearn)
σ = 1.0 C = 100
Decrease C, gives wider (soft) margin
σ = 1.0 C = 10
f(x) =
N
X
i
αiyi exp
³
−||x − xi||2/2σ2´ + b
σ = 1.0 C = ∞
f(x) =
N
X
i
αiyi exp
³
−||x − xi||2/2σ2´ + b
σ = 0.25 C = ∞
Decrease sigma, moves towards nearest neighbour classifier
σ = 0.1 C = ∞
f(x) =
N
X
i
αiyi exp
³
−||x − xi||2/2σ2´ + b
this is in contrast with C: smaller C => wide margin (underfitting) larger C => narrow margin (overfitting)
training example closest to x
K = 1
error = 0.0 error = 0.15 Training data Testing data
K = 3
error = 0.0760 error = 0.1340 Training data Testing data
K = 7
error = 0.1320 error = 0.1110 Training data Testing data
K = 21
error = 0.1120 error = 0.0920 Training data Testing data
K = 1
error = 0.0 error = 0.15 Training data Testing data
K = 3
error = 0.0760 error = 0.1340 Training data Testing data
K = 7
error = 0.1320 error = 0.1110 Training data Testing data
K = 21
error = 0.1120 error = 0.0920 Training data Testing data
small k: overfitting large k: underfitting
what about k=N?
support vectors few all
b a c e f d