Learning From Data Lecture 25 The Kernel Trick
Learning with only inner products The Kernel
- M. Magdon-Ismail
CSCI 4100/6100
Learning From Data Lecture 25 The Kernel Trick Learning with only - - PowerPoint PPT Presentation
Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M. Magdon-Ismail CSCI 4100/6100 recap: Large Margin is Better Controling Overfitting Non-Separable Data 0.08 random hyperplane 2 w t w + C N E
Learning with only inner products The Kernel
CSCI 4100/6100
recap: Large Margin is Better
Controling Overfitting Non-Separable Data
γ(random hyperplane)/γ(SVM) Eout random hyperplane SVM
0.25 0.5 0.75 1 0.04
0.06
0.08
R2 γ2
Ecv ≤ # support vectors N minimize
b,w,ξ 1 2wtw + C N n=1 ξn
subject to: yn(wtxn + b) ≥ 1−ξn ξn ≥ 0 for n = 1, . . . , N
Φ2 + SVM Φ3 + SVM Φ3 + pseudoinverse algorithm
Complex hypothesis that does not overfit because it is ‘simple’, controlled by only a few support vectors.
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 2 /18
Mechanics of the nonlinear transform − →
Φ
xn ∈ X
zn = Φ(xn) ∈ Z
‘Φ−1’
g(x) = ˜ g(Φ(x)) = sign( ˜ wtΦ(x))
˜ g(z) = sign( ˜ wtz) X -space is Rd Z-space is R
˜ d
x = 1 x1 . . . xd z = Φ(x) = 1 Φ1(x) . . . Φ ˜
d(x)
= 1 z1 . . . z ˜
d
x1, x2, . . . , xN z1, z2, . . . , zN y1, y2, . . . , yN y1, y2, . . . , yN no weights ˜ w = w0 w1 . . . w ˜
d
dvc = d + 1 dvc = d + 1 g(x) = sign( ˜ wtΦ(x))
Have to transform the data to the Z-space.
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 3 /18
Topic for this lecture − →
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 4 /18
Primal versus dual − →
minimize
b,w 1 2wtw
subject to: yn(wtxn + b) ≥ 1 for n = 1, . . . , N minimize
α 1 2 N
αnαmynym(xt
nxm) − N
αn subject to:
N
αnyn = 0 αn ≥ 0 for n = 1, . . ., N w∗ =
N
α∗
nynxn
b∗ = ys − wtxs (α∗
s > 0)
ւ
support vectors
g(x) = sign(wtx + b) g(x) = sign(w∗tx + b∗) = sign N
α∗
nynxt n(x − xs) + ys
N optimization variables α
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 5 /18
Vector-matrix form − →
minimize
b,w 1 2wtw
subject to: yn(wtxn + b) ≥ 1 for n = 1, . . . , N minimize
α 1 2αtGα − 1tα (Gnm = ynymxt
nxm)
subject to: ytα = 0 α ≥ 0 w∗ =
N
α∗
nynxn
b∗ = ys − wtxs (α∗
s > 0)
ւ
support vectors
g(x) = sign(wtx + b) g(x) = sign(w∗tx + b∗) = sign N
α∗
nynxt n(x − xs) + ys
N optimization variables α
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 6 /18
The Lagrangian − →
N
↑ lagrange multipliers ↑ the constraints
← unconstrained
Formally: use KKT conditions to transform the primal.
Intuition
= ⇒ αn → ∞ gives L → ∞
= ⇒ αn = 0 (max L w.r.t. αn) ↑
non support vectors
Conclusion At the optimum, αn(yn(wtxn + b) − 1) = 0, so L = 1 2wtw is minimized and the constraints are satisfied 1 − yn(wtxn + b) ≤ 0
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 7 /18
− →
L = 1 2wtw −
N
αn · (yn(wtxn + b) − 1)
Set ∂L
∂b = 0:
∂L ∂b =
N
αnyn = ⇒
N
αnyn = 0 Set ∂L
∂w = 0:
∂L ∂w = w −
N
αnynxn = ⇒ w =
N
αnynxn
Substitute into L to maximize w.r.t. α ≥ 0 L = 1 2wtw − wt
N
αnynxn − b
N
αnyn +
N
αn = −1 2wtw +
N
αn = −1 2
N
αnαmynymxt
nxm + N
αn
minimize
α 1 2αtGα − 1tα (Gnm = ynymxt
nxm)
subject to: ytα = 0 α ≥ 0 w = N
n=1 α∗ nynxn
αs > 0 = ⇒ ys(wtxs + b) − 1 = 0 = ⇒ b = ys − wtxs
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 8 /18
Example − →
signed data matrix ↓
X = 0 0 2 2 2 0 3 0 y = −1 −1 +1 +1 − → Xs = −2 −2 2 3 − → G = XsXt
s =
8 −4 −6 0 −4 4 6 0 −6 6 9
Quadratic Programming Dual SVM minimize
u 1 2utQu + ptz
subject to: Au ≥ c minimize
α 1 2αtGα − 1tα
subject to: ytα = 0 α ≥ 0 u = α Q = G p = −1N A = yt −yt IN c = 0N
QP(Q,p,A,c)
− − − − − − − → α∗ =
1 2 1 2
1 w =
4
α∗
nynxn =
−1
γ =
1 | | w | | = 1 √ 2
x1 − x2 − 1 = 0 α = 1
2
α = 1
2
α = 1 α = 0
non-support vectors = ⇒ αn = 0
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 9 /18
Dual linear-SVM QP algorithm − →
1: Input: X, y. 2: Let p = −1N be the N-vector of ones and c = 0N+2 the N-vector
Xs = —y1xt
1—
. . . —yNxt
N—
, Q = XsXt
s,
A = yt −yt IN×N
3: α∗ ← QP(Q, c, A, a). 4: Return
w∗ =
n>0
α∗
nynxn
b∗ = ys − wtxs (α∗
s > 0)
5: The final hypothesis is g(x) = sign(w∗tx + b∗).
minimize
α 1 2αtGα − 1tα
subject to: ytα = 0 α ≥ 0
↑ Some packages allow equality and bound constraints to directly solve this type of QP
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 10 /18
Primal versus dual (non-separable) − →
minimize
b,w,ξ 1 2wtw + C N n=1 ξn
subject to: yn(wtxn + b) ≥ 1 − ξn ξn ≥ 0 for n = 1, . . . , N minimize
α 1 2αtGα − 1tα
subject to: ytα = 0 C ≥ α ≥ 0 w∗ =
N
α∗
nynxn
b∗ = ys − wtxs (C > α∗
s > 0)
g(x) = sign(wtx + b) g(x) = sign(w∗tx + b∗) = sign N
α∗
nynxt n(x − xs) + ys
N optimization variables α
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 11 /18
Inner product algorithm − →
α 1 2αtGα − 1tα
nxm)
α∗
n>0
nyn(xt nx) + b∗
C > α∗
s > 0
b∗ = ys −
n>0
α∗
nyn(xt nxs)
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 12 /18
Z-space inner product algorithm − →
α 1 2αtGα − 1tα
nzm)
α∗
n>0
nyn(zt nz) + b∗
C > α∗
s > 0
b∗ = ys −
n>0
α∗
nyn(zt nzs)
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 13 /18
Can we compute ztz′ efficiently − →
α 1 2αtGα − 1tα
nzm)
α∗
n>0
nyn(zt nz) + b
C > α∗
s > 0
b = ys −
n>0
α∗
nyn(zt nzs)
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 14 /18
The Kernel − →
Example: 2nd-order polynomial transform Φ(x) = x1 x2 x2
1
√ 2x1x2 x2
2
K(x, x′) = Φ(x)tΦ(x′) = x1 x2 x2
1
√ 2x1x2 x2
2
x′
1
x′
2
x′
1 2
√ 2x′
1x′ 2
x′
2 2
← O(d2) = x1x′
1 + x2x′ 2 + x2 1x′ 1 2 + 2x1x2x′ 1x′ 2 + x2 2x′ 2 2
= 1 2 + xtx′ 2 − 1 4
↑ computed quickly in X -space, in O(d)
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 15 /18
Gaussian kernel − →
| x−x′ | |2
Example: Gaussian Kernel in 1-dimension Φ(x) = e−x2
20 0!
e−x2
21 1!x
e−x2
22 2!x2
e−x2
23 3!x3
e−x2
24 4!x4
. . .
(infinite dimensional Φ)
K(x, x′) = Φ(x)tΦ(x′) = e−x2
20 0!
e−x2
21 1!x
e−x2
22 2!x2
e−x2
23 3!x3
e−x2
24 4!x4
. . .
e−x′2
20 0!
e−x′2
21 1!x′
e−x′2
22 2!x′2
e−x′2
23 3!x′3
e−x′2
24 4!x′4
. . . = e−x2e−x′2
∞
(2xx′)i i! = e−(x−x′)2
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 16 /18
Bypass Z-space − →
xn ∈ X
g(x) = sign
α∗
n>0
α∗
nynK(xn, x) + b∗
b∗ = ys −
n>0
α∗
nynK(xn, xs) 1: Input: X, y, regularization parameter C 2: Compute G: Gnm = ynymK(xn, xm). 3: Solve (QP):
minimize:
α 1 2αtGα − 1tα
subject to: ytα = 0 C ≥ α ≥ 0 − → α∗ index s : C > α∗
s > 0
4: b∗ = ys −
n>0
α∗
nynK(xn, xs)
5: The final hypothesis is
g(x) = sign
α∗
n>0
α∗
nynK(xn, x) + b∗
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 17 /18
The Kernel-SVM philosophy − →
SVM Regression
high ˜ d → complicated separator small # support vectors → low effective complexity high ˜ d → expensive or infeasible computation kernel → computationally feasible to go to high ˜ d
Can go to high (infinite) ˜ d Can go to high (infinite) ˜ d
c A M L Creator: Malik Magdon-Ismail
Kernel Trick: 18 /18