Introduction to Machine Learning
- 5. Support Vector Classification
Alex Smola Carnegie Mellon University
http://alex.smola.org/teaching/cmu2013-10-701 10-701
Introduction to Machine Learning 5. Support Vector Classification - - PowerPoint PPT Presentation
Introduction to Machine Learning 5. Support Vector Classification Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Support Vector Classification Large Margin Separation, optimization
Alex Smola Carnegie Mellon University
http://alex.smola.org/teaching/cmu2013-10-701 10-701
Large Margin Separation, optimization problem
Support Vectors, kernel expansion
Dual problem, robustness
http://maktoons.blogspot.com/2009/03/support-vector-machine.html
Spam Ham
Spam Ham
Spam Ham
Spam Ham
Spam Ham
Spam Ham
Spam Ham
hw, xi + b 1 hw, xi + b 1 f(x) = hw, xi + b
linear function
hw, xi + b = 1 hw, xi + b = 1 hx+ x−, wi 2 kwk = 1 2 kwk [[hx+, wi + b] [hx−, wi + b]] = 1 kwk
margin w
hw, xi + b = 1 hw, xi + b = 1
w
maximize
w,b
1 kwk subject to yi [hxi, wi + b] 1
hw, xi + b = 1 hw, xi + b = 1
w
minimize
w,b
1 2 kwk2 subject to yi [hxi, wi + b] 1
Optimality in w, b is at saddle point with α
minimize
w,b
1 2 kwk2 subject to yi [hxi, wi + b] 1 L(w, b, α) = 1 2 kwk2 X
i
αi [yi [hxi, wi + b] 1]
constraint
L(w, b, α) = 1 2 kwk2 X
i
αi [yi [hxi, wi + b] 1] ∂wL(w, b, a) = w − X
i
αiyixi = 0 ∂bL(w, b, a) = X
i
αiyi = 0 maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X αiyi = 0 and αi 0
w
minimize
w,b
1 2 kwk2 subject to yi [hxi, wi + b] 1 maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 0 w = X
i
yiαixi
w
minimize
w,b
1 2 kwk2 subject to yi [hxi, wi + b] 1 w = X
i
yiαixi
Karush Kuhn Tucker Optimality condition
αi [yi [hw, xii + b] 1] = 0
αi = 0 αi > 0 = ) yi [hw, xii + b] = 1
w
w = X
i
yiαixi
robustness relative to uncertainty
correctly classified instances
easy problems
+ +
r ρ
CLASSIFIERS
hw, xi + b 1 hw, xi + b 1 f(x) = hw, xi + b
linear function
hw, xi + b 1 hw, xi + b 1 f(x) = hw, xi + b
linear function
hw, xi + b 1 hw, xi + b 1 f(x) = hw, xi + b
linear function linear separator is impossible
hw, xi + b 1 hw, xi + b 1
Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard
hw, xi + b 1 hw, xi + b 1
Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard
hw, xi + b 1 hw, xi + b 1
Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard
minimum error separator is impossible
hw, xi + b 1 + ξ
Convex optimization problem
hw, xi + b 1 ξ
hw, xi + b 1 + ξ
Convex optimization problem
hw, xi + b 1 ξ
hw, xi + b 1 + ξ
Convex optimization problem
minimize amount
hw, xi + b 1 ξ
(keep explicit constraints)
minimize
x
f(x) subject to ci(x) ≤ 0 L(x, α) = f(x) + X
i
αici(x) ∂xL(x, α) = ∂xf(x) + X
i
αi∂xci(x) = 0 maximize
α
L(x(α), α)
hw, xi + b 1 + ξ
Convex optimization problem
hw, xi + b 1 ξ
hw, xi + b 1 + ξ
Convex optimization problem
hw, xi + b 1 ξ
hw, xi + b 1 + ξ
Convex optimization problem
minimize amount
hw, xi + b 1 ξ
Problem is always feasible. Proof: (also yields upper bound)
minimize
w,b
1 2 kwk2 subject to yi [hw, xii + b] 1 minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 w = 0 and b = 0 and ξi = 1
Optimality in w,b,ξ is at saddle point with α,η
L(w, b, α) = 1 2 kwk2 + C X
i
ξi X
i
αi [yi [hxi, wi + b] + ξi 1] X
i
ηiξi
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0
L(w, b, α) = 1 2 kwk2 + C X
i
ξi X
i
αi [yi [hxi, wi + b] + ξi 1] X
i
ηiξi
∂wL(w, b, ξ, α, η) = w − X
i
αiyixi = 0 ∂bL(w, b, ξ, α, η) = X
i
αiyi = 0 ∂ξiL(w, b, ξ, α, η) = C − αi − ηi = 0
maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 2 [0, C]
bound influence
w
w = X
i
yiαixi
maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 2 [0, C]
αi = 0 = ) yi [hw, xii + b] 1 0 < αi < C = ) yi [hw, xii + b] = 1 αi = C = ) yi [hw, xii + b] 1
αi [yi [hw, xii + b] + ξi 1] = 0 ηiξi = 0
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO)
matter and solve in blocks (active set method).
maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 2 [0, C]
f(x) = X
i
αiyi hxi, xi + b maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 2 [0, C]
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0
f(x) = X
i
αiyik(xi, x) + b maximize
α
− 1 2 X
i,j
αiαjyiyjk(xi, xj) + X
i
αi subject to X
i
αiyi = 0 and αi ∈ [0, C]
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, φ(xi)i + b] 1 ξi and ξi 0
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
Follows from finding minimal slack variable for given (w,b) pair.
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 minimize
w,b
1 2 kwk2 + C X
i
max [0, 1 yi [hw, xii + b]]
empirical risk
max(0, 1 − yf(x)) {yf(x) < 0}
convex upper bound binary loss function margin
if f(x) > 1
1 2(1 − f(x))2
if f(x) ∈ [0, 1]
1 2 − f(x)
if f(x) < 0
max(0, 1 − f(x))
(asymptotically) linear (asymptotically) 0
log h 1 + e−f(x)i
R[f] := Ex,y∼p(x,y) [{yf(x) > 0}] Remp[f] := 1 m
m
X
i=1
{yif(xi) > 0} Rreg[f] := 1 m
m
X
i=1
max(0, 1 − yif(xi)) + λΩ[f]
regularization
how to control ƛ
Large Margin Separation, optimization problem
Support Vectors, kernel expansion
Dual problem, robustness