Machine Learning
A Geometric Approach
Professor Liang Huang
Linear Classification: Support Vector Machines (SVM)
some slides from Alex Smola (CMU)
CIML book Chap 7.7
Machine Learning A Geometric Approach CIML book Chap 7.7 Linear - - PowerPoint PPT Presentation
Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron to SVM batch 1997
A Geometric Approach
Professor Liang Huang
Linear Classification: Support Vector Machines (SVM)
some slides from Alex Smola (CMU)
CIML book Chap 7.7
Spam Ham
1959 Rosenblatt invention 1962 Novikoff proof 1999 Freund/Schapire voted/avg: revived 2002 Collins structured 2003
Crammer/Singer
MIRA 1997 Cortes/Vapnik SVM 2006
Singer group
aggressive 2005*
McDonald/Crammer/Pereira
structured MIRA
*mentioned in lectures but optional (others papers all covered in detail)
max margin
+ m a x m a r g i n + k e r n e l s +soft-margin c
s e r v a t i v e u p d a t e s i n s e p a r a b l e c a s e
2007--2010*
Singer group
Pegasos
subgradient descent minibatch
batch
AT&T Research ex-AT&T and students
1964 Vapnik Chervonenkis
fall of USSR
3
hw, xi + b 1 hw, xi + b 1 f(x) = hw, xi + b
linear function
f(x) = hw, xi + b
linear function
hw, xi + b 1 hw, xi + b 1
robustness relative to uncertainty
correctly classified instances
easy problems
+ +
r ρ
hw, xi + b = 1 hw, xi + b = 1
geometric margin: w
hw, xi + b 1 hw, xi + b 1
functional margin: yi(w · xi)
yi(w · xi) kwk = 1 kwk
hw, xi + b = 1 hw, xi + b = 1
s.t. functional margin is at least 1
w
hw, xi + b 1 hw, xi + b 1
SVM objective (max version):
max
w
1 kwk s.t. 8(x, y) 2 D, y(w · x) 1
Q1: what if we want functional margin of 2? Q2: what if we want geometric margin of 1?
hw, xi + b = 1 hw, xi + b = 1
w
hw, xi + b 1 hw, xi + b 1
SVM objective (min version):
min
w kwk s.t. 8(x, y) 2 D, y(w · x) 1
interpretation: small models generalize better
s.t. functional margin is at least 1
hw, xi + b = 1 hw, xi + b = 1
s.t. functional margin is at least 1
w
hw, xi + b 1 hw, xi + b 1
SVM objective (min version):
min
w
1 2 kwk2 s.t. 8(x, y) 2 D, y(w · x) 1
|w| not differentiable, |w|2 is.
margin of at least 1 on ALL EXAMPLES
margin of at least 1 on THIS EXAMPLE
min
w
1 2 kwk2 s.t. 8(x, y) 2 D, y(w · x) 1 min
w0 kw0 wk2
s.t. w0 · x 1
xi w
MIRA
perceptron
why don’t use convex hulls for SVMs in practice??
how many support vectors in 2D?
weight vector is determined by the support vectors alone c.f. perceptron: what about MIRA?
w = X
(x,y)∈errors
y · x
convex combination
constraint
min
w
1 2 kwk2 s.t. 8(x, y) 2 D, y(w · x) 1
linear quadratic
wi xi ⊕
MIRA 1 kxik wi · xi kxik 1 wi · xi kxik
min
w0 kw0 wk2
s.t. w0 · x 1
Derivatives in w need to vanish
constraint
L(w, b, α) = 1 2 kwk2 X
i
αi [yi [hxi, wi + b] 1]
model is a linear combo
(the support vectors) i.e., those with αi > 0
min
w
1 2 kwk2 s.t. 8(x, y) 2 D, y(w · x) 1
∂wL(w, b, a) = w − X
i
αiyixi = 0 ∂bL(w, b, a) = X
i
αiyi = 0 w = X
i
yiαixi
α
α x
KKT condition (complementary slackness)
where αi > 0 (αi=0 => inactive)
αi [yi [hw, xii + b] 1] = 0 minimize
w,b
1 2 kwk2 subject to yi [hxi, wi + b] 1
w = X
i
yiαixi
constraint
Karush–Kuhn–Tucker
w
minimize
w,b
1 2 kwk2 subject to yi [hxi, wi + b] 1 w = X
i
yiαixi Karush Kuhn Tucker (KKT) Optimality Condition
αi = 0 αi > 0 = ) yi [hw, xii + b] = 1
αi [yi [hw, xii + b] 1] = 0
w
w = X
i
yiαixi
L(w, b, α) = 1 2 kwk2 X
i
αi [yi [hxi, wi + b] 1] maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X αiyi = 0 and αi 0 ∂wL(w, b, a) = w − X
i
αiyixi = 0 ∂bL(w, b, a) = X
i
αiyi = 0 w = X
i
yiαixi
dual variables
α
w
w = X
i
yiαixi maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 0
dual variables
minimize
w,b
1 2 kwk2 subject to yi [hxi, wi + b] 1
Primal Dual
we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO)
matter and solve in blocks (active set method).
maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 0
dual variables
maximize
α
− 1 2αT Qα − αT b subject to α ≥ 0
Optimization (SMO)
Q: what’s the Q in SVM primal? how about Q in SVM dual?
convex QP => local min/max is global min/max
LP
QP
CP
svm
coordinates
argmax
αi
− 1 2αT Qα − αT b subject to α ≥ 0
Quadratic function with only one variable Maximum => first-order derivative is 0
criterion
i
initialize for all repeat pick following sweep pattern solve until meet stopping criterion
αi ← argmax
αi
− 1 2αT Qα − αT b subject to α ≥ 0 αi = 0 i
maximize
α
− 1 2αT ✓ 4 1 1 2 ◆ α − αT ✓ −6 −4 ◆ subject to α ≥ 0
Spam Ham
hw, xi + b 1 hw, xi + b 1 f(x) = hw, xi + b
linear function linear separator is impossible + +
hw, xi + b 1 hw, xi + b 1
Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard
minimum error separator is impossible +
hw, xi + b 1 + ξ
Convex optimization problem
minimize amount
hw, xi + b 1 ξ
+
misclassification is also margin violation (ξ>0)
Problem is always feasible. Proof: (also yields upper bound)
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 w = 0 and b = 0 and ξi = 1 minimize
w,b
1 2 kwk2 subject to yi [hw, xii + b] 1
C=0? C=+∞? C=+∞ => not tolerant on violation => hard margin
ξ
w determines ξ
(keep explicit constraints)
minimize
x
f(x) subject to ci(x) ≤ 0 L(x, α) = f(x) + X
i
αici(x) ∂xL(x, α) = ∂xf(x) + X
i
αi∂xci(x) = 0 maximize
α
L(x(α), α)
L(w, b, α) = 1 2 kwk2 + C X
i
ξi X
i
αi [yi [hxi, wi + b] + ξi 1] X
i
ηiξi
minimize
w,b
1 2 kwk2 + C X
i
ξi subject to yi [hw, xii + b] 1 ξi and ξi 0
ξ
ξ, ,𝝷)
∂wL(w, b, ξ, α, η) = w − X
i
αiyixi = 0 ∂bL(w, b, ξ, α, η) = X
i
αiyi = 0 ∂ξiL(w, b, ξ, α, η) = C − αi − ηi = 0
maximize
α
1 2 X
i,j
αiαjyiyj hxi, xji + X
i
αi subject to X
i
αiyi = 0 and αi 2 [0, C]
bound influence
dual variables
L(w, b, α) = 1 2 kwk2 + C X
i
ξi X
i
αi [yi [hxi, wi + b] + ξi 1] X
i
ηiξi
ξ, ,𝝷)
w
w = X
i
yiαixi
αi [yi [hw, xii + b] + ξi 1] = 0 ηiξi = 0
0 ≤ αi = C − ηi ≤ C
L(w, b, α) = 1 2 kwk2 + C X
i
ξi X
i
αi [yi [hxi, wi + b] + ξi 1] X
i
ηiξi
αi = 0 = ) yi [hw, xii + b] 1 0 < αi < C = ) yi [hw, xii + b] = 1 αi = C = ) yi [hw, xii + b] 1
ξ,
why these are not disjoint?
α=0 α=0
0<α<C 0<α<C α=C α=C
,𝝷)
all circled and squared examples are support vectors (α>0) they include ξ=0 (hard-margin SVs), ξ>0 (margin violations), and ξ>1 (misclassifications)
all circled and squared examples are support vectors (α>0) they include ξ=0 (hard-margin SVs), ξ>0 (margin violations), and ξ>1 (misclassifications)
(0<α≤C, ξ≥0) support vectors
(α=C, ξ>0)
margin violations
non-support vectors (α=0, ξ=0)
(α=C, ξ>1)
misclassifications
α=C
python demo.py 1e10 python demo.py 1e10
python demo.py 0.01 python demo.py 0.1
+ + ‒ ‒ + + ‒ ‒
In [2]: clf = svm.SVC(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(X, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32)
+
In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-0.1,-0.1]] In [3]: Y = [1,1,-1,-1, 1] In [4]: clf = svm.SVC(kernel='linear', C=1) In [5]: clf.fit(X, Y) In [6]: clf.support_vectors_ Out[6]: array([[-1. , 1. ], [-1. , -1. ], [ 1. , -1. ], [-0.1, -0.1]]) In [7]: clf.dual_coef_ Out[7]: array([[-0.45, -0.6 , 0.05, 1. ]]) In [8]: clf.coef_ Out[8]: array([[ 1.00000000e+00, 1.49011611e-09]]) In [9]: clf.intercept_ Out[9]: array([-0.])
α=C
α=C
+ + ‒ ‒ + + ‒ ‒
In [2]: clf = svm.SVC(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(X, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32)
+
In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-0.1,-0.1]] In [3]: Y = [1,1,-1,-1, 1] In [12]: clf = svm.SVC(kernel='linear', C=1e10) In [13]: clf.fit(X, Y) In [14]: clf.coef_ Out[14]: array([[ 2.02010102e+00, 1.00999543e-04]]) In [15]: clf.intercept_ Out[15]: array([ 1.02013469]) In [16]: clf.support_vectors_ Out[16]: array([[-1. , 1. ], [-1. , -1. ], [-0.01, -0.01]]) In [17]: clf.dual_coef_ Out[17]: array([[-1.01000001, -1.03050607, 2.04050608]])
+ + ‒ ‒ + + ‒ ‒
In [2]: clf = svm.SVC(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(X, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32)
+
In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-2,0]] In [3]: Y = [1,1,-1,-1, 1] In [4]: clf = svm.SVC(kernel='linear', C=1) In [5]: clf.fit(X, Y) In [6]: clf.coef_ Out[6]: array([[ 1., 0.]]) In [7]: clf.support_vectors_ Out[7]: array([[-1., 1.], [-1., -1.], [ 1., 1.], [ 1., -1.], [-2., 0.]]) In [8]: clf.dual_coef_ Out[8]: array([[-1. , -1. , 0.5, 0.5, 1. ]]) In [9]: clf.intercept_ Out[9]: array([-0.])
α=C α=C α=C
αi = 0 = ) yi [hw, xii + b] 1 0 < αi < C = ) yi [hw, xii + b] = 1 αi = C = ) yi [hw, xii + b] 1
what if C=1e10?
C=1
51
C=20
52
Optimization
Learning an SVM has been formulated as a constrained optimization prob- lem over w and ξ min
w∈Rd,ξi∈R+ ||w||2 + C
N
X
i
ξi subject to yi
³
w>xi + b
´
≥ 1 − ξi for i = 1 . . . N The constraint yi
³
w>xi + b
´
≥ 1 − ξi, can be written more concisely as yif(xi) ≥ 1 − ξi which, together with ξi ≥ 0, is equivalent to ξi = max (0, 1 − yif(xi)) Hence the learning problem is equivalent to the unconstrained optimiza- tion problem over w min
w∈Rd ||w||2 + C
N
X
i
max (0, 1 − yif(xi))
loss function regularization
From Constrained Optimization to Unconstrained Optimization (back to Primal)
53
slides 49-56 from Andrew Zisserman (Oxford/DeepMind); with annotations
w determines ξ
Loss function
w Support Vector Support Vector wTx + b = 0 min
w∈Rd ||w||2 + C
N
X
i
max (0, 1 − yif(xi))
Points are in three categories:
Point is outside margin. No contribution to loss
Point is on margin. No contribution to loss. As in hard margin case.
Point violates margin constraint. Contributes to loss
loss function
54
(margin violation ξ>0, including misclassification ξ>1)
Loss functions
max (0, 1 − yif(xi))
yif(xi)
55
good very bad (misclassification)
not good enough!
(perceptron uses a shifted hinge-loss touching the origin)
SVM min
w∈Rd C
N
X
i
max (0, 1 − yif(xi)) + ||w||2
+
convex
56
convex + convex = convex!
Gradient (or steepest) descent algorithm for SVM
First, rewrite the optimization problem as an average min
w C(w)
= λ 2||w||2 + 1 N
N
X
i
max (0, 1 − yif(xi)) = 1 N
N
X
i
µλ
2||w||2 + max (0, 1 − yif(xi))
¶
(with λ = 2/(NC) up to an overall scale of the problem) and f(x) = w>x + b Because the hinge loss is not differentiable, a sub-gradient is computed
To minimize a cost function C(w) use the iterative update
wt+1 ← wt − ηt∇wC(wt)
where η is the learning rate.
57
Sub-gradient for hinge loss
L(xi, yi; w) = max (0, 1 − yif(xi)) f(xi) = w>xi + b
yif(xi)
∂L ∂w = −yixi ∂L ∂w = 0
58
sub-gradients
Sub-gradient descent algorithm for SVM
C(w) = 1 N
N
X
i
µλ
2||w||2 + L(xi, yi; w)
¶
The iterative update is
wt+1
← wt − η∇wtC(wt) ← wt − η 1 N
N
X
i
(λwt + ∇wL(xi, yi; wt)) where η is the learning rate. Then each iteration t involves cycling through the training data with the updates:
wt+1
← wt − η(λwt − yixi) if yif(xi) < 1 ← wt − ηλwt
In the Pegasos algorithm the learning rate is set at ηt = 1
λt
just like perceptron! batch gradient
59
perc: ≤0
60
Pegasos – Stochastic Gradient Descent Algorithm
Randomly sample from the training data
50 100 150 200 250 300 10
10
10 10
110
2energy
2 4 6
2 4 6
SGD is online update: gradient on one example (unbiasedly) approximates the gradient on the whole training data (SGD)
61
slides 49-56 from Andrew Zisserman (Oxford/DeepMind); with annotations