Introduction to Machine Learning CMU-10701
Support Vector Machines
Barnabás Póczos & Aarti Singh 2014 Spring
CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh - - PowerPoint PPT Presentation
Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring http://barnabas-cmu-10701.appspot.com/ Linear classifiers which line is better? Which decision boundary is better? 4 Pick
Barnabás Póczos & Aarti Singh 2014 Spring
http://barnabas-cmu-10701.appspot.com/
4
Which decision boundary is better?
Class 1 Class 2 Margin
Data:
w ∙ x + b < 0 w ∙ x + b > 0
5
w
Classify as.. +1 if w ∙ x + b 1 –1 if w ∙ x + b –1 Universe explodes if
Plus-Plane Minus-Plane Classifier Boundary
Classification rule: Goal: Find the maximum margin classifier How large is the margin of this classifier?
6
w ∙ x + b –1 w ∙ x + b 1
Let x+ and x- be such that
M = Margin Width x- x+
w w 2
7
Maximize M minimize w·w !
8
We can assume b=0
Classify as.. +1 if w ∙ x + b 1 –1 if w ∙ x + b –1 Universe explodes if
This is the same as
This is a QP problem (m-dimensional) (Quadratic cost function, linear constraints)
9
Find and to Subject to
10
Efficient Algorithms exist for QP. They often solve the dual problem instead of the primal.
11
Moving the constraint to objective function Lagrangian: Solve:
Constraint is active when a > 0
12
Solving:
When a > 0, constraint is tight
13
Lagrange function:
14
Primal problem:
The Lagrange problem:
15
16
Quadratic Programming (n-dimensional) Lemma
17
It assumes samples are linearly separable...
18
What can we do if data is not linearly separable???
If the data set is not linearly separable, then adding new features (mapping the data to a larger feature space) the data might become linearly separable
19
Make up a new feature! Sort of… … computed from
x=0
2 k k k
Separable! MAGIC! Now drop this “augmented” data into our linear SVM.
20
n general! points in an n-1 dimensional space is always linearly separable by a hyperspace! ) it is good to map the data to high dimensional spaces Having n training data, is it always good enough to map the data into a feature space with dimension n-1?
Even if we don’t know how many test data we have and what they are... We might want to map our data to a huge (1) dimensional feature space Overfitting? Generalization error?... We don’t care now...
21
1
Use features of features of features of features….
22
Solutions:
) each training samples are linearly separable in the feature space ) Hard SVM can be applied ) overfitting... It assumes samples are linearly separable...
23
The Hard SVM problem can be rewritten:
where Misclassification Correct classification
24
We can try to solve the soft version of it: Your loss is only 1 instead of 1 if you misclassify an instance Instead of using hard constraints (points are linearly separable)
where Misclassification Correct classification
25
It is not convex in yf(x) ) It is not convex in w, either... ... and we like only convex functions... Let us approximate it with convex functions!
26
Picture is taken from R. Herbrich
27
28
Where,
29
The hinge loss upper bounds the 0-1 loss
M =
7
1
2
30
31
Equivalently, where
We can use this form, too.:
32
Equivalently, What is the dual form of primal soft SVM?
where
33
34
35
36
KKT conditions
w¢x + b > 0 w¢x + b < 0
g g
Hard SVM: Linear hyperplane defined by “support vectors” Moving other points a little doesn’t effect the decision boundary
support vectors to predict labels of new points
38
Only few ajs can be non-zero : where constraint is tight
(<w,xj>+ b)yj = 1
aj > 0 aj > 0 aj > 0 aj = 0 aj = 0 aj = 0
41
42
Learn 3 classifiers separately: Class k vs. rest (wk, bk)k=1,2,3 y = arg max wk¢x + bk
k
But wks may not be based on the same scale. Note: (aw)¢x + (ab) is also a solution
43
Simultaneously learn 3 sets of weights. Constraints: y = arg maxk w(k)¢ x + b(k) Margin - gap between correct class and nearest
44
Simultaneously learn 3 sets of weights y = arg maxk w(k)¢ x + b(k) Joint optimization: wks have the same scale.
45
46
SVM : Hinge loss: 0-1 loss
1
Logistic Regression : Log loss ( log conditional likelihood) Hinge loss Log loss
47
48
49
“Without b” “With b”
50
Φ(x) = (x1
3, x2 3, x3 3, x1 2x2x3, ….,)
51
For example polynomials
d=1 d=2
52
d
53
m – input features d – degree of polynomial
grows fast: d = 6, m = 100, about 1.6 billion terms
54
Feature space becomes really large very quickly!
Φ(x) – High-dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K
55
recall series expansion)
57
Which functions can be used as kernels??? …and why are they called kernels???
58
trouble!
59
61
62
63
2nd order decision boundary: (parabola, hyperbola, ellipse)
64
65
66
67
68
69
70
71
72
73