- 10. Support Vector Machines
10. Support Vector Machines Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation
10. Support Vector Machines Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation
Foundatjons of Machine Learning CentraleSuplec Fall 2017 10. Support Vector Machines Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Defjne a
2
Learning objectjves
- Defjne a large-margin classifjer in the separable case.
- Write the corresponding primal and dual optjmizatjon
problems.
- Re-write the optjmizatjon problem in the case of non-
separable data.
- Use the kernel trick to apply sofu-margin SVMs to non-
linear cases.
- Defjne kernels for real-valued data, strings, and graphs.
3
The linearly separable case: hard-margin SVMs
4
Linear classifjer
Assume data is linearly separable: there exists a line that separates + from -
5
Linear classifjer
6
Linear classifjer
7
Linear classifjer
8
Linear classifjer
9
Linear classifjer
10
Linear classifjer
11
Linear classifjer
Which one is beter?
12
Margin of a linear classifjer
Margin: Twice the distance from the separatjng hyperplane to the closest training point.
13
Margin of a linear classifjer
14
Margin of a linear classifjer
15
Largest margin classifjer: Support vector machines
16
Support vectors
17
Formalizatjon
- Training set
- What are the equatjons of the 3 parallel hyperplanes?
- How is the “blue” region defjned? The “orange” one?
w
18
Largest margin hyperplane
What is the size of the margin γ?
19
Largest margin hyperplane
20
Optjmizatjon problem
- Training set
- Assume the data to be linearly separable
- Goal: Find that defjne the hyperplane with
largest margin.
21
Optjmizatjon problem
- Margin maximizatjon:
minimize
- Correct classifjcatjon of the training points:
– For positjve examples: – For negatjve examples: – Summarized as ?
22
Optjmizatjon problem
- Margin maximizatjon:
minimize
- Correct classifjcatjon of the training points:
– For positjve examples: – For negatjve examples: – Summarized as:
- Optjmizatjon problem:
23
- Find that minimize
under the n constraints
- We introduce one dual variable αi for each
constraint (i.e. each training point)
- Lagrangian:
Optjmizatjon problem
?
24
- Find that minimize
under the n constraints
- We introduce one dual variable αi for each
constraint (i.e. each training point)
- Lagrangian:
Optjmizatjon problem
25
Lagrange dual of the SVM
- Lagrange dual functjon:
- Lagrange dual problem:
- Strong duality: Under Slater’s conditjons, the
- ptjmum of the primal is the optjmum of the dual.
The functjon to optjmize is convex and the equality constraints are affjne.
26
Minimizing the Lagrangian of the SVM
- L(w, b, α) is convex quadratjc in w and minimized
for
- L(w, b, α) is affjne in b. Its minimum is except if
?
27
Minimizing the Lagrangian of the SVM
- L(w, b, α) is convex quadratjc in w and minimized
for:
- L(w, b, α) is affjne in b. Its minimum is except if:
28
Minimizing the Lagrangian of the SVM
- L(w, b, α) is convex quadratjc in w and minimized
for:
- L(w, b, α) is affjne in b. Its minimum is except if:?
29
Minimizing the Lagrangian of the SVM
- L(w, b, α) is convex quadratjc in w and minimized
for:
- L(w, b, α) is affjne in b. Its minimum is except if:
30
SVM dual problem
- Lagrange dual functjon:
- Dual problem: maximize q(α) subject to α ≥ 0.
Maximizing a quadratjc functjon under box constraints can be solved effjciently using dedicated sofuware.
31
Optjmal hyperplane
- Once the optjmal α* is found, we recover (w*, b*)
- Determining b*:
– Closest positjve point to the separatjng hyperplane:
verifjes
– Closest negatjve point to the separatjng hyperplane:
verifjes
- The decision functjon is hence:
32
Lagrangian
- minimize f(w) under the constraint g(w) ≥ 0
How do we write this in terms of the gradients of f and g?
abusive notatjon: g(w, b)
33
Lagrangian
- minimize f(w) under the constraint g(w) ≥ 0
feasible region iso-contours of f unconstrained minimum of f
If the minimum of f(w) doesn't lie in the feasible region, where's our solutjon?
34
Lagrangian
- minimize f(w) under the constraint g(w) ≥ 0
feasible region iso-contours of f unconstrained minimum of f
How do we write this in terms of the gradients of f and g?
35
Lagrangian
- minimize f(w) under the constraint g(w) ≥ 0
feasible region iso-contours of f unconstrained minimum of f
How do we write this in terms of the gradients of f and g?
36
Lagrangian
- minimize f(w) under the constraint g(w) ≥ 0
feasible region iso-contours of f unconstrained minimum of f
How do we write this in terms of the gradients of f and g?
37
Lagrangian
- minimize f(w) under the constraint g(w) ≥ 0
Case 1: the unconstraind minimum lies in the feasible region. Case 2: it does not. How do we summarize both cases?
38
- minimize f(w) under the constraint g(w) ≥ 0
Case 1: the unconstraind minimum lies in the feasible region. Case 2: it does not.
– Summarized as:
Lagrangian
39
- minimize f(w) under the constraint g(w) ≥ 0
Lagrangian: α is called the Lagrange multjplier.
Lagrangian
40
- minimize f(w) under the constraints gi(w) ≥ 0
How do we deal with n constraints?
Lagrangian
41
- minimize f(w) under the constraints gi(w) ≥ 0
Use n Lagrange multjplers
– Lagrangian:
Lagrangian
42
Support vectors
- Karun-Kush-Tucker conditjons:
Either αi = 0 (case 1) or gi=0 (case 2)
Case 1: Case 2:
feasible region iso-contours of f unconstrained minimum of f
43
Support vectors
α = 0 α > 0
44
The non-linearly separable case: sofu-margin SVMs.
45
Sofu-margin SVMs
What if the data are not linearly separable?
46
Sofu-margin SVMs
47
Sofu-margin SVMs
48
Sofu-margin SVMs
- Find a trade-ofg between large margin and few
errors. What does this remind you of?
49
SVM error: hinge loss
- We want for all i:
- Hinge loss functjon:
What's the shape of the hinge loss?
50
SVM error: hinge loss
- We want for all i:
- Hinge loss functjon:
1 1
51
Sofu-margin SVMs
- Find a trade-ofg between large margin and few
errors.
- Error:
- The sofu-margin SVM solves:
52
The C parameter
- Large C
makes few errors
- Small C
ensures a large margin
- Intermediate C
fjnds a tradeofg
53
It is important to control C
Predictjon error C On training data On new data
54
Slack variables
is equivalent to: slack variable: distance btw y.f(x) and 1
55
- Primal
- Lagrangian
- Min the Lagrangian (partjal derivatjves in w, b, ξ)
- KKT conditjons
Lagrangian of the sofu-margin SVM
56
Dual formulatjon of the sofu-margin SVM
- Dual: Maximize
- under the constraints
- KKT conditjons:
“easy” “hard” “somewhat hard”
57
Support vectors of the sofu-margin SVM
α = 0 0< α < C α = C
58
Primal vs. dual
- What is the dimension of the primal problem?
- What is the dimension of the dual problem?
59
Primal vs. dual
- Primal: (w, b) has dimension (p+1).
Favored if the data is low-dimensional.
- Dual: α has dimension n.
Favored is there is litle data available.
60
The non-linear case: kernel SVMs.
61
Non-linear SVMs
62
Non-linear mapping to a feature space
R
63
Non-linear mapping to a feature space
R R2
64
SVM in the feature space
- Train:
under the constraints
- Predict with the decision functjon
65
Kernels
For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product
- f their images in the feature spaces.
- E.g.
- Kernels allow us to formalize the notjon of similarity.
66
Dot product and similarity
- Normalized dot product = cosine
feature 2 feature 1
67
Kernel trick
- Many linear algorithms (in partjcular, linear SVMs) can
be performed in the feature space H without explicitly computjng the images φ(x), but instead by computjng kernels K(x, x')
- It is sometjmes easy to compute kernels which
correspond to large-dimensional feature spaces: K(x, x') is ofuen much simpler to compute than φ(x).
68
SVM in the feature space
- Train:
under the constraints
- Predict with the decision functjon
69
SVM with a kernel
- Train:
under the constraints
- Predict with the decision functjon
70
Which functjons are kernels?
- A functjon K(x, x') defjned on a set X is a kernel ifg it
exists a Hilbert space H and a mapping φ: X →H such that, for any x, x' in X:
- A functjon K(x, x') defjned on a set X is positjve defjnite
ifg it is symmetric and satjsfjes:
- Theorem [Aronszajn, 1950]: K is a kernel ifg it is positjve
defjnite.
71
Positjve defjnite matrices
- Have a unique Cholesky decompositjon
L: lower triangular, with positjve elements on the diagonal
- Sesquilinear form is an inner product
– conjugate symmetry – linearity in the fjrst argument – positjve defjniteness
72
Polynomial kernels
Compute
?
73
Polynomial kernels
More generally, for is an inner product in a feature space of all monomials of degree up to d.
74
Gaussian kernel
What is the dimension of the feature space?
75
Gaussian kernel
The feature space has infjnite dimension.
76
77
Toy example
78
Toy example: linear SVM
79
Toy example: polynomial SVM (d=2)
80
Kernels for strings
81
Protein sequence classifjcatjon
Goal: predict which proteins are secreted or not, based on their sequence.
82
Substring-based representatjons
- Represent strings based on the presence/absence of
substrings of fjxed length.
Strings of length k ?
83
Substring-based representatjons
- Represent strings based on the presence/absence of
substrings of fjxed length.
– Number of occurrences of u in x: spectrum kernel [Leslie
et al., 2002].
84
Substring-based representatjons
- Represent strings based on the presence/absence of
substrings of fjxed length.
– Number of occurrences of u in x: spectrum kernel [Leslie
et al., 2002].
– Number of occurrences of u in x, up to m mismatches:
mismatch kernel [Leslie et al., 2004].
85
Substring-based representatjons
- Represent strings based on the presence/absence of
substrings of fjxed length.
– Number of occurrences of u in x: spectrum kernel [Leslie
et al., 2002].
– Number of occurrences of u in x, up to m mismatches:
mismatch kernel [Leslie et al., 2004].
– Number of occcurrences of u in x, allowing gaps, with a
weight decaying exponentjally with the number of gaps: substring kernel [Lohdi et al., 2002].
86
Spectrum kernel
- Implementatjon:
– Formally, a sum over |Ak|terms – How many non-zero terms in ? ?
87
Spectrum kernel
- Implementatjon:
– Formally, a sum over |Ak|terms – At most |x| - k + 1 non-zero terms in – Hence: Computatjon in O(|x|+|x'|)
- Predictjon for a new sequence x:
Write f(x) as a functjon of only |x|-k+1 weights. ?
88
Spectrum kernel
- Implementatjon:
– Formally, a sum over |Ak|terms – At most |x| - k + 1 non-zero terms in – Hence: Computatjon in O(|x|+|x'|)
- Fast predictjon for a new sequence x:
89
The choice of kernel maters
Performance of several kernels on the SCOP superfamily recognitjon kernel [Saigo et al., 2004]
90
Kernels for graphs
91
Graph data
- Molecules
- Images
[Harchaoui & Bach, 2007]
92
Subgraph-based representatjons
1 1 1 1 1 1 no occurrence
- f the 1st feature
1+ occurrences
- f the 10th feature
93
Tanimoto & MinMax
- The Tanimoto and MinMax similaritjes are kernels
94
Which subgraphs to use?
- Indexing by all subgraphs...
– Computjng all subgraph occurences is NP-hard. – Actually, fjnding whether a given subgraph occurs in a
graph is NP-hard in general.
htup://jeremykun.com/2015/11/12/a-quasipolynomial-tjme-algorithm-for-graph-isomorphism-the-details/
95
Which subgraphs to use?
- Specifjc subgraphs that lead to computatjonally
effjcient indexing:
– Subgraphs selected based on domain knowledge
E.g. chemical fjngerprints
– All frequent subgraphs [Helma et al., 2004] – All paths up to length k [Nicholls 2005] – All walks up to length k [Mahé et al., 2005] – All trees up to depth k [Rogers, 2004] – All shortest paths [Borgwardt & Kriegel, 2005] – All subgraphs up to k vertjces (graphlets) [Shervashidze
et al., 2009]
96
Which subgraphs to use?
Path of length 5 Walk of length 5 Tree of depth 2
97
Which subgraphs to use?
[Harchaoui & Bach, 2007]
Paths Walks Trees
98
The choice of kernel maters
Predictjng inhibitors for 60 cancer cell lines [Mahé & Vert, 2009]
99
The choice of kernel maters
[Harchaoui & Bach, 2007]
- COREL14: 1400 natural images, 14 classes
- Kernels: histogram (H), walk kernel (W), subtree kernel
(TW), weighted subtree kernel (wTW), combinatjon (M).
100
Summary
- Linearly separable case: hard-margin SVM
- Non-separable, but stjll linear: sofu-margin SVM
- Non-linear: kernel SVM
- Kernels for
– real-valued data – strings – graphs.
101
- A Course in Machine Learning.
http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf
– Sofu-margin SVM : Chap 7.7 – Kernel SVM: Chap 11.1 – 11.6
- The Elements of Statjstjcal Learning.
http://web.stanford.edu/~hastie/ElemStatLearn/
– Separatjng hyperplane: Chap 4.5.2 – Sofu-margin SVM: Chap 12.1 – 12.2 – Kernel SVM: Chap 12.3 – String kernels: Chap 18.5.1
- Learning with Kernels
http://agbs.kyb.tuebingen.mpg.de/lwk/
– Sofu-margin SVM: Chap 1.4 – Kernel SVM: Chap 1.5 – SVR: Chap 1.6 – Kernels: Chap 2.1
- Convex Optjmizatjon
https://web.stanford.edu/~boyd/cvxbook/
– SVM optjmizatjon : Chap 8.6.1
102
Practjcal maters
- Preparing for the exam
– Previous exams with solutjons on the course website
- Next week: special session! 2 x 1.5 hrs
– Introductjon to artjfjcial neural networks – Introductjon to deep learning and Tensorfmow (J. Boyd)
Jupyter notebook will be available for download
– Deep learning for bioimaging (P. Naylor)
103
Lab
- Redefjning cross_validate
104
Linear SVM
The data is not easily separated by a hyperplane Support vectors are either correctly classifjed points that support the margin or errors. Many support vectors suggest the data is not easy to separate and there are many erros.
105
Linear kernel matrix
No visible patuern. Dark lines correspond to vectors with highest magnitude.
106
Linear kernel matrix (afuer feature scaling)
The kernel values are on a smaller scale than previously. The diagonal emerges (the most similar sample to an
- bservatjon is itself).
Many small values.
107
108
Linear SVM with
- ptjmal C
An SVM classifjer with optjmized C
On each pair (tr, te):
- scaling factors are computjng on Xtr
- Xtr, Xte are scaled accordingly
- for each value of C:
- an SVM is cross-validated on
Xtr_scaled
- the best of these SVM is trained on the
full Xtr_scaled and applied to Xte_scaled (this produces one predictjon per data point of X)
109
Polynomial kernel SVM
Polynomial kernel with r=0 d=2 Computed on X_scaled The matrix is really close to identjty, nothing can be learned. This gets worse if you increase d. Changing r can give us a more reasonable matrix.
110
r=1000000 Almost all 1s r=100000 r=10000 r=1000 r=100 r=10 The kernel matrix is almost the identjty matrix Reasonable range of values for r
111
- For a fair comparison with the linear kernel, cross-
validate C and r.
- For r, use a logspace between 10000 and 100000
based on your observatjon of the kernel matrix.
112
Gaussian kernel SVM
- What values of gamma should we use? Start by
spreading out values.
- When gamma > 1e-2, the
kernel matrix is close to the identjty.
- When gamma = 1e-5, the
kernel matrix is gettjng close to a matrix of all 1s.
- If we choose gamma
much smaller, the kernel matrix is going to be so close to a matrix of all 1s the SVM won’t learn well.
113
Gaussian kernel SVM
- What values of gamma should we use? Start by
spreading out values.
- The kernel matrix
is more reasonable when gamma is between 5e-5 and 5e-4.
114
Gaussian kernel SVM
- The best performance we
- btain is indeed for a gamma
- f 5e-5.
- To fairly compare to the linear
SVM, one should cross- validate C.
115
Linear SVM decision boundary
116
Quadratjc SVM decision boundary
117