10. Support Vector Machines Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

10 support vector machines
SMART_READER_LITE
LIVE PREVIEW

10. Support Vector Machines Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 10. Support Vector Machines Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Defjne a


slide-1
SLIDE 1
  • 10. Support Vector Machines

Foundatjons of Machine Learning CentraleSupélec — Fall 2017 Chloé-Agathe Azencot

Centre for Computatjonal Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr

slide-2
SLIDE 2

2

Learning objectjves

  • Defjne a large-margin classifjer in the separable case.
  • Write the corresponding primal and dual optjmizatjon

problems.

  • Re-write the optjmizatjon problem in the case of non-

separable data.

  • Use the kernel trick to apply sofu-margin SVMs to non-

linear cases.

  • Defjne kernels for real-valued data, strings, and graphs.
slide-3
SLIDE 3

3

The linearly separable case: hard-margin SVMs

slide-4
SLIDE 4

4

Linear classifjer

Assume data is linearly separable: there exists a line that separates + from -

slide-5
SLIDE 5

5

Linear classifjer

slide-6
SLIDE 6

6

Linear classifjer

slide-7
SLIDE 7

7

Linear classifjer

slide-8
SLIDE 8

8

Linear classifjer

slide-9
SLIDE 9

9

Linear classifjer

slide-10
SLIDE 10

10

Linear classifjer

slide-11
SLIDE 11

11

Linear classifjer

Which one is beter?

slide-12
SLIDE 12

12

Margin of a linear classifjer

Margin: Twice the distance from the separatjng hyperplane to the closest training point.

slide-13
SLIDE 13

13

Margin of a linear classifjer

slide-14
SLIDE 14

14

Margin of a linear classifjer

slide-15
SLIDE 15

15

Largest margin classifjer: Support vector machines

slide-16
SLIDE 16

16

Support vectors

slide-17
SLIDE 17

17

Formalizatjon

  • Training set
  • What are the equatjons of the 3 parallel hyperplanes?
  • How is the “blue” region defjned? The “orange” one?

w

slide-18
SLIDE 18

18

Largest margin hyperplane

What is the size of the margin γ?

slide-19
SLIDE 19

19

Largest margin hyperplane

slide-20
SLIDE 20

20

Optjmizatjon problem

  • Training set
  • Assume the data to be linearly separable
  • Goal: Find that defjne the hyperplane with

largest margin.

slide-21
SLIDE 21

21

Optjmizatjon problem

  • Margin maximizatjon:

minimize

  • Correct classifjcatjon of the training points:

– For positjve examples: – For negatjve examples: – Summarized as ?

slide-22
SLIDE 22

22

Optjmizatjon problem

  • Margin maximizatjon:

minimize

  • Correct classifjcatjon of the training points:

– For positjve examples: – For negatjve examples: – Summarized as:

  • Optjmizatjon problem:
slide-23
SLIDE 23

23

  • Find that minimize

under the n constraints

  • We introduce one dual variable αi for each

constraint (i.e. each training point)

  • Lagrangian:

Optjmizatjon problem

?

slide-24
SLIDE 24

24

  • Find that minimize

under the n constraints

  • We introduce one dual variable αi for each

constraint (i.e. each training point)

  • Lagrangian:

Optjmizatjon problem

slide-25
SLIDE 25

25

Lagrange dual of the SVM

  • Lagrange dual functjon:
  • Lagrange dual problem:
  • Strong duality: Under Slater’s conditjons, the
  • ptjmum of the primal is the optjmum of the dual.

The functjon to optjmize is convex and the equality constraints are affjne.

slide-26
SLIDE 26

26

Minimizing the Lagrangian of the SVM

  • L(w, b, α) is convex quadratjc in w and minimized

for

  • L(w, b, α) is affjne in b. Its minimum is except if

?

slide-27
SLIDE 27

27

Minimizing the Lagrangian of the SVM

  • L(w, b, α) is convex quadratjc in w and minimized

for:

  • L(w, b, α) is affjne in b. Its minimum is except if:
slide-28
SLIDE 28

28

Minimizing the Lagrangian of the SVM

  • L(w, b, α) is convex quadratjc in w and minimized

for:

  • L(w, b, α) is affjne in b. Its minimum is except if:?
slide-29
SLIDE 29

29

Minimizing the Lagrangian of the SVM

  • L(w, b, α) is convex quadratjc in w and minimized

for:

  • L(w, b, α) is affjne in b. Its minimum is except if:
slide-30
SLIDE 30

30

SVM dual problem

  • Lagrange dual functjon:
  • Dual problem: maximize q(α) subject to α ≥ 0.

Maximizing a quadratjc functjon under box constraints can be solved effjciently using dedicated sofuware.

slide-31
SLIDE 31

31

Optjmal hyperplane

  • Once the optjmal α* is found, we recover (w*, b*)
  • Determining b*:

– Closest positjve point to the separatjng hyperplane:

verifjes

– Closest negatjve point to the separatjng hyperplane:

verifjes

  • The decision functjon is hence:
slide-32
SLIDE 32

32

Lagrangian

  • minimize f(w) under the constraint g(w) ≥ 0

How do we write this in terms of the gradients of f and g?

abusive notatjon: g(w, b)

slide-33
SLIDE 33

33

Lagrangian

  • minimize f(w) under the constraint g(w) ≥ 0

feasible region iso-contours of f unconstrained minimum of f

If the minimum of f(w) doesn't lie in the feasible region, where's our solutjon?

slide-34
SLIDE 34

34

Lagrangian

  • minimize f(w) under the constraint g(w) ≥ 0

feasible region iso-contours of f unconstrained minimum of f

How do we write this in terms of the gradients of f and g?

slide-35
SLIDE 35

35

Lagrangian

  • minimize f(w) under the constraint g(w) ≥ 0

feasible region iso-contours of f unconstrained minimum of f

How do we write this in terms of the gradients of f and g?

slide-36
SLIDE 36

36

Lagrangian

  • minimize f(w) under the constraint g(w) ≥ 0

feasible region iso-contours of f unconstrained minimum of f

How do we write this in terms of the gradients of f and g?

slide-37
SLIDE 37

37

Lagrangian

  • minimize f(w) under the constraint g(w) ≥ 0

Case 1: the unconstraind minimum lies in the feasible region. Case 2: it does not. How do we summarize both cases?

slide-38
SLIDE 38

38

  • minimize f(w) under the constraint g(w) ≥ 0

Case 1: the unconstraind minimum lies in the feasible region. Case 2: it does not.

– Summarized as:

Lagrangian

slide-39
SLIDE 39

39

  • minimize f(w) under the constraint g(w) ≥ 0

Lagrangian: α is called the Lagrange multjplier.

Lagrangian

slide-40
SLIDE 40

40

  • minimize f(w) under the constraints gi(w) ≥ 0

How do we deal with n constraints?

Lagrangian

slide-41
SLIDE 41

41

  • minimize f(w) under the constraints gi(w) ≥ 0

Use n Lagrange multjplers

– Lagrangian:

Lagrangian

slide-42
SLIDE 42

42

Support vectors

  • Karun-Kush-Tucker conditjons:

Either αi = 0 (case 1) or gi=0 (case 2)

Case 1: Case 2:

feasible region iso-contours of f unconstrained minimum of f

slide-43
SLIDE 43

43

Support vectors

α = 0 α > 0

slide-44
SLIDE 44

44

The non-linearly separable case: sofu-margin SVMs.

slide-45
SLIDE 45

45

Sofu-margin SVMs

What if the data are not linearly separable?

slide-46
SLIDE 46

46

Sofu-margin SVMs

slide-47
SLIDE 47

47

Sofu-margin SVMs

slide-48
SLIDE 48

48

Sofu-margin SVMs

  • Find a trade-ofg between large margin and few

errors. What does this remind you of?

slide-49
SLIDE 49

49

SVM error: hinge loss

  • We want for all i:
  • Hinge loss functjon:

What's the shape of the hinge loss?

slide-50
SLIDE 50

50

SVM error: hinge loss

  • We want for all i:
  • Hinge loss functjon:

1 1

slide-51
SLIDE 51

51

Sofu-margin SVMs

  • Find a trade-ofg between large margin and few

errors.

  • Error:
  • The sofu-margin SVM solves:
slide-52
SLIDE 52

52

The C parameter

  • Large C

makes few errors

  • Small C

ensures a large margin

  • Intermediate C

fjnds a tradeofg

slide-53
SLIDE 53

53

It is important to control C

Predictjon error C On training data On new data

slide-54
SLIDE 54

54

Slack variables

is equivalent to: slack variable: distance btw y.f(x) and 1

slide-55
SLIDE 55

55

  • Primal
  • Lagrangian
  • Min the Lagrangian (partjal derivatjves in w, b, ξ)
  • KKT conditjons

Lagrangian of the sofu-margin SVM

slide-56
SLIDE 56

56

Dual formulatjon of the sofu-margin SVM

  • Dual: Maximize
  • under the constraints
  • KKT conditjons:

“easy” “hard” “somewhat hard”

slide-57
SLIDE 57

57

Support vectors of the sofu-margin SVM

α = 0 0< α < C α = C

slide-58
SLIDE 58

58

Primal vs. dual

  • What is the dimension of the primal problem?
  • What is the dimension of the dual problem?
slide-59
SLIDE 59

59

Primal vs. dual

  • Primal: (w, b) has dimension (p+1).

Favored if the data is low-dimensional.

  • Dual: α has dimension n.

Favored is there is litle data available.

slide-60
SLIDE 60

60

The non-linear case: kernel SVMs.

slide-61
SLIDE 61

61

Non-linear SVMs

slide-62
SLIDE 62

62

Non-linear mapping to a feature space

R

slide-63
SLIDE 63

63

Non-linear mapping to a feature space

R R2

slide-64
SLIDE 64

64

SVM in the feature space

  • Train:

under the constraints

  • Predict with the decision functjon
slide-65
SLIDE 65

65

Kernels

For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product

  • f their images in the feature spaces.
  • E.g.
  • Kernels allow us to formalize the notjon of similarity.
slide-66
SLIDE 66

66

Dot product and similarity

  • Normalized dot product = cosine

feature 2 feature 1

slide-67
SLIDE 67

67

Kernel trick

  • Many linear algorithms (in partjcular, linear SVMs) can

be performed in the feature space H without explicitly computjng the images φ(x), but instead by computjng kernels K(x, x')

  • It is sometjmes easy to compute kernels which

correspond to large-dimensional feature spaces: K(x, x') is ofuen much simpler to compute than φ(x).

slide-68
SLIDE 68

68

SVM in the feature space

  • Train:

under the constraints

  • Predict with the decision functjon
slide-69
SLIDE 69

69

SVM with a kernel

  • Train:

under the constraints

  • Predict with the decision functjon
slide-70
SLIDE 70

70

Which functjons are kernels?

  • A functjon K(x, x') defjned on a set X is a kernel ifg it

exists a Hilbert space H and a mapping φ: X →H such that, for any x, x' in X:

  • A functjon K(x, x') defjned on a set X is positjve defjnite

ifg it is symmetric and satjsfjes:

  • Theorem [Aronszajn, 1950]: K is a kernel ifg it is positjve

defjnite.

slide-71
SLIDE 71

71

Positjve defjnite matrices

  • Have a unique Cholesky decompositjon

L: lower triangular, with positjve elements on the diagonal

  • Sesquilinear form is an inner product

– conjugate symmetry – linearity in the fjrst argument – positjve defjniteness

slide-72
SLIDE 72

72

Polynomial kernels

Compute

?

slide-73
SLIDE 73

73

Polynomial kernels

More generally, for is an inner product in a feature space of all monomials of degree up to d.

slide-74
SLIDE 74

74

Gaussian kernel

What is the dimension of the feature space?

slide-75
SLIDE 75

75

Gaussian kernel

The feature space has infjnite dimension.

slide-76
SLIDE 76

76

slide-77
SLIDE 77

77

Toy example

slide-78
SLIDE 78

78

Toy example: linear SVM

slide-79
SLIDE 79

79

Toy example: polynomial SVM (d=2)

slide-80
SLIDE 80

80

Kernels for strings

slide-81
SLIDE 81

81

Protein sequence classifjcatjon

Goal: predict which proteins are secreted or not, based on their sequence.

slide-82
SLIDE 82

82

Substring-based representatjons

  • Represent strings based on the presence/absence of

substrings of fjxed length.

Strings of length k ?

slide-83
SLIDE 83

83

Substring-based representatjons

  • Represent strings based on the presence/absence of

substrings of fjxed length.

– Number of occurrences of u in x: spectrum kernel [Leslie

et al., 2002].

slide-84
SLIDE 84

84

Substring-based representatjons

  • Represent strings based on the presence/absence of

substrings of fjxed length.

– Number of occurrences of u in x: spectrum kernel [Leslie

et al., 2002].

– Number of occurrences of u in x, up to m mismatches:

mismatch kernel [Leslie et al., 2004].

slide-85
SLIDE 85

85

Substring-based representatjons

  • Represent strings based on the presence/absence of

substrings of fjxed length.

– Number of occurrences of u in x: spectrum kernel [Leslie

et al., 2002].

– Number of occurrences of u in x, up to m mismatches:

mismatch kernel [Leslie et al., 2004].

– Number of occcurrences of u in x, allowing gaps, with a

weight decaying exponentjally with the number of gaps: substring kernel [Lohdi et al., 2002].

slide-86
SLIDE 86

86

Spectrum kernel

  • Implementatjon:

– Formally, a sum over |Ak|terms – How many non-zero terms in ? ?

slide-87
SLIDE 87

87

Spectrum kernel

  • Implementatjon:

– Formally, a sum over |Ak|terms – At most |x| - k + 1 non-zero terms in – Hence: Computatjon in O(|x|+|x'|)

  • Predictjon for a new sequence x:

Write f(x) as a functjon of only |x|-k+1 weights. ?

slide-88
SLIDE 88

88

Spectrum kernel

  • Implementatjon:

– Formally, a sum over |Ak|terms – At most |x| - k + 1 non-zero terms in – Hence: Computatjon in O(|x|+|x'|)

  • Fast predictjon for a new sequence x:
slide-89
SLIDE 89

89

The choice of kernel maters

Performance of several kernels on the SCOP superfamily recognitjon kernel [Saigo et al., 2004]

slide-90
SLIDE 90

90

Kernels for graphs

slide-91
SLIDE 91

91

Graph data

  • Molecules
  • Images

[Harchaoui & Bach, 2007]

slide-92
SLIDE 92

92

Subgraph-based representatjons

1 1 1 1 1 1 no occurrence

  • f the 1st feature

1+ occurrences

  • f the 10th feature
slide-93
SLIDE 93

93

Tanimoto & MinMax

  • The Tanimoto and MinMax similaritjes are kernels
slide-94
SLIDE 94

94

Which subgraphs to use?

  • Indexing by all subgraphs...

– Computjng all subgraph occurences is NP-hard. – Actually, fjnding whether a given subgraph occurs in a

graph is NP-hard in general.

htup://jeremykun.com/2015/11/12/a-quasipolynomial-tjme-algorithm-for-graph-isomorphism-the-details/

slide-95
SLIDE 95

95

Which subgraphs to use?

  • Specifjc subgraphs that lead to computatjonally

effjcient indexing:

– Subgraphs selected based on domain knowledge

E.g. chemical fjngerprints

– All frequent subgraphs [Helma et al., 2004] – All paths up to length k [Nicholls 2005] – All walks up to length k [Mahé et al., 2005] – All trees up to depth k [Rogers, 2004] – All shortest paths [Borgwardt & Kriegel, 2005] – All subgraphs up to k vertjces (graphlets) [Shervashidze

et al., 2009]

slide-96
SLIDE 96

96

Which subgraphs to use?

Path of length 5 Walk of length 5 Tree of depth 2

slide-97
SLIDE 97

97

Which subgraphs to use?

[Harchaoui & Bach, 2007]

Paths Walks Trees

slide-98
SLIDE 98

98

The choice of kernel maters

Predictjng inhibitors for 60 cancer cell lines [Mahé & Vert, 2009]

slide-99
SLIDE 99

99

The choice of kernel maters

[Harchaoui & Bach, 2007]

  • COREL14: 1400 natural images, 14 classes
  • Kernels: histogram (H), walk kernel (W), subtree kernel

(TW), weighted subtree kernel (wTW), combinatjon (M).

slide-100
SLIDE 100

100

Summary

  • Linearly separable case: hard-margin SVM
  • Non-separable, but stjll linear: sofu-margin SVM
  • Non-linear: kernel SVM
  • Kernels for

– real-valued data – strings – graphs.

slide-101
SLIDE 101

101

  • A Course in Machine Learning.

http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf

– Sofu-margin SVM : Chap 7.7 – Kernel SVM: Chap 11.1 – 11.6

  • The Elements of Statjstjcal Learning.

http://web.stanford.edu/~hastie/ElemStatLearn/

– Separatjng hyperplane: Chap 4.5.2 – Sofu-margin SVM: Chap 12.1 – 12.2 – Kernel SVM: Chap 12.3 – String kernels: Chap 18.5.1

  • Learning with Kernels

http://agbs.kyb.tuebingen.mpg.de/lwk/

– Sofu-margin SVM: Chap 1.4 – Kernel SVM: Chap 1.5 – SVR: Chap 1.6 – Kernels: Chap 2.1

  • Convex Optjmizatjon

https://web.stanford.edu/~boyd/cvxbook/

– SVM optjmizatjon : Chap 8.6.1

slide-102
SLIDE 102

102

Practjcal maters

  • Preparing for the exam

– Previous exams with solutjons on the course website

  • Next week: special session! 2 x 1.5 hrs

– Introductjon to artjfjcial neural networks – Introductjon to deep learning and Tensorfmow (J. Boyd)

Jupyter notebook will be available for download

– Deep learning for bioimaging (P. Naylor)

slide-103
SLIDE 103

103

Lab

  • Redefjning cross_validate
slide-104
SLIDE 104

104

Linear SVM

The data is not easily separated by a hyperplane Support vectors are either correctly classifjed points that support the margin or errors. Many support vectors suggest the data is not easy to separate and there are many erros.

slide-105
SLIDE 105

105

Linear kernel matrix

No visible patuern. Dark lines correspond to vectors with highest magnitude.

slide-106
SLIDE 106

106

Linear kernel matrix (afuer feature scaling)

The kernel values are on a smaller scale than previously. The diagonal emerges (the most similar sample to an

  • bservatjon is itself).

Many small values.

slide-107
SLIDE 107

107

slide-108
SLIDE 108

108

Linear SVM with

  • ptjmal C

An SVM classifjer with optjmized C

On each pair (tr, te):

  • scaling factors are computjng on Xtr
  • Xtr, Xte are scaled accordingly
  • for each value of C:
  • an SVM is cross-validated on

Xtr_scaled

  • the best of these SVM is trained on the

full Xtr_scaled and applied to Xte_scaled (this produces one predictjon per data point of X)

slide-109
SLIDE 109

109

Polynomial kernel SVM

Polynomial kernel with r=0 d=2 Computed on X_scaled The matrix is really close to identjty, nothing can be learned. This gets worse if you increase d. Changing r can give us a more reasonable matrix.

slide-110
SLIDE 110

110

r=1000000 Almost all 1s r=100000 r=10000 r=1000 r=100 r=10 The kernel matrix is almost the identjty matrix Reasonable range of values for r

slide-111
SLIDE 111

111

  • For a fair comparison with the linear kernel, cross-

validate C and r.

  • For r, use a logspace between 10000 and 100000

based on your observatjon of the kernel matrix.

slide-112
SLIDE 112

112

Gaussian kernel SVM

  • What values of gamma should we use? Start by

spreading out values.

  • When gamma > 1e-2, the

kernel matrix is close to the identjty.

  • When gamma = 1e-5, the

kernel matrix is gettjng close to a matrix of all 1s.

  • If we choose gamma

much smaller, the kernel matrix is going to be so close to a matrix of all 1s the SVM won’t learn well.

slide-113
SLIDE 113

113

Gaussian kernel SVM

  • What values of gamma should we use? Start by

spreading out values.

  • The kernel matrix

is more reasonable when gamma is between 5e-5 and 5e-4.

slide-114
SLIDE 114

114

Gaussian kernel SVM

  • The best performance we
  • btain is indeed for a gamma
  • f 5e-5.
  • To fairly compare to the linear

SVM, one should cross- validate C.

slide-115
SLIDE 115

115

Linear SVM decision boundary

slide-116
SLIDE 116

116

Quadratjc SVM decision boundary

slide-117
SLIDE 117

117

Separatjng XOR

RBF kernel, accuracy=0.98 linear kernel, accuracy=0.48