Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 - - PowerPoint PPT Presentation

β–Ά
support vector machines part 1
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 - - PowerPoint PPT Presentation

Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude


slide-1
SLIDE 1

Support Vector Machines Part 1

Yingyu Liang Computer Sciences 760 Fall 2017

http://pages.cs.wisc.edu/~yliang/cs760/

Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

slide-2
SLIDE 2

Goals for the lecture

you should understand the following concepts

  • the margin
  • the linear support vector machine
  • the primal and dual formulations of SVM learning
  • support vectors
  • VC-dimension and maximizing the margin

2

slide-3
SLIDE 3

Motivation

slide-4
SLIDE 4

Linear classification

(π‘₯βˆ—)π‘ˆπ‘¦ = 0 Class +1 Class -1 π‘₯βˆ— (π‘₯βˆ—)π‘ˆπ‘¦ > 0 (π‘₯βˆ—)π‘ˆπ‘¦ < 0 Assume perfect separation between the two classes

slide-5
SLIDE 5

Attempt

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Hypothesis 𝑧 = sign(𝑔

π‘₯ 𝑦 ) = sign(π‘₯π‘ˆπ‘¦)

  • 𝑧 = +1 if π‘₯π‘ˆπ‘¦ > 0
  • 𝑧 = βˆ’1 if π‘₯π‘ˆπ‘¦ < 0
  • Let’s assume that we can optimize to find π‘₯
slide-6
SLIDE 6

Multiple optimal solutions?

Class +1 Class -1 π‘₯2 π‘₯3 π‘₯1 Same on empirical loss; Different on test/expected loss

slide-7
SLIDE 7

What about π‘₯1?

Class +1 Class -1 π‘₯1

New test data

slide-8
SLIDE 8

What about π‘₯3?

Class +1 Class -1 π‘₯3

New test data

slide-9
SLIDE 9

Most confident: π‘₯2

Class +1 Class -1 π‘₯2

New test data

slide-10
SLIDE 10

Intuition: margin

Class +1 Class -1 π‘₯2

large margin

slide-11
SLIDE 11

Margin

slide-12
SLIDE 12

Margin

  • Lemma 1: 𝑦 has distance

|𝑔

π‘₯ 𝑦 |

| π‘₯ | to the hyperplane 𝑔 π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ = 0

Proof:

  • π‘₯ is orthogonal to the hyperplane
  • The unit direction is

π‘₯ | π‘₯ |

  • The projection of 𝑦 is

π‘₯ π‘₯ π‘ˆ

𝑦 =

𝑔

π‘₯(𝑦)

| π‘₯ | π‘₯ | π‘₯ |

𝑦

π‘₯ π‘₯

π‘ˆ

𝑦

slide-13
SLIDE 13

Margin: with bias

  • Claim 1: π‘₯ is orthogonal to the hyperplane 𝑔

π‘₯,𝑐 𝑦 = π‘₯π‘ˆπ‘¦ + 𝑐 = 0

Proof:

  • pick any 𝑦1 and 𝑦2 on the hyperplane
  • π‘₯π‘ˆπ‘¦1 + 𝑐 = 0
  • π‘₯π‘ˆπ‘¦2 + 𝑐 = 0
  • So π‘₯π‘ˆ(𝑦1 βˆ’ 𝑦2) = 0
slide-14
SLIDE 14

Margin: with bias

  • Claim 2: 0 has distance

βˆ’π‘ | π‘₯ | to the hyperplane π‘₯π‘ˆπ‘¦ + 𝑐 = 0

Proof:

  • pick any 𝑦1 the hyperplane
  • Project 𝑦1 to the unit direction

π‘₯ | π‘₯ | to get the distance

  • π‘₯

π‘₯ π‘ˆ

𝑦1 =

βˆ’π‘ | π‘₯ | since π‘₯π‘ˆπ‘¦1 + 𝑐 = 0

slide-15
SLIDE 15

Margin: with bias

  • Lemma 2: 𝑦 has distance

|𝑔π‘₯,𝑐 𝑦 | | π‘₯ |

to the hyperplane 𝑔

π‘₯,𝑐 𝑦 = π‘₯π‘ˆπ‘¦ +

𝑐 = 0 Proof:

  • Let 𝑦 = 𝑦βŠ₯ + 𝑠

π‘₯ | π‘₯ |, then |𝑠| is the distance

  • Multiply both sides by π‘₯π‘ˆ and add 𝑐
  • Left hand side: π‘₯π‘ˆπ‘¦ + 𝑐 = 𝑔

π‘₯,𝑐 𝑦

  • Right hand side: π‘₯π‘ˆπ‘¦βŠ₯ + 𝑠

π‘₯π‘ˆπ‘₯ | π‘₯ | + 𝑐 = 0 + 𝑠| π‘₯ |

slide-16
SLIDE 16

𝑧 𝑦 = π‘₯π‘ˆπ‘¦ + π‘₯0 The notation here is:

Figure from Pattern Recognition and Machine Learning, Bishop

slide-17
SLIDE 17

Support Vector Machine (SVM)

slide-18
SLIDE 18

SVM: objective

  • Margin over all training data points:

𝛿 = min

𝑗

|𝑔

π‘₯,𝑐 𝑦𝑗 |

| π‘₯ |

  • Since only want correct 𝑔

π‘₯,𝑐, and recall 𝑧𝑗 ∈ {+1, βˆ’1}, we have

𝛿 = min

𝑗

𝑧𝑗𝑔

π‘₯,𝑐 𝑦𝑗

| π‘₯ |

  • If 𝑔

π‘₯,𝑐 incorrect on some 𝑦𝑗, the margin is negative

slide-19
SLIDE 19

SVM: objective

  • Maximize margin over all training data points:

max

π‘₯,𝑐 𝛿 = max π‘₯,𝑐 min 𝑗

𝑧𝑗𝑔

π‘₯,𝑐 𝑦𝑗

| π‘₯ | = max

π‘₯,𝑐 min 𝑗

𝑧𝑗(π‘₯π‘ˆπ‘¦π‘— + 𝑐) | π‘₯ |

  • A bit complicated …
slide-20
SLIDE 20

SVM: simplified objective

  • Observation: when (π‘₯, 𝑐) scaled by a factor 𝑑, the margin unchanged
  • Let’s consider a fixed scale such that

π‘§π‘—βˆ— π‘₯π‘ˆπ‘¦π‘—βˆ— + 𝑐 = 1 where π‘¦π‘—βˆ— is the point closest to the hyperplane 𝑧𝑗(𝑑π‘₯π‘ˆπ‘¦π‘— + 𝑑𝑐) | 𝑑π‘₯ | = 𝑧𝑗(π‘₯π‘ˆπ‘¦π‘— + 𝑐) | π‘₯ |

slide-21
SLIDE 21

SVM: simplified objective

  • Let’s consider a fixed scale such that

π‘§π‘—βˆ— π‘₯π‘ˆπ‘¦π‘—βˆ— + 𝑐 = 1 where π‘¦π‘—βˆ— is the point closet to the hyperplane

  • Now we have for all data

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1 and at least for one 𝑗 the equality holds

  • Then the margin is

1 | π‘₯ |

slide-22
SLIDE 22

SVM: simplified objective

  • Optimization simplified to

min

π‘₯,𝑐

1 2 π‘₯

2

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

  • How to find the optimum ෝ

π‘₯βˆ—?

  • Solved by Lagrange multiplier method
slide-23
SLIDE 23

Lagrange multiplier

slide-24
SLIDE 24

Lagrangian

  • Consider optimization problem:

min

π‘₯

𝑔(π‘₯) β„Žπ‘— π‘₯ = 0, βˆ€1 ≀ 𝑗 ≀ π‘š

  • Lagrangian:

β„’ π‘₯, 𝜸 = 𝑔 π‘₯ + ෍

𝑗

π›Ύπ‘—β„Žπ‘—(π‘₯) where 𝛾𝑗’s are called Lagrange multipliers

slide-25
SLIDE 25

Lagrangian

  • Consider optimization problem:

min

π‘₯

𝑔(π‘₯) β„Žπ‘— π‘₯ = 0, βˆ€1 ≀ 𝑗 ≀ π‘š

  • Solved by setting derivatives of Lagrangian to 0

πœ–β„’ πœ–π‘₯𝑗 = 0; πœ–β„’ πœ–π›Ύπ‘— = 0

slide-26
SLIDE 26

Generalized Lagrangian

  • Consider optimization problem:

min

π‘₯

𝑔(π‘₯) 𝑕𝑗 π‘₯ ≀ 0, βˆ€1 ≀ 𝑗 ≀ 𝑙 β„Žπ‘˜ π‘₯ = 0, βˆ€1 ≀ π‘˜ ≀ π‘š

  • Generalized Lagrangian:

β„’ π‘₯, 𝜷, 𝜸 = 𝑔 π‘₯ + ෍

𝑗

𝛽𝑗𝑕𝑗(π‘₯) + ෍

π‘˜

π›Ύπ‘˜β„Žπ‘˜(π‘₯) where 𝛽𝑗, π›Ύπ‘˜β€™s are called Lagrange multipliers

slide-27
SLIDE 27

Generalized Lagrangian

  • Consider the quantity:

πœ„π‘„ π‘₯ ≔ max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

  • Why?

πœ„π‘„ π‘₯ = α‰Šπ‘” π‘₯ , if π‘₯ satisfies all the constraints +∞, if π‘₯ does not satisfy the constraints

  • So minimizing 𝑔 π‘₯ is the same as minimizing πœ„π‘„ π‘₯

min

π‘₯ 𝑔 π‘₯ = min π‘₯ πœ„π‘„ π‘₯ = min π‘₯

max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

slide-28
SLIDE 28

Lagrange duality

  • The primal problem

π‘žβˆ— ≔ min

π‘₯ 𝑔 π‘₯ = min π‘₯

max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

  • The dual problem

π‘’βˆ— ≔ max

𝜷,𝜸:𝛽𝑗β‰₯0min π‘₯ β„’ π‘₯, 𝜷, 𝜸

  • Always true:

π‘’βˆ— ≀ π‘žβˆ—

slide-29
SLIDE 29

Lagrange duality

  • The primal problem

π‘žβˆ— ≔ min

π‘₯ 𝑔 π‘₯ = min π‘₯

max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

  • The dual problem

π‘’βˆ— ≔ max

𝜷,𝜸:𝛽𝑗β‰₯0min π‘₯ β„’ π‘₯, 𝜷, 𝜸

  • Interesting case: when do we have

π‘’βˆ— = π‘žβˆ—?

slide-30
SLIDE 30

Lagrange duality

  • Theorem: under proper conditions, there exists π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— such that

π‘’βˆ— = β„’ π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— = π‘žβˆ— Moreover, π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions: πœ–β„’ πœ–π‘₯𝑗 = 0, 𝛽𝑗𝑕𝑗 π‘₯ = 0 𝑕𝑗 π‘₯ ≀ 0, β„Žπ‘˜ π‘₯ = 0, 𝛽𝑗 β‰₯ 0

slide-31
SLIDE 31

Lagrange duality

  • Theorem: under proper conditions, there exists π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— such that

π‘’βˆ— = β„’ π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— = π‘žβˆ— Moreover, π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions: πœ–β„’ πœ–π‘₯𝑗 = 0, 𝛽𝑗𝑕𝑗 π‘₯ = 0 𝑕𝑗 π‘₯ ≀ 0, β„Žπ‘˜ π‘₯ = 0, 𝛽𝑗 β‰₯ 0 dual complementarity

slide-32
SLIDE 32

Lagrange duality

  • Theorem: under proper conditions, there exists π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— such that

π‘’βˆ— = β„’ π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— = π‘žβˆ—

  • Moreover, π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions:

πœ–β„’ πœ–π‘₯𝑗 = 0, 𝛽𝑗𝑕𝑗 π‘₯ = 0 𝑕𝑗 π‘₯ ≀ 0, β„Žπ‘˜ π‘₯ = 0, 𝛽𝑗 β‰₯ 0 dual constraints primal constraints

slide-33
SLIDE 33

Lagrange duality

  • What are the proper conditions?
  • A set of conditions (Slater conditions):
  • 𝑔, 𝑕𝑗 convex, β„Žπ‘˜ affine, and exists π‘₯ satisfying all 𝑕𝑗 π‘₯ < 0
  • There exist other sets of conditions
  • Check textbooks, e.g., Convex Optimization by Boyd and Vandenberghe
slide-34
SLIDE 34

SVM: optimization

slide-35
SLIDE 35

SVM: optimization

  • Optimization (Quadratic Programming):

min

π‘₯,𝑐

1 2 π‘₯

2

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

  • Generalized Lagrangian:

β„’ π‘₯, 𝑐, 𝜷 = 1 2 π‘₯

2

βˆ’ ෍

𝑗

𝛽𝑗[𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 βˆ’ 1] where 𝜷 is the Lagrange multiplier

slide-36
SLIDE 36

SVM: optimization

  • KKT conditions:

πœ–β„’ πœ–π‘₯ = 0, οƒ  π‘₯ = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗 (1) πœ–β„’ πœ–π‘ = 0, οƒ  0 = σ𝑗 𝛽𝑗𝑧𝑗

(2)

  • Plug into β„’:

β„’ π‘₯, 𝑐, 𝜷 = σ𝑗 𝛽𝑗 βˆ’

1 2 Οƒπ‘—π‘˜ π›½π‘—π›½π‘˜π‘§π‘—π‘§π‘˜π‘¦π‘— π‘ˆπ‘¦π‘˜ (3)

combined with 0 = σ𝑗 𝛽𝑗𝑧𝑗 , 𝛽𝑗 β‰₯ 0

slide-37
SLIDE 37

SVM: optimization

  • Reduces to dual problem:

β„’ π‘₯, 𝑐, 𝜷 = ෍

𝑗

𝛽𝑗 βˆ’ 1 2 ෍

π‘—π‘˜

π›½π‘—π›½π‘˜π‘§π‘—π‘§π‘˜π‘¦π‘—

π‘ˆπ‘¦π‘˜

෍

𝑗

𝛽𝑗𝑧𝑗 = 0, 𝛽𝑗 β‰₯ 0

  • Since π‘₯ = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗, we have π‘₯π‘ˆπ‘¦ + 𝑐 = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗

π‘ˆπ‘¦ + 𝑐

Only depend on inner products

slide-38
SLIDE 38

Support Vectors

  • those instances with Ξ±i > 0 are

called support vectors

  • they lie on the margin boundary
  • solution NOT changed if delete

the instances with Ξ±i = 0

support vectors

  • final solution is a sparse linear combination of the training instances
slide-39
SLIDE 39

Learning theory justification

  • Vapnik showed a connection between the margin and VC dimension

π‘Šπ· ≀ 4𝑆2 π‘›π‘π‘ π‘•π‘—π‘œπΈ(β„Ž)

  • thus to minimize the VC dimension (and to improve the error bound)

 maximize the margin

error on true distribution training set error VC: VC-dimension

  • f hypothesis class

𝑓𝑠𝑠𝑝𝑠 β„Ž ≀ 𝑓𝑠𝑠𝑝𝑠𝐸 β„Ž + π‘Šπ· log 2𝑛 π‘Šπ· + 1 + log 4 πœ€ 𝑛

constant dependent on training data