Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu - - PowerPoint PPT Presentation

β–Ά
lecture 4 svm i
SMART_READER_LITE
LIVE PREVIEW

Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu - - PowerPoint PPT Presentation

Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from distribution 1


slide-1
SLIDE 1

Machine Learning Basics Lecture 4: SVM I

Princeton University COS 495 Instructor: Yingyu Liang

slide-2
SLIDE 2

Review: machine learning basics

slide-3
SLIDE 3

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑧 = 𝑔(𝑦) ∈ π“˜ that minimizes ΰ· 

𝑀 𝑔 =

1 π‘œ σ𝑗=1 π‘œ

π‘š(𝑔, 𝑦𝑗, 𝑧𝑗)

  • s.t. the expected loss is small

𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸[π‘š(𝑔, 𝑦, 𝑧)]

slide-4
SLIDE 4

Machine learning 1-2-3

  • Collect data and extract features
  • Build model: choose hypothesis class π“˜ and loss function π‘š
  • Optimization: minimize the empirical loss
slide-5
SLIDE 5

Loss function

  • π‘š2 loss: linear regression
  • Cross-entropy: logistic regression
  • Hinge loss: Perceptron
  • General principle: maximum likelihood estimation (MLE)
  • π‘š2 loss: corresponds to Normal distribution
  • logistic regression: corresponds to sigmoid conditional distribution
slide-6
SLIDE 6

Optimization

  • Linear regression: closed form solution
  • Logistic regression: gradient descent
  • Perceptron: stochastic gradient descent
  • General principle: local improvement
  • SGD: Perceptron; can also be applied to linear regression/logistic regression
slide-7
SLIDE 7

Principle for hypothesis class?

  • Yes, there exists a general principle (at least philosophically)
  • Different names/faces/connections
  • Occam’s razor
  • VC dimension theory
  • Minimum description length
  • Tradeoff between Bias and variance; uniform convergence
  • The curse of dimensionality
  • Running example: Support Vector Machine (SVM)
slide-8
SLIDE 8

Motivation

slide-9
SLIDE 9

Linear classification

(π‘₯βˆ—)π‘ˆπ‘¦ = 0 Class +1 Class -1 π‘₯βˆ— (π‘₯βˆ—)π‘ˆπ‘¦ > 0 (π‘₯βˆ—)π‘ˆπ‘¦ < 0 Assume perfect separation between the two classes

slide-10
SLIDE 10

Attempt

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Hypothesis 𝑧 = sign(𝑔

π‘₯ 𝑦 ) = sign(π‘₯π‘ˆπ‘¦)

  • 𝑧 = +1 if π‘₯π‘ˆπ‘¦ > 0
  • 𝑧 = βˆ’1 if π‘₯π‘ˆπ‘¦ < 0
  • Let’s assume that we can optimize to find π‘₯
slide-11
SLIDE 11

Multiple optimal solutions?

Class +1 Class -1 π‘₯2 π‘₯3 π‘₯1 Same on empirical loss; Different on test/expected loss

slide-12
SLIDE 12

What about π‘₯1?

Class +1 Class -1 π‘₯1

New test data

slide-13
SLIDE 13

What about π‘₯3?

Class +1 Class -1 π‘₯3

New test data

slide-14
SLIDE 14

Most confident: π‘₯2

Class +1 Class -1 π‘₯2

New test data

slide-15
SLIDE 15

Intuition: margin

Class +1 Class -1 π‘₯2

large margin

slide-16
SLIDE 16

Margin

slide-17
SLIDE 17

Margin

  • Lemma 1: 𝑦 has distance

|𝑔

π‘₯ 𝑦 |

| π‘₯ | to the hyperplane 𝑔 π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ = 0

Proof:

  • π‘₯ is orthogonal to the hyperplane
  • The unit direction is

π‘₯ | π‘₯ |

  • The projection of 𝑦 is

π‘₯ π‘₯ π‘ˆ

𝑦 =

𝑔

π‘₯(𝑦)

| π‘₯ | π‘₯ | π‘₯ |

𝑦

π‘₯ π‘₯

π‘ˆ

𝑦

slide-18
SLIDE 18

Margin: with bias

  • Claim 1: π‘₯ is orthogonal to the hyperplane 𝑔

π‘₯,𝑐 𝑦 = π‘₯π‘ˆπ‘¦ + 𝑐 = 0

Proof:

  • pick any 𝑦1 and 𝑦2 on the hyperplane
  • π‘₯π‘ˆπ‘¦1 + 𝑐 = 0
  • π‘₯π‘ˆπ‘¦2 + 𝑐 = 0
  • So π‘₯π‘ˆ(𝑦1 βˆ’ 𝑦2) = 0
slide-19
SLIDE 19

Margin: with bias

  • Claim 2: 0 has distance

βˆ’π‘ | π‘₯ | to the hyperplane π‘₯π‘ˆπ‘¦ + 𝑐 = 0

Proof:

  • pick any 𝑦1 the hyperplane
  • Project 𝑦1 to the unit direction

π‘₯ | π‘₯ | to get the distance

  • π‘₯

π‘₯ π‘ˆ

𝑦1 =

βˆ’π‘ | π‘₯ | since π‘₯π‘ˆπ‘¦1 + 𝑐 = 0

slide-20
SLIDE 20

Margin: with bias

  • Lemma 2: 𝑦 has distance

|𝑔π‘₯,𝑐 𝑦 | | π‘₯ |

to the hyperplane 𝑔

π‘₯,𝑐 𝑦 = π‘₯π‘ˆπ‘¦ +

𝑐 = 0 Proof:

  • Let 𝑦 = 𝑦βŠ₯ + 𝑠

π‘₯ | π‘₯ |, then |𝑠| is the distance

  • Multiply both sides by π‘₯π‘ˆ and add 𝑐
  • Left hand side: π‘₯π‘ˆπ‘¦ + 𝑐 = 𝑔

π‘₯,𝑐 𝑦

  • Right hand side: π‘₯π‘ˆπ‘¦βŠ₯ + 𝑠

π‘₯π‘ˆπ‘₯ | π‘₯ | + 𝑐 = 0 + 𝑠| π‘₯ |

slide-21
SLIDE 21

𝑧 𝑦 = π‘₯π‘ˆπ‘¦ + π‘₯0 The notation here is:

Figure from Pattern Recognition and Machine Learning, Bishop

slide-22
SLIDE 22

Support Vector Machine (SVM)

slide-23
SLIDE 23

SVM: objective

  • Margin over all training data points:

𝛿 = min

𝑗

|𝑔

π‘₯,𝑐 𝑦𝑗 |

| π‘₯ |

  • Since only want correct 𝑔

π‘₯,𝑐, and recall 𝑧𝑗 ∈ {+1, βˆ’1}, we have

𝛿 = min

𝑗

𝑧𝑗𝑔

π‘₯,𝑐 𝑦𝑗

| π‘₯ |

  • If 𝑔

π‘₯,𝑐 incorrect on some 𝑦𝑗, the margin is negative

slide-24
SLIDE 24

SVM: objective

  • Maximize margin over all training data points:

max

π‘₯,𝑐 𝛿 = max π‘₯,𝑐 min 𝑗

𝑧𝑗𝑔

π‘₯,𝑐 𝑦𝑗

| π‘₯ | = max

π‘₯,𝑐 min 𝑗

𝑧𝑗(π‘₯π‘ˆπ‘¦π‘— + 𝑐) | π‘₯ |

  • A bit complicated …
slide-25
SLIDE 25

SVM: simplified objective

  • Observation: when (π‘₯, 𝑐) scaled by a factor 𝑑, the margin unchanged
  • Let’s consider a fixed scale such that

π‘§π‘—βˆ— π‘₯π‘ˆπ‘¦π‘—βˆ— + 𝑐 = 1 where π‘¦π‘—βˆ— is the point closest to the hyperplane 𝑧𝑗(𝑑π‘₯π‘ˆπ‘¦π‘— + 𝑑𝑐) | 𝑑π‘₯ | = 𝑧𝑗(π‘₯π‘ˆπ‘¦π‘— + 𝑐) | π‘₯ |

slide-26
SLIDE 26

SVM: simplified objective

  • Let’s consider a fixed scale such that

π‘§π‘—βˆ— π‘₯π‘ˆπ‘¦π‘—βˆ— + 𝑐 = 1 where π‘¦π‘—βˆ— is the point closet to the hyperplane

  • Now we have for all data

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1 and at least for one 𝑗 the equality holds

  • Then the margin is

1 | π‘₯ |

slide-27
SLIDE 27

SVM: simplified objective

  • Optimization simplified to

min

π‘₯,𝑐

1 2 π‘₯

2

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

  • How to find the optimum ෝ

π‘₯βˆ—?

slide-28
SLIDE 28

SVM: principle for hypothesis class

slide-29
SLIDE 29

Thought experiment

  • Suppose pick an 𝑆, and suppose can decide if exists π‘₯ satisfying

1 2 π‘₯

2 ≀ 𝑆

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

  • Decrease 𝑆 until cannot find π‘₯ satisfying the inequalities
slide-30
SLIDE 30

Thought experiment

  • ෝ

π‘₯βˆ— is the best weight (i.e., satisfying the smallest 𝑆) ෝ π‘₯βˆ—

slide-31
SLIDE 31

Thought experiment

  • ෝ

π‘₯βˆ— is the best weight (i.e., satisfying the smallest 𝑆) ෝ π‘₯βˆ—

slide-32
SLIDE 32

Thought experiment

  • ෝ

π‘₯βˆ— is the best weight (i.e., satisfying the smallest 𝑆) ෝ π‘₯βˆ—

slide-33
SLIDE 33

Thought experiment

  • ෝ

π‘₯βˆ— is the best weight (i.e., satisfying the smallest 𝑆) ෝ π‘₯βˆ—

slide-34
SLIDE 34

Thought experiment

  • ෝ

π‘₯βˆ— is the best weight (i.e., satisfying the smallest 𝑆) ෝ π‘₯βˆ—

slide-35
SLIDE 35

Thought experiment

  • To handle the difference between empirical and expected losses οƒ 
  • Choose large margin hypothesis (high confidence) οƒ 
  • Choose a small hypothesis class

ෝ π‘₯βˆ—

Corresponds to the hypothesis class

slide-36
SLIDE 36

Thought experiment

  • Principle: use smallest hypothesis class still with a correct/good one
  • Also true beyond SVM
  • Also true for the case without perfect separation between the two classes
  • Math formulation: VC-dim theory, etc.

ෝ π‘₯βˆ—

Corresponds to the hypothesis class

slide-37
SLIDE 37

Thought experiment

  • Principle: use smallest hypothesis class still with a correct/good one
  • Whatever you know about the ground truth, add it as constraint/regularizer

ෝ π‘₯βˆ—

Corresponds to the hypothesis class

slide-38
SLIDE 38

SVM: optimization

  • Optimization (Quadratic Programming):

min

π‘₯,𝑐

1 2 π‘₯

2

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

  • Solved by Lagrange multiplier method:

β„’ π‘₯, 𝑐, 𝜷 = 1 2 π‘₯

2

βˆ’ ෍

𝑗

𝛽𝑗[𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 βˆ’ 1] where 𝜷 is the Lagrange multiplier

  • Details in next lecture
slide-39
SLIDE 39

Reading

  • Review Lagrange multiplier method
  • E.g. Section 5 in Andrew Ng’s note on SVM
  • posted on the course website:

http://www.cs.princeton.edu/courses/archive/spring16/cos495/