Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 - - PowerPoint PPT Presentation

β–Ά
support vector machines part 2
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 - - PowerPoint PPT Presentation

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude


slide-1
SLIDE 1

Support Vector Machines Part 2

Yingyu Liang Computer Sciences 760 Fall 2017

http://pages.cs.wisc.edu/~yliang/cs760/

Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

slide-2
SLIDE 2

Goals for the lecture

you should understand the following concepts

  • soft margin SVM
  • support vector regression
  • the kernel trick
  • polynomial kernel
  • Gaussian/RBF kernel
  • valid kernels and Mercer’s theorem
  • kernels and neural networks

2

slide-3
SLIDE 3

Variants: soft-margin and SVR

slide-4
SLIDE 4

Hard-margin SVM

  • Optimization (Quadratic Programming):

min

π‘₯,𝑐

1 2 π‘₯

2

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

slide-5
SLIDE 5

Soft-margin SVM [Cortes & Vapnik, Machine Learning 1995]

  • if the training instances are not linearly separable, the previous

formulation will fail

  • we can adjust our approach by using slack variables (denoted by πœ‚π‘—)

to tolerate errors min

π‘₯,𝑐,πœ‚π‘—

1 2 π‘₯

2

+ 𝐷 ෍

𝑗

πœ‚π‘— 𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1 βˆ’ πœ‚π‘—, πœ‚π‘— β‰₯ 0, βˆ€π‘—

  • 𝐷 determines the relative importance of maximizing margin vs.

minimizing slack

slide-6
SLIDE 6

The effect of 𝐷 in soft-margin SVM

Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010

slide-7
SLIDE 7

Hinge loss

  • when we covered neural nets, we talked about minimizing

squared loss and cross-entropy loss

  • SVMs minimize hinge loss

loss (error) when 𝑧 = 1 model output β„Ž π’š squared loss 0/1 loss hinge loss

slide-8
SLIDE 8

Support Vector Regression

  • the SVM idea can also be

applied in regression tasks

  • an πœ—-insensitive error

function specifies that a training instance is well explained if the model’s prediction is within πœ— of 𝑧𝑗

(π‘₯βŠ€π‘¦ + 𝑐) βˆ’ 𝑧 = πœ— 𝑧 βˆ’ (π‘₯βŠ€π‘¦ + 𝑐) = πœ—

slide-9
SLIDE 9

Support Vector Regression

  • Regression using slack variables (denoted by πœ‚π‘—, πœŠπ‘—) to tolerate errors

min

π‘₯,𝑐,πœ‚π‘—,πœŠπ‘—

1 2 π‘₯

2

+ 𝐷 ෍

𝑗

πœ‚π‘— + πœŠπ‘— π‘₯π‘ˆπ‘¦π‘— + 𝑐 βˆ’ 𝑧𝑗 ≀ πœ— + πœ‚π‘—, 𝑧𝑗 βˆ’ π‘₯π‘ˆπ‘¦π‘— + 𝑐 ≀ πœ— + πœŠπ‘—, πœ‚π‘—, πœŠπ‘— β‰₯ 0.

slack variables allow predictions for some training instances to be

  • ff by more than πœ—
slide-10
SLIDE 10

Kernel methods

slide-11
SLIDE 11

Features

Color Histogram

Red Green Blue

Extract features

𝑦 𝜚 𝑦

slide-12
SLIDE 12

Features

Proper feature mapping can make non-linear to linear!

slide-13
SLIDE 13

Recall: SVM dual form

  • Reduces to dual problem:

β„’ π‘₯, 𝑐, 𝜷 = ෍

𝑗

𝛽𝑗 βˆ’ 1 2 ෍

π‘—π‘˜

π›½π‘—π›½π‘˜π‘§π‘—π‘§π‘˜π‘¦π‘—

π‘ˆπ‘¦π‘˜

෍

𝑗

𝛽𝑗𝑧𝑗 = 0, 𝛽𝑗 β‰₯ 0

  • Since π‘₯ = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗, we have π‘₯π‘ˆπ‘¦ + 𝑐 = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗

π‘ˆπ‘¦ + 𝑐

Only depend on inner products

slide-14
SLIDE 14

Features

  • Using SVM on the feature space {𝜚 𝑦𝑗 }: only need 𝜚 𝑦𝑗 π‘ˆπœš(π‘¦π‘˜)
  • Conclusion: no need to design 𝜚 β‹… , only need to design

𝑙 𝑦𝑗, π‘¦π‘˜ = 𝜚 𝑦𝑗 π‘ˆπœš(π‘¦π‘˜)

slide-15
SLIDE 15

Polynomial kernels

  • Fix degree 𝑒 and constant 𝑑:

𝑙 𝑦, 𝑦′ = π‘¦π‘ˆπ‘¦β€² + 𝑑 𝑒

  • What are 𝜚(𝑦)?
  • Expand the expression to get 𝜚(𝑦)
slide-16
SLIDE 16

Polynomial kernels

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

slide-17
SLIDE 17

SVMs with polynomial kernels

Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010 18

slide-18
SLIDE 18

Gaussian/RBF kernels

  • Fix bandwidth 𝜏:

𝑙 𝑦, 𝑦′ = exp(βˆ’ 𝑦 βˆ’ 𝑦′

2/2𝜏2)

  • Also called radial basis function (RBF) kernels
  • What are 𝜚(𝑦)? Consider the un-normalized version

𝑙′ 𝑦, 𝑦′ = exp(π‘¦π‘ˆπ‘¦β€²/𝜏2)

  • Power series expansion:

𝑙′ 𝑦, 𝑦′ = ෍

𝑗 +∞ π‘¦π‘ˆπ‘¦β€² 𝑗

πœπ‘—π‘—!

slide-19
SLIDE 19

The RBF kernel illustrated

Figures from openclassroom.stanford.edu (Andrew Ng) 20

𝛿 = βˆ’10 𝛿 = βˆ’100 𝛿 = βˆ’1000

slide-20
SLIDE 20

Mercer’s condition for kenerls

  • Theorem: 𝑙 𝑦, 𝑦′ has expansion

𝑙 𝑦, 𝑦′ = ෍

𝑗 +∞

π‘π‘—πœšπ‘— 𝑦 πœšπ‘—(𝑦′) if and only if for any function 𝑑(𝑦), ∫ ∫ 𝑑 𝑦 𝑑 𝑦′ 𝑙 𝑦, 𝑦′ 𝑒𝑦𝑒𝑦′ β‰₯ 0 (Omit some math conditions for 𝑙 and 𝑑)

slide-21
SLIDE 21

Constructing new kernels

  • Kernels are closed under positive scaling, sum, product, pointwise

limit, and composition with a power series σ𝑗

+∞ 𝑏𝑗𝑙𝑗(𝑦, 𝑦′)

  • Example: 𝑙1 𝑦, 𝑦′ , 𝑙2 𝑦, 𝑦′ are kernels, then also is

𝑙 𝑦, 𝑦′ = 2𝑙1 𝑦, 𝑦′ + 3𝑙2 𝑦, 𝑦′

  • Example: 𝑙1 𝑦, 𝑦′ is kernel, then also is

𝑙 𝑦, 𝑦′ = exp(𝑙1 𝑦, 𝑦′ )

slide-22
SLIDE 22

Kernel algebra

  • given a valid kernel, we can make new valid kernels using a variety of
  • perators

f(x) = fa(x), fb(x)

( )

k(x,v) = ka(x,v)+ kb(x,v) k(x,v) =g ka(x,v), g > 0 f(x) = g fa(x) k(x,v) = ka(x,v)kb(x,v) fl(x) =fai(x)fbj(x) k(x,v) = f (x)f (v)ka(x,v) f(x) = f (x)fa(x)

kernel composition mapping composition

23

slide-23
SLIDE 23

Kernels v.s. Neural networks

slide-24
SLIDE 24

Features

Color Histogram

Red Green Blue

Extract features

𝑦 𝑧 = π‘₯π‘ˆπœš 𝑦

build hypothesis

slide-25
SLIDE 25

Features: part of the model

𝑧 = π‘₯π‘ˆπœš 𝑦

build hypothesis

Linear model Nonlinear model

slide-26
SLIDE 26

Polynomial kernels

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

slide-27
SLIDE 27

Polynomial kernel SVM as two layer neural network

𝑦1 𝑦2 𝑦1

2

𝑦2

2

2𝑦1𝑦2 2𝑑𝑦1 2𝑑𝑦2 𝑑 𝑧 = sign(π‘₯π‘ˆπœš(𝑦) + 𝑐) First layer is fixed. If also learn first layer, it becomes two layer neural network

slide-28
SLIDE 28

Comments on SVMs

  • we can find solutions that are globally optimal (maximize the margin)
  • because the learning task is framed as a convex optimization

problem

  • no local minima, in contrast to multi-layer neural nets
  • there are two formulations of the optimization: primal and dual
  • dual represents classifier decision in terms of support vectors
  • dual enables the use of kernel functions
  • we can use a wide range of optimization methods to learn SVM
  • standard quadratic programming solvers
  • SMO [Platt, 1999]
  • linear programming solvers for some formulations
  • etc.

29

slide-29
SLIDE 29

Comments on SVMs

  • kernels provide a powerful way to
  • allow nonlinear decision boundaries
  • represent/compare complex objects such as strings and trees
  • incorporate domain knowledge into the learning task
  • using the kernel trick, we can implicitly use high-dimensional mappings

without explicitly computing them

  • ne SVM can represent only a binary classification task; multi-class

problems handled using multiple SVMs and some encoding

  • empirically, SVMs have shown (close to) state-of-the art accuracy for

many tasks

  • the kernel idea can be extended to other tasks (anomaly detection,

regression, etc.)

30