SVMs, Duality and the Kernel Trick (cont.) Machine Learning - - PowerPoint PPT Presentation

svms duality and the kernel trick cont
SMART_READER_LITE
LIVE PREVIEW

SVMs, Duality and the Kernel Trick (cont.) Machine Learning - - PowerPoint PPT Presentation

Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) SVMs, Duality and the Kernel Trick (cont.) Machine Learning 10701/15781 Carlos


slide-1
SLIDE 1

2006 Carlos Guestrin

  • SVMs, Duality and the

Kernel Trick (cont.)

Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University March 1st, 2006

Two SVM tutorials linked in class website (please, read both):

High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998)

slide-2
SLIDE 2

2006 Carlos Guestrin

  • SVMs reminder
slide-3
SLIDE 3

2006 Carlos Guestrin

  • Today’s lecture

Learn one of the most interesting and exciting

recent advancements in machine learning

The “kernel trick” High dimensional feature spaces at no extra cost!

But first, a detour

Constrained optimization!

slide-4
SLIDE 4

2006 Carlos Guestrin

  • Dual SVM interpretation

w . x + b =

slide-5
SLIDE 5

2006 Carlos Guestrin

  • Dual SVM formulation –

the linearly separable case

slide-6
SLIDE 6

2006 Carlos Guestrin

  • Reminder from last time: What if the

data is not linearly separable?

Use features of features

  • f features of features….

Feature space can get really large really quickly!

slide-7
SLIDE 7

2006 Carlos Guestrin

  • Higher order polynomials

number of input dimensions number of monomial terms d=2 d=4 d=3 m – input features d – degree of polynomial

grows fast! d = 6, m = 100 about 1.6 billion terms

slide-8
SLIDE 8

2006 Carlos Guestrin

  • Dual formulation only depends on

dot-products, not on w!

slide-9
SLIDE 9

2006 Carlos Guestrin

  • Finally: the “kernel trick”!

Never represent features explicitly

Compute dot products in closed form

Constant-time high-dimensional dot-

products for many classes of features

Very interesting theory – Reproducing

Kernel Hilbert Spaces

Not covered in detail in 10701/15781,

more in 10702

slide-10
SLIDE 10

2006 Carlos Guestrin

  • Common kernels

Polynomials of degree d Polynomials of degree up to d Gaussian kernels Sigmoid

slide-11
SLIDE 11

2006 Carlos Guestrin

  • Overfitting?

Huge feature space with kernels, what about

  • verfitting???

Maximizing margin leads to sparse set of support

vectors

Some interesting theory says that SVMs search for

simple hypothesis with large margin

Often robust to overfitting

slide-12
SLIDE 12

2006 Carlos Guestrin

  • What about at classification time

For a new input x, if we need to represent Φ(x),

we are in trouble!

Recall classifier: sign(w.Φ(x)+b) Using kernels we are cool!

slide-13
SLIDE 13

2006 Carlos Guestrin

  • SVMs with kernels

Choose a set of features and kernel function Solve dual problem to obtain support vectors αi At classification time, compute:

Classify as

slide-14
SLIDE 14

2006 Carlos Guestrin

  • Remember kernel regression

Remember kernel regression???

1.

wi = exp(-D(xi, query)2 / Kw

2)

2.

How to fit with the local points? Predict the weighted average of the outputs: predict = wiyi / wi

slide-15
SLIDE 15

2006 Carlos Guestrin

  • SVMs v. Kernel Regression

SVMs Kernel Regression

  • r
slide-16
SLIDE 16

2006 Carlos Guestrin

  • SVMs v. Kernel Regression

SVMs Kernel Regression

  • r

Differences:

SVMs:

Learn weights \alpha_i (and bandwidth) Often sparse solution

KR:

Fixed “weights”, learn bandwidth Solution may not be sparse Much simpler to implement

slide-17
SLIDE 17

2006 Carlos Guestrin

  • What’s the difference between

SVMs and Logistic Regression?

High dimensional features with kernels Loss function No Yes! Log-loss Hinge loss

Logistic Regression SVMs

slide-18
SLIDE 18

2006 Carlos Guestrin

  • Kernels in logistic regression

Define weights in terms of support vectors: Derive simple gradient descent rule on αi

slide-19
SLIDE 19

2006 Carlos Guestrin

  • What’s the difference between SVMs

and Logistic Regression? (Revisited)

Almost always no! Often yes! Solution sparse Yes! Yes! High dimensional features with kernels Real probabilities “margin” Semantics of

  • utput

Loss function Log-loss Hinge loss

Logistic Regression SVMs

slide-20
SLIDE 20

2006 Carlos Guestrin

  • What you need to know

Dual SVM formulation

How it’s derived

The kernel trick Derive polynomial kernel Common kernels Kernelized logistic regression Differences between SVMs and logistic regression

slide-21
SLIDE 21

2006 Carlos Guestrin

  • Acknowledgment

SVM applet:

http://www.site.uottawa.ca/~gcaron/applets.htm

slide-22
SLIDE 22

2006 Carlos Guestrin

  • PAC-learning, VC

Dimension and Margin-based Bounds

Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University March 1st, 2005

More details:

General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz

slide-23
SLIDE 23

2006 Carlos Guestrin

  • What now…

We have explored many ways of learning from

data

But…

How good is our classifier, really? How much data do I need to make it “good enough”?

slide-24
SLIDE 24

2006 Carlos Guestrin

  • A simple setting…

Classification

m data points Finite number of possible hypothesis (e.g., dec. trees

  • f depth d)

A learner finds a hypothesis h that is consistent

with training data

Gets zero error in training – errortrain(h) = 0

What is the probability that h has more than ε

true error?

errortrue(h) ε

slide-25
SLIDE 25

2006 Carlos Guestrin

  • How likely is a bad hypothesis to

get m data points right?

Hypothesis h that is consistent with training data

got m i.i.d. points right

  • Prob. h with errortrue(h) ε gets one data point right
  • Prob. h with errortrue(h) ε gets m data points right
slide-26
SLIDE 26

2006 Carlos Guestrin

  • But there are many possible hypothesis

that are consistent with training data

slide-27
SLIDE 27

2006 Carlos Guestrin

  • How likely is learner to pick a bad

hypothesis

  • Prob. h with errortrue(h) ε gets m data points right

There are k hypothesis consistent with data

How likely is learner to pick a bad one?

slide-28
SLIDE 28

2006 Carlos Guestrin

  • Union bound

P(A or B or C or D or …)

slide-29
SLIDE 29

2006 Carlos Guestrin

  • How likely is learner to pick a bad

hypothesis

  • Prob. h with errortrue(h) ε gets m data points right

There are k hypothesis consistent with data

How likely is learner to pick a bad one?

slide-30
SLIDE 30

2006 Carlos Guestrin

  • Review: Generalization error in

finite hypothesis spaces [Haussler ’88]

Theorem: Hypothesis space H finite, dataset D

with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:

slide-31
SLIDE 31

2006 Carlos Guestrin

  • Using a PAC bound

Typically, 2 use cases:

1: Pick ε and δ, give you m 2: Pick m and δ, give you ε

slide-32
SLIDE 32

2006 Carlos Guestrin

  • Review: Generalization error in

finite hypothesis spaces [Haussler ’88]

Theorem: Hypothesis space H finite, dataset D

with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:

Even if h makes zero errors in training data, may make errors in test

slide-33
SLIDE 33

2006 Carlos Guestrin

  • Limitations of Haussler ‘88 bound

Consistent classifier Size of hypothesis space

slide-34
SLIDE 34

2006 Carlos Guestrin

  • What if our classifier does not have

zero error on the training data?

A learner with zero training errors may make

mistakes in test set

What about a learner with errortrain(h) in training set?

slide-35
SLIDE 35

2006 Carlos Guestrin

  • Simpler question: What’s the

expected error of a hypothesis?

The error of a hypothesis is like estimating the

parameter of a coin!

Chernoff bound: for m i.d.d. coin flips, x1,…,xm,

where xi {0,1}. For 0<ε<1:

slide-36
SLIDE 36

2006 Carlos Guestrin

  • Using Chernoff bound to estimate

error of a single hypothesis

slide-37
SLIDE 37

2006 Carlos Guestrin

  • But we are comparing many

hypothesis: Union bound

For each hypothesis hi: What if I am comparing two hypothesis, h1 and h2?

slide-38
SLIDE 38

2006 Carlos Guestrin

  • Generalization bound for |H|

hypothesis

Theorem: Hypothesis space H finite, dataset D

with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h:

slide-39
SLIDE 39

2006 Carlos Guestrin

  • PAC bound and Bias-Variance

tradeoff

Important: PAC bound holds for all h,

but doesn’t guarantee that algorithm finds best h!!!

  • r, after moving some terms around,

with probability at least 1-δ δ δ δ: : : :

slide-40
SLIDE 40

2006 Carlos Guestrin

  • What about the size of the

hypothesis space?

How large is the hypothesis space?

slide-41
SLIDE 41

2006 Carlos Guestrin

  • Boolean formulas with n binary features
slide-42
SLIDE 42

2006 Carlos Guestrin

  • Number of decision trees of depth k

Recursive solution Given n attributes Hk = Number of decision trees of depth k H0 =2 Hk+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = n * Hk * Hk Write Lk = log2 Hk L0 = 1 Lk+1 = log2 n + 2Lk So Lk = (2k-1)(1+log2 n) +1

slide-43
SLIDE 43

2006 Carlos Guestrin

  • PAC bound for decision trees of

depth k

Bad!!!

Number of points is exponential in depth!

But, for m data points, decision tree can’t get too big…

Number of leaves never more than number data points

slide-44
SLIDE 44

2006 Carlos Guestrin

  • Number of decision trees with k leaves

Hk = Number of decision trees with k leaves H0 =2

Loose bound: Reminder:

slide-45
SLIDE 45

2006 Carlos Guestrin

  • PAC bound for decision trees with k

leaves – Bias-Variance revisited

slide-46
SLIDE 46

2006 Carlos Guestrin

  • What did we learn from decision trees?

Bias-Variance tradeoff formalized Moral of the story:

Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classification

Complexity m – no bias, lots of variance Lower than m – some bias, less variance

slide-47
SLIDE 47

2006 Carlos Guestrin

  • What about continuous hypothesis

spaces?

Continuous hypothesis space:

|H| = Infinite variance???

As with decision trees, only care about the

maximum number of points that can be classified exactly!

slide-48
SLIDE 48

2006 Carlos Guestrin

  • How many points can a linear

boundary classify exactly? (1-D)

slide-49
SLIDE 49

2006 Carlos Guestrin

  • How many points can a linear

boundary classify exactly? (2-D)

slide-50
SLIDE 50

2006 Carlos Guestrin

  • How many points can a linear

boundary classify exactly? (d-D)

slide-51
SLIDE 51

2006 Carlos Guestrin

  • PAC bound using VC dimension

Number of training points that can be

classified exactly is VC dimension!!!

Measures relevant size of hypothesis space, as

with decision trees with k leaves

slide-52
SLIDE 52

2006 Carlos Guestrin

  • Shattering a set of points
slide-53
SLIDE 53

2006 Carlos Guestrin

  • VC dimension
slide-54
SLIDE 54

2006 Carlos Guestrin

  • Examples of VC dimension

Linear classifiers:

VC(H) = d+1, for d features plus constant term b

Neural networks

VC(H) = #parameters Local minima means NNs will probably not find best

parameters

1-Nearest neighbor?

slide-55
SLIDE 55

2006 Carlos Guestrin

  • PAC bound for SVMs

SVMs use a linear classifier

For d features, VC(H) = d+1:

slide-56
SLIDE 56

2006 Carlos Guestrin

  • VC dimension and SVMs: Problems!!!

What about kernels?

Polynomials: num. features grows really fast = Bad bound Gaussian kernels can classify any set of points exactly

Doesn’t take margin into account

n – input features p – degree of polynomial

slide-57
SLIDE 57

2006 Carlos Guestrin

  • Margin-based VC dimension

H: Class of linear classifiers: w.Φ(x) (b=0)

Canonical form: minj |w.Φ(xj)| = 1

VC(H) = R2 w.w

Doesn’t depend on number of features!!! R2 = maxj Φ(xj).Φ(xj) – magnitude of data R2 is bounded even for Gaussian kernels bounded VC

dimension

Large margin, low w.w, low VC dimension – Very cool!

slide-58
SLIDE 58

2006 Carlos Guestrin

  • Applying margin VC to SVMs?

VC(H) = R2 w.w

R2 = maxj Φ(xj).Φ(xj) – magnitude of data, doesn’t depend on choice of w

SVMs minimize w.w SVMs minimize VC dimension to get best bound? Not quite right:

  • Bound assumes VC dimension chosen before looking at data

Would require union bound over infinite number of possible VC

dimensions…

But, it can be fixed!

slide-59
SLIDE 59

2006 Carlos Guestrin

  • Structural risk minimization theorem

For a family of hyperplanes with margin γ>0

w.w 1

SVMs maximize margin γ + hinge loss

Optimize tradeoff training error (bias) versus margin γ

(variance)

slide-60
SLIDE 60

2006 Carlos Guestrin

  • Reality check – Bounds are loose

Bound can be very loose, why should you care?

There are tighter, albeit more complicated, bounds Bounds gives us formal guarantees that empirical studies can’t provide Bounds give us intuition about complexity of problems and

convergence rate of algorithms

ε

m (in 105) d=2000 d=200 d=20 d=2

slide-61
SLIDE 61

2006 Carlos Guestrin

  • What you need to know

Finite hypothesis space

Derive results Counting number of hypothesis Mistakes on Training data

Complexity of the classifier depends on number of

points that can be classified exactly

Finite case – decision trees Infinite case – VC dimension

Bias-Variance tradeoff in learning theory Margin-based bound for SVM Remember: will your algorithm find best classifier?