Statistical Machine Learning A Crash Course Part II: Classification - - PowerPoint PPT Presentation

statistical machine learning
SMART_READER_LITE
LIVE PREVIEW

Statistical Machine Learning A Crash Course Part II: Classification - - PowerPoint PPT Presentation

Statistical Machine Learning A Crash Course Part II: Classification & SVMs - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS Bayesian Decision Theory Decision rule: p ( C 1 | x ) > p ( C 2 | x ) Decide


slide-1
SLIDE 1

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS

Statistical Machine Learning

A Crash Course

Part II: Classification & SVMs

  • 11.05.2012
slide-2
SLIDE 2

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 2

■ Decision rule:

  • Decide if
  • This is equivalent to
  • Which is equivalent to

■ Bayes optimal classifier:

  • A classifier obeying this rule is called a Bayes optimal classifier.

Bayesian Decision Theory

p(C1|x) > p(C2|x) p(x|C1)p(C1) > p(x|C2)p(C2) p(x|C1) p(x|C2) > p(C2) p(C1) C1

We do not need the normalization!

slide-3
SLIDE 3

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 3

■ Bayesian decision theory:

  • Model and estimate class-conditional density as well as

class prior

  • Compute posterior
  • Minimize the error probability by maximizing

■ New approach:

  • Directly encode the “decision boundary”
  • Without modeling the densities directly
  • Still minimize the error probability

Bayesian Decision Theory

p(Ck|x) = p(x|Ck)p(Ck) p(x)

p(x|Ck)

p(Ck) p(Ck|x) p(Ck|x)

slide-4
SLIDE 4

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 4

■ Formulate classification using comparisons:

  • Discriminant functions:
  • Classify as class iff:
  • Example: Discriminant functions from Bayes classifier

y1(x), . . . , yK(x)

Discriminant Functions

yk(x) > yj(x), ⇥j = k

x

Ck yk(x) = p(Ck|x) yk(x) = p(x|Ck)p(Ck) yk(x) = log p(x|Ck) + log p(Ck)

slide-5
SLIDE 5

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 5

■ Special case: 2 classes

  • Example: Bayes classifier

Discriminant Functions

y1(x) > y2(x) ⇔ y1(x) − y2(x) > ⇔: y(x) > y(x) = p(C1|x) − p(C2|x) y(x) = log p(x|C1) p(x|C2) + log p(C1) p(C2)

slide-6
SLIDE 6

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Example: Bayes classifier

6

0.01 0.01 0.01 0.01 0.01 0.02 . 2 . 2 0.02 0.03 0.03 . 3 0.04 . 4 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 . 1 0.01 0.01 0.01 . 1 0.02 0.02 0.02 0.02 . 3 . 3 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 −0.01 −0.01 −0.01 0.01 0.01 0.01 0.01 . 2 0.02 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 − 3 − 2 − 2 − 1 − 1 − 1 1 1 1 1 2 2 2 3 3 4 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 −0.01 −0.01 −0.01 0.01 0.01 0.01 0.01 . 2 0.02 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5

■ 2 classes, Gaussian class-conditionals: p(x|C1)p(C1) − p(x|C2)p(C2) log p(x|C1)p(C1) p(x|C2)p(C2) p(x|C1)p(C1) p(x|C2)p(C2)

decision boundary

slide-7
SLIDE 7

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 7

■ 2-class problem: ■ Simplest case: linear decision boundary

  • Linear discriminant function:

normal vector

  • ffset

Linear Discriminant Functions

decide class 1, otherwise class 2

y(x) = wTx + w0 y(x) > 0 :

slide-8
SLIDE 8

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Linear Discriminant Functions

■ Illustration for 2D case:

8

x2 x1 w x

y(x) kwk

x?

w0 kwk

y = 0 y < 0 y > 0 R2 R1

y(x) = wTx + w0

signed distance to the decision boundary

slide-9
SLIDE 9

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Linear Discriminant Functions

■ 2 basic cases:

9

linearly separable not linearly separable

slide-10
SLIDE 10

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Multi-Class Case

■ What if we constructed a multi-class classifier from several 2-class classifiers?

  • If we base our decision rule on binary decisions, this may lead to

ambiguities.

10

R1 R2 R3 ? C1 not C1 C2 not C2

  • ne-versus-all

(one-versus-the-rest)

R1 R2 R3 ? C1 C2 C1 C3 C2 C3

  • ne-versus-one
slide-11
SLIDE 11

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Multi-Class Case

■ Better solution: (we have seen this already)

  • Use discriminant function to encode how strongly we believe in

each class:

  • Decision rule:

11

yk(x) > yj(x), ⇥j = k

y1(x), . . . , yK(x)

Ri Rj Rk xA xB ˆ x

If the discriminant functions are linear, the decision regions are connected and convex

slide-12
SLIDE 12

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Discriminant Functions

■ Why might we want to use discriminant functions? ■ Example: 2 classes

  • We could easily fit the class-conditionals using Gaussians and use a

Bayes classifier.

  • How about now?
  • Do these points matter for making the decision between the two

classes?

12

slide-13
SLIDE 13

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Distribution-free Classifiers

■ Main idea:

  • We do not necessarily need to model all details of the class-

conditional distributions to come up with a good decision boundary.

  • The class-conditionals may have many

intricacies that do not matter at the end

  • f the day.
  • If we can learn where to place the decision boundary directly, we

can avoid some of the complexity.

■ Nonetheless:

  • It would be unwise to believe that such classifiers are inherently

superior to probabilistic ones.

13

slide-14
SLIDE 14

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

First Attempt: Least Squares

■ Try to achieve a certain value of the discriminative function:

  • Training data:
  • Labels:

■ Linear discriminant function:

  • Try to enforce
  • One linear equation for each training data point/label pair.

14

y(x) = +1 ⇔ x ∈ C1 y(x) = −1 ⇔ x ∈ C2 X = {x1 ∈ Rd, . . . , xn} xT

i w + w0 = yi,

∀i = 1, . . . , n Y = {y1 ∈ {−1, 1}, . . . , yn}

slide-15
SLIDE 15

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

First Attempt: Least Squares

■ Linear equation system:

  • Define
  • Rewrite equation system:
  • Or in matrix-vector notation:

15

xT

i w + w0 = yi,

∀i = 1, . . . , n ˆ xi = xi 1 ⇥ ˆ w = w w0 ⇥ ˆ xT

i ˆ

w = yi, ∀i = 1, . . . , n ˆ XT ˆ w = y ˆ X = [ˆ x1, . . . , ˆ xn] y = [y1, . . . , yn]T

with

slide-16
SLIDE 16

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

First Attempt: Least Squares

■ Overdetermined system of equations:

  • n equations, d+1 unknowns

■ Look for least squares solution:

  • Set the derivative to zero:

16

ˆ XT ˆ w = y 2 ˆ X ˆ XT ˆ w − 2 ˆ Xy := 0 ˆ w = ( ˆ X ˆ XT)−1 ˆ X ⇤ ⇥ ⌅ y

Pseudo-inverse

|| ˆ XT ˆ w − y||2 → min ( ˆ XT ˆ w − y)T( ˆ XT ˆ w − y) → min ˆ wT ˆ X ˆ XT ˆ w − 2yT ˆ XT ˆ w + yTy → min

slide-17
SLIDE 17

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

First Attempt: Least Squares

■ Problem: Least-squares is very sensitive to outliers

17

−4 −2 2 4 6 8 −8 −6 −4 −2 2 4

No outliers Least-squares discriminant works

−4 −2 2 4 6 8 −8 −6 −4 −2 2 4

Outliers present Least-squares discriminant breaks down

slide-18
SLIDE 18

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

New Stategy

■ If our classes are linearly separable, we want to make sure that we find a separating (hyper)plane: ■ First such algorithm we will see:

  • The perceptron algorithm [Rosenblatt, 1962]
  • A true classic!

18

slide-19
SLIDE 19

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Perceptron Algorithm

■ Perceptron discriminant function: ■ Algorithm:

  • Start with some “weight” vector and some bias
  • For all data points with class labels
  • If is correctly classified, i.e. , do nothing.
  • Otherwise, if update the parameters using:
  • Otherwise, if update the parameters using:
  • Repeat until convergence.

19

y(x) = sign(wTx + b) w b xi

xi w ← w − xi b ← b − 1 w ← w + xi b ← b + 1

yi ∈ {−1, 1}

yi = 1 y(xi) = yi yi = −1

slide-20
SLIDE 20

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Perceptron Algorithm

■ Intuition:

20

−1 −0.5 0.5 1 −1 −0.5 0.5 1

slide-21
SLIDE 21

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Perceptron Algorithm

■ Intuition:

21

−1 −0.5 0.5 1 −1 −0.5 0.5 1

slide-22
SLIDE 22

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Perceptron Algorithm

■ Intuition:

22

−1 −0.5 0.5 1 −1 −0.5 0.5 1

slide-23
SLIDE 23

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Perceptron Algorithm

■ Intuition:

23

−1 −0.5 0.5 1 −1 −0.5 0.5 1

slide-24
SLIDE 24

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Does it work?

■ We just saw that the perceptron works in this simple case, but does it work in general?

  • Notational convenience as before:

■ Novikoff’s theorem [1962]:

  • This means that if the data is linearly separable, the perceptron

algorithm will find it.

  • If it is not separable, the algorithm will never converge.

24

wTx + b = ˆ wTˆ x

Theorem 2.1 Let S = {(x1, y1), . . . , (xn, yn)} be a data set containing at least

  • ne positive and one negative example. Set R ≡ maxi xi. Assume that there

exists a weight vector w∗ with w∗ = 1 such that γi = yiw∗, xi ≥ γ for all

  • examples. Then the perceptron will not make more than (R/γ)2 update steps.
slide-25
SLIDE 25

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

But is it useful?

■ How often is real data linearly separable? ■ Simple failure example:

  • History: Minsky & Papert [1969] criticized the perceptron for not

being able to handle this case, which halted research on this and related techniques for decades.

25

“XOR function”

slide-26
SLIDE 26

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Other Feature Spaces

■ It took a long time until people had realized that there is a simple way out. ■ Key idea: Transform the input data nonlinearly so that the problem becomes linearly separable!

26

−1.5 −1 −0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5

slide-27
SLIDE 27

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Generalized Linear Discriminant Functions

■ Instead of using the linear discriminant function we use a generalized version:

  • Here is a nonlinear transformation.

■ All our linear algorithms can be used (i.e. least-squares, perceptron) after the transformation:

  • We can represent and find non-linear decision boundaries.

27

y(x) = wTx + b y(x) = wTφ(x) + b

φ(x)

slide-28
SLIDE 28

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Generalized Linear Discriminant Functions

■ Simple example: Polar coordinates

  • If one class lies within a circle and

the other one outside, we can use a linear classifier on the radius from the center to perfectly classify the data

  • We use a linear classifier in the

transformed feature space, but if we look at the original space, the classifier is nonlinear!

28

−1.5 −1 −0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5

φ

  • [x1, x2]T⇥

= ⇧⌥ x2

1 + x2 2, arctan

⇤x2 x1 ⌅⌃T

slide-29
SLIDE 29

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Polynomial Transformations

■ Let us consider a more general case:

  • Suppose we want to represent decision boundaries that are

polynomials of a certain degree.

  • Simplest case (other than linear): degree 2.

■ Quadratic (decision) surfaces:

  • Assume we are in 2D:

29

y(x) = A11x2

1 + (A12 + A21)x1x2 + A22x2 2 + b1x1 + b2x2 + c

y(x) = xTAx + bTx + c, x, b ∈ Rn, A ∈ Rn×n

slide-30
SLIDE 30

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Polynomial Transformations

■ How can we find a suitable transformation that maps into a space in which this quadratic can be represented by a linear decision boundary?

  • That is actually quite easy:
  • Then we can use our linear classifiers to also find quadratic

decision boundaries using:

30

y(x) = A11x2

1 + (A12 + A21)x1x2 + A22x2 2 + b1x1 + b2x2 + c

φ(x) x φ(x) = (x2

1, x1x2, x2 2, x1, x2)T

y(x) = wTφ(x) + b

slide-31
SLIDE 31

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Polynomial Transformations

■ Example:

  • Circular decision boundary:

31

−1.5 −1 −0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5

y(x) = x2

1 + x2 2 − r2

= 1 · x2

1 + 0 · x1x2 + 1 · x2 2 + 0 · x1 + 0 · x2 − r2

= (1, 0, 1, 0, 0)(x2

1, x1x2, x2 2, x1, x2)T − r2

= (1, 0, 1, 0, 0)φ(x) − r2

slide-32
SLIDE 32

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Polynomial Transformations

■ What about higher dimensions ( )? ■ What about higher-order polynomials ( )? ■ We can use the same idea:

  • We transform the vector

into the space of all monomials with degree

  • What does that mean?
  • All possible terms of the form
  • Problem: There are such terms!
  • Example: terms

32

N > 2 d > 2 d xi1 · xi2 · · · · · xid

d + N − 1 d ⇥

d = 5, N = 256 ⇒ ≈ 1010

dimensions in the transformed space

x = (x1, . . . , xN)T

slide-33
SLIDE 33

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Polynomial Transformations

■ This is clearly too high-dimensional to be practical.

  • We would need to have 1010 parameters...
  • Generally, we should have many more data points than we have

parameters.

  • But wanting to classify 16 x 16 images of, say, handwritten digits

doesn’t sound like an unreasonable task.

■ What to do?

  • Let us revisit the perceptron.

33

slide-34
SLIDE 34

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Perceptron Algorithm

■ Perceptron discriminant function: ■ Algorithm:

  • Start with some “weight” vector and some bias
  • For all data points with class labels
  • If is correctly classified, i.e. , do nothing.
  • Otherwise, if update the parameters using:
  • Otherwise, if update the parameters using:
  • Repeat until convergence.

34

y(x) = sign(wTx + b) w b xi

xi w ← w − xi b ← b − 1 w ← w + xi b ← b + 1

yi ∈ {−1, 1}

yi = 1 y(xi) = yi yi = −1

slide-35
SLIDE 35

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Toward a Nonlinear Perceptron

■ Let us first rewrite the update rule:

  • This works for both cases: and

■ Multiple rounds of the perceptron are simply summing up data points:

  • : number of data points
  • : # of times that was selected.

35

w ← w + yixi yi = +1 yi = −1 w =

n

  • i=1

αiyixi n αi xi

slide-36
SLIDE 36

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Toward a Nonlinear Perceptron

■ Consequence:

  • We can express as a weighted sum of the training data

points

■ Discriminant function:

36

w =

n

  • i=1

αiyixi xi w y(x) = sign(wTx + b) = sign n ⇤

i=1

αiyixT

i x + b

slide-37
SLIDE 37

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Dual Parameterization

■ We have a new, so-called dual parameterization of our discriminant function:

  • There are only as many parameters as there are training

examples.

  • The dimensionality of does not matter anymore.

■ We can now formulate a dual perceptron algorithm.

37

y(x) = sign n ⇤

i=1

αiyixT

i x + b

⇥ αi

x

slide-38
SLIDE 38

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Dual Perceptron Algorithm

■ Dual perceptron discriminant function: ■ Algorithm:

  • Start with for all and with
  • For all data points with class labels
  • If is correctly classified, i.e. , do nothing.
  • Otherwise update the parameters using:
  • Repeat until convergence

38

xi

xi

yi ∈ {−1, 1}

y(xi) = yi

y(x) = sign n ⇤

i=1

αiyixT

i x + b

⇥ αi = 0 b = 0 i αi ← αi + 1 b ← b + yi

slide-39
SLIDE 39

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Dual Perceptron Algorithm

■ Back to the nonlinear case:

  • The feasibility of this dual nonlinear perceptron depends only on

whether we can compute the scalar product between the transformed features efficiently:

  • Later: We will see that for a large class of mappings

this scalar product can be computed efficiently without having to go to the high-dimensional feature space.

  • In fact, we can even use “infinite dimensional features”.

39

y(x) = sign n ⇤

i=1

αiyiφ(xi)Tφ(x) + b ⇥ φ(xi)Tφ(x) φ(x)

slide-40
SLIDE 40

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

  • Problem 1: Possibly irrelevant

features

  • Problem 2: Overfitting from overly

complex model

  • We are really interested in the

generalization ability and the corresponding risk.

■ Example: Classify students who will pass the exam. What are some common challenges?

40

shoe size matriculation number

Generalization Abilities

slide-41
SLIDE 41

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 41

■ Split data in training and test sets:

  • Estimate the generalization abilities from the test error
  • Choose model with the minimal test error
  • Extreme case: Leave-one-out

# of model parameters training error test error

Cross Validation

run 1 run 2 run 3 run 4

test train

slide-42
SLIDE 42

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 42

■ Statistical learning theory (Vapnik):

  • ... is concerned with the question how one can control the

generalization abilities (of a learning machine).

  • ... aims at a formal description of the generalization ability.
  • The goal is to develop a rigorous theory as opposed to commonly

used heuristics.

■ Important note:

  • This may be a noble goal, but the theory unfortunately does not

say that much about real problems...

Statistical Learning Theory

slide-43
SLIDE 43

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Assessment of prediction performance:

  • Loss function
  • Empirical risk:
  • Example: quadratic loss function

43

Risk

L(y, f(x; w)) Remp(w) = 1 N

N

  • i=1

L(yi, f(xi; w)) Remp(w) = 1 N

N

  • i=1

(yi − f(xi; w))2

slide-44
SLIDE 44

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 44

■ In reality, we are instead interested in the:

  • True risk
  • is the true probability density of and .
  • The risk is the expected error over all data sets.
  • The risk is the expectation value of the generalization error.
  • Problem: is fixed, but usually unknown.
  • We cannot compute the (actual) risk directly.

Risk

p(x, y) x y p(x, y) R(w) =

  • L(y, f(x; w))p(x, y) dxdy
slide-45
SLIDE 45

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Generalization Abilities

■ Empirical vs. true risk:

45

All 3 decision boundaries have zero empirical risk. Which one is preferable? Which one generalizes?

generalizes best to unseen data

slide-46
SLIDE 46

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 46

■ True risk:

  • Advantage: Measure for the generalization ability
  • Disadvantage: In general, we do not know .

■ Empirical risk:

  • Disadvantage: No direct measure for the generalization ability
  • Advantage: Does not depend on
  • Learning algorithms often minimize the empirical risk.

■ We are interested in the dependencies between these two risks.

Risk

p(x, y) p(x, y)

slide-47
SLIDE 47

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 47

■ Idea:

  • Determine an upper bound of the empirical risk based on the

empirical risk:

  • : # of training data points
  • : Probability that bound is met
  • : “Learning power” of the learning machine

(formally: VC-dimension)

Risk Bound

R(w) ≤ Remp(w) + (N, p, h) p N h

slide-48
SLIDE 48

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 48

? ? ? ?

too complex too simple tradeoff

new patient negative example positive example

[Florian Markowetz]

“Learning Power”

slide-49
SLIDE 49

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 49

■ VC stands for Vapnik-Chervonenkis. ■ (Informal) definition of the VC-dimension:

  • The VC-dimension of a family of functions is the maximal number
  • f data points that can be correctly classified by a function from

that family (no matter which label configuration the data points have).

  • The VC-dimension is a measure for the capacity or “learning

power” of a classifier.

VC-Dimension

slide-50
SLIDE 50

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 50

Risk Bound

Remp(w)

slide-51
SLIDE 51

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 51

■ Principle:

  • Given a family of models with non-decreasing VC-

dimension

  • Minimize the empirical risk for every model.
  • Choose the model that minimizes the risk bound (RHS of the

inequality).

  • In general, this is not the same model that minimizes the empirical

risk.

■ Note:

  • This is formally justified by the bound on the true risk.
  • The result is only sensible, however, if the upper bound on the

true risk is a tight bound.

Structural Risk Minimization

fi(x; wi) h1 ≤ h2 ≤ · · · ≤ hn

slide-52
SLIDE 52

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ How can we implement structural risk minimization? ■ Classically:

  • Keep constant and minimize
  • By keeping some model parameters fixed (e.g. # of hidden nodes

etc.) is fixed.

■ Support Vector Machines (SVM):

  • Keep constant and minimize
  • In practice: with separable data
  • is controlled by changing the VC-dimension

(“capacity control”).

52

Structural Risk Minimization

R(w) ≤ Remp(w) + (N, p, h)

(N, p, h)

Remp(w)

(N, p, h)

Remp(w)

(N, p, h)

Remp(w) = 0

(N, p, h)

slide-53
SLIDE 53

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 53

■ ...or short SVMs:

  • Linear classifiers (generalized later)
  • Approximate implementation of the structural risk minimization

principle.

  • If the data is linearly separable, the empirical risk of SVM classifiers

will be zero, and the risk bound will be approximately minimized.

  • Because of that SVMs have built-in “guaranteed” generalization

abilities.

Support Vector Machines

slide-54
SLIDE 54

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ For now: linearly separable data

  • N training data points:
  • Hyperplane that

separates the data:

  • Which hyperplane shall we use?
  • How can we minimize the VC

dimension?

54

{xi, yi}N

i=1

yi ∈ {−1, 1}

xi ∈ Rd

Support Vector Machines

x2 x1 w x

y(x) kwk

x?

w0 kwk

y = 0 y < 0 y > 0 R2 R1

y(x) = wTx + w0

slide-55
SLIDE 55

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Support Vector Machines

■ Intuitively:

  • We should find the hyperplane with the maximum “distance” to

the data.

55

generalizes best to unseen data

y = 1 y = 0 y = −1 margin

Maximize the margin (distance to the closest data point)

slide-56
SLIDE 56

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Support Vector Machines

■ Maximizing the margin:

  • Why does that make sense?
  • Why does this minimize the VC dimension?

■ Key result (Vapnik):

  • If the data points lie in a sphere of radius :
  • and the margin of the linear classifier in dimensions is ,
  • then:

■ Maximizing the margin lowers a bound on the VC- dimension!

56

||xi|| < R R d γ h ≤ min ⇤ d, 4R2 γ2 ⇥⌅

slide-57
SLIDE 57

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 57

■ Want to find a hyperplane so that the data is linearly separated:

  • Enforce for at least one data point.

■ Then we can easily express the margin:

  • The distance to the

hyperplane is:

  • This means that the margin is

Support Vector Machines

x2 x1 w x

y(x) kwk

x?

w0 kwk

y = 0 y < 0 y > 0 R2 R1

y(xi) ||w|| = wTxi + b ||w|| 1 ||w|| yi(wTxi + b) = 1 yi(wTxi + b) ≥ 1, ∀i

slide-58
SLIDE 58

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Support Vector Machines

■ Illustration: ■ Support vectors:

  • All data points that lie on the margin (i.e. )

58

yi(wTxi + b) = 1

y = 1 y = 0 y = −1

slide-59
SLIDE 59

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 59

■ Maximizing the margin is equivalent to minimizing . ■ Formulate as constrained optimization problem: ■ Lagrangian formulation:

  • Minimize Lagrangian

Support Vector Machines

1/||w|| ||w||2 s.t. yi(wTxi + b) − 1 ≥ 0 αi ≥ 0 arg min

w,b

1 2||w||2 L(w, b, α) = 1 2||w||2 −

N

  • i=1

αi(yi(wTxi + b) − 1)

slide-60
SLIDE 60

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Minimize ■ As usual, set the gradient to zero:

  • The separating hyperplane is a linear combination of the input

data.

  • But what are the ?

60

Support Vector Machines

∂L(w, b, α) ∂b = 0 ⇒

N

  • i=1

αiyi = 0 αi L(w, b, α) = 1 2||w||2 −

N

  • i=1

αi(yi(wTxi + b) − 1) ∂L(w, b, α) ∂w = 0 ⇒ w =

N

  • i=1

αiyixi

slide-61
SLIDE 61

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Sparsity

■ Important property:

  • Almost all the are zero.
  • There are only a few support

vectors.

  • But the hyperplane was written

as:

■ SVMs are sparse learning machines:

  • The classifier only depends on a few data points.

61

y = 1 y = 0 y = −1

αi w =

N

  • i=1

αiyixi

slide-62
SLIDE 62

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 62

  • Let’s rewrite the Lagrangian:
  • But we know that

Dual Form

L(w, b, α) = 1 2||w||2 −

N

  • i=1

αi(yi(wTxi + b) − 1) = 1 2||w||2 −

N

  • i=1

αiyiwTxi −

N

  • i=1

αiyib +

N

  • i=1

αi

N

  • i=1

αiyi = 0 ˆ L(w, α) = 1 2||w||2 −

N

  • i=1

αiyiwTxi +

N

  • i=1

αi

slide-63
SLIDE 63

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

  • Use constraint that:

63

Dual Form

ˆ L(w, α) = 1 2||w||2 −

N

  • i=1

αiyiwTxi +

N

  • i=1

αi ˆ L(w, α) = 1 2||w||2 −

N

  • i=1

αiyi

N

  • j=1

αjyjxT

j xi + N

  • i=1

αi = 1 2||w||2 −

N

  • i=1

N

  • j=1

αiαjyiyj(xT

j xi) + N

  • i=1

αi w =

N

  • i=1

αiyixi

slide-64
SLIDE 64

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ We further use the fact that: ■ From which we obtain the so-called Wolfe dual by substitution:

  • We now solve the original problem by maximizing this dual

function.

64

Dual Form

1 2||w||2 = 1 2wTw =

N

  • i=1

N

  • j=1

αiαjyiyj(xT

j xi)

˜ L(α) =

N

  • i=1

αi − 1 2

N

  • i=1

N

  • j=1

αiαjyiyj(xT

j xi)

slide-65
SLIDE 65

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Maximize

  • under the conditions that:

■ The separating hyperplane is given by the support vectors:

65

Support Vector Machines

˜ L(α) =

N

  • i=1

αi − 1 2

N

  • i=1

N

  • j=1

αiαjyiyj(xT

j xi) N

  • i=1

αiyi = 0 αi ≥ 0 NS

can be calculated as well, but we will skip that here...

b

w =

NS

  • i=1

αiyixi

slide-66
SLIDE 66

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

SVMs so far

■ Both the original SVM formulation (primal) as well as the derived dual formulation are quadratic programming problems.

  • They have unique solutions
  • ... which can be computed efficiently.

■ Why did we bother to go to the dual form?

  • To go beyond linear classifiers!

66

slide-67
SLIDE 67

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 67

■ Nonlinear transformation of the data: ■ Hyperplane in (linear classifier in ). ■ Nonlinear classifier in . ■ This is the same trick that we applied to the perceptron

  • What is so special here?

H H

Rd

Nonlinear SVMs

φ wTφ(x) + b = 0 x ∈ Rd φ : Rd → H

slide-68
SLIDE 68

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Nonlinear SVMs

■ Dual form:

  • subject to and

■ With a nonlinear transformation, we obtain:

  • only appears in scalar products with another .
  • We only need to be able to evaluate scalar products.

68

˜ L(α) =

N

  • i=1

αi − 1 2

N

  • i=1

N

  • j=1

αiαjyiyj(xT

j xi) N

  • i=1

αiyi = 0 αi ≥ 0 ˜ L(α) =

N

  • i=1

αi − 1 2

N

  • i=1

N

  • j=1

αiαjyiyj(φ(xj)Tφ(xi)) φ(xi) φ(xj)

slide-69
SLIDE 69

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Nonlinear SVMs

■ But what about the discriminant function?

  • We still need to represent . No?
  • No, we can represent differently:
  • The nonlinear discriminant function can thus be written as:
  • Can also be written with scalar products of the nonlinear features
  • nly.

69

y(x) = wTφ(x) + b w w w =

NS

  • i=1

αiyiφ(xi) y(x) =

NS

  • i=1

αiyiφ(xi)Tφ(x) + b

# of support vectors

slide-70
SLIDE 70

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Nonlinear SVMs

■ Both the dual optimization problem and the discriminant function can be written in terms of scalar products of the features.

  • We have already seen this when we talked about the dual version
  • f the perceptron.
  • In fact the discriminant function even has the very same

functional form:

  • Key difference: In an SVM the parameters maximize the

margin of the classifier.

  • Have built-in generalization properties.

70

y(x) =

NS

  • i=1

αiyiφ(xi)Tφ(x) + b αi

slide-71
SLIDE 71

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Kernel Trick

■ “Kernel trick”: Replace every occurrence of a scalar product between features with a kernel function:

  • If we can find a kernel function that is equivalent to this scalar

product, we can avoid mapping into a high-dimensional space and instead compute the scalar-product directly.

■ What are examples of such kernels and when do they exist?

71

K(xi, xj) = φ(xi)Tφ(xj) xi

slide-72
SLIDE 72

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 72

■ Polynomial kernel of 2nd degree:

  • Equivalence:

Example: Polynomial Kernel

K(x, y) = (xTy)2 x, y ∈ R2 K(x, y) = (xTy)2 = x2

1y2 1 + 2x1x2y1y2 + y2 1y2 2

=   x2

1

√ 2x1x2 x2

2

 

T 

 y2

1

√ 2y1y2 y2

2

 = φ(x)Tφ(y)

slide-73
SLIDE 73

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Polynomial kernel of 2nd degree:

  • Equivalence:
  • This means that is not unique for a

73

Example: Polynomial Kernel

K(x, y) = (xTy)2 x, y ∈ R2 K(x, y) = (xTy)2 = x2

1y2 1 + 2x1x2y1y2 + y2 1y2 2

= φ(x)Tφ(y) = 1 √ 2   x2

1 − x2 2

2x1x2 x2

1 + x2 2

 

T

1 √ 2   y2

1 − y2 2

2y1y2 y2

1 + y2 2

  φ(x) K(x, y)

slide-74
SLIDE 74

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Example: Polynomial Kernel

■ In general: Polynomials of degree

  • Let be the transformation that maps a vector into the

space of all ordered monomials of degree .

  • We can represent all polynomials of degree as linear functions

in this transformed space.

  • Example:
  • Ordered monomials:
  • Unordered monomials:
  • The kernel lets us compute arbitrary scalar

products without doing the explicit mapping:

74

d

Cd(x)

d

d = 2

x2

1, x1x2, x2 2

x2

1, x1x2, x2x1, x2 2

K(x, y) = (xTy)d

K(x, y) = (xTy)d = Cd(x)TCd(y)

d

slide-75
SLIDE 75

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 75

■ Polynomials of degree

  • Dimensionality of the transformed space :
  • Example:
  • The classifier has VC-dimension !

dim(H) + 1

Example: Polynomial Kernel

d

K(x, y) = (xTy)d

H N = 16 × 16 = 256 d = 4 dim(H) = 183181376

d + N − 1 d ⇥

slide-76
SLIDE 76

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

SVM: Linear Case

76

slide-77
SLIDE 77

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Polynomial kernel of degree 3:

SVM with Kernels

77

linearly separable classifier almost linear not linearly separable (in original space)

slide-78
SLIDE 78

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Constructing Kernels

■ So far we proceed as follows:

  • We identify some linear transformation that we think will be

useful.

  • Then we find a kernel that allows us to compute the

scalar product without making the mapping explicit:

■ What do kernels do?

  • They measure similarity (in a transformed space).
  • But what if we have a notion of similarity and want to encode this

in a kernel function directly?

78

φ(x)

K(xi, xj) = φ(xi)Tφ(xj)

K(xi, xj) K(xi, xj)

slide-79
SLIDE 79

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 79

■ Gaussian radial basis function (RBF) kernel:

  • Measures similarities.
  • Interesting property: is infinite dimensional.
  • Intuition:
  • Since we only use the kernel function, this is no problem.
  • But: The hyperplane also has infinite VC-dimension!

H exp(x) = 1 + x 1! + x2 2! + . . . + xn n! + . . .

Radial Basis Function Kernel

K(x, y) = exp

  • −||x − y||2

2σ2 ⇥

slide-80
SLIDE 80

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Radial Basis Function Kernel

80

slide-81
SLIDE 81

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 81

■ Intuition:

  • If we can make the radius of the kernel arbitrarily small, then at

some point every data point will have its “own” kernel:

  • But in contrast: If we bound the radius of the RBF below, we can

limit the VC-dimension!

VC-Dimension for Gaussian RBF Kernel

slide-82
SLIDE 82

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 82

■ Question: Is the Gaussian RBF kernel a valid kernel, i.e. is there a mapping so that:

  • How can we assess this more generally?

■ Mercer’s Condition:

  • A function is a valid kernel, if for every with

it holds that:

Kernels

{H, φ}

K(x, y) g(x)

  • g(x)2 dx < ∞
  • K(x, y)g(x)g(y) dxdy ≥ 0

K(x, y) = φ(x)Tφ(y)

with

φ : Rd → H

slide-83
SLIDE 83

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 83

■ Inhomogeneous polynomial kernel:

  • Can also represent polynomials of degree .

■ Gaussian RBF kernel: ■ Hyperbolic tangent kernel:

Kernels satisfying Mercer’s condition

K(x, y) = (xTy + c)d

< d

K(x, y) = exp

  • −||x − y||2

2σ2 ⇥

K(x, y) = tanh(axTy + b)

slide-84
SLIDE 84

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Combining Kernels

■ It may not be always easy to check if Mercer’s condition is satisfied, but it is possible to construct new kernels out

  • f known ones.

■ If and are valid kernels, then so are:

  • Etc.

84

K1(x, y) K2(x, y) c · K1(x, y) K1(x, y) + K2(x, y) K1(x, y) · K2(x, y) f(x)K1(x, y)f(y)

slide-85
SLIDE 85

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Non-separable data

■ What if our data is not linearly separable?

  • Simple solution: Transform the features into a space so that they

become linearly separable.

  • E.g. RBF kernel with small kernel radius.
  • Problem:
  • Such a classifier will have a very high VC-dimension, and thus has a large

capacity.

  • This will lead to overfitting.
  • Solution: Allow for data points to “violate the margin”.

85

slide-86
SLIDE 86

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 86

■ Instead of requiring that the data is perfectly linearly separable: ■ we allow for small violations from perfect separation:

Support Vector Machines

wTxi + b ≥ +1 for yi = +1 wTxi + b ≤ −1 for yi = −1 wTxi + b ≥ +1 − ξi for yi = +1 wTxi + b ≤ −1 + ξi for yi = −1 ξi ≥ 0 ∀i ξi

slide-87
SLIDE 87

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Support Vector Machines

■ More concisely, we require that:

  • The are called “slack variables”.

■ Illustration:

87

yi(wTxi + b) ≥ 1 − ξi, ξi ≥ 0 ∀i

y = 1 y = 0 y = −1

ξ > 1 ξ < 1 ξ = 0 ξ = 0

ξi

slide-88
SLIDE 88

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Support Vector Machines

■ We have to penalize the deviations:

  • Maximize the margin while minimizing the penalty for all data

points that are not outside the margin.

  • The weight allows us to specify a trade-off.
  • Typically determined through cross-validation.
  • Even if the data is separable, it may be better to allow for an
  • ccasional penalty.

88

arg min

w,b

1 2||w||2 + C

N

  • i=1

ξi s.t. yi(wTxi + b) − 1 + ξi ≥ 0 ξi ≥ 0 C

slide-89
SLIDE 89

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Dual formulation: Maximize

  • under the conditions that:

■ The separating hyperplane is given by the support vectors:

89

Support Vector Machines

˜ L(α) =

N

  • i=1

αi − 1 2

N

  • i=1

N

  • j=1

αiαjyiyj(xT

j xi) N

  • i=1

αiyi = 0 NS w =

NS

  • i=1

αiyixi 0 ≤ αi ≤ C

box constraint

slide-90
SLIDE 90

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 90

■ Text classification: Joachims (1997)

  • Problem: Classify documents into a number of categories (topics,

etc.)

  • The text is represented using word statistics, i.e. histograms of the

word frequency:

  • We count how often every word occurs and ignore their order (“bag of words”)
  • Very high-dimensional feature space (roughly 10,000 dimensions)
  • Very few features that are not relevant (difficult to apply feature selection or

dimensionality reduction)

Application Example 1

slide-91
SLIDE 91

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 91

Application Example 1

slide-92
SLIDE 92

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 92

■ Handwritten digit classification ■ U.S. Postal Service Database

Application Example 2

slide-93
SLIDE 93

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 93

■ Human performance: 2.5% error ■ Various learning algorithms:

  • 16.2%: Decision tree (C4.5)
  • 5.9%: 2-layer neural network
  • 5.1%: LeNet 1 - 5-layer neural network

■ Various SVM results:

  • 4.0%: Polynomial kernel (p=3, 274 support vectors)
  • 4.1%: Gaussian kernel (σ=0.3, 291 support vectors)

Application Example 2

slide-94
SLIDE 94

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 94

■ Very little overfitting

  • Good generalization.

Application Example 2

slide-95
SLIDE 95

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 95

■ Even better results:

  • Supply knowledge about invariances in the data.
  • Geometric deformations, etc.
  • 2.7% error: Elastic matching (no learning)
  • Use knowledge of how digits can deform
  • Classify test digit by finding the template that required least deformation

■ Recent results:

  • With more training data, better modeling of invariances, etc.
  • Error down to about 0.5% with SVMs and 0.4% with neural

networks.

Application Example 2