Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 - - PowerPoint PPT Presentation

max margin classifier
SMART_READER_LITE
LIVE PREVIEW

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 - - PowerPoint PPT Presentation

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Outline Maximum Margin


slide-1
SLIDE 1

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Max Margin-Classifier

Oliver Schulte - CMPT 726 Bishop PRML Ch. 7

slide-2
SLIDE 2

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Outline

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

slide-3
SLIDE 3

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Kernels and Non-linear Mappings

  • Where does the maximization problem come from?
  • The intuition comes from the primal version, which is

based on a feature mapping φ.

  • Theorem: Every valid kernel k(x, y) is the dot-product

φ(x)[φ(x)]T for some set of basis functions (feature mapping) φ.

  • The feature space φ(x) could be high-dimensional, even

infinite.

  • This is good because if data aren’t separable in original

input space (x), they may be in feature space φ(x)

  • We can think about how to find a good linear separator

using the dot product in high dimensions, then transfer this back to kernels in the original input space.

slide-4
SLIDE 4

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Kernels and Non-linear Mappings

  • Where does the maximization problem come from?
  • The intuition comes from the primal version, which is

based on a feature mapping φ.

  • Theorem: Every valid kernel k(x, y) is the dot-product

φ(x)[φ(x)]T for some set of basis functions (feature mapping) φ.

  • The feature space φ(x) could be high-dimensional, even

infinite.

  • This is good because if data aren’t separable in original

input space (x), they may be in feature space φ(x)

  • We can think about how to find a good linear separator

using the dot product in high dimensions, then transfer this back to kernels in the original input space.

slide-5
SLIDE 5

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Kernels and Non-linear Mappings

  • Where does the maximization problem come from?
  • The intuition comes from the primal version, which is

based on a feature mapping φ.

  • Theorem: Every valid kernel k(x, y) is the dot-product

φ(x)[φ(x)]T for some set of basis functions (feature mapping) φ.

  • The feature space φ(x) could be high-dimensional, even

infinite.

  • This is good because if data aren’t separable in original

input space (x), they may be in feature space φ(x)

  • We can think about how to find a good linear separator

using the dot product in high dimensions, then transfer this back to kernels in the original input space.

slide-6
SLIDE 6

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Kernels and Non-linear Mappings

  • Where does the maximization problem come from?
  • The intuition comes from the primal version, which is

based on a feature mapping φ.

  • Theorem: Every valid kernel k(x, y) is the dot-product

φ(x)[φ(x)]T for some set of basis functions (feature mapping) φ.

  • The feature space φ(x) could be high-dimensional, even

infinite.

  • This is good because if data aren’t separable in original

input space (x), they may be in feature space φ(x)

  • We can think about how to find a good linear separator

using the dot product in high dimensions, then transfer this back to kernels in the original input space.

slide-7
SLIDE 7

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Why Kernels?

  • If we can use dot products with features, why bother with

kernels?

  • Often easier to specify how similar two things are (dot

product) than to construct explicit feature space φ.

  • e.g. graphs, sets, strings (NIPS 2009 best student paper

award).

  • There are high-dimensional (even infinite) spaces that have

efficient-to-compute kernels

slide-8
SLIDE 8

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Kernel Trick

  • In previous lectures on linear models, we would explicitly

compute φ(xi) for each datapoint

  • Run algorithm in feature space
  • For some feature spaces, can compute dot product

φ(xi)Tφ(xj) efficiently

  • Efficient method is computation of a kernel function

k(xi, xj) = φ(xi)Tφ(xj)

  • The kernel trick is to rewrite an algorithm to only have x

enter in the form of dot products

  • The menu:
  • Kernel trick examples
  • Kernel functions
slide-9
SLIDE 9

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Kernel Trick

  • In previous lectures on linear models, we would explicitly

compute φ(xi) for each datapoint

  • Run algorithm in feature space
  • For some feature spaces, can compute dot product

φ(xi)Tφ(xj) efficiently

  • Efficient method is computation of a kernel function

k(xi, xj) = φ(xi)Tφ(xj)

  • The kernel trick is to rewrite an algorithm to only have x

enter in the form of dot products

  • The menu:
  • Kernel trick examples
  • Kernel functions
slide-10
SLIDE 10

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Kernel Trick

  • In previous lectures on linear models, we would explicitly

compute φ(xi) for each datapoint

  • Run algorithm in feature space
  • For some feature spaces, can compute dot product

φ(xi)Tφ(xj) efficiently

  • Efficient method is computation of a kernel function

k(xi, xj) = φ(xi)Tφ(xj)

  • The kernel trick is to rewrite an algorithm to only have x

enter in the form of dot products

  • The menu:
  • Kernel trick examples
  • Kernel functions
slide-11
SLIDE 11

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Kernel Trick

  • In previous lectures on linear models, we would explicitly

compute φ(xi) for each datapoint

  • Run algorithm in feature space
  • For some feature spaces, can compute dot product

φ(xi)Tφ(xj) efficiently

  • Efficient method is computation of a kernel function

k(xi, xj) = φ(xi)Tφ(xj)

  • The kernel trick is to rewrite an algorithm to only have x

enter in the form of dot products

  • The menu:
  • Kernel trick examples
  • Kernel functions
slide-12
SLIDE 12

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

A Kernel Trick

  • Let’s look at the nearest-neighbour classification algorithm
  • For input point xi, find point xj with smallest distance:

||xi − xj||2 = (xi − xj)T(xi − xj) = xiTxi − 2xiTxj + xjTxj

  • If we used a non-linear feature space φ(·):

||φ(xi) − φ(xj)||2 = φ(xi)Tφ(xi) − 2φ(xi)Tφ(xj) + φ(xj)Tφ(xj) = k(xi, xi) − 2k(xi, xj) + k(xj, xj)

  • So nearest-neighbour can be done in a high-dimensional

feature space without actually moving to it

slide-13
SLIDE 13

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

A Kernel Trick

  • Let’s look at the nearest-neighbour classification algorithm
  • For input point xi, find point xj with smallest distance:

||xi − xj||2 = (xi − xj)T(xi − xj) = xiTxi − 2xiTxj + xjTxj

  • If we used a non-linear feature space φ(·):

||φ(xi) − φ(xj)||2 = φ(xi)Tφ(xi) − 2φ(xi)Tφ(xj) + φ(xj)Tφ(xj) = k(xi, xi) − 2k(xi, xj) + k(xj, xj)

  • So nearest-neighbour can be done in a high-dimensional

feature space without actually moving to it

slide-14
SLIDE 14

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

A Kernel Trick

  • Let’s look at the nearest-neighbour classification algorithm
  • For input point xi, find point xj with smallest distance:

||xi − xj||2 = (xi − xj)T(xi − xj) = xiTxi − 2xiTxj + xjTxj

  • If we used a non-linear feature space φ(·):

||φ(xi) − φ(xj)||2 = φ(xi)Tφ(xi) − 2φ(xi)Tφ(xj) + φ(xj)Tφ(xj) = k(xi, xi) − 2k(xi, xj) + k(xj, xj)

  • So nearest-neighbour can be done in a high-dimensional

feature space without actually moving to it

slide-15
SLIDE 15

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Example: The Quadratic Kernel Function

  • Consider again the kernel function k(x, z) = (1 + xTz)2
  • With x, z ∈ R2,

k(x, z) = (1 + x1z1 + x2z2)2 = 1 + 2x1z1 + 2x2z2 + x2

1z2 1 + 2x1z1x2z2 + x2 2z2 2

= (1, √ 2x1, √ 2x2, x2

1,

√ 2x1x2, x2

2)(1,

√ 2z1, √ 2z2, z2

1,

√ 2z1z2, z2

2)T

= φ(x)Tφ(z)

  • So this particular kernel function does correspond to a dot

product in a feature space (is valid)

  • Computing k(x, z) is faster than explicitly computing

φ(x)Tφ(z)

  • In higher dimensions, larger exponent, much faster
slide-16
SLIDE 16

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Example: The Quadratic Kernel Function

  • Consider again the kernel function k(x, z) = (1 + xTz)2
  • With x, z ∈ R2,

k(x, z) = (1 + x1z1 + x2z2)2 = 1 + 2x1z1 + 2x2z2 + x2

1z2 1 + 2x1z1x2z2 + x2 2z2 2

= (1, √ 2x1, √ 2x2, x2

1,

√ 2x1x2, x2

2)(1,

√ 2z1, √ 2z2, z2

1,

√ 2z1z2, z2

2)T

= φ(x)Tφ(z)

  • So this particular kernel function does correspond to a dot

product in a feature space (is valid)

  • Computing k(x, z) is faster than explicitly computing

φ(x)Tφ(z)

  • In higher dimensions, larger exponent, much faster
slide-17
SLIDE 17

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Example: The Quadratic Kernel Function

  • Consider again the kernel function k(x, z) = (1 + xTz)2
  • With x, z ∈ R2,

k(x, z) = (1 + x1z1 + x2z2)2 = 1 + 2x1z1 + 2x2z2 + x2

1z2 1 + 2x1z1x2z2 + x2 2z2 2

= (1, √ 2x1, √ 2x2, x2

1,

√ 2x1x2, x2

2)(1,

√ 2z1, √ 2z2, z2

1,

√ 2z1z2, z2

2)T

= φ(x)Tφ(z)

  • So this particular kernel function does correspond to a dot

product in a feature space (is valid)

  • Computing k(x, z) is faster than explicitly computing

φ(x)Tφ(z)

  • In higher dimensions, larger exponent, much faster
slide-18
SLIDE 18

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Regression Kernelized

  • Many classifiers can be written as using only dot products.
  • Kernelization = replace dot products by kernel.
  • E.g., the kernel solution for regularized least squares

regression is y(x) = k(x)T(K + λIN)−1t vs. φ(x)(ΦTΦ + λIM)−1ΦTt for original version

  • N is number of datapoints (size of Gram matrix K)
  • M is number of basis functions (size of matrix ΦTΦ)
  • Bad if N > M, but good otherwise
  • k(x) = (k(x, x1, . . . , k(x, xn)) is the vector of kernel values
  • ver data points xn.
slide-19
SLIDE 19

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Outline

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

slide-20
SLIDE 20

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Linear Classification

  • Consider a two class classification problem
  • Use a linear model

y(x) = wTφ(x) + b followed by a threshold function

  • For now, let’s assume training data are linearly separable
  • Recall that the perceptron would converge to a perfect

classifier for such data

  • But there are many such perfect classifiers
slide-21
SLIDE 21

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Max Margin

y = 1 y = 0 y = −1 margin

  • We can define the margin of a classifier as the minimum

distance to any example

  • In support vector machines the decision boundary which

maximizes the margin is chosen.

  • Intuitively, this is the line “right in the middle” between the

two classes.

slide-22
SLIDE 22

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Marginal Geometry

x2 x1 w x

y(x) w

x⊥

−w0 w

y = 0 y < 0 y > 0 R2 R1

  • Recall from Ch. 4 y(x) = wTx + b
  • (Using Ch.4 notation x for input rather than φ(x).)
  • wTx

||w|| − −b ||w|| = y(x) ||w|| is signed distance to decision boundary.

slide-23
SLIDE 23

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Support Vectors

y = 1 y = 0 y = −1

  • Assuming data are separated by the hyperplane, distance

to decision boundary is tny(xn)

||w||

  • The maximum margin criterion chooses w, b by:

arg max

w,b

1 ||w|| min

n [tn(wTφ(xn) + b)]

  • Points with this min value are known as support vectors
slide-24
SLIDE 24

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Canonical Representation

  • This optimization problem is complex:

arg max

w,b

1 ||w|| min

n [tn(wTφ(xn) + b)]

  • Note that rescaling w → κw and b → κb does not change

distance tny(xn)

||w|| (many equiv. answers)

  • So for x∗ closest to surface, can use w → w/y(x∗) and

b → b/y(x∗) so that: t∗(wTφ(x∗) + b) = 1

  • All other points are at least this far away:

∀n , tn(wTφ(xn) + b) ≥ 1

  • Under these constraints, the optimization becomes:

arg max

w,b

1 ||w|| = arg min

w,b

1 2||w||2

slide-25
SLIDE 25

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Canonical Representation

  • This optimization problem is complex:

arg max

w,b

1 ||w|| min

n [tn(wTφ(xn) + b)]

  • Note that rescaling w → κw and b → κb does not change

distance tny(xn)

||w|| (many equiv. answers)

  • So for x∗ closest to surface, can use w → w/y(x∗) and

b → b/y(x∗) so that: t∗(wTφ(x∗) + b) = 1

  • All other points are at least this far away:

∀n , tn(wTφ(xn) + b) ≥ 1

  • Under these constraints, the optimization becomes:

arg max

w,b

1 ||w|| = arg min

w,b

1 2||w||2

slide-26
SLIDE 26

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Canonical Representation

  • This optimization problem is complex:

arg max

w,b

1 ||w|| min

n [tn(wTφ(xn) + b)]

  • Note that rescaling w → κw and b → κb does not change

distance tny(xn)

||w|| (many equiv. answers)

  • So for x∗ closest to surface, can use w → w/y(x∗) and

b → b/y(x∗) so that: t∗(wTφ(x∗) + b) = 1

  • All other points are at least this far away:

∀n , tn(wTφ(xn) + b) ≥ 1

  • Under these constraints, the optimization becomes:

arg max

w,b

1 ||w|| = arg min

w,b

1 2||w||2

slide-27
SLIDE 27

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Canonical Representation

  • This optimization problem is complex:

arg max

w,b

1 ||w|| min

n [tn(wTφ(xn) + b)]

  • Note that rescaling w → κw and b → κb does not change

distance tny(xn)

||w|| (many equiv. answers)

  • So for x∗ closest to surface, can use w → w/y(x∗) and

b → b/y(x∗) so that: t∗(wTφ(x∗) + b) = 1

  • All other points are at least this far away:

∀n , tn(wTφ(xn) + b) ≥ 1

  • Under these constraints, the optimization becomes:

arg max

w,b

1 ||w|| = arg min

w,b

1 2||w||2

slide-28
SLIDE 28

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Canonical Representation

  • So the optimization problem is now a constrained
  • ptimization problem:

arg min

w,b

1 2||w||2 s.t. ∀n , tn(wTφ(xn) + b) ≥ 1

  • To solve this, we need to take a detour into Lagrange

multipliers

slide-29
SLIDE 29

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Outline

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

slide-30
SLIDE 30

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Lagrange Multipliers

∇f(x) ∇g(x) xA g(x) = 0

Consider the problem: max

x

f(x) s.t. g(x) = 0

  • Points on g(x) = 0 must have ∇g(x) normal to surface
  • A stationary point must have no change in f in the direction
  • f the constraint surface, so ∇f(x) must also be normal to

the surface.

  • So there must be some λ = 0 such that ∇f(x) + λ∇g(x) = 0
  • Define Lagrangian:

L(x, λ) = f(x) + λg(x)

  • Stationary points of L(x, λ) have ∇xL(x, λ) = ∇f(x) + λ∇g(x) = 0

and ∇λL(x, λ) = g(x) = 0

slide-31
SLIDE 31

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Lagrange Multipliers

∇f(x) ∇g(x) xA g(x) = 0

Consider the problem: max

x

f(x) s.t. g(x) = 0

  • Points on g(x) = 0 must have ∇g(x) normal to surface
  • A stationary point must have no change in f in the direction
  • f the constraint surface, so ∇f(x) must also be normal to

the surface.

  • So there must be some λ = 0 such that ∇f(x) + λ∇g(x) = 0
  • Define Lagrangian:

L(x, λ) = f(x) + λg(x)

  • Stationary points of L(x, λ) have ∇xL(x, λ) = ∇f(x) + λ∇g(x) = 0

and ∇λL(x, λ) = g(x) = 0

slide-32
SLIDE 32

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Lagrange Multipliers

∇f(x) ∇g(x) xA g(x) = 0

Consider the problem: max

x

f(x) s.t. g(x) = 0

  • Points on g(x) = 0 must have ∇g(x) normal to surface
  • A stationary point must have no change in f in the direction
  • f the constraint surface, so ∇f(x) must also be normal to

the surface.

  • So there must be some λ = 0 such that ∇f(x) + λ∇g(x) = 0
  • Define Lagrangian:

L(x, λ) = f(x) + λg(x)

  • Stationary points of L(x, λ) have ∇xL(x, λ) = ∇f(x) + λ∇g(x) = 0

and ∇λL(x, λ) = g(x) = 0

slide-33
SLIDE 33

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Lagrange Multipliers

∇f(x) ∇g(x) xA g(x) = 0

Consider the problem: max

x

f(x) s.t. g(x) = 0

  • Points on g(x) = 0 must have ∇g(x) normal to surface
  • A stationary point must have no change in f in the direction
  • f the constraint surface, so ∇f(x) must also be normal to

the surface.

  • So there must be some λ = 0 such that ∇f(x) + λ∇g(x) = 0
  • Define Lagrangian:

L(x, λ) = f(x) + λg(x)

  • Stationary points of L(x, λ) have ∇xL(x, λ) = ∇f(x) + λ∇g(x) = 0

and ∇λL(x, λ) = g(x) = 0

slide-34
SLIDE 34

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Lagrange Multipliers Example

g(x1, x2) = 0 x1 x2 (x⋆

1, x⋆ 2)

  • Consider the problem

max

x

f(x1, x2) = 1 − x2

1 − x2 2

s.t. g(x1, x2) = x1 + x2 − 1 = 0

  • Lagrangian:

L(x, λ) = 1 − x2

1 − x2 2 + λ(x1 + x2 − 1)

  • Stationary points require:

∂L/∂x1 = −2x1 + λ = 0 ∂L/∂x2 = −2x2 + λ = 0 ∂L/∂λ = x1 + x2 − 1 = 0

  • So stationary point is (x∗

1, x∗ 2) = ( 1 2, 1 2), λ = 1

slide-35
SLIDE 35

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Lagrange Multipliers - Inequality Constraints

∇f(x) ∇g(x) xA xB g(x) = 0 g(x) > 0

Consider the problem: max

x

f(x) s.t. g(x) ≥ 0

  • Optimization over a region – solutions either at stationary

points (gradients 0) in region g(x) > 0 or on boundary L(x, λ) = f(x) + λg(x)

  • Solutions at stationary points of the Lagrangian with either:
  • ∇f(x) = 0 and λ = 0 (in region)
  • ∇f(x) = −λ∇g(x) and λ > 0 (boundary, > for maximizing f)
  • For both, λg(x) = 0
  • Find stationary points of L s. t. g(x) ≥ 0, λ ≥ 0, λg(x) = 0
slide-36
SLIDE 36

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Lagrange Multipliers - Inequality Constraints

∇f(x) ∇g(x) xA xB g(x) = 0 g(x) > 0

Consider the problem: max

x

f(x) s.t. g(x) ≥ 0

  • Optimization over a region – solutions either at stationary

points (gradients 0) in region g(x) > 0 or on boundary L(x, λ) = f(x) + λg(x)

  • Solutions at stationary points of the Lagrangian with either:
  • ∇f(x) = 0 and λ = 0 (in region)
  • ∇f(x) = −λ∇g(x) and λ > 0 (boundary, > for maximizing f)
  • For both, λg(x) = 0
  • Find stationary points of L s. t. g(x) ≥ 0, λ ≥ 0, λg(x) = 0
slide-37
SLIDE 37

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Outline

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

slide-38
SLIDE 38

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Now Where Were We

  • So the optimization problem is now a constrained
  • ptimization problem:

arg min

w,b

||w||2 2 s.t. ∀n , tn(wTφ(xn) + b) ≥ 1

  • For this problem, the Lagrangian (with N multipliers an) is:

L(w, b, a) = ||w||2 2 −

N

  • n=1

an

  • tn(wTφ(xn) + b) − 1
  • We can find the derivatives of L wrt w, b and set to 0:

w =

N

  • n=1

antnφ(xn) =

N

  • n=1

antn

slide-39
SLIDE 39

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Dual Formulation

  • Plugging those equations into L removes w and b results in

a version of L where ∇w,bL = 0: ˜ L(a) =

N

  • n=1

an − 1 2

N

  • n=1

N

  • m=1

anamtntmφ(xn)Tφ(xm) this new ˜ L is the dual representation of the problem (maximize with constraints)

  • Another formula for finding b, like the bias in linear

regression.

slide-40
SLIDE 40

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Outline

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

slide-41
SLIDE 41

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Non-Separable Data

y = 1 y = 0 y = −1

ξ > 1 ξ < 1 ξ = 0 ξ = 0

  • For most problems, data will not be linearly separable

(even in feature space φ)

  • Can relax the constraints from

tny(xn) ≥ 1 to tny(xn) ≥ 1 − ξn

  • The ξn ≥ 0 are called slack variables
  • ξn = 0, satisfy original problem, so xn is on margin or correct

side of margin

  • 0 < ξn < 1, inside margin, but still correctly classifed
  • ξn > 1, mis-classified
slide-42
SLIDE 42

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Loss Function For Non-separable Data

y = 1 y = 0 y = −1

ξ > 1 ξ < 1 ξ = 0 ξ = 0

  • Non-zero slack variables are bad, penalize while

maximizing the margin: min C

N

  • n=1

ξn + 1 2||w||2

  • Constant C > 0 controls importance of large margin versus

incorrect (non-zero slack)

  • Set using cross-validation
  • Optimization is same quadratic, different constraints,

convex

slide-43
SLIDE 43

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

SVM Loss Function

  • The SVM for the separable case solved the problem:

arg min

w

1 2||w||2 s.t. ∀n , tnyn ≥ 1

  • Can write this as:

arg min

w N

  • n=1

E∞(tnyn − 1) + λ||w||2 where E∞(z) = 0 if z ≥ 0, ∞ otherwise

  • Non-separable case relaxes this to be:

arg min

w N

  • n=1

ESV(tnyn − 1) + λ||w||2 where ESV(tnyn − 1) = [1 − yntn]+ hinge loss

  • [u]+ = u if u ≥ 0, 0 otherwise
slide-44
SLIDE 44

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

SVM Loss Function

  • The SVM for the separable case solved the problem:

arg min

w

1 2||w||2 s.t. ∀n , tnyn ≥ 1

  • Can write this as:

arg min

w N

  • n=1

E∞(tnyn − 1) + λ||w||2 where E∞(z) = 0 if z ≥ 0, ∞ otherwise

  • Non-separable case relaxes this to be:

arg min

w N

  • n=1

ESV(tnyn − 1) + λ||w||2 where ESV(tnyn − 1) = [1 − yntn]+ hinge loss

  • [u]+ = u if u ≥ 0, 0 otherwise
slide-45
SLIDE 45

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

SVM Loss Function

  • The SVM for the separable case solved the problem:

arg min

w

1 2||w||2 s.t. ∀n , tnyn ≥ 1

  • Can write this as:

arg min

w N

  • n=1

E∞(tnyn − 1) + λ||w||2 where E∞(z) = 0 if z ≥ 0, ∞ otherwise

  • Non-separable case relaxes this to be:

arg min

w N

  • n=1

ESV(tnyn − 1) + λ||w||2 where ESV(tnyn − 1) = [1 − yntn]+ hinge loss

  • [u]+ = u if u ≥ 0, 0 otherwise
slide-46
SLIDE 46

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Loss Functions

−2 −1 1 2 z E(z)

  • Linear classifiers, compare loss function used for learning
  • z = yntn ≥ 1 iff there is an error (with tn ∈ {+1, −1}).
  • Black is misclassification error
  • Transformed simple linear classifier, squared error:

(yn − tn)2

  • Transformed logistic regression, cross-entropy error: tn ln yn
  • SVM, hinge loss: ξn = [1 − yntn]+
  • positive only if there is a mistake, otherwise 0.
  • encourages sparse solutions.
slide-47
SLIDE 47

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Two Views of Learning as Optimization

  • The original SVM goal was of the form:
  • Find the simplest hypothesis that is consistent with the

data, or

  • Maximize simplicity, given a consistency constraint.
  • This general idea appears in much scientific model

building, in image processing, and other applications.

  • Bayesian methods use a criterion of the form
  • Find a trade-off between simplicity and data fit, or
  • Maximize sum of the type (data fit - λ simplicity)
  • e.g., ln(P(D|M)) − λln(P(M)) where the model prior M is

higher for simpler models.

slide-48
SLIDE 48

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Pros and Cons of Learning Criteria

  • The Bayesian approach has a solid probabilistic foundation

in Bayes’ theorem.

  • Seems to be especially suitable for noisy data.
  • The constraint-based approach is often easy for users to

understand.

  • Often leads to sparser simpler models.
  • Suitable for “clean” data.
slide-49
SLIDE 49

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Conclusion

  • Readings: Ch. 7 up to and including Ch. 7.1.2
  • Many algorithms can be re-written with only dot products of

features

  • We’ve seen NN, perceptron, regression; also PCA, SVMs

(later)

  • Maximum margin criterion for deciding on decision

boundary

  • Linearly separable data
  • Relax with slack variables for non-separable case
  • Global optimization is possible in both cases
  • Convex problem (no local optima)
  • Descent methods converge to global optimum
slide-50
SLIDE 50

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Logistic Regression Learning: Iterative Reweighted Least Squares

  • Iterative reweighted least squares (IRLS) is a descent

method

  • As in gradient descent, start with an initial guess, improve it
  • Gradient descent - take a step (how large?) in the gradient

direction

  • IRLS is a special case of a Newton-Raphson method
  • Approximate function using second-order Taylor expansion:

ˆ f(w + v) = f(w) + ∇f(w)T(v − w) + 1 2(v − w)T∇2f(w)(v − w)

  • Closed-form solution to minimize this is straight-forward:

quadratic, derivatives linear

  • In IRLS this second-order Taylor expansion ends up being

a weighted least-squares problem, as in the linear regression case

  • Hence the name IRLS
slide-51
SLIDE 51

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Logistic Regression Learning: Iterative Reweighted Least Squares

  • Iterative reweighted least squares (IRLS) is a descent

method

  • As in gradient descent, start with an initial guess, improve it
  • Gradient descent - take a step (how large?) in the gradient

direction

  • IRLS is a special case of a Newton-Raphson method
  • Approximate function using second-order Taylor expansion:

ˆ f(w + v) = f(w) + ∇f(w)T(v − w) + 1 2(v − w)T∇2f(w)(v − w)

  • Closed-form solution to minimize this is straight-forward:

quadratic, derivatives linear

  • In IRLS this second-order Taylor expansion ends up being

a weighted least-squares problem, as in the linear regression case

  • Hence the name IRLS
slide-52
SLIDE 52

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Logistic Regression Learning: Iterative Reweighted Least Squares

  • Iterative reweighted least squares (IRLS) is a descent

method

  • As in gradient descent, start with an initial guess, improve it
  • Gradient descent - take a step (how large?) in the gradient

direction

  • IRLS is a special case of a Newton-Raphson method
  • Approximate function using second-order Taylor expansion:

ˆ f(w + v) = f(w) + ∇f(w)T(v − w) + 1 2(v − w)T∇2f(w)(v − w)

  • Closed-form solution to minimize this is straight-forward:

quadratic, derivatives linear

  • In IRLS this second-order Taylor expansion ends up being

a weighted least-squares problem, as in the linear regression case

  • Hence the name IRLS
slide-53
SLIDE 53

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Logistic Regression Learning: Iterative Reweighted Least Squares

  • Iterative reweighted least squares (IRLS) is a descent

method

  • As in gradient descent, start with an initial guess, improve it
  • Gradient descent - take a step (how large?) in the gradient

direction

  • IRLS is a special case of a Newton-Raphson method
  • Approximate function using second-order Taylor expansion:

ˆ f(w + v) = f(w) + ∇f(w)T(v − w) + 1 2(v − w)T∇2f(w)(v − w)

  • Closed-form solution to minimize this is straight-forward:

quadratic, derivatives linear

  • In IRLS this second-order Taylor expansion ends up being

a weighted least-squares problem, as in the linear regression case

  • Hence the name IRLS
slide-54
SLIDE 54

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Logistic Regression Learning: Iterative Reweighted Least Squares

  • Iterative reweighted least squares (IRLS) is a descent

method

  • As in gradient descent, start with an initial guess, improve it
  • Gradient descent - take a step (how large?) in the gradient

direction

  • IRLS is a special case of a Newton-Raphson method
  • Approximate function using second-order Taylor expansion:

ˆ f(w + v) = f(w) + ∇f(w)T(v − w) + 1 2(v − w)T∇2f(w)(v − w)

  • Closed-form solution to minimize this is straight-forward:

quadratic, derivatives linear

  • In IRLS this second-order Taylor expansion ends up being

a weighted least-squares problem, as in the linear regression case

  • Hence the name IRLS
slide-55
SLIDE 55

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Newton-Raphson

f

  • f

(x, f(x)) (x + ∆xnt, f(x + ∆xnt))

  • Figure from Boyd and Vandenberghe, Convex Optimization
  • Excellent reference, free for download online

http://www.stanford.edu/~boyd/cvxbook/