Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 - - PowerPoint PPT Presentation
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 - - PowerPoint PPT Presentation
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Outline Maximum Margin
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Outline
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Kernels and Non-linear Mappings
- Where does the maximization problem come from?
- The intuition comes from the primal version, which is
based on a feature mapping φ.
- Theorem: Every valid kernel k(x, y) is the dot-product
φ(x)[φ(x)]T for some set of basis functions (feature mapping) φ.
- The feature space φ(x) could be high-dimensional, even
infinite.
- This is good because if data aren’t separable in original
input space (x), they may be in feature space φ(x)
- We can think about how to find a good linear separator
using the dot product in high dimensions, then transfer this back to kernels in the original input space.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Kernels and Non-linear Mappings
- Where does the maximization problem come from?
- The intuition comes from the primal version, which is
based on a feature mapping φ.
- Theorem: Every valid kernel k(x, y) is the dot-product
φ(x)[φ(x)]T for some set of basis functions (feature mapping) φ.
- The feature space φ(x) could be high-dimensional, even
infinite.
- This is good because if data aren’t separable in original
input space (x), they may be in feature space φ(x)
- We can think about how to find a good linear separator
using the dot product in high dimensions, then transfer this back to kernels in the original input space.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Kernels and Non-linear Mappings
- Where does the maximization problem come from?
- The intuition comes from the primal version, which is
based on a feature mapping φ.
- Theorem: Every valid kernel k(x, y) is the dot-product
φ(x)[φ(x)]T for some set of basis functions (feature mapping) φ.
- The feature space φ(x) could be high-dimensional, even
infinite.
- This is good because if data aren’t separable in original
input space (x), they may be in feature space φ(x)
- We can think about how to find a good linear separator
using the dot product in high dimensions, then transfer this back to kernels in the original input space.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Kernels and Non-linear Mappings
- Where does the maximization problem come from?
- The intuition comes from the primal version, which is
based on a feature mapping φ.
- Theorem: Every valid kernel k(x, y) is the dot-product
φ(x)[φ(x)]T for some set of basis functions (feature mapping) φ.
- The feature space φ(x) could be high-dimensional, even
infinite.
- This is good because if data aren’t separable in original
input space (x), they may be in feature space φ(x)
- We can think about how to find a good linear separator
using the dot product in high dimensions, then transfer this back to kernels in the original input space.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Why Kernels?
- If we can use dot products with features, why bother with
kernels?
- Often easier to specify how similar two things are (dot
product) than to construct explicit feature space φ.
- e.g. graphs, sets, strings (NIPS 2009 best student paper
award).
- There are high-dimensional (even infinite) spaces that have
efficient-to-compute kernels
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Kernel Trick
- In previous lectures on linear models, we would explicitly
compute φ(xi) for each datapoint
- Run algorithm in feature space
- For some feature spaces, can compute dot product
φ(xi)Tφ(xj) efficiently
- Efficient method is computation of a kernel function
k(xi, xj) = φ(xi)Tφ(xj)
- The kernel trick is to rewrite an algorithm to only have x
enter in the form of dot products
- The menu:
- Kernel trick examples
- Kernel functions
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Kernel Trick
- In previous lectures on linear models, we would explicitly
compute φ(xi) for each datapoint
- Run algorithm in feature space
- For some feature spaces, can compute dot product
φ(xi)Tφ(xj) efficiently
- Efficient method is computation of a kernel function
k(xi, xj) = φ(xi)Tφ(xj)
- The kernel trick is to rewrite an algorithm to only have x
enter in the form of dot products
- The menu:
- Kernel trick examples
- Kernel functions
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Kernel Trick
- In previous lectures on linear models, we would explicitly
compute φ(xi) for each datapoint
- Run algorithm in feature space
- For some feature spaces, can compute dot product
φ(xi)Tφ(xj) efficiently
- Efficient method is computation of a kernel function
k(xi, xj) = φ(xi)Tφ(xj)
- The kernel trick is to rewrite an algorithm to only have x
enter in the form of dot products
- The menu:
- Kernel trick examples
- Kernel functions
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Kernel Trick
- In previous lectures on linear models, we would explicitly
compute φ(xi) for each datapoint
- Run algorithm in feature space
- For some feature spaces, can compute dot product
φ(xi)Tφ(xj) efficiently
- Efficient method is computation of a kernel function
k(xi, xj) = φ(xi)Tφ(xj)
- The kernel trick is to rewrite an algorithm to only have x
enter in the form of dot products
- The menu:
- Kernel trick examples
- Kernel functions
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
A Kernel Trick
- Let’s look at the nearest-neighbour classification algorithm
- For input point xi, find point xj with smallest distance:
||xi − xj||2 = (xi − xj)T(xi − xj) = xiTxi − 2xiTxj + xjTxj
- If we used a non-linear feature space φ(·):
||φ(xi) − φ(xj)||2 = φ(xi)Tφ(xi) − 2φ(xi)Tφ(xj) + φ(xj)Tφ(xj) = k(xi, xi) − 2k(xi, xj) + k(xj, xj)
- So nearest-neighbour can be done in a high-dimensional
feature space without actually moving to it
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
A Kernel Trick
- Let’s look at the nearest-neighbour classification algorithm
- For input point xi, find point xj with smallest distance:
||xi − xj||2 = (xi − xj)T(xi − xj) = xiTxi − 2xiTxj + xjTxj
- If we used a non-linear feature space φ(·):
||φ(xi) − φ(xj)||2 = φ(xi)Tφ(xi) − 2φ(xi)Tφ(xj) + φ(xj)Tφ(xj) = k(xi, xi) − 2k(xi, xj) + k(xj, xj)
- So nearest-neighbour can be done in a high-dimensional
feature space without actually moving to it
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
A Kernel Trick
- Let’s look at the nearest-neighbour classification algorithm
- For input point xi, find point xj with smallest distance:
||xi − xj||2 = (xi − xj)T(xi − xj) = xiTxi − 2xiTxj + xjTxj
- If we used a non-linear feature space φ(·):
||φ(xi) − φ(xj)||2 = φ(xi)Tφ(xi) − 2φ(xi)Tφ(xj) + φ(xj)Tφ(xj) = k(xi, xi) − 2k(xi, xj) + k(xj, xj)
- So nearest-neighbour can be done in a high-dimensional
feature space without actually moving to it
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Example: The Quadratic Kernel Function
- Consider again the kernel function k(x, z) = (1 + xTz)2
- With x, z ∈ R2,
k(x, z) = (1 + x1z1 + x2z2)2 = 1 + 2x1z1 + 2x2z2 + x2
1z2 1 + 2x1z1x2z2 + x2 2z2 2
= (1, √ 2x1, √ 2x2, x2
1,
√ 2x1x2, x2
2)(1,
√ 2z1, √ 2z2, z2
1,
√ 2z1z2, z2
2)T
= φ(x)Tφ(z)
- So this particular kernel function does correspond to a dot
product in a feature space (is valid)
- Computing k(x, z) is faster than explicitly computing
φ(x)Tφ(z)
- In higher dimensions, larger exponent, much faster
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Example: The Quadratic Kernel Function
- Consider again the kernel function k(x, z) = (1 + xTz)2
- With x, z ∈ R2,
k(x, z) = (1 + x1z1 + x2z2)2 = 1 + 2x1z1 + 2x2z2 + x2
1z2 1 + 2x1z1x2z2 + x2 2z2 2
= (1, √ 2x1, √ 2x2, x2
1,
√ 2x1x2, x2
2)(1,
√ 2z1, √ 2z2, z2
1,
√ 2z1z2, z2
2)T
= φ(x)Tφ(z)
- So this particular kernel function does correspond to a dot
product in a feature space (is valid)
- Computing k(x, z) is faster than explicitly computing
φ(x)Tφ(z)
- In higher dimensions, larger exponent, much faster
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Example: The Quadratic Kernel Function
- Consider again the kernel function k(x, z) = (1 + xTz)2
- With x, z ∈ R2,
k(x, z) = (1 + x1z1 + x2z2)2 = 1 + 2x1z1 + 2x2z2 + x2
1z2 1 + 2x1z1x2z2 + x2 2z2 2
= (1, √ 2x1, √ 2x2, x2
1,
√ 2x1x2, x2
2)(1,
√ 2z1, √ 2z2, z2
1,
√ 2z1z2, z2
2)T
= φ(x)Tφ(z)
- So this particular kernel function does correspond to a dot
product in a feature space (is valid)
- Computing k(x, z) is faster than explicitly computing
φ(x)Tφ(z)
- In higher dimensions, larger exponent, much faster
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Regression Kernelized
- Many classifiers can be written as using only dot products.
- Kernelization = replace dot products by kernel.
- E.g., the kernel solution for regularized least squares
regression is y(x) = k(x)T(K + λIN)−1t vs. φ(x)(ΦTΦ + λIM)−1ΦTt for original version
- N is number of datapoints (size of Gram matrix K)
- M is number of basis functions (size of matrix ΦTΦ)
- Bad if N > M, but good otherwise
- k(x) = (k(x, x1, . . . , k(x, xn)) is the vector of kernel values
- ver data points xn.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Outline
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Linear Classification
- Consider a two class classification problem
- Use a linear model
y(x) = wTφ(x) + b followed by a threshold function
- For now, let’s assume training data are linearly separable
- Recall that the perceptron would converge to a perfect
classifier for such data
- But there are many such perfect classifiers
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Max Margin
y = 1 y = 0 y = −1 margin
- We can define the margin of a classifier as the minimum
distance to any example
- In support vector machines the decision boundary which
maximizes the margin is chosen.
- Intuitively, this is the line “right in the middle” between the
two classes.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Marginal Geometry
x2 x1 w x
y(x) w
x⊥
−w0 w
y = 0 y < 0 y > 0 R2 R1
- Recall from Ch. 4 y(x) = wTx + b
- (Using Ch.4 notation x for input rather than φ(x).)
- wTx
||w|| − −b ||w|| = y(x) ||w|| is signed distance to decision boundary.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Support Vectors
y = 1 y = 0 y = −1
- Assuming data are separated by the hyperplane, distance
to decision boundary is tny(xn)
||w||
- The maximum margin criterion chooses w, b by:
arg max
w,b
1 ||w|| min
n [tn(wTφ(xn) + b)]
- Points with this min value are known as support vectors
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Canonical Representation
- This optimization problem is complex:
arg max
w,b
1 ||w|| min
n [tn(wTφ(xn) + b)]
- Note that rescaling w → κw and b → κb does not change
distance tny(xn)
||w|| (many equiv. answers)
- So for x∗ closest to surface, can use w → w/y(x∗) and
b → b/y(x∗) so that: t∗(wTφ(x∗) + b) = 1
- All other points are at least this far away:
∀n , tn(wTφ(xn) + b) ≥ 1
- Under these constraints, the optimization becomes:
arg max
w,b
1 ||w|| = arg min
w,b
1 2||w||2
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Canonical Representation
- This optimization problem is complex:
arg max
w,b
1 ||w|| min
n [tn(wTφ(xn) + b)]
- Note that rescaling w → κw and b → κb does not change
distance tny(xn)
||w|| (many equiv. answers)
- So for x∗ closest to surface, can use w → w/y(x∗) and
b → b/y(x∗) so that: t∗(wTφ(x∗) + b) = 1
- All other points are at least this far away:
∀n , tn(wTφ(xn) + b) ≥ 1
- Under these constraints, the optimization becomes:
arg max
w,b
1 ||w|| = arg min
w,b
1 2||w||2
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Canonical Representation
- This optimization problem is complex:
arg max
w,b
1 ||w|| min
n [tn(wTφ(xn) + b)]
- Note that rescaling w → κw and b → κb does not change
distance tny(xn)
||w|| (many equiv. answers)
- So for x∗ closest to surface, can use w → w/y(x∗) and
b → b/y(x∗) so that: t∗(wTφ(x∗) + b) = 1
- All other points are at least this far away:
∀n , tn(wTφ(xn) + b) ≥ 1
- Under these constraints, the optimization becomes:
arg max
w,b
1 ||w|| = arg min
w,b
1 2||w||2
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Canonical Representation
- This optimization problem is complex:
arg max
w,b
1 ||w|| min
n [tn(wTφ(xn) + b)]
- Note that rescaling w → κw and b → κb does not change
distance tny(xn)
||w|| (many equiv. answers)
- So for x∗ closest to surface, can use w → w/y(x∗) and
b → b/y(x∗) so that: t∗(wTφ(x∗) + b) = 1
- All other points are at least this far away:
∀n , tn(wTφ(xn) + b) ≥ 1
- Under these constraints, the optimization becomes:
arg max
w,b
1 ||w|| = arg min
w,b
1 2||w||2
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Canonical Representation
- So the optimization problem is now a constrained
- ptimization problem:
arg min
w,b
1 2||w||2 s.t. ∀n , tn(wTφ(xn) + b) ≥ 1
- To solve this, we need to take a detour into Lagrange
multipliers
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Outline
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Lagrange Multipliers
∇f(x) ∇g(x) xA g(x) = 0
Consider the problem: max
x
f(x) s.t. g(x) = 0
- Points on g(x) = 0 must have ∇g(x) normal to surface
- A stationary point must have no change in f in the direction
- f the constraint surface, so ∇f(x) must also be normal to
the surface.
- So there must be some λ = 0 such that ∇f(x) + λ∇g(x) = 0
- Define Lagrangian:
L(x, λ) = f(x) + λg(x)
- Stationary points of L(x, λ) have ∇xL(x, λ) = ∇f(x) + λ∇g(x) = 0
and ∇λL(x, λ) = g(x) = 0
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Lagrange Multipliers
∇f(x) ∇g(x) xA g(x) = 0
Consider the problem: max
x
f(x) s.t. g(x) = 0
- Points on g(x) = 0 must have ∇g(x) normal to surface
- A stationary point must have no change in f in the direction
- f the constraint surface, so ∇f(x) must also be normal to
the surface.
- So there must be some λ = 0 such that ∇f(x) + λ∇g(x) = 0
- Define Lagrangian:
L(x, λ) = f(x) + λg(x)
- Stationary points of L(x, λ) have ∇xL(x, λ) = ∇f(x) + λ∇g(x) = 0
and ∇λL(x, λ) = g(x) = 0
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Lagrange Multipliers
∇f(x) ∇g(x) xA g(x) = 0
Consider the problem: max
x
f(x) s.t. g(x) = 0
- Points on g(x) = 0 must have ∇g(x) normal to surface
- A stationary point must have no change in f in the direction
- f the constraint surface, so ∇f(x) must also be normal to
the surface.
- So there must be some λ = 0 such that ∇f(x) + λ∇g(x) = 0
- Define Lagrangian:
L(x, λ) = f(x) + λg(x)
- Stationary points of L(x, λ) have ∇xL(x, λ) = ∇f(x) + λ∇g(x) = 0
and ∇λL(x, λ) = g(x) = 0
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Lagrange Multipliers
∇f(x) ∇g(x) xA g(x) = 0
Consider the problem: max
x
f(x) s.t. g(x) = 0
- Points on g(x) = 0 must have ∇g(x) normal to surface
- A stationary point must have no change in f in the direction
- f the constraint surface, so ∇f(x) must also be normal to
the surface.
- So there must be some λ = 0 such that ∇f(x) + λ∇g(x) = 0
- Define Lagrangian:
L(x, λ) = f(x) + λg(x)
- Stationary points of L(x, λ) have ∇xL(x, λ) = ∇f(x) + λ∇g(x) = 0
and ∇λL(x, λ) = g(x) = 0
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Lagrange Multipliers Example
g(x1, x2) = 0 x1 x2 (x⋆
1, x⋆ 2)
- Consider the problem
max
x
f(x1, x2) = 1 − x2
1 − x2 2
s.t. g(x1, x2) = x1 + x2 − 1 = 0
- Lagrangian:
L(x, λ) = 1 − x2
1 − x2 2 + λ(x1 + x2 − 1)
- Stationary points require:
∂L/∂x1 = −2x1 + λ = 0 ∂L/∂x2 = −2x2 + λ = 0 ∂L/∂λ = x1 + x2 − 1 = 0
- So stationary point is (x∗
1, x∗ 2) = ( 1 2, 1 2), λ = 1
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Lagrange Multipliers - Inequality Constraints
∇f(x) ∇g(x) xA xB g(x) = 0 g(x) > 0
Consider the problem: max
x
f(x) s.t. g(x) ≥ 0
- Optimization over a region – solutions either at stationary
points (gradients 0) in region g(x) > 0 or on boundary L(x, λ) = f(x) + λg(x)
- Solutions at stationary points of the Lagrangian with either:
- ∇f(x) = 0 and λ = 0 (in region)
- ∇f(x) = −λ∇g(x) and λ > 0 (boundary, > for maximizing f)
- For both, λg(x) = 0
- Find stationary points of L s. t. g(x) ≥ 0, λ ≥ 0, λg(x) = 0
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Lagrange Multipliers - Inequality Constraints
∇f(x) ∇g(x) xA xB g(x) = 0 g(x) > 0
Consider the problem: max
x
f(x) s.t. g(x) ≥ 0
- Optimization over a region – solutions either at stationary
points (gradients 0) in region g(x) > 0 or on boundary L(x, λ) = f(x) + λg(x)
- Solutions at stationary points of the Lagrangian with either:
- ∇f(x) = 0 and λ = 0 (in region)
- ∇f(x) = −λ∇g(x) and λ > 0 (boundary, > for maximizing f)
- For both, λg(x) = 0
- Find stationary points of L s. t. g(x) ≥ 0, λ ≥ 0, λg(x) = 0
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Outline
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Now Where Were We
- So the optimization problem is now a constrained
- ptimization problem:
arg min
w,b
||w||2 2 s.t. ∀n , tn(wTφ(xn) + b) ≥ 1
- For this problem, the Lagrangian (with N multipliers an) is:
L(w, b, a) = ||w||2 2 −
N
- n=1
an
- tn(wTφ(xn) + b) − 1
- We can find the derivatives of L wrt w, b and set to 0:
w =
N
- n=1
antnφ(xn) =
N
- n=1
antn
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Dual Formulation
- Plugging those equations into L removes w and b results in
a version of L where ∇w,bL = 0: ˜ L(a) =
N
- n=1
an − 1 2
N
- n=1
N
- m=1
anamtntmφ(xn)Tφ(xm) this new ˜ L is the dual representation of the problem (maximize with constraints)
- Another formula for finding b, like the bias in linear
regression.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Outline
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Non-Separable Data
y = 1 y = 0 y = −1
ξ > 1 ξ < 1 ξ = 0 ξ = 0
- For most problems, data will not be linearly separable
(even in feature space φ)
- Can relax the constraints from
tny(xn) ≥ 1 to tny(xn) ≥ 1 − ξn
- The ξn ≥ 0 are called slack variables
- ξn = 0, satisfy original problem, so xn is on margin or correct
side of margin
- 0 < ξn < 1, inside margin, but still correctly classifed
- ξn > 1, mis-classified
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Loss Function For Non-separable Data
y = 1 y = 0 y = −1
ξ > 1 ξ < 1 ξ = 0 ξ = 0
- Non-zero slack variables are bad, penalize while
maximizing the margin: min C
N
- n=1
ξn + 1 2||w||2
- Constant C > 0 controls importance of large margin versus
incorrect (non-zero slack)
- Set using cross-validation
- Optimization is same quadratic, different constraints,
convex
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
SVM Loss Function
- The SVM for the separable case solved the problem:
arg min
w
1 2||w||2 s.t. ∀n , tnyn ≥ 1
- Can write this as:
arg min
w N
- n=1
E∞(tnyn − 1) + λ||w||2 where E∞(z) = 0 if z ≥ 0, ∞ otherwise
- Non-separable case relaxes this to be:
arg min
w N
- n=1
ESV(tnyn − 1) + λ||w||2 where ESV(tnyn − 1) = [1 − yntn]+ hinge loss
- [u]+ = u if u ≥ 0, 0 otherwise
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
SVM Loss Function
- The SVM for the separable case solved the problem:
arg min
w
1 2||w||2 s.t. ∀n , tnyn ≥ 1
- Can write this as:
arg min
w N
- n=1
E∞(tnyn − 1) + λ||w||2 where E∞(z) = 0 if z ≥ 0, ∞ otherwise
- Non-separable case relaxes this to be:
arg min
w N
- n=1
ESV(tnyn − 1) + λ||w||2 where ESV(tnyn − 1) = [1 − yntn]+ hinge loss
- [u]+ = u if u ≥ 0, 0 otherwise
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
SVM Loss Function
- The SVM for the separable case solved the problem:
arg min
w
1 2||w||2 s.t. ∀n , tnyn ≥ 1
- Can write this as:
arg min
w N
- n=1
E∞(tnyn − 1) + λ||w||2 where E∞(z) = 0 if z ≥ 0, ∞ otherwise
- Non-separable case relaxes this to be:
arg min
w N
- n=1
ESV(tnyn − 1) + λ||w||2 where ESV(tnyn − 1) = [1 − yntn]+ hinge loss
- [u]+ = u if u ≥ 0, 0 otherwise
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Loss Functions
−2 −1 1 2 z E(z)
- Linear classifiers, compare loss function used for learning
- z = yntn ≥ 1 iff there is an error (with tn ∈ {+1, −1}).
- Black is misclassification error
- Transformed simple linear classifier, squared error:
(yn − tn)2
- Transformed logistic regression, cross-entropy error: tn ln yn
- SVM, hinge loss: ξn = [1 − yntn]+
- positive only if there is a mistake, otherwise 0.
- encourages sparse solutions.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Two Views of Learning as Optimization
- The original SVM goal was of the form:
- Find the simplest hypothesis that is consistent with the
data, or
- Maximize simplicity, given a consistency constraint.
- This general idea appears in much scientific model
building, in image processing, and other applications.
- Bayesian methods use a criterion of the form
- Find a trade-off between simplicity and data fit, or
- Maximize sum of the type (data fit - λ simplicity)
- e.g., ln(P(D|M)) − λln(P(M)) where the model prior M is
higher for simpler models.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Pros and Cons of Learning Criteria
- The Bayesian approach has a solid probabilistic foundation
in Bayes’ theorem.
- Seems to be especially suitable for noisy data.
- The constraint-based approach is often easy for users to
understand.
- Often leads to sparser simpler models.
- Suitable for “clean” data.
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Conclusion
- Readings: Ch. 7 up to and including Ch. 7.1.2
- Many algorithms can be re-written with only dot products of
features
- We’ve seen NN, perceptron, regression; also PCA, SVMs
(later)
- Maximum margin criterion for deciding on decision
boundary
- Linearly separable data
- Relax with slack variables for non-separable case
- Global optimization is possible in both cases
- Convex problem (no local optima)
- Descent methods converge to global optimum
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Logistic Regression Learning: Iterative Reweighted Least Squares
- Iterative reweighted least squares (IRLS) is a descent
method
- As in gradient descent, start with an initial guess, improve it
- Gradient descent - take a step (how large?) in the gradient
direction
- IRLS is a special case of a Newton-Raphson method
- Approximate function using second-order Taylor expansion:
ˆ f(w + v) = f(w) + ∇f(w)T(v − w) + 1 2(v − w)T∇2f(w)(v − w)
- Closed-form solution to minimize this is straight-forward:
quadratic, derivatives linear
- In IRLS this second-order Taylor expansion ends up being
a weighted least-squares problem, as in the linear regression case
- Hence the name IRLS
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Logistic Regression Learning: Iterative Reweighted Least Squares
- Iterative reweighted least squares (IRLS) is a descent
method
- As in gradient descent, start with an initial guess, improve it
- Gradient descent - take a step (how large?) in the gradient
direction
- IRLS is a special case of a Newton-Raphson method
- Approximate function using second-order Taylor expansion:
ˆ f(w + v) = f(w) + ∇f(w)T(v − w) + 1 2(v − w)T∇2f(w)(v − w)
- Closed-form solution to minimize this is straight-forward:
quadratic, derivatives linear
- In IRLS this second-order Taylor expansion ends up being
a weighted least-squares problem, as in the linear regression case
- Hence the name IRLS
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Logistic Regression Learning: Iterative Reweighted Least Squares
- Iterative reweighted least squares (IRLS) is a descent
method
- As in gradient descent, start with an initial guess, improve it
- Gradient descent - take a step (how large?) in the gradient
direction
- IRLS is a special case of a Newton-Raphson method
- Approximate function using second-order Taylor expansion:
ˆ f(w + v) = f(w) + ∇f(w)T(v − w) + 1 2(v − w)T∇2f(w)(v − w)
- Closed-form solution to minimize this is straight-forward:
quadratic, derivatives linear
- In IRLS this second-order Taylor expansion ends up being
a weighted least-squares problem, as in the linear regression case
- Hence the name IRLS
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Logistic Regression Learning: Iterative Reweighted Least Squares
- Iterative reweighted least squares (IRLS) is a descent
method
- As in gradient descent, start with an initial guess, improve it
- Gradient descent - take a step (how large?) in the gradient
direction
- IRLS is a special case of a Newton-Raphson method
- Approximate function using second-order Taylor expansion:
ˆ f(w + v) = f(w) + ∇f(w)T(v − w) + 1 2(v − w)T∇2f(w)(v − w)
- Closed-form solution to minimize this is straight-forward:
quadratic, derivatives linear
- In IRLS this second-order Taylor expansion ends up being
a weighted least-squares problem, as in the linear regression case
- Hence the name IRLS
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Logistic Regression Learning: Iterative Reweighted Least Squares
- Iterative reweighted least squares (IRLS) is a descent
method
- As in gradient descent, start with an initial guess, improve it
- Gradient descent - take a step (how large?) in the gradient
direction
- IRLS is a special case of a Newton-Raphson method
- Approximate function using second-order Taylor expansion:
ˆ f(w + v) = f(w) + ∇f(w)T(v − w) + 1 2(v − w)T∇2f(w)(v − w)
- Closed-form solution to minimize this is straight-forward:
quadratic, derivatives linear
- In IRLS this second-order Taylor expansion ends up being
a weighted least-squares problem, as in the linear regression case
- Hence the name IRLS
Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data
Newton-Raphson
f
- f
(x, f(x)) (x + ∆xnt, f(x + ∆xnt))
- Figure from Boyd and Vandenberghe, Convex Optimization
- Excellent reference, free for download online