2006 Carlos Guestrin
- SVMs, Duality and the
SVMs, Duality and the Kernel Trick (cont.) Machine Learning - - PowerPoint PPT Presentation
Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) SVMs, Duality and the Kernel Trick (cont.) Machine Learning 10701/15781 Carlos
2006 Carlos Guestrin
2006 Carlos Guestrin
2006 Carlos Guestrin
Learn one of the most interesting and exciting
The “kernel trick” High dimensional feature spaces at no extra cost!
But first, a detour
Constrained optimization!
2006 Carlos Guestrin
w . x + b =
2006 Carlos Guestrin
2006 Carlos Guestrin
2006 Carlos Guestrin
number of input dimensions number of monomial terms d=2 d=4 d=3 m – input features d – degree of polynomial
grows fast! d = 6, m = 100 about 1.6 billion terms
2006 Carlos Guestrin
2006 Carlos Guestrin
Never represent features explicitly
Compute dot products in closed form
Constant-time high-dimensional dot-
products for many classes of features
Very interesting theory – Reproducing
Kernel Hilbert Spaces
Not covered in detail in 10701/15781,
more in 10702
2006 Carlos Guestrin
Polynomials of degree d Polynomials of degree up to d Gaussian kernels Sigmoid
2006 Carlos Guestrin
Huge feature space with kernels, what about
Maximizing margin leads to sparse set of support
Some interesting theory says that SVMs search for
Often robust to overfitting
2006 Carlos Guestrin
For a new input x, if we need to represent Φ(x),
Recall classifier: sign(w.Φ(x)+b) Using kernels we are cool!
2006 Carlos Guestrin
Choose a set of features and kernel function Solve dual problem to obtain support vectors αi At classification time, compute:
Classify as
2006 Carlos Guestrin
Remember kernel regression???
1.
wi = exp(-D(xi, query)2 / Kw
2)
2.
How to fit with the local points? Predict the weighted average of the outputs: predict = wiyi / wi
2006 Carlos Guestrin
2006 Carlos Guestrin
SVMs:
Learn weights \alpha_i (and bandwidth) Often sparse solution
KR:
Fixed “weights”, learn bandwidth Solution may not be sparse Much simpler to implement
2006 Carlos Guestrin
2006 Carlos Guestrin
Define weights in terms of support vectors: Derive simple gradient descent rule on αi
2006 Carlos Guestrin
2006 Carlos Guestrin
Dual SVM formulation
How it’s derived
The kernel trick Derive polynomial kernel Common kernels Kernelized logistic regression Differences between SVMs and logistic regression
2006 Carlos Guestrin
SVM applet:
http://www.site.uottawa.ca/~gcaron/applets.htm
2006 Carlos Guestrin
More details:
General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz
2006 Carlos Guestrin
We have explored many ways of learning from
But…
How good is our classifier, really? How much data do I need to make it “good enough”?
2006 Carlos Guestrin
Classification
m data points Finite number of possible hypothesis (e.g., dec. trees
A learner finds a hypothesis h that is consistent
Gets zero error in training – errortrain(h) = 0
What is the probability that h has more than ε
errortrue(h) ε
2006 Carlos Guestrin
Hypothesis h that is consistent with training data
2006 Carlos Guestrin
2006 Carlos Guestrin
There are k hypothesis consistent with data
How likely is learner to pick a bad one?
2006 Carlos Guestrin
P(A or B or C or D or …)
2006 Carlos Guestrin
There are k hypothesis consistent with data
How likely is learner to pick a bad one?
2006 Carlos Guestrin
Theorem: Hypothesis space H finite, dataset D
2006 Carlos Guestrin
Typically, 2 use cases:
1: Pick ε and δ, give you m 2: Pick m and δ, give you ε
2006 Carlos Guestrin
Theorem: Hypothesis space H finite, dataset D
Even if h makes zero errors in training data, may make errors in test
2006 Carlos Guestrin
Consistent classifier Size of hypothesis space
2006 Carlos Guestrin
A learner with zero training errors may make
What about a learner with errortrain(h) in training set?
2006 Carlos Guestrin
The error of a hypothesis is like estimating the
Chernoff bound: for m i.d.d. coin flips, x1,…,xm,
2006 Carlos Guestrin
2006 Carlos Guestrin
2006 Carlos Guestrin
Theorem: Hypothesis space H finite, dataset D
2006 Carlos Guestrin
Important: PAC bound holds for all h,
2006 Carlos Guestrin
How large is the hypothesis space?
2006 Carlos Guestrin
2006 Carlos Guestrin
Recursive solution Given n attributes Hk = Number of decision trees of depth k H0 =2 Hk+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = n * Hk * Hk Write Lk = log2 Hk L0 = 1 Lk+1 = log2 n + 2Lk So Lk = (2k-1)(1+log2 n) +1
2006 Carlos Guestrin
Bad!!!
Number of points is exponential in depth!
But, for m data points, decision tree can’t get too big…
2006 Carlos Guestrin
Hk = Number of decision trees with k leaves H0 =2
2006 Carlos Guestrin
2006 Carlos Guestrin
Bias-Variance tradeoff formalized Moral of the story:
Complexity m – no bias, lots of variance Lower than m – some bias, less variance
2006 Carlos Guestrin
Continuous hypothesis space:
|H| = Infinite variance???
As with decision trees, only care about the
2006 Carlos Guestrin
2006 Carlos Guestrin
2006 Carlos Guestrin
2006 Carlos Guestrin
Number of training points that can be
Measures relevant size of hypothesis space, as
2006 Carlos Guestrin
2006 Carlos Guestrin
2006 Carlos Guestrin
Linear classifiers:
VC(H) = d+1, for d features plus constant term b
Neural networks
VC(H) = #parameters Local minima means NNs will probably not find best
1-Nearest neighbor?
2006 Carlos Guestrin
SVMs use a linear classifier
For d features, VC(H) = d+1:
2006 Carlos Guestrin
What about kernels?
Polynomials: num. features grows really fast = Bad bound Gaussian kernels can classify any set of points exactly
n – input features p – degree of polynomial
2006 Carlos Guestrin
H: Class of linear classifiers: w.Φ(x) (b=0)
Canonical form: minj |w.Φ(xj)| = 1
VC(H) = R2 w.w
Doesn’t depend on number of features!!! R2 = maxj Φ(xj).Φ(xj) – magnitude of data R2 is bounded even for Gaussian kernels bounded VC
Large margin, low w.w, low VC dimension – Very cool!
2006 Carlos Guestrin
VC(H) = R2 w.w
R2 = maxj Φ(xj).Φ(xj) – magnitude of data, doesn’t depend on choice of w
SVMs minimize w.w SVMs minimize VC dimension to get best bound? Not quite right:
Would require union bound over infinite number of possible VC
dimensions…
But, it can be fixed!
2006 Carlos Guestrin
For a family of hyperplanes with margin γ>0
w.w 1
SVMs maximize margin γ + hinge loss
Optimize tradeoff training error (bias) versus margin γ
2006 Carlos Guestrin
Bound can be very loose, why should you care?
There are tighter, albeit more complicated, bounds Bounds gives us formal guarantees that empirical studies can’t provide Bounds give us intuition about complexity of problems and
convergence rate of algorithms
m (in 105) d=2000 d=200 d=20 d=2
2006 Carlos Guestrin
Finite hypothesis space
Derive results Counting number of hypothesis Mistakes on Training data
Complexity of the classifier depends on number of
Finite case – decision trees Infinite case – VC dimension
Bias-Variance tradeoff in learning theory Margin-based bound for SVM Remember: will your algorithm find best classifier?