Machine Learning
Support Vector Machines: Training with Stochastic Gradient Descent
1
Support Vector Machines: Training with Stochastic Gradient Descent - - PowerPoint PPT Presentation
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and
1
2
3
Regularization term:
hypothesis space and pushes for better generalization
regularization terms which impose
Empirical Loss:
mistakes
functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss
4
5
6
7
u v f(v) f(u)
8
u v f(v) f(u)
9
Linear functions max is convex Some ways to show that a function is convex:
positive (for one dimensional functions)
positive semi-definite (for vector functions)
10
These are concave These are neither π ππ + 1 β π π β₯ ππ π + 1 β π π(π)
11
u v f(v) f(u)
β Very slow
β Still very slow!
12
13
J(w) w w0 Intuition: The gradient is the direction
get to the minimum, go in the opposite direction We are trying to minimize
14
J(w) w w0 w1 Intuition: The gradient is the direction
get to the minimum, go in the opposite direction We are trying to minimize
15
J(w) w w0 w1 w2 Intuition: The gradient is the direction
get to the minimum, go in the opposite direction We are trying to minimize
16
J(w) w w0 w1 w2 w3 Intuition: The gradient is the direction
get to the minimum, go in the opposite direction We are trying to minimize
t)
17
r: Called the learning rate . We are trying to minimize
18
t)
19
r: Called the learning rate
We are trying to minimize
20
21
22
23
24
25
This algorithm is guaranteed to converge to the minimum of J if Β°t is small enough. Why? The objective J(w) is a convex function
26
27
Gradient descent
28
Stochastic Gradient descent
29
Stochastic Gradient descent
30
Stochastic Gradient descent
31
Stochastic Gradient descent
32
Stochastic Gradient descent
33
Stochastic Gradient descent
34
Stochastic Gradient descent
35
Stochastic Gradient descent
36
Stochastic Gradient descent
37
Stochastic Gradient descent
38
Stochastic Gradient descent
39
Stochastic Gradient descent
40
Stochastic Gradient descent
41
Stochastic Gradient descent
42
Stochastic Gradient descent
43
Stochastic Gradient descent
44
Stochastic Gradient descent
45
Stochastic Gradient descent Many more updates than gradient descent, but each individual update is less computationally expensive
46
47
48
49
50
[Example from Boyd]
g1 is a gradient at x1 g2 and g3 is are both subgradients at x2 f is differentiable at x1 Tangent at this point Formally, a vector g is a subgradient to f at point x if
51
[Example from Boyd]
g1 is a gradient at x1 g2 and g3 is are both subgradients at x2 f is differentiable at x1 Tangent at this point Formally, a vector g is a subgradient to f at point x if
52
[Example from Boyd]
g1 is a gradient at x1 g2 and g3 is are both subgradients at x2 f is differentiable at x1 Tangent at this point Formally, a vector g is a subgradient to f at point x if
53
54
55
56
57
58
59
60
Β°t: learning rate, many tweaks possible
Important to shuffle examples at the start of each epoch
61
Β°t: learning rate, many tweaks possible
;
62
63
64
65
66
67
Compare with the Perceptron update: If y wTx Β· 0, update w Γ w + r y x
68
β Very fast, run time does not depend on number of examples β Compare with Perceptron algorithm: Perceptron does not maximize margin width
β Convergence criterion is an issue; can be too aggressive in the beginning and get to a reasonably good solution fast; but convergence is slow for very accurate weight vector
β Eg: Dual coordinate descent, implemented in liblinear
69