Subhransu Maji
24 February 2015
CMPSCI 689: Machine Learning
26 February 2015
Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 - - PowerPoint PPT Presentation
Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February 2015 26 February 2015 Overview Linear models Perceptron: model and learning algorithm combined as one Is there a better way to learn linear models? We will
Subhransu Maji
24 February 2015
CMPSCI 689: Machine Learning
26 February 2015
Subhransu Maji (UMASS) CMPSCI 689 /29
Linear models
We will separate models and learning algorithms
2
Subhransu Maji (UMASS) CMPSCI 689 /29
3
min
w
X
n
1[ynwT xn < 0] + λR(w) fewest mistakes
The perceptron algorithm will find an optimal w if the data is separable
However, if the data is not separable, optimizing this is NP-hard
Subhransu Maji (UMASS) CMPSCI 689 /29
In addition to minimizing training error, we want a simpler model
We can add a regularization term R(w) that prefers simpler models
Here λ is a hyperparameter of optimization problem
4
min
w
X
n
1[ynwT xn < 0] + λR(w) simpler model fewest mistakes hyperparameter
Subhransu Maji (UMASS) CMPSCI 689 /29
The questions that remain are:
there are efficient algorithms for solving it?
appropriately, what algorithms exist for solving the regularized
5
min
w
X
n
1[ynwT xn < 0] + λR(w) simpler model fewest mistakes hyperparameter
Subhransu Maji (UMASS) CMPSCI 689 /29
Zero/one loss is hard to optimize
Surrogate loss: replace Zero/one loss by a smooth function
Examples:
6
−2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 7 8 9Prediction Loss Zero/one Hinge Logistic Exponential Squared
ˆ y ← wT x y = +1
concave convex
Subhransu Maji (UMASS) CMPSCI 689 /29
What are good regularization functions R(w) for hyperplanes? We would like the weights —
➡ Change in the features cause small change to the score ➡ Robustness to noise
➡ Use as few features as possible ➡ Similar to controlling the depth of a decision tree
This is a form of inductive bias
7
Subhransu Maji (UMASS) CMPSCI 689 /29
Just like the surrogate loss function, we would like R(w) to be convex Small weights regularization
8
R(norm)(w) = sX
d
w2
d
R(sqrd)(w) = X
d
w2
d
R(count)(w) = X
d
1[|wd| > 0] not convex R(p-norm)(w) = X
d
|wd|p !1/p
Subhransu Maji (UMASS) CMPSCI 689 /29
9
convex for p ≥ 1
http://en.wikipedia.org/wiki/Lp_space
Subhransu Maji (UMASS) CMPSCI 689 /29
10
not convex for 0 ≤ p < 1 p = 2 3 p = 0
R(count)(w) = X
d
1[|wd| > 0]
Counting non-zeros:
http://en.wikipedia.org/wiki/Lp_space
Subhransu Maji (UMASS) CMPSCI 689 /29
Select a suitable:
Select the hyperparameter λ Minimize the regularized objective with respect to w This framework for optimization is called Tikhonov regularization or generally Structural Risk Minimization (SRM)
11
regularization surrogate loss hyperparameter min
w
X
n
`
http://en.wikipedia.org/wiki/Tikhonov_regularization
Subhransu Maji (UMASS) CMPSCI 689 /29
12
Convex function
p1 p2 p5p6
η1
p3
η2
p4
η3
step size local optima = global optima local optima global optima
Non-convex function pk+1 ← pk − ηkg(k)
take a step down the gradient
g(k) rpF(p)|pk
compute gradient at the current location
Subhransu Maji (UMASS) CMPSCI 689 /29
13
Good step size
p1 p2 p3 p4 p5 p6 p1 p2 p3 p4 p5 p6
η1 η1
Bad step size
The step size is important —
A strategy is to use large step sizes initially and small step sizes later:
adapting step size to the curvature of the function
ηt ← η0/(t0 + t)
http://stanford.edu/~boyd/cvxbook/
Subhransu Maji (UMASS) CMPSCI 689 /29
14
L(w) = X
n
exp(−ynwT xn) + λ 2 ||w||2
dL dw = X
n
−ynxn exp(−ynwT xn) + λw
gradient update
w ← w − η X
n
−ynxn exp(−ynwT xn) + λw ! w ← w + cynxn loss term
high for misclassified points similar to the perceptron update rule!
w ← (1 − ηλ)w regularization term
shrinks weights towards zero
Subhransu Maji (UMASS) CMPSCI 689 /29
15
w ← w − η X
n
dLn dw ! batch gradient w ← w − η ✓dLn dw ◆
L(w) = X
n
Ln(w) w ← w − η dL dw
gradient descent sum of n gradients gradient at nth point
update weight after you see all points update weights after you see each point
Online gradients are the default method for multi-layer perceptrons
Subhransu Maji (UMASS) CMPSCI 689 /29
The hinge loss is not differentiable at z=1 Subgradient is any direction that is below the function For the hinge loss a possible subgradient is:
16
1
subgradient
`(hinge)(y, wT x) = max(0, 1 − ywT x)
z z
d`hinge dw
= ⇢ if ywT x > 1 −yx
Subhransu Maji (UMASS) CMPSCI 689 /29
17
w ← (1 − ηλ)w regularization term
shrinks weights towards zero
L(w) = X
n
max(0, 1 − ynwT xn) + λ 2 ||w||2 loss term
w ← w + ηynxn
ynwT xn ≤ 1
perceptron update ynwT xn ≤ 0
update
w ← w − η X
n
−1[ynwT xn ≤ 1]ynxn + λw !
subgradient
dL dw = X
n
−1[ynwT xn ≤ 1]ynxn + λw
Subhransu Maji (UMASS) CMPSCI 689 /29
18
L(w) = X
n
2 + λ 2 ||w||2
matrix notation equivalent loss
Subhransu Maji (UMASS) CMPSCI 689 /29
19
gradient exact closed-form solution At optima the gradient=0
Subhransu Maji (UMASS) CMPSCI 689 /29
Assume, we have D features and N points Overall time via matrix inversion
Overall time via gradient descent
Which one is faster?
20
dL dw = X
n
−2(yn − wT xn)xn + λw
Subhransu Maji (UMASS) CMPSCI 689 /29
Which hyperplane is the best?
21
Subhransu Maji (UMASS) CMPSCI 689 /29
Maximize the distance to the nearest point (margin), while correctly classifying all the points
22
margin δ(w) w
Subhransu Maji (UMASS) CMPSCI 689 /29
23
min
w
1 δ(w) subject to: ynwT xn ≥ 1, ∀n
Separable case: hard margin SVM
separate by a non-trivial margin maximize margin
subject to: ynwT xn ≥ 1 − ξn, ∀n ξn ≥ 0 min
w
1 δ(w) + C X
n
ξn
Non-separable case: soft margin SVM
maximize margin minimize slack allow some slack
Subhransu Maji (UMASS) CMPSCI 689 /29
24
margin δ(w) w wT x − 1 = 0 wT x + 1 = 0 δ(w) = 1 ||w|| min
w
1 δ(w) ≡ min
w ||w||
maximizing margin = minimizing norm
Subhransu Maji (UMASS) CMPSCI 689 /29
25
subject to: ynwT xn ≥ 1, ∀n
Separable case: hard margin SVM Non-separable case: soft margin SVM
separate by a non-trivial margin maximize margin
subject to: ynwT xn ≥ 1 − ξn, ∀n ξn ≥ 0
allow some slack maximize margin minimize slack squaring and half for convenience
min
w
1 2||w||2 min
w
1 2||w||2 + C X
n
ξn
Subhransu Maji (UMASS) CMPSCI 689 /29
Suppose I tell you what w is, but forgot to give you the slack variables Can you derive the optimal slack for the nth example?
26
subject to: ynwT xn ≥ 1 − ξn, ∀n ξn ≥ 0
soft margin SVM
ynwT xn ξn ξn ynwT xn ynwT xn ξn
0.2 2.0 ξn = ⇢ 0 ynwT xn ≥ 1 1 − ynwT xn
min
w
1 2||w||2 + C X
n
max(0, 1 − ynwT xn)
Same as hinge loss with squared norm regularization!
min
w
1 2||w||2 + C X
n
ξn
Subhransu Maji (UMASS) CMPSCI 689 /29
Under suitable conditions*, provided you pick the step sizes appropriately, the convergence rate of gradient descent is O(1/N)
run the gradient descent for N=1000 iterations. For linear models (hinge/logistic/exponential loss) and squared-norm regularization there are off-the-shelf solvers that are fast in practice: SVMperf , LIBLINEAR, PEGASOS
27
* the function is strongly convex:
Subhransu Maji (UMASS) CMPSCI 689 /29
Figures of various “p-norms” are from Wikipedia
Some of the slides are based on CIML book by Hal Daume III
28
Subhransu Maji (UMASS) CMPSCI 689 /29
29
% Code to plot various loss functions y1=1; y2=linspace(−2,3,500); zeroOneLoss = y1*y2 <=0; hingeLoss = max(0, 1−y1*y2); logisticLoss = log(1+exp(−y1*y2))/log(2); expLoss = exp(−y1*y2); squaredLoss = (y1−y2).^2; % Plot them figure(1); clf; hold on; plot(y2, zeroOneLoss,’k−’,’LineWidth’,1); plot(y2, hingeLoss,’b−’,’LineWidth’,1); plot(y2, logisticLoss,’r−’,’LineWidth’,1); plot(y2, expLoss,’g−’,’LineWidth’,1); plot(y2, squaredLoss,’m−’,’LineWidth’,1); ylabel(’Prediction’,’FontSize’,16); xlabel(’Loss’,’FontSize’,16); legend({’Zero/one’, ’Hinge’, ’Logistic’, ’Exponential’, ’Squared’}, ’Location’, ’NorthEast’, ’FontSize’,16); box on;
−2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 7 8 9Prediction Loss Zero/one Hinge Logistic Exponential Squared
Output Matlab code