Applied Machine Learning Applied Machine Learning
Perceptron and Support Vector Machines
Siamak Ravanbakhsh Siamak Ravanbakhsh
COMP 551 COMP 551 (winter 2020) (winter 2020)
1
Applied Machine Learning Applied Machine Learning Perceptron and - - PowerPoint PPT Presentation
Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives geometry of linear
Perceptron and Support Vector Machines
Siamak Ravanbakhsh Siamak Ravanbakhsh
COMP 551 COMP 551 (winter 2020) (winter 2020)
1
geometry of linear classification Perceptron learning algorithm margin maximization and support vectors hinge loss and relation to logistic regression
2
historically a significant algorithm
(first neural network, or rather just a neuron)
biologically motivated model simple learning algorithm convergence proof beginning of connectionist AI it's criticism in the book "Perceptrons" was a factor in AI winter
3
image:https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/Neuron/index.html
f(x) = sign(w x +
⊤
w
)Model
historically a significant algorithm
(first neural network, or rather just a neuron)
biologically motivated model simple learning algorithm convergence proof beginning of connectionist AI it's criticism in the book "Perceptrons" was a factor in AI winter
3
image:https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/Neuron/index.html
f(x) = sign(w x +
⊤
w
)Model
historically a significant algorithm
(first neural network, or rather just a neuron)
biologically motivated model simple learning algorithm convergence proof beginning of connectionist AI it's criticism in the book "Perceptrons" was a factor in AI winter
note that we're using +1/-1 for labels rather than 0/1.
3
geometry of the
y = w x +
⊤
w
=w
x +2 2
w
x +1 1
w
=x2 x1
4 . 1
this hyperplane has one dimension lower than D (number of features)
geometry of the
y = w x +
⊤
w
=w
x +2 2
w
x +1 1
w
=x2 x1
for any two points a and b on the line
w (a −
⊤
b) + w
−w
=4 . 1
this hyperplane has one dimension lower than D (number of features)
geometry of the
y = w x +
⊤
w
=w
x +2 2
w
x +1 1
w
=x2 x1
for any two points a and b on the line
w (a −
⊤
b) + w
−w
=so is the unit normal vector to the line ∣∣w∣∣ w
4 . 1
this hyperplane has one dimension lower than D (number of features)
geometry of the
y = w x +
⊤
w
=w
x +2 2
w
x +1 1
w
=x2 x1
for any two points a and b on the line
w (a −
⊤
b) + w
−w
=so is the unit normal vector to the line ∣∣w∣∣ w the orthogonal component of any point on the line
b =∣∣w∣∣ w⊤
−
∣∣w∣∣ w
4 . 1
this hyperplane has one dimension lower than D (number of features)
geometry of the
x2 x1
the orthogonal component of any point on the line
b =∣∣w∣∣ w⊤
−
∣∣w∣∣ w
4 . 2
geometry of the
x2 x1
the orthogonal component of any point on the line
b =∣∣w∣∣ w⊤
−
∣∣w∣∣ w
∣∣w∣∣ w
4 . 2
geometry of the
x2 x1
the orthogonal component of any point on the line
b =∣∣w∣∣ w⊤
−
∣∣w∣∣ w
∣∣w∣∣ w
4 . 2
signed distance of any point (c) from the line
⊥
c
⊥
geometry of the
x2 x1
the orthogonal component of any point on the line
b =∣∣w∣∣ w⊤
−
∣∣w∣∣ w
∣∣w∣∣ w
4 . 2
c∣∣w∣∣ w⊤
c∣∣w∣∣ w⊤
signed distance of any point (c) from the line
⊥
c
⊥
geometry of the
x2 x1
the orthogonal component of any point on the line
b =∣∣w∣∣ w⊤
−
∣∣w∣∣ w
c∣∣w∣∣ w⊤ ⊥
∣∣w∣∣ w
4 . 2
c∣∣w∣∣ w⊤
c∣∣w∣∣ w⊤
signed distance of any point (c) from the line
⊥
c
⊥
geometry of the
x2 x1
the orthogonal component of any point on the line
b =∣∣w∣∣ w⊤
−
∣∣w∣∣ w
c −∣∣w∣∣ w⊤
c∣∣w∣∣ w⊤ ⊥
c∣∣w∣∣ w⊤ ⊥
∣∣w∣∣ w
4 . 2
c∣∣w∣∣ w⊤
c∣∣w∣∣ w⊤
signed distance of any point (c) from the line
⊥
c
⊥
Winter 2020 | Applied Machine Learning (COMP551)
geometry of the
x2 x1
the orthogonal component of any point on the line
b =∣∣w∣∣ w⊤
−
∣∣w∣∣ w
c −∣∣w∣∣ w⊤
c∣∣w∣∣ w⊤ ⊥ =
(w c +∣∣w∣∣ 1 ⊤
w
) c∣∣w∣∣ w⊤ ⊥
∣∣w∣∣ w
4 . 2
c∣∣w∣∣ w⊤
c∣∣w∣∣ w⊤
signed distance of any point (c) from the line
⊥
c
⊥
x2 x1
label and prediction have different signs
if try to make it positive y <
(n)y
^(n)
distance to the boundary this is positive for points that are on the wrong side 5 . 1
x2 x1
label and prediction have different signs
if try to make it positive y <
(n)y
^(n)
−y (w x +
(n) ⊤ (n)
w
)equivalent to minimizing
distance to the boundary this is positive for points that are on the wrong side 5 . 1
x2 x1
+
∣∣w∣∣ 1 ⊤ (n)
w
)label and prediction have different signs
if try to make it positive y <
(n)y
^(n)
−y (w x +
(n) ⊤ (n)
w
)equivalent to minimizing
distance to the boundary this is positive for points that are on the wrong side 5 . 1
x2 x1
so perceptron tries to minimize the distance of misclassified points from the decision boundary and push them to the right side
+
∣∣w∣∣ 1 ⊤ (n)
w
)label and prediction have different signs
if try to make it positive y <
(n)y
^(n)
−y (w x +
(n) ⊤ (n)
w
)equivalent to minimizing
distance to the boundary this is positive for points that are on the wrong side 5 . 1
revisiting
if minimize
y <
(n)y
^(n)
J
(w) =n
−y (w x )
(n) ⊤ (n)
5 . 2
revisiting
if minimize
y <
(n)y
^(n)
J
(w) =n
−y (w x )
(n) ⊤ (n)
now we included bias in w
5 . 2
revisiting
if minimize
y <
(n)y
^(n)
J
(w) =n
−y (w x )
(n) ⊤ (n)
now we included bias in w
5 . 2
revisiting
if minimize
y <
(n)y
^(n)
J
(w) =n
−y (w x )
(n) ⊤ (n)
use stochastic gradient descent
∇J
(w) =n
−y x
(n) (n)
now we included bias in w
w ←
{t+1}
w −
{t}
α∇J
(w) =n
w +
{t}
αy x
(n) (n)
5 . 2
revisiting
if minimize
y <
(n)y
^(n)
J
(w) =n
−y (w x )
(n) ⊤ (n)
use stochastic gradient descent
∇J
(w) =n
−y x
(n) (n)
now we included bias in w
w ←
{t+1}
w −
{t}
α∇J
(w) =n
w +
{t}
αy x
(n) (n) Perceptron uses learning rate of 1 this is okay because scaling w does not affect prediction
sign(w x) =
⊤
sign(αw x)
⊤ 5 . 2
revisiting
if minimize
y <
(n)y
^(n)
J
(w) =n
−y (w x )
(n) ⊤ (n)
use stochastic gradient descent
∇J
(w) =n
−y x
(n) (n)
now we included bias in w
w ←
{t+1}
w −
{t}
α∇J
(w) =n
w +
{t}
αy x
(n) (n) Perceptron uses learning rate of 1 this is okay because scaling w does not affect prediction
sign(w x) =
⊤
sign(αw x)
⊤
the algorithm is guaranteed to converge in finite steps if linearly separable
Perceptron convergence theorem
5 . 2
Iris dataset
(linearly separable case)
iteration 1
5 . 3
Iris dataset
(linearly separable case)
iteration 1
yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] def Perceptron(X, y, max_iters): 1 N,D = X.shape 2 w = np.random.rand(D) 3 for t in range(max_iters): 4 n = np.random.randint(N) 5 6 7 8 return w 9
5 . 3
Iris dataset
(linearly separable case)
note that the code is not chacking for convergence
iteration 1
yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] def Perceptron(X, y, max_iters): 1 N,D = X.shape 2 w = np.random.rand(D) 3 for t in range(max_iters): 4 n = np.random.randint(N) 5 6 7 8 return w 9
5 . 3
Iris dataset
(linearly separable case)
note that the code is not chacking for convergence
iteration 1
initial decision boundary w x =
⊤
yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] def Perceptron(X, y, max_iters): 1 N,D = X.shape 2 w = np.random.rand(D) 3 for t in range(max_iters): 4 n = np.random.randint(N) 5 6 7 8 return w 9
5 . 3
Iris dataset
(linearly separable case)
iteration 10
note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9
5 . 4
Iris dataset
(linearly separable case)
iteration 10
note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9
5 . 4
Iris dataset
(linearly separable case)
iteration 10
after finding a linear separator no further updates happen the final boundary depends on the order of instances (different from all previous methods) note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9
5 . 4
note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9
5 . 5
Iris dataset
(NOT linearly separable case)
note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9
5 . 5
Iris dataset
(NOT linearly separable case)
the algorithm does not converge
there is always a wrong prediction and the weights will be updated
note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9
5 . 5
cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy
5 . 6
cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations
5 . 6
cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal
5 . 6
Winter 2020 | Applied Machine Learning (COMP551)
cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal
let's fix this problem first assume linear separability
5 . 7
the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary
this is positive for correctly classified points
6 . 1
the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary
(w x+
∣∣w∣∣ 1 ⊤ (n)
w
)signed distance is
this is positive for correctly classified points
6 . 1
the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary
(w x+
∣∣w∣∣ 1 ⊤ (n)
w
)signed distance is correcting for sign (margin)
y(w x +
∣∣w∣∣ 1 (n) ⊤
w
)this is positive for correctly classified points
6 . 1
find the decision boundary with maximum margin margin is not maximal
6 . 2
find the decision boundary with maximum margin margin is not maximal
6 . 2
maximum margin
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
M M
6 . 3
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
M M
6 . 3
M =
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)matter in finding the boundary
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
M M
6 . 3
these are called support vectors
M =
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)matter in finding the boundary
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
M M
6 . 3
these are called support vectors
M =
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)matter in finding the boundary max-margin classifier is called support vector machine (SVM)
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
M M
6 . 4
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
M M
6 . 4
if is an optimal solution then
w , w
∗ ∗
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
M M
6 . 4
cw , cw
∗ ∗
is also optimal (same margin) if is an optimal solution then
w , w
∗ ∗
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
M M
6 . 4
cw , cw
∗ ∗
is also optimal (same margin) fix the norm of w to avoid this ∣∣w∣∣
=2 M 1
if is an optimal solution then
w , w
∗ ∗
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
6 . 5
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
6 . 5
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
fixing ∣∣w∣∣
=2 M 1
≤∣∣w∣∣
2
1
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n max
w,w
0 ∣∣w∣∣ 2
1
6 . 5
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
Winter 2020 | Applied Machine Learning (COMP551)
find the decision boundary with maximum margin
max
Mw,w
M ≤
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n
fixing ∣∣w∣∣
=2 M 1
≤∣∣w∣∣
2
1
y(w x +
∣∣w∣∣
2
1 (n) ⊤ (n)
w
)∀n max
w,w
0 ∣∣w∣∣ 2
1
6 . 5
simplifying, we get hard margin SVM objective
min
∣∣w∣∣w,w 2 2
y (w x +
(n) ⊤ (n)
w
) ≥1 ∀n
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal
7
cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal
maximize the hard margin
7
cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal
now lets fix this problem maximize a soft margin maximize the hard margin
7
allow points inside the margin and on the wrong side but penalize them
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
y (w x +
(n) ⊤ (n)
w
) ≥1 ∀n
instead of hard constraint
8 . 1
allow points inside the margin and on the wrong side but penalize them
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
y (w x +
(n) ⊤ (n)
w
) ≥1 ∀n
instead of hard constraint use y
(w x +
(n) ⊤ (n)
w
) ≥1 − ξ ∀n
(n)
8 . 1 ∣∣w∣∣
2
ξ(n)
allow points inside the margin and on the wrong side but penalize them
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
y (w x +
(n) ⊤ (n)
w
) ≥1 ∀n
instead of hard constraint use y
(w x +
(n) ⊤ (n)
w
) ≥1 − ξ ∀n
(n)
8 . 1
slack variables (one for each n)
ξ ≥
(n)
∣∣w∣∣
2
ξ(n)
allow points inside the margin and on the wrong side but penalize them
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
y (w x +
(n) ⊤ (n)
w
) ≥1 ∀n
instead of hard constraint use y
(w x +
(n) ⊤ (n)
w
) ≥1 − ξ ∀n
(n)
8 . 1
slack variables (one for each n)
ξ ≥
(n)
zero if the point satisfies original margin constraint ξ =
(n)
∣∣w∣∣
2
ξ(n)
allow points inside the margin and on the wrong side but penalize them
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
y (w x +
(n) ⊤ (n)
w
) ≥1 ∀n
instead of hard constraint use y
(w x +
(n) ⊤ (n)
w
) ≥1 − ξ ∀n
(n)
8 . 1
slack variables (one for each n)
ξ ≥
(n)
if correctly classified but inside the margin 0 < ξ <
(n)
1 zero if the point satisfies original margin constraint ξ =
(n)
∣∣w∣∣
2
ξ(n)
allow points inside the margin and on the wrong side but penalize them
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
y (w x +
(n) ⊤ (n)
w
) ≥1 ∀n
instead of hard constraint use y
(w x +
(n) ⊤ (n)
w
) ≥1 − ξ ∀n
(n)
8 . 1
slack variables (one for each n)
ξ ≥
(n)
if correctly classified but inside the margin 0 < ξ <
(n)
1 zero if the point satisfies original margin constraint ξ =
(n)
ξ >
(n)
1 incorrectly classified ∣∣w∣∣
2
ξ(n)
allow points inside the margin and on the wrong side but penalize them
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 2
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ ∀n
(n)
∣∣w∣∣
2
ξ(n)
min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n
soft-margin objective
allow points inside the margin and on the wrong side but penalize them
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 2
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ ∀n
(n)
∣∣w∣∣
2
ξ(n)
min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n
soft-margin objective
is a hyper-parameter that defines the importance of constraints for very large this becomes similar to hard margin svm
would be nice to turn this into an unconstrained optimization
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 3
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ(n)
∣∣w∣∣
2
ξ(n)
min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n
would be nice to turn this into an unconstrained optimization
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 3
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ(n)
∣∣w∣∣
2
ξ(n)
min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n
if point satisfies the margin
ξ =
(n)
y (w x +
(n) ⊤ (n)
w
) ≥1
minimum slack is
would be nice to turn this into an unconstrained optimization
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 3
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ(n)
∣∣w∣∣
2
ξ(n)
min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n
if point satisfies the margin
ξ =
(n)
y (w x +
(n) ⊤ (n)
w
) ≥1
minimum slack is
the smallest slack is
ξ =
(n)
1 − y (w x +
(n) ⊤ (n)
w
)y (w x +
(n) ⊤ (n)
w
) <1
would be nice to turn this into an unconstrained optimization
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 3
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ(n)
∣∣w∣∣
2
ξ(n)
min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n
so the optimal slack satisfying both cases
ξ =
(n)
max(0, 1 − y (w x +
(n) ⊤ (n)
w
))if point satisfies the margin
ξ =
(n)
y (w x +
(n) ⊤ (n)
w
) ≥1
minimum slack is
the smallest slack is
ξ =
(n)
1 − y (w x +
(n) ⊤ (n)
w
)y (w x +
(n) ⊤ (n)
w
) <1
would be nice to turn this into an unconstrained optimization
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 4 ∣∣w∣∣
2
ξ(n)
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ(n) min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n
would be nice to turn this into an unconstrained optimization
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 4 ∣∣w∣∣
2
ξ(n)
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ(n) min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n ξ =
(n)
max(0, 1 − y (w x +
(n) ⊤ (n)
w
))replace
would be nice to turn this into an unconstrained optimization
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 4 ∣∣w∣∣
2
ξ(n)
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ(n) min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n ξ =
(n)
max(0, 1 − y (w x +
(n) ⊤ (n)
w
))replace min
∣∣w∣∣ +w,w
0 2
1 2 2
γ max(0, 1 − ∑n y (w x +
(n) ⊤ (n)
w
))we get
would be nice to turn this into an unconstrained optimization
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 4 ∣∣w∣∣
2
ξ(n)
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ(n) min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n ξ =
(n)
max(0, 1 − y (w x +
(n) ⊤ (n)
w
))replace min
∣∣w∣∣ +w,w
0 2
1 2 2
γ max(0, 1 − ∑n y (w x +
(n) ⊤ (n)
w
))we get the same as min
max(0, 1 −w,w
0 ∑n
y (w x +
(n) ⊤ (n)
w
)) + ∣∣w∣∣2γ 1 2 2
would be nice to turn this into an unconstrained optimization
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 4 ∣∣w∣∣
2
ξ(n)
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ(n) min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n ξ =
(n)
max(0, 1 − y (w x +
(n) ⊤ (n)
w
))replace min
∣∣w∣∣ +w,w
0 2
1 2 2
γ max(0, 1 − ∑n y (w x +
(n) ⊤ (n)
w
))we get the same as min
max(0, 1 −w,w
0 ∑n
y (w x +
(n) ⊤ (n)
w
)) + ∣∣w∣∣2γ 1 2 2
L
(y, ) =hinge
y ^ max(0, 1 − y
)y ^ this is called the hinge loss
Winter 2020 | Applied Machine Learning (COMP551)
would be nice to turn this into an unconstrained optimization
∣∣w∣∣
2
1 ∣∣w∣∣
2
1
8 . 4 ∣∣w∣∣
2
ξ(n)
y (w x +
(n) ⊤ (n)
w
) ≥1 − ξ(n) min
∣∣w∣∣ +w,w
0 2
1 2 2
γ
ξ∑n
(n)
ξ ≥
(n)
∀n ξ =
(n)
max(0, 1 − y (w x +
(n) ⊤ (n)
w
))replace min
∣∣w∣∣ +w,w
0 2
1 2 2
γ max(0, 1 − ∑n y (w x +
(n) ⊤ (n)
w
))we get the same as min
max(0, 1 −w,w
0 ∑n
y (w x +
(n) ⊤ (n)
w
)) + ∣∣w∣∣2γ 1 2 2
L
(y, ) =hinge
y ^ max(0, 1 − y
)y ^ this is called the hinge loss soft-margin SVM is doing L2 regularized hinge loss minimization
Perceptron
if correctly classified evaluates to zero
(w x +
w,w (n) ⊤ (n)
w
))9 . 1
Perceptron
if correctly classified evaluates to zero
(w x +
w,w (n) ⊤ (n)
w
)) max(0, −y(w x + ∑n
(n) ⊤ (n)
w
))can be written as
9 . 1
∑n y (w x +
(n) ⊤ (n)
w
)) + ∣∣w∣∣2 λ 2 2
SVM Perceptron
if correctly classified evaluates to zero
(w x +
w,w (n) ⊤ (n)
w
)) max(0, −y(w x + ∑n
(n) ⊤ (n)
w
))can be written as
9 . 1
∑n y (w x +
(n) ⊤ (n)
w
)) + ∣∣w∣∣2 λ 2 2
SVM Perceptron
if correctly classified evaluates to zero
(w x +
w,w (n) ⊤ (n)
w
)) max(0, −y(w x + ∑n
(n) ⊤ (n)
w
))can be written as so this is the difference! (plus regularization)
9 . 1
∑n y (w x +
(n) ⊤ (n)
w
)) + ∣∣w∣∣2 λ 2 2
SVM Perceptron
if correctly classified evaluates to zero
(w x +
w,w (n) ⊤ (n)
w
)) max(0, −y(w x + ∑n
(n) ⊤ (n)
w
))can be written as so this is the difference! (plus regularization) finds some linear decision boundary if exists for small lambda finds the max-marging decision boundary
9 . 1
∑n y (w x +
(n) ⊤ (n)
w
)) + ∣∣w∣∣2 λ 2 2
SVM Perceptron
if correctly classified evaluates to zero
(w x +
w,w (n) ⊤ (n)
w
)) max(0, −y(w x + ∑n
(n) ⊤ (n)
w
))can be written as so this is the difference! (plus regularization) stochastic gradient descent with fixed learning rate depending on the formulation we have many choices finds some linear decision boundary if exists for small lambda finds the max-marging decision boundary
9 . 1
J(w) =
max(0, 1 −∑n y w x ) +
(n) ⊤ (n)
∣∣w∣∣2 λ 2 2
now we included bias in w
cost
9 . 2
J(w) =
max(0, 1 −∑n y w x ) +
(n) ⊤ (n)
∣∣w∣∣2 λ 2 2
check that the cost function is convex in w(?)
now we included bias in w
cost
9 . 2
J(w) =
max(0, 1 −∑n y w x ) +
(n) ⊤ (n)
∣∣w∣∣2 λ 2 2
check that the cost function is convex in w(?)
now we included bias in w
cost
def cost(X,y,w, lamb=1e-3): z = np.dot(X, w) J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 return J 1 2 3 4
9 . 2
J(w) =
max(0, 1 −∑n y w x ) +
(n) ⊤ (n)
∣∣w∣∣2 λ 2 2
check that the cost function is convex in w(?)
hinge loss is not smooth (piecewise linear)
now we included bias in w
cost
def cost(X,y,w, lamb=1e-3): z = np.dot(X, w) J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 return J 1 2 3 4
9 . 2
J(w) =
max(0, 1 −∑n y w x ) +
(n) ⊤ (n)
∣∣w∣∣2 λ 2 2
check that the cost function is convex in w(?)
hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent
now we included bias in w
cost
def cost(X,y,w, lamb=1e-3): z = np.dot(X, w) J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 return J 1 2 3 4
if minimize y <
(n)y
^(n) 1 −y (w x ) +
(n) ⊤ (n)
∣∣w∣∣2 λ 2 2
the update will look like Perceptron
9 . 2
J(w) =
max(0, 1 −∑n y w x ) +
(n) ⊤ (n)
∣∣w∣∣2 λ 2 2
check that the cost function is convex in w(?)
hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent
now we included bias in w
cost
def cost(X,y,w, lamb=1e-3): z = np.dot(X, w) J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 return J 1 2 3 4
violations = np.nonzero(z*y < 1)[0] def subgradient(X, y, w, lamb): 1 N,D = X.shape 2 z = np.dot(X, w) 3 4 grad = -np.dot(X[violations,:].T, y[violations])/N 5 grad[:-1] += lamb2 * w[:-1] 6 return grad 7 grad = -np.dot(X[violations,:].T, y[violations])/N def subgradient(X, y, w, lamb): 1 N,D = X.shape 2 z = np.dot(X, w) 3 violations = np.nonzero(z*y < 1)[0] 4 5 grad[:-1] += lamb2 * w[:-1] 6 return grad 7
if minimize y <
(n)y
^(n) 1 −y (w x ) +
(n) ⊤ (n)
∣∣w∣∣2 λ 2 2
the update will look like Perceptron
9 . 2
J(w) =
max(0, 1 −∑n y w x ) +
(n) ⊤ (n)
∣∣w∣∣2 λ 2 2
check that the cost function is convex in w(?)
hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent
now we included bias in w
cost
def cost(X,y,w, lamb=1e-3): z = np.dot(X, w) J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 return J 1 2 3 4
violations = np.nonzero(z*y < 1)[0] def subgradient(X, y, w, lamb): 1 N,D = X.shape 2 z = np.dot(X, w) 3 4 grad = -np.dot(X[violations,:].T, y[violations])/N 5 grad[:-1] += lamb2 * w[:-1] 6 return grad 7 grad = -np.dot(X[violations,:].T, y[violations])/N def subgradient(X, y, w, lamb): 1 N,D = X.shape 2 z = np.dot(X, w) 3 violations = np.nonzero(z*y < 1)[0] 4 5 grad[:-1] += lamb2 * w[:-1] 6 return grad 7 grad[:-1] += lamb2 * w[:-1] def subgradient(X, y, w, lamb): 1 N,D = X.shape 2 z = np.dot(X, w) 3 violations = np.nonzero(z*y < 1)[0] 4 grad = -np.dot(X[violations,:].T, y[violations])/N 5 6 return grad 7
if minimize y <
(n)y
^(n) 1 −y (w x ) +
(n) ⊤ (n)
∣∣w∣∣2 λ 2 2
the update will look like Perceptron
9 . 2
Iris dataset (D=2)
(linearly separable case)
while np.linalg.norm(w - w_old) > eps and t < max_iters: def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 6 g = subgradient(X, y, w, lamb=lamb) 7 w_old = w 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 11
9 . 3
Iris dataset (D=2)
(linearly separable case)
while np.linalg.norm(w - w_old) > eps and t < max_iters: def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 6 g = subgradient(X, y, w, lamb=lamb) 7 w_old = w 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 11 g = subgradient(X, y, w, lamb=lamb) w_old = w w = w - lr*g/np.sqrt(t+1) def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 while np.linalg.norm(w - w_old) > eps and t < max_iters: 6 7 8 9 t += 1 10 return w 11
9 . 3
Iris dataset (D=2)
(linearly separable case)
while np.linalg.norm(w - w_old) > eps and t < max_iters: def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 6 g = subgradient(X, y, w, lamb=lamb) 7 w_old = w 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 11 g = subgradient(X, y, w, lamb=lamb) w_old = w w = w - lr*g/np.sqrt(t+1) def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 while np.linalg.norm(w - w_old) > eps and t < max_iters: 6 7 8 9 t += 1 10 return w 11
max-margin boundary (using small lambda )
λ = 10−8
9 . 3
Iris dataset (D=2)
(linearly separable case)
compare to Perceptron's decision boundary
while np.linalg.norm(w - w_old) > eps and t < max_iters: def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 6 g = subgradient(X, y, w, lamb=lamb) 7 w_old = w 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 11 g = subgradient(X, y, w, lamb=lamb) w_old = w w = w - lr*g/np.sqrt(t+1) def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 while np.linalg.norm(w - w_old) > eps and t < max_iters: 6 7 8 9 t += 1 10 return w 11
max-margin boundary (using small lambda )
λ = 10−8
9 . 3
Iris dataset (D=2)
(NOT linearly separable case)
def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): N,D = X.shape w = np.zeros(D) g = np.inf t = 0 while np.linalg.norm(g) > eps and t < max_iters: g = subgradient(X, y, w, lamb=lamb) w = w - lr*g/np.sqrt(t+1) t += 1 return w 1 2 3 4 5 6 7 8 9 10
λ = 10−8
9 . 4
Iris dataset (D=2)
(NOT linearly separable case)
def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): N,D = X.shape w = np.zeros(D) g = np.inf t = 0 while np.linalg.norm(g) > eps and t < max_iters: g = subgradient(X, y, w, lamb=lamb) w = w - lr*g/np.sqrt(t+1) t += 1 return w 1 2 3 4 5 6 7 8 9 10
soft margins using small lambda
λ = 10−8
9 . 4
Winter 2020 | Applied Machine Learning (COMP551)
Iris dataset (D=2)
(NOT linearly separable case)
def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): N,D = X.shape w = np.zeros(D) g = np.inf t = 0 while np.linalg.norm(g) > eps and t < max_iters: g = subgradient(X, y, w, lamb=lamb) w = w - lr*g/np.sqrt(t+1) t += 1 return w 1 2 3 4 5 6 7 8 9 10
soft margins using small lambda
λ = 10−8
Perceptron does not converge
9 . 4
includes the bias
recall: logistic regression simplified cost for y ∈ {0, 1}
J(w) =
ylog (1 + ∑n=1
N (n)
e ) +
−z(n)
(1 − y ) log (1 +
(n)
e )
z(n)
where z
=
(n)
w x
⊤ (n) 10
zy
includes the bias
recall: logistic regression simplified cost for y ∈ {0, 1}
J(w) =
ylog (1 + ∑n=1
N (n)
e ) +
−z(n)
(1 − y ) log (1 +
(n)
e )
z(n)
where z
=
(n)
w x
⊤ (n)
we can write this as
y ∈ {−1, +1}
for
J(w) =
log (1 +∑n=1
N
e ) +
−y z
(n) (n)
∣∣w∣∣2 λ 2 2
also added L2 regularization
10
zy
includes the bias
recall: logistic regression simplified cost for y ∈ {0, 1}
J(w) =
ylog (1 + ∑n=1
N (n)
e ) +
−z(n)
(1 − y ) log (1 +
(n)
e )
z(n)
where z
=
(n)
w x
⊤ (n)
we can write this as
y ∈ {−1, +1}
for
J(w) =
log (1 +∑n=1
N
e ) +
−y z
(n) (n)
∣∣w∣∣2 λ 2 2
also added L2 regularization
J(w) =
max(0, 1 −∑n y (z )) +
(n) (n)
∣∣w∣∣2 λ 2 2
compare to SVM cost
y ∈ {−1, +1}
for
10
zy
includes the bias
recall: logistic regression simplified cost for y ∈ {0, 1}
J(w) =
ylog (1 + ∑n=1
N (n)
e ) +
−z(n)
(1 − y ) log (1 +
(n)
e )
z(n)
where z
=
(n)
w x
⊤ (n)
we can write this as
y ∈ {−1, +1}
for
J(w) =
log (1 +∑n=1
N
e ) +
−y z
(n) (n)
∣∣w∣∣2 λ 2 2
also added L2 regularization
J(w) =
max(0, 1 −∑n y (z )) +
(n) (n)
∣∣w∣∣2 λ 2 2
compare to SVM cost
y ∈ {−1, +1}
for
J(w)
L
2
L
0,1
scaled L
CE
scaled L
CE
L
hinge (SVM) (logistic regression) they both try to approximate 0-1 loss (accuracy)
10
zy
can we use multiple binary classifiders?
image credit: Andrew Zisserman
11 . 1
can we use multiple binary classifiders?
image credit: Andrew Zisserman
training: train C different 1-vs-(C-1) classifiers
z
(x) =c
w
x[c] ⊤ 11 . 1
can we use multiple binary classifiders?
image credit: Andrew Zisserman
training: train C different 1-vs-(C-1) classifiers
z
(x) =c
w
x[c] ⊤ 11 . 1
can we use multiple binary classifiders?
image credit: Andrew Zisserman
test time: choose the class with the highest score
z =
∗
arg max
z (x)c c training: train C different 1-vs-(C-1) classifiers
z
(x) =c
w
x[c] ⊤ 11 . 1
can we use multiple binary classifiders?
image credit: Andrew Zisserman
test time: choose the class with the highest score
z =
∗
arg max
z (x)c c training: train C different 1-vs-(C-1) classifiers
z
(x) =c
w
x[c] ⊤
problems: class imbalance not clear what it means to compare values
z
(x)c 11 . 1
can we use multiple binary classifiders?
11 . 2
can we use multiple binary classifiders?
training: train classifiers for each class pair 2 C(C−1)
11 . 2
can we use multiple binary classifiders?
test time: choose the class with the highest vote
training: train classifiers for each class pair 2 C(C−1)
11 . 2
Winter 2020 | Applied Machine Learning (COMP551)
can we use multiple binary classifiders?
problems: computationally more demanding for large C ambiguities in the final classification
test time: choose the class with the highest vote
training: train classifiers for each class pair 2 C(C−1)
11 . 2
geometry of linear classification Perceptron algorithm distance to the decision boundary (margin) max-margin classification support vectors hard vs soft SVM relation to perceptron hinge loss and its relation to logistic regression some ideas for max-margin multi-class classification
12