Applied Machine Learning Applied Machine Learning Perceptron and - - PowerPoint PPT Presentation

applied machine learning applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Applied Machine Learning Perceptron and - - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives geometry of linear


slide-1
SLIDE 1

Applied Machine Learning Applied Machine Learning

Perceptron and Support Vector Machines

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020)

1

slide-2
SLIDE 2

geometry of linear classification Perceptron learning algorithm margin maximization and support vectors hinge loss and relation to logistic regression

Learning objectives Learning objectives

2

slide-3
SLIDE 3

Perceptron Perceptron

historically a significant algorithm

(first neural network, or rather just a neuron)

  • ld implementation (1960's)

biologically motivated model simple learning algorithm convergence proof beginning of connectionist AI it's criticism in the book "Perceptrons" was a factor in AI winter

3

slide-4
SLIDE 4

image:https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/Neuron/index.html

f(x) = sign(w x +

w

)

Model

Perceptron Perceptron

historically a significant algorithm

(first neural network, or rather just a neuron)

  • ld implementation (1960's)

biologically motivated model simple learning algorithm convergence proof beginning of connectionist AI it's criticism in the book "Perceptrons" was a factor in AI winter

3

slide-5
SLIDE 5

image:https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/Neuron/index.html

f(x) = sign(w x +

w

)

Model

Perceptron Perceptron

historically a significant algorithm

(first neural network, or rather just a neuron)

  • ld implementation (1960's)

biologically motivated model simple learning algorithm convergence proof beginning of connectionist AI it's criticism in the book "Perceptrons" was a factor in AI winter

note that we're using +1/-1 for labels rather than 0/1.

3

slide-6
SLIDE 6

separating hyperplane separating hyperplane

geometry of the

y = w x +

w

=

w

x +

2 2

w

x +

1 1

w

=

x2 x1

y > 0 y < 0

b a

4 . 1

this hyperplane has one dimension lower than D (number of features)

slide-7
SLIDE 7

separating hyperplane separating hyperplane

geometry of the

y = w x +

w

=

w

x +

2 2

w

x +

1 1

w

=

x2 x1

y > 0 y < 0

b a

for any two points a and b on the line

w (a −

b) + w

w

=

4 . 1

this hyperplane has one dimension lower than D (number of features)

slide-8
SLIDE 8

separating hyperplane separating hyperplane

geometry of the

y = w x +

w

=

w

x +

2 2

w

x +

1 1

w

=

x2 x1

y > 0 y < 0

b a

for any two points a and b on the line

w (a −

b) + w

w

=

so is the unit normal vector to the line ∣∣w∣∣ w

4 . 1

this hyperplane has one dimension lower than D (number of features)

slide-9
SLIDE 9

separating hyperplane separating hyperplane

geometry of the

y = w x +

w

=

w

x +

2 2

w

x +

1 1

w

=

x2 x1

y > 0 y < 0

b a

for any two points a and b on the line

w (a −

b) + w

w

=

so is the unit normal vector to the line ∣∣w∣∣ w the orthogonal component of any point on the line

b =

∣∣w∣∣ w⊤

∣∣w∣∣ w

4 . 1

this hyperplane has one dimension lower than D (number of features)

slide-10
SLIDE 10

separating hyperplane separating hyperplane

geometry of the

x2 x1

the orthogonal component of any point on the line

b =

∣∣w∣∣ w⊤

∣∣w∣∣ w

4 . 2

slide-11
SLIDE 11

separating hyperplane separating hyperplane

geometry of the

x2 x1

the orthogonal component of any point on the line

b =

∣∣w∣∣ w⊤

∣∣w∣∣ w

∣∣w∣∣ w

4 . 2

slide-12
SLIDE 12

separating hyperplane separating hyperplane

geometry of the

x2 x1

the orthogonal component of any point on the line

b =

∣∣w∣∣ w⊤

∣∣w∣∣ w

∣∣w∣∣ w

4 . 2

c

signed distance of any point (c) from the line

c

c

slide-13
SLIDE 13

separating hyperplane separating hyperplane

geometry of the

x2 x1

the orthogonal component of any point on the line

b =

∣∣w∣∣ w⊤

∣∣w∣∣ w

∣∣w∣∣ w

4 . 2

c

∣∣w∣∣ w⊤

c

∣∣w∣∣ w⊤

c

signed distance of any point (c) from the line

c

c

slide-14
SLIDE 14

separating hyperplane separating hyperplane

geometry of the

x2 x1

the orthogonal component of any point on the line

b =

∣∣w∣∣ w⊤

∣∣w∣∣ w

c

∣∣w∣∣ w⊤ ⊥

∣∣w∣∣ w

4 . 2

c

∣∣w∣∣ w⊤

c

∣∣w∣∣ w⊤

c

signed distance of any point (c) from the line

c

c

slide-15
SLIDE 15

separating hyperplane separating hyperplane

geometry of the

x2 x1

the orthogonal component of any point on the line

b =

∣∣w∣∣ w⊤

∣∣w∣∣ w

c −

∣∣w∣∣ w⊤

c

∣∣w∣∣ w⊤ ⊥

c

∣∣w∣∣ w⊤ ⊥

∣∣w∣∣ w

4 . 2

c

∣∣w∣∣ w⊤

c

∣∣w∣∣ w⊤

c

signed distance of any point (c) from the line

c

c

slide-16
SLIDE 16

Winter 2020 | Applied Machine Learning (COMP551)

separating hyperplane separating hyperplane

geometry of the

x2 x1

the orthogonal component of any point on the line

b =

∣∣w∣∣ w⊤

∣∣w∣∣ w

c −

∣∣w∣∣ w⊤

c

∣∣w∣∣ w⊤ ⊥ =

(w c +

∣∣w∣∣ 1 ⊤

w

) c

∣∣w∣∣ w⊤ ⊥

∣∣w∣∣ w

4 . 2

c

∣∣w∣∣ w⊤

c

∣∣w∣∣ w⊤

c

signed distance of any point (c) from the line

c

c

slide-17
SLIDE 17

Perceptron: Perceptron: objective

  • bjective

x2 x1

label and prediction have different signs

if try to make it positive y <

(n)y

^(n)

distance to the boundary this is positive for points that are on the wrong side 5 . 1

slide-18
SLIDE 18

Perceptron: Perceptron: objective

  • bjective

x2 x1

label and prediction have different signs

if try to make it positive y <

(n)y

^(n)

−y (w x +

(n) ⊤ (n)

w

)

equivalent to minimizing

distance to the boundary this is positive for points that are on the wrong side 5 . 1

slide-19
SLIDE 19

Perceptron: Perceptron: objective

  • bjective

x2 x1

x(n)

(w x

+

∣∣w∣∣ 1 ⊤ (n)

w

)

label and prediction have different signs

if try to make it positive y <

(n)y

^(n)

−y (w x +

(n) ⊤ (n)

w

)

equivalent to minimizing

distance to the boundary this is positive for points that are on the wrong side 5 . 1

slide-20
SLIDE 20

Perceptron: Perceptron: objective

  • bjective

x2 x1

so perceptron tries to minimize the distance of misclassified points from the decision boundary and push them to the right side

x(n)

(w x

+

∣∣w∣∣ 1 ⊤ (n)

w

)

label and prediction have different signs

if try to make it positive y <

(n)y

^(n)

−y (w x +

(n) ⊤ (n)

w

)

equivalent to minimizing

distance to the boundary this is positive for points that are on the wrong side 5 . 1

slide-21
SLIDE 21

Perceptron: Perceptron: optimization

  • ptimization

revisiting

if minimize

y <

(n)y

^(n)

J

(w) =

n

−y (w x )

(n) ⊤ (n)

5 . 2

slide-22
SLIDE 22

Perceptron: Perceptron: optimization

  • ptimization

revisiting

if minimize

y <

(n)y

^(n)

J

(w) =

n

−y (w x )

(n) ⊤ (n)

now we included bias in w

5 . 2

slide-23
SLIDE 23

Perceptron: Perceptron: optimization

  • ptimization

revisiting

  • therwise, do nothing

if minimize

y <

(n)y

^(n)

J

(w) =

n

−y (w x )

(n) ⊤ (n)

now we included bias in w

5 . 2

slide-24
SLIDE 24

Perceptron: Perceptron: optimization

  • ptimization

revisiting

  • therwise, do nothing

if minimize

y <

(n)y

^(n)

J

(w) =

n

−y (w x )

(n) ⊤ (n)

use stochastic gradient descent

∇J

(w) =

n

−y x

(n) (n)

now we included bias in w

w ←

{t+1}

w −

{t}

α∇J

(w) =

n

w +

{t}

αy x

(n) (n)

5 . 2

slide-25
SLIDE 25

Perceptron: Perceptron: optimization

  • ptimization

revisiting

  • therwise, do nothing

if minimize

y <

(n)y

^(n)

J

(w) =

n

−y (w x )

(n) ⊤ (n)

use stochastic gradient descent

∇J

(w) =

n

−y x

(n) (n)

now we included bias in w

w ←

{t+1}

w −

{t}

α∇J

(w) =

n

w +

{t}

αy x

(n) (n) Perceptron uses learning rate of 1 this is okay because scaling w does not affect prediction

sign(w x) =

sign(αw x)

⊤ 5 . 2

slide-26
SLIDE 26

Perceptron: Perceptron: optimization

  • ptimization

revisiting

  • therwise, do nothing

if minimize

y <

(n)y

^(n)

J

(w) =

n

−y (w x )

(n) ⊤ (n)

use stochastic gradient descent

∇J

(w) =

n

−y x

(n) (n)

now we included bias in w

w ←

{t+1}

w −

{t}

α∇J

(w) =

n

w +

{t}

αy x

(n) (n) Perceptron uses learning rate of 1 this is okay because scaling w does not affect prediction

sign(w x) =

sign(αw x)

the algorithm is guaranteed to converge in finite steps if linearly separable

Perceptron convergence theorem

5 . 2

slide-27
SLIDE 27

Perceptron: Perceptron: example example

Iris dataset

(linearly separable case)

iteration 1

5 . 3

slide-28
SLIDE 28

Perceptron: Perceptron: example example

Iris dataset

(linearly separable case)

iteration 1

yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] def Perceptron(X, y, max_iters): 1 N,D = X.shape 2 w = np.random.rand(D) 3 for t in range(max_iters): 4 n = np.random.randint(N) 5 6 7 8 return w 9

5 . 3

slide-29
SLIDE 29

Perceptron: Perceptron: example example

Iris dataset

(linearly separable case)

note that the code is not chacking for convergence

iteration 1

yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] def Perceptron(X, y, max_iters): 1 N,D = X.shape 2 w = np.random.rand(D) 3 for t in range(max_iters): 4 n = np.random.randint(N) 5 6 7 8 return w 9

5 . 3

slide-30
SLIDE 30

Perceptron: Perceptron: example example

Iris dataset

(linearly separable case)

note that the code is not chacking for convergence

iteration 1

initial decision boundary w x =

yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] def Perceptron(X, y, max_iters): 1 N,D = X.shape 2 w = np.random.rand(D) 3 for t in range(max_iters): 4 n = np.random.randint(N) 5 6 7 8 return w 9

5 . 3

slide-31
SLIDE 31

Perceptron: Perceptron: example example

Iris dataset

(linearly separable case)

iteration 10

note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9

5 . 4

slide-32
SLIDE 32

Perceptron: Perceptron: example example

Iris dataset

(linearly separable case)

iteration 10

note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9

5 . 4

slide-33
SLIDE 33

Perceptron: Perceptron: example example

Iris dataset

(linearly separable case)

iteration 10

  • bservations:

after finding a linear separator no further updates happen the final boundary depends on the order of instances (different from all previous methods) note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9

5 . 4

slide-34
SLIDE 34

Perceptron: Perceptron: example example

note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9

5 . 5

slide-35
SLIDE 35

Perceptron: Perceptron: example example

Iris dataset

(NOT linearly separable case)

note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9

5 . 5

slide-36
SLIDE 36

Perceptron: Perceptron: example example

Iris dataset

(NOT linearly separable case)

the algorithm does not converge

there is always a wrong prediction and the weights will be updated

note that the code is not chacking for convergence def Perceptron(X, y, max_iters): N,D = X.shape w = np.random.rand(D) for t in range(max_iters): n = np.random.randint(N) yh = np.sign(np.dot(X[n,:], w)) if yh != y[n]: w = w + y[n]*X[n,:] return w 1 2 3 4 5 6 7 8 9

5 . 5

slide-37
SLIDE 37

Perceptron: Perceptron: issues issues

cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy

5 . 6

slide-38
SLIDE 38

Perceptron: Perceptron: issues issues

cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations

5 . 6

slide-39
SLIDE 39

Perceptron: Perceptron: issues issues

cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal

5 . 6

slide-40
SLIDE 40

Winter 2020 | Applied Machine Learning (COMP551)

Perceptron: Perceptron: issues issues

cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal

let's fix this problem first assume linear separability

5 . 7

slide-41
SLIDE 41

Margin Margin

the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary

this is positive for correctly classified points

6 . 1

slide-42
SLIDE 42

Margin Margin

the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary

(w x

+

∣∣w∣∣ 1 ⊤ (n)

w

)

signed distance is

this is positive for correctly classified points

6 . 1

slide-43
SLIDE 43

Margin Margin

the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary

(w x

+

∣∣w∣∣ 1 ⊤ (n)

w

)

signed distance is correcting for sign (margin)

y

(w x +

∣∣w∣∣ 1 (n) ⊤

w

)

this is positive for correctly classified points

6 . 1

slide-44
SLIDE 44

Max margin classifier Max margin classifier

find the decision boundary with maximum margin margin is not maximal

6 . 2

slide-45
SLIDE 45

Max margin classifier Max margin classifier

find the decision boundary with maximum margin margin is not maximal

6 . 2

maximum margin

slide-46
SLIDE 46

Max margin classifier Max margin classifier

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

M M

6 . 3

slide-47
SLIDE 47

Max margin classifier Max margin classifier

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

M M

6 . 3

  • nly the points (n) with

M =

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

matter in finding the boundary

slide-48
SLIDE 48

Max margin classifier Max margin classifier

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

M M

6 . 3

these are called support vectors

  • nly the points (n) with

M =

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

matter in finding the boundary

slide-49
SLIDE 49

Max margin classifier Max margin classifier

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

M M

6 . 3

these are called support vectors

  • nly the points (n) with

M =

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

matter in finding the boundary max-margin classifier is called support vector machine (SVM)

slide-50
SLIDE 50

Support Vector Machine Support Vector Machine

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

M M

6 . 4

slide-51
SLIDE 51

Support Vector Machine Support Vector Machine

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

M M

6 . 4

if is an optimal solution then

w , w

∗ ∗

  • bservation
slide-52
SLIDE 52

Support Vector Machine Support Vector Machine

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

M M

6 . 4

cw , cw

∗ ∗

is also optimal (same margin) if is an optimal solution then

w , w

∗ ∗

  • bservation
slide-53
SLIDE 53

Support Vector Machine Support Vector Machine

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

M M

6 . 4

cw , cw

∗ ∗

is also optimal (same margin) fix the norm of w to avoid this ∣∣w∣∣

=

2 M 1

if is an optimal solution then

w , w

∗ ∗

  • bservation
slide-54
SLIDE 54

Support Vector Machine Support Vector Machine

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

6 . 5

slide-55
SLIDE 55

Support Vector Machine Support Vector Machine

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

6 . 5

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

slide-56
SLIDE 56

Support Vector Machine Support Vector Machine

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

fixing ∣∣w∣∣

=

2 M 1

∣∣w∣∣

2

1

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n max

w,w

0 ∣∣w∣∣ 2

1

{

6 . 5

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

slide-57
SLIDE 57

Winter 2020 | Applied Machine Learning (COMP551)

Support Vector Machine Support Vector Machine

find the decision boundary with maximum margin

max

M

w,w

M ≤

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n

{

fixing ∣∣w∣∣

=

2 M 1

∣∣w∣∣

2

1

y

(w x +

∣∣w∣∣

2

1 (n) ⊤ (n)

w

)

∀n max

w,w

0 ∣∣w∣∣ 2

1

{

6 . 5

simplifying, we get hard margin SVM objective

min

∣∣w∣∣

w,w 2 2

y (w x +

(n) ⊤ (n)

w

) ≥

1 ∀n

{

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

slide-58
SLIDE 58

Perceptron: Perceptron: issues issues

cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal

7

slide-59
SLIDE 59

Perceptron: Perceptron: issues issues

cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal

maximize the hard margin

7

slide-60
SLIDE 60

Perceptron: Perceptron: issues issues

cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal

now lets fix this problem maximize a soft margin maximize the hard margin

7

slide-61
SLIDE 61

Soft Soft margin constraints margin constraints

allow points inside the margin and on the wrong side but penalize them

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

y (w x +

(n) ⊤ (n)

w

) ≥

1 ∀n

instead of hard constraint

8 . 1

slide-62
SLIDE 62

Soft Soft margin constraints margin constraints

allow points inside the margin and on the wrong side but penalize them

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

y (w x +

(n) ⊤ (n)

w

) ≥

1 ∀n

instead of hard constraint use y

(w x +

(n) ⊤ (n)

w

) ≥

1 − ξ ∀n

(n)

8 . 1 ∣∣w∣∣

2

ξ(n)

slide-63
SLIDE 63

Soft Soft margin constraints margin constraints

allow points inside the margin and on the wrong side but penalize them

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

y (w x +

(n) ⊤ (n)

w

) ≥

1 ∀n

instead of hard constraint use y

(w x +

(n) ⊤ (n)

w

) ≥

1 − ξ ∀n

(n)

8 . 1

slack variables (one for each n)

ξ ≥

(n)

∣∣w∣∣

2

ξ(n)

slide-64
SLIDE 64

Soft Soft margin constraints margin constraints

allow points inside the margin and on the wrong side but penalize them

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

y (w x +

(n) ⊤ (n)

w

) ≥

1 ∀n

instead of hard constraint use y

(w x +

(n) ⊤ (n)

w

) ≥

1 − ξ ∀n

(n)

8 . 1

slack variables (one for each n)

ξ ≥

(n)

zero if the point satisfies original margin constraint ξ =

(n)

∣∣w∣∣

2

ξ(n)

slide-65
SLIDE 65

Soft Soft margin constraints margin constraints

allow points inside the margin and on the wrong side but penalize them

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

y (w x +

(n) ⊤ (n)

w

) ≥

1 ∀n

instead of hard constraint use y

(w x +

(n) ⊤ (n)

w

) ≥

1 − ξ ∀n

(n)

8 . 1

slack variables (one for each n)

ξ ≥

(n)

if correctly classified but inside the margin 0 < ξ <

(n)

1 zero if the point satisfies original margin constraint ξ =

(n)

∣∣w∣∣

2

ξ(n)

slide-66
SLIDE 66

Soft Soft margin constraints margin constraints

allow points inside the margin and on the wrong side but penalize them

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

y (w x +

(n) ⊤ (n)

w

) ≥

1 ∀n

instead of hard constraint use y

(w x +

(n) ⊤ (n)

w

) ≥

1 − ξ ∀n

(n)

8 . 1

slack variables (one for each n)

ξ ≥

(n)

if correctly classified but inside the margin 0 < ξ <

(n)

1 zero if the point satisfies original margin constraint ξ =

(n)

ξ >

(n)

1 incorrectly classified ∣∣w∣∣

2

ξ(n)

slide-67
SLIDE 67

Soft margin constraints Soft margin constraints

allow points inside the margin and on the wrong side but penalize them

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 2

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ ∀n

(n)

∣∣w∣∣

2

ξ(n)

min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n

soft-margin objective

slide-68
SLIDE 68

Soft margin constraints Soft margin constraints

allow points inside the margin and on the wrong side but penalize them

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 2

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ ∀n

(n)

∣∣w∣∣

2

ξ(n)

min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n

soft-margin objective

is a hyper-parameter that defines the importance of constraints for very large this becomes similar to hard margin svm

γ γ

slide-69
SLIDE 69

Hinge loss Hinge loss

would be nice to turn this into an unconstrained optimization

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 3

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ(n)

∣∣w∣∣

2

ξ(n)

min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n

slide-70
SLIDE 70

Hinge loss Hinge loss

would be nice to turn this into an unconstrained optimization

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 3

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ(n)

∣∣w∣∣

2

ξ(n)

min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n

if point satisfies the margin

ξ =

(n)

y (w x +

(n) ⊤ (n)

w

) ≥

1

minimum slack is

slide-71
SLIDE 71

Hinge loss Hinge loss

would be nice to turn this into an unconstrained optimization

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 3

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ(n)

∣∣w∣∣

2

ξ(n)

min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n

if point satisfies the margin

ξ =

(n)

y (w x +

(n) ⊤ (n)

w

) ≥

1

minimum slack is

  • therwise

the smallest slack is

ξ =

(n)

1 − y (w x +

(n) ⊤ (n)

w

)

y (w x +

(n) ⊤ (n)

w

) <

1

slide-72
SLIDE 72

Hinge loss Hinge loss

would be nice to turn this into an unconstrained optimization

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 3

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ(n)

∣∣w∣∣

2

ξ(n)

min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n

so the optimal slack satisfying both cases

ξ =

(n)

max(0, 1 − y (w x +

(n) ⊤ (n)

w

))

if point satisfies the margin

ξ =

(n)

y (w x +

(n) ⊤ (n)

w

) ≥

1

minimum slack is

  • therwise

the smallest slack is

ξ =

(n)

1 − y (w x +

(n) ⊤ (n)

w

)

y (w x +

(n) ⊤ (n)

w

) <

1

slide-73
SLIDE 73

Hinge loss Hinge loss

would be nice to turn this into an unconstrained optimization

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 4 ∣∣w∣∣

2

ξ(n)

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ(n) min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n

slide-74
SLIDE 74

Hinge loss Hinge loss

would be nice to turn this into an unconstrained optimization

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 4 ∣∣w∣∣

2

ξ(n)

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ(n) min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n ξ =

(n)

max(0, 1 − y (w x +

(n) ⊤ (n)

w

))

replace

slide-75
SLIDE 75

Hinge loss Hinge loss

would be nice to turn this into an unconstrained optimization

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 4 ∣∣w∣∣

2

ξ(n)

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ(n) min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n ξ =

(n)

max(0, 1 − y (w x +

(n) ⊤ (n)

w

))

replace min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ max(0, 1 − ∑n y (w x +

(n) ⊤ (n)

w

))

we get

slide-76
SLIDE 76

Hinge loss Hinge loss

would be nice to turn this into an unconstrained optimization

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 4 ∣∣w∣∣

2

ξ(n)

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ(n) min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n ξ =

(n)

max(0, 1 − y (w x +

(n) ⊤ (n)

w

))

replace min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ max(0, 1 − ∑n y (w x +

(n) ⊤ (n)

w

))

we get the same as min

max(0, 1 −

w,w

0 ∑n

y (w x +

(n) ⊤ (n)

w

)) + ∣∣w∣∣

2γ 1 2 2

slide-77
SLIDE 77

Hinge loss Hinge loss

would be nice to turn this into an unconstrained optimization

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 4 ∣∣w∣∣

2

ξ(n)

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ(n) min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n ξ =

(n)

max(0, 1 − y (w x +

(n) ⊤ (n)

w

))

replace min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ max(0, 1 − ∑n y (w x +

(n) ⊤ (n)

w

))

we get the same as min

max(0, 1 −

w,w

0 ∑n

y (w x +

(n) ⊤ (n)

w

)) + ∣∣w∣∣

2γ 1 2 2

L

(y, ) =

hinge

y ^ max(0, 1 − y

)

y ^ this is called the hinge loss

slide-78
SLIDE 78

Winter 2020 | Applied Machine Learning (COMP551)

Hinge loss Hinge loss

would be nice to turn this into an unconstrained optimization

∣∣w∣∣

2

1 ∣∣w∣∣

2

1

8 . 4 ∣∣w∣∣

2

ξ(n)

y (w x +

(n) ⊤ (n)

w

) ≥

1 − ξ(n) min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ

ξ

∑n

(n)

ξ ≥

(n)

∀n ξ =

(n)

max(0, 1 − y (w x +

(n) ⊤ (n)

w

))

replace min

∣∣w∣∣ +

w,w

0 2

1 2 2

γ max(0, 1 − ∑n y (w x +

(n) ⊤ (n)

w

))

we get the same as min

max(0, 1 −

w,w

0 ∑n

y (w x +

(n) ⊤ (n)

w

)) + ∣∣w∣∣

2γ 1 2 2

L

(y, ) =

hinge

y ^ max(0, 1 − y

)

y ^ this is called the hinge loss soft-margin SVM is doing L2 regularized hinge loss minimization

slide-79
SLIDE 79

Perceptron vs. SVM Perceptron vs. SVM

Perceptron

if correctly classified evaluates to zero

  • therwise it is min
−y

(w x +

w,w (n) ⊤ (n)

w

))

9 . 1

slide-80
SLIDE 80

Perceptron vs. SVM Perceptron vs. SVM

Perceptron

if correctly classified evaluates to zero

  • therwise it is min
−y

(w x +

w,w (n) ⊤ (n)

w

)) max(0, −y

(w x + ∑n

(n) ⊤ (n)

w

))

can be written as

9 . 1

slide-81
SLIDE 81

Perceptron vs. SVM Perceptron vs. SVM

max(0, 1 −

∑n y (w x +

(n) ⊤ (n)

w

)) + ∣∣w∣∣

2 λ 2 2

SVM Perceptron

if correctly classified evaluates to zero

  • therwise it is min
−y

(w x +

w,w (n) ⊤ (n)

w

)) max(0, −y

(w x + ∑n

(n) ⊤ (n)

w

))

can be written as

9 . 1

slide-82
SLIDE 82

Perceptron vs. SVM Perceptron vs. SVM

max(0, 1 −

∑n y (w x +

(n) ⊤ (n)

w

)) + ∣∣w∣∣

2 λ 2 2

SVM Perceptron

if correctly classified evaluates to zero

  • therwise it is min
−y

(w x +

w,w (n) ⊤ (n)

w

)) max(0, −y

(w x + ∑n

(n) ⊤ (n)

w

))

can be written as so this is the difference! (plus regularization)

9 . 1

slide-83
SLIDE 83

Perceptron vs. SVM Perceptron vs. SVM

max(0, 1 −

∑n y (w x +

(n) ⊤ (n)

w

)) + ∣∣w∣∣

2 λ 2 2

SVM Perceptron

if correctly classified evaluates to zero

  • therwise it is min
−y

(w x +

w,w (n) ⊤ (n)

w

)) max(0, −y

(w x + ∑n

(n) ⊤ (n)

w

))

can be written as so this is the difference! (plus regularization) finds some linear decision boundary if exists for small lambda finds the max-marging decision boundary

9 . 1

slide-84
SLIDE 84

Perceptron vs. SVM Perceptron vs. SVM

max(0, 1 −

∑n y (w x +

(n) ⊤ (n)

w

)) + ∣∣w∣∣

2 λ 2 2

SVM Perceptron

if correctly classified evaluates to zero

  • therwise it is min
−y

(w x +

w,w (n) ⊤ (n)

w

)) max(0, −y

(w x + ∑n

(n) ⊤ (n)

w

))

can be written as so this is the difference! (plus regularization) stochastic gradient descent with fixed learning rate depending on the formulation we have many choices finds some linear decision boundary if exists for small lambda finds the max-marging decision boundary

9 . 1

slide-85
SLIDE 85

Perceptron vs. SVM Perceptron vs. SVM

J(w) =

max(0, 1 −

∑n y w x ) +

(n) ⊤ (n)

∣∣w∣∣

2 λ 2 2

now we included bias in w

cost

9 . 2

slide-86
SLIDE 86

Perceptron vs. SVM Perceptron vs. SVM

J(w) =

max(0, 1 −

∑n y w x ) +

(n) ⊤ (n)

∣∣w∣∣

2 λ 2 2

check that the cost function is convex in w(?)

now we included bias in w

cost

9 . 2

slide-87
SLIDE 87

Perceptron vs. SVM Perceptron vs. SVM

J(w) =

max(0, 1 −

∑n y w x ) +

(n) ⊤ (n)

∣∣w∣∣

2 λ 2 2

check that the cost function is convex in w(?)

now we included bias in w

cost

def cost(X,y,w, lamb=1e-3): z = np.dot(X, w) J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 return J 1 2 3 4

9 . 2

slide-88
SLIDE 88

Perceptron vs. SVM Perceptron vs. SVM

J(w) =

max(0, 1 −

∑n y w x ) +

(n) ⊤ (n)

∣∣w∣∣

2 λ 2 2

check that the cost function is convex in w(?)

hinge loss is not smooth (piecewise linear)

now we included bias in w

cost

def cost(X,y,w, lamb=1e-3): z = np.dot(X, w) J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 return J 1 2 3 4

9 . 2

slide-89
SLIDE 89

Perceptron vs. SVM Perceptron vs. SVM

J(w) =

max(0, 1 −

∑n y w x ) +

(n) ⊤ (n)

∣∣w∣∣

2 λ 2 2

check that the cost function is convex in w(?)

hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent

now we included bias in w

cost

def cost(X,y,w, lamb=1e-3): z = np.dot(X, w) J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 return J 1 2 3 4

if minimize y <

(n)y

^(n) 1 −y (w x ) +

(n) ⊤ (n)

∣∣w∣∣

2 λ 2 2

  • therwise, do nothing

the update will look like Perceptron

9 . 2

slide-90
SLIDE 90

Perceptron vs. SVM Perceptron vs. SVM

J(w) =

max(0, 1 −

∑n y w x ) +

(n) ⊤ (n)

∣∣w∣∣

2 λ 2 2

check that the cost function is convex in w(?)

hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent

now we included bias in w

cost

def cost(X,y,w, lamb=1e-3): z = np.dot(X, w) J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 return J 1 2 3 4

violations = np.nonzero(z*y < 1)[0] def subgradient(X, y, w, lamb): 1 N,D = X.shape 2 z = np.dot(X, w) 3 4 grad = -np.dot(X[violations,:].T, y[violations])/N 5 grad[:-1] += lamb2 * w[:-1] 6 return grad 7 grad = -np.dot(X[violations,:].T, y[violations])/N def subgradient(X, y, w, lamb): 1 N,D = X.shape 2 z = np.dot(X, w) 3 violations = np.nonzero(z*y < 1)[0] 4 5 grad[:-1] += lamb2 * w[:-1] 6 return grad 7

if minimize y <

(n)y

^(n) 1 −y (w x ) +

(n) ⊤ (n)

∣∣w∣∣

2 λ 2 2

  • therwise, do nothing

the update will look like Perceptron

9 . 2

slide-91
SLIDE 91

Perceptron vs. SVM Perceptron vs. SVM

J(w) =

max(0, 1 −

∑n y w x ) +

(n) ⊤ (n)

∣∣w∣∣

2 λ 2 2

check that the cost function is convex in w(?)

hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent

now we included bias in w

cost

def cost(X,y,w, lamb=1e-3): z = np.dot(X, w) J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 return J 1 2 3 4

violations = np.nonzero(z*y < 1)[0] def subgradient(X, y, w, lamb): 1 N,D = X.shape 2 z = np.dot(X, w) 3 4 grad = -np.dot(X[violations,:].T, y[violations])/N 5 grad[:-1] += lamb2 * w[:-1] 6 return grad 7 grad = -np.dot(X[violations,:].T, y[violations])/N def subgradient(X, y, w, lamb): 1 N,D = X.shape 2 z = np.dot(X, w) 3 violations = np.nonzero(z*y < 1)[0] 4 5 grad[:-1] += lamb2 * w[:-1] 6 return grad 7 grad[:-1] += lamb2 * w[:-1] def subgradient(X, y, w, lamb): 1 N,D = X.shape 2 z = np.dot(X, w) 3 violations = np.nonzero(z*y < 1)[0] 4 grad = -np.dot(X[violations,:].T, y[violations])/N 5 6 return grad 7

if minimize y <

(n)y

^(n) 1 −y (w x ) +

(n) ⊤ (n)

∣∣w∣∣

2 λ 2 2

  • therwise, do nothing

the update will look like Perceptron

9 . 2

slide-92
SLIDE 92

Example Example

Iris dataset (D=2)

(linearly separable case)

while np.linalg.norm(w - w_old) > eps and t < max_iters: def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 6 g = subgradient(X, y, w, lamb=lamb) 7 w_old = w 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 11

9 . 3

slide-93
SLIDE 93

Example Example

Iris dataset (D=2)

(linearly separable case)

while np.linalg.norm(w - w_old) > eps and t < max_iters: def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 6 g = subgradient(X, y, w, lamb=lamb) 7 w_old = w 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 11 g = subgradient(X, y, w, lamb=lamb) w_old = w w = w - lr*g/np.sqrt(t+1) def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 while np.linalg.norm(w - w_old) > eps and t < max_iters: 6 7 8 9 t += 1 10 return w 11

9 . 3

slide-94
SLIDE 94

Example Example

Iris dataset (D=2)

(linearly separable case)

while np.linalg.norm(w - w_old) > eps and t < max_iters: def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 6 g = subgradient(X, y, w, lamb=lamb) 7 w_old = w 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 11 g = subgradient(X, y, w, lamb=lamb) w_old = w w = w - lr*g/np.sqrt(t+1) def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 while np.linalg.norm(w - w_old) > eps and t < max_iters: 6 7 8 9 t += 1 10 return w 11

max-margin boundary (using small lambda )

λ = 10−8

9 . 3

slide-95
SLIDE 95

Example Example

Iris dataset (D=2)

(linearly separable case)

compare to Perceptron's decision boundary

while np.linalg.norm(w - w_old) > eps and t < max_iters: def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 6 g = subgradient(X, y, w, lamb=lamb) 7 w_old = w 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 11 g = subgradient(X, y, w, lamb=lamb) w_old = w w = w - lr*g/np.sqrt(t+1) def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 t = 0 4 w_old = w + np.inf 5 while np.linalg.norm(w - w_old) > eps and t < max_iters: 6 7 8 9 t += 1 10 return w 11

max-margin boundary (using small lambda )

λ = 10−8

9 . 3

slide-96
SLIDE 96

Example Example

Iris dataset (D=2)

(NOT linearly separable case)

def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): N,D = X.shape w = np.zeros(D) g = np.inf t = 0 while np.linalg.norm(g) > eps and t < max_iters: g = subgradient(X, y, w, lamb=lamb) w = w - lr*g/np.sqrt(t+1) t += 1 return w 1 2 3 4 5 6 7 8 9 10

λ = 10−8

9 . 4

slide-97
SLIDE 97

Example Example

Iris dataset (D=2)

(NOT linearly separable case)

def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): N,D = X.shape w = np.zeros(D) g = np.inf t = 0 while np.linalg.norm(g) > eps and t < max_iters: g = subgradient(X, y, w, lamb=lamb) w = w - lr*g/np.sqrt(t+1) t += 1 return w 1 2 3 4 5 6 7 8 9 10

soft margins using small lambda

λ = 10−8

9 . 4

slide-98
SLIDE 98

Winter 2020 | Applied Machine Learning (COMP551)

Example Example

Iris dataset (D=2)

(NOT linearly separable case)

def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): N,D = X.shape w = np.zeros(D) g = np.inf t = 0 while np.linalg.norm(g) > eps and t < max_iters: g = subgradient(X, y, w, lamb=lamb) w = w - lr*g/np.sqrt(t+1) t += 1 return w 1 2 3 4 5 6 7 8 9 10

soft margins using small lambda

λ = 10−8

Perceptron does not converge

9 . 4

slide-99
SLIDE 99

SVM vs. logistic regression SVM vs. logistic regression

includes the bias

recall: logistic regression simplified cost for y ∈ {0, 1}

J(w) =

y

log (1 + ∑n=1

N (n)

e ) +

−z(n)

(1 − y ) log (1 +

(n)

e )

z(n)

where z

=

(n)

w x

⊤ (n) 10

zy

slide-100
SLIDE 100

SVM vs. logistic regression SVM vs. logistic regression

includes the bias

recall: logistic regression simplified cost for y ∈ {0, 1}

J(w) =

y

log (1 + ∑n=1

N (n)

e ) +

−z(n)

(1 − y ) log (1 +

(n)

e )

z(n)

where z

=

(n)

w x

⊤ (n)

we can write this as

y ∈ {−1, +1}

for

J(w) =

log (1 +

∑n=1

N

e ) +

−y z

(n) (n)

∣∣w∣∣

2 λ 2 2

also added L2 regularization

10

zy

slide-101
SLIDE 101

SVM vs. logistic regression SVM vs. logistic regression

includes the bias

recall: logistic regression simplified cost for y ∈ {0, 1}

J(w) =

y

log (1 + ∑n=1

N (n)

e ) +

−z(n)

(1 − y ) log (1 +

(n)

e )

z(n)

where z

=

(n)

w x

⊤ (n)

we can write this as

y ∈ {−1, +1}

for

J(w) =

log (1 +

∑n=1

N

e ) +

−y z

(n) (n)

∣∣w∣∣

2 λ 2 2

also added L2 regularization

J(w) =

max(0, 1 −

∑n y (z )) +

(n) (n)

∣∣w∣∣

2 λ 2 2

compare to SVM cost

y ∈ {−1, +1}

for

10

zy

slide-102
SLIDE 102

SVM vs. logistic regression SVM vs. logistic regression

includes the bias

recall: logistic regression simplified cost for y ∈ {0, 1}

J(w) =

y

log (1 + ∑n=1

N (n)

e ) +

−z(n)

(1 − y ) log (1 +

(n)

e )

z(n)

where z

=

(n)

w x

⊤ (n)

we can write this as

y ∈ {−1, +1}

for

J(w) =

log (1 +

∑n=1

N

e ) +

−y z

(n) (n)

∣∣w∣∣

2 λ 2 2

also added L2 regularization

J(w) =

max(0, 1 −

∑n y (z )) +

(n) (n)

∣∣w∣∣

2 λ 2 2

compare to SVM cost

y ∈ {−1, +1}

for

J(w)

L

2

L

0,1

scaled L

CE

scaled L

CE

L

hinge (SVM) (logistic regression) they both try to approximate 0-1 loss (accuracy)

10

zy

slide-103
SLIDE 103

Multiclass Multiclass classification classification

can we use multiple binary classifiders?

image credit: Andrew Zisserman

  • ne versus the rest

11 . 1

slide-104
SLIDE 104

Multiclass Multiclass classification classification

can we use multiple binary classifiders?

image credit: Andrew Zisserman

  • ne versus the rest

training: train C different 1-vs-(C-1) classifiers

z

(x) =

c

w

x

[c] ⊤ 11 . 1

slide-105
SLIDE 105

Multiclass Multiclass classification classification

can we use multiple binary classifiders?

image credit: Andrew Zisserman

  • ne versus the rest

training: train C different 1-vs-(C-1) classifiers

z

(x) =

c

w

x

[c] ⊤ 11 . 1

slide-106
SLIDE 106

Multiclass Multiclass classification classification

can we use multiple binary classifiders?

image credit: Andrew Zisserman

  • ne versus the rest

test time: choose the class with the highest score

z =

arg max

z (x)

c c training: train C different 1-vs-(C-1) classifiers

z

(x) =

c

w

x

[c] ⊤ 11 . 1

slide-107
SLIDE 107

Multiclass Multiclass classification classification

can we use multiple binary classifiders?

image credit: Andrew Zisserman

  • ne versus the rest

test time: choose the class with the highest score

z =

arg max

z (x)

c c training: train C different 1-vs-(C-1) classifiers

z

(x) =

c

w

x

[c] ⊤

problems: class imbalance not clear what it means to compare values

z

(x)

c 11 . 1

slide-108
SLIDE 108

Multiclass Multiclass classification classification

can we use multiple binary classifiders?

  • ne versus one

11 . 2

slide-109
SLIDE 109

Multiclass Multiclass classification classification

can we use multiple binary classifiders?

  • ne versus one

training: train classifiers for each class pair 2 C(C−1)

11 . 2

slide-110
SLIDE 110

Multiclass Multiclass classification classification

can we use multiple binary classifiders?

  • ne versus one

test time: choose the class with the highest vote

training: train classifiers for each class pair 2 C(C−1)

11 . 2

slide-111
SLIDE 111

Winter 2020 | Applied Machine Learning (COMP551)

Multiclass Multiclass classification classification

can we use multiple binary classifiders?

  • ne versus one

problems: computationally more demanding for large C ambiguities in the final classification

test time: choose the class with the highest vote

training: train classifiers for each class pair 2 C(C−1)

11 . 2

slide-112
SLIDE 112

Summary Summary

geometry of linear classification Perceptron algorithm distance to the decision boundary (margin) max-margin classification support vectors hard vs soft SVM relation to perceptron hinge loss and its relation to logistic regression some ideas for max-margin multi-class classification

12