[PPT] - Linear classifiers CE-717: Machine Learning Sharif University of PowerPoint Presentation

SLIDE 1

Linear classifiers

CE-717: Machine Learning

Sharif University of Technology

M. Soleymani

Fall 2019

SLIDE 2

Topics

} Linear classifiers

} Perceptron } Fisher

} Multi-class classification

2

SVM will be covered in the later lectures

SLIDE 3

Classification problem

3

} Given:Training set

} labeled set of 𝑂 input-output pairs 𝐸 =

𝒚 % , 𝑧 %

%() *

} 𝑧 ∈ {1, … , 𝐿}

} Goal: Given an input 𝒚, assign it to one of 𝐿 classes } Examples:

} Spam filter } Handwritten digit recognition

SLIDE 4

Linear classifiers

4

} Decision boundaries are linear in 𝒚, or linear in some

given set of functions of 𝒚

} Linearly separable data: data points that can be exactly

classified by a linear decision surface.

} Why linear classifier?

} Even when they are not optimal, we can use their simplicity

} are relatively easy to compute } In the absence of information suggesting otherwise, linear classifiers are an

attractive candidates for initial, trial classifiers.

SLIDE 5

Two Category

5

} 𝑔 𝒚; 𝒙 = 𝒙4𝒚 + 𝑥7 = 𝑥7 + 𝑥)𝑦) + . . . 𝑥;𝑦;

} 𝒚 = 𝑦) 𝑦< … 𝑦; } 𝒙 = [𝑥) 𝑥< … 𝑥;] } 𝑥7: bias

} if 𝒙4𝒚 + 𝑥7 ≥ 0 then 𝒟) } else 𝒟<

Decision surface (boundary): 𝒙4𝒚 + 𝑥7 = 0

𝒙 is orthogonal to every vector lying within the decision surface

SLIDE 6

Example

6

𝑦1 𝑦2

1 2 3 1 2 3 4

3 − 3 4 𝑦) − 𝑦< = 0 if 𝒙4𝒚 + 𝑥7 ≥ 0 then 𝒟) else 𝒟<

SLIDE 7

Linear classifier: Two Category

7

} Decision boundary is a (𝑒 − 1)-dimensional hyperplane 𝐼 in

the 𝑒-dimensional feature space

} The orientation of 𝐼 is determined by the normal vector 𝑥), … , 𝑥; } 𝑥7 determines the location of the surface.

} The normal distance from the origin to the decision surface is HI

𝒙

𝒚 = 𝒚J + 𝑠 𝒙 𝒙 𝒙4𝒚 + 𝑥7 = 𝑠 𝒙 ⇒ 𝑠 = 𝒙4𝒚 + 𝑥7 𝒙 gives a signed measure of the perpendicular distance 𝑠 of the point 𝒚 from the decision surface

𝑔 𝒚 = 0

𝒚J

SLIDE 8

Linear boundary: geometry

8

𝒙4𝒚 + 𝑥7 = 0 𝒙4𝒚 + 𝑥7 > 0 𝒙4𝒚 + 𝑥7 < 0 𝒙4𝒚 + 𝑥7 𝒙

SLIDE 9

Non-linear decision boundary

9

} Choose non-linear features } Classifier still linear in parameters 𝒙 𝑦1 𝑦2

1 1

−1 + 𝑦)

< + 𝑦< < = 0

if 𝒙4𝝔(𝒚) ≥ 0 then 𝑧 = 1 else 𝑧 = −1 𝒙 = 𝑥7, 𝑥), … , 𝑥R = [−1, 0, 0,1,1,0]

1

1

𝝔 𝒚 = [1, 𝒚), 𝒚< , 𝒚)

<, 𝒚< <, 𝒚)𝒚<]

𝒚 = [𝒚), 𝒚<]

SLIDE 10

Cost Function for linear classification

10

} Finding linear classifiers can be formulated as an optimization

problem:

} Select how to measure the prediction loss

} Based on the training set 𝐸 =

𝒚 % , 𝑧 %

%() S

, a cost function 𝐾 𝒙 is defined

} Solve the resulting optimization problem to find parameters:

} Find optimal 𝑔

U 𝒚 = 𝑔 𝒚; 𝒙 V where 𝒙 V = argmin

𝒙

𝐾 𝒙

} Criterion or cost functions for classification:

} We will investigate several cost functions for the classification problem

SLIDE 11

SSE cost function for classification

11

SSE cost function is not suitable for classification:

} Least square loss penalizes ‘too correct’ predictions (that they lie a long

way on the correct side of the decision)

} Least square loss also lack robustness to noise

𝐾 𝒙 = ] 𝒙4𝒚 % − 𝑧 %

< * %() 𝐿 = 2 𝑧 ∈ {−1, +1}

SLIDE 12

SSE cost function for classification

12

𝒙4𝒚 𝑧 = 1 𝒙4𝒚 − 𝑧 < 1 𝒙4𝒚 𝑧 = −1 𝒙4𝒚 − 𝑧 < −1 Correct predictions that are penalized by SSE [Bishop] 𝐿 = 2

SLIDE 13

SSE cost function for classification

13

𝐾(𝒙)

} Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝑕 𝒙4𝒚 ?

𝐾 𝒙 = ] sign 𝒙4𝒚 % − 𝑧 %

< * %()

sign 𝑨 = a− 1, 𝑨 < 0 1, 𝑨 ≥ 0

} 𝐾 𝒙

is a piecewise constant function shows the number

f misclassifications

𝐿 = 2 𝒙4𝒚 𝑧 = 1 sign 𝒙4𝒚 − 𝑧 <

Training error incurred in classifying training samples

SLIDE 14

SSE cost function

14

} Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝜏 𝒙4𝒚 ?

𝐾 𝒙 = ] 𝜏 𝒙4𝒚 % − 𝑧 %

< * %()

𝜏 𝑨 = 1 − 𝑓de 1 + 𝑓de

} We see later in this lecture than the cost function of the

logistic regression method is more suitable than this cost function for the classification problem

𝐿 = 2

SLIDE 15

Perceptron algorithm

15

} Linear classifier } Two-class: 𝑧 ∈ {−1,1}

} 𝑧 = −1 for 𝐷<,

𝑧 = 1 for 𝐷)

} Goal: ∀𝑗, 𝒚(%) ∈ 𝐷) ⇒ 𝒙4𝒚(%) > 0 }

∀𝑗, 𝒚 % ∈ 𝐷< ⇒ 𝒙4𝒚 % < 0

} 𝑔 𝒚; 𝒙 = sign(𝒙4𝒚)

SLIDE 16

Perceptron criterion

16

𝐾i 𝒙 = − ] 𝒙4𝒚 % 𝑧 %

%∈ℳ

ℳ: subset of training data that are misclassified Many solutions? Which solution among them?

SLIDE 17

Cost function

17

[Duda, Hart, and Stork, 2002] 𝐾(𝒙) 𝐾i(𝒙) 𝑥7 𝑥) 𝑥7 𝑥) # of misclassifications as a cost function Perceptron’s cost function There may be many solutions in these cost functions

SLIDE 18

Batch Perceptron

18

“Gradient Descent” to solve the optimization problem: 𝒙lm) = 𝒙l − 𝜃𝛼

𝒙𝐾i(𝒙l)

𝛼

𝒙𝐾i 𝒙 = − ] 𝒚 % 𝑧 %

%∈ℳ

Batch Perceptron converges in finite number of steps for linearly separable data:

Initialize 𝒙 Repeat 𝒙 = 𝒙 + 𝜃 ∑ 𝒚 % 𝑧 %

%∈ℳ

Until 𝜃 ∑ 𝒚 % 𝑧 %

%∈ℳ

< 𝜄

SLIDE 19

Stochastic gradient descent for Perceptron

19

} Single-sample perceptron:

} If 𝒚(%) is misclassified:

𝒙lm) = 𝒙l + 𝜃𝒚(%)𝑧(%)

} Perceptron convergence theorem: for linearly separable data

} If training data are linearly separable, the single-sample perceptron is

also guaranteed to find a solution in a finite number of steps

𝒙 ← 𝟏 𝑢 ← 0 repeat 𝑢 ← 𝑢 + 1 𝑗 ← 𝑢 mod 𝑂 if 𝒚(%) is misclassified then 𝒙 = 𝒙 + 𝒚(%)𝑧(%) Until all patterns properly classified

Fixed-Increment single sample Perceptron 𝜃 can be set to 1 and proof still works

SLIDE 20

Example

20

SLIDE 21

Perceptron: Example

21

Change 𝒙 in a direction that corrects the error [Bishop]

SLIDE 22

Online vs. Batch Learning

22

} Batch Learning

} Learn from all the examples at once

} Online Learning

} Gradually learn as each example is received

} Email classification example } Recommendation systems

} Examples: recommending movies; predicting whether a user

will be interested in a new news article

SLIDE 23

Perceptron Convergence

23

} Assume that ∃𝒙∗, 𝒙∗

= 1and some 𝛿 > 0 such that 𝑧(S)𝒙∗4𝒚(S) ≥ 𝛿 for all 𝑜 = 1, … , 𝑂 . Also, assume that for all 𝑜 = 1, … , 𝑂, 𝒚(S) ≤ 𝑆 , Perceptron makes at most

}~

~ errors

SLIDE 24

Perceptron Convergence (more general case)

24

} It can be shown:

𝑁 = max

(𝒚,‚)∈ƒ 𝒚 <

𝜈 = 2 min

(𝒚,‚)∈ƒ 𝑧𝒙

V 0 4𝒚 𝑏 = min

(𝒚,‚)∈ƒ 𝑧𝒙

V ∗4𝒚

SLIDE 25

Convergence of Perceptron

25

} For data sets that are not linearly separable, the single-sample

perceptron learning algorithm will never converge

[Duda, Hart & Stork, 2002]

SLIDE 26

Pocket algorithm

26

} For the data that are not linearly separable due to noise:

} Keeps in its pocket the best 𝒙 encountered up to now.

Initialize 𝒙 for 𝑢 = 1, … , 𝑈 𝑗 ← 𝑢 mod 𝑂 if 𝒚(%) is misclassified then 𝒙S‡H = 𝒙 + 𝒚(%)𝑧(%) if 𝐹l‰Š%S 𝒙S‡H < 𝐹l‰Š%S 𝒙 then 𝒙 = 𝒙S‡H end

𝐹l‰Š%S 𝒙 = 1 𝑂 ] 𝑡𝑗𝑕𝑜(𝒙4𝒚(S)) ≠ 𝑧(S)

* S()

SLIDE 27

Linear Discriminant Analysis (LDA)

27

} Fisher’s Linear Discriminant Analysis :

} Dimensionality reduction

} Finds linear combinations of features with large ratios of between-

groups scatters to within-groups scatters (as discriminant new variables)

} Classification

} Predicts the class of an observation 𝒚 by first projecting it to the

space of discriminant variables and then classifying it in this space

SLIDE 28

Good Projection for Classification

} What is a good criterion?

} Separating different classes in the projected space

28

SLIDE 29

Good Projection for Classification

} What is a good criterion?

} Separating different classes in the projected space

29

SLIDE 30

Good Projection for Classification

} What is a good criterion?

} Separating different classes in the projected space

30

𝒙

SLIDE 31

LDA Problem

} Problem definition:

} 𝐿 = 2 classes }

𝒚(%), 𝑧(%)

%() *

training samples with 𝑂) samples from the first class (𝒟)) and 𝑂< samples from the second class (𝒟<)

} Goal: finding the best direction 𝒙 that we hope to enable accurate

classification

} The projection of sample 𝒚 onto a line in direction 𝒙 is 𝒙4𝒚 } What is the measure of the separation between the projected

points of different classes?

31

SLIDE 32

Measure of Separation in the Projected Direction

32

[Bishop] } Is the direction of the line jointing the class means a good

candidate for 𝒙?

SLIDE 33

Measure of Separation in the Projected Direction

33

} The direction of the line jointing the class means is the

solution of the following problem:

} Maximizes the separation of the projected class means

max

𝒙 𝐾 𝒙 = 𝜈)

− 𝜈<
<
s. t. 𝒙

= 1

} What is the problem with the criteria considering only

𝜈)

− 𝜈<
?

} It does not consider the variances of the classes in the projected direction

𝜈)

= 𝒙4 𝝂)

𝝂) =

∑ 𝒚(•)

𝒚(•)∈𝒟‘

*‘

𝜈<

= 𝒙4 𝝂<

𝝂< =

∑ 𝒚(•)

𝒚(•)∈𝒟~

*~

SLIDE 34

LDA Criteria

34

} Fisher idea: maximize a function that will give

} large separation between the projected class means } while also achieving a small variance within each class, thereby

minimizing the class overlap.

𝐾 𝒙 = 𝜈)

− 𝜈<
<

𝑡)

< + 𝑡<
<

SLIDE 35

LDA Criteria

} The scatters of projected data are:

𝑡)

< =

] 𝒙4𝒚 % − 𝒙4𝝂)

<

𝒚(•)∈𝒟‘

𝑡<

< =

] 𝒙4𝒚 % − 𝒙4𝝂)

<

𝒚(•)∈𝒟~

35

SLIDE 36

LDA Criteria

36

𝐾 𝒙 = 𝜈)

− 𝜈<
<

𝑡)

< + 𝑡<
<

𝜈)

− 𝜈<
< = 𝒙4𝝂) − 𝒙4𝝂< <

= 𝒙4 𝝂) − 𝝂< 𝝂) − 𝝂< 4𝒙 𝑡)

< =

] 𝒙4𝒚 % − 𝒙4𝝂)

<

𝒚(•)∈𝒟‘

= 𝒙4 ] 𝒚 % − 𝝂) 𝒚 % − 𝝂)

4

𝒚(•)∈𝒟‘

𝒙

SLIDE 37

LDA Criteria

37

𝐾 𝒙 =

𝒙4𝑻“𝒙 𝒙4𝑻”𝒙

𝑻“ = 𝝂) − 𝝂< 𝝂) − 𝝂< 4 𝑻” = 𝑻) + 𝑻< 𝑻) = ] 𝒚 % − 𝝂) 𝒚 % − 𝝂)

4

𝒚(•)∈𝒟‘

𝑻< = ] 𝒚 % − 𝝂< 𝒚 % − 𝝂<

4

𝒚(•)∈𝒟~

scatter matrix=N×covariance matrix Between-class scatter matrix Within-class scatter matrix

SLIDE 38

LDA Derivation

( )

T B T W

J = w S w w w S w

( )

( ) ( )

( )

2 2

2 2 ( )

T T T T W B T T W B B W W B T T W W

J ¶ ¶ ´

´
¶

¶ ¶ = = ¶ w S w w S w w S w w S w S w w S w S w w S w w w w w w S w w S w

( )

B W

J l ¶ = Þ = ¶ w S w S w w

38

SLIDE 39

LDA Derivation

} 𝑻𝐶𝒙 (for any vector 𝒙) points in the same direction as

𝝂) − 𝝂<:

} Thus, we can solve the eigenvalue problem immediately

rank

is full

𝑻𝑋 If

39

𝑻𝐶𝒙 = 𝜇𝑻𝑋𝒙 𝑻”

d)𝑻𝐶𝒙 = 𝜇𝒙

𝑻𝐶𝒙 = 𝝂) − 𝝂< 𝝂) − 𝝂< 4𝒙 ∝ 𝝂) − 𝝂< 𝒙 ∝ 𝑻”

d) 𝝂) − 𝝂<

SLIDE 40

LDA Algorithm

40

} Find 𝝂) and 𝝂< as the mean of class 1 and 2 respectively } Find 𝑻) and 𝑻< as scatter matrix of class 1 and 2 respectively } 𝑻” = 𝑻) + 𝑻< } 𝑻“ = 𝝂) − 𝝂<

𝝂) − 𝝂< 4

} Feature Extraction

} 𝒙 = 𝑻H

d) 𝝂) − 𝝂<

as the eigenvector corresponding to the largest eigenvalue of 𝑻H

d)𝑻š

} Classification

} 𝒙 = 𝑻H

d) 𝝂) − 𝝂<

} Using a threshold on 𝒙4𝒚, we can classify 𝒚

𝝂< 𝝂)

SLIDE 41

Multi-class classification

41

} Solutions to multi-category problems:

} Converting the problem to a set of two-class problems } Extend the learning algorithm to support multi-class:

} A function 𝑔

%(𝒚) for each class 𝑗 is found

¨ 𝑧

› = argmax

%(),…,œ

𝑔

%(𝒚)

𝑦1

𝑦2

SLIDE 42

Multi-class classification

42

} Solutions to multi-category problems:

} Converting the problem to a set of two-class problems } Extend the learning algorithm to support multi-class:

} A function 𝑔

%(𝒚) for each class 𝑗 is found

¨ 𝑧

› = argmax

%(),…,œ

𝑔

%(𝒚)

𝑦1

𝑦2

SLIDE 43

Converting multi-class problem to a set of two-class problems

43

} “one versus rest” or “one against all”

} For each class 𝐷%, a linear discriminant function that separates

samples of 𝐷% from all the other samples is found.

} Totally linearly separable

} “one versus one”

} 𝑑(𝑑 − 1)/2 linear discriminant functions are used, one to

separate samples of a pair of classes.

} Pairwise linearly separable

SLIDE 44

Multi-class classification

44

} One-vs-all (one-vs-rest)

Class 1: Class 2: Class 3:

𝑦2 𝑦2 𝑦1 𝑦2 𝑦1 𝑦1 𝑦2 𝑦1

SLIDE 45

Multi-class classification

45

} One-vs-one

Class 1: Class 2: Class 3:

𝑦2 𝑦1

𝑦2 𝑦1 𝑦2 𝑦1 𝑦2 𝑦1

SLIDE 46

Multi-class classification: ambiguity

46

ne versus rest
ne versus one

[Duda, Hart & Stork, 2002] } Converting the multi-class problem to a set of two-class

problems can lead to regions in which the classification is undefined

SLIDE 47

Multi-class classification

47

} Solutions to multi-category problems:

} Converting the problem to a set of two-class problems } Extend the learning algorithm to support multi-class:

} A function 𝑔

%(𝒚) for each class 𝑗 is found

¨ 𝑧

› = argmax

%(),…,œ

𝑔

%(𝒚)

𝑦1

𝑦2

𝑔𝑗(𝒚) > 𝑔

Ÿ(𝒚) "𝑘 ¹ 𝑗

if 𝐷% is assigned to class 𝒚

SLIDE 48

Discriminant Functions

48

} Discriminant functions: A discriminant function 𝑔

% 𝒚

for each class 𝒟% (𝑗 = 1, … , 𝐿):

} 𝒚 is assigned to class 𝒟% if:

𝑔𝑗(𝒚) > 𝑔𝑘(𝒚) "𝑘 ¹ 𝑗

} Thus, we can easily divide the feature space into 𝐿 decision

regions

∀𝒚, 𝑔𝑗(𝒚) > 𝑔𝑘(𝒚) "𝑘 ¹ 𝑗 ⇒ 𝒚 ∈ ℛ%

} Decision surfaces (or boundaries) can also be found using

discriminant functions

} Boundary of the ℛ% and ℛŸ separating samples of these two categories:

∀𝒚, 𝑔

% 𝒚 = 𝑔𝑘(𝒚)

ℛ%: Region of the 𝑗-th class

SLIDE 49

Discriminant functions

49

} Discriminant function can directly assign each vector 𝒚 to a

specific class 𝑙

} A popular way of representing a classifier

} Many classification methods are based on discriminant functions

} Assumption: the classes are taken to be disjoint

} The input space is thereby divided into decision regions

} boundaries are called decision boundaries or decision surfaces.

SLIDE 50

Discriminant Functions: Two-Category

50

} Decision surface: 𝑔 𝒚 = 0 } For two-category problem, we can only find a function 𝑔 ∶

ℝ¥ → ℝ

} 𝑔

) 𝒚 = 𝑔(𝒚)

} 𝑔

< 𝒚 = −𝑔(𝒚)

SLIDE 51

Multi-class classification: linear machine

51

} A discriminant function 𝑔

% 𝒚 = 𝒙% 4𝒚 + 𝑥%7 for each class

𝒟% (𝑗 = 1, … , 𝐿):

} 𝒚 is assigned to class 𝒟% if:

𝑔𝑗(𝒚) > 𝑔𝑘(𝒚) "𝑘 ¹ 𝑗

} Decision surfaces (boundaries) can also be found using

discriminant functions

} Boundary of the contiguous ℛ% and ℛŸ: ∀𝒚, 𝑔

% 𝒚 = 𝑔𝑘(𝒚)

}

𝒙% − 𝒙Ÿ

4𝒚 + 𝑥%7 − 𝑥 Ÿ7 = 0

} Decision regions are convex

SLIDE 52

Multi-class classification: linear machine

52

} Decision regions are convex

} Linear machines are most suitable for problems where 𝑞(𝒚|𝒟%)

are unimodal.

Convex region definition: ∀𝑦),𝑦< ∈ ℛ,0 ≤ 𝛽 ≤ 1 ⇒ 𝛽𝑦) + 1 − 𝛽 𝑦< ∈ ℛ 𝑦) 𝑦< 𝑦), 𝑦< ∈ ℛ% ⇒ ∀𝑘 ≠ 𝑗, 𝑔

% 𝑦) ≥ 𝑔 Ÿ 𝑦)

𝑔

% 𝑦< ≥ 𝑔 Ÿ 𝑦<

⇒ 𝛽𝑔

% 𝑦) + 1 − 𝛽 𝑔 % 𝑦< ≥ 𝛽𝑔 Ÿ 𝑦) + 1 − 𝛽 𝑔 Ÿ 𝑦<

⇒ 𝑔

% 𝛽𝑦) + 1 − 𝛽 𝑦< ≥ 𝑔 Ÿ 𝛽𝑦) + 1 − 𝛽 𝑦<

⇒ 𝛽𝑦) + 1 − 𝛽 𝑦< ∈ ℛ% 𝑔

% is linear

SLIDE 53

Multi-class classification: linear machine

53

[Duda, Hart & Stork, 2002]

SLIDE 54

Perceptron: multi-class

54

𝑧 › = argmax

%(),…,œ

𝒙%

4𝒚

𝐾i 𝑿 = − ] 𝒙‚ • − 𝒙‚

› • 4

𝒚 %

%∈ℳ

ℳ: subset of training data that are misclassified ℳ = 𝑗|𝑧 › % ≠ 𝑧(%)

Initialize 𝑿 = 𝒙), … , 𝒙œ , 𝑙 ← 0 repeat 𝑙 ← 𝑙 + 1 mod 𝑂 if 𝒚(%) is misclassified then 𝒙‚

› • = 𝒙‚ › • − 𝒚(%)

𝒙‚ • = 𝒙‚ • + 𝒚(%) Until all patterns properly classified