Linear classifiers CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation

β–Ά
linear classifiers
SMART_READER_LITE
LIVE PREVIEW

Linear classifiers CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Topics } Linear classifiers } Perceptron SVM will be covered in the later lectures } Fisher } Multi-class classification 2 Classification


slide-1
SLIDE 1

Linear classifiers

CE-717: Machine Learning

Sharif University of Technology

  • M. Soleymani

Fall 2018

slide-2
SLIDE 2

Topics

} Linear classifiers

} Perceptron } Fisher

} Multi-class classification

2

SVM will be covered in the later lectures

slide-3
SLIDE 3

Classification problem

3

} Given:Training set

} labeled set of 𝑂 input-output pairs 𝐸 =

π’š % , 𝑧 %

%() *

} 𝑧 ∈ {1, … , 𝐿}

} Goal: Given an input π’š, assign it to one of 𝐿 classes } Examples:

} Spam filter } Handwritten digit recognition } …

slide-4
SLIDE 4

Linear classifiers

4

} Decision boundaries are linear in π’š, or linear in some

given set of functions of π’š

} Linearly separable data: data points that can be exactly

classified by a linear decision surface.

} Why linear classifier?

} Even when they are not optimal, we can use their simplicity

} are relatively easy to compute } In the absence of information suggesting otherwise, linear classifiers are an

attractive candidates for initial, trial classifiers.

slide-5
SLIDE 5

Two Category

5

} 𝑔 π’š; 𝒙 = 𝒙4π’š + π‘₯7 = π‘₯7 + π‘₯)𝑦) + . . . π‘₯;𝑦;

} π’š = 𝑦) 𝑦< … 𝑦; } 𝒙 = [π‘₯) π‘₯< … π‘₯;] } π‘₯7: bias

} if 𝒙4π’š + π‘₯7 β‰₯ 0 then π’Ÿ) } else π’Ÿ<

Decision surface (boundary): 𝒙4π’š + π‘₯7 = 0

𝒙 is orthogonal to every vector lying within the decision surface

slide-6
SLIDE 6

Example

6

𝑦1 𝑦2

1 2 3 1 2 3 4

3 βˆ’ 3 4 𝑦) βˆ’ 𝑦< = 0 if 𝒙4π’š + π‘₯7 β‰₯ 0 then π’Ÿ) else π’Ÿ<

slide-7
SLIDE 7

Linear classifier: Two Category

7

} Decision boundary is a (𝑒 βˆ’ 1)-dimensional hyperplane 𝐼 in

the 𝑒-dimensional feature space

} The orientation of 𝐼 is determined by the normal vector π‘₯), … , π‘₯; } π‘₯7 determine the location of the surface.

} The normal distance from the origin to the decision surface is HI

𝒙

π’š = π’šJ + 𝑠 𝒙 𝒙 𝒙4π’š + π‘₯7 = 𝑠 𝒙 β‡’ 𝑠 = 𝒙4π’š + π‘₯7 𝒙 gives a signed measure of the perpendicular distance 𝑠 of the point π’š from the decision surface

𝑔 π’š = 0

π’šJ

slide-8
SLIDE 8

Linear boundary: geometry

8

𝒙4π’š + π‘₯7 = 0 𝒙4π’š + π‘₯7 > 0 𝒙4π’š + π‘₯7 < 0 𝒙4π’š + π‘₯7 𝒙

slide-9
SLIDE 9

Non-linear decision boundary

9

} Choose non-linear features } Classifier still linear in parameters 𝒙 𝑦1 𝑦2

1 1

βˆ’1 + 𝑦)

< + 𝑦< < = 0

if 𝒙4𝝔(π’š) β‰₯ 0 then 𝑧 = 1 else 𝑧 = βˆ’1 𝒙 = π‘₯7, π‘₯), … , π‘₯R = [βˆ’1, 0, 0,1,1,0]

1

  • 1

𝝔 π’š = [1, π’š), π’š< , π’š)

<, π’š< <, π’š)π’š<]

π’š = [π’š), π’š<]

slide-10
SLIDE 10

Cost Function for linear classification

10

} Finding linear classifiers can be formulated as an optimization

problem:

} Select how to measure the prediction loss

} Based on the training set 𝐸 =

π’š % , 𝑧 %

%() S

, a cost function 𝐾 𝒙 is defined

} Solve the resulting optimization problem to find parameters:

} Find optimal 𝑔

U π’š = 𝑔 π’š; 𝒙 V where 𝒙 V = argmin

𝒙

𝐾 𝒙

} Criterion or cost functions for classification:

} We will investigate several cost functions for the classification problem

slide-11
SLIDE 11

SSE cost function for classification

11

SSE cost function is not suitable for classification:

} Least square loss penalizes β€˜too correct’ predictions (that they lie a long

way on the correct side of the decision)

} Least square loss also lack robustness to noise

𝐾 𝒙 = ] 𝒙4π’š % βˆ’ 𝑧 %

< * %() 𝐿 = 2

slide-12
SLIDE 12

SSE cost function for classification

12

𝒙4π’š 𝑧 = 1 𝒙4π’š βˆ’ 𝑧 < 1 𝒙4π’š 𝑧 = βˆ’1 𝒙4π’š βˆ’ 𝑧 < βˆ’1 Correct predictions that are penalized by SSE [Bishop] 𝐿 = 2

slide-13
SLIDE 13

SSE cost function for classification

13

𝐾(𝒙)

} Is it more suitable if we set 𝑔 π’š; 𝒙 = 𝑕 𝒙4π’š ?

𝐾 𝒙 = ] sign 𝒙4π’š % βˆ’ 𝑧 %

< * %()

sign 𝑨 = aβˆ’ 1, 𝑨 < 0 1, 𝑨 β‰₯ 0

} 𝐾 𝒙

is a piecewise constant function shows the number

  • f misclassifications

𝐿 = 2 𝒙4π’š 𝑧 = 1 sign 𝒙4π’š βˆ’ 𝑧 <

Training error incurred in classifying training samples

slide-14
SLIDE 14

SSE cost function

14

} Is it more suitable if we set 𝑔 π’š; 𝒙 = 𝜏 𝒙4π’š ?

𝐾 𝒙 = ] 𝜏 𝒙4π’š % βˆ’ 𝑧 %

< * %()

𝜏 𝑨 = 1 βˆ’ 𝑓de 1 + 𝑓de

} We see later in this lecture than the cost function of the

logistic regression method is more suitable than this cost function for the classification problem

𝐿 = 2

slide-15
SLIDE 15

Perceptron algorithm

15

} Linear classifier } Two-class: 𝑧 ∈ {βˆ’1,1}

} 𝑧 = βˆ’1 for 𝐷<,

𝑧 = 1 for 𝐷)

} Goal: βˆ€π‘—, π’š(%) ∈ 𝐷) β‡’ 𝒙4π’š(%) > 0 }

βˆ€π‘—, π’š % ∈ 𝐷< β‡’ 𝒙4π’š % < 0

} 𝑔 π’š; 𝒙 = sign(𝒙4π’š)

slide-16
SLIDE 16

Perceptron criterion

16

𝐾i 𝒙 = βˆ’ ] 𝒙4π’š % 𝑧 %

  • %βˆˆβ„³

β„³: subset of training data that are misclassified Many solutions? Which solution among them?

slide-17
SLIDE 17

Cost function

17

[Duda, Hart, and Stork, 2002] 𝐾(𝒙) 𝐾i(𝒙) π‘₯7 π‘₯) π‘₯7 π‘₯) # of misclassifications as a cost function Perceptron’s cost function There may be many solutions in these cost functions

slide-18
SLIDE 18

Batch Perceptron

18

β€œGradient Descent” to solve the optimization problem: 𝒙lm) = 𝒙l βˆ’ πœƒπ›Ό

𝒙𝐾i(𝒙l)

𝛼

𝒙𝐾i 𝒙 = βˆ’ ] π’š % 𝑧 %

  • %βˆˆβ„³

Batch Perceptron converges in finite number of steps for linearly separable data:

Initialize 𝒙 Repeat 𝒙 = 𝒙 + πœƒ βˆ‘ π’š % 𝑧 %

  • %βˆˆβ„³

Until πœƒ βˆ‘ π’š % 𝑧 %

  • %βˆˆβ„³

< πœ„

slide-19
SLIDE 19

Stochastic gradient descent for Perceptron

19

} Single-sample perceptron:

} If π’š(%) is misclassified:

𝒙lm) = 𝒙l + πœƒπ’š(%)𝑧(%)

} Perceptron convergence theorem: for linearly separable data

} If training data are linearly separable, the single-sample perceptron is

also guaranteed to find a solution in a finite number of steps

Initialize 𝒙, 𝑒 ← 0 repeat 𝑒 ← 𝑒 + 1 𝑗 ← 𝑒 mod 𝑂 if π’š(%) is misclassified then 𝒙 = 𝒙 + π’š(%)𝑧(%) Until all patterns properly classified

Fixed-Increment single sample Perceptron πœƒ can be set to 1 and proof still works

slide-20
SLIDE 20

Perceptron Convergence

20

} It can be shown:

𝑁 = max

(π’š,x)∈y π’š <

𝜈 = 2 min

(π’š,x)∈y 𝑧𝒙

V 0 4π’š 𝑏 = min

(π’š,x)∈y 𝑧𝒙

V βˆ—4π’š

slide-21
SLIDE 21

Example

21

slide-22
SLIDE 22

Perceptron: Example

22

Change 𝒙 in a direction that corrects the error [Bishop]

slide-23
SLIDE 23

Convergence of Perceptron

23

} For data sets that are not linearly separable, the single-sample

perceptron learning algorithm will never converge

[Duda, Hart & Stork, 2002]

slide-24
SLIDE 24

Pocket algorithm

24

} For the data that are not linearly separable due to noise:

} Keeps in its pocket the best 𝒙 encountered up to now.

Initialize 𝒙 for 𝑒 = 1, … , π‘ˆ 𝑗 ← 𝑒 mod 𝑂 if π’š(%) is misclassified then 𝒙S~H = 𝒙 + π’š(%)𝑧(%) if 𝐹l€‒%S 𝒙S~H < 𝐹l€‒%S 𝒙 then 𝒙 = 𝒙S~H end

𝐹l€‒%S 𝒙 = 1 𝑂 ] π‘‘π‘—π‘•π‘œ(𝒙4π’š(S)) β‰  𝑧(S)

* S()

slide-25
SLIDE 25

Linear Discriminant Analysis (LDA)

25

} Fisher’s Linear Discriminant Analysis :

} Dimensionality reduction

} Finds linear combinations of features with large ratios of between-

groups scatters to within-groups scatters (as discriminant new variables)

} Classification

} Predicts the class of an observation π’š by first projecting it to the

space of discriminant variables and then classifying it in this space

slide-26
SLIDE 26

Good Projection for Classification

} What is a good criterion?

} Separating different classes in the projected space

26

slide-27
SLIDE 27

Good Projection for Classification

} What is a good criterion?

} Separating different classes in the projected space

27

slide-28
SLIDE 28

Good Projection for Classification

} What is a good criterion?

} Separating different classes in the projected space

28

𝒙

slide-29
SLIDE 29

LDA Problem

} Problem definition:

} 𝐷 = 2 classes }

π’š(%), 𝑧(%)

%() *

training samples with 𝑂) samples from the first class (π’Ÿ)) and 𝑂< samples from the second class (π’Ÿ<)

} Goal: finding the best direction 𝒙 that we hope to enable accurate

classification

} The projection of sample π’š onto a line in direction 𝒙 is 𝒙4π’š } What is the measure of the separation between the projected

points of different classes?

29

slide-30
SLIDE 30

Measure of Separation in the Projected Direction

30

[Bishop] } Is the direction of the line jointing the class means a good

candidate for 𝒙?

slide-31
SLIDE 31

Measure of Separation in the Projected Direction

31

} The direction of the line jointing the class means is the

solution of the following problem:

} Maximizes the separation of the projected class means

max

𝒙 𝐾 𝒙 = 𝜈) … βˆ’ 𝜈< … <

  • s. t. 𝒙

= 1

} What is the problem with the criteria considering only

𝜈)

… βˆ’ 𝜈< … ?

} It does not consider the variances of the classes in the projected direction

𝜈)

… = 𝒙4 𝝂)

𝝂) =

βˆ‘ π’š(Λ†)

  • π’š(Λ†)βˆˆπ’Ÿβ€°

*‰

𝜈<

… = 𝒙4 𝝂<

𝝂< =

βˆ‘ π’š(Λ†)

  • π’š(Λ†)βˆˆπ’ŸΕ 

*Ε 

slide-32
SLIDE 32

LDA Criteria

32

} Fisher idea: maximize a function that will give

} large separation between the projected class means } while also achieving a small variance within each class, thereby

minimizing the class overlap.

𝐾 𝒙 = 𝜈)

… βˆ’ 𝜈< … <

𝑑)

…< + 𝑑< …<

slide-33
SLIDE 33

LDA Criteria

} The scatters of projected data are:

𝑑)

…< =

] 𝒙4π’š % βˆ’ 𝒙4𝝂)

<

  • π’š(Λ†)βˆˆπ’Ÿβ€°

𝑑<

…< =

] 𝒙4π’š % βˆ’ 𝒙4𝝂)

<

  • π’š(Λ†)βˆˆπ’ŸΕ 

33

slide-34
SLIDE 34

LDA Criteria

34

𝐾 𝒙 = 𝜈)

… βˆ’ 𝜈< … <

𝑑)

…< + 𝑑< …<

𝜈)

… βˆ’ 𝜈< … < = 𝒙4𝝂) βˆ’ 𝒙4𝝂< <

= 𝒙4 𝝂) βˆ’ 𝝂< 𝝂) βˆ’ 𝝂< 4𝒙 𝑑)

…< =

] 𝒙4π’š % βˆ’ 𝒙4𝝂)

<

  • π’š(Λ†)βˆˆπ’Ÿβ€°

= 𝒙4 ] π’š % βˆ’ 𝝂) π’š % βˆ’ 𝝂)

4

  • π’š(Λ†)βˆˆπ’Ÿβ€°

𝒙

slide-35
SLIDE 35

LDA Criteria

35

𝐾 𝒙 =

𝒙4𝑻Œ𝒙 𝒙4𝑻‒𝒙

𝑻Œ = 𝝂) βˆ’ 𝝂< 𝝂) βˆ’ 𝝂< 4 𝑻‒ = 𝑻) + 𝑻< 𝑻) = ] π’š % βˆ’ 𝝂) π’š % βˆ’ 𝝂)

4

  • π’š(Λ†)βˆˆπ’Ÿβ€°

𝑻< = ] π’š % βˆ’ 𝝂< π’š % βˆ’ 𝝂<

4

  • π’š(Λ†)βˆˆπ’ŸΕ 

scatter matrix=NΓ—covariance matrix Between-class scatter matrix Within-class scatter matrix

slide-36
SLIDE 36

LDA Derivation

( )

T B T W

J = w S w w w S w

( )

( ) ( )

( )

2 2

2 2 ( )

T T T T W B T T W B B W W B T T W W

J ΒΆ ΒΆ Β΄

  • Β΄
  • ΒΆ

ΒΆ ΒΆ = = ΒΆ w S w w S w w S w w S w S w w S w S w w S w w w w w w S w w S w

( )

B W

J l ¢ = Þ = ¢ w S w S w w

36

slide-37
SLIDE 37

LDA Derivation

} 𝑻𝐢𝒙 (for any vector 𝒙) points in the same direction as

𝝂) βˆ’ 𝝂<:

} Thus, we can solve the eigenvalue problem immediately

rank

  • is full

𝑻𝑋 If

37

𝑻𝐢𝒙 = πœ‡π‘»π‘‹π’™ 𝑻‒

d)𝑻𝐢𝒙 = πœ‡π’™

𝑻𝐢𝒙 = 𝝂) βˆ’ 𝝂< 𝝂) βˆ’ 𝝂< 4𝒙 ∝ 𝝂) βˆ’ 𝝂< 𝒙 ∝ 𝑻‒

d) 𝝂) βˆ’ 𝝂<

slide-38
SLIDE 38

LDA Algorithm

38

} Find 𝝂) and 𝝂< as the mean of class 1 and 2 respectively } Find 𝑻) and 𝑻< as scatter matrix of class 1 and 2 respectively } 𝑻‒ = 𝑻) + 𝑻< } 𝑻Œ = 𝝂) βˆ’ 𝝂<

𝝂) βˆ’ 𝝂< 4

} Feature Extraction

} 𝒙 = 𝑻H

d) 𝝂) βˆ’ 𝝂<

as the eigenvector corresponding to the largest eigenvalue of 𝑻H

d)π‘»β€œ

} Classification

} 𝒙 = 𝑻H

d) 𝝂) βˆ’ 𝝂<

} Using a threshold on 𝒙4π’š, we can classify π’š

𝝂< 𝝂)

slide-39
SLIDE 39

Converting multi-class problem to a set of two-class problems

39

} β€œone versus rest” or β€œone against all”

} For each class 𝐷%, a linear discriminant function that separates

samples of 𝐷% from all the other samples is found.

} Totally linearly separable

} β€œone versus one”

} 𝑑(𝑑 βˆ’ 1)/2 linear discriminant functions are used, one to

separate samples of a pair of classes.

} Pairwise linearly separable

slide-40
SLIDE 40

Multi-class classification

40

} One-vs-all (one-vs-rest)

Class 1: Class 2: Class 3:

𝑦2 𝑦2 𝑦1 𝑦2 𝑦1 𝑦1 𝑦2 𝑦1

slide-41
SLIDE 41

Multi-class classification

41

} One-vs-one

Class 1: Class 2: Class 3:

𝑦2 𝑦1

𝑦2 𝑦1 𝑦2 𝑦1 𝑦2 𝑦1

slide-42
SLIDE 42

Multi-class classification: ambiguity

42

  • ne versus rest
  • ne versus one

[Duda, Hart & Stork, 2002] } Converting the multi-class problem to a set of two-class

problems can lead to regions in which the classification is undefined

slide-43
SLIDE 43

Discriminant functions

43

} Discriminant function can directly assign each vector π’š to a

specific class 𝑙

} A popular way of representing a classifier

} Many classification methods are based on discriminant functions

} Assumption: the classes are taken to be disjoint

} The input space is thereby divided into decision regions

} boundaries are called decision boundaries or decision surfaces.

slide-44
SLIDE 44

Discriminant Functions

44

} Discriminant functions: A discriminant function 𝑔

% π’š

for each class π’Ÿ% (𝑗 = 1, … , 𝐿):

} π’š is assigned to class π’Ÿ% if:

𝑔𝑗(π’š) > π‘”π‘˜(π’š) "π‘˜ ΒΉ 𝑗

} Thus, we can easily divide the feature space into 𝐿 decision

regions

βˆ€π’š, 𝑔𝑗(π’š) > π‘”π‘˜(π’š) "π‘˜ ΒΉ 𝑗 β‡’ π’š ∈ β„›%

} Decision surfaces (or boundaries) can also be found using

discriminant functions

} Boundary of the β„›% and β„›β„’ separating samples of these two categories:

βˆ€π’š, 𝑔

% π’š = π‘”π‘˜(π’š)

β„›%: Region of the 𝑗-th class

slide-45
SLIDE 45

Discriminant Functions: Two-Category

45

} Decision surface: 𝑔 π’š = 0 } For two-category problem, we can only find a function 𝑔 ∢

ℝœ β†’ ℝ

} 𝑔

) π’š = 𝑔(π’š)

} 𝑔

< π’š = βˆ’π‘”(π’š)

} First, we explain two-category classification problem and then

discuss the multi-category problems.

} Binary classification: a target variable 𝑧 ∈ 0,1 or 𝑧 ∈ βˆ’1,1

slide-46
SLIDE 46

Multi-class classification

46

} Solutions to multi-category problems:

} Extend the learning algorithm to support multi-class:

} A function 𝑔

%(π’š) for each class 𝑗 is found

Β¨ 𝑧

ΕΎ = argmax

%(),…,ΕΈ

𝑔

%(π’š)

} Converting the problem to a set of two-class problems: 𝑦1

𝑦2

𝑔𝑗(π’š) > 𝑔

β„’(π’š) "π‘˜ ΒΉ 𝑗

if 𝐷% is assigned to class π’š

slide-47
SLIDE 47

Multi-class classification: linear machine

47

} A discriminant function 𝑔

% π’š = 𝒙% 4π’š + π‘₯%7 for each class

π’Ÿ% (𝑗 = 1, … , 𝐿):

} π’š is assigned to class π’Ÿ% if:

𝑔𝑗(π’š) > π‘”π‘˜(π’š) "π‘˜ ΒΉ 𝑗

} Decision surfaces (boundaries) can also be found using

discriminant functions

} Boundary of the contiguous β„›% and β„›β„’: βˆ€π’š, 𝑔

% π’š = π‘”π‘˜(π’š)

}

𝒙% βˆ’ 𝒙ℒ

4π’š + π‘₯%7 βˆ’ π‘₯ β„’7 = 0

slide-48
SLIDE 48

Multi-class classification: linear machine

48

[Duda, Hart & Stork, 2002]

slide-49
SLIDE 49

Perceptron: multi-class

49

𝑧 ΕΎ = argmax

%(),…,ΕΈ

𝒙%

4π’š

𝐾i 𝑿 = βˆ’ ] 𝒙x Λ† βˆ’ 𝒙x

ΕΎ Λ† 4

π’š %

  • %βˆˆβ„³

β„³: subset of training data that are misclassified β„³ = 𝑗|𝑧 ΕΎ % β‰  𝑧(%)

Initialize 𝑿 = 𝒙), … , 𝒙Ÿ , 𝑙 ← 0 repeat 𝑙 ← 𝑙 + 1 mod 𝑂 if π’š(%) is misclassified then 𝒙x

ΕΎ Λ† = 𝒙x ΕΎ Λ† βˆ’ π’š(%)

𝒙x Λ† = 𝒙x Λ† + π’š(%) Until all patterns properly classified

slide-50
SLIDE 50

Resources

50

} C. Bishop, β€œPattern Recognition and Machine Learning”,

Chapter 4.1.