Linear classifiers CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation

linear classifiers
SMART_READER_LITE
LIVE PREVIEW

Linear classifiers CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Discriminant functions Linear classifiers Perceptron SVM will be covered in the later lectures Fisher Multi-class


slide-1
SLIDE 1

Linear classifiers

CE-717: Machine Learning

Sharif University of Technology

  • M. Soleymani

Fall 2016

slide-2
SLIDE 2

Topics

 Discriminant functions  Linear classifiers

 Perceptron  Fisher

 Multi-class classification

2

SVM will be covered in the later lectures

slide-3
SLIDE 3

Classification problem

3

 Given:Training set

 labeled set of 𝑂 input-output pairs 𝐸 =

𝒚 𝑗 , 𝑧 𝑗

𝑗=1 𝑂

 𝑧 ∈ {1, … , 𝐿}

 Goal: Given an input 𝒚, assign it to one of 𝐿 classes  Examples:

 Spam filter  Handwritten digit recognition  …

slide-4
SLIDE 4

Discriminant functions

4

 Discriminant function can directly assign each vector 𝒚 to a

specific class 𝑙

 A popular way of representing a classifier

 Many classification methods are based on discriminant functions

 Assumption: the classes are taken to be disjoint

 The input space is thereby divided into decision regions

 boundaries are called decision boundaries or decision surfaces.

slide-5
SLIDE 5

Discriminant Functions

5

 Discriminant functions: A discriminant function 𝑔

𝑗 𝒚

for each class 𝒟𝑗 (𝑗 = 1, … , 𝐿):

 𝒚 is assigned to class 𝒟𝑗 if:

𝑔𝑗(𝒚) > 𝑔𝑘(𝒚) 𝑘  𝑗

 Thus, we can easily divide the feature space into 𝐿 decision

regions

∀𝒚, 𝑔𝑗(𝒚) > 𝑔𝑘(𝒚) 𝑘  𝑗 ⇒ 𝒚 ∈ ℛ𝑗

 Decision surfaces (or boundaries) can also be found using

discriminant functions

 Boundary of the ℛ𝑗 and ℛ𝑘 separating samples of these two categories:

∀𝒚, 𝑔

𝑗 𝒚 = 𝑔𝑘(𝒚)

ℛ𝑗: Region of the 𝑗-th class

slide-6
SLIDE 6

Discriminant Functions: Two-Category

6

 Decision surface: 𝑔 𝒚 = 0  For two-category problem, we can only find a function 𝑔 ∶ ℝd

→ ℝ

 𝑔

1 𝒚 = 𝑔(𝒚)

 𝑔

2 𝒚 = −𝑔(𝒚)

 First, we explain two-category classification problem and then

discuss the multi-category problems.

 Binary classification: a target variable 𝑧 ∈ 0,1 or 𝑧 ∈ −1,1

slide-7
SLIDE 7

Linear classifiers

7

 Decision boundaries are linear in 𝒚, or linear in some

given set of functions of 𝒚

 Linearly separable data: data points that can be exactly

classified by a linear decision surface.

 Why linear classifier?

 Even when they are not optimal, we can use their simplicity

 are relatively easy to compute  In the absence of information suggesting otherwise, linear classifiers are an

attractive candidates for initial, trial classifiers.

slide-8
SLIDE 8

Two Category

8

 𝑔 𝒚; 𝒙 = 𝒙𝑈𝒚 + 𝑥0 = 𝑥0 + 𝑥1𝑦1 + . . . 𝑥𝑒𝑦𝑒

 𝒚 = 𝑦1 𝑦2 … 𝑦𝑒  𝒙 = [𝑥1 𝑥2 … 𝑥𝑒]  𝑥0: bias

 if 𝒙𝑈𝒚 + 𝑥0 ≥ 0 then 𝒟1  else 𝒟2

Decision surface (boundary): 𝒙𝑈𝒚 + 𝑥0 = 0

𝒙 is orthogonal to every vector lying within the decision surface

slide-9
SLIDE 9

Example

9

𝑦1 𝑦2

1 2 3 1 2 3 4

3 − 3 4 𝑦1 − 𝑦2 = 0 if 𝒙𝑈𝒚 + 𝑥0 ≥ 0 then 𝒟1 else 𝒟2

slide-10
SLIDE 10

Linear classifier: Two Category

10

 Decision boundary is a (𝑒 − 1)-dimensional hyperplane 𝐼 in

the 𝑒-dimensional feature space

 The orientation of 𝐼 is determined by the normal vector 𝑥1, … , 𝑥𝑒  𝑥0 determine the location of the surface.

 The normal distance from the origin to the decision surface is 𝑥0

𝒙

𝒚 = 𝒚⊥ + 𝑠 𝒙 𝒙 𝒙𝑈𝒚 + 𝑥0 = 𝑠 𝒙 ⇒ 𝑠 = 𝒙𝑈𝒚 + 𝑥0 𝒙 gives a signed measure of the perpendicular distance 𝑠 of the point 𝒚 from the decision surface

𝑔 𝒚 = 0

𝒚⊥

slide-11
SLIDE 11

Linear boundary: geometry

11

𝒙𝑈𝒚 + 𝑥0 = 0 𝒙𝑈𝒚 + 𝑥0 > 0 𝒙𝑈𝒚 + 𝑥0 < 0 𝒙𝑈𝒚 + 𝑥0 𝒙

slide-12
SLIDE 12

Non-linear decision boundary

12

 Choose non-linear features  Classifier still linear in parameters 𝒙 𝑦1 𝑦2

1 1

−1 + 𝑦1

2 + 𝑦2 2 = 0

if 𝒙𝑈𝝔(𝒚) ≥ 0 then 𝑧 = 1 else 𝑧 = −1 𝒙 = 𝑥0, 𝑥1, … , 𝑥𝑛 = [−1, 0, 0,1,1,0]

  • 1

1

𝝔 𝒚 = [1, 𝒚1, 𝒚2 , 𝒚1

2, 𝒚2 2, 𝒚1𝒚2]

𝒚 = [𝒚1, 𝒚2]

slide-13
SLIDE 13

Cost Function for linear classification

13

 Finding linear classifiers can be formulated as an optimization

problem:

 Select how to measure the prediction loss

 Based on the training set 𝐸 =

𝒚 𝑗 , 𝑧 𝑗

𝑗=1 𝑜 , a cost function 𝐾 𝒙 is defined

 Solve the resulting optimization problem to find parameters:

 Find optimal

𝑔 𝒚 = 𝑔 𝒚; 𝒙 where 𝒙 = argmin

𝒙

𝐾 𝒙

 Criterion or cost functions for classification:

 We will investigate several cost functions for the classification problem

slide-14
SLIDE 14

SSE cost function for classification

14

SSE cost function is not suitable for classification:

 Least square loss penalizes ‘too correct’ predictions (that they lie a long

way on the correct side of the decision)

 Least square loss also lack robustness to noise

𝐾 𝒙 =

𝑗=1 𝑂

𝒙𝑈𝒚 𝑗 − 𝑧 𝑗

2

𝐿 = 2

slide-15
SLIDE 15

SSE cost function for classification

15

𝒙𝑈𝒚 𝑧 = 1 𝒙𝑈𝒚 − 𝑧 2 1 𝒙𝑈𝒚 𝑧 = −1 𝒙𝑈𝒚 − 𝑧 2 −1 Correct predictions that are penalized by SSE [Bishop] 𝐿 = 2

slide-16
SLIDE 16

SSE cost function for classification

16

𝐾(𝒙)

 Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝑕 𝒙𝑈𝒚 ?

𝐾 𝒙 =

𝑗=1 𝑂

sign 𝒙𝑈𝒚 𝑗 − 𝑧 𝑗

2

sign 𝑨 = −1, 𝑨 < 0 1, 𝑨 ≥ 0

 𝐾 𝒙

is a piecewise constant function shows the number

  • f misclassifications

𝐿 = 2 𝒙𝑈𝒚 𝑧 = 1 sign 𝒙𝑈𝒚 − 𝑧 2

Training error incurred in classifying training samples

slide-17
SLIDE 17

Perceptron algorithm

18

 Linear classifier  Two-class: 𝑧 ∈ {−1,1}

 𝑧 = −1 for 𝐷2,

𝑧 = 1 for 𝐷1

 Goal: ∀𝑗, 𝒚(𝑗) ∈ 𝐷1 ⇒ 𝒙𝑈𝒚(𝑗) > 0 

∀𝑗, 𝒚 𝑗 ∈ 𝐷2 ⇒ 𝒙𝑈𝒚 𝑗 < 0

 𝑔 𝒚; 𝒙 = sign(𝒙𝑈𝒚)

slide-18
SLIDE 18

Perceptron criterion

19

𝐾𝑄 𝒙 = −

𝑗∈ℳ

𝒙𝑈𝒚 𝑗 𝑧 𝑗 ℳ: subset of training data that are misclassified Many solutions?Which solution among them?

slide-19
SLIDE 19

Cost function

20

[Duda, Hart, and Stork, 2002] 𝐾(𝒙) 𝐾𝑄(𝒙) 𝑥0 𝑥1 𝑥0 𝑥1 # of misclassifications as a cost function Perceptron’s cost function There may be many solutions in these cost functions

slide-20
SLIDE 20

Batch Perceptron

21

“Gradient Descent” to solve the optimization problem: 𝒙𝑢+1 = 𝒙𝑢 − 𝜃𝛼

𝒙𝐾𝑄(𝒙𝑢)

𝛼

𝒙𝐾𝑄 𝒙 = − 𝑗∈ℳ

𝒚 𝑗 𝑧 𝑗

Batch Perceptron converges in finite number of steps for linearly separable data:

Initialize 𝒙 Repeat 𝒙 = 𝒙 + 𝜃 𝑗∈ℳ 𝒚 𝑗 𝑧 𝑗 Until 𝜃 𝑗∈ℳ 𝒚 𝑗 𝑧 𝑗 < 𝜄

slide-21
SLIDE 21

Stochastic gradient descent for Perceptron

22

 Single-sample perceptron:

 If 𝒚(𝑗) is misclassified:

𝒙𝑢+1 = 𝒙𝑢 + 𝜃𝒚(𝑗)𝑧(𝑗)

 Perceptron convergence theorem: for linearly separable data

 If training data are linearly separable, the single-sample perceptron is

also guaranteed to find a solution in a finite number of steps

Initialize 𝒙, 𝑢 ← 0 repeat 𝑢 ← 𝑢 + 1 𝑗 ← 𝑢 mod 𝑂 if 𝒚(𝑗) is misclassified then 𝒙 = 𝒙 + 𝒚(𝑗)𝑧(𝑗) Until all patterns properly classified

Fixed-Increment single sample Perceptron 𝜃 can be set to 1 and proof still works

slide-22
SLIDE 22

Example

23

slide-23
SLIDE 23

Perceptron: Example

24

Change 𝒙 in a direction that corrects the error [Bishop]

slide-24
SLIDE 24

Convergence of Perceptron

25

 For data sets that are not linearly separable, the single-sample

perceptron learning algorithm will never converge

[Duda, Hart & Stork, 2002]

slide-25
SLIDE 25

Pocket algorithm

26

 For the data that are not linearly separable due to noise:

 Keeps in its pocket the best 𝒙 encountered up to now.

Initialize 𝒙 for 𝑢 = 1, … , 𝑈 𝑗 ← 𝑢 mod 𝑂 if 𝒚(𝑗) is misclassified then 𝒙𝑜𝑓𝑥 = 𝒙 + 𝒚(𝑗)𝑧(𝑗) if 𝐹𝑢𝑠𝑏𝑗𝑜 𝒙𝑜𝑓𝑥 < 𝐹𝑢𝑠𝑏𝑗𝑜 𝒙 then 𝒙 = 𝒙𝑜𝑓𝑥 end

𝐹𝑢𝑠𝑏𝑗𝑜 𝒙 = 1 𝑂

𝑜=1 𝑂

𝑡𝑗𝑕𝑜(𝒙𝑈𝒚(𝑜)) ≠ 𝑧(𝑜)

slide-26
SLIDE 26

Linear Discriminant Analysis (LDA)

27

 Fisher’s Linear Discriminant Analysis :

 Dimensionality reduction

 Finds linear combinations of features with large ratios of between-

groups scatters to within-groups scatters (as discriminant new variables)

 Classification

 Predicts the class of an observation 𝒚 by first projecting it to the

space of discriminant variables and then classifying it in this space

slide-27
SLIDE 27

Good Projection for Classification

 What is a good criterion?

 Separating different classes in the projected space

28

slide-28
SLIDE 28

Good Projection for Classification

 What is a good criterion?

 Separating different classes in the projected space

29

slide-29
SLIDE 29

Good Projection for Classification

 What is a good criterion?

 Separating different classes in the projected space

30

𝒙

slide-30
SLIDE 30

LDA Problem

 Problem definition:

 𝐷 = 2 classes 

𝒚(𝑗), 𝑧(𝑗)

𝑗=1 𝑂

training samples with 𝑂1 samples from the first class (𝒟1) and 𝑂2 samples from the second class (𝒟2)

 Goal: finding the best direction 𝒙 that we hope to enable accurate

classification

 The projection of sample 𝒚 onto a line in direction 𝒙 is 𝒙𝑈𝒚  What is the measure of the separation between the projected

points of different classes?

31

slide-31
SLIDE 31

Measure of Separation in the Projected Direction

32

[Bishop]  Is the direction of the line jointing the class means a good

candidate for 𝒙?

slide-32
SLIDE 32

Measure of Separation in the Projected Direction

33

 The direction of the line jointing the class means is the

solution of the following problem:

 Maximizes the separation of the projected class means

max

𝒙 𝐾 𝒙 = 𝜈1 ′ − 𝜈2 ′ 2

  • s. t. 𝒙

= 1

 What is the problem with the criteria considering only

𝜈1

′ − 𝜈2 ′ ?

 It does not consider the variances of the classes in the projected direction

𝜈1

′ = 𝒙𝑈 𝝂1

𝝂1 =

𝒚(𝑗)∈𝒟1 𝒚(𝑗) 𝑂1

𝜈2

′ = 𝒙𝑈 𝝂2

𝝂2 =

𝒚(𝑗)∈𝒟2 𝒚(𝑗) 𝑂2

slide-33
SLIDE 33

LDA Criteria

34

 Fisher idea: maximize a function that will give

 large separation between the projected class means  while also achieving a small variance within each class, thereby

minimizing the class overlap.

𝐾 𝒙 = 𝜈1

′ − 𝜈2 ′ 2

𝑡1

′2 + 𝑡2 ′2

slide-34
SLIDE 34

LDA Criteria

 The scatters of the original data are:

𝑡1

2 = 𝒚(𝑗)∈𝒟1

𝒚 𝑗 − 𝝂1

2

𝑡2

2 = 𝒚(𝑗)∈𝒟2

𝒚 𝑗 − 𝝂2

2

 The scatters of projected data are:

𝑡1

′2 = 𝒚(𝑗)∈𝒟1

𝒙𝑈𝒚 𝑗 − 𝒙𝑈𝝂1

2

𝑡2

′2 = 𝒚(𝑗)∈𝒟2

𝒙𝑈𝒚 𝑗 − 𝒙𝑈𝝂1

2

35

slide-35
SLIDE 35

LDA Criteria

36

𝐾 𝒙 = 𝜈1

′ − 𝜈2 ′ 2

𝑡1

′2 + 𝑡2 ′2

𝜈1

′ − 𝜈2 ′ 2 = 𝒙𝑈𝝂1 − 𝒙𝑈𝝂2 2

= 𝒙𝑈 𝝂1 − 𝝂2 𝝂1 − 𝝂2 𝑈𝒙 𝑡1

′2 = 𝒚(𝑗)∈𝒟1

𝒙𝑈𝒚 𝑗 − 𝒙𝑈𝝂1

2

= 𝒙𝑈

𝒚(𝑗)∈𝒟1

𝒚 𝑗 − 𝝂1 𝒚 𝑗 − 𝝂1

𝑈

𝒙

slide-36
SLIDE 36

LDA Criteria

37

𝐾 𝒙 =

𝒙𝑈𝑻𝐶𝒙 𝒙𝑈𝑻𝑋𝒙

𝑻𝐶 = 𝝂1 − 𝝂2 𝝂1 − 𝝂2 𝑈 𝑻𝑋 = 𝑻1 + 𝑻2 𝑻1 =

𝒚(𝑗)∈𝒟1

𝒚 𝑗 − 𝝂1 𝒚 𝑗 − 𝝂1

𝑈

𝑻2 =

𝒚(𝑗)∈𝒟2

𝒚 𝑗 − 𝝂2 𝒚 𝑗 − 𝝂2

𝑈

scatter matrix=N×covariance matrix Between-class scatter matrix Within-class scatter matrix

slide-37
SLIDE 37

LDA Derivation

( )

T B T W

J  w S w w w S w

 

   

 

2 2

2 2 ( )

T T T T W B T T W B B W W B T T W W

J             w S w w S w w S w w S w S w w S w S w w S w w w w w w S w w S w

( )

B W

J       w S w S w w

38

slide-38
SLIDE 38

LDA Derivation

 𝑻𝐶𝒙 (for any vector 𝒙) points in the same direction as

𝝂1 − 𝝂2:

 Thus, we can solve the eigenvalue problem immediately

If 𝑻𝑋 is full-rank

39

𝑻𝐶𝒙 = 𝜇𝑻𝑋𝒙 𝑻𝑋

−1𝑻𝐶𝒙 = 𝜇𝒙

𝑻𝐶𝒙 = 𝝂1 − 𝝂2 𝝂1 − 𝝂2 𝑈𝒙 ∝ 𝝂1 − 𝝂2 𝒙 ∝ 𝑻𝑋

−1 𝝂1 − 𝝂2

slide-39
SLIDE 39

LDA Algorithm

40

 Find 𝝂1 and 𝝂2 as the mean of class 1 and 2 respectively  Find 𝑻1 and 𝑻2 as scatter matrix of class 1 and 2 respectively  𝑻𝑋 = 𝑻1 + 𝑻2  𝑻𝐶 = 𝝂1 − 𝝂2

𝝂1 − 𝝂2 𝑈

 Feature Extraction

 𝒙 = 𝑻𝑥

−1 𝝂1 − 𝝂2

as the eigenvector corresponding to the largest eigenvalue of 𝑻𝑥

−1𝑻𝑐

 Classification

 𝒙 = 𝑻𝑥

−1 𝝂1 − 𝝂2

 Using a threshold on 𝒙𝑈𝒚, we can classify 𝒚

𝝂2 𝝂1

slide-40
SLIDE 40

Multi-class classification

41

 Solutions to multi-category problems:

 Extend the learning algorithm to support multi-class:

 A function 𝑔

𝑗(𝒚) for each class 𝑗 is found

𝑧 = argmax

𝑗=1,…,𝑑

𝑔

𝑗(𝒚)

 Converting the problem to a set of two-class problems: 𝑦1

𝑦2

𝒚 is assigned to class 𝐷𝑗 if 𝑔𝑗(𝒚) > 𝑔

𝑘(𝒚) 𝑘  𝑗

slide-41
SLIDE 41

Converting multi-class problem to a set of two-class problems

42

 “one versus rest” or “one against all”

 For each class 𝐷𝑗, a linear discriminant function that separates

samples of 𝐷𝑗 from all the other samples is found.

 Totally linearly separable

 “one versus one”

 𝑑(𝑑 − 1)/2 linear discriminant functions are used, one to

separate samples of a pair of classes.

 Pairwise linearly separable

slide-42
SLIDE 42

Multi-class classification

43

 One-vs-all (one-vs-rest)

Class 1: Class 2: Class 3:

𝑦2 𝑦2 𝑦1 𝑦2 𝑦1 𝑦1 𝑦2 𝑦1

slide-43
SLIDE 43

Multi-class classification

44

 One-vs-one

Class 1: Class 2: Class 3:

𝑦2 𝑦1

𝑦2 𝑦1 𝑦2 𝑦1 𝑦2 𝑦1

slide-44
SLIDE 44

Multi-class classification: ambiguity

45

  • ne versus rest
  • ne versus one

[Duda, Hart & Stork, 2002]  Converting the multi-class problem to a set of two-class

problems can lead to regions in which the classification is undefined

slide-45
SLIDE 45

Multi-class classification: linear machine

46

 A discriminant function 𝑔

𝑗 𝒚 = 𝒙𝑗 𝑈𝒚 + 𝑥𝑗0 for each class

𝒟𝑗 (𝑗 = 1, … , 𝐿):

 𝒚 is assigned to class 𝒟𝑗 if:

𝑔𝑗(𝒚) > 𝑔𝑘(𝒚) 𝑘  𝑗

 Decision surfaces (boundaries) can also be found using

discriminant functions

 Boundary of the contiguous ℛ𝑗 and ℛ𝑘: ∀𝒚, 𝑔

𝑗 𝒚 = 𝑔𝑘(𝒚)

𝒙𝑗 − 𝒙𝑘

𝑈𝒚 + 𝑥𝑗0 − 𝑥 𝑘0 = 0

slide-46
SLIDE 46

Multi-class classification: linear machine

47

[Duda, Hart & Stork, 2002]

slide-47
SLIDE 47

Perceptron: multi-class

48

𝑧 = argmax

𝑗=1,…,𝑑

𝒙𝑗

𝑈𝒚

𝐾𝑄 𝑿 = −

𝑗∈ℳ

𝒙𝑧 𝑗 − 𝒙

𝑧 𝑗 𝑈

𝒚 𝑗 ℳ: subset of training data that are misclassified ℳ = 𝑗| 𝑧 𝑗 ≠ 𝑧(𝑗)

Initialize 𝑿 = 𝒙1, … , 𝒙𝑑 , 𝑙 ← 0 repeat 𝑙 ← 𝑙 + 1 mod 𝑂 if 𝒚(𝑗) is misclassified then 𝒙

𝑧 𝑗 = 𝒙 𝑧 𝑗 − 𝒚(𝑗)

𝒙𝑧 𝑗 = 𝒙𝑧 𝑗 + 𝒚(𝑗) Until all patterns properly classified

slide-48
SLIDE 48

Resources

49

 C. Bishop, “Pattern Recognition and Machine Learning”,

Chapter 4.1.