linear classifiers
play

Linear classifiers CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Topics } Linear classifiers } Perceptron SVM will be covered in the later lectures } Fisher } Multi-class classification 2 Classification


  1. Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018

  2. Topics } Linear classifiers } Perceptron SVM will be covered in the later lectures } Fisher } Multi-class classification 2

  3. Classification problem } Given:Training set * π’š % , 𝑧 % } labeled set of 𝑂 input-output pairs 𝐸 = %() } 𝑧 ∈ {1, … , 𝐿} } Goal: Given an input π’š , assign it to one of 𝐿 classes } Examples: } Spam filter } Handwritten digit recognition } … 3

  4. Linear classifiers } Decision boundaries are linear in π’š , or linear in some given set of functions of π’š } Linearly separable data: data points that can be exactly classified by a linear decision surface. } Why linear classifier? } Even when they are not optimal, we can use their simplicity } are relatively easy to compute } In the absence of information suggesting otherwise, linear classifiers are an attractive candidates for initial, trial classifiers. 4

  5. Two Category } 𝑔 π’š; 𝒙 = 𝒙 4 π’š + π‘₯ 7 = π‘₯ 7 + π‘₯ ) 𝑦 ) + . . . π‘₯ ; 𝑦 ; } π’š = 𝑦 ) 𝑦 < … 𝑦 ; } 𝒙 = [π‘₯ ) π‘₯ < … π‘₯ ; ] } π‘₯ 7 : bias } if 𝒙 4 π’š + π‘₯ 7 β‰₯ 0 then π’Ÿ ) } else π’Ÿ < Decision surface (boundary): 𝒙 4 π’š + π‘₯ 7 = 0 𝒙 is orthogonal to every vector lying within the decision surface 5

  6. Example 3 βˆ’ 3 4 𝑦 ) βˆ’ 𝑦 < = 0 𝑦 2 3 if 𝒙 4 π’š + π‘₯ 7 β‰₯ 0 then π’Ÿ ) 2 else π’Ÿ < 1 𝑦 1 4 1 2 3 6

  7. Linear classifier: Two Category } Decision boundary is a ( 𝑒 βˆ’ 1 )-dimensional hyperplane 𝐼 in the 𝑒 -dimensional feature space } The orientation of 𝐼 is determined by the normal vector π‘₯ ) , … , π‘₯ ; } π‘₯ 7 determine the location of the surface. } The normal distance from the origin to the decision surface is H I 𝒙 π’š = π’š J + 𝑠 𝒙 𝒙 β‡’ 𝑠 = 𝒙 4 π’š + π‘₯ 7 π’š J 𝒙 4 π’š + π‘₯ 7 = 𝑠 𝒙 𝒙 𝑔 π’š = 0 gives a signed measure of the perpendicular distance 𝑠 of the point π’š from the decision surface 7

  8. Linear boundary: geometry 𝒙 4 π’š + π‘₯ 7 > 0 𝒙 4 π’š + π‘₯ 7 = 0 𝒙 4 π’š + π‘₯ 7 < 0 𝒙 4 π’š + π‘₯ 7 𝒙 8

  9. Non-linear decision boundary } Choose non-linear features } Classifier still linear in parameters 𝒙 < + 𝑦 < < = 0 𝑦 2 βˆ’1 + 𝑦 ) < , π’š < < , π’š ) π’š < ] 𝝔 π’š = [1, π’š ) , π’š < , π’š ) 1 𝒙 = π‘₯ 7 , π‘₯ ) , … , π‘₯ R = [βˆ’1, 0, 0,1,1,0] 𝑦 1 1 1 - 1 if 𝒙 4 𝝔(π’š) β‰₯ 0 then 𝑧 = 1 else 𝑧 = βˆ’1 π’š = [π’š ) , π’š < ] 9

  10. Cost Function for linear classification } Finding linear classifiers can be formulated as an optimization problem : } Select how to measure the prediction loss S π’š % , 𝑧 % } Based on the training set 𝐸 = , a cost function 𝐾 𝒙 is defined %() } Solve the resulting optimization problem to find parameters: U π’š = 𝑔 π’š; 𝒙 } Find optimal 𝑔 V where 𝒙 V = argmin 𝐾 𝒙 𝒙 } Criterion or cost functions for classification: } We will investigate several cost functions for the classification problem 10

  11. SSE cost function for classification 𝐿 = 2 SSE cost function is not suitable for classification: } Least square loss penalizes β€˜too correct’ predictions (that they lie a long way on the correct side of the decision) } Least square loss also lack robustness to noise * < 𝐾 𝒙 = ] 𝒙 4 π’š % βˆ’ 𝑧 % %() 11

  12. SSE cost function for classification 𝐿 = 2 𝒙 4 π’š βˆ’ 𝑧 < 𝑧 = 1 1 𝒙 4 π’š Correct predictions that 𝒙 4 π’š βˆ’ 𝑧 < are penalized by SSE 𝑧 = βˆ’1 [Bishop] βˆ’1 𝒙 4 π’š 12

  13. SSE cost function for classification 𝐿 = 2 } Is it more suitable if we set 𝑔 π’š; 𝒙 = 𝑕 𝒙 4 π’š ? * sign 𝒙 4 π’š βˆ’ 𝑧 < < 𝐾 𝒙 = ] sign 𝒙 4 π’š % βˆ’ 𝑧 % 𝑧 = 1 %() sign 𝑨 = aβˆ’ 1, 𝑨 < 0 1, 𝑨 β‰₯ 0 𝒙 4 π’š } 𝐾 𝒙 is a piecewise constant function shows the number of misclassifications 𝐾(𝒙) Training error incurred in classifying training samples 13

  14. SSE cost function 𝐿 = 2 } Is it more suitable if we set 𝑔 π’š; 𝒙 = 𝜏 𝒙 4 π’š ? * < 𝐾 𝒙 = ] 𝜏 𝒙 4 π’š % βˆ’ 𝑧 % %() 𝜏 𝑨 = 1 βˆ’ 𝑓 de 1 + 𝑓 de } We see later in this lecture than the cost function of the logistic regression method is more suitable than this cost function for the classification problem 14

  15. Perceptron algorithm } Linear classifier } Two-class: 𝑧 ∈ {βˆ’1,1} } 𝑧 = βˆ’1 for 𝐷 < , 𝑧 = 1 for 𝐷 ) } Goal: βˆ€π‘—, π’š (%) ∈ 𝐷 ) β‡’ 𝒙 4 π’š (%) > 0 βˆ€π‘—, π’š % ∈ 𝐷 < β‡’ 𝒙 4 π’š % < 0 } } 𝑔 π’š; 𝒙 = sign(𝒙 4 π’š) 15

  16. οΏ½ Perceptron criterion 𝐾 i 𝒙 = βˆ’ ] 𝒙 4 π’š % 𝑧 % %βˆˆβ„³ β„³ : subset of training data that are misclassified Many solutions? Which solution among them? 16

  17. Cost function 𝐾(𝒙) 𝐾 i (𝒙) π‘₯ 7 π‘₯ 7 π‘₯ ) π‘₯ ) # of misclassifications Perceptron’s as a cost function cost function There may be many solutions in these cost functions 17 [Duda, Hart, and Stork, 2002]

  18. οΏ½ οΏ½ οΏ½ Batch Perceptron β€œGradient Descent” to solve the optimization problem: 𝒙 lm) = 𝒙 l βˆ’ πœƒπ›Ό 𝒙 𝐾 i (𝒙 l ) 𝒙 𝐾 i 𝒙 = βˆ’ ] π’š % 𝑧 % 𝛼 %βˆˆβ„³ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize 𝒙 Repeat π’š % 𝑧 % 𝒙 = 𝒙 + πœƒ βˆ‘ %βˆˆβ„³ π’š % 𝑧 % Until πœƒ βˆ‘ < πœ„ %βˆˆβ„³ 18

  19. Stochastic gradient descent for Perceptron } Single-sample perceptron: } If π’š (%) is misclassified: 𝒙 lm) = 𝒙 l + πœƒπ’š (%) 𝑧 (%) } Perceptron convergence theorem: for linearly separable data } If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron Initialize 𝒙, 𝑒 ← 0 repeat πœƒ can be set to 1 and 𝑒 ← 𝑒 + 1 proof still works 𝑗 ← 𝑒 mod 𝑂 if π’š (%) is misclassified then 𝒙 = 𝒙 + π’š (%) 𝑧 (%) Until all patterns properly classified 19

  20. Perceptron Convergence } It can be shown: (π’š,x)∈y π’š < 𝑁 = max V 0 4 π’š 𝜈 = 2 min (π’š,x)∈y 𝑧𝒙 V βˆ—4 π’š 𝑏 = min (π’š,x)∈y 𝑧𝒙 20

  21. Example 21

  22. Perceptron: Example Change 𝒙 in a direction that corrects the error 22 [Bishop]

  23. Convergence of Perceptron [Duda, Hart & Stork, 2002] } For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 23

  24. Pocket algorithm } For the data that are not linearly separable due to noise: } Keeps in its pocket the best 𝒙 encountered up to now. Initialize 𝒙 for 𝑒 = 1, … , π‘ˆ 𝑗 ← 𝑒 mod 𝑂 if π’š (%) is misclassified then 𝒙 S~H = 𝒙 + π’š (%) 𝑧 (%) if 𝐹 l€‒%S 𝒙 S~H < 𝐹 l€‒%S 𝒙 then 𝒙 = 𝒙 S~H end * 𝐹 l€‒%S 𝒙 = 1 𝑂 ] π‘‘π‘—π‘•π‘œ(𝒙 4 π’š (S) ) β‰  𝑧 (S) S() 24

  25. Linear Discriminant Analysis (LDA) } Fisher’s Linear Discriminant Analysis : } Dimensionality reduction } Finds linear combinations of features with large ratios of between- groups scatters to within-groups scatters (as discriminant new variables) } Classification } Predicts the class of an observation π’š by first projecting it to the space of discriminant variables and then classifying it in this space 25

  26. Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 26

  27. Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 27

  28. Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 𝒙 28

  29. LDA Problem } Problem definition: } 𝐷 = 2 classes * π’š (%) , 𝑧 (%) training samples with 𝑂 ) samples from the first class ( π’Ÿ ) ) } %() and 𝑂 < samples from the second class ( π’Ÿ < ) } Goal: finding the best direction 𝒙 that we hope to enable accurate classification } The projection of sample π’š onto a line in direction 𝒙 is 𝒙 4 π’š } What is the measure of the separation between the projected points of different classes? 29

  30. Measure of Separation in the Projected Direction } Is the direction of the line jointing the class means a good candidate for 𝒙 ? [Bishop] 30

  31. οΏ½ οΏ½ Measure of Separation in the Projected Direction } The direction of the line jointing the class means is the solution of the following problem: } Maximizes the separation of the projected class means … βˆ’ 𝜈 < … < max 𝒙 𝐾 𝒙 = 𝜈 ) s. t. 𝒙 = 1 π’š (Λ†) βˆ‘ … = 𝒙 4 𝝂 ) π’š(Λ†)βˆˆπ’Ÿβ€° 𝜈 ) 𝝂 ) = * ‰ π’š (Λ†) βˆ‘ … = 𝒙 4 𝝂 < π’š(Λ†)βˆˆπ’ŸΕ  𝜈 < 𝝂 < = * Ε  } What is the problem with the criteria considering only … βˆ’ 𝜈 < … ? 𝜈 ) } It does not consider the variances of the classes in the projected direction 31

  32. LDA Criteria } Fisher idea: maximize a function that will give } large separation between the projected class means } while also achieving a small variance within each class, thereby minimizing the class overlap. … βˆ’ 𝜈 < … < 𝐾 𝒙 = 𝜈 ) …< + 𝑑 < …< 𝑑 ) 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend