linear classifiers
play

Linear classifiers CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Discriminant functions Linear classifiers Perceptron SVM will be covered in the later lectures Fisher Multi-class


  1. Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

  2. Topics  Discriminant functions  Linear classifiers  Perceptron SVM will be covered in the later lectures  Fisher  Multi-class classification 2

  3. Classification problem  Given:Training set 𝑂 𝒚 𝑗 , 𝑧 𝑗  labeled set of 𝑂 input-output pairs 𝐸 = 𝑗=1  𝑧 ∈ {1, … , 𝐿}  Goal: Given an input 𝒚 , assign it to one of 𝐿 classes  Examples:  Spam filter  Handwritten digit recognition  … 3

  4. Discriminant functions  Discriminant function can directly assign each vector 𝒚 to a specific class 𝑙  A popular way of representing a classifier  Many classification methods are based on discriminant functions  Assumption: the classes are taken to be disjoint  The input space is thereby divided into decision regions  boundaries are called decision boundaries or decision surfaces. 4

  5. Discriminant Functions  Discriminant functions : A discriminant function 𝑔 𝑗 𝒚 for each class 𝒟 𝑗 ( 𝑗 = 1, … , 𝐿 ):  𝒚 is assigned to class 𝒟 𝑗 if: 𝑔 𝑗 (𝒚) > 𝑔 𝑘 (𝒚)  𝑘  𝑗  Thus, we can easily divide the feature space into 𝐿 decision regions ∀𝒚, 𝑔 𝑗 (𝒚) > 𝑔 𝑘 (𝒚)  𝑘  𝑗 ⇒ 𝒚 ∈ ℛ 𝑗 ℛ 𝑗 : Region of the 𝑗 -th class  Decision surfaces (or boundaries) can also be found using discriminant functions  Boundary of the ℛ 𝑗 and ℛ 𝑘 separating samples of these two categories: ∀𝒚, 𝑔 𝑗 𝒚 = 𝑔 𝑘 (𝒚) 5

  6. Discriminant Functions: Two-Category  Decision surface: 𝑔 𝒚 = 0  For two-category problem, we can only find a function 𝑔 ∶ ℝ d → ℝ  𝑔 1 𝒚 = 𝑔(𝒚)  𝑔 2 𝒚 = −𝑔(𝒚)  First, we explain two-category classification problem and then discuss the multi-category problems.  Binary classification: a target variable 𝑧 ∈ 0,1 or 𝑧 ∈ −1,1 6

  7. Linear classifiers  Decision boundaries are linear in 𝒚 , or linear in some given set of functions of 𝒚  Linearly separable data: data points that can be exactly classified by a linear decision surface.  Why linear classifier?  Even when they are not optimal, we can use their simplicity  are relatively easy to compute  In the absence of information suggesting otherwise, linear classifiers are an attractive candidates for initial, trial classifiers. 7

  8. Two Category  𝑔 𝒚; 𝒙 = 𝒙 𝑈 𝒚 + 𝑥 0 = 𝑥 0 + 𝑥 1 𝑦 1 + . . . 𝑥 𝑒 𝑦 𝑒  𝒚 = 𝑦 1 𝑦 2 … 𝑦 𝑒  𝒙 = [𝑥 1 𝑥 2 … 𝑥 𝑒 ]  𝑥 0 : bias  if 𝒙 𝑈 𝒚 + 𝑥 0 ≥ 0 then 𝒟 1  else 𝒟 2 Decision surface (boundary): 𝒙 𝑈 𝒚 + 𝑥 0 = 0 𝒙 is orthogonal to every vector lying within the decision surface 8

  9. Example 3 − 3 4 𝑦 1 − 𝑦 2 = 0 𝑦 2 3 if 𝒙 𝑈 𝒚 + 𝑥 0 ≥ 0 then 𝒟 1 2 else 𝒟 2 1 𝑦 1 4 1 2 3 9

  10. Linear classifier: Two Category  Decision boundary is a ( 𝑒 − 1 )-dimensional hyperplane 𝐼 in the 𝑒 -dimensional feature space  The orientation of 𝐼 is determined by the normal vector 𝑥 1 , … , 𝑥 𝑒  𝑥 0 determine the location of the surface.  The normal distance from the origin to the decision surface is 𝑥 0 𝒙 𝒚 = 𝒚 ⊥ + 𝑠 𝒙 𝒙 ⇒ 𝑠 = 𝒙 𝑈 𝒚 + 𝑥 0 𝒚 ⊥ 𝒙 𝑈 𝒚 + 𝑥 0 = 𝑠 𝒙 𝒙 𝑔 𝒚 = 0 gives a signed measure of the perpendicular distance 𝑠 of the point 𝒚 from the decision surface 10

  11. Linear boundary: geometry 𝒙 𝑈 𝒚 + 𝑥 0 > 0 𝒙 𝑈 𝒚 + 𝑥 0 = 0 𝒙 𝑈 𝒚 + 𝑥 0 < 0 𝒙 𝑈 𝒚 + 𝑥 0 𝒙 11

  12. Non-linear decision boundary  Choose non-linear features  Classifier still linear in parameters 𝒙 2 + 𝑦 2 2 = 0 −1 + 𝑦 1 𝑦 2 2 , 𝒚 2 2 , 𝒚 1 𝒚 2 ] 𝝔 𝒚 = [1, 𝒚 1 , 𝒚 2 , 𝒚 1 1 𝒙 = 𝑥 0 , 𝑥 1 , … , 𝑥 𝑛 = [−1, 0, 0,1,1,0] 1 𝑦 1 1 -1 if 𝒙 𝑈 𝝔(𝒚) ≥ 0 then 𝑧 = 1 else 𝑧 = −1 𝒚 = [𝒚 1 , 𝒚 2 ] 12

  13. Cost Function for linear classification  Finding linear classifiers can be formulated as an optimization problem :  Select how to measure the prediction loss 𝑜 , a cost function 𝐾 𝒙 is defined 𝒚 𝑗 , 𝑧 𝑗  Based on the training set 𝐸 = 𝑗=1  Solve the resulting optimization problem to find parameters:  Find optimal where 𝑔 𝒚 = 𝑔 𝒚; 𝒙 𝒙 = argmin 𝐾 𝒙 𝒙  Criterion or cost functions for classification:  We will investigate several cost functions for the classification problem 13

  14. SSE cost function for classification 𝐿 = 2 SSE cost function is not suitable for classification:  Least square loss penalizes ‘ too correct ’ predictions (that they lie a long way on the correct side of the decision)  Least square loss also lack robustness to noise 𝑂 2 𝒙 𝑈 𝒚 𝑗 − 𝑧 𝑗 𝐾 𝒙 = 𝑗=1 14

  15. SSE cost function for classification 𝐿 = 2 𝒙 𝑈 𝒚 − 𝑧 2 𝑧 = 1 1 𝒙 𝑈 𝒚 Correct predictions that 𝒙 𝑈 𝒚 − 𝑧 2 are penalized by SSE 𝑧 = −1 [Bishop] −1 𝒙 𝑈 𝒚 15

  16. SSE cost function for classification 𝐿 = 2  Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝑕 𝒙 𝑈 𝒚 ? 𝑂 sign 𝒙 𝑈 𝒚 − 𝑧 2 2 sign 𝒙 𝑈 𝒚 𝑗 − 𝑧 𝑗 𝐾 𝒙 = 𝑧 = 1 𝑗=1 sign 𝑨 = −1, 𝑨 < 0 𝒙 𝑈 𝒚 1, 𝑨 ≥ 0  𝐾 𝒙 is a piecewise constant function shows the number of misclassifications 𝐾(𝒙) Training error incurred in classifying training samples 16

  17. Perceptron algorithm  Linear classifier  Two-class: 𝑧 ∈ {−1,1}  𝑧 = −1 for 𝐷 2 , 𝑧 = 1 for 𝐷 1  Goal: ∀𝑗, 𝒚 (𝑗) ∈ 𝐷 1 ⇒ 𝒙 𝑈 𝒚 (𝑗) > 0 ∀𝑗, 𝒚 𝑗 ∈ 𝐷 2 ⇒ 𝒙 𝑈 𝒚 𝑗 < 0   𝑔 𝒚; 𝒙 = sign(𝒙 𝑈 𝒚) 18

  18. Perceptron criterion 𝒙 𝑈 𝒚 𝑗 𝑧 𝑗 𝐾 𝑄 𝒙 = − 𝑗∈ℳ ℳ : subset of training data that are misclassified Many solutions?Which solution among them? 19

  19. Cost function 𝐾(𝒙) 𝐾 𝑄 (𝒙) 𝑥 0 𝑥 0 𝑥 1 𝑥 1 # of misclassifications Perceptron ’ s as a cost function cost function There may be many solutions in these cost functions 20 [Duda, Hart, and Stork, 2002]

  20. Batch Perceptron “ Gradient Descent ” to solve the optimization problem: 𝒙 𝑢+1 = 𝒙 𝑢 − 𝜃𝛼 𝒙 𝐾 𝑄 (𝒙 𝑢 ) 𝒚 𝑗 𝑧 𝑗 𝛼 𝒙 𝐾 𝑄 𝒙 = − 𝑗∈ℳ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize 𝒙 Repeat 𝒙 = 𝒙 + 𝜃 𝑗∈ℳ 𝒚 𝑗 𝑧 𝑗 Until 𝜃 𝑗∈ℳ 𝒚 𝑗 𝑧 𝑗 < 𝜄 21

  21. Stochastic gradient descent for Perceptron  Single-sample perceptron:  If 𝒚 (𝑗) is misclassified: 𝒙 𝑢+1 = 𝒙 𝑢 + 𝜃𝒚 (𝑗) 𝑧 (𝑗)  Perceptron convergence theorem: for linearly separable data  If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron Initialize 𝒙, 𝑢 ← 0 repeat 𝜃 can be set to 1 and 𝑢 ← 𝑢 + 1 proof still works 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (𝑗) is misclassified then 𝒙 = 𝒙 + 𝒚 (𝑗) 𝑧 (𝑗) Until all patterns properly classified 22

  22. Example 23

  23. Perceptron: Example Change 𝒙 in a direction that corrects the error 24 [Bishop]

  24. Convergence of Perceptron [Duda, Hart & Stork, 2002]  For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 25

  25. Pocket algorithm  For the data that are not linearly separable due to noise:  Keeps in its pocket the best 𝒙 encountered up to now. Initialize 𝒙 for 𝑢 = 1, … , 𝑈 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (𝑗) is misclassified then 𝒙 𝑜𝑓𝑥 = 𝒙 + 𝒚 (𝑗) 𝑧 (𝑗) if 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 𝑜𝑓𝑥 < 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 then 𝒙 = 𝒙 𝑜𝑓𝑥 end 𝑂 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 = 1 𝑡𝑗𝑕𝑜(𝒙 𝑈 𝒚 (𝑜) ) ≠ 𝑧 (𝑜) 𝑂 𝑜=1 26

  26. Linear Discriminant Analysis (LDA)  Fisher ’ s Linear Discriminant Analysis :  Dimensionality reduction  Finds linear combinations of features with large ratios of between- groups scatters to within-groups scatters (as discriminant new variables)  Classification  Predicts the class of an observation 𝒚 by first projecting it to the space of discriminant variables and then classifying it in this space 27

  27. Good Projection for Classification  What is a good criterion?  Separating different classes in the projected space 28

  28. Good Projection for Classification  What is a good criterion?  Separating different classes in the projected space 29

  29. Good Projection for Classification  What is a good criterion?  Separating different classes in the projected space 𝒙 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend