Linear classifiers CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Topics  Discriminant functions  Linear classifiers  Perceptron SVM will be covered in the later lectures  Fisher  Multi-class classification 2

Classification problem  Given:Training set 𝑂 𝒚 𝑗 , 𝑧 𝑗  labeled set of 𝑂 input-output pairs 𝐸 = 𝑗=1  𝑧 ∈ {1, … , 𝐿}  Goal: Given an input 𝒚 , assign it to one of 𝐿 classes  Examples:  Spam filter  Handwritten digit recognition  … 3

Discriminant functions  Discriminant function can directly assign each vector 𝒚 to a specific class 𝑙  A popular way of representing a classifier  Many classification methods are based on discriminant functions  Assumption: the classes are taken to be disjoint  The input space is thereby divided into decision regions  boundaries are called decision boundaries or decision surfaces. 4

Discriminant Functions  Discriminant functions : A discriminant function 𝑔 𝑗 𝒚 for each class 𝒟 𝑗 ( 𝑗 = 1, … , 𝐿 ):  𝒚 is assigned to class 𝒟 𝑗 if: 𝑔 𝑗 (𝒚) > 𝑔 𝑘 (𝒚)  𝑘  𝑗  Thus, we can easily divide the feature space into 𝐿 decision regions ∀𝒚, 𝑔 𝑗 (𝒚) > 𝑔 𝑘 (𝒚)  𝑘  𝑗 ⇒ 𝒚 ∈ ℛ 𝑗 ℛ 𝑗 : Region of the 𝑗 -th class  Decision surfaces (or boundaries) can also be found using discriminant functions  Boundary of the ℛ 𝑗 and ℛ 𝑘 separating samples of these two categories: ∀𝒚, 𝑔 𝑗 𝒚 = 𝑔 𝑘 (𝒚) 5

Discriminant Functions: Two-Category  Decision surface: 𝑔 𝒚 = 0  For two-category problem, we can only find a function 𝑔 ∶ ℝ d → ℝ  𝑔 1 𝒚 = 𝑔(𝒚)  𝑔 2 𝒚 = −𝑔(𝒚)  First, we explain two-category classification problem and then discuss the multi-category problems.  Binary classification: a target variable 𝑧 ∈ 0,1 or 𝑧 ∈ −1,1 6

Linear classifiers  Decision boundaries are linear in 𝒚 , or linear in some given set of functions of 𝒚  Linearly separable data: data points that can be exactly classified by a linear decision surface.  Why linear classifier?  Even when they are not optimal, we can use their simplicity  are relatively easy to compute  In the absence of information suggesting otherwise, linear classifiers are an attractive candidates for initial, trial classifiers. 7

Two Category  𝑔 𝒚; 𝒙 = 𝒙 𝑈 𝒚 + 𝑥 0 = 𝑥 0 + 𝑥 1 𝑦 1 + . . . 𝑥 𝑒 𝑦 𝑒  𝒚 = 𝑦 1 𝑦 2 … 𝑦 𝑒  𝒙 = [𝑥 1 𝑥 2 … 𝑥 𝑒 ]  𝑥 0 : bias  if 𝒙 𝑈 𝒚 + 𝑥 0 ≥ 0 then 𝒟 1  else 𝒟 2 Decision surface (boundary): 𝒙 𝑈 𝒚 + 𝑥 0 = 0 𝒙 is orthogonal to every vector lying within the decision surface 8

Example 3 − 3 4 𝑦 1 − 𝑦 2 = 0 𝑦 2 3 if 𝒙 𝑈 𝒚 + 𝑥 0 ≥ 0 then 𝒟 1 2 else 𝒟 2 1 𝑦 1 4 1 2 3 9

Linear classifier: Two Category  Decision boundary is a ( 𝑒 − 1 )-dimensional hyperplane 𝐼 in the 𝑒 -dimensional feature space  The orientation of 𝐼 is determined by the normal vector 𝑥 1 , … , 𝑥 𝑒  𝑥 0 determine the location of the surface.  The normal distance from the origin to the decision surface is 𝑥 0 𝒙 𝒚 = 𝒚 ⊥ + 𝑠 𝒙 𝒙 ⇒ 𝑠 = 𝒙 𝑈 𝒚 + 𝑥 0 𝒚 ⊥ 𝒙 𝑈 𝒚 + 𝑥 0 = 𝑠 𝒙 𝒙 𝑔 𝒚 = 0 gives a signed measure of the perpendicular distance 𝑠 of the point 𝒚 from the decision surface 10

Linear boundary: geometry 𝒙 𝑈 𝒚 + 𝑥 0 > 0 𝒙 𝑈 𝒚 + 𝑥 0 = 0 𝒙 𝑈 𝒚 + 𝑥 0 < 0 𝒙 𝑈 𝒚 + 𝑥 0 𝒙 11

Non-linear decision boundary  Choose non-linear features  Classifier still linear in parameters 𝒙 2 + 𝑦 2 2 = 0 −1 + 𝑦 1 𝑦 2 2 , 𝒚 2 2 , 𝒚 1 𝒚 2 ] 𝝔 𝒚 = [1, 𝒚 1 , 𝒚 2 , 𝒚 1 1 𝒙 = 𝑥 0 , 𝑥 1 , … , 𝑥 𝑛 = [−1, 0, 0,1,1,0] 1 𝑦 1 1 -1 if 𝒙 𝑈 𝝔(𝒚) ≥ 0 then 𝑧 = 1 else 𝑧 = −1 𝒚 = [𝒚 1 , 𝒚 2 ] 12

Cost Function for linear classification  Finding linear classifiers can be formulated as an optimization problem :  Select how to measure the prediction loss 𝑜 , a cost function 𝐾 𝒙 is defined 𝒚 𝑗 , 𝑧 𝑗  Based on the training set 𝐸 = 𝑗=1  Solve the resulting optimization problem to find parameters:  Find optimal where 𝑔 𝒚 = 𝑔 𝒚; 𝒙 𝒙 = argmin 𝐾 𝒙 𝒙  Criterion or cost functions for classification:  We will investigate several cost functions for the classification problem 13

SSE cost function for classification 𝐿 = 2 SSE cost function is not suitable for classification:  Least square loss penalizes ‘ too correct ’ predictions (that they lie a long way on the correct side of the decision)  Least square loss also lack robustness to noise 𝑂 2 𝒙 𝑈 𝒚 𝑗 − 𝑧 𝑗 𝐾 𝒙 = 𝑗=1 14

SSE cost function for classification 𝐿 = 2 𝒙 𝑈 𝒚 − 𝑧 2 𝑧 = 1 1 𝒙 𝑈 𝒚 Correct predictions that 𝒙 𝑈 𝒚 − 𝑧 2 are penalized by SSE 𝑧 = −1 [Bishop] −1 𝒙 𝑈 𝒚 15

SSE cost function for classification 𝐿 = 2  Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝑕 𝒙 𝑈 𝒚 ? 𝑂 sign 𝒙 𝑈 𝒚 − 𝑧 2 2 sign 𝒙 𝑈 𝒚 𝑗 − 𝑧 𝑗 𝐾 𝒙 = 𝑧 = 1 𝑗=1 sign 𝑨 = −1, 𝑨 < 0 𝒙 𝑈 𝒚 1, 𝑨 ≥ 0  𝐾 𝒙 is a piecewise constant function shows the number of misclassifications 𝐾(𝒙) Training error incurred in classifying training samples 16

Perceptron algorithm  Linear classifier  Two-class: 𝑧 ∈ {−1,1}  𝑧 = −1 for 𝐷 2 , 𝑧 = 1 for 𝐷 1  Goal: ∀𝑗, 𝒚 (𝑗) ∈ 𝐷 1 ⇒ 𝒙 𝑈 𝒚 (𝑗) > 0 ∀𝑗, 𝒚 𝑗 ∈ 𝐷 2 ⇒ 𝒙 𝑈 𝒚 𝑗 < 0   𝑔 𝒚; 𝒙 = sign(𝒙 𝑈 𝒚) 18

Perceptron criterion 𝒙 𝑈 𝒚 𝑗 𝑧 𝑗 𝐾 𝑄 𝒙 = − 𝑗∈ℳ ℳ : subset of training data that are misclassified Many solutions?Which solution among them? 19

Cost function 𝐾(𝒙) 𝐾 𝑄 (𝒙) 𝑥 0 𝑥 0 𝑥 1 𝑥 1 # of misclassifications Perceptron ’ s as a cost function cost function There may be many solutions in these cost functions 20 [Duda, Hart, and Stork, 2002]

Batch Perceptron “ Gradient Descent ” to solve the optimization problem: 𝒙 𝑢+1 = 𝒙 𝑢 − 𝜃𝛼 𝒙 𝐾 𝑄 (𝒙 𝑢 ) 𝒚 𝑗 𝑧 𝑗 𝛼 𝒙 𝐾 𝑄 𝒙 = − 𝑗∈ℳ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize 𝒙 Repeat 𝒙 = 𝒙 + 𝜃 𝑗∈ℳ 𝒚 𝑗 𝑧 𝑗 Until 𝜃 𝑗∈ℳ 𝒚 𝑗 𝑧 𝑗 < 𝜄 21

Stochastic gradient descent for Perceptron  Single-sample perceptron:  If 𝒚 (𝑗) is misclassified: 𝒙 𝑢+1 = 𝒙 𝑢 + 𝜃𝒚 (𝑗) 𝑧 (𝑗)  Perceptron convergence theorem: for linearly separable data  If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron Initialize 𝒙, 𝑢 ← 0 repeat 𝜃 can be set to 1 and 𝑢 ← 𝑢 + 1 proof still works 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (𝑗) is misclassified then 𝒙 = 𝒙 + 𝒚 (𝑗) 𝑧 (𝑗) Until all patterns properly classified 22

Example 23

Perceptron: Example Change 𝒙 in a direction that corrects the error 24 [Bishop]

Convergence of Perceptron [Duda, Hart & Stork, 2002]  For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 25

Pocket algorithm  For the data that are not linearly separable due to noise:  Keeps in its pocket the best 𝒙 encountered up to now. Initialize 𝒙 for 𝑢 = 1, … , 𝑈 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (𝑗) is misclassified then 𝒙 𝑜𝑓𝑥 = 𝒙 + 𝒚 (𝑗) 𝑧 (𝑗) if 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 𝑜𝑓𝑥 < 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 then 𝒙 = 𝒙 𝑜𝑓𝑥 end 𝑂 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 = 1 𝑡𝑗𝑕𝑜(𝒙 𝑈 𝒚 (𝑜) ) ≠ 𝑧 (𝑜) 𝑂 𝑜=1 26

Linear Discriminant Analysis (LDA)  Fisher ’ s Linear Discriminant Analysis :  Dimensionality reduction  Finds linear combinations of features with large ratios of between- groups scatters to within-groups scatters (as discriminant new variables)  Classification  Predicts the class of an observation 𝒚 by first projecting it to the space of discriminant variables and then classifying it in this space 27

Good Projection for Classification  What is a good criterion?  Separating different classes in the projected space 28

Good Projection for Classification  What is a good criterion?  Separating different classes in the projected space 29

Good Projection for Classification  What is a good criterion?  Separating different classes in the projected space 𝒙 30

Linear classifiers CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Discriminant functions Linear classifiers Perceptron SVM will be covered in the later lectures Fisher Multi-class

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai Vatsal Sharan, Peter Bailis,

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Combining Evidence Module Introduction CS6200: Information Retrieval Evidence of Relevance So

Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48

Software Verification : Introduction Ranjit Jhala, UC San Diego April 4, 2013 What is

Verified Decision Procedures for Equivalence of Regular Expressions Tobias Nipkow & Dmitriy

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels

Linear classifiers CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Discriminant functions Linear classifiers Perceptron SVM will be covered in the later lectures Fisher Multi-class

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai Vatsal Sharan, Peter Bailis,

Linear &amp; nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall

Linear &amp; nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Combining Evidence Module Introduction CS6200: Information Retrieval Evidence of Relevance So

Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48

Software Verification : Introduction Ranjit Jhala, UC San Diego April 4, 2013 What is

Verified Decision Procedures for Equivalence of Regular Expressions Tobias Nipkow &amp; Dmitriy

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos &amp; Aarti Singh 1

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun &amp; Rich Zemels

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Verified Decision Procedures for Equivalence of Regular Expressions Tobias Nipkow & Dmitriy

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels