Linear classifiers CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018

Topics } Linear classifiers } Perceptron SVM will be covered in the later lectures } Fisher } Multi-class classification 2

Classification problem } Given:Training set * 𝒚 % , 𝑧 % } labeled set of 𝑂 input-output pairs 𝐸 = %() } 𝑧 ∈ {1, … , 𝐿} } Goal: Given an input 𝒚 , assign it to one of 𝐿 classes } Examples: } Spam filter } Handwritten digit recognition } … 3

Linear classifiers } Decision boundaries are linear in 𝒚 , or linear in some given set of functions of 𝒚 } Linearly separable data: data points that can be exactly classified by a linear decision surface. } Why linear classifier? } Even when they are not optimal, we can use their simplicity } are relatively easy to compute } In the absence of information suggesting otherwise, linear classifiers are an attractive candidates for initial, trial classifiers. 4

Two Category } 𝑔 𝒚; 𝒙 = 𝒙 4 𝒚 + 𝑥 7 = 𝑥 7 + 𝑥 ) 𝑦 ) + . . . 𝑥 ; 𝑦 ; } 𝒚 = 𝑦 ) 𝑦 < … 𝑦 ; } 𝒙 = [𝑥 ) 𝑥 < … 𝑥 ; ] } 𝑥 7 : bias } if 𝒙 4 𝒚 + 𝑥 7 ≥ 0 then 𝒟 ) } else 𝒟 < Decision surface (boundary): 𝒙 4 𝒚 + 𝑥 7 = 0 𝒙 is orthogonal to every vector lying within the decision surface 5

Example 3 − 3 4 𝑦 ) − 𝑦 < = 0 𝑦 2 3 if 𝒙 4 𝒚 + 𝑥 7 ≥ 0 then 𝒟 ) 2 else 𝒟 < 1 𝑦 1 4 1 2 3 6

Linear classifier: Two Category } Decision boundary is a ( 𝑒 − 1 )-dimensional hyperplane 𝐼 in the 𝑒 -dimensional feature space } The orientation of 𝐼 is determined by the normal vector 𝑥 ) , … , 𝑥 ; } 𝑥 7 determine the location of the surface. } The normal distance from the origin to the decision surface is H I 𝒙 𝒚 = 𝒚 J + 𝑠 𝒙 𝒙 ⇒ 𝑠 = 𝒙 4 𝒚 + 𝑥 7 𝒚 J 𝒙 4 𝒚 + 𝑥 7 = 𝑠 𝒙 𝒙 𝑔 𝒚 = 0 gives a signed measure of the perpendicular distance 𝑠 of the point 𝒚 from the decision surface 7

Linear boundary: geometry 𝒙 4 𝒚 + 𝑥 7 > 0 𝒙 4 𝒚 + 𝑥 7 = 0 𝒙 4 𝒚 + 𝑥 7 < 0 𝒙 4 𝒚 + 𝑥 7 𝒙 8

Non-linear decision boundary } Choose non-linear features } Classifier still linear in parameters 𝒙 < + 𝑦 < < = 0 𝑦 2 −1 + 𝑦 ) < , 𝒚 < < , 𝒚 ) 𝒚 < ] 𝝔 𝒚 = [1, 𝒚 ) , 𝒚 < , 𝒚 ) 1 𝒙 = 𝑥 7 , 𝑥 ) , … , 𝑥 R = [−1, 0, 0,1,1,0] 𝑦 1 1 1 - 1 if 𝒙 4 𝝔(𝒚) ≥ 0 then 𝑧 = 1 else 𝑧 = −1 𝒚 = [𝒚 ) , 𝒚 < ] 9

Cost Function for linear classification } Finding linear classifiers can be formulated as an optimization problem : } Select how to measure the prediction loss S 𝒚 % , 𝑧 % } Based on the training set 𝐸 = , a cost function 𝐾 𝒙 is defined %() } Solve the resulting optimization problem to find parameters: U 𝒚 = 𝑔 𝒚; 𝒙 } Find optimal 𝑔 V where 𝒙 V = argmin 𝐾 𝒙 𝒙 } Criterion or cost functions for classification: } We will investigate several cost functions for the classification problem 10

SSE cost function for classification 𝐿 = 2 SSE cost function is not suitable for classification: } Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct side of the decision) } Least square loss also lack robustness to noise * < 𝐾 𝒙 = ] 𝒙 4 𝒚 % − 𝑧 % %() 11

SSE cost function for classification 𝐿 = 2 𝒙 4 𝒚 − 𝑧 < 𝑧 = 1 1 𝒙 4 𝒚 Correct predictions that 𝒙 4 𝒚 − 𝑧 < are penalized by SSE 𝑧 = −1 [Bishop] −1 𝒙 4 𝒚 12

SSE cost function for classification 𝐿 = 2 } Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝑕 𝒙 4 𝒚 ? * sign 𝒙 4 𝒚 − 𝑧 < < 𝐾 𝒙 = ] sign 𝒙 4 𝒚 % − 𝑧 % 𝑧 = 1 %() sign 𝑨 = a− 1, 𝑨 < 0 1, 𝑨 ≥ 0 𝒙 4 𝒚 } 𝐾 𝒙 is a piecewise constant function shows the number of misclassifications 𝐾(𝒙) Training error incurred in classifying training samples 13

SSE cost function 𝐿 = 2 } Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝜏 𝒙 4 𝒚 ? * < 𝐾 𝒙 = ] 𝜏 𝒙 4 𝒚 % − 𝑧 % %() 𝜏 𝑨 = 1 − 𝑓 de 1 + 𝑓 de } We see later in this lecture than the cost function of the logistic regression method is more suitable than this cost function for the classification problem 14

Perceptron algorithm } Linear classifier } Two-class: 𝑧 ∈ {−1,1} } 𝑧 = −1 for 𝐷 < , 𝑧 = 1 for 𝐷 ) } Goal: ∀𝑗, 𝒚 (%) ∈ 𝐷 ) ⇒ 𝒙 4 𝒚 (%) > 0 ∀𝑗, 𝒚 % ∈ 𝐷 < ⇒ 𝒙 4 𝒚 % < 0 } } 𝑔 𝒚; 𝒙 = sign(𝒙 4 𝒚) 15

� Perceptron criterion 𝐾 i 𝒙 = − ] 𝒙 4 𝒚 % 𝑧 % %∈ℳ ℳ : subset of training data that are misclassified Many solutions? Which solution among them? 16

Cost function 𝐾(𝒙) 𝐾 i (𝒙) 𝑥 7 𝑥 7 𝑥 ) 𝑥 ) # of misclassifications Perceptron’s as a cost function cost function There may be many solutions in these cost functions 17 [Duda, Hart, and Stork, 2002]

� � � Batch Perceptron “Gradient Descent” to solve the optimization problem: 𝒙 lm) = 𝒙 l − 𝜃𝛼 𝒙 𝐾 i (𝒙 l ) 𝒙 𝐾 i 𝒙 = − ] 𝒚 % 𝑧 % 𝛼 %∈ℳ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize 𝒙 Repeat 𝒚 % 𝑧 % 𝒙 = 𝒙 + 𝜃 ∑ %∈ℳ 𝒚 % 𝑧 % Until 𝜃 ∑ < 𝜄 %∈ℳ 18

Stochastic gradient descent for Perceptron } Single-sample perceptron: } If 𝒚 (%) is misclassified: 𝒙 lm) = 𝒙 l + 𝜃𝒚 (%) 𝑧 (%) } Perceptron convergence theorem: for linearly separable data } If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron Initialize 𝒙, 𝑢 ← 0 repeat 𝜃 can be set to 1 and 𝑢 ← 𝑢 + 1 proof still works 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (%) is misclassified then 𝒙 = 𝒙 + 𝒚 (%) 𝑧 (%) Until all patterns properly classified 19

Perceptron Convergence } It can be shown: (𝒚,x)∈y 𝒚 < 𝑁 = max V 0 4 𝒚 𝜈 = 2 min (𝒚,x)∈y 𝑧𝒙 V ∗4 𝒚 𝑏 = min (𝒚,x)∈y 𝑧𝒙 20

Example 21

Perceptron: Example Change 𝒙 in a direction that corrects the error 22 [Bishop]

Convergence of Perceptron [Duda, Hart & Stork, 2002] } For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 23

Pocket algorithm } For the data that are not linearly separable due to noise: } Keeps in its pocket the best 𝒙 encountered up to now. Initialize 𝒙 for 𝑢 = 1, … , 𝑈 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (%) is misclassified then 𝒙 S~H = 𝒙 + 𝒚 (%) 𝑧 (%) if 𝐹 l€•%S 𝒙 S~H < 𝐹 l€•%S 𝒙 then 𝒙 = 𝒙 S~H end * 𝐹 l€•%S 𝒙 = 1 𝑂 ] 𝑡𝑗𝑕𝑜(𝒙 4 𝒚 (S) ) ≠ 𝑧 (S) S() 24

Linear Discriminant Analysis (LDA) } Fisher’s Linear Discriminant Analysis : } Dimensionality reduction } Finds linear combinations of features with large ratios of between- groups scatters to within-groups scatters (as discriminant new variables) } Classification } Predicts the class of an observation 𝒚 by first projecting it to the space of discriminant variables and then classifying it in this space 25

Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 26

Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 27

Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 𝒙 28

LDA Problem } Problem definition: } 𝐷 = 2 classes * 𝒚 (%) , 𝑧 (%) training samples with 𝑂 ) samples from the first class ( 𝒟 ) ) } %() and 𝑂 < samples from the second class ( 𝒟 < ) } Goal: finding the best direction 𝒙 that we hope to enable accurate classification } The projection of sample 𝒚 onto a line in direction 𝒙 is 𝒙 4 𝒚 } What is the measure of the separation between the projected points of different classes? 29

Measure of Separation in the Projected Direction } Is the direction of the line jointing the class means a good candidate for 𝒙 ? [Bishop] 30

� � Measure of Separation in the Projected Direction } The direction of the line jointing the class means is the solution of the following problem: } Maximizes the separation of the projected class means … − 𝜈 < … < max 𝒙 𝐾 𝒙 = 𝜈 ) s. t. 𝒙 = 1 𝒚 (ˆ) ∑ … = 𝒙 4 𝝂 ) 𝒚(ˆ)∈𝒟‰ 𝜈 ) 𝝂 ) = * ‰ 𝒚 (ˆ) ∑ … = 𝒙 4 𝝂 < 𝒚(ˆ)∈𝒟Š 𝜈 < 𝝂 < = * Š } What is the problem with the criteria considering only … − 𝜈 < … ? 𝜈 ) } It does not consider the variances of the classes in the projected direction 31

LDA Criteria } Fisher idea: maximize a function that will give } large separation between the projected class means } while also achieving a small variance within each class, thereby minimizing the class overlap. … − 𝜈 < … < 𝐾 𝒙 = 𝜈 ) …< + 𝑡 < …< 𝑡 ) 32

Linear classifiers CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Topics } Linear classifiers } Perceptron SVM will be covered in the later lectures } Fisher } Multi-class classification 2 Classification

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai Vatsal Sharan, Peter Bailis,

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall

Lecture 8 N.MORGAN / B.GOLD LECTURE 8

Linear classification Course of Machine Learning Master Degree in Computer Science University of

Homework Homework Lecture 7: Linear Classification Methods Final projects? Groups Topics

The owning house data Can we separate the points with a line? 200 Income

E9 205 Machine Learning for Signal Processing Supervised-Dimensionality-Reduction. Decision

Lecture 12: Midterm Exam Review Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc.

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant

Feature Reduction and Selection Selim Aksoy Bilkent University Department of Computer

Linear classifiers CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Topics } Linear classifiers } Perceptron SVM will be covered in the later lectures } Fisher } Multi-class classification 2 Classification

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai Vatsal Sharan, Peter Bailis,

Linear &amp; nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear &amp; nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall

Lecture 8 N.MORGAN / B.GOLD LECTURE 8

Linear classification Course of Machine Learning Master Degree in Computer Science University of

Homework Homework Lecture 7: Linear Classification Methods Final projects? Groups Topics

The owning house data Can we separate the points with a line? 200 Income

E9 205 Machine Learning for Signal Processing Supervised-Dimensionality-Reduction. Decision

Lecture 12: Midterm Exam Review Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc.

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant

Feature Reduction and Selection Selim Aksoy Bilkent University Department of Computer

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology