Linear & nonlinear classifiers Machine Learning Hamid Beigy - PowerPoint PPT Presentation

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44

Table of contents Introduction 1 Linear classifiers through origin 2 Perceptron algorithm 3 Support vector machines 4 Lagrangian optimization 5 Support vector machines (cont.) 6 Non-linear support vector machine 7 Generalized linear classifier 8 Linear discriminant analysis 9 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 2 / 44

Introduction In classification, the goal is to find a mapping from inputs X to outputs t ∈ { 1 , 2 , . . . , C } given a labeled set of input-output pairs (training set) S = { ( x 1 , t 1 ) , ( x 2 , t 2 ) , . . . , ( x N , t N ) } . Each training input x is a D − dimensional vector of numbers. Approaches for building a classifier. Generative approach: This approach first creates a joint model of the form of p ( x , C n ) and then to condition on x , then deriving p ( C n | x ). Discriminative approach: This approach creates a model of the form of p ( C n | x ) directly. Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 3 / 44

Linear classifiers through origin We consider the following type of linear classifiers. y ( x n ) = g ( x n ) = sign ( w 1 x n 1 + w 2 x n 2 + . . . + w D x nD ) ∈ {− 1 , +1 }   D ( ) ∑   = sign w T x n = sign w n x nj . j =1 w = ( w 1 , w 2 , . . . , w D ) T is a column vector of real valued parameters ( w ∈ R D ). Different values of w give different functions. x n = ( x n 1 , x n 2 , . . . , x nD ) T is a column vector of real values. This classifier changes its prediction only when the argument to the sign function changes from positive to negative (or vice versa). Geometrically, this transition in the feature space corresponds to crossing the decision boundary where the argument is exactly zero: all x such that w T x = 0. Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 4 / 44

The Perceptron algorithm We would like to find a linear classifier that makes the fewest mistakes on the training set. In other words, we want find w that minimizes the training error N ∑ 1 E E ( w ) = (1 − δ ( t n , g ( x n ))) N n =1 N ∑ 1 = ℓ ( t n , g ( x n )) . N n =1 δ ( t , t ′ ) = 1 if t = t ′ and 0 otherwise. ℓ is loss function called zero–one loss. What would be a reasonable algorithm for setting the parameters w ? We can just incrementally adjust the parameters so as to correct any mistakes that the corresponding classifier makes. Such an algorithm would seem to reduce the training error that counts the mistakes. The simplest algorithm of this type is the Perceptron update rule. We consider each training instances one by one, cycling through all the training instances, and adjust the parameters according to (drive it.) w ′ = w + t n x n if t n ̸ = g ( x n ) In other words, the parameters (classifier) is changed only if we make a mistake. These updates tend to correct mistakes. Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 5 / 44

The Perceptron algorithm (cont.) The parameters (classifier) is changed only if we make a mistake. To see this, When we make a mistake sign ( w T x k ) ̸ = t k . The inequality t k w T x k < 0 is hold. Consider w after updating t n ( w + t n x n ) T x n t n w ′ T x = t n w T x n + t 2 n x T = n x n t n w T x n + || x n || 2 = This means that, the value of t k w T x n increases as a result of the update (becomes more positive). If we consider the same feature vector repeatedly, then we will necessarily change the parameters such that the vector is classified correctly, i.e., the value of t k w T x n becomes positive. if the training examples are possible to classify correctly with a linear classifier, will the Perceptron algorithm find such a classifier? Yes, it does, and it will converge to such a classifier in a finite number of updates (mistakes). To drive this result (an alternative proof), please read section 3.3 of Pattern Recognition Book by Theodoridis and Koutroumbas Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 6 / 44

The Perceptron algorithm (cont.) We considered the linearly separable case in which the following inequality holds. t n ( w ∗ ) T x n > 0 for all n = 1 , 2 , . . . , N W ∗ is the weight learned by the Perceptron algorithm. Now assume we want to learn a hyperplane that classifies the training set with margin of γ > 0, i.e. we have t n ( w ∗ ) T x n > γ for all n = 1 , 2 , . . . , N Parameter γ > 0 is used to ensure that each example is classified correctly with a finite margin. Theorem When || x n || ≤ R for all n and some finite R, the Perceptron algorithm needs at most ( ) 2 || w ∗ || 2 updates of the weight vector (w). R γ Outline of proof. The convergence proof is based on combining the following two results, The inner product ( w ∗ ) T w ( k ) increases at least linearly with each update. 1 The squared norm || w ( k ) || 2 increases at most linearly in the number of updates k . 2 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 7 / 44

The Perceptron algorithm (cont.) We now give details of each part. Proof of part 1. The weight vector w updated when the training instance is not classified correctly. We consider 1 the inner product ( w ∗ ) T w ( k ) before and after each update. ( w ∗ ) T ( ) w ( k − 1) + t n x n ( w ∗ ) T w ( k ) = ( w ∗ ) T w ( k − 1) + t n ( w ∗ ) T x n = ( w ∗ ) T w ( k − 1) + γ ≥ ( w ∗ ) T w ( k − 2) + 2 γ ≥ ( w ∗ ) T w ( k − 3) + 3 γ ≥ . . . ( w ∗ ) T w (0) + k γ ≥ = k γ Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 8 / 44

The Perceptron algorithm (cont.) We now give details of part 2. Proof of part 2. The weight vector w updated when the training instance is not classified correctly. We consider 1 || w ( k ) || 2 before and after each update. � � w ( k ) � � � 2 2 � � � � � w ( k − 1) + t n x n = � � � � w ( k − 1) � ( w ( k − 1) ) T 2 � � x n + ∥ t n x n ∥ 2 = � + 2 t n � w ( k − 1) � � ( w ( k − 1) ) T 2 � � x n + ∥ x n ∥ 2 = + 2 t n � � � w ( k − 1) � 2 � � + ∥ x n ∥ 2 ≤ � � � w ( k − 1) � 2 � � + R 2 ≤ � � � w ( k − 2) � 2 � � + 2 R 2 ≤ � . . . � � w (0) � 2 � � + kR 2 = kR 2 ≤ � Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 9 / 44

The Perceptron algorithm (cont.) We now combine two parts. Combination of parts 1 & 2. The cos( x , y ) measures the similarity of x and y . update. 1 ( w ∗ , w ( k ) ) ( w ∗ ) T w ( k ) cos = ∥ ( w ∗ ) T ∥∥ w ( k ) ∥ k γ 1 ≥ ∥ ( w ∗ ) T ∥∥ w ( k ) ∥ k γ 2 ≥ √ kR 2 ∥ ( w ∗ ) T ∥ ≤ 1 The last inequality is because the cos is bounded by one. Hence, we have 2 √ kR 2 ∥ ( w ∗ ) T ∥ k ≤ γ ( R ) 2 � ( ∥ w ∗ ∥ ) 2 � ( w ∗ ) T � � 2 = R 2 ≤ γ γ Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 10 / 44

The Perceptron algorithm : margin and geometry Does ∥ w ∗ ∥ relate to the difficulty of the classification problem? γ γ Yes, its inverse ( ∥ w ∗ ∥ ) is the smallest distance in the feature space from any example to the decision boundary specified by w ∗ . In other words, it is a measure of separation of two classes. This distance is called geometric distance and denoted by γ geom . In order to calculate the geometric distance ( γ geom ), the distance from the decision boundary (( w ∗ ) T x = 0) to one of the examples x n for which t n ( w ∗ ) T x n = γ is measured. Since w ∗ is normal to the decision boundary, the shortest path from the boundary to the instance x n will be parallel to the normal. The instance for which t n ( w ∗ ) T x n = γ is therefore among those closest to the boundary. Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 11 / 44

Linear & nonlinear classifiers Machine Learning Hamid Beigy - PowerPoint PPT Presentation

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table of contents Introduction 1

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Nonlinear Control Lecture # 31 Nonlinear Observers Nonlinear Control Lecture # 31 Nonlinear

Nonlinear Control Lecture # 22 Special nonlinear Forms Nonlinear Control Lecture # 22 Special

Nonlinear Control Lecture # 21 Special nonlinear Forms Nonlinear Control Lecture # 21 Special

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Nonlinear Control Lecture # 8 Special nonlinear Forms Nonlinear Control Lecture # 8 Special

Nonlinear Control Lecture # 12 Nonlinear Observers and Output Feedback Stabilization Nonlinear

Nonlinear Control Lecture # 20 Special nonlinear Forms Nonlinear Control Lecture # 20 Special

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Solving large scale eigenvalue problems Lecture 3, March 7, 2018: Newton methods

Introduction to Generalized Additive Models Noam Ross Senior Research Scientist, EcoHealth

Invariant Gibbs measures for the three-dimensional wave equation with a Hartree nonlinearity

On the minimum distance of q -ary nonlinear codes Jaume Pujol, Merc` e Villanueva, Fanxuan

Cryptographic Criteria of Boolean Functions and S-Boxes Luca Mariot luca.mariot@unimib.it Guest

I ntroduction to Mobile Robotics Bayes Filter Extended Kalm an Filter Wolfram Burgard 1

webinar Mike Dent Director of Pharmacy Funding Harpreet Chana Head of Funding Strategy

Coordination Control of Multiple Mobile Robots Filippo Arrichiello webuser.unicas.it/arrichiello

Sambuz

Useful Links

Newsletter

Mail Us

Linear & nonlinear classifiers Machine Learning Hamid Beigy - PowerPoint PPT Presentation

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table of contents Introduction 1

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Linear &amp; nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Nonlinear Control Lecture # 31 Nonlinear Observers Nonlinear Control Lecture # 31 Nonlinear

Nonlinear Control Lecture # 22 Special nonlinear Forms Nonlinear Control Lecture # 22 Special

Nonlinear Control Lecture # 21 Special nonlinear Forms Nonlinear Control Lecture # 21 Special

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Nonlinear Control Lecture # 8 Special nonlinear Forms Nonlinear Control Lecture # 8 Special

Nonlinear Control Lecture # 12 Nonlinear Observers and Output Feedback Stabilization Nonlinear

Nonlinear Control Lecture # 20 Special nonlinear Forms Nonlinear Control Lecture # 20 Special

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Solving large scale eigenvalue problems Lecture 3, March 7, 2018: Newton methods

Introduction to Generalized Additive Models Noam Ross Senior Research Scientist, EcoHealth

Invariant Gibbs measures for the three-dimensional wave equation with a Hartree nonlinearity

On the minimum distance of q -ary nonlinear codes Jaume Pujol, Merc` e Villanueva, Fanxuan

Cryptographic Criteria of Boolean Functions and S-Boxes Luca Mariot luca.mariot@unimib.it Guest

I ntroduction to Mobile Robotics Bayes Filter Extended Kalm an Filter Wolfram Burgard 1

webinar Mike Dent Director of Pharmacy Funding Harpreet Chana Head of Funding Strategy

Coordination Control of Multiple Mobile Robots Filippo Arrichiello webuser.unicas.it/arrichiello

Sambuz

Useful Links

Newsletter

Mail Us

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology