Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif - PowerPoint PPT Presentation

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31

Table of contents Introduction 1 Linear discriminant analysis 2 Linear classifiers 3 Support vector machines 4 Non-linear support vector machine 5 Multi-class Classifiers 6 One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 31

Introduction In classification, the goal is to find a mapping from inputs X to outputs t ∈ { 1 , 2 , . . . , C } given a labeled set of input-output pairs (training set) S = { ( x 1 , t 1 ) , ( x 2 , t 2 ) , . . . , ( x N , t N ) } . Each training input x is a D − dimensional vector of numbers. Approaches for building a classifier. Generative approach: This approach first creates a joint model of the form of p ( x , C n ) and then to condition on x , then deriving p ( C n | x ). Discriminative approach: This approach creates a model of the form of p ( C n | x ) directly. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 31

Linear discriminant analysis (LDA) One way to view a linear classification model is in terms of dimensionality reduction. Assume that we want to project a vector onto another vector to obtain a new point after a change of the basis vectors. Let a , b ∈ R n be two n -dimensional vectors. An orthogonal decomposition of the vector b in the direction of another vector a equals to b = b � + b ⊥ = p + r where b = b � is parallel to a and r = b ⊥ is perpendicular to a . X 2 b 4 a b ⊥ = 3 r 2 p = b ∥ 1 0 X 1 0 1 2 3 4 5 Vector p is called orthogonal projection or projection of b on vector a . Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 31

Linear discriminant analysis (LDA) p can be written as p = ca , where c is scaler and p is parallel to a . X 2 b 4 a b ⊥ = 3 r 2 p = b ∥ 1 0 X 1 0 1 2 3 4 5 Thus r = b − p = b − ca . Since p and r are orthogonal, we have p T r = ( ca ) T ( b − ca ) = ca T b − c 2 a T a = 0 . This implies c = a T b a T a Therefor, the projection of b on a equals to � a T b � p = b � = ca = a a T a Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 31

Linear discriminant analysis (LDA) Consider a two-class problem and suppose we take a D − dimensional input vector x and project it down to one dimension using z = W T x If we place a threshold on z and classify z ≥ w 0 as class C 1 , and otherwise class C 2 , then we obtain our standard linear classifier. 4 . 5 4 . 0 3 . 5 3 . 0 2 . 5 2 . 0 w 1 . 5 4 . 0 4 . 5 5 . 0 5 . 5 6 . 0 6 . 5 7 . 0 7 . 5 8 . 0 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 31

Linear discriminant analysis (cont.) Consider a two-class problem in which there are N 1 points of class C 1 and N 2 points of class C 2 . Hence the mean vectors of the class C j is given by � µ j = 1 x i N j i ∈ C j The simplest measure of the separation of the classes, when projected onto W , is the separation of the projected class means. This suggests that we might choose W so as to maximize m 2 − m 1 = W T ( µ 2 − µ 1 ) where m j = W T µ j This expression can be made arbitrarily large simply by increasing the magnitude of W . To solve this problem, we could constrain W to have unit length, so that � i w 2 i = 1. Using a Lagrange multiplier to perform the constrained maximization, we then find that W ∝ ( µ 2 − µ 1 ) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 31

Linear discriminant analysis (cont.) This approach has a problem: The following figure shows two classes that are well separated in the original two dimensional space but that have considerable overlap when projected onto the line joining their means. 4 2 0 − 2 − 2 2 6 This difficulty arises from the strongly non-diagonal covariances of the class distributions. The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 31

Linear discriminant analysis (cont.) The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. The projection z = W T x transforms the set of labeled data points in x into a labeled set in the one-dimensional space z . The within-class variance of the transformed data from class C j equals � s 2 ( z i − m j ) 2 j = i ∈ C j where z i = w T x i . We can define the total within-class variance for the whole data set to be s 2 1 + s 2 2 . The Fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by J ( W ) = ( m 2 − m 1 ) 2 s 2 1 + s 2 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 31

Linear discriminant analysis (cont.) Between-class covariance matrix equals to S B = ( µ 2 − µ 1 )( µ 2 − µ 1 ) T Total within-class covariance matrix equals to � � ( x i − µ 1 ) ( x i − µ 1 ) T + ( x i − µ 2 ) ( x i − µ 2 ) T S W = i ∈ C 1 i ∈ C 2 We have � � 2 ( m 1 − m 2 ) 2 W T µ 1 − W T µ 2 = W T ( µ 1 − µ 2 )( µ 1 − µ 2 ) T = W � �� S B W T S B W , = Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 31

Linear discriminant analysis (cont.) Also we have � � 2 � s 2 W T x i − µ 2 = t i 1 i � W T ( x i − µ 1 )( x i − µ 1 ) 2 Wt i = i �� W T (( x i − µ 1 )( x i − µ 1 ) 2 = W i � �� S 1 W T S 1 W , = and S W = S 1 + S 2 . Hence, J ( W ) can be written as J ( w ) = W T S B W W T S W W Derivative of J ( W ) with respect to W equals to (using ∂ x T Ax = ( A + A T ) x ) ∂ x W ∝ S − 1 W ( µ 2 − µ 1 ) The result W ∝ S − 1 W ( µ 2 − µ 1 ) is known as Fisher’s linear discriminant, although strictly it is not a discriminant but rather a specific choice of direction for projection of the data down to one dimension. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 31

Linear classifiers We consider the following type of linear classifiers. y ( x n ) = g ( x n ) = sign ( w 1 x n 1 + w 2 x n 2 + . . . + w D x nD ) ∈ {− 1 , +1 }   D � � �   = sign w T x n = sign w n x nj . j =1 w = ( w 1 , w 2 , . . . , w D ) T ∈ R D . Different value of w give different functions. x n = ( x n 1 , x n 2 , . . . , x nD ) T is a column vector of real values. This classifier changes its prediction only when the argument to the sign function changes from positive to negative (or vice versa). Geometrically,this transition in the feature space corresponds to crossing the decision boundary where the argument is exactly zero: all x such that w T x = 0. 4 4 2 2 0 0 − 2 − 2 − 4 − 4 − 6 − 6 − 8 − 8 − 4 − 2 0 2 4 6 8 − 4 − 2 0 2 4 6 8 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif - PowerPoint PPT Presentation

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents Introduction 1 Linear discriminant analysis

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Porting Tizen-IVI 3.0 to an ARM based SoC Platform Damian Hobson-Garcia Automotive Linux Summit

OpenRadio A programmable wireless dataplane Manu Bansal Stanford University Joint work with

An Approach to Synthesis of An Approach to Synthesis of Multiple-Valued Reversible

Digital Logic Devices (transistors, etc.) Solid-State Physics WELLESLEY CS

An Anomaly Detection Mechanism for IEC 60870-5-104 Panagioti s Sari gianni dis Uni versi

Lecture 5: Linear regression (contd.) Regularization ML Methodology Learning

Recent Developments for the MX Beamline Control Toolkit William M. Lavender Illinois Institute

Parallel Programming with Spark Qin Liu The Chinese University of Hong Kong 1 Previously on

Sambuz

Useful Links

Newsletter

Mail Us