Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif - - PowerPoint PPT Presentation

data mining
SMART_READER_LITE
LIVE PREVIEW

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif - - PowerPoint PPT Presentation

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents Introduction 1 Linear discriminant analysis


slide-1
SLIDE 1

Data Mining

Linear & nonlinear classifiers Hamid Beigy

Sharif University of Technology

Fall 1396

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31

slide-2
SLIDE 2

Table of contents

1

Introduction

2

Linear discriminant analysis

3

Linear classifiers

4

Support vector machines

5

Non-linear support vector machine

6

Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 31

slide-3
SLIDE 3

Table of contents

1

Introduction

2

Linear discriminant analysis

3

Linear classifiers

4

Support vector machines

5

Non-linear support vector machine

6

Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 31

slide-4
SLIDE 4

Introduction

In classification, the goal is to find a mapping from inputs X to outputs t ∈ {1, 2, . . . , C} given a labeled set of input-output pairs (training set) S = {(x1, t1), (x2, t2), . . . , (xN, tN)}. Each training input x is a D−dimensional vector of numbers. Approaches for building a classifier.

Generative approach: This approach first creates a joint model of the form of p(x, Cn) and then to condition on x, then deriving p(Cn|x). Discriminative approach: This approach creates a model of the form of p(Cn|x) directly.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 31

slide-5
SLIDE 5

Table of contents

1

Introduction

2

Linear discriminant analysis

3

Linear classifiers

4

Support vector machines

5

Non-linear support vector machine

6

Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 31

slide-6
SLIDE 6

Linear discriminant analysis (LDA)

One way to view a linear classification model is in terms of dimensionality reduction. Assume that we want to project a vector onto another vector to obtain a new point after a change of the basis vectors. Let a, b ∈ Rn be two n-dimensional vectors. An orthogonal decomposition of the vector b in the direction of another vector a equals to b = b + b⊥ = p + r where b = b is parallel to a and r = b⊥ is perpendicular to a.

1 2 3 4 1 2 3 4 5 X1 X2 a b r = b

p = b∥

Vector p is called orthogonal projection or projection of b on vector a.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 31

slide-7
SLIDE 7

Linear discriminant analysis (LDA)

p can be written as p = ca, where c is scaler and p is parallel to a.

1 2 3 4 1 2 3 4 5 X1 X2 a b r = b

p = b∥

Thus r = b − p = b − ca. Since p and r are orthogonal, we have pTr = (ca)T(b − ca) = caTb − c2aTa = 0. This implies c = aTb aTa Therefor, the projection of b on a equals to p = b = ca = aTb aTa

  • a

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 31

slide-8
SLIDE 8

Linear discriminant analysis (LDA)

Consider a two-class problem and suppose we take a D−dimensional input vector x and project it down to one dimension using z = W Tx If we place a threshold on z and classify z ≥ w0 as class C1, and otherwise class C2, then we obtain our standard linear classifier.

1.5 2.0 2.5 3.0 3.5 4.0 4.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 w

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 31

slide-9
SLIDE 9

Linear discriminant analysis (cont.)

Consider a two-class problem in which there are N1 points of class C1 and N2 points of class C2. Hence the mean vectors of the class Cj is given by µj = 1 Nj

  • i∈Cj

xi The simplest measure of the separation of the classes, when projected onto W , is the separation of the projected class means. This suggests that we might choose W so as to maximize m2 − m1 = W T(µ2 − µ1) where mj = W Tµj This expression can be made arbitrarily large simply by increasing the magnitude of W . To solve this problem, we could constrain W to have unit length, so that

i w2 i = 1.

Using a Lagrange multiplier to perform the constrained maximization, we then find that W ∝ (µ2 − µ1)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 31

slide-10
SLIDE 10

Linear discriminant analysis (cont.)

This approach has a problem: The following figure shows two classes that are well separated in the original two dimensional space but that have considerable overlap when projected onto the line joining their means.

−2 2 6 −2 2 4

This difficulty arises from the strongly non-diagonal covariances of the class distributions. The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 31

slide-11
SLIDE 11

Linear discriminant analysis (cont.)

The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. The projection z = W Tx transforms the set of labeled data points in x into a labeled set in the one-dimensional space z. The within-class variance of the transformed data from class Cj equals s2

j =

  • i∈Cj

(zi − mj)2 where zi = wTxi. We can define the total within-class variance for the whole data set to be s2

1 + s2 2.

The Fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by J(W ) = (m2 − m1)2 s2

1 + s2 2

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 31

slide-12
SLIDE 12

Linear discriminant analysis (cont.)

Between-class covariance matrix equals to SB = (µ2 − µ1)(µ2 − µ1)T Total within-class covariance matrix equals to SW =

  • i∈C1

(xi − µ1) (xi − µ1)T +

  • i∈C2

(xi − µ2) (xi − µ2)T We have (m1 − m2)2 =

  • W Tµ1 − W Tµ2

2 = W T (µ1 − µ2)(µ1 − µ2)T

  • SB

W = W TSBW ,

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 31

slide-13
SLIDE 13

Linear discriminant analysis (cont.)

Also we have s2

1

=

  • i
  • W Txi − µ2

2 ti =

  • i

W T(xi − µ1)(xi − µ1)2Wti = W T

  • i

((xi − µ1)(xi − µ1)2

  • S1

W = W TS1W , and SW = S1 + S2. Hence, J(W ) can be written as J(w) = W TSBW W TSW W Derivative of J(W ) with respect to W equals to (using ∂xT Ax

∂x

= (A + AT)x) W ∝ S−1

W (µ2 − µ1)

The result W ∝ S−1

W (µ2 − µ1) is known as Fisher’s linear discriminant, although strictly it

is not a discriminant but rather a specific choice of direction for projection of the data down to one dimension.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 31

slide-14
SLIDE 14

Table of contents

1

Introduction

2

Linear discriminant analysis

3

Linear classifiers

4

Support vector machines

5

Non-linear support vector machine

6

Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31

slide-15
SLIDE 15

Linear classifiers

We consider the following type of linear classifiers. y(xn) = g(xn) = sign(w1xn1 + w2xn2 + . . . + wDxnD) ∈ {−1, +1} = sign  

D

  • j=1

wnxnj   = sign

  • wTxn
  • .

w = (w1, w2, . . . , wD)T ∈ RD. Different value of w give different functions. xn = (xn1, xn2, . . . , xnD)T is a column vector of real values. This classifier changes its prediction only when the argument to the sign function changes from positive to negative (or vice versa). Geometrically,this transition in the feature space corresponds to crossing the decision boundary where the argument is exactly zero: all x such that wTx = 0.

−4 −2 2 4 6 8 −8 −6 −4 −2 2 4 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31

slide-16
SLIDE 16

Table of contents

1

Introduction

2

Linear discriminant analysis

3

Linear classifiers

4

Support vector machines

5

Non-linear support vector machine

6

Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 31

slide-17
SLIDE 17

Support vector machines

Consider the problem of finding a separating hyperplane for a linearly separable dataset S = {(x1, t1), (x2, t2), . . . , (xN, tN)} with xi ∈ RD and ti ∈ {−1, +1}. Which of the infinite hyperplanes should we choose?

Hyperplanes that pass too close to the training examples will be sensitive to noise and, therefore, less likely to generalize well for data outside the training set. It is reasonable to expect that a hyperplane that is farthest from all training examples will have better generalization capabilities.

We can find the maximum margin linear classifier by first identifying a classifier that correctly classifies all the examples and then increasing the geometric margin until we cannot increase the margin any further. We can also set up an optimization problem for directly maximizing the geometric margin.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 31

slide-18
SLIDE 18

Support vector machines (cont.)

We will need the classifier to be correct on all the training examples (tnwTxn ≥ +1 for all n = 1, 2, . . . , N) subject to these constraints, we would like to maximize the geometric margin ( +1

w). Hence, we have

Maximize 1 w subject to tnwTxn ≥ 1 for all n = 1, 2, . . . , N We can alternatively minimize the inversew or the inverse squared w2 subject to the same constraints. Minimize 1 2 w2 subject to tnwTxn ≥ 1 for all n = 1, 2, . . . , N Factor 1

2 is included merely for later convenience.

The above problem can be written as Minimize 1 2 w2 subject to tnwTxn ≥ 1 for all n = 1, 2, . . . , N

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 14 / 31

slide-19
SLIDE 19

Support vector machines (cont.)

The SVM optimization problem can be written as Minimize 1 2w2 subject to tnwTxn ≥ 1 for all n = 1, 2, . . . , N This optimization problem is in the standard SVM form and is a quadratic programming problem. We will modify the linear classifier here slightly by adding an offset term so that the decision boundary does not have to go through the origin. In other words, the classifier that we consider has the form g(x) = wTx + b w is the weight vector b is the bias of the separating hyperplane. The hyperplane is shown by (w, b). The bias parameter changes the optimization problem to Minimize 1 2w2 subject to tn

  • wTxn + b
  • ≥ 1 for all n = 1, 2, . . . , N

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 31

slide-20
SLIDE 20

Support vector machines (cont.)

The optimization problem for SVM is defined as Minimize 1 2w2 subject to tn

  • wTxn + b
  • ≥ 1 for all n = 1, 2, . . . , N

In order to solve this constrained optimization problem, we introduce Lagrange multipliers αn ≥ 0, with one multiplier αn for each of the constraints giving the Lagrangian function L(w, b, α) = 1 2w2 −

N

  • n=1

αn

  • tn
  • wTxn + b
  • − 1
  • where α = (α1, α2, . . . , αN)T.

Note the minus sign in front of the Lagrange multiplier term, because we are minimizing with respect to w and b, and maximizing with respect to α. Setting the derivatives of L(w, b, α) with respect to w and b equal to zero, we obtain the following two equations ∂L ∂w = 0 ⇒ w =

N

  • n=1

αntnxn ∂L ∂b = 0 ⇒ 0 =

N

  • n=1

αntn

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 16 / 31

slide-21
SLIDE 21

Support vector machines (cont.)

L has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables αn. Eliminating w and b from L(w, b, a) using these conditions then gives the dual representation of the problem in which we maximize

  • L(α) =

N

  • n=1

αn − 1 2

N

  • n=1

N

  • m=1

αnαmtntmxT

n xm

We need to maximize L(α) subject to the following constraints αn ≥ 0 ∀n and

N

  • n=1

αntn = 0 The constrained optimization of this form satisfies the Karush-Kuhn-Tucker (KKT) conditions, which in this case require that the following three properties hold αn ≥ tng(xn) ≥ 1 αn [tng(xn) − 1] = To classify a data x using the trained model, we evaluate the sign of g(x) defined by g(x) =

N

  • n=1

αntnxT

n x

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 17 / 31

slide-22
SLIDE 22

Support vector machines (cont.)

We have assumed that the training data are linearly separable in the feature space. The resulting SVM will give exact separation of the training data. In the practice, the class-conditional distributions may overlap, in which the exact separation of the training data can lead to poor generalization. We need a way to modify the SVM so as to allow some training examples to be miss-classified. To do this, we introduce slack variables (ξn ≥ 0); one slack variable for each training example. The slack variables are defined by ξn = 0 for examples that are inside the correct boundary margin and ξn = |tn − g(xn)| for other examples. Thus for data point that is on the decision boundary g(xn) = 0 will have ξn = 1 and the data points with ξn ≥ 1 will be misclassified.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 18 / 31

slide-23
SLIDE 23

Support vector machines (cont.)

The exact classification constraints will be tng(xn) ≥ 1 − ξn for n = 1, 2, . . . , N Our goal is now to maximize the margin while softly penalizing points that lie on the wrong side of the margin boundary. We therefore minimize C

N

  • n=1

ξn + 1 2w2 C > 0 controls the trade-off between the slack variable penalty and the margin. We now wish to solve the following optimization problem. Minimize 1 2w2 + C

N

  • n=1

ξn subject to tng(xn) ≥ 1 − ξn for all n = 1, 2, . . . , N The corresponding Lagrangian is given L(w, b, α) = 1 2w2 + C

N

  • n=1

ξn −

N

  • n=1

αn [tng(xn) − 1 + ξn] −

N

  • n=1

βnξn where αn ≥ 0 and βn ≥ 0 are Lagrange multipliers.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31

slide-24
SLIDE 24

Table of contents

1

Introduction

2

Linear discriminant analysis

3

Linear classifiers

4

Support vector machines

5

Non-linear support vector machine

6

Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 20 / 31

slide-25
SLIDE 25

Non-linear support vector machine

Most data sets are not linearly separable, for example Instances that are not linearly separable in 1−dimension, may be linearly separable in 2− dimensions, for example In this case, we have two solutions

Increase dimensionality of data set by introducing mapping φ. Use a more complex model for classifier.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 20 / 31

slide-26
SLIDE 26

Non-linear support vector machine (cont.)

To solve the non-linearly separable dataset, we use mapping φ. For example, let x = (x1, x2)T, z = (z1, z2.z3)T, and φ : R2 → R3. If we use mapping z = φ(x) = (x2

1,

√ 2x1x2, x2

2)T, the dataset will be linearly separable in R3.

Mapping dataset to higher dimensions has two major problems.

In high dimensions, there is risk of over-fitting. In high dimensions, we have more computational cost.

The generalization capability in higher dimension is ensured by using large margin classifiers. The mapping is an implicit mapping not explicit.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 21 / 31

slide-27
SLIDE 27

Non-linear support vector machine (cont.)

The SVM uses the following discriminant function. g(x) =

N

  • n=1

αntnxT

n x

This solution depends on the dot-product between two pints xi and xj. The operation in high dimensional space φ(x) don’t have performed explicitly if we can find a function K(xi, xj) such that K(xi, xj) = φ(xi)Tφ(xj). K(xi, xj) is called kernel in the SVM. Suppose x, z ∈ RD and consider the following kernel K(x, z) =

  • xTz

2 It is a valid kernel because K(x, z) = D

  • i=1

xizi  

D

  • j=1

xjzj   =

D

  • i=1

D

  • j=1

(xixj) (zizj) = φ(x)Tφ(z) where the mapping φ for D = 2 is : φ(x) = (x1x1, x1x2, x2x1, x2x2)T

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 22 / 31

slide-28
SLIDE 28

Non-linear support vector machine (cont.)

Show that kernel K(x, z) = (xTz + c)2 is a valid kernel. A kernel K is valid if there is some mapping φ such that K(x, z) = φ(x)Tφ(z). Assume that K is a valid kernel. Consider a set of N points, K is N × N square matrix defined as K =      k(x1, x1) k(x1, x2) · · · k(x1, xN) k(x2, x1) k(x2, x2) · · · k(x2, xN) . . . . . . ... . . . k(xN, x1) k(xN, x2) · · · k(xN, xN)      K is called kernel matrix. If K is a valid kernel then kij = k(xi, xj) = φ(xi)Tφ(xj) = φ(xj)Tφ(xi) = k(xj, xi) = kji Thus K is symmetric. It can also be shown that K is positive semi-definite (show it.). Thus if K is a valid kernel, then the corresponding kernel matrix is symmetric positive semi-definite. It is both necessary and sufficient conditions for K to be a valid kernel.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 23 / 31

slide-29
SLIDE 29

Non-linear support vector machine (cont.)

Theorem (Mercer) Assume that K : RD × RD → R. Then for K to be a valid (Mercer) kernel, it is necessary and sufficient that for any {x1, x2, . . . , xN}, (N > 1) the corresponding kernel matrix is symmetric positive semi-definite. Some valid kernel functions

Polynomial kernels K(x, z) = (xTz + 1)p p is the degree of the polynomial and specified by the user. Radial basis function kernels K(x, z) = exp

  • −x − z2

2σ2

  • The width σ is specified by the user. This kernel corresponds to an infinite dimensional

mapping φ . Sigmoid Kernel K(x, z) = tanh

  • β0xTz + β1
  • This kernel only meets Mercers condition for certain values of β0 and β1.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 24 / 31

slide-30
SLIDE 30

Table of contents

1

Introduction

2

Linear discriminant analysis

3

Linear classifiers

4

Support vector machines

5

Non-linear support vector machine

6

Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 25 / 31

slide-31
SLIDE 31

Multi-class Classifiers

In classification, the goal is to find a mapping from inputs X to outputs t ∈ {1, 2, . . . , C} given a labeled set of input-output pairs. In all our discussions, so far,we have been involved with the two-class classification task, i.e. C = 2. We can extend the binary classifiers to C class classification problems or use the binary classifiers. In an C-class problem, we have the following three extensions for using binary classifiers.

One-against-all: This approach is a straightforward extension of two-class problem and considers it as a of C two-class problems. One-against-one: In this approach,C(C − 1)/2 binary classifiers are trained and each classifier separates a pair of classes. The decision is made on the basis of a majority vote. Error correcting coding: For a C−class problem a number of L binary classifiers are used,where L is appropriately chosen by the designer. Each class is now represented by a binary code word of length L.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 25 / 31

slide-32
SLIDE 32

One-against-all classification

The extension is to consider a set of C two-class problems. For each class, we seek to design an optimal discriminant function, gi(x) (for i = 1, 2, . . . , C) so that gi(x) > gj(x), ∀j = i, if x ∈ Ci . Adopting the SVM methodology, we can design the discriminant functions so that gi(x) = 0 to be the optimal hyperplane separating class Ci from all the others. Thus, each classifier is designed to give gi(x) > 0 for x ∈ Ci and gi(x) < 0 otherwise. Classification is then achieved according to the following rule: Assign x to class Ci if i = argmax

k

gk(x)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 26 / 31

slide-33
SLIDE 33

Properties of one-against-all classification

The number of classifiers equals to C. Each binary classifier deals with a rather asymmetric problem in the sense that training is carried out with many more negative than positive examples. This becomes more serious when the number of classes is relatively large. This technique, however,may lead to indeterminate regions, where more than one gi(x) is positive

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 27 / 31

slide-34
SLIDE 34

One-against-one classification

In this case,C(C − 1)/2 binary classifiers are trained and each classifier separates a pair of classes. The decision is made on the basis of a majority vote. The obvious disadvantage of the technique is that a relatively large number of binary classifiers has to be trained.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 28 / 31

slide-35
SLIDE 35

Error correcting coding classification

In this approach, the classification task is treated in the context of error correcting coding. For a C−class problem a number of, say, L binary classifiers are used,where L is appropriately chosen by the designer. Each class is now represented by a binary code word of length L. During training, for the ith classifier, i = 1, 2, . . . , L, the desired labels, t, for each class are chosen to be either −1 or +1. For each class, the desired labels may be different for the various classifiers. This is equivalent to constructing a matrix C × L of desired labels. For example, if C = 4 and L = 6, such a matrix can be

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 29 / 31

slide-36
SLIDE 36

Error correcting coding classification (cont.)

For example, if C = 4 and L = 6, such a matrix can be During training, the first classifier (corresponding to the first column of the previous matrix) is designed in order to respond (−1, +1, +1, −1) for examples of classes C1, C2, C3, C4, respectively. The second classifier will be trained to respond (−1, −1, +1, −1), and so on. The procedure is equivalent to grouping the classes into L different pairs, and, for each pair, we train a binary classifier accordingly. Each row must be distinct and corresponds to a class.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 30 / 31

slide-37
SLIDE 37

Error correcting coding classification (cont.)

When an unknown pattern is presented, the output of each one of the binary classifiers is recorded, resulting in a code word. Then,the Hamming distance (number of places where two code words differ) of this code word is measured against the C code words, and the pattern is classified to the class corresponding to the smallest distance. This feature is the power of this technique. If the code words are designed so that the minimum Hamming distance between any pair of them is, say, d, then a correct decision will still be reached even if the decisions of at most ⌊ d−1

2 ⌋ out of the L, classifiers are

wrong.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 31 / 31