Dimensionality reduction Machine Learning Hamid Beigy Sharif - - PowerPoint PPT Presentation

▶

Feb 10, 2023 28 likes •369 views

Dimensionality reduction Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 1 / 34 Outline Introduction 1 Feature selection methods 2

SLIDE 1

Dimensionality reduction

Machine Learning Hamid Beigy

Sharif University of Technology

Fall 1393

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 1 / 34

SLIDE 2

Outline

1

Introduction

2

Feature selection methods

3

Feature extraction methods Principal component analysis Factor analysis Linear discriminant analysis

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 2 / 34

SLIDE 3

Introduction

The complexity of any classifier or regressors depends on the number of input variables or features. These complexity include Time complexity: In most learning algorithms, the time complexity depends

n the number of input dimensions(D) as well as on the size of training set

(N). Decreasing D decreases the time complexity of algorithm for both training and testing phases. Space complexity: Decreasing D also decreases the memory amount needed for training and testing phases. Samples complexity: Usually the number of training examples (N) is function

f length of feature vectors (D). Hence, decreasing the number of features

also decreases the number of training examples. Usually the number of training pattern must be 10 to 20 times of the number

f features.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 3 / 34

SLIDE 4

Introduction

There are several reasons why we are interested in reducing dimensionality as a separate preprocessing step. Decreasing the time complexity of classifiers or regressors. Decreasing the cost of extracting/producing unnecessary features. Simpler models are more robust on small data sets. Simpler models have less variance and thus are less depending on noise and outliers. Description of classifier or regressors is simpler / shorter. Visualization of data is simpler.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 4 / 34

SLIDE 5

Peaking phenomenon

In practice, for a finite N, by increasing the number of features we obtain an initial improvement in performance, but after a critical value further increase

f the number of features results in an increase of the probability of error.

This phenomenon is also known as the peaking phenomenon.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 5 / 34

SLIDE 6

Dimensionality reduction methods

There are two main methods for reducing the dimensionality of inputs

Feature selection: These methods select d (d < D) dimensions out of D dimensions and D − d other dimensions are discarded. Feature extraction: Find a new set of d (d < D) dimensions that are combinations of the original dimensions.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 6 / 34

SLIDE 7

Feature selection methods

Feature selection methods select d (d < D) dimensions out of D dimensions and D − d other dimensions are discarded. Reasons for performing feature selection

Increasing the predictive accuracy of classifiers or regressors. Removing irrelevant features. Enhancing learning efficiency (reducing computational and storage requirements). Reducing the cost of future data collection (making measurements on only those variables relevant for discrimination/prediction). Reducing complexity of the resulting classifiers/regressors description (providing an improved understanding of the data and the model).

Feature selection is not necessarily required as a pre-processing step for classification/regression algorithms to perform well. Several algorithms employ regularization techniques to handle over-fitting or averaging such as ensemble methods.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 7 / 34

SLIDE 8

Feature selection methods

Feature selection methods can be categorized into three categories.

Filter methods: These methods use the statistical properties of features to filter

ut poorly informative features.

Wrapper methods: These methods evaluate the feature subset within classifier/regression algorithms. These methods are classifier/regressors dependent and have better performance than filter methods. Embedded methods:These methods use the search for the optimal subset into classifier/regression design. These methods are classifier/regressors dependent.

Two key steps in feature selection process.

Evaluation: An evaluation measure is a means of assessing a candidate feature subset. Subset generation: A subset generation method is a means of generating a subset for evaluation.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 8 / 34

SLIDE 9

Feature selection methods (Evaluation measures)

Large number of features are not informative- either irrelevant or redundant. Irrelevant features are those features that don’t contribute to a classification

r regression rule.

Redundant features are those features that are strongly correlated. In order to choose a good feature set, we require a means of a measure to contribute to the separation of classes, either individually or in the context of already selected features. We need to measure relevancy and redundancy. There are two types of measures

Measures that relay on the general properties of the data. These assess the relevancy of individual features and are used to eliminate feature redundancy. All these measures are independent of the final classifier. These measures are inexpensive to implement but may not well detect the redundancy. Measures that use a classification rule as a part of their evaluation. In this approach, a classifier is designed using the reduced feature set and a measure

f classifier performance is employed to assess the selected features. A widely

used measure is the error rate.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 9 / 34

SLIDE 10

Feature selection methods (Evaluation measures)

The following measures relay on the general properties of the data.

Feature ranking: Features are ranked by a metric and those that fail to achieve a prescribed score are eliminated. Examples of these metrics are: Pearson correlation, mutual information, and information gain. Interclass distance: A measure of distance between classes is defined based on distances between members of each class. Example of these metrics is: Euclidean distance. Probabilistic distance: This is the computation of a probabilistic distance between class-conditional probability density functions, i.e. the distance between p(x|C1) and p(x|C2). Example of these metrics is: Chhernoff dissimilarity measure. Probabilistic Dependency: These measures are multi-class criteria that measure the distance between class-conditional probability density functions and the mixture probability density function for the data irrespective of the class, i.e. the distance between p(x|Ci) and p(x). Example of these metrics is: Joshi dissimilarity measure.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 10 / 34

SLIDE 11

Feature selection methods (Search algorithms)

Complete search: These methods guarantee to find the optimal subset of features according to some specified evaluation criteria. For example exhaustive search and branch and bound methods are complete. Sequential search: In these methods, features are added or removed

sequentially. These methods are not optimal, but are simple to implement

and fast to produce results. Best individual N: The simplest method is to assign a score to each feature and then select N top ranks features. Sequential forward selection: It is a bottom-up search procedure that adds new features to a feature set one at a time until the final feature set is reached. Generalized sequential forward selection: In this approach, at a time r > 1 features are added instead of a single feature. Sequential backward elimination selection: It is a top-down procedure that deletes a single feature at a time until the final feature set is reached. Generalized sequential backward elimination selection: In this approach, at a time r > 1 features are deleted instead of a single feature.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 11 / 34

SLIDE 12

Feature extraction

In feature extraction, we are interested to find a new set of k (k > D) dimensions that are combinations of the original D dimensions. Feature extraction methods may be supervised or unsupervised. Examples of feature extraction methods

Principal component analysis (PCA) Factor analysis (FA)s Multi-dimensional scaling (MDS) ISOMap Locally linear embedding Linear discriminant analysis (LDA)

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 12 / 34

SLIDE 13

Principal component analysis

PCA project D−dimensional input vectors to k−dimensional input vectors via a linear mapping with minimum loss of information. dimensions that are combinations of the original D dimensions. The problem is to find a matrix W such that the following mapping results in the minimum loss of information. Z = W TX PCA is unsupervised and tries to maximize the variance. The principle component is w1 such that the sample after projection onto w1 is most spread out so that the difference between the sample points becomes most apparent. For uniqueness of the solution, we require w1 = 1, Let Σ = Cov(X) and consider the principle component w1, we have z1 = wT

1 x

Var(z1) = E[(wT

1 x − wT 1 µ)2] = E[(wT 1 x − wT 1 µ)(wT 1 x − wT 1 µ)]

= E[wT

1 (x − µ)(x − µ)Tw1] = wT 1 E[(x − µ)(x − µ)T]w1 = wT 1 Σw1

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 13 / 34

SLIDE 14

Principal component analysis (cont.)

The mapping problem becomes w1 = argmax

w

wTΣw subject to wT

1 w1 = 1.

Writing this as Lagrange problem, we have maximize

w1

wT

1 Σw1 − α(wT 1 w1 − 1)

Taking derivative with respect to w1 and setting it equal to 0, we obtain 2Σw1 = 2αw1 ⇒ Σw1 = αw1 Hence w1 is eigenvector of Σ and α is the corresponding eigenvalue. Since we want to maximize Var(z1), we have Var(z1) = wT

1 Σw1 = αwT 1 w1 = α

Hence, we choose the eigenvector with the largest eigenvalue, i.e. λ1 = α.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 14 / 34

SLIDE 15

Principal component analysis (cont.)

The second principal component, w2, should also

maximize variance be unit length

rthogonal to w1 (z1 and z2 must be uncorrelated)

The mapping problem for the second principal component becomes w2 = argmax

w

wTΣw subject to wT

2 w2 = 1 and wT 2 w1 = 0.

Writing this as Lagrange problem, we have maximize

w2

wT

2 Σw2 − α(wT 2 w2 − 1) − β(wT 2 w1 − 0)

Taking derivative with respect to w2 and setting it equal to 0, we obtain 2Σw2 − 2αw2 − βw1 = 0 Pre-multiply by wT

1 , we obtain

2wT

1 Σw2 − 2αwT 1 w2 − βwT 1 w1 = 0

Note that wT

1 w2 = 0 and wT 1 Σw2 = (wT 2 Σw1)T = wT 2 Σw1 is a scaler.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 15 / 34

SLIDE 16

Principal component analysis (cont.)

Since Σw1 = λ1w1, therefore we have wT

1 Σw2 = wT 2 Σw1 = λ1wT 2 w1 = 0

Then β = 0 and the problem reduces to Σw2 = αw2 This implies that w2 should be the eigenvector of σ with the second largest eigenvalue λ2 = α. Similarly, you can show that the other dimensions are given by the eigenvectors with decreasing eigenvalues. Since Σ is symmetric, for two different eigenvalues, their corresponding eigenvectors are orthogonal. (Show it) If Σ is positive definite (xTΣx > 0 for all non-null vector x), then all its eigenvalues are positive. If Σ is singular, its rank is k (k < D) and λi = 0 for i = k + 1, . . . , D.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 16 / 34

SLIDE 17

Principal component analysis (cont.)

Define Z = W T(X − m) Then k columns of W are the k leading eigenvectors of S (the estimator of Σ). m is the sample mean of X. Subtracting m from X before projection centers the data on the origin. How to normalize variances?

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 17 / 34

SLIDE 18

Principal component analysis (cont.)

How to select k? If all eigenvalues are positive and |S| = D

i=1 λi is small, then some

eigenvalues have little contribution to the variance and may be discarded. Scree graph is the plot of variance as a function of the number of eigenvectors.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 18 / 34

SLIDE 19

Principal component analysis (cont.)

How to select k? We select the leading k components that explain more than for example 95%

f the variance.

The proportion of variance (POV) is POV = k

i=1 λi

D

i=1 λi

By visually analyzing it, we can choose k.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 19 / 34

SLIDE 20

Principal component analysis (cont.)

How to select k? Another possibility is to ignore the eigenvectors whose corresponding eigenvalues are less than the average input variance (why?). In the pre-processing phase, it is better to pre-process data such that each dimension has mean 0 and unit variance(why and when?). Question: Can we use the correlation matrix instead of covariance matrix? Drive solution fPCA.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 20 / 34

SLIDE 21

Principal component analysis (cont.)

PCA is sensitive to outliers. A few points distant from the center have large effect on the variances and thus eigenvectors. Question: How can use the robust estimation methods for calculating parameters in the presence of outliers? A simple method is discarding the isolated data points that are far away. Question: When D is large, calculating, sorting, and processing of S may be

tedious. Is it possible to calculate eigenvectors and eigenvalues directly from

data without explicitly calculating the covariance matrix?

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 21 / 34

SLIDE 22

Principal component analysis (reconstruction error)

In PCA, input vector x is projected to the z−space as z = W T(x − µ) When W is an orthogonal matrix (W TW = I), it can be back-projected to the original space as ˆ x = Wz + µ ˆ x is the reconstruction of x from its representation in the z−space. Question: Show that PCA minimizes the following reconstruction error.

ˆ xi − xi2 The reconstruction error depends on how many of the leading components are taken into account.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 22 / 34

SLIDE 23

Principal component analysis (cont.)

PCA is unsupervised and does not use label/output information. Karhunen-loeve expansion allows using label/output information. Common principle components assumes that the principal components are the same for each class whereas the variance of these components differ for different classes. Flexible discriminant analysis, which does a linear projection to a lower-dimensional space in which all features are uncorrelated. This method use minimum distance classifier.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 23 / 34

SLIDE 24

Factor analysis

In PCA, from the original dimensions xi (for i = 1, . . . , D), we form a new set of variables z that are linear combinations of xi Z = W T(X − µ) In factor analysis (FA), we assume that there is a set of unobservable, latent factors zj (for j = 1, . . . , k), which when acting in combination generate x. Thus the direction is opposite that of PCA. The goal is to characterize the dependency among the observed variables by means of a smaller number of factors. Suppose there is a group of variables that have high correlation among themselves and low correlation with all the other variables. Then there may be a single underlying factor that gave rise to these variables. FA, like PCA, is a one-group procedure and is unsupervised. The aim is to model the data in a smaller dimensional space without loss of information. In FA, this is measured as the correlation between variables.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 24 / 34

SLIDE 25

Factor analysis (cont.)

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 25 / 34

SLIDE 26

Factor analysis (cont.)

As in PCA, we have a sample x drawn from some unknown probability density with E[x] = µ and Cov(x) = Σ. We assume that the factors are unit normals, E[zj] = 0, Var(zj) = 1, and are uncorrelated,Cov(zi, zj) = 0, i = j. To explain what is not explained by the factors, there is an added source for each input which we denote by ǫi . It is assumed to be zero-mean, E[ǫi] = 0, and have some unknown variance, Var(ǫi) = ψi . These specific sources are uncorrelated among themselves, Cov(ǫi, ǫi) = 0, i = j, and are also uncorrelated with the factors, Cov(ǫi, zj) = 0, ∀i, j. FA assumes that each input dimension, xi can be written as a weighted sum

f the k < D factors, zj plus the residual term.

xi − µi = vi1z1 + vi2z2 + . . . + vikzk + ǫi This can be written in vector-matrix form as x − µ = VZ + ǫi V is the D × k matrix of weights, called factor loadings.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 26 / 34

SLIDE 27

Factor analysis (cont.)

We assume that µ = 0 and we can always add µ after projection. Given that Var(zj) = 1 and Var(ǫi) = ψi and Var(xi) = v2

i1 + v2 i2 + . . . + v2 ik + ǫi

In vector-matrix form, we have Σ = Var(X) = VV T + Ψ There are two uses of factor analysis:

It can be used for knowledge extraction when we find the loadings and try to express the variables using fewer factors. It can also be used for dimensionality reduction when k < D.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 27 / 34

SLIDE 28

Factor analysis (cont.)

For dimensionality reduction, we need to find the factor, zj, from xi. We want to find the loadings wji such that zj =

D

wjixi + ǫi In vector form, for input x, this can be written as z = W Tx + ǫ For N input, the problem is reduced to linear regression and its solution is W = (X TX)−1X TZ We do not know Z; We multiply and divide both sides by N − 1 and obtain (drive it!) W = (N − 1)(X TX)−1 X TZ N − 1 = X TX N − 1 −1 X TZ N − 1 = S−1V Z = XW = XS−1V

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 28 / 34

SLIDE 29

Linear discriminant analysis

Linear discriminant analysis (LDA) is a supervised method for dimensionality reduction for classification problems. Consider first the case of two classes, and suppose we take the D−dimensional input vector x and project it down to one dimension using z = W Tx If we place a threshold on z and classify z ≥ w0 as class C1, and otherwise class C2, then we obtain our standard linear classifier. In general, the projection onto one dimension leads to a considerable loss of information, and classes that are well separated in the original D−dimensional space may become strongly overlapping in one dimension. However, by adjusting the components of the weight vector W , we can select a projection that maximizes the class separation. Consider a two-class problem in which there are N1 points of class C1 and N2 points of class C2, so that the mean vectors of the two classes are given by mj = 1 Nj

i∈Cj

xi

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 29 / 34

SLIDE 30

Linear discriminant analysis (cont.)

The simplest measure of the separation of the classes, when projected onto W , is the separation of the projected class means. This suggests that we might choose W so as to maximize m2 − m1 = W T(m2 − m1) where mj = W Tmj This expression can be made arbitrarily large simply by increasing the magnitude of W . To solve this problem, we could constrain W to have unit length, so that

i w2

i = 1.

Using a Lagrange multiplier to perform the constrained maximization, we then find that W ∝ (m2 − m1)

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 30 / 34

SLIDE 31

Linear discriminant analysis (cont.)

This approach has a problem: The following figure shows two classes that are well separated in the original two dimensional space but that have considerable overlap when projected onto the line joining their means. This difficulty arises from the strongly non-diagonal covariances of the class distributions. The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap.

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 31 / 34

SLIDE 32

Linear discriminant analysis (cont.)

The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. The projection z = W Tx transforms the set of labeled data points in x into a labeled set in the one-dimensional space z. The within-class variance of the transformed data from class Cj equals s2

j =

i∈Cj

(zi − mj)2 where zi = wTxi. We can define the total within-class variance for the whole data set to be s2

1 + s2

2. .

The Fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by J(W ) = (m2 − m1)2 s2

1 + s2 2

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 32 / 34

SLIDE 33

Linear discriminant analysis (cont.)

Between-class covariance matrix equals to SB = (m2 − m1)(m2 − m1)T Total within-class covariance matrix equals to SW =

i∈C1

(xi − m1) (xi − m1)T +

i∈C2

(xi − m2) (xi − m2)T J(W ) can be written as J(w) = W TSBW W TSW W To maximize J(W ), we differentiate with respect to W and we obtain

W TSBW
SW W =
W TSW W
SBW

SBW is always in the direction of (m2 − m1) (Show it!). We can drop the scalar factors W TSBW and W TSW W (why?). Multiplying both sides of S−1

W we then obtain (Show it!)

W ∝ S−1

W (m2 − m1)

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 33 / 34

SLIDE 34

Linear discriminant analysis (cont.)

The result W ∝ S−1

W (m2 − m1) is known as Fisher’s linear discriminant,

although strictly it is not a discriminant but rather a specific choice of direction for projection of the data down to one dimension. The projected data can subsequently be used to construct a discriminant function. The above idea can be extended to multiple classes (Read section 4.1.6 of Bishop). How the Fisher’s linear discriminant can be used for dimensionality reduction? (Show it!)

Hamid Beigy (Sharif University of Technology) Dimensionality reduction Fall 1393 34 / 34