Feature Reduction and Selection
Selim Aksoy
Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr
CS 551, Spring 2019
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 1 / 64
Feature Reduction and Selection Selim Aksoy Department of Computer - - PowerPoint PPT Presentation
Feature Reduction and Selection Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2019 CS 551, Spring 2019 2019, Selim Aksoy (Bilkent University) c 1 / 64 Introduction In
Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 1 / 64
◮ In practical multicategory applications, it is not unusual to
◮ Intuitively, it may seem that each feature is useful for at
◮ In general, if the performance obtained with a given set of
◮ Even though increasing the number of features increases
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 2 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 3 / 64
◮ Unfortunately, it has frequently been observed in practice
◮ This is called the curse of dimensionality. ◮ There are two issues that we must be careful about:
◮ How is the classification accuracy affected by the
◮ How is the complexity of the classifier affected by the
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 4 / 64
◮ Potential reasons for increase in error include
◮ wrong assumptions in model selection, ◮ estimation errors due to the finite number of training samples
◮ Potential solutions include
◮ reducing the dimensionality, ◮ simplifying the estimation. CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 5 / 64
◮ Dimensionality can be reduced by
◮ redesigning the features, ◮ selecting an appropriate subset among the existing features, ◮ combining existing features.
◮ Estimation errors can be simplified by
◮ assuming equal covariance for all classes (for the Gaussian
◮ using regularization, ◮ using prior information and a Bayes estimate, ◮ using heuristics such as conditional independence, ◮ · · · . CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 6 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 7 / 64
◮ All of the commonly used classifiers can suffer from the
◮ While an exact relationship between the probability of error,
◮ It is generally accepted that using at least ten times as
◮ The more complex the classifier, the larger should the ratio
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 8 / 64
◮ One way of coping with the problem of high dimensionality
◮ Issues in feature reduction:
◮ Linear vs. non-linear transformations. ◮ Use of class labels or not (depends on the availability of
◮ Training objective: ◮ minimizing classification error (discriminative training), ◮ minimizing reconstruction error (PCA), ◮ maximizing class separability (LDA), ◮ retaining interesting directions (projection pursuit), ◮ making features as independent as possible (ICA), ◮ embedding to lower dimensional manifolds (Isomap, LLE). CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 9 / 64
◮ Linear combinations are particularly attractive because they
◮ Linear methods project the high-dimensional data onto a
◮ Advantages of these projections include
◮ reduced complexity in estimation and classification, ◮ ability to visually examine the multivariate data in two or
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 10 / 64
◮ Given x ∈ Rd, the goal is to find a linear transformation A
◮ Two classical approaches for finding optimal linear
◮ Principal Components Analysis (PCA): Seeks a projection
◮ Linear Discriminant Analysis (LDA): Seeks a projection that
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 11 / 64
◮ Given x1, . . . , xn ∈ Rd, the goal is to find a d′-dimensional
◮ The criterion function for the reconstruction error can be
n
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 12 / 64
◮ It can be shown that Jd′ is minimized when e1, . . . , ed′ are
n
◮ The coefficients y = (yi, . . . , yd′)T are called the principal
◮ When the eigenvectors are sorted in descending order of
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 13 / 64
◮ Often there will be just a few large eigenvalues, and this implies
◮ The actual subspace where the data may lie is related to the
◮ The geometric interpretation of intrinsic dimensionality is that the
◮ Note that the intrinsic dimensionality is not the same as the linear
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 14 / 64
(a) Scatter plot. (b) Projection onto e1. (c) Projection onto e2.
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 15 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 16 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 17 / 64
◮ Whereas PCA seeks directions that are efficient for
◮ Given x1, . . . , xn ∈ Rd divided into two subsets D1 and D2
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 18 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 19 / 64
◮ The criterion function for the best separation can be defined
1 + ˜
2
1 #Di
i = y∈wi(y − ˜
◮ This is called the Fisher’s linear discriminant with the
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 20 / 64
◮ To compute the optimal w, we define the scatter matrices Si
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 21 / 64
◮ Then, the criterion function becomes
W(m1 − m2). ◮ Note that, SW is symmetric and positive semidefinite, and it
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 22 / 64
◮ Generalization to c classes involves c − 1 discriminant
◮ The scatter matrices Si are computed as
◮ The within-class scatter matrix SW is computed as
c
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 23 / 64
◮ The between-class scatter matrix SB is computed as
c
n
◮ Then, the criterion function becomes
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 24 / 64
◮ It can be shown that J(W) is maximized when the columns
WSB having the largest
◮ Because SB is the sum of c matrices of rank one or less,
◮ Once the transformation from the d-dimensional original
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 25 / 64
(a) Scatter plot. (b) Projection onto the first PCA axis. (c) Projection onto the first LDA axis.
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 26 / 64
(a) Scatter plot. (b) Projection onto the first PCA axis. (c) Projection onto the first LDA axis.
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 27 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 28 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 29 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 30 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 31 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 32 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 33 / 64
◮ The isometric feature mapping (Isomap) algorithm
◮ computational efficiency, ◮ global optimality, and ◮ asymptotic convergence guarantees
◮ The approach seeks to preserve the intrinsic geometry of
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 34 / 64
◮ The essential point is to estimate the geodesic distance
◮ For neighboring points, input-space distance provides a
◮ For faraway points, geodesic distance can be approximated
◮ These approximations are computed efficiently by finding
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 35 / 64
◮ The Isomap algorithm has three steps. ◮ The first step determines which points are neighbors on the
◮ A sparse graph G is defined over all data points by
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 36 / 64
◮ In the second step, Isomap estimates the geodesic
◮ The final step applies classical multi-dimensional scaling to
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 37 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 38 / 64
◮ The coordinate vectors yi for points in Y are chosen to
ij. ◮ The τ operator is defined as τ(D) = −HSH/2, where S is
ij}, and H is the
◮ The global minimum of the cost function is achieved by
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 39 / 64
◮ Pros:
◮ A noniterative, polynomial time procedure with a guarantee
◮ A guarantee of asymptotic convergence to the true structure
◮ Single free parameter (ǫ or K).
◮ Cons:
◮ Sensitive to noise. ◮ Computationally expensive (dense matrix eigenreduction). CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 40 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 41 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 42 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 43 / 64
◮ The locally linear embedding (LLE) algorithm is based on
◮ Suppose that the data consist of N real-valued vectors xi,
◮ Provided there is sufficient data (such that the manifold is
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 44 / 64
◮ The local geometry of these patches is characterized by
◮ The reconstruction errors are measured by the cost function
◮ The weights Wij summarize the contribution of the j’th data
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 45 / 64
◮ To compute the weights Wij, the cost function is minimized
◮ each data point xi is reconstructed only from its neighbors,
◮ the rows of the weight matrix sum to one:
j Wij = 1. ◮ The optimal weights Wij subject to these constraints are
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 46 / 64
◮ The constrained weights that minimize these reconstruction
◮ Suppose that the data lie on or near a smooth nonlinear
◮ By design, the reconstruction weights Wij reflect intrinsic
◮ Therefore, their characterization of local geometry in the
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 47 / 64
◮ LLE constructs a neighborhood-preserving mapping based
◮ In the final step of the algorithm, each high-dimensional
◮ This is done by choosing d′-dimensional coordinates yi to
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 48 / 64
◮ This cost function, like the previous one, is based on locally
◮ The cost function can be minimized by solving a sparse
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 49 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 50 / 64
◮ Pros:
◮ Globally optimal result. ◮ Single free parameter (number of neighbors, K). ◮ Simple linear algebra operations using sparse matrices.
◮ Cons:
◮ Sensitive to noise. ◮ No theoretical guarantees. CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 51 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 52 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 53 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 54 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 55 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 56 / 64
◮ An alternative to feature reduction that uses linear or
◮ The first step in feature selection is to define a criterion
◮ Note that, the use of classification error in the criterion
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 57 / 64
◮ The most straightforward approach would require
◮ examining all
m
◮ selecting the subset that performs the best according to the
◮ The number of subsets grows combinatorially, making the
◮ Iterative procedures are often used but they cannot
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 58 / 64
◮ Sequential forward selection:
◮ First, the best single feature is selected. ◮ Then, pairs of features are formed using one of the
◮ Next, triplets of features are formed using one of the
◮ This procedure continues until all or a predefined number of
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 59 / 64
◮ Sequential backward selection:
◮ First, the criterion function is computed for all d features. ◮ Then, each feature is deleted one at a time, the criterion
◮ Next, each feature among the remaining d − 1 is deleted one
◮ This procedure continues until one feature or a predefined
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 60 / 64
56 58 60 62 64 66 68 70 72 74 AERIAL_GABOR1::COARSE0DEG AERIAL::BAND3 AERIAL_GABOR2::COARSE90DEG AERIAL::BAND2 AERIAL::BAND1 AERIAL_GABOR2::FINE0DEG IKONOS3::BAND2 AERIAL_GABOR1::COARSE90DEG AERIAL_GABOR2::FINE90DEG AERIAL_GABOR1::FINE90DEG IKONOS3::BAND1 AERIAL_GABOR2::COARSE0DEG IKONOS2_GABOR1::COARSE90DEG IKONOS2_GABOR1::FINE90DEG IKONOS3::BAND3 IKONOS3::BAND4 IKONOS2_GABOR1::FINE0DEG IKONOS2_GABOR1::COARSE0DEG AERIAL_GABOR1::FINE0DEG IKONOS2_GABOR4::COARSE0DEG IKONOS2_GABOR4::FINE0DEG IKONOS2_GABOR4::COARSE90DEG IKONOS2_GABOR4::FINE90DEG IKONOS2::BAND4 IKONOS2::BAND2 IKONOS2::BAND3 IKONOS2::BAND1 DEM::ELEVATION Sequential forward selection Classification accuracy
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 61 / 64
54 56 58 60 62 64 66 68 70 72 NONE DEM::ELEVATION IKONOS3::BAND3 AERIAL_GABOR1::COARSE90DEG IKONOS2::BAND1 IKONOS2::BAND3 IKONOS2_GABOR4::FINE90DEG AERIAL_GABOR2::COARSE0DEG AERIAL_GABOR1::FINE0DEG AERIAL_GABOR2::FINE90DEG IKONOS2_GABOR4::COARSE90DEG IKONOS3::BAND4 IKONOS2_GABOR1::FINE90DEG IKONOS3::BAND1 IKONOS2::BAND2 IKONOS2_GABOR4::COARSE0DEG IKONOS2_GABOR1::COARSE0DEG IKONOS2_GABOR1::COARSE90DEG IKONOS2_GABOR1::FINE0DEG IKONOS2::BAND4 IKONOS2_GABOR4::FINE0DEG AERIAL_GABOR1::FINE90DEG AERIAL_GABOR2::COARSE90DEG AERIAL_GABOR1::COARSE0DEG IKONOS3::BAND2 AERIAL::BAND3 AERIAL::BAND2 AERIAL_GABOR2::FINE0DEG Sequential backward selection Classification accuracy
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 62 / 64
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 63 / 64
◮ The choice between feature reduction and feature selection
◮ Feature selection leads to savings in computational costs
◮ Feature reduction with transformations may provide a better
CS 551, Spring 2019 c 2019, Selim Aksoy (Bilkent University) 64 / 64