Feature Reduction and Selection
Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr
CS 551, Spring 2005
Feature Reduction and Selection Selim Aksoy Bilkent University - - PowerPoint PPT Presentation
Feature Reduction and Selection Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr CS 551, Spring 2005 Introduction In practical multicategory applications, it is not unusual to encounter problems
CS 551, Spring 2005
◮ How is the classification accuracy affected by the
◮ How is the computational complexity of the classifier
CS 551, Spring 2005 1/35
r/2
CS 551, Spring 2005 2/35
1, . . . , σ2 d))
d
CS 551, Spring 2005 3/35
CS 551, Spring 2005 4/35
Figure 1: There is a non-zero Bayes error in the one-dimensional x1 space or the two-dimensional x1, x2 space. However, the Bayes error vanishes in the x1, x2, x3 space because of non-overlapping densities.
CS 551, Spring 2005 5/35
◮ reducing the dimensionality ◮ simplifying the estimation
CS 551, Spring 2005 6/35
◮ redesigning the features ◮ selecting an appropriate subset among the existing
◮ combining existing features
◮ assuming equal covariance for all classes (for the
◮ using prior information and a Bayes estimate ◮ using heuristics such as conditional independence
CS 551, Spring 2005 7/35
Figure 2: Problem of insufficient data is analogous to problems in curve fitting. The training data (black dots) are selected from a quadratic function plus Gaussian
polynomial for better generalization.
CS 551, Spring 2005 8/35
CS 551, Spring 2005 9/35
◮ Linear vs. non-linear transformations ◮ Use of class labels or not (depends on the availability of training
◮ Training objective:
CS 551, Spring 2005 10/35
◮ reduced complexity in estimation and classification ◮ ability to visually examine the multivariate data in
CS 551, Spring 2005 11/35
◮ Principal Components Analysis (PCA): Seeks a
◮ Linear
CS 551, Spring 2005 12/35
n
CS 551, Spring 2005 13/35
n
CS 551, Spring 2005 14/35
CS 551, Spring 2005 15/35
(a) Scatter plot. (b) Projection onto e1. (c) Projection onto e2.
Figure 3: Scatter plot (red dots) and the principal axes for a bivariate sample. The blue line shows the axis e1 with the greatest variance and the green line shows the axis e2 with the smallest variance. Features are now uncorrelated.
CS 551, Spring 2005 16/35
Figure 4: Scatter plot of the iris data. Diagonal cells show the histogram for each feature. Other cells show scatters of pairs of features x1, x2, x3, x4 in top-down and left-right order. Red, green and blue points represent samples for the setosa, versicolor and virginica classes, respectively.
CS 551, Spring 2005 17/35
Figure 5: Scatter plot of the projection of the iris data onto the first two and the first three principal axes. Red, green and blue points represent samples for the setosa, versicolor and virginica classes, respectively.
CS 551, Spring 2005 18/35
CS 551, Spring 2005 19/35
Figure 6: Projection of the same set of samples onto two different lines in the directions marked as w. The figure on the right shows greater separation between the red and black projected points.
CS 551, Spring 2005 20/35
1 + ˜
2
1 #Di
i =
CS 551, Spring 2005 21/35
CS 551, Spring 2005 22/35
W(m1 − m2)
CS 551, Spring 2005 23/35
c
CS 551, Spring 2005 24/35
c
n
CS 551, Spring 2005 25/35
WSB having
CS 551, Spring 2005 26/35
(a) Scatter plot. (b) Projection onto the first PCA axis. (c) Projection onto the first LDA axis.
Figure 7: Scatter plot and the PCA and LDA axes for a bivariate sample with two
than the projection onto the first PCA axis.
CS 551, Spring 2005 27/35
(a) Scatter plot. (b) Projection onto the first PCA axis. (c) Projection onto the first LDA axis.
Figure 8: Scatter plot and the PCA and LDA axes for a bivariate sample with two
than the projection onto the first PCA axis.
CS 551, Spring 2005 28/35
Table 1: Feature reduction methods.
CS 551, Spring 2005 29/35
CS 551, Spring 2005 30/35
◮ examining all
m
◮ selecting the subset that performs the best according
CS 551, Spring 2005 31/35
◮ First, the best single feature is selected. ◮ Then, pairs of features are formed using one of the
◮ Next, triplets of features are formed using one of the
◮ This procedure continues until all or a predefined
CS 551, Spring 2005 32/35
◮ First, the criterion function is computed for all d
◮ Then, each feature is deleted one at a time, the
◮ Next, each feature among the remaining d − 1 is
◮ This procedure continues until one feature or a
CS 551, Spring 2005 33/35
Table 2: Feature selection methods.
CS 551, Spring 2005 34/35
CS 551, Spring 2005 35/35