data mining in bioinformatics day 3 feature selection
play

Data Mining in Bioinformatics Day 3: Feature Selection Karsten - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen .: Data Mining in Bioinformatics, Page 1 What is feature


  1. Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen .: Data Mining in Bioinformatics, Page 1

  2. What is feature selection? Abundance of features Usually, our output variable Y does not depend on all of our input features X Why is this? X usually includes all features that could determine Y according to our prior knowledge, but we do not know for sure. In fact, we perform supervised learning to determine this dependence between input variables and output vari- ables (Supervised) feature selection’ means ’selecting the rel- evant subset of features’ for a particular learning task .: Data Mining in Bioinformatics, Page 2

  3. Why feature selection? Reasons for feature selection to detect causal features to remove noisy features to reduce the set of features that has to be observed cost, speed, data understanding Two modes of feature selection Filter approaches: select interesting features a priori, based on a quality function (information criterion) Wrapper approaches: select special features that are interesting for one particular classifier .: Data Mining in Bioinformatics, Page 3

  4. Optimisation problem Combinatorial problem Given a set of features D , and a quality function q , we try to find the subset S of D of cardinality n ′ that maximises q argmax S ⊂ D ∧| S | = n ′ q ( S ) (1) Exponential runtime effort The computational effort for enumerating all possibilities is exponential in n ′ , and hence intractable for large D and n ′ In practice, we have to find a workaround! .: Data Mining in Bioinformatics, Page 4

  5. Greedy selection Take the currently best one Greedy selection is an alternative to exhaustive enumer- ation Idea is to iteratively add the currently most informative feature to the se- lected set or remove the currently most uninformative feature from the solution set These two variants of greedy feature selection are re- ferred to as: forward feature selection backward elimination .: Data Mining in Bioinformatics, Page 5

  6. Greedy selection Forward Feature Selection 1: S † ← ∅ 2: repeat j ← arg max j q ( S † ∪ { j } ) 3: S † ← S † ∪ j 4: S ← S \ j 5: 6: until S = ∅ .: Data Mining in Bioinformatics, Page 6

  7. Greedy selection Backward Elimination 1: S † ← ∅ 2: repeat j ← arg max j q ( S \ { j } ) 3: S ← S \ j 4: S † ← S † ∪ j 5: 6: until S = ∅ Optimality of greedy selection Only optimal if q decomposes over the elements of S � q ( S ) = q ( X ) (2) X ∈ S Near-optimal if q is submodular (more details later) Otherwise there is no guarantee for optimality .: Data Mining in Bioinformatics, Page 7

  8. Correlation Coefficient Definition The correlation coefficient ρ X,Y between two random variables X and Y with expected values µ X and µ Y and standard deviations σ X and σ Y is defined as: ρ X,Y = cov( X, Y ) (3) σ X σ Y = E (( X − µ X )( Y − µ Y )) (4) , σ X σ Y where E is the expected value operator and cov means covariance. .: Data Mining in Bioinformatics, Page 8

  9. Mutual Information Definition Given two random variables X and Y , we define the mutual information I as � p ( x, y ) � � � I ( X, Y ) = p ( x, y ) log (5) , p ( x ) p ( y ) y ∈ Y x ∈ X where X is the input variable, Y is the output variable, p ( x, y ) is the (joint) probability of observing x and y , p ( x ) and p ( y ) are the marginal probabilities of observ- ing x and y , respectively. log is usually the logarithm with base 2. .: Data Mining in Bioinformatics, Page 9

  10. HSIC Definition The Hilbert-Schmidt Independence Criterion ( HSIC ) measures the dependence of two random variables Given two random variables X and Y , an empirical es- timate of the HSIC can be computed as trace ( KHLH ) (6) where K is a kernel on X L is a kernel on Y H is a centering matrix with H ( i, j ) = δ ( i, j ) − 1 m HSIC ( X, Y ) = 0 iff X and Y are independent The larger HSIC ( X, Y ) , the larger the dependence be- tween X and Y .: Data Mining in Bioinformatics, Page 10

  11. Submodular Functions I Definition A function q on a set D is said to be submodular if q ( S ∪ X ) − q ( S ) ≥ q ( T ∪ X ) − q ( T ) (7) where X ∈ D S ⊂ D T ⊂ D S ⊆ T This is referred to as the property of ‘diminishing re- turns’: If S is a subset of T , then S benefits more from adding X than T .: Data Mining in Bioinformatics, Page 11

  12. Submodular Functions II Near-optimality (Nemhauser, Wolsey, and Fisher, 1978) If q is a submodular, nondecreasing set function and q ( ∅ ) = 0 , then the greedy algorithm is guaranteed to find a set S such that q ( S ) ≥ (1 − 1 e ) max | T | = | S | q ( T ) (8) This means that the solution of greedy selection reaches at least ∼ 63% of the quality of the optimal solution. .: Data Mining in Bioinformatics, Page 12

  13. Submodular Functions III Example: Sensor Placement Imagine our features form a graph G = ( D, E ) Imagine the features are possible locations for a sensor. Each sensor may cover a node v and its neighbourhood N ( v ) , that is q ( S ) = | N ( v ) ∪ v | . Now we want to pick locations in the graph such that our sensors cover as large an area of the graph as possible. q fulfills the following properties q ( ∅ ) = 0 q is non-decreasing q is submodular Hence greedy selection will lead to near-optimal sensor placement! .: Data Mining in Bioinformatics, Page 13

  14. Wrapper methods Two flavours: embedded: The selection process is really integrated into the learning algorithm not-embedded (Wrapper): The learning algorithm is em- ployed as a quality measure Wrappers: Simple wrapper: do prediction using 1 feature only. Use classification accuracy as measure of quality Extend this to groups of features by heuristic search strategies (greedy, Monte-Carlo, etc.) Embedded: Typical example: Decision Trees! ℓ 0 norm SVM .: Data Mining in Bioinformatics, Page 14

  15. ℓ 0 norm SVM 3 steps 1. Train a regular linear SVM (using ℓ 1 -norm or ℓ 2 -norm regularization) 2. Re-scale the input variables by multiplying them by the absolute values of the components of the weight vector w obtained. 3. Iterate the first 2 steps until convergence. .: Data Mining in Bioinformatics, Page 15

  16. Unsupervised feature selection Problem setting Even without a target variable y , we can select features that are informative according to some criterion Criteria (Guyon and Elisseeff, 2003) Saliency: a feature is salient if it has a high variance or range Entropy: a feature has high entropy if the distribution of examples is uniform Smoothness: a feature in a time series is smooth if on average its local curvature is moderate Density: a feature is in a high-density region if it is highly connected with many other variables Reliability: a feature is reliable if the measurement error bars are smaller than the variability .: Data Mining in Bioinformatics, Page 16

  17. Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 1. Do you have domain knowledge? If yes, construct a better set of ad-hoc features. 2. Are your features commensurate? If no, consider nor- malizing them. 3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive fea- tures or products of features, as much as your computer resources allow you. 4. Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of features. .: Data Mining in Bioinformatics, Page 17

  18. Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 5. Do you need to assess features individually (e.g. to un- derstand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results. 6. Do you need a predictor? If no, stop. 7. Do you suspect your data is “dirty” (has a few meaning- less input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them. .: Data Mining in Bioinformatics, Page 18

  19. Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 8. Do you know what to try first? If no, use a linear pre- dictor. Use a forward selection method with the “probe” method as a stopping criterion or use the ℓ 0 -norm. em- bedded method. For comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset. 9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded meth- ods. Use linear and non-linear predictors. Select the best approach with model selection. .: Data Mining in Bioinformatics, Page 19

  20. Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 10. Do you want a stable solution (to improve performance and/or understanding)? If yes, subsample your data and redo your analysis for several “bootstraps”. .: Data Mining in Bioinformatics, Page 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend