Data Mining in Bioinformatics Day 3: Feature Selection Karsten - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen .: Data Mining in Bioinformatics, Page 1

What is feature selection? Abundance of features Usually, our output variable Y does not depend on all of our input features X Why is this? X usually includes all features that could determine Y according to our prior knowledge, but we do not know for sure. In fact, we perform supervised learning to determine this dependence between input variables and output variables (Supervised) feature selection’ means ’selecting the rel- evant subset of features’ for a particular learning task .: Data Mining in Bioinformatics, Page 2

Why feature selection? Reasons for feature selection to detect causal features to remove noisy features to reduce the set of features that has to be observed cost, speed, data understanding Two modes of feature selection Filter approaches: select interesting features a priori, based on a quality function (information criterion) Wrapper approaches: select special features that are interesting for one particular classifier .: Data Mining in Bioinformatics, Page 3

Optimisation problem Combinatorial problem Given a set of features D , and a quality function q , we try to find the subset S of D of cardinality n ′ that maximises q argmax S ⊂ D ∧| S | = n ′ q ( S ) (1) Exponential runtime effort The computational effort for enumerating all possibilities is exponential in n ′ , and hence intractable for large D and n ′ In practice, we have to find a workaround! .: Data Mining in Bioinformatics, Page 4

Greedy selection Take the currently best one Greedy selection is an alternative to exhaustive enumer- ation Idea is to iteratively add the currently most informative feature to the se- lected set or remove the currently most uninformative feature from the solution set These two variants of greedy feature selection are referred to as: forward feature selection backward elimination .: Data Mining in Bioinformatics, Page 5

Greedy selection Forward Feature Selection 1: S † ← ∅ 2: repeat j ← arg max j q ( S † ∪ { j } ) 3: S † ← S † ∪ j 4: S ← S \ j 5: 6: until S = ∅ .: Data Mining in Bioinformatics, Page 6

Greedy selection Backward Elimination 1: S † ← ∅ 2: repeat j ← arg max j q ( S \ { j } ) 3: S ← S \ j 4: S † ← S † ∪ j 5: 6: until S = ∅ Optimality of greedy selection Only optimal if q decomposes over the elements of S � q ( S ) = q ( X ) (2) X ∈ S Near-optimal if q is submodular (more details later) Otherwise there is no guarantee for optimality .: Data Mining in Bioinformatics, Page 7

Correlation Coefficient Definition The correlation coefficient ρ X,Y between two random variables X and Y with expected values µ X and µ Y and standard deviations σ X and σ Y is defined as: ρ X,Y = cov( X, Y ) (3) σ X σ Y = E (( X − µ X )( Y − µ Y )) (4) , σ X σ Y where E is the expected value operator and cov means covariance. .: Data Mining in Bioinformatics, Page 8

Mutual Information Definition Given two random variables X and Y , we define the mutual information I as � p ( x, y ) � � � I ( X, Y ) = p ( x, y ) log (5) , p ( x ) p ( y ) y ∈ Y x ∈ X where X is the input variable, Y is the output variable, p ( x, y ) is the (joint) probability of observing x and y , p ( x ) and p ( y ) are the marginal probabilities of observing x and y , respectively. log is usually the logarithm with base 2. .: Data Mining in Bioinformatics, Page 9

HSIC Definition The Hilbert-Schmidt Independence Criterion ( HSIC ) measures the dependence of two random variables Given two random variables X and Y , an empirical es- timate of the HSIC can be computed as trace ( KHLH ) (6) where K is a kernel on X L is a kernel on Y H is a centering matrix with H ( i, j ) = δ ( i, j ) − 1 m HSIC ( X, Y ) = 0 iff X and Y are independent The larger HSIC ( X, Y ) , the larger the dependence between X and Y .: Data Mining in Bioinformatics, Page 10

Submodular Functions I Definition A function q on a set D is said to be submodular if q ( S ∪ X ) − q ( S ) ≥ q ( T ∪ X ) − q ( T ) (7) where X ∈ D S ⊂ D T ⊂ D S ⊆ T This is referred to as the property of ‘diminishing re- turns’: If S is a subset of T , then S benefits more from adding X than T .: Data Mining in Bioinformatics, Page 11

Submodular Functions II Near-optimality (Nemhauser, Wolsey, and Fisher, 1978) If q is a submodular, nondecreasing set function and q ( ∅ ) = 0 , then the greedy algorithm is guaranteed to find a set S such that q ( S ) ≥ (1 − 1 e ) max | T | = | S | q ( T ) (8) This means that the solution of greedy selection reaches at least ∼ 63% of the quality of the optimal solution. .: Data Mining in Bioinformatics, Page 12

Submodular Functions III Example: Sensor Placement Imagine our features form a graph G = ( D, E ) Imagine the features are possible locations for a sensor. Each sensor may cover a node v and its neighbourhood N ( v ) , that is q ( S ) = | N ( v ) ∪ v | . Now we want to pick locations in the graph such that our sensors cover as large an area of the graph as possible. q fulfills the following properties q ( ∅ ) = 0 q is non-decreasing q is submodular Hence greedy selection will lead to near-optimal sensor placement! .: Data Mining in Bioinformatics, Page 13

Wrapper methods Two flavours: embedded: The selection process is really integrated into the learning algorithm not-embedded (Wrapper): The learning algorithm is em- ployed as a quality measure Wrappers: Simple wrapper: do prediction using 1 feature only. Use classification accuracy as measure of quality Extend this to groups of features by heuristic search strategies (greedy, Monte-Carlo, etc.) Embedded: Typical example: Decision Trees! ℓ 0 norm SVM .: Data Mining in Bioinformatics, Page 14

ℓ 0 norm SVM 3 steps 1. Train a regular linear SVM (using ℓ 1 -norm or ℓ 2 -norm regularization) 2. Re-scale the input variables by multiplying them by the absolute values of the components of the weight vector w obtained. 3. Iterate the first 2 steps until convergence. .: Data Mining in Bioinformatics, Page 15

Unsupervised feature selection Problem setting Even without a target variable y , we can select features that are informative according to some criterion Criteria (Guyon and Elisseeff, 2003) Saliency: a feature is salient if it has a high variance or range Entropy: a feature has high entropy if the distribution of examples is uniform Smoothness: a feature in a time series is smooth if on average its local curvature is moderate Density: a feature is in a high-density region if it is highly connected with many other variables Reliability: a feature is reliable if the measurement error bars are smaller than the variability .: Data Mining in Bioinformatics, Page 16

Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 1. Do you have domain knowledge? If yes, construct a better set of ad-hoc features. 2. Are your features commensurate? If no, consider nor- malizing them. 3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your computer resources allow you. 4. Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of features. .: Data Mining in Bioinformatics, Page 17

Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 5. Do you need to assess features individually (e.g. to un- derstand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results. 6. Do you need a predictor? If no, stop. 7. Do you suspect your data is “dirty” (has a few meaning- less input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them. .: Data Mining in Bioinformatics, Page 18

Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 8. Do you know what to try first? If no, use a linear predictor. Use a forward selection method with the “probe” method as a stopping criterion or use the ℓ 0 -norm. embedded method. For comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset. 9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods. Use linear and non-linear predictors. Select the best approach with model selection. .: Data Mining in Bioinformatics, Page 19

Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 10. Do you want a stable solution (to improve performance and/or understanding)? If yes, subsample your data and redo your analysis for several “bootstraps”. .: Data Mining in Bioinformatics, Page 20

Data Mining in Bioinformatics Day 3: Feature Selection Karsten - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen .: Data Mining in Bioinformatics, Page 1 What is feature

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

How well can HMM model load signals 3rd International Workshop on Non-Intrusive Load Monitoring,

Learning a Belief Network If you know the structure have observed all of the variables

Prediction in MLM Model comparisons and regularization PSYC 575 October 13, 2020 (updated: 25

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo Malkomes Chip Schaff Roman Garnett

1.2 Initial-Value Problems a lesson for MATH F302 Differential Equations Ed Bueler, Dept. of

Transfer Functions Transfer Functions Assume zero initial conditions. Transfer functions

The Flow of ODEs Fabian Immler & Christoph Traut ITP 2016 e l l e b a s I =

Chapter 1: Introduction Department of Electrical Engineering National Taiwan University

Data Mining in Bioinformatics Day 3: Feature Selection Karsten - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen .: Data Mining in Bioinformatics, Page 1 What is feature

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

How well can HMM model load signals 3rd International Workshop on Non-Intrusive Load Monitoring,

Learning a Belief Network If you know the structure have observed all of the variables

Prediction in MLM Model comparisons and regularization PSYC 575 October 13, 2020 (updated: 25

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo Malkomes Chip Schaff Roman Garnett

1.2 Initial-Value Problems a lesson for MATH F302 Differential Equations Ed Bueler, Dept. of

Transfer Functions Transfer Functions Assume zero initial conditions. Transfer functions

The Flow of ODEs Fabian Immler &amp; Christoph Traut ITP 2016 e l l e b a s I =

Chapter 1: Introduction Department of Electrical Engineering National Taiwan University

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

The Flow of ODEs Fabian Immler & Christoph Traut ITP 2016 e l l e b a s I =