Feature Selection CE-725: Statistical Pattern Recognition Sharif - PowerPoint PPT Presentation

Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Soleymani Fall 2018 Some slides are based on Isabelle Guyon’s slides : “Introduction to feature selection”, 2007.

Outline } Dimensionality reduction } Filter univariate methods } Multi-variate filter & wrapper methods } Search strategies } Embedded approach 2

Dimensionality reduction: Feature selection vs. feature extraction } Feature selection } Select a subset of a given feature set } Feature extraction (e.g., PCA, LDA) } A linear or non-linear transform on the original feature space 𝑦 & ' 𝑦 " 𝑦 " 𝑧 " 𝑦 " ⋮ ⋮ → ⋮ ⋮ ⋮ → = 𝑔 𝑦 $ 𝑦 & () 𝑦 $ 𝑧 $ ) 𝑦 $ Feature Feature Selection Extraction ( 𝑒 + < 𝑒 ) 3

Feature selection } Data may contain many irrelevant and redundant variables and often comparably few training examples } Consider supervised learning problems where the number of features 𝑒 is very large (perhaps 𝑒 ≫ 𝑜 ) } E.g., datasets with tens or hundreds of thousands of features and (much) smaller number of data samples ( text or document processing, gene expression array analysis) 𝑒 2 3 𝑒 − 1 1 4 𝒚 (") ⋮ ⋮ … ⋮ 𝒚 (5) 𝑗 " 𝑗 < … 𝑗 $ ) 4

Why feature selection? } FS is a way to find more accurate, faster, and easier to understand classifiers. } Performance: enhancing generalization ability } alleviating the effect of the curse of dimensionality } the higher the ratio of the no. of training patterns 𝑂 to the number of free classifier parameters, the better the generalization of the learned classifier } Efficiency: speeding up the learning process } Interpretability: resulting a model that is easier to understand by human (") (") 𝑦 " ⋯ 𝑦 $ 𝑧 (") 𝑗 " , 𝑗 < , … , 𝑗 $ ) 𝒀 = ⋮ ⋱ ⋮ , 𝑍 = ⋮ Feature Selection (5) (5) 𝑧 (5) 𝑦 " ⋯ 𝑦 $ The selected features Supervised feature selection: Given a labeled set of data points, select a subset of features for data representation 5

Noise (or irrelevant) features } Eliminating irrelevant features can decrease the classification error on test data 𝑦 < Noise feature 𝑦 " 𝑦 " SVM Decision Boundary SVM Decision Boundary 6

Drug Screening 𝑂 = 1909 compounds 𝑒 = 139,351 binary features three-dimensional properties of the molecule. Number of features 7 [Weston et al, Bioinformatics, 2002]

Text Filtering Reuters : 21578 news wire, 114 semantic categories. 20 newsgroups : 19997 articles, 20 categories. WebKB : 8282 web pages, 7 categories. Bag-of-words : >100000 features. Top 3 words of some categories: } Alt.atheism : atheism, atheists, morality } Comp.graphics : image, jpeg, graphics } Sci.space : space, nasa, orbit } Soc.religion.christian : god, church, sin } Talk.politics.mideast : israel, armenian, turkish 8 [Bekkerman et al, JMLR, 2003]

Face Male/Female Classification 𝑂 = 1000 training images 𝑒 = 60×85 = 5100 features 100 500 1000 Relief: Simba: [Navot, Bachrach, and Tishby, ICML, 2004] 9

Some definitions } One categorization of feature selection methods: } Univariate method (variable ranking) : considers one variable (feature) at a time. } Multivariate method : considers subsets of features together. } Another categorization: } Filter method : ranks features or feature subsets independent of the classifier as a preprocessing step. } Wrapper method : uses a classifier to evaluate the score of features or feature subsets. } Embedded method : Feature selection is done during the training of a classifier } E.g., Adding a regularization term 𝒙 " in the cost function of linear classifiers 10

Filter: univariate } Univariate filter method } Score each feature 𝑙 based on the 𝑙 -th column of the data matrix and the label vector } Relevance of the feature to predict labels: Can the feature discriminate the patterns of different classes? } Rank features according to their score values and select the ones with the highest scores. } How do you decide how many features k to choose? e.g., using cross- validation to select among the possible values of k } Advantage: computational and statistical scalability 11

� � � Pearson Correlation Criteria (&) − 𝑦 R 𝑧 (&) − 𝑧 5 ∑ 𝑦 R X 𝑑𝑝𝑤(𝑌 R , 𝑍) &Y" 𝑆 𝑙 = ≈ 𝑤𝑏𝑠(𝑌 R ) 𝑤𝑏𝑠(𝑍) < & − 𝑦 R 𝑧 & − 𝑧 5 5 ∑ ∑ X < 𝑦 R &Y" &Y" 𝑦 < 𝑆(1) ≫ 𝑆(2) 𝑦 " } can only detect linear dependencies between feature and target. 12

µ + µ - Single Variable Classifier ROC Curve x 2 density Y=1 Y=-1 x 1 𝑦 < x Sensitivity=TPR s + 2 s - µ + µ - Y=1 density Area Under Curve Y=-1 𝑦 " (AUC) 1- Specificity=FPR s + x 1 s - 13

Univariate Mutual Information } Independence: 𝑄(𝑌, 𝑍) = 𝑄(𝑌)𝑄(𝑍) } Mutual information as a measure of dependence: 𝑁𝐽 𝑌, 𝑍 = 𝐹 ^,_ log 𝑄(𝑌, 𝑍) 𝑄 𝑌 𝑄(𝑍) } Score of 𝑌 R based on MI with 𝑍 : } 𝐽 𝑙 = 𝑁𝐽 𝑌 R , 𝑍 14

Filter – univariate: Disadvantage } Univariate methods may fail: } a feature may be important in combination with other features. } Redundant features: } They can select a group of dependent variables that carry similar information about the output, i.e. it is sufficient to use only one (or a few) of these variables. 15

Univariate methods: Failure } Samples on which univariate feature analysis and scoring fails: 16 [Guyon-Elisseeff, JMLR 2004; Springer 2006]

Redundant features What is the relation between redundancy and correlation: Are highly correlated features necessarily redundant? What about completely correlated ones? 17

Multi-variate feature selection } Search in the space of all possible combinations of features. } all feature subsets: For 𝑒 features, 2 $ possible subsets. } high computational and statistical complexity. 18

Search space for feature selection ( 𝑒 = 4 ) 0,0,0,0 1,0,0,0 0,0,1,0 0,1,0,0 0,0,0,1 1,1,0,0 0,1,1,0 0,1,0,1 1,0,1,0 1,0,0,1 0,0,1,1 1,1,1,0 1,0,1,1 1,1,0,1 0,1,1,1 1,1,1,1 [Kohavi-John,1997] 19

Multivariate methods: General procedure Original Subset feature set Subset Subset generation evaluation No Yes Stopping Validation criterion Subset Generation: select a candidate feature subset for evaluation Subset Evaluation: compute the score (relevancy value) of the subset Stopping criterion: when stopping the search in the space of feature subsets Validation: verify that the selected subset is valid 20

Stopping criteria } Predefined number of features is selected } Predefined number of iterations is reached } Addition (or deletion) of any feature does not result in a better subset } An optimal subset (according to the evaluation criterion) is obtained. 21

Filter and wrapper methods } Wrappers use the classifier performance to evaluate the feature subset utilized in the classifier. } Training 2 $ classifiers is infeasible for large 𝑒 . } Most wrapper algorithms use a heuristic search. } Filters use an evaluation function that is cheaper to compute than the performance of the classifier } e.g. correlation coefficient 22

Filters vs. wrappers rank subsets of useful features Feature Original feature set Classifier Filter subset Multiple Original feature set Classifier feature subsets Wrapper take classifier into account to rank feature subsets (e.g., using cross validation to evaluate features) 23

Wrapper methods: Performance assessment } For each feature subset, train classifier on training data and assess its performance using evaluation techniques like cross-validation 24

Filter methods: Evaluation criteria } Distance (Eulidean distance) } Class separability: Features supporting instances of the same class to be closer in terms of distance than those from different classes } Information (Information Gain) } Select S1 if IG(S1,Y)>IG(S2,Y) } Dependency (correlation coefficient) } good feature subsets contain features highly correlated with the class, yet uncorrelated with each other } Consistency (min-features bias) } Selects features that guarantee no inconsistency in data } inconsistent instances have the same feature vector but different class labels } Prefers smaller subset with consistency (min-feature) f 1 f 2 class instance 1 a b c1 25 inconsistent instance 2 a b c2

How to search the space of feature subsets? } NP-hard problem. } Complete search is possible only for small number of features. } Greedy search is often used ( forward selection or backward elimination ). 26

Subset selection or generation } Search direction } Forward } Backward } Random } Search strategies } Exhaustive - Complete } Branch & Bound } Best first search } Heuristic or greedy } Sequential forward selection } Sequential backward elimination } Plus-l Minus-r Selection } Bidirectional Search } Sequential floating Selection } Non-deterministic } Simulated annealing } Genetic algorithm 27

Feature Selection CE-725: Statistical Pattern Recognition Sharif - PowerPoint PPT Presentation

Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Soleymani Fall 2018 Some slides are based on Isabelle Guyons slides : Introduction to feature selection, 2007. Outline } Dimensionality

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Mutual Information an Adequate Tool for Feature Selection ? Benot Frnay November 15, 2013

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Feature Extraction 7-1 Ronald Peikert SciVis 2007 - Feature Extraction What are features?

Feature Structures, Unification Some grammatical phenomena Linguistic features Feature

1 The Cost of Feature Transformation Feature Rescaling } Not every transformation Input: Each

Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas

Introduction: Why Optimization? Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative SMT

Outline Evolution of neurocomputing Artificial neural networks Feed forward

Lecture 6.1: Introduction to Fourier series Matthew Macauley Department of Mathematical Sciences

,-.). '.)*' / 0 f ( x ) 0 1

Feature Selection CE-725: Statistical Pattern Recognition Sharif - PowerPoint PPT Presentation

Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Soleymani Fall 2018 Some slides are based on Isabelle Guyons slides : Introduction to feature selection, 2007. Outline } Dimensionality

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Mutual Information an Adequate Tool for Feature Selection ? Benot Frnay November 15, 2013

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

PCA &amp; ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Feature Extraction 7-1 Ronald Peikert SciVis 2007 - Feature Extraction What are features?

Feature Structures, Unification Some grammatical phenomena Linguistic features Feature

1 The Cost of Feature Transformation Feature Rescaling } Not every transformation Input: Each

Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas

Introduction: Why Optimization? Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative SMT

Outline Evolution of neurocomputing Artificial neural networks Feed forward

Lecture 6.1: Introduction to Fourier series Matthew Macauley Department of Mathematical Sciences

,-.)*. '*.)*' / 0 f ( x ) 0 1

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

Introduction: Why Optimization? Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725

,-.). '.)*' / 0 f ( x ) 0 1