 
              Introductjon to Machine Learning CentraleSupélec Paris — Fall 2017 3. Dimensionality Reductjon Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
Learning objectjves ● Give reasons why one would wish to reduce the dimensionality of a data set. ● Explain the difgerence between feature selectjon and feature extractjon. ● Implement some fjlter strategies. ● Implement some wrapper strategies. ● Derive the computatjon of principal components from a “max variance” defjnitjon ● Implement PCA. 2
Curse of dimensionality ● Methods / intuitjons that work in low dimension may not apply to high dimensions. ? ● p=2: Fractjon of the points within a square that fall outside of the circle inscribed in it: 3
Curse of dimensionality ● Methods / intuitjons that work in low dimension may not apply to high dimensions. ● p=2: Fractjon of the points within a square that fall outside of the circle inscribed in it: r 4
Curse of dimensionality ● Methods / intuitjons that work in low dimension may not apply to high dimensions. ● p=3: Fractjon of the points within a cube that fall outside of the sphere inscribed in it: r 5
Curse of dimensionality ● Volume of a p-sphere: The Gamma functjon Γ generalizes the factorial. Γ(n) = (n-1)! ● When p ↗ the proportjon of a hypercube outside of its inscribed hypersphere approaches 1. ● What this means: – hyperspace is very big – all points are far apart ⇒ dimensionality reductjon. 6
More reasons to reduce dimensionality ● Computatjonal complexity (tjme and space) ● Interpretability ● Simpler models are more robust (less variance) ● Data visualizatjon ● Cost of data acquisitjon ● Eliminate non-relevant atuributes that can make it harder for an algorithm to learn. 7
Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 8
Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 9
Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 10
Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. Are those approaches supervised or unsupervised? ? All these feature selectjon approaches are supervised. 11
Feature selectjon: Overview All features Features Features Features set M set 1 set 2 Filter Embedded Predictor approaches approaches Wrapper Features Predictor Features approaches Lasso Features Predictor Predictor Elastjc Net 7 . p a h C Subset selectjon: e e S forward selectjon backward selectjon fmoatjng selectjon 12
Feature selectjon: Subset selectjon 13
Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 14
Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● How many subsets of p features are there? ? 15
Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● Issue: such sets. 16
Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● Issue: such sets. : Error of a ● Greedy approach: forward search predictor trained only using the features in Add the “best” feature at each step – Initjally: – New best feature: – stop if – else: 17
Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● Issue: such sets. : Error of a ● Greedy approach: forward search predictor trained only using the features in Add the “best” feature at each step – Initjally: – New best feature: – stop if – else: What is the complexity of this algorithm? ? 18
Subset selectjon ● Goal: Find the subset of features that leads to the best- performing algorithm. ● Issue: such sets. ● Greedy approach: forward search : Error of a predictor trained only Add the “best” feature at each step using the features in – Initjally: – New best feature: – stop if – else: Complexity: O(p² x C) where C=complexity of training and evaluatjng the model (might depend on p also). Much betuer than O(2 p )! 19
Subset selectjon ● Greedy approach: forward search : Error of a predictor trained only Add the “best” feature at each step using the features in – Initjally: – New best feature: – stop if – else: Complexity: O(p²) ● Alternatjve strategies: – Backward search: start from {1, …, p}, eliminate features. – Floatjng search: alternatjvely add q features and remove r features. 20
Approaches to dimensionality reductjon ● Feature extractjon Project the p features on m < p new dimensions ● Principal Components Analysis (PCA) ● Factor Analysis (FA) Linear ● Non-negatjve Matrix Factorizatjon (NMF) ● Linear Discriminant Analysis (LDA) Supervised ● Multjdimensional scaling (MDS) ● Isometric feature mapping (Isomap) Non linear ● Locally Linear Embedding (LLE) ● Autoencoders Most of these approaches are unsupervised. 21
Feature extractjon: Principal Component Analysis 22
Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. 23
Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. 24
Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. In PCA, we want the variance to be maximized. 25
Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. In PCA, we want the variance to be maximized. Projectjon on x 2 Projectjon on x 1 26
Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. In PCA, we want the variance to be maximized. Warning! This requires standardizing the features. Projectjon on x 2 Projectjon on x 1 27
Feature standardizatjon ● Variance of feature j in data set D : ? 28
Feature standardizatjon ● Variance of feature j in data set D : ● Features that take large values will have large variance Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5]. ● Standardizatjon: – mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1 29
Feature standardizatjon ● Variance of feature j in data set D : ● Features that take large values will have large variance Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5]. ● Standardizatjon: – mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1 30
Feature standardizatjon ● Variance of feature j in data set D : ● Features that take large values will have large variance Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5]. ● Standardizatjon: – mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1 31
Recommend
More recommend