3 dimensionality reductjon
play

3. Dimensionality Reductjon Chlo-Agathe Azencot Centre for - PowerPoint PPT Presentation

Introductjon to Machine Learning CentraleSuplec Paris Fall 2017 3. Dimensionality Reductjon Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Give


  1. Introductjon to Machine Learning CentraleSupélec Paris — Fall 2017 3. Dimensionality Reductjon Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

  2. Learning objectjves ● Give reasons why one would wish to reduce the dimensionality of a data set. ● Explain the difgerence between feature selectjon and feature extractjon. ● Implement some fjlter strategies. ● Implement some wrapper strategies. ● Derive the computatjon of principal components from a “max variance” defjnitjon ● Implement PCA. 2

  3. Curse of dimensionality ● Methods / intuitjons that work in low dimension may not apply to high dimensions. ? ● p=2: Fractjon of the points within a square that fall outside of the circle inscribed in it: 3

  4. Curse of dimensionality ● Methods / intuitjons that work in low dimension may not apply to high dimensions. ● p=2: Fractjon of the points within a square that fall outside of the circle inscribed in it: r 4

  5. Curse of dimensionality ● Methods / intuitjons that work in low dimension may not apply to high dimensions. ● p=3: Fractjon of the points within a cube that fall outside of the sphere inscribed in it: r 5

  6. Curse of dimensionality ● Volume of a p-sphere: The Gamma functjon Γ generalizes the factorial. Γ(n) = (n-1)! ● When p ↗ the proportjon of a hypercube outside of its inscribed hypersphere approaches 1. ● What this means: – hyperspace is very big – all points are far apart ⇒ dimensionality reductjon. 6

  7. More reasons to reduce dimensionality ● Computatjonal complexity (tjme and space) ● Interpretability ● Simpler models are more robust (less variance) ● Data visualizatjon ● Cost of data acquisitjon ● Eliminate non-relevant atuributes that can make it harder for an algorithm to learn. 7

  8. Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 8

  9. Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 9

  10. Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 10

  11. Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. Are those approaches supervised or unsupervised? ? All these feature selectjon approaches are supervised. 11

  12. Feature selectjon: Overview All features Features Features Features set M set 1 set 2 Filter Embedded Predictor approaches approaches Wrapper Features Predictor Features approaches Lasso Features Predictor Predictor Elastjc Net 7 . p a h C Subset selectjon: e e S forward selectjon backward selectjon fmoatjng selectjon 12

  13. Feature selectjon: Subset selectjon 13

  14. Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 14

  15. Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● How many subsets of p features are there? ? 15

  16. Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● Issue: such sets. 16

  17. Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● Issue: such sets. : Error of a ● Greedy approach: forward search predictor trained only using the features in Add the “best” feature at each step – Initjally: – New best feature: – stop if – else: 17

  18. Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● Issue: such sets. : Error of a ● Greedy approach: forward search predictor trained only using the features in Add the “best” feature at each step – Initjally: – New best feature: – stop if – else: What is the complexity of this algorithm? ? 18

  19. Subset selectjon ● Goal: Find the subset of features that leads to the best- performing algorithm. ● Issue: such sets. ● Greedy approach: forward search : Error of a predictor trained only Add the “best” feature at each step using the features in – Initjally: – New best feature: – stop if – else: Complexity: O(p² x C) where C=complexity of training and evaluatjng the model (might depend on p also). Much betuer than O(2 p )! 19

  20. Subset selectjon ● Greedy approach: forward search : Error of a predictor trained only Add the “best” feature at each step using the features in – Initjally: – New best feature: – stop if – else: Complexity: O(p²) ● Alternatjve strategies: – Backward search: start from {1, …, p}, eliminate features. – Floatjng search: alternatjvely add q features and remove r features. 20

  21. Approaches to dimensionality reductjon ● Feature extractjon Project the p features on m < p new dimensions ● Principal Components Analysis (PCA) ● Factor Analysis (FA) Linear ● Non-negatjve Matrix Factorizatjon (NMF) ● Linear Discriminant Analysis (LDA) Supervised ● Multjdimensional scaling (MDS) ● Isometric feature mapping (Isomap) Non linear ● Locally Linear Embedding (LLE) ● Autoencoders Most of these approaches are unsupervised. 21

  22. Feature extractjon: Principal Component Analysis 22

  23. Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. 23

  24. Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. 24

  25. Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. In PCA, we want the variance to be maximized. 25

  26. Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. In PCA, we want the variance to be maximized. Projectjon on x 2 Projectjon on x 1 26

  27. Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. In PCA, we want the variance to be maximized. Warning! This requires standardizing the features. Projectjon on x 2 Projectjon on x 1 27

  28. Feature standardizatjon ● Variance of feature j in data set D : ? 28

  29. Feature standardizatjon ● Variance of feature j in data set D : ● Features that take large values will have large variance Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5]. ● Standardizatjon: – mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1 29

  30. Feature standardizatjon ● Variance of feature j in data set D : ● Features that take large values will have large variance Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5]. ● Standardizatjon: – mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1 30

  31. Feature standardizatjon ● Variance of feature j in data set D : ● Features that take large values will have large variance Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5]. ● Standardizatjon: – mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend