statistics and learning
play

Statistics and learning Multivariate statistics 2 and clustering - PowerPoint PPT Presentation

Statistics and learning Multivariate statistics 2 and clustering Emmanuel Rachelson and Matthieu Vignes ISAE SupAero Wednesday 2 nd and 9 th October 2013 E. Rachelson & M. Vignes (ISAE) SAD 2013 1 / 14 Link to the previous session Goal


  1. Statistics and learning Multivariate statistics 2 and clustering Emmanuel Rachelson and Matthieu Vignes ISAE SupAero Wednesday 2 nd and 9 th October 2013 E. Rachelson & M. Vignes (ISAE) SAD 2013 1 / 14

  2. Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: ◮ review PCA needed ? E. Rachelson & M. Vignes (ISAE) SAD 2013 2 / 14

  3. Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: ◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix E. Rachelson & M. Vignes (ISAE) SAD 2013 2 / 14

  4. Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: ◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix ◮ introduce Canonical correlation analysis (CCA): for p quantitative variables and q quantitative variables) E. Rachelson & M. Vignes (ISAE) SAD 2013 2 / 14

  5. Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: ◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix ◮ introduce Canonical correlation analysis (CCA): for p quantitative variables and q quantitative variables) ◮ introduce Correspondence analysis (CA): for 2 qualitative variables with several (many) levels. E. Rachelson & M. Vignes (ISAE) SAD 2013 2 / 14

  6. Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: ◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix ◮ introduce Canonical correlation analysis (CCA): for p quantitative variables and q quantitative variables) ◮ introduce Correspondence analysis (CA): for 2 qualitative variables with several (many) levels. ◮ introduce clustering methods like hierarchical clustering or Kmeans-like algorithms. E. Rachelson & M. Vignes (ISAE) SAD 2013 2 / 14

  7. Multidimensional scaling (MDS) ◮ now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). E. Rachelson & M. Vignes (ISAE) SAD 2013 3 / 14

  8. Multidimensional scaling (MDS) ◮ now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). ◮ Goal : represent the cloud of points in a low-dimensional subspace. E. Rachelson & M. Vignes (ISAE) SAD 2013 3 / 14

  9. Multidimensional scaling (MDS) ◮ now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). ◮ Goal : represent the cloud of points in a low-dimensional subspace. ◮ MDS = PCA on distance matrix ! E. Rachelson & M. Vignes (ISAE) SAD 2013 3 / 14

  10. Multidimensional scaling (MDS) ◮ now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). ◮ Goal : represent the cloud of points in a low-dimensional subspace. ◮ MDS = PCA on distance matrix ! Easy example Road distances between 47 French cities. Is it Euclidian ? E. Rachelson & M. Vignes (ISAE) SAD 2013 3 / 14

  11. Canonical correlation analysis (CCA) ◮ Uses techniques close to PCA to achieve a kind of multiple output multivariate regression E. Rachelson & M. Vignes (ISAE) SAD 2013 4 / 14

  12. Canonical correlation analysis (CCA) ◮ Uses techniques close to PCA to achieve a kind of multiple output multivariate regression ◮ Goal : Linking 2 groups of variables ( X and Y ) measured on the same individuals E. Rachelson & M. Vignes (ISAE) SAD 2013 4 / 14

  13. Canonical correlation analysis (CCA) ◮ Uses techniques close to PCA to achieve a kind of multiple output multivariate regression ◮ Goal : Linking 2 groups of variables ( X and Y ) measured on the same individuals ◮ Example from yesterday on the study of fatty acids and gene levels on mice: are some acids more present when some genes are over-expressed ? Or conversely ? → Practical session ! E. Rachelson & M. Vignes (ISAE) SAD 2013 4 / 14

  14. Canonical correlation analysis (CCA) ◮ Uses techniques close to PCA to achieve a kind of multiple output multivariate regression ◮ Goal : Linking 2 groups of variables ( X and Y ) measured on the same individuals ◮ Example from yesterday on the study of fatty acids and gene levels on mice: are some acids more present when some genes are over-expressed ? Or conversely ? → Practical session ! ◮ Consists in looking for a couple of vectors, one related to X (gene expressions) and one to Y (metabolite levels) which are maximally conected. And iteratively (without correlation between iterations). E. Rachelson & M. Vignes (ISAE) SAD 2013 4 / 14

  15. Canonical correlation analysis (CCA) ◮ Uses techniques close to PCA to achieve a kind of multiple output multivariate regression ◮ Goal : Linking 2 groups of variables ( X and Y ) measured on the same individuals ◮ Example from yesterday on the study of fatty acids and gene levels on mice: are some acids more present when some genes are over-expressed ? Or conversely ? → Practical session ! ◮ Consists in looking for a couple of vectors, one related to X (gene expressions) and one to Y (metabolite levels) which are maximally conected. And iteratively (without correlation between iterations). ◮ Variables can be represented in either basis, it does not change the interpretation. E. Rachelson & M. Vignes (ISAE) SAD 2013 4 / 14

  16. CCA (cont’d) Need to have p, q ≤ n . We kept 10 genes and 11 fatty acids. More interpretation ? → Practical session E. Rachelson & M. Vignes (ISAE) SAD 2013 5 / 14

  17. Correspondence analysis (CA) ◮ Becomes AFC in French E. Rachelson & M. Vignes (ISAE) SAD 2013 6 / 14

  18. Correspondence analysis (CA) ◮ Becomes AFC in French ◮ similar concept to PCA: represent the distribution of the 2 variables and plots the individuals. but applies to qualitative rather than quantitative data → contingency table ( n i,j ) E. Rachelson & M. Vignes (ISAE) SAD 2013 6 / 14

  19. Correspondence analysis (CA) ◮ Becomes AFC in French ◮ similar concept to PCA: represent the distribution of the 2 variables and plots the individuals. but applies to qualitative rather than quantitative data → contingency table ( n i,j ) ◮ This is double PCA (line and column profiles) on ( X ij ) = ( f i,j f i,. f .j − 1) , with f i,j = n i,j /n . E. Rachelson & M. Vignes (ISAE) SAD 2013 6 / 14

  20. Correspondence analysis (CA) ◮ Becomes AFC in French ◮ similar concept to PCA: represent the distribution of the 2 variables and plots the individuals. but applies to qualitative rather than quantitative data → contingency table ( n i,j ) ◮ This is double PCA (line and column profiles) on ( X ij ) = ( f i,j f i,. f .j − 1) , with f i,j = n i,j /n . ◮ Note that χ 2 writes n � j ˜ f i,j x 2 � i,j i E. Rachelson & M. Vignes (ISAE) SAD 2013 6 / 14

  21. CA: an example Cultivated area in the Midi-Pyr´ en´ ees region Simultaneous representation of d´ epartement and farm size (in 6 bins). E. Rachelson & M. Vignes (ISAE) SAD 2013 7 / 14

  22. Today ◮ ”Clustering: unsupervised classification”. Distance, hierarchical clustering (divisive or agglomerative). ◮ Keep in mind that this is still exploratory statistics so the best clustering (including method, options, criterion, etc. ) is the most useful ?! ◮ End of practical session on mice data set. ◮ And a new guided session on multivariate stats: CA on presidential elections , PCA and clustering (k-means and AHC) on hotel data set and multiple CA on 2 multiple factor data sets . E. Rachelson & M. Vignes (ISAE) SAD 2013 8 / 14

  23. Clustering: grouping into classes Ever heard of that in your background ?? E. Rachelson & M. Vignes (ISAE) SAD 2013 9 / 14

  24. Clustering: grouping into classes E. Rachelson & M. Vignes (ISAE) SAD 2013 9 / 14

  25. Clustering: grouping into classes E. Rachelson & M. Vignes (ISAE) SAD 2013 9 / 14

  26. Cluster analysis or clustering ◮ Task of grouping objects so that objects belonging to the same group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task. E. Rachelson & M. Vignes (ISAE) SAD 2013 10 / 14

  27. Cluster analysis or clustering ◮ Task of grouping objects so that objects belonging to the same group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task. E. Rachelson & M. Vignes (ISAE) SAD 2013 10 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend