Statistics and learning Multivariate statistics 2 and clustering - PowerPoint PPT Presentation

Statistics and learning Multivariate statistics 2 and clustering Emmanuel Rachelson and Matthieu Vignes ISAE SupAero Wednesday 2 nd and 9 th October 2013 E. Rachelson & M. Vignes (ISAE) SAD 2013 1 / 14

Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: ◮ review PCA needed ? E. Rachelson & M. Vignes (ISAE) SAD 2013 2 / 14

Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: ◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix E. Rachelson & M. Vignes (ISAE) SAD 2013 2 / 14

Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: ◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix ◮ introduce Canonical correlation analysis (CCA): for p quantitative variables and q quantitative variables) E. Rachelson & M. Vignes (ISAE) SAD 2013 2 / 14

Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: ◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix ◮ introduce Canonical correlation analysis (CCA): for p quantitative variables and q quantitative variables) ◮ introduce Correspondence analysis (CA): for 2 qualitative variables with several (many) levels. E. Rachelson & M. Vignes (ISAE) SAD 2013 2 / 14

Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: ◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix ◮ introduce Canonical correlation analysis (CCA): for p quantitative variables and q quantitative variables) ◮ introduce Correspondence analysis (CA): for 2 qualitative variables with several (many) levels. ◮ introduce clustering methods like hierarchical clustering or Kmeans-like algorithms. E. Rachelson & M. Vignes (ISAE) SAD 2013 2 / 14

Multidimensional scaling (MDS) ◮ now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). E. Rachelson & M. Vignes (ISAE) SAD 2013 3 / 14

Multidimensional scaling (MDS) ◮ now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). ◮ Goal : represent the cloud of points in a low-dimensional subspace. E. Rachelson & M. Vignes (ISAE) SAD 2013 3 / 14

Multidimensional scaling (MDS) ◮ now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). ◮ Goal : represent the cloud of points in a low-dimensional subspace. ◮ MDS = PCA on distance matrix ! E. Rachelson & M. Vignes (ISAE) SAD 2013 3 / 14

Multidimensional scaling (MDS) ◮ now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). ◮ Goal : represent the cloud of points in a low-dimensional subspace. ◮ MDS = PCA on distance matrix ! Easy example Road distances between 47 French cities. Is it Euclidian ? E. Rachelson & M. Vignes (ISAE) SAD 2013 3 / 14

Canonical correlation analysis (CCA) ◮ Uses techniques close to PCA to achieve a kind of multiple output multivariate regression E. Rachelson & M. Vignes (ISAE) SAD 2013 4 / 14

Canonical correlation analysis (CCA) ◮ Uses techniques close to PCA to achieve a kind of multiple output multivariate regression ◮ Goal : Linking 2 groups of variables ( X and Y ) measured on the same individuals E. Rachelson & M. Vignes (ISAE) SAD 2013 4 / 14

Canonical correlation analysis (CCA) ◮ Uses techniques close to PCA to achieve a kind of multiple output multivariate regression ◮ Goal : Linking 2 groups of variables ( X and Y ) measured on the same individuals ◮ Example from yesterday on the study of fatty acids and gene levels on mice: are some acids more present when some genes are over-expressed ? Or conversely ? → Practical session ! E. Rachelson & M. Vignes (ISAE) SAD 2013 4 / 14

Canonical correlation analysis (CCA) ◮ Uses techniques close to PCA to achieve a kind of multiple output multivariate regression ◮ Goal : Linking 2 groups of variables ( X and Y ) measured on the same individuals ◮ Example from yesterday on the study of fatty acids and gene levels on mice: are some acids more present when some genes are over-expressed ? Or conversely ? → Practical session ! ◮ Consists in looking for a couple of vectors, one related to X (gene expressions) and one to Y (metabolite levels) which are maximally conected. And iteratively (without correlation between iterations). E. Rachelson & M. Vignes (ISAE) SAD 2013 4 / 14

Canonical correlation analysis (CCA) ◮ Uses techniques close to PCA to achieve a kind of multiple output multivariate regression ◮ Goal : Linking 2 groups of variables ( X and Y ) measured on the same individuals ◮ Example from yesterday on the study of fatty acids and gene levels on mice: are some acids more present when some genes are over-expressed ? Or conversely ? → Practical session ! ◮ Consists in looking for a couple of vectors, one related to X (gene expressions) and one to Y (metabolite levels) which are maximally conected. And iteratively (without correlation between iterations). ◮ Variables can be represented in either basis, it does not change the interpretation. E. Rachelson & M. Vignes (ISAE) SAD 2013 4 / 14

CCA (cont’d) Need to have p, q ≤ n . We kept 10 genes and 11 fatty acids. More interpretation ? → Practical session E. Rachelson & M. Vignes (ISAE) SAD 2013 5 / 14

Correspondence analysis (CA) ◮ Becomes AFC in French E. Rachelson & M. Vignes (ISAE) SAD 2013 6 / 14

Correspondence analysis (CA) ◮ Becomes AFC in French ◮ similar concept to PCA: represent the distribution of the 2 variables and plots the individuals. but applies to qualitative rather than quantitative data → contingency table ( n i,j ) E. Rachelson & M. Vignes (ISAE) SAD 2013 6 / 14

Correspondence analysis (CA) ◮ Becomes AFC in French ◮ similar concept to PCA: represent the distribution of the 2 variables and plots the individuals. but applies to qualitative rather than quantitative data → contingency table ( n i,j ) ◮ This is double PCA (line and column profiles) on ( X ij ) = ( f i,j f i,. f .j − 1) , with f i,j = n i,j /n . E. Rachelson & M. Vignes (ISAE) SAD 2013 6 / 14

Correspondence analysis (CA) ◮ Becomes AFC in French ◮ similar concept to PCA: represent the distribution of the 2 variables and plots the individuals. but applies to qualitative rather than quantitative data → contingency table ( n i,j ) ◮ This is double PCA (line and column profiles) on ( X ij ) = ( f i,j f i,. f .j − 1) , with f i,j = n i,j /n . ◮ Note that χ 2 writes n � j ˜ f i,j x 2 � i,j i E. Rachelson & M. Vignes (ISAE) SAD 2013 6 / 14

CA: an example Cultivated area in the Midi-Pyr´ en´ ees region Simultaneous representation of d´ epartement and farm size (in 6 bins). E. Rachelson & M. Vignes (ISAE) SAD 2013 7 / 14

Today ◮ ”Clustering: unsupervised classification”. Distance, hierarchical clustering (divisive or agglomerative). ◮ Keep in mind that this is still exploratory statistics so the best clustering (including method, options, criterion, etc. ) is the most useful ?! ◮ End of practical session on mice data set. ◮ And a new guided session on multivariate stats: CA on presidential elections , PCA and clustering (k-means and AHC) on hotel data set and multiple CA on 2 multiple factor data sets . E. Rachelson & M. Vignes (ISAE) SAD 2013 8 / 14

Clustering: grouping into classes Ever heard of that in your background ?? E. Rachelson & M. Vignes (ISAE) SAD 2013 9 / 14

Clustering: grouping into classes E. Rachelson & M. Vignes (ISAE) SAD 2013 9 / 14

Cluster analysis or clustering ◮ Task of grouping objects so that objects belonging to the same group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task. E. Rachelson & M. Vignes (ISAE) SAD 2013 10 / 14

Statistics and learning Multivariate statistics 2 and clustering - PowerPoint PPT Presentation

Statistics and learning Multivariate statistics 2 and clustering Emmanuel Rachelson and Matthieu Vignes ISAE SupAero Wednesday 2 nd and 9 th October 2013 E. Rachelson & M. Vignes (ISAE) SAD 2013 1 / 14 Link to the previous session Goal

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

The Pulse monitors: Statistics Smartpods PULSE 1 - Improve Facility Efficiencies 2 - Increase

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning

UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics

The Statistics Network The Statistics Network Statistics network Compute servers Desktop PCs

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Statistics for Social Sciences I: Introduction to Statistics Introduction to Statistics

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Order Statistics and Applications Rosemary Smith Introduction to Order Statistics Unordered

Statistics for Machine Learning Prof. Seungchul Lee Industrial AI Lab. Statistics and

AP Biology and Statistics Statistics Statistics help to better understand the meaning of a

Order Statistics and Pitman Closeness Katherine F. Davies Department of Statistics University of

The Power and Limits of Statistics DPRRGSP 2018-11-29 @ReinhardFurrer Applied Statistics

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science

Statistics in Schools Classrooms Powered by Census Data CENSUS.GOV/SCHOOLS Statistics in

- AP & LLE Xiangliang Zhang King Abdullah University of Science and Technology

MACHINE LEARNING Spectral Clustering 1 ADVANCED MACHINE LEARNING Outline of Todays Lecture

Dr. Damien Fay. SRG group, Computer Lab, University of Cambridge. A graph metric: motivation.

Off- -The The- -Shelf Classifiers Shelf Classifiers Off A method that can be applied directly

Recovering dialect geography from an unaligned comparable corpus Yves Scherrer LATL, Department

Partial order embedding with multiple kernels Brian McFee and Gert Lanckriet University of

Gomory Reloaded Matteo Fischetti, DEI, University of Padova (joint work with Domenico Salvagnin)

But If Not We died before we left. Fijan Missionary 6 th Century B.C. - Babylon is the world

Statistics and learning Multivariate statistics 2 and clustering - PowerPoint PPT Presentation

Statistics and learning Multivariate statistics 2 and clustering Emmanuel Rachelson and Matthieu Vignes ISAE SupAero Wednesday 2 nd and 9 th October 2013 E. Rachelson & M. Vignes (ISAE) SAD 2013 1 / 14 Link to the previous session Goal

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

The Pulse monitors: Statistics Smartpods PULSE 1 - Improve Facility Efficiencies 2 - Increase

Quality Assurance in Official Statistics Directorate of Economics &amp; Statistics, Planning

UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics

The Statistics Network The Statistics Network Statistics network Compute servers Desktop PCs

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Statistics for Social Sciences I: Introduction to Statistics Introduction to Statistics

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Order Statistics and Applications Rosemary Smith Introduction to Order Statistics Unordered

Statistics for Machine Learning Prof. Seungchul Lee Industrial AI Lab. Statistics and

AP Biology and Statistics Statistics Statistics help to better understand the meaning of a

Order Statistics and Pitman Closeness Katherine F. Davies Department of Statistics University of

The Power and Limits of Statistics DPRRGSP 2018-11-29 @ReinhardFurrer Applied Statistics

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science

Statistics in Schools Classrooms Powered by Census Data CENSUS.GOV/SCHOOLS Statistics in

- AP &amp; LLE Xiangliang Zhang King Abdullah University of Science and Technology

MACHINE LEARNING Spectral Clustering 1 ADVANCED MACHINE LEARNING Outline of Todays Lecture

Dr. Damien Fay. SRG group, Computer Lab, University of Cambridge. A graph metric: motivation.

Off- -The The- -Shelf Classifiers Shelf Classifiers Off A method that can be applied directly

Recovering dialect geography from an unaligned comparable corpus Yves Scherrer LATL, Department

Partial order embedding with multiple kernels Brian McFee and Gert Lanckriet University of

Gomory Reloaded Matteo Fischetti, DEI, University of Padova (joint work with Domenico Salvagnin)

But If Not We died before we left. Fijan Missionary 6 th Century B.C. - Babylon is the world

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning

- AP & LLE Xiangliang Zhang King Abdullah University of Science and Technology