Statistics and learning Multivariate statistics 2 and clustering - - PowerPoint PPT Presentation

statistics and learning
SMART_READER_LITE
LIVE PREVIEW

Statistics and learning Multivariate statistics 2 and clustering - - PowerPoint PPT Presentation

Statistics and learning Multivariate statistics 2 and clustering Emmanuel Rachelson and Matthieu Vignes ISAE SupAero Wednesday 2 nd and 9 th October 2013 E. Rachelson & M. Vignes (ISAE) SAD 2013 1 / 14 Link to the previous session Goal


slide-1
SLIDE 1

Statistics and learning

Multivariate statistics 2 and clustering Emmanuel Rachelson and Matthieu Vignes

ISAE SupAero

Wednesday 2nd and 9th October 2013

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 1 / 14

slide-2
SLIDE 2

Link to the previous session

Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will:

◮ review PCA needed ?

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 14

slide-3
SLIDE 3

Link to the previous session

Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will:

◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a

distance matrix

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 14

slide-4
SLIDE 4

Link to the previous session

Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will:

◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a

distance matrix

◮ introduce Canonical correlation analysis (CCA): for p quantitative

variables and q quantitative variables)

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 14

slide-5
SLIDE 5

Link to the previous session

Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will:

◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a

distance matrix

◮ introduce Canonical correlation analysis (CCA): for p quantitative

variables and q quantitative variables)

◮ introduce Correspondence analysis (CA): for 2 qualitative variables

with several (many) levels.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 14

slide-6
SLIDE 6

Link to the previous session

Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will:

◮ review PCA needed ? ◮ introduce Multidimensional scaling (MDS) as a factor analysis of a

distance matrix

◮ introduce Canonical correlation analysis (CCA): for p quantitative

variables and q quantitative variables)

◮ introduce Correspondence analysis (CA): for 2 qualitative variables

with several (many) levels.

◮ introduce clustering methods like hierarchical clustering or

Kmeans-like algorithms.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 14

slide-7
SLIDE 7

Multidimensional scaling (MDS)

◮ now only an index between individual is known, variables are not

  • bserved anymore: n × n matrix (think of distances).
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 14

slide-8
SLIDE 8

Multidimensional scaling (MDS)

◮ now only an index between individual is known, variables are not

  • bserved anymore: n × n matrix (think of distances).

◮ Goal: represent the cloud of points in a low-dimensional subspace.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 14

slide-9
SLIDE 9

Multidimensional scaling (MDS)

◮ now only an index between individual is known, variables are not

  • bserved anymore: n × n matrix (think of distances).

◮ Goal: represent the cloud of points in a low-dimensional subspace. ◮ MDS = PCA on distance matrix !

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 14

slide-10
SLIDE 10

Multidimensional scaling (MDS)

◮ now only an index between individual is known, variables are not

  • bserved anymore: n × n matrix (think of distances).

◮ Goal: represent the cloud of points in a low-dimensional subspace. ◮ MDS = PCA on distance matrix !

Easy example

Road distances between 47 French

  • cities. Is it

Euclidian ?

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 14

slide-11
SLIDE 11

Canonical correlation analysis (CCA)

◮ Uses techniques close to PCA to achieve a kind of multiple output

multivariate regression

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 14

slide-12
SLIDE 12

Canonical correlation analysis (CCA)

◮ Uses techniques close to PCA to achieve a kind of multiple output

multivariate regression

◮ Goal: Linking 2 groups of variables (X and Y ) measured on the

same individuals

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 14

slide-13
SLIDE 13

Canonical correlation analysis (CCA)

◮ Uses techniques close to PCA to achieve a kind of multiple output

multivariate regression

◮ Goal: Linking 2 groups of variables (X and Y ) measured on the

same individuals

◮ Example from yesterday on the study of fatty acids and gene levels on

mice: are some acids more present when some genes are

  • ver-expressed ? Or conversely ? → Practical session !
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 14

slide-14
SLIDE 14

Canonical correlation analysis (CCA)

◮ Uses techniques close to PCA to achieve a kind of multiple output

multivariate regression

◮ Goal: Linking 2 groups of variables (X and Y ) measured on the

same individuals

◮ Example from yesterday on the study of fatty acids and gene levels on

mice: are some acids more present when some genes are

  • ver-expressed ? Or conversely ? → Practical session !

◮ Consists in looking for a couple of vectors, one related to X (gene

expressions) and one to Y (metabolite levels) which are maximally

  • conected. And iteratively (without correlation between iterations).
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 14

slide-15
SLIDE 15

Canonical correlation analysis (CCA)

◮ Uses techniques close to PCA to achieve a kind of multiple output

multivariate regression

◮ Goal: Linking 2 groups of variables (X and Y ) measured on the

same individuals

◮ Example from yesterday on the study of fatty acids and gene levels on

mice: are some acids more present when some genes are

  • ver-expressed ? Or conversely ? → Practical session !

◮ Consists in looking for a couple of vectors, one related to X (gene

expressions) and one to Y (metabolite levels) which are maximally

  • conected. And iteratively (without correlation between iterations).

◮ Variables can be represented in either basis, it does not change the

interpretation.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 14

slide-16
SLIDE 16

CCA (cont’d)

Need to have p, q ≤ n. We kept 10 genes and 11 fatty acids. More interpretation ? → Practical session

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 5 / 14

slide-17
SLIDE 17

Correspondence analysis (CA)

◮ Becomes AFC in French

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 14

slide-18
SLIDE 18

Correspondence analysis (CA)

◮ Becomes AFC in French ◮ similar concept to PCA: represent the distribution of the 2 variables

and plots the individuals. but applies to qualitative rather than quantitative data → contingency table (ni,j)

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 14

slide-19
SLIDE 19

Correspondence analysis (CA)

◮ Becomes AFC in French ◮ similar concept to PCA: represent the distribution of the 2 variables

and plots the individuals. but applies to qualitative rather than quantitative data → contingency table (ni,j)

◮ This is double PCA (line and column profiles) on (Xij) = ( fi,j fi,.f.j − 1),

with fi,j = ni,j/n.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 14

slide-20
SLIDE 20

Correspondence analysis (CA)

◮ Becomes AFC in French ◮ similar concept to PCA: represent the distribution of the 2 variables

and plots the individuals. but applies to qualitative rather than quantitative data → contingency table (ni,j)

◮ This is double PCA (line and column profiles) on (Xij) = ( fi,j fi,.f.j − 1),

with fi,j = ni,j/n.

◮ Note that χ2 writes n i

  • j ˜

fi,jx2

i,j

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 14

slide-21
SLIDE 21

CA: an example

Cultivated area in the Midi-Pyr´ en´ ees region

Simultaneous representation of d´ epartement and farm size (in 6 bins).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 7 / 14

slide-22
SLIDE 22

Today

◮ ”Clustering: unsupervised classification”. Distance, hierarchical

clustering (divisive or agglomerative).

◮ Keep in mind that this is still exploratory statistics so the best

clustering (including method, options, criterion, etc.) is the most useful ?!

◮ End of practical session on mice data set. ◮ And a new guided session on multivariate stats: CA on presidential

elections, PCA and clustering (k-means and AHC) on hotel data set and multiple CA on 2 multiple factor data sets.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 14

slide-23
SLIDE 23

Clustering: grouping into classes

Ever heard of that in your background ??

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 14

slide-24
SLIDE 24

Clustering: grouping into classes

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 14

slide-25
SLIDE 25

Clustering: grouping into classes

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 14

slide-26
SLIDE 26

Cluster analysis or clustering

◮ Task of grouping objects so that objects belonging to the same

group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 14

slide-27
SLIDE 27

Cluster analysis or clustering

◮ Task of grouping objects so that objects belonging to the same

group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 14

slide-28
SLIDE 28

Cluster analysis or clustering

◮ Task of grouping objects so that objects belonging to the same

group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 14

slide-29
SLIDE 29

Cluster analysis or clustering

◮ Task of grouping objects so that objects belonging to the same

group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 14

slide-30
SLIDE 30

Cluster analysis or clustering

◮ Task of grouping objects so that objects belonging to the same

group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 14

slide-31
SLIDE 31

Cluster analysis or clustering

◮ Task of grouping objects so that objects belonging to the same

group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

◮ Several algorithms can do the job, their differences mainly being

about used distance.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 14

slide-32
SLIDE 32

Cluster analysis or clustering

◮ Task of grouping objects so that objects belonging to the same

group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

◮ Several algorithms can do the job, their differences mainly being

about used distance.

◮ Possibly, different parameters (initialisation, distance used, ending

criterion . . . ) lead to different representations.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 14

slide-33
SLIDE 33

Clustering algorithms

Challenge: build your own clustering algorithm ?!

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 14

slide-34
SLIDE 34

Clustering algorithms

Challenge: build your own clustering algorithm ?! Let’s quote only few of widespread clustering algorithms:

◮ hierarchical clustering with dissimilarity min → single, max →

complete or mean → average linkages)

◮ centroid models (e.g. K-means clustering) ◮ distribution models (statistical definition e.g. multivariate Gaussian

distribution)

◮ graph or density models (e.g. clique) ◮ . . .

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 14

slide-35
SLIDE 35

Clustering: some formalism

◮ Define a similarity (symetry, self-similarity, bounded) → dissimilarity ◮ Distance need additional properties: d(i, j) = 0 ⇒ i = j and

triangular inequality (Euclidian dist. from scalar product)

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 14

slide-36
SLIDE 36

Clustering: some formalism

◮ Define a similarity (symetry, self-similarity, bounded) → dissimilarity ◮ Distance need additional properties: d(i, j) = 0 ⇒ i = j and

triangular inequality (Euclidian dist. from scalar product) A goodness-of-fit of partitions can be defined: (i) external: TP, FP . . . → precision, sensitivity or Rand/Jaccard index or (ii) internal: Dunn index D = mini minj=i

d(i,j) maxk d′(k).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 14

slide-37
SLIDE 37

Homework

What do students choose after French baccalaur´ eat ?

First describe and then represent this (simple) data set in some informative way. Hint: CA...

  • rigin

counselling universit´ e

  • prep. clas.
  • ther

Total bac lit. 13 2 5 20 bac ´ eco. 20 2 8 30 bac scient. 10 5 5 20 bac tech. 7 1 22 30 Total 50 10 40 100

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 13 / 14

slide-38
SLIDE 38

Finished

Next time: tests

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 14 / 14

slide-39
SLIDE 39

Finished

Next time: tests

But before that: practice with R ?!

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 14 / 14