Correspondence Analysis. P. CAZES CEREMADE, University Paris - - PowerPoint PPT Presentation

correspondence analysis
SMART_READER_LITE
LIVE PREVIEW

Correspondence Analysis. P. CAZES CEREMADE, University Paris - - PowerPoint PPT Presentation

Some Comments on Correspondence Analysis. P. CAZES CEREMADE, University Paris Dauphine Overview Data analysis as an experimental science The Laboratory of Statistics of University Paris 6 in the seventies Coding Correspondence


slide-1
SLIDE 1

Some Comments on Correspondence Analysis.

  • P. CAZES

CEREMADE, University Paris Dauphine

slide-2
SLIDE 2

2

Overview

  • Data analysis as an experimental science
  • The Laboratory of Statistics of University

Paris 6 in the seventies

  • Coding
  • Correspondence analysis as a particular

case of other methods.

  • Correspondence analysis and modelling

techniques

  • Correspondence analysis and Data analysis

since 2000

  • Bibliography
slide-3
SLIDE 3

3

Data analysis as an experimental science

  • Theoretical results have been discovered and

demonstrated after having been observed on the computer listings as in experiences in physics.

  • Indices : inertia rates, contributions, test-value,
  • etc. have been set up to validate the results as

the error computations in physics

  • Coding techniques allow defining the ad-hoc

table to be analyzed and the succession of the analyses (descriptive, explicative or decisional analyses) to be done to treat the data. This problem is analogous to the set up of an experiment in physics.

slide-4
SLIDE 4

4

Examples of results discovered and demonstrated after seen on listing

  • In CA of a contingence table crossing two

sets I and J, Inertia moments of factorial axis of the clouds NI and NJ associated respectively to I and J are equal, result which is now standard ( B. Escofier Phd, 1965 )

  • CA of a doubling table of 0 and 1 has an

total inertia equal to 1

slide-5
SLIDE 5

5

  • CA of a doubling table of 0 and 1 is

equivalent to Normed PCA of the non dedoubling (or initial) table (Benzecri, J.P. Pagès, Bara : PHD, Serums data,1971).

– furthermore CA = NPCA / p where p is the number of variables or columns of initial table. – Then, we find again that : Inertia in CA = Inertia in NPCA /p = p / p = 1 – Same representation of the lines on factorial axis (with the factor 1/p to pass of NPCA to CA )

slide-6
SLIDE 6

6

The Laboratory of Statistics of University Paris 6 in the seventies

  • Pr. Benzecri director of the laboratory and also

Responsible of Master 2 (M2) of statistics (Research master) with150 to 200 students (the „„greatest‟‟ M2 of France)

  • 40 PhD defended each year since 1974 (3 or 4 defenses

each Monday in May and June)

  • Examples of applications very numerous and diversified:

– Biology – Ecology – Economy – Geology – Linguistics – Medicine – Physics – Psychology – Sociology

slide-7
SLIDE 7

7

  • Diversity of student‟s origin:

–French, of course, but also African, Argentinean, Greek, Egyptian, Iranian, Irish, Libanian, Syrian,Turk,Vietnamese etc…

  • Consequences

–Great discussions –numerous ideas, –exceptional impact of the laboratory :

  • publications, colloquiums, etc…
slide-8
SLIDE 8

8

Publications of Professor Benzecri

  • Creation, in 1976 of the Cahiers de l‟Analyse des

Données (CAD) which have been numerised by NUMDAM at the end of 2010 but is not to day in line

  • Publication, in 1973, of the two famous books on

Data Analysis: L‟analyse des données :

– Tome 1 : la taxinomie – Tome 2 : l‟analyse des correspondances

  • Publication in 1982 of the book: Histoire et

préhistoire de l‟analyse des données

slide-9
SLIDE 9

9

  • Publication of the 5 books of the collection

“Pratique de l‟Analyse des Données”

– Tome 1 : Analyse des correspondances. Exposé élémentaire, 1980 (Traduced in english by Gopolan in 1992) – Tome 2 : Abrégé théorique. Etude de cas modèles, 1980 – Tome 3 : Linguistique & lexicologie, 1981 – Tome 4 : Médecine, pharmacologie, physiologie clinique, 1992 – Tome 5 : Economie, 1986

slide-10
SLIDE 10

10

Colloquiums

  • very friendly and productive
  • in numerous French universities starting in

1970:

– Besançon – Marseille – Nice, – Rennes – l‟Arbresle near Lyon – etc…

slide-11
SLIDE 11

11

Coding

  • Doubling of a table of data (notes, ranks, 0 and

1, etc…)

  • Complete disjunctive coding (0, 1)
  • Fuzzy coding :

– barycentric coding at 3 or r modalities of a quantitative variable – coding allowing to get rid of the subject personal equation when the subjects give a certain number of notes (coding different when the individual changes)

I Usual coding

slide-12
SLIDE 12

12

  • Case of Exchange Table kIJ with I = J

(Leontiev Table, Importation-exportation table) Example : – k(i, j) : total of the importations from the country i toward the country j – Do CA of the table (kIJ , kJI), juxtaposition of the table kIJ and its transposed kJI – This allows to have on one line i all the exchanges of the country i toward the country j (importations and exportations).

slide-13
SLIDE 13

13

– Yagolnitzer [CAD, 1977] suggested doing CA of the following table: – Yagolnitzer analysis is equivalent to do

  • CA of the mean exchange table (kIJ + kJI)/ 2 and
  • Factorial analysis of the flux table (kIJ - kJI)/ 2 with

the ponderations (weights and metric) given by the CA of the table (kIJ + kJI)/ 2.

kIJ kJI kJI kIJ

slide-14
SLIDE 14

14

  • Other techniques of coding:

– use of the supplementary (passive or illustrative) elements

  • to refine the interpretation that appears in the

ternary table

  • in certain procedures like discriminant analysis or

scoring

  • Etc.

– Etc…

slide-15
SLIDE 15

15

II Coding allowing obtaining the equivalence with

  • ther analyses

II1 Case of Principal Component Analysis (PCA). X = { xij | 1  i  n, 1  j p) ) crossing a set I of n individuals with p variables PCA of centered X is equivalent to CA of doubling table Y with Y = { [(A + xij ) /2 , (A - xij ) /2 ) ] | 1  i  n , 1  j  p} A is any real positive value CA = PCA / (pA2) Same representation of the lines on factorial axis (with the factor 1/(Ap) to pass of PCA to CA )

slide-16
SLIDE 16

16

Terms of Y can be negative, then Eigenvalues in CA of Y can be greater than 1 A2    CA  1 Where is the mean of eigenvalues in PCA of X Particular case (B. Escofier CAD, 1979) If the variables of X are also reduced (Variances equal to 1) and A =1 PCA of X i.e. NPCA is equivalent to CA of Y (and here A = = 1)

  

slide-17
SLIDE 17

17

II 2 Analyze with respect to a model (Escofier, RSA 1984)

  • Comparison of a frequency table fIJ with a reference

table mIJ : – Analyze the difference fIJ- mIJ with the ponderations given by CA of fIJ. – If mIJ have the same margins fI and fJ than fIJ, the precedent comparison is equivalent to CA of fIJ - mIJ + fI  fJ

  • Particular cases:

– Intraclass analysis :

  • I (or J) is partitioned in subsets : I =  {Ik | k = 1, r]}

– Double Intraclass analysis or internal analysis:

  • I and J are partitioned in subsets

– Generalizations when partitions are replaced by graphs (Benali- Escofier, RSA, 1990; Cazes - Moreau 1991)

slide-18
SLIDE 18

18

Correspondence analysis as a particular case of other methods.

  • kIJ Contingency table crossing 2 qualitative variables X

and Y.

– I and J : sets of modalities of X and Y respectively.

  • CA of kIJ is a double factorial analysis:

– factorial analysis of profile lines and profile columns of kIJ

  • CA is the canonical analysis of the two sub-spaces WX

and WY respectively spanned by the indicator variables

  • f modalities of X and Y respectively.

– Indeed, this way of thinking corresponds to the research of the

  • ptimal coding (in fact the factors) centered and reduced of X

and Y.

slide-19
SLIDE 19

19

  • CA , as underlined by L. Lebart, can be considered as a

double discriminant analysis:

– In the first analysis, the variable to be explained is the qualitative variable Y and the explicative variables are the indicator variables of X – In the second analysis, it is the same, exchanging X and Y.

  • CA corresponds also to the interbattery analysis of Tucker

(1958) of the table TX and TY respectively associated to the indicator variables of X and Y, with the weight diagonal metrics given by the margins of kIJ (or the line margins of TX and TY).

  • Multiple correspondence analysis or MCA (analysis of the

complete disjunctive table associated to q qualitative variables X1,…, Xq) is a particular case of the Generalized Canonic Analysis of Carroll where the associated sub-spaces are respectively spanned by the indicator variables of X1,…, Xq

slide-20
SLIDE 20

20

  • MCA is equivalent to the Factorial Multiple Analysis (FMA,

Escofier – Pagès, 1998) of the complete disjunctive table, each sub-table corresponding to the modalities of one of the variables Xk (1 k  q), since CA of each sub-table has all its eigenvalues equal to 1 and therefore that the ponderations of each sub-table with the inverse square root

  • f each greater eigenvalue (here 1) do not change

anything.

  • CA of a sub-table of Burt crossing two sub-sets of

questions can be considered in many different ways as multiple co-inertia analysis, Chessel, 1993).

  • Etc.

This is this possibility of CA to be a particular case of numerous methods that implies its great importance in theoretical as well as practical point of view.

slide-21
SLIDE 21

21

Correspondence analysis and modelling techniques

  • I The reconstitution formula considered as

a modelling technique

Exact data reconstitution formula of a frequency table fIJ from the margins fI and fJ ,the factors I

α , J α associated to the t

non null eigenvalues α coming from CA is given by fij = fi. f.j (1 +  {(α)1/2 i

α j α | α = 1, t } )

If we keep the r first factors (r<t), we have the Least Square approximation of order r fIJ

* of fIJ :

fij

* = fi. f.j (1 +  {(α)1/2 i α j α | α = 1, r } )

fIJ = fIJ

* + eIJ

|| eIJ ||2 = {(eij )2 / (fi. f .j ) | i I, j  J } Min

slide-22
SLIDE 22

22

Near independence, we have approximatively with doing a first order limited development: Log (fij

*) = Log (fi.) + Log (f.j) +  {(α) 1/2 i α j α | α= 1, r}

This corresponds to the log–linear model Log (fij

*) =  + i + j + ij

with  = 0, i = Log (fi.), j= Log (f.j), and where the interaction term ij is of the form ij = {(α)1/2 i

α j α | α= 1, r }

term that can be reduced if r=1 to a multiplicative interaction term (1)1/2 i

1 j 1.

slide-23
SLIDE 23

23

  • One can notice that the absence of interaction corresponds

to the independence, which makes that the log–linear model presents little interest when we have only 2 variables, except if modelling the interaction term.

  • First example when r = 1 of modelling suggested by CA

– If only one factor seems sufficient to explain the data and if the modalities of one or two variables that we cross to obtain the table fIJ are ordered, and if this order is respected (or nearly respected) on the first factorial axis, we can do hypothesis of constant spacing between adjacent modalities on this axis. – We can also suppose equal value for two near modalities on this axis (and therefore in space), which is equivalent to pool these modalities and therefore to cumulate the two lines or columns

  • associated. Thus, one obtains a more sophisticated model (that the

initial model of CA) where the parameters are estimated from CA under constraint, which is equivalent to a fit using the least squares method.

slide-24
SLIDE 24

24

– An example of such approach is given by Goodman (1985) to study the link between the mental state (4 modalities) and the socio-economical status of the parents (6 modalities), these two variables being measured on a sample of 1600 individuals. – In the previous modelling, Goodman uses to estimate the parameters either the least squares method (which corresponds to CA), either the maximum likelihood method, and he proposes some tests to validate the proposed model.

slide-25
SLIDE 25

25

  • Second modelling example where r = 2 (Worsley, 1987)

– The used table crosses a set J of 9 suicide modes with a set I product

  • f the sex by the age cut in 17 classes, i.e. 34 modalities in total.

– The first factorial axis (52% of the inertia) opposes man and woman, while the second axis (38% of the inertia) seems to show for each sex a linear effect of age. – The log-linear model deduced from of reconstitution formula with r = 2, can be put in the following form, by taking into account the precedent observations:

log (fij

*) =  + i + j + x1i u1j + x2i u2j

where x1i = 0 for a woman and 1 for a man, while x2i = k for a person (man or woman) of age k (1  k  17).

slide-26
SLIDE 26

26

  • II Correspondence analysis as an intermediate

step in a modelling problem

– Like any factorial analysis techniques, CA acts as a data compression method before the modelling phase. This allows in particular to take in account the multicollinearity of data. – Furthermore, the simultaneous use of CA and a model allows having two complementary views of the data, and this can allow refining the modelling. – We will only quote three examples of the common use of CA with other statistical methods in a modelling problem:

slide-27
SLIDE 27

27

First example: Mixture of distributions: Q = {βm Pm | m = 1, s}

Q probability law estimated for instance by an histogram Y Pm (m = 1, s) known probability law (for instance, some Poisson‟s law with fixed parameters) βm unknown proportions to estimate : βm  0, { βm| m = 1, s } = 1 (1) s known E( y i ) = { βm Pmi | m = 1, s } y i : frequency of class i of histogram Y Pmi = Proba (class i of histogram Y for law Pm).

slide-28
SLIDE 28

28

If : Y : vector of the y i

( frequency of class i of histogram of Y)

X : matrix of the Pmi (Proba (class i of histogram Y for law Pm)). β : vector of the βm Y = X β +  With the constraint (1) that the coefficients βm are positive or equal to zero, and that their sum is equal to 1. Precedent problem very badly conditioned, Regression (with constraints (1)) not always good

CA of X (Compression) Regression (with constraints (1)) on first factors of CA of X Calculation of β by transition formula (Cazes, CAD, 1978).

slide-29
SLIDE 29

29

  • Classic Problem :

– Foreseen the class of an individual (good or bad payer; alive or deceased, etc…) as a function of explicative variables (qualitative and quantitative). – Or explain a variable Y with 2 modalities

  • Numerous methods:

– Disqual Method (Saporta 1977) – Barycentic Discriminant Analysis (Nakache et al., CAD, 1977) – Other Methods: Logistic regression, Regression tree, etc.

Second example: Scoring

slide-30
SLIDE 30

30

  • Disqual Method

– Cut in Classes numerical explicative variables – Construction of the complete disjunctive table kIJ of all explicative variables – CA of kIJ and visualization of explicative modalities and modalities to explain – Discriminant factorial analysis on the factors coming from this CA selected among those the factors associated to a fixed inertia percentage (80 or 90%) and which are discriminant. – The score of an individual i of I is the absciss of the projection of i on the factorial discriminant axis.

slide-31
SLIDE 31

31

  • Variant : Barycentic Discriminant Analysis (BDA)

– Cut in Classes numerical explicative variables – Construction of the complete disjunctive table kIJ of all explicative variables – Construction of the table kCJ crossing the set C of the two modalities of the variable Y to be explained with the set J of all the explicative modalities. – CA of kCJ (with kIJ in supplementary) and visualization

  • f explicative modalities and modalities to explain.

– Projection of I on the single axis of the precedent analysis and after classic discrimination

slide-32
SLIDE 32

32

  • Comparison between DISQUAL and BDA

– If, in DISQUAL method, we keep all the factors, we

  • btain the same discriminant axis in the space RJ of the

explicative modalities than in BDA, i.e. the axis joining the barycenters of the two modalities to be foreseen. – On the other hand, the discriminant scores are different,

  • because in the first case, we project the individuals with the

inertia metric, which corresponds to work (with the usual metric)

  • n the factors on J of variance 1.
  • While in the second case, we use the chi-2 metric coming from

CA of kCJ (or kIJ), which is equivalent to work (with the usual metric) on the usual factors (of CA of kIJ), i.e. the factor of eigenvalue variance.

slide-33
SLIDE 33

33

  • Remarks

– We can finally notice that instead of doing the factorial discriminant analysis, we can also proceed to the logistic regression on the factors coming from CA of the complete disjunctive table kIJ. – Barycentric Discriminant Analysis (BDA) as DISQUAL can easily be generalized to the case where the variable to explain Y has more than two modalities.

slide-34
SLIDE 34

34

  • Problem: Foreseen the journal sale quantity of

a certain number of wholesalers with a better accuracy than the one provided by the experts.

  • Data : Table kIT crossing a set I of 1577

wholesalers with a set T of 157 weeks, the general term of this table being the sale total k(i, t) of the wholesaler i in the week t.

Third Example Morineau and al. 1994

slide-35
SLIDE 35

35

Solution

1) CA of the line profiles table of the table kIT followed by an Ascendant Hierarchic Classification allows defining wholesaler classes (I =  {Ic | c C}). 2) For each class c, PCA of the under-table kIcT corresponding and data reconstitution formula gives (2 factors are sufficient): For i  Ic : k(i, t) = mc (t) + uc1 (t) Fc1(i) + uc2 (t) Fc2 (i) mc (t) mean of k(i, t) in class c. Fc1(i) and Fc2 (i) coordinates of i on the 2 first factorial axis. 3) Modeling of mc (t), uc1(t) and uc2(t) by a SARIMA process

slide-36
SLIDE 36

36

  • Links between CA, Data Analysis and Data Mining (great

data base, complex data, etc.).

  • Connexion between Data Analysis and Statistical

Learning communities.

– Symposium SLDS (Statistical Learning and Data Sciences) in Paris-Dauphine University in April 2009. – Acts have been published in a special issue (number 42, summer 2010) of MODULAD review.

  • Extensions or interesting methodologies based on

factorial analysis have been developed in the framework

  • f the sensorial analysis which took a great importance

since few years.

  • Important use of PLS regression (case of near infra-red

spectroscopy data, in quality control, for instance).

  • Development of PLS approach
  • Etc…

Correspondence analysis and Data analysis since 2000

slide-37
SLIDE 37

37

  • Remark

– Factorial methods and Data Analysis not enough used in France, – But they could provide great help in numerous cases to users having data bases every day more important and complex.

slide-38
SLIDE 38

38

Bibliography

  • [Bas80] BASTIN, Ch., BENZECRI, J.P., BOURGARIT, Ch., CAZES, P.

(1980) : Pratique de l’analyse des données, Tome 2 : Abrégé théorique. Etude de cas modèles, Dunod, 477 pages.

  • [Ben90] BENALI, H., ESCOFIER, B. (1990) : Analyse factorielle lissée et

analyse des différences locales, RSA, Vol.48 n°2, pp. 55, 76.

  • [Ben73] BENZECRI, J.P. (1973): L’analyse des données, Tome 1 : La

taxinomie, 627 pages ; Tome 2 : L’analyse des correspondances, Dunod, 619 pages.

  • [Ben80] BENZECRI, J.P., BENZECRI, F. (1980) : Pratique de l’analyse

des données, Tome 1 : Analyse des correspondances. Exposé élémentaire, Dunod, 432 pages.

  • [Ben81] BENZECRI, J.P. (1981) : Pratique de l’analyse des données,

Tome 3 : Linguistique & lexicologie, Dunod, 575 pages.

  • [Ben82] BENZECRI, J.P. (1982) : Histoire et préhistoire de l’analyse

des données, Dunod, 159 pages.

  • [Ben86] BENZECRI, J.P., BENZECRI, F. (1986) : Pratique de l’analyse

des données, Tome 5 : Economie, Dunod, 543 pages.

slide-39
SLIDE 39

39

  • [Ben92a] BENZECRI, J.P. (1992) : Correspondence Analysis

Handbook, Dekker, 678 pages.

  • [Ben92b] BENZECRI, J.P., BENZECRI, F., MAITI, G.D. (1992) : Pratique

de l’analyse des données, Tome 4 : Médecine, pharmacologie physiologie clinique, Statmatic, 542 pages.

  • [Caz78] CAZES, P. (1978) : Estimation de la statistique de

multiplication du premier étage d’un photomultiplicateur à dynodes, [Photomultiplicateur], CAD, Vol.3, n° 4 pp. 393, 417.

  • [Caz91] CAZES, P., MOREAU, J. (1991): Analysis of a contingency

table in which the rows and columns have a graph structure, in Symbolic-Numeric Data Analysis and Learning, Eds.

  • [Che93] CHESSEL, D., MERCIER, P. (1993) : Couplage de triplets

statistiques et liaisons espèces – environnement, Eds. LEBRETON, J.D., ASSELAIN, B., Masson, Paris, pp. 15-44.

  • [Esc79] ESCOFIER, B. (1979) : Traitement simultané de variables

qualitatives et quantitatives en analyse factorielle, [Qualitatives et Quantitatives], CAD, Vol. 4 n° 2, pp. 137-146.

  • [Esc84] ESCOFIER, B. (1984) : Analyse factorielle en référence à un

modèle. Application à l’analyse de tableaux d’échanges, RSA, Vol. 32 n° 4, pp. 25-36.

  • [Esc98] ESCOFIER, B., PAGES, J. (1998) : Analyses factorielles

simples et multiples. Objectifs, méthodes et interprétation, 3éme éd., Dunod, 300 pages.

slide-40
SLIDE 40

40

  • [Goo85] GOODMAN, L.A. (1985) : Correspondence analysis models, log-linear

models, and log-bilinear models for the analysis of contingency tables, Proceedings of the 45th session of ISI, Amsterdam.

  • [Mor94] MORINEAU, A., SAMMARTINO, A.E., GETTLER-SUMMA, M.,

PARDOUX, C. (1994) : Analyse des données et modélisation des séries

  • temporelles. Application à la vente de périodiques, RSA, Vol. 42, n° 4, pp. 61-81.
  • [Nak77] NAKACHE, J.P., LORENTE, P., BENZECRI, J.P., CHASTANG, J.F. (1977)

: Aspects pronostiques et thérapeutiques de l’infarctus myocardique aigu compliqué d’une défaillance sévère de la pompe cardiaque. Application des méthodes de discrimination, [Aorte], CAD, Vol.2, n° 4, pp. 415-434.

  • [Sap77] SAPORTA, G. (1977) : Une méthode et un programme d’analyse

discriminante sur variables qualitatives, Premières Journées Internationales, Analyse des Données et Informatique, INRIA, Versailles.

  • [Tuc58] TUCKER, L.R. (1958): An inter-battery method of factor analysis,

Psychometrica, Vol. 23, n° 2, pp. 111-136.

  • [Wor87] WORSLEY (1987) : Un exemple d’identification d’un modèle log-

linéaire grâce à une analyse des correspondances, RSA, Vol. 35, n° 3, pp. 13- 20.

  • [Yag77] YAGOLNITZER (1977) : Comparaison de deux correspondances entre

les mêmes ensembles, [Compar. Corr.], CAD, Vol. 2, n° 3, pp. 251, 264.

slide-41
SLIDE 41

41

  • [Ben92a] BENZECRI, J.P. (1992) : Correspondence Analysis Handbook,

Dekker, 678 pages.

  • [Ben92b] BENZECRI, J.P., BENZECRI, F., MAITI, G.D. (1992) : Pratique de

l’analyse des données, Tome 4 : Médecine, pharmacologie physiologie clinique, Statmatic, 542 pages.

  • [Caz78] CAZES, P. (1978) : Estimation de la statistique de multiplication du

premier étage d’un photomultiplicateur à dynodes, [Photomultiplicateur], CAD, Vol.3, n° 4 pp. 393, 417.

  • [Caz91] CAZES, P., MOREAU, J. (1991): Analysis of a contingency table in

which the rows and columns have a graph structure, in Symbolic-Numeric Data Analysis and Learning, Eds. [Che93] CHESSEL, D., MERCIER, P. (1993) : Couplage de triplets statistiques et liaisons espèces – environnement, Eds. LEBRETON, J.D., ASSELAIN, B., Masson, Paris, pp. 15-44.

  • [Esc79] ESCOFIER, B. (1979) : Traitement simultané de variables qualitatives et

quantitatives en analyse factorielle, [Qualitatives et Quantitatives], CAD, Vol. 4 n° 2, pp. 137-146.

  • [Esc84] ESCOFIER, B. (1984) : Analyse factorielle en référence à un modèle.

Application à l’analyse de tableaux d’échanges, RSA, Vol. 32 n° 4, pp. 25-36.

  • [Esc98] ESCOFIER, B., PAGES, J. (1998) : Analyses factorielles simples et
  • multiples. Objectifs, méthodes et interprétation, 3éme éd., Dunod, 300 pages.
slide-42
SLIDE 42

42

THANK YOU