Some Comments on Correspondence Analysis.
- P. CAZES
CEREMADE, University Paris Dauphine
Correspondence Analysis. P. CAZES CEREMADE, University Paris - - PowerPoint PPT Presentation
Some Comments on Correspondence Analysis. P. CAZES CEREMADE, University Paris Dauphine Overview Data analysis as an experimental science The Laboratory of Statistics of University Paris 6 in the seventies Coding Correspondence
CEREMADE, University Paris Dauphine
2
Paris 6 in the seventies
case of other methods.
techniques
since 2000
3
demonstrated after having been observed on the computer listings as in experiences in physics.
the error computations in physics
table to be analyzed and the succession of the analyses (descriptive, explicative or decisional analyses) to be done to treat the data. This problem is analogous to the set up of an experiment in physics.
4
sets I and J, Inertia moments of factorial axis of the clouds NI and NJ associated respectively to I and J are equal, result which is now standard ( B. Escofier Phd, 1965 )
total inertia equal to 1
5
equivalent to Normed PCA of the non dedoubling (or initial) table (Benzecri, J.P. Pagès, Bara : PHD, Serums data,1971).
– furthermore CA = NPCA / p where p is the number of variables or columns of initial table. – Then, we find again that : Inertia in CA = Inertia in NPCA /p = p / p = 1 – Same representation of the lines on factorial axis (with the factor 1/p to pass of NPCA to CA )
6
Responsible of Master 2 (M2) of statistics (Research master) with150 to 200 students (the „„greatest‟‟ M2 of France)
each Monday in May and June)
– Biology – Ecology – Economy – Geology – Linguistics – Medicine – Physics – Psychology – Sociology
7
–French, of course, but also African, Argentinean, Greek, Egyptian, Iranian, Irish, Libanian, Syrian,Turk,Vietnamese etc…
–Great discussions –numerous ideas, –exceptional impact of the laboratory :
8
Données (CAD) which have been numerised by NUMDAM at the end of 2010 but is not to day in line
Data Analysis: L‟analyse des données :
– Tome 1 : la taxinomie – Tome 2 : l‟analyse des correspondances
préhistoire de l‟analyse des données
9
“Pratique de l‟Analyse des Données”
– Tome 1 : Analyse des correspondances. Exposé élémentaire, 1980 (Traduced in english by Gopolan in 1992) – Tome 2 : Abrégé théorique. Etude de cas modèles, 1980 – Tome 3 : Linguistique & lexicologie, 1981 – Tome 4 : Médecine, pharmacologie, physiologie clinique, 1992 – Tome 5 : Economie, 1986
10
1970:
– Besançon – Marseille – Nice, – Rennes – l‟Arbresle near Lyon – etc…
11
1, etc…)
– barycentric coding at 3 or r modalities of a quantitative variable – coding allowing to get rid of the subject personal equation when the subjects give a certain number of notes (coding different when the individual changes)
I Usual coding
12
(Leontiev Table, Importation-exportation table) Example : – k(i, j) : total of the importations from the country i toward the country j – Do CA of the table (kIJ , kJI), juxtaposition of the table kIJ and its transposed kJI – This allows to have on one line i all the exchanges of the country i toward the country j (importations and exportations).
13
– Yagolnitzer [CAD, 1977] suggested doing CA of the following table: – Yagolnitzer analysis is equivalent to do
the ponderations (weights and metric) given by the CA of the table (kIJ + kJI)/ 2.
kIJ kJI kJI kIJ
14
– use of the supplementary (passive or illustrative) elements
ternary table
scoring
– Etc…
15
II Coding allowing obtaining the equivalence with
II1 Case of Principal Component Analysis (PCA). X = { xij | 1 i n, 1 j p) ) crossing a set I of n individuals with p variables PCA of centered X is equivalent to CA of doubling table Y with Y = { [(A + xij ) /2 , (A - xij ) /2 ) ] | 1 i n , 1 j p} A is any real positive value CA = PCA / (pA2) Same representation of the lines on factorial axis (with the factor 1/(Ap) to pass of PCA to CA )
16
Terms of Y can be negative, then Eigenvalues in CA of Y can be greater than 1 A2 CA 1 Where is the mean of eigenvalues in PCA of X Particular case (B. Escofier CAD, 1979) If the variables of X are also reduced (Variances equal to 1) and A =1 PCA of X i.e. NPCA is equivalent to CA of Y (and here A = = 1)
17
II 2 Analyze with respect to a model (Escofier, RSA 1984)
table mIJ : – Analyze the difference fIJ- mIJ with the ponderations given by CA of fIJ. – If mIJ have the same margins fI and fJ than fIJ, the precedent comparison is equivalent to CA of fIJ - mIJ + fI fJ
– Intraclass analysis :
– Double Intraclass analysis or internal analysis:
– Generalizations when partitions are replaced by graphs (Benali- Escofier, RSA, 1990; Cazes - Moreau 1991)
18
and Y.
– I and J : sets of modalities of X and Y respectively.
– factorial analysis of profile lines and profile columns of kIJ
and WY respectively spanned by the indicator variables
– Indeed, this way of thinking corresponds to the research of the
and Y.
19
double discriminant analysis:
– In the first analysis, the variable to be explained is the qualitative variable Y and the explicative variables are the indicator variables of X – In the second analysis, it is the same, exchanging X and Y.
(1958) of the table TX and TY respectively associated to the indicator variables of X and Y, with the weight diagonal metrics given by the margins of kIJ (or the line margins of TX and TY).
complete disjunctive table associated to q qualitative variables X1,…, Xq) is a particular case of the Generalized Canonic Analysis of Carroll where the associated sub-spaces are respectively spanned by the indicator variables of X1,…, Xq
20
Escofier – Pagès, 1998) of the complete disjunctive table, each sub-table corresponding to the modalities of one of the variables Xk (1 k q), since CA of each sub-table has all its eigenvalues equal to 1 and therefore that the ponderations of each sub-table with the inverse square root
anything.
questions can be considered in many different ways as multiple co-inertia analysis, Chessel, 1993).
This is this possibility of CA to be a particular case of numerous methods that implies its great importance in theoretical as well as practical point of view.
21
a modelling technique
Exact data reconstitution formula of a frequency table fIJ from the margins fI and fJ ,the factors I
α , J α associated to the t
non null eigenvalues α coming from CA is given by fij = fi. f.j (1 + {(α)1/2 i
α j α | α = 1, t } )
If we keep the r first factors (r<t), we have the Least Square approximation of order r fIJ
* of fIJ :
fij
* = fi. f.j (1 + {(α)1/2 i α j α | α = 1, r } )
fIJ = fIJ
* + eIJ
|| eIJ ||2 = {(eij )2 / (fi. f .j ) | i I, j J } Min
22
Near independence, we have approximatively with doing a first order limited development: Log (fij
*) = Log (fi.) + Log (f.j) + {(α) 1/2 i α j α | α= 1, r}
This corresponds to the log–linear model Log (fij
*) = + i + j + ij
with = 0, i = Log (fi.), j= Log (f.j), and where the interaction term ij is of the form ij = {(α)1/2 i
α j α | α= 1, r }
term that can be reduced if r=1 to a multiplicative interaction term (1)1/2 i
1 j 1.
23
to the independence, which makes that the log–linear model presents little interest when we have only 2 variables, except if modelling the interaction term.
– If only one factor seems sufficient to explain the data and if the modalities of one or two variables that we cross to obtain the table fIJ are ordered, and if this order is respected (or nearly respected) on the first factorial axis, we can do hypothesis of constant spacing between adjacent modalities on this axis. – We can also suppose equal value for two near modalities on this axis (and therefore in space), which is equivalent to pool these modalities and therefore to cumulate the two lines or columns
initial model of CA) where the parameters are estimated from CA under constraint, which is equivalent to a fit using the least squares method.
24
– An example of such approach is given by Goodman (1985) to study the link between the mental state (4 modalities) and the socio-economical status of the parents (6 modalities), these two variables being measured on a sample of 1600 individuals. – In the previous modelling, Goodman uses to estimate the parameters either the least squares method (which corresponds to CA), either the maximum likelihood method, and he proposes some tests to validate the proposed model.
25
– The used table crosses a set J of 9 suicide modes with a set I product
– The first factorial axis (52% of the inertia) opposes man and woman, while the second axis (38% of the inertia) seems to show for each sex a linear effect of age. – The log-linear model deduced from of reconstitution formula with r = 2, can be put in the following form, by taking into account the precedent observations:
log (fij
*) = + i + j + x1i u1j + x2i u2j
where x1i = 0 for a woman and 1 for a man, while x2i = k for a person (man or woman) of age k (1 k 17).
26
step in a modelling problem
– Like any factorial analysis techniques, CA acts as a data compression method before the modelling phase. This allows in particular to take in account the multicollinearity of data. – Furthermore, the simultaneous use of CA and a model allows having two complementary views of the data, and this can allow refining the modelling. – We will only quote three examples of the common use of CA with other statistical methods in a modelling problem:
27
First example: Mixture of distributions: Q = {βm Pm | m = 1, s}
Q probability law estimated for instance by an histogram Y Pm (m = 1, s) known probability law (for instance, some Poisson‟s law with fixed parameters) βm unknown proportions to estimate : βm 0, { βm| m = 1, s } = 1 (1) s known E( y i ) = { βm Pmi | m = 1, s } y i : frequency of class i of histogram Y Pmi = Proba (class i of histogram Y for law Pm).
28
If : Y : vector of the y i
( frequency of class i of histogram of Y)
X : matrix of the Pmi (Proba (class i of histogram Y for law Pm)). β : vector of the βm Y = X β + With the constraint (1) that the coefficients βm are positive or equal to zero, and that their sum is equal to 1. Precedent problem very badly conditioned, Regression (with constraints (1)) not always good
CA of X (Compression) Regression (with constraints (1)) on first factors of CA of X Calculation of β by transition formula (Cazes, CAD, 1978).
29
– Foreseen the class of an individual (good or bad payer; alive or deceased, etc…) as a function of explicative variables (qualitative and quantitative). – Or explain a variable Y with 2 modalities
– Disqual Method (Saporta 1977) – Barycentic Discriminant Analysis (Nakache et al., CAD, 1977) – Other Methods: Logistic regression, Regression tree, etc.
Second example: Scoring
30
– Cut in Classes numerical explicative variables – Construction of the complete disjunctive table kIJ of all explicative variables – CA of kIJ and visualization of explicative modalities and modalities to explain – Discriminant factorial analysis on the factors coming from this CA selected among those the factors associated to a fixed inertia percentage (80 or 90%) and which are discriminant. – The score of an individual i of I is the absciss of the projection of i on the factorial discriminant axis.
31
– Cut in Classes numerical explicative variables – Construction of the complete disjunctive table kIJ of all explicative variables – Construction of the table kCJ crossing the set C of the two modalities of the variable Y to be explained with the set J of all the explicative modalities. – CA of kCJ (with kIJ in supplementary) and visualization
– Projection of I on the single axis of the precedent analysis and after classic discrimination
32
– If, in DISQUAL method, we keep all the factors, we
explicative modalities than in BDA, i.e. the axis joining the barycenters of the two modalities to be foreseen. – On the other hand, the discriminant scores are different,
inertia metric, which corresponds to work (with the usual metric)
CA of kCJ (or kIJ), which is equivalent to work (with the usual metric) on the usual factors (of CA of kIJ), i.e. the factor of eigenvalue variance.
33
– We can finally notice that instead of doing the factorial discriminant analysis, we can also proceed to the logistic regression on the factors coming from CA of the complete disjunctive table kIJ. – Barycentric Discriminant Analysis (BDA) as DISQUAL can easily be generalized to the case where the variable to explain Y has more than two modalities.
34
a certain number of wholesalers with a better accuracy than the one provided by the experts.
wholesalers with a set T of 157 weeks, the general term of this table being the sale total k(i, t) of the wholesaler i in the week t.
Third Example Morineau and al. 1994
35
Solution
1) CA of the line profiles table of the table kIT followed by an Ascendant Hierarchic Classification allows defining wholesaler classes (I = {Ic | c C}). 2) For each class c, PCA of the under-table kIcT corresponding and data reconstitution formula gives (2 factors are sufficient): For i Ic : k(i, t) = mc (t) + uc1 (t) Fc1(i) + uc2 (t) Fc2 (i) mc (t) mean of k(i, t) in class c. Fc1(i) and Fc2 (i) coordinates of i on the 2 first factorial axis. 3) Modeling of mc (t), uc1(t) and uc2(t) by a SARIMA process
36
data base, complex data, etc.).
Learning communities.
– Symposium SLDS (Statistical Learning and Data Sciences) in Paris-Dauphine University in April 2009. – Acts have been published in a special issue (number 42, summer 2010) of MODULAD review.
factorial analysis have been developed in the framework
since few years.
spectroscopy data, in quality control, for instance).
37
– Factorial methods and Data Analysis not enough used in France, – But they could provide great help in numerous cases to users having data bases every day more important and complex.
38
(1980) : Pratique de l’analyse des données, Tome 2 : Abrégé théorique. Etude de cas modèles, Dunod, 477 pages.
analyse des différences locales, RSA, Vol.48 n°2, pp. 55, 76.
taxinomie, 627 pages ; Tome 2 : L’analyse des correspondances, Dunod, 619 pages.
des données, Tome 1 : Analyse des correspondances. Exposé élémentaire, Dunod, 432 pages.
Tome 3 : Linguistique & lexicologie, Dunod, 575 pages.
des données, Dunod, 159 pages.
des données, Tome 5 : Economie, Dunod, 543 pages.
39
Handbook, Dekker, 678 pages.
de l’analyse des données, Tome 4 : Médecine, pharmacologie physiologie clinique, Statmatic, 542 pages.
multiplication du premier étage d’un photomultiplicateur à dynodes, [Photomultiplicateur], CAD, Vol.3, n° 4 pp. 393, 417.
table in which the rows and columns have a graph structure, in Symbolic-Numeric Data Analysis and Learning, Eds.
statistiques et liaisons espèces – environnement, Eds. LEBRETON, J.D., ASSELAIN, B., Masson, Paris, pp. 15-44.
qualitatives et quantitatives en analyse factorielle, [Qualitatives et Quantitatives], CAD, Vol. 4 n° 2, pp. 137-146.
modèle. Application à l’analyse de tableaux d’échanges, RSA, Vol. 32 n° 4, pp. 25-36.
simples et multiples. Objectifs, méthodes et interprétation, 3éme éd., Dunod, 300 pages.
40
models, and log-bilinear models for the analysis of contingency tables, Proceedings of the 45th session of ISI, Amsterdam.
PARDOUX, C. (1994) : Analyse des données et modélisation des séries
: Aspects pronostiques et thérapeutiques de l’infarctus myocardique aigu compliqué d’une défaillance sévère de la pompe cardiaque. Application des méthodes de discrimination, [Aorte], CAD, Vol.2, n° 4, pp. 415-434.
discriminante sur variables qualitatives, Premières Journées Internationales, Analyse des Données et Informatique, INRIA, Versailles.
Psychometrica, Vol. 23, n° 2, pp. 111-136.
linéaire grâce à une analyse des correspondances, RSA, Vol. 35, n° 3, pp. 13- 20.
les mêmes ensembles, [Compar. Corr.], CAD, Vol. 2, n° 3, pp. 251, 264.
41
Dekker, 678 pages.
l’analyse des données, Tome 4 : Médecine, pharmacologie physiologie clinique, Statmatic, 542 pages.
premier étage d’un photomultiplicateur à dynodes, [Photomultiplicateur], CAD, Vol.3, n° 4 pp. 393, 417.
which the rows and columns have a graph structure, in Symbolic-Numeric Data Analysis and Learning, Eds. [Che93] CHESSEL, D., MERCIER, P. (1993) : Couplage de triplets statistiques et liaisons espèces – environnement, Eds. LEBRETON, J.D., ASSELAIN, B., Masson, Paris, pp. 15-44.
quantitatives en analyse factorielle, [Qualitatives et Quantitatives], CAD, Vol. 4 n° 2, pp. 137-146.
Application à l’analyse de tableaux d’échanges, RSA, Vol. 32 n° 4, pp. 25-36.
42