Singly and doubly ordered cumulative correspondence analysis.
- L. D’Ambra*, E. Beh** and I. Camminatiello*
*University of Naples Federico II (Italy) ** University of Newcasle (Australia) dambra@unina.it
Singly and doubly ordered cumulative correspondence analysis. L. - - PowerPoint PPT Presentation
Singly and doubly ordered cumulative correspondence analysis. L. DAmbra*, E. Beh** and I. Camminatiello* *University of Naples Federico II (Italy) ** University of Newcasle (Australia) dambra@unina.it Outline A short review Singly
*University of Naples Federico II (Italy) ** University of Newcasle (Australia) dambra@unina.it
A short review Singly ordered cumulative correspondence analysis: methodology and application in industrial experiment Doubly ordered cumulative correspondence analysis: some developments and application with Van Rijckevorsel’s data We also propose an unified approach
In multidimensional data analysis for studying the association between two categorical variables, Correspondence Analysis (CA) is one of the most popular tool. This method is based on chi-squared
It does not take in consideration the ordered nature of the categories
There are some contributions that deal with ordinal categorical variables, including those of Parsa and Smith (1993), Ritov and Gilula (1993) and Schriever (1983)
These procedures involve constraining the output obtained from applying singular value decomposition (SVD) so that the coordinates in the first dimension have an ordered structure.
An alternative approach applies moment decomposition (MD - Beh, 1997) or hybrid decomposition (HD - Beh, 2004) that involve using the orthogonal polyonomials in order to detect linear , quadratic , cubic components
In some industrial experiments, sometimes the output consists of categorical data (contingency table ) with an ordering in the categories. For analyzing such data, Taguchi (1966, 1974) proposed the Accumulation Analysis method as an alternative to Pearson's chi- squared test. His motivation for recommending this technique appears to be its similarity to ANOVA for quantitative variables. More recently, Light and Margolin (1971) proposed a method called CATANOVA by defining an appropriate measure of variation for categorical data. Unlike these methods Taguchi considers situations with ordered categories and does ANOVA on the cumulative frequencies
Beh, D’Ambra, Simonetti (Carme 2007; Communication in Statistics 2011) performed correspondence analysis when cross-classified variables have an ordered structure by considering the Taguchi’s statistic.
Taguchi’s statistic is an appropriate measure of non- symmetric association for two categorical variables of which one is on ordinal scale. It takes into account the presence of an ordered variable by considering the cumulative sum of cell frequencies across the variable.
the absolute two-way contingency table that cross-classifies n units according to I
column categories
ij
n N
the relative two-way contingency table
1
the row and column marginals.
i
j
A triangular matrix of 1’s with the last J-th row is removed so that it is of dimension (J -1)×J. A triangular matrix of 1’s of dimension J x J A triangular matrix of 1’s of dimension I x I
J
M
M L
the cumulative frequencies the cumulative column proportions the vectors with the marginal frequencies of P
iJ i iJ i i i i i
1 2 1 2 1 1
n n n d n n n d n n d
J J
1 2 1 2 1 1
, , ,
the diagonal matrices with the marginal frequencies of P
and
c
D
and
r
D
1 1 1 2 J j I i j i ij i j
1 1
J
1
1
j j j
d d w
J wj 1
The properties of T, Taguchi'(1966, 1974) "cumulative-sums' statistic obtained by assigning a weight to each column that is inversely proportional to its conditional expectation of the j-th term (conditional on the given marginals) In this paper we use this weighting system . A simpler statistic, T, which assigns each column constant weights 1/J
1 .... 1 ........ 1
1
J j d d w
j j j
Nair (1987) demonstrated that the link between the Pearson chi-squared statistic and Taguchi’s statistic is
1 1 2 J j j
2 j
is the Pearson chi-squared statistic for the contingency table obtained by aggregating column categories 1 to j, and aggregating the column categories j+1 to J. For this reason, it is also referred to as cumulative chi- squared statistic.
2 1 2 1
r T T r
W (J-1,J-1) is the diagonal matrix of weights A (J-1,J) is the matrix involving the cumulative column proportions
1 1 1 1 2 2 2 2 1 1 1 1
1 1 1 1 1 1
J J J J
d d d d d d d d d d d d A
Considering that
The Taguchi’s statistic after some algebra may be rewritten by
T J
J
2 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 r T T J T J T r r r r T T J T J T r r T T T J T T J r r T T T J T T J r
trace n trace n n n n n trace n n n trace n T D rc P WM M c 1 P D D D rc P WM M rc P D D rd PM W rd PM D D rd NM W rd NM D
2 1 1 2 1
r T T c T r
trace n T D rc P D rc P D
(C.A.)
Beh, D’Ambra, Simonetti (Carme 2007 and Communication in Statistics 2011) carried out CA when cross-classified variables have an ordered structure by considering the Taguchi’s statistic. In terms of the Taguchi's statistic, Beh et al. (2010) perform SVD
2 1 1 2 1 2 1 2 1
T J T r r r T J T r
Special cases and Properties of Cumulative Correspondence analysis
For I > 2 and in the case of EQUIPROBABLE categories the eigenvectors are given by CHEBYCHEV POLYNOMIALS For I > 2, and in the equiprobable case, the first component (location or linear ) is proportional the Kruskal-Wallis statistic for contingency tables Similarly the second component ( dispersion or quadratic ) is the generalizzation of the grouped data version of Mood's (1954) statistic. In general case this is no true In the case of 2xJ table we have two components: the first component ( linear ) of Taguchi’statistics is equivalent to Wilcoxon statistics The second component (Quadratic ) is equivalent to Mood’s test (1954) (See Nair 1987) See Beh- D’Ambra- Simonetti in Communication in Statistics 2011 Coordinates Distances Properties of decomposition of Taguchi’s Statistic and Non Symmetrical Correspondence Analysis (NSCA)
For cumulative analysis we may write the row coordinates by
V W M c 1 P D O
2 1 1 T J T r r r
For classical CA the row coordinates are defined by
c 1 P D O ~ ~
1 T r r r
are the matrices containing the right singular vectors for cumulative analysis and classical CA, respectively.
V V ~ ,
therefore
V W M V O O
2 1
~ ~
T J T r r
This shows that you may be able to go from classical CA coordinates to cumulative coordinates easily.
Using the same argument we can obtain the classical coordinates from the cumulative coordinates from the relationship
W M V O O ~ ~
1 2 1
T J T r r
To illustrate the cumulative correspondence analysis using the taguchi’s statistic, D’Ambra, Köksoy, Simonetti (2010) use Phadke’s data (1989).
Levels
T025 T0 T0+25
P0200 P0 P0+200
N0 N0 150
N0 75
S0100
S0 50
S0
t0 t0+8
t0+16
None CM2 CM3 1 2 3
The control factors (6) and their levels (3) of polysilicon deposition process
Categories of product’s quality
Categories Description Cumulative categories I : 03 defects No surface defect (I) = I (03 defects) II : 430 defects Very few defects (II) = I+II (030 defects) III : 31300 defects Some defects (III) = I+II+III (0300 defects) IV : 3011000 defects Many defects (IV) = I+II+III+IV (01000 defects) V : 1001 and more defects Too many defects (V) = I+II+III+IV+V (0 defects)
Factor effects for the categorized surface defect data
A1 34 40 51 53 54 0.63 0.74 0.94 0.98 1.00 A2 7 22 34 41 54 0.13 0.41 0.63 0.76 1.00 A3 8 14 19 32 54 0.15 0.26 0.35 0.59 1.00 B1 25 40 46 51 54 0.46 0.74 0.85 0.94 1.00 B2 20 28 36 43 54 0.37 0.52 0.67 0.80 1.00 B3 4 8 22 32 54 0.07 0.15 0.41 0.59 1.00 C1 19 30 32 39 54 0.35 0.56 0.59 0.72 1.00 C2 11 20 28 39 54 0.20 0.37 0.52 0.72 1.00 C3 19 26 44 48 54 0.35 0.48 0.81 0.89 1.00 D1 20 25 34 41 54 0.37 0.46 0.63 0.76 1.00 D2 13 31 42 44 54 0.24 0.57 0.78 0.81 1.00 D3 16 20 28 41 54 0.30 0.37 0.52 0.76 1.00 E1 21 27 38 43 54 0.39 0.50 0.70 0.80 1.00 E2 16 29 36 42 54 0.30 0.54 0.67 0.78 1.00 E3 12 20 30 41 54 0.22 0.37 0.56 0.76 1.00 F1 21 23 26 34 54 0.39 0.43 0.48 0.63 1.00 F2 21 30 40 46 54 0.39 0.56 0.74 0.85 1.00 F3 7 23 38 46 54 0.13 0.43 0.70 0.85 1.00 Factor Levels (I) (II) (III) (IV) (V) (I) (II) (III) (IV) (V)
Number of observations by categories Probabilities for the cumulative caregories
The Pearson chi-squared statistic for the four contingency tables obtained by aggregating column categories 1 to j, and aggregating the column categories j+1 to J.
I II+III+IV+V I+II III+IV+V I+II+III IV+V I+II+III+IV V 83,209 79,265 95,8786 60,2143
The partition of Taguchi’s statistic from contingency table in Pearson chi-squared statistics
Aggregated Column Categories Factor (I) (II+III+IV+V) (I+II) (III+IV+V) (I+II+III) (IV+V) (I+II+III+IV) (V) A1 34 20 40 14 51 3 53 1 A2 7 47 22 32 34 20 41 13 A3 8 46 14 40 19 35 32 22 B1 25 29 40 14 46 8 51 3 B2 20 34 28 26 36 18 43 11 B3 4 50 8 46 22 32 32 22 C1 19 35 30 24 32 22 39 15 C2 11 43 20 34 28 26 39 15 C3 19 35 26 28 44 10 48 6 D1 20 34 25 29 34 20 41 13 D2 13 41 31 23 42 12 44 10 D3 16 38 20 34 28 26 41 13 E1 21 33 27 27 38 16 43 11 E2 16 38 29 25 36 18 42 12 E3 12 42 20 34 30 24 41 13 F1 21 33 23 26 34 20 F2 21 33 30 40 46 8 F3 7 47 23 38 46 8 Total
2 83
2 1
,
3 79
2 2
,
9 95
2 3
, 2 60
2 4
, 6 318
2
,
TOT
The table shows the ANOVA results. following Taguchi ( see Nair ) A and B are the two most important factors affecting product’s quality
Source AA A 14,3 B 10,8 C 2,3 D 1,8 E 1,2 F 2,9
Figure shows the graphical representation
Table shows the distances from the origin to the column points in Figure above. We note that the point “I vs (II+III+IV+V)” is the most important because it represents a larger contribution (30,618), which is measured by the distances from the
I (II+III+IV+V) I+II (III+IV+V) I+II+III (IV+V) I+II+III+IV (V) 30,618 4,995 1,658 0,023
Row cordinates of cumulative ordinal correspondence analysis Colomn cordinates of cumulative ordinal correspondence analysis Categories “I” from correspondence analysis Categories “II” from correspondence analysis Categories “III” from correspondence analysis Categories “IV” from correspondence analysis Categories “V” from correspondence analysis Supplementary point of factors: A, B, C, D, E, F Colomn cordinates of cumulative ordinal correspondence analysis Row cordinates of cumulative ordinal correspondence analysis Colomn cordinates of cumulative ordinal correspondence analysis Categories “I” from correspondence analysis Categories “II” from correspondence analysis Categories “III” from correspondence analysis Categories “IV” from correspondence analysis Categories “V” from correspondence analysis Supplementary point of factors: A, B, C, D, E, F Colomn cordinates of cumulative ordinal correspondence analysis
Figure shows the row and column categories of Singly ordered cumulative correspondence analysis, the supplementary points of the factors (A,B,C,D,E,F) and the column categories of classical analysis A,B are important factors
Level A B C D E F 1 5,197 2,262 3,353 3,315 1,963 5,749 2 5,536 2,882 6,519 1,715 2,836 0,787 3 10,233 11,153 0,700 5,542 5,772 4,035 Level A B C D E F 1 5,345 4,143 5,777 5,328 4,854 7,438 2 2,192 4,908 4,088 2,476 4,139 4,228 3 5,259 3,745 2,931 3,538 3,803 1,056
Two tables show the distances between the row points and “I vs (II+III+IV+V)” column point on the first and second factorial axes.
“I vs (II+III+IV+V)” 1° Axis 2° Axis A1 A2 B1 B3 C3 C3 D2 D2 E1 E3 F2 F3 Table shows the first and second axes solutions based on the minimum distance reports. So we choose this optimal combination
Comparative results for the optimal factor settings
Method Solution (I) (II) (III) (IV) (V) Probabilities for the cumulative categories MEL A1B1C3D2E1F2
0.875
0.959 0.998 0.999 1.000 SCORE A1B1C3D2E2F2 0.822 0.964 0.998 0.999 1.000 WSNR A1B1C1D1E2F2 0.896 0.960 0.986 0.996 1.000 AA A1B2C1D3E2F2 0.814 0.863 0.942 0.983 1.000 MSD A1B1C3D2E1F3 0.617 0.931 0.998 0.999 1.000 STARTING A2B2C1D3E1F1 0.363 0.435 0.394 0.554 1.000 PROPOSED: 1° Axis A1B1C3D2E1F2 0.875 0.959 0.998 0.999 1.000 2° Axis A2B3C3D2E3F3 0.090 0.590 0.880 0.820 1.000 Plane solution A2B1C3D2E2F3 0.090 0.800 0.975 0.984 1.000 MEL=Asiabar and Ghomi (2006), SCORE=Nair(1986), WSNR=Wu and Yeh (2006),
AA=Phadke’s accumulation Analysis (1989), MSD=Jeng and Guo (1996), STARTING= Starting, PROPOSED=D’Ambra, Köksoy and Simonetti
p 1 p 10log w(p)
10
The optimum settings recommended by the first factorial axes solution of cumulative correspondence analysis is A1, B1, C3, D2, E1, F2. By the inverse omega transform, the predicted probability for category (I) is 0.875. The second axis solution (i.e., A2, B3, C3, D2, E3, F3) does not seem so powerful and the probabilities for the cumulative categories are not high enough. The plane solution (i.e., A2, B1, C3, D2, E2, F3) especially provides a very low probability in category (I). As a result, we suggest to pick the first axes solution as the optimal solution for the Phadke’s polysilicon deposition process. The last table shows the comparative results for the solution methods to optimize factor settings according to their predicted probabilities for the cumulative categories. To calculate the optimal probabilities for the cumulative categories Taguchi uses the
probability p is defined by
We observe that the first axis solution is equivalent to the MEL (i.e., minimization of expected loss) solution proposed by Asiabar and Ghomi (2006). The solution seems a nice candidate among the others since the probabilities for the cumulative categories are higher. Asiabar and Ghomi (2006) suggested a technique, which is called MEL that minimizes the expected loss for the analysis of ordered categorical data. After an experiment and data collection authors define a probability distribution function of data in
calculated and the decision is made by the fact that the optimum level of a factor is the one where the expected loss is lower than the expected loss at other levels of that factor.
Cuadras (2002) proposed the following approach based
T J T T r
2 1 2 1
U is the matrix containing the left singular vectors S is the diagonal matrix containing the singular values V is the matrix containing the right singular vectors. WJ is the J x J diagonal matrix of weights 1/J L is lower triangular matrix M is upper triangular matrix
This approach does not decompose any known index. This approach has not the property to be the sum of the Pearson chi-squared statistic for the contingency table obtained by partitioning and pooling the original data.( see Taguchi)
Starting from the proposal of Beh et al. (2007 -2011), we present a more general approach based on double cumulative frequencies which overcomes these problems and presents some interesting proprieties. Notation
R the 2(I-1)xI matrix obtained by alternating the rows of an (I-1)xI lower triangular matrix of ones without the row of all ones and the rows of an (I-1)xI upper triangular matrix of ones without the row of all ones. C the Jx2(J-1) matrix obtained by alternating the columns of an Jx(J-1) upper triangular matrix of ones without the column of all
DR and DC the diagonal matrices with the marginal frequencies of doubly cumulative table .
T C T R
2 1 2 1
,
1 1
V CD rc P R D G
C T R r
U D R rc P C D G
1 1
R T T T T C c
The inertia
k k R T T T T C T R
s J I n trace J I n Q
2 2 1 1 2 1
1 1 1 1 D R rc P C CD rc P R D
can be considered a generalization of Taguchi’s statistic because takes into account the presence of both ordinal variables.
It is easy to verify that trace of Q is identical to doubly cumulative chi-squared statistic defined by Hirotsu (1986) ( used for comparing treatments and change point analysis) This approach preserves same property of Taguchi’s statistics
1 1 2 2 I i J j ij
is the Pearson chi-squared statistic for the 2x2 contingency table
2 ij
The CA on the on doubly cumulative table, which we call Doubly Cumulative Correspondence Analysis, presents the following properties
The approach maximizes the fi-squared statistic of each 2 by 2 table and, apart the constant, 2(I-1)x2(J-1), of doubly cumulative table. All the weighted row and the column coordinates are centred The weighted row and the column coordinates are centred for the 2 by 2 tables This approach allows of representing the variations of row and column categories rather than the categories on the space generated by cumulative frequencies. Successively, it is possible to project on the same space the row and column categories as supplementary points.
In order to represent the rows and columns of N we can consider the following SVD depending
E D B F , , , and the vector a
T T T
USV E D ac P FB Overall approach 1.
2 1
r
D F , r a , I B , I D
T
,
1
c
D E Correspondence Analysis 2.
2 1
r
D F , r a , I B ,
T T
M D ,
J
W E Cuadras approach 3.
2 1
r
D F , r a ,. I B ,
T J T
M D , W E Taguchi decomposition (Beh, D’Ambra, Simonetti, 2007 -2011) 4.
2 1
r
D F , r a , L B ,
T T
M D ,
J
W E Doubly Cumulative Correspondence Analysis (Cuadras approach) 5.
2 1
R
D F , r a , R B , C D
T
,
1
C
D E
Doubly Cumulative Correspondence Analysis (our approach – Hirotsu decomposition) 6.
2 1
r
D F ,
r
1 a , I B , I D
T
,
C
D E
Non Symmetrical Correspondence Analysis
A data matrix that is both RR (row regression dependence ) and CR (row regression dependence) following Schriever 1983, Warren-Heiser 2009
The appreciations of five red Bordeaux wines by 200 judges using a four category system: from excellent to boring (Van Rijckevorsel, 1987, p. 60)
The rows and columns of Table have been permuted using the scores of the first CA dimension.
C1 C2 C3 C4 excellent good mediocre boring R1 grand cru classè 87 93 19 1 200 R2 cru Bourgeois 45 126 24 5 200 R3 Bordeaux d'Origine 36 68 74 22 200 R4 vin de marque 30 111 59 200 R5 vin de table 52 148 200 168 317 280 235 1000
Since Table is both RR and CR, there exists a strong ordinal association between the categories of two variables and the five wines can be perfectly
doubly cumulative correspondence analysis
1 1 1 1 1 1 1 R= 1 1 1 1 1 1 1 1 1 1 1 1 1
87 93 19 1 45 126 24 5 N= 36 68 74 22 30 111 59 52 148
1 1 1 C= 1 1 1 1 1 1 1 1 1
Calculating the doubly cumulative table
RxNxC=
The doubly cumulative chi-squared statistic defined by Hirotsu
Pearson chi-squared statistic for the 2 by 2 tables
2
It is easy to verify
2 ij
4 3 2 2 i j ij
C1 C2-C4 C1-C2 C3-C4 C1-C3 C4 R1 R2-R5 R1-R2 R3-R5 R1-R3 R4-R5 R1-R4 R5 max min 127,505795 125,1717033 172,3801421 73,56417744 411,1867347 179,4836138 134,6153846 448,6704701 295,9486395 50,48076923 235,4368932 354,6446948
2 ij
2 ij
It is easy to verify that, apart the constant, The total inertia is identical to doubly cumulative chi- squared statistic defined by Hirotsu
Eigenvalues and percentages of inertia of doubly cumulative
correspondence analysis
F1 F2 F3 Total inertia Eigenvalue 0,213 0,004 0,001 0,217 Cumulative % 97,749 99,764 100,000
12000 3 4 1000 1 1 J I n
2609,089 0,217 1200 1 1
2 2
k k
s J I n
Plot of Doubly ordered cumulative C.A. Max Dist from origin is c1-c2 c3-c4 r1-r3 r4-r5 Max Chi-squared 448,67 . We note different variations c1, c1-c2 c1 c3 From r1 r1-r2 r1-r3
Symmetric plot (axes F1 and F2: 99,76 %)
C1 C2-C4 C1-C2 C3-C4 C1-C3 C4 R1 R2-R5 R1-R2 R3-R5 R1-R3 R4-R5 R1-R4 R5
0,5 1
0,5 1 1,5
F1 (97,75 %) F2 (2,01 %) Columns Rows
0.0 0.5 1.0 1.5
0.0 0.5 1.0 1.5 Principal Axis 1 ( 80.7 %) Principal Axis 2 ( 15.05 %) TOTAL 2D ASSOC. - 95.75 % R1 R2 R3 R4 R5 C1 C2 C3 C4
Plot of correspondence analysis
Lets look first at the position of C1 and R1. Since they are situated near each other in this plot, this suggests that this row category and column category are associated with each other. So if we were to look at their position in the classical plot they would be located near each other. Looking at the position of C1 and C1-C2: These two points are situated fairly close to one another indicating that there is a small difference between C1 and C2. Since C1-C2 is slightly closer to the origin than C1 this suggests that C2 is also slightly closer to the origin (in the classical CA plot) than C1. Similar comments can be made by considering the relatively short distance between R1 and R1-R2. Such a distance implies that, in the classical CA plot R1 and R2 are located near each other. If we consider the relative distance between (C1, C1-C2) and (R1, R1-R2) we can see that these two distances are similar. Since we have discussed that C1 is associated with R1, these similar distances imply that C2 and R2 are also similarly positioned in the classical CA plot. The relatively similar distance between R1, R1-R2 and R1-R3 suggests that the relative distance between R1, R2 and R3 in the classical CA plot are the same. Lets look at the right hand side of our cumulative plot. C4 and R5 are situated close to each other implying that in the classical CA plot they will also be situated close to one another. The distance between R5 and R4-R5 tells me that R4 is quite different to R5. Since R4-R5 is situated closer to the origin than R5 then R4 will be situated closer to the origin that R5. The relative equal distance between R2-R5 (closer to the origin), R3-R5 and R4-R5 (further from the origin) tells me that R2, R3 and R4 are roughly the same distance apart from each other in the classical CA plot. What is interesting is that the distance between the pairs (R1, R2-R5), (R1-R2, R3-R5), (R1-R3, R4-R5) and (R1-R4, R5) are about the same indicating that the cumulative nature of our analysis is preserving the relative difference (or similarity) of R1, R2, R3, R4 and R5 that the classical CA plot would reflect. All of these conclusions regarding the interpretation of the cumulative correspondence plot is reflected in the classical CA plot.
Beh, E. J. (2004), Simple correspondence analysis: A bibliographic review, International Statistical Review, 72, 257-284. Beh, E. J., D'Ambra, L., Simonetti B. (2011), Cumulative correspondence analysis for ordered categorial data using Taguchi's Statistic, Communication in Statisticcs Cuadras, C. M. (2002), Correspondence analysis and diagonal expansions in terms of distribution functions, J. of Statistical Planning and Inference 103, pp. 137-150. D’Ambra L., Köksoy O., Simonetti B (2009) Cumulative correspondence analysis of ordered categorical data from industrial experiments, Journal of applied statistics, 36, 1315-1328 Hirotsu C. (1986), Cumulative Chi-squared Statistic as a Tool for Testing Goodness of Fit, Biometrika, 73, pp. 165-173 Nair, V. N. (1987), Chi-squared type tests for ordered alternatives in contingency tables, Journal of the American Statistical Association, 82, 283-291. Taguchi, G. (1974), A new statistical analysis for clinical data, the accumulating analysis, in contrast with the chi-square test, Saishin Igaku, 29, 806-813. Warrens M. J., Heiser W. J (2009), Diagnostics for regression dependence in tables re-ordered by the dominant correspondence analysis solution, Computational Statistics and Data Analysis, 53, 3139-3144