Singly and doubly ordered cumulative correspondence analysis. L. - - PowerPoint PPT Presentation

singly and doubly ordered
SMART_READER_LITE
LIVE PREVIEW

Singly and doubly ordered cumulative correspondence analysis. L. - - PowerPoint PPT Presentation

Singly and doubly ordered cumulative correspondence analysis. L. DAmbra*, E. Beh** and I. Camminatiello* *University of Naples Federico II (Italy) ** University of Newcasle (Australia) dambra@unina.it Outline A short review Singly


slide-1
SLIDE 1

Singly and doubly ordered cumulative correspondence analysis.

  • L. D’Ambra*, E. Beh** and I. Camminatiello*

*University of Naples Federico II (Italy) ** University of Newcasle (Australia) dambra@unina.it

slide-2
SLIDE 2

Outline

 A short review  Singly ordered cumulative correspondence analysis: methodology and application in industrial experiment  Doubly ordered cumulative correspondence analysis: some developments and application with Van Rijckevorsel’s data  We also propose an unified approach

slide-3
SLIDE 3

Review (1/3)

 In multidimensional data analysis for studying the association between two categorical variables, Correspondence Analysis (CA) is one of the most popular tool.  This method is based on chi-squared

 Drawback

It does not take in consideration the ordered nature of the categories

slide-4
SLIDE 4

Review (2/3)

 There are some contributions that deal with ordinal categorical variables, including those of Parsa and Smith (1993), Ritov and Gilula (1993) and Schriever (1983)

 These procedures involve constraining the output obtained from applying singular value decomposition (SVD) so that the coordinates in the first dimension have an ordered structure.

 An alternative approach applies moment decomposition (MD - Beh, 1997) or hybrid decomposition (HD - Beh, 2004) that involve using the orthogonal polyonomials in order to detect linear , quadratic , cubic components

slide-5
SLIDE 5

Review (3/3)

 In some industrial experiments, sometimes the output consists of categorical data (contingency table ) with an ordering in the categories.  For analyzing such data, Taguchi (1966, 1974) proposed the Accumulation Analysis method as an alternative to Pearson's chi- squared test.  His motivation for recommending this technique appears to be its similarity to ANOVA for quantitative variables.  More recently, Light and Margolin (1971) proposed a method called CATANOVA by defining an appropriate measure of variation for categorical data.  Unlike these methods Taguchi considers situations with ordered categories and does ANOVA on the cumulative frequencies

slide-6
SLIDE 6

Aim of our paper

 In this paper we explore the development of correspondence analysis which takes into account the presence of ordered variables by considering the cumulative sum of cell frequencies across the variables.

slide-7
SLIDE 7

Singly ordered cumulative correspondence analysis

 Beh, D’Ambra, Simonetti (Carme 2007; Communication in Statistics 2011) performed correspondence analysis when cross-classified variables have an ordered structure by considering the Taguchi’s statistic.

 Taguchi’s statistic is an appropriate measure of non- symmetric association for two categorical variables of which one is on ordinal scale.  It takes into account the presence of an ordered variable by considering the cumulative sum of cell frequencies across the variable.

slide-8
SLIDE 8

Notation (1/3)

 the absolute two-way contingency table that cross-classifies n units according to I

  • rdered row categories and J ordered

column categories

 

ij

n  N

 the relative two-way contingency table

N P

1 

 n

 the row and column marginals.

,

 i

n

j

n

A triangular matrix of 1’s with the last J-th row is removed so that it is of dimension (J -1)×J. A triangular matrix of 1’s of dimension J x J A triangular matrix of 1’s of dimension I x I

J 

M

M L

slide-9
SLIDE 9

Notation (2/3)

 the cumulative frequencies  the cumulative column proportions  the vectors with the marginal frequencies of P

iJ i iJ i i i i i

n n z n n z n z        

1 2 1 2 1 1

, , ,

n n n d n n n d n n d

J J     

       

1 2 1 2 1 1

, , ,

 the diagonal matrices with the marginal frequencies of P

r

c

and

c

D

and

r

D

slide-10
SLIDE 10

Taguchi’s statistic (1/2).

Taguchi (1966) proposed the following statistic

 

    

                 

1 1 1 2 J j I i j i ij i j

d n z n w T

are weights >0. Two choices are

1 1

, ,

 J

w w  possible

   

1

1

 

j j j

d d w

  • r

J wj 1 

slide-11
SLIDE 11

Taguchi’s statistic (2/2).

 The properties of T, Taguchi'(1966, 1974) "cumulative-sums' statistic obtained by assigning a weight to each column that is inversely proportional to its conditional expectation of the j-th term (conditional on the given marginals) In this paper we use this weighting system .  A simpler statistic, T, which assigns each column constant weights 1/J

   

1 .... 1 ........ 1

1

   

J j d d w

j j j

slide-12
SLIDE 12

The Pearson chi-squared statistic and Taguchi’s statistic

Nair (1987) demonstrated that the link between the Pearson chi-squared statistic and Taguchi’s statistic is

 

1 1 2 J j j

T 

2 j

is the Pearson chi-squared statistic for the contingency table obtained by aggregating column categories 1 to j, and aggregating the column categories j+1 to J. For this reason, it is also referred to as cumulative chi- squared statistic.

slide-13
SLIDE 13

Taguchi’s statistic in matrix notation (1/2)

The Taguchi’s statistic may be expressed in matrix notation by

 

2 1 2 1  

 

r T T r

trace n T D WAN NA D

W (J-1,J-1) is the diagonal matrix of weights A (J-1,J) is the matrix involving the cumulative column proportions

                        

    1 1 1 1 2 2 2 2 1 1 1 1

1 1 1 1 1 1

J J J J

d d d d d d d d d d d d         A

slide-14
SLIDE 14

Taguchi’s statistic in matrix notation (2/2)

Considering that

The Taguchi’s statistic after some algebra may be rewritten by

T J

d1 M A  

c M d

J 

   

 

   

 

   

 

   

 

2 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 r T T J T J T r r r r T T J T J T r r T T T J T T J r r T T T J T T J r

trace n trace n n n n n trace n n n trace n T D rc P WM M c 1 P D D D rc P WM M rc P D D rd PM W rd PM D D rd NM W rd NM D             

              

   

 

2 1 1 2 1   

   

r T T c T r

trace n T D rc P D rc P D

(C.A.)

slide-15
SLIDE 15

Approach proposed

 Beh, D’Ambra, Simonetti (Carme 2007 and Communication in Statistics 2011) carried out CA when cross-classified variables have an ordered structure by considering the Taguchi’s statistic.  In terms of the Taguchi's statistic, Beh et al. (2010) perform SVD

  • n

   

2 1 1 2 1 2 1 2 1

W M c 1 P D D W M rc P D X

T J T r r r T J T r    

  

Matrix X is centered

slide-16
SLIDE 16

Special cases and Properties of Cumulative Correspondence analysis

 For I > 2 and in the case of EQUIPROBABLE categories the eigenvectors are given by CHEBYCHEV POLYNOMIALS  For I > 2, and in the equiprobable case, the first component (location or linear ) is proportional the Kruskal-Wallis statistic for contingency tables  Similarly the second component ( dispersion or quadratic ) is the generalizzation of the grouped data version of Mood's (1954) statistic.  In general case this is no true  In the case of 2xJ table we have two components:  the first component ( linear ) of Taguchi’statistics is equivalent to Wilcoxon statistics  The second component (Quadratic ) is equivalent to Mood’s test (1954) (See Nair 1987) See Beh- D’Ambra- Simonetti in Communication in Statistics 2011  Coordinates  Distances  Properties of decomposition of Taguchi’s Statistic and Non Symmetrical Correspondence Analysis (NSCA)

slide-17
SLIDE 17

Relationship between the coordinates in the cumulative analysis and in the classical C.A (1/2)

 For cumulative analysis we may write the row coordinates by

 

V W M c 1 P D O

2 1 1 T J T r r r  

 

 For classical CA the row coordinates are defined by

 V

c 1 P D O ~ ~

1 T r r r

 

are the matrices containing the right singular vectors for cumulative analysis and classical CA, respectively.

V V ~ ,

 therefore

V W M V O O

2 1

~ ~

T J T r r 

 This shows that you may be able to go from classical CA coordinates to cumulative coordinates easily.

slide-18
SLIDE 18

Relationship between the coordinates in the cumulative analysis and in the classical analysis (2/2)

 Using the same argument we can obtain the classical coordinates from the cumulative coordinates from the relationship

  V

W M V O O ~ ~

1 2 1  

T J T r r

slide-19
SLIDE 19

Example: Phadke’s data (1/12)

 To illustrate the cumulative correspondence analysis using the taguchi’s statistic, D’Ambra, Köksoy, Simonetti (2010) use Phadke’s data (1989).

Levels

  • A. Deposition temperature (oC)

T025 T0 T0+25

  • B. Deposition pressure (mttor)

P0200 P0 P0+200

  • C. Nitrogen flow (sccm)

N0 N0 150

N0 75

  • D. Silane flow (sccm)

S0100

S0 50

S0

  • E. Setting time (min)

t0 t0+8

t0+16

  • F. Cleaning method

None CM2 CM3 1 2 3

The control factors (6) and their levels (3) of polysilicon deposition process

slide-20
SLIDE 20

Example: Phadke’s data (2/12)

Categories of product’s quality

Categories Description Cumulative categories I : 03 defects No surface defect (I) = I (03 defects) II : 430 defects Very few defects (II) = I+II (030 defects) III : 31300 defects Some defects (III) = I+II+III (0300 defects) IV : 3011000 defects Many defects (IV) = I+II+III+IV (01000 defects) V : 1001 and more defects Too many defects (V) = I+II+III+IV+V (0 defects)

slide-21
SLIDE 21

Example: Phadke’s data (3/12)

Factor effects for the categorized surface defect data

A1 34 40 51 53 54 0.63 0.74 0.94 0.98 1.00 A2 7 22 34 41 54 0.13 0.41 0.63 0.76 1.00 A3 8 14 19 32 54 0.15 0.26 0.35 0.59 1.00 B1 25 40 46 51 54 0.46 0.74 0.85 0.94 1.00 B2 20 28 36 43 54 0.37 0.52 0.67 0.80 1.00 B3 4 8 22 32 54 0.07 0.15 0.41 0.59 1.00 C1 19 30 32 39 54 0.35 0.56 0.59 0.72 1.00 C2 11 20 28 39 54 0.20 0.37 0.52 0.72 1.00 C3 19 26 44 48 54 0.35 0.48 0.81 0.89 1.00 D1 20 25 34 41 54 0.37 0.46 0.63 0.76 1.00 D2 13 31 42 44 54 0.24 0.57 0.78 0.81 1.00 D3 16 20 28 41 54 0.30 0.37 0.52 0.76 1.00 E1 21 27 38 43 54 0.39 0.50 0.70 0.80 1.00 E2 16 29 36 42 54 0.30 0.54 0.67 0.78 1.00 E3 12 20 30 41 54 0.22 0.37 0.56 0.76 1.00 F1 21 23 26 34 54 0.39 0.43 0.48 0.63 1.00 F2 21 30 40 46 54 0.39 0.56 0.74 0.85 1.00 F3 7 23 38 46 54 0.13 0.43 0.70 0.85 1.00 Factor Levels (I) (II) (III) (IV) (V) (I) (II) (III) (IV) (V)

Number of observations by categories Probabilities for the cumulative caregories

slide-22
SLIDE 22

Example: Phadke’s data (4/12)

 The Taguchi’s statistic T=318,5669

 The Pearson chi-squared statistic for the four contingency tables obtained by aggregating column categories 1 to j, and aggregating the column categories j+1 to J.

I II+III+IV+V I+II III+IV+V I+II+III IV+V I+II+III+IV V 83,209 79,265 95,8786 60,2143

slide-23
SLIDE 23

The partition of Taguchi’s statistic from contingency table in Pearson chi-squared statistics

Aggregated Column Categories Factor (I) (II+III+IV+V) (I+II) (III+IV+V) (I+II+III) (IV+V) (I+II+III+IV) (V) A1 34 20 40 14 51 3 53 1 A2 7 47 22 32 34 20 41 13 A3 8 46 14 40 19 35 32 22 B1 25 29 40 14 46 8 51 3 B2 20 34 28 26 36 18 43 11 B3 4 50 8 46 22 32 32 22 C1 19 35 30 24 32 22 39 15 C2 11 43 20 34 28 26 39 15 C3 19 35 26 28 44 10 48 6 D1 20 34 25 29 34 20 41 13 D2 13 41 31 23 42 12 44 10 D3 16 38 20 34 28 26 41 13 E1 21 33 27 27 38 16 43 11 E2 16 38 29 25 36 18 42 12 E3 12 42 20 34 30 24 41 13 F1 21 33 23 26 34 20 F2 21 33 30 40 46 8 F3 7 47 23 38 46 8 Total

2 83

2 1

,  

3 79

2 2

,  

9 95

2 3

,   2 60

2 4

,   6 318

2

,

TOT 

slide-24
SLIDE 24

Example: Phadke’s data (5/12)

The table shows the ANOVA results. following Taguchi ( see Nair ) A and B are the two most important factors affecting product’s quality

Source AA A 14,3 B 10,8 C 2,3 D 1,8 E 1,2 F 2,9

slide-25
SLIDE 25

Example: Phadke’s data (6/12)

Figure shows the graphical representation

  • f the results.

Table shows the distances from the origin to the column points in Figure above. We note that the point “I vs (II+III+IV+V)” is the most important because it represents a larger contribution (30,618), which is measured by the distances from the

  • rigin.

I (II+III+IV+V) I+II (III+IV+V) I+II+III (IV+V) I+II+III+IV (V) 30,618 4,995 1,658 0,023

slide-26
SLIDE 26

Row cordinates of cumulative ordinal correspondence analysis Colomn cordinates of cumulative ordinal correspondence analysis Categories “I” from correspondence analysis Categories “II” from correspondence analysis Categories “III” from correspondence analysis Categories “IV” from correspondence analysis Categories “V” from correspondence analysis Supplementary point of factors: A, B, C, D, E, F Colomn cordinates of cumulative ordinal correspondence analysis Row cordinates of cumulative ordinal correspondence analysis Colomn cordinates of cumulative ordinal correspondence analysis Categories “I” from correspondence analysis Categories “II” from correspondence analysis Categories “III” from correspondence analysis Categories “IV” from correspondence analysis Categories “V” from correspondence analysis Supplementary point of factors: A, B, C, D, E, F Colomn cordinates of cumulative ordinal correspondence analysis

Example: Phadke’s data (7/12)

Figure shows the row and column categories of Singly ordered cumulative correspondence analysis, the supplementary points of the factors (A,B,C,D,E,F) and the column categories of classical analysis A,B are important factors

slide-27
SLIDE 27

Example: Phadke’s data (8/12)

Level A B C D E F 1 5,197 2,262 3,353 3,315 1,963 5,749 2 5,536 2,882 6,519 1,715 2,836 0,787 3 10,233 11,153 0,700 5,542 5,772 4,035 Level A B C D E F 1 5,345 4,143 5,777 5,328 4,854 7,438 2 2,192 4,908 4,088 2,476 4,139 4,228 3 5,259 3,745 2,931 3,538 3,803 1,056

Two tables show the distances between the row points and “I vs (II+III+IV+V)” column point on the first and second factorial axes.

slide-28
SLIDE 28

Example: Phadke’s data (9/12)

“I vs (II+III+IV+V)” 1° Axis 2° Axis A1 A2 B1 B3 C3 C3 D2 D2 E1 E3 F2 F3 Table shows the first and second axes solutions based on the minimum distance reports. So we choose this optimal combination

slide-29
SLIDE 29

Example: Phadke’s data (10/12)

Comparative results for the optimal factor settings

Method Solution (I) (II) (III) (IV) (V) Probabilities for the cumulative categories MEL A1B1C3D2E1F2

0.875

0.959 0.998 0.999 1.000 SCORE A1B1C3D2E2F2 0.822 0.964 0.998 0.999 1.000 WSNR A1B1C1D1E2F2 0.896 0.960 0.986 0.996 1.000 AA A1B2C1D3E2F2 0.814 0.863 0.942 0.983 1.000 MSD A1B1C3D2E1F3 0.617 0.931 0.998 0.999 1.000 STARTING A2B2C1D3E1F1 0.363 0.435 0.394 0.554 1.000 PROPOSED: 1° Axis A1B1C3D2E1F2 0.875 0.959 0.998 0.999 1.000 2° Axis A2B3C3D2E3F3 0.090 0.590 0.880 0.820 1.000 Plane solution A2B1C3D2E2F3 0.090 0.800 0.975 0.984 1.000 MEL=Asiabar and Ghomi (2006), SCORE=Nair(1986), WSNR=Wu and Yeh (2006),

AA=Phadke’s accumulation Analysis (1989), MSD=Jeng and Guo (1996), STARTING= Starting, PROPOSED=D’Ambra, Köksoy and Simonetti

slide-30
SLIDE 30

Example: Phadke’s data (11/12)

          p 1 p 10log w(p)

10

The optimum settings recommended by the first factorial axes solution of cumulative correspondence analysis is A1, B1, C3, D2, E1, F2. By the inverse omega transform, the predicted probability for category (I) is 0.875. The second axis solution (i.e., A2, B3, C3, D2, E3, F3) does not seem so powerful and the probabilities for the cumulative categories are not high enough. The plane solution (i.e., A2, B1, C3, D2, E2, F3) especially provides a very low probability in category (I). As a result, we suggest to pick the first axes solution as the optimal solution for the Phadke’s polysilicon deposition process. The last table shows the comparative results for the solution methods to optimize factor settings according to their predicted probabilities for the cumulative categories. To calculate the optimal probabilities for the cumulative categories Taguchi uses the

  • mega transform, also known as the logit transform. The omega transform for

probability p is defined by

slide-31
SLIDE 31

Example: Phadke’s data (12/12)

We observe that the first axis solution is equivalent to the MEL (i.e., minimization of expected loss) solution proposed by Asiabar and Ghomi (2006). The solution seems a nice candidate among the others since the probabilities for the cumulative categories are higher. Asiabar and Ghomi (2006) suggested a technique, which is called MEL that minimizes the expected loss for the analysis of ordered categorical data. After an experiment and data collection authors define a probability distribution function of data in

  • categories. In the final step of MEL algorithm, expected loss in each level of factors is

calculated and the decision is made by the fact that the optimum level of a factor is the one where the expected loss is lower than the expected loss at other levels of that factor.

slide-32
SLIDE 32

Doubly ordered cumulative correspondence analysis.

 Now , we explore a generalization of Taguchi’s statistic which takes into account the presence of both ordered variables by considering the cumulative sum of cell frequencies across the variables.

slide-33
SLIDE 33

Approach of Cuadras (1/2)

Cuadras (2002) proposed the following approach based

  • n double cumulative frequencies

 

T J T T r

USV W M rc P L D  

 2 1 2 1

 U is the matrix containing the left singular vectors  S is the diagonal matrix containing the singular values  V is the matrix containing the right singular vectors.  WJ is the J x J diagonal matrix of weights 1/J  L is lower triangular matrix  M is upper triangular matrix

slide-34
SLIDE 34

Approach of Cuadras (2/2)

 Disadvantages:

This approach does not decompose any known index. This approach has not the property to be the sum of the Pearson chi-squared statistic for the contingency table obtained by partitioning and pooling the original data.( see Taguchi)

slide-35
SLIDE 35

Doubly Cumulative Correspondence Analysis (1/4)

 Starting from the proposal of Beh et al. (2007 -2011), we present a more general approach based on double cumulative frequencies which overcomes these problems and presents some interesting proprieties.  Notation

 R the 2(I-1)xI matrix obtained by alternating the rows of an (I-1)xI lower triangular matrix of ones without the row of all ones and the rows of an (I-1)xI upper triangular matrix of ones without the row of all ones.  C the Jx2(J-1) matrix obtained by alternating the columns of an Jx(J-1) upper triangular matrix of ones without the column of all

  • nes and the columns of an Jx(J-1) lower triangular matrix of
  • nes without the column of all ones.

 DR and DC the diagonal matrices with the marginal frequencies of doubly cumulative table .

slide-36
SLIDE 36

Doubly Cumulative Correspondence Analysis (2/4)

 The CA can be approached by using cumulative frequencies for rows and columns

 

T C T R

USV CD rc P R D  

  2 1 2 1

 The row and column coordinates are respectively

 

,

1 1

V CD rc P R D G

 

 

C T R r

 

U D R rc P C D G

1 1  

 

R T T T T C c

slide-37
SLIDE 37

Doubly Cumulative Correspondence Analysis (3/4)

 The inertia

  

   

 

  

       

   k k R T T T T C T R

s J I n trace J I n Q

2 2 1 1 2 1

1 1 1 1 D R rc P C CD rc P R D

can be considered a generalization of Taguchi’s statistic because takes into account the presence of both ordinal variables.

 It is easy to verify that trace of Q is identical to doubly cumulative chi-squared statistic defined by Hirotsu (1986) ( used for comparing treatments and change point analysis) This approach preserves same property of Taguchi’s statistics



   

1 1 2 2 I i J j ij

 

is the Pearson chi-squared statistic for the 2x2 contingency table

  • btained by partitioning and pooling the original table

2 ij

slide-38
SLIDE 38

Doubly Cumulative Correspondence Analysis (4/4)

 The CA on the on doubly cumulative table, which we call Doubly Cumulative Correspondence Analysis, presents the following properties

 The approach maximizes the fi-squared statistic of each 2 by 2 table and, apart the constant, 2(I-1)x2(J-1), of doubly cumulative table.  All the weighted row and the column coordinates are centred  The weighted row and the column coordinates are centred for the 2 by 2 tables  This approach allows of representing the variations of row and column categories rather than the categories on the space generated by cumulative frequencies. Successively, it is possible to project on the same space the row and column categories as supplementary points.

slide-39
SLIDE 39

An unified approach

In order to represent the rows and columns of N we can consider the following SVD depending

  • n four matrices

E D B F , , , and the vector a

 

T T T

USV E D ac P FB   Overall approach 1.

2 1 

r

D F , r a  , I B  , I D 

T

,

1 

c

D E Correspondence Analysis 2.

2 1 

r

D F , r a  , I B  ,

T T

M D  ,

J

W E  Cuadras approach 3.

2 1 

r

D F , r a  ,. I B  ,

T J T 

 M D , W E  Taguchi decomposition (Beh, D’Ambra, Simonetti, 2007 -2011) 4.

2 1 

r

D F , r a  , L B  ,

T T

M D  ,

J

W E  Doubly Cumulative Correspondence Analysis (Cuadras approach) 5.

2 1 

R

D F , r a  , R B  , C D 

T

,

1 

C

D E

Doubly Cumulative Correspondence Analysis (our approach – Hirotsu decomposition) 6.

2 1 

r

D F ,

r

1 a  , I B  , I D 

T

,

C

D E 

Non Symmetrical Correspondence Analysis

slide-40
SLIDE 40

Example: Van Rijckevorsel’s data (1/8)

 A data matrix that is both RR (row regression dependence ) and CR (row regression dependence) following Schriever 1983, Warren-Heiser 2009

The appreciations of five red Bordeaux wines by 200 judges using a four category system: from excellent to boring (Van Rijckevorsel, 1987, p. 60)

 The rows and columns of Table have been permuted using the scores of the first CA dimension.

C1 C2 C3 C4 excellent good mediocre boring R1 grand cru classè 87 93 19 1 200 R2 cru Bourgeois 45 126 24 5 200 R3 Bordeaux d'Origine 36 68 74 22 200 R4 vin de marque 30 111 59 200 R5 vin de table 52 148 200 168 317 280 235 1000

 Since Table is both RR and CR, there exists a strong ordinal association between the categories of two variables and the five wines can be perfectly

  • rdered from excellent to boring. Then, we use such table for illustrating the

doubly cumulative correspondence analysis

slide-41
SLIDE 41

Example: Van Rijckevorsel’s data (2/8)

1 1 1 1 1 1 1 R= 1 1 1 1 1 1 1 1 1 1 1 1 1

87 93 19 1 45 126 24 5 N= 36 68 74 22 30 111 59 52 148

1 1 1 C= 1 1 1 1 1 1 1 1 1

Calculating the doubly cumulative table

RxNxC=

slide-42
SLIDE 42

Example: Van Rijckevorsel’s data (3/8)

 The doubly cumulative chi-squared statistic defined by Hirotsu

Pearson chi-squared statistic for the 2 by 2 tables

2609,089

2   

 It is easy to verify

2 ij



  4 3 2 2 i j ij

 

C1 C2-C4 C1-C2 C3-C4 C1-C3 C4 R1 R2-R5 R1-R2 R3-R5 R1-R3 R4-R5 R1-R4 R5 max min 127,505795 125,1717033 172,3801421 73,56417744 411,1867347 179,4836138 134,6153846 448,6704701 295,9486395 50,48076923 235,4368932 354,6446948

2 ij

2 ij

slide-43
SLIDE 43

Example: Van Rijckevorsel’s data (4/8)

 It is easy to verify that, apart the constant, The total inertia is identical to doubly cumulative chi- squared statistic defined by Hirotsu

Eigenvalues and percentages of inertia of doubly cumulative

correspondence analysis

F1 F2 F3 Total inertia Eigenvalue 0,213 0,004 0,001 0,217 Cumulative % 97,749 99,764 100,000

  

12000 3 4 1000 1 1       J I n

  

2609,089 0,217 1200 1 1

2 2

    

 

k k

s J I n

slide-44
SLIDE 44

Example: Van Rijckevorsel’s data (5/8)

Plot of Doubly ordered cumulative C.A. Max Dist from origin is c1-c2 c3-c4 r1-r3 r4-r5 Max Chi-squared 448,67 . We note different variations c1, c1-c2 c1 c3 From r1 r1-r2 r1-r3

  • Diff. variations R5 R4-R5 R3-R5 from C4 C3-C4 C2-C4

Symmetric plot (axes F1 and F2: 99,76 %)

C1 C2-C4 C1-C2 C3-C4 C1-C3 C4 R1 R2-R5 R1-R2 R3-R5 R1-R3 R4-R5 R1-R4 R5

  • 0,5

0,5 1

  • 1
  • 0,5

0,5 1 1,5

F1 (97,75 %) F2 (2,01 %) Columns Rows

  • 1.0
  • 0.5

0.0 0.5 1.0 1.5

  • 1.0
  • 0.5

0.0 0.5 1.0 1.5 Principal Axis 1 ( 80.7 %) Principal Axis 2 ( 15.05 %) TOTAL 2D ASSOC. - 95.75 % R1 R2 R3 R4 R5 C1 C2 C3 C4

Plot of correspondence analysis

slide-45
SLIDE 45

Example: Van Rijckevorsel’s data (6/8)

 Lets look first at the position of C1 and R1. Since they are situated near each other in this plot, this suggests that this row category and column category are associated with each other. So if we were to look at their position in the classical plot they would be located near each other.  Looking at the position of C1 and C1-C2: These two points are situated fairly close to one another indicating that there is a small difference between C1 and C2. Since C1-C2 is slightly closer to the origin than C1 this suggests that C2 is also slightly closer to the origin (in the classical CA plot) than C1.  Similar comments can be made by considering the relatively short distance between R1 and R1-R2. Such a distance implies that, in the classical CA plot R1 and R2 are located near each other.  If we consider the relative distance between (C1, C1-C2) and (R1, R1-R2) we can see that these two distances are similar. Since we have discussed that C1 is associated with R1, these similar distances imply that C2 and R2 are also similarly positioned in the classical CA plot.  The relatively similar distance between R1, R1-R2 and R1-R3 suggests that the relative distance between R1, R2 and R3 in the classical CA plot are the same.  Lets look at the right hand side of our cumulative plot. C4 and R5 are situated close to each other implying that in the classical CA plot they will also be situated close to one another.  The distance between R5 and R4-R5 tells me that R4 is quite different to R5. Since R4-R5 is situated closer to the origin than R5 then R4 will be situated closer to the origin that R5.  The relative equal distance between R2-R5 (closer to the origin), R3-R5 and R4-R5 (further from the origin) tells me that R2, R3 and R4 are roughly the same distance apart from each other in the classical CA plot.  What is interesting is that the distance between the pairs (R1, R2-R5), (R1-R2, R3-R5), (R1-R3, R4-R5) and (R1-R4, R5) are about the same indicating that the cumulative nature of our analysis is preserving the relative difference (or similarity) of R1, R2, R3, R4 and R5 that the classical CA plot would reflect.  All of these conclusions regarding the interpretation of the cumulative correspondence plot is reflected in the classical CA plot.

slide-46
SLIDE 46

References

 Beh, E. J. (2004), Simple correspondence analysis: A bibliographic review, International Statistical Review, 72, 257-284.  Beh, E. J., D'Ambra, L., Simonetti B. (2011), Cumulative correspondence analysis for ordered categorial data using Taguchi's Statistic, Communication in Statisticcs  Cuadras, C. M. (2002), Correspondence analysis and diagonal expansions in terms of distribution functions, J. of Statistical Planning and Inference 103, pp. 137-150.  D’Ambra L., Köksoy O., Simonetti B (2009) Cumulative correspondence analysis of ordered categorical data from industrial experiments, Journal of applied statistics, 36, 1315-1328  Hirotsu C. (1986), Cumulative Chi-squared Statistic as a Tool for Testing Goodness of Fit, Biometrika, 73, pp. 165-173  Nair, V. N. (1987), Chi-squared type tests for ordered alternatives in contingency tables, Journal of the American Statistical Association, 82, 283-291.  Taguchi, G. (1974), A new statistical analysis for clinical data, the accumulating analysis, in contrast with the chi-square test, Saishin Igaku, 29, 806-813.  Warrens M. J., Heiser W. J (2009), Diagnostics for regression dependence in tables re-ordered by the dominant correspondence analysis solution, Computational Statistics and Data Analysis, 53, 3139-3144