The Aggregate Prediction Index and Non-Symmetric Correspondence - - PowerPoint PPT Presentation

the aggregate prediction index and non symmetric
SMART_READER_LITE
LIVE PREVIEW

The Aggregate Prediction Index and Non-Symmetric Correspondence - - PowerPoint PPT Presentation

The Aggregate Prediction Index and Non-Symmetric Correspondence Analysis of Aggregate Data: The 2 x 2 Table Eric J. Beh School of Mathematical and Physical Sciences University of Newcastle, Australia Rosaria Lombardo Economics Faculty,


slide-1
SLIDE 1

The Aggregate Prediction Index and Non-Symmetric Correspondence Analysis of Aggregate Data: The 2 x 2 Table

Eric J. Beh

School of Mathematical and Physical Sciences University of Newcastle, Australia

Rosaria Lombardo

Economics Faculty, Second University of Naples, Italy CARME 2011, Rennes, France – February 8-11

slide-2
SLIDE 2

The 2x2 Contingency Table

Cross-classify a sample of size n according to two dichotomous variables

 

                 

        2 1 2 1 2 2 1 1 1 1 1 2

p p p p p p P n p , p | P X

1 11 1

p p P Define

“Let us blot

  • ut

the contents

  • f

the table, leaving only the marginal frequencies . . . [they] by themselves supply no information on . . . the proportionality of the frequencies in the body

  • f the table . . . ”

– Fisher (1935)

Column 1 Column 2 Total Row 1 Row 2 Total 1

11

p

12

p

 1

p

 2

p

2

p

1

p

21

p

22

p

? ? ? ?

2 21 2

p p P

Symmetric association – Pearson chi-squared statistic Aggregate Association Index (Beh; 2010 CS&DA)

slide-3
SLIDE 3

3

Bounds & Accounting Identity

1 1 1 1 1 2 1 1

U 1 , n n min P n n n , max L                     

    

Duncan & Davis (1953) Bounds

2 2 1 2 2 1 1 2

U 1 , n n min P n n n , max L                     

    

The Accounting Identity (King, 1997; and others)

  

 

2 2 1 1 1

n P n P n

slide-4
SLIDE 4

Non-Symmetric Correspondence Analysis

j i ij ij

p p p

 

  

Define as the difference between the unconditional marginal prediction p•j (column marginal proportion) and the conditional prediction pij/pi• (row profiles).

  

      

      

2 1 j 2 j num 2 1 j 2 j 2 1 i 2 1 j 2 ij i

p 1 p 1 p

Goodman-Kruskal tau index (1954)

 

2 1 ,

~ 1 n C

   

Light & Margolin (1971)

For a 2x2 contingency table . . . Rows → Predictor Variable Columns → Response Variable

NSCA (D’Ambra & Lauro1989)

slide-5
SLIDE 5

Non-Symmetric Correspondence Analysis

j i ij

y x   

Decomposition of ij Akin to the SVD and BMD of a general two-way contingency table

        

   

1 , 2 , 1 x p x p

2 2 1 1

        

 

1 , 2 , 1 y y

2 1

Orthonormality

2 1 ) 1 ( y p p ) 1 ( x

1 j j 3 i 2 2 1 1 i i     

           

Lancaster (1969)

slide-6
SLIDE 6

6

Bounds

Duncan & Davis (1953) showed that Which only requires the marginal information                   

                1 2 2 1 2 1 1 2 1 1 2 2 2 2 1 1

p p p p , p p p p min p p p p , p p p p min 

   

   

   

2 1 1 1 j i 1 1

p p 2 p P y x p P 

Under the hypothesis of independence, ρ is an asymptotic standard normal random variable and can be expressed as a function of P1 and of the marginal information:

slide-7
SLIDE 7

7

NSCA and Classical Coordinates

These coordinates may be expressed in terms of P1 and the marginal proportions

3 i 2 2 1 1 i i i

p p ) 1 ( x f

   

            

          

2 ) 1 ( y g

1 j j j

Some insight into the asymmetric association may be made using NSCA, by constructing a classical plot or biplot graphical display. For a classical plot

 

1 1 1 1

p P 2 x f

   

  

   

2 1 1 1 2 2

p p ) p P ( 2 x f 

 

  

  

2 1 1 1 1 1

p p p P y g 

 

  

   

2 1 1 1 2 2

p p p P y g 

slide-8
SLIDE 8

8

NSCA and the Biplot

3 i 2 2 1 1 i i i

p p ) 1 ( x f

   

            

        

2 1 ) 1 ( y g

1 j j j

and for the j.th column The row isometric biplot it is used to project the column coordinates on the line defined by the row coordinates, the shorter is the distance the stronger is the predictability!

To depict the asymmetric relationship between row and column categories, consider a row metric preserving biplot (Kroonenberg, Lombardo, 1999). The biplot coordinates for the ith row

                  

            1 2 2 2 2 1 2 1 1 1 2 2 2 2 1 2 1

p p p p , p p min f p p , p p p p min

Bounds can be computed for coordinates. For example,

slide-9
SLIDE 9

Bounds of P1

* 1 2 j 2 j 2 / 1 1 1 2 j 2 j 2 / 1 *

U p p 1 n p 1 Z p P p p 1 n p 1 Z p L

           

                         

 

   

   

    U U , 1 min P L , max L

* 1 *

100(1 – )% Confidence Bounds under the null hypothesis of independence

Given  and the aggregate data, there is a significant asymmetric association between the two dichotomous variables if

1 1 1 1

U P U

  • r

L P L    

 

slide-10
SLIDE 10

Aggregate Prediction Index (API)

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 25 30

L1 U1 U L p1* 2

Chi-squared Statistic

Statistically significant association

P1

Consider a plot of the chi-squared statistic versus P1 If the area under C but above χ2 is large than there is evidence that the row categories are good predictors of the column categories

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 25 30

L1 U1 U L p1* 2

Chi-squared Statistic

Statistically significant association

P1

C – Statistic

slide-11
SLIDE 11

         

               

 

   

1 1

U L 1 1 1 1 U L 1 1 1 1 2 1 1

dP p , p | P C dP p , p | P C U U L L 1 100 API

 

   

This area may be calculated by

Aggregate Prediction Index (API)

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 25 30

L1 U1 U L p1* 2

Chi-squared Statistic

Statistically significant association

P1

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 25 30

L1 U1 U L p1* 2

Chi-squared Statistic

Statistically significant association

P1

C – Statistic

slide-12
SLIDE 12

Example – Fisher’s Twin Data

Fisher's data studies 30 criminal twins and classifies them according to whether they are monozygotic twins or dizygotic twins. The table also classifies whether the twins have been convicted of a criminal offence.

The Goodman – Kruskal tau index = 0.434 The C – statistic is 12.597.

m p-value = 0.0004 → the type of twin is a good predictor of the

conviction status of a criminal.

slide-13
SLIDE 13

But, as Fisher (1935) did, suppose we “blot out” the cells of the table. Question: What information do the margins provide in understanding the extent to which the variables are associated. We shall

  • consider the non-symmetric correspondence analysis

using only the aggregate data, and

  • calculate the aggregate prediction index

Example – Fisher’s Twin Data

slide-14
SLIDE 14

Example – Fisher’s Twin Data

 

2 1 1

17 12 P 30 34 26 P C        

No prediction when 0,19 ≤ P1 ≤ 0.60

API0.05 = 56,85

If we consider the 5% level of significance, the margins provide strong evidence that there may exist a significant prediction

  • f conviction status

given twin type

0.92 0.40 0.60 0.80 0.20 0.00 0.0 23.0 51.7 1.00

C – Statistic

4 . 30 12 p 1  

slide-15
SLIDE 15

Example – Fisher’s Twin Data

Classical plot proposed in CA of aggregate data (Beh, 2008)

  • 0,8
  • 0,6
  • 0,4
  • 0,2

0,2 0,4 0,6 0,8 1 0,2 0,4 0,6 0,8 1 1,2

No prediction if 0,19 ≤ P1 ≤ 0.60

Row=monoz.

Row=dizy. Column=conv. Column=non conv.

slide-16
SLIDE 16

Example – Fisher’s Twin Data

No prediction if 0,19 ≤ P1 ≤ 0.60 Inverse prediction if 0.0 ≤ P1 ≤ 0.19 Direct prediction when 0.60 ≤ P1 ≤ 0.92

Row isometric Biplot

  • 0,8
  • 0,6
  • 0,4
  • 0,2

0,2 0,4 0,6 0,8 1 0,2 0,4 0,6 0,8 1 1,2

Row=monoz.

Row=dizy. Column=non conv. Column=conv.

slide-17
SLIDE 17

Discussion

v The API index provides an indication of the extent to which two dichotomous variables are significantly asymmetric related given only the marginal information v Investigate the positive and negative association of API v This Index is not meant to infer the individual level association of the variables, but to provide a measure reflecting how likely the two variables may be associated.

Further issue includes:

v Investigate the applicability of this index for G (>1) 2x2 tables, including incorporating covariate information (ecological inference)

slide-18
SLIDE 18

Bibliography

  • Barnard, G. A. (1984), ``Comments to Tests of significance for 2x2 contingency tables", Journal of the Royal
  • Statistical Society, Series A, 47, 449-450.
  • Beh, E. J. (2008), ``Correspondence Analysis of aggregate data: the 2x2 table", Journal of Statistical Planning and

Inference, 138, 2941-2952.

  • Beh, E. J. (2010), ``The Aggregate Associate Index", Computational Statistics & Data Analysis, 6 , 58-72.
  • D’Ambra, L., & Lauro, N. C. (1989). Non-symmetrical correspondence analysis for three-way contingency table. In
  • R. Coppi & S. Bolasco (Eds.), Multiway Data Analysis (pg 301–315). Amsterdam: Elsevier.
  • Duncan, O. D. and Davis, B. (1953), ``An alternative to ecological correlation", American Sociological Review, 18,

665-666.

  • Fisher, R. A. (1935), ``The logic of inductive inference" (with discussions), Journal of the Royal Statistical

Association, Series A, 98, 39-82.

  • Goodman, L. A. (1959), ``Some alternatives to ecological correlation", The American Journal of Sociology, 64,

610-625.

  • Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American

Statistical Association, 49, 732–764.

  • Greenacre, M. J. (1984), The Theory and Application of Correspondence Analysis, London : Academic Press.
  • Kroonenberg, P. M. & Lombardo, R. (1999). Non-symmetric correspondence analysis: A tool for analysing

contingency tables with a dependence structure, Multivariate Behavioral Research, 34, 367 – 396.

  • King G. (1997) A solution to the ecological Inference problem. Princeton University Press, Princeton, USA.
  • Plackett, R. L. (1977), ``The marginal totals of a 2x 2 table, Biometrika, 64, 37-42.
  • Rayner, J. C. W. and Best, D. J. (1996), ``Smooth extensions of Pearson's product moment correlation and

Spearman's Rho", Statistics and Probability Letter, 30, 171-177.

  • Yates, F. (1984), ``Tests of significance for 2x2 contingency tables" (with discussions), Journal of the Royal

Statistical Society, Series A, 147, 426-263.

  • ………………………………………………………..
slide-19
SLIDE 19