Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja - - PowerPoint PPT Presentation

correspondence analysis and moderate outliers
SMART_READER_LITE
LIVE PREVIEW

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja - - PowerPoint PPT Presentation

Correspondence Analysis Outliers Confidence regions Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar 9, 2011 TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 1 / 23 Correspondence Analysis


slide-1
SLIDE 1

Correspondence Analysis Outliers Confidence regions

Correspondence Analysis and Moderate Outliers

Anna Langovaya, Sonja Kuhnt

TU Dortmund

Ferbruar 9, 2011

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 1 / 23

slide-2
SLIDE 2

Correspondence Analysis Outliers Confidence regions

Overview

1 Correspondence Analysis

Statistical model Correspondence Analysis

2 Moderate outliers in contingency tables

Idea of moderate outliers Simulation study design Results

3 Spatial confidence regions

One outlier in the table Several outliers in the table Outlook

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 2 / 23

slide-3
SLIDE 3

Correspondence Analysis Outliers Confidence regions Model CA

Motivation

Behaviour of Correspondence Analysis (CA) with outliers in multidimensional contingency tables Consider ’outliers’ that break independence in the table, but are not immediately conspicuous. Question: How do outliers affect the CA-coordinates?

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 3 / 23

slide-4
SLIDE 4

Correspondence Analysis Outliers Confidence regions Model CA

Notation

X1, X2, X3 - random variables, X = {1, ..., I} × {1, ..., J} × {1, ..., K} nijk, i = 1, ..., I, j = 1, ..., J, k = 1, ..., K

  • bserved frequency of (X1 = i, X2 = j, X3 = k)

Nijk random variables, n - sample size (N111, . . . , NIJK) ∼ Multinomial(n, (π111, . . . , πIJK)) Under the null hypothesis of total independence: πijk = πi··π·j·π··k πi·· =

j

  • k πijk,

π·j· =

i

  • k πijk, π··k =

i

  • j πijk

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 4 / 23

slide-5
SLIDE 5

Correspondence Analysis Outliers Confidence regions Model CA

Example of a 3-dimensional contingency table

X3 Sum X1 X2 1 2 ... K 1 n111 n112 n11k n11K n11· 1 . . . n1j1 n1j2 ... n1jK J n1J1 n1J2 n1Jk n1JK 1 ni11 . . . . . . ... nijk nij· J niJK 1 nI11 I . . . ... J nIJK nIJ· Sum n··1 · · · n··k n··K n

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 5 / 23

slide-6
SLIDE 6

Correspondence Analysis Outliers Confidence regions Model CA

Correspondence analysis

S matrix of standardized residuals dimension (I · J) × K elements of S: s(ij)k =

(nijk/n)−r(ij)ck

√r(ij)ck with ck = n··k/n and r(ij) = nij·/n Dr = diag(r11, ..., rIJ) Dc = diag(c1, ..., cK) Singular value decomposition of S: S = UΣVT

Correspondence analysis representation by

F = Dr− 1

2 UΣ

G = Dc− 1

2 VΣ TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 6 / 23

slide-7
SLIDE 7

Correspondence Analysis Outliers Confidence regions Idea Design Results

Outliers in contingency tables

Idea

Outliers are defined as specific cell frequencies of the underlying contingency table Outlier: deviation from the null model Null model: independence model

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 7 / 23

slide-8
SLIDE 8

Correspondence Analysis Outliers Confidence regions Idea Design Results

Simulation study design

Procedure 1: Independence

1 Randomly generate marginal probabilities πi.., π.j., π..k 2 Define probabilities πijk = πi.. · π.j. · π..k 3 Simulate n observations from Multinomial(n, (πl)l=1,...,IJK),

Matrix of observations X(I,J,K)

4 Apply correspondence analysis (R-package: ca) TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 8 / 23

slide-9
SLIDE 9

Correspondence Analysis Outliers Confidence regions Idea Design Results

Simulation study design

Procedure 2: with an outlier

1 Randomly generate marginal probabilities πi.., π.j., π..k 2 Define probabilities πijk = πi.. · π.j. · π..k 3 Outlier generation: replace chosen πijk by (1.2)max(πijk) 4 Rescale probabilities to

ijk πijk = 1

5 Simulate n observations from Multinomial(n, (πl)l=1,...,IJK),

Matrix of observations X(I,J,K)

6 Apply correspondence analysis (R-package: ca) TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 9 / 23

slide-10
SLIDE 10

Correspondence Analysis Outliers Confidence regions Idea Design Results

Tables with outlier in first cell

[[A]] X1 X2 X3 a b c d 1, , 1 62 0 0 1 2 0 0 0 2 3 0 0 1 2 4 0 0 0 1 2, , 1 4 22 0 16 2 19 45 7 32 3 5 42 7 39 4 15 50 6 23 3, , 1 4 25 1 13 2 12 73 8 41 3 12 60 4 29 4 12 58 11 43 4, , 1 1 11 0 9 2 10 29 4 21 3 9 31 7 22 4 8 19 1 11 [[B]] X1 X2 X3 a b c d 1, , 1 61 20 15 4 2 1 64 47 39 3 0 21 20 14 4 3 53 51 41 2, , 1 0 11 10 6 2 0 46 34 31 3 0 20 10 11 4 0 27 34 37 3, , 1 0 3 3 2 2 0 11 3 9 3 0 4 2 1 4 0 5 6 5 4, , 1 1 11 6 7 2 0 32 28 16 3 0 7 7 7 4 0 31 40 22 [[C]] X1 X2 X3 a b c d 1, , 1 54 5 3 4 2 0 4 2 11 3 3 2 6 6 4 2 1 8 5 2, , 1 8 15 23 24 2 13 29 22 52 3 10 40 24 43 4 7 18 17 33 3, , 1 13 22 18 28 2 13 45 33 50 3 10 42 39 60 4 7 33 13 38 4, , 1 2 2 3 4 2 1 5 2 1 3 2 2 6 5 4 1 0 3 3 [[D]] X1 X2 X3 a b c d 1, , 1 90 13 37 35 2 10 21 53 63 3 0 2 13 9 4 4 7 28 25 2, , 1 0 1 3 3 2 1 3 2 6 3 0 0 2 0 4 1 1 3 2 3, , 1 5 18 65 71 2 7 19 84 84 3 1 5 15 13 4 5 17 39 34 4, , 1 1 3 9 10 2 2 1 12 16 3 0 0 7 1 4 1 3 6 8

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 10 / 23

slide-11
SLIDE 11

Correspondence Analysis Outliers Confidence regions Idea Design Results

CA-Plots of 4 simulations with outlier in the first cell

A

−3 −2 −1 1 2 −2 −1 1 2

  • 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 a b c d

B

−3 −2 −1 1 2 −2 −1 1 2

  • 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 a b cd

C

−3 −2 −1 1 2 −2 −1 1 2

  • 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 a b c d

D

−3 −2 −1 1 2 −2 −1 1 2

  • 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 a b c d

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 11 / 23

slide-12
SLIDE 12

Frequency of the CA-coordinates of the cell 1

Dimension 1 −10 −5 5 10 Dimension 2 −10 −5 5 10 F r e q u e n c y 50 100

Independece_Rows: Coordinates of the cell 1

Dimension 1 −10 −5 5 10 Dimension 2 −10 −5 5 10 F r e q u e n c y 20 40 60 80

Independence_Columns: coordinates of the cell 1

Dimension 1 −10 −5 5 Dimension 2 −10 −5 5 F r e q u e n c y 50 100 150

Outlier_Rows: Coordinates of the cell 1

Dimension 1 −10 −5 5 Dimension 2 −10 −5 5 F r e q u e n c y 50 100 150 200

Outlier_Columns: coordinates of the cell 1

slide-13
SLIDE 13

Scatter of CA-coordinates (cell 1)

  • −4

−2 2 4 −4 −2 2 4

Rows_independence: CA−coordinates of cell 1

xr yr

  • −4

−2 2 4 −4 −2 2 4

Columns_independence: CA−coordinates of cell 1

xc yc

  • ● ●
  • ●●
  • ●●
  • ●●
  • ● ●
  • −4

−2 2 4 −4 −2 2 4

Rows_outlier: CA−coordinates of cell 1

xr yr

  • ● ●
  • ● ●
  • ● ●
  • ● ●
  • −4

−2 2 4 −4 −2 2 4

Columns_outlier: CA−coordinates of cell 1

xc yc

slide-14
SLIDE 14

Correspondence Analysis Outliers Confidence regions One outlier Several outliers Outlook

Confidence regions

Based on CA-coordinates define simulated confidence intervals (regions)

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 14 / 23

slide-15
SLIDE 15

Correspondence Analysis Outliers Confidence regions One outlier Several outliers Outlook

70 % - confidence regions for table (4x4x4), sample size n = 10000, 104 simulations:

−4 −3 −2 −1 1 2 3 −4 −2 2 4

Simulated confidence regions

Dimension 1 Dimension 2 independence with an outlier

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 15 / 23

slide-16
SLIDE 16

Correspondence Analysis Outliers Confidence regions One outlier Several outliers Outlook

80 % -confidence regions for table (5x5x5), n = 10000:

−4 −3 −2 −1 1 2 3 −4 −2 2 4

Simulated confidence regions

Dimension 1 Dimension 2 independence with an outlier

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 16 / 23

slide-17
SLIDE 17

Correspondence Analysis Outliers Confidence regions One outlier Several outliers Outlook

Adjusted 90% - confidence regions: Table (5x5x5), n = 10000

−4 −3 −2 −1 1 2 3 −4 −2 2 4

Simulated confidence regions

Dimension 1 Dimension 2 independence with an outlier

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 17 / 23

slide-18
SLIDE 18

Correspondence Analysis Outliers Confidence regions One outlier Several outliers Outlook

CA-plots with 2 outliers in the table

Row 1 + Columns 1 & 2

(blue circle 1 and red triangles a & b)

−2 −1 1 2 −2 −1 1 2

  • 1

2 3 4 5 6 7 8 910 11 12 13 14 15 16 a b c d

−2 −1 1 2 −2 −1 1 2

  • 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 a b c d

−2 −1 1 2 −2 −1 1 2

  • 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 a b c d

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 18 / 23

slide-19
SLIDE 19

Correspondence Analysis Outliers Confidence regions One outlier Several outliers Outlook

80% -Confidence regions of CA-coordinates (cell 1) for case of 2 outliers in the table (Row 1 + Columns 1 & 2):

−4 −3 −2 −1 1 2 3 −4 −2 2 4

Simulated confidence regions

Dimension 1 Dimension 2 independence with outliers

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 19 / 23

slide-20
SLIDE 20

Correspondence Analysis Outliers Confidence regions One outlier Several outliers Outlook

70%-Confidence regions of CA-coordinates of the cell 1 for case of 2 outliers in the table in cells [1,1,1] & [2,2,2]:

−4 −3 −2 −1 1 2 3 −4 −2 2 4

Simulated confidence regions

Dimension 1 Dimension 2 independence with outliers

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 20 / 23

slide-21
SLIDE 21

Correspondence Analysis Outliers Confidence regions One outlier Several outliers Outlook

Outlook

Analyse distributions of CA-coordinates Derive theoretical spatial confidence regions Consider different combinations of Log-Linear Models and CA-Schemes

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 21 / 23

slide-22
SLIDE 22

Correspondence Analysis Outliers Confidence regions One outlier Several outliers Outlook

References

Benzecri J.-P. (1992). Correspondence Analysis Handbook. Marcel Dekker, Inc., New york. Blasius J., Greenacre M. (2006). Multiple Correspondence Analysis and Related Methods, London: Chapman & Hall/CRC. Kuhnt S. (2004). Outlier Identification Procedures for Contingency Tables using Maximum Likelihood and L1 Estimates. Scandinavian Journal of Statistics, 31, 431-442. O’Neill M.E. (1978). Distributional Expansions for Canonical Correlations from Contingency Tables. Journal of the Royal Statistical Society, Ser. B. 40(3), 303-312 Greenacre M., Nenadic O. (2010). ca: Computation and visualization of simple, multiple and joint correspondence analysis. R package version 0.33, URL http://cran.r-project.org/web/packages/ca/

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 22 / 23

slide-23
SLIDE 23

Correspondence Analysis Outliers Confidence regions One outlier Several outliers Outlook

Thank you very much!!!

TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 23 / 23