An evolutionary analysis of association patterns Alfonso Iodice DEnza - - PowerPoint PPT Presentation

an evolutionary analysis of association patterns
SMART_READER_LITE
LIVE PREVIEW

An evolutionary analysis of association patterns Alfonso Iodice DEnza - - PowerPoint PPT Presentation

Introduction Notation Criterion Procedure Example An evolutionary analysis of association patterns Alfonso Iodice DEnza 1 Francesco Palumbo 2 Correspondence Analysis and Related MEthods 2011 Rennes, 8 - 11 February 2011 1Universit` a di


slide-1
SLIDE 1

Introduction Notation Criterion Procedure Example

An evolutionary analysis of association patterns

Alfonso Iodice D’Enza1 Francesco Palumbo2

Correspondence Analysis and Related MEthods 2011 Rennes, 8 - 11 February 2011

1Universit` a di Cassino 2Universit` a degli Studi di Napoli 1 / 25 An evolutionary analysis of association patterns

slide-2
SLIDE 2

Introduction Notation Criterion Procedure Example

1

Introduction

2

Notation

3

Criterion

4

Procedure

5

Example

2 / 25 An evolutionary analysis of association patterns

slide-3
SLIDE 3

Introduction Notation Criterion Procedure Example

Background A common approach in finding patterns of association in high dimensional and sparse data is to combine dimension reduction and clustering techniques. Quantitative data

Tandem-analysis [Arabie and Hubert(1994)] Factor K-means [Vichi and Kiers(2001)]

Qualitative data

multiple correspondence analysis and clustering [Hwang et al.(2006)] non-symmetric correspondence analysis and clustering [Palumbo and Iodice D’Enza(2010)]

3 / 25 An evolutionary analysis of association patterns

slide-4
SLIDE 4

Introduction Notation Criterion Procedure Example

Aim and scope This contribution consists of a dynamic clustering procedure for high dimensional binary data that are arranged into subsequent batches; the first data batch is used to determine a ‘starting’ solution that is updated as further data batches are processed.

two-fold problem

clustering very large data sets or data produced at a high rate (data flows); perform a comparative analysis of data stratified according to time

  • r space.

4 / 25 An evolutionary analysis of association patterns

slide-5
SLIDE 5

Introduction Notation Criterion Procedure Example

Notation and data structures n number of statistical units; p number of binary attributes; K number of groups of statistical units. Zj, 1 . . . , p, Bernoulli distributed attribute (with z indicating success and ¯ z failure) with parameter πj. X = (X1, X2, . . . , XK) random vector multinomial distributed with parameters (n; π1, π2, . . . , πK), where πk(k = 1, . . . , K) are unknown.

5 / 25 An evolutionary analysis of association patterns

slide-6
SLIDE 6

Introduction Notation Criterion Procedure Example

Criterion

Cross-classification table F of X and a single binary attribute Z Z z ¯ z X 1 f11 f12 f1+ . . . . . . . . . . . . K fK1 fK2 fK+ f+1 f+2 n

6 / 25 An evolutionary analysis of association patterns

slide-7
SLIDE 7

Introduction Notation Criterion Procedure Example

Criterion

The qualitative variance, or heterogeneity, of X can be defined by the Gini index G(X) = 1 −

K

  • k=1

fk+ n 2 = 1 −

K

  • k=1

f 2

k+

n2 . The variation of X within the categories of the variable Z is obtained by averaging G (X | z) and G (X | ¯ z) G(X | Z) =

2

  • h=1

f+h n

  • 1 −

K

  • k=1

f 2

kh

f 2

+h

  • = 1 − 1

n

K

  • k=1

2

  • h=1

f 2

kh

f+h

7 / 25 An evolutionary analysis of association patterns

slide-8
SLIDE 8

Introduction Notation Criterion Procedure Example

Criterion The variation of X explained by the categories of Z is G(X) − G(X | Z) = 1 −

K

  • k=1

f 2

k+

n2 −

  • 1 − 1

n

K

  • k=1

2

  • h=1

f 2

kh

f+h

  • =

= 1 n

K

  • k=1

2

  • h=1

f 2

kh

f+h − 1 n

K

  • k=1

f 2

k+

n

8 / 25 An evolutionary analysis of association patterns

slide-9
SLIDE 9

Introduction Notation Criterion Procedure Example

Criterion In the case of p binary attributes the criterion being maximized is

p

  • j=1

(G(X) − G(X | Zj)) that is the sum of variances of X explained by each of the attributes Zj.

9 / 25 An evolutionary analysis of association patterns

slide-10
SLIDE 10

Introduction Notation Criterion Procedure Example

Algebraic formalization Quantity to maximize tr 1

nF(∆)−1FT − 1 n2

  • XT11TX

≡ tr 1

nXTZ(∆)−1ZTX − 1 n2

  • XT11TX
  • where X is a (n × K) matrix with xik = 1 is the unit i is assigned to group k,

F = [F1 . . . Fp] = XTZ , ∆ = diag(ZTZ) and 1 is a n-dimensional vector of ones.

Eigenvalue decomposition 1 n

  • XTZ(∆)−1ZTX − 1

n

  • XT11TX
  • U = ΛU.

10 / 25 An evolutionary analysis of association patterns

slide-11
SLIDE 11

Introduction Notation Criterion Procedure Example

Back to the aim This contribution consists of a dynamic clustering procedure for high dimensional binary data that are arranged into subsequent batches; the first data batch is used to determine a ‘starting’ solution that is updated as further data batches are processed.

two-fold problem

clustering very large data sets or data produced at a high rate (data flows); perform a comparative analysis of data stratified according to time

  • r space.

11 / 25 An evolutionary analysis of association patterns

slide-12
SLIDE 12

Introduction Notation Criterion Procedure Example

Back to the aim This contribution consists of a dynamic clustering procedure for high dimensional binary data that are arranged into subsequent batches; the first data batch is used to determine a ‘starting’ solution that is updated as further data batches are processed.

two-fold problem

clustering very large data sets or data produced at a high rate (data flows); perform a comparative analysis of data stratified according to time

  • r space.

11 / 25 An evolutionary analysis of association patterns

slide-13
SLIDE 13

Introduction Notation Criterion Procedure Example

The overall procedure The proposed procedure consists of three phases.

phase 1 Analysis of the starting batch: the i-FCB3 procedure is applied to obtain the starting solution[Palumbo and Iodice D’Enza(2010)]; phase 2 new batch processing: incoming statistical units are assigned to the K groups; phase 3 updating process: all the quantities are updated according to new data.

Phases 2 and 3 are repeated for each new data batch.

3iterative factorial clustering of binary data 12 / 25 An evolutionary analysis of association patterns

slide-14
SLIDE 14

Introduction Notation Criterion Procedure Example

phase 1: starting batch

The i-FCB iterative algorithm runs over the following steps: step 0: pseudo-random generation of matrix X; step 1: an eigenvalue decomposition is performed on the matrix resulting from expression 1, obtaining the matrix Ψ, such that Ψ =

  • Z(∆)−1ZT − 1

n 11T

  • XUΛ

1 2 ;

(1) step 2: matrix X is updated according to a Euclidean squared distance-based non-hierarchical clustering algorithm (k-means) on the projected statistical units (Ψ matrix). Steps 1 and 2 are iterated until the stopping rule is verified: the quantity in 1 does not significantly increase from one iteration to the next.

13 / 25 An evolutionary analysis of association patterns

slide-15
SLIDE 15

Introduction Notation Criterion Procedure Example

convergence of the criterion

number of iterations versus value of the criterion: 1000 repetitions.

Unstructured data Structured data

14 / 25 An evolutionary analysis of association patterns

slide-16
SLIDE 16

Introduction Notation Criterion Procedure Example

phase 3: updating process update of the number of units: n∗ = n + n+; update of cross-tabulation block matrix: F∗ = F + F+, with F+ = Z+TX+; update of the diagonal matrix of margins: ∆∗ = ∆ + ∆+, with ∆+ = diag

  • Z+TZ+

update of eigenvalue decomposition: 1 n∗

  • F∗(∆∗)−1F∗T − 1

n∗

  • f∗f∗T

U∗ = Λ∗U∗ where f∗ is the row-margin vector of the F∗ matrix.

15 / 25 An evolutionary analysis of association patterns

slide-17
SLIDE 17

Introduction Notation Criterion Procedure Example

Application: synthetic data

The number of binary attributes is p = 12, V 1, V 2, . . . , V 12;

starting block: 200 statistical units described by uncorrelated items first block: 100 statistical units with V 1, V 2, V 3 highly correlated, 100 statistical units with

V 10, V 11, V 12 highly correlated;

second block: 400 statistical units described by uncorrelated items third block: 100 statistical units with V 4, V 5, V 6 highly correlated, 100 statistical units with

V 7, V 8, V 9 highly correlated; notes The number of clusters is K = 3. Synthetic data are obtained using the R-package bindata, by Leisch.

16 / 25 An evolutionary analysis of association patterns

slide-18
SLIDE 18

Introduction Notation Criterion Procedure Example

Visualization of the results

A common visualization support

The procedure produces a different factorial plan for each update. In order to visualize the evolving association structure of the considered attributes as new data comes in, a three-way multidimensional scaling (MDS) is used.

MDS visualization

For the starting matrix F and for its updates F∗ a matrix of chi-square distances among attributes is computed. A three-way MDS on the resulting three-way distance matrix is performed, using the package smacof by de Leeuw and Mair.

17 / 25 An evolutionary analysis of association patterns

slide-19
SLIDE 19

Introduction Notation Criterion Procedure Example

Application

18 / 25 An evolutionary analysis of association patterns

slide-20
SLIDE 20

Introduction Notation Criterion Procedure Example

Application: real-world data The ‘retail’ data set The retail market basket data set is supplied by a anonymous Belgian retail supermarket store. The data are collected over three non-consecutive periods, for a time range of approximately 5 months of data. The total amount of receipts (statistical units) being collected equals n = 88163; the number of products (binary attributes) p = 28549.

19 / 25 An evolutionary analysis of association patterns

slide-21
SLIDE 21

Introduction Notation Criterion Procedure Example

Application: real-world data Figure: Statistical units and cluster activity: starting (top) versus the four upcoming data batches

20 / 25 An evolutionary analysis of association patterns

slide-22
SLIDE 22

Introduction Notation Criterion Procedure Example

Application: real-world data Figure: Attributes representation: plot of 10% of the longest trajectories

21 / 25 An evolutionary analysis of association patterns

slide-23
SLIDE 23

Introduction Notation Criterion Procedure Example

Application: synthetic data

A further toy example

The number of binary attributes is p = 20, V 1, V 2, . . . , V 20; n = 1000 statistical units in four blocks of different sizes: 100, 200, 300 and 400, respectively; Each block is characterized by a different subset of five attributes highly co-occurring;

22 / 25 An evolutionary analysis of association patterns

slide-24
SLIDE 24

Introduction Notation Criterion Procedure Example

Static view: comparison with the MCA result

MCA units view i-FCB units view

23 / 25 An evolutionary analysis of association patterns

slide-25
SLIDE 25

Introduction Notation Criterion Procedure Example

Static view: comparison with the MCA result

MCA attributes view i-FCB units view

24 / 25 An evolutionary analysis of association patterns

slide-26
SLIDE 26

Introduction Notation Criterion Procedure Example

Static view: stability of the clustering results

i-FCB vs K-means

  • real data set BMS-Web-view2 (n = 557,p = 74)
  • 1000 repetitions for each procedure;
  • standard deviations of the φ2 for each of the clustering solutions;

25 / 25 An evolutionary analysis of association patterns

slide-27
SLIDE 27

Introduction Notation Criterion Procedure Example Arabie P. and Hubert L., (1994). ‘Cluster analysis in marketing research’. IEEE Trans. on Automatic Control AC. 19: 716–723. Gini C., (1912) ‘Variabilita e mutabilit` a’. Reprinted in Memorie di Metodologia Statistica, E. Pizetti and T. Salvemini, Eds. Libreria Erendi Virgilio Veschi, Rome. Greenacre M. J., (2007) ‘Correspondence Analysis in Practice, second edition’. Chapman and Hall/CR. Hwang H., Dillon W. R. and Takane Y., (2006). ‘An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents’. In Psychometrika . 71, 161–171. Iodice D’Enza A. and Greenacre M.J.,(2010).‘Multiple correspondence analysis for the quantification and visualization of large categorical data sets’. In proceedings of SIS09 Statistical Methods for the analysis of large data-sets. Pescara, Italy (submitted). Mirkin B., (2001). ‘Eleven Ways to Look at the Chi-Squared Coefficient for Contingency Tables’. In The American Statistician, vol. 55, 2:111-120. Palumbo F. and Iodice D’Enza A.,(2010).‘A two-step iterative procedure for clustering of binary sequences’. In Data Analysis And Classification. Springer, Heidelberg, 50–60. Vichi M. and Kiers H., (2001). ‘Factorial k-means analysis for two way data’. Computational Statistics and Data Analysis 37(1): 49–64. 25 / 25 An evolutionary analysis of association patterns