SLIDE 1
An evolutionary analysis of association patterns Alfonso Iodice DEnza - - PowerPoint PPT Presentation
An evolutionary analysis of association patterns Alfonso Iodice DEnza - - PowerPoint PPT Presentation
Introduction Notation Criterion Procedure Example An evolutionary analysis of association patterns Alfonso Iodice DEnza 1 Francesco Palumbo 2 Correspondence Analysis and Related MEthods 2011 Rennes, 8 - 11 February 2011 1Universit` a di
SLIDE 2
SLIDE 3
Introduction Notation Criterion Procedure Example
Background A common approach in finding patterns of association in high dimensional and sparse data is to combine dimension reduction and clustering techniques. Quantitative data
Tandem-analysis [Arabie and Hubert(1994)] Factor K-means [Vichi and Kiers(2001)]
Qualitative data
multiple correspondence analysis and clustering [Hwang et al.(2006)] non-symmetric correspondence analysis and clustering [Palumbo and Iodice D’Enza(2010)]
3 / 25 An evolutionary analysis of association patterns
SLIDE 4
Introduction Notation Criterion Procedure Example
Aim and scope This contribution consists of a dynamic clustering procedure for high dimensional binary data that are arranged into subsequent batches; the first data batch is used to determine a ‘starting’ solution that is updated as further data batches are processed.
two-fold problem
clustering very large data sets or data produced at a high rate (data flows); perform a comparative analysis of data stratified according to time
- r space.
4 / 25 An evolutionary analysis of association patterns
SLIDE 5
Introduction Notation Criterion Procedure Example
Notation and data structures n number of statistical units; p number of binary attributes; K number of groups of statistical units. Zj, 1 . . . , p, Bernoulli distributed attribute (with z indicating success and ¯ z failure) with parameter πj. X = (X1, X2, . . . , XK) random vector multinomial distributed with parameters (n; π1, π2, . . . , πK), where πk(k = 1, . . . , K) are unknown.
5 / 25 An evolutionary analysis of association patterns
SLIDE 6
Introduction Notation Criterion Procedure Example
Criterion
Cross-classification table F of X and a single binary attribute Z Z z ¯ z X 1 f11 f12 f1+ . . . . . . . . . . . . K fK1 fK2 fK+ f+1 f+2 n
6 / 25 An evolutionary analysis of association patterns
SLIDE 7
Introduction Notation Criterion Procedure Example
Criterion
The qualitative variance, or heterogeneity, of X can be defined by the Gini index G(X) = 1 −
K
- k=1
fk+ n 2 = 1 −
K
- k=1
f 2
k+
n2 . The variation of X within the categories of the variable Z is obtained by averaging G (X | z) and G (X | ¯ z) G(X | Z) =
2
- h=1
f+h n
- 1 −
K
- k=1
f 2
kh
f 2
+h
- = 1 − 1
n
K
- k=1
2
- h=1
f 2
kh
f+h
7 / 25 An evolutionary analysis of association patterns
SLIDE 8
Introduction Notation Criterion Procedure Example
Criterion The variation of X explained by the categories of Z is G(X) − G(X | Z) = 1 −
K
- k=1
f 2
k+
n2 −
- 1 − 1
n
K
- k=1
2
- h=1
f 2
kh
f+h
- =
= 1 n
K
- k=1
2
- h=1
f 2
kh
f+h − 1 n
K
- k=1
f 2
k+
n
8 / 25 An evolutionary analysis of association patterns
SLIDE 9
Introduction Notation Criterion Procedure Example
Criterion In the case of p binary attributes the criterion being maximized is
p
- j=1
(G(X) − G(X | Zj)) that is the sum of variances of X explained by each of the attributes Zj.
9 / 25 An evolutionary analysis of association patterns
SLIDE 10
Introduction Notation Criterion Procedure Example
Algebraic formalization Quantity to maximize tr 1
nF(∆)−1FT − 1 n2
- XT11TX
- ≡
≡ tr 1
nXTZ(∆)−1ZTX − 1 n2
- XT11TX
- where X is a (n × K) matrix with xik = 1 is the unit i is assigned to group k,
F = [F1 . . . Fp] = XTZ , ∆ = diag(ZTZ) and 1 is a n-dimensional vector of ones.
Eigenvalue decomposition 1 n
- XTZ(∆)−1ZTX − 1
n
- XT11TX
- U = ΛU.
10 / 25 An evolutionary analysis of association patterns
SLIDE 11
Introduction Notation Criterion Procedure Example
Back to the aim This contribution consists of a dynamic clustering procedure for high dimensional binary data that are arranged into subsequent batches; the first data batch is used to determine a ‘starting’ solution that is updated as further data batches are processed.
two-fold problem
clustering very large data sets or data produced at a high rate (data flows); perform a comparative analysis of data stratified according to time
- r space.
11 / 25 An evolutionary analysis of association patterns
SLIDE 12
Introduction Notation Criterion Procedure Example
Back to the aim This contribution consists of a dynamic clustering procedure for high dimensional binary data that are arranged into subsequent batches; the first data batch is used to determine a ‘starting’ solution that is updated as further data batches are processed.
two-fold problem
clustering very large data sets or data produced at a high rate (data flows); perform a comparative analysis of data stratified according to time
- r space.
11 / 25 An evolutionary analysis of association patterns
SLIDE 13
Introduction Notation Criterion Procedure Example
The overall procedure The proposed procedure consists of three phases.
phase 1 Analysis of the starting batch: the i-FCB3 procedure is applied to obtain the starting solution[Palumbo and Iodice D’Enza(2010)]; phase 2 new batch processing: incoming statistical units are assigned to the K groups; phase 3 updating process: all the quantities are updated according to new data.
Phases 2 and 3 are repeated for each new data batch.
3iterative factorial clustering of binary data 12 / 25 An evolutionary analysis of association patterns
SLIDE 14
Introduction Notation Criterion Procedure Example
phase 1: starting batch
The i-FCB iterative algorithm runs over the following steps: step 0: pseudo-random generation of matrix X; step 1: an eigenvalue decomposition is performed on the matrix resulting from expression 1, obtaining the matrix Ψ, such that Ψ =
- Z(∆)−1ZT − 1
n 11T
- XUΛ
1 2 ;
(1) step 2: matrix X is updated according to a Euclidean squared distance-based non-hierarchical clustering algorithm (k-means) on the projected statistical units (Ψ matrix). Steps 1 and 2 are iterated until the stopping rule is verified: the quantity in 1 does not significantly increase from one iteration to the next.
13 / 25 An evolutionary analysis of association patterns
SLIDE 15
Introduction Notation Criterion Procedure Example
convergence of the criterion
number of iterations versus value of the criterion: 1000 repetitions.
Unstructured data Structured data
14 / 25 An evolutionary analysis of association patterns
SLIDE 16
Introduction Notation Criterion Procedure Example
phase 3: updating process update of the number of units: n∗ = n + n+; update of cross-tabulation block matrix: F∗ = F + F+, with F+ = Z+TX+; update of the diagonal matrix of margins: ∆∗ = ∆ + ∆+, with ∆+ = diag
- Z+TZ+
update of eigenvalue decomposition: 1 n∗
- F∗(∆∗)−1F∗T − 1
n∗
- f∗f∗T
U∗ = Λ∗U∗ where f∗ is the row-margin vector of the F∗ matrix.
15 / 25 An evolutionary analysis of association patterns
SLIDE 17
Introduction Notation Criterion Procedure Example
Application: synthetic data
The number of binary attributes is p = 12, V 1, V 2, . . . , V 12;
starting block: 200 statistical units described by uncorrelated items first block: 100 statistical units with V 1, V 2, V 3 highly correlated, 100 statistical units with
V 10, V 11, V 12 highly correlated;
second block: 400 statistical units described by uncorrelated items third block: 100 statistical units with V 4, V 5, V 6 highly correlated, 100 statistical units with
V 7, V 8, V 9 highly correlated; notes The number of clusters is K = 3. Synthetic data are obtained using the R-package bindata, by Leisch.
16 / 25 An evolutionary analysis of association patterns
SLIDE 18
Introduction Notation Criterion Procedure Example
Visualization of the results
A common visualization support
The procedure produces a different factorial plan for each update. In order to visualize the evolving association structure of the considered attributes as new data comes in, a three-way multidimensional scaling (MDS) is used.
MDS visualization
For the starting matrix F and for its updates F∗ a matrix of chi-square distances among attributes is computed. A three-way MDS on the resulting three-way distance matrix is performed, using the package smacof by de Leeuw and Mair.
17 / 25 An evolutionary analysis of association patterns
SLIDE 19
Introduction Notation Criterion Procedure Example
Application
18 / 25 An evolutionary analysis of association patterns
SLIDE 20
Introduction Notation Criterion Procedure Example
Application: real-world data The ‘retail’ data set The retail market basket data set is supplied by a anonymous Belgian retail supermarket store. The data are collected over three non-consecutive periods, for a time range of approximately 5 months of data. The total amount of receipts (statistical units) being collected equals n = 88163; the number of products (binary attributes) p = 28549.
19 / 25 An evolutionary analysis of association patterns
SLIDE 21
Introduction Notation Criterion Procedure Example
Application: real-world data Figure: Statistical units and cluster activity: starting (top) versus the four upcoming data batches
20 / 25 An evolutionary analysis of association patterns
SLIDE 22
Introduction Notation Criterion Procedure Example
Application: real-world data Figure: Attributes representation: plot of 10% of the longest trajectories
21 / 25 An evolutionary analysis of association patterns
SLIDE 23
Introduction Notation Criterion Procedure Example
Application: synthetic data
A further toy example
The number of binary attributes is p = 20, V 1, V 2, . . . , V 20; n = 1000 statistical units in four blocks of different sizes: 100, 200, 300 and 400, respectively; Each block is characterized by a different subset of five attributes highly co-occurring;
22 / 25 An evolutionary analysis of association patterns
SLIDE 24
Introduction Notation Criterion Procedure Example
Static view: comparison with the MCA result
MCA units view i-FCB units view
23 / 25 An evolutionary analysis of association patterns
SLIDE 25
Introduction Notation Criterion Procedure Example
Static view: comparison with the MCA result
MCA attributes view i-FCB units view
24 / 25 An evolutionary analysis of association patterns
SLIDE 26
Introduction Notation Criterion Procedure Example
Static view: stability of the clustering results
i-FCB vs K-means
- real data set BMS-Web-view2 (n = 557,p = 74)
- 1000 repetitions for each procedure;
- standard deviations of the φ2 for each of the clustering solutions;
25 / 25 An evolutionary analysis of association patterns
SLIDE 27