Model-based clustering with mixed/missing data using the new - - PowerPoint PPT Presentation

model based clustering with mixed missing data using the
SMART_READER_LITE
LIVE PREVIEW

Model-based clustering with mixed/missing data using the new - - PowerPoint PPT Presentation

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Model-based clustering with mixed/missing data using the new software MixtComp


slide-1
SLIDE 1

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Model-based clustering with mixed/missing data using the new software MixtComp

https://modal-research.lille.inria.fr/BigStat/

Christophe Biernacki

(with Thibault Deregnaucourt and Vincent Kubicki)

CMStatistics 2015 (ERCIM 2015) London (UK), 12-14 December 2015

1/29

slide-2
SLIDE 2

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Outline

1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion

2/29

slide-3
SLIDE 3

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Clustering of complex data

Data: n individuals: x = (x1, . . . , xn) = (xO, xM) belonging to a space X

Observed individuals

xO

Missing individuals

xM

Aim: estimation of the partition z and the number of clusters K Partition in K clusters G1, . . . , GK : z = (z1, . . . , zn), zi = (zi1, . . . , ziK)′ xi ∈ Gk ⇔ zih = I{h=k}

Mixed, missing, uncertain

Individuals xO Partition zO ⇔ Group ? 0.5 red 5 ? ? ? ⇔ ??? 0.3 0.1 green 3 ? ? ? ⇔ ??? 0.3 0.6 {red,green} 3 ? ? ? ⇔ ??? 0.9 [0.25 0.45] red ? ? ? ? ⇔ ??? ↓ ↓ ↓ ↓ continuous continuous categorical integer

3/29

slide-4
SLIDE 4

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Model-based clustering

Cluster k is modelled by a parametric distribution: Xi|Zik=1

i.i.d.

∼ p(·; αk) Cluster k has probability πk with K

k=1 πk = 1 : Zi i.i.d.

∼ MultK (1, π1, . . . , πK) Missing data x are obtained by a missing completely at random process (MCAR)1 Observed mixture pdf: with parameter θ = (π1, . . . , πK, α1, . . . , αK ), it is written p(xO

i ; θ) = K

  • k=1

πkp(xO

i ; αk) = K

  • k=1

πk

  • xM

i

p(xO

i , xM i ; αk)dxM i

Maximum a posteriori (MAP): with tk(xO

i ; θ) = p(Zik = 1|xO i ; θ) = πkp(xO

i ;αk )

p(xO

i ;θ)

ˆ zi = arg max

k={1,...,K} tk(xO i ; θ)

Seems to be suitable for missing/uncertain data but which p(·; αk) for mixed data?

1Could be relaxed to missing at random (MAR) 4/29

slide-5
SLIDE 5

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Outline

1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion

5/29

slide-6
SLIDE 6

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

High-dimensional today’s data2

  • 2S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:

Algorithms and Applications, 29

6/29

slide-7
SLIDE 7

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

HD clustering: blessing (1/2)

A two-component d-variate Gaussian mixture with intra-dependency: π1 = π2 = 1 2, X1|z11 = 1 ∼ Nd(0, Σ), X1|z12 = 1 ∼ Nd(1, Σ) Each variable provides equal and own separation information Theoretical error decreases when d grows: errtheo = Φ(−µ2 − µ1Σ−1/2) Empirical error rate with the (true) intra-correlated model worse with d Empirical error rate with the (false) intra-independent model better with d!

−4 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 x1 x2 1 2 3 4 5 6 7 8 9 10 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38

d err Empirical corr. Empirical indep. Theoretical

7/29

slide-8
SLIDE 8

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

HD clustering: blessing (2/2)

FDA

−4 −3 −2 −1 1 2 3 4 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5

1st axis FDA 2nd axis FDA d=2

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −3 −2 −1 1 2 3 4

1st axis FDA 2nd axis FDA d=20

−1.5 −1 −0.5 0.5 1 1.5 −4 −3 −2 −1 1 2 3 4 5

1st axis FDA 2nd axis FDA d=200

−1.5 −1 −0.5 0.5 1 1.5 −3 −2 −1 1 2 3

1st axis FDA 2nd axis FDA d=400

Neglect intra-dependency in HD clustering for better bias/variance trade-offa

aWhen variables convey no redundant cluster information; see conlusion 8/29

slide-9
SLIDE 9

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Mixed data: conditional independence everywhere

The aim is to combine continuous, categorical and integer data

x1 = ( xcont

1

,

xcat

1

,

xint

1 )

The proposed solution is to mixed all types by inter-type conditional independence p(x1; αk) = p(xcont

1

; αcont

k

) × p(xcat

1 ; αcat k ) × p(xint 1 ; αint k )

In addition, for symmetry between types, intra-type conditional independence Only need to define the univariate pdf for each variable type! Continuous: Gaussian Categorical: multinomial Integer: Poisson

9/29

slide-10
SLIDE 10

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Outline

1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion

10/29

slide-11
SLIDE 11

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

SEM algorithm

A SEM algorithm to estimate θ by maximizing the observed-data log-likelihood ℓ(θ;

xO) = ln p( xO; θ)

Initialisation: θ(0) Iteration nb q:

E-step: compute conditional probabilities p(xM,

z|D; θ(q))

S-step: draw (xM(q),

z(q)) from p(xM, z| x0; θ(q))

M-step: maximize θ(q+1) = arg maxθ ln p(xO,

xM(q), z(q); θ)

Stopping rule: iteration number

Properties

simplicity because of conditional independence classical M steps avoids local maxima the mean of the sequence (θ(q)) approximates ˆ θ the variance of the sequence (θ(q)) gives confidence intervals

11/29

slide-12
SLIDE 12

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

SE algorithm

A SE algorithm estimates then (

xM, z)

Iteration nb q:

E-step: compute conditional probabilities p(xM,

z| xO; ˆ

θ) S-step: draw (xM(q),

z(q)) from p(xM, z| xO; ˆ

θ)

Stopping rule: iteration number

Properties

simplicity because of conditional independence the mean/mode of the sequence (

xM(q), z(q)) estimates ( xM, z)

confidence intervals are also derived

12/29

slide-13
SLIDE 13

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Outline

1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion

13/29

slide-14
SLIDE 14

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Prostate cancer data3

Individuals: 506 patients with prostatic cancer grouped on clinical criteria into two Stages 3 and 4 of the disease Variables: d = 12 pre-trial variates were measured on each patient, composed by eight continuous variables (age, weight, systolic blood pressure, diastolic blood pressure, serum haemoglobin, size of primary tumour, index of tumour stage and histolic grade, serum prostatic acid phosphatase) and four categorical variables with various numbers of levels (performance rating, cardiovascular disease history, electrocardiogram code, bone metastases) Some missing data: 62 missing values (≈ 1%) We forget the classes (Stages of the desease) for performing clustering

Questions

How many clusters? Which partition?

3Byar DP, Green SB (1980): Bulletin Cancer, Paris 67:477-488 14/29

slide-15
SLIDE 15

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Create a free account in MixtComp4

https://modal-research.lille.inria.fr/BigStat/

It implements the mixed/missing data clustering in a software as a service (SaaS)

4See documentation at https://modal.lille.inria.fr/wikimodal/doku.php?id=mixtcomp 15/29

slide-16
SLIDE 16

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Two files to merge into a unique zip file

Variable descriptor file: descriptor.csv Data file: data.csv

16/29

slide-17
SLIDE 17

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Learn!

Step 1: input zip file and K Step 2: it is running!

17/29

slide-18
SLIDE 18

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Output

Option 1: output zip file Option 2: instant viewing clusters (variable-wise normalized entropy)

18/29

slide-19
SLIDE 19

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Output R format

19/29

slide-20
SLIDE 20

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Two strategies in competition

Strategy “mice5 + MixtComp”: MixtComp on the dataset completed by mice > data.imp=mice(data) > data.comp.mice=complete(data.imp) Strategy “full MixtComp”: MixtComp on the observed (no completed) dataset

5http://cran.r-project.org/web/packages/mice/mice.pdf 20/29

slide-21
SLIDE 21

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Choosing K with the ICL criterion

1 2 3 4 5 6 7 −12600 −12500 −12400 −12300 K ICL 1 2 3 4 5 6 7 −12550 −12450 −12350 −12250 K ICL

mice + MixtComp full MixtComp ˆ K = 7 ˆ K = 2 . . . may lose some cluster information when imputation before clustering

21/29

slide-22
SLIDE 22

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Partition quality with K = 2

Strategy mice + MixtComp full MixtComp % misclassified 12.8 8.1 To be compared also to missing data removal: 475 patients with non-missing data MixtComp for clustering possibility to consider continuous, categorical or mixed data Strategy continuous only categorical only mixed cont/cat % misclassified 9.46 47.16 8.63 risk of information lost when removing missing data lines/columns avoid to complete missing data (imputation depends on the purpose)

22/29

slide-23
SLIDE 23

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

And for supervised classification?

Use now the predict functionality of MixtComp descriptor.csv + data.csv +

  • utput.RData

(from previous learn. . . ) = NameYouWant.zip Then same output format as the learn functionality of MixtComp

23/29

slide-24
SLIDE 24

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Outline

1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion

24/29

slide-25
SLIDE 25

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Mixture models as a extremely flexible family of distributions

Allow to estimate any distribution by increasing the number of components

50 100 150 200 250 100 200 300 400 500 600

Niveaux de gris n* frequence

50 100 150 200 250 1 2 3 4 5 6 7 8 9 x 10

−3

Niveaux de gris Densite

25/29

slide-26
SLIDE 26

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Cancer dataset with more missing data

Add artificially ≈ 30% missing data with a MCAR design Then compare two strategies of imputation: Strategy “mice”: dataset completed by mice > data.imp=mice(data) > data.comp.mice=complete(data.imp) Strategy “full MixtComp”: MixtComp on the observed (no completed) dataset

1 2 3 4 5 6 −8780 −8760 −8740 −8720 −8700 −8680 −8660 −8640 K ICL 1 2 3 4 5 6 −8750 −8700 −8650 −8600 −8550 K BIC

ICL BIC ˆ K = 2 ˆ K = 4

26/29

slide-27
SLIDE 27

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Imputation accuracy

Continuous variables: mean of absolute difference between x and ˆ x var. mice MixtComp (K = 2) MixtComp (K = 4) Age 8.907143 5.546571 5.526861 Wt 13.51656 9.779485 9.731182 SBP 2.103226 1.788152 1.795820 DBP 1.317568 1.165201 1.169672 HG 21.67568 14.83514 14.51291 SZ 1.714899 1.160546 1.158105 SG 1.979866 1.386841 1.416053 AP 1.359299 1.027513 1.009126 Global mean 6.5718 4.5862 4.5400 Categorical variable: mean of the proportion of difference between x and ˆ x var. mice MixtComp (K = 2) MixtComp (K = 4) PF 0.1904762 0.0952381 0.0952381 HX 0.4121622 0.4391892 0.4121622 EKG 0.7564103 0.6858974 0.7179487 BM 0.1081081 0.1486486 0.1216216 Global mean 0.3668 0.3422 0.3367

27/29

slide-28
SLIDE 28

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Outline

1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion

28/29

slide-29
SLIDE 29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion

Present and future of MixtComp

Present

Clustering and/or imputation for mixed/mixing/incertain data Current variables: continuous, categorical, integer Limit highly the preprocessing step: upload data as they are Software as a Service (SaaS) facility, nothing to intall on the laptop Output: R objects and friendly/interactive graphical displays

Future

Add other kinds of widespread variables: ordinal, ranks, functional, directional Add variable selection ability for tackle (very) high dimension: variable clustering? Improve gradually the server computing performance

29/29