Model-based clustering with mixed/missing data using the new - PowerPoint PPT Presentation

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Model-based clustering with mixed/missing data using the new software MixtComp https://modal-research.lille.inria.fr/BigStat/ Christophe Biernacki (with Thibault Deregnaucourt and Vincent Kubicki) CMStatistics 2015 (ERCIM 2015) London (UK), 12-14 December 2015 1/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Outline 1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion 2/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Clustering of complex data Data: n individuals: x = ( x 1 , . . . , x n ) = ( x O , x M ) belonging to a space X x O Observed individuals x M Missing individuals Aim: estimation of the partition z and the number of clusters K Partition in K clusters G 1 , . . . , G K : z = ( z 1 , . . . , z n ), z i = ( z i 1 , . . . , z iK ) ′ x i ∈ G k ⇔ z ih = I { h = k } Mixed, missing, uncertain Individuals x O Partition z O ⇔ Group ? 0.5 red 5 ? ? ? ⇔ ??? 0.3 0.1 green 3 ? ? ? ⇔ ??? 0.3 0.6 { red,green } 3 ? ? ? ⇔ ??? 0.9 [0.25 0.45] red ? ? ? ? ⇔ ??? ↓ ↓ ↓ ↓ continuous continuous categorical integer 3/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Model-based clustering i . i . d . Cluster k is modelled by a parametric distribution: X i | Z ik =1 ∼ p( · ; α k ) i . i . d . Cluster k has probability π k with � K k =1 π k = 1 : Z i ∼ Mult K (1 , π 1 , . . . , π K ) Missing data x are obtained by a missing completely at random process (MCAR) 1 Observed mixture pdf: with parameter θ = ( π 1 , . . . , π K , α 1 , . . . , α K ), it is written K K � p( x O � π k p( x O � p( x O i , x M i ; α k ) d x M i ; θ ) = i ; α k ) = π k i x M k =1 k =1 i i ; θ ) = π k p ( x O i ; α k ) Maximum a posteriori (MAP): with t k ( x O i ; θ ) = p( Z ik = 1 | x O p ( x O i ; θ ) k = { 1 ,..., K } t k ( x O ˆ z i = arg max i ; θ ) Seems to be suitable for missing/uncertain data but which p( · ; α k ) for mixed data? 1 Could be relaxed to missing at random (MAR) 4/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion High-dimensional today’s data 2 2 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications , 29 6/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion HD clustering: blessing (1/2) A two-component d -variate Gaussian mixture with intra-dependency: π 1 = π 2 = 1 2 , X 1 | z 11 = 1 ∼ N d ( 0 , Σ ) , X 1 | z 12 = 1 ∼ N d ( 1 , Σ ) Each variable provides equal and own separation information Theoretical error decreases when d grows: err theo = Φ( −� µ 2 − µ 1 � Σ − 1 / 2) Empirical error rate with the (true) intra-correlated model worse with d Empirical error rate with the (false) intra-independent model better with d ! 0.38 4 0.36 3 0.34 2 Empirical corr. Empirical indep. 0.32 1 Theoretical err x2 0.3 0 0.28 −1 0.26 −2 0.24 −3 1 2 3 4 5 6 7 8 9 10 −4 −3 −2 −1 0 1 2 3 4 5 d x1 7/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion HD clustering: blessing (2/2) FDA d=2 d=20 2.5 4 2 3 1.5 2 1 2nd axis FDA 2nd axis FDA 0.5 1 0 0 −0.5 −1 −1 −1.5 −2 −2 −2.5 −3 −4 −3 −2 −1 0 1 2 3 4 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 1st axis FDA 1st axis FDA d=200 d=400 5 3 4 2 3 2 1 2nd axis FDA 2nd axis FDA 1 0 0 −1 −1 −2 −2 −3 −4 −3 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 1st axis FDA 1st axis FDA Neglect intra-dependency in HD clustering for better bias/variance trade-off a a When variables convey no redundant cluster information; see conlusion 8/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Mixed data: conditional independence everywhere The aim is to combine continuous, categorical and integer data x cont x cat x int x 1 = ( , , 1 ) 1 1 The proposed solution is to mixed all types by inter-type conditional independence p( x 1 ; α k ) = p( x cont ; α cont ) × p( x cat 1 ; α cat k ) × p( x int 1 ; α int k ) 1 k In addition, for symmetry between types, intra-type conditional independence Only need to define the univariate pdf for each variable type! Continuous: Gaussian Categorical: multinomial Integer: Poisson 9/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion SEM algorithm A SEM algorithm to estimate θ by maximizing the observed-data log-likelihood x O ) = ln p( x O ; θ ) ℓ ( θ ; Initialisation: θ (0) Iteration nb q : E-step: compute conditional probabilities p( x M , z |D ; θ ( q ) ) S-step: draw ( x M ( q ) , z ( q ) ) from p( x M , z | x 0 ; θ ( q ) ) M-step: maximize θ ( q +1) = arg max θ ln p( x O , x M ( q ) , z ( q ) ; θ ) Stopping rule: iteration number Properties simplicity because of conditional independence classical M steps avoids local maxima the mean of the sequence ( θ ( q ) ) approximates ˆ θ the variance of the sequence ( θ ( q ) ) gives confidence intervals 11/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion SE algorithm x M , A SE algorithm estimates then ( z ) Iteration nb q : E-step: compute conditional probabilities p( x M , z | x O ; ˆ θ ) z | x O ; ˆ S-step: draw ( x M ( q ) , z ( q ) ) from p( x M , θ ) Stopping rule: iteration number Properties simplicity because of conditional independence x M ( q ) , z ( q ) ) estimates ( x M , the mean/mode of the sequence ( z ) confidence intervals are also derived 12/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Prostate cancer data 3 Individuals: 506 patients with prostatic cancer grouped on clinical criteria into two Stages 3 and 4 of the disease Variables: d = 12 pre-trial variates were measured on each patient, composed by eight continuous variables (age, weight, systolic blood pressure, diastolic blood pressure, serum haemoglobin, size of primary tumour, index of tumour stage and histolic grade, serum prostatic acid phosphatase) and four categorical variables with various numbers of levels (performance rating, cardiovascular disease history, electrocardiogram code, bone metastases) Some missing data: 62 missing values ( ≈ 1%) We forget the classes (Stages of the desease) for performing clustering Questions How many clusters? Which partition? 3 Byar DP, Green SB (1980): Bulletin Cancer, Paris 67:477-488 14/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Create a free account in MixtComp 4 https://modal-research.lille.inria.fr/BigStat/ It implements the mixed/missing data clustering in a software as a service (SaaS) 4 See documentation at https://modal.lille.inria.fr/wikimodal/doku.php?id=mixtcomp 15/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Two files to merge into a unique zip file Variable descriptor file: descriptor.csv Data file: data.csv 16/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Learn! Step 1: input zip file and K Step 2: it is running! 17/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Output Option 1: output zip file Option 2: instant viewing clusters (variable-wise normalized entropy) 18/29

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Output R format 19/29

Model-based clustering with mixed/missing data using the new - PowerPoint PPT Presentation

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Model-based clustering with mixed/missing data using the new software MixtComp

MixtComp software: Model-based clustering/imputation with mixed data, missing data and uncertain

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Precision Training PAI Overview What is mixed-precision

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

Best Practices for Managing Centralized Drug and Regimen Content Streamlining Clinical Workflows

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

Chapters 1 & 2. Introduction & Overview Wei Pan Division of Biostatistics, School of

Lecture 1: Linear Regression Princeton University COS 495 Instructor: Yingyu Liang Machine

What Should PCORI Study? A Call for Topics from Patients and Stakeholders December 4, 2012

NHS Mansfield and Ashfield CCG Annual Public Meeting 17 September 2018 Civic Quarter, Civic

Urinary tract infections Asymptomatic bacteriuria Uncomplicated UTI Complicated

Update in diagnosis and Asymptomatic bacteriuria management of UTIs Uncomplicated UTI

Model-based clustering with mixed/missing data using the new - PowerPoint PPT Presentation

The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Model-based clustering with mixed/missing data using the new software MixtComp

MixtComp software: Model-based clustering/imputation with mixed data, missing data and uncertain

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Precision Training PAI Overview What is mixed-precision

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

Best Practices for Managing Centralized Drug and Regimen Content Streamlining Clinical Workflows

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

Chapters 1 &amp; 2. Introduction &amp; Overview Wei Pan Division of Biostatistics, School of

Lecture 1: Linear Regression Princeton University COS 495 Instructor: Yingyu Liang Machine

What Should PCORI Study? A Call for Topics from Patients and Stakeholders December 4, 2012

NHS Mansfield and Ashfield CCG Annual Public Meeting 17 September 2018 Civic Quarter, Civic

Urinary tract infections Asymptomatic bacteriuria Uncomplicated UTI Complicated

Update in diagnosis and Asymptomatic bacteriuria management of UTIs Uncomplicated UTI

Chapters 1 & 2. Introduction & Overview Wei Pan Division of Biostatistics, School of