Evaluating clustering Introduction to cluster analysis and - PDF document

HAL Id: hal-01810377 scientifjques de niveau recherche, publiés ou non, mer School on Clustering, Data Analysis and Visualization of Complex Data, May 2018, Catania, Christophe Biernacki. Introduction to cluster analysis and classifjcation: Evaluating clustering. Sum- To cite this version: Christophe Biernacki Evaluating clustering Introduction to cluster analysis and classifjcation: publics ou privés. recherche français ou étrangers, des laboratoires émanant des établissements d’enseignement et de destinée au dépôt et à la difgusion de documents https://hal.inria.fr/hal-01810377 L’archive ouverte pluridisciplinaire HAL , est abroad, or from public or private research centers. teaching and research institutions in France or The documents may come from lished or not. entifjc research documents, whether they are pub- archive for the deposit and dissemination of sci- HAL is a multi-disciplinary open access Submitted on 7 Jun 2018 Italy. ฀hal-01810377฀

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Introduction to cluster analysis and classification: Evaluating clustering C. Biernacki Summer School on Clustering, Data Analysis and Visualization of Complex Data May 21-25 2018, University of Catania, Italy 1/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Evaluating clustering “‘Technical” evaluation x , δ [ , ∆ , kernel , . . . ] , K , algo ) z = f ( ˆ “User” evaluation A good clustering result is an end-user useful clustering result Need always to combine both evaluation points of view 2/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Outline 1 Data factor 2 Dissimilarity factor (and co) 3 Algorithm factor 4 Number of clusters factor 5 User factor 6 To go further 3/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further The variable effect Medicine 1 : diseases may be classified by etiology (cause), pathogenesis (mechanism by which the disease is caused), or by symptom(s). Alternatively, diseases may be classified according to the organ system involved, though this is often complicated since many diseases affect more than one organ. And so on. . . 5 3 4 3 2 2 4 1 1 Variable 3 Variable 3 2 Variable 3 0 0 0 −1 −2 −1 −2 −4 12 8 −2 −3 10 6 8 4 6 −4 −3 4 2 2 0 −5 0 −2 −2 −2 0 2 4 6 8 10 −2 −1 0 1 2 3 4 5 6 7 Variable 2 −4 Variable 1 Variable 1 Variable 2 −4 1 Nosologie m´ ethodique, dans laquelle les maladies sont rang´ ees par classes, suivant le syst` eme de Sydenham, & l’ordre des botanistes, par Fran¸ cois Boissier de Sauvages de Lacroix. Paris, H´ erissant le fils, 1771 4/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Need to compare partitions: empirical error rate Two partitions z and ˆ z τ : all permutations on { 1 , . . . , K } Empirical error rate n z ) = 1 � 0 , K − 1 � � err( z , ˆ n min z i ) } ∈ I { z i = τ (ˆ K τ i =1 Partitions are closer when err is small Restricted to compare partition with the same number of clusters Example ˆ err( z , ˆ z ) z z ˆ 1 6 min { 5 , 1 } = 1 G 1 = { a , b , c } G 1 = { e , f } 6 ˆ G 2 = { d , e , f } G 2 = { a , b , c , d } 5/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Need to compare partitions: rand index Two partitions z and ˆ z A measure on basis of agreement vs. disagreement between object pairs Not limited to the same number of clusters between partitions Rand index [Rand 1971] A : #pairs of elements in x that are in the same subset in z and in the same subset in ˆ z B : #pairs of elements in x that are in different subsets in z and in different subsets in ˆ z C : #pairs of elements in x that are in the same subset in z and in different subsets in ˆ z D : #pairs of elements in x that are in different subsets in z and in the same subset in ˆ z A + B nb. agree rand( z , ˆ z ) = A + B + C + D = nb. agree + nb. disagree ∈ { 0 , 1 } Partitions are closer when rand is high Example ˆ intermediate rand( z , ˆ z ) z z ˆ G 1 = { a , b , c } G 1 = { a , b } A = 2, B = 7 0.6 ˆ G 2 = { d , e , f } G 2 = { c , d , e } C = 4, D = 2 ˆ G 3 = { f } Caution: use the adjusted rand index [Hubert and Arabie 1985] to compare z ) when ˆ K � = ˜ rand( z , ˆ z ) and rand( z , ˜ K 6/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Prostate cancer data: description 2 475 patients from 506 (missing values have been discarded) 8 quantitative variables, 4 categorical (some are ordinal) variables Two “evident” clusters for medical users: Stage 3 and Stage 4 of cancer Continuous data Categorical data 40 5 30 4 20 3 10 2nd axis PCA 2nd axis MCA 2 0 −10 1 −20 0 −30 −1 −40 −50 −2 −80 −60 −40 −20 0 20 40 60 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1st axis PCA 1st axis MCA 2 Byar and Green (1980) 7/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Prostate cancer data: variable detail 8/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Prostate cancer data: partition according to retained variables quantitative categorical (raw) mixing quali/quanti err=9.46% err=47.16% err=8.63% 1 2 1 2 1 2 Stage 3 247 26 142 131 252 21 Stage 4 19 183 120 82 20 182 Partition varies with retained variables as expected A general principle: categorical variables less informative than quantitative ones However, categorical variables here improve quantitative ones 9/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Prostate cancer data: partition according to recoded variables categorical (raw) categorical (MCA) err=47.16% err=38.95% 1 2 1 2 Stage 3 142 131 175 98 Stage 4 120 82 87 115 MCA is equivalent to recoding categorical variables Raw data and MCA data are in a one-to-one mapping (no info. loss) It can however drastically impact clustering result It open the question of data units/coding to use Currently: let the user to choose the unit (prior or posterior choice) Next lesson: need formalizing to go further 10/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Prostate cancer data: partition according to missing data Use the reduced data set without individuals having missing data ( n = 475) Use the completed data set where missing data are imputed 3 ( n = 506) In both cases, use all mixed variables (not all details at this step, see next lesson) Data set completed data reduced data err 12.8 8.1 It is current to have a data “pretreatment” like missing data imputation Be careful: it can impact the clustering Imputation gives only an estimate data set ˆ x which is a “deteriorated” data set As a consequence it can lead to a “deteriorated” clustering result See next lesson to formalize this problem 3 We use the mice package:http://cran.r-project.org/web/packages/mice/mice.pdf 11/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Stability of a clustering result Do not forget that ˆ z is just an estimate of (a hypothetical true) z Statistical properties of this estimate should be addressed, as it stability (variance) A simple (but computational demanding) attempt: x ( b ) ( b = 1 , . . . , B ) Use bootstrap samples z ( b ) Obtain bootstrap partitions Deduce for instance confidence regions on centers µ through related centers µ ( b ) Be careful to the permutation of labelling! See the next lesson for more on the statistical properties (need formalizing). . . 12/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Outline 1 Data factor 2 Dissimilarity factor (and co) 3 Algorithm factor 4 Number of clusters factor 5 User factor 6 To go further 13/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Effect of the metric M (1/5) � a � 0 � 3 � 0 0 � � � � X = R 2 , M = , x 1 = , x 2 = , x 3 = 0 1 0 0 1 � a � 0 x 2 ) 2 = ( x 2 ) = a ( x 21 − x 11 ) 2 = 9 a x 2 ) ′ δ M ( x 1 , x 1 − ( x 1 − 0 1 � a � 0 x 3 ) 2 = ( x 3 ) = ( x 32 − x 12 ) 2 = 1 x 3 ) ′ δ M ( x 1 , x 1 − ( x 1 − 0 1 14/66

Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Effect of the metric M (2/5) x 3 ) 2 ⇔ a ≤ 1 x 2 ) 2 ≤ δ M ( δ M ( x 1 , x 1 , 9 The distance is impacted by the metric, thus the clustering could be also Somewhere the metric is also related to variable selection (try a = 0. . . ) 15/66

Evaluating clustering Introduction to cluster analysis and - PDF document

HAL Id: hal-01810377 scientifjques de niveau recherche, publis ou non, mer School on Clustering, Data Analysis and Visualization of Complex Data, May 2018, Catania, Christophe Biernacki. Introduction to cluster analysis and classifjcation:

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and application Feature

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Clustering with k-means Introduction to Machine Learning Clustering, what? Cluster :

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering Data Mining: Concepts and October 18, 2019 Techniques 1 Chapter 8. Cluster Analysis

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Introduction to Graph Cluster Analysis Outline Introduction to Cluster Analysis Types of

Clustering Data Clustering with user constraints The clustering problem : Given a set of

Clustering Algorithms Dalya Baron (Tel Aviv University) XXX Winter School, November 2018

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

MixtComp software: Model-based clustering/imputation with mixed data, missing data and uncertain

PBC and PSC: Back to Basics grace.kim@ucsf.edu Outline Case 1 clinical history In chronic

[Introductions] Today, Im going to talk to you about PubMed, which is a free online tool

Energy -driven geopolitical considerations are a pronounced, common feature of many

Evolutionary Conservation of Human Phosphorylation Sites Javad Safaei 1 , Jan Manuch 1 , Arvind

Mi Missi ssissi ssippi i Phosp osphates s Corp Corpor orati tion on Mississippi

OVERVIEW OF OVERVIEW OF CUMULATIVE EFFECTS CUMULATIVE EFFECTS ASSESSMENT ASSESSMENT What is

Environmental Economics - 4910 Professor: Brd Harstad University of Oslo January 14, 2019

Sambuz

Useful Links

Newsletter

Mail Us