CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice - PowerPoint PPT Presentation

CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice ROUX Ex-Professor Paul Cezanne University Marseille, France mrhroux@yahoo.fr

K-means: what about the details ? Introduction • k-means type algorithms are very popular • they are fast • they allow for the treatment of huge data sets • they use a very simple scheme easily comprehensible

K-means: what about the details ? Introduction: practical problems • the quality of the results depends heavily on the initialization • k-means requires the number of clusters to be chosen beforehand How to deal with these issues ?

K-means: what about the details ? The classical solutions 1. Initialization : Repeat many random initializations and retain the solution which maximizes the « Between sum of squared distances » (BSS). 2. Number K of clusters Try several values of K and retain the one which leads to the best value of some given criterion.

K-means: what about the details ? The details to take care about 1. Initialization : How many random initializations ? 2. Number K of clusters : Which criterion to evaluate the results ?

K-means: what about the details ? The present study : methods • use real data sets to put in practice the usual methods mostly tested on artificial data sets • try to solve both the selection of K and the number of random initializations in the classical batch K-means algorithm • can the processing of a set of partitions (« cluster ensemble ») bring more information on the data set ?

K-means: what about the details ? Plan of the presentation 1. Some quality indexes of a partition illustrated with an artificial data set 2. Real life data sets 3. Discussion

K-means: what about the details ? PART 1 : quality indexes and an artificial data set Quality indexes : 12 classical formulas for evaluating the fit of a partition to a given distance or dissimilarity. Artificial data set : a 20-points sample in 2-D by J.P. Nakache and J. Confais (2010) «Approche pragmatique de la classification», Technip, Paris (p. 197)

K-means: what about the details ? Quality indexes : parametric Indices Type Variation BSS / TSS isolation/compactness [ 0; 1] [ 0; ∞ ] Theta (Guénoche, 2003) isolation/compactness [ 0; ∞ ] Davies-Bouldin (1979) compactness/isolation [ 0; ∞ ] Dunn (1974) isolation/compactness Hubert & Levin (1976) compactness [ 0; 1] Silhouette (Rousseuw, 1987)isolation/compactness [-1; +1] 9

K-means: what about the details ? Quality indexes: non-parametric Indices Type Variation Yule (1900) correlation [-1; +1] Adjusted Rand (1985) correlation [ 0; 1] Fowlkes & Mallows (1983) correlation [ 0; 1] Goodman & Kruskal (1954) correlation [-1; +1] Kendall's tau (1938) correlation [-1; +1] [ 0; ∞ ] contingency Khi-2 correlation 10

K-means: what about the details ? A small 2-D example by J.P. Nakache and J. Confais (2010) 50 G 45 T N 40 B H P I 35 E 30 M K 25 L C 20 S A 15 Q O 10 D F R J 5 0 0 10 20 30 40 50 60 11

K-means: what about the details ? Small example: optimal criteria values for 50 random restarts in K-means algorithm K-m2c K-m3c K-m4c K-m5c K-m6c K-m7c K-m8c K-m9c K-m10c K-m11c BSS/TSS 0,4673 0,7354 0,8603 0,9001 0,9345 0,9486 0,9575 0,9669 0,9679 0,9798 Theta 1,6828 2,288 2,7846 2,993 3,6454 3,9283 4,1094 4,3701 4,2197 5,0056 DB 1,1781 0,949 0,7766 0,8269 0,9678 0,9063 0,829 0,7653 0,9622 0,6984 Dunn 1,6287 1,7433 2,0693 1,3138 1,4258 1,2447 1,0815 1,2909 0,6573 0,8751 HL 0,8402 0,9378 0,9747 0,9682 0,9793 0,9852 0,9869 0,9903 0,9756 0,9923 Silh 0,3889 0,4765 0,5523 0,5385 0,4866 0,4992 0,5144 0,5169 0,4537 0,5848 GK 0,6471 0,8741 0,9595 0,9535 0,9684 0,9758 0,9772 0,9816 0,9586 0,9897 Tau 0,3245 0,369 0,3487 0,2944 0,2149 0,1929 0,1768 0,1608 0,1315 0,1177 Yule 0,6754 0,9485 0,9815 0,9692 0,9838 0,9887 0,9933 0,9969 0,9816 0,9955 AdRand 0,3884 0,6992 0,7962 0,7258 0,7615 0,7859 0,8246 0,8708 0,6916 0,8221 Fowlkes 0,6813 0,7895 0,8444 0,7778 0,7917 0,8095 0,8421 0,8824 0,7143 0,8333 Khi-2 3808,5 5827,9 6062,7 5120,6 3789,8 3434,8 3154,3 2883,8 2306,4 2097,5 12

K-means: what about the details ? Small example: optimal criteria values for 50 random restarts Theta = Mean D b / Mean D w Dunn = Min D b / Max D w 6 5 4 Theta 3 Dunn 2 1 0 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c 13

K-means: what about the details ? Small example: optimal criteria values for 50 random restarts Davies-Bouldin index 1,2 1,1 1 0,9 0,8 DB 0,7 0,6 0,5 0,4 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters 14

K-means: what about the details ? Davies-Bouldin index (1979) D w (k) + D w (j) Max { | j ≠ k } DB(k) = D b (j, k) D w (k) = Mean { d ii’ | i ∈ k ; i’ ∈ k ; i ≠ i’ } D b (j, k) = Mean { d ii’ | i ∈ j ; i’ ∈ k } DB = Mean k ∈ K DB(k) Type : compactness / isolation

K-means: what about the details ? Small example : best partition in 4 clusters by k-means out of 50 random restarts, DB = 0.7766 50 G 45 T N 40 B H P I 35 E 30 M K 25 L C 20 S A 15 Q O 10 D F R J 5 0 0 10 20 30 40 50 60 16

K-means: what about the details ? Small example : partition in 4 clusters, DB = 0.7539 50 G 45 T N 40 B H P I 35 E 30 M K 25 L C 20 S A 15 Q O 10 D F R J 5 0 0 10 20 30 40 50 60 17

K-means: what about the details ? Small example : partition in 4 clusters, DB = 0.7514 50 G 45 T N 40 B H P I 35 E 30 M K 25 L C 20 S A 15 Q O 10 D F R J 5 0 0 10 20 30 40 50 60 18

K-means: what about the details ? Small example: optimal criteria values for 50 random restarts Yule, adjusted Rand and Fowlkes-Mallows indexes 1,1 1 0,9 0,8 Yule 0,7 AdRand Fowlkes 0,6 0,5 0,4 0,3 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters 19

K-means: what about the details ? Goodman-Kruskal's index Kendall's tau 1,2 1 0,8 GK 0,6 Tau 0,4 0,2 0 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters

K-means: what about the details ? Non-parametric indexes based on quadruples of objects Partition distances u ii’ < u jj’ u ii’ > u jj’ d ii’ < d jj’ concordant discordant Initial distances d ii’ > d jj’ discordant concordant

K-means: what about the details ? Kendall ’s Tau (1938) Goodman and Kruskal index (1954) S + = number of concordant quadruples S - = number of discordant quadruples N = number of object pairs GK = S + - S - S + - S - Tau = S + + S - (N*(N-1))/2 Type : correlation coefficient

K-means: what about the details ? Contingency Khi-2 over quadruples 7000 6000 5000 4000 Khi-2 3000 2000 1000 0 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters

K-means: what about the details ? Some indexes should be discarded A. Uniform trend for increasing or decreasing values • SSB/SST • Theta • Goodman-Kruskal • Hubert-Levin B. Preference for unbalanced partitions • Davies-Bouldin

K-means: what about the details ? Analyzing the results by correspondence analysis A special data table for analysing the results of K- means: the confusion or co-association table . From a set P of partitions (with the same number of clusters) count the number of times two objects, i and i’, fall in the same cluster. c ii’ = Card { p ∈ P | k p (i) = k p (i’) } k p (i) = cluster in which i belongs to in partition p Submit this table C to Correspondence analysis

K-means: what about the details ? Small example: 15 distinct partitions in 4 clusters after 50 restarts Correspondence analysis of the co-association matrix 1,5 F, J, R F R J 1 P M K 3 or 4 0,5 A, C, D, L, O, Q, S K, M, P clusters ? O F1: C D Q L A S 0 -2 -1,5 -1 -0,5 0 0,5 1 57.7 % Intermediate or I H -0,5 H, I anomalous E objects ? B -1 T G N G, N, T -1,5 F2: 35 %

K-means: what about the details ? Correspondence analysis suggests the validity of the 3-clusters partition. It makes appear the border position of points B and H-I 50 G 45 T N 40 B H P I 35 E 30 M K 25 L C 20 S A 15 Q O 10 D F R J 5 0 0 10 20 30 40 50 60 Cluster { K, M, P, R, F, J } contains 2 sub-clusters

K-means: what about the details ? PART 2 : real life examples • Leukemia38 • Alpes55 • Yeast237

K-means: what about the details ? Real life examples : Leukemia38 Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek, M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., Lander E.S. (1999) Molecular classification of cancer : class discovery and class prediction by gene expression monitoring. Science, vol. 286, pp 531-537. //www.sciencemag.org Handl J., Knowles J. and Kell D.B. (2005) Computational cluster validation in post-genomic data analysis, BIOINFORMATICS, 21(15): 3201-3212. Data table : 38 tissues x 100 genes, quantitative levels of gene expressions. There are 3 groups of tissues, known a priori .

K-means: what about the details ? Leukemia38: correspondence analysis of raw data 0,8 T T T T T T 0,4 T T B F1: 0 27.5 % -0,8 -0,4 0 0,4 0,8 M M M B B M M B B B M B B B B B M B B B B B M M B B M M B -0,4 F2: 23.4 %

CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice - PowerPoint PPT Presentation

CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice ROUX Ex-Professor Paul Cezanne University Marseille, France mrhroux@yahoo.fr K-means: what about the details ? Introduction k-means type algorithms are very popular

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Introduction to K- means Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Introduction to Graph Cluster Analysis Outline Introduction to Cluster Analysis Types of

Kmean Cluster Analysis 1 Learning Objectives Understanding the kmean cluster analysis

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and application Feature

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Dependency Parsing with Bounded Block Degree and Well-nestedness via Lagrangian Relaxation and

Practical Implementation of Ring-SIS/LWE based Signature and IBE Pauline Bert, Pierre-Alain

Greedy Algorithms, Frank-Wolfe and Friends a modern perspective NIPS 2013 Workshop

Games Where You Can Play Optimally with Arena-Independent Finite Memory Patricia Bouyer 1

COQ DEVELOPMENT TEAM SESSION Coq Development Team Coq Workshop 2019 Portland Sep 8th, 2019

rt rts t

Spectral Dimensionality Reduction via Learning Eigenfunctions Yoshua Bengio Thanks to Pascal

Automation and Computation in the Lean Theorem Prover Robert Y. Lewis 1 Leonardo de Moura 2 1

Sambuz

Useful Links

Newsletter

Mail Us

CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice - PowerPoint PPT Presentation

CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice ROUX Ex-Professor Paul Cezanne University Marseille, France mrhroux@yahoo.fr K-means: what about the details ? Introduction k-means type algorithms are very popular

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Introduction to K- means Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Introduction to Graph Cluster Analysis Outline Introduction to Cluster Analysis Types of

Kmean Cluster Analysis 1 Learning Objectives Understanding the kmean cluster analysis

CLUSTER ANALYSIS Agenda Introduction to cluster analysis and application Feature

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

Dependency Parsing with Bounded Block Degree and Well-nestedness via Lagrangian Relaxation and

Practical Implementation of Ring-SIS/LWE based Signature and IBE Pauline Bert, Pierre-Alain

Greedy Algorithms, Frank-Wolfe and Friends a modern perspective NIPS 2013 Workshop

Games Where You Can Play Optimally with Arena-Independent Finite Memory Patricia Bouyer 1

COQ DEVELOPMENT TEAM SESSION Coq Development Team Coq Workshop 2019 Portland Sep 8th, 2019

rt rts t

Spectral Dimensionality Reduction via Learning Eigenfunctions Yoshua Bengio Thanks to Pascal

Automation and Computation in the Lean Theorem Prover Robert Y. Lewis 1 Leonardo de Moura 2 1

Sambuz

Useful Links

Newsletter

Mail Us

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on