CLUSTER ANALYSIS WITH K-MEANS
What about the details ?
Maurice ROUX Ex-Professor Paul Cezanne University Marseille, France mrhroux@yahoo.fr
CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice - - PowerPoint PPT Presentation
CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice ROUX Ex-Professor Paul Cezanne University Marseille, France mrhroux@yahoo.fr K-means: what about the details ? Introduction k-means type algorithms are very popular
Maurice ROUX Ex-Professor Paul Cezanne University Marseille, France mrhroux@yahoo.fr
K-means: what about the details ?
K-means: what about the details ?
K-means: what about the details ?
K-means: what about the details ?
K-means: what about the details ?
K-means: what about the details ?
K-means: what about the details ?
9
Indices Type Variation BSS / TSS isolation/compactness [ 0; 1] Theta (Guénoche, 2003) isolation/compactness [ 0; ∞ ] Davies-Bouldin (1979) compactness/isolation [ 0; ∞ ] Dunn (1974) isolation/compactness [ 0; ∞ ] Hubert & Levin (1976) compactness [ 0; 1] Silhouette (Rousseuw, 1987)isolation/compactness [-1; +1]
K-means: what about the details ?
10
Indices Type Variation Yule (1900) correlation [-1; +1] Adjusted Rand (1985) correlation [ 0; 1] Fowlkes & Mallows (1983) correlation [ 0; 1] Goodman & Kruskal (1954) correlation [-1; +1] Kendall's tau (1938) correlation [-1; +1] contingency Khi-2 correlation [ 0; ∞ ]
K-means: what about the details ?
11
A small 2-D example by J.P. Nakache and J. Confais (2010) K-means: what about the details ?
T S R Q P O N M L K J I H G F E D C B A
5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60
12
Small example: optimal criteria values for 50 random restarts in K-means algorithm K-means: what about the details ?
K-m2c K-m3c K-m4c K-m5c K-m6c K-m7c K-m8c K-m9c K-m10c K-m11c BSS/TSS 0,4673 0,7354 0,8603 0,9001 0,9345 0,9486 0,9575 0,9669 0,9679 0,9798 Theta 1,6828 2,288 2,7846 2,993 3,6454 3,9283 4,1094 4,3701 4,2197 5,0056 DB 1,1781 0,949 0,7766 0,8269 0,9678 0,9063 0,829 0,7653 0,9622 0,6984 Dunn 1,6287 1,7433 2,0693 1,3138 1,4258 1,2447 1,0815 1,2909 0,6573 0,8751 HL 0,8402 0,9378 0,9747 0,9682 0,9793 0,9852 0,9869 0,9903 0,9756 0,9923 Silh 0,3889 0,4765 0,5523 0,5385 0,4866 0,4992 0,5144 0,5169 0,4537 0,5848 GK 0,6471 0,8741 0,9595 0,9535 0,9684 0,9758 0,9772 0,9816 0,9586 0,9897 Tau 0,3245 0,369 0,3487 0,2944 0,2149 0,1929 0,1768 0,1608 0,1315 0,1177 Yule 0,6754 0,9485 0,9815 0,9692 0,9838 0,9887 0,9933 0,9969 0,9816 0,9955 AdRand 0,3884 0,6992 0,7962 0,7258 0,7615 0,7859 0,8246 0,8708 0,6916 0,8221 Fowlkes 0,6813 0,7895 0,8444 0,7778 0,7917 0,8095 0,8421 0,8824 0,7143 0,8333 Khi-2 3808,5 5827,9 6062,7 5120,6 3789,8 3434,8 3154,3 2883,8 2306,4 2097,5
13
Small example: optimal criteria values for 50 random restarts K-means: what about the details ?
Theta = Mean Db / Mean Dw Dunn = Min Db / Max Dw
1 2 3 4 5 6 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Theta Dunn
14
Small example: optimal criteria values for 50 random restarts K-means: what about the details ?
Davies-Bouldin index
0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters DB
K-means: what about the details ?
16
Small example : best partition in 4 clusters by k-means
K-means: what about the details ?
T S R Q P O N M L K J I H G F E D C B A
5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60
17
Small example : partition in 4 clusters, DB = 0.7539 K-means: what about the details ?
T S R Q P O N M L K J I H G F E D C B A
5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60
18
Small example : partition in 4 clusters, DB = 0.7514 K-means: what about the details ?
T S R Q P O N M L K J I H G F E D C B A
5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60
19
Small example: optimal criteria values for 50 random restarts K-means: what about the details ?
Yule, adjusted Rand and Fowlkes-Mallows indexes
0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters Yule AdRand Fowlkes
K-means: what about the details ?
Goodman-Kruskal's index Kendall's tau
0,2 0,4 0,6 0,8 1 1,2 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters GK Tau
K-means: what about the details ?
Initial distances Partition distances
K-means: what about the details ?
K-means: what about the details ?
Contingency Khi-2 over quadruples
1000 2000 3000 4000 5000 6000 7000 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters Khi-2
K-means: what about the details ?
K-means: what about the details ? A special data table for analysing the results of K- means: the confusion or co-association table. From a set P of partitions (with the same number of clusters) count the number of times two objects, i and i’, fall in the same cluster. cii’ = Card { p ∈ P | kp(i) = kp(i’) } kp(i) = cluster in which i belongs to in partition p Submit this table C to Correspondence analysis
K-means: what about the details ? Small example: 15 distinct partitions in 4 clusters after 50 restarts Correspondence analysis of the co-association matrix
T S R Q P O N M L K J I H G F E D C B A
0,5 1 1,5
0,5 1
H, I A, C, D, L, O, Q, S G, N, T K, M, P F, J, R
3 or 4 clusters ? Intermediate
anomalous
F1: 57.7 % F2: 35 %
Correspondence analysis suggests the validity of the 3-clusters
K-means: what about the details ?
T S R Q P O N M L K J I H G F E D C B A
5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60
Cluster { K, M, P, R, F, J } contains 2 sub-clusters
K-means: what about the details ?
K-means: what about the details ? Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek, M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., Lander E.S. (1999) Molecular classification of cancer : class discovery and class prediction by gene expression
Handl J., Knowles J. and Kell D.B. (2005) Computational cluster validation in post-genomic data analysis, BIOINFORMATICS, 21(15): 3201-3212. Data table : 38 tissues x 100 genes, quantitative levels of gene expressions. There are 3 groups of tissues, known a priori.
K-means: what about the details ?
M M M M M M M M M M M T T T T T T T T B B B B B B B B B B B B B B B B B B B
0,4 0,8
0,4 0,8
F1: 27.5 % F2: 23.4 %
K-means: what about the details ?
2 cl 3 cl 4 cl 5cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 12 cl Dunn 1,1383 1,2733 1,2051 1,0062 0,937 0,9788 0,9122 0,7284 0,8965 0,8404 0,5551 Silh 0,2033 0,2898 0,2745 0,2371 0,1509 0,1948 0,1436 0,1693 0,1862 0,1937 0,2006 Tau 0,2759 0,4006 0,3915 0,3565 0,2307 0,2363 0,1619 0,1931 0,171 0,1593 0,1564 Yule 0,7269 0,9544 0,9592 0,9394 0,8998 0,9093 0,9182 0,8799 0,9272 0,9478 0,9357 AdRand 0,4309 0,7238 0,7326 0,6745 0,5455 0,5621 0,5204 0,4783 0,5444 0,5822 0,5457 Fowlkes 0,711 0,8197 0,8186 0,7685 0,627 0,64 0,575 0,551 0,5976 0,625 0,5915 Khi-2 36312 76844 75467 64507 43922 43934 30733 32920 31717 30535 35609
Leukemia38: C.A. of the co-association table based on 3 clusters and 50 random restarts (but only 10 distinct partitions) K-means: what about the details ?
M M M M M T T T T B B B B B B B B B B B B B B B B
0,4 0,8
0,5 1 1,5
F1: 86.6 % F2: 8.3 %
K-means: what about the details ? Leukemia38: C.A. of the co-association table based on 4-clusters and 50 random restarts ( 45 distinct partitions)
M M M M M M M M M M M T T T T B B B B B B B B B B B B B B B B B B B
0,5 1 1,5
0,5 1
F1: 51.7 % F2: 46.6 %
K-means: what about the details ?
Tau 2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 100 restarts 0,2759 0,4006 0,3915 0,3565 0,2307 0,2363 0,1619 0,1931 0,171 0,1593 1000 restarts 0,2759 0,4006 0,3915 0,3631 0,2445 0,2571 0,2063 0,1595 0,2175 0,1995 5000 restarts 0,2759 0,4006 0,3915 0,3631 0,2442 0,2217 0,2201 0,2008 0,2258 0,2329 BSS/TSS 2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 100 restarts 0,2402 0,4607 0,5127 0,5545 0,5902 0,6225 0,6416 0,6641 0,6935 0,7046 1000 restarts 0,2402 0,4607 0,5127 0,5586 0,5948 0,6231 0,6585 0,6739 0,7014 0,7264 5000 restarts 0,2402 0,4607 0,5127 0,5586 0,5949 0,6287 0,6554 0,6829 0,7083 0,7215
K-means: what about the details ? Roux G. et Roux M. (1967). A propos de quelques méthodes de classification en phytosociologie. Revue de Statistique appliquée, 15(2): 59-72. Benzécri J.P. et coll. (1973). L’analyse des données, tome 1 : la Taxinomie [T1C no 2], Dunod, Paris, pp 360-374. Raw data : 55 relevés x 174 species, presence- absence (coded 1 and 0 respectively) Data table: 10 principal coordinates axes from Jaccard ’s distances (60.2 % inertia)
Alpes55: principal coordinates based on Jaccard similarity index K-means: what about the details ?
R55 R54 R53 R52 R51 R50 R49 R48 R47 R46 R45 R44 R43 R42 R41 R40 R39 R38 R37 R36 R35 R34 R33 R32 R31 R30 R29 R28 R27 R26 R25 R24 R23 R22 R21 R20 R19 R18 R17 R16 R15 R14 R13 R12 R11 R10 R9 R8 R7 R6 R5 R4 R3 R2 R1
0,25 0,5
0,2 0,4 F1:
15.8 % F2: 9.8 %
K-means: what about the details ?
Nb.cl. 2 3 4 5 6 7 8 9 10 11 Dunn 1,2353 1,3235 1,2307 1,3128 1,1296 1,1965 1,2121 1,167 1,1404 1,0815 Silh 0,2132 0,2495 0,2785 0,3118 0,2961 0,2863 0,3031 0,2974 0,3019 0,2959 Tau 0,286 0,3547 0,3328 0,3371 0,2614 0,2204 0,194 0,1677 0,1616 0,1426 Yule 0,7054 0,8955 0,918 0,9281 0,9295 0,9565 0,9637 0,9622 0,9689 0,9733 AdRand 0,4127 0,609 0,6301 0,6462 0,6125 0,6546 0,6586 0,6314 0,6539 0,6568 Fowlkes 0,7089 0,7527 0,7372 0,7439 0,6866 0,7051 0,7 0,6689 0,6875 0,6855 Khi-2 179511 299481 296945 314563 244800 214518 195124 170181 164817 146699
K-means: what about the details ?
0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 2 3 4 5 6 7 8 9 10 11 Number of clusters Dunn Yule AdRand Fowlkes
Alpes55: C.A. of the co-association table based on 5-clusters and 100 random restarts ( 85 distinct partitions) K-means: what about the details ?
R55 R54 R53 R52 R51 R50 R49 R48 R47 R46 R45 R44 R43 R42 R41 R40 R39 R38 R37 R36 R35 R34 R33 R32 R31 R30 R29 R28 R27 R26 R25 R24 R23 R22 R21 R20 R19 R18 R17 R16 R15 R14 R13 R12 R11 R10 R9 R8 R7 R6 R5 R4 R3 R2 R1
1 2
1,5
R6, R8 R30, R31 R18, R19, R25, R26 R1, R2, R11, R21, R39, R43, R53
K-means: what about the details ?
Tau 2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 12 cl 13 cl 14 cl 100 resta 0,286 0,3547 0,3328 0,3371 0,2619 0,2248 0,1891 0,1767 0,158 0,1475 0,1281 0,1279 0,1151 1000 rest 0,286 0,3547 0,3328 0,3371 0,2614 0,2204 0,1923 0,1735 0,1501 0,1476 0,1304 0,1186 0,1133 5000 rest 0,286 0,3547 0,3328 0,3371 0,2614 0,2204 0,194 0,1677 0,1616 0,1426 0,1295 0,1139 0,1124
K-means: what about the details ? Cho R.J., Campbell, M.J., Winzeler E.A., Steinmetz L., Conway A., Wodicka L., Wolfsberg T.G., Gabrielian A.E., Landsman D., Lockhart D.J., Davis R.W. (1998). A Genome-Wide Transcriptional Analysis of the Mitotic Cell
Data table: 237 genes x 17 time points, quantitative levels of gene expressions, there are 4 groups known a priori as « functional categories ».
K-means: what about the details ?
F1: 20.2 % F2: 16.1 %
1 2 3 4
1 2 3 4 5 6 Class 1 Class 2 Class 3 Class 4
K-means: what about the details ?
2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 12 cl 13 cl 14 cl Dunn 1,03579 1,02831 1,07531 0,99867 0,95781 0,76343 0,78344 0,89028 0,54417 0,53582 0,54398 0,57214 0,53743 Silh 0,27291 0,30406 0,31826 0,31811 0,31178 0,26155 0,25695 0,29178 0,16725 0,15935 0,17297 0,14949 0,17444 Tau 0,29349 0,35412 0,38313 0,38377 0,37581 0,31987 0,32014 0,34515 0,19693 0,18665 0,19996 0,19924 0,18488 AdRand 0,45971 0,62734 0,73506 0,77682 0,80551 0,77156 0,76689 0,82365 0,51078 0,52573 0,52884 0,53271 0,53124 Fowlkes 0,73277 0,76875 0,82103 0,84407 0,86009 0,8242 0,81974 0,86594 0,57951 0,58627 0,59325 0,5953 0,58875 Khi-2 5,9E+07 9,7E+07 1,3E+08 1,4E+08 1,5E+08 1,4E+08 1,4E+08 1,5E+08 1,1E+08 1,2E+08 1,2E+08 1,3E+08 1,3E+08
K-means: what about the details ?
0,2 0,4 0,6 0,8 1 1,2 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of clusters Dunn Silh Tau AdRand
Yeast237: C.A. of the co-association table based on 4-clusters and 100 random restarts ( 36 distinct partitions), axes 1 and 2 K-means: what about the details ? F1: 42.4 % F2: 36.6 %
1 2 3 1 4 1 1 2 1 3 4 4 4 4 3 4 44 3 4 1 4 4 3 2 4 4 4 1 4 4 4 1 4 2 3 1 4 1 4 3 4 1 4 4 3 4 4 4 3 1 2 4 1 1 4 1 4 4 4 4 2 4 4 4 4 4 4 1 1 3 4 1 2 4 4 3 4 1 4 3 4 1 2 1 1 3 4 2 2 4 2 4 1 2 1 1 4 4 1 2 4 4 2 2 2 2 2 1 3 4 3 4 4 4 4 1 4 1 2 1 4 3 2 1 4 4 1 4 3 2 1 4 4 4 1 2 4 2 2 1 4 2 4 2 2 4 4 4 1 4 1 4 4 4 2 1 4 2 1 1 4 1 4 3 1 4 1 4 4 4 4 1 4 1 1 2 2 1
0,5 1 1,5 2 2,5
0,5 1 1,5 2 2,5
Yeast237: C.A. of the co-association table based on 4-clusters and 100 random restarts ( 36 distinct partitions), axes 1 and 3 K-means: what about the details ? F1: 42.4 % F3: 20.5 %
1 2 3 1 4 1 1 2 1 3 4 4 4 4 3 4 44 3 4 1 4 4 3 2 4 4 4 1 4 4 4 1 4 2 3 1 4 1 4 3 4 1 4 4 3 4 4 4 3 1 2 4 1 1 4 1 4 4 4 4 2 4 4 4 4 4 4 1 1 3 4 1 2 4 4 3 4 1 4 3 4 1 2 1 1 3 4 2 2 4 2 4 1 2 1 1 4 4 1 2 4 4 2 2 2 2 2 1 3 4 3 4 4 4 4 1 4 1 2 1 4 3 2 1 4 4 1 4 3 2 1 4 4 4 1 2 4 2 2 1 4 2 4 2 2 4 4 4 1 4 1 4 4 4 2 1 4 2 1 1 4 1 4 3 1 4 1 4 4 4 4 1 4 1 1 2 2 1
0,5 1 1,5 2 2,5
0,5 1 1,5 2 2,5
Yeast237: C.A. of the co-association table based on 4-clusters and 100 random restarts ( 36 distinct partitions), axes 1 and 3, labels from the 4-cluster solution of the k-means K-means: what about the details ? F1: 42.4 % F3: 20.5 %
4 1 1 4 2 2 4 3 4 3 2 3 2 4 2 22 4 2 1 2 2 1 3 2 2 3 4 3 2 2 4 3 1 4 2 1 2 3 3 3 2 3 1 2 3 2 1 3 1 3 1 4 2 1 2 3 2 2 1 2 2 2 3 2 3 3 4 2 4 1 2 2 1 2 1 2 1 2 4 1 4 2 1 3 2 1 2 4 1 4 3 1 2 4 3 2 3 1 4 3 3 1 3 1 2 4 3 2 2 2 3 2 4 3 3 1 4 1 3 2 2 4 2 1 3 3 2 1 2 1 2 3 1 1 1 3 1 2 3 3 2 3 2 4 2 4 2 4 2 1 2 2 3 4 2 4 3 2 4 2 1 2 3 2 2 4 2 3 4 1 4 2
0,5 1 1,5 2 2,5
0,5 1 1,5 2 2,5
K-means: what about the details ?
Dunn 2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 12 cl 13 cl 14 cl 200 res. 1,0358 1,0283 1,0753 0,9987 0,9578 0,7634 0,7834 0,8903 0,5442 0,5358 0,544 0,5721 0,5374 2000 res. 1,0358 1,0283 1,0753 0,9971 1,0076 0,8693 0,8172 0,8446 0,5877 0,5542 0,5438 0,5819 0,5641 BSS/TSS2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 12 cl 13 cl 14 cl 200 res. 0,229 0,3682 0,4561 0,4923 0,5161 0,5372 0,5563 0,5765 0,5913 0,608 0,613 0,631 0,6381 2000 res. 0,229 0,3682 0,4561 0,4922 0,5169 0,5396 0,5605 0,5776 0,5942 0,6096 0,6206 0,6315 0,6417
K-means: what about the details ?
Some indexes present a trend toward increasing (or decreasing values) with respect to the number of clusters : K-means: what about the details ?
Davies-Bouldin favours unbalanced partitions
K-means: what about the details ? The remaining indexes are :
Tau and Khi-2 are based upon quadruples, they are time consuming. Therefore the elected index is Dunn (modified version to diminish outliers influence) Dunn = Min { Mean[db(k, k’)] } Max { Mean [dw(k)] }
K-means: what about the details ? The number of distinct local optima increases
The values of BSS/TSS increase with the number of random restarts but better values of BSS/TSS do not imply higher values of other criteria. In general, increasing the number of random restarts does not change the choice of the number of clusters.
K-means: what about the details ? Having obtained a number of distinct clusterings, most researchers focus on looking for a consensus partition. My suggestion is to treat the co-association matrix by Correspondence analysis, this allows for :
Yeast237)
K-means: what about the details ? Several papers deal with different initializations, for instance :
2010)
Krzanowski, 2005)
K-means: what about the details ?
some consensus method as shown in the examples
useful (see the example Yeast237) Working with real life data sets : Necessary limited size of the data table :
means and index computations are time consuming
processed by usual Correspondence analysis programs.