CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice - - PowerPoint PPT Presentation

cluster analysis with k means
SMART_READER_LITE
LIVE PREVIEW

CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice - - PowerPoint PPT Presentation

CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice ROUX Ex-Professor Paul Cezanne University Marseille, France mrhroux@yahoo.fr K-means: what about the details ? Introduction k-means type algorithms are very popular


slide-1
SLIDE 1

CLUSTER ANALYSIS WITH K-MEANS

What about the details ?

Maurice ROUX Ex-Professor Paul Cezanne University Marseille, France mrhroux@yahoo.fr

slide-2
SLIDE 2

K-means: what about the details ?

Introduction

  • k-means type algorithms are very popular
  • they are fast
  • they allow for the treatment of huge data

sets

  • they use a very simple scheme easily

comprehensible

slide-3
SLIDE 3

K-means: what about the details ?

Introduction: practical problems

  • the quality of the results depends heavily on the

initialization

  • k-means requires the number of clusters to be

chosen beforehand How to deal with these issues ?

slide-4
SLIDE 4

K-means: what about the details ?

The classical solutions

  • 1. Initialization :

Repeat many random initializations and retain the solution which maximizes the « Between sum of squared distances » (BSS).

  • 2. Number K of clusters

Try several values of K and retain the one which leads to the best value of some given criterion.

slide-5
SLIDE 5

K-means: what about the details ?

The details to take care about

  • 1. Initialization :

How many random initializations ?

  • 2. Number K of clusters :

Which criterion to evaluate the results ?

slide-6
SLIDE 6

K-means: what about the details ?

The present study : methods

  • use real data sets to put in practice the usual

methods mostly tested on artificial data sets

  • try to solve both the selection of K and the

number of random initializations in the classical batch K-means algorithm

  • can the processing of a set of partitions (« cluster

ensemble ») bring more information on the data set ?

slide-7
SLIDE 7

K-means: what about the details ?

Plan of the presentation

  • 1. Some quality indexes of a partition

illustrated with an artificial data set

  • 2. Real life data sets
  • 3. Discussion
slide-8
SLIDE 8

PART 1 : quality indexes and an artificial data set

K-means: what about the details ?

Artificial data set : a 20-points sample in 2-D by J.P. Nakache and J. Confais (2010) «Approche pragmatique de la classification», Technip, Paris (p. 197) Quality indexes : 12 classical formulas for evaluating the fit of a partition to a given distance or dissimilarity.

slide-9
SLIDE 9

9

Quality indexes : parametric

Indices Type Variation BSS / TSS isolation/compactness [ 0; 1] Theta (Guénoche, 2003) isolation/compactness [ 0; ∞ ] Davies-Bouldin (1979) compactness/isolation [ 0; ∞ ] Dunn (1974) isolation/compactness [ 0; ∞ ] Hubert & Levin (1976) compactness [ 0; 1] Silhouette (Rousseuw, 1987)isolation/compactness [-1; +1]

K-means: what about the details ?

slide-10
SLIDE 10

10

Quality indexes: non-parametric

Indices Type Variation Yule (1900) correlation [-1; +1] Adjusted Rand (1985) correlation [ 0; 1] Fowlkes & Mallows (1983) correlation [ 0; 1] Goodman & Kruskal (1954) correlation [-1; +1] Kendall's tau (1938) correlation [-1; +1] contingency Khi-2 correlation [ 0; ∞ ]

K-means: what about the details ?

slide-11
SLIDE 11

11

A small 2-D example by J.P. Nakache and J. Confais (2010) K-means: what about the details ?

T S R Q P O N M L K J I H G F E D C B A

5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60

slide-12
SLIDE 12

12

Small example: optimal criteria values for 50 random restarts in K-means algorithm K-means: what about the details ?

K-m2c K-m3c K-m4c K-m5c K-m6c K-m7c K-m8c K-m9c K-m10c K-m11c BSS/TSS 0,4673 0,7354 0,8603 0,9001 0,9345 0,9486 0,9575 0,9669 0,9679 0,9798 Theta 1,6828 2,288 2,7846 2,993 3,6454 3,9283 4,1094 4,3701 4,2197 5,0056 DB 1,1781 0,949 0,7766 0,8269 0,9678 0,9063 0,829 0,7653 0,9622 0,6984 Dunn 1,6287 1,7433 2,0693 1,3138 1,4258 1,2447 1,0815 1,2909 0,6573 0,8751 HL 0,8402 0,9378 0,9747 0,9682 0,9793 0,9852 0,9869 0,9903 0,9756 0,9923 Silh 0,3889 0,4765 0,5523 0,5385 0,4866 0,4992 0,5144 0,5169 0,4537 0,5848 GK 0,6471 0,8741 0,9595 0,9535 0,9684 0,9758 0,9772 0,9816 0,9586 0,9897 Tau 0,3245 0,369 0,3487 0,2944 0,2149 0,1929 0,1768 0,1608 0,1315 0,1177 Yule 0,6754 0,9485 0,9815 0,9692 0,9838 0,9887 0,9933 0,9969 0,9816 0,9955 AdRand 0,3884 0,6992 0,7962 0,7258 0,7615 0,7859 0,8246 0,8708 0,6916 0,8221 Fowlkes 0,6813 0,7895 0,8444 0,7778 0,7917 0,8095 0,8421 0,8824 0,7143 0,8333 Khi-2 3808,5 5827,9 6062,7 5120,6 3789,8 3434,8 3154,3 2883,8 2306,4 2097,5

slide-13
SLIDE 13

13

Small example: optimal criteria values for 50 random restarts K-means: what about the details ?

Theta = Mean Db / Mean Dw Dunn = Min Db / Max Dw

1 2 3 4 5 6 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Theta Dunn

slide-14
SLIDE 14

14

Small example: optimal criteria values for 50 random restarts K-means: what about the details ?

Davies-Bouldin index

0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters DB

slide-15
SLIDE 15

Davies-Bouldin index (1979)

K-means: what about the details ?

DB(k) = Dw(k) + Dw(j) Db(j, k) Dw(k) = Mean { dii’ | i ∈ k ; i’ ∈ k ; i ≠ i’ } Db(j, k) = Mean { dii’ | i ∈ j ; i’ ∈ k } DB = Mean k ∈ K DB(k) Type : compactness / isolation Max {

| j ≠ k}

slide-16
SLIDE 16

16

Small example : best partition in 4 clusters by k-means

  • ut of 50 random restarts, DB = 0.7766

K-means: what about the details ?

T S R Q P O N M L K J I H G F E D C B A

5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60

slide-17
SLIDE 17

17

Small example : partition in 4 clusters, DB = 0.7539 K-means: what about the details ?

T S R Q P O N M L K J I H G F E D C B A

5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60

slide-18
SLIDE 18

18

Small example : partition in 4 clusters, DB = 0.7514 K-means: what about the details ?

T S R Q P O N M L K J I H G F E D C B A

5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60

slide-19
SLIDE 19

19

Small example: optimal criteria values for 50 random restarts K-means: what about the details ?

Yule, adjusted Rand and Fowlkes-Mallows indexes

0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters Yule AdRand Fowlkes

slide-20
SLIDE 20

K-means: what about the details ?

Goodman-Kruskal's index Kendall's tau

0,2 0,4 0,6 0,8 1 1,2 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters GK Tau

slide-21
SLIDE 21

K-means: what about the details ?

Non-parametric indexes based on quadruples of objects

Initial distances Partition distances

uii’ < ujj’ uii’ > ujj’ dii’ < djj’ concordant discordant dii’ > djj’ discordant concordant

slide-22
SLIDE 22

Kendall ’s Tau (1938) Goodman and Kruskal index (1954)

K-means: what about the details ?

S+ = number of concordant quadruples S- = number of discordant quadruples N = number of object pairs GK = S+ - S- S+ + S- Type : correlation coefficient Tau = S+ - S- (N*(N-1))/2

slide-23
SLIDE 23

K-means: what about the details ?

Contingency Khi-2 over quadruples

1000 2000 3000 4000 5000 6000 7000 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters Khi-2

slide-24
SLIDE 24

Some indexes should be discarded

K-means: what about the details ?

  • A. Uniform trend for increasing or decreasing values
  • SSB/SST
  • Theta
  • Goodman-Kruskal
  • Hubert-Levin
  • B. Preference for unbalanced partitions
  • Davies-Bouldin
slide-25
SLIDE 25

Analyzing the results by correspondence analysis

K-means: what about the details ? A special data table for analysing the results of K- means: the confusion or co-association table. From a set P of partitions (with the same number of clusters) count the number of times two objects, i and i’, fall in the same cluster. cii’ = Card { p ∈ P | kp(i) = kp(i’) } kp(i) = cluster in which i belongs to in partition p Submit this table C to Correspondence analysis

slide-26
SLIDE 26

K-means: what about the details ? Small example: 15 distinct partitions in 4 clusters after 50 restarts Correspondence analysis of the co-association matrix

T S R Q P O N M L K J I H G F E D C B A

  • 1,5
  • 1
  • 0,5

0,5 1 1,5

  • 2
  • 1,5
  • 1
  • 0,5

0,5 1

H, I A, C, D, L, O, Q, S G, N, T K, M, P F, J, R

3 or 4 clusters ? Intermediate

  • r

anomalous

  • bjects ?

F1: 57.7 % F2: 35 %

slide-27
SLIDE 27

Correspondence analysis suggests the validity of the 3-clusters

  • partition. It makes appear the border position of points B and H-I

K-means: what about the details ?

T S R Q P O N M L K J I H G F E D C B A

5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60

Cluster { K, M, P, R, F, J } contains 2 sub-clusters

slide-28
SLIDE 28

PART 2 : real life examples

K-means: what about the details ?

  • Leukemia38
  • Alpes55
  • Yeast237
slide-29
SLIDE 29

Real life examples : Leukemia38

K-means: what about the details ? Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek, M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., Lander E.S. (1999) Molecular classification of cancer : class discovery and class prediction by gene expression

  • monitoring. Science, vol. 286, pp 531-537. //www.sciencemag.org

Handl J., Knowles J. and Kell D.B. (2005) Computational cluster validation in post-genomic data analysis, BIOINFORMATICS, 21(15): 3201-3212. Data table : 38 tissues x 100 genes, quantitative levels of gene expressions. There are 3 groups of tissues, known a priori.

slide-30
SLIDE 30

Leukemia38: correspondence analysis of raw data

K-means: what about the details ?

M M M M M M M M M M M T T T T T T T T B B B B B B B B B B B B B B B B B B B

  • 0,4

0,4 0,8

  • 0,8
  • 0,4

0,4 0,8

F1: 27.5 % F2: 23.4 %

slide-31
SLIDE 31

Leukemia38: k-means with 100 random restarts

K-means: what about the details ?

2 cl 3 cl 4 cl 5cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 12 cl Dunn 1,1383 1,2733 1,2051 1,0062 0,937 0,9788 0,9122 0,7284 0,8965 0,8404 0,5551 Silh 0,2033 0,2898 0,2745 0,2371 0,1509 0,1948 0,1436 0,1693 0,1862 0,1937 0,2006 Tau 0,2759 0,4006 0,3915 0,3565 0,2307 0,2363 0,1619 0,1931 0,171 0,1593 0,1564 Yule 0,7269 0,9544 0,9592 0,9394 0,8998 0,9093 0,9182 0,8799 0,9272 0,9478 0,9357 AdRand 0,4309 0,7238 0,7326 0,6745 0,5455 0,5621 0,5204 0,4783 0,5444 0,5822 0,5457 Fowlkes 0,711 0,8197 0,8186 0,7685 0,627 0,64 0,575 0,551 0,5976 0,625 0,5915 Khi-2 36312 76844 75467 64507 43922 43934 30733 32920 31717 30535 35609

slide-32
SLIDE 32

Leukemia38: C.A. of the co-association table based on 3 clusters and 50 random restarts (but only 10 distinct partitions) K-means: what about the details ?

M M M M M T T T T B B B B B B B B B B B B B B B B

  • 0,8
  • 0,4

0,4 0,8

  • 1
  • 0,5

0,5 1 1,5

F1: 86.6 % F2: 8.3 %

slide-33
SLIDE 33

K-means: what about the details ? Leukemia38: C.A. of the co-association table based on 4-clusters and 50 random restarts ( 45 distinct partitions)

M M M M M M M M M M M T T T T B B B B B B B B B B B B B B B B B B B

  • 2
  • 1,5
  • 1
  • 0,5

0,5 1 1,5

  • 1,5
  • 1
  • 0,5

0,5 1

F1: 51.7 % F2: 46.6 %

slide-34
SLIDE 34

K-means: what about the details ?

Leukemia38: K-means Influence of the number of random restarts

Tau 2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 100 restarts 0,2759 0,4006 0,3915 0,3565 0,2307 0,2363 0,1619 0,1931 0,171 0,1593 1000 restarts 0,2759 0,4006 0,3915 0,3631 0,2445 0,2571 0,2063 0,1595 0,2175 0,1995 5000 restarts 0,2759 0,4006 0,3915 0,3631 0,2442 0,2217 0,2201 0,2008 0,2258 0,2329 BSS/TSS 2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 100 restarts 0,2402 0,4607 0,5127 0,5545 0,5902 0,6225 0,6416 0,6641 0,6935 0,7046 1000 restarts 0,2402 0,4607 0,5127 0,5586 0,5948 0,6231 0,6585 0,6739 0,7014 0,7264 5000 restarts 0,2402 0,4607 0,5127 0,5586 0,5949 0,6287 0,6554 0,6829 0,7083 0,7215

slide-35
SLIDE 35

Real life examples : Alpes55

K-means: what about the details ? Roux G. et Roux M. (1967). A propos de quelques méthodes de classification en phytosociologie. Revue de Statistique appliquée, 15(2): 59-72. Benzécri J.P. et coll. (1973). L’analyse des données, tome 1 : la Taxinomie [T1C no 2], Dunod, Paris, pp 360-374. Raw data : 55 relevés x 174 species, presence- absence (coded 1 and 0 respectively) Data table: 10 principal coordinates axes from Jaccard ’s distances (60.2 % inertia)

slide-36
SLIDE 36

Alpes55: principal coordinates based on Jaccard similarity index K-means: what about the details ?

R55 R54 R53 R52 R51 R50 R49 R48 R47 R46 R45 R44 R43 R42 R41 R40 R39 R38 R37 R36 R35 R34 R33 R32 R31 R30 R29 R28 R27 R26 R25 R24 R23 R22 R21 R20 R19 R18 R17 R16 R15 R14 R13 R12 R11 R10 R9 R8 R7 R6 R5 R4 R3 R2 R1

  • 0,25

0,25 0,5

  • 0,4
  • 0,2

0,2 0,4 F1:

15.8 % F2: 9.8 %

slide-37
SLIDE 37

Alpes55: k-means with 5000 random restarts

K-means: what about the details ?

Nb.cl. 2 3 4 5 6 7 8 9 10 11 Dunn 1,2353 1,3235 1,2307 1,3128 1,1296 1,1965 1,2121 1,167 1,1404 1,0815 Silh 0,2132 0,2495 0,2785 0,3118 0,2961 0,2863 0,3031 0,2974 0,3019 0,2959 Tau 0,286 0,3547 0,3328 0,3371 0,2614 0,2204 0,194 0,1677 0,1616 0,1426 Yule 0,7054 0,8955 0,918 0,9281 0,9295 0,9565 0,9637 0,9622 0,9689 0,9733 AdRand 0,4127 0,609 0,6301 0,6462 0,6125 0,6546 0,6586 0,6314 0,6539 0,6568 Fowlkes 0,7089 0,7527 0,7372 0,7439 0,6866 0,7051 0,7 0,6689 0,6875 0,6855 Khi-2 179511 299481 296945 314563 244800 214518 195124 170181 164817 146699

slide-38
SLIDE 38

Alpes55: k-means with 100 random restarts

K-means: what about the details ?

0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 2 3 4 5 6 7 8 9 10 11 Number of clusters Dunn Yule AdRand Fowlkes

slide-39
SLIDE 39

Alpes55: C.A. of the co-association table based on 5-clusters and 100 random restarts ( 85 distinct partitions) K-means: what about the details ?

R55 R54 R53 R52 R51 R50 R49 R48 R47 R46 R45 R44 R43 R42 R41 R40 R39 R38 R37 R36 R35 R34 R33 R32 R31 R30 R29 R28 R27 R26 R25 R24 R23 R22 R21 R20 R19 R18 R17 R16 R15 R14 R13 R12 R11 R10 R9 R8 R7 R6 R5 R4 R3 R2 R1

  • 1

1 2

  • 1,5

1,5

R6, R8 R30, R31 R18, R19, R25, R26 R1, R2, R11, R21, R39, R43, R53

slide-40
SLIDE 40

Alpes55: K-means Influence of the number of random restarts

K-means: what about the details ?

Tau 2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 12 cl 13 cl 14 cl 100 resta 0,286 0,3547 0,3328 0,3371 0,2619 0,2248 0,1891 0,1767 0,158 0,1475 0,1281 0,1279 0,1151 1000 rest 0,286 0,3547 0,3328 0,3371 0,2614 0,2204 0,1923 0,1735 0,1501 0,1476 0,1304 0,1186 0,1133 5000 rest 0,286 0,3547 0,3328 0,3371 0,2614 0,2204 0,194 0,1677 0,1616 0,1426 0,1295 0,1139 0,1124

slide-41
SLIDE 41

Real life examples: Yeast237

K-means: what about the details ? Cho R.J., Campbell, M.J., Winzeler E.A., Steinmetz L., Conway A., Wodicka L., Wolfsberg T.G., Gabrielian A.E., Landsman D., Lockhart D.J., Davis R.W. (1998). A Genome-Wide Transcriptional Analysis of the Mitotic Cell

  • cycle. Molecular Cell, Vol. 2, 65–73.

Data table: 237 genes x 17 time points, quantitative levels of gene expressions, there are 4 groups known a priori as « functional categories ».

slide-42
SLIDE 42

Yeast237: Principal components analysis

K-means: what about the details ?

F1: 20.2 % F2: 16.1 %

  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 6 Class 1 Class 2 Class 3 Class 4

slide-43
SLIDE 43

Yeast237: k-means with 200 random restarts

K-means: what about the details ?

2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 12 cl 13 cl 14 cl Dunn 1,03579 1,02831 1,07531 0,99867 0,95781 0,76343 0,78344 0,89028 0,54417 0,53582 0,54398 0,57214 0,53743 Silh 0,27291 0,30406 0,31826 0,31811 0,31178 0,26155 0,25695 0,29178 0,16725 0,15935 0,17297 0,14949 0,17444 Tau 0,29349 0,35412 0,38313 0,38377 0,37581 0,31987 0,32014 0,34515 0,19693 0,18665 0,19996 0,19924 0,18488 AdRand 0,45971 0,62734 0,73506 0,77682 0,80551 0,77156 0,76689 0,82365 0,51078 0,52573 0,52884 0,53271 0,53124 Fowlkes 0,73277 0,76875 0,82103 0,84407 0,86009 0,8242 0,81974 0,86594 0,57951 0,58627 0,59325 0,5953 0,58875 Khi-2 5,9E+07 9,7E+07 1,3E+08 1,4E+08 1,5E+08 1,4E+08 1,4E+08 1,5E+08 1,1E+08 1,2E+08 1,2E+08 1,3E+08 1,3E+08

slide-44
SLIDE 44

Yeast237: k-means with 200 random restarts

K-means: what about the details ?

0,2 0,4 0,6 0,8 1 1,2 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of clusters Dunn Silh Tau AdRand

slide-45
SLIDE 45

Yeast237: C.A. of the co-association table based on 4-clusters and 100 random restarts ( 36 distinct partitions), axes 1 and 2 K-means: what about the details ? F1: 42.4 % F2: 36.6 %

1 2 3 1 4 1 1 2 1 3 4 4 4 4 3 4 44 3 4 1 4 4 3 2 4 4 4 1 4 4 4 1 4 2 3 1 4 1 4 3 4 1 4 4 3 4 4 4 3 1 2 4 1 1 4 1 4 4 4 4 2 4 4 4 4 4 4 1 1 3 4 1 2 4 4 3 4 1 4 3 4 1 2 1 1 3 4 2 2 4 2 4 1 2 1 1 4 4 1 2 4 4 2 2 2 2 2 1 3 4 3 4 4 4 4 1 4 1 2 1 4 3 2 1 4 4 1 4 3 2 1 4 4 4 1 2 4 2 2 1 4 2 4 2 2 4 4 4 1 4 1 4 4 4 2 1 4 2 1 1 4 1 4 3 1 4 1 4 4 4 4 1 4 1 1 2 2 1

  • 2,5
  • 2
  • 1,5
  • 1
  • 0,5

0,5 1 1,5 2 2,5

  • 1
  • 0,5

0,5 1 1,5 2 2,5

slide-46
SLIDE 46

Yeast237: C.A. of the co-association table based on 4-clusters and 100 random restarts ( 36 distinct partitions), axes 1 and 3 K-means: what about the details ? F1: 42.4 % F3: 20.5 %

1 2 3 1 4 1 1 2 1 3 4 4 4 4 3 4 44 3 4 1 4 4 3 2 4 4 4 1 4 4 4 1 4 2 3 1 4 1 4 3 4 1 4 4 3 4 4 4 3 1 2 4 1 1 4 1 4 4 4 4 2 4 4 4 4 4 4 1 1 3 4 1 2 4 4 3 4 1 4 3 4 1 2 1 1 3 4 2 2 4 2 4 1 2 1 1 4 4 1 2 4 4 2 2 2 2 2 1 3 4 3 4 4 4 4 1 4 1 2 1 4 3 2 1 4 4 1 4 3 2 1 4 4 4 1 2 4 2 2 1 4 2 4 2 2 4 4 4 1 4 1 4 4 4 2 1 4 2 1 1 4 1 4 3 1 4 1 4 4 4 4 1 4 1 1 2 2 1

  • 1,5
  • 1
  • 0,5

0,5 1 1,5 2 2,5

  • 1
  • 0,5

0,5 1 1,5 2 2,5

slide-47
SLIDE 47

Yeast237: C.A. of the co-association table based on 4-clusters and 100 random restarts ( 36 distinct partitions), axes 1 and 3, labels from the 4-cluster solution of the k-means K-means: what about the details ? F1: 42.4 % F3: 20.5 %

4 1 1 4 2 2 4 3 4 3 2 3 2 4 2 22 4 2 1 2 2 1 3 2 2 3 4 3 2 2 4 3 1 4 2 1 2 3 3 3 2 3 1 2 3 2 1 3 1 3 1 4 2 1 2 3 2 2 1 2 2 2 3 2 3 3 4 2 4 1 2 2 1 2 1 2 1 2 4 1 4 2 1 3 2 1 2 4 1 4 3 1 2 4 3 2 3 1 4 3 3 1 3 1 2 4 3 2 2 2 3 2 4 3 3 1 4 1 3 2 2 4 2 1 3 3 2 1 2 1 2 3 1 1 1 3 1 2 3 3 2 3 2 4 2 4 2 4 2 1 2 2 3 4 2 4 3 2 4 2 1 2 3 2 2 4 2 3 4 1 4 2

  • 1,5
  • 1
  • 0,5

0,5 1 1,5 2 2,5

  • 1
  • 0,5

0,5 1 1,5 2 2,5

slide-48
SLIDE 48

Yeast237: K-means Influence of the number of random restarts

K-means: what about the details ?

Dunn 2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 12 cl 13 cl 14 cl 200 res. 1,0358 1,0283 1,0753 0,9987 0,9578 0,7634 0,7834 0,8903 0,5442 0,5358 0,544 0,5721 0,5374 2000 res. 1,0358 1,0283 1,0753 0,9971 1,0076 0,8693 0,8172 0,8446 0,5877 0,5542 0,5438 0,5819 0,5641 BSS/TSS2 cl 3 cl 4 cl 5 cl 6 cl 7 cl 8 cl 9 cl 10 cl 11 cl 12 cl 13 cl 14 cl 200 res. 0,229 0,3682 0,4561 0,4923 0,5161 0,5372 0,5563 0,5765 0,5913 0,608 0,613 0,631 0,6381 2000 res. 0,229 0,3682 0,4561 0,4922 0,5169 0,5396 0,5605 0,5776 0,5942 0,6096 0,6206 0,6315 0,6417

slide-49
SLIDE 49

PART 3 : discussion

K-means: what about the details ?

  • Quality indexes of a partition
  • Number of restarts in K-means
  • Detection of “outliers” from

the co-association matrix

slide-50
SLIDE 50

Some indexes present a trend toward increasing (or decreasing values) with respect to the number of clusters : K-means: what about the details ?

Discussion : quality indexes

  • SSB/SST
  • Theta
  • Goodman-Kruskal
  • Yule
  • Hubert-Levin
  • Adjusted Rand
  • Silhouette
  • Fowlkes & Mallows

Davies-Bouldin favours unbalanced partitions

slide-51
SLIDE 51

Discussion : quality indexes

K-means: what about the details ? The remaining indexes are :

  • Dunn
  • Tau
  • Khi-2

Tau and Khi-2 are based upon quadruples, they are time consuming. Therefore the elected index is Dunn (modified version to diminish outliers influence) Dunn = Min { Mean[db(k, k’)] } Max { Mean [dw(k)] }

slide-52
SLIDE 52

Discussion : on the number of random restarts

K-means: what about the details ? The number of distinct local optima increases

  • with the number of objects
  • with the number of clusters

The values of BSS/TSS increase with the number of random restarts but better values of BSS/TSS do not imply higher values of other criteria. In general, increasing the number of random restarts does not change the choice of the number of clusters.

slide-53
SLIDE 53

Discussion : on the usage of the co-association matrix

K-means: what about the details ? Having obtained a number of distinct clusterings, most researchers focus on looking for a consensus partition. My suggestion is to treat the co-association matrix by Correspondence analysis, this allows for :

  • detecting possible outliers (or “inliers” ?)
  • confirming or dismissing the number of clusters
  • discovering a situation of “continuum” (see example

Yeast237)

slide-54
SLIDE 54

Discussion : other initializations

K-means: what about the details ? Several papers deal with different initializations, for instance :

  • cutting off a hierarchical clustering (Milligan 1980, 1985)
  • “intelligent initialization” (Mirkin 2005, Chiang and Mirkin,

2010)

  • perturbations of the K-means solution (Hand and

Krzanowski, 2005)

  • bootstrap-like procedures (Bradley and Fayyad, 1998)
slide-55
SLIDE 55

Discussion : the limits of the present approach

K-means: what about the details ?

  • when there is no known partition one is constrained to

some consensus method as shown in the examples

  • when there is a known partition this may not always be

useful (see the example Yeast237) Working with real life data sets : Necessary limited size of the data table :

  • whith large numbers of objects the repetitions of K-

means and index computations are time consuming

  • the co-association matrix is enormous and may not be

processed by usual Correspondence analysis programs.