clusterability in model selection
play

Clusterability in Model Selection Johannes Kiesel - PowerPoint PPT Presentation

Clusterability in Model Selection Johannes Kiesel Bauhaus-Universitt Weimar 28 th May, 2014 1 [] Cluster Analysis: Motivation Art and Design Computer Science Media Studies Data Categorization Given data (a set of comparable entities


  1. Clusterability in Model Selection Johannes Kiesel Bauhaus-Universität Weimar 28 th May, 2014 1 []

  2. Cluster Analysis: Motivation Art and Design Computer Science Media Studies Data Categorization Given data (a set of comparable entities or objects ) Find a categorization of it 2 []

  3. Cluster Analysis: Motivation ? ? ? Data Categorization Given data (a set of comparable entities or objects ) Find a categorization of it (without labels) 2 []

  4. Cluster Analysis: Motivation D R A W E Y Data Categorization Given data (a set of comparable entities or objects ) Find a categorization of it (without labels) 2 []

  5. Cluster Analysis: In the Beginning was the Data Data 3 []

  6. Cluster Analysis: Modeling Age: Fashion index: Data XKCD/week: Library (h/day): Sketches/day: Model 4 []

  7. Cluster Analysis: Modeling Age: Fashion index: Data XKCD/week: Library (h/day): Sketches/day: Model 4 []

  8. Cluster Analysis: Clustering Data Clustering algorithm Model Clustering 5 []

  9. Cluster Analysis: Clustering D R A W E Y Data Categorization Clustering algorithm Model Clustering 5 []

  10. Cluster Analysis: Clustering D R A W E Y Data Categorization Clustering algorithm Model Clustering 5 []

  11. Cluster Analysis: Clustering D R A W E Y Data Categorization Clustering algorithm Model Clustering 5 []

  12. Cluster Analysis: Modeling II Age: Fashion index: Data XKCD/week: Library (h/day): Sketches/day: Model 6 []

  13. Cluster Analysis: Modeling II Age: Noselength (cm): Data Weight (kg): Heigth (cm): Student ID: Model 6 []

  14. Cluster Analysis: Modeling II Data Categorization Clustering algorithm Model Clustering 6 []

  15. Cluster Analysis: Modeling II Data Categorization Clustering algorithm Model Clustering 6 []

  16. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering 7 []

  17. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Separation Cohesiveness 7 []

  18. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Test good (2.0) Separation Cohesiveness 7 []

  19. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Test bad (0.0) Separation Cohesiveness 7 []

  20. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Test bad (0.0) Separation Cohesiveness 7 []

  21. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Test bad (0.0) Separation Cohesiveness 7 []

  22. Cluster Analysis: Model Evaluation Model Clustering Clusterability index Test bad (0.0) 8 []

  23. Cluster Analysis: Overview Test (1.2) Test (1.4) Test (4.2) Test (0.6) Test (2.3) Test (0.8) Test (1.3) Test (2.0) Test (0.9) Test (1.0) Clustering Clusterability Evaluation algorithm(s) index index 9 []

  24. Clusterability ◮ Task: calculate a score for a model ◮ Has to be comparable at least among similar models (same number of objects) Test (4.2) ◮ A clusterable model (high score) has a dominant structure of mutually separated parts that are cohesive groups of objects. 10 []

  25. Clusterability I: Salient Clustering Idea Model selection by cluster evaluation (“one-step”) ◮ Cluster the model with different algorithms and/or parameter settings ◮ Evaluate all clusterings ◮ Choose best combination of model & clustering → two-step one-step 11 []

  26. Clusterability I: Dunn Index Dunn index Evaluation family Dunn MST index min ( ) / max ( 1 / ) Minimum spanning tree Dunn index (Dunn MST) 1 / Largest edge length in the minimum spanning tree of the cluster Smallest dissimilarity of objects from different clusters Optimum clustering is feasibly computable (no other clustering algorithm necessary) 12 []

  27. Clusterability I: Salient Clustering + - + - Most evaluation indices Needs no additional clusterability index require local optimization + Evaluation indices are - Not all evaluation indices better understood can compare clusterings of different models → 13 []

  28. Clusterability II: Statistical Tests on Structure Idea Use a statistical test for unstructured models ◮ Null hypothesis: model generated from a model distribution that generates non-clusterable models (e.g., uniform distribution) ◮ Calculate a test statistic with known distribution under the null hypothesis ◮ Use the probability that a similar large value occurs under the null hypothesis for the clusterability assessment 14 []

  29. Clusterability II: Hopkins and Skellam Statistic x 0 x spaced uniform clustered Compare distribution of original objects ( x ) and r uniformly sampled x 0 (null hypothesis) 15 [Hopkins and Skellam. A New Method for Determining the Type of Distribution of Plant Individuals. 1954]

  30. Clusterability II: Hopkins and Skellam Statistic x 0 ψ nn ( x 0 ) x ψ nn ( x ) H r → 0 H r ≈ 0 . 5 H r → 1 Compare distribution of original objects ( x ) and r uniformly sampled x 0 (null hypothesis) ψ nn ( x ) Dissimilarity of x to its nearest neighbor 15 [Hopkins and Skellam. A New Method for Determining the Type of Distribution of Plant Individuals. 1954]

  31. Clusterability II: Hopkins and Skellam Statistic x 0 ψ nn ( x 0 ) x ψ nn ( x ) H r → 0 H r ≈ 0 . 5 H r → 1 Compare distribution of original objects ( x ) and r uniformly sampled x 0 (null hypothesis) � r i = 1 ( ψ nn ( x 0 i )) m H r = i )) m + � r � r i = 1 ( ψ nn ( x 0 i = 1 ( ψ nn ( x π ( i ) )) m ψ nn ( x ) Dissimilarity of x to its nearest neighbor m Number of dimensions 15 [Hopkins and Skellam. A New Method for Determining the Type of Distribution of Plant Individuals. 1954]

  32. Clusterability II: Statistical Tests on Structure + + The distribution under the null hypothesis allows β r , r -distribution 5 for an interpretation of the probability density function score 4 + Often requires only a 3 sample 2 - 1 0 - Depends heavily on the 0 0.2 0.4 0.6 0.8 1 H r (uniform distribution) null hypothesis - Adjustment of statistics is not trivial 16 []

  33. Clusterability III: Concentration of Dissimilarities Idea In a clusterable model most object pairs should be either very dissimilar (different clusters) or very similar (same clusters) Similarity-histogram separation cohesiveness 0 0.2 0.4 0.6 0.8 1 similarity ϕ ◮ Test if relatively few dissimilarities are of average size 17 []

  34. Clusterability III: Dash et al. score spaced uniform clustered Similarity-histogram 0 0.2 0.4 0.6 0.8 1 similarity ϕ 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

  35. Clusterability III: Dash et al. score spaced uniform clustered Similarity-histogram Weighting-function 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 similarity ϕ similarity ϕ 1 − ( ϕ · log 2 ( ϕ ) + ( 1 − ϕ ) · log 2 ( 1 − ϕ )) 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

  36. Clusterability III: Dash et al. score spaced uniform clustered Similarity-histogram Weighted similarity-histogram 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 similarity ϕ similarity ϕ 1 − ( ϕ · log 2 ( ϕ ) + ( 1 − ϕ ) · log 2 ( 1 − ϕ )) 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

  37. Clusterability III: Dash et al. score spaced uniform clustered Clusterability-score Weighted similarity-histogram 0 0.2 0.4 0.6 0.8 1 similarity ϕ 1 − ( ϕ · log 2 ( ϕ ) + ( 1 − ϕ ) · log 2 ( 1 − ϕ )) 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

  38. Clusterability III: Concentration of Dissimilarities + + Very general idea Similarity-histogram + Related to the concept of intrinsic dimensionality separation cohesiveness - - Not clear when the used heuristic (see right figure) 0 0.2 0.4 0.6 0.8 1 applies similarity ϕ - Lacks the interpretability of statistical tests 19 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

  39. Clusterability: Overview ◮ A clusterable model has a dominant structure of mutually separated parts that are cohesive groups of objects. Test (4.2) ◮ Clusterability is related to various other topics in data analysis ◮ Evaluation indices (Dunn) ◮ Tests on model distributions (Hopkins and Skellam) ◮ Methods of unsupervised feature selection (Dash et al.) ◮ Estimators of intrinsic dimensionality ◮ . . . ? 20 []

  40. Experiment: Synthetic Models Can the clusterability indices identify clusterable models? Experiment setup: ◮ 10 model distributions of varying intuitive clusterability 1 model from the uniform distribution ◮ 1 000 models per distribution (results are means) ◮ 180 2-dimensional objects per model 21 []

  41. Experiment: Synthetic Models s = 0 s = 0 . 1 s = 0 . 2 s = 0 . 3 22 []

  42. Experiment: Synthetic Models s = 0 s = 0 . 1 s = 0 . 2 s = 0 . 3 symbol 22 []

  43. Experiment: Synthetic Models Dunn MST [ 1 ] Hopkins and Skellam [ 2 ] Dash et al. mean clusterability 0 0.1 0.2 0.3 0 0.1 0.2 0.3 0 0.1 0.2 0.3 s s s [ 1 ] Limited to clusterings with 13 or less clusters [ 2 ] Mean of 1 000 applications per model 23 []

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend