Clusterability in Model Selection Johannes Kiesel - PowerPoint PPT Presentation

Clusterability in Model Selection Johannes Kiesel Bauhaus-Universität Weimar 28 th May, 2014 1 []

Cluster Analysis: Motivation Art and Design Computer Science Media Studies Data Categorization Given data (a set of comparable entities or objects ) Find a categorization of it 2 []

Cluster Analysis: Motivation ? ? ? Data Categorization Given data (a set of comparable entities or objects ) Find a categorization of it (without labels) 2 []

Cluster Analysis: Motivation D R A W E Y Data Categorization Given data (a set of comparable entities or objects ) Find a categorization of it (without labels) 2 []

Cluster Analysis: In the Beginning was the Data Data 3 []

Cluster Analysis: Modeling Age: Fashion index: Data XKCD/week: Library (h/day): Sketches/day: Model 4 []

Cluster Analysis: Clustering Data Clustering algorithm Model Clustering 5 []

Cluster Analysis: Clustering D R A W E Y Data Categorization Clustering algorithm Model Clustering 5 []

Cluster Analysis: Modeling II Age: Fashion index: Data XKCD/week: Library (h/day): Sketches/day: Model 6 []

Cluster Analysis: Modeling II Age: Noselength (cm): Data Weight (kg): Heigth (cm): Student ID: Model 6 []

Cluster Analysis: Modeling II Data Categorization Clustering algorithm Model Clustering 6 []

Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering 7 []

Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Separation Cohesiveness 7 []

Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Test good (2.0) Separation Cohesiveness 7 []

Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Test bad (0.0) Separation Cohesiveness 7 []

Cluster Analysis: Model Evaluation Model Clustering Clusterability index Test bad (0.0) 8 []

Cluster Analysis: Overview Test (1.2) Test (1.4) Test (4.2) Test (0.6) Test (2.3) Test (0.8) Test (1.3) Test (2.0) Test (0.9) Test (1.0) Clustering Clusterability Evaluation algorithm(s) index index 9 []

Clusterability ◮ Task: calculate a score for a model ◮ Has to be comparable at least among similar models (same number of objects) Test (4.2) ◮ A clusterable model (high score) has a dominant structure of mutually separated parts that are cohesive groups of objects. 10 []

Clusterability I: Salient Clustering Idea Model selection by cluster evaluation (“one-step”) ◮ Cluster the model with different algorithms and/or parameter settings ◮ Evaluate all clusterings ◮ Choose best combination of model & clustering → two-step one-step 11 []

Clusterability I: Dunn Index Dunn index Evaluation family Dunn MST index min ( ) / max ( 1 / ) Minimum spanning tree Dunn index (Dunn MST) 1 / Largest edge length in the minimum spanning tree of the cluster Smallest dissimilarity of objects from different clusters Optimum clustering is feasibly computable (no other clustering algorithm necessary) 12 []

Clusterability I: Salient Clustering + - + - Most evaluation indices Needs no additional clusterability index require local optimization + Evaluation indices are - Not all evaluation indices better understood can compare clusterings of different models → 13 []

Clusterability II: Statistical Tests on Structure Idea Use a statistical test for unstructured models ◮ Null hypothesis: model generated from a model distribution that generates non-clusterable models (e.g., uniform distribution) ◮ Calculate a test statistic with known distribution under the null hypothesis ◮ Use the probability that a similar large value occurs under the null hypothesis for the clusterability assessment 14 []

Clusterability II: Hopkins and Skellam Statistic x 0 x spaced uniform clustered Compare distribution of original objects ( x ) and r uniformly sampled x 0 (null hypothesis) 15 [Hopkins and Skellam. A New Method for Determining the Type of Distribution of Plant Individuals. 1954]

Clusterability II: Hopkins and Skellam Statistic x 0 ψ nn ( x 0 ) x ψ nn ( x ) H r → 0 H r ≈ 0 . 5 H r → 1 Compare distribution of original objects ( x ) and r uniformly sampled x 0 (null hypothesis) ψ nn ( x ) Dissimilarity of x to its nearest neighbor 15 [Hopkins and Skellam. A New Method for Determining the Type of Distribution of Plant Individuals. 1954]

Clusterability II: Hopkins and Skellam Statistic x 0 ψ nn ( x 0 ) x ψ nn ( x ) H r → 0 H r ≈ 0 . 5 H r → 1 Compare distribution of original objects ( x ) and r uniformly sampled x 0 (null hypothesis) � r i = 1 ( ψ nn ( x 0 i )) m H r = i )) m + � r � r i = 1 ( ψ nn ( x 0 i = 1 ( ψ nn ( x π ( i ) )) m ψ nn ( x ) Dissimilarity of x to its nearest neighbor m Number of dimensions 15 [Hopkins and Skellam. A New Method for Determining the Type of Distribution of Plant Individuals. 1954]

Clusterability II: Statistical Tests on Structure + + The distribution under the null hypothesis allows β r , r -distribution 5 for an interpretation of the probability density function score 4 + Often requires only a 3 sample 2 - 1 0 - Depends heavily on the 0 0.2 0.4 0.6 0.8 1 H r (uniform distribution) null hypothesis - Adjustment of statistics is not trivial 16 []

Clusterability III: Concentration of Dissimilarities Idea In a clusterable model most object pairs should be either very dissimilar (different clusters) or very similar (same clusters) Similarity-histogram separation cohesiveness 0 0.2 0.4 0.6 0.8 1 similarity ϕ ◮ Test if relatively few dissimilarities are of average size 17 []

Clusterability III: Dash et al. score spaced uniform clustered Similarity-histogram 0 0.2 0.4 0.6 0.8 1 similarity ϕ 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

Clusterability III: Dash et al. score spaced uniform clustered Similarity-histogram Weighting-function 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 similarity ϕ similarity ϕ 1 − ( ϕ · log 2 ( ϕ ) + ( 1 − ϕ ) · log 2 ( 1 − ϕ )) 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

Clusterability III: Dash et al. score spaced uniform clustered Similarity-histogram Weighted similarity-histogram 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 similarity ϕ similarity ϕ 1 − ( ϕ · log 2 ( ϕ ) + ( 1 − ϕ ) · log 2 ( 1 − ϕ )) 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

Clusterability III: Dash et al. score spaced uniform clustered Clusterability-score Weighted similarity-histogram 0 0.2 0.4 0.6 0.8 1 similarity ϕ 1 − ( ϕ · log 2 ( ϕ ) + ( 1 − ϕ ) · log 2 ( 1 − ϕ )) 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

Clusterability III: Concentration of Dissimilarities + + Very general idea Similarity-histogram + Related to the concept of intrinsic dimensionality separation cohesiveness - - Not clear when the used heuristic (see right figure) 0 0.2 0.4 0.6 0.8 1 applies similarity ϕ - Lacks the interpretability of statistical tests 19 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

Clusterability: Overview ◮ A clusterable model has a dominant structure of mutually separated parts that are cohesive groups of objects. Test (4.2) ◮ Clusterability is related to various other topics in data analysis ◮ Evaluation indices (Dunn) ◮ Tests on model distributions (Hopkins and Skellam) ◮ Methods of unsupervised feature selection (Dash et al.) ◮ Estimators of intrinsic dimensionality ◮ . . . ? 20 []

Experiment: Synthetic Models Can the clusterability indices identify clusterable models? Experiment setup: ◮ 10 model distributions of varying intuitive clusterability 1 model from the uniform distribution ◮ 1 000 models per distribution (results are means) ◮ 180 2-dimensional objects per model 21 []

Experiment: Synthetic Models s = 0 s = 0 . 1 s = 0 . 2 s = 0 . 3 22 []

Experiment: Synthetic Models s = 0 s = 0 . 1 s = 0 . 2 s = 0 . 3 symbol 22 []

Experiment: Synthetic Models Dunn MST [ 1 ] Hopkins and Skellam [ 2 ] Dash et al. mean clusterability 0 0.1 0.2 0.3 0 0.1 0.2 0.3 0 0.1 0.2 0.3 s s s [ 1 ] Limited to clusterings with 13 or less clusters [ 2 ] Mean of 1 000 applications per model 23 []

Clusterability in Model Selection Johannes Kiesel - PowerPoint PPT Presentation

Clusterability in Model Selection Johannes Kiesel Bauhaus-Universitt Weimar 28 th May, 2014 1 [] Cluster Analysis: Motivation Art and Design Computer Science Media Studies Data Categorization Given data (a set of comparable entities

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Selection Rules: Selection Rules Each of the spectroscopies have associated selection

Bayesian Model Selection and Averaging Nonlinear Models Bayes factors Example Families FFX

Component selection 1 (c) 2020 A.J.M. Montagne Component selection + - + - + - 2 (c)

Class of 2024 1 Course selection worksheet 1 Course selection online directions for

t r rtrs st

WiNSeC Dr. Patrick White Assoc. Director WiNSeC Office: 201-216-5028 October 29, 2003 pw

Fourth Quarter 2015 Earnings Conference Call Presentation February 3, 2016 2/3/2016 Forward

SARS Quarterly Stakeholder Meeting Feed Back SARS Quarterly Stakeholder Meeting Service

BACK TO THE FUTURE On the State of the Art in Default Reasoning Emil Weydert Individual and

Social Media & Digital Marketing Jillian Guzinski Max Rielly Social Media Manager Digital

Adaptivitt in Lernplattform en W ie knnen Lernstile erkannt und bercksichtigt w erden?

The InGOS Project: Setup and First Results A.T. Vermeulen 1 , S. Hammer 2 , P. Bergamaschi 3 , U.

Clusterability in Model Selection Johannes Kiesel - PowerPoint PPT Presentation

Clusterability in Model Selection Johannes Kiesel Bauhaus-Universitt Weimar 28 th May, 2014 1 [] Cluster Analysis: Motivation Art and Design Computer Science Media Studies Data Categorization Given data (a set of comparable entities

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Selection Rules: Selection Rules Each of the spectroscopies have associated selection

Bayesian Model Selection and Averaging Nonlinear Models Bayes factors Example Families FFX

Component selection 1 (c) 2020 A.J.M. Montagne Component selection + - + - + - 2 (c)

Class of 2024 1 Course selection worksheet 1 Course selection online directions for

t r rtrs st

WiNSeC Dr. Patrick White Assoc. Director WiNSeC Office: 201-216-5028 October 29, 2003 pw

Fourth Quarter 2015 Earnings Conference Call Presentation February 3, 2016 2/3/2016 Forward

SARS Quarterly Stakeholder Meeting Feed Back SARS Quarterly Stakeholder Meeting Service

BACK TO THE FUTURE On the State of the Art in Default Reasoning Emil Weydert Individual and

Social Media &amp; Digital Marketing Jillian Guzinski Max Rielly Social Media Manager Digital

Adaptivitt in Lernplattform en W ie knnen Lernstile erkannt und bercksichtigt w erden?

The InGOS Project: Setup and First Results A.T. Vermeulen 1 , S. Hammer 2 , P. Bergamaschi 3 , U.

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Social Media & Digital Marketing Jillian Guzinski Max Rielly Social Media Manager Digital