Alternative Clusterings: Current Progress and Open Challenges James - - PowerPoint PPT Presentation

▶

Apr 04, 2024 151 likes •349 views

Alternative Clusterings: Current Progress and Open Challenges James Bailey Department of Computer Science and Software Engineering The University of Melbourne, Australia 1 Introduction Cluster analysis: group similar objects into

SLIDE 1

Alternative Clusterings: Current Progress and Open Challenges

James Bailey

Department of Computer Science and Software Engineering The University of Melbourne, Australia

SLIDE 2

Introduction

Cluster analysis: group “similar”
bjects into clusters
No single solution

=> Equally important, different views or hypotheses regarding the data

Cluster by pose or individual ?

SLIDE 3

Motivations

Multiple explanations of the data

– user doesn’t initially know what they want, needs

ptions

– different viewpoints of users – may be aiming to verify that multiple explanations do not exist (hypothesis verification, or for benchmarking clustering algorithms)

Contrast with consensus clustering
‘’Every clustering should be accompanied by at least
ne alternative clustering’’ !?

SLIDE 4

Alternative Clustering: Is it new ?

From one perspective, alternative clustering is not so new
Generation of clusterings often goes like

– Generate and assess a clustering with 2 clusters – Generate and assess a clustering with 3 clusters – … – Generate and assess a clustering with k clusters

We now have k-1 alternative clusterings ….

– But some of them may be very similar

SLIDE 5

Alternative Clustering Algorithms

Growing number of approaches

ADFT, CAMI, COALA, Condens, Convolutional EM, Decorrelated k-means, MAXIMUS, Meta clustering, Multiview orthogonal clustering, NACI, Non redundant clustering,….

Papers have appeared at

– KDD10, ICML10, SDM10, KDD09, SDM09,ICDM08,ICDM07,ICDM06,KDD05, ICDM04, …,DMKD, KAIS, …

SLIDE 6

How do these approaches differ ?

Task formulation:

– Number of alternatives to generate – Sequential or Simultaneous Generation

Mathematical basis

– Linear algebra – Information theory – Other objective functions

SLIDE 7

Sequential Alternative Clustering Generation

Task: Given input clusterings {C1,..Cn}, generate an

alternative clustering C’, such that C’ is of high quality and C’ is different from {C1…Cn}

Important special case: n=1

C1 C2 … Cn

----->

C’

generate

Alternative Existing

SLIDE 8

Simultaneous Alternative Clustering Generation

Task: Simultaneously generate n clusterings

{C1,…Cn}, such that each Ci is of high quality and each pair (Ci,Cj) is different from one another

Important special case: n=2
--------->

C1 C2 … Cn

generate Alternatives

SLIDE 9

Sequential vs. Simultaneous

Sequential (greedy)

– Semi-supervised – For i=2 to n

{generate the optimal alternative clustering with

respect to the previous i clusterings} – Locally optimal at each step

Simultaneous (non-greedy)

– Unsupervised – In parallel, generate optimal set of n clusterings – Globally optimal clustering collection

but might miss some strong clusterings which would

be generated by a sequential technique

More difficult optimisation problem

SLIDE 10

Style of Algorithm

Projection based

– Project the data into an orthogonal subspace and then re-cluster – Appealing linear algebra formulation – Relatively efficient – Orthogonality may be too strict

More complex objective function

– Generate the alternative clustering, trading off dissimilarity and quality in the objective function – More flexible – May require parameter choices

SLIDE 11

Simple Example

Most existing techniques seem to work well (a canonical example)

SLIDE 12

Circle of Gaussians

Techniques which trade off dissimilarity and quality

more likely to produce the second clustering

Orthogonal projection doesn’t work so well here

SLIDE 13

Other issues

Evaluation: Measuring quality/dissimilarity of alternatives
Clustering setting:

– Desired shape of clusters: spherical versus elongated, linear versus non linear separation – low versus high dimensionality data – continuous versus discrete features – soft versus hard clusters – EM versus K-means versus hierarchical versus constraint based – Number of clusters desired in each clustering

SLIDE 14

Alternative Clustering Evaluation

Measuring dissimilarity: Mathematical measures - Rand

index, Jaccard index, normalised mutual information …

Measuring quality:

– Internal validation measures: Dunn index, David Bouldin index, silhouette width – External validation: Synthetic examples

Combine dissimilarity and quality into a single number, or

present separately ?

Are these numbers useful ?

SLIDE 15

Where are we ?

Good existing algorithms for generation of one or two

alternatives – Sequential generation – Simultaneous generation

Not yet deployed on very large datasets
Validated using assorted benchmark datasets and

internal metrics

SLIDE 16

Open Issues

What’s the killer application ?

– Deployment of alternative clusterings – Need convincing use cases where consensus clustering is limited

Objective function and performance measures
How many alternatives is enough ?
How many clusters should be in an alternative

clustering ? – the same number as the original clustering ?

SLIDE 17

Open Issues cont.

How to find alternative subspace clusters (rather than

clusterings) ?

Visualisation of alternative clusterings
More focused alternatives

– ``Give me another clustering which is similar in these respects and different in these other respects to the previous clustering’’

SLIDE 18

Moving Forward

Central repository of code and canonical examples

(synthetic and real)

Make alternative clusterings algorithms accessible
Identify cases in the literature of ‘’missing’’ alternative

clusterings

SLIDE 19

Bibliography

E. Bae, J. Bailey and G. Dong. A Clustering Comparison Measure Using Density Profiles and its

Application to the Discovery of Alternate Clusterings. To appear in Data Mining and Knowledge Discovery.

D. Niu, J. G. Dy, and M. I. Jordan, Multiple non-redundant spectral clustering views, in Proc. of

ICML 10, 2010.

X. H. Dang and J. Bailey. A Hierarchical Information Theoretic Technique for the Discovery of Non

Linear Alternative Clusterings. Proc. of KDD 2010.

X. H. Dang and J. Bailey. Generation of alternative clusterings using the CAMI approach. Proc. of

SDM 2010.

Z. Qi and I. Davidson, A principled and flexible framework for finding alternative clusterings, Proc.
f KDD 2009.
P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparate clusterings.
Proc. of SDM 2008.
I. Davidson and Z. Qi. Finding alternative clusterings using constraints. Proc. of ICDM 2008.
Y. Cui, X. Z. Fern, and J. G. Dy, Non-redundant multi-view clustering via orthogonalization. Proc.
f ICDM 2007.
E. Bae and J. Bailey. COALA: A novel approach for the extraction of an alternate clustering of high

quality and high dissimilarity. Proc. of ICDM 2006.

R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In ICDM Conference, 2006.
D. Gondek and T. Hofmann. Non-redundant clustering with conditional ensembles. Proc. of KDD

2005.

Gondek, D., Hofmann, T. Non-redundant data clustering. Proc. of ICDM 2004.