Alternative Clusterings: Current Progress and Open Challenges James - - PowerPoint PPT Presentation

alternative clusterings current progress and open
SMART_READER_LITE
LIVE PREVIEW

Alternative Clusterings: Current Progress and Open Challenges James - - PowerPoint PPT Presentation

Alternative Clusterings: Current Progress and Open Challenges James Bailey Department of Computer Science and Software Engineering The University of Melbourne, Australia 1 Introduction Cluster analysis: group similar objects into


slide-1
SLIDE 1

1

Alternative Clusterings: Current Progress and Open Challenges

James Bailey

Department of Computer Science and Software Engineering The University of Melbourne, Australia

slide-2
SLIDE 2

Introduction

  • Cluster analysis: group “similar”
  • bjects into clusters
  • No single solution

=> Equally important, different views or hypotheses regarding the data

Cluster by pose or individual ?

slide-3
SLIDE 3

Motivations

  • Multiple explanations of the data

– user doesn’t initially know what they want, needs

  • ptions

– different viewpoints of users – may be aiming to verify that multiple explanations do not exist (hypothesis verification, or for benchmarking clustering algorithms)

  • Contrast with consensus clustering
  • ‘’Every clustering should be accompanied by at least
  • ne alternative clustering’’ !?
slide-4
SLIDE 4

Alternative Clustering: Is it new ?

  • From one perspective, alternative clustering is not so new
  • Generation of clusterings often goes like

– Generate and assess a clustering with 2 clusters – Generate and assess a clustering with 3 clusters – … – Generate and assess a clustering with k clusters

  • We now have k-1 alternative clusterings ….

– But some of them may be very similar

slide-5
SLIDE 5

Alternative Clustering Algorithms

  • Growing number of approaches

ADFT, CAMI, COALA, Condens, Convolutional EM, Decorrelated k-means, MAXIMUS, Meta clustering, Multiview orthogonal clustering, NACI, Non redundant clustering,….

  • Papers have appeared at

– KDD10, ICML10, SDM10, KDD09, SDM09,ICDM08,ICDM07,ICDM06,KDD05, ICDM04, …,DMKD, KAIS, …

slide-6
SLIDE 6

How do these approaches differ ?

  • Task formulation:

– Number of alternatives to generate – Sequential or Simultaneous Generation

  • Mathematical basis

– Linear algebra – Information theory – Other objective functions

slide-7
SLIDE 7

Sequential Alternative Clustering Generation

  • Task: Given input clusterings {C1,..Cn}, generate an

alternative clustering C’, such that C’ is of high quality and C’ is different from {C1…Cn}

  • Important special case: n=1

C1 C2 … Cn

  • ----->

C’

generate

Alternative Existing

slide-8
SLIDE 8

Simultaneous Alternative Clustering Generation

  • Task: Simultaneously generate n clusterings

{C1,…Cn}, such that each Ci is of high quality and each pair (Ci,Cj) is different from one another

  • Important special case: n=2
  • --------->

C1 C2 … Cn

generate Alternatives

slide-9
SLIDE 9

Sequential vs. Simultaneous

  • Sequential (greedy)

– Semi-supervised – For i=2 to n

  • {generate the optimal alternative clustering with

respect to the previous i clusterings} – Locally optimal at each step

  • Simultaneous (non-greedy)

– Unsupervised – In parallel, generate optimal set of n clusterings – Globally optimal clustering collection

  • but might miss some strong clusterings which would

be generated by a sequential technique

  • More difficult optimisation problem
slide-10
SLIDE 10

Style of Algorithm

  • Projection based

– Project the data into an orthogonal subspace and then re-cluster – Appealing linear algebra formulation – Relatively efficient – Orthogonality may be too strict

  • More complex objective function

– Generate the alternative clustering, trading off dissimilarity and quality in the objective function – More flexible – May require parameter choices

slide-11
SLIDE 11

Simple Example

Most existing techniques seem to work well (a canonical example)

slide-12
SLIDE 12

Circle of Gaussians

  • Techniques which trade off dissimilarity and quality

more likely to produce the second clustering

  • Orthogonal projection doesn’t work so well here
slide-13
SLIDE 13

Other issues

  • Evaluation: Measuring quality/dissimilarity of alternatives
  • Clustering setting:

– Desired shape of clusters: spherical versus elongated, linear versus non linear separation – low versus high dimensionality data – continuous versus discrete features – soft versus hard clusters – EM versus K-means versus hierarchical versus constraint based – Number of clusters desired in each clustering

slide-14
SLIDE 14

Alternative Clustering Evaluation

  • Measuring dissimilarity: Mathematical measures - Rand

index, Jaccard index, normalised mutual information …

  • Measuring quality:

– Internal validation measures: Dunn index, David Bouldin index, silhouette width – External validation: Synthetic examples

  • Combine dissimilarity and quality into a single number, or

present separately ?

  • Are these numbers useful ?
slide-15
SLIDE 15

Where are we ?

  • Good existing algorithms for generation of one or two

alternatives – Sequential generation – Simultaneous generation

  • Not yet deployed on very large datasets
  • Validated using assorted benchmark datasets and

internal metrics

slide-16
SLIDE 16

Open Issues

  • What’s the killer application ?

– Deployment of alternative clusterings – Need convincing use cases where consensus clustering is limited

  • Objective function and performance measures
  • How many alternatives is enough ?
  • How many clusters should be in an alternative

clustering ? – the same number as the original clustering ?

slide-17
SLIDE 17

Open Issues cont.

  • How to find alternative subspace clusters (rather than

clusterings) ?

  • Visualisation of alternative clusterings
  • More focused alternatives

– ``Give me another clustering which is similar in these respects and different in these other respects to the previous clustering’’

slide-18
SLIDE 18

Moving Forward

  • Central repository of code and canonical examples

(synthetic and real)

  • Make alternative clusterings algorithms accessible
  • Identify cases in the literature of ‘’missing’’ alternative

clusterings

slide-19
SLIDE 19

Bibliography

  • E. Bae, J. Bailey and G. Dong. A Clustering Comparison Measure Using Density Profiles and its

Application to the Discovery of Alternate Clusterings. To appear in Data Mining and Knowledge Discovery.

  • D. Niu, J. G. Dy, and M. I. Jordan, Multiple non-redundant spectral clustering views, in Proc. of

ICML 10, 2010.

  • X. H. Dang and J. Bailey. A Hierarchical Information Theoretic Technique for the Discovery of Non

Linear Alternative Clusterings. Proc. of KDD 2010.

  • X. H. Dang and J. Bailey. Generation of alternative clusterings using the CAMI approach. Proc. of

SDM 2010.

  • Z. Qi and I. Davidson, A principled and flexible framework for finding alternative clusterings, Proc.
  • f KDD 2009.
  • P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparate clusterings.
  • Proc. of SDM 2008.
  • I. Davidson and Z. Qi. Finding alternative clusterings using constraints. Proc. of ICDM 2008.
  • Y. Cui, X. Z. Fern, and J. G. Dy, Non-redundant multi-view clustering via orthogonalization. Proc.
  • f ICDM 2007.
  • E. Bae and J. Bailey. COALA: A novel approach for the extraction of an alternate clustering of high

quality and high dissimilarity. Proc. of ICDM 2006.

  • R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In ICDM Conference, 2006.
  • D. Gondek and T. Hofmann. Non-redundant clustering with conditional ensembles. Proc. of KDD

2005.

  • Gondek, D., Hofmann, T. Non-redundant data clustering. Proc. of ICDM 2004.