On Using Class-Labels in Evaluation of Clusterings Ines Frber - - PowerPoint PPT Presentation

on using class labels in evaluation of clusterings
SMART_READER_LITE
LIVE PREVIEW

On Using Class-Labels in Evaluation of Clusterings Ines Frber - - PowerPoint PPT Presentation

On Using Class-Labels in Evaluation of Clusterings Ines Frber Stephan Gnnemann Hans-Peter Kriegel Peer Krger Emmanuel Mller Erich Schubert Thomas Seidl Arthur Zimek RWTH Aachen University, Germany LMU


slide-1
SLIDE 1

On Using Class-Labels in Evaluation of Clusterings

Ines Färber • Stephan Günnemann • Hans-Peter Kriegel ◦ Peer Kröger ◦ Emmanuel Müller • Erich Schubert ◦ Thomas Seidl • Arthur Zimek ◦

  • RWTH Aachen University, Germany
  • LMU Munich University, Germany

MultiClust at KDD 2010 July 25, 2010

slide-2
SLIDE 2

The Dilemma of Evaluation

What would be the optimal clustering solution?

View 1 View 2

On Using Class-Labels in Evaluation of Clusterings 1 / 1

slide-3
SLIDE 3

Introduction

evaluation of clustering solutions: evaluation based on internal measures

+ no additional information needed; data independent

  • approaches optimizing the evaluation criteria

will always be preferred

evaluation based on an experts opinion

+ may reveal new insight into the data

  • very expensive; results are not comparable

evaluation based on external measures

+ objective evaluation

  • needs a valid ground truth

On Using Class-Labels in Evaluation of Clusterings 2 / 1

slide-4
SLIDE 4

History of Cluster Evaluation

clustering broke off from classification ⇒ assumption: classes stand out by inherent similarity traditional clustering mainly follows the partitioning approach external evaluation of traditional clustering: ⇒ the original assumption motivated the comparison against class labels

UCI - iris dataset

class structure does not necessarily correspond to a clustering structure

⇒ classes may split up into several subgroups ⇒ there might be smooth transitions between two classes

On Using Class-Labels in Evaluation of Clusterings 3 / 1

slide-5
SLIDE 5

Multi-View Context

assumption: data groups differently when seen from different perspectives ⇒ each object might be grouped in multiple clusters ⇒ with each perspective a set of attributes can be associated

View 1 View 2

clustering goes beyond the structure of class labels

data items potentially belong to many clusters in differing views ⇒ class labels do not meet the assumptions of this scenario

On Using Class-Labels in Evaluation of Clusterings 4 / 1

slide-6
SLIDE 6

Classes vs. Clusters

commonly observed differences between clusterings and class labelings: splitting of classes into multiple clusters merging of classes into a single cluster missing class outliers multiple (overlapping) hidden structures

X Y Z W alternative labels H = color given class label C = shape

On Using Class-Labels in Evaluation of Clusterings 5 / 1

slide-7
SLIDE 7

Case Study – Pendigits Dataset

differnet ways of digit notation

1 2 1 2 2 1 1 2 1 2

different types of digits 9 and 3 ⇒ almost 30 different groups of digits in contrast to 10 given classes

On Using Class-Labels in Evaluation of Clusterings 6 / 1

slide-8
SLIDE 8

Case Study – ALOI Dataset

  • bject groups that stand out due to

similarity based on: color shape rotation

  • bject types

⇒ feature space influences the clustering result

On Using Class-Labels in Evaluation of Clusterings 7 / 1

slide-9
SLIDE 9

Can Anything Be Learned?

EDSC

QUESTIONABLE

… still widely used!

On Using Class-Labels in Evaluation of Clusterings 8 / 1

slide-10
SLIDE 10

Challenges

1

ground truth should provide multiple labellings

2

measures should be able to deal with multiple labels

⇒ e.g. label layers

challenges:

clustering covers only part of the layer (incompleteness?) clusters in one layer vs. multiple layers (purity vs. variety) the clustering intersects layers the clustering contains newly detected clusters ⇒ e.g. label hierarchies ⇒ e.g. label ontologies

On Using Class-Labels in Evaluation of Clusterings 9 / 1

slide-11
SLIDE 11

Challenges

1

ground truth should provide multiple labellings

2

measures should be able to deal with multiple labels

⇒ e.g. label layers ⇒ e.g. label hierarchies

challenges:

might be hard to derive clustering covers one branch (redundancy?) clustering covers one layer (impurity?) clustering covers nodes only partially (incompleteness?) union of nodes newly detected clusters ⇒ e.g. label ontologies

On Using Class-Labels in Evaluation of Clusterings 10 / 1

slide-12
SLIDE 12

Conclusion

database clustering

C H1 H2 H3 H4 ...

class label per object classification data hidden clusters per object result evaluation enhanced evaluation

proceed in the development of new clustering algorithms ensure objective clustering evaluation

labeling of data measures for multiple labels

On Using Class-Labels in Evaluation of Clusterings 11 / 1