Less is More: Non-Redundant Subspace Clustering Ira Assent Emmanuel - - PowerPoint PPT Presentation

less is more non redundant subspace clustering
SMART_READER_LITE
LIVE PREVIEW

Less is More: Non-Redundant Subspace Clustering Ira Assent Emmanuel - - PowerPoint PPT Presentation

Less is More: Non-Redundant Subspace Clustering Ira Assent Emmanuel Mller Stephan Gnnemann Ralph Krieger Thomas Seidl Aalborg University, Denmark RWTH Aachen University, Germany MultiClust Workshop at SIGKDD 2010


slide-1
SLIDE 1

Less is More: Non-Redundant Subspace Clustering

Ira Assent ◦ Emmanuel Müller • Stephan Günnemann • Ralph Krieger • Thomas Seidl •

  • Aalborg University, Denmark
  • RWTH Aachen University, Germany

MultiClust Workshop at SIGKDD 2010 July 25, 2010

slide-2
SLIDE 2

Effective Models Efficient Computation Evaluation and Exploration of Results

Detection of Non-Redundant Subspace Clusters I

C1 C4 C3 C6 C5 C2

income sportive activities # boats in Miami internet usage

Hidden clusters are described by different attribute sets Each object might be grouped in multiple clusters ⇒ Novel challenges for subspace clustering

Less is More: Non-Redundant Subspace Clustering 1 / 11

slide-3
SLIDE 3

Effective Models Efficient Computation Evaluation and Exploration of Results

Detection of Non-Redundant Subspace Clusters II

C1 C4 C3

income # boats in Miami # cars

  • freq. flyer miles

# horses Subspace Cluster: (rich; boat owner; car fan; globetrotter; horse fan)

  • Exp. many projections

(rich) (boat owner) (rich; globetrotter) ...

Huge amount of redundant clusters ⇒ Typically number of clusters ≫ number of objects ⇒ Detection of all and only non-redundant subspace clusters

Less is More: Non-Redundant Subspace Clustering 2 / 11

slide-4
SLIDE 4

Effective Models Efficient Computation Evaluation and Exploration of Results

Overview

Main question

How can you use/extend non-redundant clustering ... In this talk, we present A survey of our contributions so far The generality of our techniques Our open source initiatives for the community Research questions arise in the areas of:

1

Effective Models

2

Efficient Computation

3

Evaluation and Exploration of Results

Less is More: Non-Redundant Subspace Clustering 3 / 11

slide-5
SLIDE 5

Effective Models Efficient Computation Evaluation and Exploration of Results

Notions and Related Work

Abstract subspace clustering definition

Definition of object set O clustered in subspace S C = (O, S) with O ⊆ DB, S ⊆ DIM Selection of result set M a subset of all valid subspace clusters ALL M = {(O1, S1) . . . (On, Sn)} ⊆ ALL

1 4 3 2 1,2 1,4 2,3 3,4 1,2,3 2,3,4 1,3,4 1,2,4

1,2,3,4

1 4 3 2 1,2 1,3 1,4 2,3 2,4 3,4 1,2,3 2,3,4 1,3,4 1,2,4

1,2,3,4

Related work

Subspace clustering: focus on definition of (O, S) ⇒ Output all valid subspace cluster M = ALL (⇒ too many) Projected clustering: focus on definition of disjoint clusters in M ⇒ Unable to detect objects in multiple clusters (⇒ too few)

Less is More: Non-Redundant Subspace Clustering 4 / 11

slide-6
SLIDE 6

Effective Models Efficient Computation Evaluation and Exploration of Results

Non-Redundant Subspace Clustering Models

Select M ⊆ ALL: Exclude redundant subspace clusters...

C2 C1 C2 C3 C1

Local (pairwise) redundancy elimination[1][2]

(O, S) is non-redundant iff ¬∃(O′, S′) with O′ ⊆ O ∧ S′ ⊃ S ∧ |O′| ≥ R · |O| ⇒ Excludes large number of redundant subspace clusters

[1] Assent, Krieger, Müller and Seidl: DUSC: Dimensionality Unbiased Subspace Clustering, in ICDM 2007. [2] Assent, Krieger, Müller and Seidl: INSCY: Indexing Subspace Clusters with In-Process-Removal of Redundancy, in ICDM 2008. Less is More: Non-Redundant Subspace Clustering 5 / 11

slide-7
SLIDE 7

Effective Models Efficient Computation Evaluation and Exploration of Results

Generalization of Redundancy Elimination

Relevant subspace clustering model[3]

Include the most interesting subspace clusters Exclude redundant subspace clusters ⇒ Provide most relevant subspace clusters in result set ⇒ Extract novel knowledge with each cluster

all possible clusters ALL relevance model interestingness

  • f clusters

redundancy

  • f clusters

relevant clustering M ALL

Given any definition of subspace clusters C = (O, S) ⇒ Choose optimal subset M = {C1, . . . , Cn} ⊆ ALL Proof: Such an optimization is an NP-hard problem

[3] Müller, Assent, Günnemann, Krieger and Seidl: Relevant Subspace Clustering: Mining the Most Interesting Non-Redundant Concepts in High Dimensional Data, in ICDM 2009. Less is More: Non-Redundant Subspace Clustering 6 / 11

slide-8
SLIDE 8

Effective Models Efficient Computation Evaluation and Exploration of Results

Redundancy Pruning by Depth-First Processing

Pruning

Applicable for local redundancy (simple pairwise model) Enables in-process pruning of redundant clusters.

1 4 3 2 1,2 1,4 2,3 3,4 1,2,3 2,3,4 1,3,4 1,2,4

1,2,3,4

1 4 3 2 1,2 1,3 1,4 2,3 2,4 3,4 1,2,3 2,3,4 1,3,4 1,2,4

1,2,3,4

1 4 3 2 1,2 1,4 2,3 3,4 1,2,3 2,3,4 1,3,4 1,2,4

1,2,3,4

1 4 3 2 1,2 1,3 1,4 2,3 2,4 3,4 1,2,3 2,3,4 1,3,4 1,2,4

1,2,3,4

depth-first breadth-first

Step-by-step processing (k-D → (k + 1)-D subspace!) ⇒ Scalability to high dimensional data?

Less is More: Non-Redundant Subspace Clustering 7 / 11

slide-9
SLIDE 9

Effective Models Efficient Computation Evaluation and Exploration of Results

Scalable Subspace Processing

1 4 3 2 1,2 1,4 2,3 3,4 1,2,3 2,3,4 1,3,4 1,2,4

1,2,3,4

1 4 3 2 1,2 1,3 1,4 2,3 2,4 3,4 1,2,3 2,3,4 1,3,4 1,2,4

1,2,3,4

direct jump

Dimension 2 Dimension 1 D i m e n s i

  • n

4 interval 1 interval 2 1 2 1 2

Key idea: density estimation + steered jumps

Subspace clusters are represented by many low dimensional projections Use 2-D projections to estimate density in higher subspace regions[4] Use k-D projections to jump directly to (k + x)-D subspaces [x ≫ 1] Best-first search: Intelligent steering to promising subspace regions

[4] Müller, Assent, Krieger, Günnemann and Seidl: DensEst: Density Estimation for Data Mining in High Dimensional Spaces, in SDM 2009. Less is More: Non-Redundant Subspace Clustering 8 / 11

slide-10
SLIDE 10

Effective Models Efficient Computation Evaluation and Exploration of Results

Challenges in Evaluation and Exploration

General challenge for clustering

No ground truth available for clustering ⇒ Subjective evaluation by exploration requires visualization techniques and interactive exploration tools ⇒ Objective evaluations are incomparable using different implementations, databases and quality measures ⇒ We provide broad evaluation study & interactive exploration framework

Evaluation Study[5]

Characterization of major paradigms Providing comparable baseline implementations Evaluation based on broad set of data sets, quality measures and parameter settings

[5] Müller, Günnemann, Assent and Seidl: Evaluating Clustering in Subspace Projections of High Dimensional Data, in VLDB 2009. Less is More: Non-Redundant Subspace Clustering 9 / 11

slide-11
SLIDE 11

Effective Models Efficient Computation Evaluation and Exploration of Results

Open Source Framework

OpenSubspace framework

Framework for research, education and application [6][7][8][9] Baselines for algorithm and evaluation measure development

unified algorithm repository

  • eval. 1
  • eval. 2
  • eval. 3

re-implementation rare case: common implementation

OpenSubspace

  • algo. A
  • algo. B
  • algo. C

unified evaluation repository

  • eval. 1
  • eval. 2
  • eval. 3
  • algo. D

http://dme.rwth-aachen.de/OpenSubspace/

[6] Müller, Assent, Krieger, Jansen and Seidl: Morpheus: Interactive Exploration of Subspace Clustering, in KDD 2008. [7] Assent, Müller, Krieger, Jansen and Seidl: Pleiades: Subspace Clustering and Evaluation, in PKDD 2008. [8] Günnemann, Färber, Kremer, Seidl: CoDA: Interactive Cluster Based Concept Discovery, in VLDB 2010 [9] Müller, Schiffer, Gerwert, Hannen, Jansen, Seidl: SOREX: Subspace Outlier Ranking Exploration Toolkit, in PKDD 2010. Less is More: Non-Redundant Subspace Clustering 10 / 11

slide-12
SLIDE 12

Effective Models Efficient Computation Evaluation and Exploration of Results

Conclusion and Future Work

Subspace clustering is still an emerging research field...

Is the basis for a lot of further research

Alternative subspace clustering Evaluation measures for subspace clustering Benchmark databases for subspace clustering . . .

Less is More: Non-Redundant Subspace Clustering 11 / 11

slide-13
SLIDE 13

Effective Models Efficient Computation Evaluation and Exploration of Results

Conclusion and Future Work

Subspace clustering is still an emerging research field...

Is the basis for a lot of further research

Alternative subspace clustering Evaluation measures for subspace clustering Benchmark databases for subspace clustering . . . Thank you for your attention. Questions?

Less is More: Non-Redundant Subspace Clustering 11 / 11