Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli - - PowerPoint PPT Presentation

using local neighborhoods to find subspace clusters
SMART_READER_LITE
LIVE PREVIEW

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli - - PowerPoint PPT Presentation

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel Mller and Jilles Vreeken High Dimensional Data ? 2 High Dimensional Data 3 High Dimensional Data 4 High Dimensional


slide-1
SLIDE 1

Using Local Neighborhoods to Find Subspace Clusters

Emin Aksehirli with Bart Goethals, Emmanuel Müller and Jilles Vreeken

slide-2
SLIDE 2

2

High Dimensional Data

✓ ✓ ✗ ✗

?

slide-3
SLIDE 3

3

High Dimensional Data

slide-4
SLIDE 4

4

High Dimensional Data

slide-5
SLIDE 5

5

High Dimensional Data

slide-6
SLIDE 6

6

High Dimensional Data

slide-7
SLIDE 7

7

Problem Setting

  • Preserve local neighborhoods
  • Combine different views on the data
  • Produce explainable results
slide-8
SLIDE 8

8

Transformation

High Dimensional DB Itemset (Transaction) DB

FIM Clustering

Subspace Clusters Frequent Patterns CARTIFICATION

slide-9
SLIDE 9

9

Cartification

slide-10
SLIDE 10

10

Cartification

slide-11
SLIDE 11

11

Cartification

slide-12
SLIDE 12

12

Cartification

slide-13
SLIDE 13

13

Cartification

slide-14
SLIDE 14

14

Cartification

slide-15
SLIDE 15

15

Frequent Itemset Mining

?

FIs Cartified DB Original DB

slide-16
SLIDE 16

16

Cartification

  • Frequent Itemset Mining solves our problem.
  • It is not scalable.
slide-17
SLIDE 17

17

Take 2

slide-18
SLIDE 18

18

Take 2

?

slide-19
SLIDE 19

19

Uniform vs. Clusters

slide-20
SLIDE 20

20

Running example

Dim 1✓

slide-21
SLIDE 21

21

Running example

Dim 1 Dim 2✓

?

slide-22
SLIDE 22

22

Running example

Dim 1 Dim 2 Dim 3 Dim 4

✓ ✗ ? ? ✗ ? ✓

slide-23
SLIDE 23

23

Experiments

Our Method CartiClus FIRES PROCLUS STATPC SUBCLUE

0.0 0.2 0.4 0.6 0.8 1.0 1 2 4 8 16 32 100 200 F1 Score

slide-24
SLIDE 24

24

Experiments

S1500 S2500 S3500 S4500 S5500 1 10 100 1000 10000 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU

Run Time (seconds)

slide-25
SLIDE 25

25

Experiments

D5 D10 D15 D25 D50 D75 1 10 100 1000 10000 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU

Run Time (seconds)

slide-26
SLIDE 26

26

Real World – MovieLens

Star Wars: A New Hope (a.k.a. Star Wars) (1977) Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The T wo T

  • wers, The (2002)

LotR: The Return of the King, The (2003) Back to the Future (1985) T erminator, The (1984) T erminator 2: Judgment Day (1991) Die Hard (1988) T erminator, The (1984) T erminator 2: Judgment Day (1991) Usual Suspects, The (1995) Pulp Fiction (1994) Silence of the Lambs, The (1991)

slide-27
SLIDE 27

27

Real World - Movielens

Star Wars: A New Hope (1977) Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The T wo T

  • wers, The (2002)

LotR: The Return of the King, The (2003) Brazil (1985)

  • Dr. Strangelove (1964)

Clockwork Orange, A (1971) 2001: A Space Odyssey (1968) Blade Runner (1982) Alien (1979) Chinatown (1974) Rear Window (1954) North by Northwest (1959) Vertigo (1958) Psycho (1960) Silence of the Lambs, The (1991) Third Man, The (1949) Citizen Kane (1941) Godfather: Part II, The (1974) Chinatown (1974) Godfather, The (1972) T axi Driver (1976)

slide-28
SLIDE 28

28

Conclusion

  • Preserves neighborhood information
  • Combines different similarity measures gracefully
  • Finds relevant features and discards noise
  • Fast
  • Produce explainable results

Thank you!

→ Code and the data is available at our website.

slide-29
SLIDE 29

29

Real World – Gene Expression

Alon Nutt Our method 0.78 0.78 PROCLUS 0.46 0.49 FIRES 0.52 0.55 SUBCLU 0.58 n/a STATPC n/a n/a CartiClus n/a n/a # of Objects 62 50 # of Dims 2000 1377

slide-30
SLIDE 30

30

More Experiments

S1500 S2500 S3500 S4500 S5500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Our Method CartiClus FIRES PROCLUS STATPC SUBCLU

F1 Score

slide-31
SLIDE 31

31

More Experiments

D05 D10 D15 D25 D50 D75

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Our Method CartiClus FIRES PROCLUS STATPC SUBCLU

F1 Score

slide-32
SLIDE 32

32

Experiments

  • Evaluate:
  • Subspace cluster detection
  • Noise Robustness
  • Scalability
  • Competitors:
  • Subspace clustering: PROCLUS
  • Clustering: K-Means
  • Dimensionality Reduction:

PCA and Random Projection

  • Clustering Ensemble: CSPA
slide-33
SLIDE 33

33

Results

F1 E4SC 0.2 0.4 0.6 0.8 1 Our Method CSPA Proclus K-Means PCA+KM RP+KM

Quality of the found clusters

10 clusters in 10 dimensions 200 irrelevant dimensions

slide-34
SLIDE 34

34

Results

  • Very effective on finding relevant dimensions.

F1 E4SC 0.2 0.4 0.6 0.8 1 Our Method CSPA Proclus K-Means PCA+KM RP+KM

Quality of the found clusters

10 clusters in 10 dimensions 200 irrelevant dimensions