using local neighborhoods to find subspace clusters
play

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli - PowerPoint PPT Presentation

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel Mller and Jilles Vreeken High Dimensional Data ? 2 High Dimensional Data 3 High Dimensional Data 4 High Dimensional


  1. Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel Müller and Jilles Vreeken

  2. High Dimensional Data ✓ ✓ ✗ ✗ ? 2

  3. High Dimensional Data 3

  4. High Dimensional Data 4

  5. High Dimensional Data 5

  6. High Dimensional Data 6

  7. Problem Setting • Preserve local neighborhoods • Combine different views on the data • Produce explainable results 7

  8. Transformation CARTIFICATION High Itemset Dimensional (Transaction) DB DB Clustering FIM Subspace Clusters Frequent Patterns 8

  9. Cartification 9

  10. Cartification 10

  11. Cartification 11

  12. Cartification 12

  13. Cartification 13

  14. Cartification 14

  15. Frequent Itemset Mining Cartified DB ? Original DB FIs 15

  16. Cartification • Frequent Itemset Mining solves our problem. • It is not scalable. 16

  17. Take 2 17

  18. Take 2 ? 18

  19. Uniform vs. Clusters 19

  20. Running example Dim 1 ✓ 20

  21. Running example Dim 2 ✓ Dim 1 ? 21

  22. Running example ✓ Dim 1 ✗ ? Dim 2 ✗ ? Dim 3 ✓ ? Dim 4 22

  23. Experiments 1.0 0.8 0.6 F1 Score 0.4 0.2 0.0 Our Method CartiClus FIRES PROCLUS STATPC SUBCLUE 1 2 4 8 16 32 100 200 23

  24. Experiments 10000 1000 Run Time (seconds) 100 10 1 0 S1500 S2500 S3500 S4500 S5500 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 24

  25. Experiments 10000 1000 Run Time (seconds) 100 10 1 0 D5 D10 D15 D25 D50 D75 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 25

  26. Real World – MovieLens Star Wars: A New Hope (a.k.a. Star Wars) (1977) Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The T wo T owers, The (2002) LotR: The Return of the King, The (2003) Back to the Future (1985) T erminator, The (1984) T erminator 2: Judgment Day (1991) Die Hard (1988) T erminator, The (1984) T erminator 2: Judgment Day (1991) Usual Suspects, The (1995) Pulp Fiction (1994) Silence of the Lambs, The (1991) 26

  27. Real World - Movielens Star Wars: A New Hope (1977) Brazil (1985) Star Wars: The Empire Strikes Back (1980) Dr. Strangelove (1964) Star Wars: Return of the Jedi (1983) Clockwork Orange, A (1971) LotR: The Fellowship of the Ring, The (2001) 2001: A Space Odyssey (1968) LotR: The T wo T owers, The (2002) Blade Runner (1982) LotR: The Return of the King, The (2003) Alien (1979) Chinatown (1974) Third Man, The (1949) Rear Window (1954) Citizen Kane (1941) North by Northwest (1959) Godfather: Part II, The (1974) Vertigo (1958) Chinatown (1974) Psycho (1960) Godfather, The (1972) Silence of the Lambs, The (1991) T axi Driver (1976) 27

  28. Conclusion • Preserves neighborhood information • Combines different similarity measures gracefully • Finds relevant features and discards noise • Fast • Produce explainable results → Code and the data is available at our website. Thank you! 28

  29. Real World – Gene Expression Alon Nutt Our method 0.78 0.78 PROCLUS 0.46 0.49 FIRES 0.52 0.55 SUBCLU 0.58 n/a STATPC n/a n/a CartiClus n/a n/a # of Objects 62 50 # of Dims 2000 1377 29

  30. More Experiments 1 0.9 0.8 0.7 0.6 F1 Score 0.5 0.4 0.3 0.2 0.1 0 S1500 S2500 S3500 S4500 S5500 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 30

  31. More Experiments 1 0.9 0.8 0.7 F1 Score 0.6 0.5 0.4 0.3 0.2 0.1 0 D05 D10 D15 D25 D50 D75 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 31

  32. Experiments • Evaluate: - Subspace cluster detection - Noise Robustness - Scalability • Competitors: - Subspace clustering: PROCLUS - Clustering: K-Means - Dimensionality Reduction: PCA and Random Projection - Clustering Ensemble: CSPA 32

  33. Results 1 Quality of the found clusters 0.8 0.6 10 clusters in 10 dimensions 0.4 200 irrelevant dimensions 0.2 0 F1 E4SC Our Method CSPA Proclus K-Means PCA+KM RP+KM 33

  34. Results 1 Quality of the found clusters 0.8 0.6 10 clusters in 10 dimensions 0.4 200 irrelevant dimensions 0.2 0 F1 E4SC Our Method CSPA Proclus K-Means PCA+KM RP+KM • Very effective on finding relevant dimensions. 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend