distance based methods drawbacks
play

Distance-based Methods: Drawbacks Hard to find clusters with - PowerPoint PPT Presentation

Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find Irregular Clusters? Divide


  1. Distance-based Methods: Drawbacks • Hard to find clusters with irregular shapes • Hard to specify the number of clusters • Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1

  2. How to Find Irregular Clusters? • Divide the whole space into many small areas – The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster • Start from a dense area, traverse connected dense areas and discover clusters in irregular shape Jian Pei: CMPT 459/741 Clustering (3) 2

  3. Directly Density Reachable p MinPts = 3 q Eps = 1 cm • Parameters – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps- neighborhood of that point – NEps(p): {q | dist(p,q) ≤ Eps} • Core object p: |NEps(p)| ≥ MinPts – A core object is in a dense area • Point q directly density-reachable from p iff q ∈ NEps(p) and p is a core object Jian Pei: CMPT 459/741 Clustering (3) 3

  4. Density-Based Clustering • Density-reachable – Directly density reachable p 1 à p 2 , p 2 à p 3 , … , p n-1 à p n – p n density-reachable from p 1 • Density-connected – If points p, q are density-reachable from o then p and q are density-connected p q p p 1 o q Jian Pei: CMPT 459/741 Clustering (3) 4

  5. DBSCAN • A cluster: a maximal set of density- connected points – Discover clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core MinPts = 5 Jian Pei: CMPT 459/741 Clustering (3) 5

  6. DBSCAN: the Algorithm • Arbitrary select a point p • Retrieve all points density-reachable from p wrt Eps and MinPts • If p is a core point, a cluster is formed • If p is a border point, no points are density- reachable from p and DBSCAN visits the next point of the database • Continue the process until all of the points have been processed Jian Pei: CMPT 459/741 Clustering (3) 6

  7. Challenges for DBSCAN • Different clusters may have very different densities • Clusters may be in hierarchies Jian Pei: CMPT 459/741 Clustering (3) 7

  8. OPTICS: A Cluster-ordering Method • Idea: ordering points to identify the clustering structure • “Group” points by density connectivity – Hierarchies of clusters • Visualize clusters and the hierarchy Jian Pei: CMPT 459/741 Clustering (3) 8

  9. Ordering Points • Points strongly density-connected should be close to one another • Clusters density-connected should be close to one another and form a “ cluster ” of clusters Jian Pei: CMPT 459/741 Clustering (3) 9

  10. OPTICS: An Example Reachability-distance undefined ε ε ε ‘ Cluster-order of the objects Jian Pei: CMPT 459/741 Clustering (3) 10

  11. DENCLUE: Using Density Functions • DENsity-based CLUstEring • Major features – Solid mathematical foundation – Good for data sets with large amounts of noise – Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets – Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45) – But need a large number of parameters Jian Pei: CMPT 459/741 Clustering (3) 11

  12. DENCLUE: Techniques • Use grid cells – Only keep grid cells actually containing data points – Manage cells in a tree-based access structure • Influence function: describe the impact of a data point on its neighborhood • Overall density of the data space is the sum of the influence function of all data points • Clustering by identifying density attractors – Density attractor: local maximal of the overall density function Jian Pei: CMPT 459/741 Clustering (3) 12

  13. Density Attractor Jian Pei: CMPT 459/741 Clustering (3) 13

  14. Center-defined and Arbitrary Clusters Jian Pei: CMPT 459/741 Clustering (3) 14

  15. A Shrinking-based Approach • Difficulties of Multi-dimensional Clustering – Noise (outliers) – Clusters of various densities – Not well-defined shapes • A novel preprocessing concept “Shrinking” • A shrinking-based clustering approach Jian Pei: CMPT 459/741 Clustering (3) 15

  16. Intuition & Purpose • For data points in a data set, what if we could make them move towards the centroid of the natural subgroup they belong to? • Natural sparse subgroups become denser, thus easier to be detected – Noises are further isolated Jian Pei: CMPT 459/741 Clustering (3) 16

  17. Inspiration • Newton’s Universal Law of Gravitation – Any two objects exert a gravitational force of attraction on each other – The direction of the force is along the line joining the objects – The magnitude of the force is directly proportional to the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them m m Fg G 1 2 – G: universal gravitational constant = 2 r • G = 6.67 x 10 -11 N m 2 /kg 2 Jian Pei: CMPT 459/741 Clustering (3) 17

  18. The Concept of Shrinking • A data preprocessing technique – Aim to optimize the inner structure of real data sets • Each data point is “attracted” by other data points and moves to the direction in which way the attraction is the strongest • Can be applied in different fields Jian Pei: CMPT 459/741 Clustering (3) 18

  19. Apply shrinking into clustering field • Shrink the natural sparse clusters to make them much denser to facilitate further cluster-detecting process. Multi- attribute hyperspac e Jian Pei: CMPT 459/741 Clustering (3) 19

  20. Data Shrinking • Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters • Points are “ attracted ” by their neighbors and move to create denser clusters • It proceeds iteratively ; repeated until the data are stabilized or the number of iterations exceeds a threshold Jian Pei: CMPT 459/741 Clustering (3) 20

  21. Approximation & Simplification • Problem: Computing mutual attraction of each data points pair is too time consuming O(n 2 ) – Solution: No Newton's constant G, m 1 and m 2 are set to unit • Only aggregate the gravitation surrounding each data point • Use grids to simplify the computation Jian Pei: CMPT 459/741 Clustering (3) 21

  22. Termination condition • Average movement of all points in the current iteration is less than a threshold • The number of iterations exceeds a threshold Jian Pei: CMPT 459/741 Clustering (3) 22

  23. Optics on Pendigits Data Before data shrinking After data shrinking Jian Pei: CMPT 459/741 Clustering (3) 23

  24. Biclustering • Clustering both objects and attributes simultaneously • Four requirements – Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of attributes – An object may participate in multiple biclusters or no biclusters – An attribute may be involved in multiple biclusters, or no biclusters Jian Pei: Big Data Analytics -- Clustering 24

  25. Application Examples • Recommender systems sample/condition – Objects: users w w w 11 12 1m – Attributes: items gene w w w 21 22 2m w w w – Values: user ratings 31 3m 32 • Microarray data – Objects: genes w w w n2 n1 nm – Attributes: samples – Values: expression levels Jian Pei: Big Data Analytics -- Clustering 25

  26. Biclusters with Constant Values · · · b 6 · · · b 12 · · · b 36 · · · b 99 · · · · · · 60 · · · 60 · · · 60 · · · 60 · · · a 1 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 60 · · · 60 · · · 60 · · · 60 · · · a 33 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 60 · · · 60 · · · 60 · · · 60 · · · a 86 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 10 10 10 10 10 20 20 20 20 20 50 50 50 50 50 0 0 0 0 0 On rows Jian Pei: Big Data Analytics -- Clustering 26

  27. Biclusters with Coherent Values • Also known as pattern-based clusters Jian Pei: Big Data Analytics -- Clustering 27

  28. Biclusters with Coherent Evolutions • Only up- or down-regulated changes over rows or columns 10 50 30 70 20 20 100 50 1000 30 50 100 90 120 80 0 80 20 100 10 Coherent evolutions on rows Jian Pei: Big Data Analytics -- Clustering 28

  29. Differences from Subspace Clustering • Subspace clustering uses global distance/ similarity measure • Pattern-based clustering looks at patterns • A subspace cluster according to a globally defined similarity measure may not follow the same pattern Jian Pei: Big Data Analytics -- Clustering 29

  30. Objects Follow the Same Pattern? pScore Object blue Obejct green D 1 D 2 The less the pScore, the more consistent the objects Jian Pei: Big Data Analytics -- Clustering 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend