clustering community detection
play

Clustering Community Detection - PowerPoint PPT Presentation

Clustering Community Detection http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811 Jian Pei: CMPT 741/459 Clustering (1) 2 Customer Relation


  1. Clustering

  2. Community Detection http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811 Jian Pei: CMPT 741/459 Clustering (1) 2

  3. Customer Relation Management • Partitioning customers into groups such that customers within a group are similar in some aspects • A manager can be assigned to a group • Customized products and services can be developed Jian Pei: CMPT 741/459 Clustering (1) 3

  4. What Is Clustering? • Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Outliers Cluster 1 Cluster 2 Jian Pei: CMPT 741/459 Clustering (1) 4

  5. Requirements of Clustering • Scalability • Ability to deal with various types of attributes • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters Jian Pei: CMPT 741/459 Clustering (1) 5

  6. Data Matrix • For memory-based clustering – Also called object-by-variable structure • Represents n objects with p variables (attributes, measures) � � x x x ⎡ ⎤ 11 1 f 1 p – A relational table ⎢ ⎥ � � � � � ⎢ ⎥ ⎢ ⎥ x � x � x i 1 if ip ⎢ ⎥ � � � � � ⎢ ⎥ ⎢ ⎥ x � x � x ⎢ ⎥ n 1 nf np ⎣ ⎦ Jian Pei: CMPT 741/459 Clustering (1) 6

  7. Dissimilarity Matrix • For memory-based clustering – Also called object-by-object structure – Proximities of pairs of objects – d(i, j): dissimilarity between objects i and j – Nonnegative 0 ⎡ ⎤ – Close to 0: similar ⎢ ⎥ d (2,1) 0 ⎢ ⎥ d (3,1) d (3,2) 0 ⎢ ⎥ ⎢ ⎥ � � � ⎢ ⎥ ⎢ d ( n ,1) d ( n ,2) � � 0 ⎥ ⎣ ⎦ Jian Pei: CMPT 741/459 Clustering (1) 7

  8. How Good Is Clustering? • Dissimilarity/similarity depends on distance function – Different applications have different functions • Judgment of clustering quality is typically highly subjective Jian Pei: CMPT 741/459 Clustering (1) 8

  9. Types of Data in Clustering • Interval-scaled variables • Binary variables • Nominal, ordinal, and ratio variables • Variables of mixed types Jian Pei: CMPT 741/459 Clustering (1) 9

  10. Interval-valued Variables • Continuous measurements of a roughly linear scale – Weight, height, latitude and longitude coordinates, temperature, etc. • Effect of measurement units in attributes – Smaller unit à larger variable range à larger effect to the result – Standardization + background knowledge Jian Pei: CMPT 741/459 Clustering (1) 10

  11. Standardization • Calculate the mean absolute deviation 1 s (| x m | | x m | ... | x m |) 1 = − + − + + − m (x x x n = + ... ) n + + f 1 f f 2 f f nf f . f 1 f 2 f nf • Calculate the standardized measurement (z- x m − score) if f z = s if f • Mean absolute deviation is more robust – The effect of outliers is reduced but remains detectable Jian Pei: CMPT 741/459 Clustering (1) 11

  12. Similarity and Dissimilarity • Distances are normally used measures • Minkowski distance: a generalization q q q d ( i , j ) | x x | | x x | ... | x x | ( q 0 ) = − + − + + − > q i j i j i j 1 1 2 2 p p • If q = 2, d is Euclidean distance • If q = 1, d is Manhattan distance • If q = ∞ , d is Chebyshev distance • Weighed distance q q q d ( i , j ) w | x x | w | x x | ... w | x x | ) ( q 0 ) = − + − + + − > q p i j i j i j 1 2 1 1 2 2 p p Jian Pei: CMPT 741/459 Clustering (1) 12

  13. Manhattan and Chebyshev Distance Chebyshev Distance Manhattan Distance When n = 2, chess-distance Picture from Wekipedia http://brainking.com/images/rules/chess/02.gif Jian Pei: CMPT 741/459 Clustering (1) 13

  14. Properties of Minkowski Distance • Nonnegative: d(i,j) ≥ 0 • The distance of an object to itself is 0 – d(i,i) = 0 • Symmetric: d(i,j) = d(j,i) • Triangular inequality i j – d(i,j) ≤ d(i,k) + d(k,j) k Jian Pei: CMPT 741/459 Clustering (1) 14

  15. Object j 1 0 Sum Binary Variables 1 q r q+r Object i 0 s t s+t Sum q+s r+t p • A contingency table for binary data • Symmetric variable: each state carries the same weight r s + + d ( i , j ) = q r s t + + – Invariant similarity • Asymmetric variable: the positive value carries more weight r s ++ d ( i , j ) = q r s + – Noninvariant similarity (Jacard) Jian Pei: CMPT 741/459 Clustering (1) 15

  16. Nominal Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., Red, yellow, blue, green p m − d ( i , j ) = • Method 1: simple matching p – M: # of matches, p: total # of variables • Method 2: use a large number of binary variables – Creating a new binary variable for each of the M nominal states Jian Pei: CMPT 741/459 Clustering (1) 16

  17. Ordinal Variables • An ordinal variable can be discrete or continuous r ∈ { 1 ,..., M } if f • Order is important, e.g., rank • Can be treated like interval-scaled – Replace x if by their rank – Map the range of each variable onto [0, 1] by replacing the i-th object in the f-th variable by r 1 − z if = M 1 if − f – Compute the dissimilarity using methods for interval-scaled variables Jian Pei: CMPT 741/459 Clustering (1) 17

  18. Ratio-scaled Variables • Ratio-scaled variable: a positive measurement on a nonlinear scale – E.g., approximately at exponential scale, such as Ae Bt • Treat them like interval-scaled variables? – Not a good choice: the scale can be distorted! • Apply logarithmic transformation, y if = log(x if ) • Treat them as continuous ordinal data, treat their rank as interval-scaled Jian Pei: CMPT 741/459 Clustering (1) 18

  19. Variables of Mixed Types • A database may contain all the six types of variables – Symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio • One may use a weighted formula to combine their effects d p ( f ) ( f ) Σ δ f 1 ij ij d ( i , j ) = = p ( f ) Σ δ f 1 ij = Jian Pei: CMPT 741/459 Clustering (1) 19

  20. Clustering Methods • K-means and partitioning methods • Hierarchical clustering • Density-based clustering • Grid-based clustering • Pattern-based clustering • Other clustering methods Jian Pei: CMPT 741/459 Clustering (1) 20

  21. Partitioning Algorithms: Ideas • Partition n objects into k clusters – Optimize the chosen partitioning criterion • Global optimal: examine all possible partitions – (k n -(k-1) n - … -1) possible partitions, too expensive! • Heuristic methods: k-means and k-medoids – K-means: a cluster is represented by the center – K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster Jian Pei: CMPT 741/459 Clustering (1) 21

  22. K-means • Arbitrarily choose k objects as the initial cluster centers • Until no change, do – (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster – Update the cluster means, i.e., calculate the mean value of the objects for each cluster Jian Pei: CMPT 741/459 Clustering (1) 22

  23. K-Means: Example 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 Update Assi Assign 3 3 3 the e each 2 2 2 cluster 1 o 1 object 1 0 means 0 0 to to the 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 si most reassign reassign similar center 10 10 K=2 9 9 8 8 Arbitrarily choose K 7 7 object as initial 6 6 5 5 cluster center Update 4 4 the 3 3 2 2 cluster 1 1 means 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 741/459 Clustering (1) 23

  24. Pros and Cons of K-means • Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t << n. • Often terminate at a local optimum • Applicable only when mean is defined – What about categorical data? • Need to specify the number of clusters • Unable to handle noisy data and outliers • Unsuitable to discover non-convex clusters Jian Pei: CMPT 741/459 Clustering (1) 24

  25. Variations of the K-means • Aspects of variations – Selection of the initial k means – Dissimilarity calculations – Strategies to calculate cluster means • Handling categorical data: k-modes – Use mode instead of mean • Mode: the most frequent item(s) – A mixture of categorical and numerical data: k-prototype method • EM (expectation maximization): assign a probability of an object to a cluster (will be discussed later) Jian Pei: CMPT 741/459 Clustering (1) 25

  26. A Problem of K-means + • Sensitive to outliers + – Outlier: objects with extremely large values • May substantially distort the distribution of the data • K-medoids: the most centrally located object in a cluster 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 741/459 Clustering (1) 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend