clustering
play

Clustering ! Hierarchical methods ! Model-based methods ! - PDF document

Preview ! Introduction Lecture 10 ! Partitioning methods Clustering ! Hierarchical methods ! Model-based methods ! Density-based methods 1 2 What is Clustering? Examples of Clustering Applications ! Cluster: a collection of data objects !


  1. Preview ! Introduction Lecture 10 ! Partitioning methods Clustering ! Hierarchical methods ! Model-based methods ! Density-based methods 1 2 What is Clustering? Examples of Clustering Applications ! Cluster: a collection of data objects ! Marketing: Help marketers discover distinct groups in their ! Similar to one another within the same cluster customer bases, and then use this knowledge to develop ! Dissimilar to the objects in other clusters targeted marketing programs ! Cluster analysis ! Land use: Identification of areas of similar land use in an ! Grouping a set of data objects into clusters earth observation database ! Clustering is unsupervised classification: ! Insurance: Identifying groups of motor insurance policy no predefined classes holders with a high average claim cost ! Typical applications ! Urban planning: Identifying groups of houses according to ! As a stand-alone tool to get insight into data their house type, value, and geographical location distribution ! Seismology: Observed earth quake epicenters should be ! As a preprocessing step for other algorithms clustered along continent faults 4 Requirements for Clustering What Is a Good Clustering? in Data Mining ! Scalability ! A good clustering method will produce ! Ability to deal with different types of attributes clusters with ! Discovery of clusters with arbitrary shape ! High intra-class similarity ! Minimal domain knowledge required to determine input parameters ! Low inter-class similarity ! Ability to deal with noise and outliers ! Precise definition of clustering quality is difficult ! Insensitivity to order of input records ! Application-dependent ! Robustness wrt high dimensionality ! Ultimately subjective ! Incorporation of user-specified constraints ! Interpretability and usability 5 6 1

  2. Similarity and Dissimilarity Major Clustering Approaches Between Objects ! Same we used for IBL (e.g, L p norm) ! Partitioning: Construct various partitions and then evaluate ! Euclidean distance (p = 2): them by some criterion ! Hierarchical: Create a hierarchical decomposition of the set 2 2 2 d ( i , j ) = (| x − x | + | x − x | + ... + | x − x | ) i j i j i j 1 1 2 2 p p of objects using some criterion ! Properties of a metric d(i,j) : ! Model-based: Hypothesize a model for each cluster and ! d(i,j) ≥ 0 find best fit of models to data ! d(i,i) = 0 ! Density-based: Guided by connectivity and density ! d(i,j) = d(j,i) functions ! d(i,j) ≤ d(i,k) + d(k,j) 7 8 Partitioning Algorithms K-Means Clustering ! Partitioning method: Construct a partition of a database D ! Given k , the k-means algorithm consists of of n objects into a set of k clusters four steps: ! Given a k , find a partition of k clusters that optimizes the ! Select initial centroids at random. chosen partitioning criterion ! Assign each object to the cluster with the ! Global optimal: exhaustively enumerate all partitions nearest centroid. ! Heuristic methods: k-means and k-medoids algorithms ! Compute each centroid as the mean of the ! k-means (MacQueen, 1967): Each cluster is represented by the center of the cluster objects assigned to it. ! k-medoids or PAM (Partition around medoids) ! Repeat previous 2 steps until no change. (Kaufman & Rousseeuw, 1987): Each cluster is represented by one of the objects in the cluster 9 10 Comments on the K-Means Method K-Means Clustering (contd.) ! Strengths ! Example ! Relatively efficient : O ( tkn ), where n is # objects, k is 10 10 # clusters, and t is # iterations. Normally, k , t << n . 9 9 8 8 ! Often terminates at a local optimum . The global optimum 7 7 6 6 5 5 may be found using techniques such as simulated 4 4 3 3 annealing and genetic algorithms 2 2 1 1 ! Weaknesses 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 ! Applicable only when mean is defined (what about categorical data?) 10 10 9 9 ! Need to specify k, the number of clusters, in advance 8 8 7 7 6 6 ! Trouble with noisy data and outliers 5 5 4 4 ! Not suitable to discover clusters with non-convex shapes 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 11 12 2

  3. AGNES (Agglomerative Nesting) Hierarchical Clustering ! Produces tree of clusters (nodes) ! Use distance matrix as clustering criteria. This method ! Initially: each object is a cluster (leaf) does not require the number of clusters k as an input, but needs a termination condition ! Recursively merges nodes that have the least dissimilarity ! Criteria: min distance, max distance, avg distance, center Step 1 Step 2 Step 3 Step 4 Step 0 agglomerative distance (AGNES) ! Eventually all nodes belong to the same cluster (root) a a b b a b c d e 10 10 10 9 9 9 c 8 8 8 c d e 7 7 7 6 6 6 d 5 5 5 d e 4 4 4 3 3 3 e 2 2 2 divisive 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Step 3 Step 2 Step 1 Step 0 (DIANA) Step 4 13 14 A Dendrogram Shows How the DIANA (Divisive Analysis) Clusters are Merged Hierarchically ! Inverse order of AGNES Decompose data objects into several levels of nested partitioning (tree of clusters), called a dendrogram. ! Start with root cluster containing all objects A clustering of the data objects is obtained by cutting the ! Recursively divide into subclusters dendrogram at the desired level. Then each connected component forms a cluster. ! Eventually each cluster contains a single object 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 15 16 BIRCH Other Hierarchical Clustering Methods ! BIRCH: Balanced Iterative Reducing and Clustering using ! Major weakness of agglomerative clustering methods Hierarchies (Zhang, Ramakrishnan & Livny, 1996) ! Do not scale well: time complexity of at least O ( n 2 ), ! Incrementally construct a CF (Clustering Feature) tree where n is the number of total objects ! Can never undo what was done previously ! Parameters: max diameter, max children ! Integration of hierarchical with distance-based clustering ! Phase 1: scan DB to build an initial in-memory CF tree (each node: #points, sum, sum of squares) ! BIRCH: uses CF-tree and incrementally adjusts the quality of sub-clusters ! Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree ! CURE: selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a ! Scales linearly : finds a good clustering with a single scan specified fraction and improves the quality with a few additional scans ! Weaknesses: handles only numeric data, sensitive to order of data records. 17 18 3

  4. Clustering Feature Vector CF Tree Root CF 1 CF 2 CF 3 CF 6 B = 7 Clustering Feature: CF = (N, LS, SS) child 1 child 2 child 3 child 6 L = 6 N : Number of data points LS: ∑ N Non-leaf node i=1 X i CF 1 CF 2 CF 3 CF 5 SS: ∑ N CF = (5, (16,30),(54,190)) i=1 X i 2 child 1 child 2 child 3 child 5 (3,4) 10 9 8 (2,6) 7 Leaf node Leaf node 6 (4,5) 5 4 prev CF 1 CF 2 CF 6 next prev CF 1 CF 2 CF 4 next 3 (4,7) 2 1 0 (3,8) 0 1 2 3 4 5 6 7 8 9 10 19 20 Drawbacks of Distance-Based Method CURE (Clustering Using REpresentatives) ! CURE: non-spherical clusters, robust wrt outliers ! Uses multiple representative points to evaluate ! Drawbacks of square-error-based clustering method the distance between clusters ! Consider only one point as representative of a cluster ! Stops the creation of a cluster hierarchy if a ! Good only for convex clusters, of similar size and level consists of k clusters density, and if k can be reasonably estimated 21 22 Cure: The Algorithm Data Partitioning and Clustering ! s = 50 ! s/pq = 5 ! p = 2 ! Draw random sample s ! s/p = 25 ! Partition sample to p partitions with size s/p y ! Partially cluster partitions into s/pq clusters y y ! Cluster partial clusters, shrinking x representatives towards centroid y y ! Label data on disk x x x x 23 24 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend