data mining techniques
play

Data Mining Techniques: Partitioning Methods: K-Means Cluster - PDF document

Cluster Analysis Overview Introduction Foundations: Measuring Distance (Similarity) Data Mining Techniques: Partitioning Methods: K-Means Cluster Analysis Hierarchical Methods Mirek Riedewald Density-Based Methods Many slides


  1. Cluster Analysis Overview • Introduction • Foundations: Measuring Distance (Similarity) Data Mining Techniques: • Partitioning Methods: K-Means Cluster Analysis • Hierarchical Methods Mirek Riedewald • Density-Based Methods Many slides based on presentations by • Clustering High-Dimensional Data Han/Kamber, Tan/Steinbach/Kumar, and Andrew • Cluster Evaluation Moore 2 What is Cluster Analysis? What is Cluster Analysis? • Cluster: a collection of data objects – Similar to one another within the same cluster – Dissimilar to the objects in other clusters Inter-cluster Intra-cluster distances are • Unsupervised learning: usually no training set distances are maximized with known “classes” minimized • Typical applications – As a stand-alone tool to get insight into data properties – As a preprocessing step for other algorithms 3 4 Rich Applications, Multidisciplinary Examples of Clustering Applications Efforts • Pattern Recognition • Marketing : Help marketers discover distinct groups in their customer bases, and then use this knowledge to • Spatial Data Analysis develop targeted marketing programs • Image Processing • Land use : Identification of areas of similar land use in • Data Reduction an earth observation database • Economic Science • Insurance : Identifying groups of motor insurance policy Clustering precipitation in Australia holders with a high average claim cost – Market research • City-planning : Identifying groups of houses according • WWW to their house type, value, and geographical location – Document classification • Earth-quake studies : Observed earth quake epicenters – Weblogs: discover groups of similar access patterns should be clustered along continent faults 5 6 1

  2. Quality: What Is Good Clustering? Notion of a Cluster can be Ambiguous • Cluster membership  objects in same class • High intra-class similarity, low inter-class similarity How many clusters? Six Clusters – Choice of similarity measure is important • Ability to discover some or all of the hidden patterns Two Clusters Four Clusters – Difficult to measure without ground truth 7 8 Cluster Analysis Overview Distinctions Between Sets of Clusters • Exclusive versus non-exclusive • Introduction – Non-exclusive clustering: points may belong to • Foundations: Measuring Distance (Similarity) multiple clusters • Fuzzy versus non-fuzzy • Partitioning Methods: K-Means – Fuzzy clustering: a point belongs to every cluster with • Hierarchical Methods some weight between 0 and 1 • Weights must sum to 1 • Density-Based Methods • Partial versus complete • Clustering High-Dimensional Data – Cluster some or all of the data • Heterogeneous versus homogeneous • Cluster Evaluation – Clusters of widely different sizes, shapes, densities 9 10 Distance Similarity Between Objects • Clustering is inherently connected to question • Usually measured by some notion of distance of (dis-)similarity of objects • Popular choice: Minkowski distance   q q q  q       dist x ( i ), x ( j ) | x ( i ) x ( j ) | | x ( i ) x ( j ) |  | x ( i ) x ( j ) | 1 1 2 2 d d • How can we define similarity between – q is a positive integer objects? • q = 1: Manhattan distance          dist ( ), ( ) | ( ) ( ) | | ( ) ( ) |  | ( ) ( ) | x i x j x i x j x i x j x i x j 1 1 2 2 d d • q = 2: Euclidean distance:   2 2 2  2       dist ( ), ( ) | ( ) ( ) | | ( ) ( ) |  | ( ) ( ) | x i x j x i x j x i x j x i x j 1 1 2 2 d d 11 12 2

  3. Metrics Challenges • Properties of a metric • How to compute a distance for categorical – d(i,j)  0 attributes – d(i,j) = 0 if and only if i=j – d(i,j) = d(j,i) – d(i,j)  d(i,k) + d(k,j) • An attribute with a large domain often • Examples: Euclidean distance, Manhattan distance dominates the overall distance • Many other non-metric similarity measures exist – Weight and scale the attributes like for k-NN • After selecting the distance function, is it now clear how to compute similarity between objects? • Curse of dimensionality 13 14 Curse of Dimensionality Nominal Attributes • Best solution: remove any attribute that is • Method 1: work with original values known to be very noisy or not interesting – Difference = 0 if same value, difference = 1 otherwise • Method 2: transform to binary attributes • Try different subsets of the attributes and – New binary attribute for each domain value determine where good clusters are found – Encode specific domain value by setting corresponding binary attribute to 1 and all others to 0 15 16 Ordinal Attributes Scaling and Transforming Attributes • Method 1: treat as nominal • Sometimes it might be necessary to transform numerical attributes to [0,1] or use another – Problem: loses ordering information normalizing transformation, maybe even non- linear (e.g., logarithm) • Method 2: map to [0,1] • Might need to weight attributes differently – Problem: To which values should the original values be mapped? – Default: equi-distant mapping to [0,1] • Often requires expert knowledge or trial-and- error 17 18 3

  4. Other Similarity Measures Calculating Cluster Distances • Single link = smallest distance between an element in one • Special distance or similarity measures for cluster and an element in the other: dist(K i , K j ) = min( x ip , many applications x jq ) • Complete link = largest distance between an element in – Might be a non-metric function one cluster and an element in the other: dist(K i , K j ) = • Information retrieval max( x ip , x jq ) • Average distance between an element in one cluster and an – Document similarity based on keywords element in the other: dist(K i , K j ) = avg( x ip , x jq ) • Bioinformatics • Distance between cluster centroids: dist(K i , K j ) = d( m i , m j ) – Gene features in micro-arrays • Distance between cluster medoids: dist(K i , K j ) = dist( x mi , x mj ) – Medoid: one chosen, centrally located object in the cluster 19 20 Cluster Analysis Overview Cluster Centroid, Radius, and Diameter 1  • Centroid : the “middle” of a cluster C  • Introduction m x | | C  x C • Foundations: Measuring Distance (Similarity) • Radius: square root of average distance from any • Partitioning Methods: K-Means  point of the cluster to its centroid  2 ( ) x m • Hierarchical Methods   x C R | C | • Density-Based Methods • Diameter: square root of average mean squared • Clustering High-Dimensional Data distance between all pairs of points in the cluster    • Cluster Evaluation 2 ( ) x y     , x C y C y x D   | C | (| C | 1 ) 21 22 K-means Clustering Partitioning Algorithms: Basic Concept • Construct a partition of a database D of n objects into a • Each cluster is associated with a centroid set of K clusters, s.t. sum of squared distances to • Each object is assigned to the cluster with the cluster “representative” m is minimized closest centroid   K  2 ( ) m x   i i 1 x C i 1. Given K, select K random objects as initial • Given a K, find partition of K clusters that optimizes the centroids chosen partitioning criterion 2. Repeat until centroids do not change – Globally optimal: enumerate all partitions – Heuristic methods 1. Form K clusters by assigning every object to its • K-means (’67): each cluster represented by its centroid nearest centroid • K-medoids (’87): each cluster represented by one of the objects in 2. Recompute centroid of each cluster the cluster 23 24 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend