cs6220 data mining techniques
play

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2015 Announcements Homework 1 grades out Re-grading policy: If you have doubts in your grading, please submit a


  1. CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2015

  2. Announcements • Homework 1 grades out • Re-grading policy: • If you have doubts in your grading, please submit a regrading form (via emails to both TAs and CC to the Instructor) indicating clearly the reason why you think it should be regraded • The deadline of the regrading form should be submitted within one week after you receive your score • We will regrade the whole homework/exam • Homework 3 out tomorrow 2

  3. Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Classification Decision Tree; HMM Label Neural Naïve Bayes; Propagation* Network Logistic Regression SVM; kNN Clustering K-means; PLSA SCAN*; hierarchical Spectral clustering; Clustering* DBSCAN ; Mixture Models; kernel k- means* Frequent Apriori; GSP; FP-growth PrefixSpan Pattern Mining Linear Regression Autoregression Prediction Similarity DTW P-PageRank Search Ranking PageRank 3

  4. Matrix Data: Clustering: Part 1 • Cluster Analysis: Basic Concepts • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Evaluation of Clustering • Summary 4

  5. What is Cluster Analysis? • Cluster: A collection of data objects • similar (or related) to one another within the same group • dissimilar (or unrelated) to the objects in other groups • Cluster analysis (or clustering , data segmentation, … ) • Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms 5

  6. Applications of Cluster Analysis • Data reduction • Summarization: Preprocessing for regression, PCA, classification, and association analysis • Compression: Image processing: vector quantization • Prediction based on groups • Cluster & find characteristics/patterns for each group • Finding K-nearest Neighbors • Localizing search to one or a small number of clusters • Outlier detection: Outliers are often viewed as those “far away” from any cluster 6

  7. Clustering: Application Examples • Biology : taxonomy of living things: kingdom, phylum, class, order, family, genus and species • Information retrieval : document clustering • Land use : Identification of areas of similar land use in an earth observation database • Marketing : Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • City-planning : Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies : Observed earth quake epicenters should be clustered along continent faults • Climate : understanding earth climate, find patterns of atmospheric and ocean 7

  8. Basic Steps to Develop a Clustering Task • Feature selection • Select info concerning the task of interest • Minimal information redundancy • Proximity measure • Similarity of two feature vectors • Clustering criterion • Expressed via a cost function or some rules • Clustering algorithms • Choice of algorithms • Validation of the results • Validation test (also, clustering tendency test) • Interpretation of the results • Integration with applications 8

  9. Requirements and Challenges • Scalability • Clustering all the data instead of only on samples • Ability to deal with different types of attributes • Numerical, binary, categorical, ordinal, linked, and mixture of these • Constraint-based clustering User may give inputs on constraints • Use domain knowledge to determine input parameters • • Interpretability and usability • Others • Discovery of clusters with arbitrary shape • Ability to deal with noisy data • Incremental clustering and insensitivity to input order • High dimensionality 9

  10. Matrix Data: Clustering: Part 1 • Cluster Analysis: Basic Concepts • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Evaluation of Clustering • Summary 10

  11. Partitioning Algorithms: Basic Concept • Partitioning method: Partitioning a dataset D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where c i is the centroid or medoid of cluster C i )     k 2 E ( d ( p , c ))  i 1 p C i i • Given k , find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 11

  12. The K-Means Clustering Method • Given k , the k-means algorithm is implemented in four steps: • Step 0: Partition objects into k nonempty subsets • Step 1: Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point , of the cluster) • Step 2: Assign each object to the cluster with the nearest seed point • Step 3: Go back to Step 1, stop when the assignment does not change 12

  13. An Example of K-Means Clustering K=2 Arbitrarily Update the partition cluster objects into centroids k groups The initial data set Loop if Reassign objects needed Partition objects into k nonempty  subsets Repeat  Update the Compute centroid (i.e., mean  cluster point) for each partition centroids Assign each object to the  cluster of its nearest centroid Until no change  13

  14. Theory Behind K-Means • Objective function 𝑙 𝑘 || 2 • 𝐾 = 𝑘=1 𝐷 𝑗 =𝑘 ||𝑦 𝑗 − 𝑑 • Total within-cluster variance • Re-arrange the objective function 𝑙 𝑘 || 2 • 𝐾 = 𝑘=1 𝑗 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • 𝑥 𝑗𝑘 ∈ {0,1} • 𝑥 𝑗𝑘 = 1, 𝑗𝑔 𝑦 𝑗 𝑐𝑓𝑚𝑝𝑜𝑕𝑡 𝑢𝑝 𝑑𝑚𝑣𝑡𝑢𝑓𝑠 𝑘; 𝑥 𝑗𝑘 = 0, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 • Looking for: • The best assignment 𝑥 𝑗𝑘 • The best center 𝑑 𝑘 14

  15. Solution of K-Means 𝑙 𝑘 || 2 𝐾 = 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • Iterations 𝑘=1 𝑗 • Step 1: Fix centers 𝑑 𝑘 , find assignment 𝑥 𝑗𝑘 that minimizes 𝐾 𝑘 || 2 is the smallest • => 𝑥 𝑗𝑘 = 1, 𝑗𝑔 ||𝑦 𝑗 − 𝑑 • Step 2: Fix assignment 𝑥 𝑗𝑘 , find centers that minimize 𝐾 • => first derivative of 𝐾 = 0 𝜖𝐾 • => 𝜖𝑑 𝑘 = −2 𝑗 𝑥 𝑗𝑘 (𝑦 𝑗 − 𝑑 𝑘 ) = 0 𝑗 𝑥 𝑗𝑘 𝑦 𝑗 • => 𝑑 𝑘 = 𝑗 𝑥 𝑗𝑘 • Note 𝑗 𝑥 𝑗𝑘 is the total number of objects in cluster j 15

  16. Comments on the K-Means Method • Strength: Efficient : O ( tkn ), where n is # objects, k is # clusters, and t is # iterations. Normally, k , t << n . • Comment: Often terminates at a local optimal • Weakness • Applicable only to objects in a continuous n-dimensional space • Using the k-modes method for categorical data • In comparison, k-medoids can be applied to a wide range of data • Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) • Sensitive to noisy data and outliers • Not suitable to discover clusters with non-convex shapes 16

  17. Variations of the K-Means Method • Most of the variants of the k-means which differ in • Selection of the initial k means • Dissimilarity calculations • Strategies to calculate cluster means • Handling categorical data: k-modes • Replacing means of clusters with modes • Using new dissimilarity measures to deal with categorical objects • Using a frequency-based method to update modes of clusters • A mixture of categorical and numerical data: k-prototype method 17

  18. What Is the Problem of the K-Means Method? • The k-means algorithm is sensitive to outliers ! • Since an object with an extremely large value may substantially distort the distribution of the data • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 18

  19. PAM: A Typical K-Medoids Algorithm Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary Assign 7 7 7 6 6 6 choose k each 5 5 5 object as remaining 4 4 4 initial object to 3 3 3 medoids nearest 2 2 2 medoids 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Randomly select a nonmedoid object,O ramdom Total Cost = 26 10 10 Do loop Compute 9 9 Swapping O 8 8 total cost of Until no 7 7 and O ramdom swapping 6 6 change 5 5 If quality is 4 4 improved. 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend