CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2015

Announcements • Homework 1 grades out • Re-grading policy: • If you have doubts in your grading, please submit a regrading form (via emails to both TAs and CC to the Instructor) indicating clearly the reason why you think it should be regraded • The deadline of the regrading form should be submitted within one week after you receive your score • We will regrade the whole homework/exam • Homework 3 out tomorrow 2

Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Classification Decision Tree; HMM Label Neural Naïve Bayes; Propagation* Network Logistic Regression SVM; kNN Clustering K-means; PLSA SCAN*; hierarchical Spectral clustering; Clustering* DBSCAN ; Mixture Models; kernel k- means* Frequent Apriori; GSP; FP-growth PrefixSpan Pattern Mining Linear Regression Autoregression Prediction Similarity DTW P-PageRank Search Ranking PageRank 3

Matrix Data: Clustering: Part 1 • Cluster Analysis: Basic Concepts • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Evaluation of Clustering • Summary 4

What is Cluster Analysis? • Cluster: A collection of data objects • similar (or related) to one another within the same group • dissimilar (or unrelated) to the objects in other groups • Cluster analysis (or clustering , data segmentation, … ) • Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms 5

Applications of Cluster Analysis • Data reduction • Summarization: Preprocessing for regression, PCA, classification, and association analysis • Compression: Image processing: vector quantization • Prediction based on groups • Cluster & find characteristics/patterns for each group • Finding K-nearest Neighbors • Localizing search to one or a small number of clusters • Outlier detection: Outliers are often viewed as those “far away” from any cluster 6

Clustering: Application Examples • Biology : taxonomy of living things: kingdom, phylum, class, order, family, genus and species • Information retrieval : document clustering • Land use : Identification of areas of similar land use in an earth observation database • Marketing : Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • City-planning : Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies : Observed earth quake epicenters should be clustered along continent faults • Climate : understanding earth climate, find patterns of atmospheric and ocean 7

Basic Steps to Develop a Clustering Task • Feature selection • Select info concerning the task of interest • Minimal information redundancy • Proximity measure • Similarity of two feature vectors • Clustering criterion • Expressed via a cost function or some rules • Clustering algorithms • Choice of algorithms • Validation of the results • Validation test (also, clustering tendency test) • Interpretation of the results • Integration with applications 8

Requirements and Challenges • Scalability • Clustering all the data instead of only on samples • Ability to deal with different types of attributes • Numerical, binary, categorical, ordinal, linked, and mixture of these • Constraint-based clustering User may give inputs on constraints • Use domain knowledge to determine input parameters • • Interpretability and usability • Others • Discovery of clusters with arbitrary shape • Ability to deal with noisy data • Incremental clustering and insensitivity to input order • High dimensionality 9

Matrix Data: Clustering: Part 1 • Cluster Analysis: Basic Concepts • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Evaluation of Clustering • Summary 10

Partitioning Algorithms: Basic Concept • Partitioning method: Partitioning a dataset D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where c i is the centroid or medoid of cluster C i )     k 2 E ( d ( p , c ))  i 1 p C i i • Given k , find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 11

The K-Means Clustering Method • Given k , the k-means algorithm is implemented in four steps: • Step 0: Partition objects into k nonempty subsets • Step 1: Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point , of the cluster) • Step 2: Assign each object to the cluster with the nearest seed point • Step 3: Go back to Step 1, stop when the assignment does not change 12

An Example of K-Means Clustering K=2 Arbitrarily Update the partition cluster objects into centroids k groups The initial data set Loop if Reassign objects needed Partition objects into k nonempty  subsets Repeat  Update the Compute centroid (i.e., mean  cluster point) for each partition centroids Assign each object to the  cluster of its nearest centroid Until no change  13

Theory Behind K-Means • Objective function 𝑙 𝑘 || 2 • 𝐾 = 𝑘=1 𝐷 𝑗 =𝑘 ||𝑦 𝑗 − 𝑑 • Total within-cluster variance • Re-arrange the objective function 𝑙 𝑘 || 2 • 𝐾 = 𝑘=1 𝑗 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • 𝑥 𝑗𝑘 ∈ {0,1} • 𝑥 𝑗𝑘 = 1, 𝑗𝑔 𝑦 𝑗 𝑐𝑓𝑚𝑝𝑜𝑕𝑡 𝑢𝑝 𝑑𝑚𝑣𝑡𝑢𝑓𝑠 𝑘; 𝑥 𝑗𝑘 = 0, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 • Looking for: • The best assignment 𝑥 𝑗𝑘 • The best center 𝑑 𝑘 14

Solution of K-Means 𝑙 𝑘 || 2 𝐾 = 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • Iterations 𝑘=1 𝑗 • Step 1: Fix centers 𝑑 𝑘 , find assignment 𝑥 𝑗𝑘 that minimizes 𝐾 𝑘 || 2 is the smallest • => 𝑥 𝑗𝑘 = 1, 𝑗𝑔 ||𝑦 𝑗 − 𝑑 • Step 2: Fix assignment 𝑥 𝑗𝑘 , find centers that minimize 𝐾 • => first derivative of 𝐾 = 0 𝜖𝐾 • => 𝜖𝑑 𝑘 = −2 𝑗 𝑥 𝑗𝑘 (𝑦 𝑗 − 𝑑 𝑘 ) = 0 𝑗 𝑥 𝑗𝑘 𝑦 𝑗 • => 𝑑 𝑘 = 𝑗 𝑥 𝑗𝑘 • Note 𝑗 𝑥 𝑗𝑘 is the total number of objects in cluster j 15

Comments on the K-Means Method • Strength: Efficient : O ( tkn ), where n is # objects, k is # clusters, and t is # iterations. Normally, k , t << n . • Comment: Often terminates at a local optimal • Weakness • Applicable only to objects in a continuous n-dimensional space • Using the k-modes method for categorical data • In comparison, k-medoids can be applied to a wide range of data • Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) • Sensitive to noisy data and outliers • Not suitable to discover clusters with non-convex shapes 16

Variations of the K-Means Method • Most of the variants of the k-means which differ in • Selection of the initial k means • Dissimilarity calculations • Strategies to calculate cluster means • Handling categorical data: k-modes • Replacing means of clusters with modes • Using new dissimilarity measures to deal with categorical objects • Using a frequency-based method to update modes of clusters • A mixture of categorical and numerical data: k-prototype method 17

What Is the Problem of the K-Means Method? • The k-means algorithm is sensitive to outliers ! • Since an object with an extremely large value may substantially distort the distribution of the data • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 18

PAM: A Typical K-Medoids Algorithm Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary Assign 7 7 7 6 6 6 choose k each 5 5 5 object as remaining 4 4 4 initial object to 3 3 3 medoids nearest 2 2 2 medoids 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Randomly select a nonmedoid object,O ramdom Total Cost = 26 10 10 Do loop Compute 9 9 Swapping O 8 8 total cost of Until no 7 7 and O ramdom swapping 6 6 change 5 5 If quality is 4 4 improved. 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 19

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2015 Announcements Homework 1 grades out Re-grading policy: If you have doubts in your grading, please submit a

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Sequential and Time Series Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

2008 Interim Results 21 February 2008 2008 Interim Results Mike Ihlein Chief Executive Officer

Financial Update Mark Mulhern Chief Financial Officer Progress Energy Inc Progress Energy, Inc.

Simulating tokamak edge instabilities: advances and challenges Matthias Hoelzl, GTA Huijsmans, FJ

EPS Funding 4 A look at where we get and spend our money Planning for 2013-14 - Staff Update

Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems Till Rohrmann 1

MACRA Request for Information MIPS APMs PFPMs October 15, 2015 The MACRA: Background The

Program Year 2018 Webinar Presented by: Janet Sheridan Compliance Specialist Jessi Wilson

Delivering inclusive capitalism Sharing success with investors, customers and society LEGAL &

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2015 Announcements Homework 1 grades out Re-grading policy: If you have doubts in your grading, please submit a

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Sequential and Time Series Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

2008 Interim Results 21 February 2008 2008 Interim Results Mike Ihlein Chief Executive Officer

Financial Update Mark Mulhern Chief Financial Officer Progress Energy Inc Progress Energy, Inc.

Simulating tokamak edge instabilities: advances and challenges Matthias Hoelzl, GTA Huijsmans, FJ

EPS Funding 4 A look at where we get and spend our money Planning for 2013-14 - Staff Update

Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems Till Rohrmann 1

MACRA Request for Information MIPS APMs PFPMs October 15, 2015 The MACRA: Background The

Program Year 2018 Webinar Presented by: Janet Sheridan Compliance Specialist Jessi Wilson

Delivering inclusive capitalism Sharing success with investors, customers and society LEGAL &amp;

Delivering inclusive capitalism Sharing success with investors, customers and society LEGAL &