CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun yzsun@cs.ucla.edu October 27, 2017

Methods to Learn Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 2

Vector Data: Clustering Basics • Clustering Analysis: Basic Concepts • Partitioning methods • Hierarchical Methods • Density-Based Methods • Summary 3

What is Cluster Analysis? • Cluster: A collection of data objects • similar (or related) to one another within the same group • dissimilar (or unrelated) to the objects in other groups • Cluster analysis (or clustering , data segmentation, … ) • Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms 4

Applications of Cluster Analysis • Data reduction • Summarization: Preprocessing for regression, PCA, classification, and association analysis • Compression: Image processing: vector quantization • Prediction based on groups • Cluster & find characteristics/patterns for each group • Finding K-nearest Neighbors • Localizing search to one or a small number of clusters • Outlier detection: Outliers are often viewed as those “far away” from any cluster 5

Clustering: Application Examples • Biology : taxonomy of living things: kingdom, phylum, class, order, family, genus and species • Information retrieval : document clustering • Land use : Identification of areas of similar land use in an earth observation database • Marketing : Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • City-planning : Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies : Observed earth quake epicenters should be clustered along continent faults • Climate : understanding earth climate, find patterns of atmospheric and ocean 6

Partitioning Algorithms: Basic Concept • Partitioning method: Partitioning a dataset D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where c j is the centroid or medoid of cluster C j ) 𝑙 𝑘 ) 2 𝐾 = ෍ ෍ 𝑒(𝑦 𝑗 , 𝑑 𝑘=1 𝐷 𝑗 =𝑘 • Given k , find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 8

The K-Means Clustering Method • Given k , the k-means algorithm is implemented in four steps: • Step 0: Partition objects into k nonempty subsets • Step 1: Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point , of the cluster) • Step 2: Assign each object to the cluster with the nearest seed point • Step 3: Go back to Step 1, stop when the assignment does not change 9

An Example of K-Means Clustering K=2 Arbitrarily Update the partition cluster objects into centroids k groups The initial data set Loop if Reassign objects needed Partition objects into k nonempty  subsets Repeat  Update the Compute centroid (i.e., mean  cluster point) for each partition centroids Assign each object to the  cluster of its nearest centroid Until no change  10

Theory Behind K-Means • Objective function 𝑙 𝑘 || 2 • 𝐾 = σ 𝑘=1 σ 𝐷 𝑗 =𝑘 ||𝑦 𝑗 − 𝑑 • Re-arrange the objective function 𝑙 𝑘 || 2 • 𝐾 = σ 𝑘=1 σ 𝑗 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • 𝑥 𝑗𝑘 ∈ {0,1} • 𝑥 𝑗𝑘 = 1, 𝑗𝑔 𝑦 𝑗 𝑐𝑓𝑚𝑝𝑜𝑕𝑡 𝑢𝑝 𝑑𝑚𝑣𝑡𝑢𝑓𝑠 𝑘; 𝑥 𝑗𝑘 = 0, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 • Looking for: • The best assignment 𝑥 𝑗𝑘 • The best center 𝑑 𝑘 11

Solution of K-Means 𝑙 • Iterations 𝑘 || 2 𝐾 = ෍ ෍ 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 𝑘=1 𝑗 • Step 1: Fix centers 𝑑 𝑘 , find assignment 𝑥 𝑗𝑘 that minimizes 𝐾 𝑘 || 2 is the smallest • => 𝑥 𝑗𝑘 = 1, 𝑗𝑔 ||𝑦 𝑗 − 𝑑 • Step 2: Fix assignment 𝑥 𝑗𝑘 , find centers that minimize 𝐾 • => first derivative of 𝐾 = 0 𝜖𝐾 • => 𝜖𝑑 𝑘 = −2 σ 𝑗 𝑥 𝑗𝑘 (𝑦 𝑗 − 𝑑 𝑘 ) = 0 σ 𝑗 𝑥 𝑗𝑘 𝑦 𝑗 • => 𝑑 𝑘 = σ 𝑗 𝑥 𝑗𝑘 • Note σ 𝑗 𝑥 𝑗𝑘 is the total number of objects in cluster j 12

Comments on the K-Means Method • Strength: Efficient : O ( tkn ), where n is # objects, k is # clusters, and t is # iterations. Normally, k , t << n . • Comment: Often terminates at a local optimal • Weakness • Applicable only to objects in a continuous n-dimensional space • Using the k-modes method for categorical data • In comparison, k-medoids can be applied to a wide range of data • Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) • Sensitive to noisy data and outliers • Not suitable to discover clusters with non-convex shapes 13

Variations of the K-Means Method • Most of the variants of the k-means which differ in • Selection of the initial k means • Dissimilarity calculations • Strategies to calculate cluster means • Handling categorical data: k-modes • Replacing means of clusters with modes • Using new dissimilarity measures to deal with categorical objects • Using a frequency-based method to update modes of clusters • A mixture of categorical and numerical data: k-prototype method 14

What Is the Problem of the K-Means Method? • The k-means algorithm is sensitive to outliers ! • Since an object with an extremely large value may substantially distort the distribution of the data • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 15

PAM: A Typical K-Medoids Algorithm* Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary Assign 7 7 7 6 6 6 choose k each 5 5 5 object as remaining 4 4 4 initial object to 3 3 3 medoids nearest 2 2 2 medoids 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Randomly select a nonmedoid object,O ramdom Total Cost = 26 10 10 Do loop Compute 9 9 Swapping O 8 8 total cost of Until no 7 7 and O ramdom swapping 6 6 change 5 5 If quality is 4 4 improved. 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 16

The K-Medoid Clustering Method* • K - Medoids Clustering: Find representative objects (medoids) in clusters • PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987) • Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering • PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity) • Efficiency improvement on PAM • CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples • CLARANS (Ng & Han, 1994): Randomized re-sampling 17

Hierarchical Clustering • Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 1 Step 2 Step 3 Step 4 Step 0 agglomerative (AGNES) a a b b a b c d e c c d e d d e e divisive (DIANA) Step 3 Step 2 Step 1 Step 0 Step 4 19

AGNES (Agglomerative Nesting) • Introduced in Kaufmann and Rousseeuw (1990) • Implemented in statistical packages, e.g., Splus • Use the single-link method and the dissimilarity matrix • Merge nodes that have the least dissimilarity • Go on in a non-descending fashion • Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 20

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun yzsun@cs.ucla.edu October 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest Neighbor Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 6: Vector Data: Neural Network Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Course Project Overview Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun

Introduction What is data mining? to Data mining functionalities Data Mining Major

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing Balancing

Chapter 9. Clustering Analysis Wei Pan Division of Biostatistics, School of Public Health,

Data Mining Techniques: Partitioning Methods: K-Means Cluster Analysis Hierarchical

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey

UNSUPERVISED LEARNING AND CLUSTERING Jeff Robble, Brian Renzenbrink, Doug Roberts Unsupervised

Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing

ASCLU Alternative Subspace Clustering Stephan Gnnemann Ines Frber Emmanuel Mller

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun yzsun@cs.ucla.edu October 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest Neighbor Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 6: Vector Data: Neural Network Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Course Project Overview Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun

Introduction What is data mining? to Data mining functionalities Data Mining Major

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing Balancing

Chapter 9. Clustering Analysis Wei Pan Division of Biostatistics, School of Public Health,

Data Mining Techniques: Partitioning Methods: K-Means Cluster Analysis Hierarchical

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, &amp; Jeffrey

UNSUPERVISED LEARNING AND CLUSTERING Jeff Robble, Brian Renzenbrink, Doug Roberts Unsupervised

Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing

ASCLU Alternative Subspace Clustering Stephan Gnnemann Ines Frber Emmanuel Mller

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey