Data Mining Techniques: Partitioning Methods: K-Means Cluster - PDF document

Cluster Analysis Overview • Introduction • Foundations: Measuring Distance (Similarity) Data Mining Techniques: • Partitioning Methods: K-Means Cluster Analysis • Hierarchical Methods Mirek Riedewald • Density-Based Methods Many slides based on presentations by • Clustering High-Dimensional Data Han/Kamber, Tan/Steinbach/Kumar, and Andrew • Cluster Evaluation Moore 2 What is Cluster Analysis? What is Cluster Analysis? • Cluster: a collection of data objects – Similar to one another within the same cluster – Dissimilar to the objects in other clusters Inter-cluster Intra-cluster distances are • Unsupervised learning: usually no training set distances are maximized with known “classes” minimized • Typical applications – As a stand-alone tool to get insight into data properties – As a preprocessing step for other algorithms 3 4 Rich Applications, Multidisciplinary Examples of Clustering Applications Efforts • Pattern Recognition • Marketing : Help marketers discover distinct groups in their customer bases, and then use this knowledge to • Spatial Data Analysis develop targeted marketing programs • Image Processing • Land use : Identification of areas of similar land use in • Data Reduction an earth observation database • Economic Science • Insurance : Identifying groups of motor insurance policy Clustering precipitation in Australia holders with a high average claim cost – Market research • City-planning : Identifying groups of houses according • WWW to their house type, value, and geographical location – Document classification • Earth-quake studies : Observed earth quake epicenters – Weblogs: discover groups of similar access patterns should be clustered along continent faults 5 6 1

Quality: What Is Good Clustering? Notion of a Cluster can be Ambiguous • Cluster membership  objects in same class • High intra-class similarity, low inter-class similarity How many clusters? Six Clusters – Choice of similarity measure is important • Ability to discover some or all of the hidden patterns Two Clusters Four Clusters – Difficult to measure without ground truth 7 8 Cluster Analysis Overview Distinctions Between Sets of Clusters • Exclusive versus non-exclusive • Introduction – Non-exclusive clustering: points may belong to • Foundations: Measuring Distance (Similarity) multiple clusters • Fuzzy versus non-fuzzy • Partitioning Methods: K-Means – Fuzzy clustering: a point belongs to every cluster with • Hierarchical Methods some weight between 0 and 1 • Weights must sum to 1 • Density-Based Methods • Partial versus complete • Clustering High-Dimensional Data – Cluster some or all of the data • Heterogeneous versus homogeneous • Cluster Evaluation – Clusters of widely different sizes, shapes, densities 9 10 Distance Similarity Between Objects • Clustering is inherently connected to question • Usually measured by some notion of distance of (dis-)similarity of objects • Popular choice: Minkowski distance   q q q  q       dist x ( i ), x ( j ) | x ( i ) x ( j ) | | x ( i ) x ( j ) |  | x ( i ) x ( j ) | 1 1 2 2 d d • How can we define similarity between – q is a positive integer objects? • q = 1: Manhattan distance          dist ( ), ( ) | ( ) ( ) | | ( ) ( ) |  | ( ) ( ) | x i x j x i x j x i x j x i x j 1 1 2 2 d d • q = 2: Euclidean distance:   2 2 2  2       dist ( ), ( ) | ( ) ( ) | | ( ) ( ) |  | ( ) ( ) | x i x j x i x j x i x j x i x j 1 1 2 2 d d 11 12 2

Metrics Challenges • Properties of a metric • How to compute a distance for categorical – d(i,j)  0 attributes – d(i,j) = 0 if and only if i=j – d(i,j) = d(j,i) – d(i,j)  d(i,k) + d(k,j) • An attribute with a large domain often • Examples: Euclidean distance, Manhattan distance dominates the overall distance • Many other non-metric similarity measures exist – Weight and scale the attributes like for k-NN • After selecting the distance function, is it now clear how to compute similarity between objects? • Curse of dimensionality 13 14 Curse of Dimensionality Nominal Attributes • Best solution: remove any attribute that is • Method 1: work with original values known to be very noisy or not interesting – Difference = 0 if same value, difference = 1 otherwise • Method 2: transform to binary attributes • Try different subsets of the attributes and – New binary attribute for each domain value determine where good clusters are found – Encode specific domain value by setting corresponding binary attribute to 1 and all others to 0 15 16 Ordinal Attributes Scaling and Transforming Attributes • Method 1: treat as nominal • Sometimes it might be necessary to transform numerical attributes to [0,1] or use another – Problem: loses ordering information normalizing transformation, maybe even non- linear (e.g., logarithm) • Method 2: map to [0,1] • Might need to weight attributes differently – Problem: To which values should the original values be mapped? – Default: equi-distant mapping to [0,1] • Often requires expert knowledge or trial-and- error 17 18 3

Other Similarity Measures Calculating Cluster Distances • Single link = smallest distance between an element in one • Special distance or similarity measures for cluster and an element in the other: dist(K i , K j ) = min( x ip , many applications x jq ) • Complete link = largest distance between an element in – Might be a non-metric function one cluster and an element in the other: dist(K i , K j ) = • Information retrieval max( x ip , x jq ) • Average distance between an element in one cluster and an – Document similarity based on keywords element in the other: dist(K i , K j ) = avg( x ip , x jq ) • Bioinformatics • Distance between cluster centroids: dist(K i , K j ) = d( m i , m j ) – Gene features in micro-arrays • Distance between cluster medoids: dist(K i , K j ) = dist( x mi , x mj ) – Medoid: one chosen, centrally located object in the cluster 19 20 Cluster Analysis Overview Cluster Centroid, Radius, and Diameter 1  • Centroid : the “middle” of a cluster C  • Introduction m x | | C  x C • Foundations: Measuring Distance (Similarity) • Radius: square root of average distance from any • Partitioning Methods: K-Means  point of the cluster to its centroid  2 ( ) x m • Hierarchical Methods   x C R | C | • Density-Based Methods • Diameter: square root of average mean squared • Clustering High-Dimensional Data distance between all pairs of points in the cluster    • Cluster Evaluation 2 ( ) x y     , x C y C y x D   | C | (| C | 1 ) 21 22 K-means Clustering Partitioning Algorithms: Basic Concept • Construct a partition of a database D of n objects into a • Each cluster is associated with a centroid set of K clusters, s.t. sum of squared distances to • Each object is assigned to the cluster with the cluster “representative” m is minimized closest centroid   K  2 ( ) m x   i i 1 x C i 1. Given K, select K random objects as initial • Given a K, find partition of K clusters that optimizes the centroids chosen partitioning criterion 2. Repeat until centroids do not change – Globally optimal: enumerate all partitions – Heuristic methods 1. Form K clusters by assigning every object to its • K-means (’67): each cluster represented by its centroid nearest centroid • K-medoids (’87): each cluster represented by one of the objects in 2. Recompute centroid of each cluster the cluster 23 24 4

Data Mining Techniques: Partitioning Methods: K-Means Cluster - PDF document

Cluster Analysis Overview Introduction Foundations: Measuring Distance (Similarity) Data Mining Techniques: Partitioning Methods: K-Means Cluster Analysis Hierarchical Methods Mirek Riedewald Density-Based Methods Many slides

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Clustering on Graphs: The Markov Cluster Algorithm (MCL) CS 595D Presentation By Kathy Macropol

Clustering ECE6133 Physical Design Automation of VLSI Systems Prof. Sung Kyu Lim School of

Partitional Clustering Boston University Slideshow Title Goes Here Clustering: David Arthur,

Chapter 9. Clustering Analysis Wei Pan Division of Biostatistics, School of Public Health,

Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing Balancing

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey

Data Mining Techniques: Partitioning Methods: K-Means Cluster - PDF document

Cluster Analysis Overview Introduction Foundations: Measuring Distance (Similarity) Data Mining Techniques: Partitioning Methods: K-Means Cluster Analysis Hierarchical Methods Mirek Riedewald Density-Based Methods Many slides

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Clustering on Graphs: The Markov Cluster Algorithm (MCL) CS 595D Presentation By Kathy Macropol

Clustering ECE6133 Physical Design Automation of VLSI Systems Prof. Sung Kyu Lim School of

Partitional Clustering Boston University Slideshow Title Goes Here Clustering: David Arthur,

Chapter 9. Clustering Analysis Wei Pan Division of Biostatistics, School of Public Health,

Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing Balancing

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, &amp; Jeffrey

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey