Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 17: Clustering Validation Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 1 / 59

Clustering Validation and Evaluation Cluster validation and assessment encompasses three main tasks: clustering evaluation seeks to assess the goodness or quality of the clustering, clustering stability seeks to understand the sensitivity of the clustering result to various algorithmic parameters, for example, the number of clusters, and clustering tendency assesses the suitability of applying clustering in the first place, that is, whether the data has any inherent grouping structure. Validity measures can be divided into three main types: External: External validation measures employ criteria that are not inherent to the dataset, e.g., class labels. Internal: Internal validation measures employ criteria that are derived from the data itself, e.g., intracluster and intercluster distances. Relative: Relative validation measures aim to directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 2 / 59

External Measures External measures assume that the correct or ground-truth clustering is known a priori , which is used to evaluate a given clustering. Let D = { x i } n i = 1 be a dataset consisting of n points in a d -dimensional space, partitioned into k clusters. Let y i ∈ { 1 , 2 ,..., k } denote the ground-truth cluster membership or label information for each point. The ground-truth clustering is given as T = { T 1 , T 2 ,..., T k } , where the cluster T j consists of all the points with label j , i.e., T j = { x i ∈ D | y i = j } . We refer to T as the ground-truth partitioning , and to each T i as a partition . Let C = { C 1 ,..., C r } denote a clustering of the same dataset into r clusters, obtained via some clustering algorithm, and let ˆ y i ∈ { 1 , 2 ,..., r } denote the cluster label for x i . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 3 / 59

External Measures External evaluation measures try capture the extent to which points from the same partition appear in the same cluster, and the extent to which points from different partitions are grouped in different clusters. All of the external measures rely on the r × k contingency table N that is induced by a clustering C and the ground-truth partitioning T , defined as follows N ( i , j ) = n ij = | C i ∩ T j | The count n ij denotes the number of points that are common to cluster C i and ground-truth partition T j . Let n i = | C i | denote the number of points in cluster C i , and let m j = | T j | denote the number of points in partition T j . The contingency table can be computed from T and C in O ( n ) time by examining the partition and cluster labels, y i and ˆ y i , for each point x i ∈ D and incrementing the corresponding count n y i ˆ y i . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 4 / 59

Matching Based Measures: Purity Purity quantifies the extent to which a cluster C i contains entities from only one partition: purity i = 1 k max j = 1 { n ij } n i The purity of clustering C is defined as the weighted sum of the clusterwise purity values: r r n purity i = 1 n i k � � purity = max j = 1 { n ij } n i = 1 i = 1 where the ratio n i n denotes the fraction of points in cluster C i . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 5 / 59

Matching Based Measures: Maximum Matching The maximum matching measure selects the mapping between clusters and partitions, such that the sum of the number of common points ( n ij ) is maximized, provided that only one cluster can match with a given partition. Let G be a bipartite graph over the vertex set V = C ∪ T , and let the edge set be E = { ( C i , T j ) } with edge weights w ( C i , T j ) = n ij . A matching M in G is a subset of E , such that the edges in M are pairwise nonadjacent, that is, they do not have a common vertex. The maximum weight matching in G is given as: � w ( M ) � match = argmax n M where w ( M ) is the sum of the sum of all the edge weights in matching M , given as w ( M ) = � e ∈ M w ( e ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 6 / 59

Matching Based Measures: F-measure Given cluster C i , let j i denote the partition that contains the maximum number of points from C i , that is, j i = max k j = 1 { n ij } . The precision of a cluster C i is the same as its purity: prec i = 1 j = 1 { n ij } = n ij i k max n i n i The recall of cluster C i is defined as recall i = n ij i | T j i | = n ij i m j i where m j i = | T j i | . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 7 / 59

Matching Based Measures: F-measure The F-measure is the harmonic mean of the precision and recall values for each C i 2 = 2 · prec i · recall i 2 n ij i F i = = 1 1 prec i + prec i + recall i n i + m j i recall i The F-measure for the clustering C is the mean of clusterwise F-meaure values: r F = 1 � F i r i = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 8 / 59

bC uT bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC rS rS rS uT uT bC uT bC bC bC bC bC bC bC bC bC bC rS uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT bC bC bC bC bC bC bC bC bC uT bC bC bC bC bC uT uT rS rS uT rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS uT rS uT bC uT uT uT uT uT uT uT uT uT uT uT uT rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS uT K-means: Iris Principal Components Data Good Case u 2 1 . 0 bC bC 0 . 5 bC bC 0 bC bC − 0 . 5 − 1 . 0 u 1 − 1 . 5 − 4 − 3 − 2 − 1 0 1 2 3 Contingency table: iris-setosa iris-versicolor iris-virginica T 1 T 2 T 3 n i C 1 (squares) 0 47 14 61 C 2 (circles) 50 0 0 50 C 3 (triangles) 0 3 36 39 m j 50 50 50 n = 100 purity = 0 . 887, match = 0 . 887, F = 0 . 885. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 9 / 59

uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT rS uT rS rS rS rS rS rS uT rS rS rS rS rS rS rS uT rS uT uT bC rS uT uT uT uT uT uT uT uT uT uT uT rS rS uT bC bC bC bC bC bC uT bC bC bC bC bC uT uT uT bC bC rS rS rS rS rS rS rS rS rS bC uT bC uT bC bC bC uT K-means: Iris Principal Components Data Bad Case u 2 1 . 0 rS rS 0 . 5 rS rS 0 bC bC bC bC bC bC − 0 . 5 − 1 . 0 u 1 − 1 . 5 − 4 − 3 − 2 − 1 0 1 2 3 Contingency table: iris-setosa iris-versicolor iris-virginica T 1 T 2 T 3 n i C 1 ( squares ) 30 0 0 30 C 2 ( circles ) 20 4 0 24 C 3 ( triangles ) 0 46 50 96 m j 50 50 50 n = 150 purity = 0 . 667, match = 0 . 560, F = 0 . 658 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 10 / 59

Entropy-based Measures: Conditional Entropy The entropy of a clustering C and partitioning T is given as r k � � H ( C ) = − p C i log p C i H ( T ) = − p T j log p T j i = 1 j = 1 m j where p C i = n i n and p T j = n are the probabilities of cluster C i and partition T j . The cluster-specific entropy of T , that is, the conditional entropy of T with respect to cluster C i is defined as k � n ij � � n ij � � H ( T | C i ) = − log n i n i j = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 11 / 59

Entropy-based Measures: Conditional Entropy The conditional entropy of T given clustering C is defined as the weighted sum: � p ij r r k � n i � � � H ( T |C ) = n H ( T | C i ) = − p ij log p C i i = 1 i = 1 j = 1 = H ( C , T ) − H ( C ) n ij where p ij = n is the probability that a point in cluster i also belongs to partition and where H ( C , T ) = − � r � k j = 1 p ij log p ij is the joint entropy of C and T . i = 1 H ( T |C ) = 0 if and only if T is completely determined by C , corresponding to the ideal clustering. If C and T are independent of each other, then H ( T |C ) = H ( T ) . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 12 / 59

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining by

Introduction What is data mining? to Data mining functionalities Data Mining Major

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Models of concurrency, categories, and games Pierre Clairambault and Glynn Winskel Models of

Rigid Body Velocity Cedric Fischer and Michael Mattmann Institute of Robotics and Intelligent

Outline Higher Order Statistics First, second and higher-order statistics Matthias Hennig

s w s = strong witness w = weak witness Theorem [de Silva 03] w ab b s w bc w abc

OSGi and Java EE 6 Yes you can with GlassFish V3 Jerome Dochez Oracle Corposration The

Nonlinear-SUSY General Relativity Theory(NLSUSYGR) -Unification of space-time and matter-

Cost-based pragmatic implicatures in an artificial language experiment Judith Degen, Michael

Part B: Mullin type bijections B.I Mullin bijection and Sheffieldburger paradigm Mullins