NbClust Package : finding the relevant number of clusters in a - PowerPoint PPT Presentation

Introduction NbClust package Examples Conclusion NbClust Package : finding the relevant number of clusters in a dataset Malika Charrad, Nadia Ghazzali, V´ eronique Boiteau, Azam Niknafs Laval University, Quebec, Canada June 13th, 2012 UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion Outline 1 Introduction 2 NbClust package 3 Examples 4 Conclusion UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion Introduction Clustering is the task of assigning a set of objects into groups (clusters) so that the objects in the same cluster are more similar to each other than objects in other clusters. Most of the clustering algorithms depend on input parameters such as the number of clusters , the minimum number of objects in a cluster, or the diameter of a cluster .. ⇒ The selection of different parameters leads to different clusters of data. UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion How many clusters are there in the dataset ? UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion How to select the best number of clusters in a dataset ? If the clustering algorithm parameters are assigned improper values, the clustering method may result in a partitioning scheme that’s not optimal ⇒ Wrong decisions. The user is faced with the dilemma of selecting the number of clusters in the dataset. The problem of deciding the number of clusters better fitting a dataset as well as the evaluation of the clustering results is known under the term cluster validity . UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion Related work (Milligan and Cooper, 1985) examined 30 indices with simulated data. There are other criteria which were not examined in Milligan and Cooper study such as : Dunn index (Dunn, 1974) Silhouette statistic (Rousseeuw, 1987) Gap statistic (Tibshirani, 2001) Dindex (Lebart, 2000) SD index and SDbw index (Halkidi et al., 2000, 2001) Statistic of Hubert ((Hubert and Arabie, 1985)) UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion Related work ⇒ 19 among all existing indices are implemented in SAS and R packages : cclust , clusterSim , clv and clvalid . UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion 1 Introduction 2 NbClust package 3 Examples 4 Conclusion UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion NbClust package 1. NbClust package provides 30 indices to determine the number of clusters : 11 other indices : ”duda” Duda and Hart (1973) ”beale” Beale (1969) ”gplus” Rohlf (1974), Milligan (1981) ”frey” Frey and Van Groenewoud (1972) ”tau” Rohlf (1974), Milligan (1981) ”mcclain” McClain and Rao (1975), ”gap” Tibshirani (2001), ”dindex” Lebart (2000), ”hubert” Hubert and Arabie (1985), ”sdindex” Halkidi et al. (2000), ”sdbw” Halkidi et al. (2001). 2. NbClust offers the user the best clustering scheme among different results. UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion NbClust function NbClust(data, diss=”NULL”, distance=”euclidean”, min.nc=2, max.nc=15, method=”ward”, index=”all”, alphaBeale=0.1) Arguments : data matrix or data set diss dissimilarity matrix to be used. By default, diss=”NULL”, but if it is replaced by a dissimilarity matrix, distance should be ”NULL”. distance the distance measure to be used to compute the dissimilarity matrix. This must be one of : ”euclidean” , ”maximum” , ”manhattan” , ”canberra” , ”binary” , ”minkowski” or ”NULL”. UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion NbClust function NbClust(data, diss=”NULL”, distance=”euclidean”, min.nc=2, max.nc=15, method=”ward”, index=”all”, alphaBeale=0.1) Arguments : min.nc minimum number of clusters, between 2 and (number of objects - 1). max.nc maximum number of clusters, between 2 and (number of objects - 1), greater or equal to min.nc. method the cluster analysis method to be used. Available methods are : ”ward” , ”single” , ”complete” , ”average” , ”mcquitty” , ”median” , ”centroid” ”kmeans” UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion NbClust function Nb.clusters(data, diss=”NULL”, distance=”euclidean”, min.nc=2, max.nc=15, method=”ward”, index=”all”, alphaBeale=0.1) Arguments : index the index to be calculated. This should be one of : ”kl” , ”ch” , ”hartigan” , ”ccc” , ”scott” , ”marriot” , ”trcovw” , ”tracew” , ”friedman” , ”rubin” , ”cindex” , ”db” , ”silhouette” , ”duda” , ”pseudot2” , ”beale” , ”ratkowsky” , ”ball” , ”ptbiserial” , ”gap” , ”frey” , ”mcclain” , ”gamma” , ”gplus” , ”tau” , ”dunn” , ”hubert” , ”sdindex” , ”dindex” , ”sdbw” , ”alllong” : all indices included ”all” : all indices except GAP, Gamma, Gplus and Tau. alphaBeale significance value for Beale’s index. UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion Example1 : Simulated dataset with 2 variables and 4 clusters UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion NbClust output : Gap index UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion NbClust output : ”alllong” option [All.index] Values of indices for each partition of the dataset obtained with a number of clusters between min.nc and max.nc . UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion NbClust output : Critical values [All.CriticalValues] Critical values of some indices for each partition obtained with a number of clusters between min.nc and max.nc . UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion NbClust output : Best number of clusters [Best.nc] Best number of clusters proposed by each index and the corresponding index value. UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion NbClust output : Best number of clusters UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion NbClust output : Hubert index and Dindex UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion Example2 : Iris dataset (Fisher 1936) Iris dataset is composed of 3 species : ”Setosa”, ”Virginica” and ”Versicolor” UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion Clustering of Iris dataset (1) UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion How to decide on the correct number of clusters ? UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion How to decide on the correct number of clusters ? 1. Majority rule : User can select the number of clusters proposed by the majority of indices. ex : 4 in 1st example and 3 in 2nd example. 2. User can consider only indices that performed best in simulations studies. Top-5 indices in Milligan and Cooper study are : CH index, Duda index, Cindex, Gamma and Beale index. UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion Conclusion NbClust package provides a large list of indices, many of them are not implemented anywhere. The current version contains up to 30 indices. NbClust package permits the user to simultaneously vary the number of clusters, the clustering method and the indices to decide how best to group observations in his dataset or to compare all indices or clustering methods. NbClust package is available at http ://cran.r-project.org/web/packages/NbClust/index.html UseR !2012, Nashville NbClust Package

Introduction NbClust package Examples Conclusion Thank you ! UseR !2012, Nashville NbClust Package

NbClust Package : finding the relevant number of clusters in a - PowerPoint PPT Presentation

Introduction NbClust package Examples Conclusion NbClust Package : finding the relevant number of clusters in a dataset Malika Charrad, Nadia Ghazzali, V eronique Boiteau, Azam Niknafs Laval University, Quebec, Canada June 13th, 2012

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

SHARQ Guide: SHARQ Guide: Finding relevant biological data Finding relevant biological data and

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Package Managers CC-BY-SA 2016 Nate Levesque What is a Package Manager? A package manager or

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Locational narratives in creative clusters An exploration of place, reputation and creative

Dynamic Virtual Clusters in a Grid Dynamic Virtual Clusters in a Grid Site Manager Site Manager

GETTING STARTED? BASIC PREMIUM SHRED10 PACKAGE PACKAGE* PACKAGE* $61.50 /month $132.75

croft design studio Package Prices 2020 Package Prices We are now offering these package

Package Management with Package Management with Package Management with Anaconda Anaconda

Parsing package docs: Part III: Using the ReadP package

Thank you to our Sponsors Zeek Package Contest Winners First Prize EternalSafety Package - Lexi

The traitr package John Verzani CUNY/The College of Staten Island useR!2010 The traitr package

12 Tips for giving an Effective Presentation Louise Lehane, UoL, Ireland Tip Number One Tip

Understanding User Interactions with Podcast Recommendations Delivered Via Voice Lo Longqi Yang

UNDERSTANDING ENDOWED & NON ENDOWED SPENDING INDICES Presented by Terry Shoebotham

Renewing Matas Annual Report 2019/20, corona impact & strategy update C P H R o a d s h o w

14 th December 2016 Our investment portfolio has returned 76% in blended currency over the last

Federal Reserve Bank of Chicago December 5, 2008 Robert J. DiCianni ArcelorMittal USA Agenda

Oil Production Tax Ordinance Laura L. Doud, City Auditor January 23, 2007 Collaborative Effort

Economic & Revenue Forecast Presentation to the Joint Budget Committee September 20, 2019

Growth + Shareholder Returns www.parexresources.com | TSX:PXT | Corporate

NbClust Package : finding the relevant number of clusters in a - PowerPoint PPT Presentation

Introduction NbClust package Examples Conclusion NbClust Package : finding the relevant number of clusters in a dataset Malika Charrad, Nadia Ghazzali, V eronique Boiteau, Azam Niknafs Laval University, Quebec, Canada June 13th, 2012

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

SHARQ Guide: SHARQ Guide: Finding relevant biological data Finding relevant biological data and

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Package Managers CC-BY-SA 2016 Nate Levesque What is a Package Manager? A package manager or

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Locational narratives in creative clusters An exploration of place, reputation and creative

Dynamic Virtual Clusters in a Grid Dynamic Virtual Clusters in a Grid Site Manager Site Manager

GETTING STARTED? BASIC PREMIUM SHRED10 PACKAGE PACKAGE* PACKAGE* $61.50 /month $132.75

croft design studio Package Prices 2020 Package Prices We are now offering these package

Package Management with Package Management with Package Management with Anaconda Anaconda

Parsing package docs: Part III: Using the ReadP package

Thank you to our Sponsors Zeek Package Contest Winners First Prize EternalSafety Package - Lexi

The traitr package John Verzani CUNY/The College of Staten Island useR!2010 The traitr package

12 Tips for giving an Effective Presentation Louise Lehane, UoL, Ireland Tip Number One Tip

Understanding User Interactions with Podcast Recommendations Delivered Via Voice Lo Longqi Yang

UNDERSTANDING ENDOWED &amp; NON ENDOWED SPENDING INDICES Presented by Terry Shoebotham

Renewing Matas Annual Report 2019/20, corona impact &amp; strategy update C P H R o a d s h o w

14 th December 2016 Our investment portfolio has returned 76% in blended currency over the last

Federal Reserve Bank of Chicago December 5, 2008 Robert J. DiCianni ArcelorMittal USA Agenda

Oil Production Tax Ordinance Laura L. Doud, City Auditor January 23, 2007 Collaborative Effort

Economic &amp; Revenue Forecast Presentation to the Joint Budget Committee September 20, 2019

Growth + Shareholder Returns www.parexresources.com | TSX:PXT | Corporate

UNDERSTANDING ENDOWED & NON ENDOWED SPENDING INDICES Presented by Terry Shoebotham

Renewing Matas Annual Report 2019/20, corona impact & strategy update C P H R o a d s h o w

Economic & Revenue Forecast Presentation to the Joint Budget Committee September 20, 2019