Data Clustering with R
Yanchang Zhao
http://www.RDataMining.com
R and Data Mining Course Beijing University of Posts and Telecommunications, Beijing, China
July 2019
1 / 62
Data Clustering with R Yanchang Zhao http://www.RDataMining.com R - - PowerPoint PPT Presentation
Data Clustering with R Yanchang Zhao http://www.RDataMining.com R and Data Mining Course Beijing University of Posts and Telecommunications, Beijing, China July 2019 1 / 62 Contents Introduction Data Clustering with R The Iris Dataset
1 / 62
2 / 62
3 / 62
∗package name::function name() †Chapter 6 - Clustering, in R and Data Mining: Examples and Case Studies.
4 / 62
‡https://archive.ics.uci.edu/ml/datasets/Iris 5 / 62
6 / 62
7 / 62
8 / 62
9 / 62
10 / 62
11 / 62
12 / 62
13 / 62
14 / 62
15 / 62
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.0 2.5 3.0 3.5 4.0 Sepal.Length Sepal.Width
16 / 62
17 / 62
18 / 62
19 / 62
20 / 62
−3 −2 −1 1 2 3 −2 2
Component 1 Component 2 These two components explain 95.81 % of the point v Silhouette width si 0.0 0.2 0.4 0.6 0.8 1.0
Average silhouette width : 0.55 n = 150 3 clusters Cj j : nj | avei∈Cj si 1 : 50 | 0.80 2 : 62 | 0.42 3 : 38 | 0.45
21 / 62
22 / 62
23 / 62
−3 −2 −1 1 2 3 4 −3 −2 −1 1 2
clusplot(pam(x = sdata, k = k, diss = diss))
Component 1 Component 2 These two components explain 95.81 % of the point v Silhouette width si 0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = sdata, k = k, diss = diss)
Average silhouette width : 0.69 n = 150 2 clusters Cj j : nj | avei∈Cj si 1 : 51 | 0.81 2 : 99 | 0.62 24 / 62
25 / 62
25 / 62
26 / 62
27 / 62
28 / 62
29 / 62
30 / 62
31 / 62
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa versicolor versicolor versicolor virginica virginica versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica virginica virginica virginica 1 2 3 4
hclust (*, "average") dist(iris3) Height
32 / 62
33 / 62
34 / 62
35 / 62
36 / 62
37 / 62
38 / 62
39 / 62
40 / 62
41 / 62
2.0 2.5 3.0 3.5 4.0 4.5 5.5 6.5 7.5 0.5 1.0 1.5 2.0 2.5 2.0 2.5 3.0 3.5 4.0
1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7
42 / 62
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 0.5 1.0 1.5 2.0 2.5 Sepal.Length Petal.Width 43 / 62
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 0 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 −8 −6 −4 −2 2 −2 −1 1 2 3 dc 1 dc 2 44 / 62
45 / 62
46 / 62
46 / 62
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 0.5 1.0 1.5 2.0 2.5 Sepal.Length Petal.Width
47 / 62
48 / 62
49 / 62
50 / 62
51 / 62
52 / 62
53 / 62
54 / 62
55 / 62
56 / 62
57 / 62
58 / 62
59 / 62
Alsabti, K., Ranka, S., and Singh, V. (1998). An efficient k-means clustering algorithm. In Proc. the First Workshop on High Performance Data Mining, Orlando, Florida. Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 49–60, New York, NY, USA. ACM Press. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226–231. Frank, A. and Asuncion, A. (2010). UCI machine learning repository. university of california, irvine, school of information and computer sciences. http://archive.ics.uci.edu/ml. Guha, S., Rastogi, R., and Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. In SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 73–84, New York, NY, USA. ACM Press. Guha, S., Rastogi, R., and Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of the 15th International Conference on Data Engineering, 23-26 March 1999, Sydney, Austrialia, pages 512–521. IEEE Computer Society. Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 60 / 62
Hennig, C. (2014). fpc: Flexible procedures for clustering. R package version 2.1-9. Hinneburg, A. and Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. In KDD, pages 58–65. DENCLUE. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283–304. Karypis, G., Han, E.-H., and Kumar, V. (1999). Chameleon: hierarchical clustering using dynamic modeling. Computer, 32(8):68–75. Kaufman, L. and Rousseeuw, P. J. (1990). Finding groups in data. an introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York: Wiley, 1990. Macqueen, J. B. (1967). Some methods of classification and analysis of multivariate observations. In the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2016). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.4 — For new features, see the ’Changelog’ file (in the package source). 61 / 62
Ng, R. T. and Han, J. (1994). Efficient and effective clustering methods for spatial data mining. In VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 144–155, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Zhang, T., Ramakrishnan, R., and Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. In SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pages 103–114, New York, NY, USA. ACM Press. 62 / 62