Clustering methods R.W. Oldford Interactive data visualization An - - PowerPoint PPT Presentation
Clustering methods R.W. Oldford Interactive data visualization An - - PowerPoint PPT Presentation
Clustering methods R.W. Oldford Interactive data visualization An important advantage of data visualization is that much structure (e.g. density, groupings, regular patterns, relationships, outliers, connections across dimensions, etc.) can be
Interactive data visualization
An important advantage of data visualization is that much structure (e.g. density, groupings, regular patterns, relationships, outliers, connections across dimensions, etc.) can be easily seen visually, even though it might be more difficult to describe mathematically. Moreover, the structure observed need not have been anticipated. Interaction which allows fairly arbitrary changes to the plot via direct manipulation (e.g. mouse gestures) and also via command line (i.e. programmatically), further enables the analyst, providing quick and easy data queries, marking of structure, and when the visualizations are themselves data structures quick setting and extraction of observed information. Direct interaction amplifies the advantage of data visualization and creates a powerful tool for uncovering structure. In contrast, we might choose to have some statistical algorithm search for structure in the data. This would of course require specifying in advance how that structure might be described mathematically.
Interactive data visualization
The two approaches naturally complement one another.
◮ Structure searched for algorithmically must be precisely characterised
mathematically, and so is necesssarily determined prior to the analysis.
◮ Interactive data visualization depends on the human visual system which
has evolved over millions of years to be able to see patterns, both anticipated and not. In the hands of an experienced analyst, one complements and amplifies the
- ther; the two are worked together to give much greater insight than either
approach could alone. We have already seen the value of using both in conjunction with one another in, for example, hypothesis testing, density estimation, and smoothing.
Finding groups in data
Oftentimes we observe that the data have grouped together in patterns. In a scatterplot, for example, we might notice that the observations concentrate more in some areas than they do in others.
The "Old Faithful" ‘geyser‘ data (from the ‘MASS‘ package). A simple scatterplot with larger point sizes with alpha blending shows
◮ 3 regions of
concentration,
◮ 3 vertical lines, ◮ and a few outliers.
Contours of constant kernel density estimate show
◮ two modes at right, ◮ a higher mode at left,
and
◮ a smooth continuous
mathematical function Perhaps the points could be automatically grouped by using the contours?
1 2 3 4 5 50 60 70 80 90 100 110 x y
. 2 . 2 . 4 . 4 . 6 . 6 . 8 . 8 . 8 0.01 . 1 0.01 0.012 . 1 2 0.012 0.014
Finding groups in data - K-means
A great many methods exist (and continue to be developed) to automatically find groups in data. These have historically been called clustering methods by data analysts and more recently sometimes called unsupervised learning methods (in the sense that we do not know the “classes” of the observations as in “supervised learning”) by many artificial intelligence researchers. One of the earliest clustering methods is “K-means”. In its simplest form, it begins with the knowledge that there are exactly K clusters to be determined. The idea is to identify K clusters, C1, . . . , CK , where every multivariate observation xi for i = 1, . . . , n in the data set appears in one and only one cluster Ck. The clusters are to be chosen so that the total within cluster spread is as small as possible. For every cluster, the total spread for that cluster is measured by the sum of squared Euclidean distances from the cluster “centroid”, namely SSEk =
- i∈Ck d2(i, k)
=
- i∈Ck ||xi − ck||2
where ck is the cluster “centroid”. Typically, the cluster average xk =
i∈Ck xi/nk
(where nk denotes the cardinality of cluster Ck) is chosen as the cluster centroid (i.e. choose ck = xk). The K clusters are chosen to minimize K
i=1 SSEk.
Algorithms typically begin with “seed” centroids c1, . . . , cK, possibly randomly chosen, then assign every observation to its nearest centroid. Each centroid is then recalculated based on the values of xi ∀ i ∈ Ck (e.g. ck = xk), and the data are reassigned to the new centroids. Repeat until there is no change in the clustering.
Finding groups in data - K-means
There are several implementations of K-means in R. Many of these are available via the base R function kmeans().
# First scale the data data <- as.data.frame(scale(geyser[,c("duration", "waiting")])) result <- kmeans(data, centers = 3) str(result) ## List of 9 ## $ cluster : Named int [1:299] 3 2 1 3 3 2 1 3 2 1 ... ## ..- attr(*, "names")= chr [1:299] "1" "2" "3" "4" ... ## $ centers : num [1:3, 1:2] 0.844 -1.273 0.56 -1.242 0.787 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:3] "1" "2" "3" ## .. ..$ : chr [1:2] "duration" "waiting" ## $ totss : num 596 ## $ withinss : num [1:3] 25.6 31.8 23.7 ## $ tot.withinss: num 81.2 ## $ betweenss : num 515 ## $ size : int [1:3] 101 107 91 ## $ iter : int 2 ## $ ifault : int 0 ##
- attr(*, "class")= chr "kmeans"
The cluster component identifies which of three clusters the corresponding
- beservation has been assigned.
Finding groups in data - K-means
Plotting this information in loon
library(loon) ## Loading required package: tcltk p <- l_plot(data, linkingGroup = "geyser", showGuides=TRUE) # Add the density contours l_layer_contourLines(p, kde2d(data$duration, data$waiting, n=100), color = "grey") ## loon layer "lines" of type lines of plot .l0.plot ## [1] "layer0" # Colour the clusters p['color'] <- result$cluster
Finding groups in data - K-means
Plotting this information in loon which looks pretty good.
Finding groups in data - K-means
Had we selected only K = 2: which we might not completely agree with.
Finding groups in data - K-means
How about K = 4?: which, again, we might or might not agree.
Finding groups in data - K-means
How about K = 5?: which, again, we might or might not agree.
Finding groups in data - K-means
How about K = 6?: which, again, we might or might not agree.
Finding groups in data - K-means
Let’s try K = 6 again: which, is different!
Finding groups in data - K-means
Some comments and questions:
◮ K means depends on total squared Euclidean distance to the centroids ◮ K means implicitly presumes that the clusters will be "globular" or "spherical" ◮ different clusters might arise on different calls (random starting position for
centroids)
◮ how do we choose k? ◮ should we enforce a hierarchy on the clusters? ◮ with an interactive visualization, we should be able to readjust the clusters by
changing their colours
Finding groups in data - model based clustering
A related approach to K means, but one that is much more general and which comes from a different reasoning base is that of so-called “model-based clustering”. Here, the main idea is that the data xi are a sample independently and identically distributed (iid) multivariate observations from some multivariate mixture distribution. That is X1, . . . , Xn ∼ fp(x ; Θ) where fp(x ; θ) is a p-variate continuous density parameterized by some collection of parameters ˆ that can be expressed as a finite mixture of individual p-variate densities gp(): fp(x ; Θ) =
K
- k=1
αkgp(x ; θk). Here αk ≥ 0, K
k=1 αk = 1, the individual densities gp(x ; θk) are of known shape and
are identical up to differences given by their individual parameter vectors θk. Neither αk nor θk are known for any k = 1, . . . , K and must be estimated from the observations (i.e. Θ = {α1, . . . , αK, θ1, . . . , θK}). Typically gp(x, θk) are taken to be multivariate Gaussian densities of the form gp(x ; θk) = φp(x ; µk, Σk) with φp(x ; µk, Σk) = (2π)− p
2 |Σk|− p 2 e− 1 2 (x−µk)TΣ−1(x−µk)
and θk = (µk, Σk).
Finding groups in data - model based clustering
We can imagine fitting the mixture model to the data for fixed K and αk > 0 ∀ k via, say, maximum likelihood. This can be accomplished by introducing latent variates zik which are 1 when xi came from the kth mixture component and zero otherwise. Suffice to say that the ziks are treated as “missing” and that an “EM” or “Expectation-Maximization” algorithm is then used to perform maximum likelihood estimation on the finite mixture. The parameters µk and Σ can also be constrained (the eigen decomposition Σk = ODσOT is useful in this) to restrict the problem further. An information criterion like the “Bayesian Information Criterion”, or “BIC”, is used to compare values of K across models. This adds a negative penalty to the log-likelihood that penalizes larger models. Look for models that have high information (as measured by BIC). (Note that some writers (e.g. Wikipedia) use minus this and hence minimize their objective function.) Note that by using a Gaussian model that the clusters are inherently taken to be elliptically shaped. An implementation of model-based clustering can be found in the R package mclust
Finding groups in data - model based clustering
For the geyser data
library(mclust) resultmc <- Mclust(data, G = 1:10) # G = number of mixtures to consider plot(resultmc, what = "BIC") # (VVV is the most general)
−1700 −1600 −1500 −1400 −1300 −1200 −1100 Number of components BIC 1 2 3 4 5 6 7 8 9 10 EII VII EEI VEI EVI VVI EEE EVE VEE VVE EEV VEV EVV VVV
Finding groups in data - model based clustering
For the geyser data
summary(resultmc) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VVI (diagonal, varying volume and shape) model with 4 components: ## ## log.likelihood n df BIC ICL ##
- 502.1488 299 19 -1112.606 -1142.556
## ## Clustering table: ## 1 2 3 4 ## 90 17 98 94
We could check the agreement with K-means
table(result4$cluster, resultmc$classification) ## ## 1 2 3 4 ## 1 82 4 ## 2 8 0 39 ## 3 0 59 ## 4 0 13 0 94
Finding groups in data - model based clustering
And update the loon plot
p['color'] <- resultmc$classification which is interesting . . .
Finding groups in data - model based clustering
Plotting the mixtures
plot(resultmc, what = "classification")
−2 −1 1 −2 −1 1 2 duration waiting
Classification
The model is imposing its structure on the data.
Dissimilarity based methods
There are a number of clustering methods which are based entirely on some measure of the dissimilarity δij = δ(xi, xj) between every pair (i, j) of
- bservations xi and xj (for all i = j).
Usually (but not always) the dissimilarity δij is a distance function in that it
- beys the axioms of a metric. Namely, for all vectors x, y, z, a function d(x, y)
is a metric or distance if it satisfies
- 1. d(x, y) ≥ 0
- 2. d(x, y) = 0 ⇐
⇒ x = y
- 3. d(x, y) = d(y, x)
- 4. d(x, z) ≤ d(x, y) + d(y, z)
Dissimilarity measures δ(x, y) typically obey the first three of these; some do not additionally obey the fourth (the triangle inequality). When all four hold, so that the measure is a distance, we will denote the dissimilarity by dij rather than by δij.
Example distances - some common choices
There are numerous functions that could be used, depending on the application. Some examples for x, y ∈ Rp include:
◮ Euclidean distance:
d(x, y) = (x − y)T(x − y) 1
2
◮ Minkowski distance or k-norm distance (k > 0):
d(x, y) =
- p
- j=1
|xj − yj|k
1
k
◮ City block distance (or Manhattan, or Taxicab):
d(x, y) =
p
- j=1
|xj − yj|
◮ Infinity norm, or supremum, or maximum, distance:
d(x, y) = limk→∞
p
j=1|xj − yj|k 1
k
= maxj=1,...,p|xj − yj| The dist() function in R will calculate these and others.
Example distances - x / ∈ Rp
There are are also numerous other distances/disimilarities that arise even when x, y / ∈ Rp. Some examples are
◮ Cosine similarity. Often used for measuring similarity of text documents
where each element of x contains the count of some word for that document implying x, y ∈ Rp
+. If the vector (i.e. document) length is to be ignored,
then x and y are more similar the larger is cos ∠(x, y). A distance could be d(x, y) = 1 − cos ∠(x, y)
◮ Jaccard distance. Here x and y are finite sets. The Jaccard index
J(x, y) = |x ∩ y| |x ∪ y| is a measure of how similar the two sets x and y are. A corresponding dissimilarity measure would be d(x, y) = 1 − J(x, y).
◮ Hamming distance. Here x and y are strings having the same number of
characters (the jth element of the vector is the jth character in the string). The Hamming distance is the minimum number of substitutions required to turn one string into the other. That is, it is the number of positions at which the two strings are different. (See also Levenshtein distance for strings of unequal length.)
Euclidean squared distances and the Gram matrix
When dij = ||xi − xj|| is the Euclidean distance, some very useful relationships can be derived between the observations and their squared distances d2
ij
d2
ij
= ||xi − xj||2 = (xi − xj)T(xi − xj) = xi
Txi − 2xi Txj + xj Txj
= gii − 2gij + gjj where gij = xi
Txj is the (i, j) entry of the Gram matrix G = [gij].
Let D⋆ = [d2
ij] denote the matrix of squared Eucidean distances. Then the
above equation can be written in matrix and vector form as D⋆ = 1gT − 2G + g1T
Euclidean squared distances and the Gram matrix
In the other direction, we can show (assuming we have centred the data so that
n
j=1 xi = 0) that the Gram matrix G can be determined from the matrix
D⋆ = [d2
ij] of squared Eucidean distances as
G = −1 2 (In − P) D⋆ (In − P) where P = 1(1T1)−11T Now, given a Gram matrix, a point configuration (and hence possibly a lower dimensional embedding) can always be determined from an eigen-decomposition of G = XXT = UDλUT (i.e. principal components). This opens up a whole realm of possibilities (including visualization), beginning with any squared distance between points!!!
Example distances - Distances from Kernel functions
Since the squared distances can be determined from the Gram matrix, and the Gram matrix is just a matrix of inner products, we could begin with a definition
- f an inner product and derive the squared distances.
One choice is to define a “Kernel” function K(x, y) to be the inner-product of some transformation ψ(x) of the original xs. Typically, this transformation maps the original vectors to a much higher dimensional space. That is < ψ(x), ψ(y) >= K(x, y) for some function K(). If K = [Kij] is the matrix of inner products K(xi, xj) then it can be turned into a distance function by treating it as a Gram matrix. Note that before doing so, the matrix K must be pre- and post-multiplied by the centring matrix In − P since the previous work assumed the vectors in the inner product were themselves centred about their average.
Example distances - Distances from Kernel functions
Some popular choices of K(x, y) are
- 1. Polynomial of degree d, scale parameter σ, and offset θ:
K(x, y) = (σxTy + θ)d For example, suppose p = 3, d = 2 (with σ = 1, and θ = 0) then
K(x, y) = (x1y1 + x2y2 + x3y3)2 = x2
1 y2 1 + x2 2 y2 2 + x2 3 y2 3 + 2x1x2y1y2 + 2x1x3y1y3 + 2x2x3y2y3
= (x2
1 , x2 2 , x2 3 ,
√ 2x1x2, √ 2x1x3, √ 2x2x3)
y2
1
y2
2
y2
3
√ 2y1y2 √ 2y1y3 √ 2y2y3
So ψ(x) = (x2
1 , x2 2 , x2 3 ,
√ 2x1x2, √ 2x1x3, √ 2x2x3)
T
Example distances - Distances from Kernel functions
- 2. Radial basis function (Gaussian) (with scale parameter σ):
K(x, y) = exp
- − ||x − y||2
2σ2
- To see that this is also an inner product, consider a series expansion of et. The
feature space is infinite dimensional.
- 3. Sigmoid (hyperbolic tangent) with scale σ and offset θ:
K(x, y) = tanh σxTy + θ There is a theorem from functional analysis (involving reproducing Kernel Hilbert spaces, hence the name) called Mercer’s Theorem which gives conditions for which a function K(x, y) can be expressed as a dot product.
Dissimilarity based methods - hierarchical clustering
Perhaps the most popular way in which distance (or dissimilariy) methods are used is to construct a hierarchical clustering of the observations based on the distances between points. Typically, the hierarchy is produced in one of two ways:
- 1. Top-down, or divisive:
Begin with all observations in a single cluster C(0)
1 , then split this into two clusters
C(1)
1
and C(1)
2
such that the distance between clusters, d(C(1)
1 , C(1) 2 ) is maximized.
Repeat this for each cluster, until every cluster consists of a single location x.
- 2. Bottom-up, or agglomerative:
Begin with each observation in its own cluster. Join the two clusters Ci and Ck that are closest to one another in that they have the smallest between cluster distance d(Ci, Cj). Repeat this, joining clusters, until only a single cluster containing all observations is constructed. In either case, the clustering method is hierarchical in that the clusters can be nested in a tree whose nodes are clusters (called a dendrogram). Note that the clustering which results will depend upon the definition of the between cluster distances.
Dissimilarity based methods - hierarchical clustering
Some common choices of distance between clusters A andB are
- 1. Single linkage
d(A, B) = min{d(x, y) : x ∈ A, y ∈ B}
- 2. Complete linkage
d(A, B) = max{d(x, y) : x ∈ A, y ∈ B}
- 3. Average linkage
d(A, B) = 1 |A| × |B|
- x∈A
- y∈B
d(x, y)
- 4. Centroid linkage
d(A, B) = ||cA − cB|| where cA and cB are the centroids of the clusters A and B, respectively. A function hclust() in R implements these and other methods. There is also a package called cluster which implements a variety of agglomerative and divisive hierarchical methods, as well as partitioning methods like K-means.
Hierarchical clustering - example
Suppose we try single-linkage clustering on the geyser data.
d <- dist(data, method = "euclidean") single <- hclust(d, method = "single") str(single) ## List of 7 ## $ merge : int [1:298, 1:2] -3 -64 -4 -5 -207 -6 -266 -23 -24 -297 ... ## $ height : num [1:298] 0 0 0 0 0 0 0 0 0 0 ... ## $ order : int [1:299] 149 61 243 58 12 62 183 155 268 33 ... ## $ labels : chr [1:299] "1" "2" "3" "4" ... ## $ method : chr "single" ## $ call : language hclust(d = d, method = "single") ## $ dist.method: chr "euclidean" ##
- attr(*, "class")= chr "hclust"
class(single) ## [1] "hclust"
Hierarchical clustering - example
The result can be displayed as a tree, or dendrogram
plot(single, cex = 0.25)
149 61 243 58 12 62 183 155 268 33 70 262 164 2 245 41 139 16 82 248 37 55 178 216 288 231 221 96 53 100 121 153 233 151 98 13 102 250 174 181 76 125 284 168 67 106 188 256 21 254 286 18 190 166 47 275 198 238 273 65 88 72 299 25 104 113 9 266 6 63 292 45 147 200 241 39 117 49 294 86 212 90 290 194 108 202 43 23 205 119 296 115 229 157 192 92 11 94 214 51 176 208 35 133 27 270 187 206 85 285 218 170 84 110 265 169 247 54 242 101 99 274 289 201 269 68 184 255 10 171 14 22 182 232 295 165 46 109 246 44 263 114 122 42 73 36 64 3 26 66 52 297 24 105 83 126 34 140 199 19 148 249 17 120 77 271 179 177 251 48 71 239 213 7 56 191 257 93 175 118 217 116 107 87 293 152 215 267 287 150 91 291 154 203 189 38 276 40 95 167 234 50 230 97 156 222 103 89 195 134 193 111 28 259 209 211 132 15 210 161 261 123 138 219 81 31 142 252 158 258 8 32 260 180 237 78 135 236 298 163 223 127 277 57 272 112 280 227 264 130 144 224 281 279 278 204 185 143 162 145 186 1 4 124 225 283 207 5 146 29 282 137 75 173 197 228 220 226 253 129 160 244 60 79 128 80 141 235 74 240 159 20 136 30 59 131 69 172 1960.0 0.2 0.4 0.6 0.8
Cluster Dendrogram
hclust (*, "single") d Height
Hierarchical clustering - example
Clusters can be determined by cutting across the dendrogram at any specified height
plot(single, cex = 0.25) abline(h = 0.25, col = "red", lty = 2)
149 61 243 58 12 62 183 155 268 33 70 262 164 2 245 41 139 16 82 248 37 55 178 216 288 231 221 96 53 100 121 153 233 151 98 13 102 250 174 181 76 125 284 168 67 106 188 256 21 254 286 18 190 166 47 275 198 238 273 65 88 72 299 25 104 113 9 266 6 63 292 45 147 200 241 39 117 49 294 86 212 90 290 194 108 202 43 23 205 119 296 115 229 157 192 92 11 94 214 51 176 208 35 133 27 270 187 206 85 285 218 170 84 110 265 169 247 54 242 101 99 274 289 201 269 68 184 255 10 171 14 22 182 232 295 165 46 109 246 44 263 114 122 42 73 36 64 3 26 66 52 297 24 105 83 126 34 140 199 19 148 249 17 120 77 271 179 177 251 48 71 239 213 7 56 191 257 93 175 118 217 116 107 87 293 152 215 267 287 150 91 291 154 203 189 38 276 40 95 167 234 50 230 97 156 222 103 89 195 134 193 111 28 259 209 211 132 15 210 161 261 123 138 219 81 31 142 252 158 258 8 32 260 180 237 78 135 236 298 163 223 127 277 57 272 112 280 227 264 130 144 224 281 279 278 204 185 143 162 145 186 1 4 124 225 283 207 5 146 29 282 137 75 173 197 228 220 226 253 129 160 244 60 79 128 80 141 235 74 240 159 20 136 30 59 131 69 172 1960.0 0.2 0.4 0.6 0.8
Cluster Dendrogram
hclust (*, "single") d Height
Hierarchical clustering - example
Clusters can be determined by cutting across the dendrogram at any specified height
p['color'] <- cutree(single, h = 0.25)
Hierarchical clustering - example
Or by specifying the number of clusters
p['color'] <- cutree(single, k = 3)
Hierarchical clustering - example
Or by specifying the number of clusters
p['color'] <- cutree(single, k = 14)
Hierarchical clustering - example
Or by specifying the number of clusters
p['color'] <- cutree(single, k = 15)
Hierarchical clustering - example
Average-linkage:
avelinkage <- hclust(d, method = "average") plot(avelinkage, cex = 0.25) abline(h = 1, col = "red", lty = 2)
236 298 163 223 127 277 57 272 252 32 260 180 237 8 258 265 132 15 210 197 228 75 173 220 226 253 137 129 160 244 60 79 142 111 28 259 209 211 187 206 85 285 58 219 1 4 124 280 227 264 78 135 159 20 136 158 74 240 128 80 141 235 112 207 5 146 225 283 281 279 278 204 185 143 162 145 186 29 282 130 144 224 31 30 59 131 81 161 261 123 138 12 62 54 242 165 232 114 122 42 73 246 44 263 295 46 109 201 269 68 184 287 89 195 134 193 103 222 182 14 22 101 10 171 116 118 217 107 87 293 152 215 267 289 150 255 91 291 154 203 189 38 276 40 95 167 234 50 230 97 156 99 274 34 140 199 19 148 249 77 271 17 120 48 239 7 213 56 191 71 179 177 251 257 93 175 169 52 66 36 64 3 26 297 24 105 83 126 61 84 110 172 196 69 247 243 27 270 170 218 149 183 155 268 216 288 231 221 96 53 100 121 41 37 55 178 200 241 39 117 86 212 90 290 194 108 202 49 294 119 296 43 23 205 153 233 151 98 13 102 51 176 208 35 133 72 299 25 104 65 88 250 174 181 198 238 273 113 9 266 6 63 292 45 147 115 229 157 192 92 11 94 214 166 47 275 188 256 21 254 286 76 125 284 168 67 106 18 190 139 16 82 248 164 2 245 33 70 2620.0 0.5 1.0 1.5 2.0 2.5
Cluster Dendrogram
hclust (*, "average") d Height
Hierarchical clustering - example
Average-linkage:
p['color'] <- cutree(avelinkage, h=1)
Hierarchical clustering - example
Complete-linkage:
completeLinkage <- hclust(d, method = "complete") plot(completeLinkage, cex = 0.25) abline(h = 3, col = "red", lty = 3)
164 2 245 139 16 82 248 33 70 262 149 92 11 94 214 166 47 275 188 256 21 254 286 76 125 284 168 67 106 18 190 198 238 273 292 45 147 113 9 266 6 63 72 299 25 104 65 88 250 174 181 115 229 157 192 61 243 27 270 49 294 119 296 43 23 205 108 202 86 212 90 290 194 200 241 151 39 117 98 13 102 153 233 41 37 55 178 183 155 268 216 288 96 53 100 121 221 231 187 206 85 285 52 66 36 297 24 105 83 126 169 165 64 3 26 71 179 48 239 7 213 56 191 177 251 257 93 175 99 274 34 140 199 19 148 249 77 271 17 120 182 14 22 101 10 171 12 62 289 150 255 54 242 116 118 217 107 87 293 152 215 267 189 91 291 154 203 38 276 40 95 167 234 50 230 97 156 232 114 122 42 73 246 44 263 295 46 109 201 269 68 184 287 89 195 134 193 103 222 51 176 208 35 133 84 110 170 218 58 137 219 31 59 131 172 196 69 247 280 227 264 78 135 30 1 4 124 281 279 278 204 185 143 162 145 186 265 132 15 210 8 258 252 32 260 180 237 236 298 163 223 127 277 57 272 81 159 20 136 158 74 240 141 235 80 128 112 207 5 146 225 283 29 282 130 144 224 129 160 244 60 79 197 228 75 173 220 226 253 161 261 123 138 28 259 209 211 111 1421 2 3 4 5
Cluster Dendrogram
hclust (*, "complete") d Height
Hierarchical clustering - example
Complete-linkage:
p['color'] <- cutree(completeLinkage, h=3)
Hierarchical clustering - linkage methods
Note that depending on how clusters are joined/split, different linkage methods may favour different shaped clusters. For example, single linkage corresponds to finding the minimal spanning tree of the (complete) geometric graph of the points. It will therefore track “stringy” structure as clusters. In contrast, complete linkage will prefer compact small diameter clusters over stringy ones.
- Similarly. average linkage will have a tendency to avoid stringy clusters.
An advantage of any of these hierarchical methods over partitional methods like K-means is that they provide a nested sequence of potential clusterings.
Density based clustering - dbscan
Data clusters might also be defined to be points in those regions of the space where there is high density. This naturally leads to a cluster tree where points separate whenever high density regions separate (by a low-density region appearing between them) and, at least in principle, have no reason to favour any shaped cluster over any other. For example, the contours of our original density estimate on the geyser data actually define (by their level curves) a sequence of clusters where each set of clusters have fewer points. The resulting cluster tree would have only three branches (two binary branchings). One way of determining density in higher dimensional space is by the average (or maximum, or some other size measure) of the distances from any location to its k nearest neighbours, the smaller the value the greater the local density. The package dbscan (“Density Based Clustering of Applications with Noise”) contains the functions kNN() and kNNdist() which will calculate the distances of the k nearest neighhbours to each point in the data set and return these in a matrix:
library(dbscan) kDists <- kNNdist(data, 5) head(kDists) ## 1 2 3 4 5 ## 1 0.01451925 0.01451925 0.07344207 0.07344207 0.07344207 ## 2 0.13067299 0.14471532 0.19444059 0.19444059 0.29376826 ## 3 0.00000000 0.00000000 0.07199256 0.07199256 0.07199256 ## 4 0.00000000 0.01451925 0.07199256 0.07199256 0.07199256 ## 5 0.00000000 0.00000000 0.07199256 0.07199256 0.07199256 ## 6 0.00000000 0.00000000 0.02903841 0.04355766 0.07199256
Density based clustering - dbscan
A plot of these distances is produced by kNNdissplot() (to which we have added three horizontal lines).
kNNdistplot(data, 5) abline(h = c(0.15, 0.2, 0.25), col = c("red", "blue", "forestgreen"), lty = c(2,1,3))
500 1000 1500 0.0 0.2 0.4 0.6 0.8 1.0 Points (sample) sorted by distance 5−NN distance