Clustering methods R.W. Oldford Interactive data visualization An - - PowerPoint PPT Presentation

clustering methods
SMART_READER_LITE
LIVE PREVIEW

Clustering methods R.W. Oldford Interactive data visualization An - - PowerPoint PPT Presentation

Clustering methods R.W. Oldford Interactive data visualization An important advantage of data visualization is that much structure (e.g. density, groupings, regular patterns, relationships, outliers, connections across dimensions, etc.) can be


slide-1
SLIDE 1

Clustering methods

R.W. Oldford

slide-2
SLIDE 2

Interactive data visualization

An important advantage of data visualization is that much structure (e.g. density, groupings, regular patterns, relationships, outliers, connections across dimensions, etc.) can be easily seen visually, even though it might be more difficult to describe mathematically. Moreover, the structure observed need not have been anticipated. Interaction which allows fairly arbitrary changes to the plot via direct manipulation (e.g. mouse gestures) and also via command line (i.e. programmatically), further enables the analyst, providing quick and easy data queries, marking of structure, and when the visualizations are themselves data structures quick setting and extraction of observed information. Direct interaction amplifies the advantage of data visualization and creates a powerful tool for uncovering structure. In contrast, we might choose to have some statistical algorithm search for structure in the data. This would of course require specifying in advance how that structure might be described mathematically.

slide-3
SLIDE 3

Interactive data visualization

The two approaches naturally complement one another.

◮ Structure searched for algorithmically must be precisely characterised

mathematically, and so is necesssarily determined prior to the analysis.

◮ Interactive data visualization depends on the human visual system which

has evolved over millions of years to be able to see patterns, both anticipated and not. In the hands of an experienced analyst, one complements and amplifies the

  • ther; the two are worked together to give much greater insight than either

approach could alone. We have already seen the value of using both in conjunction with one another in, for example, hypothesis testing, density estimation, and smoothing.

slide-4
SLIDE 4

Finding groups in data

Consider the “Old Faithful” geyser data (from the MASS package), centred and scaled as follows. library(MASS) ## ## Attaching package: 'MASS' ## The following object is masked from 'package:dplyr': ## ## select xrange <- diff(range(geyser$duration)) yrange <- diff(range(geyser$waiting)) data <- as.data.frame(scale(geyser[,c("duration", "waiting")], scale = c(xrange, yrange))) data is now centred with the average in each direction and scaled so that the ranges of the two directions are identical. We do this so that when we consider the clustering methods, they will work on data and visual distances observed on any (square) scatterplot will correspond to Euclidean distances in the space of measurememts (which any clustering method would use).

slide-5
SLIDE 5

Finding groups in data

Oftentimes we observe that the data have grouped together in patterns. In a scatterplot, for example, we might notice that the observations concentrate more in some areas than they do in others.

A simple scatterplot with larger point sizes with alpha blending shows ◮ 3 regions of concentration, ◮ 3 vertical lines, ◮ and a few outliers. Contours of constant kernel density estimate show ◮ two modes at right, ◮ a higher mode at left, and ◮ a smooth continuous mathematical function Perhaps the points could be automatically grouped by using the contours?

duration waiting

0.5 0.5 1 1 1 . 5 1 . 5 2 2 2.5 2.5 2.5 3 3 3 3 . 5 3 . 5 3.5 4 4 . 5

slide-6
SLIDE 6

Finding groups in data - K-means

A great many methods exist (and continue to be developed) to automatically find groups in data. These have historically been called clustering methods by data analysts and more recently sometimes called unsupervised learning methods (in the sense that we do not know the “classes” of the observations as in “supervised learning”) by many artificial intelligence researchers. One of the earliest clustering methods is “K-means”. In its simplest form, it begins with the knowledge that there are exactly K clusters to be determined. The idea is to identify K clusters, C1, . . . , CK , where every multivariate observation xi for i = 1, . . . , n in the data set appears in one and only one cluster Ck. The clusters are to be chosen so that the total within cluster spread is as small as possible. For every cluster, the total spread for that cluster is measured by the sum of squared Euclidean distances from the cluster “centroid”, namely SSEk =

  • i∈Ck d2(i, k)

=

  • i∈Ck ||xi − ck||2

where ck is the cluster “centroid”. Typically, the cluster average xk =

i∈Ck xi/nk

(where nk denotes the cardinality of cluster Ck) is chosen as the cluster centroid (i.e. choose ck = xk). The K clusters are chosen to minimize K

i=1 SSEk.

Algorithms typically begin with “seed” centroids c1, . . . , cK, possibly randomly chosen, then assign every observation to its nearest centroid. Each centroid is then recalculated based on the values of xi ∀ i ∈ Ck (e.g. ck = xk), and the data are reassigned to the new centroids. Repeat until there is no change in the clustering.

slide-7
SLIDE 7

Finding groups in data - K-means

There are several implementations of K-means in R. Many of these are available via the base R function kmeans().

result <- kmeans(data, centers = 3) str(result) ## List of 9 ## $ cluster : Named int [1:299] 2 3 1 2 2 3 1 2 3 1 ... ## ..- attr(*, "names")= chr [1:299] "1" "2" "3" "4" ... ## $ centers : num [1:3, 1:2] 0.21 0.1392 -0.3166 -0.2655 0.0968 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:3] "1" "2" "3" ## .. ..$ : chr [1:2] "duration" "waiting" ## $ totss : num 32 ## $ withinss : num [1:3] 1.33 1.19 1.58 ## $ tot.withinss: num 4.09 ## $ betweenss : num 27.9 ## $ size : int [1:3] 101 91 107 ## $ iter : int 2 ## $ ifault : int 0 ##

  • attr(*, "class")= chr "kmeans"

The cluster component identifies which of three clusters the corresponding

  • beservation has been assigned.
slide-8
SLIDE 8

Finding groups in data - K-means

Plotting this information in loon

library(loon) p <- l_plot(data, linkingGroup = "geyser", showScales = FALSE, showLabels = FALSE, showGuides= # Add the density contours l_layer_contourLines(p, kde2d(data$duration, data$waiting, n=100), color = "grey") ## loon layer "lines" of type lines of plot .l0.plot ## [1] "layer0" # Colour the clusters p['color'] <- result$cluster

slide-9
SLIDE 9

Finding groups in data - K-means

Plotting this information in loon plot(p) which looks pretty good.

slide-10
SLIDE 10

Finding groups in data - K-means

Had we selected only K = 2: which we might not completely agree with.

slide-11
SLIDE 11

Finding groups in data - K-means

How about K = 4?: which, again, we might or might not agree.

slide-12
SLIDE 12

Finding groups in data - K-means

How about K = 5?: which, again, we might or might not agree.

slide-13
SLIDE 13

Finding groups in data - K-means

How about K = 6?: which, again, we might or might not agree.

slide-14
SLIDE 14

Finding groups in data - K-means

Let’s try K = 6 again: which, is different!

slide-15
SLIDE 15

Finding groups in data - K-means

Some comments and questions:

◮ K means depends on total squared Euclidean distance to the centroids ◮ K means implicitly presumes that the clusters will be "globular" or "spherical" ◮ different clusters might arise on different calls (random starting position for

centroids)

◮ how do we choose k? ◮ should we enforce a hierarchy on the clusters? ◮ with an interactive visualization, we should be able to readjust the clusters by

changing their colours

slide-16
SLIDE 16

Finding groups in data - model based clustering

A related approach to K means, but one that is much more general and which comes from a different reasoning base is that of so-called “model-based clustering”. Here, the main idea is that the data xi are a sample independently and identically distributed (iid) multivariate observations from some multivariate mixture distribution. That is X1, . . . , Xn ∼ fp(x ; Θ) where fp(x ; θ) is a p-variate continuous density parameterized by some collection of parameters ˆ that can be expressed as a finite mixture of individual p-variate densities gp(): fp(x ; Θ) =

K

  • k=1

αkgp(x ; θk). Here αk ≥ 0, K

k=1 αk = 1, the individual densities gp(x ; θk) are of known shape and

are identical up to differences given by their individual parameter vectors θk. Neither αk nor θk are known for any k = 1, . . . , K and must be estimated from the observations (i.e. Θ = {α1, . . . , αK, θ1, . . . , θK}). Typically gp(x, θk) are taken to be multivariate Gaussian densities of the form gp(x ; θk) = φp(x ; µk, Σk) with φp(x ; µk, Σk) = (2π)− p

2 |Σk|− p 2 e− 1 2 (x−µk)TΣ−1 k

(x−µk)

and θk = (µk, Σk).

slide-17
SLIDE 17

Finding groups in data - model based clustering

We can imagine fitting the mixture model to the data for fixed K and αk > 0 ∀ k via, say, maximum likelihood. This can be accomplished by introducing latent variates zik which are 1 when xi came from the kth mixture component and zero otherwise. Suffice to say that the ziks are treated as “missing” and that an “EM” or “Expectation-Maximization” algorithm is then used to perform maximum likelihood estimation on the finite mixture. The parameters µk and Σ can also be constrained (the eigen decomposition Σk = ODσOT is useful in this) to restrict the problem further. An information criterion like the “Bayesian Information Criterion”, or “BIC”, is used to compare values of K across models. This adds a penalty to the log-likelihood that penalizes larger models. Look for models that have high information (as measured by BIC). (Note that some writers (e.g. Wikipedia) use minus this and hence minimize their

  • bjective function.)

Note that by using a Gaussian model that the clusters are inherently taken to be elliptically shaped. An implementation of model-based clustering can be found in the R package mclust

slide-18
SLIDE 18

Finding groups in data - model based clustering

For the geyser data

library(mclust) resultmc <- Mclust(data, G = 1:10) # G = number of mixtures to consider plot(resultmc, what = "BIC") # (VVV is the most general)

100 200 300 400 500 600 Number of components BIC 1 2 3 4 5 6 7 8 9 10 EII VII EEI VEI EVI VVI EEE EVE VEE VVE EEV VEV EVV VVV

slide-19
SLIDE 19

Finding groups in data - model based clustering

For the geyser data

summary(resultmc) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VVI (diagonal, varying volume and shape) model with 4 components: ## ## log-likelihood n df BIC ICL ## 375.3983 299 19 642.4882 612.5996 ## ## Clustering table: ## 1 2 3 4 ## 90 17 98 94

We could check the agreement with K-means

table(result4$cluster, resultmc$classification) ## ## 1 2 3 4 ## 1 0 60 ## 2 0 13 0 94 ## 3 82 4 ## 4 8 0 38

slide-20
SLIDE 20

Finding groups in data - model based clustering

And update the loon plot

p['color'] <- resultmc$classification which is interesting . . .

slide-21
SLIDE 21

Finding groups in data - model based clustering

Plotting the mixtures

plot(resultmc, what = "classification")

−0.6 −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 duration waiting

The model is imposing its structure on the data.

slide-22
SLIDE 22

Dissimilarity based methods

There are a number of clustering methods which are based entirely on some measure of the dissimilarity δij = δ(xi, xj) between every pair (i, j) of

  • bservations xi and xj (for all i = j).

Usually (but not always) the dissimilarity δij is a distance function in that it

  • beys the axioms of a metric. Namely, for all vectors x, y, z, a function d(x, y)

is a metric or distance if it satisfies

  • 1. d(x, y) ≥ 0
  • 2. d(x, y) = 0 ⇐

⇒ x = y

  • 3. d(x, y) = d(y, x)
  • 4. d(x, z) ≤ d(x, y) + d(y, z)

Dissimilarity measures δ(x, y) typically obey the first three of these; some do not additionally obey the fourth (the triangle inequality). When all four hold, so that the measure is a distance, we will denote the dissimilarity by dij rather than by δij.

slide-23
SLIDE 23

Example distances - some common choices

There are numerous functions that could be used, depending on the application. Some examples for x, y ∈ Rp include:

◮ Euclidean distance:

d(x, y) = (x − y)T(x − y) 1

2

◮ Minkowski distance or k-norm distance (k > 0):

d(x, y) =

  • p
  • j=1

|xj − yj|k

1

k

◮ City block distance (or Manhattan, or Taxicab):

d(x, y) =

p

  • j=1

|xj − yj|

◮ Infinity norm, or supremum, or maximum, distance:

d(x, y) = limk→∞

p

j=1|xj − yj|k 1

k

= maxj=1,...,p|xj − yj| The dist() function in R will calculate these and others.

slide-24
SLIDE 24

Example distances - x / ∈ Rp

There are are also numerous other distances/disimilarities that arise even when x, y / ∈ Rp. Some examples are

◮ Cosine similarity. Often used for measuring similarity of text documents

where each element of x contains the count of some word for that document implying x, y ∈ Rp

+. If the vector (i.e. document) length is to be ignored,

then x and y are more similar the larger is cos ∠(x, y). A distance could be d(x, y) = 1 − cos ∠(x, y)

◮ Jaccard distance. Here x and y are finite sets. The Jaccard index

J(x, y) = |x ∩ y| |x ∪ y| is a measure of how similar the two sets x and y are. A corresponding dissimilarity measure would be d(x, y) = 1 − J(x, y).

◮ Hamming distance. Here x and y are strings having the same number of

characters (the jth element of the vector is the jth character in the string). The Hamming distance is the minimum number of substitutions required to turn one string into the other. That is, it is the number of positions at which the two strings are different. (See also Levenshtein distance for strings of unequal length.)

slide-25
SLIDE 25

Euclidean squared distances and the Gram matrix

Recall how point configurations x1, . . . , xn, their matrix D⋆ of squared Euclidean distances when d2

ij = ||xi − xj||2, and the Gram matrix G of inner products xi Txj

are related. Namely, D⋆ = 1gT − 2G + g1T G = −1 2 (In − P) D⋆ (In − P) where P = 1(1T1)−11T Or, since the Gram matrix is a matrix of inner products, we could begin with kernel functions K(xi, xj) as the inner-product of some transformation ψ(x) of the original xs. The kernel matrix (row and column centred) then acts as the Gram matrix. Finally, given a Gram matrix, a point configuration (and hence possibly a lower dimensional embedding) can always be determined from an eigen-decomposition of G = XXT = UDλUT (i.e. principal components). Together, these relationships open up a lot of possibilities for clustering based on distances.

slide-26
SLIDE 26

Dissimilarity based methods - hierarchical clustering

Perhaps the most popular way in which distance (or dissimilariy) methods are used is to construct a hierarchical clustering of the observations based on the distances between points. Typically, the hierarchy is produced in one of two ways:

  • 1. Top-down, or divisive:

Begin with all observations in a single cluster C(0)

1 , then split this into two clusters

C(1)

1

and C(1)

2

such that the distance between clusters, d(C(1)

1 , C(1) 2 ) is maximized.

Repeat this for each cluster, until every cluster consists of a single location x.

  • 2. Bottom-up, or agglomerative:

Begin with each observation in its own cluster. Join the two clusters Ci and Ck that are closest to one another in that they have the smallest between cluster distance d(Ci, Cj). Repeat this, joining clusters, until only a single cluster containing all observations is constructed. In either case, the clustering method is hierarchical in that the clusters can be nested in a tree whose nodes are clusters (called a dendrogram). Note that the clustering which results will depend upon the definition of the between cluster distances.

slide-27
SLIDE 27

Dissimilarity based methods - hierarchical clustering

Some common choices of distance between clusters A and B are

  • 1. Single linkage

d(A, B) = min{d(x, y) : x ∈ A, y ∈ B}

  • 2. Complete linkage

d(A, B) = max{d(x, y) : x ∈ A, y ∈ B}

  • 3. Average linkage

d(A, B) = 1 |A| × |B|

  • x∈A
  • y∈B

d(x, y)

  • 4. Centroid linkage

d(A, B) = ||cA − cB|| where cA and cB are the centroids of the clusters A and B, respectively. A function hclust() in R implements these and other methods. There is also a package called cluster which implements a variety of agglomerative and divisive hierarchical methods, as well as partitioning methods like K-means.

slide-28
SLIDE 28

Hierarchical clustering - example

Suppose we try single-linkage clustering on the geyser data.

d <- dist(data, method = "euclidean") single <- hclust(d, method = "single") str(single) ## List of 7 ## $ merge : int [1:298, 1:2] -3 -64 -4 -5 -207 -6 -266 -23 -24 -297 ... ## $ height : num [1:298] 0 0 0 0 0 0 0 0 0 0 ... ## $ order : int [1:299] 149 61 243 58 12 62 169 265 247 187 ... ## $ labels : chr [1:299] "1" "2" "3" "4" ... ## $ method : chr "single" ## $ call : language hclust(d = d, method = "single") ## $ dist.method: chr "euclidean" ##

  • attr(*, "class")= chr "hclust"

class(single) ## [1] "hclust"

slide-29
SLIDE 29

Hierarchical clustering - example

The result can be displayed as a tree, or dendrogram

plot(single, cex = 0.25)

149 61 243 58 12 62 169 265 247 187 206 85 285 69 172 196 54 242 99 274 101 201 269 68 184 289 182 10 171 14 22 36 66 52 83 126 64 3 26 297 24 105 232 165 246 44 263 46 109 295 114 122 42 73 255 34 140 199 150 287 91 291 154 203 189 38 276 40 95 167 234 50 230 97 156 222 134 193 103 89 195 179 177 251 48 257 93 175 71 7 239 213 56 191 118 217 116 107 152 215 267 87 293 17 120 77 271 19 148 249 111 28 259 209 211 132 219 210 81 31 161 261 123 138 15 142 252 32 260 180 237 8 258 158 159 80 128 141 235 74 240 20 136 112 225 283 207 5 146 29 282 130 144 224 281 279 278 204 185 143 162 145 186 1 4 124 280 227 264 137 75 173 228 226 253 197 220 129 160 244 60 79 78 135 57 272 236 298 277 127 163 223 30 59 131 27 270 170 84 110 218 51 176 208 35 133 183 155 268 33 262 164 245 2 70 41 37 55 178 139 16 82 248 216 288 231 221 53 96 100 121 153 233 151 98 13 102 250 174 181 115 229 157 192 92 11 94 214 39 117 49 294 65 88 72 299 25 104 292 45 147 113 9 266 6 63 198 238 273 76 125 284 188 256 21 254 286 168 67 106 18 190 166 47 275 200 241 194 108 202 86 212 90 290 43 23 205 119 296

0.00 0.05 0.10 0.15 0.20

Cluster Dendrogram

hclust (*, "single") d Height

slide-30
SLIDE 30

Hierarchical clustering - example

Clusters can be determined by cutting across the dendrogram at any specified height

plot(single, cex = 0.25) abline(h = c(0.04, 0.05, 0.06), col = c("red", "blue", "forestgreen"), lty = c(2,1,3))

149 61 243 58 12 62 169 265 247 187 206 85 285 69 172 196 54 242 99 274 101 201 269 68 184 289 182 10 171 14 22 36 66 52 83 126 64 3 26 297 24 105 232 165 246 44 263 46 109 295 114 122 42 73 255 34 140 199 150 287 91 291 154 203 189 38 276 40 95 167 234 50 230 97 156 222 134 193 103 89 195 179 177 251 48 257 93 175 71 7 239 213 56 191 118 217 116 107 152 215 267 87 293 17 120 77 271 19 148 249 111 28 259 209 211 132 219 210 81 31 161 261 123 138 15 142 252 32 260 180 237 8 258 158 159 80 128 141 235 74 240 20 136 112 225 283 207 5 146 29 282 130 144 224 281 279 278 204 185 143 162 145 186 1 4 124 280 227 264 137 75 173 228 226 253 197 220 129 160 244 60 79 78 135 57 272 236 298 277 127 163 223 30 59 131 27 270 170 84 110 218 51 176 208 35 133 183 155 268 33 262 164 245 2 70 41 37 55 178 139 16 82 248 216 288 231 221 53 96 100 121 153 233 151 98 13 102 250 174 181 115 229 157 192 92 11 94 214 39 117 49 294 65 88 72 299 25 104 292 45 147 113 9 266 6 63 198 238 273 76 125 284 188 256 21 254 286 168 67 106 18 190 166 47 275 200 241 194 108 202 86 212 90 290 43 23 205 119 296

0.00 0.05 0.10 0.15 0.20

Cluster Dendrogram

hclust (*, "single") d Height

slide-31
SLIDE 31

Hierarchical clustering - example

Clusters can be determined by cutting across the dendrogram at any specified height

p['color'] <- cutree(single, h = 0.06)

slide-32
SLIDE 32

Hierarchical clustering - example

Clusters can be determined by cutting across the dendrogram at any specified height

p['color'] <- cutree(single, h = 0.05)

slide-33
SLIDE 33

Hierarchical clustering - example

Clusters can be determined by cutting across the dendrogram at any specified height

p['color'] <- cutree(single, h = 0.04)

slide-34
SLIDE 34

Hierarchical clustering - example

Or by specifying the number of clusters

p['color'] <- cutree(single, k = 3)

slide-35
SLIDE 35

Hierarchical clustering - example

Or by specifying the number of clusters

p['color'] <- cutree(single, k = 10)

slide-36
SLIDE 36

Hierarchical clustering - example

Or by specifying the number of clusters

slide-37
SLIDE 37

Hierarchical clustering - example

Or by specifying the number of clusters

slide-38
SLIDE 38

Hierarchical clustering - example

Average-linkage:

avelinkage <- hclust(d, method = "average") plot(avelinkage, cex = 0.25) abline(h = c(0.2, 0.3), col = c("red", "blue"), lty = c(2,1))

61 149 153 233 151 98 13 102 166 47 275 188 256 21 254 286 76 125 284 168 67 106 18 190 108 202 86 212 90 290 39 117 194 200 241 49 294 119 296 43 23 205 41 37 55 178 183 155 268 216 288 231 221 53 96 100 121 51 176 208 35 133 72 299 25 104 65 88 198 238 273 113 9 266 6 63 292 45 147 250 174 181 115 229 157 192 92 11 94 214 139 16 82 248 33 164 245 262 2 70 12 62 232 114 122 42 73 165 246 295 44 263 46 109 201 269 68 184 287 134 193 222 103 89 195 48 239 7 213 56 191 71 179 177 251 257 93 175 169 52 66 36 83 126 64 3 26 297 24 105 54 242 101 10 171 182 14 22 289 150 255 116 118 217 107 87 293 152 215 267 189 91 291 154 203 38 276 40 95 167 234 50 230 97 156 28 259 209 211 111 142 99 274 34 140 199 19 148 249 77 271 17 120 187 206 85 285 58 219 84 110 172 196 69 247 243 27 270 170 218 265 127 163 223 236 298 252 8 258 32 260 180 237 132 15 210 81 161 261 123 138 272 277 57 135 78 1 4 124 280 227 264 158 141 235 74 240 80 128 159 20 136 137 160 244 60 79 75 173 228 197 220 129 226 253 130 144 224 281 279 278 204 185 143 162 145 186 29 282 112 207 5 146 225 283 31 30 59 131

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Cluster Dendrogram

hclust (*, "average") d Height

slide-39
SLIDE 39

Hierarchical clustering - example

Average-linkage:

p['color'] <- cutree(avelinkage, h = 0.3)

slide-40
SLIDE 40

Hierarchical clustering - example

Average-linkage:

p['color'] <- cutree(avelinkage, h = 0.2)

slide-41
SLIDE 41

Hierarchical clustering - example

Complete-linkage:

completeLinkage <- hclust(d, method = "complete") plot(completeLinkage, cex = 0.25) abline(h = c(0.25, 0.5, 0.8), col = c("red", "blue", "forestgreen"), lty = c(2,1,3))

61 119 296 43 23 205 49 294 86 212 108 202 200 241 39 117 194 90 290 153 233 151 98 13 102 41 37 55 178 183 155 268 216 288 53 96 100 121 221 231 149 72 299 25 104 65 88 250 174 181 198 113 238 273 9 266 6 63 292 45 147 92 94 214 229 157 192 11 115 188 47 275 166 256 21 254 286 76 125 284 168 67 106 18 190 33 208 51 176 35 133 139 16 82 248 164 245 262 2 70 165 246 295 44 263 46 109 232 114 122 42 73 169 52 66 36 83 126 64 3 26 297 24 105 116 118 217 107 87 293 152 215 267 189 91 291 154 203 38 276 40 95 167 234 50 230 97 156 54 242 201 269 68 184 287 134 193 222 103 89 195 182 14 22 101 10 171 12 62 289 150 255 71 179 48 213 56 191 7 239 177 251 257 93 175 99 274 34 140 199 19 148 249 77 271 17 120 161 261 123 138 81 158 159 111 28 259 209 211 228 197 220 129 226 253 142 75 173 58 187 206 85 285 219 137 160 244 60 79 272 277 57 135 78 280 227 264 1 4 124 281 279 278 204 185 143 162 145 186 20 136 74 240 80 128 141 235 112 207 5 146 225 283 29 282 130 144 224 31 59 131 30 69 265 132 15 210 252 32 260 180 237 8 258 127 163 223 236 298 172 196 247 84 110 170 218 243 27 270

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Cluster Dendrogram

hclust (*, "complete") d Height

slide-42
SLIDE 42

Hierarchical clustering - example

Complete-linkage:

p['color'] <- cutree(completeLinkage, h=0.8)

slide-43
SLIDE 43

Hierarchical clustering - example

Complete-linkage:

p['color'] <- cutree(completeLinkage, h=0.5)

slide-44
SLIDE 44

Hierarchical clustering - example

Complete-linkage:

p['color'] <- cutree(completeLinkage, h=0.25)

slide-45
SLIDE 45

Hierarchical clustering - linkage methods

Note that depending on how clusters are joined/split, different linkage methods may favour different shaped clusters. For example, single linkage corresponds to finding the minimal spanning tree of the (complete) geometric graph of the points. It will therefore track “stringy” structure as clusters. In contrast, complete linkage will prefer compact small diameter clusters over stringy ones.

  • Similarly. average linkage will have a tendency to avoid stringy clusters.

An advantage of any of these hierarchical methods over partitional methods like K-means is that they provide a nested sequence of potential clusterings.

slide-46
SLIDE 46

Density based clustering - dbscan

Data clusters might also be defined to be points in those regions of the space where there is high density. This naturally leads to a cluster tree where points separate whenever high density regions separate (by a low-density region appearing between them) and, at least in principle, have no reason to favour any shaped cluster over any other. For example, the contours of our original density estimate on the geyser data actually define (by their level curves) a sequence of clusters where each set of clusters have fewer points. The resulting cluster tree would have only three branches (two binary branchings). One way of determining density in higher dimensional space is by the average (or maximum, or some other size measure) of the distances from any location to its k nearest neighbours, the smaller the value the greater the local density. The package dbscan (“Density Based Clustering of Applications with Noise”) contains the functions kNN() and kNNdist() which will calculate the distances of the k nearest neighbours to each point in the data set and return these in a matrix:

library(dbscan) kDists <- kNNdist(data, 5) head(kDists) ## 1 2 3 4 5 6 ## 0.01580251 0.06321004 0.01538462 0.01538462 0.01538462 0.01538462

slide-47
SLIDE 47

Density based clustering - dbscan

A plot of these distances is produced by kNNdissplot() (to which we have added three horizontal lines).

kNNdistplot(data, 5) abline(h = c(0.025, 0.05, 0.075), col = c("red", "blue", "forestgreen"), lty = c(2,1,3))

50 100 150 200 250 300 0.00 0.05 0.10 0.15 0.20 Points (sample) sorted by distance 5−NN distance

Note that this is just a quantile plot with the x axis relabelled by the number (rather than the proportion) of distances whose value is less than or equal to the vertical value. The vertical azis records the actual nearest neighbour distances. The horizontal lines were placed after visual inspection of the graph to mark (roughly) the “knee” of the graph beyond which the nearest neighbour differences become relatively huge. This is an important paramater for dbscan(). Small nearest neighbour distances correspond to high density.

slide-48
SLIDE 48

Density based clustering - dbscan

Density based clustering uses the horizontal cutoff as a parameter ǫ (argument eps) to determine a local neighbourhood for each point. Density around each point is now determined to be proportional to the number of points within the radius ǫ of each point. In addition to determining the number of clusters, dbscan() also identifies “noise” points which it does not assign to any cluster. A function to update our plot based on the data and ǫ is

updatePlot <- function(plot, data, eps){ results <- dbscan(data, eps = eps) clusters <- results$cluster plot["color"] <- clusters # Noise points plot["glyph"] <- "ccircle" plot["glyph"][clusters == 0] <- "ocircle" # Noise points in cluster 0 }

This can now be used to find density based clusters for different ǫ.

slide-49
SLIDE 49

Density based clustering - dbscan

updatePlot(p, data, 0.075)

slide-50
SLIDE 50

Density based clustering - dbscan

updatePlot(p, data, 0.05)

slide-51
SLIDE 51

Density based clustering - dbscan

updatePlot(p, data, 0.025)

slide-52
SLIDE 52

Other clustering methods

There are many other clustering methods and R packages which implement them. For example, the package cluster contains numerous functions and graphical tools for determining variations on traditional partitional, agglomerative, and divisive methods. The package kernlab contains functions clustering methods based on using kernel functions in traditional methods like K-means. kernlab also contains functions to apply “spectral clustering” methods (also in conjunction with kernels) on data. Any clustering method should be used in conjunction with an interactive data visualization system like loon to determine whether the clustering makes sense for the data in hand.

slide-53
SLIDE 53