Clustering Lecture notes Clustering is Exploratory, unsupervised - - PowerPoint PPT Presentation
Clustering Lecture notes Clustering is Exploratory, unsupervised - - PowerPoint PPT Presentation
Clustering Lecture notes Clustering is Exploratory, unsupervised method Data in cluster is similar to each other, and dissimilar to other cluster data Finding structure in data Hard vs. Soft/Fuzzy vs. Exclusive Understanding and/or summarise
70% 15% 13% 2%
Exploratory, unsupervised method Finding structure in data Understanding and/or summarise Data in cluster is similar to each other, and dissimilar to other cluster data Hard vs. Soft/Fuzzy vs. Exclusive
Hard vs Fuzzy/Soft clustering
Clustering is
Centre-based – based on distance to cluster centres.
Types of clusters
Well separated Overlapping
Contiguity
points are closer to others in their cluster than any other cluster
Density
High-density areas separated by low-density areas
Conceptual
Sharing some general
- attribute. Intersections
belong to both
Notion of cluster ambiguous
Biology – compare species in different environments Medicine – Group variants of illnesses or similar genes PET scans; differentiate tissues Marketing – Consumers with similar shopping habits, grouping shopping items Social sciences – Crime analysis (greatest incidences of different crimes), Poll data Computer science – Data compression, Anomaly detection Internet – Web searches (categories of results), social media analysis Other: Location of network towers for optimum signal cover Netflix (cluster similar view habits) Group skulls from archeological digs based on measurements
Applications
- Minkowski metric
for n features, points x,y
Bad for high dimension data
Two common cases: Manhattan, q = 1 (cityblock) Euclidean, q = 2 (“as the crow flies”)
Magnitude and units affect (e.g. body height vs. toe length)
- > standardise! (mean=0, std=1)
But may affect variability
Others metrics
- Mahalanobis distance
– Absolute without redundancies
- Pearson correlation (unit indep.)
– covariance(x,y) / [std(x) std(y)]
- Binary data:
– Russel, Dice, Yule index, ….
- Cosine (Document: keywords)
- Gowder’s distance (mixed)
- Alternatives (squared distances):
✴ Squared Euclidean ✴ Squared Pearson
……….
“Distance” metric
Represent cluster location
nearest(single) neighbour (noise,outliers) farthest(complete) neighbour (noise,outliers,favours global shape) average – calculate avg. amongst point dist. (middle of single/complete) centroid – virtual "average" point based on each feature medoid – real “median" point
Linkage criteria
Example calculation
1 2 3 4 5 6 1 2 3 4 A B
Distance comparison
Manhattan Euclidean Single 4 sqrt(8)=2.83 Complete 7 sqrt(29)=5.39 Average 48/9=5.33 4.00 Centroid 16/3 = 5.33 sqrt(153/9)=4.12 Medoid 6 sqrt(18)=4.24
Hierarchical
(Hierarchical) tree of nested clusters Levels are steps in clustering process agglomerative vs divisive “bottom-up” vs “top-down” Time complexity at least n^2 Divisive often more computationally demanding
Dendrogram
Stepwise merge or split
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
a b c d e
Threshold
Leaf Branch point Root
Dendrogram
Root – starting point for all points Branch point – splitting/merging point Leaf – point is in its own cluster Threshold – selected limit for best # clusters
a b
Example calculation – Revisited
a b c d e a
1 4 5 5
b
1 2 6 6
c
4 2 7 7
d
5 6 7 2
e
5 6 7 2
1 2 3 4 5 6 1 2 3 4
c d e
5
Manhattan metric with centroid linkage
Example calculation – Revisited
ab c d e ab
3.5 5.5 5.5
c
3.5 7 7
d
5.5 7 2
e
5.5 7 2
1 2 3 4 5 6 1 2 3 4
c
5
ab d e
Manhattan metric with centroid linkage
Example calculation – Revisited
ab c de ab
4.5 5
c
4.5 7
de
5 7
1 2 3 4 5 6 1 2 3 4 5
de
Manhattan metric with centroid linkage
c ab
Example calculation – Revisited
1 2 3 4 5 6 1 2 3 4 5
Manhattan metric with centroid linkage
c de abc
More detailed example
k-means
Partitional clustering method Specify number of clusters k, find the most compact solution Ordinarily time complexity ndk+1 for d=features, k=clusters Faster algorithms exist (Lloyd's algorithm is linear!) But, k-means falls into local minima, several trials needed and only handles numeric data and redundancies are not excluded
k-means – Initialise
Initialise k cluster centroids randomly – possibly different clusters, do multiple runs first random or average, then rest most distant point for centroids – may select outliers) k means++ (first is fully random, the others random with probability proportional to D2 – distance to nearest centroid) then …….
k-means – Iterate
Iterate (determine:)
calculate centroids calculate distances to centroids move points to closest centroid Repeat until either: No points move clusters <X% moves Maximum iterations
k-means – issues
Empty clusters – replace centroid with farthest point to clusters
- r from cluster with highest Sum of Square Errors
Outliers – Points with high contribution to cluster SSE, and more But for some fields, like financial, outliers are important
Visualisation – k=10, white crosses are centroids
Silhouette
What k to use? Validates performance based on intra and inter-cluster distances a(i) average dissimilarity with other data in cluster b(i) lowest dissimilarity to any non-member cluster Values [-1,1] Calculated for each point, so very time demanding!
Calinski-Harabasz Index
Faster, better for large data Performance based on average intra and inter-cluster SSE (Tr):
Postprocessing
We may still want to improve SSE of our results
- Split cluster with largest SSE or standard deviation
- Open a new cluster using point most distant to any cluster
Decrease number of clusters with smallest SSE increase
- Disperse a cluster, reassign points to cluster increasing SSE the least
- Merge two clusters with the closest centroids