Clustering Lecture notes Clustering is Exploratory, unsupervised - - PowerPoint PPT Presentation

clustering
SMART_READER_LITE
LIVE PREVIEW

Clustering Lecture notes Clustering is Exploratory, unsupervised - - PowerPoint PPT Presentation

Clustering Lecture notes Clustering is Exploratory, unsupervised method Data in cluster is similar to each other, and dissimilar to other cluster data Finding structure in data Hard vs. Soft/Fuzzy vs. Exclusive Understanding and/or summarise


slide-1
SLIDE 1

Clustering

Lecture notes

slide-2
SLIDE 2

70% 15% 13% 2%

Exploratory, unsupervised method Finding structure in data Understanding and/or summarise Data in cluster is similar to each other, and dissimilar to other cluster data Hard vs. Soft/Fuzzy vs. Exclusive

Hard vs Fuzzy/Soft clustering

Clustering is

slide-3
SLIDE 3

Centre-based – based on distance to cluster centres.

Types of clusters

Well separated Overlapping

Contiguity

points are closer to others in their cluster than any other cluster

Density

High-density areas separated by low-density areas

Conceptual

Sharing some general

  • attribute. Intersections

belong to both

slide-4
SLIDE 4

Notion of cluster ambiguous

slide-5
SLIDE 5

Biology – compare species in different environments Medicine – Group variants of illnesses or similar genes PET scans; differentiate tissues Marketing – Consumers with similar shopping habits, grouping shopping items Social sciences – Crime analysis (greatest incidences of different crimes), Poll data Computer science – Data compression, Anomaly detection Internet – Web searches (categories of results), social media analysis Other: Location of network towers for optimum signal cover Netflix (cluster similar view habits) Group skulls from archeological digs based on measurements

Applications

slide-6
SLIDE 6
  • Minkowski metric

for n features, points x,y

Bad for high dimension data

Two common cases: Manhattan, q = 1 (cityblock) Euclidean, q = 2 (“as the crow flies”)

Magnitude and units affect (e.g. body height vs. toe length)

  • > standardise! (mean=0, std=1)

But may affect variability

Others metrics

  • Mahalanobis distance

– Absolute without redundancies

  • Pearson correlation (unit indep.)

– covariance(x,y) / [std(x) std(y)]

  • Binary data:

– Russel, Dice, Yule index, ….

  • Cosine (Document: keywords)
  • Gowder’s distance (mixed)
  • Alternatives (squared distances):

✴ Squared Euclidean ✴ Squared Pearson

……….

“Distance” metric

slide-7
SLIDE 7

Represent cluster location

nearest(single) neighbour (noise,outliers) farthest(complete) neighbour (noise,outliers,favours global shape) average – calculate avg. amongst point dist. (middle of single/complete) centroid – virtual "average" point based on each feature medoid – real “median" point

Linkage criteria

slide-8
SLIDE 8

Example calculation

1 2 3 4 5 6 1 2 3 4 A B

slide-9
SLIDE 9

Distance comparison

Manhattan Euclidean Single 4 sqrt(8)=2.83 Complete 7 sqrt(29)=5.39 Average 48/9=5.33 4.00 Centroid 16/3 = 5.33 sqrt(153/9)=4.12 Medoid 6 sqrt(18)=4.24

slide-10
SLIDE 10

Hierarchical

(Hierarchical) tree of nested clusters Levels are steps in clustering process agglomerative vs divisive “bottom-up” vs “top-down” Time complexity at least n^2 Divisive often more computationally demanding

slide-11
SLIDE 11

Dendrogram

Stepwise merge or split

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

a b c d e

slide-12
SLIDE 12

Threshold

Leaf Branch point Root

Dendrogram

Root – starting point for all points Branch point – splitting/merging point Leaf – point is in its own cluster Threshold – selected limit for best # clusters

slide-13
SLIDE 13

a b

Example calculation – Revisited

a b c d e a

1 4 5 5

b

1 2 6 6

c

4 2 7 7

d

5 6 7 2

e

5 6 7 2

1 2 3 4 5 6 1 2 3 4

c d e

5

Manhattan metric with centroid linkage

slide-14
SLIDE 14

Example calculation – Revisited

ab c d e ab

3.5 5.5 5.5

c

3.5 7 7

d

5.5 7 2

e

5.5 7 2

1 2 3 4 5 6 1 2 3 4

c

5

ab d e

Manhattan metric with centroid linkage

slide-15
SLIDE 15

Example calculation – Revisited

ab c de ab

4.5 5

c

4.5 7

de

5 7

1 2 3 4 5 6 1 2 3 4 5

de

Manhattan metric with centroid linkage

c ab

slide-16
SLIDE 16

Example calculation – Revisited

1 2 3 4 5 6 1 2 3 4 5

Manhattan metric with centroid linkage

c de abc

slide-17
SLIDE 17

More detailed example

slide-18
SLIDE 18

k-means

Partitional clustering method Specify number of clusters k, find the most compact solution Ordinarily time complexity ndk+1 for d=features, k=clusters Faster algorithms exist (Lloyd's algorithm is linear!) But, k-means falls into local minima, several trials needed and only handles numeric data and redundancies are not excluded

slide-19
SLIDE 19

k-means – Initialise

Initialise k cluster centroids randomly – possibly different clusters, do multiple runs first random or average, then rest most distant point for centroids – may select outliers) k means++ (first is fully random, the others random with probability proportional to D2 – distance to nearest centroid) then …….

slide-20
SLIDE 20

k-means – Iterate

Iterate (determine:)

calculate centroids calculate distances to centroids move points to closest centroid Repeat until either: No points move clusters <X% moves Maximum iterations

slide-21
SLIDE 21
slide-22
SLIDE 22

k-means – issues

Empty clusters – replace centroid with farthest point to clusters

  • r from cluster with highest Sum of Square Errors

Outliers – Points with high contribution to cluster SSE, and more But for some fields, like financial, outliers are important

slide-23
SLIDE 23

Visualisation – k=10, white crosses are centroids

slide-24
SLIDE 24

Silhouette

What k to use? Validates performance based on intra and inter-cluster distances a(i) average dissimilarity with other data in cluster b(i) lowest dissimilarity to any non-member cluster Values [-1,1] Calculated for each point, so very time demanding!

slide-25
SLIDE 25

Calinski-Harabasz Index

Faster, better for large data Performance based on average intra and inter-cluster SSE (Tr):

slide-26
SLIDE 26

Postprocessing

We may still want to improve SSE of our results

  • Split cluster with largest SSE or standard deviation
  • Open a new cluster using point most distant to any cluster

Decrease number of clusters with smallest SSE increase

  • Disperse a cluster, reassign points to cluster increasing SSE the least
  • Merge two clusters with the closest centroids