Hierarchical Clustering 36-350: Data Mining 25 September 2006 Last - - PowerPoint PPT Presentation

hierarchical clustering
SMART_READER_LITE
LIVE PREVIEW

Hierarchical Clustering 36-350: Data Mining 25 September 2006 Last - - PowerPoint PPT Presentation

Hierarchical Clustering 36-350: Data Mining 25 September 2006 Last time... Unsupervised learning problems; finding clusters K means divide into k clusters to minimize within- cluser variance*cluster size local search, local


slide-1
SLIDE 1

Hierarchical Clustering

36-350: Data Mining 25 September 2006

slide-2
SLIDE 2

Last time...

  • Unsupervised learning problems; finding

clusters

  • K means
  • divide into k clusters to minimize within-

cluser variance*cluster size

  • local search, local minima
slide-3
SLIDE 3

Limits of k-Means

  • Local search can get stuck
  • Random starts help
  • Sum-of-squares likes ball-shaped clusters
  • How to pick k?
  • No relations between clusters
slide-4
SLIDE 4

Hierarchical Clustering

  • Basic idea: cluster the clusters
  • High-level clusters contain multiple low-level

clusters

  • Clusters are now related
  • Don’t need to chose k
  • Assumes a hierarchy makes sense...
slide-5
SLIDE 5

Ward’s Method

  • 1. Start with every point in its own cluster
  • 2. For each pair of clusters, calculate “merging

cost” = increase in sum of squares

  • 3. Merge least-costly pair
  • 4. Stop when merging cost takes a big jump
slide-6
SLIDE 6

gray32

  • rchid3

darkmagenta flower2 flower3

  • rchid3

flower6

  • rchid3

flower7 flower8 gray59.2 flower9 plum4 flower1 flower4 midnightblue gray10 gray36 tiger7 burlywood2 tiger3 tiger5 darkseagreen4 flower5 antiquewhite2 tiger6 darkseagreen4 darkseagreen4 tiger8 tiger9 lightgoldenrod3 tiger4 gray10 tiger1 tiger2 darkslategray.2 darkslategray.2

  • cean3
  • cean7

lightskyblue3 azure3

  • cean2
  • cean4

royalblue

  • cean5

royalblue

  • cean1
  • cean6

2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 3.0 clusters merging cost

Ward’s method applied to the images from lecture 3: ocean, tigers, flowers Jump in merging cost suggests 3 clusters - almost exactly right

  • nes, too (but thinks

flower5 is a tiger)

slide-7
SLIDE 7
  • Don’t have to chose k
  • Sum of squares is worse, generally, than k-

means (for equal k)

  • more constrained search
  • prefers to merge small clusters, all else

equal

slide-8
SLIDE 8

k-Means Ward’s Minimizing the mean distance from the center tends to make spheres, which can be silly note how Ward’s is less balanced

slide-9
SLIDE 9

Single-link clustering

  • 1. Start with every point in its own cluster
  • 2. Calculate gaps between every pair of

clusters = distance between 2 closest points in each cluster

  • 3. Merge clusters with smallest gap
slide-10
SLIDE 10

k-Means Ward’s Single-link

slide-11
SLIDE 11

Examples where single-link doesn’t work so well k-Means Ward’s Single-link