Clustering kMeans, Expectation Maximization, Self-Organizing Maps - - PowerPoint PPT Presentation

clustering
SMART_READER_LITE
LIVE PREVIEW

Clustering kMeans, Expectation Maximization, Self-Organizing Maps - - PowerPoint PPT Presentation

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means clustering Hierarchical clustering Incremental clustering Probability-based clustering Self-Organising Maps Classification vs. Clustering


slide-1
SLIDE 1

Clustering

kMeans, Expectation Maximization, Self-Organizing Maps

slide-2
SLIDE 2

Outline

  • K-means clustering
  • Hierarchical clustering
  • Incremental clustering
  • Probability-based clustering
  • Self-Organising Maps
slide-3
SLIDE 3

Classification vs. Clustering

Classification: Supervised learning (labels given)

slide-4
SLIDE 4

Classification vs. Clustering

Clustering: Unsupervised learning No labels, find “natural” grouping of instances

labels unknown

slide-5
SLIDE 5

Many Applications!

  • Basically, everywhere labels are unknown/

uncertain/too expensive

  • Marketing: find groups of similar customers
  • Astronomy: find groups of similar stars, galaxies
  • Earth-quake studies: cluster earth quake epicenters along

continent faults

  • Genomics: find groups of genes with similar expressions
slide-6
SLIDE 6

Clustering Methods: Terminology

Non-overlapping Overlapping

slide-7
SLIDE 7

Clustering Methods: Terminology

Top-down Bottom-up (agglomerative)

slide-8
SLIDE 8

Clustering Methods: Terminology

Hierarchical

slide-9
SLIDE 9

Clustering Methods: Terminology

Deterministic Probabilistic

slide-10
SLIDE 10

K-Means Clustering

slide-11
SLIDE 11

K-means clustering (k=3)

X Y

Pick k random points: initial cluster centers

slide-12
SLIDE 12

K-means clustering (k=3)

k1 k2 k3 X Y

Pick k random points: initial cluster centers

slide-13
SLIDE 13

K-means clustering (k=3)

k1 k2 k3 X Y

Assign each point to nearest cluster center

slide-14
SLIDE 14

K-means clustering (k=3)

X Y k2 k1 k3

Move cluster centers to mean of each cluster

slide-15
SLIDE 15

K-means clustering (k=3)

X Y k2 k1 k3

Move cluster centers to mean of each cluster

slide-16
SLIDE 16

K-means clustering (k=3)

X Y k2 k1 k3

Move cluster centers to mean of each cluster

slide-17
SLIDE 17

K-means clustering (k=3)

X Y k2 k1 k3

Move cluster centers to mean of each cluster

slide-18
SLIDE 18

K-means clustering (k=3)

X Y k1 k2 k3

Reassign points to nearest cluster center

slide-19
SLIDE 19

K-means clustering (k=3)

X Y k1 k3 k2

Reassign points to nearest cluster center

slide-20
SLIDE 20

K-means clustering (k=3)

X Y k1 k3 k2

Reassign points to nearest cluster center

slide-21
SLIDE 21

K-means clustering (k=3)

X Y k1 k3 k2

Reassign points to nearest cluster center

slide-22
SLIDE 22

K-means clustering (k=3)

X Y k1 k3 k2

Reassign points to nearest cluster center

slide-23
SLIDE 23

K-means clustering (k=3)

X Y k1 k3 k2

Repeat step 3-4 until cluster centers converge (don’t/hardly move)

slide-24
SLIDE 24

K-means clustering (k=3)

X Y k1 k3 k2

Repeat step 3-4 until cluster centers converge (don’t/hardly move)

slide-25
SLIDE 25

K-means clustering (k=3)

X Y k1 k3 k2

Repeat step 3-4 until cluster centers converge (don’t/hardly move)

slide-26
SLIDE 26

K-means clustering (k=3)

X Y k1 k3 k2

Repeat step 3-4 until cluster centers converge (don’t/hardly move)

slide-27
SLIDE 27

K-means

Works with numeric data only

1)

Pick K random points: initial cluster centers

2)

Assign every item to its nearest cluster center (e.g. using Euclidean distance)

3)

Move each cluster center to the mean of its assigned items

4)

Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)

slide-28
SLIDE 28

K-means clustering: another example

http://www.youtube.com/watch?v=zaKjh2N8jN4#!

slide-29
SLIDE 29

Discussion

  • Result can vary significantly depending on initial

choice of centers

  • Can get trapped in local minimum
  • Example:
  • To increase chance of finding global optimum: restart

with different random seeds

instances initial cluster centers

slide-30
SLIDE 30

K-means clustering summary

Disadvantages

  • Must pick number of

clusters before hand

  • All items forced into a single

cluster

  • Sensitive to outliers

Advantages

  • Simple, understandable
  • Items automatically

assigned to clusters

slide-31
SLIDE 31

K-means: variations

  • K-medoids – instead of mean, use medians of each cluster
  • Mean of 1, 3, 5, 7, 1009 is
  • Median of 1, 3, 5, 7, 1009 is
  • For large databases, use sampling
slide-32
SLIDE 32

K-means: variations

  • K-medoids – instead of mean, use medians of each cluster
  • Mean of 1, 3, 5, 7, 1009 is
  • Median of 1, 3, 5, 7, 1009 is
  • For large databases, use sampling

205

slide-33
SLIDE 33

K-means: variations

  • K-medoids – instead of mean, use medians of each cluster
  • Mean of 1, 3, 5, 7, 1009 is
  • Median of 1, 3, 5, 7, 1009 is
  • For large databases, use sampling

205 5

slide-34
SLIDE 34

Hierarchical Clustering

slide-35
SLIDE 35

Bottom-up vs top-down clustering

  • Bottom up / Agglomerative
  • Start with single-instance clusters
  • At each step, join two “closest” clusters
  • Top down
  • Start with one universal cluster
  • Split in two clusters
  • Proceed recursively on each subset

A A B C BC D E DE DEF BCDEF ABCDEF F B C D E F

slide-36
SLIDE 36

Hierarchical clustering

  • Hierarchical clustering represented in dendrogram
  • tree structure containing hierarchical clusters
  • clusters in leafs, union of child clusters in nodes
slide-37
SLIDE 37

Distance Between Clusters

  • Centroid: distance between centroids
  • Sometimes hard to compute (e.g. mean of molecules?)
  • Single Link: smallest distance between points
  • Complete Link: largest distance between points
  • Average Link: average distance between points

sinlge link complete link average link A B C D

distance = 1 distance = 2 distance = 1.5

A B C D A B C D

single

(d(A,C)+d(A,D) +d(B,C)+d(B,D))/4

slide-38
SLIDE 38

Distance Between Clusters

  • Centroid: distance between centroids
  • Sometimes hard to compute (e.g. mean of molecules?)
  • Single Link: smallest distance between points
  • Complete Link: largest distance between points
  • Average Link: average distance between points
  • Group-average: group two clusters into one, then take

average distance between all points (incl. d(A,B) & d(C,D))

slide-39
SLIDE 39

Incremental Clustering

slide-40
SLIDE 40

Clustering weather data

ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

1

slide-41
SLIDE 41

Clustering weather data

ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

1

2

start new clusters, up to a point

slide-42
SLIDE 42
  • Category utility: overall quality of clustering
  • Quadratic loss function
  • nominal: clusters Ci, attributes ai, values vij:
  • numeric: similar, assume Gaussian distribution
  • Intuitively:
  • good clusters allow to predict value of new data points:

Pr[ai=vij | Ci] > Pr[ai=vij]

  • 1/k factor: penalty for using many clusters (avoids overfitting)

Category Utility

slide-43
SLIDE 43

Clustering weather data

ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

1

slide-44
SLIDE 44

Clustering weather data

ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

1

2

  • Max. number

depends on k

slide-45
SLIDE 45

Clustering weather data

ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

1

2 3

  • Max. number

depends on k join with most similar leaf: new cluster

slide-46
SLIDE 46

ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

4

Clustering weather data

slide-47
SLIDE 47

5

ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

4

A and D equally good: merge

Clustering weather data

slide-48
SLIDE 48

5

ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

4

Consider splitting the best host if merging doesn’t help A and D equally good: merge

Clustering weather data

slide-49
SLIDE 49

Final hierarchy

ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False

Note that a and b are actually very similar, but end up in different clusters

slide-50
SLIDE 50

Incremental clustering

  • For large, regularly updated databases
  • start with tree and empty root node
  • add instances one by one
  • update tree appropriately at each stage
  • form new leaf
  • join instance with most similar leaf: new node (cluster)
  • merge existing leafs (move down one level)
  • split node into leafs (move up one level)
  • Best decision: category utility
slide-51
SLIDE 51

Probability-based Clustering

slide-52
SLIDE 52

Probability-based Clustering

  • Given k clusters, each instance belongs to all clusters, with a certain

probability

  • mixture model: set of k distributions (one per cluster)
  • also: each cluster has prior likelihood
  • If correct clustering known, we know parameters and P(Ci) for

each cluster: calculate P(Ci|x) using Bayes’ rule

  • Estimate the unknown parameters
  • How?
slide-53
SLIDE 53

EM Expectation Maximization

  • Finds the parameters for the distributions and the

cluster membership

  • (Random) initialization
  • Initial parameters , P(Ci) for each cluster
  • Iterative algorithm:
  • Expectation step: with current parameters, calculate P(C|x)
  • Maximization step: update parameters using P(C|x): new ,

P(Ci)

  • Iterate until converged to local optimum
slide-54
SLIDE 54

EM vs K-means

http://www.youtube.com/watch?v=1CWDWmF0i2s

slide-55
SLIDE 55

Quiz

slide-56
SLIDE 56

Quiz

slide-57
SLIDE 57

Clustering Evaluation

  • Manual inspection
  • Benchmarking on existing labels
  • Cluster quality measures
  • distance measures
  • high similarity within a cluster, low across clusters
slide-58
SLIDE 58

Goodness of Fit

Given a function that defines the cluster Calculate for each point how well it fits the cluster

slide-59
SLIDE 59

How to choose k?

 One important parameter k, but how to choose?

  • Domain dependent, we simply want k clusters

 Alternative: repeat for several values of k and choose the best  Example:

  • cluster mammal properties
  • k different clusters
  • Use an MDL based encoding
  • Alternative to category theory
  • Each additional cluster

introduces a penalty

  • Optimal for k=6
slide-60
SLIDE 60

Self-Organizing Maps

slide-61
SLIDE 61

Self Organizing Map

slide-62
SLIDE 62

Self Organizing Map

http://www.youtube.com/watch?v=71wmOT4lHWc

slide-63
SLIDE 63

Self Organizing Map

  • Applications
  • Group similar data together
  • Dimensionality reduction
  • Data visualization technique
  • Similar to neural networks
  • Neurons try to mimic the input vectors
  • The winning neuron (and its neighborhood) wins
  • Topology preserving, using

Neighborhood function

slide-64
SLIDE 64

SOM Learning Algorithm

  • Initialize SOM (random, or such that dissimilar input is

mapped far apart)

  • For t from 0 to N
  • Randomly select a training instance
  • Get the best matching neuron
  • calculate distance, e.g.
  • Scale neighbors
  • Who? decrease over time: Hexagons, squares, gaussian,

  • Update of neighbours towards the training instance
slide-65
SLIDE 65

Self Organizing Map

  • Neighborhood function to preserve topological properties
  • f the input space
  • Neighbors share the prize (but slightly less)
  • utput

input (n-dimensional) winner

slide-66
SLIDE 66

Self Organizing Map

  • Input: uniformly randomly distributed points
  • Output: Map of 202 neurons
  • Training: Starting with a large learning rate and neighborhood

size, both are gradually decreased to facilitate convergence

  • After learning, neurons with similar weights tend to cluster on

the map

slide-67
SLIDE 67

Discussion

  • Can interpret clusters by using supervised learning
  • learn a classifier based on clusters
  • Decrease dependence between attributes?
  • pre-processing step
  • E.g. use principal component analysis
  • Can be used to fill in missing values
  • Key advantage of probabilistic clustering:
  • Can estimate likelihood of data
  • Use it to compare different models objectively