Aykut Erdem // Hacettepe University // Fall 2019
Lecture 21:
Clustering K-Means
BBM406
Fundamentals of Machine Learning
Photo by Unsplash user @foodiesfeed
BBM406 Fundamentals of Machine Learning Lecture 21: Clustering - - PowerPoint PPT Presentation
Photo by Unsplash user @foodiesfeed BBM406 Fundamentals of Machine Learning Lecture 21: Clustering K-Means Aykut Erdem // Hacettepe University // Fall 2019 Last time Boosting Idea: given a weak learner, run it multiple times on
Aykut Erdem // Hacettepe University // Fall 2019
Lecture 21:
Clustering K-Means
Photo by Unsplash user @foodiesfeed
training data, then let the learned classifiers vote
weighted by their strength
2
slide by Aarti Singh & Barnabas PoczosLast time.. The AdaBoost Algorithm
3
slide by Jiri Matas and Jan Šochman4
5
6
slide by Tamara Broderick7
slide by Tamara Broderick8
slide by Tamara Broderick9
e.g. archaeological dig
slide by Tamara Broderick10
Artifact location e.g. archaeological dig Artifact location Distance East Distance North
slide by Tamara Broderick11
Artifact location e.g. archaeological dig Artifact location Distance East Distance North
slide by Tamara Broderick12
Distance East Distance North Distance Nor e.g. archaeological dig
slide by Tamara Broderick13
Distance East Distance North Distance Nor e.g. archaeological dig
slide by Tamara Broderick14
Distance East Distance North Distance Nor e.g. archaeological dig
slide by Tamara Broderick15
Distance East Distance North Distance Nor e.g. archaeological dig
slide by Tamara Broderick16
Distance East Distance North Distance Nor e.g. archaeological dig
slide by Tamara Broderick17
Distance Nor
Predicting new labels from old labels
slide by Tamara Broderick18
Distance Nor
Predicting new labels from old labels
Family A Family B Family C
slide by Tamara Broderick19
Distance Nor
Predicting new labels from old labels
Family A Family B Family C Family A Family B Family C
slide by Tamara BroderickWhy use clustering… …instead of classification
20
[Krivitsky Handcock 2008]
slide by Tamara BroderickWhy use clustering… …instead of classification
21
too quickly, expensive to label data, etc) ... when the cartoon looks so easy?
[Krivitsky Handcock 2008]
slide by Tamara BroderickWhy use clustering… …instead of classification
22
too quickly, expensive to label data, etc) ... when the cartoon looks so easy?
[Krivitsky Handcock 2008]
Datum: person Similarity: the number
two people
slide by Tamara BroderickWhy use clustering… …instead of classification
23
too quickly, expensive to label data, etc) ... when the cartoon looks so easy?
[Krivitsky Handcock 2008]
Datum: a binary vector specifying whether a person has each interest Similarity: the number
two people
slide by Tamara BroderickWhy use clustering… …instead of classification
quickly, expensive to label data, etc)
24
slide by Tamara BroderickWhy use clustering… …instead of classification
quickly, expensive to label data, etc)
25
slide by Tamara BroderickWhy use clustering… …instead of classification
quickly, expensive to label data, etc)
26
... when the cartoon looks so easy?
[Blei 2003]
Topic Analysis
slide by Tamara Broderick... when the cartoon looks so easy?
Why use clustering… …instead of classification
quickly, expensive to label data, etc)
27
[Blei 2003]
Topic Analysis
slide by Tamara Broderick... when the cartoon looks so easy?
Why use clustering… …instead of classification
quickly, expensive to label data, etc)
28
[Blei 2003]
Topic Analysis
Datum: word Similarity: how many documents exist where two words co-occur
slide by Tamara Broderick... when the cartoon looks so easy?
Why use clustering… …instead of classification
quickly, expensive to label data, etc)
29
[Blei 2003]
Topic Analysis
Datum: binary vector indicating document
Similarity: how many documents exist where two words co-occur
slide by Tamara BroderickWhy use clustering… …instead of classification
quickly, expensive to label data, etc)
30
slide by Tamara BroderickWhy use clustering… …instead of classification
quickly, expensive to label data, etc)
31
[Carpineto et al. 2009]
8
... when the cartoon looks so easy?
Document clustering
slide by Tamara BroderickWhy use clustering… …instead of classification
quickly, expensive to label data, etc)
32
[Carpineto et al. 2009]
8
... when the cartoon looks so easy?
Document clustering
Datum: document Dissimilarity: distance between topic distributions
Why use clustering… …instead of classification
quickly, expensive to label data, etc)
33
[Carpineto et al. 2009]
8
... when the cartoon looks so easy?
Document clustering
Datum: vector of topic
Dissimilarity: distance between topic distributions
Why use clustering… …instead of classification
quickly, expensive to label data, etc)
34
slide by Tamara BroderickWhy use clustering… …instead of classification
quickly, expensive to label data, etc)
35
[Fei-Fei 2011]
Image segmentation
slide by Tamara BroderickWhy use clustering… …instead of classification
quickly, expensive to label data, etc)
36
[Fei-Fei 2011]
Image segmentation
Datum: pixel Dissimilarity: difference in color + difference in location
slide by Tamara BroderickWhy use clustering… …instead of classification
quickly, expensive to label data, etc)
37
[Fei-Fei 2011]
Image segmentation
Datum: pixel RGB values and pixel horizontal and vertical locations Dissimilarity: difference in color + difference in location
slide by Tamara Broderickand then evaluate them by some criterion
criterion
38
knowledge to determine input parameters
39
slide by Andrew Moore40
Benefits
41
slide by Tamara Broderick42
slide by Tamara Broderick43
slide by Tamara BroderickDatum: Vector of continuous values
44
slide by Tamara BroderickDatum: Vector of continuous values
Distance East Distance North
45
slide by Tamara BroderickDatum: Vector of continuous values
1.5 6.2 Distance East Distance North
46
slide by Tamara BroderickDatum: Vector of continuous values
1.5 6.2 Distance East Distance North x3 = (1.5, 6.2) 1.2 5.9 4.3 2.1 1.5 6.3 4.1 2.3 x1 x2 x3 xN Nor East ... Distance East North East
47
slide by Tamara BroderickDatum: Vector of continuous values
1.5 6.2 Feature 1 Feature 2 x3 = (1.5, 6.2) 1.2 5.9 4.3 2.1 1.5 6.3 4.1 2.3 x1 x2 x3 xN Nor East ... Distance East
Feature 1 Feature 2
48
slide by Tamara BroderickDatum: Vector of continuous values
Feature 1 Feature 2 x3 = (1.5, 6.2) 1.2 5.9 4.3 2.1 1.5 6.3 4.1 2.3 x1 x2 x3 xN Nor East ... Distance East
Feature 1 Feature 2
x3 = (x3,1, x3,2) x1 x2 x3 xN F F ... x1,1 x1,2 x2,1 x2,2 x3,1 x3,2 xN,1 xN,2 x3 = (x3,1, x3,2) Feature 1 Feature 2
49
slide by Tamara BroderickDatum: Vector of D continuous values
Feature 1 Feature 2 1.2 5.9 4.3 2.1 1.5 6.3 4.1 2.3 x1 x2 x3 xN Nor East ... Distance East
Feature 1 Feature 2
x3 = (x3,1, x3,2) x1 x2 x3 xN F F ... x1,1 x1,2 x2,1 x2,2 x3,1 x3,2 xN,1 xN,2 x3 = (x3,1, x3,2) Feature 1 Feature 2
50
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Datum: Vector of D continuous values
51
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Dissimilarity: Distance as the crow flies
52
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Dissimilarity: Distance as the crow flies
x3 x17
53
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Dissimilarity: Distance as the crow flies
x3 x17
54
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Dissimilarity: Euclidean distance
x3 x17
55
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Dissimilarity: Squared Euclidean distance
x3 x17 dis(x3, x17) = (x3,1 − x17,1)2 + (x3,2 − x17,2)2
56
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Dissimilarity: Squared Euclidean distance
x3 x17 dis(x3, x17) = (x3,1 − x17,1)2 + (x3,2 − x17,2)2 dis(x3, x17) =
D
(x3,d − x17,d)2 For each feature For each feature
57
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Dissimilarity
58
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
K = number of clusters
59
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
60
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
61
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
62
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
µ2 µ3 µ1
63
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
µ2 µ3 µ1
1 = (µ1,1, µ1,2)
64
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
µ2 µ3 µ1
1 = (µ1,1, µ1,2)
µ1, µ2, . . . , µK
65
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
µ1, µ2, . . . , µK
66
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
µ1, µ
µ1, µ2, . . . , µK
67
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
µ1, µ
µ1, µ2, . . . , µK Sk = set of points in cluster k
= set of points in cluster k
68
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
µ1, µ
µ1, µ2, . . . , µK Sk = set of points in cluster k
= set of points in cluster k
S1, S2, . . . , SK
69
slide by Tamara BroderickFeature 1 Feature 2 x3 = (x3,1, x3,2) x3 = (x3,1, x3,2) Feature 1 Feature 2
Cluster summary
µ1, µ
µ1, µ2, . . . , µK S1, S2, . . . , SK µ1 µ3 µ2 Sk = set of points in cluster k
= set of points in cluster k
70
slide by Tamara BroderickFeature 1 Feature 2
Dissimilarity
Featur
71
slide by Tamara BroderickFeature 1 Feature 2
Dissimilarity (global)
Featur disglobal =
K
D
(xn,d − µk,d)2
72
slide by Tamara BroderickFeature 1 Feature 2
Dissimilarity (global)
Featur disglobal =
K
D
(xn,d − µk,d)2 For each cluster
73
slide by Tamara BroderickFeature 1 Feature 2
Dissimilarity (global)
Featur disglobal =
K
D
(xn,d − µk,d)2
For each cluster For each data point in the kth cluster
74
slide by Tamara BroderickFeature 1 Feature 2
Dissimilarity (global)
Featur disglobal =
K
D
(xn,d − µk,d)2
For each cluster
point in the kth cluster
For each data point in the kth cluster For each feature
75
slide by Tamara BroderickFeature 1 Feature 2
Dissimilarity (global)
Featur disglobal =
K
D
(xn,d − µk,d)2
76
slide by Tamara BroderickK-Means Algorithm
Featur
✦ Assign each data point to
the cluster with the closest center.
✦ Assign each cluster
center to be the mean of its cluster’s data points
77
slide by Tamara BroderickK-Means Algorithm
Featur
✦ Assign each data point to
the cluster with the closest center.
✦ Assign each cluster
center to be the mean of its cluster’s data points
78
slide by Tamara BroderickK-Means Algorithm
Featur
µk ← xn
✦Randomly draw n from
1,…,N without replacement
✦
✦ Assign each data point to
the cluster with the closest center.
✦ Assign each cluster
center to be the mean of its cluster’s data points
79
slide by Tamara BroderickK-Means Algorithm
µk ← xn
✦Randomly draw n from
1,…,N without replacement
✦
✦ Assign each data point to
the cluster with the closest center.
✦ Assign each cluster
center to be the mean of its cluster’s data points
80
slide by Tamara BroderickK-Means Algorithm
µk ← xn
✦Randomly draw n from
1,…,N without replacement
✦
✦ Assign each data point to
the cluster with the closest center.
✦ Assign each cluster
center to be the mean of its cluster’s data points
81
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
✦ Assign each data point to
the cluster with the closest center.
✦ Assign each cluster
center to be the mean of its cluster’s data points
µk ← xn
82
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ Assign each data point to
the cluster with the closest center.
✦ Assign each cluster
center to be the mean of its cluster’s data points
µk ← xn
83
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ Assign each data point to
the cluster with the closest center.
✦ Assign each cluster
center to be the mean of its cluster’s data points
µk ← xn
Or no change Or no change in Or no change in disglobal
84
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ Assign each data point to
the cluster with the closest center.
✦ Assign each cluster
center to be the mean of its cluster’s data points
µk ← xn
85
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ❖ Find k with smallest ❖ Put (and no
✦ Assign each cluster
center to be the mean of its cluster’s data points
µk ← xn
* Find k with smallest dis(xn, µk) * Put (and no xn ∈ Sk
86
slide by Tamara BroderickK-Means Algorithm
µk ← xn
* Find k with smallest dis(xn, µk) * Put (and no xn ∈ Sk
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ❖ Find k with smallest ❖ Put (and no
✦ Assign each cluster
center to be the mean of its cluster’s data points
87
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ❖ Find k with smallest ❖ Put (and no
✦ Assign each cluster
center to be the mean of its cluster’s data points
µk ← xn
* Find k with smallest dis(xn, µk) * Put (and no xn ∈ Sk
88
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ✤ Find k with smallest ✤ Put (and no
✦ Assign each cluster
center to be the mean of its cluster’s data points
µk ← xn
xn ∈ Sk * Put (and no dis(xn, µk) * Find k with smallest
89
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ✤ Find k with smallest ✤ Put (and no
✦ For k = 1,…,K ✤
µk ← xn
xn ∈ Sk * Put (and no dis(xn, µk) * Find k with smallest For k = 1,...,K * µk ← |Sk|−1
xn
90
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ✤ Find k with smallest ✤ Put (and no
✦ For k = 1,…,K ✤
µk ← xn
xn ∈ Sk * Put (and no dis(xn, µk) * Find k with smallest For k = 1,...,K * µk ← |Sk|−1
xn
91
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ❖ Find k with smallest ❖ Put (and no
✦ Assign each cluster
center to be the mean of its cluster’s data points
µk ← xn
* Find k with smallest dis(xn, µk) * Put (and no xn ∈ Sk
92
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ❖ Find k with smallest ❖ Put (and no
✦ Assign each cluster
center to be the mean of its cluster’s data points
µk ← xn
* Find k with smallest dis(xn, µk) * Put (and no xn ∈ Sk
93
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ❖ Find k with smallest ❖ Put (and no
✦ Assign each cluster
center to be the mean of its cluster’s data points
µk ← xn
* Find k with smallest dis(xn, µk) * Put (and no xn ∈ Sk
94
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ✤ Find k with smallest ✤ Put (and no
✦ For k = 1,…,K ✤
µk ← xn
xn ∈ Sk * Put (and no dis(xn, µk) * Find k with smallest For k = 1,...,K * µk ← |Sk|−1
xn
95
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ✤ Find k with smallest ✤ Put (and no
✦ For k = 1,…,K ✤
µk ← xn
xn ∈ Sk * Put (and no dis(xn, µk) * Find k with smallest For k = 1,...,K * µk ← |Sk|−1
xn
96
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ✤ Find k with smallest ✤ Put (and no
✦ For k = 1,…,K ✤
µk ← xn
xn ∈ Sk * Put (and no dis(xn, µk) * Find k with smallest For k = 1,...,K * µk ← |Sk|−1
xn
97
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ✤ Find k with smallest ✤ Put (and no
✦ For k = 1,…,K ✤
µk ← xn
xn ∈ Sk * Put (and no dis(xn, µk) * Find k with smallest For k = 1,...,K * µk ← |Sk|−1
xn
98
slide by Tamara BroderickK-Means Algorithm
✦ Randomly draw n from
1,…,N without replacement
✦
change:
✦ For n = 1,…N ✤ Find k with smallest ✤ Put (and no
✦ For k = 1,…,K ✤
µk ← xn
xn ∈ Sk * Put (and no dis(xn, µk) * Find k with smallest For k = 1,...,K * µk ← |Sk|−1
xn
99
slide by Tamara BroderickK-Means: Evaluation
100
slide by Tamara BroderickK-Means: Evaluation
101
slide by Tamara BroderickK-Means: Evaluation
Global dissimilarity only useful for comparing clusterings.
O(KN) time
assigned points O(N) time
102
slide by David SontagK-Means: Evaluation
Objective
– Take partial derivative of μi and set to zero, we have
103
K-Means takes an alternating optimization approach, each step is guaranteed to decrease the objective – thus guaranteed to converge
slide by Alan FernK-Means: Evaluation
Demo time…
104
105
slide by Kristen GraumanK-Means Algorithm: Some Issues
K-Means Applications, Spectral clustering, Hierarchical clustering and What is a good clustering?
106