Machine Learning 2
DS 4420 - Spring 2020
Clustering I
Byron C. Wallace
Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. - - PowerPoint PPT Presentation
Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace Unsupervised learning So far we have reviewed some fundamentals, discussed Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop
DS 4420 - Spring 2020
Byron C. Wallace
Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD
although the above methods are general; we will shift focus to unsupervised learning for a few weeks
be relevant here — and we will consider the former explicitly for clustering next week
Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD
although the above methods are general; we will shift focus to unsupervised learning for a few weeks
be relevant here — and we will consider the former explicitly for clustering next week
Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD
although the above methods are general; we will shift focus to unsupervised learning for a few weeks
be relevant here — and we will consider the former explicitly for clustering next week
Unsupervised learning (no labels for training) Group data into similar classes that
Unsupervised learning (no labels for training) Group data into similar classes that
Simpson’s Family School Employees Females Males
Choice of clustering criterion can be task-dependent
Simpson’s Family School Employees Females Males
Choice of clustering criterion can be task-dependent
Simpson’s Family School Employees Females Males
Choice of clustering criterion can be task-dependent
0.2 3 342.7
Peter Piotr
Dissimilarity/distance: d(x1, x2) Similarity: s(x1, x2) } Proximity: p(x1, x2)
0.2 3 342.7
Peter Piotr
Dissimilarity/distance: d(x1, x2) Similarity: s(x1, x2) } Proximity: p(x1, x2)
0.2 3 342.7
Peter Piotr
Dissimilarity/distance: d(x1, x2) Similarity: s(x1, x2) } Proximity: p(x1, x2)
k
i=1
k
i=1
i=1
q
k
i=1
k
i=1
i=1
q
k
i=1
k
i=1
i=1
q
input space X
features classification representation φ(x) xplore the two
Radial Basis Function (RBF) Polynomial Linear (inner-product)
First feature Second feature First feature Second feature
Linear RBF kernel
Figure from MML book
“The key insight in kernel-based learning is that you can rewrite many linear models in a way that doesn’t require you to ever explicitly compute φ(x)
Symmetry Reflexivity Positivity (Separation) Triangular Inequality Distance Measure
Symmetry Reflexivity Positivity (Separation) Triangular Inequality Distance Measure
Symmetry Reflexivity Positivity (Separation) Triangular Inequality Distance Measure
Symmetry Reflexivity Positivity (Separation) Triangular Inequality Distance Measure
Similarity functions
necessarily well defined
Symmetry Reflexivity Positivity (Separation) Triangular Inequality Distance Measure
Notion of Clusters: Cut off dendrogram at some depth
Notion of Clusters: Connected regions of high density
Notion of Clusters: Distributions on features
Input: X = {x1, x2, . . . , xN} Number of clusters K Initialize: K random centroids µ1, µ2, . . . , µK Repeat Until Convergence
1
For i = 1, . . . , K do Ci = {x 2 X|i = arg min
1jK k x µj k2}
2
For i = 1, . . . , K do µi = arg min
z
P
x2Ci
k z x k2} Output: C1, C2, . . . , CK
Input: X = {x1, x2, . . . , xN} Number of clusters K Initialize: K random centroids µ1, µ2, . . . , µK Repeat Until Convergence
1
For i = 1, . . . , K do Ci = {x 2 X|i = arg min
1jK k x µj k2}
2
For i = 1, . . . , K do µi = arg min
z
P
x2Ci
k z x k2} Output: C1, C2, . . . , CK
Input: X = {x1, x2, . . . , xN} Number of clusters K Initialize: K random centroids µ1, µ2, . . . , µK Repeat Until Convergence
1
For i = 1, . . . , K do Ci = {x 2 X|i = arg min
1jK k x µj k2}
2
For i = 1, . . . , K do µi = arg min
z
P
x2Ci
k z x k2} Output: C1, C2, . . . , CK
Input: X = {x1, x2, . . . , xN} Number of clusters K Initialize: K random centroids µ1, µ2, . . . , µK Repeat Until Convergence
1
For i = 1, . . . , K do Ci = {x 2 X|i = arg min
1jK k x µj k2}
2
For i = 1, . . . , K do µi = arg min
z
P
x2Ci
k z x k2} Output: C1, C2, . . . , CK
1 2 3 4 5 1 2 3 4 5
thm: K-means, Distance Metric: Euclidean Distanc μ1 μ2 μ3
Randomly initialize K centroids μk
1 2 3 4 5 1 2 3 4 5
μ1 μ2 μ3
Assign each point to closest centroid, then update centroids to average of points
1 2 3 4 5 1 2 3 4 5
μ1 μ2 μ3
Assign each point to closest centroid, then update centroids to average of points
1 2 3 4 5 1 2 3 4 5
μ1 μ2 μ3
Repeat until convergence (no points reassigned, means unchanged)
1 2 3 4 5 1 2 3 4 5
μ1 μ2 μ3
Repeat until convergence (no points reassigned, means unchanged)
Input: X = {x1, x2, . . . , xN} Number of clusters K Initialize: K random centroids µ1, µ2, . . . , µK Repeat Until Convergence
1
For i = 1, . . . , K do Ci = {x 2 X|i = arg min
1jK k x µj k2}
2
For i = 1, . . . , K do µi = arg min
z
P
x2Ci
k z x k2} Output: C1, C2, . . . , CK
Let's see some examples in Python
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 1
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 2
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 3
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 4
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 5
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 6
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 1
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 2
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 3
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 4
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 5
5 10 15 20
2 4 6 8
x y
Iteration 1
5 pairs of clusters, two initial points in each pair
5 10 15 20
2 4 6 8
x y
Iteration 2
5 10 15 20
2 4 6 8
x y
Iteration 3
5 10 15 20
2 4 6 8
x y
Iteration 4
5 pairs of clusters, two initial points in each pair
5 10 15 20
2 4 6 8
x y
Iteration 1
5 10 15 20
2 4 6 8
x y
Iteration 2
5 10 15 20
2 4 6 8
x y
Iteration 3
5 10 15 20
2 4 6 8
x y
Iteration 4
Initialization tricks
keep most widely separated points
1 2 3 4 5 6 7 8 9 10
K=1, SSE=873
1 2 3 4 5 6 7 8 9 10
K=2, SSE=173
1 2 3 4 5 6 7 8 9 10
K=3, SSE=134
0.00E+00 1.00E+02 2.00E+02 3.00E+02 4.00E+02 5.00E+02 6.00E+02 7.00E+02 8.00E+02 9.00E+02 1.00E+03 1 2 3 4 5 6
equals 1 to 6… “ ” or “knee finding”.
K Cost Function
“Elbow finding” (a.k.a. “knee finding”) Set K to value just above “abrupt” increase
Original Points K-means (3 clusters)
Original Points K-means (3 clusters)
Original Points K-means (2 clusters)
Intuition: “Combine” smaller clusters into larger clusters
arbitrarily shaped clusters noise
(one of the most-cited clustering methods)
Intuition
arbitrarily shaped clusters noise
Naïve approach
For each point in a cluster there are at least a minimum number (MinPts)
cluster
Eps-neighborhood of a point p NEps(p) = { q ∈ D | dist (p, q) ≤ Eps }
Eps
p
̶ points inside the cluster (core points) ̶ points on the border (border points)
‒
̶ ̶
‒
cluster
̶ ̶ An Eps-neighborhood of a border point contains significantly less points than an Eps-neighborhood of a core point.
‒
For every point p in a cluster C there is a point q ∈ C, so that (1) p is inside of the Eps-neighborhood of q and (2) NEps(q) contains at least MinPts points.
core bo
p q ∈ (q) | = 6 ≥ 5 =
∈
border points are connected to core points
∈
core points = high density
Better notion of cluster
Definition A point p is directly density-reachable from a point q with regard to the parameters Eps and MinPts, if 1) p ∈ NEps(q) 2) | NEps(q) | ≥ MinPts (core point condition) (q) | = 6 ≥ 5 = (reachability)
Parameter: MinPts = 5 p directly density reachable from q p ∈ NEps(q)
| NEps(q) | = 6 ≥ 5 = MinPts (core point condition)
q not directly density reachable from p
| NEps (p) | = 4 < 5 = MinPts (core point condition)
Note: This is an asymmetric relationship
p q ∈ (q) | = 6 ≥ 5 =
Definition A point p is directly density-reachable from a point q with regard to the parameters Eps and MinPts, if 1) p ∈ NEps(q) 2) | NEps(q) | ≥ MinPts (core point condition) (q) | = 6 ≥ 5 = (reachability)
Parameter: MinPts = 5 p directly density reachable from q p ∈ NEps(q)
| NEps(q) | = 6 ≥ 5 = MinPts (core point condition)
q not directly density reachable from p
| NEps (p) | = 4 < 5 = MinPts (core point condition)
Note: This is an asymmetric relationship
p q ∈ (q) | = 6 ≥ 5 =
Definition A point p is density-reachable from a point q with regard to the parameters Eps and MinPts if there is a chain of points p1, p2, . . . ,ps with p1 = q and ps = p such that pi+1 is directly density-reachable from pi for all 1 < i < s-1.
) | = 6 ≥ 5 = p MinPts = 5 q | NEps(q) | = 5 = MinPts (core point condition) p1 | NEps(p1) | = 6 ≥ 5 = MinPts (core point condition)
Definition (density-connected) A point p is density-connected to a point q with regard to the parameters Eps and MinPts if there is a point v such that both p and q are density-reachable from v.
p
MinPts = 5
q v
Note: This is a symmetric relationship
A cluster with regard to the parameters Eps and MinPts is a non-empty subset C of the database D with 1) For all p, q ∈ D: If p ∈ C and q is density-reachable from p with regard to the parameters Eps and MinPts, then q ∈ C. 2) For all p, q ∈ C: The point p is density-connected to q with regard to the parameters Eps and MinPts. ∈ ∈ ∈ ∈ (Maximality) (Connectivity)
ly shaped clusters n
Noise Cluster
Let C1,...,Ck be the clusters of the database D with regard to the parameters Eps i and MinPts I (i=1,...,k). The set of points in the database D not belonging to any cluster C1,...,Ck is called noise:
Noise = { p ∈ D | p ∉ Ci for all i = 1,...,k}
(1) Start with an arbitrary point p from the database and retrieve all points density-reachable from p with regard to Eps and MinPts. (2) If p is a core point, the procedure yields a cluster with regard to Eps and MinPts (3) If p is a border point, no points are density-reachable from p and DBSCAN visits the next unclassified point in the database.
and all points in the cluster are classified.
(1) Start with an arbitrary point p from the database and retrieve all points density-reachable from p with regard to Eps and MinPts. (2) If p is a core point, the procedure yields a cluster with regard to Eps and MinPts (3) If p is a border point, no points are density-reachable from p and DBSCAN visits the next unclassified point in the database.
and all points in the cluster are classified.
(1) Start with an arbitrary point p from the database and retrieve all points density-reachable from p with regard to Eps and MinPts. (2) If p is a core point, the procedure yields a cluster with regard to Eps and MinPts (3) If p is a border point, no points are density-reachable from p and DBSCAN visits the next unclassified point in the database.
and all points in the cluster are classified.
Original Points Point types: core, border and noise
Original Points Clusters
+ Resistant to noise + Can handle arbitrary shapes
MinPts = 4, Eps=9.92 MinPts = 4, Eps=9.75
Sensitive to hyperparameters
K-means DBSCAN