CSCI 4520 – Introduction to Machine Learning
Spring 2020
Mehdi Allahyari Georgia Southern University
(slides borrowed from Tom Mitchell, Maria Florina Balcan, Ali Borji, Ke Chen)
Clustering
1
Clustering (slides borrowed from Tom Mitchell, Maria Florina Balcan, - - PowerPoint PPT Presentation
CSCI 4520 Introduction to Machine Learning Spring 2020 Mehdi Allahyari Georgia Southern University Clustering (slides borrowed from Tom Mitchell, Maria Florina Balcan, Ali Borji, Ke Chen) 1 Clustering, Informal Goals Goal : Automatically
Spring 2020
Mehdi Allahyari Georgia Southern University
(slides borrowed from Tom Mitchell, Maria Florina Balcan, Ali Borji, Ke Chen)
1
2
(e.g., for visualization purposes).
3
(e.g., for visualization purposes).
4
profile.
Facebook network Twitter Network
5
6
7
1 2 3
1 2 3
8
1 2 3
1 2 3
9
1 2 3
1 2 3
10
11
12
13
14
i i i
1 2
15
i i i
1 2 2
16
1
i i i
17
18
n Partitioning Clustering Approach
n a typical clustering analysis approach via iteratively
n learning a partition on a data set to produce several non-
n in principle, optimal partition achieved via minimizing the
2 1 2
) ( ) , (
kn n N n k
m x d
=
m x
2 1 k C K k
k
xÎ = S
e.g., Euclidean distance
19
n Given a K, find a partition of K clusters to optimize
n The K-means algorithm: a heuristic method
the center of the cluster and the algorithm converges to stable centriods of clusters.
clustering analysis and widely used in data mining applications.
20
n Given the cluster number K, the K-means algorithm is
n Initialisation: set seed points (randomly)
n Assign each object to the cluster of the nearest seed point
n Compute new seed points as the centroids of the clusters of the
n Go back to Step 1), stop when no more new assignment (i.e.,
n Choose a number of clusters k n Initialize cluster centers µ1,… µk
n Could pick k data points and set cluster centers to these
n Or could randomly assign points to clusters and take
n For each data point, compute the cluster center it is
n Re-compute cluster centers (mean of data points in
n Stop when there are no new re-assignments
22
n Problem
Suppose we have 4 types of medicines and each has two attributes (pH and weight index). Our goal is to group these objects into K=2 group of medicine. Medicine Weight pH- Index A 1 1 B 2 1 C 4 3 D 5 4 A B C D
23
n Step 1: Use initial seed points for partitioning
B c , A c
2 1
= =
24 . 4 ) 1 4 ( ) 2 5 ( ) , ( 5 ) 1 4 ( ) 1 5 ( ) , (
2 2 2 2 2 1
=
=
c D d c D d Assign each object to the cluster with the nearest seed point
Euclidean distance
D C A B
24
n Step 2: Compute new centroids of the current
Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships.
2 1
25
n Step 2: Renew membership based on new
Compute the distance of all
Assign the membership to objects
26
n Step 3: Repeat the first two steps until its
Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships.
) 2 1 3 , 2 1 4 ( 2 4 3 , 2 5 4 ) 1 , 2 1 1 ( 2 1 1 , 2 2 1
2 1
= ÷ ø ö ç è æ + + = = ÷ ø ö ç è æ + + = c c
27
n Step 3: Repeat the first two steps until its
Compute the distance of all
Stop due to no new assignment Membership in each cluster no longer change
28
For the medicine data set, use K-means with the Manhattan distance metric for clustering analysis by setting K=2 and initialising seeds as C1 = A and C2 = C. Answer three questions as follows:
1.
How many steps are required for convergence?
2.
What are memberships of two clusters after convergence?
3.
What are centroids of two clusters after convergence?
Medicine
Weight pH- Index A 1 1 B 2 1 C 4 3 D 5 4 A B C D
29
n minj∈ 1,…,k
2
30
n minj∈ 1,…,k
2
31
n minj∈ 1,…,k
2
32
n
2
n
2
2 + 1
n
2
1 n ∑i=1 n 𝐲𝐣
μ
n
2
n
2
2 + 1
n
2
1 n ∑i=1 n 𝐲𝐣
μ
33
n
n O(tKn), where n is number of objects, K is number of clusters,
n
n sensitive to initial seed points n converge to a local optimum: maybe an unwanted solution
n
n Need to specify K, the number of clusters, in advance n Unable to handle noisy data and outliers (K-Medoids algorithm) n Not suitable for discovering clusters with non-convex shapes n Applicable only when mean is defined, then what about
n how to evaluate the K-mean performance?
34
soccer
sports fashion
Gucci tennis Lacoste
All topics
35
soccer
sports fashion
Gucci tennis Lacoste
All topics
algorithms.
36
x∈A,x′∈B′ dist(x, x′)
dist A, B = avg
x∈A,x′∈B′ dist(x, x′)
soccer sports fashion Gucci tennis Lacoste All topics
E.g., # keywords in common, edit distance, etc
x∈A,x′∈B′ dist(x, x′)
37
Single linkage: dist A, 𝐶 = min
x∈A,x′∈𝐶 dist(x, x′)
6 2.1 3.2
A B C D E F 3 4 5 A B D E 1 2 A B C A B C D E A B C D E F
Dendogram
38
1 2 3 4 5
One way to think of it: at any moment, we see connected components
6 2.1 3.2
A B C D E F
Watch as r grows (only n-1 relevant values because we only we merge at value of r corresponding to values of r in different clusters). Single linkage: dist A, 𝐶 = min
x∈A,x′∈𝐶 dist(x, x′)
39
Complete linkage: dist A, B = max
x∈A,x′∈B dist(x, x′)
One way to think of it: keep max diameter as small as possible at any level.
6 2.1 3.2
A B C D E F 3 4 5 A B D E 1 2 A B C DEF A B C D E F
40
One way to think of it: keep max diameter as small as possible.
6 2.1 3.2
A B C D E F 1 2 3 4 5
Complete linkage: dist A, B = max
x∈A,x′∈B dist(x, x′)
41