Machine Learning
Clustering I
Hamid R. Rabiee
Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1
Clustering I Hamid R. Rabiee Jafar Muhammadi, Nima Pourdamghani - - PowerPoint PPT Presentation
Machine Learning Clustering I Hamid R. Rabiee Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 Agenda Agenda Unsupervised Learning Quality Measurement Similarity Measures
Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
2
Unsupervised Learning Quality Measurement Similarity Measures Major Clustering Approaches
Distance Measuring Partitioning Methods Hierarchical Methods Density Based Methods Spectral Clustering Other Methods
Constraint Based Clustering Clustering as Optimization
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
2
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
3
Clustering or unsupervised classification is aimed at discovering natural groupings in a set
Note: All samples in the training set are unlabeled.
Applications for clustering:
Spatial data analysis: Create thematic maps in GIS by clustering feature space Image processing: Segmentation Economic science: Discover distinct groups in costumer bases Internet: Document classification To gain insight into the structure of the data prior to classifier design; classifier design
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
4
high intra-class similarity low inter-class similarity
Ability to discover hidden patterns Judged by the user Purity Suppose we know the labels of the data, assign to each cluster its most frequent class Purity is the number of correctly assigned points divided by the number of data
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
5
Some popular distances are Minkowski and Mahalanobis. Distance between binary strings d(S1,S2)=|{(s1,i,s2,i) : s1,i ≠ s2,i}| Distance between vector objects
T
X .Y d(X,Y) X Y
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
6
Partitioning approach
Construct various partitions and then evaluate them by some criterion (ex. k-means, c-means, k-medoids)
Hierarchical approach
Create a hierarchical decomposition of the set of data using some criterion (ex. Agnes)
Density-based approach
Based on connectivity and density functions (ex. DBSACN, OPTICS)
Graph-based approach (Spectral Clustering)
approximately optimizing the normalized cut criterion
Grid-based approach
based on a multiple-level granularity structure (ex. STING, WaveCluster, CLIQUE)
Model-based
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other (ex. EM, SOM)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
7
Single link
smallest distance between an element in one cluster and an element in the other
Complete link
largest distance between an element in one cluster and an element in the other
Average
avg distance between an element in one cluster and an element in the other
Centroid
distance between the centroids of two clusters Used in k-means
Medoid
distance between the medoids of two clusters Medoid: A representative object whose average dissimilarity to all the objects in the cluster is minimal
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
8
where Cms are clusters representatives. Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means, c-means and k-medoids algorithms k-means: Each cluster is represented by the center of the cluster c-means: The fuzzy version of k-means k-medoids: Each cluster is represented by one of the samples in the cluster
j m
k 2 m 1 x Cluster j m
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
9
Suppose we know there are K categories and each category is represented by its sample mean Given a set of unlabeled training samples, how to estimate the means?
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
10
Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers (Why?) Not suitable to discover clusters with non-convex shapes (Why?) Algorithm is sensitive to number of cluster centers, choice of initial cluster centers sequence in which data are processed (Why?) Convergence not guaranteed, but results acceptable if there are well-separated clusters
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
11
The membership function μil expresses to what degree xl belongs to class Ci. Crisp clustering: xl can belong to one class only Fuzzy clustering: xl belongs to all classes simultaneously with varying degrees of membership
where z(m)s are cluster means q is a fuzziness index with 1<q<2 Fuzzy clustering becomes crisp clustering when q→1 Observe that C-mean minimizes
1
l i il l i
if x C if x C
1 1 ( ) 1 1 ( ) 1
1 ( , ) 1 ( , )
q m i l il q k m i i l
d z x d z x
1
1, 1,2,..., .
k il i
for l N
2 ( ) 1 1
, ( )
k N f f q m e i i il i l i l
J J J z x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
12
Instead of taking the mean value of the samples in a cluster as a reference point, medoids can be used Note that choosing the new medoids is slightly different with choosing the new means in k- means algorithm
Swap m and o and compute the total cost of configuration
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
13
k-medoids is more robust than k-means in the presence of noise and outliers (Why?) works effectively for small data sets, but does not scale well for large data sets For Large data sets we can use sampling based methods (How?)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
14
Step 0 Step 1 Step 2 Step 3 Step 3 Step 2 Step 1 Step 0
agglomerative (AGNES) divisive (DIANA)
a b c d e a b d e c d e a b c d e
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
15
AGNES (Agglomerative Nesting) Uses the Single-Link method Merge nodes (clusters) that have the maximum similarity
DIANA (Divisive Analysis) Inverse order of AGNES Eventually each node forms a cluster on its own
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
16
Shows How the Clusters are Merged Decompose samples into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the samples is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
17
Discover clusters of arbitrary shapes Handle noise Need density parameters as termination condition
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
18
parameters: Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Eps-neighbourhood
Sample q is directly density-reachable from sample p, if d(p,q)<=Eps and p has MinPts points in its neighborhood. Sample q is density-reachable from a sample p if there is a chain
reachable from pi Sample p is density-connected to sample q if there is a sample o such that both, p and q are density-reachable from o.
p q
p p1
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
19
Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial data with noise
Arbitrary select a sample p Retrieve all samples density-reachable from p w.r.t. Eps and MinPts. If p is a core sample (some samples are density-reachable from p), a cluster is formed. If p is a border sample (no samples are density-reachable from p), DBSCAN visits the next sample of the database. Continue the process until all of the samples have been processed.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
20
Large weights mean that the adjacent vertices are very similar; small weights imply dissimilarity.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
21
Mincut criterion: Find a partition A, B that minimizes cut(A,B)
, ,
u v u A v B
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
22
, , , ,
u v u v u A v V u B v V
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
23
Looks for a new representation of the original data points, such that Preserve the edge weights. The convex clusters’ shapes in the new space represents non-convex ones in the
Cluster the points in the new space using any clustering scheme (say k-means).
For more information about derivations, refer to U. Luxburg, “A Tutorial on Spectral Clustering”.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
24
Set of points and number of clusters k
Form the edge weights matrix For example:
Scaling parameter chosen by user
Define D a diagonal matrix whose (i,i) element is the sum of W’s row i Form the matrix Find the k largest eigenvectors of L to form the matrix Xnxk
nxn
W R
1/2 1/2
L D WD
2 2
/2
i j
s s ij
e if i j W else
1,..., l n
S S S R
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
25
Normalized the matrix Xnxk and form matrix Ynxk Treat each row of Y as a point in Rk (data dimensionality reduction from n to k) Cluster the new data into k clusters via K-means
2
/
ij ij ij j
Y X X
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
26
simple edge weights matrix (d(xi,xj) denotes Euclidean distance between points xi, xj and θ=1)
1 , , ,
i j
if d x x W i j W j i
1
(0.7,0.7,0,0)T e
2
(0,0,0.7,0.7)T e
1
(0.7,0,0.7,0)T e
2
(0,0.7,0,0.7)T e
~
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 a b c d a c b d a a W W b c c b d d b c 1 d a
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
27
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
28
Using multi-resolution grid data structure.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
29
Attempt to optimize the fit between the given data and some mathematical model Based on the assumption: Data are generated by a mixture of underlying probability distribution Typical methods Statistical approach: EM (Expectation maximization) – will be discussed later Neural network approach: SOM (Self-Organizing Feature Map)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
30
Need user feedback: Users know their applications the best Less parameters but more user-desired constraints, e.g., an ATM allocation problem:
Constraints on individual samples (do selection first) Cluster on samples which … Constraints on distance or similarity functions Weighted functions, obstacles Constraints on the selection of clustering parameters Number of clusters, limitation of each cluster size User-specified constraints Some samples must be in cluster and some others not! Semi-supervised: giving small training sets as “constraints” or hints
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
31
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
32
The sum-of-squared-error criterion Scatter criteria
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
33
Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1
Partitioning the data set into some number K of clusters Cluster: a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster Goal: an assignment of data points to clusters such that the sum of the squares of the distances to each data point to its closest vector (the center
2
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
3
In the 1st stage: minimizing J with respect to the rnk, keeping the μk fixed In the 2nd stage: minimizing J with respect to the μk, keeping rnk fixed
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
4
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
5
Graphical representation of a mixture model A binary random variable z having a 1-of-K representation The marginal distribution of x is a Gaussian mixture of the form (*) ( for every observed data point xn, there is a corresponding latent variable zn)
…(*)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
6
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
Generating a value for z, which denoted as from the marginal distribution p(z) and then generate a value for x from the conditional distribution
7
8
a.
The three states of z, corresponding to the three components of the mixture, are depicted in red, green, blue
b.
The corresponding samples from the marginal distribution p(x)
c.
The same samples in which the colors represent the value of the responsibilities γ(znk) associated with data point
Illustrating the responsibilities by evaluating the posterior probability for each component in the mixture distribution which this data set was generated
9
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
10
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
11
In applying maximum likelihood to a Gaussian mixture models, there should be steps to avoid finding such pathological solutions and instead seek local minima of the likelihood function that are well behaved
A K-component mixture will have a total of K! equivalent solutions corresponding to the K! ways of assigning K sets of parameters to K components Difficulty of maximizing the log likelihood function the presence of the summation
case
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
12
I. Assign some initial values for the means, covariances, and mixing coefficients II. Expectation or E step
III. Maximization or M step
It is common to run the K-means algorithm in order to find a suitable initial values
means algorithm
clusters
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
13
1. Initialize the means μk, covariance Σk and mixing coefficients πk 2. E step 3. M step
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
14
4: Evaluate the log likelihood
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
15
A weighted mean of all of the points in the data set
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
16
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
17
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
18
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
19
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
20
Posterior distribution The expected value of the indicator variable under this posterior distribution The expected value of the complete-data log likelihood function
1 1
( | , , , ) ( | , , , ) ( ) [ ( | , )] ( . (9.10), (9.11))
nk
N K z k n k k n k
p Z X μ p X Z μ p Z N x ref
1
( 1) ( | 1) ( ) ( 1| ) ( ) ( | , ) ( ) ( | , )
nk n nk nk nk n n k n k k nk K j n j j j
p z p x z E z p z x p x N x z N x
k 1 1
[ln ( , , , )] ( ){ln ln ( | , )}
N K nk n k k n k
E p X Z | μ z N x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
21
2 n j
x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
22
2 2 2 2 1 1
exp{ / 2 } 1 1 ( | , ) exp , ( ) 2 2 exp{ / 2 } 1 [ln ( , | , , )] 2
k n k n k k n k nk j n j j N K nk n k n k
x p x I x z x E p X Z r x const
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
23
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
24
(1 ) 1 1 1 (1 ) 1 1
( | ) (1 ) (single component) ( ( , , ), ( ) ( , , ), ( ) { (1 )}) ( | , ) ( | ), ( | ) (1 ) (mixture) ( { },
i i i i
D x x i i i D D i i D K x x k k k ki ki k i
p x x E Cov diag p p p
1 K k
x μ x x μ x Σ x μ x μ x μ μ μ , ,μ μ
1 1 1 1 1 2 ( ) ( ) ( )
( , , )) ( ) ( ) , ( ) ( ) ( ) ( ) ( ( )) ( ) ( ) ( ) ( ) (Let ( )) ( ), then ( | ) , ( |
k kD K K k k k k k k K K T T T k k k k k k k ij k ii k i k k ij k i j k
E E Cov E E E E E E E E E c c E x c E x x
T T T k k T
x x | μ μ x xx x x xx | μ x x μ μ x x xx | μ μ μ
2 3 2
) ( ) and note that individual variabes are ) [ . .] (1, 0, 0, 1, 1) (single component): ( ) : ( ) (1 ) , ( ) ( , , ), ( ) (1 ) (mixture) : (
k i
i j x independent, given μ E g p H p E Cov I p H x x x x
3 2 3 2 1 1 2 2 1 1 1 2 2 2 1 1 2 2 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 2 2 2 2 1 1 1 2 2 2 1 1 1 2 1 1
| ) , ( | ) : ( ) (1 ) (1 ) , ( ) ( , , ) ( ) { (1 ) } { (1 ) } ( ) { (1 ) (1 )} { ( c p H c p E Cov I I I x x x 1 1 1
2 2 2) }
* Because the covariance matrix Cov( ) is no longer diagonal, the mixture distribution can capture correlations between the variables, unlike a single Bernulli distribution. x 1
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
25
1 1 T 1 1 1 1
ln ( | , ) ln ( | ) ( | , ) ( | ) ( ( , ) is a binary indicator variables) ( ) (complete-data log likelihood function) : ln ( , | , ) ln [
k k
N K k k n k K z K k K z k k K nk k k
p p p p z , z p p z
n k
x μ x μ x z μ x μ z z X Z μ π π π |
1 1 Z 1 1 1 1 1
ln (1 )ln(1 )] E [ln ( , | , )] ( ) ln [ ln (1 )ln(1 )] ( | ) (E-step) ( ) [ ] , ( ), ( | )
N D ni ki ni ki n i N K D nk k ni ki ni ki n k i N k k nk nk k nk K n j j j
x x p z x x p z E z N z p
n n
X Z μ x μ x x μ π
1 k
1 ( ) (M-step) , * In contrast to the mixture of Gaussians, there are no singularities in which the likelihood goes to infinity
N nk n k k k
z N N N
k n k
x μ x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
26
N=600 digit images, 3 mixtures A mixture of k=3 Bernoulli distributions by 10 EM iterations Parameters for each of the three components/single multivariate Bernoulli The analysis of Bernoulli mixtures can be extended to the case of multinomial binary variables having M>2 states (Ex.9.19)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
27
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
28