Unsupervised Learning and Clustering
Selim Aksoy
Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr
CS 551, Spring 2008
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 1 / 50
Unsupervised Learning and Clustering Selim Aksoy Department of - - PowerPoint PPT Presentation
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 2008, Selim Aksoy (Bilkent University) c 1 / 50 Introduction
Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 1 / 50
◮ Until now we have assumed that the training examples were
◮ Procedures that use labeled samples are said to be
◮ In this chapter, we will study clustering as an unsupervised
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 2 / 50
◮ Unsupervised procedures are used for several reasons:
◮ Collecting and labeling a large set of sample patterns can be
◮ One can train with large amount of unlabeled data, and then
◮ Unsupervised methods can be used for feature extraction. ◮ Exploratory data analysis can provide insight into the nature
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 3 / 50
◮ Assume that we have a set of unlabeled multi-dimensional
◮ One way of describing this set of patterns is to compute
◮ This description uses the assumption that the patterns form
◮ However, we must be careful about any assumptions we
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 4 / 50
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 5 / 50
◮ A cluster is comprised of a number of similar objects
◮ Other definitions of clusters (from Jain and Dubes, 1988):
◮ A cluster is a set of entities which are alike, and entities from
◮ A cluster is an aggregation of points in the test space such
◮ Clusters may be described as connected regions of a
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 6 / 50
◮ Cluster analysis organizes data by abstracting the
◮ These groupings are based on measured or perceived
◮ Clustering is unsupervised. Category labels and other
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 7 / 50
◮ Clustering is a very difficult problem because data can
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 8 / 50
◮ Clustering algorithms can be divided into several groups:
◮ Exclusive (each pattern belongs to only one cluster) vs.
◮ Hierarchical (nested sequence of partitions) vs. partitional (a
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 9 / 50
◮ Implementations of clustering algorithms can also be
◮ Agglomerative (merging atomic clusters into larger clusters)
◮ Serial (processing patterns one by one) vs. simultaneous
◮ Graph-theoretic (based on connectedness) vs. algebraic
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 10 / 50
◮ Hundreds of clustering algorithms have been proposed in
◮ Most of these algorithms are based on the following two
◮ Iterative squared-error partitioning, ◮ Agglomerative hierarchical clustering.
◮ One of the main challenges is to select an appropriate
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 11 / 50
◮ The most obvious measure of similarity (or dissimilarity)
◮ If distance is a good measure of dissimilarity, then we can
◮ Then, a very simple way of doing clustering would be to
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 12 / 50
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 13 / 50
◮ The next challenge after selecting the similarity measure is
◮ Suppose that we have a set D = {x1, . . . , xn} of n samples
◮ Each subset is to represent a cluster, with samples in the
◮ The simplest and most widely used criterion function for
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 14 / 50
◮ Suppose that the given set of n patterns has somehow
◮ Let ni be the number of samples in Di and let mi be the
◮ Then, the sum-of-squared errors is defined by
k
◮ For a given cluster Di, the mean vector mi (centroid) is the
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 15 / 50
◮ A general algorithm for iterative squared-error partitioning:
◮ This algorithm, without step 5, is also known as the
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 16 / 50
◮ k-means is computationally efficient and gives good results
◮ However, choosing k and choosing the initial partition are
◮ The value of k is often chosen empirically or by prior
◮ The initial partition is often chosen by generating k random
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 17 / 50
◮ Numerous attempts have been made to improve the
◮ incorporating a fuzzy criterion resulting in fuzzy k-means, ◮ using genetic algorithms, simulated annealing, deterministic
◮ using iterative splitting to find the initial partition.
◮ Another alternative is to use model-based clustering using
◮ In model-based clustering, the value of k corresponds to
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 18 / 50
(a) Good initialization. (b) Good initialization. (c) Bad initialization. (d) Bad initialization.
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 19 / 50
◮ The k-means algorithm produces a flat data description
◮ In some applications, groups of patterns share some
◮ Hierarchical clustering tries to capture these multi-level
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 20 / 50
◮ In hierarchical clustering, for a set of n samples,
◮ the first level consists of n clusters (each cluster containing
◮ the second level contains n − 1 clusters, ◮ the third level contains n − 2 clusters, ◮ and so on until the last (n’th) level at which all samples form
◮ Given any two samples, at some level they will be grouped
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 21 / 50
◮ A natural representation of hierarchical clustering is a tree,
◮ If there is an unusually large gap between the similarity
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 22 / 50
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 23 / 50
◮ Agglomerative Hierarchical Clustering:
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 24 / 50
◮ Popular distance measures (for two clusters Di and Dj):
x∈Di x′∈Dj
x∈Di x′∈Dj
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 25 / 50
◮ When dmin is used to measure the distance between
◮ Moreover, if the algorithm is terminated when the distance
◮ patterns represent the nodes of a graph, ◮ edges connect patterns belonging to the same cluster, ◮ merging two clusters corresponds to adding an edge
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 26 / 50
◮ When dmax is used to measure the distance between
◮ Moreover, if the algorithm is terminated when the distance
◮ patterns represent the nodes of a graph, ◮ edges connect all patterns belonging to the same cluster, ◮ merging two clusters corresponds to adding edges between
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 27 / 50
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 28 / 50
◮ Stepwise-Optimal Hierarchical Clustering:
◮ When the sum-of-squared-error criterion Je is used, the pair
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 29 / 50
◮ Graph: (S, R)
◮ S: Set of nodes ◮ R: Set of edges, R ⊆ S × S
◮ Clique: Set of nodes that are all connected to each other,
◮ Goal: Find clusters in a graph that are not as dense as
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 30 / 50
4 9 2 10 7 6 5 3 8 1
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 31 / 50
◮ (X, Y ) ∈ R means Y is a neighbor of X,
◮ Conditional density D(Y |X) is the number of nodes in the
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 32 / 50
◮ Given an integer K, a dense region Z around a node
◮ Z(X) = Z(X, J) is a dense region candidate around X
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 33 / 50
◮ Association of a node X to a subset B of S is
◮ Compactness of a subset B of S is
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 34 / 50
◮ A dense region B of the graph (S, R) should satisfy
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 35 / 50
◮ Algorithm for finding a dense region around a node X:
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 36 / 50
◮ Given the dense regions, the algorithm for graph-theoretic
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 37 / 50
4 9 2 10 7 6 5 3 8 1
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 38 / 50
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 39 / 50
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 40 / 50
◮ The procedures we have considered so far either assume
◮ These may be reasonable assumptions for some
◮ Furthermore, most of the iterative algorithms that we use
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 41 / 50
◮ Methods for validating the results of a clustering algorithm
◮ Repeating the clustering procedure for different values of the
◮ Evaluating the goodness-of-fit using measures such as the
◮ Formulating hypothesis tests that check whether multiple
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 42 / 50
◮ The groupings by the unsupervised clustering can also be
◮ In an optimal result, the patterns with the same class labels
◮ The following measures quantify how well the results of the
◮ Entropy, ◮ Rand index. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 43 / 50
◮ Entropy is an information theoretic criterion that measures
◮ Given K as the number of clusters resulting from the
◮ hck denote the number of patterns assigned to cluster k with
◮ hc. = K
k=1 hck denote the number of patterns with a ground
◮ h.k = C
c=1 hck denote the number of patterns assigned to
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 44 / 50
◮ The quality of individual clusters is measured in terms of the
◮ For each cluster k, the cluster entropy Ek is given by
C
◮ Then, the overall cluster entropy Ecluster is given by a
k=1 h.k K
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 45 / 50
◮ A smaller cluster entropy value indicates a higher
◮ However, the cluster entropy continues to decrease as the
◮ To overcome this problem, another entropy criterion that
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 46 / 50
◮ For each class c, the class entropy Ec is given by
K
◮ Then, the overall class entropy Eclass is given by a weighted
c=1 hc. C
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 47 / 50
◮ Unlike the cluster entropy, the class entropy increases when
◮ Therefore, the two measures can be combined for an
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 48 / 50
◮ The Rand index can also be used to measure the
◮ The agreement occurs if
◮ two patterns that belong to the same class are put into the
◮ two patterns that belong to different classes are put into
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 49 / 50
◮ The Rand index is computed as the proportion of all pattern
◮ The index has a value between 0 and 1, where 0 indicates
CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 50 / 50