Clustering
Sriram Sankararaman (Adapted from slides by Junming Yin)
Clustering Sriram Sankararaman (Adapted from slides by Junming Yin) - - PowerPoint PPT Presentation
Clustering Sriram Sankararaman (Adapted from slides by Junming Yin) Outline Introduction Unsupervised learning What is cluster analysis? Applications of clustering Dissimilarity (similarity) of samples Clustering
Sriram Sankararaman (Adapted from slides by Junming Yin)
2
3
regression, the training data are represented as , the goal is to learn a function that predicts given . (supervised learning)
unlabelled data . Can we infer some properties of the distribution of X?
4
some low-dimensional features that might be sufficient to describe the samples (next lecture).
to perform exploratory data analysis and gain some insight into the nature or structure of data.
5
samples such that samples within the same group are more similar to each other than they are to the samples of
clusters.
6
7
http://people.cs.uchicago.edu/~pff/segment/
8
9
Eisen et al, PNAS 1998
10
Bishop, PRML
11
12
dissimilarity between samples?
dissimilarity.
particular application (later).
13
the Euclidean distance.
translations and rotations in feature space, but not invariant to scaling of features.
features so that all of features have zero mean and unit variance.
14
Simulated data, 2-means without standardization Simulated data, 2-means with standardization
15
16
each of which is summarized by a prototype
such that for all data indices i
17
distances from each data point to its assigned prototype (is equivalent to the within-cluster scatter).
prototypes responsibilities data
18
prototypes.
procedure.
a merge-split approach.
19
cluster.
number of possible settings for the responsibilities.
algorithm with many different initial settings.
20
21
22
23
24
25
26
27
28
29
30
domain.
usually selected by some heuristics in practice.
increasing K
contains a subset of true underlying groups
falls substantially, afterwards not a lot more
31
setting K.
K=2
32
initialization.
farthest from prototypes .
33
probabilistic assignments (GMM)
for each cluster.
arbitrary way, the clusters are not necessarily nested
34
35
mean covariance
36
where
parameters to be estimated
37
variable is associated with each
38
Synthetic Data Set Without Colours
39
data.
to maximize the incomplete log likelihood.
closed-form solution.
40
corresponding set of data points.
complete log likelihood is exactly the loss function used in K-means.
log likelihood by working with the (easier) complete log likelihood instead.
41
compute the expected values of the latent variables (responsibilities of data points)
but we still have
Bayes rule
42
likelihood
43
likelihood of data does not increase any more.
initial guess of parameters (as in K-means).
, two methods coincide.
44
45
46
47
48
49
50
Euclidean distance.
algorithm.
clusters.
during E-step.
with equal probability of a cluster.
likelihood.
probability of membership.
during E-step.
with different probabilities.
51
to more distant points.
and not the attributes.
52
assigned to the cluster.
prototypes.
53
Euclidean distance.
regression.
54
55
Step 0 Step 1 Step 2 Step 3 Step 4
b d c e a a b d e c d e a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative divisive
56
the smallest between-cluster dissimilarity (defined later on).
(e.g. largest diameter); splitting a cluster is also a clustering problem (usually done in a greedy way); less popular than bottom-up way.
Step 0 Step 1 Step 2 Step 3 Step 4
b d c e a a b d e c d e a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative divisive
57
represent the most natural division into clusters
exceeds some threshold
Step 0 Step 1 Step 2 Step 3 Step 4
b d c e a a b d e c d e a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative divisive 3 2
58
groups G and H, is computed from pairwise dissimilarities
under monotone increasing transform.
59
Example: Human Tumor Microarray Data
columns to tissue samples.
functions of unknown genes from known genes with similar expression profiles.
identify disease profiles: tissues with similar disease should yield similar expression profiles.
Gene expression matrix
60
Example: Human Tumor Microarray Data
the randomly ordered rows and columns.
61
62
graph G.
edge E.
very similar; small weights imply dissimilarity.
63
the vertices of the graph.
be dissimilar.
64
subgraphs formed.
partitions.
is NP-hard.
65
cut criterion leads to spectral clustering.
clustering scheme (say 2-means).
normalized matrix using the largest 2 eigenvectors of the matrix
66
67
68
“similar”:
replacing with , and then applying the standard Euclidean distance
69
desired dissimilarity:
solved by gradient descent and iterative projection
70
71
Elements of Statistical Learning, Chapter 14
Learning, Chapter 9