.: Data Mining in Bioinformatics, Page 1
Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen .: Data Mining in Bioinformatics, Page 1 What is clustering? Clustering
What is clustering?
.: Data Mining in Bioinformatics, Page 2
Clustering Class discovery Given a set of objects, group them into clusters (classes that are unknown beforehand) an instance of unsupervised learning (no training data- set) In Practice Cluster images to find categories Cluster patient data to find disease subtypes Cluster persons in social networks to detect communi- ties
What is clustering?
.: Data Mining in Bioinformatics, Page 3
Supervised versus unsupervised learning general inference problem: given xi, predict yi by lear- ning a function f training set: set of examples (xi, yi) where yi = f(xi) (but
f is still unknown!)
test set: new set of data points xi where yi is unknown Supervised: use training data to infer your model, then apply this model to the test data Unsupervised: no training data, learn model and apply it directly on the test data
K-means
.: Data Mining in Bioinformatics, Page 4
Objective Partition the dataset into k clusters such that intra- cluster variance is minimised
V (D) =
k
- i=1
- xj∈Si
(xj − µi)2 (1)
where
V is the variance, Si is a cluster, µi is its mean, D is the dataset of all points xj
K-means
.: Data Mining in Bioinformatics, Page 5
Llyods algorithm
- 1. Partition the data into k initial clusters
- 2. Compute the mean of each cluster
- 3. Assign each point to the cluster whose mean is closest
to the point
- 4. If any point changed its cluster membership: Repeat
from step 2
K-means
.: Data Mining in Bioinformatics, Page 6
Example: before clustering
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
K-means
.: Data Mining in Bioinformatics, Page 7
Example: after clustering (k=2)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
K-means
.: Data Mining in Bioinformatics, Page 8
Things to note k-means is still the state-of-the-art method for most clus- tering tasks When proposing a new clustering method, one should always compare to k-means. Lloyds’ algorithm has several setbacks It is order-dependent. Its results depends on the initialisation of the clusters. Its result may be a local optimum, not the global opti- mal solution.
K-centroid
.: Data Mining in Bioinformatics, Page 9
‘Brother’ of k-means Don’t use the mean of each cluster but the medoid The medoid is the point closest to the mean:
mi = argminxj∈Si||xj − µi||2 (2)
One thereby restricts the cluster ‘means’ to points that are present in the dataset One only minimises variance with respect to these points
Kernel k-means
.: Data Mining in Bioinformatics, Page 10
Kernelised k-means? It would be attractive to perform clustering using kernels can move clustering problem to different feature spaces can cluster string and graph data But we have to be able to perform all steps in k-means using kernels!
Kernel k-means
.: Data Mining in Bioinformatics, Page 11
Kernelised k-means The key step in k-means is to compute the distance bet- ween one data point x1 and the mean of a cluster of points x2, . . . , xm:
φ(x1) − 1 (m − 1)
m
- j=2
φ(xj)2 = k(x1, x1) − 2 (m − 1)
m
- j=2
k(x1, xj) + 1 (m − 1)2
m
- i=2
m
- j=2
k(xi, xj) (3)
This result is based on the fact that every kernel k indu- ces a distance d:
d(xi, xj)2=φ(xj)−φ(xj)2=k(xi, xi)−2k(xi, xj)+k(xj, xj)
Graph-based clustering I
.: Data Mining in Bioinformatics, Page 12
Data representation dataset D is given in terms of a graph G = (V, E) a data objects vi is a node in G edge e(i, j) from node vi to node vj has weight w(i, j) Graph-based clustering Define a threshold θ Remove all edges e(i, j) from G with weight w(i, j) > θ Each connected component of the graph now corre- sponds to one cluster Two nodes are in the same connected component if the- re is a path between them Graph components can be found by depth-first search in a graph ((O(|V | + |E|))
Graph-based clustering II
.: Data Mining in Bioinformatics, Page 13
Original graph
Graph-based clustering III
.: Data Mining in Bioinformatics, Page 14
Thresholded graph (θ = 0.5)
Graph-based clustering IV
.: Data Mining in Bioinformatics, Page 15
But how to get the graph in the first place? Think of the weights as a similarity measure. If two nodes are not connected, then their similarity mea- sure is 0. Graph-based clustering creates clusters of similar ob- jects For any object vi in a cluster, there is a second object vj such that similarity(vi, vj) is larger than θ.
DBScan I
.: Data Mining in Bioinformatics, Page 16
Noise-robust graph-based clustering Graph-based clustering can suffer from the fact that one noisy edge connects two clusters DBScan (Ester et al., 1996) is a noise-robust extension
- f graph-based clustering
DBScan is short for Density-Based Spatial Clustering of Applications with Noise Core object Two objects vi and vj with distance d(vi, vj) < ǫ belong to the same cluster if either vi or vj are a core object.
vi is a core object iff there are MinPoints points within
a distance of ǫ from vi. A cluster is defined by iteratively checking this core ob- ject property.
DBScan II
.: Data Mining in Bioinformatics, Page 17
Code: Main DBSCAN (SetOfPoints, Eps, MinPts) // SetOfPoints is UNCLASSIFIED ClusterId := nextId(NOISE); for i FROM 1 TO SetOfPoints.size do Point := SetOfPoints.get(i); if Point.ClId = UNCLASSIFIED then if ExpandCluster(SetOfPoints, Point, ClusterId, Eps, MinPts) then ClusterId := nextId(ClusterId) end if end if end for
DBScan III
.: Data Mining in Bioinformatics, Page 18
Code: ExpandCluster ExpandCluster(SetOfPoints, Point, ClId, Eps, MinPts): Boolean; seeds:=SetOfPoints.regionQuery(Point,Eps); if seeds.size < MinPts then SetOfPoint.changeClId(Point,NOISE); RETURN False; else SetOfPoints.changeClIds(seeds,ClId); seeds.delete(Point); while seeds <> Empty currentP := seeds.first(); result := SetOfPoints.regionQuery(currentP , Eps);
DBScan IV
.: Data Mining in Bioinformatics, Page 19
if result.size >= MinPts then for i FROM 1 TO result.size do resultP := result.get(i); if resultP .ClId IN (UNCLASSIFIED, NOISE) then if resultP .ClId = UNCLASSIFIED then seeds.append(resultP); end if SetOfPoints.changeClId(resultP ,ClId); end if // UNCLASSIFIED or NOISE end for; end if; // result.size >= MinPts seeds.delete(currentP); end while; // seeds <> Empty RETURN True; end if end // ExpandCluster
DBScan V
.: Data Mining in Bioinformatics, Page 20
Original graph
DBScan VI
.: Data Mining in Bioinformatics, Page 21
DBScan-clustered graph (MinPts = 2, Eps = 0.5)
DBScan VII
.: Data Mining in Bioinformatics, Page 22
Original graph
DBScan VIII
.: Data Mining in Bioinformatics, Page 23
DBScan-clustered graph (MinPts = 3, Eps = 0.5)
DBScan IX
.: Data Mining in Bioinformatics, Page 24
Properties Cluster assignment of border points is order-dependent Unlike k-means, one does not have to specify the num- ber of clusters a priori But one has to set MinPts and Eps Ester et al. report that for 2D examples MinPts=4 is suf- ficient for good results They determine Eps by visual inspection of a k-distance plot Transfer question: How to kernelise DBScan?
Hierarchical Clustering
.: Data Mining in Bioinformatics, Page 25
Extension of original setting What if clusters contain clusters themselves? Then we need hierarchical clustering!
Hierarchical Clustering
.: Data Mining in Bioinformatics, Page 26
Join most similar clusters Iteratively join the two most similar clusters But how to measure similarity between clusters? Similarity of clusters Single Link: S(Ci, Cj) =
min
x∈Ci,x′∈Cj d(x, x′)
Average Link: S(Ci, Cj) =
1 |Ci| |Cj|
- x∈Ci,x′∈Cj
d(x, x′)
Maximum Link: S(Ci, Cj) =
max
x∈Ci,x′∈Cj d(x, x′)
The end
.: Data Mining in Bioinformatics, Page 27