Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt - - PowerPoint PPT Presentation

data mining in bioinformatics day 2 clustering
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen .: Data Mining in Bioinformatics, Page 1 What is clustering? Clustering


slide-1
SLIDE 1

.: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 2: Clustering

Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen

slide-2
SLIDE 2

What is clustering?

.: Data Mining in Bioinformatics, Page 2

Clustering Class discovery Given a set of objects, group them into clusters (classes that are unknown beforehand) an instance of unsupervised learning (no training data- set) In Practice Cluster images to find categories Cluster patient data to find disease subtypes Cluster persons in social networks to detect communi- ties

slide-3
SLIDE 3

What is clustering?

.: Data Mining in Bioinformatics, Page 3

Supervised versus unsupervised learning general inference problem: given xi, predict yi by lear- ning a function f training set: set of examples (xi, yi) where yi = f(xi) (but

f is still unknown!)

test set: new set of data points xi where yi is unknown Supervised: use training data to infer your model, then apply this model to the test data Unsupervised: no training data, learn model and apply it directly on the test data

slide-4
SLIDE 4

K-means

.: Data Mining in Bioinformatics, Page 4

Objective Partition the dataset into k clusters such that intra- cluster variance is minimised

V (D) =

k

  • i=1
  • xj∈Si

(xj − µi)2 (1)

where

V is the variance, Si is a cluster, µi is its mean, D is the dataset of all points xj

slide-5
SLIDE 5

K-means

.: Data Mining in Bioinformatics, Page 5

Llyods algorithm

  • 1. Partition the data into k initial clusters
  • 2. Compute the mean of each cluster
  • 3. Assign each point to the cluster whose mean is closest

to the point

  • 4. If any point changed its cluster membership: Repeat

from step 2

slide-6
SLIDE 6

K-means

.: Data Mining in Bioinformatics, Page 6

Example: before clustering

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

slide-7
SLIDE 7

K-means

.: Data Mining in Bioinformatics, Page 7

Example: after clustering (k=2)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

slide-8
SLIDE 8

K-means

.: Data Mining in Bioinformatics, Page 8

Things to note k-means is still the state-of-the-art method for most clus- tering tasks When proposing a new clustering method, one should always compare to k-means. Lloyds’ algorithm has several setbacks It is order-dependent. Its results depends on the initialisation of the clusters. Its result may be a local optimum, not the global opti- mal solution.

slide-9
SLIDE 9

K-centroid

.: Data Mining in Bioinformatics, Page 9

‘Brother’ of k-means Don’t use the mean of each cluster but the medoid The medoid is the point closest to the mean:

mi = argminxj∈Si||xj − µi||2 (2)

One thereby restricts the cluster ‘means’ to points that are present in the dataset One only minimises variance with respect to these points

slide-10
SLIDE 10

Kernel k-means

.: Data Mining in Bioinformatics, Page 10

Kernelised k-means? It would be attractive to perform clustering using kernels can move clustering problem to different feature spaces can cluster string and graph data But we have to be able to perform all steps in k-means using kernels!

slide-11
SLIDE 11

Kernel k-means

.: Data Mining in Bioinformatics, Page 11

Kernelised k-means The key step in k-means is to compute the distance bet- ween one data point x1 and the mean of a cluster of points x2, . . . , xm:

φ(x1) − 1 (m − 1)

m

  • j=2

φ(xj)2 = k(x1, x1) − 2 (m − 1)

m

  • j=2

k(x1, xj) + 1 (m − 1)2

m

  • i=2

m

  • j=2

k(xi, xj) (3)

This result is based on the fact that every kernel k indu- ces a distance d:

d(xi, xj)2=φ(xj)−φ(xj)2=k(xi, xi)−2k(xi, xj)+k(xj, xj)

slide-12
SLIDE 12

Graph-based clustering I

.: Data Mining in Bioinformatics, Page 12

Data representation dataset D is given in terms of a graph G = (V, E) a data objects vi is a node in G edge e(i, j) from node vi to node vj has weight w(i, j) Graph-based clustering Define a threshold θ Remove all edges e(i, j) from G with weight w(i, j) > θ Each connected component of the graph now corre- sponds to one cluster Two nodes are in the same connected component if the- re is a path between them Graph components can be found by depth-first search in a graph ((O(|V | + |E|))

slide-13
SLIDE 13

Graph-based clustering II

.: Data Mining in Bioinformatics, Page 13

Original graph

slide-14
SLIDE 14

Graph-based clustering III

.: Data Mining in Bioinformatics, Page 14

Thresholded graph (θ = 0.5)

slide-15
SLIDE 15

Graph-based clustering IV

.: Data Mining in Bioinformatics, Page 15

But how to get the graph in the first place? Think of the weights as a similarity measure. If two nodes are not connected, then their similarity mea- sure is 0. Graph-based clustering creates clusters of similar ob- jects For any object vi in a cluster, there is a second object vj such that similarity(vi, vj) is larger than θ.

slide-16
SLIDE 16

DBScan I

.: Data Mining in Bioinformatics, Page 16

Noise-robust graph-based clustering Graph-based clustering can suffer from the fact that one noisy edge connects two clusters DBScan (Ester et al., 1996) is a noise-robust extension

  • f graph-based clustering

DBScan is short for Density-Based Spatial Clustering of Applications with Noise Core object Two objects vi and vj with distance d(vi, vj) < ǫ belong to the same cluster if either vi or vj are a core object.

vi is a core object iff there are MinPoints points within

a distance of ǫ from vi. A cluster is defined by iteratively checking this core ob- ject property.

slide-17
SLIDE 17

DBScan II

.: Data Mining in Bioinformatics, Page 17

Code: Main DBSCAN (SetOfPoints, Eps, MinPts) // SetOfPoints is UNCLASSIFIED ClusterId := nextId(NOISE); for i FROM 1 TO SetOfPoints.size do Point := SetOfPoints.get(i); if Point.ClId = UNCLASSIFIED then if ExpandCluster(SetOfPoints, Point, ClusterId, Eps, MinPts) then ClusterId := nextId(ClusterId) end if end if end for

slide-18
SLIDE 18

DBScan III

.: Data Mining in Bioinformatics, Page 18

Code: ExpandCluster ExpandCluster(SetOfPoints, Point, ClId, Eps, MinPts): Boolean; seeds:=SetOfPoints.regionQuery(Point,Eps); if seeds.size < MinPts then SetOfPoint.changeClId(Point,NOISE); RETURN False; else SetOfPoints.changeClIds(seeds,ClId); seeds.delete(Point); while seeds <> Empty currentP := seeds.first(); result := SetOfPoints.regionQuery(currentP , Eps);

slide-19
SLIDE 19

DBScan IV

.: Data Mining in Bioinformatics, Page 19

if result.size >= MinPts then for i FROM 1 TO result.size do resultP := result.get(i); if resultP .ClId IN (UNCLASSIFIED, NOISE) then if resultP .ClId = UNCLASSIFIED then seeds.append(resultP); end if SetOfPoints.changeClId(resultP ,ClId); end if // UNCLASSIFIED or NOISE end for; end if; // result.size >= MinPts seeds.delete(currentP); end while; // seeds <> Empty RETURN True; end if end // ExpandCluster

slide-20
SLIDE 20

DBScan V

.: Data Mining in Bioinformatics, Page 20

Original graph

slide-21
SLIDE 21

DBScan VI

.: Data Mining in Bioinformatics, Page 21

DBScan-clustered graph (MinPts = 2, Eps = 0.5)

slide-22
SLIDE 22

DBScan VII

.: Data Mining in Bioinformatics, Page 22

Original graph

slide-23
SLIDE 23

DBScan VIII

.: Data Mining in Bioinformatics, Page 23

DBScan-clustered graph (MinPts = 3, Eps = 0.5)

slide-24
SLIDE 24

DBScan IX

.: Data Mining in Bioinformatics, Page 24

Properties Cluster assignment of border points is order-dependent Unlike k-means, one does not have to specify the num- ber of clusters a priori But one has to set MinPts and Eps Ester et al. report that for 2D examples MinPts=4 is suf- ficient for good results They determine Eps by visual inspection of a k-distance plot Transfer question: How to kernelise DBScan?

slide-25
SLIDE 25

Hierarchical Clustering

.: Data Mining in Bioinformatics, Page 25

Extension of original setting What if clusters contain clusters themselves? Then we need hierarchical clustering!

slide-26
SLIDE 26

Hierarchical Clustering

.: Data Mining in Bioinformatics, Page 26

Join most similar clusters Iteratively join the two most similar clusters But how to measure similarity between clusters? Similarity of clusters Single Link: S(Ci, Cj) =

min

x∈Ci,x′∈Cj d(x, x′)

Average Link: S(Ci, Cj) =

1 |Ci| |Cj|

  • x∈Ci,x′∈Cj

d(x, x′)

Maximum Link: S(Ci, Cj) =

max

x∈Ci,x′∈Cj d(x, x′)

slide-27
SLIDE 27

The end

.: Data Mining in Bioinformatics, Page 27

See you tomorrow! Next topic: Feature Selection