Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics - - PowerPoint PPT Presentation

▶

Mar 09, 2023 365 likes •623 views

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page

SLIDE 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen

SLIDE 2

Clustering in bioinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Microarrays Clustering is a widely used tool in microarray analysis Class discovery is an important problem in microarray studies for two reasons: either the classes are completely unknown before- hand

r it is unknown whether a known class contains inter-

esting subclasses

SLIDE 3

Clustering in bioinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Examples Classes unknown: Does a disease affect gene expression in a particular tissue? Does gene expression differ between two groups in a particular condition? Subclasses unknown: Are there subtypes of a disease? Is there even a hierarchy of subclasses within one dis- ease?

SLIDE 4

Clustering in bioinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Popularity Clustering tools are available in the large microarray database NCBI Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo/ 3002 pubmed hits for ’microarray clustering’ Recent editorial of OUP Bioinformatics

SLIDE 5

Distance metrics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Euclidean distance Euclidean distance of gene x and y of n samples or sam- ple x and y of n genes:

dxy =

(xi − yi)2 (1)

Pearson’s Correlation Pearson Correlation of gene x and y of n samples or sample x and y of n genes, where ¯

x is the mean of x

and is ¯

y the mean of y: rxy =

n

i=1(xi − ¯

x)(yi − ¯ y)

n

i=1(xi − ¯

x)2n

i=1(yi − ¯

y)2 (2)

SLIDE 6

Distance metrics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Un-centered correlation coefficient Un-centered correlation coefficient of gene x and y of n samples or sample x and y of n genes:

ru

xy =

n

i=1 xiyi

n

i=1 x2 i

n

i=1 y2 i

(3)

SLIDE 7

Clustering algorithms

Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Hierarchical Clustering Single linkage: The linking distance is the minimum dis- tance between two clusters. Complete linkage: The linking distance is the maximum distance between two clusters. Average linkage/UPGMA (The linking distance is the av- erage of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA)) ‘Flat’ Clustering k-means (k from 2 to 15, 3 runs) k-median (k-medoid)

SLIDE 8

The two-sample problem

Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

Interpretation of clusters Clustering introduces ‘structure’ into microarray datasets But is there a statistical or biomedical meaning of these classes? Biomedical meaning has to be established in experi- ments ‘Statistical meaning’ can be measured using statistical tests, by a so-called two-sample test A two-sample tests decides whether two samples were drawn from the same probability distribution or not

SLIDE 9

The two-sample problem

Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

Data diversity Molecular biology produces a wealth of information The problem is that these data are generated

n different platforms and

by different protocols under different levels of noise Hence data from different labs show different scales different ranges different distributions Main problem: Joint data analysis may detect differences in distribu- tions, not biological phenomena!

SLIDE 10

The two-sample problem

Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

The two-sample problem Given two samples X and Y . Were they generated by the same distribution? Previous approaches two-sample tests exist for univariate and multivariate data

SLIDE 11

The two-sample problem

Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

t-test A test of the null hypothesis that the means of two nor- mally distributed populations are equal unpaired/independent (versus paired) For equal sample sizes and equal variances, the t statis- tic to test whether the means are different can be calcu- lated as follows:

t = ¯ x − ¯ y σxy ·

(4)

where σxy =

x+σ2 y

. The degrees of freedom for this test is 2n − 2 where n is the size of each sample.

SLIDE 12

The two-sample problem

Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

New challenges in bioinformatics high-dimensional structured (strings and graphs) low sample size Novel distribution test: Maximum Mean Discrepancy (MMD)

SLIDE 13

MMD key idea

Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

SLIDE 14

MMD key idea

Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Key Idea Avoid density estimator, use means in feature spaces Maximum Mean Discrepancy (Fortet and Mourier, 1953)

D(p, q, F) := sup

f∈F

Ep [f(x)] − Eq [f(y)]

Theorem

D(p, q, F) = 0 iff p = q, when F = C0(X).

Follows directly, e.g. from Dudley, 1984. Theorem

D(p, q, F) = 0 iff p = q, when F = {f| fH ≤ 1}

provided that H is a universal RKHS. (follows via Steinwart, 2001, Smola et al., 2006).

SLIDE 15

MMD statistic

Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Goal: Estimate D(p, q, F)

Ep,pk(x, x′) − 2Ep,qk(x, y) + Eq,qk(y, y′)

U-Statistic: Empirical estimate D(X, Y, F)

1 m(m−1)

k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)

Theorem

D(X, Y, F) is an unbiased estimator of D(p, q, F).

Test Estimate σ2 from data. Reject null hypothesis that p = q if D(X, Y, F) exceeds acceptance threshold.

SLIDE 16

Attractive for bioinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

MMD two-sample test in terms of kernels Computationally attractive search infinite space of functions by evaluating one ex- pression no optimization problem has to be solved All thanks to kernels!

SLIDE 17

Attractive for bioinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

Wide applicability for one- and higher-dimensional vectorial data, but also for structured data! two-sample problems can now be tackled on strings: protein and DNA sequences graphs: molecules, protein interaction networks time series: time series of microarray data and sets, trees, . . .

SLIDE 18

Cross-platform comparability

Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Data microarray data from two breast cancer studies

ne on cDNA platform (Gruvberger et al., 2001)
ther on oligonucleotide microarray platform (West et

al., 2001) Task Can MMD help to find out if two sets of observations were generated by the same study (both from Gruvberger or both from West)? different studies (one Gruvberger, one West)?

SLIDE 19

Cross-platform comparability

Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Experiment sample size each: 25 dimension of each datapoint 2,116 significance level: α = 0.05 100 times: 1 sample from Gruvberger, 1 from West 100 times: both from Gruvberger or both from West report percentage of correct decisions compare to t-test, Friedman-Rafsky Wald-Wolfowitz and Smirnov

SLIDE 20

Cross-platform comparability

Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

SLIDE 21

Kernel-based statistical test

Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

novel statistical test for two-sample problem: easy to implement non-parametric first for structured data best on high-dimensional data quadratic runtime w.r.t. the number of data points impressive accuracy in our experiments kernel method for two-sample problem: all kernels recently defined in molecular biology can be re-used for data integration applicable to vectors, strings, sets, trees, graphs and time series

SLIDE 22

Biclustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Clustering in two dimensions alternative names: co-clustering, two-mode clustering A bicluster is a subset of genes that show similar activ- ity patterns under a subset of conditions. Clustering in 2 dimensions Cluster patients and conditions Earliest work by Hartigan, 1972: Divide a matrix into submatrices with minimum variance. Most interesting cases are NP-complete. Many extensions in bioinformatics (e.g. Cheng and Church, 2002)

SLIDE 23

References and further reading

Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

References

[1] Gretton, Borgwardt, Rasch, Schölkopf, Smola: A kernel method for the two-sample problem. NIPS 2006

SLIDE 24

The end

Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen

Clustering in bioinformatics

Microarrays Clustering is a widely used tool in microarray analysis Class discovery is an important problem in microarray studies for two reasons: either the classes are completely unknown before- hand

esting subclasses

Clustering in bioinformatics

Examples Classes unknown: Does a disease affect gene expression in a particular tissue? Does gene expression differ between two groups in a particular condition? Subclasses unknown: Are there subtypes of a disease? Is there even a hierarchy of subclasses within one dis- ease?

Clustering in bioinformatics

Popularity Clustering tools are available in the large microarray database NCBI Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo/ 3002 pubmed hits for ’microarray clustering’ Recent editorial of OUP Bioinformatics

Distance metrics

Euclidean distance Euclidean distance of gene x and y of n samples or sam- ple x and y of n genes:

dxy =

(xi − yi)2 (1)

Pearson’s Correlation Pearson Correlation of gene x and y of n samples or sample x and y of n genes, where ¯

x is the mean of x

and is ¯

y the mean of y: rxy =

n

x)(yi − ¯ y)

n

x)2n

y)2 (2)

Distance metrics

Un-centered correlation coefficient Un-centered correlation coefficient of gene x and y of n samples or sample x and y of n genes:

ru

n

n

n

(3)

Clustering algorithms

The two-sample problem

The two-sample problem

Data diversity Molecular biology produces a wealth of information The problem is that these data are generated

by different protocols under different levels of noise Hence data from different labs show different scales different ranges different distributions Main problem: Joint data analysis may detect differences in distribu- tions, not biological phenomena!

The two-sample problem

The two-sample problem Given two samples X and Y . Were they generated by the same distribution? Previous approaches two-sample tests exist for univariate and multivariate data

The two-sample problem

t-test A test of the null hypothesis that the means of two nor- mally distributed populations are equal unpaired/independent (versus paired) For equal sample sizes and equal variances, the t statis- tic to test whether the means are different can be calcu- lated as follows:

t = ¯ x − ¯ y σxy ·

(4)

where σxy =

. The degrees of freedom for this test is 2n − 2 where n is the size of each sample.

The two-sample problem

New challenges in bioinformatics high-dimensional structured (strings and graphs) low sample size Novel distribution test: Maximum Mean Discrepancy (MMD)

MMD key idea

MMD key idea

Key Idea Avoid density estimator, use means in feature spaces Maximum Mean Discrepancy (Fortet and Mourier, 1953)

D(p, q, F) := sup

Ep [f(x)] − Eq [f(y)]

Theorem

D(p, q, F) = 0 iff p = q, when F = C0(X).

Follows directly, e.g. from Dudley, 1984. Theorem

D(p, q, F) = 0 iff p = q, when F = {f| fH ≤ 1}

provided that H is a universal RKHS. (follows via Steinwart, 2001, Smola et al., 2006).

MMD statistic

Goal: Estimate D(p, q, F)

Ep,pk(x, x′) − 2Ep,qk(x, y) + Eq,qk(y, y′)

U-Statistic: Empirical estimate D(X, Y, F)

k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)

Theorem

D(X, Y, F) is an unbiased estimator of D(p, q, F).

Test Estimate σ2 from data. Reject null hypothesis that p = q if D(X, Y, F) exceeds acceptance threshold.

Attractive for bioinformatics

MMD two-sample test in terms of kernels Computationally attractive search infinite space of functions by evaluating one ex- pression no optimization problem has to be solved All thanks to kernels!

Attractive for bioinformatics

Wide applicability for one- and higher-dimensional vectorial data, but also for structured data! two-sample problems can now be tackled on strings: protein and DNA sequences graphs: molecules, protein interaction networks time series: time series of microarray data and sets, trees, . . .

Cross-platform comparability

Data microarray data from two breast cancer studies

al., 2001) Task Can MMD help to find out if two sets of observations were generated by the same study (both from Gruvberger or both from West)? different studies (one Gruvberger, one West)?

Cross-platform comparability

Experiment sample size each: 25 dimension of each datapoint 2,116 significance level: α = 0.05 100 times: 1 sample from Gruvberger, 1 from West 100 times: both from Gruvberger or both from West report percentage of correct decisions compare to t-test, Friedman-Rafsky Wald-Wolfowitz and Smirnov

Cross-platform comparability

Kernel-based statistical test

Biclustering

References and further reading

[1] Gretton, Borgwardt, Rasch, Schölkopf, Smola: A kernel method for the two-sample problem. NIPS 2006

The end

See you tomorrow! Next topic: Feature Selection in Bioinformatics