Data Mining in Bioinformatics Days 6 and 7: The Need for Data - PowerPoint PPT Presentation

Data Mining in Bioinformatics Days 6 and 7: The Need for Data Mining in Bioinformatics Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

The Need for Machine Learning in Computational Biology High-throughput technologies: ◮ Genome and RNA sequencing ◮ Compound screening ◮ Genotyping chips ◮ Bioimaging BGI Hong Kong, Tai Po Industrial Estate, Hong Kong Molecular databases are growing much faster than our knowledge of biological processes. Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 2

The Evolution of Bioinformatics ◮ Classic Bioinformatics: Focus on Molecules Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 3

Classic Bioinformatics: Focus on Molecules ◮ Large collections of molecular data ◮ Gene and protein sequences ◮ Genome sequence ◮ Protein structures ◮ Chemical compounds ◮ Focus: Inferring properties of molecules ◮ Predict the function of a gene given its sequence ◮ Predict the structure of a protein given its sequence ◮ Predict the boundaries of a gene given a genome segment ◮ Predict the function of a chemical compound given its molecular structure Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 4

Example: Predicting Function from Structure ◮ Structure-Activity Relationship Source: Joska T M , and Anderson A C Antimicrob. Agents Chemother. 2006;50:3435-3443 ◮ Fundamental idea: Similarity in structure implies similarity in function Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 5

Measuring the Similarity of Graphs ◮ How similar are two graphs? ◮ How similar is their structure? ◮ How similar are their node labels and edge labels? ◮ Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 6

Graph Comparison 1. Graph isomorphism and subgraph isomorphism checking ◮ Exact match ◮ Exponential runtime 2. Graph edit distances ◮ Involves definition of a cost function ◮ Typically subgraph isomorphism as intermediate step 3. Topological descriptors ◮ Lose some of the structural information represented by the graph or ◮ Exponential runtime effort 4. Graph kernels (G¨ artner et al, 2003; Kashima et al. 2003) ◮ Goal 1: Polynomial runtime in the number of nodes ◮ Goal 2: Applicable to large graphs ◮ Goal 3: Applicable to graphs with attributes Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 7

Graph Kernels I ◮ Kernels ◮ Key concept: Move problem to feature space H . ◮ Naive explicit approach: ◮ Map objects x and x ′ via mapping φ to H . ◮ Measure their similarity in H as � φ ( x ) , φ ( x ′ ) � . ◮ Kernel Trick : Compute inner product in H as kernel in input space k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � . R 2 ⇒ H Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 8

Graph Kernels II ◮ Graph kernels ◮ Kernels on pairs of graphs ( not pairs of nodes) ◮ Instance of R-Convolution kernels (Haussler, 1999): ◮ Decompose objects x and x ′ into substructures. ◮ Pairwise comparison of substructures via kernels to compare x and x ′ . ◮ A graph kernel makes the whole family of kernel methods applicable to graphs. G G’ Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 9

Weisfeiler-Lehman Kernel (Shervashidze and Borgwardt, NIPS 2009) 1st iteration Given labeled graphs G and G ’ Result of steps 1 and 2: multiset-label determination and sorting 5 2 2 5 5,234 2,35 2,45 5,234 4 3 4 3 4,1135 3,245 4,1235 3,245 1 1 1 2 1,4 1,4 1,4 2,3 G G ’ G G ’ a b 1st iteration 1st iteration Result of step 3: label compression Result of step 4: relabeling 13 8 9 13 1,4 6 3,245 10 2,3 7 4,1135 11 11 12 10 10 2,35 8 4,1235 12 2,45 9 5,234 13 6 6 6 7 G G ’ c d φ Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 10 φ φ φ

Weisfeiler-Lehman Kernel (Shervashidze and Borgwardt, NIPS 2009) End of the 1st iteration Feature vector representations of G and G ’ (1) φ (G) = ( 2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1 ) WLsubtree (1) φ (G’) = ( ) 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1 WLsubtree Counts of Counts of original compressed node labels node labels (1) (1) (1) k (G,G ’ )= < φ (G), φ (G ’ ) > =11. WLsubtree WLsubtree WLsubtree e Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 11

Subtree-like Patterns 2 1 1 3 3 6 2 6 4 5 3 1 2 4 5 1 1 5 Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 12

Weisfeiler-Lehman Kernel: Theoretical Runtime Properties ◮ Fast Weisfeiler-Lehman kernel (NIPS 2009 and JMLR 2011) ◮ Algorithm : Repeat the following steps h times 1. Sort: Represent each node v as sorted list L v of its neighbors ( O ( m ) ) 2. Compress: Compress this list into a hash value h ( L v ) ( O ( m ) ) 3. Relabel: Relabel v by the hash value h ( L v ) ( O ( n ) ) ◮ Runtime analysis ◮ per graph pair: Runtime O ( m h ) ◮ for N graphs: Runtime O ( N m h + N 2 n h ) (naively O ( N 2 m h ) ) Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 13

Weisfeiler-Lehman Kernel: Empirical Runtime Properties 5 10 500 pairwise 4 10 Runtime in seconds Runtime in seconds 400 global 3 10 300 2 10 200 1 10 100 0 10 − 1 10 0 1 2 3 200 400 600 800 1000 10 10 10 Graph size n Number of graphs N 20 15 Runtime in seconds Runtime in seconds 15 10 10 5 5 0 0 2 4 6 8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Subtree height h Graph density c Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 14

Weisfeiler-Lehman Kernel: Runtime and Accuracy 1000 days WL 100 days RG 10 days 3 Graphlet RW 1 day SP 1 hour 1 minute 10 sec 85 % 80 % 75 % 70 % 65 % 60 % 55 % 50 % MUTAG NCI1 NCI109 D&D graph size Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 15

The Evolution of Bioinformatics ◮ Modern Bioinformatics: Focus on Individuals Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 16

Modern Bioinformatics: Focus on Individuals ◮ High-throughput technologies now enable the collection of molecular information on individuals ◮ Microarrays to measure gene expression levels ◮ Chips to determine the genotype of an individual ◮ Sequencing to determine the genome sequence of an individual Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 17

Phenotype Prediction ◮ Goal: Predict breast cancer outcome from gene expression levels ◮ Current results are not satisfying in terms of stability and prediction performance Source: Venet et al., PLoS Comp Bio 2011 Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 18

Phenotype Prediction Nature News, March 2009 ◮ ‘Genetic test predicts eye color in Dutch men with 90% accuracy’ (Liu et al., Current Biology 2009) ◮ Special setting: Candidate genes were already known beforehand ◮ Other phenotypes: Large genetics consortia try to detect candidate genes (e.g. diabetes, autism, depression, drug response, plant growth) Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 19

Genetics: Association Studies ◮ Genome-Wide Association Studies (GWAS) bco D. Weigel ◮ One considers genome positions that differ between individuals, that is Single Nucleotide Polymorphisms (SNPs) (more general: genetic locus or genomic variant). ◮ Problem size: 10 5 - 10 7 SNPs per genome, 10 2 to 10 5 individuals Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 20

Genetics: Manhattan Plots ◮ The standard statistical analysis in Genetics: Generating a Manhattan plot of association signals Manhattan-plot for chromosome Chr2 -log10(p-value) Bonferroni threshold [0.05] 6 -log10(p-value) 4 2 0 4000000 8000000 12000000 16000000 chromosomal position [bp] Phenotype: Flower color-related trait of Arabidopsis thaliana ◮ A plot of genome positions versus p-values of association/correlation. Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 21

Genetics: Missing Heritability ◮ More than 1200 new disease loci were detected over the last decade. ◮ The phenotypic variance explained by these loci is disappointingly low: Vol 461 j 8 October 2009 j doi:10.1038/nature08494 REVIEWS Finding the missing heritability of complex diseases Teri A. Manolio 1 , Francis S. Collins 2 , Nancy J. Cox 3 , David B. Goldstein 4 , Lucia A. Hindorff 5 , David J. Hunter 6 , Mark I. McCarthy 7 , Erin M. Ramos 5 , Lon R. Cardon 8 , Aravinda Chakravarti 9 , Judy H. Cho 10 , Alan E. Guttmacher 1 , Augustine Kong 11 , Leonid Kruglyak 12 , Elaine Mardis 13 , Charles N. Rotimi 14 , Montgomery Slatkin 15 , David Valle 9 , AliceS.Whittemore 16 ,MichaelBoehnke 17 ,AndrewG.Clark 18 ,EvanE.Eichler 19 ,GregGibson 20 ,JonathanL.Haines 21 , Trudy F. C. Mackay 22 , Steven A. McCarroll 23 & Peter M. Visscher 24 Genome-wide association studies have identified hundreds of genetic variants associated with complex human diseases and traits, and have provided valuable insights into their genetic architecture. Most variants identified so far confer relatively Manolio et al., Nature 2009 Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 22

Data Mining in Bioinformatics Days 6 and 7: The Need for Data - PowerPoint PPT Presentation

Data Mining in Bioinformatics Days 6 and 7: The Need for Data Mining in Bioinformatics Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institute Tbingen and Eberhard

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Potty Training in Potty Training in Potty Training in Potty Training in Four Days Four Days

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Practical Bioinformatics Mark Voorhies 4/20/2011 Mark Voorhies Practical Bioinformatics Review

The Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics Preslav

THE UPS AND DOWNS OF PLATELETS Dr Tung Moon Ley Associate Consultant Department of

http://www.utdallas.edu/~kilgard/brain.jpg BIRS Canada-China Workshop on Industrial Mathematics

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Y P O C T Neurological Applications of O N Transcranial Magnetic Stimulation O D Mouhsin

Tonights panel Dr Helen Stanley Dr Phillip Tully Associate Professor Ms Nicola Palfrey

W14. Movement Dis isorders for r the In Internist Dr. . David ide Mart rtino, PhD MD

Results from the Endovascular Revascularization And Supervised Exercise for claudication study

Data Mining in Bioinformatics Days 6 and 7: The Need for Data - PowerPoint PPT Presentation

Data Mining in Bioinformatics Days 6 and 7: The Need for Data Mining in Bioinformatics Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institute Tbingen and Eberhard

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Potty Training in Potty Training in Potty Training in Potty Training in Four Days Four Days

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Practical Bioinformatics Mark Voorhies 4/20/2011 Mark Voorhies Practical Bioinformatics Review

The Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics Preslav

THE UPS AND DOWNS OF PLATELETS Dr Tung Moon Ley Associate Consultant Department of

http://www.utdallas.edu/~kilgard/brain.jpg BIRS Canada-China Workshop on Industrial Mathematics

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Y P O C T Neurological Applications of O N Transcranial Magnetic Stimulation O D Mouhsin

Tonights panel Dr Helen Stanley Dr Phillip Tully Associate Professor Ms Nicola Palfrey

W14. Movement Dis isorders for r the In Internist Dr. . David ide Mart rtino, PhD MD

Results from the Endovascular Revascularization And Supervised Exercise for claudication study

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt