data mining in bioinformatics day 9 graph mining in
play

Data Mining in Bioinformatics Day 9: Graph Mining in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chlo-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tbingen


  1. Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

  2. Drug discovery Modern therapeutic research From serendipity to rationalized drug design Ancient Greeks treat infections with mould NH 2 NH S HO CH 3 O N CH 3 O O HO Biapenem in PBP-1A Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

  3. Drug discovery process 3.Hit-to-lead: 4. Lead 1. Find a 2. Identify 5. Assay characterize optimization target hits hits and synthesis Protein that we want Compounds likely to Can they be drugs? - bioactivity - in vitro to inhibit so as to interfer bind to the target (ADME-T ox) - pharmacokinetics - in vivo with a biological process - synthetic pathway - clinical Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

  4. Drug discovery process 52 months 90 months 3.Hit-to-lead: 4. Lead 1. Find a 2. Identify 5. Assay characterize optimization target hits hits and synthesis Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

  5. Drug discovery process 52 months 90 months 3.Hit-to-lead: 4. Lead 1. Find a 2. Identify 5. Assay characterize optimization target hits hits and synthesis $500,000,000 to $2,000,000,000 Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

  6. Chemoinformatics How can computer science help? → Chemoinformatics! “...the mixing of information resources to transform data into informa- tion, and information into knowledge, for the intended purpose of mak- ing better decisions faster in the arena of drug lead identification and optimisation.” – F. K. Brown “... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

  7. Chemoinformatics Chemoinformatics 3.Hit-to-lead: 4. Lead 1. Find a 2. Identify 5. Assay characterize optimization target hits hits and synthesis Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

  8. Chemoinformatics The chemical space 10 60 possible small or- ganic molecules 10 22 stars in the observ- able universe (Slide courtesy of Matthew A. Kayala) Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

  9. Drug discovery process 3.Hit-to-lead: 4. Lead 1. Find a 2. Identify 5. Assay characterize optimization target hits hits and synthesis QSAR QSPR QSAR: Qualitative Structure-Activity Relationship i.e. classification QSPR: Quantititive Structure-Property Relationship i.e. regression Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

  10. Representing chemicals in silico Expert knowledge molecular descriptors → hard, potentially incomplete Molecules are... NH 2 NH S CH HO 3 O N CH 3 O O HO Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

  11. Representing chemicals in silico Similar Property Principle Molecules having similar structures should exhibit similar activities. → Structure-based representations Compare molecules by comparing substructures Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

  12. Molecular graph O O d C O d C C N C O C d C N C C C C C C S O C C N C C Undirected labeled graph Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

  13. Fingerprints Define feature vectors that record the presence/absence (or number of occurrences) of particular patterns in a given molecular graph φ ( A ) = ( φ s ( A )) s substructure where � 1 if s occurs in A φ s ( A ) = 0 otherwise Extension of traditional chemical fingerprints Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

  14. Fingerprints Learning from fingerprints Classical machine learning and data mining techniques can be applied to these vectorial feature representations. Any distance / kernel can be used Classification Feature selection Clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

  15. Fingerprints Fingerprints compression Systematic enumeration → long, sparse vectors e.g. 50 , 000 random compounds from ChemDB → 300 , 000 paths of length up to 8 → 300 non-zeros on average “Naive” Compression List the positions of the 1 s 2 19 = 524 , 288 average encoding: 300 × 19 = 5 , 700 bits Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

  16. Fingerprints Fingerprints compression Modulo Compression (lossy) Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

  17. Frequent patterns fingerprints MOLFEA [Helma et al. , 2004] P = positive (mutagenic) compounds N = negative compounds features: fragments (= patterns) f such that both freq ( f, P ) ≥ t and freq ( f, N ) ≥ t Limited to frequent linear patterns ML algorithm: SVM with linear or quadratic kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

  18. Frequent patterns fingerprints MOLFEA [Helma et al. , 2004] CPDB – Carcinogenic Potency DataBase 684 compounds classified in 341 mutagens and 343 non- mutagens according to Ames test on Salmonella Mutagenicity prediction [Hema04] 100 Linear kernel Quadratic kernel 90 Cross-validated sensitivity 80 70 60 50 1% 3% 5% 10% Frequency threshold Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

  19. Spectrum kernels φ ( A ) = ( φ s ( A )) s ∈ S K spectrum ( A, A ′ ) = k ( φ ( A ) , φ ( A ′ )) k ∈ R R | ( S ) | × R | ( S ) | can be Dot product (linear kernel) RBF kernel Tanimoto kernel: k ( A, B ) = A ∩ B A ∪ B � N i =1 min( A i ,B i ) MinMax kernel: � N i =1 max( A i ,B i ) Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

  20. Spectrum kernels Tanimoto and MinMax Both Tanimoto and Minmax are kernels. Proof for Tanimoto: J.C. Gower A general coefficient of similarity and some of its properties . Biometrics 1971. Proof for MinMax: � φ ( x ) , φ ( y ) � MinMax ( x, y ) = � φ ( x ) , φ ( x ) � + � φ ( y ) , φ ( y ) � − � φ ( x ) , φ ( y ) � with φ ( x ) of length: # patterns × max count φ ( x ) i = 1 iff. the pattern indexed by ⌊ i/q ⌋ appears more than i mod q times in x Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

  21. All patterns fingerprints Paths fingerprints Labeled sub-paths (walks) O O d C O d CsCsCdO C C N C O C d C N C C C NsCsCsS S C C C O C C N C C Some sub-paths of length 3 Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

  22. All patterns fingerprints Circular fingerprints Labeled sub-trees - Extended-Connectivity (or Circular) features O O d C O d C C N C O C d C N C C C S C C C C C N O C{sC{sN|sC}|sN{sC}|sS{sC}} C C Example of a circular substructure of depth 2 Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

  23. All patterns fingerprints 2D spectrum kernels [Azencott et al. , 2007] Systematically extract paths / circular fingerprints, for various maximal depths SVM with Tanimoto / Minmax Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

  24. All patterns fingerprints 2D spectrum kernels [Azencott et al. , 2007] Mutagenicity (Mutag) : 188 compounds Benzodiazepine receptor affinity (BZR) : 181+125 compounds Cyclooxygenase-2 ihibitors (COX2) : 178 + 125 compounds Estrogen receptor affinity (ER) : 166 + 180 compounds Data SVM Previous best Mutag 90 . 4 % 85 . 2% ( gBoost ) BZR 79 . 8 % 76 . 4% COX2 70 . 1% 73 . 6 % 82 . 1 % 79 . 8% ER Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

  25. Weisfeiler-Lehman kernel [Shervashidze et al. , 2011] Goal: scalability Compute a sequence that captures topological and label information of graphs in a runtime linear in the number of edges → sub-tree kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

  26. Weisfeiler-Lehman kernel [Shervashidze et al. , 2011] Karsten Borgwardt: Data Mining in Bioinformatics, Page 26

  27. Convolution kernels a.k.a. decomposition kernels ( x 1 , . . . , x D ) is a tuple of parts of x , with x d ∈ X for each part d = 1 , . . . , D k d ∈ R X d × X d : a Mercer kernel � � K decomposition ( x, x ′ ) = k 1 ( x 1 , x ′ 1 ) k 2 ( x 2 , x ′ 2 ) . . . k D ( x D , x ′ D ) x 1 x 2 ...x D = x x ′ 1 x ′ 2 x ′ D = x ′ Spectrum kernels are a particular case of convolution kernels Karsten Borgwardt: Data Mining in Bioinformatics, Page 27

  28. Convolution kernels Weighted Decomposition Kernel [Menchetti et al. , 2005] Match atoms and weigh them according to a kernel between sub- graphs that include these atoms K WDK ( x, x ′ ) = � ( a ′ ,σ ′ ∈ D r ( x ′ )) δ ( a, a ′ ) K c ( σ, σ ′ ) � ( a,σ ∈ D r ( x )) r > 0 ∈ N D r ( x ) : decompositions of the molecular graph of x in an atom a and a subpath σ of x including a and of depth at most r Karsten Borgwardt: Data Mining in Bioinformatics, Page 28

  29. Convolution kernels Weighted Decomposition Kernel [Menchetti et al. , 2005] K c : contextual kernel , here: histogram intersection kernel l ∈ L min ( f σ ( l ) , f σ ′ ( l )) K c ( σ, σ ′ ) = � L : possible labels for edges and vertices f σ ( l ) : frequency of label l subgraph σ . Karsten Borgwardt: Data Mining in Bioinformatics, Page 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend