Data Mining in Bioinformatics Day 9: Graph Mining in - - PowerPoint PPT Presentation

data mining in bioinformatics day 9 graph mining in
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 9: Graph Mining in - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chlo-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tbingen


slide-1
SLIDE 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics

Chloé-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen

slide-2
SLIDE 2

Drug discovery

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Modern therapeutic research From serendipity to rationalized drug design Ancient Greeks treat infections with mould

CH

3

N S O NH O HO NH

2

O HO CH

3

Biapenem in PBP-1A

slide-3
SLIDE 3

Drug discovery process

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

  • 1. Find a

target

  • 2. Identify

hits 3.Hit-to-lead: characterize hits

  • 4. Lead
  • ptimization

and synthesis

  • 5. Assay

Protein that we want to inhibit so as to interfer with a biological process Compounds likely to bind to the target Can they be drugs? (ADME-T

  • x)
  • in vitro
  • in vivo
  • clinical
  • bioactivity
  • pharmacokinetics
  • synthetic pathway
slide-4
SLIDE 4

Drug discovery process

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

52 months 90 months

  • 1. Find a

target

  • 2. Identify

hits 3.Hit-to-lead: characterize hits

  • 4. Lead
  • ptimization

and synthesis

  • 5. Assay
slide-5
SLIDE 5

Drug discovery process

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

$500,000,000 to $2,000,000,000 52 months 90 months

  • 1. Find a

target

  • 2. Identify

hits 3.Hit-to-lead: characterize hits

  • 4. Lead
  • ptimization

and synthesis

  • 5. Assay
slide-6
SLIDE 6

Chemoinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

How can computer science help?

→ Chemoinformatics!

“...the mixing of information resources to transform data into informa- tion, and information into knowledge, for the intended purpose of mak- ing better decisions faster in the arena of drug lead identification and

  • ptimisation.” – F. K. Brown

“... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel

slide-7
SLIDE 7

Chemoinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Chemoinformatics

  • 1. Find a

target

  • 2. Identify

hits 3.Hit-to-lead: characterize hits

  • 4. Lead
  • ptimization

and synthesis

  • 5. Assay
slide-8
SLIDE 8

Chemoinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

The chemical space

1060

possible small

  • r-

ganic molecules

1022 stars in the observ-

able universe

(Slide courtesy of Matthew A. Kayala)

slide-9
SLIDE 9

Drug discovery process

Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 QSAR QSPR

  • 1. Find a

target

  • 2. Identify

hits 3.Hit-to-lead: characterize hits

  • 4. Lead
  • ptimization

and synthesis

  • 5. Assay

QSAR: Qualitative Structure-Activity Relationship i.e. classification QSPR: Quantititive Structure-Property Relationship i.e. regression

slide-10
SLIDE 10

Representing chemicals in silico

Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Expert knowledge molecular descriptors

→ hard, potentially incomplete

Molecules are...

CH

3

N S O NH O HO NH

2

O HO CH

3

slide-11
SLIDE 11

Representing chemicals in silico

Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

Similar Property Principle Molecules having similar structures should exhibit similar activities.

→ Structure-based representations

Compare molecules by comparing substructures

slide-12
SLIDE 12

Molecular graph

Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

C O N C C C N O S C C O O C C d d d C C N C C C C C C O

Undirected labeled graph

slide-13
SLIDE 13

Fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Define feature vectors that record the presence/absence (or number of occurrences) of particular patterns in a given molecular graph

φ(A) = (φs(A))s substructure

where

φs(A) =

1 if s occurs in A

0 otherwise

Extension of traditional chemical fingerprints

slide-14
SLIDE 14

Fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Learning from fingerprints Classical machine learning and data mining techniques can be applied to these vectorial feature representations. Any distance / kernel can be used Classification Feature selection Clustering

slide-15
SLIDE 15

Fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Fingerprints compression Systematic enumeration → long, sparse vectors e.g. 50, 000 random compounds from ChemDB

→ 300, 000 paths of length up to 8 → 300 non-zeros on average

“Naive” Compression List the positions of the 1s

219 = 524, 288

average encoding: 300 × 19 = 5, 700 bits

slide-16
SLIDE 16

Fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Fingerprints compression Modulo Compression (lossy)

slide-17
SLIDE 17

Frequent patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

MOLFEA [Helma et al., 2004]

P = positive (mutagenic) compounds N = negative compounds features: fragments (= patterns) f such that both freq(f, P) ≥ t and freq(f, N) ≥ t Limited to frequent linear patterns ML algorithm: SVM with linear or quadratic kernel

slide-18
SLIDE 18

Frequent patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

MOLFEA [Helma et al., 2004]

CPDB – Carcinogenic Potency DataBase

684 compounds classified in 341 mutagens and 343 non-

mutagens according to Ames test on Salmonella

1% 3% 5% 10% Frequency threshold 50 60 70 80 90 100 Cross-validated sensitivity

Mutagenicity prediction [Hema04] Linear kernel Quadratic kernel

slide-19
SLIDE 19

Spectrum kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

φ(A) = (φs(A))s∈S Kspectrum(A, A′) = k(φ(A), φ(A′)) k ∈ RR|(S)|×R|(S)| can be

Dot product (linear kernel) RBF kernel Tanimoto kernel: k(A, B) = A∩B

A∪B

MinMax kernel:

N

i=1 min(Ai,Bi)

N

i=1 max(Ai,Bi)

slide-20
SLIDE 20

Spectrum kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Tanimoto and MinMax Both Tanimoto and Minmax are kernels. Proof for Tanimoto: J.C. Gower A general coefficient

  • f similarity and some of its properties. Biometrics

1971. Proof for MinMax: MinMax(x, y) =

φ(x), φ(y) φ(x), φ(x) + φ(y), φ(y) − φ(x), φ(y)

with φ(x) of length: # patterns × max count

φ(x)i = 1 iff. the pattern indexed by ⌊i/q⌋ appears more

than i mod q times in x

slide-21
SLIDE 21

All patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Paths fingerprints Labeled sub-paths (walks)

O N C C N O S C C O O C C

d d d

C C

NsCsCsS CsCsCdO

C

C N C C C C C C O

Some sub-paths of length 3

slide-22
SLIDE 22

All patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Circular fingerprints Labeled sub-trees - Extended-Connectivity (or Circular) features

O N C C N O S C C O O C C

d d d

C C

C{sC{sN|sC}|sN{sC}|sS{sC}}

C

C N C C C C C C O

Example of a circular substructure of depth 2

slide-23
SLIDE 23

All patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

2D spectrum kernels [Azencott et al., 2007] Systematically extract paths / circular fingerprints, for various maximal depths SVM with Tanimoto / Minmax

slide-24
SLIDE 24

All patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

2D spectrum kernels [Azencott et al., 2007]

Mutagenicity (Mutag): 188 compounds Benzodiazepine receptor affinity (BZR): 181+125 compounds Cyclooxygenase-2 ihibitors (COX2): 178 + 125 compounds Estrogen receptor affinity (ER): 166 + 180 compounds Data SVM Previous best Mutag 90.4% 85.2% (gBoost) BZR 79.8% 76.4% COX2 70.1% 73.6% ER 82.1% 79.8%

slide-25
SLIDE 25

Weisfeiler-Lehman kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

[Shervashidze et al., 2011]

Goal: scalability Compute a sequence that captures topological and label information of graphs in a runtime linear in the number of edges

→ sub-tree kernel

slide-26
SLIDE 26

Weisfeiler-Lehman kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 26

[Shervashidze et al., 2011]

slide-27
SLIDE 27

Convolution kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 27

a.k.a. decomposition kernels

(x1, . . . , xD) is a tuple of parts of x, with xd ∈ X for each

part d = 1, . . . , D

kd ∈ RXd×Xd: a Mercer kernel

Kdecomposition(x, x′) =

  • x1x2...xD=x
  • x′

1x′ 2x′ D=x′

k1(x1, x′

1)k2(x2, x′ 2) . . . kD(xD, x′ D)

Spectrum kernels are a particular case of convolution kernels

slide-28
SLIDE 28

Convolution kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 28

Weighted Decomposition Kernel [Menchetti et al., 2005]

Match atoms and weigh them according to a kernel between sub- graphs that include these atoms KWDK(x, x′) =

(a,σ∈Dr(x))

  • (a′,σ′∈Dr(x′)) δ(a, a′)Kc(σ, σ′)

r > 0 ∈ N Dr(x): decompositions of the molecular graph of x in an atom a and a subpath σ of x including a and of depth at most r

slide-29
SLIDE 29

Convolution kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 29

Weighted Decomposition Kernel [Menchetti et al., 2005]

Kc: contextual kernel, here: histogram intersection kernel Kc(σ, σ′) =

l∈L min(f σ(l), f σ′(l))

L: possible labels for edges and vertices f σ(l): frequency of label l subgraph σ.

slide-30
SLIDE 30

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 30

3D Histograms [Azencott et al., 2007] Groups of k atoms Associated size: Pairwise distances (k = 2) diameter of the smallest sphere that contains all

k atoms

slide-31
SLIDE 31

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 31

3D Histograms [Azencott et al., 2007] One histogram per class of k-tuple (e.g. C-C-C, C-C-O)

C O N C C C N O S C C O O C C C 2.2 4.6 3.2 5.6 6.7 2.4 2.6 3.7

1 2 3 4 5 6 7 Frequency of N-O N-O distance (A)

C N C C C C C C O 6.3 6.6 9.2 2.7 5.7 7.9 9.5

8 9 10 1 2 3 4

slide-32
SLIDE 32

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 32

3D Histograms: performance [Azencott et al., 2007] Data 2D kernel Hist3D kernel Mutag

90.4% 88.8%

BZR (loo)

82.0% 79.4%

ER (loo)

87.0% 86.1%

COX2

76.9% 78.6%

slide-33
SLIDE 33

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 33

3D Decomposition Kernels [Ceroni et al., 2007]

Remember: KWDK(x, x′) =

(a,σ∈Dr(x))

  • (a′,σ′∈Dr(x′)) δ(a, a′)Kc(σ, σ′)

K3DDK(x, x′) =

σ∈Sr(x)

  • σ′∈Sr(x′) Ks(σ, σ′)

Sr(x): subgraphs of x composed of r distinct vertices Ks(σ, σ′) = r(r−1)/2

i=1

δ(ei, e′

i)e−γ(li−l′

i)

li = length of edge ei in x (e1, e2, . . . , er(r−1)/2 lexicographically ordered; γ ∈ R

slide-34
SLIDE 34

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 34

3DDK: Performance [Ceroni et al., 2007] Data 2D kernel Hist3D kernel 3DDK Circ3DDK Mutag

90.4% 88.8% 86.7% 83.5%

BZR (loo)

82.0% 79.4% 78.4% 81.4%

ER (loo)

87.0% 86.1% 82.3% 82.1%

COX2

76.9% 78.6% 75.6% 75.2%

slide-35
SLIDE 35

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 35

The pharmacophore kernel [Mahé et al., 2006]

pharmacophore p ∈ P(x): p = [(x1, l1), (x2, l2), (x3, l3)] xi 3D coordinates of atom i of x; li = label of atom i K(x, x′) =

p∈P(x)

  • p′∈P(x′) KP(p, p′)

KP(p, p′) = Kdist(d1, d′

1)Kdist(d2, d′ 2)Kdist(d3, d′ 3)Kfeat(l1, l′ 1)Kfeat(l2, l′ 2)Kfeat(l3, l′ 3)

Kdist: RBF Gaussian Kdist(d, d′) = exp

  • d−d′2

2σ2

  • Kfeat: Dirac
slide-36
SLIDE 36

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 36

3D LAP kernel [Hinselmann et al., 2010]

M: pairwise intramolecular matrix of inter-atomic

geometric distances

slide-37
SLIDE 37

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 37

Conclusion How relevant is 3D information? How good is 3D information?

slide-38
SLIDE 38

Drug discovery process

Karsten Borgwardt: Data Mining in Bioinformatics, Page 38 Docking Virtual High-Throughput Screening

  • 1. Find a

target

  • 2. Identify

hits 3.Hit-to-lead: characterize hits

  • 4. Lead
  • ptimization

and synthesis

  • 5. Assay
slide-39
SLIDE 39

High-throughput screening

Karsten Borgwardt: Data Mining in Bioinformatics, Page 39

Assay a large library of potential drugs against their target Very costly

→ docking → virtual high-throughput

screening (vHTS)

slide-40
SLIDE 40

Measuring performance

Karsten Borgwardt: Data Mining in Bioinformatics, Page 40

Imbalanced data

Typically, most compounds are inactive ⇒ many more negative than positive examples E.g. DHFR data set: 99, 995 chemicals screened for activity against dihydrofolate reductase; < 0.2% active compounds Accuracy is not appropriate: predicting all compounds negative ⇒ accuracy = 99.8% sensitivity= # True Positives # Positives specificity= # True Negatives # Negatives For many methods, the output is continuous ⇒ accuracy, sensitivity and specificity depend on a threshold θ

slide-41
SLIDE 41

Measuring performance

Karsten Borgwardt: Data Mining in Bioinformatics, Page 41

Receiver-Operator Characteristic Curves

For all possible values of θ, report sensitivity and 1 − specificity AUROC (Area under the ROC Curve) is a numerical measure of performance AUROC(random) = 0.5 and AUROC(optimal) = 1

1/6 1/3 1/2 2/3 5/6 1 1/4 2/4 3/4 1 False Positive Rate True Positive Rate x x x x x x x x x x x

Inf 0.95 0.94 0.9 0.81 0.73 0.52 0.2 0.17 0.12 0.09

random perfect real

label prediction + 0.95

  • 0.94

+ 0.90 + 0.81

  • 0.73
  • 0.52
  • 0.20

+ 0.17

  • 0.12
  • 0.09
slide-42
SLIDE 42

Measuring performance

Karsten Borgwardt: Data Mining in Bioinformatics, Page 42

Inhibition of DHFR: ROC Curves [Azencott et al., 2007] method AUC IRV 0.71 SVM 0.59 kNN 0.59 MAX-SIM 0.54

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR TPR RANDOM IRV SVM MAXSIM

slide-43
SLIDE 43

Measuring performance

Karsten Borgwardt: Data Mining in Bioinformatics, Page 43

Precision-recall curves Precision = # True Positives # Predicted Positives Recall = sensitivity

1/4 2/4 3/4 1 1/5 2/5 3/5 4/5 1 Recall Precision x x x x x x x x x x

0.95 0.94 0.9 0.81 0.73 0.52 0.2 0.17 0.12 0.09

perfect real

slide-44
SLIDE 44

Other applications

Karsten Borgwardt: Data Mining in Bioinformatics, Page 44

Other applications of graph mining in chemoinformatics Database indexing and search Prediction of 3D structures of small compounds and proteins Reaction Prediction

slide-45
SLIDE 45

References and further reading

Karsten Borgwardt: Data Mining in Bioinformatics, Page 45

[Azencott et al., 2007] Azencott, C.-A., Ksikes, A., Swamidass, S. J., Chen, J. H., Ralaivola, L. and Baldi, P . (2007). One-to four- dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. Journal of chemical information and modeling 47, 965–974. 23, 24, 35, 36, 37, 47 [Baldi et al., 2007] Baldi, P ., Benz, R. W., Hirschberg, D. S. and Swamidass, S. J. (2007). Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval. Journal of chemical information and modeling 47, 2098–2109. [Ceroni et al., 2007] Ceroni, A., Costa, F. and Frasconi, P . (2007). Classification of small molecules by two-and three-dimensional decomposition kernels. Bioinformatics 23, 2038–2045. 38, 39 [Helma et al., 2004] Helma, C., Cramer, T., Kramer, S. and De Raedt, L. (2004). Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal of chemical information and computer sciences 44, 1402–1411. 17, 18 [Hinselmann et al., 2010] Hinselmann, G., Fechner, N., Jahn, A., Eckert, M. and Zell, A. (2010). Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments. Neurocomputing 74, 219–229. 41 [Mahé et al., 2006] Mahé, P ., Ralaivola, L., Stoven, V. and Vert, J.-P . (2006). The pharmacophore kernel for virtual screening with support vector machines. Journal of chemical information and modeling 46, 2003–2014. 40 [Menchetti et al., 2005] Menchetti, S., Costa, F. and Frasconi, P . (2005). Weighted Decomposition Kernels. In Proceedings of the 22nd International Conference on Machine Learning pp. 585–592, ACM, Bonn, Germany. 33, 34 [Saigo et al., 2009] Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T. and Tsuda, K. (2009). gBoost: a mathematical programming approach to graph classification and regression. Machine Learning 75, 69–89. 26, 27, 28, 29 [Shervashidze et al., 2011] Shervashidze, N., Schweitzer, P ., van Leeuwen, E. J., Mehlhorn, K. and Borgwardt, K. M. (2011). Weisfeiler- Lehman graph kernels. Journal of Machine Learning Research 12, 2539–2561. 30, 31