Data Mining in Bioinformatics Day 8: Graph Mining for - - PowerPoint PPT Presentation

data mining in bioinformatics day 8 graph mining for
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 8: Graph Mining for - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Graph Mining for Chemoinformatics and Drug Discovery Chlo-Agathe Azencott Machine Learning & Computational Biology Research Group MPIs Tbingen C.-A. Azencott Graph Mining for Chemoinformatics


slide-1
SLIDE 1

Data Mining in Bioinformatics Day 8: Graph Mining for Chemoinformatics and Drug Discovery

Chloé-Agathe Azencott

Machine Learning & Computational Biology Research Group MPIs Tübingen

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 1

slide-2
SLIDE 2

Drug Discovery

Modern Therapeutic Research

From serendipity to rationalized drug design Ancient Greeks treat infections with mould

CH

3

N S O NH O HO NH

2

O HO CH

3

Biapenem in PBP-1A

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 2

slide-3
SLIDE 3

Drug Discovery

Drug Discovery Process

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 3

slide-4
SLIDE 4

Drug Discovery

Drug Discovery Process

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 4

slide-5
SLIDE 5

Drug Discovery

Drug Discovery Process

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 5

slide-6
SLIDE 6

Drug Discovery

Chemoinformatics

How can computer science help? → Chemoinformatics!

“...the mixing of information resources to transform data into information, and information into knowledge, for the intended purpose of making better decisions faster in the arena of drug lead identification and optimisation.” – F . K. Brown “... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 6

slide-7
SLIDE 7

Drug Discovery

Chemoinformatics

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 7

slide-8
SLIDE 8

Drug Discovery

The Chemical Space

◮ 1060 possible small molecules ◮ 1022 stars in the observable

universe

(Slide courtesy of Matthew A. Kayala) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 8

slide-9
SLIDE 9

Structure-Based Approaches

Drug Discovery Process

QSAR: Qualitative Structure-Activity Relationship (classification) QSPR: Quantititive Structure-Property Relationship (regression)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 9

slide-10
SLIDE 10

Structure-Based Approaches

Representing Chemicals in silico

◮ Expert knowledge molecular descriptors

→ hard, potentially incomplete

◮ How to get a “complete” enough representation?

CH

3

N S O NH O HO NH

2

O HO CH

3

Similar Property Principle: molecules having similar structures should exhibit similar activities.

◮ → Structure-based representations ◮ Compare molecules by comparing substructures

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 10

slide-11
SLIDE 11

Structure-Based Approaches

Molecular Graph

Undirected labeled graph

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 11

slide-12
SLIDE 12

Structure-Based Approaches

Feature Vectors based on Patterns

◮ Define feature vectors that record the presence/absence (or

number of occurrences) of particular patterns in a given molecular graph ϕ(A) = (ϕs(A))s substructure where ϕs(A) = 1 if s occurs in A

  • therwise

◮ Extension of traditional chemical fingerprints

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 12

slide-13
SLIDE 13

Structure-Based Approaches

Feature Vectors based on Patterns

Classical machine learning and data mining techniques can be applied to these vectorial feature representations.

◮ Any distance / kernel can be used ◮ Dot product → “How many substructures do two compounds

share?”

◮ Classification ◮ Feature selection ◮ Clustering

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 13

slide-14
SLIDE 14

Structure-Based Approaches

T animoto and MinMax

◮ T

animoto (binary setting): k(A, B) = A∩B

A∪B ◮ MinMax (counts setting): N

i=1 min(Ai,Bi)

N

i=1 max(Ai,Bi) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 14

slide-15
SLIDE 15

Structure-Based Approaches

T animoto and MinMax

Both T animoto and Minmax are kernels.

◮ Proof for T

animoto: J.C. Gower A general coefficient of similarity and

some of its properties. Biometrics 1971. ◮ Proof for MinMax:

MinMax(x, y) = 〈ϕ(x), ϕ(y)〉 〈ϕ(x), ϕ(x)〉 + 〈ϕ(y), ϕ(y)〉 − 〈ϕ(x), ϕ(y)〉 with ϕ(x) of length: # patterns × max count ϕ(x)i = 1 iff. the pattern indexed by ⌊i/q⌋ appears more than i mod q times in x

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 15

slide-16
SLIDE 16

Structure-Based Approaches

Fingerprint Compression

◮ Systematic enumeration → long, sparse vectors

e.g. 50, 000 random compounds from ChemDB → 300, 000 paths of length up to 8 → 300 non-zeros on average

◮ “Naive” Compression

◮ List the positions of the 1s ◮ 219 = 524, 288 ◮ average encoding: 300 × 19 = 5, 700 bits

◮ Modulo

Compression (lossy)

◮ Elias-Gamma Monotone Encoding

(lossless) [Baldi 2007]

◮ index j → ⌊log(j)⌋ 0 bits + binary

encoding of j

◮ ji < ji+1: ⌊log(ji+1)⌋ → ⌊log(ji) − log(ji+1)⌋ ◮ average compressed size = 1, 800 bits C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 16

slide-17
SLIDE 17

Frequent Patterns MOLFEA

Frequent Pattern Mining: Mutagenicity Inducing Substructures

[Helma et al., 2004]

◮ P = positive (mutagenic) compounds

N = negative compounds

◮ features: fragments (= patterns) f such that

both freq(f, P) ≥ t and freq(f, N) ≤ t

◮ Limited to frequent linear patterns ◮ ML algorithm: SVM with linear or quadratic kernel

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 17

slide-18
SLIDE 18

Frequent Patterns MOLFEA

Frequent Pattern Mining: Mutagenicity Inducing Substructures

[Helma et al., 2004]

◮ CPDB – Carcinogenic Potency DataBase ◮ 684 compounds classified in 341 mutagens and 343

non-mutagens according to Ames test on Salmonella

1% 3% 5% 10% Frequency threshold 50 60 70 80 90 100 Cross-validated sensitivity

Mutagenicity prediction [Hema04] Linear kernel Quadratic kernel

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 18

slide-19
SLIDE 19

Frequent Patterns gBoost

gBoost

[Saigo et al., 2008]

◮ Train data: {(Gn, yn)}n=1...l ◮ Stump: h(x : t, w) = w(2xt − 1) xt = 1 if x ⊆ G and 0 otherwise ◮ Decision function:

f(x) =

  • t∈T ,w∈{−1,+1}

αtw h(x : t, w) l

t,w αtw = 1 and αtw ≥ 0 ∀t, w ◮ LPBoosting:

min

α,ξ,ρ −ρ + D l

  • n=1

ξn s.t. ξn ≥ 0 ∀n, l

t,w αtw = 1 and αtw ≥ 0 ∀t, w ◮ Equivalent to solving

min

λ,γ γ

s.t l

n=1 λnynh(xn : t, w) ≤ γ ∀t, w

l

n=1 λn = 1 and 0 ≤ λn ≤ D ∀n

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 19

slide-20
SLIDE 20

Frequent Patterns gBoost

gBoost

[Saigo et al., 2008]

◮ Solve by “column generation”

◮ start with H = ∅ and λn = 1/n ∀n ◮ Iteratively: ◮ find (t∗, w∗) that maximizes

g(t, w) = l

n=1 λnynh(xn : t, w)

◮ add (t∗, w∗) to H ◮ update λn, γ ◮ Stop when ∃(t∗, w∗) such that g(t∗, w∗) > γ + ε C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 20

slide-21
SLIDE 21

Frequent Patterns gBoost

gBoost

[Saigo et al., 2008]

◮ Finding (t∗, w∗): DFS code tree ◮ Pruning condition (g∗ optimal gain so far): if

max  2

  • n:yn=1,t⊆Gn

λn −

l

  • n=1

ynλn, 2

  • n:yn=−1,t⊆Gn

λn +

l

  • n=1

ynλn   < g∗ then ∀t′ : t ⊆ t′, ∀w′, g∗ > g(t′, w′)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 21

slide-22
SLIDE 22

Frequent Patterns gBoost

gBoost

[Saigo et al., 2008]

Application to CPDB:

◮ Accuracy similar to

Helma et al. (79%)

◮ Most discriminative

patterns

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 22

slide-23
SLIDE 23

All Patterns 2D Kernels

Paths-Based Fingerprints

◮ Labeled sub-paths (walks)

Figure: Some sub-paths of length 3

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 23

slide-24
SLIDE 24

All Patterns 2D Kernels

Circular Fingerprints

◮ Labeled sub-trees - Extended-Connectivity (or Circular)

features

Figure: Example of a circular substructure of depth 2

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 24

slide-25
SLIDE 25

All Patterns 2D Kernels

T wo-Dimensional Kernels

[Azencott et al., 2007]

◮ Systematically extract paths / circular fingerprints,

for various maximal depths

◮ SVM with T

animoto / Minmax

◮ Data

◮ Mutagenicity (Mutag): 188 compounds ◮ Benzodiazepine receptor affinity (BZR): 181 + 125

compounds

◮ Cyclooxygenase-2 ihibitors (COX2): 178 + 125

compounds

◮ Estrogen receptor affinity: 166 + 180 compounds

Data SVM Previous best Mutag 90.4% 85.2% (gBoost) BZR 79.8% 76.4% COX2 70.1% 73.6% ER 82.1% 79.8%

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 25

slide-26
SLIDE 26

All Patterns 3D Kernels

Introducing Spatial Information

◮ 3D Histograms [Azencott et al, 2007]

◮ Groups of k atoms ◮ Associated size: ◮ Pairwise distances (k = 2) ◮ diameter of the smallest sphere that

contains all k atoms

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 26

slide-27
SLIDE 27

All Patterns 3D Kernels

Introducing Spatial Information

One histogram per class of k-tuple (e.g. C-C-C, C-C-O)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 27

slide-28
SLIDE 28

All Patterns 3D Kernels

Hist3D Performance

Data

2D kernel Hist3D kernel Mutag 90.4% 88.8% BZR (loo) 82.0% 79.4% ER (loo) 87.0% 86.1% COX2 76.9% 78.6%

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 28

slide-29
SLIDE 29

All Patterns 3D Kernels

Introducing Spatial Information

◮ 3D Decomposition Kernels Ceroni et al, 2007

◮ Try to match patterns in 3D space as well ◮ e−γ(li−l′ i )2

li (resp l′

i ): length of edge i in molecule x (resp. x′)

◮ edges are lexicographically ordered ◮ kernel between two patterns:

Ks(σ, σ′) =

r

  • i=1

δ(ei, e′

i )e−γ(li−l′

i )2 ◮ kernel between two molecules:

K(x, x′) =

  • σ∈x
  • σ∈x′

Ks(σ, σ′)

◮ can be reduced to a class of patterns (e.g. circular

substructures)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 29

slide-30
SLIDE 30

All Patterns 3D Kernels

3DDK Performance

Data

2D kernel Hist3D kernel 3DDK Circ3DDK Mutag 90.4% 88.8% 86.7% 83.5% BZR (loo) 82.0% 79.4% 78.4% 81.4% ER (loo) 87.0% 86.1% 82.3% 82.1% COX2 76.9% 78.6% 75.6% 75.2%

◮ How relevant is 3D information? ◮ How good is 3D information?

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 30

slide-31
SLIDE 31

Virtual HTS

Drug Discovery Process

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 31

slide-32
SLIDE 32

Virtual HTS

High-Throughput Screening

◮ Assay a large library

  • f potential drugs

against their target

◮ Very costly ◮ → docking ◮ → virtual

high-throughput screening (vHTS)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 32

slide-33
SLIDE 33

Virtual HTS Measuring Performance

Imbalanced Data

◮ T

ypically, most compounds are inactive ⇒ many more negative than positive examples

◮ E.g. DHFR data set

◮ 99, 995 chemicals screened for activity against

dihydrofolate reductase

◮ < 0.2% active compounds

◮ Accuracy is not appropriate:

predicting all compounds negative ⇒ accuracy = 99.8%

◮ sensitivity = # True Positives

# Positives

◮ specificity = # True Negatives

# Negatives

◮ For many methods, the output is continuous

⇒ accuracy, sensitivity and specificity depend on a threshold θ

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 33

slide-34
SLIDE 34

Virtual HTS Measuring Performance

Receiver-Operator Characteristic Curves

◮ For all possible values of θ, report sensitivity and 1 − specificity ◮ AUROC (Area under the ROC Curve) is a numerical measure of

peformance

◮ AUROC(random) = 0.5 and AUROC(optimal) = 1

1/6 1/3 1/2 2/3 5/6 1 1/4 2/4 3/4 1 False Positive Rate True Positive Rate x x x x x x x x x x x

Inf 0.95 0.94 0.9 0.81 0.73 0.52 0.2 0.17 0.12 0.09

random perfect real

label prediction + 0.95

  • 0.94

+ 0.90 + 0.81

  • 0.73
  • 0.52
  • 0.20

+ 0.17

  • 0.12
  • 0.09

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 34

slide-35
SLIDE 35

Virtual HTS Measuring Performance

Inhibition of DHFR: ROC Curves

method AUC IRV 0.71 SVM 0.59 kNN 0.59 MAX-SIM 0.54

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR TPR RANDOM IRV SVM MAXSIM

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 35

slide-36
SLIDE 36

Virtual HTS Measuring Performance

Precision-Recall Curves

◮ Precision =

# True Positives # Predicted Positives

◮ Recall = sensitivity

1/4 2/4 3/4 1 1/5 2/5 3/5 4/5 1 Recall Precision x x x x x x x x x x

0.95 0.94 0.9 0.81 0.73 0.52 0.2 0.17 0.12 0.09

perfect real C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 36

slide-37
SLIDE 37

Other Graph Mining Applications

Other Applications

◮ Database indexing and search ◮ Prediction of 3D structures of small compounds and

proteins

◮ Reaction Prediction

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 37

slide-38
SLIDE 38

Other Graph Mining Applications

References

C.-A. Azencott, A. Ksikes, S. J. Swamidass, J. H. Chen, L. Ralaivola, and P . Baldi. One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, and Biological Properties. J. Chem. Inf. Model, 2007 http://www.igb.uci.edu/~pfbaldi/publications/journals/2006/ci600397p.pdf P . Baldi, R. Benz, J. S. Swamidass, and D. S. Hirschberg. Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval J. Chem. Inf. Model, 2007 http://www.ics.uci.edu/~dan/pubs/ci700200n.pdf

  • A. Ceroni, F

. Costa, and P . Frasconi. Classification of Small Molecules by Two- and Three-Dimensional Decomposition Kernels. Bioinformatics, 2007 http://bioinformatics.oxfordjournals.org/content/23/16/2038

  • T. Fawcett.

ROC Graphs: Notes and Practical Considerations for Researchers HP Labs Tech Report, 2004 http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.9777

  • C. Helma, T. Cramer, S. Kramer, and L. De Raedt.

Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds. J. Chem. Inf. Comput. Sci., 2004 http://cbio.ensmp.fr/~jvert/svn/bibli/local/Helma2004Data.pdf

  • H. Saigo, S. Nowozin, T. Kadowaki, T. Kudo, and K. T

usda. gBoost: a mathematical programming approach to graph classification and regression Mach. Learn., 2009 http://www.nowozin.net/sebastian/papers/saigo2008gboost.pdf C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 38