Similarity searching using multiple starting points Peter Willett, - - PowerPoint PPT Presentation

similarity searching using multiple starting points
SMART_READER_LITE
LIVE PREVIEW

Similarity searching using multiple starting points Peter Willett, - - PowerPoint PPT Presentation

Similarity searching using multiple starting points Peter Willett, University of Sheffield, UK Overview of talk Introduction Similarity searching when multiple bioactive reference structures are available Turbo similarity


slide-1
SLIDE 1

Similarity searching using multiple starting points

Peter Willett, University of Sheffield, UK

slide-2
SLIDE 2

Overview of talk

  • Introduction
  • Similarity searching when multiple

bioactive reference structures are available

  • Turbo similarity searching, based on using

nearest-neighbours

  • Conclusions
slide-3
SLIDE 3

Similarity searching: I

  • Use of a similarity measure to determine

the relatedness of an active target, or reference, structure to each structure in a database

  • The similar property principle means that

high-ranked structures are likely to have similar activity to that of the target

  • Similarity searching hence provides an
  • bvious way of following-up on an initial

active

slide-4
SLIDE 4

Similarity searching: II

  • Similarity searching using a single

target structure now a common feature in chemoinformatics software systems

  • How to search with multiple, structurally

unrelated target structures, e.g.,

  • Diverse hits from HTS
  • Compounds from a public database (e.g.,

MDL Drug Data Report and the World Drugs Index)

  • Competitor compounds
slide-5
SLIDE 5

Comparison of search techniques: I

  • Given a set of active molecules, how can a

database be similarity-ranked in order of decreasing probability of activity?

  • Extensive simulated virtual screening

experiments on the MDL Drug Data Report (MDDR) database, using

  • Molecules represented by 2D fingerprints

(UNITY fingerprints in the initial experiments)

  • Inter-fingerprint similarity calculated using the

Tanimoto Coefficient

slide-6
SLIDE 6

Comparison of search techniques: II

  • Several different techniques were tested
  • Hert, J. et al., “Comparison of fingerprint-

based methods for virtual screening using multiple bioactive reference structures.” J.

  • Chem. Inf. Comput. Sci., 44, 2004, 1177.
  • Best results obtained by
  • Combining the rankings resulting from

separate searches using data fusion

  • Approximation of the binary kernel

discrimination method for machine learning

slide-7
SLIDE 7

Comparison of search techniques: III

  • Here, focus on data fusion, where

combine different rankings of the same sets of molecules

  • Two basic approaches
  • Generate rankings from the same molecule

using different similarity measures (similarity fusion)

  • Generate rankings from different molecules

using the same similarity measure but different molecules (group fusion)

slide-8
SLIDE 8

Reference 1 Group fusion Reference 2 Reference 3

slide-9
SLIDE 9

After truncation to required rank Reference 2 Reference 1 Reference 3

slide-10
SLIDE 10

Fused r = 2000 r = 1000 Active found in earlier list New Active Final truncated Group Fusion Length related to diversity

  • f retrieved

compounds

slide-11
SLIDE 11

Group fusion rules

  • Fusion of scores or fusion of ranks

(normal in similarity fusion)

  • SUM rule : add the scores (ranks) from

the similarity lists for some database molecule and then re-rank the resulting sums

  • MAX rule : re-rank using the maximum

score (minimum rank) attained in any of the lists

slide-12
SLIDE 12

Experimental details

  • MDDR with ca. 102K molecules
  • 11 activity classes
  • 10 sets of 10 randomly chosen compounds

from each activity class

  • All similarities calculated using the

Tanimoto Coefficient

  • Best group-fusion results obtained using

combination of scores and the MAX rule

  • Comparison with average and best single-

molecule searches

slide-13
SLIDE 13

Use of multiple reference structures

25 30 35 40 45 50 55 60 65 70 Single Similarity - Average Single Similarity - Maximum Data Fusion (scores - Max) Recall at 5% (%) Unity ECFP_4

slide-14
SLIDE 14

Comparison of 2D similarity measures

  • Extensive comparative experiments
  • Scitegic ECFP_4 best of the 14 types of 2D fingerprint
  • Tanimoto best of the 12 types of similarity coefficients
  • Whittle, M. et al., “Enhancing the effectiveness of

virtual screening by fusing nearest-neighbour lists: A comparison of similarity coefficients” J. Chem. Inf.

  • Comput. Sci., 44, 2004, 1840.
  • Hert, J. et al., Topological descriptors for similarity-

based virtual screening using multiple bioactive reference structures.” Org. Biomol. Chem., 2, 2004, 3256

slide-15
SLIDE 15

Effect of structural diversity

  • Some evidence to suggest that the

enhancement was greatest with the most diverse sets of actives.

  • More detailed experiments where chose

10 MDDR activity classes that

  • Contained at least 50 molecules
  • Had the smallest, or the largest or the median

mean pair-wise Tanimoto similarity (similar results if use numbers of scaffolds)

slide-16
SLIDE 16

Recall for group fusion and similarity searching

10 20 30 40 50 60 70 80 90 100 20 40 60 80 100

SS GF

Low MPS Medium MPS High MPS

slide-17
SLIDE 17

Variation in relative recall with mean pair-wise similarity

0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6

MPS GF/SS

Low MPS Medium MPS High MPS

slide-18
SLIDE 18

Turbo similarity searching: I

  • Similar property principle: nearest

neighbours are likely to exhibit the same activity as the reference structure

  • Group fusion improves the identification
  • f active compounds
  • Potential for further enhancements by

group fusion of rankings from the reference structure and from its assumed active nearest neighbours

slide-19
SLIDE 19

Turbo similarity searching: II

Reference structure Nearest neighbours Ranked list

slide-20
SLIDE 20

Probability of activity for nearest neighbours

10 20 30 40 50 60 70 80 90 100 200 400 600 800 1000

Rank Probability of being active (%)

45 50 55 60 65 70 75 80 85 90 95 5 10 15 20

slide-21
SLIDE 21

Experimental details

  • MDDR data set of 11 activity classes and

102K structures as used previously

  • In all, 8294 actives in the 11 classes, with

(turbo) similarity searches being carried out using each of these as the reference structure

  • ECFP_4 fingerprints/Tanimoto coefficient
  • MAX group fusion on similarity scores
  • Increasing numbers of nearest neighbours
slide-22
SLIDE 22

Numbers of nearest neighbours

36 37 38 39 40 41 42 43 44 45 46 SS TSS-5 TSS-10 TSS-15 TSS-20 TSS-30 TSS-40 TSS-50 TSS- 100 TSS- 200 Recalls at 5% (%)

slide-23
SLIDE 23

Upper and lower bound experiments

30 35 40 45 50 55 60 65 70

SS TSS-100 Upperbound (reference + active NNs among the 100NNs) Lowerbound (inactive NNs among the 100 NNs) Upperbound (reference + 100 active NNs) Lowerbound (100 inactive NNs)

Recall at 5%

slide-24
SLIDE 24

Rationale for upper bound results

  • The true actives in the set of assumed actives

yield significant enhancements in performance

  • The true inactives in the set of assumed actives

have little effect on performance

  • Taken together, the two groups of compounds

yield the observed net enhancement

  • Hert, J. et al., “Enhancing the effectiveness of

similarity-based virtual screening using nearest- neighbour information.” J. Med. Chem., in the press.

slide-25
SLIDE 25

Use of machine- learning methods for similarity searching: I

  • Turbo similarity searching uses group fusion to

enhance conventional similarity searching

  • Machine learning is a more powerful virtual

screening tool than similarity searching

  • But requires a training-set containing known actives

and inactives

  • Given an active reference structure, a training-

set can be generated from

  • Using the k nearest neighbours of the reference

structure as the actives

  • Using k randomly chosen, low-similarity compounds

as the inactives

slide-26
SLIDE 26

Use of machine- learning methods for similarity searching: II

REFERENCE STRUCTURE NEAREST NEIGHBOURS RANKED LIST TRAINING SET SIMILARITY SEARCHING MACHINE LEARNING RANDOMLY SELECTED COMPOUNDS

slide-27
SLIDE 27

Initial experiments: I

  • Three machine-learning techniques in the

second stage

  • Substructural analysis

Best results with the R4 probabilistic weight

  • Binary kernel discrimination
  • Support vector machine
  • MDDR dataset as used previously, with

100-molecule training-sets

slide-28
SLIDE 28

Initial experiments: II

38 39 40 41 42 43 44 45 SS TSS-100 TSS-100- SSA-R4 TSS-100- BKD-OPT TSS-100- SVM

Recall at 5%

slide-29
SLIDE 29

Additional experiments: I

  • Initial results rather disappointing, but

some improvements noted with the most diverse datasets

  • Further experiments with the set of 10

MDDR activity classes with the lowest mean pair-wise Tanimoto similarity

slide-30
SLIDE 30

Additional experiments: II

18 19 20 21 22 23 24 25 26 27 28 SS TSS-100 TSS-100- SSA-R4 TSS-100- BKD-OPT TSS-100- SVM

Recall at 5%

slide-31
SLIDE 31

Conclusions: I

  • Fingerprint-based similarity searching

using a known reference structure is long-established in chemoinformatics

  • When small numbers of actives are

available, group fusion will enhance performance when the sought actives are structurally heterogeneous

slide-32
SLIDE 32

Conclusions: II

  • Can also enhance conventional similarity

search, even if there is just a single active, by assuming that the nearest neighbours are also active

  • Can be effected in two ways
  • Use of group fusion to combine similarity

rankings (overall best approach)

  • Use of substructural analysis to compute

fragment weights (best with highly heterogeneous sets of actives)

slide-33
SLIDE 33

Acknowledgements

  • Collaborators
  • Jerome Hert, Martin Whittle and David Wilton
  • Pierre Acklin, Kamal Azzaoui, Edgar Jacoby and

Ansgar Schuffenhauer

  • Alexander Alex, Jens Loesel and Jonathan Mason
  • Funding, software and data support
  • Barnard Chemical Information, Daylight Chemical

Information Systems, MDL Information Systems, Novartis Institutes for BioMedical Research, Pfizer Global Research and Development, Royal Society, Scitegic, Tripos, and the Wolfson Foundation