Similarity searching using multiple starting points Peter Willett, - PowerPoint PPT Presentation

Similarity searching using multiple starting points Peter Willett, University of Sheffield, UK

Overview of talk • Introduction • Similarity searching when multiple bioactive reference structures are available • Turbo similarity searching, based on using nearest-neighbours • Conclusions

Similarity searching: I • Use of a similarity measure to determine the relatedness of an active target , or reference , structure to each structure in a database • The similar property principle means that high-ranked structures are likely to have similar activity to that of the target • Similarity searching hence provides an obvious way of following-up on an initial active

Similarity searching: II • Similarity searching using a single target structure now a common feature in chemoinformatics software systems • How to search with multiple, structurally unrelated target structures, e.g., • Diverse hits from HTS • Compounds from a public database (e.g., MDL Drug Data Report and the World Drugs Index ) • Competitor compounds

Comparison of search techniques: I • Given a set of active molecules, how can a database be similarity-ranked in order of decreasing probability of activity? • Extensive simulated virtual screening experiments on the MDL Drug Data Report (MDDR) database, using • Molecules represented by 2D fingerprints (UNITY fingerprints in the initial experiments) • Inter-fingerprint similarity calculated using the Tanimoto Coefficient

Comparison of search techniques: II • Several different techniques were tested • Hert, J. et al ., “Comparison of fingerprint- based methods for virtual screening using multiple bioactive reference structures.” J. Chem. Inf. Comput. Sci. , 44 , 2004, 1177. • Best results obtained by • Combining the rankings resulting from separate searches using data fusion • Approximation of the binary kernel discrimination method for machine learning

Comparison of search techniques: III • Here, focus on data fusion, where combine different rankings of the same sets of molecules • Two basic approaches • Generate rankings from the same molecule using different similarity measures ( similarity fusion ) • Generate rankings from different molecules using the same similarity measure but different molecules ( group fusion )

Group fusion Reference 2 Reference 1 Reference 3

After truncation to required rank Reference 2 Reference 1 Reference 3

Fused Group Final Fusion truncated r = 1000 Length related r = 2000 to diversity New Active of retrieved compounds Active found in earlier list

Group fusion rules • Fusion of scores or fusion of ranks (normal in similarity fusion) • SUM rule : add the scores (ranks) from the similarity lists for some database molecule and then re-rank the resulting sums • MAX rule : re-rank using the maximum score (minimum rank) attained in any of the lists

Experimental details • MDDR with ca. 102K molecules • 11 activity classes • 10 sets of 10 randomly chosen compounds from each activity class • All similarities calculated using the Tanimoto Coefficient • Best group-fusion results obtained using combination of scores and the MAX rule • Comparison with average and best single- molecule searches

Use of multiple reference structures 70 Unity 65 Recall at 5% (%) ECFP_4 60 55 50 45 40 35 30 25 Single Single Data Fusion Similarity - Similarity - (scores - Max) Average Maximum

Comparison of 2D similarity measures • Extensive comparative experiments • Scitegic ECFP_4 best of the 14 types of 2D fingerprint • Tanimoto best of the 12 types of similarity coefficients • Whittle, M. et al ., “Enhancing the effectiveness of virtual screening by fusing nearest-neighbour lists: A comparison of similarity coefficients” J. Chem. Inf. Comput. Sci. , 44 , 2004, 1840. • Hert, J. et al ., Topological descriptors for similarity- based virtual screening using multiple bioactive reference structures.” Org. Biomol. Chem. , 2 , 2004, 3256

Effect of structural diversity • Some evidence to suggest that the enhancement was greatest with the most diverse sets of actives. • More detailed experiments where chose 10 MDDR activity classes that • Contained at least 50 molecules • Had the smallest, or the largest or the median mean pair-wise Tanimoto similarity (similar results if use numbers of scaffolds)

Recall for group fusion and similarity searching 100 90 80 70 60 GF 50 40 30 Low MPS 20 Medium MPS High MPS 10 0 0 20 40 60 80 100 SS

Variation in relative recall with mean pair-wise similarity 3 Low MPS Medium MPS High MPS 2.5 2 GF/SS 1.5 1 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 MPS

Turbo similarity searching: I • Similar property principle: nearest neighbours are likely to exhibit the same activity as the reference structure • Group fusion improves the identification of active compounds • Potential for further enhancements by group fusion of rankings from the reference structure and from its assumed active nearest neighbours

Turbo similarity searching: II Reference structure Ranked list Nearest neighbours

Probability of activity for nearest neighbours 100 Probability of being active (%) 90 95 90 80 85 80 70 75 70 60 65 50 60 55 40 50 45 30 0 5 10 15 20 20 10 0 0 200 400 600 800 1000 Rank

Experimental details • MDDR data set of 11 activity classes and 102K structures as used previously • In all, 8294 actives in the 11 classes, with (turbo) similarity searches being carried out using each of these as the reference structure • ECFP_4 fingerprints/Tanimoto coefficient • MAX group fusion on similarity scores • Increasing numbers of nearest neighbours

Numbers of nearest neighbours 46 45 44 43 Recalls at 5% (%) 42 41 40 39 38 37 36 SS TSS-5 TSS-10 TSS-15 TSS-20 TSS-30 TSS-40 TSS-50 TSS- TSS- 100 200

Upper and lower bound experiments 70 65 60 Recall at 5% 55 50 45 40 35 30 SS TSS-100 Upperbound Lowerbound Upperbound Lowerbound (reference + (inactive (reference + (100 active NNs NNs among 100 active inactive among the the 100 NNs) NNs) 100NNs) NNs)

Rationale for upper bound results • The true actives in the set of assumed actives yield significant enhancements in performance • The true inactives in the set of assumed actives have little effect on performance • Taken together, the two groups of compounds yield the observed net enhancement • Hert, J. et al ., “Enhancing the effectiveness of similarity-based virtual screening using nearest- neighbour information.” J. Med. Chem. , in the press.

Use of machine- learning methods for similarity searching: I • Turbo similarity searching uses group fusion to enhance conventional similarity searching • Machine learning is a more powerful virtual screening tool than similarity searching • But requires a training-set containing known actives and inactives • Given an active reference structure, a training- set can be generated from • Using the k nearest neighbours of the reference structure as the actives • Using k randomly chosen, low-similarity compounds as the inactives

Use of machine- learning methods for similarity searching: II REFERENCE STRUCTURE SIMILARITY SEARCHING NEAREST NEIGHBOURS TRAINING MACHINE RANKED SET LEARNING LIST RANDOMLY SELECTED COMPOUNDS

Initial experiments: I • Three machine-learning techniques in the second stage • Substructural analysis Best results with the R4 probabilistic weight • Binary kernel discrimination • Support vector machine • MDDR dataset as used previously, with 100-molecule training-sets

Initial experiments: II 45 44 43 Recall at 5% 42 41 40 39 38 SS TSS-100 TSS-100- TSS-100- TSS-100- SSA-R4 BKD-OPT SVM

Additional experiments: I • Initial results rather disappointing, but some improvements noted with the most diverse datasets • Further experiments with the set of 10 MDDR activity classes with the lowest mean pair-wise Tanimoto similarity

Additional experiments: II 28 27 26 25 Recall at 5% 24 23 22 21 20 19 18 SS TSS-100 TSS-100- TSS-100- TSS-100- SSA-R4 BKD-OPT SVM

Conclusions: I • Fingerprint-based similarity searching using a known reference structure is long-established in chemoinformatics • When small numbers of actives are available, group fusion will enhance performance when the sought actives are structurally heterogeneous

Conclusions: II • Can also enhance conventional similarity search, even if there is just a single active, by assuming that the nearest neighbours are also active • Can be effected in two ways • Use of group fusion to combine similarity rankings (overall best approach) • Use of substructural analysis to compute fragment weights (best with highly heterogeneous sets of actives)

Similarity searching using multiple starting points Peter Willett, - PowerPoint PPT Presentation

Similarity searching using multiple starting points Peter Willett, University of Sheffield, UK Overview of talk Introduction Similarity searching when multiple bioactive reference structures are available Turbo similarity

Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Searching in speech Language and Keyword searching in OSCAR Language and Computers Computers

Linguistics 384: Language and Computers Operators Searching the web Topic 2: Searching

Linguistic Graph Similarity for News Sentence Searching Kim Schouten & Flavius Frasincar

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Searching Documents and Pages Searching Documents and Pages Searching Documents and Pages Prof.

Searching and Sorting Mason Vail, Boise State University Computer Science Searching Searching is

Chapter 5 Searching and Binary Search Trees 5.1 Searching sequence The purpose of searching :

Searching Tiziana Ligorio 1 Todays Plan Searching algorithms and their analysis 2

Sorting and Searching Topic 14 Searching and Simple Sorts Fundamental problems in computer

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

Analyst Presentation Abcam plc, 2019/20 Interim Results Presentation, 9 March 2020 Disclaimer

Silicon Solutions A monograph by Edward Bent (in English) Fundamental instrument in

Croatian Common Bean Landraces Klaudija Carovi -Stanko University of Zagreb Faculty of

Development of bioactive polysulfone nanocomposites for bone tissue replacement Ajith James Jose,

FY 15 Full Year Results 23 September 2015 Vision and Mission Statement Vision: To optimise the

Subproject 2: Sheep Ethical Problems and Breeding Goals Alexandros Stefanakis, Smaro Sotiraki,

Dr. Nora Khaldi Founder and Chief Scientific Officer @NuritasResearch Evolution of Nuritas -

Bioresource Processing Alliance A six year, MBIE funded research and development program

Sambuz

Useful Links

Newsletter

Mail Us

Similarity searching using multiple starting points Peter Willett, - PowerPoint PPT Presentation

Similarity searching using multiple starting points Peter Willett, University of Sheffield, UK Overview of talk Introduction Similarity searching when multiple bioactive reference structures are available Turbo similarity

Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Searching in speech Language and Keyword searching in OSCAR Language and Computers Computers

Linguistics 384: Language and Computers Operators Searching the web Topic 2: Searching

Linguistic Graph Similarity for News Sentence Searching Kim Schouten &amp; Flavius Frasincar

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Searching Documents and Pages Searching Documents and Pages Searching Documents and Pages Prof.

Searching and Sorting Mason Vail, Boise State University Computer Science Searching Searching is

Chapter 5 Searching and Binary Search Trees 5.1 Searching sequence The purpose of searching :

Searching Tiziana Ligorio 1 Todays Plan Searching algorithms and their analysis 2

Sorting and Searching Topic 14 Searching and Simple Sorts Fundamental problems in computer

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

Analyst Presentation Abcam plc, 2019/20 Interim Results Presentation, 9 March 2020 Disclaimer

Silicon Solutions A monograph by Edward Bent (in English) Fundamental instrument in

Croatian Common Bean Landraces Klaudija Carovi -Stanko University of Zagreb Faculty of

Development of bioactive polysulfone nanocomposites for bone tissue replacement Ajith James Jose,

FY 15 Full Year Results 23 September 2015 Vision and Mission Statement Vision: To optimise the

Subproject 2: Sheep Ethical Problems and Breeding Goals Alexandros Stefanakis, Smaro Sotiraki,

Dr. Nora Khaldi Founder and Chief Scientific Officer @NuritasResearch Evolution of Nuritas -

Bioresource Processing Alliance A six year, MBIE funded research and development program

Sambuz

Useful Links

Newsletter

Mail Us

Linguistic Graph Similarity for News Sentence Searching Kim Schouten & Flavius Frasincar