leveraging open chemogenomics data and tools with knime
play

Leveraging Open Chemogenomics Data and Tools with KNIME George - PowerPoint PPT Presentation

Leveraging Open Chemogenomics Data and Tools with KNIME George Papadatos ChEMBL Group georgep@ebi.ac.uk What is EMBL-EBI? Europes home for biological data data, services services, research research and training training A


  1. Leveraging Open Chemogenomics Data and Tools with KNIME George Papadatos ChEMBL Group georgep@ebi.ac.uk

  2. What is EMBL-EBI? • Europe’s home for biological data data, services services, research research and training training • A trusted data provider for the life sciences • Part of the European Molecular Biology Laboratory, an intergovernmental research organisation • International: 570 members of staff from 57 nations

  3. Data resources at EMBL-EBI Genes, genomes & variation d European Nucleotide Archive Ensembl GWAS Catalog European Variation Archive Ensembl Genomes Metagenomics portal g C r o s s d o m a P i n Gene, protein & metabolite expression r e s o u r c RNA Central ArrayExpress Metabolights e s s . Expression Atlas PRIDE C Literature & r o s s ontologies d Protein sequences, families & motifs o m a i n b InterPro Pfam UniProt r Europe PubMed Central e s o BioStudies u r Molecular structures c Gene Ontology e s Experimental Factor Protein Data Bank in Europe Ontology y Electron Microscopy Data Bank Chemical biology Chemical biology ChEMBL ChEMBL SureChEMBL SureChEMBL ChEBI ChEBI Reactions, interactions & Systems pathways BioModels Enzyme Portal BioSamples IntAct Reactome MetaboLights

  4. ChEMBL: Data for drug discovery 1. Scientific facts 3. Insight, tools and resources for translational drug discovery Compound >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR Assay/Target K i = 4.5nM ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE Bioactivity data APTT = 11 min. 2. Organization, integration, curation and standardization of pharmacology data

  5. KNIME at the EBI • KNIME nodes to access ChEBI and ChEMBL databases • Trusted community nodes • Workflows on Examples server • Method development and use cases • Provide KNIME training to scientists and researchers • Wellcome Trust drug discovery courses, EMBL courses • CDK community nodes support h"ps://tech.knime.org/book/embl-ebi-nodes-for-knime-trusted-extension

  6. KNIME and ChEMBL ChEMBL Web Virtual Services Machine 14M bioactivities Local access to ChEMBL 1.5M structures data and services UniChem Web Services Access ~110M structures from 27 sources knime://EXAMPLES/099_Community/08_ChEMBL_WebServices

  7. KNIME and ChEMBL ChEMBL Web Virtual Services Machine 14M bioactivities Local access to ChEMBL 1.5M structures data and services Patent Annotations UniChem Web Services 4M patent documents Access ~110M structures 14M structures from 27 sources 260M annotations

  8. Why looking at patent documents? • Patent filing and searching • Legal, financial and commercial incentives & interests • Prior art, novelty, freedom to operate searches • Competitive intelligence • Unprecedented wealth of knowledge • Most of the knowledge will never be disclosed anywhere else • Compounds, scaffolds, reactions • Biological targets, diseases, indications • Average lag of 2-4 years between patent document and journal publication disclosure for chemistry, 4-5 for biological targets

  9. SureChEMBL data processing Patent SureChEMBL System Offices 1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3- Chemistry propyl-1H-pyrazolo[4,3-d]pyrimidin-5- Database yl)phenylsulfonyl]-4-methylpiperazine WO SureChem IP Name to OCR En@ty Structure EP Recogni@on Applica@ons (five methods) & Granted Processed patents Database (service) US Image to Applica@ons & granted Structure (one method) Patent A"achments JP PDFs API Applica@on Abstracts (service) Users www.surechembl.org

  10. SureChEMBL data processing v2 Patent Offices SureChEMBL System Chemistry WO Database SureChem IP OCR Name to En@ty Structure Recogni@on EP ( five methods ) Processed Applica@ons & Granted patents (service) Database Image to US Structure Applica@ons & granted ( one method ) Patent Applica@on JP PDFs Server Abstracts Bio-En@ty (service) Recogni@on Users www.surechembl.org

  11. SureChEMBL bioannotation • SciBite’s Termite text-mining engine run on 4M life-science patents from SureChEMBL corpus • Genes (identified by HGNC symbols) and diseases (identified by MeSH IDs) annotated • Section/frequency information annotated (e.g., in title, abstract, claims, total frequency) • Relevance score (0-3) to flag important chemical and biological entities and remove noise

  12. Relevance scoring – genes/diseases • Various features used: • Term frequency • Position (title, abstract, figure, caption, table) • Frequency distribution • Scores range from 0 – 3 • 3 – most important entities in the patent 3 – most important entities in the patent • 2 – important entities in the patent 2 – important entities in the patent • 1 – mentioned entities in the patent • 0 – ambiguous entity/likely annotation error

  13. Relevance scoring - compounds • Main assumptions for relevance: 1. Very frequent compounds are irrelevant (but if drug-like then that’s OK) 2. Compounds with busy chemical space around them are interesting • Use distribution of close analogues (NNs) among compounds found in the same same patent family patent family • Scores range from 0 – 3 • 3 – highest number of NNs: most important entities in the patent 3 – highest number of NNs: most important entities in the patent • 2 – important entities in the patent 2 – important entities in the patent • 1 – few NNs: mentioned entities in the patent • 0 – singletons or trivial entities, most likely errors or reagents, solvents, substituents Hatori, K., Wakabayashi, H., & Tamaki, K. (2008). JCIM, 48(1), 135–142. doi:10.1021/ci7002686 Tyrchan, C., Boström, J., Giordaneto, F., Winter, J., & Muresan, S. (2012). JCIM, 52(6), 1480–1489. doi:10.1021/ci3001293

  14. Gotchas & out of scope • No Markush extraction • No natural language processing (e.g., ‘compound x is an inhibitor inhibitor of target y’) • No extraction of bioactivities • No chemistry search (yet) • Patent coverage stops in April 2015 • Incremental updates TBD • Patent calls still in dev • Old scripts / workflows may break

  15. Open PHACTS Architecture Drug Discovery Today 2012, 17:21 (doi:10.1016/j.drudis.2012.05.016)

  16. The Open PHACTS node executable API call to KREST nodes

  17. Open PHACTS Patent API https://dev.openphacts.org/docs/develop

  18. Open PHACTS Patent API Compound extracted links Patent Disease Target

  19. Open PHACTS Patent API Compound extracted links Patent Disease Target inferred links

  20. Use case #1: Patent to Entities 1. From a patent get compounds, genes and diseases 2. Filter to remove noise • Frequency and relevance score 3. Process and visualise Compound ? ? Patent Disease Target ?

  21. US-7718693-B2

  22. Use case #1: Patent to Entities • Patent URI: • http://rdf.ebi.ac.uk/resource/surechembl/patent/US-7718693-B2 • API call: • 586 entities back

  23. Use case #1: Patent to Entities • Patent URI: • http://rdf.ebi.ac.uk/resource/surechembl/patent/US-7718693-B2 • API call: • 586 entities back

  24. 1) Look at target and disease entities • 178 target and disease entities • Filter: Relevance score >= 2 à 23 remain • Visualise in tag cloud by frequency

  25. 1) Look at target and disease entities • 178 target and disease entities • Filter: Relevance score >= 2 à 23 remain • Visualise in tag cloud by frequency

  26. Does it make sense?

  27. Does it make sense? US-7718693-B2

  28. 2) Look at compound entities • 408 compound entities • Filter: Relevance score >= 1 à 201 remain • Calculate properties

  29. 2) Look at compound entities • 408 compound entities • Filter: Relevance score >= 1 à 201 remain • Calculate properties

  30. Does it make sense? • Calculate MCS

  31. Does it make sense? • Calculate MCS US-7718693-B2

  32. Use case #2: Drug targets & indications for compound 1. Search patents for a compound (approved drug) 2. Filter to remove noise • Frequency, relevance score and classification code 3. For remaining patents, get disease and target entities 4. Filter to remove noise Compound • Frequency and relevance score 5. Visualise results Patent Disease Target

  33. Eluxadoline (JNJ-27018966, VIBERZI) CHEMBL2159122 FDA Approval: 2015

  34. 1) Get patents for Eluxadoline • UniChem call à SCHEMBL12971682 • Compound URI: • http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL12971682 • API call: • Relevance score >=1 à 17 patents (patentome patentome):

  35. 1) Get patents for Eluxadoline • UniChem call à SCHEMBL12971682 • Compound URI: • http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL12971682 • API call: • Relevance score >=1 à 17 patents (patentome patentome):

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend