Leveraging Open Chemogenomics Data and Tools with KNIME George - - PowerPoint PPT Presentation

leveraging open chemogenomics data and tools with knime
SMART_READER_LITE
LIVE PREVIEW

Leveraging Open Chemogenomics Data and Tools with KNIME George - - PowerPoint PPT Presentation

Leveraging Open Chemogenomics Data and Tools with KNIME George Papadatos ChEMBL Group georgep@ebi.ac.uk What is EMBL-EBI? Europes home for biological data data, services services, research research and training training A


slide-1
SLIDE 1

Leveraging Open Chemogenomics Data and Tools with KNIME

George Papadatos

ChEMBL Group georgep@ebi.ac.uk

slide-2
SLIDE 2

What is EMBL-EBI?

  • Europe’s home for biological data

data, services services, research research and training training

  • A trusted data provider for the life sciences
  • Part of the European Molecular Biology Laboratory, an

intergovernmental research organisation

  • International: 570 members of staff from 57 nations
slide-3
SLIDE 3

C r

  • s

s d

  • m

a i n r e s

  • u

r c e s . C

r

  • s

s d

  • m

a i n r

e s

  • u

r c e

s

d

g

P

b s

y

Data resources at EMBL-EBI

Genes, genomes & variation

RNA Central ArrayExpress Expression Atlas Metabolights PRIDE InterPro Pfam UniProt ChEMBL ChEMBL SureChEMBL SureChEMBL ChEBI ChEBI

Molecular structures

Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive

Gene, protein & metabolite expression Protein sequences, families & motifs Chemical biology Chemical biology Reactions, interactions & pathways

IntAct Reactome MetaboLights

Systems

BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central BioStudies Gene Ontology Experimental Factor Ontology

Literature &

  • ntologies
slide-4
SLIDE 4

Bioactivity data

Compound

Assay/Target

>Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE

  • 3. Insight, tools and resources for translational drug discovery
  • 2. Organization, integration, curation and standardization of pharmacology data
  • 1. Scientific facts

Ki = 4.5nM APTT = 11 min.

ChEMBL: Data for drug discovery

slide-5
SLIDE 5

KNIME at the EBI

  • KNIME nodes to access ChEBI and ChEMBL databases
  • Trusted community nodes
  • Workflows on Examples server
  • Method development and use cases
  • Provide KNIME training to scientists and researchers
  • Wellcome Trust drug discovery courses, EMBL courses
  • CDK community nodes support

h"ps://tech.knime.org/book/embl-ebi-nodes-for-knime-trusted-extension

slide-6
SLIDE 6

KNIME and ChEMBL

ChEMBL Web Services Virtual Machine UniChem Web Services

Access ~110M structures from 27 sources 14M bioactivities 1.5M structures Local access to ChEMBL data and services

knime://EXAMPLES/099_Community/08_ChEMBL_WebServices

slide-7
SLIDE 7

KNIME and ChEMBL

ChEMBL Web Services Virtual Machine UniChem Web Services

Access ~110M structures from 27 sources 14M bioactivities 1.5M structures Local access to ChEMBL data and services

Patent Annotations

4M patent documents 14M structures 260M annotations

slide-8
SLIDE 8

Why looking at patent documents?

  • Patent filing and searching
  • Legal, financial and commercial incentives & interests
  • Prior art, novelty, freedom to operate searches
  • Competitive intelligence
  • Unprecedented wealth of knowledge
  • Most of the knowledge will never be disclosed anywhere else
  • Compounds, scaffolds, reactions
  • Biological targets, diseases, indications
  • Average lag of 2-4 years between patent document and journal

publication disclosure for chemistry, 4-5 for biological targets

slide-9
SLIDE 9

SureChEMBL data processing

WO EP

Applica@ons & Granted

US

Applica@ons & granted

JP

Abstracts

Patent Offices Chemistry Database

SureChEMBL System

Patent PDFs (service) Applica@on Users API Database

En@ty Recogni@on

SureChem IP

1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3- propyl-1H-pyrazolo[4,3-d]pyrimidin-5- yl)phenylsulfonyl]-4-methylpiperazine

Image to Structure

(one method)

A"achments Name to Structure

(five methods)

OCR

Processed patents (service) www.surechembl.org

slide-10
SLIDE 10

SureChEMBL data processing v2

WO EP

Applica@ons & Granted

US

Applica@ons & granted

JP

Abstracts

Chemistry Database SureChEMBL System Database En@ty Recogni@on

SureChem IP

Image to Structure

(one method)

Name to Structure

(five methods) OCR

Processed patents

(service)

Bio-En@ty Recogni@on Patent PDFs

(service)

Applica@on Server Users

Patent Offices www.surechembl.org

slide-11
SLIDE 11

SureChEMBL bioannotation

  • SciBite’s Termite text-mining engine run on 4M life-science patents

from SureChEMBL corpus

  • Genes (identified by HGNC symbols) and diseases (identified by

MeSH IDs) annotated

  • Section/frequency information annotated (e.g., in title, abstract,

claims, total frequency)

  • Relevance score (0-3) to flag important chemical and biological

entities and remove noise

slide-12
SLIDE 12

Relevance scoring – genes/diseases

  • Various features used:
  • Term frequency
  • Position (title, abstract, figure, caption, table)
  • Frequency distribution
  • Scores range from 0 – 3
  • 3 – most important entities in the patent

3 – most important entities in the patent

  • 2 – important entities in the patent

2 – important entities in the patent

  • 1 – mentioned entities in the patent
  • 0 – ambiguous entity/likely annotation error
slide-13
SLIDE 13

Relevance scoring - compounds

  • Main assumptions for relevance:

1. Very frequent compounds are irrelevant (but if drug-like then that’s OK) 2. Compounds with busy chemical space around them are interesting

  • Use distribution of close analogues (NNs) among compounds found in the same

same patent family patent family

  • Scores range from 0 – 3
  • 3 – highest number of NNs: most important entities in the patent

3 – highest number of NNs: most important entities in the patent

  • 2 – important entities in the patent

2 – important entities in the patent

  • 1 – few NNs: mentioned entities in the patent
  • 0 – singletons or trivial entities, most likely errors or reagents, solvents,

substituents

Hatori, K., Wakabayashi, H., & Tamaki, K. (2008). JCIM, 48(1), 135–142. doi:10.1021/ci7002686 Tyrchan, C., Boström, J., Giordaneto, F., Winter, J., & Muresan, S. (2012). JCIM, 52(6), 1480–1489. doi:10.1021/ci3001293

slide-14
SLIDE 14

Gotchas & out of scope

  • No Markush extraction
  • No natural language processing (e.g., ‘compound x is an

inhibitor inhibitor of target y’)

  • No extraction of bioactivities
  • No chemistry search (yet)
  • Patent coverage stops in April 2015
  • Incremental updates TBD
  • Patent calls still in dev
  • Old scripts / workflows may break
slide-15
SLIDE 15

Open PHACTS Architecture

Drug Discovery Today 2012, 17:21 (doi:10.1016/j.drudis.2012.05.016)

slide-16
SLIDE 16

The Open PHACTS node

executable API call to KREST nodes

slide-17
SLIDE 17

Open PHACTS Patent API

https://dev.openphacts.org/docs/develop

slide-18
SLIDE 18

Open PHACTS Patent API

Disease Compound Target Patent

extracted links

slide-19
SLIDE 19

Open PHACTS Patent API

Disease Compound Target Patent

inferred links extracted links

slide-20
SLIDE 20

Use case #1: Patent to Entities

  • 1. From a patent get compounds, genes and diseases
  • 2. Filter to remove noise
  • Frequency and relevance score
  • 3. Process and visualise

Disease Compound Target Patent

? ? ?

slide-21
SLIDE 21

US-7718693-B2

slide-22
SLIDE 22

Use case #1: Patent to Entities

  • Patent URI:
  • http://rdf.ebi.ac.uk/resource/surechembl/patent/US-7718693-B2
  • API call:
  • 586 entities back
slide-23
SLIDE 23

Use case #1: Patent to Entities

  • Patent URI:
  • http://rdf.ebi.ac.uk/resource/surechembl/patent/US-7718693-B2
  • API call:
  • 586 entities back
slide-24
SLIDE 24

1) Look at target and disease entities

  • 178 target and disease entities
  • Filter: Relevance score >= 2 à 23 remain
  • Visualise in tag cloud by frequency
slide-25
SLIDE 25

1) Look at target and disease entities

  • 178 target and disease entities
  • Filter: Relevance score >= 2 à 23 remain
  • Visualise in tag cloud by frequency
slide-26
SLIDE 26

Does it make sense?

slide-27
SLIDE 27

Does it make sense?

US-7718693-B2

slide-28
SLIDE 28

2) Look at compound entities

  • 408 compound entities
  • Filter: Relevance score >= 1 à 201 remain
  • Calculate properties
slide-29
SLIDE 29

2) Look at compound entities

  • 408 compound entities
  • Filter: Relevance score >= 1 à 201 remain
  • Calculate properties
slide-30
SLIDE 30

Does it make sense?

  • Calculate MCS
slide-31
SLIDE 31

Does it make sense?

  • Calculate MCS

US-7718693-B2

slide-32
SLIDE 32

Use case #2: Drug targets & indications for compound

  • 1. Search patents for a compound (approved drug)
  • 2. Filter to remove noise
  • Frequency, relevance score and classification code
  • 3. For remaining patents, get disease and target entities
  • 4. Filter to remove noise
  • Frequency and relevance score
  • 5. Visualise results

Disease Compound Target Patent

slide-33
SLIDE 33

Eluxadoline (JNJ-27018966, VIBERZI)

CHEMBL2159122 FDA Approval: 2015

slide-34
SLIDE 34

1) Get patents for Eluxadoline

  • UniChem call à SCHEMBL12971682
  • Compound URI:
  • http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL12971682
  • API call:
  • Relevance score >=1 à 17 patents (patentome

patentome):

slide-35
SLIDE 35

1) Get patents for Eluxadoline

  • UniChem call à SCHEMBL12971682
  • Compound URI:
  • http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL12971682
  • API call:
  • Relevance score >=1 à 17 patents (patentome

patentome):

slide-36
SLIDE 36

2) Get target and disease entities for patents

  • API call:
  • Classification codes for patents
  • Filter: A61 and C07* à 11 patents remaining
  • API call:
  • Filter: Relevance score >= 2, Frequency >= 2 à 2 targets, 21 diseases
  • Visualise with tag clouds by frequency

* http://web2.wipo.int/classifications/ipc/ipcpub/

slide-37
SLIDE 37

Results

Relevant targets: Relevant diseases:

slide-38
SLIDE 38

Does it make sense?

http://www.accessdata.fda.gov/drugsatfda_docs/label/2015/206940s000lbl.pdf

slide-39
SLIDE 39

Does it make sense?

http://www.accessdata.fda.gov/drugsatfda_docs/label/2015/206940s000lbl.pdf

slide-40
SLIDE 40

Take home message

  • It is now possible to extract and interlink the key structures,

scaffolds, targets and diseases from med. chem. patent corpus automatically

  • By high-throughput text-mining only
  • Thanks to simple heuristics (relevance scores and frequency)
  • Using the Open PHACTS API and KNIME

KNIME

  • For the first time in a free resource in such scale

Disease Compound Target Patent

slide-41
SLIDE 41

What next: Other ideas and use cases

  • Target validation / Druggability
  • For a target, get me all related and relevant diseases
  • Compare with DisGeNET / CTTV, etc.
  • Any known patented scaffolds for my target?
  • Start a pharmacophore hypothesis for patent busting
  • Novelty checking / Due diligence
  • What do we know about this scaffold / compound?
  • Add ChEMBL pharmacology, pathway information
  • Large-scale data mining
  • Annotated patent chemogenomics space and predictive models in

KNIME

  • Anyone?
slide-42
SLIDE 42

Availability

  • API calls will be released to production soon – available for

testing now:

  • https://dev.openphacts.org/docs/develop
  • KNIME workflows available on request
  • SureChEMBL annotations licensed under CC BY-SA
  • SciBite annotations licensed under CC BY-NC-SA
  • Check out the Open PHACTS workshop on Friday 9am
slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46

Acknowledgements

  • ChEMBL and SureChEMBL
  • Anna Gaulton
  • Mark Davies
  • Nathan Dedman
  • James Siddle
  • Anne Hersey
  • SciBite
  • Lee Harland
  • Open PHACTS consortium
  • Nick Lynch
  • Daniela Digles
  • Antonis Loizou
  • EBI alumni
  • Edmund Duesbury
  • Stephan Beisken
slide-47
SLIDE 47

Technology partners

slide-48
SLIDE 48

The Data Are Out There

slide-49
SLIDE 49

Leveraging Open Chemogenomics Data and Tools with KNIME

George Papadatos

ChEMBL Group georgep@ebi.ac.uk

slide-50
SLIDE 50

Back-up slides

slide-51
SLIDE 51

Example: All bioactivities for hERG

All bioac@vi@es for hERG Ac@vity value, assay descrip@on, compound, reference

slide-52
SLIDE 52

Example: Compound searching in ChEMBL

Query List of NNs

slide-53
SLIDE 53

Example: Polypharmacology profile

Compounds

Query

Find NNs Retrieve bioac@vi@es Filter, summarise & pivot

slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56

Accessing local data and services with myChEMBL

slide-57
SLIDE 57

Using KNIME to connect to myChEMBL

SELECT mr.*, md.chembl_id, cp.full_mwt, cp.alogp from mols_rdkit mr, molecule_dictionary md, compound_properties cp where mr.m mr.m @> '$${ @> '$${SMolecule SMolecule}$$':: }$$'::qmol qmol and mr.molregno = md.molregno and md.molregno = cp.molregno;

slide-58
SLIDE 58

Cheminformatics utilities

  • Chemical format conversions
  • Dynamic image generation
  • Image processing (via OSRA)
  • Descriptors and property calculations
  • Chemical modifications and standardization

https://www.ebi.ac.uk/chembl/api/utils/docs

slide-59
SLIDE 59

Example: RESTful Image to Structure conversion

image URL

slide-60
SLIDE 60

UniChem – Compound Mapping across Resources

UniChem

slide-61
SLIDE 61

Novelty checking with UniChem

h"ps://www.ebi.ac.uk/unichem/