[PPT] - Big Data in Drug Discovery David J. Wild Assistant Professor & PowerPoint Presentation

SLIDE 1

Big Data in Drug Discovery

David J. Wild

Assistant Professor & Director, Cheminformatics Program Indiana University School of Informatics and Computing djwild@indiana.edu - http://djwild.info

SLIDE 2

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Epochs in drug discovery

Empirical – up until 1960’s 754 First pharmacy opened in Baghdad Late 1800’s – major pharmaceutical companies, mass production 1900-1960 – major discoveries (insulin, penicillin, the pill …) Rational – 1960’s to 1990’s Designing molecules to target protein active sites – “lock and key” Computational Drug Discovery Biggest success HIV (RT , protease inhibitors) Big Experiment – 1990’s to 2000’s High throughput screening Microarray Assays Gene Sequencing and Human Genome Project Big Data – 2010’s onwards Informatics-driven drug discovery Accepting the body is amazingly complex and we don’t understand it well Everything is connected

SLIDE 3

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

The metabolic pathways of a single cell

David Wild, December 2009. Page 3 http://djwild.info

SLIDE 4

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

The inner life of the cell

David Wild, December 2009. Page 4 http://djwild.info

http://video.google.com/videoplay?docid=-2351549868099343381&hl=en#

SLIDE 5

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Big Data in the public domain

There is now an incredibly rich resource of public information relating

compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on:

69 million compounds and 449,392 bioassays (PubChem)
4,763 drugs (DrugBank)
9 million protein sequences (SwissProt) and 58,000 3D structures (PDB)
14 million human nucleotide sequences (EMBL)
19 million life science publications - 800,000 new each year (PubMed)
Multitude of other sets (drugs, toxicogenomics, chemogenomics, SAR, …)
Even more important are the relationships between these entities. For example

a chemical compound can be linked to a gene or a protein target in a multitude

f ways:
Biological assay with percent inhibition, IC50, etc
Crystal structure of ligand/protein complex
Co-occurrence in a paper abstract
Computational experiment (docking, predictive model)
Statistical relationship
System association (e.g. involved in same pathways cellular processes)

SLIDE 6

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

PubChem growth since 2005

David Wild, December 2009. Page 6 http://djwild.info

2,824,265 35,379,748 56,774,950 69,088,100

10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 70,000,000 80,000,000 2005-01 2005-03 2005-05 2005-07 2005-09 2005-11 2006-01 2006-03 2006-05 2006-07 2006-09 2006-11 2007-01 2007-03 2007-05 2007-07 2007-09 2007-11 2008-01 2008-03 2008-05 2008-07 2008-09 2008-11 2009-01 2009-03 2009-05 2009-07 2009-09 2009-11 2010-01 2010-03 2010-05 2010-07

PubChem Substance Size 2005-2010

Addition of ChemSpider 434635

1 10 100 1000 10000 100000 1000000 2005-01 2005-04 2005-07 2005-10 2006-01 2006-04 2006-07 2006-10 2007-01 2007-04 2007-07 2007-10 2008-01 2008-04 2008-07 2008-10 2009-01 2009-04 2009-07 2009-10 2010-01 2010-04 2010-07

PubChem Bioassays 2005-2010

Addition of ChEMBL

SLIDE 7

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Large amount of data and links for each compound

David Wild, December 2009. Page 7 http://djwild.info

SLIDE 8

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Proteins & Genes

David Wild, December 2009. Page 8 http://djwild.info

http://www.genome.jp/en/db_growth.html

SLIDE 9

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF: The FaceBook of Drug Discovery

David Wild, December 2009. Page 9 http://djwild.info

SLIDE 10

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

You are a big pile of data too!

SLIDE 11

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Large-scale predictive modeling adds even more data

Range of ROCV values from different classes of BioAssay data set.

Range of ROCV values from three different classes of BioAssay data set for original models and models built with additional inactive compounds (“improved”).

Chen, B. and Wild, D.J. PubChem BioAssays as a data source for predictive models, Journal of Molecular Graphics and Modeling. 2010; 28, 420-426.

SLIDE 12

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Informatics-based drug discovery

Predicting new molecular targets for known drugs. Nature 462, 175-181(12 November 2009)

SLIDE 13

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

“Systems chemical biology” and chemogenomics

SLIDE 14

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Recent enabling technologies for SCB / Chemogenomics

Cloud computing allows processing and data mining on a vast scale Integrative cheminformatics & bioinformatics connects compounds, targets genes, pathways, diseases and side effects Health informatics (PHRs and EHRs) allows integration

f the molecular

and patient models (QP) Semantic technologies and complex systems tools allow seamless integration and human-scale data mining

Analysis

Visualization, projection, data mining, hypothesis generation, network tools

Integration

RDF , XML, Triple Stores Ontologies, SPARQL, Graph algorithms

Access

Web Services, RPC Information extraction

SLIDE 15

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

ChemBioGrid.org: Web service infrastructure for cheminformatics

Dong, X., Gilbert, K.E., Guha, R., Heiland, R., Kim, J., Pierce, M.E. Pierce, Fox, G.C. and Wild, D.J. Web service infrastructure for chemoinformatics, J. Chem. Inf. Model., 2007; 47(4) pp 1303-1307.

SLIDE 16

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

The Semantic Web – meaning & relationships

SLIDE 17

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF – RDF integration & SPARQL querying

Chen, B., Dong. X., Jiao, D., Wang, H., Zhu, Q., Ding, Y ., Wild, D.J. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics 2010, 11, 255

SLIDE 18

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF context

SLIDE 19

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF Relationships

SLIDE 20

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Linked Open Data Cloud (linkeddata.org)

SLIDE 21

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Converting data into RDF

SLIDE 22

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Finding multi-target inhibitors of MAPK pathway with a SPARQL query

SLIDE 23

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Finding compounds with similar polypharmacology using SPARQL

SLIDE 24

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Projecting queries into chemical space

GTM / MDS projection and embedding of all PubChem using clouds Plotting and embedding unknown compounds with SCB property labels Dynamic querying and projection into chemical space

SLIDE 25

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Projecting queries into chemical space

Choi, J.Y . , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics Data with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop, ACM Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL

SLIDE 26

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

“Doppler Radar Plot” – Kinase Specificity

Choi, J.Y . , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics Data with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop, ACM Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL

SLIDE 27

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

“Doppler Radar Plot” – Kinase Specificity

SLIDE 28

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF Dashboard: finding paths

SLIDE 29

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Pathfinder

Dexamethasone Triamcinalone NFKB1 Glucocorticoid Receptor

http://ella.slis.indiana.edu/~yuysun/flex/pathfinder.html

SLIDE 30

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Dynamic exploration with clouds and Cytoscape

Virtuoso runs Chem2Bio2RDF queries on the cloud Cytoscape plugins give access to Chem2Bio2RDF , LPG and chemical structure visualization Dynamic exploration in Cytoscape

SLIDE 31

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Hydrocortisone – Dexamethasone links

Fig. Use Case 1.Network diagram of the paths obtained between Hydrocortisone and Dexamethasone using

ChemBioScape.Drugbank interaction contains information about every drug’s target. In this case, DB00741 and DB01234 share common targets through several different Drugbank interaction ID’s.

SLIDE 32

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Tolcapone and Entacapone links

Fig. Use case 2.Tolcapone and Entacapone are connected to each other through drugbank

interaction 2348 and 1962.Also, the two drugs appear in PubMed articles 8119326 and 8223912 via their CID (Compound ID)

SLIDE 33

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Isoniazid and Ethionamide – replicate paper results

Banerjee, A., Dubnau, E., Quemard, A., Balasubramanian, V

., Um, K., Wilson, T ., et al.: inhA, a gene encoding a target for isoniazid and ethionamide in Mycobacterium tuberculosis. Science, 263(5144), 227-230 (1994).

SLIDE 34

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

WENDI – exploring compound knowledge space

Doxorubicin (anthracyclin antibiotic)

SLIDE 35

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

WENDI v1.0 - insights from the literature

SLIDE 36

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

WENDI v2.0 - Automated reasoning with RDF

Simple OWL ontology for relationships
Large RDF network expands out from Query
RDF inference engines applied & results filtered / prioritized

QUERY CID 86427 CID 8642 AID 328 PubMed 12856 Breast Cancer Breast Cancer HER2 Breast Cancer

similar_to similar_to active_against contains_term contained_in contains_term contains_term predicted_ inactive_against

SLIDE 37

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

WENDI OWL/RDF Network & Inferred Associations

SLIDE 38

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Semantic text mining of journal articles

Jiao, D. and Wild, D.J. Extraction of CYP Chemical Interactions from Biomedical Literature Using Natural Language Processing Methods, Journal of Chemical Information and Modeling, 49(2); pp263-269

SLIDE 39

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chemical & Biological Literature Extraction

SLIDE 40

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Validating topics by experimental relationships

Topic 26: cell, expression, cancer, tumor,… Related Disease: DNA Damage, Melanoma, Glioblastoma, … Un-proved link proved link by c2b2r_chemogenomics Target Drug

SLIDE 41

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Bio-LDA III

Entropy
In information theory, entropy is a measure of the uncertainty associated

with a random variable.

Here we can compute the bio-term entropies over topics
Kullback-Leibler divergence (KL divergence)
a non-symmetric measure of the difference between two probability

distributions.

Here we used the KL divergence as the non-symmetric distance measure for

two bio-terms over topics

SLIDE 42

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Combining path finding and Bio-LDA

Detect semantic association
Path finding algorithm
millions of RDF triples from Chem2bio2rdf
Assess semantic association
Bio-LDA model
Entropy and KL divergence
Additional knowledge base: 50, 100 and 200 topics using the recent

336,899 MEDLINE abstracts, which contains 13,338 identical bio-terms

SLIDE 43

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Summary

Drug discovery is enetering a new era that is arguably centered on informatics

analysis of the vast amount of biological and chemical data now being produced, and which looks at the effect of drugs on biological systems as a

whole. This new approach underlies the new fields of systems chemical biology

and chemogenomics

Analyzing this data and particularly the relationships beween compounds,

drugs, proteins, genes, diseases, pathways and people promises to provide important understanding of the nature of disease and treatment

The Semantic Web provides an effective framework for logically managing the

data, and Cloud Computing provides a physical framework for computation and searching

Early-stage methods developed at Indiana allow integrated access to this

data, path finding between any two points, visualization in chemical space and network tools, and advanced handling of the scholarly literature

Critical next steps include ranking and intelligent filtering of paths and

relationships to provide aggregate evidence-based approaches, and integration of NGS and patient data

SLIDE 44

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Big Data in Drug Discovery

David J. Wild

Epochs in drug discovery

The metabolic pathways of a single cell

The inner life of the cell

Big Data in the public domain

compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on:

a chemical compound can be linked to a gene or a protein target in a multitude

PubChem growth since 2005

PubChem Substance Size 2005-2010

Large amount of data and links for each compound

Proteins & Genes

Chem2Bio2RDF: The FaceBook of Drug Discovery

You are a big pile of data too!

Large-scale predictive modeling adds even more data

Informatics-based drug discovery

“Systems chemical biology” and chemogenomics

Recent enabling technologies for SCB / Chemogenomics

Analysis

Integration

Access

ChemBioGrid.org: Web service infrastructure for cheminformatics

The Semantic Web – meaning & relationships

Chem2Bio2RDF – RDF integration & SPARQL querying

Chem2Bio2RDF context

Chem2Bio2RDF Relationships

Linked Open Data Cloud (linkeddata.org)

Converting data into RDF

Finding multi-target inhibitors of MAPK pathway with a SPARQL query

Finding compounds with similar polypharmacology using SPARQL

Projecting queries into chemical space

GTM / MDS projection and embedding of all PubChem using clouds Plotting and embedding unknown compounds with SCB property labels Dynamic querying and projection into chemical space

Projecting queries into chemical space

“Doppler Radar Plot” – Kinase Specificity

“Doppler Radar Plot” – Kinase Specificity

Chem2Bio2RDF Dashboard: finding paths

Pathfinder

http://ella.slis.indiana.edu/~yuysun/flex/pathfinder.html

Dynamic exploration with clouds and Cytoscape

Hydrocortisone – Dexamethasone links

Tolcapone and Entacapone links

Isoniazid and Ethionamide – replicate paper results

WENDI – exploring compound knowledge space

WENDI v1.0 - insights from the literature

WENDI v2.0 - Automated reasoning with RDF

WENDI OWL/RDF Network & Inferred Associations

Semantic text mining of journal articles

Chemical & Biological Literature Extraction

Validating topics by experimental relationships

Bio-LDA III

with a random variable.

distributions.

two bio-terms over topics

Combining path finding and Bio-LDA

336,899 MEDLINE abstracts, which contains 13,338 identical bio-terms

Summary

analysis of the vast amount of biological and chemical data now being produced, and which looks at the effect of drugs on biological systems as a

and chemogenomics

drugs, proteins, genes, diseases, pathways and people promises to provide important understanding of the nature of disease and treatment

data, and Cloud Computing provides a physical framework for computation and searching

data, path finding between any two points, visualization in chemical space and network tools, and advanced handling of the scholarly literature

relationships to provide aggregate evidence-based approaches, and integration of NGS and patient data

Cheminformatics group at Indiana University

http://djwild.info