Big Data in Drug Discovery David J. Wild Assistant Professor & - - PowerPoint PPT Presentation

big data in drug discovery
SMART_READER_LITE
LIVE PREVIEW

Big Data in Drug Discovery David J. Wild Assistant Professor & - - PowerPoint PPT Presentation

Big Data in Drug Discovery David J. Wild Assistant Professor & Director, Cheminformatics Program Indiana University School of Informatics and Computing djwild@indiana.edu - http://djwild.info Epochs in drug discovery Empirical up until


slide-1
SLIDE 1

Big Data in Drug Discovery

David J. Wild

Assistant Professor & Director, Cheminformatics Program Indiana University School of Informatics and Computing djwild@indiana.edu - http://djwild.info

slide-2
SLIDE 2

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Epochs in drug discovery

Empirical – up until 1960’s 754 First pharmacy opened in Baghdad Late 1800’s – major pharmaceutical companies, mass production 1900-1960 – major discoveries (insulin, penicillin, the pill …) Rational – 1960’s to 1990’s Designing molecules to target protein active sites – “lock and key” Computational Drug Discovery Biggest success HIV (RT , protease inhibitors) Big Experiment – 1990’s to 2000’s High throughput screening Microarray Assays Gene Sequencing and Human Genome Project Big Data – 2010’s onwards Informatics-driven drug discovery Accepting the body is amazingly complex and we don’t understand it well Everything is connected

slide-3
SLIDE 3

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

The metabolic pathways of a single cell

David Wild, December 2009. Page 3 http://djwild.info

slide-4
SLIDE 4

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

The inner life of the cell

David Wild, December 2009. Page 4 http://djwild.info

http://video.google.com/videoplay?docid=-2351549868099343381&hl=en#

slide-5
SLIDE 5

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Big Data in the public domain

  • There is now an incredibly rich resource of public information relating

compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on:

  • 69 million compounds and 449,392 bioassays (PubChem)
  • 4,763 drugs (DrugBank)
  • 9 million protein sequences (SwissProt) and 58,000 3D structures (PDB)
  • 14 million human nucleotide sequences (EMBL)
  • 19 million life science publications - 800,000 new each year (PubMed)
  • Multitude of other sets (drugs, toxicogenomics, chemogenomics, SAR, …)
  • Even more important are the relationships between these entities. For example

a chemical compound can be linked to a gene or a protein target in a multitude

  • f ways:
  • Biological assay with percent inhibition, IC50, etc
  • Crystal structure of ligand/protein complex
  • Co-occurrence in a paper abstract
  • Computational experiment (docking, predictive model)
  • Statistical relationship
  • System association (e.g. involved in same pathways cellular processes)
slide-6
SLIDE 6

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

PubChem growth since 2005

David Wild, December 2009. Page 6 http://djwild.info

2,824,265 35,379,748 56,774,950 69,088,100

10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 70,000,000 80,000,000 2005-01 2005-03 2005-05 2005-07 2005-09 2005-11 2006-01 2006-03 2006-05 2006-07 2006-09 2006-11 2007-01 2007-03 2007-05 2007-07 2007-09 2007-11 2008-01 2008-03 2008-05 2008-07 2008-09 2008-11 2009-01 2009-03 2009-05 2009-07 2009-09 2009-11 2010-01 2010-03 2010-05 2010-07

PubChem Substance Size 2005-2010

Addition of ChemSpider 434635

1 10 100 1000 10000 100000 1000000 2005-01 2005-04 2005-07 2005-10 2006-01 2006-04 2006-07 2006-10 2007-01 2007-04 2007-07 2007-10 2008-01 2008-04 2008-07 2008-10 2009-01 2009-04 2009-07 2009-10 2010-01 2010-04 2010-07

PubChem Bioassays 2005-2010

Addition of ChEMBL

slide-7
SLIDE 7

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Large amount of data and links for each compound

David Wild, December 2009. Page 7 http://djwild.info

slide-8
SLIDE 8

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Proteins & Genes

David Wild, December 2009. Page 8 http://djwild.info

http://www.genome.jp/en/db_growth.html

slide-9
SLIDE 9

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF: The FaceBook of Drug Discovery

David Wild, December 2009. Page 9 http://djwild.info

slide-10
SLIDE 10

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

You are a big pile of data too!

slide-11
SLIDE 11

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Large-scale predictive modeling adds even more data

Range of ROCV values from different classes of BioAssay data set.

Range of ROCV values from three different classes of BioAssay data set for original models and models built with additional inactive compounds (“improved”).

Chen, B. and Wild, D.J. PubChem BioAssays as a data source for predictive models, Journal of Molecular Graphics and Modeling. 2010; 28, 420-426.

slide-12
SLIDE 12

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Informatics-based drug discovery

Predicting new molecular targets for known drugs. Nature 462, 175-181(12 November 2009)

slide-13
SLIDE 13

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

“Systems chemical biology” and chemogenomics

slide-14
SLIDE 14

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Recent enabling technologies for SCB / Chemogenomics

Cloud computing allows processing and data mining on a vast scale Integrative cheminformatics & bioinformatics connects compounds, targets genes, pathways, diseases and side effects Health informatics (PHRs and EHRs) allows integration

  • f the molecular

and patient models (QP) Semantic technologies and complex systems tools allow seamless integration and human-scale data mining

Analysis

Visualization, projection, data mining, hypothesis generation, network tools

Integration

RDF , XML, Triple Stores Ontologies, SPARQL, Graph algorithms

Access

Web Services, RPC Information extraction

slide-15
SLIDE 15

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

ChemBioGrid.org: Web service infrastructure for cheminformatics

Dong, X., Gilbert, K.E., Guha, R., Heiland, R., Kim, J., Pierce, M.E. Pierce, Fox, G.C. and Wild, D.J. Web service infrastructure for chemoinformatics, J. Chem. Inf. Model., 2007; 47(4) pp 1303-1307.

slide-16
SLIDE 16

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

The Semantic Web – meaning & relationships

slide-17
SLIDE 17

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF – RDF integration & SPARQL querying

Chen, B., Dong. X., Jiao, D., Wang, H., Zhu, Q., Ding, Y ., Wild, D.J. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics 2010, 11, 255

slide-18
SLIDE 18

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF context

slide-19
SLIDE 19

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF Relationships

slide-20
SLIDE 20

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Linked Open Data Cloud (linkeddata.org)

slide-21
SLIDE 21

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Converting data into RDF

slide-22
SLIDE 22

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Finding multi-target inhibitors of MAPK pathway with a SPARQL query

slide-23
SLIDE 23

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Finding compounds with similar polypharmacology using SPARQL

slide-24
SLIDE 24

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Projecting queries into chemical space

GTM / MDS projection and embedding of all PubChem using clouds Plotting and embedding unknown compounds with SCB property labels Dynamic querying and projection into chemical space

slide-25
SLIDE 25

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Projecting queries into chemical space

Choi, J.Y . , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics Data with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop, ACM Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL

slide-26
SLIDE 26

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

“Doppler Radar Plot” – Kinase Specificity

Choi, J.Y . , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics Data with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop, ACM Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL

slide-27
SLIDE 27

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

“Doppler Radar Plot” – Kinase Specificity

slide-28
SLIDE 28

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chem2Bio2RDF Dashboard: finding paths

slide-29
SLIDE 29

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Pathfinder

Dexamethasone Triamcinalone NFKB1 Glucocorticoid Receptor

http://ella.slis.indiana.edu/~yuysun/flex/pathfinder.html

slide-30
SLIDE 30

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Dynamic exploration with clouds and Cytoscape

Virtuoso runs Chem2Bio2RDF queries on the cloud Cytoscape plugins give access to Chem2Bio2RDF , LPG and chemical structure visualization Dynamic exploration in Cytoscape

slide-31
SLIDE 31

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Hydrocortisone – Dexamethasone links

  • Fig. Use Case 1.Network diagram of the paths obtained between Hydrocortisone and Dexamethasone using

ChemBioScape.Drugbank interaction contains information about every drug’s target. In this case, DB00741 and DB01234 share common targets through several different Drugbank interaction ID’s.

slide-32
SLIDE 32

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Tolcapone and Entacapone links

  • Fig. Use case 2.Tolcapone and Entacapone are connected to each other through drugbank

interaction 2348 and 1962.Also, the two drugs appear in PubMed articles 8119326 and 8223912 via their CID (Compound ID)

slide-33
SLIDE 33

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Isoniazid and Ethionamide – replicate paper results

  • Banerjee, A., Dubnau, E., Quemard, A., Balasubramanian, V

., Um, K., Wilson, T ., et al.: inhA, a gene encoding a target for isoniazid and ethionamide in Mycobacterium tuberculosis. Science, 263(5144), 227-230 (1994).

slide-34
SLIDE 34

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

WENDI – exploring compound knowledge space

Doxorubicin (anthracyclin antibiotic)

slide-35
SLIDE 35

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

WENDI v1.0 - insights from the literature

slide-36
SLIDE 36

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

WENDI v2.0 - Automated reasoning with RDF

  • Simple OWL ontology for relationships
  • Large RDF network expands out from Query
  • RDF inference engines applied & results filtered / prioritized

QUERY CID 86427 CID 8642 AID 328 PubMed 12856 Breast Cancer Breast Cancer HER2 Breast Cancer

similar_to similar_to active_against contains_term contained_in contains_term contains_term predicted_ inactive_against

slide-37
SLIDE 37

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

WENDI OWL/RDF Network & Inferred Associations

slide-38
SLIDE 38

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Semantic text mining of journal articles

Jiao, D. and Wild, D.J. Extraction of CYP Chemical Interactions from Biomedical Literature Using Natural Language Processing Methods, Journal of Chemical Information and Modeling, 49(2); pp263-269

slide-39
SLIDE 39

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Chemical & Biological Literature Extraction

slide-40
SLIDE 40

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Validating topics by experimental relationships

Topic 26: cell, expression, cancer, tumor,… Related Disease: DNA Damage, Melanoma, Glioblastoma, … Un-proved link proved link by c2b2r_chemogenomics Target Drug

slide-41
SLIDE 41

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Bio-LDA III

  • Entropy
  • In information theory, entropy is a measure of the uncertainty associated

with a random variable.

  • Here we can compute the bio-term entropies over topics
  • Kullback-Leibler divergence (KL divergence)
  • a non-symmetric measure of the difference between two probability

distributions.

  • Here we used the KL divergence as the non-symmetric distance measure for

two bio-terms over topics

slide-42
SLIDE 42

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Combining path finding and Bio-LDA

  • Detect semantic association
  • Path finding algorithm
  • millions of RDF triples from Chem2bio2rdf
  • Assess semantic association
  • Bio-LDA model
  • Entropy and KL divergence
  • Additional knowledge base: 50, 100 and 200 topics using the recent

336,899 MEDLINE abstracts, which contains 13,338 identical bio-terms

slide-43
SLIDE 43

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Summary

  • Drug discovery is enetering a new era that is arguably centered on informatics

analysis of the vast amount of biological and chemical data now being produced, and which looks at the effect of drugs on biological systems as a

  • whole. This new approach underlies the new fields of systems chemical biology

and chemogenomics

  • Analyzing this data and particularly the relationships beween compounds,

drugs, proteins, genes, diseases, pathways and people promises to provide important understanding of the nature of disease and treatment

  • The Semantic Web provides an effective framework for logically managing the

data, and Cloud Computing provides a physical framework for computation and searching

  • Early-stage methods developed at Indiana allow integrated access to this

data, path finding between any two points, visualization in chemical space and network tools, and advanced handling of the scholarly literature

  • Critical next steps include ranking and intelligent filtering of paths and

relationships to provide aggregate evidence-based approaches, and integration of NGS and patient data

slide-44
SLIDE 44

Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.

Cheminformatics group at Indiana University

http://djwild.info