Ontologies and data integration in biomedicine
Success stories and challenging issues
Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA
Ontologies and data integration in biomedicine Success stories and - - PowerPoint PPT Presentation
Data Integration in the Life Sciences Evry, France June 26, 2008 Ontologies and data integration in biomedicine Success stories and challenging issues Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda,
Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA
Lister Hill National Center for Biomedical Communications 3
[Yildirim, 2007]
Lister Hill National Center for Biomedical Communications 4
NIH Roadmap Clinical and Translational Science Awards (CTSA)
Basic research Clinical research
Lister Hill National Center for Biomedical Communications 5
Lister Hill National Center for Biomedical Communications 6
Lister Hill National Center for Biomedical Communications 8
Lister Hill National Center for Biomedical Communications 9
Sources to be integrated
Local schema (of the
Global schema (in
Lister Hill National Center for Biomedical Communications 10
Provide a conceptualization of the domain
Help define the schema Information model vs. ontology
Provide value sets for data elements Enable standardization and sharing of data
Annotations to the Gene Ontology Repositories for translational research (CTSA) Clinical information systems
Lister Hill National Center for Biomedical Communications 11
Reference for defining the global schema Map between local and global schemas
TAMBIS BioMediator OntoFusion
Lister Hill National Center for Biomedical Communications 13
Functional annotation of gene products
Assigned manually by curators Inferred automatically (e.g., from sequence similarity)
Lister Hill National Center for Biomedical Communications 14
http:// www.informatics.jax.org/
Lister Hill National Center for Biomedical Communications 15
http://db.yeastgenome.org/
Lister Hill National Center for Biomedical Communications 16
http://www.ebi.ac.uk/GOA/
Lister Hill National Center for Biomedical Communications 17
Enrichment analysis (within/across species) Clustering (co-clustering with gene expression data)
Closely related annotations Semantic similarity
[Bodenreider, PSB 2005] [Sahoo, Medinfo 2007] [Lord, PSB 2003]
Lister Hill National Center for Biomedical Communications 18 Gene Ontology
GO PubMed Gene name OMIM Sequence Interactions Glycosyltransferase Congenital muscular dystrophy Entrez Gene [Sahoo, Medinfo 2007]
Lister Hill National Center for Biomedical Communications 19
MIM:608840 Muscular dystrophy, congenital, type 1D GO:0008375 has_associated_phenotype has_molecular_function EG:9215 LARGE acetylglucosaminyl- transferase GO:0016757 glycosyltransferase GO:0008194 isa GO:0008375 acetylglucosaminyl- transferase GO:0016758
Lister Hill National Center for Biomedical Communications 21
Lister Hill National Center for Biomedical Communications 22
Microarray data repository
Biospecimen repository
Annotations on microarray data
Cancer Translational Research Informatics Platform Integrates data services
Lister Hill National Center for Biomedical Communications 23
Reference terminology for the cancer domain ~ 60,000 concepts OWL Lite
Metadata repository Used to bridge across UML models through Common
Links to concepts in ontologies
Lister Hill National Center for Biomedical Communications 25
Lister Hill National Center for Biomedical Communications 26
Data/Information E.g., translational research
[Ruttenberg, 2007]
Lister Hill National Center for Biomedical Communications 27
NeuronDB BAMS NC Annotations Homologene SWAN Entrez Gene Gene Ontology Mammalian Phenotype PDSPki BrainPharm AlzGene Antibodies PubChem MeSH Reactome Allen Brain Atlas Publications
http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo
Lister Hill National Center for Biomedical Communications 28
Lister Hill National Center for Biomedical Communications 29
NeuronDB
Protein (channels/receptors) Neurotransmitters Neuroanatomy Cell Compartments Currents
BAMS
Protein Neuroanatomy Cells Metabolites (channels) PubMedID
NC Annotations
Genes/Proteins Processes Cells (maybe) PubMed ID
Allen Brain Atlas
Genes Brain images Gross anatomy -> neuroanatomy
Homologene
Genes Species Orthologies Proofs
SWAN
PubMedID Hypothesis Questions Evidence Genes
Entrez Gene
Genes Protein GO PubMedID Interaction (g/p) Chromosome
GO
Molecular function Cell components Biological process Annotation gene PubMedID
Mammalian Phenotype
Genes Phenotypes Disease PubMedID Proteins Chemicals Neurotransmitters
PDSPki BrainPharm
Drug Drug effect Pathological agent Phenotype Receptors Channels Cell types PubMedID Disease
AlzGene
Gene Polymorphism Population Alz Diagnosis
Antibodies
Genes Antibodies
PubChem
Name Structure Properties MeSH term
MeSH
Drugs Anatomy Phenotypes Compounds Chemicals PubMedID PubChem
Reactome
Genes/proteins Interactions Cellular location Processes (GO)
Lister Hill National Center for Biomedical Communications 30
NeuronDB
Protein (channels/receptors) Neurotransmitters Neuroanatomy Cell Compartments Currents
BAMS
Protein Neuroanatomy Cells Metabolites (channels) PubMedID
NC Annotations
Genes/Proteins Processes Cells (maybe) PubMed ID
Allen Brain Atlas
Genes Brain images Gross anatomy -> neuroanatomy
Homologene
Genes Species Orthologies Proofs
SWAN
PubMedID Hypothesis Questions Evidence Genes
Entrez Gene
Genes Protein GO PubMedID Interaction (g/p) Chromosome
GO
Molecular function Cell components Biological process Annotation gene PubMedID
Mammalian Phenotype
Genes Phenotypes Disease PubMedID Proteins Chemicals Neurotransmitters
PDSPki BrainPharm
Drug Drug effect Pathological agent Phenotype Receptors Channels Cell types PubMedID Disease
AlzGene
Gene Polymorphism Population Alz Diagnosis
Antibodies
Genes Antibodies
PubChem
Name Structure Properties MeSH term
MeSH
Drugs Anatomy Phenotypes Compounds Chemicals PubMedID PubChem
Reactome
Genes/proteins Interactions Cellular location Processes (GO)
Lister Hill National Center for Biomedical Communications 31
“Recombinant data” (E. Neumann)
Lister Hill National Center for Biomedical Communications 33
Addison Disease (D000224) Addison's disease (363732003)
Biomedical literature
MeSH
Clinical repositories
SNOMED CT Primary adrenocortical insufficiency (E27.1) ICD 10
Lister Hill National Center for Biomedical Communications 34
Lister Hill National Center for Biomedical Communications 35
Biomedical literature
MeSH
Genome annotations
GO
Model
NCBI Taxonomy
Genetic knowledge bases
OMIM
Clinical repositories
SNOMED CT
Other subdomains
…
Anatomy
FMA
Lister Hill National Center for Biomedical Communications 36 36
Biomedical literature Genome annotations Model
Genetic knowledge bases Clinical repositories Other subdomains Anatomy
Lister Hill National Center for Biomedical Communications 37
Genome annotations
GO
Model
NCBI Taxonomy
Genetic knowledge bases
OMIM
Other subdomains
…
Anatomy
FMA
Addison Disease (D000224) Addison's disease (363732003)
Biomedical literature
MeSH
Clinical repositories
SNOMED CT
C0001403
Lister Hill National Center for Biomedical Communications 38
Purpose Directionality
Lexically: ambiguity, normalization Semantically: lack of / incomplete formal definitions
Lister Hill National Center for Biomedical Communications 40
Data annotated to different ontologies cannot
Need for mappings across ontologies
Multiple possible identifiers for the same entity
Depending on the underlying representational scheme (URI
Depending on who creates the URI
Lister Hill National Center for Biomedical Communications 41
One level of indirection between developers and users Independence from local constraints at the developer’s
E.g., URI for genes in Entrez Gene
W3C Health Care and Life Sciences Interest Group
Lister Hill National Center for Biomedical Communications 43
Cost-free license required
SNOMED CT is freely available in member countries
Is a requirement for the Open Biomedical Ontologies
Is a de facto prerequisite for Semantic Web applications
Lister Hill National Center for Biomedical Communications 44
UMLS: 143 source vocabularies
NCBO BioPortal: ~100 ontologies
Limited overlap between the two repositories
Lister Hill National Center for Biomedical Communications 45
Web Ontology Language (OWL) – NCI Thesaurus OBO format – most OBO ontologies UMLS Rich Release Format (RRF) – UMLS, RxNorm
OBO to OWL LexGrid (import/export to LexGrid internal format)
Lister Hill National Center for Biomedical Communications 46
UMLS approach Integrates ontologies “as is”, including legacy
Facilitates the integration of the corresponding datasets
OBO Foundry approach Ensures consistency ab initio Excludes legacy ontologies
Lister Hill National Center for Biomedical Communications 47
Difficult to define outside a use case or application
Collaboratively, by users (Web 2.0 approach)
Marginal notes enabled by BioPortal
Centrally, by experts
OBO Foundry approach
Governance Installed base / Community of practice
Lister Hill National Center for Biomedical Communications 49
Lister Hill National Center for Biomedical Communications 50
Lister Hill National Center for Biomedical Communications 51
Upregulated genes Diseases (extracted from text + MeSH terms)
Lister Hill National Center for Biomedical Communications 52
Coded discharge summaries Laboratory data Upregulated genes Diseases (extracted from text + MeSH terms)
Lister Hill National Center for Biomedical Communications 53
Courtesy of David Chen, Butte Lab
Lister Hill National Center for Biomedical Communications 54
Courtesy of David Chen, Butte Lab
Lister Hill National Center for Biomedical Communications 55
No pairing between genomic and clinical data Text mining Mapping between SNOMED CT and ICD 9-CM
Reuse of ICD 9-CM codes assigned for billing purposes
Rediscovery more than discovery
Lister Hill National Center for Biomedical Communications 56
Dudley J, Butte AJ "Enabling integrative genomic analysis of high-
impact human diseases through text mining." Pac Symp Biocomput 2008; 580-91
Chen DP, Weber SC, Constantinou PS, Ferris TA, Lowe HJ, Butte AJ
"Novel integration of hospital electronic medical records and gene expression measurements to identify genetic markers of maturation." Pac Symp Biocomput 2008; 243-54
Butte AJ, "Medicine. The ultimate model organism." Science 2008;
320: 5874: 325-7
Lister Hill National Center for Biomedical Communications 57
Grass roots effort (GO) Regulatory context (ICD 9-CM)
Ontology integration resources / strategies