Ontologies and data integration in biomedicine Success stories and - - PowerPoint PPT Presentation

ontologies and data integration in biomedicine
SMART_READER_LITE
LIVE PREVIEW

Ontologies and data integration in biomedicine Success stories and - - PowerPoint PPT Presentation

Data Integration in the Life Sciences Evry, France June 26, 2008 Ontologies and data integration in biomedicine Success stories and challenging issues Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda,


slide-1
SLIDE 1

Ontologies and data integration in biomedicine

Success stories and challenging issues

Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA

Data Integration in the Life Sciences Evry, France June 26, 2008

slide-2
SLIDE 2

Why integrate data?

slide-3
SLIDE 3

Lister Hill National Center for Biomedical Communications 3

Integration yields nice pictures!

[Yildirim, 2007]

slide-4
SLIDE 4

Lister Hill National Center for Biomedical Communications 4

Motivation Translational research

“Bench to Bedside” Integration of clinical and research activities and

results

Supported by research programs

NIH Roadmap Clinical and Translational Science Awards (CTSA)

Requires the effective integration and exchange

and of information between

Basic research Clinical research

slide-5
SLIDE 5

Lister Hill National Center for Biomedical Communications 5

Translational research NIH Roadmap

slide-6
SLIDE 6

Lister Hill National Center for Biomedical Communications 6

Motivation Translational research

Basic Research Clinical Research and Practice

slide-7
SLIDE 7

Why ontologies?

slide-8
SLIDE 8

Lister Hill National Center for Biomedical Communications 8

Terminology and translational research

Cancer Basic Research EHR Cancer Patients NCI Thesaurus SNOMED CT

slide-9
SLIDE 9

Lister Hill National Center for Biomedical Communications 9

Approaches to data integration

Warehousing

Sources to be integrated

are transformed into a common format and converted to a common vocabulary Mediation

Local schema (of the

sources)

Global schema (in

reference to which the queries are made)

slide-10
SLIDE 10

Lister Hill National Center for Biomedical Communications 10

Ontologies and warehousing

Role

Provide a conceptualization of the domain

Help define the schema Information model vs. ontology

Provide value sets for data elements Enable standardization and sharing of data

Examples

Annotations to the Gene Ontology Repositories for translational research (CTSA) Clinical information systems

slide-11
SLIDE 11

Lister Hill National Center for Biomedical Communications 11

Ontologies and mediation

Role

Reference for defining the global schema Map between local and global schemas

Examples

TAMBIS BioMediator OntoFusion

slide-12
SLIDE 12

Success stories

Gene Ontology

http://www.geneontology.org/

slide-13
SLIDE 13

Lister Hill National Center for Biomedical Communications 13

Annotating data

Gene Ontology

Functional annotation of gene products

in several dozen model organisms

Various communities use the same controlled

vocabularies

Enabling comparisons across model organisms Annotations

Assigned manually by curators Inferred automatically (e.g., from sequence similarity)

slide-14
SLIDE 14

Lister Hill National Center for Biomedical Communications 14

GO Annotations for Aldh2 (mouse)

http:// www.informatics.jax.org/

slide-15
SLIDE 15

Lister Hill National Center for Biomedical Communications 15

GO ALD4 in Yeast

http://db.yeastgenome.org/

slide-16
SLIDE 16

Lister Hill National Center for Biomedical Communications 16

GO Annotations for ALDH2 (Human)

http://www.ebi.ac.uk/GOA/

slide-17
SLIDE 17

Lister Hill National Center for Biomedical Communications 17

Integration applications

Based on shared annotations

Enrichment analysis (within/across species) Clustering (co-clustering with gene expression data)

Based on the structure of GO

Closely related annotations Semantic similarity

Based on associations between gene products and

annotations

Leveraging reasoning

[Bodenreider, PSB 2005] [Sahoo, Medinfo 2007] [Lord, PSB 2003]

slide-18
SLIDE 18

Lister Hill National Center for Biomedical Communications 18 Gene Ontology

Integration Entrez Gene + GO

gene

GO PubMed Gene name OMIM Sequence Interactions Glycosyltransferase Congenital muscular dystrophy Entrez Gene [Sahoo, Medinfo 2007]

slide-19
SLIDE 19

Lister Hill National Center for Biomedical Communications 19

From glycosyltransferase to congenital muscular dystrophy

MIM:608840 Muscular dystrophy, congenital, type 1D GO:0008375 has_associated_phenotype has_molecular_function EG:9215 LARGE acetylglucosaminyl- transferase GO:0016757 glycosyltransferase GO:0008194 isa GO:0008375 acetylglucosaminyl- transferase GO:0016758

slide-20
SLIDE 20

Success stories

caBIG

http://cabig.nci.nih.gov/

slide-21
SLIDE 21

Lister Hill National Center for Biomedical Communications 21

Cancer Biomedical Informatics Grid

US National Cancer Institute Common infrastructure used to share data and

applications across institutions to support cancer research efforts in a grid environment

Data and application services available on the grid Supported by ontological resources

slide-22
SLIDE 22

Lister Hill National Center for Biomedical Communications 22

caBIG services

caArray

Microarray data repository

caTissue

Biospecimen repository

caFE (Cancer Function Express)

Annotations on microarray data

… caTRIP

Cancer Translational Research Informatics Platform Integrates data services

slide-23
SLIDE 23

Lister Hill National Center for Biomedical Communications 23

Ontological resources

NCI Thesaurus

Reference terminology for the cancer domain ~ 60,000 concepts OWL Lite

Cancer Data Standards Repository (caDSR)

Metadata repository Used to bridge across UML models through Common

Data Elements

Links to concepts in ontologies

slide-24
SLIDE 24

Success stories

Semantic Web for Health Care and Life Sciences

http://www.w3.org/2001/sw/hcls/

slide-25
SLIDE 25

Lister Hill National Center for Biomedical Communications 25

W3C Health Care and Life Sciences IG

slide-26
SLIDE 26

Lister Hill National Center for Biomedical Communications 26

Biomedical Semantic Web

Integration

Data/Information E.g., translational research

Hypothesis generation Knowledge discovery

[Ruttenberg, 2007]

slide-27
SLIDE 27

Lister Hill National Center for Biomedical Communications 27

HCLS mashup of biomedical sources

NeuronDB BAMS NC Annotations Homologene SWAN Entrez Gene Gene Ontology Mammalian Phenotype PDSPki BrainPharm AlzGene Antibodies PubChem MeSH Reactome Allen Brain Atlas Publications

http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo

slide-28
SLIDE 28

Lister Hill National Center for Biomedical Communications 28

Shared identifiers Example

GO

slide-29
SLIDE 29

Lister Hill National Center for Biomedical Communications 29

HCLS mashup

NeuronDB

Protein (channels/receptors) Neurotransmitters Neuroanatomy Cell Compartments Currents

BAMS

Protein Neuroanatomy Cells Metabolites (channels) PubMedID

NC Annotations

Genes/Proteins Processes Cells (maybe) PubMed ID

Allen Brain Atlas

Genes Brain images Gross anatomy -> neuroanatomy

Homologene

Genes Species Orthologies Proofs

SWAN

PubMedID Hypothesis Questions Evidence Genes

Entrez Gene

Genes Protein GO PubMedID Interaction (g/p) Chromosome

  • C. location

GO

Molecular function Cell components Biological process Annotation gene PubMedID

Mammalian Phenotype

Genes Phenotypes Disease PubMedID Proteins Chemicals Neurotransmitters

PDSPki BrainPharm

Drug Drug effect Pathological agent Phenotype Receptors Channels Cell types PubMedID Disease

AlzGene

Gene Polymorphism Population Alz Diagnosis

Antibodies

Genes Antibodies

PubChem

Name Structure Properties MeSH term

MeSH

Drugs Anatomy Phenotypes Compounds Chemicals PubMedID PubChem

Reactome

Genes/proteins Interactions Cellular location Processes (GO)

slide-30
SLIDE 30

Lister Hill National Center for Biomedical Communications 30

HCLS mashup

NeuronDB

Protein (channels/receptors) Neurotransmitters Neuroanatomy Cell Compartments Currents

BAMS

Protein Neuroanatomy Cells Metabolites (channels) PubMedID

NC Annotations

Genes/Proteins Processes Cells (maybe) PubMed ID

Allen Brain Atlas

Genes Brain images Gross anatomy -> neuroanatomy

Homologene

Genes Species Orthologies Proofs

SWAN

PubMedID Hypothesis Questions Evidence Genes

Entrez Gene

Genes Protein GO PubMedID Interaction (g/p) Chromosome

  • C. location

GO

Molecular function Cell components Biological process Annotation gene PubMedID

Mammalian Phenotype

Genes Phenotypes Disease PubMedID Proteins Chemicals Neurotransmitters

PDSPki BrainPharm

Drug Drug effect Pathological agent Phenotype Receptors Channels Cell types PubMedID Disease

AlzGene

Gene Polymorphism Population Alz Diagnosis

Antibodies

Genes Antibodies

PubChem

Name Structure Properties MeSH term

MeSH

Drugs Anatomy Phenotypes Compounds Chemicals PubMedID PubChem

Reactome

Genes/proteins Interactions Cellular location Processes (GO)

slide-31
SLIDE 31

Lister Hill National Center for Biomedical Communications 31

HCLS mashups

Based on RDF/OWL Based on shared identifiers

“Recombinant data” (E. Neumann)

Ontologies used in some cases Support applications (SWAN, SenseLab, etc.) Journal of Biomedical Informatics

special issue on Semantic Bio-mashups (forthcoming)

slide-32
SLIDE 32

Challenging issues

Bridges across ontologies

slide-33
SLIDE 33

Lister Hill National Center for Biomedical Communications 33

Trans-namespace integration

Addison Disease (D000224) Addison's disease (363732003)

Biomedical literature

MeSH

Clinical repositories

SNOMED CT Primary adrenocortical insufficiency (E27.1) ICD 10

slide-34
SLIDE 34

Lister Hill National Center for Biomedical Communications 34

(Integrated) concept repositories

Unified Medical Language System

http://umlsks.nlm.nih.gov

NCBO’s BioPortal

http://www.bioontology.org/tools/portal/bioportal.html

caDSR http://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr Open Biomedical Ontologies (OBO)

http://obofoundry.org/

slide-35
SLIDE 35

Lister Hill National Center for Biomedical Communications 35

Integrating subdomains

Biomedical literature

MeSH

Genome annotations

GO

Model

  • rganisms

NCBI Taxonomy

Genetic knowledge bases

OMIM

Clinical repositories

SNOMED CT

Other subdomains

Anatomy

FMA

UMLS

slide-36
SLIDE 36

Lister Hill National Center for Biomedical Communications 36 36

Integrating subdomains

Biomedical literature Genome annotations Model

  • rganisms

Genetic knowledge bases Clinical repositories Other subdomains Anatomy

slide-37
SLIDE 37

Lister Hill National Center for Biomedical Communications 37

Trans-namespace integration

Genome annotations

GO

Model

  • rganisms

NCBI Taxonomy

Genetic knowledge bases

OMIM

Other subdomains

Anatomy

FMA

UMLS

Addison Disease (D000224) Addison's disease (363732003)

Biomedical literature

MeSH

Clinical repositories

SNOMED CT

UMLS

C0001403

slide-38
SLIDE 38

Lister Hill National Center for Biomedical Communications 38

Mappings

Created manually (e.g., UMLS)

Purpose Directionality

Created automatically (e.g., BioPortal)

Lexically: ambiguity, normalization Semantically: lack of / incomplete formal definitions

Key to enabling semantic interoperability Enabling resource for the Semantic Web

slide-39
SLIDE 39

Challenging issues

Permanent identifiers for biomedical entities

slide-40
SLIDE 40

Lister Hill National Center for Biomedical Communications 40

Identifying biomedical entities

Multiple identifiers for the same entity in different

  • ntologies

Barrier to data integration in general

Data annotated to different ontologies cannot

“recombine”

Need for mappings across ontologies

Barrier to data integration in the Semantic Web

Multiple possible identifiers for the same entity

Depending on the underlying representational scheme (URI

  • vs. LSID)

Depending on who creates the URI

slide-41
SLIDE 41

Lister Hill National Center for Biomedical Communications 41

Possible solutions

PURL http://purl.org

One level of indirection between developers and users Independence from local constraints at the developer’s

end

The institution creating a resource is also

responsible for minting URIs

E.g., URI for genes in Entrez Gene

Guidelines: “URI note”

W3C Health Care and Life Sciences Interest Group

slide-42
SLIDE 42

Challenging issues

Other issues

slide-43
SLIDE 43

Lister Hill National Center for Biomedical Communications 43

Availability

Many ontologies are freely available The UMLS is freely available for research

purposes

Cost-free license required

Licensing issues can be tricky

SNOMED CT is freely available in member countries

  • f the IHTSDO

Being freely available

Is a requirement for the Open Biomedical Ontologies

(OBO)

Is a de facto prerequisite for Semantic Web applications

slide-44
SLIDE 44

Lister Hill National Center for Biomedical Communications 44

Discoverability

Ontology repositories

UMLS: 143 source vocabularies

(biased towards healthcare applications)

NCBO BioPortal: ~100 ontologies

(biased towards biological applications)

Limited overlap between the two repositories

Need for discovery services

slide-45
SLIDE 45

Lister Hill National Center for Biomedical Communications 45

Formalism

Several major formalism

Web Ontology Language (OWL) – NCI Thesaurus OBO format – most OBO ontologies UMLS Rich Release Format (RRF) – UMLS, RxNorm

Conversion mechanisms

OBO to OWL LexGrid (import/export to LexGrid internal format)

slide-46
SLIDE 46

Lister Hill National Center for Biomedical Communications 46

Ontology integration

Post hoc integration , form the bottom up

UMLS approach Integrates ontologies “as is”, including legacy

  • ntologies

Facilitates the integration of the corresponding datasets

Coordinated development of ontologies

OBO Foundry approach Ensures consistency ab initio Excludes legacy ontologies

slide-47
SLIDE 47

Lister Hill National Center for Biomedical Communications 47

Quality

Quality assurance in ontologies is still imperfectly

defined

Difficult to define outside a use case or application

Several approaches to evaluating quality

Collaboratively, by users (Web 2.0 approach)

Marginal notes enabled by BioPortal

Centrally, by experts

OBO Foundry approach

Important factors besides quality

Governance Installed base / Community of practice

slide-48
SLIDE 48

Thinking outside the integration box

The Butte approach

slide-49
SLIDE 49

Lister Hill National Center for Biomedical Communications 49

Integrating genomic and clinical data

No genomic data available for most patients No precise clinical data available associated with

most genomic data (GWAS excepted)

Genomic data Clinical data

slide-50
SLIDE 50

Lister Hill National Center for Biomedical Communications 50

Integrating genomic and clinical data

Genomic data

slide-51
SLIDE 51

Lister Hill National Center for Biomedical Communications 51

Integrating genomic and clinical data

Genomic data

Upregulated genes Diseases (extracted from text + MeSH terms)

slide-52
SLIDE 52

Lister Hill National Center for Biomedical Communications 52

Integrating genomic and clinical data

Clinical data Genomic data

Coded discharge summaries Laboratory data Upregulated genes Diseases (extracted from text + MeSH terms)

slide-53
SLIDE 53

Lister Hill National Center for Biomedical Communications 53

The Butte approach Methods

Courtesy of David Chen, Butte Lab

slide-54
SLIDE 54

Lister Hill National Center for Biomedical Communications 54

The Butte approach Results

Courtesy of David Chen, Butte Lab

slide-55
SLIDE 55

Lister Hill National Center for Biomedical Communications 55

The Butte approach

Extremely rough methods

No pairing between genomic and clinical data Text mining Mapping between SNOMED CT and ICD 9-CM

through UMLS

Reuse of ICD 9-CM codes assigned for billing purposes

Extremely preliminary results

Rediscovery more than discovery

Extremely promising nonetheless

slide-56
SLIDE 56

Lister Hill National Center for Biomedical Communications 56

The Butte approach References

Dudley J, Butte AJ "Enabling integrative genomic analysis of high-

impact human diseases through text mining." Pac Symp Biocomput 2008; 580-91

Chen DP, Weber SC, Constantinou PS, Ferris TA, Lowe HJ, Butte AJ

"Novel integration of hospital electronic medical records and gene expression measurements to identify genetic markers of maturation." Pac Symp Biocomput 2008; 243-54

Butte AJ, "Medicine. The ultimate model organism." Science 2008;

320: 5874: 325-7

slide-57
SLIDE 57

Lister Hill National Center for Biomedical Communications 57

Conclusions

Ontologies are enabling resources for data

integration

Standardization works

Grass roots effort (GO) Regulatory context (ICD 9-CM)

Bridging across resources is crucial

Ontology integration resources / strategies

(UMLS, BioPortal / OBO Foundry)

Massive amounts of imperfect data integrated with

rough methods might still be useful

slide-58
SLIDE 58

Medical Ontology Research

Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA

Contact:

Web:

  • livier@nlm.nih.gov

mor.nlm.nih.gov