Knowledge Modeling and its Application in Life Sciences: A Tale of - - PowerPoint PPT Presentation

knowledge modeling and its application in life sciences a
SMART_READER_LITE
LIVE PREVIEW

Knowledge Modeling and its Application in Life Sciences: A Tale of - - PowerPoint PPT Presentation

Knowledge Modeling and its Application in Life Sciences: A Tale of two ontologies Satya S. Sahoo, Chris Thomas, Amit P. Sheth, William S. York, Samir Tartir Paper Presented at 1 5 th I nternational W orld W ide W eb Conference, Edinburgh,


slide-1
SLIDE 1

Knowledge Modeling and its Application in Life Sciences: A Tale of two ontologies

Bioinformatics for Glycan Expression Integrated Technology Resource for Biomedical Glycomics NCRR/ NIH Satya S. Sahoo, Chris Thomas, Amit P. Sheth, William S. York, Samir Tartir

Paper Presented at 1 5 th I nternational W orld W ide W eb Conference, Edinburgh, Scotland May 2 5 , 2 0 0 6

slide-2
SLIDE 2

Outline

  • Background
  • Ontology Structure
  • Ontology Population: Knowledge base
  • Ontology Size Measures
  • Applications in Semantic Bioinformatics
  • Conclusions
slide-3
SLIDE 3
  • Study of structure, function and quantity of ‘complex

carbohydrate’ synthesized by an organism

  • Carbohydrates

added to basic protein structure

  • Glycosylation

Folded protein structure (schematic)

Background: glycomics

slide-4
SLIDE 4

Outline

  • Background
  • Ontology Structure
  • Ontology Population: Knowledge base
  • Ontology Size Measures
  • Applications in Semantic Bioinformatics
  • Conclusions
slide-5
SLIDE 5

Requirements from ontologies

  • Storing, sharing of data + reasoning over

biological data → logical rigor

  • Expressive as well as decidable language

→ OWL-DL

  • Incorporation of real world knowledge →
  • ntology population
  • Ensure amenability to alignment

with existing bio-medical ontologies

slide-6
SLIDE 6
  • Challenge – model hundreds of thousands of

complex carbohydrate entities

  • But, the differences between the entities are

small (E.g. just one component)

  • How to model all the concepts but preclude

redundancy → ensure maintainability, scalability

GlycO ontology

slide-7
SLIDE 7
  • N. Takahashi and K. Kato, Trends in Glycosciences

and Glycotechnology, 15: 235-251

β-D-GlcpNAc β-D-GlcpNAc β-D-Manp-(1-4)-

  • (1-4)-

α-D-Manp -(1-6)+ β-D-GlcpNAc-(1-2)- α-D-Manp -(1-3)+ β-D-GlcpNAc-(1-4)- β-D-GlcpNAc-(1-2)+

GlycoTree

slide-8
SLIDE 8
  • Two aspects of glycoproteomics:
  • What is it? → identification
  • How much of it is there? → quantification
  • Heterogeneity in data generation process,

instrumental parameters, formats

  • Need data

and process provenance →

  • ntology-mediated provenance
  • Hence,

ProPreO models both the glycoproteomics experimental process and attendant data

ProPreO ontology

slide-9
SLIDE 9

830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043

parent ion m/ z fragment ion m/ z ms/ ms peaklist data fragment ion abundance parent ion abundance parent ion charge

Ontology-mediated provenance

Mass Spectrometry (MS) Data

slide-10
SLIDE 10

<ms-ms_peak_list> <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer” mode=“ms-ms”/> <parent_ion m-z=“830.9570” abundance=“194.9604” z=“2”/> <fragment_ion m-z=“580.2985” abundance=“0.3592”/> <fragment_ion m-z=“688.3214” abundance=“0.2526”/> <fragment_ion m-z=“779.4759” abundance=“38.4939”/> <fragment_ion m-z=“784.3607” abundance=“21.7736”/> <fragment_ion m-z=“1543.7476” abundance=“1.3822”/> <fragment_ion m-z=“1544.7595” abundance=“2.9977”/> <fragment_ion m-z=“1562.8113” abundance=“37.4790”/> <fragment_ion m-z=“1660.7776” abundance=“476.5043”/> </ms-ms_peak_list>

Ontological Concepts

Ontology-mediated provenance

Semantically Annotated MS Data

slide-11
SLIDE 11

Compatibility with existing Biomedical

  • ntologies
  • Top level classes are modeled according to

the Basic Formal Ontology (BFO) approach

  • Taxonomy of relationships

and multiple restrictions per class → accuracy

  • Hence,

both GlycO and ProPreO are compatible with ontologies that follow BFO approach

  • Exploring alignment with ontologies listed

at Open Biomedical Ontologies (OBO)

slide-12
SLIDE 12

Outline

  • Background
  • Ontology Structure
  • Ontology Population: Knowledge base
  • Ontology Size Measures
  • Applications in Semantic Bioinformatics
  • Conclusions
slide-13
SLIDE 13
  • Multiple data sources used in populating

the ontology

  • KEGG -

Kyoto Encyclopedia of Genes and Genomes

  • SWEETDB
  • CARBANK Database
  • Each data source has different schema for

storing data

  • There is significant overlap of instances in

the data sources

  • Hence,

entity disambiguation and a common representational format are needed

GlycO population

slide-14
SLIDE 14

Has CarbBank ID? IUPAC to LINUCS LINUCS to GLYDE Compare to Knowledge Base Already in KB? YES NO Semagix Freedom knowledge extractor Instance Data YES: next Instance Insert into KB NO [][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc] {}[(4+1)][b-D-GlcpNAc] {}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}}

GlycO population

slide-15
SLIDE 15

Has CarbBank ID? IUPAC to LINUCS LINUCS to GLYDE Compare to Knowledge Base Already in KB? YES NO Semagix Freedom knowledge extractor Instance Data YES: next Instance Insert into KB NO

<Gly </Gly can> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue> can>

GlycO population

slide-16
SLIDE 16

ProPreO population: transformation to rdf

Scientific Data Computational Methods Ontology instances

slide-17
SLIDE 17

“Protein RDF”

chemical mass monoisotopic mass amino-acid sequence n-glycosylation concensus

Protein Data

amino-acid sequence Chemical Mass RDF Monoisotopic Mass RDF Amino-acid Sequence RDF

“Peptide RDF”

chemical mass monoisotopic mass amino-acid sequence n-glycosylation concensus parent protein Calculate Chemical Mass Calculate Monoisotopic Mass Determine N-glycosylation Concensus

Key

Protein Path Peptide Path

amino-acid sequence Extract Peptide Amino-acid Sequence from Protein Amino-acid Sequence

Scientific Data Com putational Methods RDF

ProPreO population: transformation to rdf

slide-18
SLIDE 18

Outline

  • Background
  • Ontology Structure
  • Ontology Population: Knowledge base
  • Ontology Size Measures
  • Applications in Semantic Bioinformatics
  • Conclusions
slide-19
SLIDE 19

Measures of ontology size

GlycO GlycO ProPreO ProPreO Classes 318 390 Properties (datatype & object) 82 32 Property restrictions 333 172 instances 737 3.1 million assertions 19,893 18.6 million

slide-20
SLIDE 20

Outline

  • Background
  • Ontology Structure
  • Ontology Population: Knowledge base
  • Ontology Size Measures
  • Applications in Semantic Bioinformatics
  • Conclusions
slide-21
SLIDE 21

Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways

Glycan structure and function

Biological pathways

slide-22
SLIDE 22

The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC 2.4.1.145. The product of this reaction is the Glycan with KEGG ID 00020. Reaction R05987 catalyzed by enzyme 2.4.1.145 adds_glycosyl_residue N-glycan_b-D-GlcpNAc_13

Zooming in a little….

slide-23
SLIDE 23

Semantic Web Process to incorporate provenance

Storage Standard Format Data Raw Data Filtered Data Search Results Final Output

Agent Agent Agent Agent

Biological Sample Analysis by MS/ MS Raw Data to Standard Format Data Pre- process DB Search (Mascot/ Sequest) Results Post- process (ProValt)

O I O I O I O I O

Biological Information

Semantic Annotation Applications

slide-24
SLIDE 24
  • Formalized domain knowledge is in ontologies
  • Data is annotated using concepts from the
  • ntologies
  • Semantic annotations enable identification

and extraction of relevant information

  • Relationships allow discovery of knowledge

that is implicit in the data

Overview - integrated semantic information system

slide-25
SLIDE 25

Outline

  • Background
  • Ontology Structure
  • Ontology Population: Knowledge base
  • Ontology Size Measures
  • Applications in Semantic Bioinformatics
  • Conclusions
slide-26
SLIDE 26
  • GlycO uses simple ‘canonical’ entities to build complex

structures thereby avoids redundancy → ensures maintainability and scalability

  • ProPreO is the first comprehensive ontology for data

and process provenance in glycoproteomics

  • Web process for entity disambiguation and common

representational format → populated ontology from disparate data sources

  • The two ontologies are among the largest populated
  • ntologies in life sciences

Conclusions

slide-27
SLIDE 27

Data, ontologies, more publications at Biomedical Glycomics project web site: http: / / lsdis.cs.uga.edu/ projects/ glycomics/

Thank You