on the integration of on the integration of biomedical
play

On the integration of On the integration of biomedical knowledge - PowerPoint PPT Presentation

M.Fato I.Porro E.Giunchiglia L.Vassalli M.Fato I.Porro E.Giunchiglia L.Vassalli On the integration of On the integration of biomedical knowledge bases: biomedical knowledge bases: problems and solutions problems and


  1. M.Fato – I.Porro – E.Giunchiglia – L.Vassalli M.Fato – I.Porro – E.Giunchiglia – L.Vassalli On the integration of On the integration of biomedical knowledge bases: biomedical knowledge bases: problems and solutions problems and solutions Luca Vassalli Luca Vassalli lucanl@star.dist.unige.it lucanl@star.dist.unige.it Systems and Technologies for Automated Reasoning laboratory, Systems and Technologies for Automated Reasoning laboratory, DIST, University of Genoa DIST, University of Genoa

  2. Outline Outline  A collaboration between: A collaboration between:  Systems and Technologies for Automated Reasoning laboratory, DIST, Systems and Technologies for Automated Reasoning laboratory, DIST, University of Genoa University of Genoa  Bioengineering and Bioimages laboratory (Biolab), DIST, University of Genoa Bioengineering and Bioimages laboratory (Biolab), DIST, University of Genoa  Brief introduction to the problem Brief introduction to the problem  Our research goal Our research goal  The different possible solutions The different possible solutions  BioGIS BioGIS (Bioinformatic GAV Integration System) (Bioinformatic GAV Integration System)  Rewriting rules Rewriting rules  Front end Front end  Internal structure Internal structure  Conclusions Conclusions 13/06/2007 Luca Vassalli

  3. Data Sources Integration Data Sources Integration “ The user should be able to focus on what he is looking for rather The user should be able to focus on what he is looking for rather “ than thinking how to obtain it”(A. Levy) than thinking how to obtain it”(A. Levy)  Issues: Issues:  Overlapping and mismatching Overlapping and mismatching  Syntactic difference between sources Syntactic difference between sources  Different layout of the sources (chart based, text based, etc.) Different layout of the sources (chart based, text based, etc.)  Lacking of a common exchange format Lacking of a common exchange format  Unknown data source internal structure Unknown data source internal structure  Internet is not a stable environment Internet is not a stable environment  Sometimes hard identifying the same element in different Sometimes hard identifying the same element in different systems systems 13/06/2007 Luca Vassalli

  4. BioGIS BioGIS  The goal: The goal:  Integration of the human metabolic pathways Integration of the human metabolic pathways  The sources: The sources:  KEGG (M. Kanehisa et al., 2002) KEGG (M. Kanehisa et al., 2002)  Reactome (G. Joshi-Tope et al., 2005) Reactome (G. Joshi-Tope et al., 2005)  The user: The user:  Biolab portal (http://grid.bio.dist.unige.it) Biolab portal (http://grid.bio.dist.unige.it) 13/06/2007 Luca Vassalli

  5. Modelling the data sources Modelling the data sources Global as view (Garcia-Molina et al., 1997) (Garcia-Molina et al., 1997) Global as view  Two data sources: Two data sources:  DB1 (Pathway_Name, Pathway_ID1, Description, Molecule) DB1 (Pathway_Name, Pathway_ID1, Description, Molecule)  DB2 (Pathway_ID2, Pathway_Name, Organism) DB2 (Pathway_ID2, Pathway_Name, Organism)  Mediated schema relations: Mediated schema relations:  Pathway (Pathway_Name, Description, Organism) :- Pathway (Pathway_Name, Description, Organism) :- DB1(Pathway_Name,Pathway_ID1, Description, Molecule), DB1(Pathway_Name,Pathway_ID1, Description, Molecule), DB2(Pathway_ID2, Pathway_Name, Organism) DB2(Pathway_ID2, Pathway_Name, Organism)  Connection_Molecule (Pathway_Name, Molecule) :- Connection_Molecule (Pathway_Name, Molecule) :- DB1(Pathway_Name,Pathway_ID1, Description, Molecule) DB1(Pathway_Name,Pathway_ID1, Description, Molecule) 13/06/2007 Luca Vassalli

  6. Modelling the data sources Modelling the data sources Local as view (O. Duschka et al., 1997) (O. Duschka et al., 1997) Local as view  DB1 (Pathway_Name, Pathway_ID1, Description, Molecule) DB1 (Pathway_Name, Pathway_ID1, Description, Molecule) :- :- Pathway (Pathway_Name, Description, Organism, Pathway (Pathway_Name, Description, Organism, Pathway_ID1, Pathway_ID2 ), ), Pathway_ID1, Pathway_ID2 Connection_Molecule (Pathway_Name, Molecule, Class), Connection_Molecule (Pathway_Name, Molecule, Class), Class = “genes” Class = “genes”  DB2 (Pathway_ID2, Pathway_Name, Organism) :- DB2 (Pathway_ID2, Pathway_Name, Organism) :- Pathway (Pathway_Name, Description, Organism, Pathway (Pathway_Name, Description, Organism, Pathway_ID1, Pathway_ID2 ), Organism = “homo sapient” ), Organism = “homo sapient” Pathway_ID1, Pathway_ID2 13/06/2007 Luca Vassalli

  7. A Comparison A Comparison  GAV GAV  Does not require containment checking (fast and reliable) Does not require containment checking (fast and reliable)  Somehow awkward modelling the system Somehow awkward modelling the system  Difficult to extend Difficult to extend  LAV LAV  Easy to extend Easy to extend  Useless details in the model of the system Useless details in the model of the system  Requires containment checking (slow) Requires containment checking (slow)  The algorithm may be even intractable The algorithm may be even intractable  GLAV (M Friedman et al., 1999) GLAV (M Friedman et al., 1999)  Same complexity than LAV Same complexity than LAV  Solved some drawbacks in the modelling phase Solved some drawbacks in the modelling phase 13/06/2007 Luca Vassalli

  8. BioGIS BioGIS  Front end or ad hoc Front end or ad hoc Query in mediated methods methods schema Ad hoc method call  Execution engine Execution engine Front end which iteratively calls which iteratively calls the wrappers the wrappers Execution engine  A wrapper for each A wrapper for each data source data source Reactome KEGG wrapper wrapper  Integration engine Integration engine Integration engine Reactome KEGG WS WS Query Answer 13/06/2007 Luca Vassalli

  9. The information extracted The information extracted  Two ad hoc family of methods: Two ad hoc family of methods:  getMoleculesForPathway getMoleculesForPathway  getPathwayForMolecules getPathwayForMolecules  Three global schema relations: Three global schema relations:  Pathway Pathway  Connection_Molecule Connection_Molecule  Reaction Reaction 13/06/2007 Luca Vassalli

  10. Front End Front End  Queries have to follow a precise grammar Queries have to follow a precise grammar Query Tokens Lexer Parser IR Error Message Execution engine  Examples: Examples:  PATHWAY { GOTerm = " alanine metabolism " } END PATHWAY { GOTerm = " alanine metabolism " } END  PATHWAY { ReactomePathwayID = " 109606 " } , PATHWAY { ReactomePathwayID = " 109606 " } , CONNECTION_MOLECULE { ReactomePathwayID = " CONNECTION_MOLECULE { ReactomePathwayID = " 109606 " } END 109606 " } END  CONNECTION_MOLECULE { UniqueID = " Q92934 " } CONNECTION_MOLECULE { UniqueID = " Q92934 " } END END 13/06/2007 Luca Vassalli

  11. Internal structure Internal structure  Execution engine: Execution engine:  Simple unfolding of the queries according to the GAV Simple unfolding of the queries according to the GAV methodology methodology  Ad hoc methods: concurrent threads which query in parallel Ad hoc methods: concurrent threads which query in parallel the wrappers the wrappers  Wrappers: Wrappers:  A class for every different data source relation. The A class for every different data source relation. The information is retrieved from the sources and structured into information is retrieved from the sources and structured into objects. objects.  Integration engine: Integration engine:  Pathways merged using the pathway names and the Gene Pathways merged using the pathway names and the Gene Ontology terms Ontology terms  Molecules merged using the UniProt and COMPOUND ids Molecules merged using the UniProt and COMPOUND ids 13/06/2007 Luca Vassalli

  12. Performances Performances  Vary according to several factors: Vary according to several factors:  The number of hits of the query The number of hits of the query  “ “Retrieve all the genes that take part to a pathway which matches the Retrieve all the genes that take part to a pathway which matches the keyword “pyruvate” ”: around 65 hits – 1 minute keyword “pyruvate” ”: around 65 hits – 1 minute  “ “Retrieve all the genes that take part to a pathway which matches the Retrieve all the genes that take part to a pathway which matches the keyword “metabolism” ”: thousands of hits – half an hour keyword “metabolism” ”: thousands of hits – half an hour  The state of the Reactome cache The state of the Reactome cache  The network latency The network latency  Better to be used in a chain of web services than as a Better to be used in a chain of web services than as a standalone service available through a browser standalone service available through a browser 13/06/2007 Luca Vassalli

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend