1
HPC Asia 2004 BioGrid workshop Development of a Database System for - - PowerPoint PPT Presentation
HPC Asia 2004 BioGrid workshop Development of a Database System for - - PowerPoint PPT Presentation
HPC Asia 2004 BioGrid workshop Development of a Database System for Drug Discovery by Employing Grid Technology July 21,2004 Masato Kitajima1,2 Yukako Tohsato 1, Takahiro Kosaka 1, Kazuto Yamazaki 3,Reiji Teramoto 3, Susumu Date 1, Shinji
2
Databases in the Life Sciences The amount of data and the number of databases in life science have dramatically increased in just a few years
Nucleic Acids Research DB Issue
1 2 3 4 5 6 1 9 9 61 9 9 71 9 9 81 9 9 92 02 12 22 32 4
- No. of DB
Year
3
20,000,000 40,000,000 60,000,000 80,000,000 100,000,000 120,000,000 140,000,000
2004/4/21 2004/4/28 2004/5/5 2004/5/12 2004/5/19 2004/5/26 2004/6/2 2004/6/9 2004/6/16
Amount of updates in two months of a DNA database
bases date
4
Common Database Problems in the Life Sciences
・ Increase in the amount of data puts a great load to the administrator who updates the database ・ A slight change in the schema of one of the databases requires a complete rebuild of the whole system ・ A considerable amount of time and resources wasted in just updating the database
5
Different Ways Of Integrating Distributed Databases
- Hyperlinked Database
– Most commonly used for linking databases – Hyperlinks cannot carry special meanings
- Integrated Database(
- ex. NCBI’s Entrez)
– User only needs to access a single database – Changes in the schema of one database will prompt the rebuilding of the whole database system
- Heterogeneous Database(
- ex. Stanford Univ.’s TSIMMIS)
– Builds a “wrapper” on each of the databases to be accessed by a mediator (Changes in the schema of one database, only requires a change in the wrapper for that database) – Databases that use authentications and functionalities specific to life sciences(like homology searching and similarity searching) pose a problem in integration
6
- Unorganized structure of information
- Data in unformatted text
- Inconsistent use of terms on different databases
- Building of relationships between the databases
could only be done manually Common Problems in Linking the Databases
7
Proposal of a New Database System
Use of grid technology and Introduction of the concept of metadata Greatly helped in building mutual data relationships between databases in a distributed system
8
DBMS (RDB, XML DB)
SOAP/HTTP service creation API interactions
Analysis Registry GDSR Factory GDSF Grid Data Service GDS Overview of OGSA-DAI
OGSA( Open Grid Service Architecture Data Access and Integration)
9
Compounds (drugs) are activated by binding to proteins in a cell. Drug Discovery Process is to find chemical compounds that have good effects on their target proteins. The process is time-consuming and expensive.
5~10 million $ (10~15 years)
Application to the drug discovery process
Cell Protein Compound (Drug)
Drug Target Identification Target Validation Lead Identification Lead Optimization Pre Clinical Clinical
- Num. of Compounds10,000
200 1
Genome-based Drug Discovery Process
10
Gene Function Analysis Target Validation Lead Identification Lead Optimization Pre-clinical Clinical Market
Disease DB
Gene/Protein Database
Interaction DB Compound DB
Genome DB ( Gene location, SNP)
Disease Modeling Genome Mapping/ Gene Finding Known Proteins Search
( Sequence・ Structure Similarity Search)
Proteins Compound Interaction Search Compound Search
( Structure Similarity Search)
Databases Needed in Genome-based Drug Discovery
Basic Genomic Research Gene Finding
11
Basic Genomic Research Gene Finding Gene Function Analysis Target Validation Lead Identification Lead Optimization Pre-clinical Clinical Market
Disease DB
Gene/Protein Database
Interaction DB Compound DB
Genome DB ( Gene location, SNP)
Disease Modeling Genome Mapping/ Gene Finding Known Proteins Search
( Sequence・ Structure Similarity Search)
Proteins・ Compound Interaction Search Compound Search
( Structure Similarity Search)
Semantic Gap Exists Between Databases and Their Corresponding Disciplines
Database relationship
Semantic Gap
12
Unification of Different Disciplines Through Metadata → Supports the Drug Discovery Process
Life Science Chemistry Medicine
Disease DB Genome DB Protein DB Compound DB
Linking Databases in Different Disciplines
Lead Identification
Metadata
Gene-Disease Mapping
Metadata
13
Life Science Chemistry Medicine
Disease DB Genome DB Protein DB Compound DB
Lead Identification
Metadata
Gene-Disease Mapping
Metadata
Linking Databases in Different Disciplines Linking Eleven Databases involved in Genomic Drug Discovery
NLM
- Medical
Encyclopedia Protein Research Foundation
- LITDB
DNA Databank
- f Japan
- DDBJ
- SwissProt
- PIR
- PDB
- ENZYME
- GPCR-DB
- NucleaRDB
- LGIC-DB
- MDL Drug
Data Report MDL
- MDL Drug
Data Report
14
Two-Level Implementation of the Metadata PDB PIR
MDDR Compound DB Protein Metadata Compound Metadata
Protein-Compound Interaction Metadata
The relationship between groups in each category level of Protein Metadata and Compound Metadata
15
Metadata as Implemented on the Drug Discovery Workflow
Basic Genomic Research Gene Finding Gene Function Analysis Target Validation Lead Identification Lead Optimization Pre- Clinical Clinical Market
Work Flow
Disease DB Gene-Protein DB Compound DB Drug MetabolismDB
Disease Metadata Protein/Compound Interaction Metadata Drug Metabolism Metadata
DB Server DB Server DB Server DB Server
Compound1 Agonist ReceptorA Ligand Relation Protein EnzymeⅠ Substrate DrugA Enzyme Relation Drug ReceptorA Active DiseaseA Target Relation Disease
16
Compound Metadata Service Protein Metadata Service Protein-Compound Interaction Metadata Service Protein Sequence Homology Search (BLAST) GDSF GDS Compound Structure Similarity Search ( Tanimoto index) Compound DB MDDR GDSF GDS Protein DB SwissProt GDSF GDS
Interaction DB
(Enzyme, GPCR-DB, NucleaRDB, LGIC-DB) Structure Keys ( Compound substructures)
Grid Service (Globus Toolkit 3) Search Portal( Tomcat) Web Browser HTTPS
Factory Factory Factory Factory Factory
search process
GDSF GDS
USER
Database Search Service (Servlet)
Protein DB PIR Protein DB PDB GDSF GDS
BLAST Search D B
Grid Data Service (OGSA-DAI) SOAP
Database System for Protein-Compound Interaction Search
17
Protein Name, Protein Family Class Protein or Protein Family Compound Interaction Extracted data from MDDR(MDL) * Schuffenhauer A, Zimmermann J, Stoop R, van der Vyver JJ, Lecchini S, Jacoby E. “An ontology for pharmaceutical ligands and its application for in silico screening and library design,” J Chem Inf Comput Sci. 2002 Jul-Aug;42(4):947-55.
Strategy Used in Protein-Compound Interaction Search
Ligand Ontology*
- ENZYME
- GPCR-DB
- NucleaRDB
- LGIC-DB
18
New Target Protein
Large Reference Set of Known Ligands
- f Homologous Target Protein
Interactions Search
Homology Search (BLAST,etc)
Candidate Ligands of New Target Protein
Schuffenhauer A, Floersheim P, Acklin P, Jacoby E., “Similarity metrics for ligands reflecting the similarity of the target proteins”, J Chem Inf Comput Sci. 2003 Mar-Apr;43(2):391-405.
Process Flow in Protein-Compound Interaction Search
Structure Similarity Search (ISIS SS, etc)
Compound Descriptors
Homologous Target Protein
with known ligands
ProteinDB Compound Library
19
Protein Compound
agonist
Binding Domain (ex.) rosiglitazone (ex.) PPARgamma Protein Compound
agonist
Binding Domain (ex.) fenofibrate (ex.) PPARalpha
Zf-C4 137-211 Hormone_rec 318-501 Zf-C4 100-174 Hormone_rec 281-464
(ex.) ragaglitazar Compound
dual agonist similarity similarity similarity
Example of Protein-Compound Interaction Search
Activity Homology Activity Homology
20
Protein-Compound Interaction Search System Website
21
Applications Available to the User
- Protein Sequence Search :
Retrieve the target protein’s sequence by specifying its Protein ID.
- Homology Search :
Search for proteins homologous to the target in the Protein DB.
- Protein-Compound Interaction Search :
Extract ligands that bind to the homologous proteins.
- Compound Search :
Search for new compounds that may possibly interact with the target protein, by structural similarity to the extracted ligands.
22
Sequence Search Homology Search
Protein Sequence Homology Search (BLAST) Protein Metadata Service
Interaction Search
Compound Metadata Service
Protein-Compound Interaction Metadata Service
Protein DB Search Portal (Servlet) Compound DB Interaction DB
Protein Sequence Information Homology Search Information Protein-Compound Interaction Search Information
Protein Metadata Service
Structure Similarity Search
Compound Structure Similarity Search (Tanimoto Index)
Compound Structure Search Information
GDS GDS GDS GDS
Grid Service (Globus Toolkit 3)
Grid Data Service (OGSA-DAI)
User Access (Web Browser)
Flow of User Access and Grid Service Execution
23
BioDataGrid
Actual Implementation of the Search System
・ OS:Red Hat Linux 9 ・ CPU: Pentium4( 2.4GHz) ・ Memory Size:4GB ・ Java SDK 1.4.1 ・ Globus Toolkit 3 beta ・ OGSA-DAI :Release 2.5 ・ Jakarta Tomcat 4.1.24 ・ MySQL 3.23.54
24
Sequence Search Homology Search
Protein Sequence Homology Search (BLAST) Protein Metadata Service
Interaction Search
Compound Metadata Service
Protein-Compound Interaction Metadata Service
Search Portal (Servlet) Compound DB
Protein Metadata Service
Structure Similarity Search
Compound Structure Similarity Search (Tanimoto Index) GDS GDS GDS GDS
Grid Service (Globus Toolkit 3)
Grid Data Service (OGSA-DAI) 15KB
Interaction DB Protein DB
1.8KB 12.7KB 2.8KB 23.3KB 11.7KB 15.4KB 3.2KB 12.7KB 2.7KB 39.6KB 3.9KB 30KB (14.1KB) 4.2KB (1.9KB) 9.1KB 2.5KB 3.4KB 1.4KB 7.7KB 2.7KB 20.5KB 2.9KB 41.8KB 2.7KB 1.6KB 4.7KB 1.6KB 4.7KB 1.6KB 4.7KB 1.6KB 4.7KB 1.6KB 4.7KB 3.5KB 1.5KB GDS GDS 30KB (6.6KB) 4.2KB (1.2KB)
+ + + +
4.7KB (4.7KB) 1.6KB (1.6KB)
+ + *values in parenthesis show average for displaying a single compound
Average Amount of Data Flow for Each Grid Services
25
Conclusion
- A data grid system that links together online
databases was proposed
- Actual linking of 11 databases in the Life Sciences
was explained
- An integrated heterogeneous database system
based on the workflow of the genome-based drug discovery process was discussed
- Use of the latest grid technology like Globus
Toolkit 3/OGSA-DAI in linking distributed databases was successfully proven
26
Future Works
・ Implementation of security technologies ・ Implementation of XML DBMS technologies ・ Improvement of the search program
27