HPC Asia 2004 BioGrid workshop Development of a Database System for - - PowerPoint PPT Presentation

hpc asia 2004 biogrid workshop development of a database
SMART_READER_LITE
LIVE PREVIEW

HPC Asia 2004 BioGrid workshop Development of a Database System for - - PowerPoint PPT Presentation

HPC Asia 2004 BioGrid workshop Development of a Database System for Drug Discovery by Employing Grid Technology July 21,2004 Masato Kitajima1,2 Yukako Tohsato 1, Takahiro Kosaka 1, Kazuto Yamazaki 3,Reiji Teramoto 3, Susumu Date 1, Shinji


slide-1
SLIDE 1

1

Development of a Database System for Drug Discovery by Employing Grid Technology

Masato Kitajima1,2 Yukako Tohsato 1, Takahiro Kosaka 1, Kazuto Yamazaki 3,Reiji Teramoto 3, Susumu Date 1, Shinji Shimojo 4, Hideo Matsuda 1

1 Graduate School of Information Science and Technology, Osaka University. 2 Fujitsu Kyushu System Engineering Limited. 3 Research Division, Sumitomo Pharmaceuticals Co., Ltd. 4 Cybermedia Center, Osaka University.

HPC Asia 2004 BioGrid workshop

July 21,2004

slide-2
SLIDE 2

2

Databases in the Life Sciences The amount of data and the number of databases in life science have dramatically increased in just a few years

Nucleic Acids Research DB Issue

1 2 3 4 5 6 1 9 9 61 9 9 71 9 9 81 9 9 92 02 12 22 32 4

  • No. of DB

Year

slide-3
SLIDE 3

3

20,000,000 40,000,000 60,000,000 80,000,000 100,000,000 120,000,000 140,000,000

2004/4/21 2004/4/28 2004/5/5 2004/5/12 2004/5/19 2004/5/26 2004/6/2 2004/6/9 2004/6/16

Amount of updates in two months of a DNA database

bases date

slide-4
SLIDE 4

4

Common Database Problems in the Life Sciences

・ Increase in the amount of data puts a great load to the administrator who updates the database ・ A slight change in the schema of one of the databases requires a complete rebuild of the whole system ・ A considerable amount of time and resources wasted in just updating the database

slide-5
SLIDE 5

5

Different Ways Of Integrating Distributed Databases

  • Hyperlinked Database

– Most commonly used for linking databases – Hyperlinks cannot carry special meanings

  • Integrated Database(
  • ex. NCBI’s Entrez)

– User only needs to access a single database – Changes in the schema of one database will prompt the rebuilding of the whole database system

  • Heterogeneous Database(
  • ex. Stanford Univ.’s TSIMMIS)

– Builds a “wrapper” on each of the databases to be accessed by a mediator (Changes in the schema of one database, only requires a change in the wrapper for that database) – Databases that use authentications and functionalities specific to life sciences(like homology searching and similarity searching) pose a problem in integration

slide-6
SLIDE 6

6

  • Unorganized structure of information
  • Data in unformatted text
  • Inconsistent use of terms on different databases
  • Building of relationships between the databases

could only be done manually Common Problems in Linking the Databases

slide-7
SLIDE 7

7

Proposal of a New Database System

Use of grid technology and Introduction of the concept of metadata Greatly helped in building mutual data relationships between databases in a distributed system

slide-8
SLIDE 8

8

DBMS (RDB, XML DB)

SOAP/HTTP service creation API interactions

Analysis Registry GDSR Factory GDSF Grid Data Service GDS Overview of OGSA-DAI

OGSA( Open Grid Service Architecture Data Access and Integration)

slide-9
SLIDE 9

9

Compounds (drugs) are activated by binding to proteins in a cell. Drug Discovery Process is to find chemical compounds that have good effects on their target proteins. The process is time-consuming and expensive.

5~10 million $ (10~15 years)

Application to the drug discovery process

Cell Protein Compound (Drug)

Drug Target Identification Target Validation Lead Identification Lead Optimization Pre Clinical Clinical

  • Num. of Compounds10,000

200 1

Genome-based Drug Discovery Process

slide-10
SLIDE 10

10

Gene Function Analysis Target Validation Lead Identification Lead Optimization Pre-clinical Clinical Market

Disease DB

Gene/Protein Database

Interaction DB Compound DB

Genome DB ( Gene location, SNP)

Disease Modeling Genome Mapping/ Gene Finding Known Proteins Search

( Sequence・ Structure Similarity Search)

Proteins Compound Interaction Search Compound Search

( Structure Similarity Search)

Databases Needed in Genome-based Drug Discovery

Basic Genomic Research Gene Finding

slide-11
SLIDE 11

11

Basic Genomic Research Gene Finding Gene Function Analysis Target Validation Lead Identification Lead Optimization Pre-clinical Clinical Market

Disease DB

Gene/Protein Database

Interaction DB Compound DB

Genome DB ( Gene location, SNP)

Disease Modeling Genome Mapping/ Gene Finding Known Proteins Search

( Sequence・ Structure Similarity Search)

Proteins・ Compound Interaction Search Compound Search

( Structure Similarity Search)

Semantic Gap Exists Between Databases and Their Corresponding Disciplines

Database relationship

Semantic Gap

slide-12
SLIDE 12

12

Unification of Different Disciplines Through Metadata → Supports the Drug Discovery Process

Life Science Chemistry Medicine

Disease DB Genome DB Protein DB Compound DB

Linking Databases in Different Disciplines

Lead Identification

Metadata

Gene-Disease Mapping

Metadata

slide-13
SLIDE 13

13

Life Science Chemistry Medicine

Disease DB Genome DB Protein DB Compound DB

Lead Identification

Metadata

Gene-Disease Mapping

Metadata

Linking Databases in Different Disciplines Linking Eleven Databases involved in Genomic Drug Discovery

NLM

  • Medical

Encyclopedia Protein Research Foundation

  • LITDB

DNA Databank

  • f Japan
  • DDBJ
  • SwissProt
  • PIR
  • PDB
  • ENZYME
  • GPCR-DB
  • NucleaRDB
  • LGIC-DB
  • MDL Drug

Data Report MDL

  • MDL Drug

Data Report

slide-14
SLIDE 14

14

Two-Level Implementation of the Metadata PDB PIR

MDDR Compound DB Protein Metadata Compound Metadata

Protein-Compound Interaction Metadata

The relationship between groups in each category level of Protein Metadata and Compound Metadata

slide-15
SLIDE 15

15

Metadata as Implemented on the Drug Discovery Workflow

Basic Genomic Research Gene Finding Gene Function Analysis Target Validation Lead Identification Lead Optimization Pre- Clinical Clinical Market

Work Flow

Disease DB Gene-Protein DB Compound DB Drug MetabolismDB

Disease Metadata Protein/Compound Interaction Metadata Drug Metabolism Metadata

DB Server DB Server DB Server DB Server

Compound1 Agonist ReceptorA Ligand Relation Protein EnzymeⅠ Substrate DrugA Enzyme Relation Drug ReceptorA Active DiseaseA Target Relation Disease

slide-16
SLIDE 16

16

Compound Metadata Service Protein Metadata Service Protein-Compound Interaction Metadata Service Protein Sequence Homology Search (BLAST) GDSF GDS Compound Structure Similarity Search ( Tanimoto index) Compound DB MDDR GDSF GDS Protein DB SwissProt GDSF GDS

Interaction DB

(Enzyme, GPCR-DB, NucleaRDB, LGIC-DB) Structure Keys ( Compound substructures)

Grid Service (Globus Toolkit 3) Search Portal( Tomcat) Web Browser HTTPS

Factory Factory Factory Factory Factory

search process

GDSF GDS

USER

Database Search Service (Servlet)

Protein DB PIR Protein DB PDB GDSF GDS

BLAST Search D B

Grid Data Service (OGSA-DAI) SOAP

Database System for Protein-Compound Interaction Search

slide-17
SLIDE 17

17

Protein Name, Protein Family Class Protein or Protein Family Compound Interaction Extracted data from MDDR(MDL) * Schuffenhauer A, Zimmermann J, Stoop R, van der Vyver JJ, Lecchini S, Jacoby E. “An ontology for pharmaceutical ligands and its application for in silico screening and library design,” J Chem Inf Comput Sci. 2002 Jul-Aug;42(4):947-55.

Strategy Used in Protein-Compound Interaction Search

Ligand Ontology*

  • ENZYME
  • GPCR-DB
  • NucleaRDB
  • LGIC-DB
slide-18
SLIDE 18

18

New Target Protein

Large Reference Set of Known Ligands

  • f Homologous Target Protein

Interactions Search

Homology Search (BLAST,etc)

Candidate Ligands of New Target Protein

Schuffenhauer A, Floersheim P, Acklin P, Jacoby E., “Similarity metrics for ligands reflecting the similarity of the target proteins”, J Chem Inf Comput Sci. 2003 Mar-Apr;43(2):391-405.

Process Flow in Protein-Compound Interaction Search

Structure Similarity Search (ISIS SS, etc)

Compound Descriptors

Homologous Target Protein

with known ligands

ProteinDB Compound Library

slide-19
SLIDE 19

19

Protein Compound

agonist

Binding Domain (ex.) rosiglitazone (ex.) PPARgamma Protein Compound

agonist

Binding Domain (ex.) fenofibrate (ex.) PPARalpha

Zf-C4 137-211 Hormone_rec 318-501 Zf-C4 100-174 Hormone_rec 281-464

(ex.) ragaglitazar Compound

dual agonist similarity similarity similarity

Example of Protein-Compound Interaction Search

Activity Homology Activity Homology

slide-20
SLIDE 20

20

Protein-Compound Interaction Search System Website

slide-21
SLIDE 21

21

Applications Available to the User

  • Protein Sequence Search :

Retrieve the target protein’s sequence by specifying its Protein ID.

  • Homology Search :

Search for proteins homologous to the target in the Protein DB.

  • Protein-Compound Interaction Search :

Extract ligands that bind to the homologous proteins.

  • Compound Search :

Search for new compounds that may possibly interact with the target protein, by structural similarity to the extracted ligands.

slide-22
SLIDE 22

22

Sequence Search Homology Search

Protein Sequence Homology Search (BLAST) Protein Metadata Service

Interaction Search

Compound Metadata Service

Protein-Compound Interaction Metadata Service

Protein DB Search Portal (Servlet) Compound DB Interaction DB

Protein Sequence Information Homology Search Information Protein-Compound Interaction Search Information

Protein Metadata Service

Structure Similarity Search

Compound Structure Similarity Search (Tanimoto Index)

Compound Structure Search Information

GDS GDS GDS GDS

Grid Service (Globus Toolkit 3)

Grid Data Service (OGSA-DAI)

User Access (Web Browser)

Flow of User Access and Grid Service Execution

slide-23
SLIDE 23

23

BioDataGrid

Actual Implementation of the Search System

・ OS:Red Hat Linux 9 ・ CPU: Pentium4( 2.4GHz) ・ Memory Size:4GB ・ Java SDK 1.4.1 ・ Globus Toolkit 3 beta ・ OGSA-DAI :Release 2.5 ・ Jakarta Tomcat 4.1.24 ・ MySQL 3.23.54

slide-24
SLIDE 24

24

Sequence Search Homology Search

Protein Sequence Homology Search (BLAST) Protein Metadata Service

Interaction Search

Compound Metadata Service

Protein-Compound Interaction Metadata Service

Search Portal (Servlet) Compound DB

Protein Metadata Service

Structure Similarity Search

Compound Structure Similarity Search (Tanimoto Index) GDS GDS GDS GDS

Grid Service (Globus Toolkit 3)

Grid Data Service (OGSA-DAI) 15KB

Interaction DB Protein DB

1.8KB 12.7KB 2.8KB 23.3KB 11.7KB 15.4KB 3.2KB 12.7KB 2.7KB 39.6KB 3.9KB 30KB (14.1KB) 4.2KB (1.9KB) 9.1KB 2.5KB 3.4KB 1.4KB 7.7KB 2.7KB 20.5KB 2.9KB 41.8KB 2.7KB 1.6KB 4.7KB 1.6KB 4.7KB 1.6KB 4.7KB 1.6KB 4.7KB 1.6KB 4.7KB 3.5KB 1.5KB GDS GDS 30KB (6.6KB) 4.2KB (1.2KB)

+ + + +

4.7KB (4.7KB) 1.6KB (1.6KB)

+ + *values in parenthesis show average for displaying a single compound

Average Amount of Data Flow for Each Grid Services

slide-25
SLIDE 25

25

Conclusion

  • A data grid system that links together online

databases was proposed

  • Actual linking of 11 databases in the Life Sciences

was explained

  • An integrated heterogeneous database system

based on the workflow of the genome-based drug discovery process was discussed

  • Use of the latest grid technology like Globus

Toolkit 3/OGSA-DAI in linking distributed databases was successfully proven

slide-26
SLIDE 26

26

Future Works

・ Implementation of security technologies ・ Implementation of XML DBMS technologies ・ Improvement of the search program

slide-27
SLIDE 27

27

Acknowledgment This study was conducted under the IT-Program of the Japanese Ministry of Education, Culture, Sports, Science and Technology. The authors thank the Biogrid Project members for all their support.