1
WinBioinfTools: Bioinformatics Tools for Windows High Performance Computing Server 2008 Mohamed Abouelhoda
Nile University
joint project between Nile University, Microsoft Egypt, and Cairo Microsoft Innovation Center
Computing Server 2008 joint project between Nile University, - - PowerPoint PPT Presentation
WinBioinfTools: Bioinformatics Tools for Windows High Performance Computing Server 2008 joint project between Nile University, Microsoft Egypt, and Cairo Microsoft Innovation Center Mohamed Abouelhoda Nile University 1 Nile University
1
joint project between Nile University, Microsoft Egypt, and Cairo Microsoft Innovation Center
Nile University 2
applications
Transportation systems, and construction management
Nile University 3
3
4
Distributed Computing Resources & Remote Software Access Distributed Data Sources (SQL, Web Sources, Images, Text) Distributed Sensors & Devices Local Computing Resources Local Data & Software Tools Data & Information Integration Tools Data Analysis, Decision Making & Collaboration Tools Scientists Knowledge Workers Distributed Scientific Information & Resources Scientific Discovery & Business Insights
HPC Ubiquitous Networking Data Mining Bioinformatics Medical Imaging Data Management
Nile University 5 Nile University Microsoft CMIC Imperial College London Nile University Shared Middleware: Standardized SOA interfaces, Service Composition, Utility-based Computing, …. Bioinformatics Applications Biblioteca Alexandrina Other resources Bridge-Project
Local CIS resources (first phase):
cores and total 1TB RAM
Extensible resources via partners
Project
http://www.bioinf.nileu.edu.eg
Academic
Industry
8
corresponding Linux based versions.
10
1.
Comparing short sequences Parallel Sequence Alignment
Database search Database search Genome Comparison, Sequence alignment
(Human And Vertebrate Analysis aNd Annotation)
12
– Most bioinformatics problems can be well solved under this category due to decomposability of data
13
14
15
genome length)
2
TACAATCAA TCACTCAC S1 S
2
Sequence Alignment T _ ACAA TCA A TC AC_ _TCA C
Needlemann-Wunch, 1970 mismatch insertion/deletion
16
score at cell (i,j) is computed as follows:
1 ) 1 , ( 1 ) , 1 ( ] [ ] [ ), 1 , 1 ( ] [ ] [ , 1 ) 1 , 1 ( min ) , ( j i score j i score j S i S if j i score j S i S if j i score j i score
(character deletion cost) (character deletion cost)
1 ) 1 , ( 1 ) , 1 ( ] [ ] [ ), 1 , 1 ( ] [ ] [ , 1 ) 1 , 1 ( min ) , ( j i score j i score j S i S if j i score j S i S if j i score j i score
(character deletion cost) (character deletion cost) synchronizing line, synchronized by the master node node 1 node 2 node 3 node 4
Sequence Length Time on 4 nodes Time on one node
Communication Time Processing time Total 100 X 100 0.03623 0.000665 0.001765 0.0034 1000 X 1000 0.152653 0.005 0.014 0.04 5000 X 5000 0.142311 0.3 1 3.9 10000 X 10000 1.19 1.1 2.6 8.4 20000 X 20000 3.679 2 8 18 30000 X 30000 4 11 15 40
aligned two sequences, each of100 character length.
aligned two sequences, each of100 character length.
20
21
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
Biological database formatting And querying
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
formatting query results 2 1 3
22
Internet Institution Enterprise
queries
similar (sub) regions in the database
up BLAST search
Altschul et al. 1997
23
Internet/ Institution/ Enterprise
queries
similar (sub) regions in the database
up BLAST search
Altschul et al. 1997
Database segmentation, where the whole database DB is divided into subsets DB1,…,DB4
DB2 DB3 DB4 DB1
24
Running times in hours for biological data bases The first 3 databases are DNA while the others are proteins The query sequence is of the same type as the database
Database Running on 4 nodes One Node
Windows Communication Time Processing time Total time Drosoph 0.014522 0.023478 0.038 0.08 Pataa 0.01835 0.116 0.13435 0.5 est_others 0.0343 0.5456 0.5799 1 env_nr 0.53077 3.5 4.03077 18 Nr 0.4077 6.8 7.2077 27
Running times in hours for biological data bases The first 3 databases are DNA while the others are proteins The query sequence is of the same type as the database
26
27
Genome Comparison: Given two genomic sequences, locate the regions of similarity and difference.
Human genome Mouse genome Human chromosomes Mouse chromosomes
28
Abouelhoda-Kurtz- Ohlebusch, 2008
29
before
Abouelhoda-Kurtz- Ohlebusch, 2008
Chromosome comparisons are independent of each other Divide the comparisons among the cluster nodes
N1
Genome1 Genome 2
X X X X X
N2 Nm
31
and 19 days on 4 nodes
Human Chr. 13 to Mouse Chr. 14 Human Chr. 18 to Mouse Chr. 18
32
33
needed on the algorithmic level
performance computing and application migration from Unix/Linux in a user friendly way
(GenomeTools)
34
35
Full configuration (including HPC, Security, Networking) 1.5 h per node.
easy to compile and run. It also has the feature to run virtually over many cores.