computing server 2008
play

Computing Server 2008 joint project between Nile University, - PowerPoint PPT Presentation

WinBioinfTools: Bioinformatics Tools for Windows High Performance Computing Server 2008 joint project between Nile University, Microsoft Egypt, and Cairo Microsoft Innovation Center Mohamed Abouelhoda Nile University 1 Nile University


  1. WinBioinfTools: Bioinformatics Tools for Windows High Performance Computing Server 2008 joint project between Nile University, Microsoft Egypt, and Cairo Microsoft Innovation Center Mohamed Abouelhoda Nile University 1

  2. Nile University • Established in 2006 as a first non-profit research university • Specialized in • Information and Communication Technology and related fields and their applications • Research centers • Center for Informatics Sciences (CIS) • Center for Wireless Intelligent Networks (WINC) • Center for Innovation & Competitiveness (CIC) • Modern Master Programs • 9 Master programs in IT, Micro-electronics, Management, Business, Transportation systems, and construction management • Recent undergraduate program • Engineering and management programs Nile University 2

  3. Research Groups • Established in June 2008 • 9 Senior Scientists , 36 Junior scientists • Mission: Address information rich problems of importance to the region and Egypt Nile University 3 3

  4. State of the art Scientific Discovery & Business Insights Scientists Knowledge Workers Bioinformatics Medical Imaging Data Mining Data Analysis, Decision Making & Collaboration Tools Data Management Local Computing Local Data & Data & Information HPC Integration Tools Resources Software Tools Ubiquitous Networking Distributed Scientific Information & Resources Distributed Computing Distributed Data Sources Distributed 4 Resources & Remote (SQL, Web Sources, Sensors & Devices Software Access Images, Text)

  5. Infrastructure of CIS Local CIS resources (first phase): • 21 Servers with 160 AMD/Intel Bioinformatics Applications cores and total 1TB RAM • 24 TB total Storage Extensible resources via partners Nile University • Microsoft, Imperial College, Bridge Project Shared Middleware: Standardized SOA interfaces, Service Composition, Utility- based Computing, …. Imperial College Other resources Microsoft CMIC Biblioteca London Nile University Alexandrina Bridge-Project Nile University 5

  6. Group Leader: Mohamed Abouelhoda Co-Workers: 7 RAs Projects and Research: • NUBIOS: Nile University Bioinformatics Server • Plant , animal, bacterial, and virus computational genomics • Cancer Bioinformatics • High Performance Computing for Bioinformatics Applications Collaborators: Academic • Imperial College, Prof. Hani Gabra • National Cancer Institute, Egypt http://www.bioinf.nileu.edu.eg • Bielefeld University, Prof. Robert Giegerich • Agriculture Research Institute Industry • Cairo Microsoft Innovation Cenetr (CMIC), Egypt • IBM

  7. WinBioinfTools: Bioinformatics Tools for Windows High Performance Computing Server 2008

  8. Motivation  bioinformatics tools are essential for recent molecular biology research  Obstacles : • Open source bioinformatics tools are usually written for Unix/Linux, which are not so popular in life science community • Data size becomes prohibitively large to analyze on usual PC 8

  9. Project Objectives  Providing WinBioinfTools to the biological community that  - runs under MS-windows - runs under computer cluster (Windows HPC Server 2008)  Primary focus on sequence analysis and comparative genomics - Distributed Sequence Alignment - Distributed BLAST (Basic Local Alignment Search Tool) - CoCoNUT (Computational Comparative GeNomics Utilities Toolkit)  Comparing the performance of the Windows based versions of these tools to the corresponding Linux based versions.

  10. Resources  Human Resources o Mohamed Abouelhoda, Hisham Mohamed (Nile University) o Mohamed Zahran (collaborator, New York City University) o Tamer Shaalan (CMIC)  CMIC Lab: • Cluster of 4 nodes (2 Quad-core 2.6 GHz processors, 16GB RAM, 250 GB HD) • 1 Giga Ethernet Network • Windows HPC server 2008, with HPC Pack 2008 10

  11. Why Sequence Analysis First? - We focused on sequence analysis tools Comparing short sequences  Parallel Sequence Alignment 1. 2. Comparing large genomic sequences  Parallel CoCoNUT 3. Database search  Parallel Blast Database search Genome - Sequence analysis helps in elucidating Comparison, Sequence alignment function and structure of genomic regions Database search - Example pipeline used in practice is HAVANA (Human And Vertebrate Analysis aNd Annotation)

  12. Cluster Modes of Operation 1. Load balancing: task level parallelism – Most bioinformatics problems can be well solved under this category due to decomposability of data 2. (High Performance) Compute cluster: instruction level parallelism - Problems following this are very critical and form a bottleneck 12

  13. Basic features of the Windows (HPC) Server 2008  High performance:  64bit version, accessing large memory, 16, 32, 64, 128 GB RAM  Cluster and multi-core support  Cluster management and monitoring tools  Load balancing: Job scheduler  Parallel computing: MS MPI  Interoperability: SUA (Support for Unix Applications), Cygwin also works  Virtualization: Hyper-V for virtual machines support 13

  14. Sequence Alignment 14

  15. Sequence Alignment mismatch S 1 TACAATCAA T _ ACAA TCA A S TCACTCAC TC AC_ _TCA C 2 Sequence Alignment insertion/deletion 2  Dynamic programming algorithms take time ( k =number of genomes, n =average O ( n ) genome length) Needlemann-Wunch, 1970 15

  16. Dynamic Programming Algorithm  Sequence alignment aims at maximizing the similarities between sequences.  Optimal sequence alignment can be computed using dynamic programming.  For two sequences, the best alignment is computed by filling a 2D matrix, where the score at cell ( i,j ) is computed as follows: score ( i 1 , j 1 ) 1 , if S [ i ] S [ j ] ( 1 , 1 ), [ ] [ ] score i j if S i S j score ( i , j ) min (character deletion cost) score ( i 1 , j ) 1 (character deletion cost) score ( i , j 1 ) 1 16

  17. Parallelization of the DP Algorithm  The cluster nodes cooperate in filling matrix (Compute Cluster Model)  The filling proceeds diagonal-wise, and the master node synchronizes the filling  The complexity reduces to O ( n 2 /k+tk ’ ), where t is the communication time, k is the number of cores , k’ is the number of cluster nodes. node 4 score ( i 1 , j 1 ) 1 , if S [ i ] S [ j ] node 3 score ( i 1 , j 1 ), if S [ i ] S [ j ] ( , ) min node 2 score i j (character deletion cost) score ( i 1 , j ) 1 node 1 (character deletion cost) score ( i , j 1 ) 1 synchronizing line, synchronized by the master node

  18. Experimental Results  The running times (in seconds) for pairwise sequence alignment on one and 4 nodes. Time on 4 nodes Time on one Sequence Length Communication Processing Total node Time time 100 X 100 0.03623 0.000665 0.001765 0.0034 1000 X 1000 0.152653 0.005 0.014 0.04 5000 X 5000 0.142311 0.3 1 3.9 10000 X 10000 1.19 1.1 2.6 8.4 20000 X 20000 3.679 2 8 18 30000 X 30000 4 11 15 40 - In the first column, we list the sequence sizes, where 100x100 for example means that we aligned two sequences, each of100 character length.

  19. Experimental Results - On the x-axis, we list the sequence sizes, where 100x100 for example means that we aligned two sequences, each of100 character length.

  20. Database Search 20

  21. Querying Biological Databases using BLAST Biological database formatting And querying 2 query 1 formatting results 3 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin . ||| | . |. . . | : .||||.:| : 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP . ||| | . |. . . | : .||||.:| : : | | | | :: | .| . || |: || |. 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin : | | | | :: | .| . || |: || |. 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP : | | | | :: | .| . || |: || |. || ||. | :.|||| | . .| 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin || ||. | :.|||| | . .| 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 21

  22. Large Scale Application of BLAST  BLAST (basic local alignment search tool): given a biological sequence it search for similar (sub) regions in the database Altschul et al. 1997  The database size is extremely large  The search time is proportional to the database length  Computer cluster provides an ideal solution for speeding up BLAST search Internet queries Institution Enterprise 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend