Computing Server 2008 joint project between Nile University, - - PowerPoint PPT Presentation

computing server 2008
SMART_READER_LITE
LIVE PREVIEW

Computing Server 2008 joint project between Nile University, - - PowerPoint PPT Presentation

WinBioinfTools: Bioinformatics Tools for Windows High Performance Computing Server 2008 joint project between Nile University, Microsoft Egypt, and Cairo Microsoft Innovation Center Mohamed Abouelhoda Nile University 1 Nile University


slide-1
SLIDE 1

1

WinBioinfTools: Bioinformatics Tools for Windows High Performance Computing Server 2008 Mohamed Abouelhoda

Nile University

joint project between Nile University, Microsoft Egypt, and Cairo Microsoft Innovation Center

slide-2
SLIDE 2

Nile University 2

  • Established in 2006 as a first non-profit research university
  • Specialized in
  • Information and Communication Technology and related fields and their

applications

  • Research centers
  • Center for Informatics Sciences (CIS)
  • Center for Wireless Intelligent Networks (WINC)
  • Center for Innovation & Competitiveness (CIC)
  • Modern Master Programs
  • 9 Master programs in IT, Micro-electronics, Management, Business,

Transportation systems, and construction management

  • Recent undergraduate program
  • Engineering and management programs

Nile University

slide-3
SLIDE 3

Nile University 3

3

Research Groups

  • Established in June 2008
  • 9 Senior Scientists , 36 Junior

scientists

  • Mission: Address information rich

problems of importance to the region and Egypt

slide-4
SLIDE 4

4

State of the art

Distributed Computing Resources & Remote Software Access Distributed Data Sources (SQL, Web Sources, Images, Text) Distributed Sensors & Devices Local Computing Resources Local Data & Software Tools Data & Information Integration Tools Data Analysis, Decision Making & Collaboration Tools Scientists Knowledge Workers Distributed Scientific Information & Resources Scientific Discovery & Business Insights

HPC Ubiquitous Networking Data Mining Bioinformatics Medical Imaging Data Management

slide-5
SLIDE 5

Nile University 5 Nile University Microsoft CMIC Imperial College London Nile University Shared Middleware: Standardized SOA interfaces, Service Composition, Utility-based Computing, …. Bioinformatics Applications Biblioteca Alexandrina Other resources Bridge-Project

Local CIS resources (first phase):

  • 21 Servers with 160 AMD/Intel

cores and total 1TB RAM

  • 24 TB total Storage

Extensible resources via partners

  • Microsoft, Imperial College, Bridge

Project

Infrastructure of CIS

slide-6
SLIDE 6

http://www.bioinf.nileu.edu.eg

Group Leader: Mohamed Abouelhoda Co-Workers: 7 RAs

Projects and Research:

  • NUBIOS: Nile University Bioinformatics Server
  • Plant , animal, bacterial, and virus computational genomics
  • Cancer Bioinformatics
  • High Performance Computing for Bioinformatics Applications

Collaborators:

Academic

  • Imperial College, Prof. Hani Gabra
  • National Cancer Institute, Egypt
  • Bielefeld University, Prof. Robert Giegerich
  • Agriculture Research Institute

Industry

  • Cairo Microsoft Innovation Cenetr (CMIC), Egypt
  • IBM
slide-7
SLIDE 7

WinBioinfTools: Bioinformatics Tools for Windows High Performance Computing Server 2008

slide-8
SLIDE 8

8

Motivation

  • bioinformatics tools are essential for recent molecular biology research
  • Obstacles :
  • Open source bioinformatics tools are usually written for Unix/Linux, which

are not so popular in life science community

  • Data size becomes prohibitively large to analyze on usual PC
slide-9
SLIDE 9

Project Objectives

  • Providing WinBioinfTools to the biological community that 
  • runs under MS-windows
  • runs under computer cluster (Windows HPC Server 2008)
  • Primary focus on sequence analysis and comparative genomics
  • Distributed Sequence Alignment
  • Distributed BLAST (Basic Local Alignment Search Tool)
  • CoCoNUT (Computational Comparative GeNomics Utilities Toolkit)
  • Comparing the performance of the Windows based versions of these tools to the

corresponding Linux based versions.

slide-10
SLIDE 10

10

Resources

  • Human Resources
  • Mohamed Abouelhoda, Hisham Mohamed (Nile University)
  • Mohamed Zahran (collaborator, New York City University)
  • Tamer Shaalan (CMIC)
  • CMIC Lab:
  • Cluster of 4 nodes (2 Quad-core 2.6 GHz processors, 16GB RAM, 250 GB HD)
  • 1 Giga Ethernet Network
  • Windows HPC server 2008, with HPC Pack 2008
slide-11
SLIDE 11

Why Sequence Analysis First?

  • We focused on sequence analysis tools

1.

Comparing short sequences  Parallel Sequence Alignment

  • 2. Comparing large genomic sequences  Parallel CoCoNUT
  • 3. Database search  Parallel Blast

Database search Database search Genome Comparison, Sequence alignment

  • Example pipeline used in practice is HAVANA

(Human And Vertebrate Analysis aNd Annotation)

  • Sequence analysis helps in elucidating

function and structure of genomic regions

slide-12
SLIDE 12

12

Cluster Modes of Operation

  • 1. Load balancing: task level parallelism

– Most bioinformatics problems can be well solved under this category due to decomposability of data

  • 2. (High Performance) Compute cluster: instruction level parallelism
  • Problems following this are very critical and form a bottleneck
slide-13
SLIDE 13

13

Basic features of the Windows (HPC) Server 2008

  • High performance:
  • 64bit version, accessing large memory, 16, 32, 64, 128 GB RAM
  • Cluster and multi-core support
  • Cluster management and monitoring tools
  • Load balancing: Job scheduler
  • Parallel computing: MS MPI
  • Interoperability: SUA (Support for Unix Applications), Cygwin also works
  • Virtualization: Hyper-V for virtual machines support
slide-14
SLIDE 14

14

Sequence Alignment

slide-15
SLIDE 15

15

Sequence Alignment

  • Dynamic programming algorithms take time (k=number of genomes, n=average

genome length)

) (

2

n O

TACAATCAA TCACTCAC S1 S

2

Sequence Alignment T _ ACAA TCA A TC AC_ _TCA C

Needlemann-Wunch, 1970 mismatch insertion/deletion

slide-16
SLIDE 16

16

Dynamic Programming Algorithm

  • Sequence alignment aims at maximizing the similarities between sequences.
  • Optimal sequence alignment can be computed using dynamic programming.
  • For two sequences, the best alignment is computed by filling a 2D matrix, where the

score at cell (i,j) is computed as follows:

1 ) 1 , ( 1 ) , 1 ( ] [ ] [ ), 1 , 1 ( ] [ ] [ , 1 ) 1 , 1 ( min ) , ( j i score j i score j S i S if j i score j S i S if j i score j i score

(character deletion cost) (character deletion cost)

slide-17
SLIDE 17

Parallelization of the DP Algorithm

  • The cluster nodes cooperate in filling matrix (Compute Cluster Model)
  • The filling proceeds diagonal-wise, and the master node synchronizes the filling
  • The complexity reduces to O(n2/k+tk’), where t is the communication time, k is the number
  • f cores, k’is the number of cluster nodes.

1 ) 1 , ( 1 ) , 1 ( ] [ ] [ ), 1 , 1 ( ] [ ] [ , 1 ) 1 , 1 ( min ) , ( j i score j i score j S i S if j i score j S i S if j i score j i score

(character deletion cost) (character deletion cost) synchronizing line, synchronized by the master node node 1 node 2 node 3 node 4

slide-18
SLIDE 18

Experimental Results

  • The running times (in seconds) for pairwise sequence alignment on one and 4 nodes.

Sequence Length Time on 4 nodes Time on one node

Communication Time Processing time Total 100 X 100 0.03623 0.000665 0.001765 0.0034 1000 X 1000 0.152653 0.005 0.014 0.04 5000 X 5000 0.142311 0.3 1 3.9 10000 X 10000 1.19 1.1 2.6 8.4 20000 X 20000 3.679 2 8 18 30000 X 30000 4 11 15 40

  • In the first column, we list the sequence sizes, where 100x100 for example means that we

aligned two sequences, each of100 character length.

slide-19
SLIDE 19

Experimental Results

  • On the x-axis, we list the sequence sizes, where 100x100 for example means that we

aligned two sequences, each of100 character length.

slide-20
SLIDE 20

20

Database Search

slide-21
SLIDE 21

21

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

Querying Biological Databases using BLAST

Biological database formatting And querying

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

formatting query results 2 1 3

slide-22
SLIDE 22

22

Large Scale Application of BLAST

Internet Institution Enterprise

queries

  • BLAST (basic local alignment search tool): given a biological sequence it search for

similar (sub) regions in the database

  • The database size is extremely large
  • The search time is proportional to the database length
  • Computer cluster provides an ideal solution for speeding

up BLAST search

Altschul et al. 1997

slide-23
SLIDE 23

23

Large Scale Application of BLAST

Internet/ Institution/ Enterprise

queries

  • BLAST (basic local alignment search tool): given a biological sequence it search for

similar (sub) regions in the database

  • The database size is extremely large
  • The search time is proportional to the database length
  • Computer cluster provides an ideal solution for speeding

up BLAST search

Altschul et al. 1997

Database segmentation, where the whole database DB is divided into subsets DB1,…,DB4

DB2 DB3 DB4 DB1

slide-24
SLIDE 24

24

Running Time for 1000 Query on BLAST

Running times in hours for biological data bases The first 3 databases are DNA while the others are proteins The query sequence is of the same type as the database

Database Running on 4 nodes One Node

Windows Communication Time Processing time Total time Drosoph 0.014522 0.023478 0.038 0.08 Pataa 0.01835 0.116 0.13435 0.5 est_others 0.0343 0.5456 0.5799 1 env_nr 0.53077 3.5 4.03077 18 Nr 0.4077 6.8 7.2077 27

slide-25
SLIDE 25

Running Time for 1000 Query on BLAST

Running times in hours for biological data bases The first 3 databases are DNA while the others are proteins The query sequence is of the same type as the database

slide-26
SLIDE 26

26

Comparative Genomics

slide-27
SLIDE 27

27

Genome Comparison

Genome Comparison: Given two genomic sequences, locate the regions of similarity and difference.

Human genome Mouse genome Human chromosomes Mouse chromosomes

slide-28
SLIDE 28

28

CoCoNUT

  • CoCoNUT is written in Perl and C/C++ and it was intended to run under Linux/Unix
  • CoCoNUT
  • Compares two or multiple genomes
  • Compares draft genomes
  • analyzes repeat
  • Maps cDNA to complete genomes

Abouelhoda-Kurtz- Ohlebusch, 2008

slide-29
SLIDE 29

29

CoCoNUT

  • CoCoNUT is written in Perl and C/C++ and it was intended to run under Linux/Unix
  • CoCoNUT was ported to run under Windows
  • Parts of the code are compiled and runs directly on windows
  • Third party packages (GenomeTools) runs using SUA and Cygwin
  • The correctness of the porting was asserted by comparison to results obtained

before

  • The large scale comparison runs
  • using the Job Scheduler
  • using our MPI-based script to save some computations

Abouelhoda-Kurtz- Ohlebusch, 2008

slide-30
SLIDE 30

Pairwise Comparison of multi-chromosomal

Chromosome comparisons are independent of each other  Divide the comparisons among the cluster nodes

N1

……

Genome1 Genome 2

X X X X X

N2 Nm

slide-31
SLIDE 31

31

Comparison between the Human and Mouse Genome

  • Total time of 20 comparison one node: approx. 47 h. on Windows
  • Total time of 20 comparison on 4 nodes: approx. 12 h. on Windows
  • Estimated total time for the whole human-mouse comparison is 75 days on 1 node

and 19 days on 4 nodes

Human Chr. 13 to Mouse Chr. 14 Human Chr. 18 to Mouse Chr. 18

slide-32
SLIDE 32

32

Availability of our Tool

  • The outcome of this project is the package WinBioinfTools

 open source  available to download from

  • NUBIOS: Nile University Bioinformatics Server

http://www.nubios.nileu.edu.eg/tools/WinBioinfTools

  • CodePlex: Microsoft repository for open source tools

http://winbioinftools.codeplex.com

  • In September: Total downloads 138

times since its release in May 2009

  • 2 November: 176 downloads
slide-33
SLIDE 33

33

Conclusions and Lessons

  • Porting applications to HPC is not always straightforward  Research is still

needed on the algorithmic level

  • Parallelization is not scalable
  • Mixed model of parallelization is becoming the trend
  • Windows cluster solution encapsulates the required features to conduct high

performance computing and application migration from Unix/Linux in a user friendly way

  • The Linux version is slightly faster but this is not attributed to the cluster modules
  • f the Windows HPC Server, rather to the third party packages used

(GenomeTools)

slide-34
SLIDE 34

34

Thanks for your attention

slide-35
SLIDE 35

35

Advantages of Windows (HPC) Server 2008

  • easy to download with rich and informative documentation 

Full configuration (including HPC, Security, Networking) 1.5 h per node.

  • efficient, easy-to-use cluster management tools

For administrators

  • user friendly, GUI  focus more on application
  • job scheduler with intuitive interface and efficiency in practice.
  • no sophisticated command lines.

For users

  • the MS-MPI (Microsoft implementation of MPI) runs smoothly through the Visual Studio, i.e.,

easy to compile and run. It also has the feature to run virtually over many cores.

  • the debugging features of the Visual Studio 2008 supporting parallel algorithms

For developers