15th symposium on computer architecture and high
play

15th Symposium on Computer Architecture and High Performance - PowerPoint PPT Presentation

15th Symposium on Computer Architecture and High Performance Computing 1/12 November 10 to 12 - S ao Paulo, SP Comparison of Genomes using High-Performance Parallel Computing N. F. Almeida Jr Universidade Federal de Mato Grosso do


  1. 15th Symposium on Computer Architecture and High Performance Computing 1/12 November 10 to 12 - S˜ ao Paulo, SP Comparison of Genomes using High-Performance Parallel Computing N. F. Almeida Jr � Universidade Federal de Mato Grosso do Sul C. E. R. Alves � Univsidade S˜ ao Judas Tadeu � E. N. C´ aceres � Universidade Federal de Mato Grosso do Sul S. W. Song � Universidade de S˜ ao Paulo � �

  2. Comparison of Entire Genomes 2/12 • Comparison of genomes is useful to investigate com- mon functionalities of the corresp. organisms • Our purpose is twofold – Use parallel computing so that more expensive align- ment methods (dynamice programming) can be used. – Locate and compare not only homologous genes, but also compare the regions between corresponding ho- mologous genes. • As example, we compare � � – Xanthomonas axonopodis pv. citri with 5,175,554 base � pairs and 4,313 protein-coding genes � – Xanthomonas campestris pv. campestris with 5,076,187 � base pairs and 4,182 protein-coding genes. � �

  3. Motivations and Previous Works 3/12 • Homology: two genes share a common evolutionary past. • Often similarity between two DNA or amino-acid se- quences may infer homology. • Homology in turn may determine function. Thus: Similarity → homology → function Rasera, Setubal, Almeida et al. [2002] compare the whole genomes of Xanthomonas axonopodis pv. citri and Xanthomonas � campestris pv. campestris and conclude both share more � than 80% of the genes. � � � � �

  4. Comparison Strategy - Main Ideas 4/12 Given two genomes G and H and their gene locations: 1. Find and label pairs of homologous genes. 1 2 3 4 5 6 q ❇ ❇ ❇ ❇ q q ✡ ✡ q ✂ ✂ q q g ❇ ❇ ✡ ✂ ❇ ❇ ✡ ✂ ❇ ❇ ✡ ✂ ❇ ❇ ✡ ✂ ❇ ❇ ✡ ✂ h q q q q q q 1 2 4 3 5 6 2. Find the non-crossing pairs of homologous genes. � � 1 2 3 5 6 � q q q q q ❇ ❇ ❇ ❇ ✂ ✂ g g ′ ❇ ❇ ✂ � ❇ ❇ ✂ ❇ ❇ ✂ ❇ ❇ ✂ � ❇ h ′ ❇ ✂ h q q q q q 1 2 3 5 6 � �

  5. Comparison Strategy (continued) 5/12 1 2 3 5 6 q q q q q ❇ ❇ ❇ ❇ ✂ ✂ g g ′ ❇ ❇ ✂ ❇ ❇ ✂ ❇ ❇ ✂ ❇ ❇ ✂ ❇ h ′ ❇ ✂ h q q q q q 1 2 3 5 6 3. Align each pair of homologous genes. 4. Align each pair of intergenic regions (e.g. [ g, g ′ ] and [ h, h ′ ] ). 5. Join all alignments. � � � � � � �

  6. Comparison Strategy - Details 6/12 1: Find pairs of the homologous genes: For all g of G , obtain h of H such that DP-score ( g, h ) = max { DP-score ( g, w ) for all w of H } 2: Label the homologous genes of G : Label the homologous genes of G as 1 , 2 , . . . , m in the same order as their positions in the genome G . Let LabelG denote the sequence of labels obtained in this step. 3: Label the corresponding homologous genes of H : � For all pairs of homologous genes ( g, h ) , g of G , h of � H , label gene h with the same label of g . � Let LabelH denote the sequence of labels obtained in � this step. � � �

  7. 7/12 4: Find the non-crossing pairs of homologous genes : Obtain the LCS( LabelG , LabelH ). the LCS obtained con- tains only the non-crossing pairs 5: Align each pair of homologous genes : For each non crossing homologous pair ( g, h ) do DP- align ( g, h ) . 6: Align each pair of intergenic regions : For each intergenic region [ g, g ′ ] , where [ g, g ′ ] of G are two consecutive genes of the LCS, obtain the corre- sponding intergenic region [ h, h ′ ] in H and do � DP-align ([ g, g ′ ] , [ h, h ′ ]) . � 7: Join all the alignments : � Concatenate the alignments of the homologous genes � and the intergenic regions, in the same order they � appear in the genomes. � �

  8. Computing Similarity of Two Strings 8/12 A simple example of string alignment: A a c t t c a – t a t t c – a c g C Score 1 0 1 0 0 1 0 0 3 a c t t c a – t A C a – t t c a c g Score 1 0 1 1 1 1 0 0 5 Using dynamic programming (gives better quality align- ments): b a a b c a b c a b (0 , 0) � b a � ( i − 1 , j − 1) ( i − 1 , j ) a � b c ( i, j − 1) ( i, j ) b � c a � (8 , 10) � �

  9. The Parallel Solution 9/12 • Finding homologus pairs (the most time consuming phase): compare all the genes of one genome with all the genes of another: more than 18 million alignments by dynamic programming. Two types of parallelisms are used: – Master distributes the alignment tasks to slave pro- cessors. – When the lengths of the sequences to be aligned exceed 5,000 base pairs, parallel dynamic program- � ming is used. � • Finding the non-crossing homologous gene pairs: We � used a parallel LCS (longest common subsequence) al- � gorithm. (Could have used LIS - longest increasing � subsequence algorithm.) � �

  10. The Parallel Platform Used 10/12 • 64-node Beowulf cluster - low cost microcom- puters with 256MB RAM, 256MB swap mem- ory, CPU Intel Pentium III 448.956 MHz, 512KB cache. • 100 Mb fast-Ethernet switch. • Code in standard ANSI C and LAM-MPI Ver- sion 6.5.6. � � � � � � �

  11. Preliminary Implementation Results 11/12 • Finding homologus pairs (most time consuming): Sequential solution using Blast and EGG: 3 hours. Parallel solution using dynamic programming: 1 hour 15 minutes. • Finding non-crossing pairs (surely not the dominant step): Sequential solution using Blast and EGG: not avail- able. � Parallel solution using dynamic programming: 20 sec- � onds. � � � � �

  12. Conclusion 12/12 We compared the whole genomes of two organisms: • Exploited parallelism in two ways: Standard master-slave approach to distribute compar- ison tasks (sequential dynamice programming) to slave processors. To compute the similarity between two sequences, when- ever the sequences are longer than 5,000 base pairs, we used parallel dynamic programming. • The gain does not seem to be so significant, however � we used a dynamic programming approach that gives better quality results. � � • Our comparison strategy also compares the intergenic � regions between two consecutive homologous genes in � each genome. The relevance of this in a biological � viewpoint is yet to be investigated. �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend