efficient scaling up of
play

Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale - PowerPoint PPT Presentation

Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale Biological Problems on Cray XT Kevin Thomas , Cray Inc. Outline Biological networks Graph algorithms and terminology Implementation of a parallel graph algorithm Optimization


  1. Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale Biological Problems on Cray XT Kevin Thomas , Cray Inc.

  2. Outline Biological networks Graph algorithms and terminology Implementation of a parallel graph algorithm Optimization of single-thread performance Lessons learned May 2008 Cray Inc. Proprietary Slide 2

  3. Analysis of Biological Networks Analysis of biological networks is increasingly an used tool in biology Numerous types of biological networks Gene Expression Protein Interaction Metabolic Phylogenetic Signal Transduction Biological networks analysis requires the solution of combinatorial problems Maximal and maximum clique Vertex cover Dominating set Shortest path May 2008 Cray Inc. Proprietary Slide 3

  4. Biological Applications of Maximal Clique Enumeration Structural Alignment Tertiary Structure Genome Mapping MCE Functional Protein Relationships Gene Expression May 2008 Cray Inc. Proprietary Slide 4

  5. Graphs and Cliques Graphs are composed of vertices connected by edges A clique is a set of vertices which are pair-wise connected A maximal clique cannot include any additional vertex and still remain a clique (a,c,d,e) is a maximal clique a b e d c Step 1 of 2 Step 2 of 2 May 2008 Cray Inc. Proprietary Slide 5

  6. Maximal Clique Enumeration Finding all of the maximal cliques of a graph (a,b,d) (a,c,d,e) a b a b e d e d c c May 2008 Cray Inc. Proprietary Slide 6

  7. Maximal Clique Enumeration Brute Force Search May 2008 Cray Inc. Proprietary Slide 7

  8. Maximal Clique Enumeration Applying a backtracking algorithm results in a search tree / a b a b c d e e d b c e d c d d e e May 2008 Cray Inc. Proprietary Slide 8

  9. Parallel Maximal Clique Enumeration The search tree is divided into independent sub-trees Unexplored sub-trees are represented as candidate paths The candidate paths are placed into per-thread work pools Thread 1 Thread 2 Thread 3 Thread 4 candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path May 2008 Cray Inc. Proprietary Slide 9

  10. Load Balancing The work pools can become unbalanced over time Thread 1 Thread 2 Thread 3 Thread 4 candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path candidate path Dynamic load balancing through work stealing Step1 of 3 Step2 of 3 Step3 of 3 May 2008 Cray Inc. Proprietary Slide 10

  11. Two levels of load balancing Thread level Used when one thread of a process becomes idle Balances work within a single process Each thread acts on its own to steal work from other threads Locks are used to prevent race conditions Process level Used when all threads of a process become idle Local master thread sends a request to another process Remote master thread responds to the request Master thread must poll for incoming requests while performing the main computation May 2008 Cray Inc. Proprietary Slide 11

  12. Load balancing between processes Process 1 Process 2 Request Thread 1 Thread 2 Thread 1 Thread 2 candidate path candidate path candidate path candidate path Response candidate path candidate path candidate path candidate path Step 2 of 3 Step1 of 3 Step3 of 3 May 2008 Cray Inc. Proprietary Slide 12

  13. Termination Process-level load balancing attempts are made until all processes have been checked When no process has work to share, then the idle state is entered To synchronize globally, an idle notification is sent to each process When all processes are idle, the job can terminate 2(N-1) 2 messages are required for termination May 2008 Cray Inc. Proprietary Slide 13

  14. Adjacency Test – Linear List An important MCE operation is testing two vertices for adjacency Graph representation uses a vertex adjacency list Each vertex has a list of adjacent vertices An adjacency test requires a list traversal A linked list is easy to build, but slow to search A linear list (array) is faster to search a b c d e a b b a d e d c a d e d a b c e c e a c d May 2008 Cray Inc. Proprietary Slide 14

  15. Adjacency Test – Bit Matrix Adjacency bit matrix has a fast, constant time lookup Memory requirement is N 2 a b c d e a b a - 1 1 1 1 b 1 - 0 1 0 c 1 0 - 1 1 e d d 1 1 1 - 1 e 1 0 1 1 - c May 2008 Cray Inc. Proprietary Slide 15

  16. Adjacency Test – Hash Table Adjacency hash table has a fast, constant time lookup But not as fast as bit matrix Memory requirement is cN (2N in this example) Data structure is a sparse linear list But access is through direct through key hashing a - c - b - d - e a b b - - d a c e d - a - - e d d - c - b - - a e c e - d - a - c May 2008 Cray Inc. Proprietary Slide 16

  17. Adjacency Test Performance Comparison 300000 250000 cliques/second 200000 150000 100000 50000 0 Linear List Hash Table Bit Matrix May 2008 Cray Inc. Proprietary Slide 17

  18. SMP Versus DMP Programming 64.80 64.70 64.60 1 process, 8 threads Time (seconds) 64.50 2 processes, 4 threads each 64.40 4 processes, 2 threads each 64.30 8 processes, 1 thread each 64.20 64.10 64.00 May 2008 Cray Inc. Proprietary Slide 18

  19. Parallel Scaling on Cray XT4 quad core At 2048 processes, compute time is 2.1 seconds Overhead due to message passing is 0.43 seconds Graph contains 3472 vertices, found 2.6 billion maximal cliques Ideal pDFS 2048 1024 512 256 Speedup 128 64 32 16 8 4 2 1 1 2 4 8 16 32 64 128 256 512 1024 2048 Processes May 2008 Cray Inc. Proprietary Slide 19

  20. Conclusion Explicit decomposition at the thread level enabled easier implementation of MPI Independent work already identified Compact representation of units of work Additional work Improved load balancing by grouping processes Parallel I/O optimization May 2008 Cray Inc. Proprietary Slide 20

  21. Conclusion Research group members Nagiza F. Samatova, North Carolina State University and Oak Ridge National Laboratory Matthew Schmidt, North Carolina State University and Oak Ridge National Laboratory Byung-Hoon Park, Oak Ridge National Laboratory Kevin Thomas, Cray Inc. Thank you! Questions? May 2008 Cray Inc. Proprietary Slide 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend