Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale - - PowerPoint PPT Presentation

efficient scaling up of
SMART_READER_LITE
LIVE PREVIEW

Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale - - PowerPoint PPT Presentation

Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale Biological Problems on Cray XT Kevin Thomas , Cray Inc. Outline Biological networks Graph algorithms and terminology Implementation of a parallel graph algorithm Optimization


slide-1
SLIDE 1

Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale Biological Problems on Cray XT

Kevin Thomas, Cray Inc.

slide-2
SLIDE 2

Outline

Biological networks Graph algorithms and terminology Implementation of a parallel graph algorithm Optimization of single-thread performance Lessons learned

May 2008 Cray Inc. Proprietary Slide 2

slide-3
SLIDE 3

Analysis of Biological Networks

Analysis of biological networks is increasingly an used tool in biology Numerous types of biological networks

Gene Expression Protein Interaction Metabolic Phylogenetic Signal Transduction

Biological networks analysis requires the solution of combinatorial problems

Maximal and maximum clique Vertex cover Dominating set Shortest path

May 2008 Cray Inc. Proprietary Slide 3

slide-4
SLIDE 4

Biological Applications of Maximal Clique Enumeration

May 2008 Cray Inc. Proprietary Slide 4

Structural Alignment

MCE

Gene Expression Functional Protein Relationships Tertiary Structure Genome Mapping

slide-5
SLIDE 5

Graphs are composed of vertices connected by edges A clique is a set of vertices which are pair-wise connected A maximal clique cannot include any additional vertex and still remain a clique (a,c,d,e) is a maximal clique

Graphs and Cliques

May 2008 Cray Inc. Proprietary Slide 5

e c a d b

Step 1 of 2 Step 2 of 2

slide-6
SLIDE 6

Finding all of the maximal cliques of a graph (a,b,d) (a,c,d,e)

Maximal Clique Enumeration

May 2008 Cray Inc. Proprietary Slide 6

e c a d b e c a d b

slide-7
SLIDE 7

Maximal Clique Enumeration

Brute Force Search

May 2008 Cray Inc. Proprietary Slide 7

slide-8
SLIDE 8

Applying a backtracking algorithm results in a search tree

Maximal Clique Enumeration

May 2008 Cray Inc. Proprietary Slide 8

a b c d e / b c d e d d e e e c a d b

slide-9
SLIDE 9

Parallel Maximal Clique Enumeration

The search tree is divided into independent sub-trees Unexplored sub-trees are represented as candidate paths The candidate paths are placed into per-thread work pools

May 2008 Cray Inc. Proprietary Slide 9

Thread 1

candidate path candidate path candidate path

Thread 2

candidate path candidate path candidate path

Thread 3

candidate path candidate path candidate path

Thread 4

candidate path candidate path candidate path

slide-10
SLIDE 10

Load Balancing

The work pools can become unbalanced over time

May 2008 Cray Inc. Proprietary Slide 10

Thread 1

candidate path

Thread 3

candidate path candidate path candidate path

Thread 4

candidate path candidate path candidate path

Thread 2

candidate path candidate path candidate path candidate path candidate path candidate path

Dynamic load balancing through work stealing

Step1 of 3 Step2 of 3 Step3 of 3

slide-11
SLIDE 11

Two levels of load balancing

Thread level

Used when one thread of a process becomes idle Balances work within a single process Each thread acts on its own to steal work from other threads Locks are used to prevent race conditions

Process level

Used when all threads of a process become idle Local master thread sends a request to another process Remote master thread responds to the request Master thread must poll for incoming requests while performing the main computation

May 2008 Cray Inc. Proprietary Slide 11

slide-12
SLIDE 12

Step1 of 3 Step 2 of 3 Step3 of 3

Load balancing between processes

May 2008 Cray Inc. Proprietary Slide 12

Process 1

Thread 1

candidate path candidate path candidate path

Thread 2

candidate path candidate path candidate path

Process 2

Thread 1

candidate path

Thread 2

candidate path Request Response

slide-13
SLIDE 13

Termination

Process-level load balancing attempts are made until all processes have been checked When no process has work to share, then the idle state is entered To synchronize globally, an idle notification is sent to each process When all processes are idle, the job can terminate 2(N-1)2 messages are required for termination

May 2008 Cray Inc. Proprietary Slide 13

slide-14
SLIDE 14

Adjacency Test – Linear List

An important MCE operation is testing two vertices for adjacency Graph representation uses a vertex adjacency list

Each vertex has a list of adjacent vertices An adjacency test requires a list traversal A linked list is easy to build, but slow to search A linear list (array) is faster to search

May 2008 Cray Inc. Proprietary Slide 14

b c d e a a d b a d e c a b c e d a c d e

e c a d b

slide-15
SLIDE 15

Adjacency Test – Bit Matrix

Adjacency bit matrix has a fast, constant time lookup Memory requirement is N2

May 2008 Cray Inc. Proprietary Slide 15

a b c d e a

  • 1

1 1 1 b 1

  • 1

c 1

  • 1

1 d 1 1 1

  • 1

e 1 1 1

  • e

c a d b

slide-16
SLIDE 16

Adjacency Test – Hash Table

Adjacency hash table has a fast, constant time lookup

But not as fast as bit matrix

Memory requirement is cN (2N in this example) Data structure is a sparse linear list But access is through direct through key hashing

May 2008 Cray Inc. Proprietary Slide 16

e c a d b

e d

  • c

a

  • c
  • b

d

  • a

e

  • d
  • e

a

  • c
  • c
  • b

a

  • d
  • e
  • b

d a

slide-17
SLIDE 17

Adjacency Test Performance Comparison

May 2008 Cray Inc. Proprietary Slide 17

50000 100000 150000 200000 250000 300000

Linear List Hash Table Bit Matrix

cliques/second

slide-18
SLIDE 18

SMP Versus DMP Programming

May 2008 Cray Inc. Proprietary Slide 18

64.00 64.10 64.20 64.30 64.40 64.50 64.60 64.70 64.80 Time (seconds) 1 process, 8 threads 2 processes, 4 threads each 4 processes, 2 threads each 8 processes, 1 thread each

slide-19
SLIDE 19

Parallel Scaling on Cray XT4 quad core

At 2048 processes, compute time is 2.1 seconds

Overhead due to message passing is 0.43 seconds Graph contains 3472 vertices, found 2.6 billion maximal cliques

May 2008 Cray Inc. Proprietary Slide 19

1 2 4 8 16 32 64 128 256 512 1024 2048 1 2 4 8 16 32 64 128 256 512 1024 2048

Speedup Processes Ideal pDFS

slide-20
SLIDE 20

Conclusion

Explicit decomposition at the thread level enabled easier implementation of MPI

Independent work already identified Compact representation of units of work

Additional work

Improved load balancing by grouping processes Parallel I/O optimization

May 2008 Cray Inc. Proprietary Slide 20

slide-21
SLIDE 21

Conclusion

Research group members

Nagiza F. Samatova, North Carolina State University and Oak Ridge National Laboratory Matthew Schmidt, North Carolina State University and Oak Ridge National Laboratory Byung-Hoon Park, Oak Ridge National Laboratory Kevin Thomas, Cray Inc.

Thank you! Questions?

May 2008 Cray Inc. Proprietary Slide 21