An Adaptive Parallel Algorithm for Computing Connectivity Chirag - - PowerPoint PPT Presentation

an adaptive parallel algorithm for computing connectivity
SMART_READER_LITE
LIVE PREVIEW

An Adaptive Parallel Algorithm for Computing Connectivity Chirag - - PowerPoint PPT Presentation

An Adaptive Parallel Algorithm for Computing Connectivity Chirag Jain, Patrick Flick, Tony Pan, Oded Green, Srinivas Aluru SIAM Workshop on Combinatorial Scientific Computing (CSC16) October 10, 2016 1 Introduction Methods Experiments


slide-1
SLIDE 1

An Adaptive Parallel Algorithm for Computing Connectivity

Chirag Jain, Patrick Flick, Tony Pan, Oded Green, Srinivas Aluru

1

SIAM Workshop on Combinatorial Scientific Computing (CSC16) October 10, 2016

slide-2
SLIDE 2

Connected Components

  • Finding connected

components is at the heart of many graph applications.

  • Sequentially, we have linear

time O(|E|) solutions.

  • Union-find
  • BFS / DFS

G(V,E)

2 Introduction Methods Experiments

slide-3
SLIDE 3

Scaling to Large Graphs

  • Sizes of graph datasets continue

to grow in multiple scientific domains

  • Bioinformatics : Metagenomics

de-Bruijn graphs

  • Iowa Prairie (3.3B reads) - JGI
  • Social networks, WWW
  • We need method that scales to

graphs with billions/trillion of edges

  • irrespective of graph topology

Sequencing machines generate ~109 DNA reads in 1 day > 109 content uploads in 1 day

3 Introduction Methods Experiments

slide-4
SLIDE 4

Background

4

  • A. Parallel connectivity

algorithms

  • 1. Parallel BFS
  • 2. Shiloach-Vishkin PRAM

algorithm (SV)

  • B. Recent prior work

Buluç and Madduri “Parallel breadth-first search …” SC 11 Beamer et. al. "Distributed memory breadth-first search revisited …” IPDPSW 13

source

Introduction Methods Experiments

slide-5
SLIDE 5

Background

5

  • A. Parallel connectivity

algorithms

  • 1. Parallel BFS
  • 2. Shiloach-Vishkin

PRAM algorithm (SV)

  • B. Recent prior work

Shiloach and Vishkin “An O(log n) parallel connecLvity algorithm” 1982 Introduction Methods Experiments

slide-6
SLIDE 6

Background

Shiloach and Vishkin “An O(log n) parallel connecLvity algorithm” 1982 6

Pointer jumping for faster convergence

O(log |V|) iterations

→ O(|E| log |V|) work

  • A. Parallel connectivity

algorithms

  • 1. Parallel BFS
  • 2. Shiloach-Vishkin

PRAM algorithm (SV)

  • B. Recent prior work

Introduction Methods Experiments

O(|V|) iterations

→ O(|E|.|V|) work

Label PropagaLon Shiloach-Vishkin

slide-7
SLIDE 7

Background

7

  • A. Parallel connectivity

algorithms

  • 1. Parallel BFS
  • 2. Shiloach-Vishkin PRAM

algorithm (SV)

  • B. Recent prior work

G(V,E)

Multistep algorithm Part of popular graph analysis frameworks : GraphX, PowerLyra, PowerGraph 1 Parallel BFS iteration Parallel Label Propagation

Slota et. al. “A Case Study of Complex Graph Analysis …” IPDPS 2016 Slota et al. “BFS and coloring-based parallel … IPDPS 2014 Introduction Methods Experiments

slide-8
SLIDE 8

Flick et. al. “A parallel connecLvity algorithm …” SC 15

Contributions

  • 1. Novel edge-based adaptation of Shiloach-Vishkin

algorithm for distributed memory parallel systems.

  • 2. Fast heuristic to guide algorithm selection at run-time.

8

G(V,E) Parallel SV Parallel BFS

1 2

Introduction Methods Experiments

slide-9
SLIDE 9

Parallel SV algorithm

9

Current partition id Vertex ids

  • Initialization
  • We work with an

array of tuples (call it

A) to keep partition

id of each vertex.

  • O(|V|) partitions at

beginning

  • Size of A : 


O(|V| + |E|)

Introduction Methods Experiments

u

v1 v2

u v1 v2 u v1 v2 v2 v1 u v2 u v1 u v2 v1 u v2 u v1 u

slide-10
SLIDE 10

Parallel SV algorithm

10

<

,

  • Initialization
  • We work with an

array of tuples (call it

A) to keep partition

id of each vertex.

  • O(|V|) partitions at

beginning

  • Size of A : 


O(|V| + |E|)

Introduction Methods Experiments

u

v1 v2

u v1 v2 u v1 v2 v2 v1 u v2 u v1 u v2 v1 u v2 u v1 u Current partition id Vertex ids

slide-11
SLIDE 11

Parallel SV algorithm

11

u

u u u

Current partition id Vertex ids

  • vertex ‘u’ is member of which all

partition ids?

  • Sort A by ‘vertex id’ layer

Introduction Methods Experiments u

slide-12
SLIDE 12

Parallel SV algorithm

12

u

u u u

Current partition id

u v w

u v w

  • Which all vertices are member
  • f partition ?
  • Sort A by ‘partition id’ layer

Current partition id

Introduction Methods Experiments

Vertex ids

  • vertex ‘u’ is member of which all

partition ids?

  • Sort A by ‘vertex id’ layer

u v w

slide-13
SLIDE 13

Parallel SV algorithm

13

u

u u u

Current partition id

u v w

u v w

  • Which all vertices are member
  • f partition ?
  • Sort A by ‘partition id’ layer

Current partition id

Introduction Methods Experiments

Vertex ids

  • vertex ‘u’ is member of which all

partition ids?

  • Sort A by ‘vertex id’ layer

u v w

slide-14
SLIDE 14

Parallel SV algorithm

14

  • In our implementation, we use parallel sample sort.
  • Custom reduction operations to efficiently compute

minimums.

  • Additional details:
  • pointer jumping
  • detect convergence of small components early, load

balance

  • Runtime :

Introduction Methods Experiments

Check our preprint

slide-15
SLIDE 15

Flick et. al. “A parallel connecLvity algorithm …” SC 15

Contributions

  • 1. Novel edge-based adaptation of Shiloach-Vishkin

algorithm for distributed memory parallel systems.

  • 2. Fast heuristic to guide algorithm selection at run-time.

15

G(V,E) Parallel SV Parallel BFS

1 2

Introduction Methods Experiments

slide-16
SLIDE 16

Dynamic hybrid method

16

  • Parallel BFS is close to work

efficient for a giant small world graph component.

  • Efficiency is lost when :
  • Large number of small

components

  • Large diameter of a graph

component

  • How to decide which algorithm to

choose at runtime?

Introduction Methods Experiments

slide-17
SLIDE 17

Dynamic hybrid method

17 Introduction Methods Experiments

Run Parallel-SV on remaining graph Curve fits power- law distribution? Compute degree distribution of input graph 1 BFS iteration Yes No

slide-18
SLIDE 18

Experimental Setup

  • Software : C++14, MPI, CombBLAS library for parallel BFS
  • Hardware : Cray XC30 (Edison) at Lawrence Berkeley National

Laboratory

  • 5,576 nodes, each with 2 x 12-core Intel Ivy processors and 64 GB

RAM

  • 1 MPI process per physical core
  • Timing :
  • Exclude graph construction and I/O time
  • Profiling starts after having block-distributed list of edges in memory

18 Buluç and Gilbert “The Combinatorial BLAS: Design …” IJHPCA 2011 Introduction Methods Experiments

slide-19
SLIDE 19

Datasets

19 Introduction Methods Experiments

slide-20
SLIDE 20

Datasets

Small world graphs

20 Introduction Methods Experiments

slide-21
SLIDE 21

Datasets

21

Small world graphs Large diameter graph

Introduction Methods Experiments

slide-22
SLIDE 22

Datasets

22

Small world graphs Large diameter graph Large number of components

Introduction Methods Experiments

slide-23
SLIDE 23

20 40 60 M1 M2 M3 G1 G2 G3 K1 K2 Datasets Time (sec)

Method Dynamic Static (Opp. Choice)

Time (sec) Graphs

Dynamic Approach

23

Run BFS?

1.2x 0.9x 1.2x 4.1x 3.7x 4.7x 3.6x 4.0x

Timings against opposite choice, using 2K cores

Introduction Methods Experiments

slide-24
SLIDE 24

20 40 60 M1 M2 M3 G1 G2 G3 K1 K2 Datasets Time (sec)

Method Dynamic Static (Opp. Choice)

Time (sec) Graphs

Dynamic Approach

24

Proportion of time spent in prediction (using 2K cores) Proportion

  • f time

Introduction Methods Experiments

Run BFS?

slide-25
SLIDE 25
  • Maximum speedup of ~8x

using 4096 cores (Ideal :16x)

  • Sorting benchmark with 2B

integers achieves 8.06x speedup as well.

25

  • 100

200 300 2.5 5.0 7.5

Time (sec) Speedup

256 512 1024 2048 4096 Number of cores (log scale) Dataset

  • G1

G2 G3 K1 M1 M2

Number of cores (log scale)

Timings for the largest graph M4

Strong Scalability

Time (sec) Speedup Introduction Methods Experiments

slide-26
SLIDE 26

v/s Multistep method

26

25 50 75 M1 M2 M3 G1 G2 G3 K1 K2 Datasets Time (sec)

Method Our method Multistep

Time (sec)

2.1x 1.1x 2.7x 24x 0.9x 1.1x 1.1x 1.9x

Diameter

4K 4K 2K 25K 9 9 16 17

Graphs

Introduction Methods Experiments

slide-27
SLIDE 27

v/s Best sequential method

27

  • Performance comparison against Rem’s algorithm (based on

union-find)

  • Using small graphs that fit in single node (64 GB RAM)
  • E. W. Dijkstra, A discipline of programming. 1976

Introduction Methods Experiments

slide-28
SLIDE 28

Conclusions

  • 1. Efficient distributed memory parallel connectivity

algorithm based on Shiloach-Vishkin approach.

  • 2. Propose heuristic to guide algorithm selection at

runtime.

  • 3. Efficient as well as generic, scales on a variety of large

graphs.

  • 4. Significant performance gains against previous state-
  • f-the-art, particularly in case of large diameter graphs.

28

slide-29
SLIDE 29

Thank you!

arxiv.org/abs/1607.06156 cjain @ gatech.edu

Reproducibility IniLaLve Award

github.com/ParBLiSS/ parconnect