Algorithm for Massively-Parallel Devices Jayadharini Jaiganesh & - - PowerPoint PPT Presentation

algorithm for massively parallel devices
SMART_READER_LITE
LIVE PREVIEW

Algorithm for Massively-Parallel Devices Jayadharini Jaiganesh & - - PowerPoint PPT Presentation

An Efficient Connected Components Algorithm for Massively-Parallel Devices Jayadharini Jaiganesh & Martin Burtscher Department of Computer Science Connected Components A Connected Component C is a subset of vertices such that, All


slide-1
SLIDE 1

An Efficient Connected Components Algorithm for Massively-Parallel Devices

Jayadharini Jaiganesh & Martin Burtscher Department of Computer Science

slide-2
SLIDE 2

Connected Components

▪ A Connected Component C is a subset of vertices such that,

▪ All vertices in C are reachable from any vertex in C ▪ No edges between vertices belonging to different components

▪ Navigation ▪ Medicine - Cancer and tumor detection ▪ Biochemistry

▪ Protein study ▪ Drug discovery

2 Connected Components

slide-3
SLIDE 3

PRIOR WORK

3 Connected Components

slide-4
SLIDE 4

Standard CC Algorithm

▪ Label Propagation

▪ Mark each vertex with unique label ▪ Propagate vertex labels through edges ▪ Repeat until all vertices in same component have same label

4 Connected Components label propagation label propagation

slide-5
SLIDE 5

Parallel CC Algorithm - Shiloach & Vishkin’s

▪ Each vertex is considered a separate tree

▪ Component labelled by its own ID

▪ Iterates on two operations

▪ Hooking ▪ Pointer Jumping

5 Connected Components

slide-6
SLIDE 6

Hooking

▪ Works on edges ▪ For each edge (u, v), checks if u and v have same label ▪ If not, link higher label to lower label

6 Connected Components hooking

slide-7
SLIDE 7

▪ Works on vertices ▪ Replaces a vertex’s label with its parent’s label ▪ Reduces depth of tree by one

7

Pointer Jumping

Connected Components pointer jumping pointer jumping

slide-8
SLIDE 8

▪ A variant of Shiloach-Vishkin’s algorithm ▪ Uses Multiple Pointer Jumping

▪ Iteratively performs Pointer Jumping ▪ Converts multi-level tree to a single-level tree (star) ▪ Reduces tree’s height to one

8 Connected Components multiple pointer jumping

Parallel CC Algorithm - Soman’s

slide-9
SLIDE 9

Parallel CC Algorithm - Groute

▪ Variant of Soman’s work ▪ Comprises Atomic Hooking and Multiple Pointer Jumping

▪ Locks component ID vertex until hooking succeeds ▪ No overriding with concurrent hooking operations

▪ Splits graph into (2*|E|)/|V| edge list segments

▪ Enables intermediate pointer jumping ▪ Reduces operations in the next segment’s hooking

9 Connected Components

slide-10
SLIDE 10

ECL-CC: OUR ALGORITHM

10 Connected Components

slide-11
SLIDE 11

Our Solution - ECL-CC Algorithm

▪ Like previous work, it chooses minimum

vertex ID in each component as component ID to guarantee uniqueness

▪ Comprises three main functions

▪ Init, Compute, and Flatten

▪ Init function

▪ Initializes each vertex’s label with

a smaller neighbor ID if possible

Connected Components 11

slide-12
SLIDE 12

▪ Compute function

▪ Processes each edge of a vertex so that both ends of edge

have same component ID

▪ Makes sure that each edge is considered in only one direction ▪ Employs Intermediate Pointer jumping

▪ Flatten function

▪ A form of Multiple Pointer jumping

Our Solution - ECL-CC Algorithm (cont.)

Connected Components 12 intermediate pointer jumping

slide-13
SLIDE 13

ECL-CC - GPU Implementation

▪ Written in CUDA ▪ Lock-free implementation based on atomic operations ▪ Uses double-sided worklist for load balancing ▪ Uses three compute kernels

▪ compute1: |E| ≤ 16, thread-level parallelism ▪ compute2: 16 < |E| ≤ 352, warp-level parallelism ▪ compute3: |E| > 352, block-level parallelism

Connected Components 13 16 < |E| ≤ 352 |E| > 352

slide-14
SLIDE 14

Our Solution - ECL-CCaf Algorithm

▪ Atomic operations

▪ Slower than atomic-free operations ▪ Potential bottleneck for future massively parallel devices

▪ ECL-CCaf - Synchronous atomic-free version of ECL-CC ▪ Uses same three functions - Init, Compute, and Flatten ▪ Repeatedly calls Compute to avoid data races

Connected Components 14

slide-15
SLIDE 15

EVALUATION METHODOLOGY

15 Connected Components

slide-16
SLIDE 16

Machines - GPU

▪ NVIDIA GeForce GTX Titan X ▪ NVIDIA Tesla K40

16 Connected Components

Titan X K40 Cores 3072 2880 Global Memory 12 GB 12 GB Clock Frequency 1.1 GHz 745 MHz

slide-17
SLIDE 17

▪ Machine 1

▪ Intel Xeon E5-2687W ▪ Hyperthreading

Connected Components 17

Machine - CPU

Machine 1 Sockets 2 Cores 10 Clock Frequency 3.1 GHz

slide-18
SLIDE 18

Input Graphs

▪ Eighteen graphs

▪ 65K to 18M vertices ▪ 387K to 523M edges

▪ Graph types

▪ Roadmaps ▪ Random graphs ▪ Synthetic graphs ▪ Internet topology graphs ▪ Social network graphs ▪ Web-links graphs

18 Connected Components

slide-19
SLIDE 19

RESULTS: ECL-CCaf

19 Connected Components

slide-20
SLIDE 20

Slowdown Relative to ECL-CCaf - Titan X

▪ Fastest on 6 graphs and Groute is 1.04x faster

20 Connected Components

slide-21
SLIDE 21

Slowdown Relative to ECL-CCaf - K40

21

▪ Fastest on 8 graphs and Groute is 1.2x faster

Connected Components

slide-22
SLIDE 22

RESULTS: ECL-CC

22 Connected Components

slide-23
SLIDE 23

Slowdown Relative to ECL-CC - Titan X

▪ Fastest on 16 graphs and at least 1.8x faster on average

23 Connected Components

slide-24
SLIDE 24

Slowdown Relative to ECL-CC - K40

▪ Fastest on 14 graphs and at least 1.6x faster on average

24 Connected Components

slide-25
SLIDE 25

Geometric-Mean Slowdown Across Systems

▪ Fastest among all benchmarks across different platforms

25 Connected Components

slide-26
SLIDE 26

Connected Components 26

ALGORITHM ANALYSIS

slide-27
SLIDE 27

▪ Version 1

▪ Label is assigned with the vertex’s own ID

▪ Version 2

▪ Label is assigned with the vertex’s minimum neighbor’s ID

▪ Version 3

▪ Label is set with the ID of the first smaller neighbor ▪ Avoids traversing all neighbors ▪ Label is set with a better value ▪ Used in ECL-CC algorithm

Connected Components 27

Init Versions

slide-28
SLIDE 28

Connected Components 28

Slowdown Relative to ECL-CC Init

▪ On average, 1.4x faster than version 2

slide-29
SLIDE 29

▪ Version 1 - Multiple Pointer Jumping ▪ Version 2 - Single Pointer Jumping ▪ Version 3 - No Pointer Jumping (returns end of list) ▪ Version 4 - Intermediate Pointer Jumping

▪ Links every node to second-to-next node ▪ Reduces list length by a factor of two ▪ Used in ECL-CC

Connected Components 29

Pointer Jumping Versions

novel intermediate pointer jumping

slide-30
SLIDE 30

No Graph Name Vertex degree max avg 1 2d-2e20 9 1.4 2 amazon0601 8 1.3 3 as-skitter 17 1.0 4 citationCiteseer 11 1.1 5 cit-Patents 9 1.0 6 coPapersDBLP 8 1.0 7 delaunay_n24 13 1.4 8 europe_osm 122 4.3 9 in-2004 31 1.1 10 internet 10 1.5 11 kron_g500-logn21 6 1.0 12 r4-2e23 29 1.3 13 rmat16 10 1.3 14 rmat22 8 1.1 15 soc-livejournal 7 1.0 16 uk-2002 91 1.2 17 USA-NY 43 2.6 18 USA-USA 27 1.6

Connected Components 30

Vertex Chain Length

slide-31
SLIDE 31

Connected Components 31

Slowdown Relative to ECL-CC Pointer Jumping

▪ At least 1.2x to 3.6x faster than other versions on average

slide-32
SLIDE 32

▪ Version 1 - Intermediate Pointer jumping

▪ Links every node to second-to-next node ▪ Current node is linked to end of list ▪ Reduces list length by a factor of two

Connected Components 32

Flatten Versions

▪ Version 2 - Multiple Pointer jumping

▪ Links every node to end of list

▪ Version 3 - Pointer jumping

▪ Only current node is linked to end of list ▪ Used in ECL-CC

slide-33
SLIDE 33

Connected Components 33

Slowdown Relative to ECL-CC Flatten

▪ Flatten’s runtime at least 4x faster on larger graphs -|V| > 15M ▪ On average, 1.2x faster than version 2

slide-34
SLIDE 34

SUMMARY

34 Connected Components

slide-35
SLIDE 35

Summary

▪ ECL-CC - Asynchronous CC algorithm

▪ Uses optimized version of initialization ▪ Employs a double-sided worklist & three compute kernels ▪ Incorporates Intermediate Pointer jumping ▪ Considers each edge in only one direction ▪ On average, 1.7x faster than fastest GPU algorithm

35 Connected Components

ECL-CCaf - Atomic free and synchronous algorithm

▪ Iterates over compute kernels to avoid data races ▪ Average performance on par with Groute

slide-36
SLIDE 36

Thank you ☺

36 Connected Components

Jayadharini Jaiganesh Texas State University

jayadharini@txstate.edu

Download link

http://cs.txstate.edu/~burtscher/research/ECL-CC/

slide-37
SLIDE 37

Algorithm - ECL-CC

▪ procedure: ECL-CC (V, E) 1.

Init (V, nstat)

2.

Compute (V, E, nstat)

3.

Flatten (V, nstat)

▪ procedure: Init (V, nstat) 1.

nstat = {0, ..., |V|-1} //Hold the vertex labels

2.

for each vertex v in V

3.

nstat[v]  First neighbor smaller than v.

Connected Components 37

slide-38
SLIDE 38

▪ procedure: Compute (V, E, nstat) 1.

for each v in V {

2.

vstat  representative (v, nstat)

3.

for each edge (u, v) in E {

4.

if (v > u) {

5.

  • stat  representative (u, nstat)

6.

if (vstat < ostat)

7.

nstat[ostat]  vstat

8.

else

9.

nstat[vstat]  ostat

10.

}

11.

}

  • 12. }

Connected Components 38

slide-39
SLIDE 39

▪ procedure: Representative (v, nstat) 1.

curr  nstat[v]

2.

if (curr != v) {

3.

prev  v

4.

next  nstat[curr]

5.

while (curr > next) {

6.

nstat[prev]  next

7.

prev  curr

8.

curr  next

9.

}

  • 10. }

Connected Components 39

slide-40
SLIDE 40

▪ A form of pointer jumping ▪ Updates the label of all the vertices so that it represents the

component ID directly

▪ procedure: Flatten (V, nstat) 1.

for each vertex v in V {

2.

vstat  nstat[v]

3.

while (vstat > nstat[vstat])

4.

vstat  nstat[vstat]

5.

nstat[v]  vstat

6.

}

Connected Components 40

Flatten Function

slide-41
SLIDE 41

Algorithm - ECL-CCaf

▪ procedure: ECL-CCaf (V, E) 1.

Init (V, nstat)

2.

reiterate  1

3.

do

4.

if reiterate

5.

Compute (V, E, nstat, &reiterate)

6.

end if

7.

while (!reiterate)

8.

Flatten (V, nstat)

Connected Components 41

slide-42
SLIDE 42

Graph Representation

▪ Compressed Adjacency List (two arrays)

▪ neighborlist - concatenation of all adjacency lists ▪ neighborindex - starting point of each adjacency list

42 Graph Compressed Adjacency List Connected Components

slide-43
SLIDE 43
  • S. No

Graph Name

  • No. of

Edges (M)

  • No. of

Vertices (M) Vertex degree

  • No. of CC

min max avg 1 2d-2e20 1.0 4.2 2 4 3 1 2 amazon0601 0.4 4.9 1 2752 12 7 3 as-skitter 1.7 2.2 1 35455 1 756 4 citationCiteseer 0.3 2.3 1 1318 8 1 5 cit-Patents 3.8 33.0 1 793 8 3,627 6 coPapersDBLP 0.5 30.5 1 3299 56 1 7 delaunay_n24 16.8 100.7 3 26 5 1 8 europe_osm 50.9 108.1 1 13 2 1 9 in-2004 1.4 27.2 21869 19 134 10 internet 0.1 0.4 1 151 3 1 11 kron_g500- logn21 2.1 182.1 213904 86 553,159 12 r4-2e23 8.4 67.1 2 26 7 1 13 rmat16 0.1 1.0 569 14 3,900 14 rmat22 4.2 65.7 3687 15 428,640 15 soc-livejournal 4.8 85.7 20333 17 1,876 16 uk-2002 18.5 523.6 194955 28 38,359 17 USA-NY 0.3 0.7 1 8 2 1 18 USA-USA 23.9 57.7 1 9 2 1

Connected Components 43

Graph Details