GPU-Accelerated Network Centrality Erik Saule collaborative work - - PowerPoint PPT Presentation

gpu accelerated network centrality
SMART_READER_LITE
LIVE PREVIEW

GPU-Accelerated Network Centrality Erik Saule collaborative work - - PowerPoint PPT Presentation

GPU-Accelerated Network Centrality Erik Saule collaborative work with: uce (OSU), Kamer Kaya (Sabanc), Ahmet Erdem Sary Umit V. C ataly urek (OSU) University of North Carolina at Charlotte (CS) GTC 2015 Erik Saule (UNCC) GPU


slide-1
SLIDE 1

GPU-Accelerated Network Centrality

Erik Saule collaborative work with: Ahmet Erdem Sarıy¨ uce (OSU), Kamer Kaya (Sabancı), ¨ Umit V. C ¸ataly¨ urek (OSU)

University of North Carolina at Charlotte (CS)

GTC 2015

Erik Saule (UNCC) GPU Centrality GTC 2015 1 / 24

slide-2
SLIDE 2

Outline

1

Introduction

2

Decomposition for GPU

3

An SpMM-based approach

4

Conclusion

Erik Saule (UNCC) GPU Centrality GTC 2015 2 / 24

slide-3
SLIDE 3

Centralities - Concept

Answer questions such as

Who controls the flow in a network? Who is more important? Who has more influence? Whose contribution is significant for connections?

Different kinds of graph

road networks social networks power grids mechanical mesh

Applications

Covert network (e.g., terrorist identification) Contingency analysis (e.g., weakness/robustness of networks) Viral marketing (e.g., who will spread the word best) Traffic analysis Store locations

Erik Saule (UNCC) GPU Centrality GTC 2015 3 / 24

slide-4
SLIDE 4

Centrality Formally

Closeness Centrality

Let G = (V , E) be an unweighted graph with the vertex set V and edge set E. cc[v] =

u∈V 1 d(v,u) where d(u, v) is the

shortest path length between u and v.

Betweenness Centrality

Let G = (V , E) be an unweighted graph. Let σst be the number of shortest paths connecting s amd t. Let σst(v) be the number of such s-t paths passing through v. bc[v] =

s=v=t∈V δst(v) where

δst(v) = σst(v)

σst .

Algorithm

In each case, the best algorithm computes the shortest path graph rooted in each vertex of the graph and extract the relevant information. The complexity is O(E) per source, O(VE) in total, which makes its computationally expensive.

Erik Saule (UNCC) GPU Centrality GTC 2015 4 / 24

slide-5
SLIDE 5

Computing Breadth First Traversal (Centrality)

Top-down (scatter writes)

For each element of the frontier, touch the neighbors. Complexity: O(E) Writes are scattered in memory

Bottom-up (gather reads)

For each vertex, are the neighbors in the frontier? Complexity O(ED), where D is the diameter

  • f the graph.

Writes are performed once linearly. Direction Optimizing. Level synchronous bfs.

x x x x x x x x x x x T

  • From

Erik Saule (UNCC) GPU Centrality GTC 2015 5 / 24

slide-6
SLIDE 6

Outline

1

Introduction

2

Decomposition for GPU

3

An SpMM-based approach

4

Conclusion

Erik Saule (UNCC) GPU Centrality GTC 2015 6 / 24

slide-7
SLIDE 7

Traditionally ...

Vertex Centric

1 thread: 1 vertex No graph coalescing Vector read is not coalesced No atomics High divergence (high degree)

x x x x x x x x x x x

Erik Saule (UNCC) GPU Centrality GTC 2015 7 / 24

slide-8
SLIDE 8

Traditionally ...

Edge Centric

1 thread: 1 edge Graph read is coalesced Vector read is not coalesced Many atomics Little divergence (likely to have adjacent thread doing the same vertex)

x x x x x x x x x x x

Erik Saule (UNCC) GPU Centrality GTC 2015 8 / 24

slide-9
SLIDE 9

Virtual vertex decomposition

Virtual Vertex

1 thread: 1 virtual vertex High degree vertices are split in multiple ”virtual vertices” No graph coalescing Vector read is not coalesced Some atomics Bounded divergence

x x x x x x x x x x x

Erik Saule (UNCC) GPU Centrality GTC 2015 9 / 24

slide-10
SLIDE 10

Strided virtual vertex decomposition

Virtual Vertex

1 thread: 1 virtual vertex Some graph coalescing Vector read is not coalesced Some atomics Bounded divergence

x x x x x x x x x x x

Erik Saule (UNCC) GPU Centrality GTC 2015 10 / 24

slide-11
SLIDE 11

Experimental Setting

Instances

Graph |V | |E| Avg|Γ(v)| Max|Γ(v)| Diam. Amazon 403K 4,886K 12.1 2,752 19 Gowalla 196K 1,900K 9.6 14,730 12 Google 855K 8,582K 10.0 6,332 18 NotreDame 325K 2,180K 6.6 10,721 27 WikiTalk 2,388K 9,313K 3.8 100,029 10 Orkut 3,072K 234,370K 76.2 33,313 9 LiveJournal 4,843K 85,691K 17.6 20,333 15

Machines

2 Intel Sandybridge EP NVIDIA K20

Metric

Traversed Edge Per Second:

VE time .

Erik Saule (UNCC) GPU Centrality GTC 2015 11 / 24

slide-12
SLIDE 12

First results

1 2 3 4 5 6 7 8 9 10 11 Speedup wrt CPU 1 thread GPU vertex GPU edge GPU virtual GPU stride

Erik Saule (UNCC) GPU Centrality GTC 2015 12 / 24

slide-13
SLIDE 13

Outline

1

Introduction

2

Decomposition for GPU

3

An SpMM-based approach

4

Conclusion

Erik Saule (UNCC) GPU Centrality GTC 2015 13 / 24

slide-14
SLIDE 14

No vector coalescing

x x x x x x x x x x x

Vertex

x x x x x x x x x x x

Edge

x x x x x x x x x x x

Virtual

x x x x x x x x x x x

Virtual Coalesced All the representations give vector coalescing only “if you are lucky”

Erik Saule (UNCC) GPU Centrality GTC 2015 14 / 24

slide-15
SLIDE 15

Simultaneous sources traversal

The problem with previous methods is that BFS leaves the coalescing of the access to the vector up to the structure of the graph.

Multiple sources

All threads of a warp should make similar access pattern. Since there are multiple traversals to perform in a Centrality computation, process B traversals at once. If a vertex is in the same level of the BFS in multiple traversal, they will be processed at the same time. Social networks have most vertices in a few levels.

x x x x x x x x x x x

Erik Saule (UNCC) GPU Centrality GTC 2015 15 / 24

slide-16
SLIDE 16

An SpMV-based approach of BFS for Closeness Centrality

A simpler definition of level synchronous BFS

Vertex v is at level ℓ if and only if one of the neighbors of v is at level ℓ − 1 and v is not at any level ℓ′ < ℓ. Let xℓ

i = true if vertex i is a part of the frontier at level ℓ.

yℓ+1 is the neighbors of level ℓ. yℓ+1

k

= ORj∈Γ(k)xℓ

j . ( (OR, AND)-SpMV )

Compute the next level frontier xℓ+1

i

= yℓ+1

i

&¬(ORℓ′≤ℓxℓ′

i ).

Contribution of the source to cc[i] is xℓ

i

ℓ .

It allows to compute Closeness Centrality by encoding the state of 32 traversal with an int.

Erik Saule (UNCC) GPU Centrality GTC 2015 16 / 24

slide-17
SLIDE 17

Impact on working warps

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1 2 4 8 16 32 B

Amazon Gowalla Google NotreDame WikiT alk Orkut LiveJournal

Number of active warps necessary for 32 traversals. Small increase in the number of warps.

Erik Saule (UNCC) GPU Centrality GTC 2015 17 / 24

slide-18
SLIDE 18

Impact on non simultaneous traversal

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 16 32 B

Amazon Gowalla Google NotreDame WikiT alk Orkut LiveJournal

With B = 4, 32 traversals of one vertex are distributed in about 40% of 32 warps. Good coalescing.

Erik Saule (UNCC) GPU Centrality GTC 2015 18 / 24

slide-19
SLIDE 19

Impact on Runtime

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 16 32 64 128 Normalized Time B

Amazon Gowalla Google NotreDame WikiT alk Orkut LiveJournal

Erik Saule (UNCC) GPU Centrality GTC 2015 19 / 24

slide-20
SLIDE 20

Outline

1

Introduction

2

Decomposition for GPU

3

An SpMM-based approach

4

Conclusion

Erik Saule (UNCC) GPU Centrality GTC 2015 20 / 24

slide-21
SLIDE 21

On other architecture? Betweenness Centrality

200 400 600 800 1000 1200

Amazon Gowalla Google NotreDame WikiT alk Orkut LiveJournal

MTEPS CPU-SNAP CPU-Ligra CPU-BC GPU-VirBC GPU-VirBC-Multi

O(DE) algorithms (GPU-) is unsuitable for NotreDame because of its high diameter. CPU: 2 Intel Sandybridge EP (2x8 cores)

Erik Saule (UNCC) GPU Centrality GTC 2015 21 / 24

slide-22
SLIDE 22

On other architecture? Closeness Centrality

10 20 30 40 50 60 70 80 90

Amazon Gowalla Google NotreDame WikiT alk Orkut LiveJournal

GTEPS

CPU-DO CPU-SpMM PHI-DO PHI-SpMM GPU-VirCC GPU-SpMM

CPU: 2 Intel Sandybridge EP (2x8 cores) PHI: Intel Xeon Phi 5120

Erik Saule (UNCC) GPU Centrality GTC 2015 22 / 24

slide-23
SLIDE 23

Conclusion

Centrality

Betweenness and Closeness Centrality are computed using multiple Breadth First Search traversal.

Graph representation for GPU

Vertex Centric Edge Centric Virtual Vertex Coalesced Virtual Vertex Determine parallelism but also memory access patterns and thread divergence.

Multiple traversals

Centrality requires graph traversal from many different sources. Threads of a warp can be set to process different traversal for the same decomposition. Provided a vertex is used in the same level in multiple traversals, all the memory accesses can be coalesced. Improves performance by a factor of 70x. Adapts to CPU architecture for similar effects.

Erik Saule (UNCC) GPU Centrality GTC 2015 23 / 24

slide-24
SLIDE 24

Thank you

Other centrality works (with Sarıy¨ uce, Kaya and C ¸ataly¨ urek)

Compression using graph properties (SDM 2013) GPU optimization (GPGPU 2013) Incremental algorithm (BigData 2013) Distributed memory incremental framework (Cluster 2013, ParCo 2015) Regularized memory accesses for CPU, GPU, Xeon Phi (MTAAP 2014, JPDC 2015)

More information

Contact : esaule@uncc.edu Visit: http://webpages.uncc.edu/~esaule

Erik Saule (UNCC) GPU Centrality GTC 2015 24 / 24