Applications of interest Computational biology Social network - - PowerPoint PPT Presentation

applications of interest
SMART_READER_LITE
LIVE PREVIEW

Applications of interest Computational biology Social network - - PowerPoint PPT Presentation

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification GTC 2015 2


slide-1
SLIDE 1

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis

Adam McLaughlin

slide-2
SLIDE 2

Applications of interest…

  • Computational biology
  • Social network analysis
  • Urban planning
  • Epidemiology
  • Hardware verification

2

GTC 2015

slide-3
SLIDE 3

Applications of interest…

  • Computational biology
  • Social network analysis
  • Urban planning
  • Epidemiology
  • Hardware verification
  • Common denominator:

Grap aph Ana nalysis is

3

GTC 2015

slide-4
SLIDE 4

Challenges in Network Analysis

  • Size

– Networks cannot be manually inspected

  • Varying structural properties

– Small-world, scale-free, meshes, road networks

  • Not a one-size fits all problem
  • Unpredictable

– Data-dependent memory access patterns

4

GTC 2015

slide-5
SLIDE 5

Betweenness Centrality

  • Determine the

importance of a vertex in a network

– Requires the solution of the APSP problem

  • Applications are

manifold

  • Computationally

demanding

– 𝑃 𝑛𝑜 time complexity

5

GTC 2015

slide-6
SLIDE 6

Defining Betweenness Centrality

  • Formally, the BC score of a vertex is defined

as:

𝐶𝐷 𝑤 = 𝜏𝑡𝑢(𝑤) 𝜏𝑡𝑢

𝑡≠t≠v

  • 𝜏𝑡𝑢 is the number of shortest paths from 𝑡 to 𝑢
  • 𝜏𝑡𝑢(𝑤) is the number of those paths passing through 𝑤

6

𝜏𝑡𝑢 = 2 𝜏𝑡𝑢(𝑤) = 1

GTC 2015

u

slide-7
SLIDE 7

Brandes’s Algorithm

1.

  • 1. Shor
  • rtest

test pat ath h ca calc lculation ulation (downward) 2.

  • 2. Dep

epen endency dency ac accum cumulation ulation (upward)

– Dependency:

𝜀𝑡𝑤 = 𝜏𝑡𝑤 𝜏𝑡𝑥 1 + 𝜀𝑡𝑥

𝑥∈𝑡𝑣𝑑𝑑(𝑤)

7

– Redefine BC scores as:

𝐶𝐷 𝑤 = 𝜀𝑡𝑤

𝑡≠v

GTC 2015

slide-8
SLIDE 8

Prior GPU Implementations

  • Vertex and Edge Parallelism [Jia et al. (2011)]

– Same coarse-grained strategy – Edge-parallel approach better utilizes the GPU

  • GPU-FAN [Shi and Zhang (2011)]

– Reported 11-19% speedup over Jia et al.

  • Results were limited in scope

– Devote entire GPU to fine-grained parallelism

  • Both use large 𝑃 𝑛 , 𝑃 𝑜2

predecessor arrays

– Our approach: eliminate iminate this s array ray

  • Both use 𝑃(𝑜2 + 𝑛) graph traversals

– Our approach: trad ade-off

  • ff memory

mory bandwid width th and excess ss work

GTC 2015

8

slide-9
SLIDE 9

Coarse-grained Parallelization Strategy

9

GTC 2015

slide-10
SLIDE 10

Fine-grained Parallelization Strategy

  • Edge-parallel downward traversal

10

GTC 2015

𝒆 = 𝟏 𝑒 = 1 𝑒 = 2 𝑒 = 3 𝑒 = 4

  • Threads are

assigned to each edge

– Only a subset is active

  • Balanced amount
  • f work per thread
slide-11
SLIDE 11

Fine-grained Parallelization Strategy

  • Edge-parallel downward traversal

11

GTC 2015

𝑒 = 0 𝒆 = 𝟐 𝑒 = 2 𝑒 = 3 𝑒 = 4

  • Threads are

assigned to each edge

– Only a subset is active

  • Balanced amount
  • f work per thread
slide-12
SLIDE 12

Fine-grained Parallelization Strategy

  • Edge-parallel downward traversal

12

GTC 2015

𝑒 = 0 𝑒 = 1 𝒆 = 𝟑 𝑒 = 3 𝑒 = 4

  • Threads are

assigned to each edge

– Only a subset is active

  • Balanced amount
  • f work per thread
slide-13
SLIDE 13

Fine-grained Parallelization Strategy

  • Edge-parallel downward traversal

13

GTC 2015

𝑒 = 0 𝑒 = 1 𝑒 = 2 𝒆 = 𝟒 𝑒 = 4

  • Threads are

assigned to each edge

– Only a subset is active

  • Balanced amount
  • f work per thread
slide-14
SLIDE 14

Fine-grained Parallelization Strategy

  • Edge-parallel downward traversal

14

GTC 2015

𝑒 = 0 𝑒 = 1 𝑒 = 2 𝑒 = 3 𝒆 = 𝟓

  • Threads are

assigned to each edge

– Only a subset is active

  • Balanced amount
  • f work per thread
slide-15
SLIDE 15

Fine-grained Parallelization Strategy

  • Work-efficient downward traversal

15

GTC 2015

𝒆 = 𝟏 𝑒 = 1 𝑒 = 2 𝑒 = 3 𝑒 = 4

  • Threads are

assigned vertices in n the he fro rontier ntier

– Use an explicit queue

  • Variable number of

edges to traverse per thread

slide-16
SLIDE 16

Fine-grained Parallelization Strategy

  • Work-efficient downward traversal

16

GTC 2015

𝑒 = 0 𝒆 = 𝟐 𝑒 = 2 𝑒 = 3 𝑒 = 4

  • Threads are

assigned vertices in n the he fro rontier ntier

– Use an explicit queue

  • Variable number of

edges to traverse per thread

slide-17
SLIDE 17

Fine-grained Parallelization Strategy

  • Work-efficient downward traversal

17

GTC 2015

𝑒 = 0 𝑒 = 1 𝒆 = 𝟑 𝑒 = 3 𝑒 = 4

  • Threads are

assigned vertices in n the he fro rontier ntier

– Use an explicit queue

  • Variable number of

edges to traverse per thread

slide-18
SLIDE 18

Fine-grained Parallelization Strategy

  • Work-efficient downward traversal

18

GTC 2015

𝑒 = 0 𝑒 = 1 𝑒 = 2 𝒆 = 𝟒 𝑒 = 4

  • Threads are

assigned vertices in n the he fro rontier ntier

– Use an explicit queue

  • Variable number of

edges to traverse per thread

slide-19
SLIDE 19

Fine-grained Parallelization Strategy

  • Work-efficient downward traversal

19

GTC 2015

𝑒 = 0 𝑒 = 1 𝑒 = 2 𝑒 = 3 𝒆 = 𝟓

  • Threads are

assigned vertices in n the he fro rontier ntier

– Use an explicit queue

  • Variable number of

edges to traverse per thread

slide-20
SLIDE 20

Motivation for Hybrid Methods

  • No one method of parallelization works best

GTC 2015

20

  • High diameter: Only do useful work
  • Low diameter: Leverage memory bandwidth
slide-21
SLIDE 21

Sampling Approach

  • Idea: Processing one source vertex takes

𝑃(𝑛 + 𝑜) time

– Can process a small sample of vertices fast!

  • Estimate the diameter of the graph’s

connected components

– Store the maximum BFS distance found from each of the first 𝑙 vertices – 𝑒𝑗𝑏𝑛𝑓𝑢𝑓𝑠 ≈ 𝑛𝑓𝑒𝑗𝑏𝑜(𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓𝑡)

  • Completes useful work rather than

preprocessing the graph!

GTC 2015

21

slide-22
SLIDE 22

Experimental Setup

  • Single-node

– CPU (4 Cores)

  • Intel Core i7-2600K
  • 3.4 GHz, 8MB Cache

– GPU

  • NVIDIA GeForce GTX

Titan

  • 14 SMs, 837 MHz, 6

GB GDDR5

  • Compute Capability 3.5

22

GTC 2015

  • Multi-node (KIDS)

– CPUs (2 x 4 Cores)

  • Intel Xeon X5560
  • 2.8 GHz, 8 MB Cache

– GPUs (3)

  • NVIDIA Tesla M2090
  • 16 SMs, 1.3 GHz, 6 GB

GDDR5

  • Compute Capability 2.0

– Infiniband QDR Network

  • All times are reported in seconds
slide-23
SLIDE 23

Benchmark Data Sets

Name Vertices Edges

  • Diam. Significance

af_shell9 504,855 8,542,010 497 Sheet Metal Forming caidaRouterLevel 192,244 609,066 25 Internet Router Level cnr-2000 325,527 2,738,969 33 Web crawl com-amazon 334,863 925,872 46 Product co-purchasing delaunay_n20 1,048,576 3,145,686 444 Random Triangulation kron_g500-logn20 524,288 21,780,787 6 Kronecker Graph loc-gowalla 196,591 1,900,654 15 Geosocial luxembourg.osm 114,599 119,666 1,336 Road Network rgg_n_2_20 1,048,576 6,891,620 864 Random Geometric smallworld 100,000 499,998 9 Logarithmic Diameter

23

GTC 2015

slide-24
SLIDE 24

Scaling Results (rgg)

GTC 2015

24

  • Random geometric

graphs

  • Sampling beats GPU-

FAN by 12x for all scales

slide-25
SLIDE 25

Scaling Results (rgg)

GTC 2015

25

  • Random geometric

graphs

  • Sampling beats GPU-

FAN by 12x for all scales

  • Similar amount of

time to process a graph 4x as large!

slide-26
SLIDE 26

Scaling Results (Delaunay)

GTC 2015

26

  • Sparse meshes
  • Speedup grows with

graph scale

slide-27
SLIDE 27

Scaling Results (Delaunay)

GTC 2015

27

  • Sparse meshes
  • Speedup grows with

graph scale

  • When edge-parallel is

best it’s best by a matter of ms

slide-28
SLIDE 28

Scaling Results (Delaunay)

GTC 2015

28

  • Sparse meshes
  • Speedup grows with

graph scale

  • When edge-parallel is

best it’s best by a matter of ms

  • When sampling is

best it’s by a matter

  • f days
slide-29
SLIDE 29

Benchmark Results

GTC 2015

29

  • Road networks and

meshes see ~10x improvement

– af_shell: 2.5 days → 5 hours

  • Modest improvements
  • therwise
  • 2.71x Average

speedup

slide-30
SLIDE 30

Multi-GPU Results

GTC 2015

30

  • Linear speedups when

graphs are sufficiently large

  • 10+ GTEPS for 192

GPUs

  • Scaling isn’t unique to

graph structure

– Abundant coarse- grained parallelism

slide-31
SLIDE 31

A Back of the Envelope Calculation…

GTC 2015

31

  • 192 Tesla M2090 GPUs
  • 16 Streaming Multiprocessors per GPU
  • Maximum of 1024 Threads per Block
  • 192 ∗ 16 ∗ 1024 = 3,145,728
  • Over 3 million CUDA Threads!
slide-32
SLIDE 32

Conclusions

  • Work-efficient approach obtains up to

to 13x 3x speed eedup up for high-diameter graphs

  • Tradeoff between work-efficiency and DRAM

utilization maximizes performance

– Aver erag age e spe peed edup up is is 2 2.71x 71x for all graphs

  • Our algorithms easily scale to many GPUs

– Linear scaling on up to p to 192 2 GPUs

  • Our results are co

consi nsistent stent ac across ss network twork str tructures ctures

32

GTC 2015

slide-33
SLIDE 33

Questions?

33

GTC 2015

  • Contact: Adam McLaughlin,

Adam27X@gatech.edu

  • Advisor: David A. Bader,

bader@cc.gatech.edu

  • Source code:

https://github.com/Adam27X/hybrid_BC https://github.com/Adam27X/graph-utils

slide-34
SLIDE 34

Backup

34

GTC 2015

slide-35
SLIDE 35

Contributions

  • A work-effici

efficien ent t al algo gorit rithm hm for computing Betweenness Centrality on the GPU – Works especially well for high-diameter graphs

  • On-line hybrid

rid ap appr proac aches hes that coordinate threads based on graph structure

  • An aver

erage ge spe peed edup of p of 2.71x 71x over the best existing methods

  • A di

distrib ibuted uted im impl plem ementa entati tion

  • n that scale

les lin inea early y to up p to 192 92 GPUs

  • Results that are pe

performan rmance ce po portable ble across

  • ss the

gamut of net etwork rk structure uctures

35

GTC 2015

slide-36
SLIDE 36

Brandes’s Algorithm

  • Let vertex 1 be the source, 𝑡

36

GTC 2015

𝑒 = 0 𝑒 = 1 𝑒 = 2 𝑒 = 3 𝑒 = 4

  • First, downward

traversal from 𝑡

  • Obtain the

number of shortest paths from 𝑡 to each vertex (𝜏𝑡𝑡 = 1)

slide-37
SLIDE 37

Brandes’s Algorithm

  • Downward traversal from 𝑡

37

GTC 2015

𝑒 = 0 𝑒 = 1 𝑒 = 2 𝑒 = 3 𝑒 = 4

  • 𝜏17 = 𝜏15 + 𝜏16
  • 𝜏1⋅ =

[1,1,1,1,1,1,2,2,2] 𝜏𝑡𝑥 = 𝜏𝑡𝑤

𝑤∈𝑞𝑠𝑓𝑒(𝑥)

slide-38
SLIDE 38

Brandes’s Algorithm

  • Upward dependency accumulation toward 𝑡

38

GTC 2015

𝑒 = 0 𝑒 = 1 𝑒 = 2 𝑒 = 3 𝑒 = 4

  • 𝜀16 =

𝜏16 𝜏17 1 + 𝜀17 + 𝜏16 𝜏18 1 + 𝜀18

  • 𝜀1⋅ =

[8,0,0,5, 3

2 , 3 2 , 1,0,0]

𝜀𝑡𝑤 = 𝜏𝑡𝑤 𝜏𝑡𝑥 1 + 𝜀𝑡𝑥

𝑥∈𝑡𝑣𝑑𝑑(𝑤)

slide-39
SLIDE 39

Fine-grained Parallelization Strategy

  • Vertex-parallel downward traversal

39

GTC 2015

𝒆 = 𝟏 𝑒 = 1 𝑒 = 2 𝑒 = 3 𝑒 = 4

  • Threads are

assigned to each vertex

– Only a subset is active

  • Variable number of

edges to traverse per thread

slide-40
SLIDE 40

Fine-grained Parallelization Strategy

  • Vertex-parallel downward traversal

40

GTC 2015

𝑒 = 0 𝒆 = 𝟐 𝑒 = 2 𝑒 = 3 𝑒 = 4

  • Threads are

assigned to each vertex

– Only a subset is active

  • Variable number of

edges to traverse per thread

slide-41
SLIDE 41

Fine-grained Parallelization Strategy

  • Vertex-parallel downward traversal

41

GTC 2015

𝑒 = 0 𝑒 = 1 𝒆 = 𝟑 𝑒 = 3 𝑒 = 4

  • Threads are

assigned to each vertex

– Only a subset is active

  • Variable number of

edges to traverse per thread

slide-42
SLIDE 42

Fine-grained Parallelization Strategy

  • Vertex-parallel downward traversal

42

GTC 2015

𝑒 = 0 𝑒 = 1 𝑒 = 2 𝒆 = 𝟒 𝑒 = 4

  • Threads are

assigned to each vertex

– Only a subset is active

  • Variable number of

edges to traverse per thread

slide-43
SLIDE 43

Fine-grained Parallelization Strategy

  • Vertex-parallel downward traversal

43

GTC 2015

𝑒 = 0 𝑒 = 1 𝑒 = 2 𝑒 = 3 𝒆 = 𝟓

  • Threads are

assigned to each vertex

– Only a subset is active

  • Variable number of

edges to traverse per thread