[PPT] - Accelerating Approximate Weighted Matching on GPUs MD NAIM*, FREDRIK PowerPoint Presentation

SLIDE 1

Accelerating Approximate Weighted Matching on GPUs

MD NAIM*, FREDRIK MANNE*, MAHANTESH HALAPPANAVAR+, ANTONINO TUMEO+, JOHANNES LANGGUTH#

April 14, 2016 1

*University of Bergen – Bergen, Norway

+Pacific Northwest National Laboratory – Richland, WA, USA #Simula Research Laboratory – Oslo, Norway

SLIDE 2

The Edge Weighted Matching Problem

Given an edge weighted graph G(V,E). Select a set M of non-incident edges of maximum weight. Best known algorithm has running time O(|V||E|+|V|2log|V|) [Gabow 90] Often too expensive for real applications

April 14, 2016 2

7 6 6 2 9 5 10 8 1

W(opt) = 22

SLIDE 3

Greedy Algorithm

Add most expensive remaining edge (x,y) to the solution Remove (x,y) and all edges incident on x or y Running time: O(|E| log |V|) (due to sorting) Guarantees a 1⁄2 - approximation, but in practice often much better

April 14, 2016 3

7 6 6 2 9 5 10 8 1

SLIDE 4

Suitor Algorithm (Manne and Halappanavar 2015)

P(x) = “v: Heaviest neighbor of x that does not already have a better

ffer than w(x,v)”

Lemma: If P(x) is set correctly for every x, then P() defines the same solution as Greedy. Outline of The Suitor Algorithm:

Process the vertices in a linear fashion to find the best candidate for each vertex Set the Suitor value of the candidate Whenever a vertex is dislodged it must be re-processed before moving to the next vertex

April 14, 2016 4

(x,y) in M

x y

SLIDE 5

a b c d e

END

2 3 6 8 10 1 2 3 4 5 6 7 8 9 b c a a d e c e c d 4 5 4 5 7 9 7 10 9 10

e d c a b 1 7 9 5 4 Compressed Edge List Idx E W

Graph Data Structures

SLIDE 6

e d c a b 1 7 9 5 4 __:__ __:__ __:__ __:__ __:__ Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

SLIDE 7

e d c a b 1 7 9 5 4 __:__ __:__ __:__ __:__ __:__ Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

SLIDE 8

e d c a b 1 7 9 5 4 __:__ __:__ __:__ a : 5 __:__ Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

SLIDE 9

e d c a b 1 7 9 5 4 __:__ __:__ __:__ a : 5 __:__ Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

SLIDE 10

e d c a b 1 7 9 5 4 __:__ __:__ __:__ a : 5 b : 4 Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

SLIDE 11

e d c a b 1 7 9 5 4 __:__ __:__ __:__ a : 5 b : 4 Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

SLIDE 12

e d c a b 1 7 9 5 4 c : 9 __:__ __:__ a : 5 b : 4 Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

SLIDE 13

e d c a b 1 7 9 5 4 __:__ __:__ a : 5 b : 4 Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

c : 9

SLIDE 14

e d c a b 1 7 9 5 4 __:__ __:__ a : 5 b : 4 Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

d : 9

SLIDE 15

e d c a b 1 7 9 5 4 c : 7 __:__ a : 5 b : 4 Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

d : 9

SLIDE 16

e d c a b 1 7 9 5 4 c : 7 __:__ a : 5 b : 4 Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

d : 9

SLIDE 17

e d c a b 1 7 9 5 4 e : 10 __:__ a : 5 b : 4 Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

d : 9

SLIDE 18

e d c a b 1 7 9 5 4 e : 10 __:__ a : 5 b : 4 Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

d : 9

SLIDE 19

e d c a b 1 7 9 5 4 e : 10 __:__ a : 5 b : 4 Unprocesse d Current Candidate Processed Suitor:Offer

Suitor Algorithm: Example

d : 9

SLIDE 20

Suitor Algorithm: Recap

Search for the best possible candidate for each vertex → O(dv) In the worst case, needs to repeat the search dv times Set the Suitor and corresponding Offer, O(1) Running Time: O( Σv∈V (dv)2) ≤ O(∆E) With sorted neighbor lists: O( Σv∈V (dv log dv )2 + V + E) ≤ O(Elog∆+ V + E)

April 14, 2016 20

SLIDE 21

Observations

The order in which vertices are processed has no influence on the final matching The vertices can be processed independently Need to protect the Suitor values as multiple vertices may try to become suitor at the same time

April 14, 2016 21

x v u Mutual Exclusion

SLIDE 22

Suitor Algorithm on GPU

Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion Search for the best possible candidate for each v ∈ chunk The threads in the warp read the neighbor list of v in an interleaved fashion O(Ceil( |N(v)|/32))

April 14, 2016 22

9

. . . Start Indices . . . . . . . . . W E ….. warp0 warp1 warpL-1 Stored in shared memory of the warp

SLIDE 23

Suitor Algorithm on GPU

April 14, 2016 23

1 ... 30 31 w w

1

.. w

30

w

31

w

32

w

33

.. w

62

w

63

1 ... 30 31 time |N(v)| = 64 1 ... 30 31 w w

1

.. w

30

w

31

w

32

w

33

.. w

62

w

63

1 ... 30 31 T

best_t

=max{ wt , wt+32 ,wt+64,...}

SLIDE 24

April 14, 2016 24

Suitor Algorithm on GPU

Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion Search for the best possible candidate for each v ∈ chunk The threads in the warp read the neighbor list of v in an interleaved fashion O(Ceil( |N(v)|/32)) Saves the best found so far in a local variable Reduction at the end, O( log2(32))

SLIDE 25

April 14, 2016 25

log2(32) Butterfly Reduction T T

1

T

2

T

3

. . . . T

30

T

31

T

01

T

01

T

23

T

23

.. .. .. ..

0123 0123 0123 0123 …. ....

.... .... …. …. T

best

=max{T

best_0,

....

,

T

best_31

} After 5 Steps

Suitor Algorithm on GPU

SLIDE 26

April 14, 2016 26

Suitor Algorithm on GPU

Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion Search for the best possible candidate for each v ∈ chunk The threads in the warp read the neighbor list of v in an interleaved fashion O(Ceil( |N(v)|/32)) Saves the best found so far in a local variable Reduction at the end, O( log2(32)) Set the Suitor and corresponding Offer for the candidate, O(1) Can be buffered in registers Save failed and dislodged vertices in shared memory Processed by the same warp in the next round Redistributed across the warps of the same block

SLIDE 27

Load Balancing

Each warp stores failed and dislodged vertices in programmable cache

Can be redistributed across the warps of the same block Need synchronization among the warps of the same block Synchronization can become a bottleneck

April 14, 2016 27

SLIDE 28

Experimental Evaluation

Dataset

Florida Sparse Matrix Collection: 269 problems ranging from a million to billion of edges

Alternatives

Locally Dominant(LD): Shared Memory and GPU Implementation Suitor: Shared Memory

April 14, 2016 28

Intel CPUs 2 sockets, 64 GB and BW of 51.2 GB/s. Hyper-threaded 8-core Intel Xeon E5 @3.10 GHz GCC 4.9.2, GOMP_CPU_AFFINITY, numactl NVIDIA GPU Tesla K40 with GK110B GPU, 15 SMXes, 2880 cores @ 745 MHz 12 GB of GDDR5 at 3 GHz, BW of 288 GB/sec. nvcc (CUDA 7.0)

SLIDE 29

Problem Instances and Runtimes

April 14, 2016 29

SLIDE 30

Speed up vs LD and OMP-Suitor

Speedup of GPU-Suitor relative to GPU-LD and OMP-LD (left),and OMP-Suitor (right)

April 14, 2016 30

SLIDE 31

Design Space Exploration

April 14, 2016 31

SLIDE 32

Conclusion

Presented an implementation of the Suitor Matching algorithm for GPUs Faster than multicore and previous GPU implementations Future works:

Extension to multigpus

April 14, 2016 32

SLIDE 33

Thank you!

Questions? Md Naim - naim.md@gmail.com Fredrik Manne - ManneFredrik.Manne@ii.uib.no Mahantesh Halappanavar – mahantesh.halappanavar@pnnl.gov Antonino Tumeo – antonino.tumeo@pnnl.gov Johannes Langguth – langguth@simula.no

April 14, 2016 33