GPU accelerated maximum cardinality matching algorithms for - PowerPoint PPT Presentation

GPU accelerated maximum cardinality matching algorithms for bipartite graphs Bora U¸ car CNRS and LIP, ENS Lyon, France Europar 2013, 26–30 Auguest, 2013, Aachen, Germany Joint work with: Mehmet Deveci ¨ Umit V. C ¸ataly¨ urek Kamer Kaya BMI (and ECE for MD & UVC ¸), The Ohio State University 1/19 Matrix scaling

Bipartite graphs and matchings G = ( R ∪ C , E ) is a bipartite graph with r 1 c 1 the vertex set R ∪ C where R ∩ C = ∅ , and all edges contain one vertex in R other r 2 c 2 in C . A matching M in a graph G is a subset of r 3 c 3 edges E where a vertex in R ∪ C is in at most one edge in M . r 4 c 4 Perfect matching all vertices in R or C are matched, e.g., r 5 c 5 ( r 1 , c 3 ) , ( r 2 , c 1 ) , ( r 3 , c 5 ) , ( r 4 , c 2 ) , ( r 5 , c 4 ). Problem: Find a matching of maximum cardinality. 2/19 Matrix scaling

Outline

Matrices, bipartite graphs and matchings Motivation: Given an n × n sparse matrix A , 1 find a permutation of the columns so that the 2 A = 3 diagonal of the permuted matrix is zero free. 4 5 1 2 3 4 5 Take the associated bipartite graph 1 0 G A = ( R ∪ C , E ) r 1 c 1 r 2 c 2 R corresponds to the set of rows, C to G A = r 3 c 3 the set of columns r 4 c 4 ( r i , c j ) ∈ E iff a ij � = 0. r 5 c 5 Compute a perfect matching in G A . 2 1 2 Permute columns according to the 3 AP = 3 matching. 4 5 3 1 5 2 4 The permuted form can be used to detect reducibility of A ; if so 0 substantial savings are possible while solving the associated linear system. 3/19 Matrix scaling

Augmenting paths Alternating path: A path in G is M -alternating if its edges are alternatively in M and not in M . Augmenting path: An M -alternating path P is called M -augmenting if the start and end vertices of P are both unmatched. r 1 c 1 r 1 c 1 r 2 c 2 r 2 c 2 r 3 c 3 r 3 c 3 Augmenting path r 4 c 4 r 4 c 4 r 5 c 5 r 5 c 5 r 4 c 2 r 5 c 4 r 2 c 1 r 1 c 3 c 2 c 4 c 1 c 3 r 4 r 5 r 2 r 1 All (exact, deterministic) algorithms are based on augmenting paths: start with possible empty matching and augment (theorem of Berge). 4/19 Matrix scaling

Algorithms for bipartite mathching Alg . Description Complexity DFS. Forms the basis of many algorithms. O ( n τ ) DFSB BFS. Quite common (the algorithm FF in O ( n τ ) BFSB [ Melhorn and N¨ aher,’99 ]). DFS+Lookahead [ Duff,’81 ] and dmperm in O ( n τ ) MC21A Matlab [ Davis,’06 ]—the most wide-spread? Phases of disjoint DFSs [ Pothen and Fan,’90 ]. O ( n τ ) PF O ( √ n τ ) HK Shortest disjoint augmenting paths [ Hopcroft and Karp,’73 ]. O ( √ n τ ) HK +Disjoint DFS [ Duff and Wiberg,’88 ]. HKDW min {O ( √ n τ ) Combined DFS and BFS [ Alt, Blum, Mehlhorn, ABMP O ( n 1 . 5 � and Paul,’91 ]. τ/ log n ) } A simple modification of PF [ Duff, Kaya, and O ( n τ ) PF+ U.,’10 ]. O ( √ n τ ) Push-relabel [ Cherkassky, Goldberg, Martin, Se- PR tubal, Stolfi,’98 ]; Bounds on distances to free vertices. Prefixes and suffixes of augmenting paths O ( n τ ) PseudoFlow [ Hochbaum’98 and Chandran and Hochbaum’11 ] 5/19 Matrix scaling

Some recent parallelization studies Undirected graph weighted, unweighted, approximate, GPU, MPI, external memory: Brin, Osipov, Sanders, Schulz, Sitchinava, Session F2, (EuroPar’13). weighted, unweighted, heuristic, GPU: Fagginger Auer and Bisseling’12. weighted, GPU, multicore: Halappanavar, Feo, Villa, Tumeo, and Pothen’12. weighted, greedy, multicore: C ¸ataly¨ urek, Deveci, Kaya, U.’12. Bipartite graph weighted, GPU: Vasconcelos and Rosenhahn’09. unweighted, multicor: Azad, Halappanavar, Rajamanickam, Boman, Khan, Pothen’12. We propose : bipartite unweighted, GPU. 6/19 Matrix scaling

Outline r 1 c 1 r 2 c 2 1 2 G A = c 3 r 3 AP = 3 4 r 4 c 4 5 c 5 r 5 3 1 5 2 4 0

Proposed algorithms Based on HK and HKDW: Use BFS to locate a set of shortest augmenting paths, augment along a maximal set of them using DFS. HKDW adds one more DFS step to augment along the remaining paths (not shortest). Keep the BFS part; the DFS part does not propose efficiency. Overall description HK: Find a set of shortest augmenting paths, alternate along all of them (some of them will be realized) HKDW: Find the set of augmenting paths, alternate along all of them (some of them will be realized). The worst case running time complexity increases, O ( n τ ) instead of O ( √ n τ ). We trade that to achieve fine-grained parallelism. 7/19 Matrix scaling

Proposed algorithms: Main one, similar to HKDW Algorithm 1: Shortest augmenting paths (APsB) ALL (APFB) Data : cxadj, cadj, nc, nr, rmatch, cmatch 1 augmenting path found ← true ; 2 while augmenting path found do BFS uses alternating paths, bfs level ← L 0 ; 3 starts from unmatched InitBfsArray ( bfs array, cmatch, L 0 ); 4 columns, tries to reach vertex inserted ← true ; 5 unmatched rows. while vertex inserted do 6 predecessor ← Bfs ( bfs level, bfs array, cxadj, cadj, nc, rmatch, 7 vertex inserted, augmenting path found ); 8 if augmenting path found then 9 break ; 10 bfs level ← bfs level + 1; 11 h cmatch, rmatch i ← Alternate ( cmatch, rmatch, nc, predecessor ); 12 h cmatch, rmatch i ← FixMatching ( cmatch, rmatch ); 13 Needed to avoid atomic operations and locks. 8/19 Matrix scaling

Proposed algorithms: BFS kernel function 1 The visited column vertex in the current level col vertex ← i × tot thread num + tid ; if bfs array [ col vertex ] = bfs level then 4 for j from cxadj [ col vertex ] to cxadj [ col vertex + 1] do 5 neighbor row ← cadj [ j ]; 6 col match ← rmatch [ neighbor row ]; 7 if col match > − 1 then 8 if bfs array [ col match ] = L 0 − 1 then 9 Unvisited (matched) vertex inserted ← true ; 10 column vertex is bfs array [ col match ] ← bfs level + 1; 11 found predeccesor [ neighbor row ] ← col vertex ; 12 else 13 if col match = − 1 then 14 Unmatched row rmatch [ neighbor row ] ← − 2; 15 vertex is found predeccesor [ neighbor row ] ← col vertex ; 16 augmenting path found ← true ; 17 9/19 Matrix scaling

Proposed algorithms: Alternate Algorithm 3: Alternate Data : cmatch, rmatch, nc, nr, predecessor 1 process vcnt ← getProcessCount ( nr ); 2 for i from 0 to process vcnt − 1 do row vertex ← i × tot thread num + tid ; 3 if rmatch [ row vertex ] = − 2 then 4 while row vertex 6 = − 1 do 5 matched col ← predecessor [ row vertex ]; 6 matched row ← cmatch [ matched col ] ; 7 if predecessor[matched row] = matched col then 8 break ; 9 cmatch [ matched col ] ← row vertex ; 10 rmatch [ row vertex ] ← matched col ; 11 row vertex ← matched row ; 12 Line 3 ❀ coalesced access to the memory. 10/19 Matrix scaling

Proposed algorithms: Fix Matching Problem 1 Thread t’ Thread t’ r 3 r 3 c 1 r 1 c 2 c 1 r 1 c 2 r 2 r 2 Thread t Thread t Problem 2 Thread t’ cmatch [c 2 ] = r 2 r 3 c 1 r 1 c 2 rmatch [r 2 ] = c 2 This is why r 2 we need rmatch [r 3 ] = c 2 Thread t F IX M ATCHING FixMatching : rmatch [ r ] ← − 1 for any r satisfying cmatch [ rmatch [ r ]] � = r 11/19 Matrix scaling

Proposed algorithms: BFS kernel modified ...so that: early exits; if an augmenting path found for a column, no more BFS’s continue for the same column. helps Alternate: mark the start and the end of the augmenting paths (so that alternate works along correct augmenting paths). 12/19 Matrix scaling

Outline r 1 c 1 r 2 c 2 1 2 G A = c 3 r 3 AP = 3 4 r 4 c 4 5 c 5 r 5 3 1 5 2 4 0

Experiments The sequential HK and PFP implementations Duff, Kaya, and U’11. Multicore implementations P-PFP, P-DBFS, and P-HK from Azad et al.’12 on 8 threads. CPU: 2.27GHz dual quad-core Intel Xeon CPUs with 2-way hyper-threading and 48GB main memory (C++ and OpenMP). GPU: NVIDIA Tesla C2050 with usable 2.6GB of global memory (14 multiprocessors each containing 32 CUDA cores). gcc-4.4.4, cuda-4.2.9 and -O2 optimization flag. A standard heuristic is used to initialize all algorithms. The execution times of the GPU algorithms exclude memory copy time (when included decreases the reported mean speedups across all data set by at most 6%.) 14/19 Matrix scaling

Experiments: Data set and GPU algorithms Data 70 large matrices from UFL collection. “O” original set, “RCP” random row/column permutations. report on those matrices for which one of the sequential algorithms took more than one second (O S1, 28 matrices; and RCP S1, 50 matrices). O Hardest20 and RCP Hardest20 that contain the set of 20 matrices on which the sequential algorithms required the longest runtime. GPU algorithms Geometric mean of the runtime (in seconds) on different sets of instances APFB APsB GPUBFS GPUBFS-WR GPUBFS GPUBFS-WR MT CT MT CT MT CT MT CT O S1 2.96 1.89 2.12 1.34 3.68 2.88 2.98 2.27 O Hardest20 4.28 2.70 3.21 1.93 5.23 4.14 4.20 3.13 RCP S1 3.66 3.24 1.13 1.05 3.52 3.33 2.22 2.14 RCP Hardest20 7.27 5.79 3.37 2.85 12.06 10.75 8.17 7.41 15/19 Matrix scaling

GPU accelerated maximum cardinality matching algorithms for - PowerPoint PPT Presentation

GPU accelerated maximum cardinality matching algorithms for bipartite graphs Bora U car CNRS and LIP, ENS Lyon, France Europar 2013, 2630 Auguest, 2013, Aachen, Germany Joint work with: Mehmet Deveci Umit V. C ataly urek

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Section 2.5 Section Summary ! Cardinality ! Countable Sets ! Computability Cardinality Definition

Section 2.5 1 Cardinality Definition : The cardinality of a set A is equal to the cardinality of

Crazy Picture. Maximum matching and simplex. z y x Maximum matching and simplex. max x + y + z

Chapter 9 Cardinality Estimation How Many Rows Does a Query Yield? Cardinality Estimation

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

The Maximum Carpool Matching Problem Gilad Kutiel gkutiel@cs.technion.ac.il CSR 2017 1/21 The

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

CSE 322, Fall 2010 A set is countable if it is finite or has the same cardinality as the natural

Cardinality 2/12 Example: Do the sets Do the sets Cardinality { 5 , 10 , 15 , . . . , 155 } N

IN 3130 September 19, 2019 Todays topics are from Chapter 14: Matchings in (undirected)

CS 401 Max Flow / Bipartite Matching Xiaorui Sun 1 Flow network Flow network. G = (V, E) =

Chapter 13 Network Flow II CS 573: Algorithms, Fall 2013 October 10, 2013 13.0.1

Counting Perfect Matchings In Graphs with Application in monomer dimer models Afshin Behmaram

The extendability of matchings in strongly regular graphs Sebastian Cioab a Weiqiang Li

Excluded t -factors in Bipartite Graphs: A Unified Framework for Nonbipartite Matchings and

Definitions & basic recap Network Analysis in Python II Network/Graph Network = Graph

Graphical degree sequences and Definitions realizations and History Undirected swap-

GPU accelerated maximum cardinality matching algorithms for - PowerPoint PPT Presentation

GPU accelerated maximum cardinality matching algorithms for bipartite graphs Bora U car CNRS and LIP, ENS Lyon, France Europar 2013, 2630 Auguest, 2013, Aachen, Germany Joint work with: Mehmet Deveci Umit V. C ataly urek

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Section 2.5 Section Summary ! Cardinality ! Countable Sets ! Computability Cardinality Definition

Section 2.5 1 Cardinality Definition : The cardinality of a set A is equal to the cardinality of

Crazy Picture. Maximum matching and simplex. z y x Maximum matching and simplex. max x + y + z

Chapter 9 Cardinality Estimation How Many Rows Does a Query Yield? Cardinality Estimation

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

The Maximum Carpool Matching Problem Gilad Kutiel gkutiel@cs.technion.ac.il CSR 2017 1/21 The

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

CSE 322, Fall 2010 A set is countable if it is finite or has the same cardinality as the natural

Cardinality 2/12 Example: Do the sets Do the sets Cardinality { 5 , 10 , 15 , . . . , 155 } N

IN 3130 September 19, 2019 Todays topics are from Chapter 14: Matchings in (undirected)

CS 401 Max Flow / Bipartite Matching Xiaorui Sun 1 Flow network Flow network. G = (V, E) =

Chapter 13 Network Flow II CS 573: Algorithms, Fall 2013 October 10, 2013 13.0.1

Counting Perfect Matchings In Graphs with Application in monomer dimer models Afshin Behmaram

The extendability of matchings in strongly regular graphs Sebastian Cioab a Weiqiang Li

Excluded t -factors in Bipartite Graphs: A Unified Framework for Nonbipartite Matchings and

Definitions &amp; basic recap Network Analysis in Python II Network/Graph Network = Graph

Graphical degree sequences and Definitions realizations and History Undirected swap-

Definitions & basic recap Network Analysis in Python II Network/Graph Network = Graph