A Parallel Solver for Laplacian Matrices Tristan Konolige (me) and - PowerPoint PPT Presentation

A Parallel Solver for Laplacian Matrices Tristan Konolige (me) and Jed Brown

Graph Laplacian Matrices • Covered by other speakers (hopefully) • Useful in a variety of areas • Graphs are getting very big • Facebook now has ~couple billion users • Computer networks for cyber security • Interested in network graphs • Undirected • Weighted • We will need faster ways to solve these systems • Note: Laplacians have constant vector as nullspace

Why Parallelism • Graphs are growing but single processor speed is not • Want to process existing graphs faster or do larger network analysis • Clock speed has stagnated • Bandwidth increasing slowly • Processor count/machine count growing • Xeon Phi, etc. • Going to look at distributed memory systems • Most supercomputers and commodity clusters

Goals • Parallel scalability out to large numbers of processors/nodes • Convergence factors close to LAMG • Interested mostly in scale-free graphs for now

Existing Solvers • Spielman and Teng’s theoretical nearly-linear time solver • No viable practical implementations • Many other theoretical solvers • Kelner solver (previous talk w/ Kevin) • Combinatorial Multigrid from [Koutis and Miller] • Lean Algebraic Multigrid from [Livne and Brandt] • Degree Aware Aggregation from [Napov and Notay] • CG a variety of preconditioners • Direct solvers

Smoothing Multigrid Interpolation Restriction • Both CMG and LAMG are multigrid solvers • Multilevel method for solving Smoothing linear systems • O(N) (ideally) • Originally intended for geometric problems, now used Direct Solve on arbitrary matrices A V-cycle

Lean Algebraic Multigrid [Livne and Brandt 2011] • Low degree elimination • Eliminate up to degree 4 • Reduces cycle complexity • Incredibly useful on network graphs • Aggregation based Multigrid • Restriction/interpolation from fine grid aggregates • Avoids aggregating high-degree nodes • Based on strength of connection + energy ratio • Typically smoothed restriction/interpolation

LAMG • Caliber 1 interpolation (unsmoothed restriction/interpolation) • Avoids complexity from fill in • Gauss-Seidel Smoothing • Multilevel iterant recombination – adaptive energy correction • Similar to Krylov method at every level • O(N) empirically

LAMG • Hierarchy alternates between elimination and aggregation • First level elimination only applied once during solve Level Size NNZ Type Time (s ) Comm Size Imb 0 1069126 113682432 Elim 0.1180 64 1.10 1 1019470 113385358 Reg 0.7480 64 1.11 2 75493 18442801 Elim 0.0090 64 1.46 3 62072 18374722 Reg 0.0687 64 1.23 4 8447 1265927 Elim 0.0016 64 2.87 5 5153 1250659 Reg 0.0052 64 1.49 6 466 20188 Elim 0.0004 1 1.00 7 173 19125 Reg 0.0019 1 1.00 8 18 56 Elim 0.0001 1 1.00 9 3 7 Reg 0.0001 1 1.00

Implementation • C++ and MPI • V-cycles • No OpenMP for now • No iterant recombination, requires multiple dot-products which are slow • CombBLAS for 2D matrix in parallel decomposition [Buluç and Gilbert • Instead use constant correction 2011] • CG preconditioner • Needed for scaling • Worse than energy correction • Helps distribute high-degree hubs • Orthangonalize every cycle • Randomized matrix ordering • Manually redistribute work if • Worse locality problem gets too small • Greatly improves load balance • Jacobi Smoothing

Parallel Low-Degree Elimination • Difficult part is if there are two low-degree neighbors • Can’t eliminate both at once • Use SpMV to choose which neighbors to eliminate • Boolean vector indicating degree < 4 • Semiring is {min(hash(x), hash(y)), id} • Can use multiple iterations to eliminate all low- degree nodes • In practice, one iteration eliminates most low- degree nodes

Parallel Aggregation for each undecided node n: let s = undecided or seed neighbor with strongest connection and not full if s is a seed: aggregate n with s if s is undecided: s becomes a seed aggregate n with s end • Aggregates depend on order

Parallel Aggregation • SpMV iterations on strength of connection matrix to form aggregates • Vector is status of node {Undecided, Aggregated, Seed,FullSeed} • Semiring + is max (i.e. strongest connection) • x * y is y if x == Undecided or Seed otherwise 0 • In resulting vector, if x found an Aggregated vertex, we aggregate. Otherwise x votes for is best connection • Undecided nodes with enough votes are converted to seeds • <10 iteration before every node is decided • Cluster size is somewhat constrained • As long as clusters have a reasonable size bound, results are fine • We do not use energy ratios in aggregation (yet) • Will have worse aggregates than LAMG

Strength of Connection • LAMG uses a strength of connection metric for aggregation • Relax on Ax=0 for random x Affinity • In our tests, algebraic distance [Safro, Sanders, Schulz 2012] performs slightly better than affinity • 58.49% of fastest solves used algebraic distance vs 41.51% Algebraic distance with affinity

Matrix Randomization

Results • All tests run on NERSC’s Edison • 2x 2.4GHz 12-core Intel "Ivy Bridge" processor per node • Cray Aries interconnect • 4 MPI tasks per node • LAMG Serial implementation by [Livne and Brandt] • In MATLAB with C mex extensions • Solve to 1e-8 relative residual norm • Code is not well optimized • Interested in scaling

Convergence Factors • Cycle complexity: nnz(all ops)/nnz(finest matrix) • Effective Convergence Factor (ECF) Δ ‖residual ‖ ^ 1/cycle complexity Matrix ECF Serial LAMG ECF Our Solver ECF Jacobi PCG hollywood-2009 0.540 0.856 0.992 citationCiteseer 0.816 0.919 0.938 astro-ph 0.695 0.800 0.846 as-22july06 0.282 0.501 0.784 delaunay_n16 0.812 0.896 0.980 • No GS-smoothing • No iterant recombination • Poorer aggregates

1000 Regular solve Random permutation solve hollywood-2009 LAMG serial* 1,139,905 nodes 113,891,327 nnz 100 3.7x Time (s) 10 45x 1 0 5 10 15 20 25 30 35 40 Number of nodes (4 cores per node)

1000 Random permutation solve Random Setup Time hollywood-2009 LAMG serial* setup 1,139,905 nodes 113,891,327 nnz 100 10 1 0 5 10 15 20 25 30 35 40 Number of nodes (4 cores per node)

1000 Setup Random Solve Random europe_osm rows 50,912,018 nnz 108,109,320 100 10 0 5 10 15 20 25 30 35 40 45 50 Number of nodes (4 cores per node)

Conclusion & Future Work • Distributed memory solver show significant speedups • Even without complex aggregation strategies • Matrix randomization provides large benefit • Improve aggregation with energy ratios • Convergence rates still well below LAMG • Particular graphs have very poor rates

Thank you

A Parallel Solver for Laplacian Matrices Tristan Konolige (me) and - PowerPoint PPT Presentation

A Parallel Solver for Laplacian Matrices Tristan Konolige (me) and Jed Brown Graph Laplacian Matrices Covered by other speakers (hopefully) Useful in a variety of areas Graphs are getting very big Facebook now has ~couple billion

Scalable Laplacian K-modes Imtiaz Masud Ziko, Eric Granger and Ismail Ben Ayed Laplacian K-modes

A fundamental inequality for the p-Laplacian and the -Laplacian Yi Ru-Ya Zhang ETH Z urich

Local Laplacian Filters: Edge-aware Image Processing with a Laplacian Pyramid Paper by Sylvain

2 2 f f + = 0 2 2 x y Laplacian operator is discretized version

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Estimating Current-Flow Closeness Centrality with a Multigrid Laplacian Solver E. Bergamini, M.

OpenSMT2 A Parallel, Interpolating SMT Solver Antti Hyvrinen, Matteo Marescotti, Leonardo Alt,

A massivelly parallel multigrid solver using PETSc for unstructured meshes on Tier0

Time-Parallel optimal control solver for parabolic equations Mohamed Kamal RIAHI, joint work

My Parallel Electromagnetic Solver A Star-Maxwell-P FDTD Propagator 18.337 Parallel Computing

Adjoint Solver Workshop Why is an Adjoint Solver useful? Design and manufacture for better

KY 1 Engineering 10 San Jose State University Solver The Solver is intended primarily for

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline

precise solver for chemical ODEs Fan Feng, Zifa Wang GTC 2016 San Jose, USA, 04-07 Apr. 2016

An Iterative Solver for the Diffusion The Methods Progress So Far... Equation Alan Davidson

Lecture 2: Existence, uniqueness, and regularity in the Lipschitz case Habib Ammari Department

Disclosures I have nothing to disclose 1 Stem cells 2 Stem cells ECMO Stem cells

Outline Outline Stationary Solution to Fokker Stationary Solution to Fokker- - Planck

Market Engagement for the Implementation of the Early Career Framework Teacher Workforce

1 Exception Tables Exceptions An exception is a transfer of control to the OS kernel in

Global Symposium: Innovative Financial Inclusion By Mr. Rizal Nainy Deputy CEO (I )

ESASky: A New Window to the Universe Mara Sarmiento on behalf of ESDC team 16/06/2016

Indiana Supreme Court Indiana E-filing Manager PNCO Vendor Conference August 13, 2014 WORKING