Recommended Reading Efficient Parallel Sparse MatrixVector - PowerPoint PPT Presentation

Recommended Reading Efficient Parallel Sparse Matrix–Vector Multiplication U.V. C ¸ataly¨ urek and C. Aykanat: “Hypergraph-Partitioning-Based Decomposition Using Graph and Hypergraph Partitioning for Parallel Sparse Matrix–Vector Multiplication”. IEEE Trans. on Parallel and Distributed Systems , 10(7), July 1999, pp. 673–693. A. Trifunovic: “Parallel Algorithms for Hypergraph Partitioning”. PhD thesis, Imperial College London, November 2005. William Knottenbelt J.T. Bradley, D.V. de Jager, W.J. Knottenbelt, A. Trifunovic: “Hypergraph Partitioning for Faster PageRank Computation”. Proc. EPEW 2005 , pp. 155–171. Imperial College London A. Trifunovic and W.J. Knottenbelt: “A General Graph Model for Representing wjk@doc.ic.ac.uk Exact Communication Volume in Parallel Sparse Matrix–Vector Multiplication”. February 2015 Proc. ISCIS 2006 , pp. 813–824. William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 1 / 26 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 2 / 26 Recommended Software Tools Outline CHACO graph partitioning software: Parallel Sparse Matrix–Vector Products http://www.cs.sandia.gov/~bahendr/chaco.html Partitioning Objectives and Strategies PaToH hypergraph partitioning software: http://bmi.osu.edu/~umit/software.html Na¨ ıve Row-Striping 1D Graph Partitioning METIS/ParMETIS graph partitioners and hMETIS hypergraph 1D Hypergraph Partitioning partitioner: http://glaros.dtc.umn.edu/gkhome/views/metis 2D Hypergraph Partitioning Par k way parallel hypergraph partitioner: Comparison of Graph and Hypergraph Partitioning Techniques http://www.doc.ic.ac.uk/~at701/parkway/ William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 3 / 26 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 4 / 26

Parallel Sparse Matrix–Vector Products Partitioning Objectives and Strategies Aim is to allocate matrix and vector elements across processors such that: Parallel sparse matrix–vector product (and similar) operations form computational load is balanced the kernel of many parallel numerical algorithms. communication is minimised Candidate partitioning strategies: Particularly widely used in iterative algorithms for solving very large random permutation applied to rows and columns with 2D sparse systems of linear equations (e.g. Jacobi and checkerboard processor layout Conjugate-Gradient Squared methods). na¨ ıve row (or column) striping coarse-grained mapping of rows (or columns) and corresponding vector The data partitioning strategy adopted (i.e. the assignment of matrix elements to processors using 1D graph or hypergraph-based data and vector elements to processors) has a major impact on partitioning performance, especially in distributed memory environments. fine-grained mapping of individual non-zero matrix elements and vector elements to processors using 2D hypergraph-based partitioning William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 5 / 26 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 6 / 26 Partitioning Objectives and Strategies Na¨ ıve Row-Striping: Definition Aim is to allocate matrix and vector elements across processors such that: Assume an n × n sparse matrix A , an n -vector x and p processors. computational load is balanced communication is minimised Simply allocate n / p matrix rows and n / p vector elements to each Candidate partitioning strategies: processor (assuming p divides n exactly). random permutation applied to rows and columns with 2D If p does not divide n exactly, allocate one extra row and one extra checkerboard processor layout na¨ ıve row (or column) striping vector element to those processors with rank less than n mod p . coarse-grained mapping of rows (or columns) and corresponding vector elements to processors using 1D graph or hypergraph-based data partitioning What are the advantages and disadvantages of this scheme? fine-grained mapping of individual non-zero matrix elements and vector elements to processors using 2D hypergraph-based partitioning William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 6 / 26 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 7 / 26

Na¨ ıve Row-Striping: Example Na¨ ıve Row-Striping: Example (cont.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x 1 2 P1 3 4 Consider the layout of a 16 × 16 non-symmetric sparse matrix A and 5 vector x onto 4 processors under a na¨ ıve row-striping scheme on the 6 P2 previous slide. 7 8 What is: 9 (a) the computational load per processor? 10 P3 11 (b) the total comms volume per matrix–vector product? 12 13 14 P4 15 16 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 8 / 26 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 9 / 26 1D Graph Partitioning: Definition 1D Graph Partitioning: Definition (cont.) An n × n sparse matrix A can be represented as an undirected graph G = ( V , E ). Each row i (1 ≤ i ≤ n ) in A corresponds to vertex v i ∈ V in the graph. An edge e ij is cut if the vertices which it contains are assigned to two different processors, i.e. if v i ∈ P m and v j ∈ P n where m � = n . The (vertex) weight w i of vertex v i is the total number of non-zeros in row i . The edge-cut is the sum of the edge weights of cut edges and is an For the edge-set E , edge e ij connects vertices v i and v j with (edge) approximation for the amount of interprocessor communication. weight: Why is it not exact? 1 if either one of | a ij | > 0 or | a ji | > 0, 2 if both | a ij | > 0 and | a ji | > 0 Aim to partition the vertices into p mutually exclusive subsets (parts) { P 1 , P 2 , . . . , P p } such that edge-cut is minimised and load is balanced . William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 10 / 26 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 11 / 26

1D Graph Partitioning: Definition (cont.) 1D Graph Partitioning: Definition (cont.) Problem of finding a balanced p -way partition that minimises edge cut is NP-complete. Let � W k = (for 1 ≤ k ≤ p ) w i But heuristics can often be applied to obtain good sub-optimal i ∈ P k solutions. denote the weight of part P k , and W denote the average part weight. Software tools: A partition is said to be balanced if: CHACO METIS (1 − ε ) W ≤ W k ≤ (1 + ε ) W ParMETIS for k = 1 , 2 , . . . p . Once partition has been computed, assign matrix row i to processor k if v i ∈ P k . William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 12 / 26 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 13 / 26 1D Graph Partitioning: Example 1D Graph Partitioning: Example (cont.) Consider the graph corresponding to the sparse matrix A of the previous example. Assume the graph is partitioned into four parts as follows: P 1 = { v 13 , v 7 , v 16 , v 11 } P 2 = { v 15 , v 9 , v 2 , v 5 } P 3 = { v 14 , v 8 , v 10 , v 4 } P 4 = { v 3 , v 12 , v 1 , v 6 } Draw the graph representation and compute the edge cut. William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 14 / 26 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 15 / 26

1D Graph Partitioning: Example (cont.) 1D Graph Partitioning: Example (cont.) 13 7 16 11 15 9 2 5 14 8 10 4 3 12 1 6 x 13 7 P1 16 11 The row-striped layout of the sparse matrix A and vector x onto 4 15 processors under this graph-partitioning scheme is given on the 9 previous slide. P2 2 5 What is: 14 (a) the computational load per processor? 8 P3 (b) the total comms vol. per matrix–vector product? How 10 does the comms vol. compare to the edge cut? 4 3 12 P4 1 6 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 16 / 26 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 17 / 26 1D Hypergraph Partitioning: Definition 1D Hypergraph Partitioning: Definition (cont.) The cutsize or hyperedge cut of a partition is defined as: An n × n sparse matrix A can be represented as a hypergraph H = ( V , N ). � ( λ j − 1) V is a set of vertices and N is a set of nets or hyperedges. Each N j ∈N n ∈ N is a subset of the vertex set V . Aim is to minimise the hyperedge cut while maintaining the balance Each row i (1 ≤ i ≤ n ) in A corresponds to vertex v i ∈ V . criterion (which is same as for graphs). Each column j (1 ≤ i ≤ n ) in A corresponds to net N j ∈ N . In particular v i ∈ N j iff a ij � = 0. Again, problem of finding a balanced p -way partition that minimises the hyper-edge cut is NP-complete, but heuristics can be used to find The (vertex) weight w i of vertex v i is the total number of non-zeros sub-optimal solutions. in row i . Software tools: Given a partition { P 1 , P 2 , . . . , P p } , the connectivity λ j of net N j hMETIS denotes the number of different parts spanned by N j . Net N j is cut iff PaToH λ j > 1. Par k way William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 18 / 26 William Knottenbelt (Imperial) (Hyper)graph Partitioning February 2015 19 / 26

Recommended Reading Efficient Parallel Sparse MatrixVector - PowerPoint PPT Presentation

Recommended Reading Efficient Parallel Sparse MatrixVector Multiplication U.V. C ataly urek and C. Aykanat: Hypergraph-Partitioning-Based Decomposition Using Graph and Hypergraph Partitioning for Parallel Sparse MatrixVector

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors

A User-Friendly Hybrid Sparse Matrix Class in C++ Conrad Sanderson, Ryan R. Curtin July 19, 2018

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

+ Design of Parallel Algorithms Parallel Dense Matrix Algorithms + Topic Overview n

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Whirlwind Tour of LA Part 1: Some Nitty-Gritty Stuff David Bindel 2015-01-30 Logistics

1.3 Vector Equations McDonald Fall 2018, MATH 2210Q 1.3 Slides Homework: Read the section and do

MATH 105: Finite Mathematics 2-4: Matrix Algebra Prof. Jonathan Duncan Walla Walla College

Introduction to MATLAB Markus Kuhn Computer Laboratory, University of Cambridge

Announcements Monday, November 19 You should already have the link to view your graded midterm

Machine Learning Prof. Kuan-Ting Lai 2020/4/11 Applied Math for Machine Learning Linear

Review: Tools for Code and Calc/Linear Algebra Debugging Math:

Introduction to MATLAB CS534 Fall 2016 What you'll be learning today MATLAB basics