A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse - PowerPoint PPT Presentation

A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse Matrices Kiran Raj Ramamoorthy, Dip Sankar Banerjee, Kannan Srinathan and Kishore Kothapalli. C-STAR, IIIT Hyderabad.

Outline • Inspiration :: Heterogeneous Platform & Challenges • Introduction :: Sparse Matrix-Matrix Multiplication (SPMM) • Earlier Work :: Row-Row (K. Matam et. al) • Our Approach :: HH-CPU • Implementation :: Notes • Results :: Datasets (SNAP , Synthetic …), Experiments & Discussion • Other Approaches :: Work Queue & its variations • Conclusion :: Future Work & References

Heterogeneous Platform CPU GPU Send Code Send Data Send Results

Heterogeneous Platform CPU GPU Send Code Send Data Data Transfer Data Transfer Data Transfer Data Transfer Send Results

Challenges • Which portion of input is processed by which device ? • Static Partitioning input is a good solution to obtain high performance on heterogeneous platforms. • However, compute capability of each entity is di ff erent & performance of device is dependent on nature of input. • Simple/Static partitioning is not optimal. • Is it possible to come up with partitioning techniques for heterogenous platforms and applications ?

Our Goal • To propose a novel heterogeneous algorithm for sparse matrix-matrix multiplication that, • not only, balances load across heterogeneous devices in computing platform. • but also, assigns "right" work to the "right" processor.

Sparse Matrix • Matrix in which most of the elements are zero. • i.e. nnz = k * n • Example

Real-World Matrices Usually datasets in Data Mining, Social Network Analysis & Communication Networks are very large.

Dense Row Nature of Real-world Matrices These graphs are highly irregular & scale-free with a power-law degree distribution.

Sparse Matrix-Matrix Multiplication • Compute C = A x B, where A, B are two sparse matrices. • Why is it hard in a heterogeneous setting ? • Sparse nature of matrix makes it hard for programmers to exploit CPU’s cache hierarchy (tiling) to achieve performance. • Irregular computation implies thread load imbalance & hence not suitable for GPUs.

Row-Row Formulation • K. Matam et. al, proved row-row formulation of matrix multiplication out performs usual row-column formulation for SPMM in GPUs. ∑ C ( i ,:) = A ( i , j )* B ( j ,:) j ∈ I i ( A )

0 1 2 3 2 1 3 0 0 0 2 4 16 6 0 0 0 1 1 8 0 0 0 7 6 1 A = B = A x B = 1 0 1 0 0 0 6 2 3 10 2 2 0 0 4 0 7 0 4 34 8 3 C(1, :) = Row-Row Formulation Example

0 1 2 3 2 1 3 0 0 0 2 4 16 6 0 0 0 1 1 8 0 0 0 7 6 1 A = B = A x B = 1 0 1 0 0 0 6 2 3 10 2 2 0 0 4 0 7 0 4 34 8 3 C(1, :) = 2 * Row-Row Formulation Example

0 1 2 3 2 1 3 0 0 0 2 4 16 6 0 0 0 1 1 8 0 0 0 7 6 1 A = B = A x B = 1 0 1 0 0 0 6 2 3 10 2 2 0 0 4 0 7 0 4 34 8 3 C(1, :) = 2 * [8 0 0] Row-Row Formulation Example

0 1 2 3 2 1 3 0 0 0 2 4 16 6 0 0 0 1 1 8 0 0 0 7 6 1 A = B = A x B = 1 0 1 0 0 0 6 2 3 10 2 2 0 0 4 0 7 0 4 34 8 3 C(1, :) = 2 * [8 0 0] + 1 * Row-Row Formulation Example

0 1 2 3 2 1 3 0 0 0 2 4 16 6 0 0 0 1 1 8 0 0 0 7 6 1 A = B = A x B = 1 0 1 0 0 0 6 2 3 10 2 2 0 0 4 0 7 0 4 34 8 3 C(1, :) = 2 * [8 0 0] + 1 * [0 0 6] Row-Row Formulation Example

0 1 2 3 2 1 3 0 0 0 2 4 16 6 0 0 0 1 1 8 0 0 0 7 6 1 A = B = A x B = 1 0 1 0 0 0 6 2 3 10 2 2 0 0 4 0 7 0 4 34 8 3 C(1, :) = 2 * [8 0 0] + 1 * [0 0 6] = [16 0 6] Row-Row Formulation Example

0 1 2 3 2 1 3 0 0 0 2 4 16 6 0 0 0 1 1 8 0 0 0 7 6 1 A = B = A x B = 1 0 1 0 0 0 6 2 3 10 2 2 0 0 4 0 7 0 4 34 8 3 C(1, :) = 2 * [8 0 0] + 1 * [0 0 6] = [16 0 6] C(2, :) = 1 * [0 0 6] + 1 * [0 7 0] = [0 7 6] Row-Row Formulation Example

0 1 2 3 2 1 3 0 0 0 2 4 16 6 0 0 0 1 1 8 0 0 0 7 6 1 A = B = A x B = 1 0 1 0 0 0 6 2 3 10 2 2 0 0 4 0 7 0 4 34 8 3 C(1, :) = 2 * [8 0 0] + 1 * [0 0 6] = [16 0 6] C(2, :) = 1 * [0 0 6] + 1 * [0 7 0] = [0 7 6] C(3, :) = 1 * [2 3 4] + 1 * [0 0 6] = [2 3 10] Row-Row Formulation Example

0 1 2 3 2 1 3 0 0 0 2 4 16 6 0 0 0 1 1 8 0 0 0 7 6 1 A = B = A x B = 1 0 1 0 0 0 6 2 3 10 2 2 0 0 4 0 7 0 4 34 8 3 C(1, :) = 2 * [8 0 0] + 1 * [0 0 6] = [16 0 6] C(2, :) = 1 * [0 0 6] + 1 * [0 7 0] = [0 7 6] C(3, :) = 1 * [2 3 4] + 1 * [0 0 6] = [2 3 10] C(4, :) = 2 * [2 3 4] + 4 * [0 7 0] = [4 34 8] Row-Row Formulation Example

Thread Load Imbalance x

HH-CPU • Classify each row of sparse matrix into high dense and low dense. Now we can write SPMM as, C = A x B => C = (A H + A L ) x (B H + B L ) => C = A H x B H + A L x B L + A H x B L + A L x B H • Each multiplication above has certain properties that helps us to map it to a device that performs better.

Example 3 4 2 1 2 1 0 0 2 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A = B = C = 3 2 2 1 3 2 2 1 6 12 7 7 0 0 0 5 0 0 0 5 0 25 0 0

Example 3 4 2 1 2 1 0 0 2 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A = B = C = 3 2 2 1 3 2 2 1 6 12 7 7 0 0 0 5 0 0 0 5 0 25 0 0 2 1 2 1 0 0 0 0 2 2 3 1 0 0 0 0 0 0 0 0 0 0 0 0 A H = B H = A H x B H = 3 2 2 1 3 2 2 1 6 10 7 2 0 0 0 0 0 0 0 0 0 0 0 0

Example 3 4 2 1 2 1 0 0 2 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A = B = C = 3 2 2 1 3 2 2 1 6 12 7 7 0 0 0 5 0 0 0 5 0 25 0 0 A H x B H 2 2 3 1 + 0 0 0 0 6 10 7 2 0 0 0 0

Example 3 4 2 1 2 1 0 0 2 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A = B = C = 3 2 2 1 3 2 2 1 6 12 7 7 0 0 0 5 0 0 0 5 0 25 0 0 A H x B H 2 2 3 1 + 0 0 0 0 6 10 7 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A L = B L = A L x B L = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 5 0 0 0 25

Example 3 4 2 1 2 1 0 0 2 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A = B = C = 3 2 2 1 3 2 2 1 6 12 7 7 0 0 0 5 0 0 0 5 0 25 0 0 A H x B H A L x B L 2 2 0 0 0 0 3 1 + + 0 0 0 0 0 1 0 0 6 10 7 2 0 0 0 0 0 0 0 0 0 25 0 0

Example 3 4 2 1 2 1 0 0 2 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A = B = C = 3 2 2 1 3 2 2 1 6 12 7 7 0 0 0 5 0 0 0 5 0 25 0 0 A H x B H A L x B L 2 2 0 0 0 0 3 1 + + 0 0 0 0 0 1 0 0 6 10 7 2 0 0 0 0 0 0 0 0 0 25 0 0 2 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 A H = B L = A H x B L = 3 2 2 1 0 0 0 0 0 2 0 5 0 0 0 0 0 0 0 5 0 0 0 0

Example 3 4 2 1 2 1 0 0 2 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A = B = C = 3 2 2 1 3 2 2 1 6 12 7 7 0 0 0 5 0 0 0 5 0 25 0 0 A H x B H A L x B L A H x B L 0 2 0 0 2 2 0 0 0 0 3 1 + + + 0 0 0 0 0 0 0 0 0 1 0 0 6 10 7 2 0 0 0 0 0 2 0 5 0 0 0 0 0 25 0 0 0 0 0 0

Example 3 4 2 1 2 1 0 0 2 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A = B = C = 3 2 2 1 3 2 2 1 6 12 7 7 0 0 0 5 0 0 0 5 0 25 0 0 A H x B H A L x B L A H x B L 0 2 0 0 2 2 0 0 0 0 3 1 + + + 0 0 0 0 0 0 0 0 0 1 0 0 6 10 7 2 0 0 0 0 0 2 0 5 0 0 0 0 0 25 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 A L = B H = A L x B H = 0 0 0 0 3 2 2 1 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0

Example 3 4 2 1 2 1 0 0 2 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A = B = C = 3 2 2 1 3 2 2 1 6 12 7 7 0 0 0 5 0 0 0 5 0 25 0 0 A H x B H A L x B L A H x B L A L x B H 0 0 0 2 0 0 0 0 2 2 0 0 0 0 3 1 + + + = 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 6 10 7 2 0 0 0 0 0 2 0 5 0 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 0

Example 3 4 2 1 2 1 0 0 2 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 A = B = C = 3 2 2 1 3 2 2 1 6 12 7 7 0 0 0 5 0 0 0 5 0 25 0 0 A H x B H A L x B L A H x B L A L x B H 3 4 2 1 0 0 0 2 0 0 0 0 2 2 0 0 0 0 3 1 + + + = 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 6 10 7 2 0 0 0 0 0 2 0 5 0 0 0 0 6 12 7 7 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 0 0 25 0 0

Phase I • CPU, GPU :: Identify thresholds t A , t B and the matrices A H , A L , B H , B L . A =

Phase I • CPU, GPU :: Identify thresholds t A , t B and the matrices A H , A L , B H , B L . t A A =

Phase I • CPU, GPU :: Identify thresholds t A , t B and the matrices A H , A L , B H , B L . t A A = A =

Phase I • CPU, GPU :: Identify thresholds t A , t B and the matrices A H , A L , B H , B L . t A A H A = A =

Phase I • CPU, GPU :: Identify thresholds t A , t B and the matrices A H , A L , B H , B L . t A A H A = A = A L

Phase II • In parallel, CPU :: Compute A H * B H . GPU :: Compute A L * B L .

A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse - PowerPoint PPT Presentation

A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse Matrices Kiran Raj Ramamoorthy, Dip Sankar Banerjee, Kannan Srinathan and Kishore Kothapalli. C-STAR, IIIT Hyderabad. Outline Inspiration :: Heterogeneous Platform &

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Multiplying Moore's Law with Proximity Communication Robert Drost, Ph.D. Director and

https://www.youtube.com/watch?v=Z8oBp8BP_OU LO: to practise multiplying and dividing by 10, 100

Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance

9/22/2020 Multiplying Decimals by Whole Numbers Today we will review Example: Find 14.2 x 6

Multiplying a Vector By a Scalar MCV4U: Calculus & Vectors Compare the two vectors, u and

Multiplying Fractions MPM1D: Principles of Mathematics Recap Evaluate 12 25 15 32 . Since the

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Fast Fourier Transform Integer multiplication Multiplying two n-bit integers A and B: Grade

Mining molecular flexibility: novel tools, novel insights F. Cazals, Inria

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

CB-DBSCAN: A Novel Clustering Algorithm for Adjacent Clusters with Different Densities Gashin

The Future is Light John Cronin AUT University, Auckland NZ Wearable Resistance (W (WR) Novel

farmyard B. van t Land, G. Smeenk, H. Lucas, A. Lamers Speaker: Bert Van t Land SMART

Strongly Coupled Gauge Strongly Coupled Gauge Theories and Strings Theories and Strings Igor

Picking II, Collision and Accelleration Week 10, Fri Mar 23

In collaboration with A. Cheng, G. Petropoulos and D. Schaich ArXiv:1111:2317,1207.7162,1207.7164

Dangerous Pyrotechnic 'Composition': Fireworks, Embedded Wireless and Insecurity-by-Design

New MiniBooNE MiniBooNE Results Results New Zelimir Djurcic Zelimir Djurcic Physics

14.581 International Trade Lecture 14: Firm Heterogeneity Theory (I) Melitz (2003)

Economics 2 Professor Christina Romer Spring 2020 Professor David Romer LECTURE 6 FIRMS AND

Which Aspects of Corporate Governance Matter in Emerging Markets: Evidence from Brazil, India,

A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse - PowerPoint PPT Presentation

A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse Matrices Kiran Raj Ramamoorthy, Dip Sankar Banerjee, Kannan Srinathan and Kishore Kothapalli. C-STAR, IIIT Hyderabad. Outline Inspiration :: Heterogeneous Platform &

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Multiplying Moore's Law with Proximity Communication Robert Drost, Ph.D. Director and

https://www.youtube.com/watch?v=Z8oBp8BP_OU LO: to practise multiplying and dividing by 10, 100

Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance

9/22/2020 Multiplying Decimals by Whole Numbers Today we will review Example: Find 14.2 x 6

Multiplying a Vector By a Scalar MCV4U: Calculus &amp; Vectors Compare the two vectors, u and

Multiplying Fractions MPM1D: Principles of Mathematics Recap Evaluate 12 25 15 32 . Since the

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Fast Fourier Transform Integer multiplication Multiplying two n-bit integers A and B: Grade

Mining molecular flexibility: novel tools, novel insights F. Cazals, Inria

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

CB-DBSCAN: A Novel Clustering Algorithm for Adjacent Clusters with Different Densities Gashin

The Future is Light John Cronin AUT University, Auckland NZ Wearable Resistance (W (WR) Novel

farmyard B. van t Land, G. Smeenk, H. Lucas, A. Lamers Speaker: Bert Van t Land SMART

Strongly Coupled Gauge Strongly Coupled Gauge Theories and Strings Theories and Strings Igor

Picking II, Collision and Accelleration Week 10, Fri Mar 23

In collaboration with A. Cheng, G. Petropoulos and D. Schaich ArXiv:1111:2317,1207.7162,1207.7164

Dangerous Pyrotechnic 'Composition': Fireworks, Embedded Wireless and Insecurity-by-Design

New MiniBooNE MiniBooNE Results Results New Zelimir Djurcic Zelimir Djurcic Physics

14.581 International Trade Lecture 14: Firm Heterogeneity Theory (I) Melitz (2003)

Economics 2 Professor Christina Romer Spring 2020 Professor David Romer LECTURE 6 FIRMS AND

Which Aspects of Corporate Governance Matter in Emerging Markets: Evidence from Brazil, India,

Multiplying a Vector By a Scalar MCV4U: Calculus & Vectors Compare the two vectors, u and