A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse - - PowerPoint PPT Presentation

a novel heterogeneous algorithm for multiplying scale
SMART_READER_LITE
LIVE PREVIEW

A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse - - PowerPoint PPT Presentation

A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse Matrices Kiran Raj Ramamoorthy, Dip Sankar Banerjee, Kannan Srinathan and Kishore Kothapalli. C-STAR, IIIT Hyderabad. Outline Inspiration :: Heterogeneous Platform &


slide-1
SLIDE 1

A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse Matrices

Kiran Raj Ramamoorthy, Dip Sankar Banerjee, Kannan Srinathan and Kishore Kothapalli. C-STAR, IIIT Hyderabad.

slide-2
SLIDE 2

Outline

  • Inspiration :: Heterogeneous Platform & Challenges
  • Introduction :: Sparse Matrix-Matrix Multiplication (SPMM)
  • Earlier Work :: Row-Row (K. Matam et. al)
  • Our Approach :: HH-CPU
  • Implementation :: Notes
  • Results :: Datasets (SNAP

, Synthetic …), Experiments & Discussion

  • Other Approaches :: Work Queue & its variations
  • Conclusion :: Future Work & References
slide-3
SLIDE 3

Heterogeneous Platform

CPU GPU

Send Results Send Data Send Code

slide-4
SLIDE 4

Heterogeneous Platform

Data Transfer Data Transfer

CPU GPU

Send Results Send Data Send Code Data Transfer Data Transfer

slide-5
SLIDE 5

Challenges

  • Which portion of input is processed

by which device ?

  • Static Partitioning input is a good

solution to obtain high performance

  • n heterogeneous platforms.
  • However, compute capability of

each entity is different & performance of device is dependent

  • n nature of input.
  • Simple/Static partitioning is not
  • ptimal.
  • Is it possible to come up with

partitioning techniques for heterogenous platforms and applications ?

slide-6
SLIDE 6

Our Goal

  • To propose a novel heterogeneous algorithm for sparse

matrix-matrix multiplication that,

  • not only, balances load across heterogeneous devices

in computing platform.

  • but also, assigns "right" work to the "right" processor.
slide-7
SLIDE 7

Sparse Matrix

  • Matrix in which most of the elements are zero.
  • i.e. nnz = k * n
  • Example
slide-8
SLIDE 8

Real-World Matrices

Usually datasets in Data Mining, Social Network Analysis & Communication Networks are very large.

slide-9
SLIDE 9

Nature of Real-world Matrices

These graphs are highly irregular & scale-free with a power-law degree distribution.

Dense Row

slide-10
SLIDE 10

Sparse Matrix-Matrix Multiplication

  • Compute C = A x B, where A, B are two sparse matrices.
  • Why is it hard in a heterogeneous setting ?
  • Sparse nature of matrix makes it hard for programmers

to exploit CPU’s cache hierarchy (tiling) to achieve performance.

  • Irregular computation implies thread load imbalance &

hence not suitable for GPUs.

slide-11
SLIDE 11

Row-Row Formulation

  • K. Matam et. al, proved row-row formulation of matrix

multiplication out performs usual row-column formulation for SPMM in GPUs.

C(i,:) = A(i, j)* B( j,:)

j∈Ii(A)

slide-12
SLIDE 12

Row-Row Formulation

Example

2 1 1 1 1 1 2 4

A =

2 3 4 8 6 7

B =

16 6 7 6 2 3 10 4 34 8

A x B =

C(1, :) =

1 2 3 1 2 3

slide-13
SLIDE 13

Row-Row Formulation

Example

2 1 1 1 1 1 2 4

A =

2 3 4 8 6 7

B =

16 6 7 6 2 3 10 4 34 8

A x B =

C(1, :) =

1 2 3 1 2 3

2 *

slide-14
SLIDE 14

Row-Row Formulation

Example

2 1 1 1 1 1 2 4

A =

2 3 4 8 6 7

B =

16 6 7 6 2 3 10 4 34 8

A x B =

C(1, :) =

1 2 3 1 2 3

[8 0 0] 2 *

slide-15
SLIDE 15

Row-Row Formulation

Example

2 1 1 1 1 1 2 4

A =

2 3 4 8 6 7

B =

16 6 7 6 2 3 10 4 34 8

A x B =

C(1, :) =

1 2 3 1 2 3

[8 0 0] + 1 * 2 *

slide-16
SLIDE 16

Row-Row Formulation

Example

2 1 1 1 1 1 2 4

A =

2 3 4 8 6 7

B =

16 6 7 6 2 3 10 4 34 8

A x B =

C(1, :) =

1 2 3 1 2 3

[8 0 0] + 1 * [0 0 6] 2 *

slide-17
SLIDE 17

Row-Row Formulation

Example

2 1 1 1 1 1 2 4

A =

2 3 4 8 6 7

B =

16 6 7 6 2 3 10 4 34 8

A x B =

C(1, :) =

1 2 3 1 2 3

[8 0 0] + 1 * [0 0 6] = [16 0 6] 2 *

slide-18
SLIDE 18

Row-Row Formulation

Example

2 1 1 1 1 1 2 4

A =

2 3 4 8 6 7

B =

16 6 7 6 2 3 10 4 34 8

A x B =

C(1, :) = C(2, :) = 1 * [0 0 6] + 1 * [0 7 0] = [0 7 6]

1 2 3 1 2 3

[8 0 0] + 1 * [0 0 6] = [16 0 6] 2 *

slide-19
SLIDE 19

Row-Row Formulation

Example

2 1 1 1 1 1 2 4

A =

2 3 4 8 6 7

B =

16 6 7 6 2 3 10 4 34 8

A x B =

C(1, :) = C(2, :) = 1 * [0 0 6] + 1 * [0 7 0] = [0 7 6] C(3, :) = 1 * [2 3 4] + 1 * [0 0 6] = [2 3 10]

1 2 3 1 2 3

[8 0 0] + 1 * [0 0 6] = [16 0 6] 2 *

slide-20
SLIDE 20

Row-Row Formulation

Example

2 1 1 1 1 1 2 4

A =

2 3 4 8 6 7

B =

16 6 7 6 2 3 10 4 34 8

A x B =

C(1, :) = C(2, :) = 1 * [0 0 6] + 1 * [0 7 0] = [0 7 6] C(3, :) = 1 * [2 3 4] + 1 * [0 0 6] = [2 3 10] C(4, :) = 2 * [2 3 4] + 4 * [0 7 0] = [4 34 8]

1 2 3 1 2 3

[8 0 0] + 1 * [0 0 6] = [16 0 6] 2 *

slide-21
SLIDE 21

Thread Load Imbalance

x

slide-22
SLIDE 22

HH-CPU

  • Classify each row of sparse matrix into high dense and

low dense. Now we can write SPMM as, C = A x B => C = (AH + AL) x (BH + BL) => C = AH x BH + AL x BL + AH x BL + AL x BH

  • Each multiplication above has certain properties that

helps us to map it to a device that performs better.

slide-23
SLIDE 23

Example

2 1 1 3 2 2 1 5

A = B =

3 4 2 1 1 6 12 7 7 25

C =

2 1 1 3 2 2 1 5

slide-24
SLIDE 24

Example

2 1 1 3 2 2 1 5

A = B =

3 4 2 1 1 6 12 7 7 25

C =

2 1 1 3 2 2 1 5

2 1 3 2 2 1

AH =

2 1 3 2 2 1

BH =

3 2 2 1 6 10 7 2

AH x BH =

slide-25
SLIDE 25

Example

2 1 1 3 2 2 1 5

A = B =

3 4 2 1 1 6 12 7 7 25

C =

2 1 1 3 2 2 1 5 3 2 2 1 6 10 7 2

AH x BH

+

slide-26
SLIDE 26

Example

2 1 1 3 2 2 1 5

A = B =

3 4 2 1 1 6 12 7 7 25

C =

2 1 1 3 2 2 1 5

1 5

AL =

1 5

BL =

1 0 25

AL x BL =

3 2 2 1 6 10 7 2

AH x BH

+

slide-27
SLIDE 27

Example

2 1 1 3 2 2 1 5

A = B =

3 4 2 1 1 6 12 7 7 25

C =

2 1 1 3 2 2 1 5 3 2 2 1 6 10 7 2

AH x BH

+

1 25

AL x BL

+

slide-28
SLIDE 28

Example

2 1 1 3 2 2 1 5

A = B =

3 4 2 1 1 6 12 7 7 25

C =

2 1 1 3 2 2 1 5

2 1 3 2 2 1

AH =

1 5

BL =

2 2 5

AH x BL =

3 2 2 1 6 10 7 2

AH x BH

+

1 25

AL x BL

+

slide-29
SLIDE 29

Example

2 1 1 3 2 2 1 5

A = B =

3 4 2 1 1 6 12 7 7 25

C =

2 1 1 3 2 2 1 5 3 2 2 1 6 10 7 2

AH x BH

+

1 25

AL x BL

+

2 2 5

AH x BL

+

slide-30
SLIDE 30

Example

2 1 1 3 2 2 1 5

A = B =

3 4 2 1 1 6 12 7 7 25

C =

2 1 1 3 2 2 1 5

1 5

AL =

2 1 3 2 2 1

BH = AL x BH =

3 2 2 1 6 10 7 2

AH x BH

+

1 25

AL x BL

+

2 2 5

AH x BL

+

slide-31
SLIDE 31

Example

2 1 1 3 2 2 1 5

A = B =

3 4 2 1 1 6 12 7 7 25

C =

2 1 1 3 2 2 1 5 3 2 2 1 6 10 7 2

AH x BH

+

1 25

AL x BL

+

2 2 5

AH x BL

+

AL x BH

=

slide-32
SLIDE 32

Example

2 1 1 3 2 2 1 5

A = B =

3 4 2 1 1 6 12 7 7 25

C =

2 1 1 3 2 2 1 5 3 2 2 1 6 10 7 2

AH x BH

+

1 25

AL x BL

+

2 2 5

AH x BL

+

AL x BH

=

3 4 2 1 1 6 12 7 7 25

slide-33
SLIDE 33

Phase I

  • CPU, GPU :: Identify thresholds tA, tB and the matrices AH,

AL, BH, BL. A =

slide-34
SLIDE 34

Phase I

  • CPU, GPU :: Identify thresholds tA, tB and the matrices AH,

AL, BH, BL. A = tA

slide-35
SLIDE 35

Phase I

  • CPU, GPU :: Identify thresholds tA, tB and the matrices AH,

AL, BH, BL. A = tA A =

slide-36
SLIDE 36

Phase I

  • CPU, GPU :: Identify thresholds tA, tB and the matrices AH,

AL, BH, BL. A = tA A =

AH

slide-37
SLIDE 37

Phase I

  • CPU, GPU :: Identify thresholds tA, tB and the matrices AH,

AL, BH, BL. A = tA A =

AH AL

slide-38
SLIDE 38

Phase II

  • In parallel,

CPU :: Compute AH * BH. GPU :: Compute AL * BL.

slide-39
SLIDE 39

Phase II

  • In parallel,

CPU :: Compute AH * BH. GPU :: Compute AL * BL.

AH = BH =

slide-40
SLIDE 40

Phase II

  • In parallel,

CPU :: Compute AH * BH. GPU :: Compute AL * BL.

AH = BH =

slide-41
SLIDE 41

Phase II

  • In parallel,

CPU :: Compute AH * BH. GPU :: Compute AL * BL.

AH = BH =

slide-42
SLIDE 42

Phase II

  • In parallel,

CPU :: Compute AH * BH. GPU :: Compute AL * BL.

AH = BH =

slide-43
SLIDE 43

Phase II

  • In parallel,

CPU :: Compute AH * BH. GPU :: Compute AL * BL.

AH = BH =

slide-44
SLIDE 44

Phase II

  • In parallel,

CPU :: Compute AH * BH. GPU :: Compute AL * BL.

AH = BH = AL = BL =

slide-45
SLIDE 45

Phase II

  • In parallel,

CPU :: Compute AH * BH. GPU :: Compute AL * BL.

AH = BH = AL = BL =

slide-46
SLIDE 46

Phase III

  • In parallel,

CPU :: Compute AL * BH. [WorkQueue Mode] GPU :: Compute AH * BL.

slide-47
SLIDE 47

Phase III

  • In parallel,

CPU :: Compute AL * BH. [WorkQueue Mode] GPU :: Compute AH * BL.

BL = AH = x

slide-48
SLIDE 48

Phase III

  • In parallel,

CPU :: Compute AL * BH. [WorkQueue Mode] GPU :: Compute AH * BL.

BL = AH = x

slide-49
SLIDE 49

Phase III :: Contd

  • In parallel,

CPU :: Compute AL * BH. [WorkQueue Mode] GPU :: Compute AH * BL.

BH = AL x

CPU Start CPU End GPU Start GPU End

slide-50
SLIDE 50

Phase III :: Contd

  • In parallel,

CPU :: Compute AL * BH. [WorkQueue Mode] GPU :: Compute AH * BL.

BH = AL x

CPU Start CPU End GPU Start GPU End

slide-51
SLIDE 51

Phase III :: Contd

  • In parallel,

CPU :: Compute AL * BH. [WorkQueue Mode] GPU :: Compute AH * BL.

BH = AL x

CPU Start CPU End GPU Start GPU End

slide-52
SLIDE 52

Phase IV

  • CPU, GPU :: Combine results of Phases II & III.
  • GPU to CPU :: Transfer the partial results from GPU to

CPU.

AH x BH AL x BL AH x BL AL x BH

+ + +

A x B

slide-53
SLIDE 53

Timeline Diagram

CPU GPU

Phase I Phase II Phase III Phase IV Mark AH x BH AL x BL AL x BH AH x BL Merge

slide-54
SLIDE 54

Implementation Details

  • Sparse matrices A, B are stored in CSR format.
  • We multiply A x A instead of A x B due to dataset unavailability. However we

show results for A x B in synthetically generated data for experiments.

  • We consider only CPU & GPU for simplicity in the heterogeneous system.
  • Phase 1 :: Thresholds (tA, tB) are empirically obtained.
  • Phase 2, 3 :: We use modified version of Row_Row_SPMM developed by K.

Matam et. al for SPMM of partial matrices.

  • Phase 3 :: Work units of CPU & GPU in work queue model is empirically

determined a-priori.

  • Phase 4 :: Standard primitives like Mark, Scan & Merge are used.
slide-55
SLIDE 55

Experimental Setup

  • CPU :: Intel i7 980
  • GPU :: Nvidia Tesla K20c (Kepler)
  • CPU - GPU Link :: PCI Express version 2.0 link that

supports data transfer bandwidth of 8 GB/s.

  • CUDA API Version 4.1 for Programming
slide-56
SLIDE 56

Dataset (Scale-free & )

α

Matrix Rows NNZ scircuit 1,70,998 9,58,936 3.55 Webbase-1M 10,00,005 31,05,536 2.1 cop20kA 1,21,192 26,24,331 143.8 web-Google 9,16,428 51,05,039 3.75 p2p-Gnutella 62,586 1,47,892 48.9 ca-CondMat 23,133 1,86,936 3.58 roadNet-CA 19,71,281 55,33,214 133.80 internet 1,24,651 2,07,214 4.63 dblp2010 3,26,186 16,15,400 5.79 email-Enron 36,692 3,67,662 2.1 wiki-Vote 8,297 1,03,689 3.88 cit-Patents 37,74,768 1,65,18,948 3.90

α

100 101 102 103 104 105 106 25 50 75 100 125 150 175 200 225 250 275 300 325 350 450 # Rows NNZ web-Google HD = 6834 Threshold = 25 100 101 102 103 104 105 106 107 1 2 3 4 5 6 7 8 9 10 # Rows NNZ roadNet-CA HD = 468149 Threshold = 3

slide-57
SLIDE 57

Results :: Overall Improvement

Speed Up 1.000 1.150 1.300 1.450 1.600 Matrices Webbase-1M email-Enron web-Google scircuit wiki-Vote cit-Patents ca-CondMat internet dblp2010 p2p-Gnutella roadNet-CA cop20kA

Average: 25% Faster

< 2 2 to 4 4 to 5 >5

slide-58
SLIDE 58

Results :: Profiling

0.0625 0.25 1 4 16 64 256 1024 4096

scircuit webbase-1M cop20kA web-Google p2p-Gnutella31 ca-CondMat roadNet-CA internet dblp-2010 email-Enron wiki-Vote cit-Patents

Time (ms) Matrix Instance Phase I Phase II Phase III Phase IV

slide-59
SLIDE 59

Results :: Trade-Off

20 40 60 80 100 120 10 20 30 40 50 60 70 80 90 100 Time (msec) Threshold ca-CondMat

Phase 2 Phase 3 Total Time

400 800 1200 1600 2000 10 20 30 40 50 60 70 80 90 Time (msec) Threshold cop20kA

Phase 2 Phase 3 Total Time

200 400 600 800 1000 1 2 3 4 5 6 7 8 9 10 11 12 Time (msec) Threshold roadNet-CA

Phase 2 Phase 3 Total Time

500 1000 1500 2000 2500 3000 3500 5 10 15 20 25 30 35 40 45 50 Time (msec) Threshold web-Google

Phase 2 Phase 3 Total Time

slide-60
SLIDE 60

Experiments with Synthetic Datasets

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 3 3.5 4 4.5 5 5.5 6 6.5 Speed Up Alpha N = 100K N = 500K N = 1M

slide-61
SLIDE 61

Work Queue

  • Why not apply work queue

completely ?

  • CPU, GPU works with bunch of

rows (work-units) until all rows are finished.

  • Now, always amount of time

taken by CPU & GPU is (almost) equal.

  • Is it optimal ?
  • No, since the rows processed

by CPU & GPU are random, it might not be suited for the device and hence not optimal.

CPU GPU

Time

Almost Equal

CPU Start CPU End GPU Start GPU End

slide-62
SLIDE 62

Work Queue

  • Why not apply work queue

completely ?

  • CPU, GPU works with bunch of

rows (work-units) until all rows are finished.

  • Now, always amount of time

taken by CPU & GPU is (almost) equal.

  • Is it optimal ?
  • No, since the rows processed

by CPU & GPU are random, it might not be suited for the device and hence not optimal.

CPU GPU

Time

Almost Equal

CPU Start CPU End GPU Start GPU End

slide-63
SLIDE 63

Work Queue

  • Why not apply work queue

completely ?

  • CPU, GPU works with bunch of

rows (work-units) until all rows are finished.

  • Now, always amount of time

taken by CPU & GPU is (almost) equal.

  • Is it optimal ?
  • No, since the rows processed

by CPU & GPU are random, it might not be suited for the device and hence not optimal.

CPU GPU

Time

Almost Equal

CPU Start CPU End GPU Start GPU End

slide-64
SLIDE 64

Work Queue

  • Why not apply work queue

completely ?

  • CPU, GPU works with bunch of

rows (work-units) until all rows are finished.

  • Now, always amount of time

taken by CPU & GPU is (almost) equal.

  • Is it optimal ?
  • No, since the rows processed

by CPU & GPU are random, it might not be suited for the device and hence not optimal.

CPU GPU

Time

Almost Equal

CPU Start CPU End GPU Start GPU End

slide-65
SLIDE 65

Sorted Work Queue

  • Sort the rows such that nnz
  • decreases. CPUs are better suited

for top portion of the matrix as they are dense and can exploit cache hierarchy while bottom portion is suited for GPUs as input is almost regular.

  • Again amount of time taken by CPU

& GPU is (almost) equal.

  • Is it optimal ?
  • No, since amount of computation

done by each thread is primarily dependant on the non-zeros in the "B" matrix. This partition technique still leads to thread divergence inside a warp / block.

CPU GPU

Time

Almost Equal

CPU Start CPU End GPU Start GPU End

slide-66
SLIDE 66

Sorted Work Queue

  • Sort the rows such that nnz
  • decreases. CPUs are better suited

for top portion of the matrix as they are dense and can exploit cache hierarchy while bottom portion is suited for GPUs as input is almost regular.

  • Again amount of time taken by CPU

& GPU is (almost) equal.

  • Is it optimal ?
  • No, since amount of computation

done by each thread is primarily dependant on the non-zeros in the "B" matrix. This partition technique still leads to thread divergence inside a warp / block.

CPU GPU

Time

Almost Equal

CPU Start CPU End GPU Start GPU End

slide-67
SLIDE 67

Sorted Work Queue

  • Sort the rows such that nnz
  • decreases. CPUs are better suited

for top portion of the matrix as they are dense and can exploit cache hierarchy while bottom portion is suited for GPUs as input is almost regular.

  • Again amount of time taken by CPU

& GPU is (almost) equal.

  • Is it optimal ?
  • No, since amount of computation

done by each thread is primarily dependant on the non-zeros in the "B" matrix. This partition technique still leads to thread divergence inside a warp / block.

CPU GPU

Time

Almost Equal

CPU Start CPU End GPU Start GPU End

slide-68
SLIDE 68

Sorted Work Queue

  • Sort the rows such that nnz
  • decreases. CPUs are better suited

for top portion of the matrix as they are dense and can exploit cache hierarchy while bottom portion is suited for GPUs as input is almost regular.

  • Again amount of time taken by CPU

& GPU is (almost) equal.

  • Is it optimal ?
  • No, since amount of computation

done by each thread is primarily dependant on the non-zeros in the "B" matrix. This partition technique still leads to thread divergence inside a warp / block.

CPU GPU

Time

Almost Equal

CPU Start CPU End GPU Start GPU End

slide-69
SLIDE 69

Work Queue Vs HH-CPU

1 1.1 1.2 1.3 1.4 1.5 1.6 scircuit webbase-1M web-Google ca-CondMat internet dblp-2010 email-Enron wiki-Vote cit-Patents Average Speedup Matrix Instance Unsorted/HH-CPU Sorted/HH-CPU

slide-70
SLIDE 70

Future Work

  • Study analytical techniques to identify the threshold in

Phase I of Algorithm HH-CPU

  • Similar algorithm can be designed for CSRMM, which

multiplies a sparse matrix A with a dense matrix B.

slide-71
SLIDE 71

References

  • D. A. Bader and K. Madduri. GTgrpah: A suite of synthetic graph generators.

Available at https://sdm.lbl.gov/⇠kamesh/software/GTgraph/

  • A. Buluc and J. R. Gilbert. Challenges and advances in parallel sparse matrix-

matrix multiplication. In Proc. ICPP , pp 503–510, 2008.

  • S. Indarapu, M. Maramreddy, and K. Kothapalli. Architecture- and Workload-

aware algorithms for Spare Matrix- Vector Multiplication, Under submission, 2014.

  • K. Matam, S. Indarapu, and K. Kothapalli. Sparse Matrix Matrix Multiplication on

Modern Architectures, in Proc. of HiPC, 2012.

  • NVIDIA cuSPARSE Library, https://developer.nvidia.com/cusparse
  • Stanford Network Analysis Platform dataset , http://www.cise.ufl.edu/ research/

sparse/matrices/SNAP/

slide-72
SLIDE 72

Thank You