Sparse Matrix Partitioning, Reordering and Vector Multiplication - - PowerPoint PPT Presentation

sparse matrix partitioning reordering and vector
SMART_READER_LITE
LIVE PREVIEW

Sparse Matrix Partitioning, Reordering and Vector Multiplication - - PowerPoint PPT Presentation

Sparse Matrix Partitioning, Reordering and Vector Multiplication Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht University (NL) May, 2010 Albert-Jan Yzelman, Utrecht University (NL) Sparse Matrix


slide-1
SLIDE 1

Sparse Matrix Partitioning, Reordering and Vector Multiplication

Sparse Matrix Partitioning, Reordering and Vector Multiplication

Albert-Jan Yzelman, Utrecht University (NL) May, 2010

Albert-Jan Yzelman, Utrecht University (NL)

slide-2
SLIDE 2

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Bulk Synchronous Parallel

1

Bulk Synchronous Parallel

2

Matrix partitioning for SpMV

3

Reordering for Sequential SpMV

4

In relation to PSPIKE

Albert-Jan Yzelman, Utrecht University (NL)

slide-3
SLIDE 3

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Supercomputers...

Many different processor types: Reduced Instruction Set chips (RISC), (e.g., IBM Power) Intel Itanium x86-type (your average home PC/laptop) Vector (co-)processors GPUs Stream processors ...

Albert-Jan Yzelman, Utrecht University (NL)

slide-4
SLIDE 4

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Supercomputers...

Many different connectivity; Ring All-to-all ethernet InfiniBand Cube Hierarchical Internet ...

Albert-Jan Yzelman, Utrecht University (NL)

slide-5
SLIDE 5

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Programming Model

One model; bridging models: Message Passing Interface (MPI) Bulk Synchronous Parallel (BSP)

Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 (1990), pp. 103–111

Albert-Jan Yzelman, Utrecht University (NL)

slide-6
SLIDE 6

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Bulk Synchronous Parallel

A BSP-computer: consists of P processors, each with local memory executes a Single Program on Multiple Data (SPMD) performs no communication during calculation communicates only during barrier synchronisation

Albert-Jan Yzelman, Utrecht University (NL)

slide-7
SLIDE 7

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Superstep 0 Sync Superstep 1

Albert-Jan Yzelman, Utrecht University (NL)

slide-8
SLIDE 8

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Bulk Synchronous Parallel

A BSP-computer furthermore: has homogenous processors, able to do r flops each second takes l time to synchronise has a communication speed of g The model thus only uses four parameters (P, r, l, g).

Albert-Jan Yzelman, Utrecht University (NL)

slide-9
SLIDE 9

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Bulk Synchronous Parallel

A BSP-algorithm can (using BSPlib, BSPonMPI): Ask for some environment variables: bsp_nprocs() bsp_pid() Synchronise: bsp_sync() Perform “direct” remote memory access (DRMA): bsp_put(source, dest, dest_PID) bsp_get(source, source_PID, dest) Send messages, synchronously (BSMP): bsp_send(data, dest_PID) bsp_move()

Albert-Jan Yzelman, Utrecht University (NL)

slide-10
SLIDE 10

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Exercise

Design a parallel inner-product calculation: double spmd_ip( double *x, double *y, int length ) { double sum=... ... return sum; } Using: bsp_nprocs(), bsp_pid(), bsp_put(...)

Albert-Jan Yzelman, Utrecht University (NL)

slide-11
SLIDE 11

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Exercise

Design a parallel inner-product calculation: double spmd_ip( double *x, double *y, int length ) { int i = 0; double sum = x[0]*y[0]; double res[ bsp_nprocs() ]; for(i=1; i<length; i++) sum += x[i]*y[i]; for(i=0; i<bsp_nprocs(); i++) bsp_put(&sum,&res[bsp_pid()],i); bsp_sync(); for(i=0; i<bsp_nprocs(); i++) sum += res[i]; return sum; }

Albert-Jan Yzelman, Utrecht University (NL)

slide-12
SLIDE 12

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

BSP cost model: let w(s)

i

be the work to be done by processor s in superstep i, let r(s)

i

be the communication received by processor s between superstep i and i + 1, let t(s)

i

be the communication transmitted by processor s. Furthermore, let T be the total number of supersteps and the communication bound of superstep i be given by ci = max

  • max

s∈[0,P−1] r(s) i

, max

s∈[0,P−1] t(s) i

  • .

Similarly, the upper bound for the amount of work: wi = max

s∈[0,P−1] w(s) i

. Then the cost of a BSP algorithm is given by:

T

  • i=0

wi +

T−1

  • i=0

(l + g · ci)

Albert-Jan Yzelman, Utrecht University (NL)

slide-13
SLIDE 13

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Bulk Synchronous Parallel

Exercise

Now think on how to do Sparse Matrix–Vector multiplication (SpMV) using BSP.

Albert-Jan Yzelman, Utrecht University (NL)

slide-14
SLIDE 14

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Matrix partitioning for SpMV

1

Bulk Synchronous Parallel

2

Matrix partitioning for SpMV

3

Reordering for Sequential SpMV

4

In relation to PSPIKE

Albert-Jan Yzelman, Utrecht University (NL)

slide-15
SLIDE 15

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Sparse matrix, dense vector multiplication

y=Ax: for each nonzero k from A add x[k.column] · k.value to y[k.row]

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-16
SLIDE 16

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Sparse matrix, dense vector multiplication (parallel)

Step 1 (fan-out): Not all processors have the elements from x they need; processors need to get the missing items.

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-17
SLIDE 17

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Sparse matrix, dense vector multiplication (parallel)

Step 2: use received elements from x for multiplication. Step 3 (fan-in): write local results to the correct processors; here, y is distributed cyclically, obviously a bad choice

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-18
SLIDE 18

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Sparse matrix, dense vector multiplication

The algorithm: for all nonzeroes k from A if x[k.j] and x[k.i] are local add k.v ∗ x[k.j] to y[k.i] if x[k.j] is not local get x[k.j] from the responsible processor if y[k.i] is not local set locally y[k.i] = 0 synchronise for all nonzeroes k from A not yet processed add k.v · x[k.j] to y[k.i] send all local copies y[i] to the responsible processor synchronise for all incoming pairs (i, v) (such that v = y[i]) add v to y[i]

Albert-Jan Yzelman, Utrecht University (NL)

slide-19
SLIDE 19

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

In the case of sparse matrix–vector multiplication,

what causes the communication?

nonzeroes on the same row distributed to different processors: fan-out communication nonzeroes on the same column distributed to different processors: fan-in communication

(Assuming the vector distribution of x/y is such that x[i] is distributed to the same processor as at least one entry of the ith column/row of A.)

Albert-Jan Yzelman, Utrecht University (NL)

slide-20
SLIDE 20

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

In the case of sparse matrix–vector multiplication,

what causes the communication?

nonzeroes on the same row distributed to different processors: fan-out communication nonzeroes on the same column distributed to different processors: fan-in communication

(Assuming the vector distribution of x/y is such that x[i] is distributed to the same processor as at least one entry of the ith column/row of A.)

Easy solution: distribute all nonzeroes to the same processor

Albert-Jan Yzelman, Utrecht University (NL)

slide-21
SLIDE 21

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Load balancing

For each superstep i, let ¯ wi = 1

P

  • s∈[0,P−1] w(s)

i

be the average

  • workload. We demand that:

w(s)

i

≤ (1 + ǫ) ¯ wi, where ǫ is the maximum load imbalance parameter.

Albert-Jan Yzelman, Utrecht University (NL)

slide-22
SLIDE 22

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Communication balancing

Recall that the communication-part is assumed to depend on the maximum incoming or outgoing volume over all processors; for each superstep, we minimise ci = max

  • max

s∈[0,P−1] r(s) i

, max

s∈[0,P−1] t(s) i

  • not the total communication volume

s

  • r(s)

i

+ t(s)

i

  • .

Albert-Jan Yzelman, Utrecht University (NL)

slide-23
SLIDE 23

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

“Shared” columns: communication during fan-out

1 2 4 8 3 5 7 6

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏

1 2 3 4 5 6 7 8

Column-net model; a cut net means a shared column

Albert-Jan Yzelman, Utrecht University (NL)

slide-24
SLIDE 24

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

“Shared” columns: communication during fan-out

✁✁✁ ✁✁✁ ✁✁✁ ✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏

1 2 3 4 5 6 7 8 2 4 1 6 8 3 7 5

Column-net model; a cut net means a shared column

Albert-Jan Yzelman, Utrecht University (NL)

slide-25
SLIDE 25

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

“Shared” columns: communication during fan-out

  • 1

2 3 4 5 6 7 8 1 6 3 7 5 8 2 4 Column-net model; a cut net means a shared column

Albert-Jan Yzelman, Utrecht University (NL)

slide-26
SLIDE 26

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

“Shared” rows: communication during fan-in

3 1 8 2 5 7 6 4

✁ ✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛✁☛✁☛ ☞✁☞✁☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✒✁✒

1 2 3 4 5 6 7 8

Row-net model; a cut net means a shared row

Albert-Jan Yzelman, Utrecht University (NL)

slide-27
SLIDE 27

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

“Catch” all communication:

2

3 4 8 7 5 6 1

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞

2 1 4 3 5 6 7 8

Fine-grain model; a cut net means either a shared row or column

Albert-Jan Yzelman, Utrecht University (NL)

slide-28
SLIDE 28

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Emerging partitioning strategy: Model the sparse matrix using a hypergraph

Albert-Jan Yzelman, Utrecht University (NL)

slide-29
SLIDE 29

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of that hypergraph (in two)

Albert-Jan Yzelman, Utrecht University (NL)

slide-30
SLIDE 30

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of that hypergraph (in two) Criteria: load balance (max. ǫ imbalance) + minimising

  • i

(λi − 1), where λi equals the number of partitions within the ith net; the connectivity of the ith net. Alternative:

  • i
  • 1,

if λi > 0, 0,

  • therwise

.

Albert-Jan Yzelman, Utrecht University (NL)

slide-31
SLIDE 31

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of that hypergraph (in two)

Kernighan & Lin, An efficient heuristic procedure for partitioning graphs, Bell Systems Technical Journal 49 (1970): pp. 291-307 Fiduccia & Mattheyses, A linear-time heuristic for improving network partitions, Proceedings of the 19th IEEE Design Automation Conference (1982), pp. 175-181 Cataly¨ urek & Aykanat, Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication, IEEE Transactions on Parallel Distributed Systems 10 (1999), pp. 673-693

Albert-Jan Yzelman, Utrecht University (NL)

slide-32
SLIDE 32

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Original Mondriaan: both row- and column-net, and choose best

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-33
SLIDE 33

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-34
SLIDE 34

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-35
SLIDE 35

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-36
SLIDE 36

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Mondriaan:

Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts

  • Brendan Vastenhouw and Rob H. Bisseling, A Two-Dimensional Data

Distribution Method for Parallel Sparse Matrix-Vector Multiplication, SIAM Review, Vol. 47, No. 1 (2005) pp. 67-95

Albert-Jan Yzelman, Utrecht University (NL)

slide-37
SLIDE 37

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Matrix partitioning for SpMV

Mondriaan:

Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts Partition the vector elements

✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜ ✢✁✢ ✢✁✢ ✢✁✢ ✣✁✣ ✣✁✣ ✣✁✣ ✤✁✤ ✤✁✤ ✤✁✤ ✥✁✥ ✥✁✥ ✥✁✥ ✦✁✦ ✦✁✦ ✦✁✦ ✧✁✧ ✧✁✧ ✧✁✧ ★✁★ ★✁★ ★✁★ ✩✁✩ ✩✁✩ ✩✁✩ ✪✁✪ ✪✁✪ ✪✁✪ ✫✁✫ ✫✁✫ ✫✁✫ ✬✁✬ ✬✁✬ ✬✁✬ ✭✁✭ ✭✁✭ ✭✁✭ ✮✁✮ ✮✁✮ ✮✁✮ ✯✁✯ ✯✁✯ ✯✁✯ ✰✁✰ ✰✁✰ ✰✁✰ ✱✁✱ ✱✁✱ ✱✁✱ ✲✁✲ ✲✁✲ ✲✁✲ ✳✁✳ ✳✁✳ ✳✁✳ ✴✁✴ ✴✁✴ ✴✁✴ ✵✁✵ ✵✁✵ ✵✁✵ ✶✁✶ ✶✁✶ ✶✁✶ ✷✁✷ ✷✁✷ ✷✁✷ ✸✁✸ ✸✁✸ ✸✁✸ ✹✁✹ ✹✁✹ ✹✁✹ ✺✁✺ ✺✁✺ ✺✁✺ ✻✁✻ ✻✁✻ ✻✁✻ ✼✁✼ ✼✁✼ ✼✁✼ ✽✁✽ ✽✁✽ ✽✁✽ ✾✁✾ ✾✁✾ ✾✁✾ ✿✁✿ ✿✁✿ ✿✁✿ ❀✁❀ ❀✁❀ ❀✁❀ ❁✁❁ ❁✁❁ ❁✁❁

Rob H. Bisseling and Wouter Meesen, Communication balancing in parallel sparse matrix-vector multiplication, Electronic Transactions on Numerical Analysis, Vol. 21 (2005) pp. 47-65

Albert-Jan Yzelman, Utrecht University (NL)

slide-38
SLIDE 38

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Reordering for Sequential SpMV

1

Bulk Synchronous Parallel

2

Matrix partitioning for SpMV

3

Reordering for Sequential SpMV

4

In relation to PSPIKE

Albert-Jan Yzelman, Utrecht University (NL)

slide-39
SLIDE 39

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Naive cache

k = 1, modulo mapped cache Memory (of length LS) from RAM with start address x is stored in cache line number x mod L:

Main memory (RAM)

Albert-Jan Yzelman, Utrecht University (NL)

slide-40
SLIDE 40

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

’Ideal’ cache

Instead of using a naive modulo mapping, we use a smarter policy. We take k = L = 4, using ’Least Recently Used (LRU)’ policy: x1 ⇒

  • Req. x1, . . . , x4

x4 x3 x2 x1 ⇒

  • Req. x2

x2 x4 x3 x1 ⇒

  • Req. x5

x5 x2 x4 x3

Albert-Jan Yzelman, Utrecht University (NL)

slide-41
SLIDE 41

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Realistic cache

1 < k < L, combining modulo-mapping and the LRU policy

Main memory (RAM) Cache Subcaches Modulo mapping LRU−stack

Albert-Jan Yzelman, Utrecht University (NL)

slide-42
SLIDE 42

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Multilevel caches

Main memory (RAM)

  • CPU

Cache Cache (L1) (L2)

Intel Core2 (Q6600) AMD Phenom II (945e) L1: 32kB k = 8 L2: 4MB k = 16 L3:

  • S = 64kB

k = 2 S = 512kB k = 8 S = 6MB k = 48

Albert-Jan Yzelman, Utrecht University (NL)

slide-43
SLIDE 43

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒

Albert-Jan Yzelman, Utrecht University (NL)

slide-44
SLIDE 44

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒

Albert-Jan Yzelman, Utrecht University (NL)

slide-45
SLIDE 45

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒

Albert-Jan Yzelman, Utrecht University (NL)

slide-46
SLIDE 46

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒ x1 y0 a00 x0 = ⇒

Albert-Jan Yzelman, Utrecht University (NL)

slide-47
SLIDE 47

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒ x1 y0 a00 x0 = ⇒ a01 x1 y0 a00 x0 = ⇒

Albert-Jan Yzelman, Utrecht University (NL)

slide-48
SLIDE 48

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒ x1 y0 a00 x0 = ⇒ a01 x1 y0 a00 x0 = ⇒ y0 a01 x1 a00 x0

Albert-Jan Yzelman, Utrecht University (NL)

slide-49
SLIDE 49

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The sparse case

Standard datastructure: Compressed Row Storage (CRS)

Albert-Jan Yzelman, Utrecht University (NL)

slide-50
SLIDE 50

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒

Albert-Jan Yzelman, Utrecht University (NL)

slide-51
SLIDE 51

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒

Albert-Jan Yzelman, Utrecht University (NL)

slide-52
SLIDE 52

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒ y0 a0? x? = ⇒

Albert-Jan Yzelman, Utrecht University (NL)

slide-53
SLIDE 53

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒ y0 a0? x? = ⇒ x? y0 a0? x? = ⇒

Albert-Jan Yzelman, Utrecht University (NL)

slide-54
SLIDE 54

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒ y0 a0? x? = ⇒ x? y0 a0? x? = ⇒ a?? x? y0 a0? x? = ⇒ y? a?? x? y? a0? x? We cannot predict memory accesses in the sparse case!

Albert-Jan Yzelman, Utrecht University (NL)

slide-55
SLIDE 55

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Other storage schemes

Starting point, plain CRS: A =     4 1 3 2 3 1 2 7 1 1     Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col: [0 1 2 2 3 0 3 0 2 3] row: [0 3 5 7 10] , 2nnz + (m + 1) accesses

Albert-Jan Yzelman, Utrecht University (NL)

slide-56
SLIDE 56

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Incremental CRS

A =     4 1 3 2 3 1 2 7 1 1     Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col increment: [0 1 1 4 1 1 3 1 2 1] row increment: [0 1 1 1] , 2nnz + m accesses

Note: accesses like plain CRS, but requires less instructions for SpMV

Reference: Joris Koster, Parallel templates for numerical linear algebra, a high-performance computation library (Master’s Thesis), 2002.

Albert-Jan Yzelman, Utrecht University (NL)

slide-57
SLIDE 57

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Blocked CRS

A =     4 1 3 2 3 1 2 7 1 1     , dense blocks: 4, 1, 3 / 2, 3 / 1 / 2 / 7, 0, 1, 1 Stored as: nzs: [4 1 3 2 3 1 2 7 0 1 1] blk: [0 3 5 6 7 11] col: [0 2 0 3 0] row: [0 1 2 4 5] , nnz + (2nblk + 1) + (m + 1) accesses Reference: Pinar and Heath, Improving Performance of Sparse Matrix-Vector Multiplication, 1999

Albert-Jan Yzelman, Utrecht University (NL)

slide-58
SLIDE 58

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Fractal datastructures (triplets)

A =     4 1 2 2 3 1 2 7 1     Stored as: nzs: [7 1 4 1 2 2 3 2 1] i : [3 2 0 0 1 0 1 2 3] j : [0 0 0 1 1 3 3 3 2] , 3nnz accesses per nonzero Reference: Haase, Liebmann and Plank, A Hilbert-Order Multiplication Scheme for Unstructured Sparse Matrices, 2005

Albert-Jan Yzelman, Utrecht University (NL)

slide-59
SLIDE 59

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Improving cache–use

We adapt two things; sparse matrix storage sparse matrix structure CRS to “Zig-zag” CRS:

Albert-Jan Yzelman, Utrecht University (NL)

slide-60
SLIDE 60

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Zig-zag CRS

A =     4 1 3 2 3 1 2 7 1 1     Stored as: nzs: [4 1 3 3 2 1 2 1 1 7] col: [0 1 2 3 2 0 3 3 2 0] row: [0 3 5 7 10] , 2nnz + (m + 1) accesses Reference: Yzelman and Bisseling, Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods, SISC, 2009

Albert-Jan Yzelman, Utrecht University (NL)

slide-61
SLIDE 61

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Improving cache–use

We adapt two things; sparse matrix storage sparse matrix structure

Albert-Jan Yzelman, Utrecht University (NL)

slide-62
SLIDE 62

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Now assume the sparse matrix is in “Separated Block Diagonal” (SBD) form:

Albert-Jan Yzelman, Utrecht University (NL)

slide-63
SLIDE 63

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

No cache misses 1 cache miss per row 3 cache misses per row 1 cache miss per row

Albert-Jan Yzelman, Utrecht University (NL)

slide-64
SLIDE 64

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

No cache misses 1 cache miss per row 3 cache misses 1 cache miss per row 7 cache misses per row 1 cache miss per row 3 cache misses per row 1 cache miss per row

Albert-Jan Yzelman, Utrecht University (NL)

slide-65
SLIDE 65

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV No cache misses 1 cache miss per row 3 cache misses 1 cache miss 7 cache misses 15 cache misses Albert-Jan Yzelman, Utrecht University (NL)

slide-66
SLIDE 66

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Summarising... If

1 the matrix is stored in Zig-zag CRS format, 2 each SBD block corresponds to one vertex, 3 each row corresponds to one net (containing a vertex if the

corresponding submatrix contains a non-zero);

then, the number of cache-misses is given by:

  • i

(λi − 1).

λi being the connectivity of the ith row

Albert-Jan Yzelman, Utrecht University (NL)

slide-67
SLIDE 67

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Using Mondriaan

Input

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-68
SLIDE 68

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Using Mondriaan

Column partitioning (force row-net model)

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-69
SLIDE 69

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Using Mondriaan

Permute columns

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-70
SLIDE 70

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Using Mondriaan

Identify conflicting rows

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-71
SLIDE 71

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Using Mondriaan

Permute rows

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-72
SLIDE 72

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Column subpartitioning

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-73
SLIDE 73

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Column permutation

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-74
SLIDE 74

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

No mixed rows - row permutation

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-75
SLIDE 75

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-76
SLIDE 76

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-77
SLIDE 77

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-78
SLIDE 78

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-79
SLIDE 79

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-80
SLIDE 80

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-81
SLIDE 81

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-82
SLIDE 82

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-83
SLIDE 83

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-84
SLIDE 84

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-85
SLIDE 85

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

Continued

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-86
SLIDE 86

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Permuting to SBD form

...and recurse (until n partitions reached)

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-87
SLIDE 87

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Two-dimensional SBD

N col

N col

c

N col

+

N row

N row

+

N row

c

1D 2D

Albert-Jan Yzelman, Utrecht University (NL)

slide-88
SLIDE 88

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Two-dimensional SBD

Using a fine-grain model of the input sparse matrix, individual nonzeros each correspond to a vertex; each row and column have a corresponding net. The quantity to minimise remains

i(λi − 1).

N col

N col

c

N col

+

N row

N row

+

N row

c

Albert-Jan Yzelman, Utrecht University (NL)

slide-89
SLIDE 89

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Two-dimensional SBD

Zig-zag CRS is not suitable for handling 2D SBD

Albert-Jan Yzelman, Utrecht University (NL)

slide-90
SLIDE 90

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Two-dimensional SBD

  • Albert-Jan Yzelman, Utrecht University (NL)
slide-91
SLIDE 91

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Two-dimensional SBD

Albert-Jan Yzelman, Utrecht University (NL)

slide-92
SLIDE 92

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Bi-directional Incremental CRS (BICRS)

A =     4 1 3 2 3 1 2 7 1 1    

  • Stored as:

nzs: [3 2 3 1 1 2 1 7 4 1] col increment: [2 4 1 4 -1 5 -3 4 4 1] row increment: [0 1 2 -1 1 -3] , 2nnz + (row jumps + 1) accesses

Albert-Jan Yzelman, Utrecht University (NL)

slide-93
SLIDE 93

Sparse Matrix Partitioning, Reordering and Vector Multiplication > Reordering for Sequential SpMV

Cache-oblivious SpMV – Stanford link matrix

Original Finegrain (hybrid, p = 20, ǫ = 0.1) p = 20, ǫ = 0.1: 8 minutes reordering time, 0.44 gain (BICRS/ICRS)

A.N. Yzelman & Rob H. Bisseling, Cache-oblivious sparse matrix–vector multiplication by using sparse matrix partitioning methods, SIAM Journal of Scientific Computation, Vol. 31, No. 4 (2009), pp 3128–3154

Albert-Jan Yzelman, Utrecht University (NL)

slide-94
SLIDE 94

Sparse Matrix Partitioning, Reordering and Vector Multiplication > In relation to PSPIKE

In relation to PSPIKE

1

Bulk Synchronous Parallel

2

Matrix partitioning for SpMV

3

Reordering for Sequential SpMV

4

In relation to PSPIKE

Albert-Jan Yzelman, Utrecht University (NL)

slide-95
SLIDE 95

Sparse Matrix Partitioning, Reordering and Vector Multiplication > In relation to PSPIKE

Partitioning

First, make use of Mondriaan as originally intended: integrate the partitioning scheme, including vector distribution, for use within parallel Bi-CGStab. Another use is to increase the SpMV kernel’s speed by reordering the partitions themselves.

Albert-Jan Yzelman, Utrecht University (NL)

slide-96
SLIDE 96

Sparse Matrix Partitioning, Reordering and Vector Multiplication > In relation to PSPIKE

Another reordering scheme

N col

N col

c

N col

+

N row

N row

+

N row

c

But also, perhaps more interesting, use the reordering facilities to permute (symmetric) input matrices to band matrices, replacing the current Sloan-based algorithm. Then reliably identify the coupling blocks.

Albert-Jan Yzelman, Utrecht University (NL)

slide-97
SLIDE 97

Sparse Matrix Partitioning, Reordering and Vector Multiplication > In relation to PSPIKE

Value-dependent reordering

N col

N col

c

N col

+

N row

N row

+

N row

c

A third item is to integrate moving “heavy” nonzeroes on or towards the diagonal, either as post-processing (could be disadvantageous for matrix structure) or implemented into Mondriaan itself (could be disadvantageous for partitioning).

Albert-Jan Yzelman, Utrecht University (NL)