Parallel Computing the Why and the How Albert-Jan Yzelman - - PowerPoint PPT Presentation

parallel computing the why and the how
SMART_READER_LITE
LIVE PREVIEW

Parallel Computing the Why and the How Albert-Jan Yzelman - - PowerPoint PPT Presentation

Parallel Computing the Why and the How Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman Parallel Computing the Why and the How Parallel Computing Albert-Jan Yzelman Parallel Computing


slide-1
SLIDE 1

Parallel Computing – the Why and the How

Parallel Computing – the Why and the How

Albert-Jan Yzelman February, 2010

Albert-Jan Yzelman

slide-2
SLIDE 2

Parallel Computing – the Why and the How

Parallel Computing

Albert-Jan Yzelman

slide-3
SLIDE 3

Parallel Computing – the Why and the How

Some problems are too large to be solved by one processor

Albert-Jan Yzelman

slide-4
SLIDE 4

Parallel Computing – the Why and the How

Google

Albert-Jan Yzelman

slide-5
SLIDE 5

Parallel Computing – the Why and the How

Climate Modeling

Albert-Jan Yzelman

slide-6
SLIDE 6

Parallel Computing – the Why and the How

Computational Materials Science

Albert-Jan Yzelman

slide-7
SLIDE 7

Parallel Computing – the Why and the How

N-body simulations

Albert-Jan Yzelman

slide-8
SLIDE 8

Parallel Computing – the Why and the How

Financial market simulation

Albert-Jan Yzelman

slide-9
SLIDE 9

Parallel Computing – the Why and the How

Movie rendering

Albert-Jan Yzelman

slide-10
SLIDE 10

Parallel Computing – the Why and the How

The How

Many different architectures One parallel model Load balancing

Albert-Jan Yzelman

slide-11
SLIDE 11

Parallel Computing – the Why and the How > Architectures

Architectures

1

Architectures

2

Parallel model

3

Balancing

4

Sequential sparse matrix–vector multiplication

5

Future

Albert-Jan Yzelman

slide-12
SLIDE 12

Parallel Computing – the Why and the How > Architectures

Vector machines (1970s, until early 1990s). ILLIAC IV (early 1960s-1976): 64 processors, 200Mflop/s, 256KB RAM, 1TB laser recording device

Albert-Jan Yzelman

slide-13
SLIDE 13

Parallel Computing – the Why and the How > Architectures

Vector machines (1970s, until early 1990s). Cray-1 (1972-1976) 12 vector processors, 136 − 250 Mflop/s, 8MB RAM

Albert-Jan Yzelman

slide-14
SLIDE 14

Parallel Computing – the Why and the How > Architectures

Vector machines (1970s, until early 1990s). Cray Y-MP (1988):

  • max. 8 vector processors, 2.6Gflop/s, 512MB RAM, 4GB SSD

Not only does a · x + y in one instruction, but several (64) of those!

Albert-Jan Yzelman

slide-15
SLIDE 15

Parallel Computing – the Why and the How > Architectures

Beowulf clusters (around 1994, named after one at NASA) Did not find details, but described as a “Giga-flops machine”

Albert-Jan Yzelman

slide-16
SLIDE 16

Parallel Computing – the Why and the How > Architectures

SARA: Teras / Aster, 2001 Huygens, 2007 SGI Origin 3800: 1024 MIPS R14000 processors 1 Tflop/s SGI Altix 3700: 416 Intel Itanium 2 processors 2.2 Tflop/s IBM System p5-575: 3328 IBM POWER6 processors 62.5 Tflop/s

Albert-Jan Yzelman

slide-17
SLIDE 17

Parallel Computing – the Why and the How > Architectures

Grid computing: DAS-3 (2007), 44 Tflop/s(?)

Albert-Jan Yzelman

slide-18
SLIDE 18

Parallel Computing – the Why and the How > Architectures

Stream processing: Cell / Roadrunner (Nov. 2008) Cell: 1 PPE, 8 SPEs; 100 Giga-flop per second Roadrunner: 6921 Opteron processors, 12960 Cell processors; 1.456 Peta-flop per second

Albert-Jan Yzelman

slide-19
SLIDE 19

Parallel Computing – the Why and the How > Architectures

Upcoming architectures

Multicore (OpenMP, PThreads, MPI, OpenCL) GPU (OpenCL, CUDA) Cloud Computing (Amazon) Manycore (hundreds or thousands of cores)

Albert-Jan Yzelman

slide-20
SLIDE 20

Parallel Computing – the Why and the How > Parallel model

Parallel model

1

Architectures

2

Parallel model

3

Balancing

4

Sequential sparse matrix–vector multiplication

5

Future

Albert-Jan Yzelman

slide-21
SLIDE 21

Parallel Computing – the Why and the How > Parallel model

In summary

Different processor types: Reduced Instruction Set chips (RISC), (e.g., IBM Power) Intel Itanium x86-type (your average home PC/laptop) Vector (co-)processors GPUs Stream processors ...

Albert-Jan Yzelman

slide-22
SLIDE 22

Parallel Computing – the Why and the How > Parallel model

In summary

Different connectivity; Ring All-to-all ethernet InfiniBand Cube Hierarchical Internet ...

Albert-Jan Yzelman

slide-23
SLIDE 23

Parallel Computing – the Why and the How > Parallel model

Parallel Models

Solution: bridging models Message Passing Interface (MPI) Bulk Synchronous Parallel (BSP)

Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 (1990), pp. 103–111

Albert-Jan Yzelman

slide-24
SLIDE 24

Parallel Computing – the Why and the How > Parallel model

Bulk Synchronous Parallel

A BSP-computer: consists of P processors, each with local memory executes a Single Program on Multiple Data (SPMD) performs no communication during calculation communicates only during barrier synchronisation

Albert-Jan Yzelman

slide-25
SLIDE 25

Parallel Computing – the Why and the How > Parallel model

Superstep 0 Sync Superstep 1

Albert-Jan Yzelman

slide-26
SLIDE 26

Parallel Computing – the Why and the How > Parallel model

Bulk Synchronous Parallel

A BSP-computer furthermore: has homogeneous processors, able to do r flops each second takes l time to synchronise has a communication speed of g The model thus only uses four parameters (P, r, l, g).

Albert-Jan Yzelman

slide-27
SLIDE 27

Parallel Computing – the Why and the How > Parallel model

Bulk Synchronous Parallel

A BSP-algorithm can: Ask for some environment variables: bsp_nprocs() bsp_pid() Synchronise: bsp_sync() Perform “direct” remote memory access (DRMA): bsp_put(source, dest, dest_PID) bsp_get(source, source_PID, dest) Send messages, synchronously (BSMP): bsp_send(data, dest_PID) bsp_move()

Albert-Jan Yzelman

slide-28
SLIDE 28

Parallel Computing – the Why and the How > Parallel model

Example: sparse matrix, dense vector multiplication

y=Ax: for each nonzero k from A add x[k.column] · k.value to y[k.row]

  • Albert-Jan Yzelman
slide-29
SLIDE 29

Parallel Computing – the Why and the How > Parallel model

Example: sparse matrix, dense vector multiplication

To do this in parallel: Distribute the nonzeroes of A, but also distribute x and y; each processor should have about 1/Pth of the total data.

  • Albert-Jan Yzelman
slide-30
SLIDE 30

Parallel Computing – the Why and the How > Parallel model

Example: sparse matrix, dense vector multiplication

Step 1 (fan-out): Not all processors have the elements from x they need; processors need to get the missing items.

  • Albert-Jan Yzelman
slide-31
SLIDE 31

Parallel Computing – the Why and the How > Parallel model

Example: sparse matrix, dense vector multiplication

Step 2: use received elements from x for multiplication. Step 3 (fan-in): write local results to the correct processors; here, y is distributed cyclically, obviously a bad choice

  • Albert-Jan Yzelman
slide-32
SLIDE 32

Parallel Computing – the Why and the How > Parallel model

Example: sparse matrix, dense vector multiplication

The algorithm: for all nonzeroes k from A if k has a local row and a local column add k.v · x[k.j] to y[k.i] if x[k.j] is not local get x[k.j] from the responsible processor if y[k.i] is not local set locally y[k.i] = 0 synchronise for all nonzeroes k from A not yet processed add k.v · x[k.j] to y[k.i] send all local copies y[i] to the responsible processor synchronise for all incoming pairs (i, v) (such that v = y[i]) add v to y[i]

Albert-Jan Yzelman

slide-33
SLIDE 33

Parallel Computing – the Why and the How > Balancing

Balancing

1

Architectures

2

Parallel model

3

Balancing

4

Sequential sparse matrix–vector multiplication

5

Future

Albert-Jan Yzelman

slide-34
SLIDE 34

Parallel Computing – the Why and the How > Balancing

BSP cost model: let w(s)

i

be the work to be done by processor s in superstep i, let r(s)

i

be the communication received by processor s between superstep i and i + 1, let t(s)

i

be the communication transmitted by processor s. Furthermore, let T + 1 be the total number of supersteps and the communication bound of superstep i be given by ci = max

  • max

s∈[0,P−1] r(s) i

, max

s∈[0,P−1] t(s) i

  • .

Similarly, the upper bound for the amount of work: wi = max

s∈[0,P−1] w(s) i

. Then the cost of a BSP algorithm is given by:

T−1

  • i=0

wi +

T−1

  • i=0

(l + g · ci)

Albert-Jan Yzelman

slide-35
SLIDE 35

Parallel Computing – the Why and the How > Balancing

In the case of sparse matrix–vector multiplication,

what causes the communication?

nonzeroes on the same row distributed to different processors: fan-out communication nonzeroes on the same column distributed to different processors: fan-in communication

(Assuming the vector distribution of x/y is such that x[i]/y[i] is distributed to the same processor as at least one entry of the ith column/row of A.)

Albert-Jan Yzelman

slide-36
SLIDE 36

Parallel Computing – the Why and the How > Balancing

In the case of sparse matrix–vector multiplication,

what causes the communication?

nonzeroes on the same row distributed to different processors: fan-out communication nonzeroes on the same column distributed to different processors: fan-in communication

(Assuming the vector distribution of x/y is such that x[i]/y[i] is distributed to the same processor as at least one entry of the ith column/row of A.)

Easy solution: distribute all nonzeroes to the same processor

Albert-Jan Yzelman

slide-37
SLIDE 37

Parallel Computing – the Why and the How > Balancing

Load balancing

For each superstep i, let ¯ wi = 1

P

  • s∈[0,P−1] w(s)

i

be the average

  • workload. We demand that:

max

s∈[0,P−1] | ¯

wi − w(s)

i

| ≤ ǫ ¯ wi, where ǫ is the maximum load imbalance parameter.

Albert-Jan Yzelman

slide-38
SLIDE 38

Parallel Computing – the Why and the How > Balancing

Communication balancing

Recall that the communication-part of the BSP cost depends on the maximum incoming or outgoing volume over all processors; for each superstep, we minimise ci = max

  • max

s∈[0,P−1] r(s) i

, max

s∈[0,P−1] t(s) i

  • not the total communication volume

s r(s) i

  • =

s t(s) i

  • .

Albert-Jan Yzelman

slide-39
SLIDE 39

Parallel Computing – the Why and the How > Balancing

“Shared” columns: communication during fan-out

1 2 4 8 3 5 7 6

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏

1 2 3 4 5 6 7 8

Column-net model; a cut net means a shared column

Albert-Jan Yzelman

slide-40
SLIDE 40

Parallel Computing – the Why and the How > Balancing

“Shared” columns: communication during fan-out

✁✁✁ ✁✁✁ ✁✁✁ ✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏

1 2 3 4 5 6 7 8 2 4 1 6 8 3 7 5

Column-net model; a cut net means a shared column

Albert-Jan Yzelman

slide-41
SLIDE 41

Parallel Computing – the Why and the How > Balancing

“Shared” columns: communication during fan-out

✁✁✁ ✁✁✁ ✁✁✁ ✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏

1 2 3 4 5 6 7 8 1 6 3 7 5 8 2 4

Column-net model; a cut net means a shared column

Albert-Jan Yzelman

slide-42
SLIDE 42

Parallel Computing – the Why and the How > Balancing

“Shared” rows: communication during fan-in

3 1 8 2 5 7 6 4

✁ ✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛✁☛✁☛ ☞✁☞✁☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✒✁✒

1 2 3 4 5 6 7 8

Row-net model; a cut net means a shared row

Albert-Jan Yzelman

slide-43
SLIDE 43

Parallel Computing – the Why and the How > Balancing

“Catch” all communication:

2

3 4 8 7 5 6 1

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞

2 1 4 3 5 6 7 8

Fine-grain model; a cut net means either a shared row or column

Albert-Jan Yzelman

slide-44
SLIDE 44

Parallel Computing – the Why and the How > Balancing

Emerging partitioning strategy: Model the sparse matrix using a hypergraph

Albert-Jan Yzelman

slide-45
SLIDE 45

Parallel Computing – the Why and the How > Balancing

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of that hypergraph (in two)

Albert-Jan Yzelman

slide-46
SLIDE 46

Parallel Computing – the Why and the How > Balancing

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of that hypergraph (in two) Criteria: load balance (max. ǫ imbalance) + minimising

  • i

(λi − 1) , where λi equals the number of partitions within the ith net; the connectivity of the ith net. Alternative:

  • i
  • 1,

if λi > 0, 0,

  • therwise

.

Albert-Jan Yzelman

slide-47
SLIDE 47

Parallel Computing – the Why and the How > Balancing

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of that hypergraph (in two)

Kernighan & Lin, An efficient heuristic procedure for partitioning graphs, Bell Systems Technical Journal 49 (1970): pp. 291-307 Fiduccia & Mattheyses, A linear-time heuristic for improving network partitions, Proceedings of the 19th IEEE Design Automation Conference (1982), pp. 175-181 Cataly¨ urek & Aykanat, Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication, IEEE Transactions on Parallel Distributed Systems 10 (1999), pp. 673-693

Albert-Jan Yzelman

slide-48
SLIDE 48

Parallel Computing – the Why and the How > Balancing

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Original Mondriaan: both row- and column-net, and choose best

  • Albert-Jan Yzelman
slide-49
SLIDE 49

Parallel Computing – the Why and the How > Balancing

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts

  • Albert-Jan Yzelman
slide-50
SLIDE 50

Parallel Computing – the Why and the How > Balancing

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts

  • Albert-Jan Yzelman
slide-51
SLIDE 51

Parallel Computing – the Why and the How > Balancing

Emerging partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts

  • Albert-Jan Yzelman
slide-52
SLIDE 52

Parallel Computing – the Why and the How > Balancing

Mondriaan:

Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts

  • Brendan Vastenhouw and Rob H. Bisseling, A Two-Dimensional Data

Distribution Method for Parallel Sparse Matrix-Vector Multiplication, SIAM Review, Vol. 47, No. 1 (2005) pp. 67-95

Albert-Jan Yzelman

slide-53
SLIDE 53

Parallel Computing – the Why and the How > Balancing

Mondriaan:

Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts Partition the vector elements

✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜ ✢✁✢ ✢✁✢ ✢✁✢ ✣✁✣ ✣✁✣ ✣✁✣ ✤✁✤ ✤✁✤ ✤✁✤ ✥✁✥ ✥✁✥ ✥✁✥ ✦✁✦ ✦✁✦ ✦✁✦ ✧✁✧ ✧✁✧ ✧✁✧ ★✁★ ★✁★ ★✁★ ✩✁✩ ✩✁✩ ✩✁✩ ✪✁✪ ✪✁✪ ✪✁✪ ✫✁✫ ✫✁✫ ✫✁✫ ✬✁✬ ✬✁✬ ✬✁✬ ✭✁✭ ✭✁✭ ✭✁✭ ✮✁✮ ✮✁✮ ✮✁✮ ✯✁✯ ✯✁✯ ✯✁✯ ✰✁✰ ✰✁✰ ✰✁✰ ✱✁✱ ✱✁✱ ✱✁✱ ✲✁✲ ✲✁✲ ✲✁✲ ✳✁✳ ✳✁✳ ✳✁✳ ✴✁✴ ✴✁✴ ✴✁✴ ✵✁✵ ✵✁✵ ✵✁✵ ✶✁✶ ✶✁✶ ✶✁✶ ✷✁✷ ✷✁✷ ✷✁✷ ✸✁✸ ✸✁✸ ✸✁✸ ✹✁✹ ✹✁✹ ✹✁✹ ✺✁✺ ✺✁✺ ✺✁✺ ✻✁✻ ✻✁✻ ✻✁✻ ✼✁✼ ✼✁✼ ✼✁✼ ✽✁✽ ✽✁✽ ✽✁✽ ✾✁✾ ✾✁✾ ✾✁✾ ✿✁✿ ✿✁✿ ✿✁✿ ❀✁❀ ❀✁❀ ❀✁❀ ❁✁❁ ❁✁❁ ❁✁❁

Rob H. Bisseling and Wouter Meesen, Communication balancing in parallel sparse matrix-vector multiplication, Electronic Transactions on Numerical Analysis, Vol. 21 (2005) pp. 47-65

Albert-Jan Yzelman

slide-54
SLIDE 54

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Sequential sparse matrix–vector multiplication

1

Architectures

2

Parallel model

3

Balancing

4

Sequential sparse matrix–vector multiplication

5

Future

Albert-Jan Yzelman

slide-55
SLIDE 55

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Naive cache

k = 1, modulo mapped cache Memory (of length LS) from RAM with start address x is stored in cache line number x mod L:

Main memory (RAM)

Albert-Jan Yzelman

slide-56
SLIDE 56

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

’Ideal’ cache

Instead of using a naive modulo mapping, we use a smarter policy. We take k = L = 4, using ’Least Recently Used (LRU)’ policy: x1 ⇒

  • Req. x1, . . . , x4

x4 x3 x2 x1 ⇒

  • Req. x2

x2 x4 x3 x1 ⇒

  • Req. x5

x5 x2 x4 x3

Albert-Jan Yzelman

slide-57
SLIDE 57

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Realistic cache

1 < k < L, combining modulo-mapping and the LRU policy

Main memory (RAM) Cache Subcaches Modulo mapping LRU−stack

Albert-Jan Yzelman

slide-58
SLIDE 58

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒

Albert-Jan Yzelman

slide-59
SLIDE 59

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒

Albert-Jan Yzelman

slide-60
SLIDE 60

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒

Albert-Jan Yzelman

slide-61
SLIDE 61

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒ x1 y0 a00 x0 = ⇒

Albert-Jan Yzelman

slide-62
SLIDE 62

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒ x1 y0 a00 x0 = ⇒ a01 x1 y0 a00 x0 = ⇒

Albert-Jan Yzelman

slide-63
SLIDE 63

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The dense case

Dense matrix–vector multiplication     a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33     ·     x0 x1 x2 x3     =     y0 y1 y2 y3     Example with k = L = 2: x0 = ⇒ a00 x0 = ⇒ y0 a00 x0 = ⇒ x1 y0 a00 x0 = ⇒ a01 x1 y0 a00 x0 = ⇒ y0 a01 x1 a00 x0

Albert-Jan Yzelman

slide-64
SLIDE 64

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The sparse case

Standard datastructure: Compressed Row Storage (CRS)

Albert-Jan Yzelman

slide-65
SLIDE 65

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒

Albert-Jan Yzelman

slide-66
SLIDE 66

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒

Albert-Jan Yzelman

slide-67
SLIDE 67

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒ y0 a0? x? = ⇒

Albert-Jan Yzelman

slide-68
SLIDE 68

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒ y0 a0? x? = ⇒ x? y0 a0? x? = ⇒

Albert-Jan Yzelman

slide-69
SLIDE 69

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

The sparse case

Sparse matrix–vector multiplication (SpMV) x? = ⇒ a0? x? = ⇒ y0 a0? x? = ⇒ x? y0 a0? x? = ⇒ a?? x? y0 a0? x? = ⇒ y? a?? x? y? a0? x? We cannot predict memory accesses in the sparse case!

Albert-Jan Yzelman

slide-70
SLIDE 70

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Improving cache–use

We adapt two things; sparse matrix storage sparse matrix structure CRS to “Zig-zag” CRS:

Albert-Jan Yzelman

slide-71
SLIDE 71

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Now assume the sparse matrix is in “Separated Block Diagonal” (SBD) form:

Albert-Jan Yzelman

slide-72
SLIDE 72

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

No cache misses 1 cache miss per row 3 cache misses per row 1 cache miss per row

Albert-Jan Yzelman

slide-73
SLIDE 73

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

No cache misses 1 cache miss per row 3 cache misses 1 cache miss per row 7 cache misses per row 1 cache miss per row 3 cache misses per row 1 cache miss per row

Albert-Jan Yzelman

slide-74
SLIDE 74

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication No cache misses 1 cache miss per row 3 cache misses 1 cache miss 7 cache misses 15 cache misses Albert-Jan Yzelman

slide-75
SLIDE 75

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Summarising... If

1 the matrix is stored in Zig-zag CRS format, 2 each SBD block corresponds to one vertex, 3 each row corresponds to one net (containing a vertex if the

corresponding submatrix contains a non-zero);

then, the number of cache-misses is given by:

  • i

(λi − 1).

λi being the connectivity of the ith row

Albert-Jan Yzelman

slide-76
SLIDE 76

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Using Mondriaan

Input

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman

slide-77
SLIDE 77

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Using Mondriaan

Column partitioning (force row-net model)

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman

slide-78
SLIDE 78

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Using Mondriaan

Permute

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman

slide-79
SLIDE 79

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Using Mondriaan

Identify conflicting rows

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman

slide-80
SLIDE 80

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Using Mondriaan

Permute rows

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman

slide-81
SLIDE 81

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Using Mondriaan

...and recurse (until n partitions reached)

✁ ✁ ✁ ✂✁✂ ✂✁✂ ✂✁✂ ✄✁✄ ✄✁✄ ✄✁✄ ☎✁☎ ☎✁☎ ☎✁☎ ✆✁✆ ✆✁✆ ✆✁✆ ✝✁✝ ✝✁✝ ✝✁✝ ✞✁✞ ✞✁✞ ✞✁✞ ✟✁✟ ✟✁✟ ✟✁✟ ✠✁✠ ✠✁✠ ✠✁✠ ✡✁✡ ✡✁✡ ✡✁✡ ☛✁☛ ☛✁☛ ☛✁☛ ☞✁☞ ☞✁☞ ☞✁☞ ✌✁✌ ✌✁✌ ✌✁✌ ✍✁✍ ✍✁✍ ✍✁✍ ✎✁✎ ✎✁✎ ✎✁✎ ✏✁✏ ✏✁✏ ✏✁✏ ✑✁✑ ✑✁✑ ✑✁✑ ✒✁✒ ✒✁✒ ✒✁✒ ✓✁✓ ✓✁✓ ✓✁✓ ✔✁✔ ✔✁✔ ✔✁✔ ✕✁✕ ✕✁✕ ✕✁✕ ✖✁✖ ✖✁✖ ✖✁✖ ✗✁✗ ✗✁✗ ✗✁✗ ✘✁✘ ✘✁✘ ✘✁✘ ✙✁✙ ✙✁✙ ✙✁✙ ✚✁✚ ✚✁✚ ✚✁✚ ✛✁✛ ✛✁✛ ✛✁✛ ✜✁✜ ✜✁✜ ✜✁✜

Albert-Jan Yzelman

slide-82
SLIDE 82

Parallel Computing – the Why and the How > Sequential sparse matrix–vector multiplication

Cache-oblivious SpMV

The “rhpentium new” matrix, reordered with p = 100, ǫ = 0.1; 34 percent faster SpMV A.N. Yzelman & Rob H. Bisseling, Cache-oblivious sparse matrix–vector multiplication by using sparse matrix partitioning methods, SIAM Journal of Scientific Computation, Vol. 31, No. 4 (2009), pp 3128–3154

Albert-Jan Yzelman

slide-83
SLIDE 83

Parallel Computing – the Why and the How > Future

Future

1

Architectures

2

Parallel model

3

Balancing

4

Sequential sparse matrix–vector multiplication

5

Future

Albert-Jan Yzelman

slide-84
SLIDE 84

Parallel Computing – the Why and the How > Future

Multicore processing (MulticoreBSP, shared caches) Further enhancing sequential SpMV (two-dimensional SBD) Faster/better Mondriaan (better matching, other...) Other uses for Mondriaan (LU-decomposition, ...) Parallel Mondriaan

Albert-Jan Yzelman