The Input/Output Complexity of Sparse Matrix Multiplication Rasmus - - PowerPoint PPT Presentation

the input output complexity of sparse matrix
SMART_READER_LITE
LIVE PREVIEW

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus - - PowerPoint PPT Presentation

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh 1 , Morten St ockel 2 1 IT University of Copenhagen, 2 University of Copenhagen SIAM LA, October 26 2015 Pagh, St ockel ITU, DIKU October 26 2015 1 / 30 Sparse matrix


slide-1
SLIDE 1

The Input/Output Complexity of Sparse Matrix Multiplication

Rasmus Pagh1, Morten St¨

  • ckel2

1IT University of Copenhagen, 2 University of Copenhagen

SIAM LA, October 26 2015

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 1 / 30

slide-2
SLIDE 2

Sparse matrix multiplication Problem description

Sparse matrix multiplication Problem description Upper bound Size estimation Partitioning Outputting from partitions Summary Lower bound Technique used Bounding #phases

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 2 / 30

slide-3
SLIDE 3

Sparse matrix multiplication Problem description

Overview

I Let A and C be matrices over a semiring R with N nonzero entries in

total.

I The problem: Compute matrix product [AC]i,j = P k Ai,kCk,j with Z

nonzero entries.

I Central result: Can be done in (for most of parameter space) optimal

˜ O ⇣

N √ Z B √ M

⌘ I/Os.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 3 / 30

slide-4
SLIDE 4

Sparse matrix multiplication Problem description

Cancellation of elementary products

We say that we have cancellation when two or more summands of [AC]i,j = P

k Ai,kCk,j are nonzero

but the sum is zero. Our algorithm handles such cases.

a11 a12 ... a1p a21 a22 ... a2p . . . . . . ... . . . an1 an2 ... anp                                     A : n rows p columns c11 c12 ... c1q c21 c22 ... c2q . . . . . . ... . . . cp1 cp2 ... cpq                                     C : p rows q columns ac11 ac12 ... ac1q ac21 ac22 ... ac2q . . . . . . ... . . . acn1 acn2 ... acnq                                     a21 ×c12 a22 × c22 a2p × cp2 + + . . . + AC = A×C : n rows q columns 1

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 4 / 30

slide-5
SLIDE 5

Sparse matrix multiplication Problem description

Motivation

Lots of applications. Some of them:

I Computing determinants and inverses of matrices. I Bioinformatics. I Graphs: counting cycles, computing matchings.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 5 / 30

slide-6
SLIDE 6

Sparse matrix multiplication Problem description

The semiring I/O model, 1

I A word is big enough to hold a matrix element plus its coordinates. I Internal memory that holds M words and disk of infinite size. I One I/O: Transfer B words from disk to internal memory. I Cost of an algorithm: Number of I/Os used. I Operations allowed: Semiring operations, copy and equality check.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 6 / 30

slide-7
SLIDE 7

Sparse matrix multiplication Problem description

The semiring I/O model, 2

I We make no assumptions about cancellation. I To produce output: must invoke emit(.) on every nonzero output

entry once.

I Matrices are of size U × U. I ˜

O suppresses polylog factors in U and N.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 7 / 30

slide-8
SLIDE 8

Sparse matrix multiplication Problem description

Our results, 1

I Let A and C be U × U matrices over semiring R with N nonzero

input and Z nonzero output entries. There exist algorithms 1 and 2 such that:

  • 1. emits the set of nonzero entries of AC with probability at least

1 − 1/U, using ˜ O ⇣ N √ Z/(B √ M) ⌘ I/Os.

  • 2. emits the set of nonzero entries of AC, and uses O
  • N 2/(MB)
  • I/Os.

I Previous best [Amossen-Pagh, ’09]: ˜

O ⇣ N √ Z/(BM1/8) ⌘ I/Os (boolean matrices = ⇒ no cancellation).

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 8 / 30

slide-9
SLIDE 9

Sparse matrix multiplication Problem description

Our results, 2

I Let A and C be U × U matrices over semiring R with N nonzero

input and Z nonzero output entries. There exist algorithms 1 and 2 such that:

  • 1. emits the set of nonzero entries of AC with probability at least

1 − 1/U, using ˜ O ⇣ N √ Z/(B √ M) ⌘ I/Os.

  • 2. emits the set of nonzero entries of AC, and uses O
  • N 2/(MB)
  • I/Os.

I There exist matrices that require Ω

⇣ min ⇣

N2 MB, N √ Z B √ M

⌘⌘ I/Os to compute all nonzero entries of AC.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 8 / 30

slide-10
SLIDE 10

Upper bound Size estimation

Output size estimation

Size estimation tool: Given matrices A and C with N nonzero entries, compute ε-estimate of number of nonzeroes of each column of AC using ˜ O(ε−3N/B) I/Os.

Fact (Bender et al, ’07)

For dense 1 × U vector y and sparse U × U matrix S we can compute yS in ˜ O((nnz(S)/B) I/Os.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 9 / 30

slide-11
SLIDE 11

Upper bound Size estimation

Distinct elements and matrix size

I Distinct elements: Given frequency vector x of size n where xi

denotes the number of times element i occurs, then F0 = P

i |xi|0. I Fundamental problem in streaming: Estimate F0 without

materializing x.

I Observation: The distinct elements of AC is nnz(AC). I Good news: use existing machinery. Size O(ε−3 log n log δ−1) × n

matrix F exists s.t Fx gives F0 whp [Flajolet-Martin, ’85].

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 10 / 30

slide-12
SLIDE 12

Upper bound Size estimation

Output estimation

F is ε−3 log δ−1 log U × U. A and C are U × U. To get size estimate we must compute:

F × A × C

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 11 / 30

slide-13
SLIDE 13

Upper bound Size estimation

Output estimation

F is ε−3 log δ−1 log U × U. A and C are U × U. To get size estimate we must compute:

(F × A) × C

Due to associativity: Pick cheap order. Analysis: ε−3 log δ−1 log U invocations of dense vector sparse matrix black box: ˜ O(ε−3N/B) I/Os. Note: Works with cancellation, contrary to previous size estimation.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 11 / 30

slide-14
SLIDE 14

Upper bound Partitioning

Matrix mult partitioning, 1

A C ×

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 12 / 30

slide-15
SLIDE 15

Upper bound Partitioning

Matrix mult partitioning, 1

A C ×

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 12 / 30

slide-16
SLIDE 16

Upper bound Partitioning

Matrix mult partitioning, 2

A C

×

=

× + × + × + ×

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 13 / 30

slide-17
SLIDE 17

Upper bound Partitioning

Partitioning the matrices

I What we want: Split matrices into disjoint colored groups s.t. every

color combination has at most M nonzero output entries.

I Problem: Can’t be done. I Instead: Color rows of A using c colors. For each c groups of rows, do

an independent coloring with c colors of columns of C.

× + ×

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 14 / 30

slide-18
SLIDE 18

Upper bound Partitioning

Partitioning the matrices, 2

Overview of how to partition matrices A and C:

  • 1. Pick number of colors c =

q

nnz(AC) log U M

+ O(1)

  • 2. Recurse: Split A into A1 and A2 where it holds:

nnz(A1C) ≈ nnz(AC)/2 and nnz(A2C) ≈ nnz(AC).

  • 3. After log c + O(1) recursive levels we have O(c) disjoint colored

groups of rows of A.

  • 4. For each of those groups: Repeat procedure for columns of C.
  • 5. The key point: O(c2) problems of size nnz(AC)/c2 = O(M/ log U).

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 15 / 30

slide-19
SLIDE 19

Upper bound Partitioning

Getting the correct subproblem size

Say we can do splits of A into A1, A2 s.t.

  • 1. nnz(A1C) ∈

⇥ (1 − log−1 U) nnz(AC)/2; (1 + log−1 U) nnz(AC)/2 ⇤ .

  • 2. nnz(A2C) ∈

⇥ (1 − log−1 U) nnz(AC)/2; (1 + log−1 U) nnz(AC)/2 ⇤ . Assume biggest possible positive error: after q recursions have problem

  • utput size nnz(AC)(1/2 + 1/(2 log U))q. Then after log c2 + O(1)

recursions: nnz(AC) ✓1 2 + 1 2 log U ◆log c2 ≤ nnz(AC)2− log c2e

log c2 log U

≤ nnz(AC)O(1)/c2 = O(M/ log U)

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 16 / 30

slide-20
SLIDE 20

Upper bound Partitioning

How to compute the split

How to do relative error 1/ log U splits: Use size estimation tool: For any set r of rows we have access to ˆ zi’s s.t. (1−log−1 U) nnz X

i∈r

[AC]i∗ ! ≤ X

i∈r

ˆ zi ≤ (1+log−1 U) nnz X

i∈r

[AC]i∗ ! . Splitting A into A1 and A2:

  • 1. Let ˆ

Z = P

i ˆ

zi.

  • 2. Add rows from A to A1 until P

i∈A1 ˆ

zi ≥ ˆ Z/2.

  • 3. The row that y overflows A1: Compute y × C directly.
  • 4. Add remaining rows to A2

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 17 / 30

slide-21
SLIDE 21

Upper bound Partitioning

I/O cost of splitting

I/O cost:

I Initial size est: ˜

O(N/B).

I Partition A: c dense-vector-sparse-matrix: ˜

O(cN/B).

I For the c A-partitions: one size est of total ˜

O(N/B) and c DVSM of total ˜ O(cN/B).

I Total: ˜

O(cN/B) = ˜ O ✓

N√ nnz(AC) B √ M

◆ since c = q

nnz(AC) log U M

.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 18 / 30

slide-22
SLIDE 22

Upper bound Outputting from partitions

Are we done?

× + ×

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 19 / 30

slide-23
SLIDE 23

Upper bound Outputting from partitions

Status

I Where we are: have c2 = nnz(AC) log U M

subproblems with output ≤ M/ log U.

I Central cancellation difficulty: Intermediate results can be much

larger than M.

I Our I/O aim: ˜

O(cN/B), hence we can’t pay for those cancelling inner products.

I Solution: Compute a particular polynomial and allow polynomially

small error probability.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 20 / 30

slide-24
SLIDE 24

Upper bound Outputting from partitions

Compressed matrix mult intuition

AiCj

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 21 / 30

slide-25
SLIDE 25

Upper bound Outputting from partitions

Compressed matrix mult intuition

AiCj

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 21 / 30

slide-26
SLIDE 26

Upper bound Outputting from partitions

Compressed matrix mult intution

AiCj

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 21 / 30

slide-27
SLIDE 27

Upper bound Outputting from partitions

Compressed matrix mult

I Let r = M/ log U be the

number of output entries in a subproblem.

I We can perform compressed

matrix mult in 4r space by computing a O(r)-degree polynomial [Pagh, ’12].

I Need O(log U) repetitions to

get high probability. AiCj

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 22 / 30

slide-28
SLIDE 28

Upper bound Summary

Algorithm summary

I/O cost of steps taken:

I Initial size est: ˜

O(N/B).

I Partition into c2 problems with output M/ log U: ˜

O(cN/B).

I Compute and emit(.) all subproblems: ˜

O(cN/B).

I Total: ˜

O(cN/B) = ˜ O ✓

N√ nnz(AC) B √ M

◆ since c = q

nnz(AC) log U M

.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 23 / 30

slide-29
SLIDE 29

Lower bound Technique used

Lower bound, technique used

I We will show: Ω

N B min

✓q

Z M , N M

◆◆ I/Os needed.

I Argument type follows “phase argument” [Hong and Kung, ’81] –

divide execution in phases of M/B I/Os.

I Double memory to be 2M: There now exists equivalent execution

where reads and writes are ordered.

I This allows us to argue: For a specific computation, how good is the

best possible execution.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 24 / 30

slide-30
SLIDE 30

Lower bound Technique used

Technique, continued

I Our hard instance: Dense matrices A is

√ Z × N

√ Z and C is N √ Z ×

√ Z.

I Notice: nnz(A) + nnz(C) = Θ(N) and nnz(AC) = Θ(Z). I Crucial due to semiring operations: Every stored element is always

either:

  • 1. an input entry
  • 2. entry from a partial sum

I We are now ready to argue about number of phases needed to create

two types of output.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 25 / 30

slide-31
SLIDE 31

Lower bound Bounding #phases

Bounding direct outputs

I Direct outputs: All needed entries are stored – requires two N √ Z -size

vectors to be stored.

I At most 2M √ Z N

vectors fit in memory, thus at most M2Z

N2

direct

  • utputs possible.

I To output Z/2 of this type: (Z/2)/M2Z N2 = (N/M)2 phases needed,

hence Ω ⇣

N2 BM

⌘ I/Os.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 26 / 30

slide-32
SLIDE 32

Lower bound Bounding #phases

Bounding indirect outputs

I Indirect outputs: Output entries for which an elementary product is

written in some phase.

I In space 2M, the number of elementary products stored and

computed is at most (2M)3/2 [Irony et al, ’04].

I To output Z/2 of this type: Z/2 · N/

√ Z = N √ Z/2 elementary products to be computed.

I Number of phases needed: N √ Z/2 (2M)3/2 , thus Ω

N √ Z B √ M

⌘ I/Os.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 27 / 30

slide-33
SLIDE 33

Lower bound Bounding #phases

Lower bound summary

I To do Z/2 direct: Ω

N2 BM

⌘ I/Os.

I To do Z/2 indirect: Ω

N √ Z B √ M

⌘ I/Os.

I Since at least Z/2 of either is needed, lower bound becomes minimum

  • f the two.

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 28 / 30

slide-34
SLIDE 34

Lower bound Bounding #phases

Concluding remarks

I Size estimation: Supports cancellation and uses ˜

O(ε−3N/B) I/Os.

I Algorithm 1: ˜

O ⇣ N √ Z/(B √ M) ⌘ I/Os.

I Algorithm 2: O

  • N2/(MB)
  • I/Os.

I Lower bound: Ω

⇣ min ⇣

N2 MB, N √ Z √ MB

⌘⌘ I/Os. Open: Remove monte carlo (and log factors).

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 29 / 30

slide-35
SLIDE 35

The Input/Output Complexity of Sparse Matrix Multiplication

Rasmus Pagh1, Morten St¨

  • ckel2

1IT University of Copenhagen, 2 University of Copenhagen

SIAM LA, October 26 2015

Pagh, St¨

  • ckel ITU, DIKU

October 26 2015 30 / 30