The Input/Output Complexity of Sparse Matrix Multiplication
Rasmus Pagh1, Morten St¨
- ckel2
1IT University of Copenhagen, 2 University of Copenhagen
SIAM LA, October 26 2015
Pagh, St¨
- ckel ITU, DIKU
October 26 2015 1 / 30
The Input/Output Complexity of Sparse Matrix Multiplication Rasmus - - PowerPoint PPT Presentation
The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh 1 , Morten St ockel 2 1 IT University of Copenhagen, 2 University of Copenhagen SIAM LA, October 26 2015 Pagh, St ockel ITU, DIKU October 26 2015 1 / 30 Sparse matrix
Rasmus Pagh1, Morten St¨
1IT University of Copenhagen, 2 University of Copenhagen
SIAM LA, October 26 2015
Pagh, St¨
October 26 2015 1 / 30
Sparse matrix multiplication Problem description
Sparse matrix multiplication Problem description Upper bound Size estimation Partitioning Outputting from partitions Summary Lower bound Technique used Bounding #phases
Pagh, St¨
October 26 2015 2 / 30
Sparse matrix multiplication Problem description
I Let A and C be matrices over a semiring R with N nonzero entries in
total.
I The problem: Compute matrix product [AC]i,j = P k Ai,kCk,j with Z
nonzero entries.
I Central result: Can be done in (for most of parameter space) optimal
˜ O ⇣
N √ Z B √ M
⌘ I/Os.
Pagh, St¨
October 26 2015 3 / 30
Sparse matrix multiplication Problem description
We say that we have cancellation when two or more summands of [AC]i,j = P
k Ai,kCk,j are nonzero
but the sum is zero. Our algorithm handles such cases.
a11 a12 ... a1p a21 a22 ... a2p . . . . . . ... . . . an1 an2 ... anp A : n rows p columns c11 c12 ... c1q c21 c22 ... c2q . . . . . . ... . . . cp1 cp2 ... cpq C : p rows q columns ac11 ac12 ... ac1q ac21 ac22 ... ac2q . . . . . . ... . . . acn1 acn2 ... acnq a21 ×c12 a22 × c22 a2p × cp2 + + . . . + AC = A×C : n rows q columns 1
Pagh, St¨
October 26 2015 4 / 30
Sparse matrix multiplication Problem description
Lots of applications. Some of them:
I Computing determinants and inverses of matrices. I Bioinformatics. I Graphs: counting cycles, computing matchings.
Pagh, St¨
October 26 2015 5 / 30
Sparse matrix multiplication Problem description
I A word is big enough to hold a matrix element plus its coordinates. I Internal memory that holds M words and disk of infinite size. I One I/O: Transfer B words from disk to internal memory. I Cost of an algorithm: Number of I/Os used. I Operations allowed: Semiring operations, copy and equality check.
Pagh, St¨
October 26 2015 6 / 30
Sparse matrix multiplication Problem description
I We make no assumptions about cancellation. I To produce output: must invoke emit(.) on every nonzero output
entry once.
I Matrices are of size U × U. I ˜
O suppresses polylog factors in U and N.
Pagh, St¨
October 26 2015 7 / 30
Sparse matrix multiplication Problem description
I Let A and C be U × U matrices over semiring R with N nonzero
input and Z nonzero output entries. There exist algorithms 1 and 2 such that:
1 − 1/U, using ˜ O ⇣ N √ Z/(B √ M) ⌘ I/Os.
I Previous best [Amossen-Pagh, ’09]: ˜
O ⇣ N √ Z/(BM1/8) ⌘ I/Os (boolean matrices = ⇒ no cancellation).
Pagh, St¨
October 26 2015 8 / 30
Sparse matrix multiplication Problem description
I Let A and C be U × U matrices over semiring R with N nonzero
input and Z nonzero output entries. There exist algorithms 1 and 2 such that:
1 − 1/U, using ˜ O ⇣ N √ Z/(B √ M) ⌘ I/Os.
I There exist matrices that require Ω
⇣ min ⇣
N2 MB, N √ Z B √ M
⌘⌘ I/Os to compute all nonzero entries of AC.
Pagh, St¨
October 26 2015 8 / 30
Upper bound Size estimation
Size estimation tool: Given matrices A and C with N nonzero entries, compute ε-estimate of number of nonzeroes of each column of AC using ˜ O(ε−3N/B) I/Os.
Fact (Bender et al, ’07)
For dense 1 × U vector y and sparse U × U matrix S we can compute yS in ˜ O((nnz(S)/B) I/Os.
Pagh, St¨
October 26 2015 9 / 30
Upper bound Size estimation
I Distinct elements: Given frequency vector x of size n where xi
denotes the number of times element i occurs, then F0 = P
i |xi|0. I Fundamental problem in streaming: Estimate F0 without
materializing x.
I Observation: The distinct elements of AC is nnz(AC). I Good news: use existing machinery. Size O(ε−3 log n log δ−1) × n
matrix F exists s.t Fx gives F0 whp [Flajolet-Martin, ’85].
Pagh, St¨
October 26 2015 10 / 30
Upper bound Size estimation
F is ε−3 log δ−1 log U × U. A and C are U × U. To get size estimate we must compute:
Pagh, St¨
October 26 2015 11 / 30
Upper bound Size estimation
F is ε−3 log δ−1 log U × U. A and C are U × U. To get size estimate we must compute:
Due to associativity: Pick cheap order. Analysis: ε−3 log δ−1 log U invocations of dense vector sparse matrix black box: ˜ O(ε−3N/B) I/Os. Note: Works with cancellation, contrary to previous size estimation.
Pagh, St¨
October 26 2015 11 / 30
Upper bound Partitioning
A C ×
Pagh, St¨
October 26 2015 12 / 30
Upper bound Partitioning
A C ×
Pagh, St¨
October 26 2015 12 / 30
Upper bound Partitioning
A C
×
=
× + × + × + ×
Pagh, St¨
October 26 2015 13 / 30
Upper bound Partitioning
I What we want: Split matrices into disjoint colored groups s.t. every
color combination has at most M nonzero output entries.
I Problem: Can’t be done. I Instead: Color rows of A using c colors. For each c groups of rows, do
an independent coloring with c colors of columns of C.
× + ×
Pagh, St¨
October 26 2015 14 / 30
Upper bound Partitioning
Overview of how to partition matrices A and C:
q
nnz(AC) log U M
+ O(1)
nnz(A1C) ≈ nnz(AC)/2 and nnz(A2C) ≈ nnz(AC).
groups of rows of A.
Pagh, St¨
October 26 2015 15 / 30
Upper bound Partitioning
Say we can do splits of A into A1, A2 s.t.
⇥ (1 − log−1 U) nnz(AC)/2; (1 + log−1 U) nnz(AC)/2 ⇤ .
⇥ (1 − log−1 U) nnz(AC)/2; (1 + log−1 U) nnz(AC)/2 ⇤ . Assume biggest possible positive error: after q recursions have problem
recursions: nnz(AC) ✓1 2 + 1 2 log U ◆log c2 ≤ nnz(AC)2− log c2e
log c2 log U
≤ nnz(AC)O(1)/c2 = O(M/ log U)
Pagh, St¨
October 26 2015 16 / 30
Upper bound Partitioning
How to do relative error 1/ log U splits: Use size estimation tool: For any set r of rows we have access to ˆ zi’s s.t. (1−log−1 U) nnz X
i∈r
[AC]i∗ ! ≤ X
i∈r
ˆ zi ≤ (1+log−1 U) nnz X
i∈r
[AC]i∗ ! . Splitting A into A1 and A2:
Z = P
i ˆ
zi.
i∈A1 ˆ
zi ≥ ˆ Z/2.
Pagh, St¨
October 26 2015 17 / 30
Upper bound Partitioning
I/O cost:
I Initial size est: ˜
O(N/B).
I Partition A: c dense-vector-sparse-matrix: ˜
O(cN/B).
I For the c A-partitions: one size est of total ˜
O(N/B) and c DVSM of total ˜ O(cN/B).
I Total: ˜
O(cN/B) = ˜ O ✓
N√ nnz(AC) B √ M
◆ since c = q
nnz(AC) log U M
.
Pagh, St¨
October 26 2015 18 / 30
Upper bound Outputting from partitions
× + ×
Pagh, St¨
October 26 2015 19 / 30
Upper bound Outputting from partitions
I Where we are: have c2 = nnz(AC) log U M
subproblems with output ≤ M/ log U.
I Central cancellation difficulty: Intermediate results can be much
larger than M.
I Our I/O aim: ˜
O(cN/B), hence we can’t pay for those cancelling inner products.
I Solution: Compute a particular polynomial and allow polynomially
small error probability.
Pagh, St¨
October 26 2015 20 / 30
Upper bound Outputting from partitions
AiCj
Pagh, St¨
October 26 2015 21 / 30
Upper bound Outputting from partitions
AiCj
Pagh, St¨
October 26 2015 21 / 30
Upper bound Outputting from partitions
AiCj
Pagh, St¨
October 26 2015 21 / 30
Upper bound Outputting from partitions
I Let r = M/ log U be the
number of output entries in a subproblem.
I We can perform compressed
matrix mult in 4r space by computing a O(r)-degree polynomial [Pagh, ’12].
I Need O(log U) repetitions to
get high probability. AiCj
Pagh, St¨
October 26 2015 22 / 30
Upper bound Summary
I/O cost of steps taken:
I Initial size est: ˜
O(N/B).
I Partition into c2 problems with output M/ log U: ˜
O(cN/B).
I Compute and emit(.) all subproblems: ˜
O(cN/B).
I Total: ˜
O(cN/B) = ˜ O ✓
N√ nnz(AC) B √ M
◆ since c = q
nnz(AC) log U M
.
Pagh, St¨
October 26 2015 23 / 30
Lower bound Technique used
I We will show: Ω
✓
N B min
✓q
Z M , N M
◆◆ I/Os needed.
I Argument type follows “phase argument” [Hong and Kung, ’81] –
divide execution in phases of M/B I/Os.
I Double memory to be 2M: There now exists equivalent execution
where reads and writes are ordered.
I This allows us to argue: For a specific computation, how good is the
best possible execution.
Pagh, St¨
October 26 2015 24 / 30
Lower bound Technique used
I Our hard instance: Dense matrices A is
√ Z × N
√ Z and C is N √ Z ×
√ Z.
I Notice: nnz(A) + nnz(C) = Θ(N) and nnz(AC) = Θ(Z). I Crucial due to semiring operations: Every stored element is always
either:
I We are now ready to argue about number of phases needed to create
two types of output.
Pagh, St¨
October 26 2015 25 / 30
Lower bound Bounding #phases
I Direct outputs: All needed entries are stored – requires two N √ Z -size
vectors to be stored.
I At most 2M √ Z N
vectors fit in memory, thus at most M2Z
N2
direct
I To output Z/2 of this type: (Z/2)/M2Z N2 = (N/M)2 phases needed,
hence Ω ⇣
N2 BM
⌘ I/Os.
Pagh, St¨
October 26 2015 26 / 30
Lower bound Bounding #phases
I Indirect outputs: Output entries for which an elementary product is
written in some phase.
I In space 2M, the number of elementary products stored and
computed is at most (2M)3/2 [Irony et al, ’04].
I To output Z/2 of this type: Z/2 · N/
√ Z = N √ Z/2 elementary products to be computed.
I Number of phases needed: N √ Z/2 (2M)3/2 , thus Ω
⇣
N √ Z B √ M
⌘ I/Os.
Pagh, St¨
October 26 2015 27 / 30
Lower bound Bounding #phases
I To do Z/2 direct: Ω
⇣
N2 BM
⌘ I/Os.
I To do Z/2 indirect: Ω
⇣
N √ Z B √ M
⌘ I/Os.
I Since at least Z/2 of either is needed, lower bound becomes minimum
Pagh, St¨
October 26 2015 28 / 30
Lower bound Bounding #phases
I Size estimation: Supports cancellation and uses ˜
O(ε−3N/B) I/Os.
I Algorithm 1: ˜
O ⇣ N √ Z/(B √ M) ⌘ I/Os.
I Algorithm 2: O
I Lower bound: Ω
⇣ min ⇣
N2 MB, N √ Z √ MB
⌘⌘ I/Os. Open: Remove monte carlo (and log factors).
Pagh, St¨
October 26 2015 29 / 30
Rasmus Pagh1, Morten St¨
1IT University of Copenhagen, 2 University of Copenhagen
SIAM LA, October 26 2015
Pagh, St¨
October 26 2015 30 / 30