Lower Bounds for Communication in Linear Algebra Grey Ballard UC - - PowerPoint PPT Presentation
Lower Bounds for Communication in Linear Algebra Grey Ballard UC - - PowerPoint PPT Presentation
Lower Bounds for Communication in Linear Algebra Grey Ballard UC Berkeley January 9, 2012 Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227) 1
Summary
Motivation Communication is costly We’d like to reduce communication at the algorithm level How much communication can we possibly avoid? Outline Communication model Methods of proving lower bounds New algorithms developed to match lower bounds
2
Memory Model
By communication we mean moving data within memory hierarchy on a sequential computer moving data between processors on a parallel computer
SLOW FAST Local Sequential Parallel Local Local Local Local Local Local Local Local
3
Communication Cost Model
Measure communication in terms of messages and words Flop cost: γ Cost of message of size w words: α + βw Total running time of an algorithm (ignoring overlap): α · (# messages) + β · (# words) + γ · (# flops) think of α as latency+overhead cost, β as inverse bandwidth As flop rates continue to improve more quickly than data transfer rates, the relative cost of communication (the first two terms) grows larger
4
Prior Work: Matrix Multiplication Lower Bounds
Assume O(n3) algorithm (i.e. not Strassen-like) Sequential case with fast memory of size M
lower bound on words moved between fast/slow mem: Ω ✓ n3 √ M ◆ [HK81] attained by blocked algorithm and recursive algorithm
5
Prior Work: Matrix Multiplication Lower Bounds
Assume O(n3) algorithm (i.e. not Strassen-like) Sequential case with fast memory of size M
lower bound on words moved between fast/slow mem: Ω ✓ n3 √ M ◆ [HK81] attained by blocked algorithm and recursive algorithm
Parallel case with P processors (local memory of size M)
lower bound on words communicated (along critical path): Ω ✓ n3 P √ M ◆ [ITT04] attained by “2D” and “3D” algorithms: M lower bound attained by 2D O ⇣
n2 P
⌘ Ω ⇣
n2 √ P
⌘ [Can69] 3D O ⇣
n2 P2/3
⌘ Ω ⇣
n2 P2/3
⌘ [DNS81]
more on these upper bounds later
5
Proving lower bounds
We’ve used three approaches to proving communication lower bounds:
1
Reduction argument
[BDHS10],[GDX11]
2
Geometric embedding
[ITT04],[BDHS11b]
3
Computation graph analysis
[HK81],[BDHS11a]
Blue = work in which our group at Berkeley was involved
6
Reduction Example: LU
It’s easy to reduce matrix multiplication to LU: T ≡ 2 4 I −B A I I 3 5 = 2 4 I A I I 3 5 2 4 I −B I A · B I 3 5 ≡ L · U LU factorization can be used to perform matrix multiplication Communication lower bound for matrix multiplication applies to LU Reduction to Cholesky is a little trickier, but same idea [BDHS10]
7
Geometric Embedding Example: Matmul
Crux of proof based on geometric inequality from [LW49] used to prove matrix multiplication lower bound in [ITT04]
x z z y x y
A B C
V
Volume of box
V = xyz = √xz · yz · xy
A shadow
- B
s h a d
- w
- C shadow
A B C
V
Volume of a 3D set
V ≤ p area(A shadow) · p area(B shadow) · p area(C shadow)
Given limited set of data, how much useful computation can be done?
8
Extensions to the rest of O(n3) linear algebra
We extended the geometric embedding approach of [ITT04]
to other algorithms that “smell” like 3 nested loops:
the rest of BLAS Cholesky, LU, QR factorizations eigenvalue and SVD reductions sequences of algorithms (e.g. repeated matrix squaring) graph algorithms (e.g. all pairs shortest paths)
to dense or sparse problems to sequential or parallel cases
see [BDHS11b] for details and proofs
9
Extensions to the rest of O(n3) linear algebra
General lower bound for O(n3) linear algebra (3 nested loops): # words = Ω # flops p fast/local memory size ! this bound can be applied processor by processor
even to heterogeneous platforms # flops refers to work done by that processor
the memory size may depend on the hardware or the algorithm lower bound on # messages can be derived by dividing by the memory size (the largest possible message size)
corresponds to synchronization required
10
Computation graph analysis
Red-blue pebble game introduced in [HK81] lower bounds proved for matrix multiplication, FFT, and others pebbling game extended in [Sav95] and later papers
11
Computation graph analysis
Red-blue pebble game introduced in [HK81] lower bounds proved for matrix multiplication, FFT, and others pebbling game extended in [Sav95] and later papers
Input / Output Intermediate value Dependency
RS WS
S
V
We’ve connected graph expansion to communication [BDHS11a] expansion describes the relationship between a subset and its neighbors in the complement larger expansion implies more communication necessary
11
Computation graph analysis example: Strassen
Strassen’s original algorithm uses 7 multiplies and 18 adds for n = 2
M1 = (A11 + A22) · (B11 + B22) M2 = (A21 + A22) · B11 M3 = A11 · (B12 − B22) M4 = A22 · (B21 − B11) M5 = (A11 + A12) · B22 M6 = (A21 − A11) · (B11 + B12) M7 = (A12 − A22) · (B21 + B22) C11 = M1 + M4 − M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = M1 − M2 + M3 + M6
`
7 5 4 1 3 2 6
11 12 21 22 11 12 21 22 11 12 21 22
Enc A Dec C Enc B 12
Computation graph analysis example: Strassen
Strassen works recursively, so its graph has recursive structure
`
Dec C Enc A Enc B
Computation graph analysis example: Strassen
Strassen works recursively, so its graph has recursive structure
`
Dec C Enc A Enc B
`
Dec C Enc A Enc B
Computation graph analysis example: Strassen
Strassen works recursively, so its graph has recursive structure
`
Dec C Enc A Enc B
`
Dec C Enc A Enc B
`
Dec C Enc A Enc B
Computation graph analysis example: Strassen
Strassen works recursively, so its graph has recursive structure
`
Dec C Enc A Enc B
`
Dec C Enc A Enc B
`
Dec C Enc A Enc B
`
Dec C
Enc B Enc A Dec C
Enc A Enc B 13
Computation graph analysis example: Strassen
Strassen works recursively, so its graph has recursive structure
`
Dec C Enc A Enc B
`
Dec C Enc A Enc B
`
Dec C Enc A Enc B
`
Dec C
Enc B Enc A Dec C
Enc A Enc B
Expansion properties of this graph lead to communication lower bounds
13
Lower Bounds for Strassen
The communication lower bounds are similar to classical matmul
Classical Strassen Sequential Ω ✓⇣
n √ M
⌘log2 8 M ◆ Ω ✓⇣
n √ M
⌘log2 7 M ◆ Parallel Ω ✓⇣
n √ M
⌘log2 8 M
P
◆ Ω ✓⇣
n √ M
⌘log2 7 M
P
◆
14
Lower Bounds for Strassen
The communication lower bounds are similar to classical matmul
Classical Strassen Strassen-like Sequential Ω ✓⇣
n √ M
⌘log2 8 M ◆ Ω ✓⇣
n √ M
⌘log2 7 M ◆ Ω ⇣⇣
n √ M
⌘ω0 M ⌘ Parallel Ω ✓⇣
n √ M
⌘log2 8 M
P
◆ Ω ✓⇣
n √ M
⌘log2 7 M
P
◆ Ω ⇣⇣
n √ M
⌘ω0 M
P
⌘
14
Lower Bounds for Strassen
The communication lower bounds are similar to classical matmul
Classical Strassen Strassen-like Sequential Ω ✓⇣
n √ M
⌘log2 8 M ◆ Ω ✓⇣
n √ M
⌘log2 7 M ◆ Ω ⇣⇣
n √ M
⌘ω0 M ⌘ Parallel Ω ✓⇣
n √ M
⌘log2 8 M
P
◆ Ω ✓⇣
n √ M
⌘log2 7 M
P
◆ Ω ⇣⇣
n √ M
⌘ω0 M
P
⌘
these lower bounds imply that Strassen and faster algorithms require less communication sequential lower bound is attained by the recursive algorithm parallel lower bound is attainable with a new algorithm [BDH+12]
14
Lower bounds → algorithms
Proving lower bounds allows us to evaluate existing algorithms identify possibilities for algorithmic innovation
- btain asymptotic improvements
speedups increase with n and/or M
15
Communication Upper Bounds - Sequential O(n3)
Algorithm Minimizes Minimizes # words # messages BLAS3 usual blocked or recursive algorithms [Gus97, FLPR99] Cholesky LAPACK [Gus97, AP00] [Gus97, AP00] [BDHS10] [BDHS10] Symmetric LAPACK (rarely) [BDD+12] Indefinite [BDD+12] LU with LAPACK (rarely) [GDX11] pivoting [Gus97, Tol97] [GDX11] QR LAPACK (rarely) [FW03] [FW03] [DGHL11] [EG98] [DGHL11] Eig, SVD [BDK12, BDD11] Blue = work in which our group at Berkeley was involved
16
Communication Upper Bounds - Parallel O(n3)
Algorithm Reference Factor exceeding Factor exceeding lower bound for lower bound for # words # messages Matrix Multiply [Can69] 1 1 Cholesky ScaLAPACK log P log P LU with [GDX11] log P log P pivoting ScaLAPACK log P (N/P1/2) log P QR [DGHL11] log P log3 P ScaLAPACK log P (N/P1/2) log P SymEig, SVD [BDK12] log P log3 P ScaLAPACK log P N/P1/2 NonsymEig [BDD11] log P log3 P ScaLAPACK P1/2 log P N log P Blue = work in which our group at Berkeley was involved Red = not optimal Lower Bounds: Ω ⇣
n2 √ P
⌘ Ω( √ P)
17
Communication Upper Bounds - Parallel O(n3)
Algorithm Reference Factor exceeding Factor exceeding lower bound for lower bound for # words # messages Matrix Multiply [Can69] 1 1 Cholesky ScaLAPACK log P log P LU with [GDX11] log P log P pivoting ScaLAPACK log P (N/P1/2) log P QR [DGHL11] log P log3 P ScaLAPACK log P (N/P1/2) log P SymEig, SVD [BDK12] log P log3 P ScaLAPACK log P N/P1/2 NonsymEig [BDD11] log P log3 P ScaLAPACK P1/2 log P N log P Blue = work in which our group at Berkeley was involved Red = not optimal Lower Bounds: Ω ⇣
n2 √ P
⌘ Ω( √ P) *This slide assumes that one copy of the data is distributed evenly across processors
17
“2.5D” Algorithms
The lower bound for O(n3) linear algebra is Ω ⇣
n3 P √ M
⌘ words
communication requirements decrease as M increases
Standard algorithms use 2D grid of processors
storing one copy of data across all processors local memory requirements: M = Θ ⇣
n2 P
⌘ # words moved = Θ ⇣
n2 P1/2
⌘
For matrix multiplication, can visualize processors as 3D grid
requires space for P1/3 copies of the data local memory requirements: M = Θ ⇣
n2 P2/3
⌘ # words moved = Θ ⇣
n2 P2/3
⌘
18
“2.5D” Algorithms
2.5D algorithms interpolate between 2D and 3D algs [SD11]
use as much local memory as is physically available
up to M = Θ ⇣
n2 P2/3
⌘ as in 3D case
enable strong scaling (both computation and communication)
New algorithms developed for matrix multiplication (and LU)
perfect strong scaling achieved on BG/P (“Intrepid” at Argonne) speedup of 12× over 2D matmul with n = 8192 and P = 16384
19
“2.5D” Algorithms
2.5D algorithms interpolate between 2D and 3D algs [SD11]
use as much local memory as is physically available
up to M = Θ ⇣
n2 P2/3
⌘ as in 3D case
enable strong scaling (both computation and communication)
New algorithms developed for matrix multiplication (and LU)
perfect strong scaling achieved on BG/P (“Intrepid” at Argonne) speedup of 12× over 2D matmul with n = 8192 and P = 16384
20 40 60 80 100 256 512 1024 2048 Percentage of machine peak #nodes Matrix multiplication on BG/P (n=65,536) 2.5D MM 2D MM
19
Parallel Strassen
The lower bound for parallel Strassen is Ω ✓⇣
n √ M
⌘log2 7 M
P
◆ words Previous attempts to parallelize Strassen are not optimal Our new algorithm [BDH+12]
uses as much local memory as is physically available
up to M = O ⇣
n2 P2/ log2 7
⌘
is asymptotically optimal (up to log P factor) is faster in practice than classical and previous Strassen algorithms
in fact, faster than classical peak performance measured on a Cray XT4 (“Franklin” at NERSC)
20
Performance of Parallel Strassen
10 20 30 40 50 P=49 P=343 P=2401 Effective GFlop/s per node absolute maximum for all classical algorithms New parallel algorithm Combined algorithm [3,5,6] Previous parallel Strassen [5] Previous parallel Strassen [3] CA Classical (2.5D) [6] ScaLAPACK
fixed problem size n = 94080
21
Ongoing Work
Extending lower bounds to arbitrary nested loops
beyond linear algebra
Symmetric indefinite factorization
algorithms and implementation
Avoiding communication in sparse iterative methods Tuning parallel Strassen implementation Sparse matrix-matrix multiplication
lower bounds and algorithms
Autotuning implementations of new algorithms
22
Thank You!
Please contact me with questions! ballard@cs.berkeley.edu http://www.eecs.berkeley.edu/~ballard Find links to papers and other resources at the BEBOP webpage: http://bebop.cs.berkeley.edu/
23
References I
- N. Ahmed and K. Pingali.
Automatic generation of block-recursive codes. In Euro-Par ’00: Proceedings from the 6th International Euro-Par Conference on Parallel Processing, pages 368–378, London, UK, 2000. Springer-Verlag.
- G. Ballard, J. Demmel, and I. Dumitriu.
Communication-optimal parallel and sequential eigenvalue and singular value algorithms. Technical Report No. UCBEECS-2011-14, 2011.
- G. Ballard, J. Demmel, A. Druinsky, I. Peled, O. Schwartz, and S. Toledo.
Communication avoiding symmetric indefinite factorization, 2012. In preparation.
- G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, E. Rom, and O. Schwartz.
Communication-optimal parallel algorithm for Strassen’s matrix multiplication, 2012. In preparation.
- G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.
Communication-optimal parallel and sequential cholesky decomposition. SIAM Journal on Scientific Computing, 32(6):3495–3523, 2010.
- G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.
Graph expansion and communication costs of fast matrix multiplication. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA ’11, pages 1–12, New York, NY, USA, 2011. ACM.
- G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.
Minimizing communication in numerical linear algebra. SIAM Journal on Matrix Analysis and Applications, 32(3):866–901, 2011.
24
References II
- G. Ballard, J. Demmel, and N. Knight.
Communication avoiding successive band reduction. In Proceedings of the 17th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, New York, NY, USA, 2012. ACM. To appear.
- L. Cannon.
A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University, Bozeman, MN, 1969.
- J. Demmel, L. Grigori, M. Hoemmen, and J. Langou.
Communication-optimal parallel and sequential QR and LU factorizations. UC Berkeley Technical Report EECS-2008-89, Aug 1, 2008; To appear in SIAM. J. Sci. Comp., 2011.
- E. Dekel, D. Nassimi, and S. Sahni.
Parallel matrix and graph algorithms. SIAM Journal on Computing, 10(4):657–675, 1981.
- E. Elmroth and F. Gustavson.
New serial and parallel recursive QR factorization algorithms for SMP systems. In B. Kågström et al., editor, Applied Parallel Computing. Large Scale Scientific and Industrial Problems., volume 1541 of Lecture Notes in Computer Science, pages 120–128. Springer, 1998.
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran.
Cache-oblivious algorithms. In FOCS ’99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285, Washington, DC, USA, 1999. IEEE Computer Society.
25
References III
- J. D. Frens and D. S. Wise.
QR factorization with morton-ordered quadtree matrices for memory re-use and parallelism. SIGPLAN Not., 38(10):144–154, 2003.
- L. Grigori, J. Demmel, and H. Xiang.
CALU: A communication optimal LU factorization algorithm. SIAM Journal on Matrix Analysis and Applications, 32(4):1317–1350, 2011.
- F. G. Gustavson.
Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev., 41(6):737–756, 1997.
- J. W. Hong and H. T. Kung.
I/O complexity: The red-blue pebble game. In STOC ’81: Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326–333, New York, NY, USA, 1981. ACM.
- D. Irony, S. Toledo, and A. Tiskin.
Communication lower bounds for distributed-memory matrix multiplication.
- J. Parallel Distrib. Comput., 64(9):1017–1026, 2004.
- L. H. Loomis and H. Whitney.
An inequality related to the isoperimetric inequality. Bulletin of the AMS, 55:961–962, 1949.
- J. E. Savage.
Extending the Hong-Kung model to memory hierarchies. In Computing and Combinatorics, volume 959, pages 270–281. Springer Verlag, 1995.
26
References IV
- E. Solomonik and J. Demmel.
Communication-optimal parallel 2.5d matrix multiplication and lu factorization algorithms. In Proceedings of the 17th international conference on Parallel processing - Volume Part II, Euro-Par’11, pages 90–109, Berlin, Heidelberg, 2011. Springer-Verlag.
- S. Toledo.
Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl., 18(4):1065–1081, 1997.