Lower Bounds for Communication in Linear Algebra Grey Ballard UC - - PowerPoint PPT Presentation

lower bounds for communication in linear algebra
SMART_READER_LITE
LIVE PREVIEW

Lower Bounds for Communication in Linear Algebra Grey Ballard UC - - PowerPoint PPT Presentation

Lower Bounds for Communication in Linear Algebra Grey Ballard UC Berkeley January 9, 2012 Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227) 1


slide-1
SLIDE 1

Lower Bounds for Communication in Linear Algebra

Grey Ballard

UC Berkeley

January 9, 2012

Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227) 1

slide-2
SLIDE 2

Summary

Motivation Communication is costly We’d like to reduce communication at the algorithm level How much communication can we possibly avoid? Outline Communication model Methods of proving lower bounds New algorithms developed to match lower bounds

2

slide-3
SLIDE 3

Memory Model

By communication we mean moving data within memory hierarchy on a sequential computer moving data between processors on a parallel computer

SLOW FAST Local Sequential Parallel Local Local Local Local Local Local Local Local

3

slide-4
SLIDE 4

Communication Cost Model

Measure communication in terms of messages and words Flop cost: γ Cost of message of size w words: α + βw Total running time of an algorithm (ignoring overlap): α · (# messages) + β · (# words) + γ · (# flops) think of α as latency+overhead cost, β as inverse bandwidth As flop rates continue to improve more quickly than data transfer rates, the relative cost of communication (the first two terms) grows larger

4

slide-5
SLIDE 5

Prior Work: Matrix Multiplication Lower Bounds

Assume O(n3) algorithm (i.e. not Strassen-like) Sequential case with fast memory of size M

lower bound on words moved between fast/slow mem: Ω ✓ n3 √ M ◆ [HK81] attained by blocked algorithm and recursive algorithm

5

slide-6
SLIDE 6

Prior Work: Matrix Multiplication Lower Bounds

Assume O(n3) algorithm (i.e. not Strassen-like) Sequential case with fast memory of size M

lower bound on words moved between fast/slow mem: Ω ✓ n3 √ M ◆ [HK81] attained by blocked algorithm and recursive algorithm

Parallel case with P processors (local memory of size M)

lower bound on words communicated (along critical path): Ω ✓ n3 P √ M ◆ [ITT04] attained by “2D” and “3D” algorithms: M lower bound attained by 2D O ⇣

n2 P

⌘ Ω ⇣

n2 √ P

⌘ [Can69] 3D O ⇣

n2 P2/3

⌘ Ω ⇣

n2 P2/3

⌘ [DNS81]

more on these upper bounds later

5

slide-7
SLIDE 7

Proving lower bounds

We’ve used three approaches to proving communication lower bounds:

1

Reduction argument

[BDHS10],[GDX11]

2

Geometric embedding

[ITT04],[BDHS11b]

3

Computation graph analysis

[HK81],[BDHS11a]

Blue = work in which our group at Berkeley was involved

6

slide-8
SLIDE 8

Reduction Example: LU

It’s easy to reduce matrix multiplication to LU: T ≡ 2 4 I −B A I I 3 5 = 2 4 I A I I 3 5 2 4 I −B I A · B I 3 5 ≡ L · U LU factorization can be used to perform matrix multiplication Communication lower bound for matrix multiplication applies to LU Reduction to Cholesky is a little trickier, but same idea [BDHS10]

7

slide-9
SLIDE 9

Geometric Embedding Example: Matmul

Crux of proof based on geometric inequality from [LW49] used to prove matrix multiplication lower bound in [ITT04]

x z z y x y

A B C

V

Volume of box

V = xyz = √xz · yz · xy

A shadow

  • B

s h a d

  • w
  • C shadow

A B C

V

Volume of a 3D set

V ≤ p area(A shadow) · p area(B shadow) · p area(C shadow)

Given limited set of data, how much useful computation can be done?

8

slide-10
SLIDE 10

Extensions to the rest of O(n3) linear algebra

We extended the geometric embedding approach of [ITT04]

to other algorithms that “smell” like 3 nested loops:

the rest of BLAS Cholesky, LU, QR factorizations eigenvalue and SVD reductions sequences of algorithms (e.g. repeated matrix squaring) graph algorithms (e.g. all pairs shortest paths)

to dense or sparse problems to sequential or parallel cases

see [BDHS11b] for details and proofs

9

slide-11
SLIDE 11

Extensions to the rest of O(n3) linear algebra

General lower bound for O(n3) linear algebra (3 nested loops): # words = Ω # flops p fast/local memory size ! this bound can be applied processor by processor

even to heterogeneous platforms # flops refers to work done by that processor

the memory size may depend on the hardware or the algorithm lower bound on # messages can be derived by dividing by the memory size (the largest possible message size)

corresponds to synchronization required

10

slide-12
SLIDE 12

Computation graph analysis

Red-blue pebble game introduced in [HK81] lower bounds proved for matrix multiplication, FFT, and others pebbling game extended in [Sav95] and later papers

11

slide-13
SLIDE 13

Computation graph analysis

Red-blue pebble game introduced in [HK81] lower bounds proved for matrix multiplication, FFT, and others pebbling game extended in [Sav95] and later papers

Input / Output Intermediate value Dependency

RS WS

S

V

We’ve connected graph expansion to communication [BDHS11a] expansion describes the relationship between a subset and its neighbors in the complement larger expansion implies more communication necessary

11

slide-14
SLIDE 14

Computation graph analysis example: Strassen

Strassen’s original algorithm uses 7 multiplies and 18 adds for n = 2

M1 = (A11 + A22) · (B11 + B22) M2 = (A21 + A22) · B11 M3 = A11 · (B12 − B22) M4 = A22 · (B21 − B11) M5 = (A11 + A12) · B22 M6 = (A21 − A11) · (B11 + B12) M7 = (A12 − A22) · (B21 + B22) C11 = M1 + M4 − M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = M1 − M2 + M3 + M6

`

7 5 4 1 3 2 6

11 12 21 22 11 12 21 22 11 12 21 22

Enc A Dec C Enc B 12

slide-15
SLIDE 15

Computation graph analysis example: Strassen

Strassen works recursively, so its graph has recursive structure

`

Dec C Enc A Enc B

slide-16
SLIDE 16

Computation graph analysis example: Strassen

Strassen works recursively, so its graph has recursive structure

`

Dec C Enc A Enc B

`

Dec C Enc A Enc B

slide-17
SLIDE 17

Computation graph analysis example: Strassen

Strassen works recursively, so its graph has recursive structure

`

Dec C Enc A Enc B

`

Dec C Enc A Enc B

`

Dec C Enc A Enc B

slide-18
SLIDE 18

Computation graph analysis example: Strassen

Strassen works recursively, so its graph has recursive structure

`

Dec C Enc A Enc B

`

Dec C Enc A Enc B

`

Dec C Enc A Enc B

`

Dec C

Enc B Enc A Dec C

Enc A Enc B 13

slide-19
SLIDE 19

Computation graph analysis example: Strassen

Strassen works recursively, so its graph has recursive structure

`

Dec C Enc A Enc B

`

Dec C Enc A Enc B

`

Dec C Enc A Enc B

`

Dec C

Enc B Enc A Dec C

Enc A Enc B

Expansion properties of this graph lead to communication lower bounds

13

slide-20
SLIDE 20

Lower Bounds for Strassen

The communication lower bounds are similar to classical matmul

Classical Strassen Sequential Ω ✓⇣

n √ M

⌘log2 8 M ◆ Ω ✓⇣

n √ M

⌘log2 7 M ◆ Parallel Ω ✓⇣

n √ M

⌘log2 8 M

P

◆ Ω ✓⇣

n √ M

⌘log2 7 M

P

14

slide-21
SLIDE 21

Lower Bounds for Strassen

The communication lower bounds are similar to classical matmul

Classical Strassen Strassen-like Sequential Ω ✓⇣

n √ M

⌘log2 8 M ◆ Ω ✓⇣

n √ M

⌘log2 7 M ◆ Ω ⇣⇣

n √ M

⌘ω0 M ⌘ Parallel Ω ✓⇣

n √ M

⌘log2 8 M

P

◆ Ω ✓⇣

n √ M

⌘log2 7 M

P

◆ Ω ⇣⇣

n √ M

⌘ω0 M

P

14

slide-22
SLIDE 22

Lower Bounds for Strassen

The communication lower bounds are similar to classical matmul

Classical Strassen Strassen-like Sequential Ω ✓⇣

n √ M

⌘log2 8 M ◆ Ω ✓⇣

n √ M

⌘log2 7 M ◆ Ω ⇣⇣

n √ M

⌘ω0 M ⌘ Parallel Ω ✓⇣

n √ M

⌘log2 8 M

P

◆ Ω ✓⇣

n √ M

⌘log2 7 M

P

◆ Ω ⇣⇣

n √ M

⌘ω0 M

P

these lower bounds imply that Strassen and faster algorithms require less communication sequential lower bound is attained by the recursive algorithm parallel lower bound is attainable with a new algorithm [BDH+12]

14

slide-23
SLIDE 23

Lower bounds → algorithms

Proving lower bounds allows us to evaluate existing algorithms identify possibilities for algorithmic innovation

  • btain asymptotic improvements

speedups increase with n and/or M

15

slide-24
SLIDE 24

Communication Upper Bounds - Sequential O(n3)

Algorithm Minimizes Minimizes # words # messages BLAS3 usual blocked or recursive algorithms [Gus97, FLPR99] Cholesky LAPACK [Gus97, AP00] [Gus97, AP00] [BDHS10] [BDHS10] Symmetric LAPACK (rarely) [BDD+12] Indefinite [BDD+12] LU with LAPACK (rarely) [GDX11] pivoting [Gus97, Tol97] [GDX11] QR LAPACK (rarely) [FW03] [FW03] [DGHL11] [EG98] [DGHL11] Eig, SVD [BDK12, BDD11] Blue = work in which our group at Berkeley was involved

16

slide-25
SLIDE 25

Communication Upper Bounds - Parallel O(n3)

Algorithm Reference Factor exceeding Factor exceeding lower bound for lower bound for # words # messages Matrix Multiply [Can69] 1 1 Cholesky ScaLAPACK log P log P LU with [GDX11] log P log P pivoting ScaLAPACK log P (N/P1/2) log P QR [DGHL11] log P log3 P ScaLAPACK log P (N/P1/2) log P SymEig, SVD [BDK12] log P log3 P ScaLAPACK log P N/P1/2 NonsymEig [BDD11] log P log3 P ScaLAPACK P1/2 log P N log P Blue = work in which our group at Berkeley was involved Red = not optimal Lower Bounds: Ω ⇣

n2 √ P

⌘ Ω( √ P)

17

slide-26
SLIDE 26

Communication Upper Bounds - Parallel O(n3)

Algorithm Reference Factor exceeding Factor exceeding lower bound for lower bound for # words # messages Matrix Multiply [Can69] 1 1 Cholesky ScaLAPACK log P log P LU with [GDX11] log P log P pivoting ScaLAPACK log P (N/P1/2) log P QR [DGHL11] log P log3 P ScaLAPACK log P (N/P1/2) log P SymEig, SVD [BDK12] log P log3 P ScaLAPACK log P N/P1/2 NonsymEig [BDD11] log P log3 P ScaLAPACK P1/2 log P N log P Blue = work in which our group at Berkeley was involved Red = not optimal Lower Bounds: Ω ⇣

n2 √ P

⌘ Ω( √ P) *This slide assumes that one copy of the data is distributed evenly across processors

17

slide-27
SLIDE 27

“2.5D” Algorithms

The lower bound for O(n3) linear algebra is Ω ⇣

n3 P √ M

⌘ words

communication requirements decrease as M increases

Standard algorithms use 2D grid of processors

storing one copy of data across all processors local memory requirements: M = Θ ⇣

n2 P

⌘ # words moved = Θ ⇣

n2 P1/2

For matrix multiplication, can visualize processors as 3D grid

requires space for P1/3 copies of the data local memory requirements: M = Θ ⇣

n2 P2/3

⌘ # words moved = Θ ⇣

n2 P2/3

18

slide-28
SLIDE 28

“2.5D” Algorithms

2.5D algorithms interpolate between 2D and 3D algs [SD11]

use as much local memory as is physically available

up to M = Θ ⇣

n2 P2/3

⌘ as in 3D case

enable strong scaling (both computation and communication)

New algorithms developed for matrix multiplication (and LU)

perfect strong scaling achieved on BG/P (“Intrepid” at Argonne) speedup of 12× over 2D matmul with n = 8192 and P = 16384

19

slide-29
SLIDE 29

“2.5D” Algorithms

2.5D algorithms interpolate between 2D and 3D algs [SD11]

use as much local memory as is physically available

up to M = Θ ⇣

n2 P2/3

⌘ as in 3D case

enable strong scaling (both computation and communication)

New algorithms developed for matrix multiplication (and LU)

perfect strong scaling achieved on BG/P (“Intrepid” at Argonne) speedup of 12× over 2D matmul with n = 8192 and P = 16384

20 40 60 80 100 256 512 1024 2048 Percentage of machine peak #nodes Matrix multiplication on BG/P (n=65,536) 2.5D MM 2D MM

19

slide-30
SLIDE 30

Parallel Strassen

The lower bound for parallel Strassen is Ω ✓⇣

n √ M

⌘log2 7 M

P

◆ words Previous attempts to parallelize Strassen are not optimal Our new algorithm [BDH+12]

uses as much local memory as is physically available

up to M = O ⇣

n2 P2/ log2 7

is asymptotically optimal (up to log P factor) is faster in practice than classical and previous Strassen algorithms

in fact, faster than classical peak performance measured on a Cray XT4 (“Franklin” at NERSC)

20

slide-31
SLIDE 31

Performance of Parallel Strassen

10 20 30 40 50 P=49 P=343 P=2401 Effective GFlop/s per node absolute maximum for all classical algorithms New parallel algorithm Combined algorithm [3,5,6] Previous parallel Strassen [5] Previous parallel Strassen [3] CA Classical (2.5D) [6] ScaLAPACK

fixed problem size n = 94080

21

slide-32
SLIDE 32

Ongoing Work

Extending lower bounds to arbitrary nested loops

beyond linear algebra

Symmetric indefinite factorization

algorithms and implementation

Avoiding communication in sparse iterative methods Tuning parallel Strassen implementation Sparse matrix-matrix multiplication

lower bounds and algorithms

Autotuning implementations of new algorithms

22

slide-33
SLIDE 33

Thank You!

Please contact me with questions! ballard@cs.berkeley.edu http://www.eecs.berkeley.edu/~ballard Find links to papers and other resources at the BEBOP webpage: http://bebop.cs.berkeley.edu/

23

slide-34
SLIDE 34

References I

  • N. Ahmed and K. Pingali.

Automatic generation of block-recursive codes. In Euro-Par ’00: Proceedings from the 6th International Euro-Par Conference on Parallel Processing, pages 368–378, London, UK, 2000. Springer-Verlag.

  • G. Ballard, J. Demmel, and I. Dumitriu.

Communication-optimal parallel and sequential eigenvalue and singular value algorithms. Technical Report No. UCBEECS-2011-14, 2011.

  • G. Ballard, J. Demmel, A. Druinsky, I. Peled, O. Schwartz, and S. Toledo.

Communication avoiding symmetric indefinite factorization, 2012. In preparation.

  • G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, E. Rom, and O. Schwartz.

Communication-optimal parallel algorithm for Strassen’s matrix multiplication, 2012. In preparation.

  • G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.

Communication-optimal parallel and sequential cholesky decomposition. SIAM Journal on Scientific Computing, 32(6):3495–3523, 2010.

  • G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.

Graph expansion and communication costs of fast matrix multiplication. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA ’11, pages 1–12, New York, NY, USA, 2011. ACM.

  • G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.

Minimizing communication in numerical linear algebra. SIAM Journal on Matrix Analysis and Applications, 32(3):866–901, 2011.

24

slide-35
SLIDE 35

References II

  • G. Ballard, J. Demmel, and N. Knight.

Communication avoiding successive band reduction. In Proceedings of the 17th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, New York, NY, USA, 2012. ACM. To appear.

  • L. Cannon.

A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University, Bozeman, MN, 1969.

  • J. Demmel, L. Grigori, M. Hoemmen, and J. Langou.

Communication-optimal parallel and sequential QR and LU factorizations. UC Berkeley Technical Report EECS-2008-89, Aug 1, 2008; To appear in SIAM. J. Sci. Comp., 2011.

  • E. Dekel, D. Nassimi, and S. Sahni.

Parallel matrix and graph algorithms. SIAM Journal on Computing, 10(4):657–675, 1981.

  • E. Elmroth and F. Gustavson.

New serial and parallel recursive QR factorization algorithms for SMP systems. In B. Kågström et al., editor, Applied Parallel Computing. Large Scale Scientific and Industrial Problems., volume 1541 of Lecture Notes in Computer Science, pages 120–128. Springer, 1998.

  • M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran.

Cache-oblivious algorithms. In FOCS ’99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285, Washington, DC, USA, 1999. IEEE Computer Society.

25

slide-36
SLIDE 36

References III

  • J. D. Frens and D. S. Wise.

QR factorization with morton-ordered quadtree matrices for memory re-use and parallelism. SIGPLAN Not., 38(10):144–154, 2003.

  • L. Grigori, J. Demmel, and H. Xiang.

CALU: A communication optimal LU factorization algorithm. SIAM Journal on Matrix Analysis and Applications, 32(4):1317–1350, 2011.

  • F. G. Gustavson.

Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev., 41(6):737–756, 1997.

  • J. W. Hong and H. T. Kung.

I/O complexity: The red-blue pebble game. In STOC ’81: Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326–333, New York, NY, USA, 1981. ACM.

  • D. Irony, S. Toledo, and A. Tiskin.

Communication lower bounds for distributed-memory matrix multiplication.

  • J. Parallel Distrib. Comput., 64(9):1017–1026, 2004.
  • L. H. Loomis and H. Whitney.

An inequality related to the isoperimetric inequality. Bulletin of the AMS, 55:961–962, 1949.

  • J. E. Savage.

Extending the Hong-Kung model to memory hierarchies. In Computing and Combinatorics, volume 959, pages 270–281. Springer Verlag, 1995.

26

slide-37
SLIDE 37

References IV

  • E. Solomonik and J. Demmel.

Communication-optimal parallel 2.5d matrix multiplication and lu factorization algorithms. In Proceedings of the 17th international conference on Parallel processing - Volume Part II, Euro-Par’11, pages 90–109, Berlin, Heidelberg, 2011. Springer-Verlag.

  • S. Toledo.

Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl., 18(4):1065–1081, 1997.

27