i o lower bounds and algorithms for matrix matrix
play

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication - PowerPoint PPT Presentation

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1 / 69 Introduction Dense matrix-matrix multiplication (MMM) Goal: Reduce I/O cost for machines with hierarchical memory Novel


  1. I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1 / 69

  2. Introduction ◮ Dense matrix-matrix multiplication (MMM) ◮ Goal: Reduce I/O cost for machines with hierarchical memory ◮ Novel contributions: ◮ I/O lower bounds with a tight constant 2 mnk √ S ◮ A family of algorithms for machines with any number of levels of memory hierarchy ◮ Outperform the state-of-the-art Goto’s Algorithm by 38% when there is low bandwidth to main memory 2 / 69

  3. Problem definition ◮ Classical MMM ◮ C += AB ◮ C is m × n , A is m × k , and B is k × n ◮ Reduce I/O cost for MMM algorithms 3 / 69

  4. Hierarchical memory 4 / 69

  5. Blocked algorithms ◮ MMM is an operation with a lot of opportunities for reuse ◮ Each element of A is used n times ◮ Each element of B is used m times ◮ Each element of C is used k times ◮ With O ( n 2 ) elements, one can perform O ( n 3 ) flops ◮ If all matrices fit into fast memory, amortize O ( n 2 ) memops with O ( n 3 ) flops ◮ Work with blocks of matrices at a time, where the blocks can fit into fast memory 5 / 69

  6. Building blocks of dense linear algebra ◮ MMM is the bottom of the food chain ◮ Level-3 BLAS ◮ LAPACK/FLAME ◮ ScaLAPACK/Elemental 6 / 69

  7. Outline ◮ Introduction ◮ State-of-the-art MMM ◮ Goto’s Algorithm ◮ Lower bounds ◮ Algorithms ◮ Experiments 7 / 69

  8. Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += 1 st loop around micro-kernel m R m R += micro-kernel main memory 1 L3 cache L2 cache += 1 L1 cache registers 8 / 69

  9. Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j main memory L3 cache L2 cache L1 cache registers 9 / 69

  10. Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p ~ B p main memory L3 cache L2 cache L1 cache registers 10 / 69

  11. Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i ~ A i main memory L3 cache L2 cache L1 cache registers 11 / 69

  12. Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += main memory L3 cache L2 cache L1 cache registers 12 / 69

  13. Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += 1 st loop around micro-kernel m R m R += main memory L3 cache L2 cache L1 cache registers 13 / 69

  14. Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += 1 st loop around micro-kernel m R m R += micro-kernel main memory 1 L3 cache L2 cache += 1 L1 cache registers 14 / 69

  15. I/O cost of Goto’s Algorithm ◮ Reuse dictates the I/O cost for Goto’s Algorithm ◮ Each time an element is read from main memory: ◮ An element of A is reused n c times ◮ An element of B is reused m times ◮ An element of C is reused k c times ◮ Overall I/O costs of: ◮ A : mnk n c ◮ B : mnk m mnk ◮ C : k c 15 / 69

  16. Roofline model 4 core Intel i7-7700k Goto’s Algorithm 100 GFLOPS 10 Roofline 1 1 2 4 8 16 32 64 128 256 512 flops per byte 16 / 69

  17. Roofline model Bandwidth to main memory: 51.2 GB/s Bandwidth to main memory: 6.4 GB/s Goto’s Algorithm Goto’s Algorithm 100 100 GFLOPS 10 10 Roofline 1 1 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 flops per byte flops per byte 17 / 69

  18. Outline ◮ Introduction ◮ State-of-the-art MMM ◮ Lower bounds ◮ Algorithms ◮ Experiments 18 / 69

  19. I/O lower bounds ◮ Theoretical minimum I/O cost for an operation ◮ We want to find the greatest I/O lower bound ◮ Model of computation ◮ 2 layers of memory: slow and fast ◮ Slow memory has unlimited capacity ◮ Fast memory has capacity S ◮ Data must be in fast memory before computing with it 19 / 69

  20. Related work ◮ Hong and Kung (1981) � � ◮ I/O lower bound: Ω mnk √ S ◮ Irony, Toledo, and Tiskin (2004) ◮ I/O lower bound: mnk √ √ 2 2 S ◮ With a little calculus this can be improved to mnk √ S ◮ Tyler Smith, Robert van de Geijn, Bradley Lowery, and Julien Langou (2017) ◮ I/O lower bound 2 mnk √ S ◮ Under submission at ACM TOMS 20 / 69

  21. Lower bound strategy ◮ Consider any algorithm for MMM ◮ Break the algorithm into phases ◮ Each phase has an I/O cost of exactly S 1 ◮ If there must be at least h phases, and each phase has an I/O cost of S , the overall I/O cost must be at least Sh . ◮ Determine minimum number of phases ◮ Let F be an upper bound on the multiplications during a phase ◮ There are mnk total multiplications during MMM ◮ There must be at least mnk phases F ◮ Determine F based on the number of elements available ◮ Each phase: 2 S elements available as inputs and 2 S elements available as outputs 1 except the last 21 / 69

  22. Upper bound on elementary multiplications in a phase Irony, Toledo, and Tiskin (2004) ◮ Inequality from Loomis and Whitney (1949) ◮ Using N A , N B , and N C elements of A , B , and C ◮ Can perform at most √ N A N B N C multiplications ◮ At most 2 S elements available as inputs, and 2 S elements available as outputs ◮ N A ≤ 2 S , N B ≤ 2 S , and N C ≤ 2 S √ √ √ � � � 8 S 3 = ◮ At most � 2 2 S S multiplications in a phase 1 ◮ Gives an overall lower bound of mnk √ √ 2 2 S 22 / 69

  23. Improving the lower bound ◮ Assume we perform FMAs instead of elementary multiplications ◮ In an FMA, elements of A , B , and C are all inputs ◮ We can reason about the input cost of C ◮ What if we generalize the I/O cost of each phase? ◮ Each phase can have S + M inputs and S + M outputs ◮ This adds a degree of freedom to our lower bound 23 / 69

  24. Upper bound on FMAs during a phase ◮ There are at most S + M inputs ◮ N A + N B + N C ≤ S + M ◮ We again use the Loomis-Whitney inequality ◮ Maximize √ N A N B N C when N A + N B + N C = S + M ◮ Maximized when N A = N B = N C √ 3 3 Mmnk ◮ Then our lower bound is √ ( S + M ) S + M ◮ Finding the greatest lower bound ◮ Maximizing over M , this occurs when M = 2 S ◮ The greatest lower bound is 2 mnk √ S 24 / 69

  25. Roofline model Bandwidth to main memory: 51.2 GB/s Bandwidth to main memory: 6.4 GB/s Goto’s Algorithm Goto’s Algorithm 100 100 Lower bound Lower bound GFLOPS 10 10 Roofline 1 1 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 flops per byte flops per byte 25 / 69

  26. Outline ◮ Introduction ◮ State-of-the-art MMM ◮ Lower bounds ◮ Algorithms ◮ Single level of cache ◮ Multiple levels of cache ◮ Experiments 26 / 69

  27. Resident C += C A B 27 / 69

  28. Resident C Partition m dimension m c += 28 / 69

  29. Resident C Partition n dimension n c m c += 29 / 69

  30. Resident C Move m c × n c block of C into fast memory n c m c += 30 / 69

  31. Resident C Stream panels of A and B from slow memory n c m c += 31 / 69

  32. Resident C Partition k dimension n c 1 m c += 32 / 69

  33. Resident C Move vectors into fast memory n c 1 m c += 33 / 69

  34. I/O cost for Resident C n c 1 += m c ◮ I/O cost per block dot product: ◮ C i , j : m c n c reads and m c n c writes. ◮ A i : m c k reads. ◮ B j : kn c reads. ◮ Total I/O cost: ◮ C : mn reads and mn writes. ◮ A : mnk reads. n c ◮ B : mnk m c reads. 34 / 69

  35. Choosing blocksizes for Resident C √ 1 S √ += S √ ◮ If m c ≈ n c ≈ S ◮ Total I/O cost: ◮ C : mn reads and mn writes. ◮ A : mnk S reads. √ ◮ B : mnk S reads. √ ◮ If m , n , k are large and we can ignore lower ordered terms ◮ I/O cost is 2 mnk √ S ◮ Same as lower bound 35 / 69

  36. Three algorithms += Resident C += Resident B += Resident A Data in cache. Data in main memory. 36 / 69

  37. Resident A, B, and C algorithms in Goto’s Algorithm 37 / 69

  38. Algorithms for multiple levels of cache ◮ Suppose we have 2 levels of cache: L 2 and L 1 ◮ We have 3 algorithms ◮ Resident A, Resident B, and Resident C ◮ Each is associated with a shape of MMM ◮ Suppose we have one of those shapes at the L 2 level ◮ Then how do we also encounter one at the L 1 level? ◮ We can do it with two loops 38 / 69

  39. Resident C at the L 2 cache += Resident block of L 2 cache. 39 / 69

  40. L 1 outer loop Partition k dimension += Resident block of L 2 cache. 40 / 69

  41. L 1 outer loop Partition k dimension += += Resident block of L 2 cache. 41 / 69

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend