low communication fft with fast multipole method
play

LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior - PowerPoint PPT Presentation

May 8-11, 2017 | Silicon Valley LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11, 2017 THE FAST FOURIER TRANSFORM Operation Count: 4 N log 2 N 6 N + 8 2 SPLIT-RADIX FFT Algorithm 3


  1. May 8-11, 2017 | Silicon Valley LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11, 2017

  2. THE FAST FOURIER TRANSFORM Operation Count: 4 N log 2 N − 6 N + 8 2

  3. SPLIT-RADIX FFT Algorithm 3

  4. SPLIT-RADIX FFT Profile 4

  5. FMM-FFT Edelman et al. 1999 5

  6. STRUCTURED DENSE MATRICES AND FMM •SVD: A = U D V ∗ •Low-Rank: K = U ˜ K r × r V ∗ •Hierarchically LR: K IJ = U I ˜ K IJ V ∗ J •H-Semi-Separable: K IJ = U I ˜ I ˜ J ˜ V ∗ J V ∗ U ˜ K ˜ I ˜ ˜ J •H 2 -Matrix/FMM 6

  7. FMM-FFT Algorithm M M,P = diag( I M , C 1 , . . . , C P − 1 ) ⇣ π n − m + p h ⇣ ⌘⌘ i [ C p ] mn = ρ p cot + ı M P } 2D M × P FFT 7

  8. COT FMM ⇣ π n − m + p h ⇣ ⌘⌘ i [ C p ] mn = ρ p cot + ı M P • One dimensional • Uniform — integers are source/target • Periodic • Distributed • Size M-by-M • P of them! • Interleaved 8

  9. FMM OPERATORS M2L B=2 M2L M2M Q L2L 3 M2L Q M2M L2L L2L L=4 Q L2T L2T S2M M/2 L S2T S2T • S: “Source” Each operator is an (implicit) matrix. • T: “Target” • M: “Multipole” • L: “Local” 9

  10. PARAMETERS OF THE FMM-FFT • FFT N = M P • FMM ( N, P, M L , Q, B ) Q • Rank • Base level B • Leaf box size M L L = log 2 ( M/M L ) • Leaf level 10

  11. DISTRIBUTED FMM All2All Gather Halo 2b Halo 2b Halo 1b All2All Gather Halo 2b Halo 2b Halo 1b 11

  12. INTERPOLATIVE FMM ˜ ˜ ˜ ˜ C ij = ` m ( t I I I J n ) ` n ( s J J i ) ` q ( z m ) C ( z r ) ` r ( z i ) q , z M2L z − z k ✓ (2 j + 1) π ◆ Y ` i ( z ) = z j = cos z i − z k 2 Q 0  k<Q M2M L2L k 6 = i • Same operators across all boxes S2M L2T • Same operators across all levels • Almost same operators across all FMMs 12

  13. TENSOR REPRESENTATIONS A ijk ` := A [ i + j ∗ ldA<1> + k ∗ ldA<2> + ` ∗ ldA<3> ] , • Input: S n ≡ S pm ≡ S pmb • Output: T n ≡ T pm ≡ T pmb 13

  14. S2M/L2T s m = − 1 + 2 m + 1 S 2 M qm = ` q ( s m ) M L M L ( p − 1) qb = S 2 M qm S pmb Computed with single BatchedGEMM 14

  15. BATCHED MATRIX-MATRIX MULTIPLY cublas<T>gemmStridedBatched in cuBLAS 8.0 15

  16. S2M/L2T M pq [ b ] = S pm [ b ] S 2 M T M pqb = S 2 M qm S pmb = ⇒ qm T pmb = L 2 T mq L pqb T pm [ b ] = L pq [ b ] S 2 M qm = ⇒ 16

  17. M2M/L2L ✓ z k ± 1 ◆ M 2 M ± qk = ` q 2 pqb = M 2 M qk M ` +1 M ` pk (2 b ) L ` +1 pkb + L ` +1 pq (2 b ) = L 2 L qk L ` pq (2 b ) Computed with single BatchedGEMM 17

  18. S2T/M2L � π ( � cot N ( p + Pk ) p > 0 T pib = S 2 T p ( j − i ) S pjb S 2 T pk = p = 0 δ k 0 L ` pib = M 2 L ` pijs M ` ⇣ π 2 ` ( z j 2 − z i 2 + s ) + π ⌘ M 2 L ` pijs = cot N ( p + 1) pj ( b + s ) • Also Level-3 Linear Algebra computations, but no BLAS primitives. • CUSTOM KERNELS 18

  19. INTERPOLATIVE FMM Operator Storage Compute P(4M L -1) 3P2 L M L2 QM L 2PMQ 2Q 2 4(2 L -2 B )PQ 2 4(L-B)PQ 2 3(2 L -2 B )PQ 2 2Q 2 4(2 L -2 B )PQ 2 QM L 2PMQ 19

  20. ALGORITHM 20

  21. PROFILE 21

  22. FMM-FFT PROFILE Halo 2D FFT } S2M M2M S2T L2L L2T M2L 22

  23. 2xK40c FMM-FFT 23

  24. 2xP100 FMM-FFT 24

  25. 8xP100 FMM-FFT 25

  26. FMM BREAKDOWN Components • T=ComplexDouble, A=2xP100 • B-GEMM and S2T dominate • Small N • Latency — Use 1 Level • Large N • Compute 26

  27. EFFICIENCY • >95% BatchedGEMM • 60% S2T/M2L • >90% FMM-FFT 27

  28. PARAMETER DEPENDENCE — M L Points per box per FMM • Trade #levels for S2T comp • Flop count not enough • Increase the intensity • Tune performance for M L =64 • T=Z, A=2xP100, N=2 27 , P=256, B=3, Q=16 28

  29. PARAMETER DEPENDENCE — P Number of FMMs • Flops/Intensity approx constant • Trade #levels for #FMMs • Large P good • Fill up B-GEMM • More square 2D FFT • T=Z, A=2xP100, N=2 27 , M L =64, B=3, Q=16 29

  30. PARAMETER DEPENDENCE — B Base Level • Not very significant • Scale to 128 GPUs w/o complications • T=Z, A=2xP100, N=2 27 , P=256, M L =64, Q=16 30

  31. PARAMETER DEPENDENCE — Q Quadrature Order • Weak performance dependence • Accuracy tuning • T=Z, A=2xP100, N=2 27 , P=256, M L =64, B=3 31

  32. FUTURE • Integration into CUFFT • Application to 2D/3D FFTs? • Convolutions • NUFFT , Sparse FFT • Volta predictions and measurements • Mixed precision (e.g. FP16 far-field) to use Tensor Core? • Persistent Matrix Batched GEMM (cuBLAS optimization) • Staged Persistent Matrix Batched GEMM (cooperative groups, RNNs) 32

  33. CONCLUSION • FMM-FFT trades 2/3 communication in 1D FFT for P FMMs • Viable on highest comp:comm architecture available • Detailed implementation that relies heavily on existing primitives • Primitives >95% efficient • Two custom dense kernels >60% efficient • Entire FMM-FFT >90% efficient • Tunable accuracy-performance tradeoff • Compute model accurately predicts performance 33

  34. May 8-11, 2017 | Silicon Valley THANK YOU

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend