extending the blis analytical model for gpus
play

Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia - PowerPoint PPT Presentation

Carnegie Mellon Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze Meng Low 18 September 2018 BLIS Retreat 1 Carnegie Mellon Many problems are MMM Popula5on Genomics k-Nearest Neighbours


  1. Carnegie Mellon Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze Meng Low 18 September 2018 BLIS Retreat 1

  2. Carnegie Mellon Many problems are “MMM” Popula5on Genomics k-Nearest Neighbours ~Length of DNA Sequence A C T G G T G A C T C A G A T G C A C G G A … # Samples A C A C G T A A C T C C C T T A G A G A C A … C T T G A C A A C T T C C A T A C C G T A A … A C A C G T G T G A T C C A A A C A T T A C … DNA Fingerprin5ng All-Pairs Shortest Path 2

  3. Carnegie Mellon Leveraging BLIS § Small microkernel � 4 th loop around micro-kernel § 5 parameters k C B p += A p C j k C ~ Pack B p → B p 3 rd loop around micro-kernel m r n r k c m c n c ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i m R += k C 1 st loop around micro-kernel n R m R += k C micro-kernel main memory 1 L3 cache += L2 cache 1 L1 cache registers 3

  4. Carnegie Mellon Population Genomics Nikolaos AlachioUs, Thom Popovici, Tze Meng Low, 2016. Efficient ComputaUon of Linkage Disequilibria as Dense Linear Algebra OperaUons. 4 HiCOMB 2016

  5. Carnegie Mellon Population Genomics m r n r ≥ N Popcnt L Popcnt N vec Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-OrU. 2016. AnalyUcal Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. SoBw. 43, 2, ArUcle 12 Nikolaos AlachioUs, Thom Popovici, Tze Meng Low, 2016. Efficient ComputaUon of Linkage Disequilibria as Dense Linear Algebra OperaUons. 5 HiCOMB 2016

  6. Carnegie Mellon Application of (Partial) Model Convolu5on Neural Nets Large m-D FFTs Core0 Core1 Performance on Intel Haswell L1 L1 FPU FPU % of Peak t0 t1 t2 t3 100% L1 L1 80% 60% L2 L2 40% 20% 0% L3 1 2 3 4 5 Layers of AlexNet OpenBLAS + Layout Change OpenBLAS GEMM Customed ConvoluUon Finite Field Linear Algebra Microcontrollers Performance of different FF Algorithms Bits Ops / Cycle 2000 4R - BLIS m4ri (4 bit tables) O(n^3) Naive Peak 1500 1000 500 0 1024 4096 16384 6 N = M = K

  7. Carnegie Mellon “Can we do it on a GPU?” 7

  8. Carnegie Mellon Our initial attempt Linkage Disequilibrium on GTX 980 % of peak 100% 90% 80% 70% 60% 50% 40% 30% 2k-64 2k-1024 20% 4k-64 10% 4k-1024 0% 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 K 8

  9. Carnegie Mellon GTX 980 in a nutshell 1 of 16 SMs • 1 warp = 32 threads • 4 clusters of 32 (SP) FMA cores • Each cluster with 8 SFU cores (popcnt) • 64k registers per SM (255/thread) • 48K/96K shared memory 9

  10. Carnegie Mellon GTX 980 in a nutshell 1 of 16 SMs • 1 warp = 32 threads • 4 clusters of 32 (SP) FMA cores • Each cluster with 8 SFU cores (popcnt) • 64k registers per SM (255/thread) • 48K/96K shared memory • Latency of FMA ≈ 8 cycles • Latency of Popcnt ≈ 12-13 cycles • Popcnt seems to be pipelined 10

  11. Carnegie Mellon Applying the model • Minimum size of kernel m r n r ≥ N Popcnt L Popcnt N vec 256 4 clusters 8 cycles 8 threads • Maximum size of kernel 64 k 256 = 256 11

  12. Carnegie Mellon Applying the model • Minimum size of kernel m r n r ≥ N Popcnt L Popcnt N vec 256 4 clusters 8 cycles 8 threads • Maximum size of kernel 64 k >255 registers/thread 256 = 256 12

  13. Carnegie Mellon Applying the model • Minimum size of kernel m r n r ≥ N Popcnt L Popcnt N vec 256 4 clusters 8 cycles 8 threads • Maximum size of kernel 64 k 256 = 256 1024 Threads, Registers 64 13

  14. Carnegie Mellon Our initial attempt Linkage Disequilibrium on GTX 980 % of peak 100% 90% 80% 70% 60% 50% 40% 30% 2k-64 2k-1024 20% 4k-64 10% 4k-1024 0% 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 K 14

  15. Carnegie Mellon With Shared Memory Linkage Disequilibrium on GTX 980 % of peak 100% 90% 80% 70% 60% 50% 40% 1k 30% 2k 20% 4k 10% 0% K 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend