Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia - PowerPoint PPT Presentation

Carnegie Mellon Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze Meng Low 18 September 2018 BLIS Retreat 1

Carnegie Mellon Many problems are “MMM” Popula5on Genomics k-Nearest Neighbours ~Length of DNA Sequence A C T G G T G A C T C A G A T G C A C G G A … # Samples A C A C G T A A C T C C C T T A G A G A C A … C T T G A C A A C T T C C A T A C C G T A A … A C A C G T G T G A T C C A A A C A T T A C … DNA Fingerprin5ng All-Pairs Shortest Path 2

Carnegie Mellon Leveraging BLIS § Small microkernel � 4 th loop around micro-kernel § 5 parameters k C B p += A p C j k C ~ Pack B p → B p 3 rd loop around micro-kernel m r n r k c m c n c ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i m R += k C 1 st loop around micro-kernel n R m R += k C micro-kernel main memory 1 L3 cache += L2 cache 1 L1 cache registers 3

Carnegie Mellon Population Genomics Nikolaos AlachioUs, Thom Popovici, Tze Meng Low, 2016. Efficient ComputaUon of Linkage Disequilibria as Dense Linear Algebra OperaUons. 4 HiCOMB 2016

Carnegie Mellon Population Genomics m r n r ≥ N Popcnt L Popcnt N vec Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-OrU. 2016. AnalyUcal Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. SoBw. 43, 2, ArUcle 12 Nikolaos AlachioUs, Thom Popovici, Tze Meng Low, 2016. Efficient ComputaUon of Linkage Disequilibria as Dense Linear Algebra OperaUons. 5 HiCOMB 2016

Carnegie Mellon Application of (Partial) Model Convolu5on Neural Nets Large m-D FFTs Core0 Core1 Performance on Intel Haswell L1 L1 FPU FPU % of Peak t0 t1 t2 t3 100% L1 L1 80% 60% L2 L2 40% 20% 0% L3 1 2 3 4 5 Layers of AlexNet OpenBLAS + Layout Change OpenBLAS GEMM Customed ConvoluUon Finite Field Linear Algebra Microcontrollers Performance of different FF Algorithms Bits Ops / Cycle 2000 4R - BLIS m4ri (4 bit tables) O(n^3) Naive Peak 1500 1000 500 0 1024 4096 16384 6 N = M = K

Carnegie Mellon “Can we do it on a GPU?” 7

Carnegie Mellon Our initial attempt Linkage Disequilibrium on GTX 980 % of peak 100% 90% 80% 70% 60% 50% 40% 30% 2k-64 2k-1024 20% 4k-64 10% 4k-1024 0% 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 K 8

Carnegie Mellon GTX 980 in a nutshell 1 of 16 SMs • 1 warp = 32 threads • 4 clusters of 32 (SP) FMA cores • Each cluster with 8 SFU cores (popcnt) • 64k registers per SM (255/thread) • 48K/96K shared memory 9

Carnegie Mellon GTX 980 in a nutshell 1 of 16 SMs • 1 warp = 32 threads • 4 clusters of 32 (SP) FMA cores • Each cluster with 8 SFU cores (popcnt) • 64k registers per SM (255/thread) • 48K/96K shared memory • Latency of FMA ≈ 8 cycles • Latency of Popcnt ≈ 12-13 cycles • Popcnt seems to be pipelined 10

Carnegie Mellon Applying the model • Minimum size of kernel m r n r ≥ N Popcnt L Popcnt N vec 256 4 clusters 8 cycles 8 threads • Maximum size of kernel 64 k 256 = 256 11

Carnegie Mellon Applying the model • Minimum size of kernel m r n r ≥ N Popcnt L Popcnt N vec 256 4 clusters 8 cycles 8 threads • Maximum size of kernel 64 k >255 registers/thread 256 = 256 12

Carnegie Mellon Applying the model • Minimum size of kernel m r n r ≥ N Popcnt L Popcnt N vec 256 4 clusters 8 cycles 8 threads • Maximum size of kernel 64 k 256 = 256 1024 Threads, Registers 64 13

Carnegie Mellon Our initial attempt Linkage Disequilibrium on GTX 980 % of peak 100% 90% 80% 70% 60% 50% 40% 30% 2k-64 2k-1024 20% 4k-64 10% 4k-1024 0% 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 K 14

Carnegie Mellon With Shared Memory Linkage Disequilibrium on GTX 980 % of peak 100% 90% 80% 70% 60% 50% 40% 1k 30% 2k 20% 4k 10% 0% K 15

Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia - PowerPoint PPT Presentation

Carnegie Mellon Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze Meng Low 18 September 2018 BLIS Retreat 1 Carnegie Mellon Many problems are MMM Popula5on Genomics k-Nearest Neighbours

Blis Connor Abbott, Wendy Pan, Klint Qinami, Jason Vaccaro Motivation: Why Blis? OpenGL is

Packing - the next BLIS Fron5er? Tze Meng Low BLIS

An Analytical Model for BLIS Tze Meng Low 1 Francisco D. Igual 2 Tyler M. Smith 3 Enrique

BLIS Performs Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Extending ns Extending ns In OTcl In C++ Debugging Padma Haldar USC/ISI 1 2 ns

Extending CSP with tests for availability Gavin Lowe Extending CSP with tests for availability

BTEC: Analytical Services and Capabilities Nathaniel Hentz, Assistant Director Analytical What is

P1 Holistic Assessment for Mathematics 2013 Curricula Goal Curricula Goal Analytical

HINARI: An Overview BY Samuel A Bello BLIS, MLIS, MIT, CLN Arcis Librarian University of

Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya

The BLIS Approach to Skinny Matrix Multiplication Field G. Van Zee Science of High Performance

Mixing domains and precisions in BLIS: Ini5al thoughts Field G. Van Zee Science of High

Another year of progress for BLIS: 2017-2018 Field G. Van Zee Science of High Performance

BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby

Cayley Complexity of One Degree of Freedom Linkages in 2D Meera Sitharam Menghan Wang Heping

Basics of hierarchical clustering CLUS TERIN G METH ODS W ITH S CIP Y Shaumik Daityari

RCCR Linkage Zijia Li (DK9) Johann Radon Institute for Computational and Applied Mathematics,

MEDICINEINSIGHT A scalable and linkable general practice data set. Yuen Ai Lee, NPS

Expanding HIV Testing, Prevention and Treatment in Jail Are we equipped to traverse the last

Audit Committee denvergov.org/Auditor Timothy M. O'Brien, CPA, Auditor 2 Audit Committee

MODULES MODULES The Beginner's Guide by Daniela Engert 1 ABOUT ME ABOUT ME Diploma degree in

Realization of Simply Connected Polygonal Linkages and Recognition of Unit Disk Contact Trees