Carnegie Mellon
Extending the BLIS Analytical Model for GPUs
Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze Meng Low 18 September 2018 BLIS Retreat
1
Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia - - PowerPoint PPT Presentation
Carnegie Mellon Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze Meng Low 18 September 2018 BLIS Retreat 1 Carnegie Mellon Many problems are MMM Popula5on Genomics k-Nearest Neighbours
Carnegie Mellon
1
Carnegie Mellon
2
# Samples ~Length of DNA Sequence
A C T G G T G A C T C A G A T G C A C G G A … A C A C G T A A C T C C C T T A G A G A C A … A C A C G T G T G A T C C A A A C A T T A C … C T T G A C A A C T T C C A T A C C G T A A …
Carnegie Mellon
4th loop around micro-kernel 3rd loop around micro-kernel
mR mR 1
+= += += += +=
kC kC mC mC 1 nR kC nR
Pack Ai → Ai ~ Pack Bp → Bp ~
nR
Ap Bp Cj Ai ~ Bp ~ Bp ~ Ci Ci
kC
L3 cache L2 cache L1 cache registers main memory 1st loop around micro-kernel 2nd loop around micro-kernel micro-kernel
Ai
3
Carnegie Mellon
Nikolaos AlachioUs, Thom Popovici, Tze Meng Low, 2016. Efficient ComputaUon of Linkage Disequilibria as Dense Linear Algebra OperaUons. HiCOMB 2016
4
Carnegie Mellon
Nikolaos AlachioUs, Thom Popovici, Tze Meng Low, 2016. Efficient ComputaUon of Linkage Disequilibria as Dense Linear Algebra OperaUons. HiCOMB 2016
5
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-OrU. 2016. AnalyUcal Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. SoBw. 43, 2, ArUcle 12
Carnegie Mellon
6
Core0
L1 FPU L1 L2 L3
Core1
L1 FPU L1 L2
t0 t1 t2 t3
0% 20% 40% 60% 80% 100% 1 2 3 4 5 % of Peak Layers of AlexNet
Performance on Intel Haswell
OpenBLAS + Layout Change OpenBLAS GEMM Customed ConvoluUon
500 1000 1500 2000 1024 4096 16384 Bits Ops / Cycle N = M = K 4R - BLIS m4ri (4 bit tables) O(n^3) Naive Peak
Performance of different FF Algorithms
Carnegie Mellon
7
Carnegie Mellon
8
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
Linkage Disequilibrium on GTX 980
2k-64 2k-1024 4k-64 4k-1024
Carnegie Mellon
9
Carnegie Mellon
10
Carnegie Mellon
11
Carnegie Mellon
12
Carnegie Mellon
13
Carnegie Mellon
14
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
Linkage Disequilibrium on GTX 980
2k-64 2k-1024 4k-64 4k-1024
Carnegie Mellon
15
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % of peak K Linkage Disequilibrium on GTX 980 1k 2k 4k