Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia - - PowerPoint PPT Presentation

extending the blis analytical model for gpus
SMART_READER_LITE
LIVE PREVIEW

Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia - - PowerPoint PPT Presentation

Carnegie Mellon Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze Meng Low 18 September 2018 BLIS Retreat 1 Carnegie Mellon Many problems are MMM Popula5on Genomics k-Nearest Neighbours


slide-1
SLIDE 1

Carnegie Mellon

Extending the BLIS Analytical Model for GPUs

Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze Meng Low 18 September 2018 BLIS Retreat

1

slide-2
SLIDE 2

Carnegie Mellon

Many problems are “MMM”

2

# Samples ~Length of DNA Sequence

A C T G G T G A C T C A G A T G C A C G G A … A C A C G T A A C T C C C T T A G A G A C A … A C A C G T G T G A T C C A A A C A T T A C … C T T G A C A A C T T C C A T A C C G T A A …

k-Nearest Neighbours DNA Fingerprin5ng All-Pairs Shortest Path Popula5on Genomics

slide-3
SLIDE 3

Carnegie Mellon

§ Small microkernel § 5 parameters

4th loop around micro-kernel 3rd loop around micro-kernel

mR mR 1

+= += += += +=

kC kC mC mC 1 nR kC nR

Pack Ai → Ai ~ Pack Bp → Bp ~

nR

Ap Bp Cj Ai ~ Bp ~ Bp ~ Ci Ci

kC

L3 cache L2 cache L1 cache registers main memory 1st loop around micro-kernel 2nd loop around micro-kernel micro-kernel

Ai

Leveraging BLIS

3

mr nr kc

nc mc

slide-4
SLIDE 4

Carnegie Mellon

Population Genomics

Nikolaos AlachioUs, Thom Popovici, Tze Meng Low, 2016. Efficient ComputaUon of Linkage Disequilibria as Dense Linear Algebra OperaUons. HiCOMB 2016

4

slide-5
SLIDE 5

Carnegie Mellon

Population Genomics

Nikolaos AlachioUs, Thom Popovici, Tze Meng Low, 2016. Efficient ComputaUon of Linkage Disequilibria as Dense Linear Algebra OperaUons. HiCOMB 2016

5

mrnr ≥ NPopcntLPopcntNvec

Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-OrU. 2016. AnalyUcal Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. SoBw. 43, 2, ArUcle 12

slide-6
SLIDE 6

Carnegie Mellon

Application of (Partial) Model

6

Large m-D FFTs Convolu5on Neural Nets

Core0

L1 FPU L1 L2 L3

Core1

L1 FPU L1 L2

t0 t1 t2 t3

0% 20% 40% 60% 80% 100% 1 2 3 4 5 % of Peak Layers of AlexNet

Performance on Intel Haswell

OpenBLAS + Layout Change OpenBLAS GEMM Customed ConvoluUon

Finite Field Linear Algebra

500 1000 1500 2000 1024 4096 16384 Bits Ops / Cycle N = M = K 4R - BLIS m4ri (4 bit tables) O(n^3) Naive Peak

Performance of different FF Algorithms

Microcontrollers

slide-7
SLIDE 7

Carnegie Mellon

“Can we do it on a GPU?”

7

slide-8
SLIDE 8

Carnegie Mellon

Our initial attempt

8

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024

% of peak K

Linkage Disequilibrium on GTX 980

2k-64 2k-1024 4k-64 4k-1024

slide-9
SLIDE 9

Carnegie Mellon

GTX 980 in a nutshell

  • 1 warp = 32 threads
  • 4 clusters of 32 (SP) FMA cores
  • Each cluster with 8 SFU cores

(popcnt)

  • 64k registers per SM (255/thread)
  • 48K/96K shared memory

9

1 of 16 SMs

slide-10
SLIDE 10

Carnegie Mellon

GTX 980 in a nutshell

  • 1 warp = 32 threads
  • 4 clusters of 32 (SP) FMA cores
  • Each cluster with 8 SFU cores

(popcnt)

  • 64k registers per SM (255/thread)
  • 48K/96K shared memory
  • Latency of FMA ≈ 8 cycles
  • Latency of Popcnt ≈ 12-13 cycles
  • Popcnt seems to be pipelined

10

1 of 16 SMs

slide-11
SLIDE 11

Carnegie Mellon

Applying the model

  • Minimum size of kernel
  • Maximum size of kernel

11

mrnr ≥ NPopcntLPopcntNvec

8 threads 8 cycles 4 clusters 256

64k 256 = 256

slide-12
SLIDE 12

Carnegie Mellon

Applying the model

  • Minimum size of kernel
  • Maximum size of kernel

12

mrnr ≥ NPopcntLPopcntNvec

8 threads 8 cycles 4 clusters 256

64k 256 = 256

>255 registers/thread

slide-13
SLIDE 13

Carnegie Mellon

Applying the model

  • Minimum size of kernel
  • Maximum size of kernel

13

mrnr ≥ NPopcntLPopcntNvec

8 threads 8 cycles 4 clusters 256

64k 256 = 256 Threads, Registers

1024 64

slide-14
SLIDE 14

Carnegie Mellon

Our initial attempt

14

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024

% of peak K

Linkage Disequilibrium on GTX 980

2k-64 2k-1024 4k-64 4k-1024

slide-15
SLIDE 15

Carnegie Mellon

With Shared Memory

15

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % of peak K Linkage Disequilibrium on GTX 980 1k 2k 4k