BLIS Performs Devangi N. Parikh Science of High Performance - - PowerPoint PPT Presentation

blis performs
SMART_READER_LITE
LIVE PREVIEW

BLIS Performs Devangi N. Parikh Science of High Performance - - PowerPoint PPT Presentation

BLIS Performs Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at Aus8n ThunderX2 Architecture arm v8.1 Base frequency 2.0 GHz # sockets/node 2 # cores/socket 28 armv8a kernels in BLIS were wriOen by


slide-1
SLIDE 1

BLIS Performs

Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at Aus8n

slide-2
SLIDE 2

ThunderX2

Architecture arm v8.1 Base frequency 2.0 GHz # sockets/node 2 # cores/socket 28

armv8a kernels in BLIS were wriOen by Fransisco D. Igual for cortexa57 architectures.

slide-3
SLIDE 3

DGEMM (armv8a)

200 400 600 800 1000 1200 1400 1600 1800 2000

matrix dimension m=n=k

2 4 6 8 10 12 14 16

GFLOPS DGEMM (single-threaded)

BLIS

slide-4
SLIDE 4

DGEMM – Other Libraries

200 400 600 800 1000 1200 1400 1600 1800 2000

matrix dimension m=n=k

2 4 6 8 10 12 14 16

GFLOPS DGEMM (single-threaded)

BLIS OpenBLAS ARMPL

slide-5
SLIDE 5

GEMM – Other Datatypes

200 400 600 800 1000 1200 1400 1600 1800 2000

matrix dimension m=n=k

5 10 15 20 25 30

GFLOPS SGEMM (single-threaded)

BLIS OpenBLAS ARMPL

slide-6
SLIDE 6

GEMM – Other Datatypes

200 400 600 800 1000 1200 1400 1600 1800 2000

matrix dimension m=n=k

5 10 15 20 25 30

GFLOPS CGEMM (single-threaded)

BLIS OpenBLAS ARMPL

slide-7
SLIDE 7

GEMM – Other Datatypes

200 400 600 800 1000 1200 1400 1600 1800 2000

matrix dimension m=n=k

2 4 6 8 10 12 14 16

GFLOPS ZGEMM (single-threaded)

BLIS OpenBLAS ARMPL

slide-8
SLIDE 8

Level 3

500 1000 1500 2000 5 10 15 20 25 30 GFLOPS SGEMM (single-threaded) 500 1000 1500 2000 5 10 15 GFLOPS DGEMM (single-threaded) 500 1000 1500 2000 5 10 15 20 25 30 GFLOPS CGEMM (single-threaded) 500 1000 1500 2000 matrix dimension m=n=k 5 10 15 GFLOPS ZGEMM (single-threaded) 500 1000 1500 2000 5 10 15 20 25 30 GFLOPS SSYRK (single-threaded) 500 1000 1500 2000 5 10 15 GFLOPS DSYRK (single-threaded) 500 1000 1500 2000 5 10 15 20 25 30 GFLOPS CSYRK (single-threaded) 500 1000 1500 2000 matrix dimension m=n=k 5 10 15 GFLOPS ZSYRK (single-threaded) 500 1000 1500 2000 5 10 15 20 25 30 GFLOPS SSYMM (single-threaded) 500 1000 1500 2000 5 10 15 GFLOPS DSYMM (single-threaded) 500 1000 1500 2000 5 10 15 20 25 30 GFLOPS CHEMM (single-threaded) 500 1000 1500 2000 matrix dimension m=n=k 5 10 15 GFLOPS ZHEMM (single-threaded) 500 1000 1500 2000 5 10 15 20 25 30 GFLOPS STRMM (single-threaded) 500 1000 1500 2000 5 10 15 GFLOPS DTRMM (single-threaded) 500 1000 1500 2000 5 10 15 20 25 30 GFLOPS CTRMM (single-threaded) 500 1000 1500 2000 matrix dimension m=n=k 5 10 15 GFLOPS ZTRMM (single-threaded) BLIS OpenBLAS ARMPL
slide-9
SLIDE 9

MulG-threaded BLIS (28 cores)

1000 2000 3000 4000 5000 200 400 600 800 GFLOPS SGEMM (multi-threaded) 1000 2000 3000 4000 5000 100 200 300 400 GFLOPS DGEMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS CGEMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 100 200 300 400 GFLOPS ZGEMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS SSYRK (multi-threaded) 1000 2000 3000 4000 5000 100 200 300 400 GFLOPS DSYRK (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS CSYRK (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 100 200 300 400 GFLOPS ZSYRK (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS SSYMM (multi-threaded) 1000 2000 3000 4000 5000 100 200 300 400 GFLOPS DSYMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS CHEMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 100 200 300 400 GFLOPS ZHEMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS STRMM (multi-threaded) 1000 2000 3000 4000 5000 100 200 300 400 GFLOPS DTRMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS CTRMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 100 200 300 400 GFLOPS ZTRMM (multi-threaded) BLIS OpenBLAS ARMPL
slide-10
SLIDE 10

MulG-threaded BLIS (56 cores)

1000 2000 3000 4000 5000 500 1000 1500 GFLOPS SGEMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS DGEMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 GFLOPS CGEMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 200 400 600 800 GFLOPS ZGEMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 GFLOPS SSYRK (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS DSYRK (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 GFLOPS CSYRK (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 200 400 600 800 GFLOPS ZSYRK (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 GFLOPS SSYMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS DSYMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 GFLOPS CHEMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 200 400 600 800 GFLOPS ZHEMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 GFLOPS STRMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 GFLOPS DTRMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 GFLOPS CTRMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 200 400 600 800 GFLOPS ZTRMM (multi-threaded) BLIS OpenBLAS ARMPL
slide-11
SLIDE 11

Other Architectures SkylakeX (single core)

500 1000 1500 2000 20 40 60 80 100 GFLOPS SGEMM (single-threaded) 500 1000 1500 2000 10 20 30 40 50 GFLOPS DGEMM (single-threaded) 500 1000 1500 2000 20 40 60 80 100 GFLOPS CGEMM (single-threaded) 500 1000 1500 2000 matrix dimension m=n=k 10 20 30 40 50 GFLOPS ZGEMM (single-threaded) 500 1000 1500 2000 20 40 60 80 100 GFLOPS SSYRK (single-threaded) 500 1000 1500 2000 10 20 30 40 50 GFLOPS DSYRK (single-threaded) 500 1000 1500 2000 20 40 60 80 100 GFLOPS CSYRK (single-threaded) 500 1000 1500 2000 matrix dimension m=n=k 10 20 30 40 50 GFLOPS ZSYRK (single-threaded) 500 1000 1500 2000 20 40 60 80 100 GFLOPS SSYMM (single-threaded) 500 1000 1500 2000 10 20 30 40 50 GFLOPS DSYMM (single-threaded) 500 1000 1500 2000 20 40 60 80 100 GFLOPS CHEMM (single-threaded) 500 1000 1500 2000 matrix dimension m=n=k 10 20 30 40 50 GFLOPS ZHEMM (single-threaded) 500 1000 1500 2000 20 40 60 80 100 GFLOPS STRMM (single-threaded) 500 1000 1500 2000 10 20 30 40 50 GFLOPS DTRMM (single-threaded) 500 1000 1500 2000 20 40 60 80 100 GFLOPS CTRMM (single-threaded) 500 1000 1500 2000 matrix dimension m=n=k 10 20 30 40 50 GFLOPS ZTRMM (single-threaded) BLIS OpenBLAS MKL
slide-12
SLIDE 12

Other Architectures SkylakeX (20 cores)

1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS SGEMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 1000 GFLOPS DGEMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS CGEMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 200 400 600 800 1000 GFLOPS ZGEMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS SSYRK (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 1000 GFLOPS DSYRK (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS CSYRK (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 200 400 600 800 1000 GFLOPS ZSYRK (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS SSYMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 1000 GFLOPS DSYMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS CHEMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 200 400 600 800 1000 GFLOPS ZHEMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS STRMM (multi-threaded) 1000 2000 3000 4000 5000 200 400 600 800 1000 GFLOPS DTRMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS CTRMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 200 400 600 800 1000 GFLOPS ZTRMM (multi-threaded) BLIS OpenBLAS MKL
slide-13
SLIDE 13

Other Architectures SkylakeX (40 cores)

1000 2000 3000 4000 5000 1000 2000 3000 4000 GFLOPS SGEMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS DGEMM (multi-threaded) 1000 2000 3000 4000 5000 1000 2000 3000 4000 GFLOPS CGEMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 500 1000 1500 2000 GFLOPS ZGEMM (multi-threaded) 1000 2000 3000 4000 5000 1000 2000 3000 4000 GFLOPS SSYRK (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS DSYRK (multi-threaded) 1000 2000 3000 4000 5000 1000 2000 3000 4000 GFLOPS CSYRK (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 500 1000 1500 2000 GFLOPS ZSYRK (multi-threaded) 1000 2000 3000 4000 5000 1000 2000 3000 4000 GFLOPS SSYMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS DSYMM (multi-threaded) 1000 2000 3000 4000 5000 1000 2000 3000 4000 GFLOPS CHEMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 500 1000 1500 2000 GFLOPS ZHEMM (multi-threaded) 1000 2000 3000 4000 5000 1000 2000 3000 4000 GFLOPS STRMM (multi-threaded) 1000 2000 3000 4000 5000 500 1000 1500 2000 GFLOPS DTRMM (multi-threaded) 1000 2000 3000 4000 5000 1000 2000 3000 4000 GFLOPS CTRMM (multi-threaded) 1000 2000 3000 4000 5000 matrix dimension m=n=k 500 1000 1500 2000 GFLOPS ZTRMM (multi-threaded) BLIS OpenBLAS MKL
slide-14
SLIDE 14

Conclusion

BLIS instan8ates high-performance implementa8ons across virtually all level-3 opera8ons, parameter cases, and datatypes with just two microkernels.

  • sgemm
  • dgemm
slide-15
SLIDE 15

Thank you!

  • Thanks to Fransisco D. Igual for running

experiments on a frequency stable SkylakeX.

  • Thanks to Cavium and ARM for access to

ThunderX2.

(More plots on poster)

slide-16
SLIDE 16
slide-17
SLIDE 17

Details

  • ThunderX2
  • 2.0 GHz
  • 2 sockets
  • 28 cores/socket
  • Hardware threads turned off
  • OpenBLAS (commit 52d3f7a), built for target

THUNDERX2T99

  • ARMPL: armpl-18.4.0_ThunderX2CN99
  • 1 socket: JC_NT = 2; IC_NT = 14
  • 2 socket: JC_NT = 4; IC_NT = 14
  • SkylakeX
  • 1.7 GHz
  • 2 sockets
  • 20 cores/socket
  • Frequency throOling turned off
  • OpenBLAS (commit 4dd70d9)
  • MKL 18
  • 1 socket: JC_NT = 1; IC_NT = 20
  • 2 socket: JC_NT = 2; IC_NT = 20
  • Haswell
  • 3.2 GHz/3.0 GHz
  • 2 sockets
  • 12 cores/socket
  • MKL 11.3
  • 1 socket: JC_NT = 2; IC_NT = 6
  • 2 socket: JC_NT = 4; IC_NT = 6
  • For single-threaded experiments, each graphs plots GFLOPS as a

func8on of problem size--varied from 40 to 2000 in increments of 40--where all matrix operands' dimensions (m, n, k) are bound to the problem size. In other words, all matrices are square.

  • For mul8-threaded experiments, each graphs plots GFLOPS as a

func8on of problem size--varied from 200 to 5000 in increments of 200–where all matrix operands' dimensions (m, n, k) are bound to the problem size. In other words, all matrices are square.

  • The y-axis is scaled so that the top of the graphs correspond to

theore8cal peak for the clock rate.

  • All results are performed on random data, with each point of each

graph represen8ng the best (shortest run8me) of three trials.

  • I've omiOed legends from all graphs except one in the boOom-

right, only to minimize cluOer.

  • BLIS uses the 1M-induced methods for C,Z opera8ons (ThunderX2,

and SkylakeX)