I Dont Care about BLAS Devin Matthews Institute for Computational - - PowerPoint PPT Presentation

i don t care about blas
SMART_READER_LITE
LIVE PREVIEW

I Dont Care about BLAS Devin Matthews Institute for Computational - - PowerPoint PPT Presentation

I Dont Care about BLAS Devin Matthews Institute for Computational Engineering and Sciences, UT Austin #smallmoleculesmatter BLIS Retreat, Sept. 18, 2017 1 From QC to DGEMM H R | i = E Simple eigenproblem R | i r


slide-1
SLIDE 1

I Don’t Care about BLAS

Devin Matthews Institute for Computational Engineering and Sciences, UT Austin

1

#smallmoleculesmatter

BLIS Retreat, Sept. 18, 2017

slide-2
SLIDE 2

From QC to DGEMM

¯ H ˆ R|Φi = E ˆ R|Φi

“Simple” eigenproblem…

¯ Habij

cdkl

rabef

ijkl , In terms of tensors…

rabef

ijkl , W ai ck, F i k, tab ij , . . . In terms of other tensors…

rab¯

e ¯ f ij¯ k¯ l or ˇ

rabef

ijkl

With structured sparsity…

ra<b¯

e< ¯ f i<j¯ k<¯ l or ˇ

rabef

i≤j≤k≤l

With symmetry…

rabef

0000, rabef 0001, rabef 0002, . . . With slicing (or blocking etc.)…

r

(ab)γab (ef)γef

With more sparsity…

A = B · C

In terms of matrices. In terms of dense tensors…

rab

ef ∈ Rna⊗nb⊗ne⊗nf

BLIS Retreat, Sept. 18, 2017 2

slide-3
SLIDE 3

Tensor Contraction Today

tae

im

V mk

ec

W ak

ic

A = B · C

ck ck em em ai ai

“TTDT” “LoG”

tae

im

V mk

ec

W ak

ic

Ÿ Ÿ

X

m

[tam]ei [Vmk]ec [Wak]ic

∀ i, e, c

for i for e for c DGEMM = =

BLIS Retreat, Sept. 18, 2017 3

slide-4
SLIDE 4

BLIS Retreat, Sept. 18, 2017 4

slide-5
SLIDE 5

DGEMM Considered Harmful

  • Tensors have to be transposed in order to use DGEMM.
  • DGEMM needs dense matrices. If our tensors have

structure (permutational symmetry, point group symmetry, sparsity, etc.) we have to expand or block them.

  • Point group symmetry is efficiently handled with the Direct

Product Decomposition (DPD), but we want to automate and optimize it.

  • Blocking reduces the size of individual DGEMM calls. Can

we aggregate these into more efficient operations?

BLIS Retreat, Sept. 18, 2017 5 DPD: Stanton, J.F.; Gauss, J.; Watts, J.D.; Bartlett, R.J. J. Chem. Phys. 1991, 94, 4334.

slide-6
SLIDE 6

How Much Does Transpose Cost?

BLIS Retreat, Sept. 18, 2017 6

slide-7
SLIDE 7

BLIS à TBLIS

micro&kernel+ 2nd+loop+around+micro&kernel+ 1st+loop+around+micro&kernel+ 5th+loop+around+micro&kernel+ 4th+loop+around+micro&kernel+ 3rd+loop+around+micro&kernel+

mR mR 1

+=+ +=+ +=+ +=+ +=+ +=+

nC nC kC kC mC mC 1 nR kC nR

Pack Ai → Ai ~ Pack Bp → Bp ~

nR

A Bj Cj Ap Ai Bp Cj Ai ~ Bp ~ Bp ~ Ci Ci

kC

L3+cache+ L2+cache+ L1+cache+ registers+ main+memory+ mR 1

+=# +=#

1 nR kC

L3#cache# L2#cache# L1#cache# registers# main#memory# Matrix7like#logical#layout# Packed#internal#layout#

mR

+=# +=#

kC mC mC kC nR nR

Ai Bp Ai ~ Ci Ci

nC nC

C" A" B"

+=#

Tensor#physical#layout# #1st#loop#around#micro7kernel# #micro7kernel#

Pack “Ai”→ Ai ~ Pack “Bp”→ Bp ~ Bp ~

Tensor to matrix packing kernel (Possibly) scattered update

BLIS Retreat, Sept. 18, 2017 7

slide-8
SLIDE 8

5 10 15 20 25 30 35 40 45 1000 2000 3000 4000 5000 GFLOPS/s

  • Approx. M=N=K

MKL BLIS TBLIS TTDT 5 10 15 20 25 30 35 40 45 100 200 300 400 500 GFLOPS/s

  • Approx. K (M=N=4000)

MKL BLIS TBLIS TTDT 5 10 15 20 25 30 35 40 45 500 1000 1500 2000 GFLOPS/s

  • Approx. M=N=K

MKL BLIS TBLIS TTDT 5 10 15 20 25 30 35 40 45 100 200 300 400 500 GFLOPS/s

  • Approx. K (M=N=16000)

MKL BLIS TBLIS TTDT

E5-2690 v3 1 core 12 cores

Results for Dense Tensors

BLIS Retreat, Sept. 18, 2017 8

slide-9
SLIDE 9

200 400 600 800 1000 1200 1400 1600 1800 2000 1000 2000 3000 4000 5000 6000 7000 8000 GFLOPS/s

  • Approx. M=N=K

MKL BLIS TBLIS TTDT

“Square” MM and TC

  • n Xeon Phi 7210

200 400 600 800 1000 1200 1400 1600 1800 2000 500 1000 1500 2000 GFLOPS/s

  • Approx. K (M=N=16000)

MKL BLIS TBLIS TTDT

“Rank-k” MM and TC

  • n Xeon Phi 7210

Works Great on Xeon Phi Too

BLIS Retreat, Sept. 18, 2017 9

slide-10
SLIDE 10

Quasi-Sparse Tensor Contractions

Entire quantity laid out on disk Hunk: sized to fit in memory Chunk: fixed irreducible representations Virtual block: fixed values of ijkl

Option #1: Batch within TBLIS framework Option #2: Batch outside of TBLIS framework

mR

… +=

kC nR nR

Ai ~ Bp ~ Ci zero, not computed non-zero

for k’ pfor n’ pfor m’ if ∃A(m’,k’) ∧ ∃B(k’,n’) ∧ ∃C(m’,n’) contract(…) endif done done done BLIS Retreat, Sept. 18, 2017 10

slide-11
SLIDE 11

Quasi-Sparse Tensor Contractions

BLIS Retreat, Sept. 18, 2017 11 for k’ pfor n’ pfor m’ if ∃A(m’,k’) ∧ ∃B(k’,n’) ∧ ∃C(m’,n’) contract(…) endif done done done Split communicator into c_in & c_out pfor n’ over c_out pfor m’ over c_out ks = {} for k’ if ∃A(m’,k’) ∧ ∃B(k’,n’) ∧ ∃C(m’,n’) append k’ to ks endif done pcontract(ks,…) over c_in done done

Use hierarchical dynamic+static parallelism and aggregate blocks when possible.

slide-12
SLIDE 12

Quasi-Sparse Tensor Contractions

140 486 5 766 612 230 1121 1130 357 200 400 600 800 1000 1200 nv =40, no=10 nv =60, no=5 nv =20, no=20 GFLOPS/s baseline, thread within blocks baseline, thread
  • utside blocks
TBLIS

↑46% ↑85% ↑55%

53 362 3 321 205 157 551 558 282 100 200 300 400 500 600 nv =40, no=10 nv =60, no=5 nv =20, no=20 GFLOPS/s baseline, thread within blocks baseline, thread
  • utside blocks
TBLIS

↑72% ↑54% ↑80%

8 6 1 59 40 19 10 20 30 40 50 60 70 nv =40, no=10 nv =60, no=5 nv =20, no=20 GFLOPS/s baseline, thread within blocks baseline, thread
  • utside blocks
TBLIS

↑638% ↑567% ↑1257%

140 543 5 842 784 287 845 825 311 100 200 300 400 500 600 700 800 900 nv =40, no=10 nv =60, no=5 nv =20, no=20 GFLOPS/s baseline, thread within blocks baseline, thread
  • utside blocks
TBLIS

↑0% ↑5% ↑8%

(uses TBLIS for inner tensor contraction) (adds hierarchical multithreading and block aggregation)

BLIS Retreat, Sept. 18, 2017 12

slide-13
SLIDE 13

Taking Advantage of Structure

i j k

Point Group Symmetry Cost savings proportional to g2 (g = number of irreducible representations/blocks). Permutational Symmetry Factorial cost savings for increasing dimensionality.

Aijk = -Ajik = Ajki =

  • Akji = Akij = -Aikj

Ai<j<k

BLIS Retreat, Sept. 18, 2017 13

slide-14
SLIDE 14

Taking Advantage of Structure

i j k ik j j i block ik Stride of j within block depends on block of i! Zero pad Full scatter

BLIS Retreat, Sept. 18, 2017 14

slide-15
SLIDE 15

BLIS Retreat, Sept. 18, 2017 15

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 GC DPF DBDPE Au2

Speedup

Guanine-cytosine dimer (GC), no symmetry Krepl et al., J. Phys. Chem. B 2013, 117, 1872 2,4,-diphenylfuran (DPF), Cs symmetry (E) 1,2-dibromo- 1,2- diphenylethene (DBDPE) planar, C2h symmetry

Au Au

Gold dimer (Au2) all-electron, D2h symmetry

Less symmetry More symmetry

2x Xeon E5-2698 v3 (32 cores)

Speedup in computation of coupled cluster singles and doubles (CCSD) ground state energy when using TBLIS

slide-16
SLIDE 16

Summary

  • Novel algorithms leveraging the BLIS

methodology can significantly outperform DGEMM-based algorithms for tensor contraction.

  • Breaking through the DGEMM barrier allows

new algorithms to be implemented with high efficiency.

BLIS Retreat, Sept. 18, 2017 16

slide-17
SLIDE 17

17

#smallmoleculesmatter

BLIS Retreat, Sept. 18, 2017

Robert van de Geijn Field Van Zee Tyler Smith Jianyu Huang

TBLIS on

Thanks!

Devangi Parikh

www.cfour.de