I Don’t Care about BLAS
Devin Matthews Institute for Computational Engineering and Sciences, UT Austin
1
#smallmoleculesmatter
BLIS Retreat, Sept. 18, 2017
I Dont Care about BLAS Devin Matthews Institute for Computational - - PowerPoint PPT Presentation
I Dont Care about BLAS Devin Matthews Institute for Computational Engineering and Sciences, UT Austin #smallmoleculesmatter BLIS Retreat, Sept. 18, 2017 1 From QC to DGEMM H R | i = E Simple eigenproblem R | i r
1
#smallmoleculesmatter
BLIS Retreat, Sept. 18, 2017
¯ H ˆ R|Φi = E ˆ R|Φi
“Simple” eigenproblem…
¯ Habij
cdkl
rabef
ijkl , In terms of tensors…
rabef
ijkl , W ai ck, F i k, tab ij , . . . In terms of other tensors…
rab¯
e ¯ f ij¯ k¯ l or ˇ
rabef
ijkl
With structured sparsity…
ra<b¯
e< ¯ f i<j¯ k<¯ l or ˇ
rabef
i≤j≤k≤l
With symmetry…
rabef
0000, rabef 0001, rabef 0002, . . . With slicing (or blocking etc.)…
r
(ab)γab (ef)γef
With more sparsity…
A = B · C
In terms of matrices. In terms of dense tensors…
rab
ef ∈ Rna⊗nb⊗ne⊗nf
BLIS Retreat, Sept. 18, 2017 2
tae
im
V mk
ec
W ak
ic
A = B · C
ck ck em em ai ai
tae
im
V mk
ec
W ak
ic
X
m
[tam]ei [Vmk]ec [Wak]ic
∀ i, e, c
for i for e for c DGEMM = =
BLIS Retreat, Sept. 18, 2017 3
BLIS Retreat, Sept. 18, 2017 4
structure (permutational symmetry, point group symmetry, sparsity, etc.) we have to expand or block them.
Product Decomposition (DPD), but we want to automate and optimize it.
we aggregate these into more efficient operations?
BLIS Retreat, Sept. 18, 2017 5 DPD: Stanton, J.F.; Gauss, J.; Watts, J.D.; Bartlett, R.J. J. Chem. Phys. 1991, 94, 4334.
BLIS Retreat, Sept. 18, 2017 6
micro&kernel+ 2nd+loop+around+micro&kernel+ 1st+loop+around+micro&kernel+ 5th+loop+around+micro&kernel+ 4th+loop+around+micro&kernel+ 3rd+loop+around+micro&kernel+
mR mR 1+=+ +=+ +=+ +=+ +=+ +=+
nC nC kC kC mC mC 1 nR kC nRPack Ai → Ai ~ Pack Bp → Bp ~
nRA Bj Cj Ap Ai Bp Cj Ai ~ Bp ~ Bp ~ Ci Ci
kCL3+cache+ L2+cache+ L1+cache+ registers+ main+memory+ mR 1
+=# +=#
1 nR kC
L3#cache# L2#cache# L1#cache# registers# main#memory# Matrix7like#logical#layout# Packed#internal#layout#
mR
+=# +=#
kC mC mC kC nR nR
Ai Bp Ai ~ Ci Ci
nC nC
C" A" B"
+=#
Tensor#physical#layout# #1st#loop#around#micro7kernel# #micro7kernel#
Pack “Ai”→ Ai ~ Pack “Bp”→ Bp ~ Bp ~
Tensor to matrix packing kernel (Possibly) scattered update
BLIS Retreat, Sept. 18, 2017 7
5 10 15 20 25 30 35 40 45 1000 2000 3000 4000 5000 GFLOPS/s
MKL BLIS TBLIS TTDT 5 10 15 20 25 30 35 40 45 100 200 300 400 500 GFLOPS/s
MKL BLIS TBLIS TTDT 5 10 15 20 25 30 35 40 45 500 1000 1500 2000 GFLOPS/s
MKL BLIS TBLIS TTDT 5 10 15 20 25 30 35 40 45 100 200 300 400 500 GFLOPS/s
MKL BLIS TBLIS TTDT
E5-2690 v3 1 core 12 cores
BLIS Retreat, Sept. 18, 2017 8
200 400 600 800 1000 1200 1400 1600 1800 2000 1000 2000 3000 4000 5000 6000 7000 8000 GFLOPS/s
MKL BLIS TBLIS TTDT
“Square” MM and TC
200 400 600 800 1000 1200 1400 1600 1800 2000 500 1000 1500 2000 GFLOPS/s
MKL BLIS TBLIS TTDT
“Rank-k” MM and TC
BLIS Retreat, Sept. 18, 2017 9
Entire quantity laid out on disk Hunk: sized to fit in memory Chunk: fixed irreducible representations Virtual block: fixed values of ijkl
Option #1: Batch within TBLIS framework Option #2: Batch outside of TBLIS framework
mR
… +=
kC nR nR
Ai ~ Bp ~ Ci zero, not computed non-zero
for k’ pfor n’ pfor m’ if ∃A(m’,k’) ∧ ∃B(k’,n’) ∧ ∃C(m’,n’) contract(…) endif done done done BLIS Retreat, Sept. 18, 2017 10
BLIS Retreat, Sept. 18, 2017 11 for k’ pfor n’ pfor m’ if ∃A(m’,k’) ∧ ∃B(k’,n’) ∧ ∃C(m’,n’) contract(…) endif done done done Split communicator into c_in & c_out pfor n’ over c_out pfor m’ over c_out ks = {} for k’ if ∃A(m’,k’) ∧ ∃B(k’,n’) ∧ ∃C(m’,n’) append k’ to ks endif done pcontract(ks,…) over c_in done done
Use hierarchical dynamic+static parallelism and aggregate blocks when possible.
↑46% ↑85% ↑55%
53 362 3 321 205 157 551 558 282 100 200 300 400 500 600 nv =40, no=10 nv =60, no=5 nv =20, no=20 GFLOPS/s baseline, thread within blocks baseline, thread↑72% ↑54% ↑80%
8 6 1 59 40 19 10 20 30 40 50 60 70 nv =40, no=10 nv =60, no=5 nv =20, no=20 GFLOPS/s baseline, thread within blocks baseline, thread↑638% ↑567% ↑1257%
140 543 5 842 784 287 845 825 311 100 200 300 400 500 600 700 800 900 nv =40, no=10 nv =60, no=5 nv =20, no=20 GFLOPS/s baseline, thread within blocks baseline, thread↑0% ↑5% ↑8%
(uses TBLIS for inner tensor contraction) (adds hierarchical multithreading and block aggregation)
BLIS Retreat, Sept. 18, 2017 12
i j k
Point Group Symmetry Cost savings proportional to g2 (g = number of irreducible representations/blocks). Permutational Symmetry Factorial cost savings for increasing dimensionality.
Aijk = -Ajik = Ajki =
Ai<j<k
BLIS Retreat, Sept. 18, 2017 13
i j k ik j j i block ik Stride of j within block depends on block of i! Zero pad Full scatter
BLIS Retreat, Sept. 18, 2017 14
BLIS Retreat, Sept. 18, 2017 15
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 GC DPF DBDPE Au2
Speedup
Guanine-cytosine dimer (GC), no symmetry Krepl et al., J. Phys. Chem. B 2013, 117, 1872 2,4,-diphenylfuran (DPF), Cs symmetry (E) 1,2-dibromo- 1,2- diphenylethene (DBDPE) planar, C2h symmetry
Au Au
Gold dimer (Au2) all-electron, D2h symmetry
Less symmetry More symmetry
2x Xeon E5-2698 v3 (32 cores)
Speedup in computation of coupled cluster singles and doubles (CCSD) ground state energy when using TBLIS
BLIS Retreat, Sept. 18, 2017 16
17
#smallmoleculesmatter
BLIS Retreat, Sept. 18, 2017
Robert van de Geijn Field Van Zee Tyler Smith Jianyu Huang
TBLIS on
Devangi Parikh
www.cfour.de