Computational Challenges of Coupled Cluster Theory Jeff Hammond - - PowerPoint PPT Presentation

computational challenges of coupled cluster theory
SMART_READER_LITE
LIVE PREVIEW

Computational Challenges of Coupled Cluster Theory Jeff Hammond - - PowerPoint PPT Presentation

Computational Challenges of Coupled Cluster Theory Jeff Hammond Leadership Computing Facility Argonne National Laboratory 11 January 2012 Jeff Hammond ICERM Atomistic simulation in chemistry 1 classical molecular dynamics (MD) with


slide-1
SLIDE 1

Computational Challenges of Coupled Cluster Theory

Jeff Hammond

Leadership Computing Facility Argonne National Laboratory

11 January 2012

Jeff Hammond ICERM

slide-2
SLIDE 2

Atomistic simulation in chemistry

1 classical molecular dynamics (MD) with

empirical potentials

2 quantum molecular dynamics based upon

density-function theory (DFT)

3 quantum chemistry with wavefunctions

e.g. perturbation theory (PT), coupled-cluster (CC) or quantum monte carlo (QMC).

Jeff Hammond ICERM

slide-3
SLIDE 3

Classical molecular dynamics

Image courtesy of Benoˆ ıt Roux via ALCF.

Solves Newton’s equations of motion with empirical terms and classical electrostatics. Size: 100K-10M atoms Time: 1-10 ns/day Scaling: ∼ Natoms Math: N-body

Data from K. Schulten, et al. “Biomolecular modeling in the era of petascale computing.” In D. Bader, ed., Petascale Computing: Algorithms and Applications. Jeff Hammond ICERM

slide-4
SLIDE 4

Car-Parrinello molecular dynamics

Image courtesy of Giulia Galli via ALCF.

Forces obtained from solving an approximate single-particle Schr¨

  • dinger equation.

Size: 100-1000 atoms Time: 0.01-1 ps/day Scaling: ∼ Nx

el (x=1-3)

Math: FFT, eigensolve.

  • F. Gygi, IBM J. Res. Dev. 52, 137 (2008); E. J. Bylaska et al. J.

Phys.: Conf. Ser. 180, 012028 (2009). Jeff Hammond ICERM

slide-5
SLIDE 5

Wavefunction theory

, MP2 is second-order PT and is accurate via magical cancellation of error. CC is infinite-order solution to many-body Schr¨

  • dinger equation truncated via clusters.

QMC is Monte Carlo integration applied to the Schr¨

  • dinger equation.

Size: 10-100 atoms, maybe 100-1000 atoms with MP2. Time: N/A (LOL) Scaling: ∼ Nx

bf (x=4-7)

Math: DLA (tensors)

Image courtesy of Karol Kowalski and Niri Govind. Jeff Hammond ICERM

slide-6
SLIDE 6

The Standard Model (of Quantum Chemistry)

Jeff Hammond ICERM

slide-7
SLIDE 7

Quantum chemistry

1 Separate molecule(s) from environment (closed to both

matter and energy)

2 Boundary conditions:

ψ(x → ∞) = 0 (finite system) ψ(x) = φ(x + g) (infinite, periodic system)

3 Ignore relativity, QED, spin-orbit coupling 4 Separate electronic and nuclear degrees of freedom

− → non-relativistic electronic Schr¨

  • dinger equation in a vacuum at

zero temperature.

Jeff Hammond ICERM

slide-8
SLIDE 8

Quantum chemistry

ˆ H = ˆ Tel + ˆ Vel−nuc + ˆ Vel−el ˆ H = −1 2

M

  • i=1

∇2

i + N

  • n=1

M

  • i=1

Zn Rni +

M

  • i<j

1 rij Ψ (x1, . . . , xn, xn+1, . . . , xN) = −Ψ (x1, . . . , xn+1, xn, . . . , xN) The electron coordinates (xi) include both space (r) and spin (σ). We will integrate-out spin wherever possible.

Jeff Hammond ICERM

slide-9
SLIDE 9

Quantum chemistry

Wavefunction antisymmetry is enforced by expanding in determinants, which we now capture in second quantization.

1 project physical operators (e.g. Coulomb) into one-electron

basis — usually atom-center Gaussians

2 generate mean-field reference and expand many-body

wavefunction in terms of excitations out of that reference − → Full configuration-interation (FCI) ansatz.

1 truncate exponentially-growing FCI ansatz (CI=linear

generator, CC=exponential generator)

2 solve CC (or CI) iteratively 3 add more correlation via perturbation theory

− → CCSD(T), as one example.

Jeff Hammond ICERM

slide-10
SLIDE 10

Quantum chemistry

Correct for missing physics using perturbation theory (a posteriori error correction) or mixed (e.g. QM/MM) formalism:

1 relativistic corrections 2 non-adiabatic corrections 3 solvent corrections 4 open BC corrections (less common)

Jeff Hammond ICERM

slide-11
SLIDE 11

Coupled-cluster theory

Jeff Hammond ICERM

slide-12
SLIDE 12

Coupled-cluster theory

|ΨCC = exp(T)|ΨHF T = T1 + T2 + · · · + Tn (n ≪ N) T1 =

  • ia

ta

i ˆ

a†

ai T2 =

  • ijab

tab

ij ˆ

a†

a†

ajˆ ai |ΨCCD = exp(T2)|ΨHF = (1 + T2 + T 2

2 )|ΨHF

|ΨCCSD = exp(T1 + T2)|ΨHF = (1 + T1 + · · · + T 4

1 + T2 + T 2 2 + T1T2 + T 2 1 T2)|ΨHF

Jeff Hammond ICERM

slide-13
SLIDE 13

Coupled cluster (CCD) implementation

exp(T2)|ΨHF turns into: Rab

ij

= V ab

ij

+ P(ia, jb)

  • T ae

ij I b e − T ab im I m j

+ 1 2V ab

ef T ef ij +

1 2T ab

mnI mn ij

− T ae

mjI mb ie

− I ma

ie T eb mj + (2T ea mi − T ea im)I mb ej

  • I a

b

= (−2V mn

eb + V mn be )T ea mn

I i

j

= (2V mi

ef − V im ef )T ef mj

I ij

kl

= V ij

kl + V ij ef T ef kl

I ia

jb

= V ia

jb − 1

2V im

eb T ea jm

I ia

bj

= V ia

bj + V im be (T ea mj − 1

2T ae

mj) − 1

2V mi

be T ae mj

Jeff Hammond ICERM

slide-14
SLIDE 14

Tensor Contraction Engine

Jeff Hammond ICERM

slide-15
SLIDE 15

Tensor Contraction Engine

What does it do?

1 GUI input quantum many-body theory e.g. CCSD. 2 Operator specification of theory (as in a theory paper). 3 Apply Wick’s theory to transform operator expressions into

array expressions (as in a computational paper).

4 Transform input array expression to operation tree using many

types of optimization (i.e. compile).

5 Generate F77/GA/NXTVAL implementation for NWChem or

C++/MemoryGrp for MPQC or F90/.. for UTChem. Developer can intercept at various stages to modify theory, algorithm or implementation (may be painful).

Jeff Hammond ICERM

slide-16
SLIDE 16

TCE Input

We get 73 lines of serial F90 or 604 lines of parallel F77 from this:

1/1 Sum(g1 g2 p3 h4) f(g1 g2) t(p3 h4) {g1+ g2}{p3+ h4} 1/4 Sum(g1 g2 g3 g4 p5 h6) v(g1 g2 g3 g4) t(p5 h6) {g1+ g2+ g4 g3}{p5+ h6} 1/16 Sum(g1 g2 g3 g4 p5 p6 h7 h8) v(g1 g2 g3 g4) t(p5 p6 h7 h8) {g1+ g2+ g4 g3}{p5+ p6+ h8 h7} 1/8 Sum(g1 g2 g3 g4 p5 h6 p7 h8) v(g1 g2 g3 g4) t(p5 h6) t(p7 h8) {g1+ g2+ g4 g3}{p5+ h6} {p7+ h8}

LaTeX equivalent of the first term:

  • g1,g2,p3,h4

fg1,g2tp3,h4{g†

1g2}{p† 3h4}

Jeff Hammond ICERM

slide-17
SLIDE 17

Summary of TCE module

http://cloc.sourceforge.net v 1.53 T=30.0 s

  • Language

files blank comment code

  • Fortran 77

11451 1004 115129 2824724

  • SUM:

11451 1004 115129 2824724

  • Perhaps <25 KLOC are hand-written; ∼100 KLOC is utility code

following TCE data-parallel template. Expansion from TCE input to massively-parallel F77 is ∼ 200 (drops with language abstractions).

Jeff Hammond ICERM

slide-18
SLIDE 18

TCE template

Pseudocode for Ra,b

i,j = T c,d i,j

∗ V c,d

a,b :

for i,j in occupied blocks: for a,b in virtual blocks: for c,d in virtual blocks: if symmetry criteria(i,j,a,b,c,d): if dynamic load balancer(me): Get block t(i,j,c,d) from T Permute t(i,j,c,d) Get block v(a,b,c,d) from V Permute v(a,b,c,d) r(i,j,c,d) += t(i,j,c,d) * v(a,b,c,d) Permute r(i,j,a,b) Accumulate r(i,j,a,b) block to R

Jeff Hammond ICERM

slide-19
SLIDE 19

TCE profile

ccsd t2 8 (DGEMM-like): timer min max avg dgemm 68.605 91.296 81.282 ga acc 0.042 0.070 0.050 ga get 5.845 7.779 6.679 nxtask 0.012 28.710 13.638 tce sort4 6.184 8.174 7.347 tce sortacc4 7.892 11.042 9.290

Jeff Hammond ICERM

slide-20
SLIDE 20

Observations about the TCE template

1 Blocking get means no overlap 2 Dynamic load balancing is global (shared counter) 3 Get+Permute of t(i,j,c,d) happens for all (a,b) 4 Get+Permute of v(a,b,c,d) happens for all (i,j) 5 Permute is a nasty operation (desire fused contraction).

We could apply well-known techniques to fix everything. . . (There are an uncountable number of good programming techniques not being used in any scientific code.)

Jeff Hammond ICERM

slide-21
SLIDE 21

TCE Template for MMM

Pseudocode for C i

j = Ai k ∗ Bk j :

for i in I blocks: for j in J blocks: for k in K blocks: if dynamic load balancer(me): Get block a(i,k) from A Get block b(k,j) from B c(i,j) += a(i,k) * b(k,j) Accumulate c(i,j) block to C Algorithms trump tuned runtimes and libraries every time.

Jeff Hammond ICERM

slide-22
SLIDE 22

A better way

TCE has it right, but only serially: tensor contractions are permute + matmul. Parallel permute = parallel sorting = well-understood. Parallel matmul = well-understood. Therefore, parallel tensor contractions are solved, up to the implementation details and future algorithm developments in sorting and matmul. All existing TCE technology for operation trees are still valid.

Jeff Hammond ICERM

slide-23
SLIDE 23

Cyclops Tensor Framework

Written by Edgar Solomonik (I am just a cheerleader). Very preliminary (Summer 2011) strong-scaling results:

Jeff Hammond ICERM

slide-24
SLIDE 24

Communication

But where’s the one-sided communication?!? Like parallel matmul and sorting, CTF does fine with MPI-1. There are good uses of one-sided but TCE isn’t one*. * Unless matmul or sorting benefits from it.

Jeff Hammond ICERM

slide-25
SLIDE 25

Summary

Dense tensor contractions are dense linear algebra plus some lower-order bookkeeping. Permutation symmetry folded into cyclic/elemental distribution in a load-balanced way. Parallel dense linear algebra is a well-understood problem that is continuously studied by smart people; parallel libraries exist. Parallel dense tensor contractions are best implemented in terms of parallel dense linear algebra and not as serial dense linear algebra directed by a locality-oblivious dynamic runtime, especially if flops are “free.”

Jeff Hammond ICERM

slide-26
SLIDE 26

The future

Sparse formalisms are much more involved, but there is every reason to believe that representing them in terms of sparse sorting and sparse matmul will be reasonable, whereas the one-sided approach would only move farther and farther away from its sweet-spot (bandwidth-limited bulk operations and infrequent active-messages). TCE treated matrix-chain multiplication problem in a straightforward way, ignoring possibilities for exotic multi-level fusion. PNNL and Argonne have efforts using task-parallelism and DAG-scheduling for MPPs and heterogenous nodes, respectively, but much parallelism remains. We can learn from DAG efforts in

  • DLA. . .

Jeff Hammond ICERM

slide-27
SLIDE 27

The end of CC

Jeff Hammond ICERM

slide-28
SLIDE 28

A theory for hard science problems. . .

“It is our belief that CCSDTQ represents the (practical) end of advancement in coupled-cluster theory. We are unaware of any problems in quantum chemistry that are not treated satisfactorily by this level of theory.” – John Stanton |ΨCC = exp(T)|ΨHF T = T1 + T2 + T3 + T4 |ΨCCSDTQ = exp(T1 + T2 + T3 + T4)|ΨHF Writing this theory down in full details takes longer than some

  • simulations. . .

Jeff Hammond ICERM

slide-29
SLIDE 29

CCSDTQ singles

r1p2

h1 ,

= f p2

h1 , +f h3 h1 , tp2 h3 , +f p2 p3 , tp3 h1 , −tp3 h4 , vh4p2 h1p3 , +f h3 p4 , tp4p2 h3h1 , +

1 2 tp3p2

h4h5 , vh4h5 h1p3 , +

1 2 tp3p4

h5h1 , vh5p2 p3p4 , +

1 4 tp3p4p2

h5h6h1 , vh5h6 p3p4

−tp3

h1 , tp2 h4 , f h4 p3 , −tp2 h3 , tp4 h5 , vh3h5 h1p4 , −tp3 h1 , tp4 h5 , vh5p2 p3p4 , −

1 2 tp3p2

h4h5 , tp6 h1 , vh4h5 p3p6 , −

1 2 tp3p4

h5h1 , tp2 h6 , vh5h6 p3p4 , +tp3p2 h4h1 , tp h

−tp3

h1 , tp2 h4 , tp5 h6 , vh4h6 p3p5 ,

Jeff Hammond ICERM

slide-30
SLIDE 30

CCSDTQ doubles

r2p3p4

h1h2 ,

= +vp3p4

h1h2 , −

  • 1 − Pp4p3h1h2

p3p4h1h2

  • tp3

h5 , vh5p4 h1h2 , +

  • 1 − Pp3p4h2h1

p3p4h1h2

  • tp5

h2 , vp3p4 h1p5 , −

  • 1 − Pp3p4h1h2

p3p4h2h1

  • f h5

h1 , tp3p4 h5h2 ,

  • 1 − Pp3p4h1h2

p4p3h1h2

  • f p4

p5 , tp5p3 h1h2 , +

1 2 tp3p4

h5h6 , vh5h6 h1h2 , +

  • 1 − Pp4p3h2h1

p3p4h2h1 − Pp3p4h2h1 p3p4h1h2 + Pp4p3h2h1 p3p4h1h2

  • tp5p3

h6h2 , v

+ 1 2 tp5p6

h1h2 , vp3p4 p5p6 , +f h5 p6 , tp6p3p4 h5h1h2 , +

1 2

  • 1 − Pp3p4h2h1

p3p4h1h2

  • tp5p3p4

h6h7h2 , vh6h7 h1p5 , −

1 2

  • 1 − Pp4p3h1h2

p3p4h1h2

  • tp5p6p3

h7h1h2 , vh p

+ 1 4 tp5p6p3p4

h7h8h1h2 , vh7h8 p5p6 , +tp3 h5 , tp4 h6 , vh5h6 h1h2 , −

  • 1 − Pp4p3h2h1

p3p4h2h1 − Pp3p4h2h1 p3p4h1h2 + Pp4p3h2h1 p3p4h1h2

  • tp5

h2 , tp3 h6 , vh6p4 h1p5 , +

  • 1 − Pp3p4h2h1

p3p4h1h2

  • f h5

p6 , tp3p4 h5h2 , tp6 h1 , +

  • 1 − Pp4p3h1h2

p3p4h1h2

  • f h5

p6 , tp6p3 h1h2 , tp4 h5 , +

1 2

  • 1 − Pp3p4h2h1

p3p4h1h2

  • tp3p4

h5h6 , tp7 h2

  • 1 − Pp4p3h2h1

p3p4h2h1 − Pp3p4h2h1 p3p4h1h2 + Pp4p3h2h1 p3p4h1h2

  • tp5p3

h6h2 , tp4 h7 , vh6h7 h1p5 , −

  • 1 − Pp3p4h2h1

p3p4h1h2

  • tp3p4

h5h2 , tp6 h7 , vh5h7 h1p6 ,

− 1 2

  • 1 − Pp4p3h1h2

p3p4h1h2

  • tp5p6

h1h2 , tp3 h7 , vh7p4 p5p6 , +

  • 1 − Pp4p3h1h2

p3p4h1h2

  • tp5p3

h1h2 , tp6 h7 , vh7p4 p5p6 , −

1 2

  • 1 − Pp3p4h2h1

p3p4h1h2

  • tp5

h6

+ 1 2

  • 1 − Pp4p3h1h2

p3p4h1h2

  • tp5p6p3

h7h1h2 , tp4 h8 , vh7h8 p5p6 , +tp5p3p4 h6h1h2 , tp7 h8 , vh6h8 p5p7 , +

1 2

  • 1 − Pp3p4h1h2

p4p3h1h2

  • tp5p4

h1h2 , tp6p3 h7h8 , vh7h p5p

− 1 2

  • 1 − Pp3p4h1h2

p3p4h2h1

  • tp3p4

h5h1 , tp6p7 h8h2 , vh5h8 p6p7 , −

  • 1 − Pp3p4h1h2

p4p3h1h2

  • tp5p4

h6h1 , tp7p3 h8h2 , vh6h8 p5p7 , +

  • 1 − Pp3p4h2h1

p3p4h1h2

  • +
  • 1 − Pp3p4h1h2

p4p3h1h2 − Pp4p3h1h2 p4p3h2h1 + Pp3p4h1h2 p4p3h2h1

  • tp5

h1 , tp4 h6 , tp7p3 h8h2 , vh6h8 p5p7 ,

+

  • 1 − Pp3p4h1h2

p3p4h2h1

  • tp5

h1 , tp6 h7 , tp3p4 h8h2 , vh7h8 p5p6 , +

1 2 tp3

h5 , tp4 h6 , tp7p8 h1h2 , vh5h6 p7p8 , −

  • 1 − Pp3p4h1h2

p4p3h1h2

  • tp4

h5 , tp6 h7 , tp8p3 h1h2

Jeff Hammond ICERM

slide-31
SLIDE 31

CCSDTQ triples

r3p3p4p5

h1h2h3 ,

=

  • 1 − Pp4p6p5h3h1h2

p4p5p6h3h1h2 − Pp5p6p4h3h1h2 p5p4p6h3h1h2 − Pp4p5p6h3h1h2 p4p5p6h2h1h3 + Pp4p6p5h3h1h2 p4p5p6h2h1h3 + Pp5p6p4h3h1h2 p5p4p6h2h1h3 − Pp4p5p6 p4p5p6

+

  • 1 − Pp5p4p6h2h3h1

p4p5p6h2h3h1 − Pp6p4p5h2h3h1 p4p6p5h2h3h1 + Pp4p5p6h3h2h1 p4p5p6h1h3h2 − Pp5p4p6h3h2h1 p4p5p6h1h3h2 − Pp6p4p5h3h2h1 p4p6p5h1h3h2 + Pp4p5 p4p5

  • 1 − Pp4p5p6h1h2h3

p4p5p6h2h1h3 − Pp4p5p6h1h3h2 p4p5p6h3h1h2

  • f h7

h1 , tp4p5p6 h7h2h3 , +

  • 1 − Pp5p4p6h1h2h3

p6p4p5h1h2h3 − Pp4p5p6h1h2h3 p6p5p4h1h2h3

  • f p6

p7

+ 1 2

  • 1 − Pp4p5p6h3h1h2

p4p5p6h2h1h3 − Pp4p5p6h3h2h1 p4p5p6h1h2h3

  • tp4p5p6

h7h8h3 , vh7h8 h1h2 , −

  • 1 − Pp4p6p5h2h3h1

p4p5p6h2h3h1 − Pp5p6p4h2h3h1 p5p4p6h2h3h1 +

+ 1 2

  • 1 − Pp5p4p6h1h2h3

p4p5p6h1h2h3 − Pp6p4p5h1h2h3 p4p6p5h1h2h3

  • tp7p8p4

h1h2h3 , vp5p6 p7p8 , +f h7 p8 , tp8p4p5p6 h7h1h2h3 ,

+ 1 2

  • 1 + Pp4p5p6h3h2h1

p4p5p6h1h3h2 + Pp4p5p6h2h3h1 p4p5p6h1h2h3

  • tp7p4p5p6

h8h9h2h3 , vh8h9 h1p7 , +

1 2

  • 1 − Pp4p6p5h1h2h3

p4p5p6h1h2h3 − Pp5p6p4h1h2h p5p4p6h1h2h

+

  • 1 − Pp4p6p5h3h2h1

p4p5p6h3h2h1 − Pp5p6p4h3h2h1 p5p4p6h3h2h1 − Pp4p5p6h3h2h1 p4p5p6h2h3h1 + Pp4p6p5h3h2h1 p4p5p6h2h3h1 + Pp5p6p4h3h2h1 p5p4p6h2h3h1 − Pp4p5 p4p5

  • 1 − Pp5p4p6h2h3h1

p4p5p6h2h3h1 − Pp4p6p5h2h3h1 p4p5p6h2h3h1 + Pp5p6p4h2h3h1 p4p5p6h2h3h1 + Pp6p4p5h2h3h1 p4p5p6h2h3h1 − Pp6p5p4h2h3h1 p4p5p6h2h3h1 + Pp4p5 p4p5

  • 1 + Pp4p5p6h3h2h1

p4p5p6h1h3h2 + Pp4p5p6h2h3h1 p4p5p6h1h2h3

  • f h7

p8 , tp4p5p6 h7h2h3 , tp8 h1 , −

  • 1 − Pp4p6p5h1h2h3

p4p5p6h1h2h3 − Pp5p6p4h1h2h3 p5p4p6h1h2h3

  • +
  • 1 − Pp4p6p5h2h3h1

p4p5p6h2h3h1 − Pp5p6p4h2h3h1 p5p4p6h2h3h1 + Pp4p5p6h3h2h1 p4p5p6h1h3h2 − Pp4p6p5h3h2h1 p4p5p6h1h3h2 − Pp5p6p4h3h2h1 p5p4p6h1h3h2 + Pp4p5 p4p5

  • 1 + Pp4p5p6h3h2h1

p4p5p6h1h3h2 + Pp4p5p6h2h3h1 p4p5p6h1h2h3

  • tp4p5p6

h7h2h3 , tp8 h9 , vh7h9 h1p8 , +

  • 1 − Pp4p6p5h2h3h1

p4p5p6h2h3h1 − Pp5p6p4h2h3h1 p5p4p6h2h3h1

− 1 2

  • 1 − Pp4p6p5h1h2h3

p4p5p6h1h2h3 − Pp5p6p4h1h2h3 p5p4p6h1h2h3

  • tp7p8p4p5

h9h1h2h3 , tp6 h10 , vh9h10 p7p8 , +tp7p4p5p6 h8h1h2h3 , tp9 h10 , vh8h10 p7p9 , +

  • 1 −

+ 1 2

  • 1 − Pp5p4p6h2h3h1

p6p4p5h2h3h1 − Pp4p5p6h2h3h1 p6p5p4h2h3h1 + Pp6p4p5h3h2h1 p6p4p5h1h3h2 − Pp5p4p6h3h2h1 p6p4p5h1h3h2 − Pp4p5p6h3h2h1 p6p5p4h1h3h2 + Pp6 p6

  • 1 + Pp4p6p5h2h3h1

p6p5p4h2h3h1 − Pp4p5p6h2h3h1 p5p6p4h3h2h1 + Pp4p5p6h2h3h1 p5p6p4h2h3h1 − Pp4p6p5h2h3h1 p6p5p4h3h2h1 − Pp5p6p4h2h3h1 p5p6p4h3h2h1 − Pp5p6 p5p6

Jeff Hammond ICERM

slide-32
SLIDE 32

CCSDTQ-LR quadruples

Summary: · · · /ccsdtq lr> wc -l ccsdtq t[1234].out 15 ccsdtq t1.out 38 ccsdtq t2.out 53 ccsdtq t3.out 74 ccsdtq t4.out 180 total These 180 equations – each of which may include dozens of permutations of one contraction – are implemented as 50 KLOC of Fortran 77. Side note: The operator algebra and code generation for CCSDTQ takes close to an hour. CCSDTQP exhausted the memory on all known machines capabile of running Python.

Jeff Hammond ICERM

slide-33
SLIDE 33

Summary

Know: Don’t send a runtime to do an algorithm’s job. Elemental/cyclic distribution solves critical data distribution problem of symmetric tensors. Communication-reducing algorithms for sorting and matmul can be reused for tensors. Binary contractions are just one piece of the puzzle. Want: Generalized N-body requires continuous FMM for r−α Getting to science is hopeless without automation.

Jeff Hammond ICERM

slide-34
SLIDE 34

Acknowledgments

Berkeley: Edgar Solomonik

Jeff Hammond ICERM