Scalable Tensor Computations with Cyclops and Faster Algorithms for - - PowerPoint PPT Presentation

scalable tensor computations with cyclops and faster
SMART_READER_LITE
LIVE PREVIEW

Scalable Tensor Computations with Cyclops and Faster Algorithms for - - PowerPoint PPT Presentation

Scalable Tensor Computations with Cyclops and Faster Algorithms for Alternating Least Squares Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Invited Workshop on Compiler Techniques for Sparse Tensor


slide-1
SLIDE 1

Scalable Tensor Computations with Cyclops and Faster Algorithms for Alternating Least Squares

Edgar Solomonik

Department of Computer Science University of Illinois at Urbana-Champaign Invited Workshop on Compiler Techniques for Sparse Tensor Algebra Cambridge, MA

Jan 26, 2019 Compiler Techniques for Sparse Tensor Algebra Cyclops 1/9

slide-2
SLIDE 2

A library for parallel tensor computations

Cyclops Tensor Framework (CTF)1, C++ (MPI/OpenMP) ⇒ Python distributed-memory symmetric/sparse/dense tensor objects

Matrix <int > A(n, n, AS|SP , World( MPI_COMM_WORLD )); Tensor <float > T(order , is_sparse , dims , syms , ring , world ); T.read(...); T.write(...); T.slice(...); T.permute(...);

parallel contraction/summation of tensors

Z["abij"] += V["ijab"]; // C++ Z.i("abij") << V.i("ijab") // Python W["mnij"] += 0.5*W["mnef"]*T["efij"]; // C++ W.i("mnij") << 0.5*W.i("mnef")*T.i("efij") // Python einsum("mnef ,efij ->mnij",W,T) // numpy -style Python

∼2000 commits since 2011, open source since 2013

1E.S., D. Matthews, J.R. Hammond, J. Demmel, JPDC 2014 Compiler Techniques for Sparse Tensor Algebra Cyclops 2/9

slide-3
SLIDE 3

Electronic structure calculations with Cyclops

Coupled cluster engine in Aquarius (Devin Matthews)

4 8 16 32 64 128 256 512 1024 512 1024 2048 4096 8192 16384 32768 Teraflops #nodes Weak scaling on BlueGene/Q Aquarius-CTF CCSD Aquarius-CTF CCSDT

Cyclops works with QChem, VASP, CC4S, Psi4, and PySCF Is also being used for other applications, e.g. by IBM+LLNL collaboration to perform 49-qubit quantum circuit simulation2

  • 2E. Pednault et al. arXiv:1710.05867

Compiler Techniques for Sparse Tensor Algebra Cyclops 3/9

slide-4
SLIDE 4

Sparse MP3 code

Strong and weak scaling of sparse MP3 code, with (1) dense V and T (2) sparse V and dense T (3) sparse V and T

0.125 0.25 0.5 1 2 4 8 16 32 64 128 256 24 48 96 192 384 768 seconds/iteration #cores Strong scaling of MP3 with no=40, nv=160 dense 10% sparse*dense 10% sparse*sparse 1% sparse*dense 1% sparse*sparse .1% sparse*dense .1% sparse*sparse 1 2 4 8 16 32 64 128 256 512 1024 2048 24 48 96 192 384 768 1536 3072 6144 seconds/iteration #cores Weak scaling of MP3 with no=40, nv=160 dense 10% sparse*dense 10% sparse*sparse 1% sparse*dense 1% sparse*sparse .1% sparse*dense .1% sparse*sparse

Compiler Techniques for Sparse Tensor Algebra Cyclops 4/9

slide-5
SLIDE 5

Special operator application: betweenness centrality

Betweenness centrality code snippet, for k of n nodes

void btw_central(Matrix <int > A, Matrix <path > P, int n, int k){ Monoid <path > mon(..., []( path a, path b){ if (a.w<b.w) return a; else if (b.w<a.w) return b; else return path(a.w, a.m+b.m); }, ...); Matrix <path > Q(n,k,mon); // shortest path matrix Q["ij"] = P["ij"]; Function <int ,path > append ([]( int w, path p){ return path(w+p.w, p.m); }; ); for (int i=0; i<n; i++) Q["ij"] = append(A["ik"],Q["kj"]); ... }

Compiler Techniques for Sparse Tensor Algebra Cyclops 5/9

slide-6
SLIDE 6

Betweenness Centrality on R-MAT Graphs

4 16 64 256 1024 4096 16384 1 4 16 64 256 MTEPS/node #nodes Strong scaling of MFBC and CombBLAS for R-MAT S=22

E=128 CTF-MFBC E=8 CTF-MFBC E=128 CA-MFBC E=8 CA-MFBC E=128 CombBLAS E=8 CombBLAS

16 64 256 1024 4096 16384 1 4 16 64 MTEPS/node #nodes Strong scaling of three versions of MFBC for R-MAT S=22

E=128 adapt=sparse*sparse E=128 dense=sparse*sparse E=128 dense=sparse*dense E=8 adapt=sparse*sparse E=8 dense=sparse*sparse E=8 dense=sparse*dense

Left plot compares different algorithms

with CombBLAS with CA-MFBC (statically-mapped comm-efficient matrix distribution)

Right plot compares matrix represenations (including push/pull)

adjacency matrix sparse for all versions frontier sparse or dense rectangular matrix vertices adjacent to frontier (output) sparse or dense rectangular matrix

Compiler Techniques for Sparse Tensor Algebra Cyclops 6/9

slide-7
SLIDE 7

Tensor Decomposition Algorithms

Tensor decomposition algorithms generally use a variant of gradient descent or alternating least squares (ALS) ALS is effective for CP and Tucker as well as MPS/PEPS/DMRG

update each site/factor in network individually by quadratic optimization3

3Holtz, Rohwedder, and Schneider SISC 2012 Compiler Techniques for Sparse Tensor Algebra Cyclops 7/9

slide-8
SLIDE 8

Accelerating Alternating Least Squares

Dimension trees amortize cost across quadratic subproblems Pairwise perturbation (PP) approximates ALS with less cost4, specifically for rank R decomposition for order N and s × · · · × s tensor

dimension tree ALS sweep PP setup PP approximate sweep CP 4sNR 6sNR 2Ns2R Tucker 4sNR 6sNR 2Ns2RN−1

Cyclops-based implementation of PP shows improvements over regular dimension tree ALS for both synthetic and real-world tensors

4Linjian Ma and E.S. arXiv:1811.10573 Compiler Techniques for Sparse Tensor Algebra Cyclops 8/9

slide-9
SLIDE 9

Conclusion

Summary Cyclops is a distributed-memory sparse/dense tensor library

has seen adaptation in quantum chemistry and quantum circuit simulation supports general semirings, efficient parallel graph algorithms

Pairwise perturbation is a first-order-accurate approximation to ALS

its asymptotically faster in theory and 2-3X faster in practice

In-progress/future work Sparse tensor completion with Cyclops using ALS/CCD/SGD Perturbative ALS with low-rank updates Acknowledgements Devin Matthews (UT Austin), Jeff Hammond (Intel Corp.), Maciej Besta, Flavio Vella, Torsten Hoefler (ETH Zurich), Zecheng Zhang (UIUC), Linjian Ma, James Demmel (UC Berkeley) Computational resources at NERSC, CSCS, ALCF, NCSA, and TACC

Compiler Techniques for Sparse Tensor Algebra Cyclops 9/9