Scalable Tensor Computations with Cyclops and Faster Algorithms for - PowerPoint PPT Presentation

Scalable Tensor Computations with Cyclops and Faster Algorithms for Alternating Least Squares Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Invited Workshop on Compiler Techniques for Sparse Tensor Algebra Cambridge, MA Jan 26, 2019 Compiler Techniques for Sparse Tensor Algebra Cyclops 1/9

A library for parallel tensor computations Cyclops Tensor Framework (CTF) 1 , C++ (MPI/OpenMP) ⇒ Python distributed-memory symmetric/sparse/dense tensor objects Matrix <int > A(n, n, AS|SP , World( MPI_COMM_WORLD )); Tensor <float > T(order , is_sparse , dims , syms , ring , world ); T.read(...); T.write(...); T.slice(...); T.permute(...); parallel contraction/summation of tensors Z["abij"] += V["ijab"]; // C++ Z.i("abij") << V.i("ijab") // Python W["mnij"] += 0.5*W["mnef"]*T["efij"]; // C++ W.i("mnij") << 0.5*W.i("mnef")*T.i("efij") // Python einsum("mnef ,efij ->mnij",W,T) // numpy -style Python ∼ 2000 commits since 2011, open source since 2013 1 E.S., D. Matthews, J.R. Hammond, J. Demmel, JPDC 2014 Compiler Techniques for Sparse Tensor Algebra Cyclops 2/9

Electronic structure calculations with Cyclops Coupled cluster engine in Aquarius (Devin Matthews) Weak scaling on BlueGene/Q 1024 Aquarius-CTF CCSD 512 Aquarius-CTF CCSDT 256 Teraflops 128 64 32 16 8 4 512 1024 2048 4096 8192 16384 32768 #nodes Cyclops works with QChem, VASP, CC4S, Psi4, and PySCF Is also being used for other applications, e.g. by IBM+LLNL collaboration to perform 49-qubit quantum circuit simulation 2 2 E. Pednault et al. arXiv:1710.05867 Compiler Techniques for Sparse Tensor Algebra Cyclops 3/9

Sparse MP3 code Strong and weak scaling of sparse MP3 code, with (1) dense V and T (2) sparse V and dense T (3) sparse V and T Strong scaling of MP3 with no=40, nv=160 Weak scaling of MP3 with no=40, nv=160 256 2048 dense dense 128 1024 10% sparse*dense 10% sparse*dense 64 10% sparse*sparse 10% sparse*sparse 512 1% sparse*dense 1% sparse*dense 32 256 1% sparse*sparse 1% sparse*sparse seconds/iteration seconds/iteration 16 .1% sparse*dense .1% sparse*dense 128 .1% sparse*sparse .1% sparse*sparse 8 64 4 32 2 16 1 8 0.5 4 0.25 2 0.125 1 24 48 96 192 384 768 24 48 96 192 384 768 1536 3072 6144 #cores #cores Compiler Techniques for Sparse Tensor Algebra Cyclops 4/9

Special operator application: betweenness centrality Betweenness centrality code snippet, for k of n nodes void btw_central(Matrix <int > A, Matrix <path > P, int n, int k){ Monoid <path > mon(..., []( path a, path b){ if (a.w<b.w) return a; else if (b.w<a.w) return b; else return path(a.w, a.m+b.m); }, ...); Matrix <path > Q(n,k,mon); // shortest path matrix Q["ij"] = P["ij"]; Function <int ,path > append ([]( int w, path p){ return path(w+p.w, p.m); }; ); for (int i=0; i<n; i++) Q["ij"] = append(A["ik"],Q["kj"]); ... } Compiler Techniques for Sparse Tensor Algebra Cyclops 5/9

Betweenness Centrality on R-MAT Graphs Strong scaling of MFBC and CombBLAS for R-MAT S=22 Strong scaling of three versions of MFBC for R-MAT S=22 16384 16384 E=128 CTF-MFBC E=128 adapt=sparse*sparse E=8 CTF-MFBC E=128 dense=sparse*sparse 4096 4096 E=128 CA-MFBC E=128 dense=sparse*dense E=8 CA-MFBC E=8 adapt=sparse*sparse MTEPS/node 1024 MTEPS/node E=128 CombBLAS E=8 dense=sparse*sparse 1024 E=8 CombBLAS E=8 dense=sparse*dense 256 256 64 64 16 4 16 1 4 16 64 256 1 4 16 64 #nodes #nodes Left plot compares different algorithms with CombBLAS with CA-MFBC (statically-mapped comm-efficient matrix distribution) Right plot compares matrix represenations (including push/pull) adjacency matrix sparse for all versions frontier sparse or dense rectangular matrix vertices adjacent to frontier (output) sparse or dense rectangular matrix Compiler Techniques for Sparse Tensor Algebra Cyclops 6/9

Tensor Decomposition Algorithms Tensor decomposition algorithms generally use a variant of gradient descent or alternating least squares (ALS) ALS is effective for CP and Tucker as well as MPS/PEPS/DMRG update each site/factor in network individually by quadratic optimization 3 3 Holtz, Rohwedder, and Schneider SISC 2012 Compiler Techniques for Sparse Tensor Algebra Cyclops 7/9

Accelerating Alternating Least Squares Dimension trees amortize cost across quadratic subproblems Pairwise perturbation (PP) approximates ALS with less cost 4 , specifically for rank R decomposition for order N and s × · · · × s tensor dimension tree ALS sweep PP setup PP approximate sweep 4 s N R 6 s N R 2 Ns 2 R CP 4 s N R 6 s N R 2 Ns 2 R N − 1 Tucker Cyclops-based implementation of PP shows improvements over regular dimension tree ALS for both synthetic and real-world tensors 4 Linjian Ma and E.S. arXiv:1811.10573 Compiler Techniques for Sparse Tensor Algebra Cyclops 8/9

Conclusion Summary Cyclops is a distributed-memory sparse/dense tensor library has seen adaptation in quantum chemistry and quantum circuit simulation supports general semirings, efficient parallel graph algorithms Pairwise perturbation is a first-order-accurate approximation to ALS its asymptotically faster in theory and 2-3X faster in practice In-progress/future work Sparse tensor completion with Cyclops using ALS/CCD/SGD Perturbative ALS with low-rank updates Acknowledgements Devin Matthews (UT Austin), Jeff Hammond (Intel Corp.), Maciej Besta, Flavio Vella, Torsten Hoefler (ETH Zurich), Zecheng Zhang (UIUC), Linjian Ma, James Demmel (UC Berkeley) Computational resources at NERSC, CSCS, ALCF, NCSA, and TACC Compiler Techniques for Sparse Tensor Algebra Cyclops 9/9

Scalable Tensor Computations with Cyclops and Faster Algorithms for - PowerPoint PPT Presentation

Scalable Tensor Computations with Cyclops and Faster Algorithms for Alternating Least Squares Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Invited Workshop on Compiler Techniques for Sparse Tensor

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Presentation from: The Cyclops Group Everything you expect from a class leading independent

Zarr - scalable storage of tensor Zarr - scalable storage of tensor data for parallel and

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

Charlotte A. & Clinton E. Rings Photo History (Slides) 1964 1971 By Al Ring 2007

Parallel Sparse Tensor Decomposition in Chapel Thomas B. Rolinger , Tyler A. Simon, Christopher

Matrix Factorization and Collaborative Filtering MF Readings: Matt Gormley (Koren et

Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3

Network dynamics: advanced models Marta Arias, Ramon Ferrer-i-Cancho, Argimiro Arratia

ASPR zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA ASPR TRACIE was developed as a

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Conficker / Downadup spreading vectors MS08-067 Vulnerability in Server service USB-Flash drives

Scalable Tensor Computations with Cyclops and Faster Algorithms for - PowerPoint PPT Presentation

Scalable Tensor Computations with Cyclops and Faster Algorithms for Alternating Least Squares Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Invited Workshop on Compiler Techniques for Sparse Tensor

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Presentation from: The Cyclops Group Everything you expect from a class leading independent

Zarr - scalable storage of tensor Zarr - scalable storage of tensor data for parallel and

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

Charlotte A. &amp; Clinton E. Rings Photo History (Slides) 1964 1971 By Al Ring 2007

Parallel Sparse Tensor Decomposition in Chapel Thomas B. Rolinger , Tyler A. Simon, Christopher

Matrix Factorization and Collaborative Filtering MF Readings: Matt Gormley (Koren et

Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3

Network dynamics: advanced models Marta Arias, Ramon Ferrer-i-Cancho, Argimiro Arratia

ASPR zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA ASPR TRACIE was developed as a

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Conficker / Downadup spreading vectors MS08-067 Vulnerability in Server service USB-Flash drives

Charlotte A. & Clinton E. Rings Photo History (Slides) 1964 1971 By Al Ring 2007