SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication - PowerPoint PPT Presentation

SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication Shaden Smith 1 Niranjay Ravindran Nicholas D. Sidiropoulos George Karypis University of Minnesota 1 shaden@cs.umn.edu Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 1 / 24

Tensor Introduction Tensors are matrices extended to higher dimensions. users tags items Example We can model an item tagging system with a user × item × tag tensor. Very sparse! Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 2 / 24

Canonical Polyadic Decomposition (CPD) Extension of the singular value decomposition. Rank- F decomposition with F ∼ 10 Compute A ∈ R I × F , B ∈ R J × F , and C ∈ R K × F A C B Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 3 / 24

Khatri-Rao Product Column-wise Kronecker product ( I × F ) � ( J × F ) = ( IJ × F ) A � B = [ a 1 ⊗ b 1 , a 2 ⊗ b 2 , . . . , a n ⊗ b n ] Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 4 / 24

CPD with Alternating Least Squares Computing the CPD We use alternating least squares. We operate on X (1) , the tensor flattened to a matrix along the first dimension. A = X (1) ( C � B ) ( C ⊺ C ∗ B ⊺ B ) − 1 Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 5 / 24

Matricized Tensor times Khatri-Rao Product (MTTKRP) JK F X (1) I JK C � B mttkrp is the bottleneck of CPD Explicitly forming C � B is infeasible, so we do it in place . Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 6 / 24

Related Work Related Work Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 7 / 24

Sparse Tensor-Vector Products B ( j , f ) C ( k , f ) ∗ X ( i , j , k ) Tensor Toolbox Popular Matlab code today for sparse tensor work mttkrp uses nnz ( X ) space and 3 F · nnz ( X ) FLOPs Parallelism is difficult during “shrinking” stage Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 8 / 24

GigaTensor X (1) stretch ( B ) stretch ( C ) ∗ ∗ 1 GigaTensor is a recent algorithm developed for Hadoop Uses O ( nnz ( X )) space but 5 F · nnz ( X ) FLOPs Computes a column at a time Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 9 / 24

DFacTo J 1 1 K → IK J I K Two sparse matrix-vector multiplications per column Requires an auxiliary sparse matrix with as many nonzeros as there are non-empty fibers 2 F ( nnz ( X ) + P ) FLOPs, with P non-empty fibers Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 10 / 24

SPLATT The S urprisingly P aralle L sp A rse T ensor T oolkit Contributions Fast algorithm and data structure for mttkrp Cache friendly tensor reordering Cache blocking for temporal locality Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 11 / 24

SPLATT– Optimized Algorithm K J � � M ( i , f ) = C ( k , f ) X ( i , j , k ) B ( j , f ) k =1 j =1 B C I K J K J � � M ( i , :) = C ( k , :) ∗ X ( i , j , k ) B ( j , :) j =1 k =1 Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 12 / 24

SPLATT– Brief Analysis B C I K J We compute rows at a time instead of columns Access patterns much better Same complexity as DFacTo! Only F extra memory for mttkrp Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 13 / 24

Tensor Reordering  0 3 0 3 0 0 0 0 2 0 0 2  0 0 0 0 1 0 1 0 0 0 0 0     0 0 0 0 1 0 1 0 2 0 0 2   0 3 0 3 0 0 0 0 0 0 0 0   3 3 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 2 2 0 0 0 0 0     0 0 0 0 0 2 2 0 0 0 1 1   0 0 0 0 0 0 0 0 0 0 1 1 We reorder the tensor to improve the access patterns of B and C Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 14 / 24

Tensor Reordering – Mode Independent � α � β 0 0 0 0 γ δ i 1 i 2 2 2 j 1 k 1 j 2 k 2 2 Graph Partitioning We model the sparsity structure of X with a tripartite graph ◮ Slices are vertices, nonzeros connect slices with a triangle Partitioning the graph finds regions with shared indices We reorder the tensor to group indices in the same partition Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 15 / 24

Tensor Reordering – Mode Dependent � α � 0 0 β 0 γ 0 δ j 2 k 2 αβ i 1 i i δ i j 1 i γ k 1 i i i 2 Hypergraph Partitioning Instead, create a new reordering for each mode of computation Fibers are now vertices and slices are hyperedges Overheads? Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 16 / 24

Cache Blocking over Tensors Sparsity is Hard Tiling lets us schedule nonzeros to reuse indices already in cache Cost: more fibers Tensor sparsity forces us to grow tiles Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 17 / 24

Experimental Evaluation Experimental Evaluation Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 18 / 24

Summary of Datasets Dataset I J K nnz density NELL-2 15K 15K 30K 77M 1.3e-05 Netflix 480K 18K 2K 100M 5.4e-06 Delicious 532K 17M 2.5M 140M 6.1e-12 NELL-1 4M 4M 25M 144M 3.1e-13 Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 19 / 24

Effects of Tensor Reordering Time (Speedup) Dataset Random Mode-Independent Mode-Dependent NELL-2 2.78 2.61 (1 . 06 × ) 2.60 (1 . 06 × ) Netflix 6.02 5.26 (1 . 14 × ) 5.43 (1 . 10 × ) Delicious 15.61 13.10 (1 . 19 × ) 12.51 (1 . 24 × ) NELL-1 19.83 17.83 (1 . 11 × ) 17.55 (1 . 12 × ) Small effect on serial performance Without cache blocking, a dense fiber can hurt cache reuse Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 20 / 24

Effects of Cache Blocking Time (Speedup) Thds SPLATT tiled MI+tiled MD+tiled 1 8.14 (1 . 0 × ) 8.90 (0 . 9 × ) 8.70 (1 . 0 × ) 9.18 (0 . 9 × ) 2 4.73 (1 . 7 × ) 4.88 (1 . 7 × ) 4.37 (1 . 9 × ) 4.52 (1 . 8 × ) 4 2.54 (3 . 2 × ) 2.58 (3 . 2 × ) 2.29 (3 . 6 × ) 2.35 (3 . 5 × ) 8 1.42 (5 . 7 × ) 1.41 (5 . 8 × ) 1.26 (6 . 5 × ) 1.26 (6 . 4 × ) 16 0.90 (9 . 0 × ) 0.85 (9 . 5 × ) 0.74 (11 . 0 × ) 0.75 (10 . 8 × ) MI and MD are mode-independent and mode-dependent reorderings, respectively. Cache blocking on its own is also not enough MI and MD are very competitive with tiling enabled Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 21 / 24

Scaling: Average Speedup vs TVec 40 SPLATT 35 SPLATT+mem 30 GigaT ensor DFacT o 25 Speedup TVec 20 15 10 5 0 0 2 4 6 8 10 12 14 16 Threads Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 22 / 24

Scaling: NELL-2, Speedup vs TVec 90 SPLATT 80 SPLATT+mem 70 GigaT ensor 60 DFacT o Speedup 50 TVec 40 30 20 10 0 0 2 4 6 8 10 12 14 16 Threads Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 23 / 24

Conclusions Results SPLATT uses less memory than the state of the art Compared to DFacTo, we average 2 . 8 × faster serially and 4 . 8 × faster with 16 threads How? ◮ Fast algorithm ◮ Tensor reordering ◮ Cache blocking SPLATT Released as a C library cs.umn.edu/~shaden/software/ Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 24 / 24

SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication - PowerPoint PPT Presentation

SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication Shaden Smith 1 Niranjay Ravindran Nicholas D. Sidiropoulos George Karypis University of Minnesota 1 shaden@cs.umn.edu Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 1

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Parallel Sparse Tensor Decomposition in Chapel Thomas B. Rolinger , Tyler A. Simon, Christopher

Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decomposition

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

BOOLEAN MATRIX AND TENSOR DECOMPOSITIONS Pauli Miettinen TML 2013 27 September 2013 BOOLEAN

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector

Lecture 17/18: Dynamic Programming - Matrix Chain Parenthesization COMS10007 - Algorithms Dr.

Matrix Multiplication General Case M. A. Hameed ECSE Department Rensselaer Polytechnic

Matrix multiplication Standard serial algorithm: procedure MAT_VECT ( A, x, y ) begin for i

Primal-Dual Algorithm Math 482, Lecture 29 Misha Lavrov April 17, 2020 Introduction The

Decision Sequence Matrix Multiplication Chains Determine the best way to compute the M 1

Matrix Multiplication Jamen Long Data Scientist DataCamp Building Recommendation Engines with

Local Parallel Iteration in X10 Josh Milthorpe IBM Research This material is based upon work

Why? performance wise not!! 3 4 The inner loop of matrix Matrix multiplication: IJK