SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication - - PowerPoint PPT Presentation

splatt efficient and parallel sparse tensor matrix
SMART_READER_LITE
LIVE PREVIEW

SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication - - PowerPoint PPT Presentation

SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication Shaden Smith 1 Niranjay Ravindran Nicholas D. Sidiropoulos George Karypis University of Minnesota 1 shaden@cs.umn.edu Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 1


slide-1
SLIDE 1

SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication

Shaden Smith1 Niranjay Ravindran Nicholas D. Sidiropoulos George Karypis

University of Minnesota

1shaden@cs.umn.edu Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 1 / 24

slide-2
SLIDE 2

Tensor Introduction

Tensors are matrices extended to higher dimensions. users items tags

Example

We can model an item tagging system with a user × item × tag tensor. Very sparse!

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 2 / 24

slide-3
SLIDE 3

Canonical Polyadic Decomposition (CPD)

Extension of the singular value decomposition. Rank-F decomposition with F ∼ 10 Compute A ∈ RI×F, B ∈ RJ×F, and C ∈ RK×F A B C

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 3 / 24

slide-4
SLIDE 4

Khatri-Rao Product

Column-wise Kronecker product (I×F) (J×F) = (IJ×F) A B = [a1 ⊗ b1, a2 ⊗ b2, . . . , an ⊗ bn]

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 4 / 24

slide-5
SLIDE 5

CPD with Alternating Least Squares

Computing the CPD

We use alternating least squares. We operate on X(1), the tensor flattened to a matrix along the first dimension. A = X(1)(C B) (C⊺C ∗ B⊺B)−1

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 5 / 24

slide-6
SLIDE 6

Matricized Tensor times Khatri-Rao Product (MTTKRP)

X(1) C B I F JK JK mttkrp is the bottleneck of CPD Explicitly forming C B is infeasible, so we do it in place.

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 6 / 24

slide-7
SLIDE 7

Related Work Related Work

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 7 / 24

slide-8
SLIDE 8

Sparse Tensor-Vector Products

X(i, j, k) B(j, f )C(k, f ) ∗

Tensor Toolbox

Popular Matlab code today for sparse tensor work mttkrp uses nnz(X) space and 3F · nnz(X) FLOPs Parallelism is difficult during “shrinking” stage

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 8 / 24

slide-9
SLIDE 9

GigaTensor

X(1) stretch(B) stretch(C) 1 ∗ ∗ GigaTensor is a recent algorithm developed for Hadoop Uses O(nnz(X)) space but 5F · nnz(X) FLOPs Computes a column at a time

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 9 / 24

slide-10
SLIDE 10

DFacTo

IK J J 1 → I K K 1 Two sparse matrix-vector multiplications per column Requires an auxiliary sparse matrix with as many nonzeros as there are non-empty fibers 2F(nnz(X) + P) FLOPs, with P non-empty fibers

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 10 / 24

slide-11
SLIDE 11

SPLATT The Surprisingly ParalleL spArse Tensor Toolkit

Contributions

Fast algorithm and data structure for mttkrp Cache friendly tensor reordering Cache blocking for temporal locality

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 11 / 24

slide-12
SLIDE 12

SPLATT– Optimized Algorithm

M(i, f ) =

K

  • k=1

C(k, f )

J

  • j=1

X(i, j, k)B(j, f ) I J K B C M(i, :) =

K

  • k=1

C(k, :) ∗

J

  • j=1

X(i, j, k)B(j, :)

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 12 / 24

slide-13
SLIDE 13

SPLATT– Brief Analysis

I J K B C We compute rows at a time instead of columns Access patterns much better Same complexity as DFacTo! Only F extra memory for mttkrp

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 13 / 24

slide-14
SLIDE 14

Tensor Reordering

    3 3 2 2 1 1 1 1 2 2 3 3         3 3 3 3 2 2 2 2 1 1 1 1     We reorder the tensor to improve the access patterns of B and C

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 14 / 24

slide-15
SLIDE 15

Tensor Reordering – Mode Independent

α β γ δ

  • i1

i2 j1 k1 j2 k2 2 2 2

Graph Partitioning

We model the sparsity structure of X with a tripartite graph

◮ Slices are vertices, nonzeros connect slices with a triangle

Partitioning the graph finds regions with shared indices We reorder the tensor to group indices in the same partition

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 15 / 24

slide-16
SLIDE 16

Tensor Reordering – Mode Dependent

α β γ δ

  • αβ

γ δ i i1 i i2 i j1 i j2 i k1 i k2

Hypergraph Partitioning

Instead, create a new reordering for each mode of computation Fibers are now vertices and slices are hyperedges Overheads?

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 16 / 24

slide-17
SLIDE 17

Cache Blocking over Tensors

Sparsity is Hard

Tiling lets us schedule nonzeros to reuse indices already in cache Cost: more fibers Tensor sparsity forces us to grow tiles

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 17 / 24

slide-18
SLIDE 18

Experimental Evaluation Experimental Evaluation

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 18 / 24

slide-19
SLIDE 19

Summary of Datasets

Dataset I J K nnz density NELL-2 15K 15K 30K 77M 1.3e-05 Netflix 480K 18K 2K 100M 5.4e-06 Delicious 532K 17M 2.5M 140M 6.1e-12 NELL-1 4M 4M 25M 144M 3.1e-13

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 19 / 24

slide-20
SLIDE 20

Effects of Tensor Reordering

Time (Speedup) Dataset Random Mode-Independent Mode-Dependent NELL-2 2.78 2.61 (1.06×) 2.60 (1.06×) Netflix 6.02 5.26 (1.14×) 5.43 (1.10×) Delicious 15.61 13.10 (1.19×) 12.51 (1.24×) NELL-1 19.83 17.83 (1.11×) 17.55 (1.12×) Small effect on serial performance Without cache blocking, a dense fiber can hurt cache reuse

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 20 / 24

slide-21
SLIDE 21

Effects of Cache Blocking

Time (Speedup) Thds SPLATT tiled MI+tiled MD+tiled 1 8.14 (1.0×) 8.90 (0.9×) 8.70 (1.0×) 9.18 (0.9×) 2 4.73 (1.7×) 4.88 (1.7×) 4.37 (1.9×) 4.52 (1.8×) 4 2.54 (3.2×) 2.58 (3.2×) 2.29 (3.6×) 2.35 (3.5×) 8 1.42 (5.7×) 1.41 (5.8×) 1.26 (6.5×) 1.26 (6.4×) 16 0.90 (9.0×) 0.85 (9.5×) 0.74 (11.0×) 0.75 (10.8×) MI and MD are mode-independent and mode-dependent reorderings, respectively. Cache blocking on its own is also not enough MI and MD are very competitive with tiling enabled

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 21 / 24

slide-22
SLIDE 22

Scaling: Average Speedup vs TVec

5 10 15 20 25 30 35 40 2 4 6 8 10 12 14 16 Speedup Threads SPLATT SPLATT+mem GigaT ensor DFacT

  • TVec

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 22 / 24

slide-23
SLIDE 23

Scaling: NELL-2, Speedup vs TVec

10 20 30 40 50 60 70 80 90 2 4 6 8 10 12 14 16 Speedup Threads SPLATT SPLATT+mem GigaT ensor DFacT

  • TVec

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 23 / 24

slide-24
SLIDE 24

Conclusions

Results

SPLATT uses less memory than the state of the art Compared to DFacTo, we average 2.8× faster serially and 4.8× faster with 16 threads How?

◮ Fast algorithm ◮ Tensor reordering ◮ Cache blocking

SPLATT

Released as a C library cs.umn.edu/~shaden/software/

Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 24 / 24