splatt efficient and parallel sparse tensor matrix
play

SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication - PowerPoint PPT Presentation

SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication Shaden Smith 1 Niranjay Ravindran Nicholas D. Sidiropoulos George Karypis University of Minnesota 1 shaden@cs.umn.edu Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 1


  1. SPLATT Efficient and Parallel Sparse Tensor-Matrix Multiplication Shaden Smith 1 Niranjay Ravindran Nicholas D. Sidiropoulos George Karypis University of Minnesota 1 shaden@cs.umn.edu Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 1 / 24

  2. Tensor Introduction Tensors are matrices extended to higher dimensions. users tags items Example We can model an item tagging system with a user × item × tag tensor. Very sparse! Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 2 / 24

  3. Canonical Polyadic Decomposition (CPD) Extension of the singular value decomposition. Rank- F decomposition with F ∼ 10 Compute A ∈ R I × F , B ∈ R J × F , and C ∈ R K × F A C B Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 3 / 24

  4. Khatri-Rao Product Column-wise Kronecker product ( I × F ) � ( J × F ) = ( IJ × F ) A � B = [ a 1 ⊗ b 1 , a 2 ⊗ b 2 , . . . , a n ⊗ b n ] Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 4 / 24

  5. CPD with Alternating Least Squares Computing the CPD We use alternating least squares. We operate on X (1) , the tensor flattened to a matrix along the first dimension. A = X (1) ( C � B ) ( C ⊺ C ∗ B ⊺ B ) − 1 Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 5 / 24

  6. Matricized Tensor times Khatri-Rao Product (MTTKRP) JK F X (1) I JK C � B mttkrp is the bottleneck of CPD Explicitly forming C � B is infeasible, so we do it in place . Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 6 / 24

  7. Related Work Related Work Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 7 / 24

  8. Sparse Tensor-Vector Products B ( j , f ) C ( k , f ) ∗ X ( i , j , k ) Tensor Toolbox Popular Matlab code today for sparse tensor work mttkrp uses nnz ( X ) space and 3 F · nnz ( X ) FLOPs Parallelism is difficult during “shrinking” stage Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 8 / 24

  9. GigaTensor X (1) stretch ( B ) stretch ( C ) ∗ ∗ 1 GigaTensor is a recent algorithm developed for Hadoop Uses O ( nnz ( X )) space but 5 F · nnz ( X ) FLOPs Computes a column at a time Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 9 / 24

  10. DFacTo J 1 1 K → IK J I K Two sparse matrix-vector multiplications per column Requires an auxiliary sparse matrix with as many nonzeros as there are non-empty fibers 2 F ( nnz ( X ) + P ) FLOPs, with P non-empty fibers Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 10 / 24

  11. SPLATT The S urprisingly P aralle L sp A rse T ensor T oolkit Contributions Fast algorithm and data structure for mttkrp Cache friendly tensor reordering Cache blocking for temporal locality Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 11 / 24

  12. SPLATT– Optimized Algorithm K J � � M ( i , f ) = C ( k , f ) X ( i , j , k ) B ( j , f ) k =1 j =1 B C I K J K J � � M ( i , :) = C ( k , :) ∗ X ( i , j , k ) B ( j , :) j =1 k =1 Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 12 / 24

  13. SPLATT– Brief Analysis B C I K J We compute rows at a time instead of columns Access patterns much better Same complexity as DFacTo! Only F extra memory for mttkrp Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 13 / 24

  14. Tensor Reordering  0 3 0 3 0 0 0 0 2 0 0 2  0 0 0 0 1 0 1 0 0 0 0 0     0 0 0 0 1 0 1 0 2 0 0 2   0 3 0 3 0 0 0 0 0 0 0 0   3 3 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 2 2 0 0 0 0 0     0 0 0 0 0 2 2 0 0 0 1 1   0 0 0 0 0 0 0 0 0 0 1 1 We reorder the tensor to improve the access patterns of B and C Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 14 / 24

  15. Tensor Reordering – Mode Independent � α � β 0 0 0 0 γ δ i 1 i 2 2 2 j 1 k 1 j 2 k 2 2 Graph Partitioning We model the sparsity structure of X with a tripartite graph ◮ Slices are vertices, nonzeros connect slices with a triangle Partitioning the graph finds regions with shared indices We reorder the tensor to group indices in the same partition Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 15 / 24

  16. Tensor Reordering – Mode Dependent � α � 0 0 β 0 γ 0 δ j 2 k 2 αβ i 1 i i δ i j 1 i γ k 1 i i i 2 Hypergraph Partitioning Instead, create a new reordering for each mode of computation Fibers are now vertices and slices are hyperedges Overheads? Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 16 / 24

  17. Cache Blocking over Tensors Sparsity is Hard Tiling lets us schedule nonzeros to reuse indices already in cache Cost: more fibers Tensor sparsity forces us to grow tiles Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 17 / 24

  18. Experimental Evaluation Experimental Evaluation Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 18 / 24

  19. Summary of Datasets Dataset I J K nnz density NELL-2 15K 15K 30K 77M 1.3e-05 Netflix 480K 18K 2K 100M 5.4e-06 Delicious 532K 17M 2.5M 140M 6.1e-12 NELL-1 4M 4M 25M 144M 3.1e-13 Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 19 / 24

  20. Effects of Tensor Reordering Time (Speedup) Dataset Random Mode-Independent Mode-Dependent NELL-2 2.78 2.61 (1 . 06 × ) 2.60 (1 . 06 × ) Netflix 6.02 5.26 (1 . 14 × ) 5.43 (1 . 10 × ) Delicious 15.61 13.10 (1 . 19 × ) 12.51 (1 . 24 × ) NELL-1 19.83 17.83 (1 . 11 × ) 17.55 (1 . 12 × ) Small effect on serial performance Without cache blocking, a dense fiber can hurt cache reuse Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 20 / 24

  21. Effects of Cache Blocking Time (Speedup) Thds SPLATT tiled MI+tiled MD+tiled 1 8.14 (1 . 0 × ) 8.90 (0 . 9 × ) 8.70 (1 . 0 × ) 9.18 (0 . 9 × ) 2 4.73 (1 . 7 × ) 4.88 (1 . 7 × ) 4.37 (1 . 9 × ) 4.52 (1 . 8 × ) 4 2.54 (3 . 2 × ) 2.58 (3 . 2 × ) 2.29 (3 . 6 × ) 2.35 (3 . 5 × ) 8 1.42 (5 . 7 × ) 1.41 (5 . 8 × ) 1.26 (6 . 5 × ) 1.26 (6 . 4 × ) 16 0.90 (9 . 0 × ) 0.85 (9 . 5 × ) 0.74 (11 . 0 × ) 0.75 (10 . 8 × ) MI and MD are mode-independent and mode-dependent reorderings, respectively. Cache blocking on its own is also not enough MI and MD are very competitive with tiling enabled Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 21 / 24

  22. Scaling: Average Speedup vs TVec 40 SPLATT 35 SPLATT+mem 30 GigaT ensor DFacT o 25 Speedup TVec 20 15 10 5 0 0 2 4 6 8 10 12 14 16 Threads Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 22 / 24

  23. Scaling: NELL-2, Speedup vs TVec 90 SPLATT 80 SPLATT+mem 70 GigaT ensor 60 DFacT o Speedup 50 TVec 40 30 20 10 0 0 2 4 6 8 10 12 14 16 Threads Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 23 / 24

  24. Conclusions Results SPLATT uses less memory than the state of the art Compared to DFacTo, we average 2 . 8 × faster serially and 4 . 8 × faster with 16 threads How? ◮ Fast algorithm ◮ Tensor reordering ◮ Cache blocking SPLATT Released as a C library cs.umn.edu/~shaden/software/ Shaden Smith, shaden@cs.umn.edu (U. Minnesota) SPLATT 24 / 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend