Accelerating the Tucker Decomposition with Compressed Sparse Tensors
Shaden Smith and George Karypis
Department of Computer Science & Engineering, University of Minnesota {shaden, karypis}@cs.umn.edu
Euro-Par 2017
1 / 40
Accelerating the Tucker Decomposition with Compressed Sparse Tensors - - PowerPoint PPT Presentation
Accelerating the Tucker Decomposition with Compressed Sparse Tensors Shaden Smith and George Karypis Department of Computer Science & Engineering, University of Minnesota { shaden, karypis } @cs.umn.edu Euro-Par 2017 1 / 40 Outline Tensor
Shaden Smith and George Karypis
Department of Computer Science & Engineering, University of Minnesota {shaden, karypis}@cs.umn.edu
Euro-Par 2017
1 / 40
Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions
1 / 40
Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions
1 / 40
Tensors are the generalization of matrices to higher dimensions.
◮ Allow us to represent and analyze multi-dimensional data ◮ Applications in precision healthcare, cybersecurity, recommender
systems, . . .
patients procedures d i a g n
e s
2 / 40
Tensor-matrix multiplication (TTM; also called the n-way product)
◮ Given: tensor X ∈ RI×J×K and matrix M ∈ RF×K. ◮ Operation: X ×3 M ◮ Output: Y ∈ RI×J×F
Elementwise: Y(i, j, f ) =
K
X(i, j, k)M(f , k).
3 / 40
Tensor-matrix multiplications are often performed in sequence (chained). Y1 ← X ×2 BT ×3 CT
Notation Tensors can be unfolded along one mode to matrix form: Y(n).
◮ Mode n forms the rows and the remaining modes become columns.
4 / 40
The Tucker decomposition models a tensor X as a set of orthogonal factor matrices and a core tensor.
Notation A ∈ RI×F1, B ∈ RJ×F2, and C ∈ RK×F3 denote the factor matrices. G ∈ RF1×F2×F3 denotes the core tensor.
5 / 40
The core tensor, G, can be viewed as weights for the interactions between the low-rank factor matrices. Elementwise: X(i, j, k) ≈
F1
F2
F3
G(f1, f2, f3)A(i, f1)B(j, f2)C(k, f3)
6 / 40
Dense: data compression
◮ The Tucker decomposition has long been used to compress
(dense) tensor data (think truncated SVD).
◮ Folks at Sandia have had huge successes in compressing large
simulation outputs1. Sparse: unstructured data analysis
◮ More recently, used to discover relationships in unstructured data. ◮ The resulting tensors are sparse and high-dimensional.
◮ These large, sparse tensors are the focus of this talk. 1Woody Austin, Grey Ballard, and Tamara G Kolda. “Parallel tensor
compression for large-scale scientific data”. In: International Parallel & Distributed Processing Symposium (IPDPS’16). IEEE. 2016, pp. 912–922.
7 / 40
Factor interpretation:
◮ Each row of a factor matrix represents an object from the
◮ The ith object is a point in low-dimensional space: A(i, :). ◮ These points can be clustered, etc.
8 / 40
Factor interpretation:
◮ Each row of a factor matrix represents an object from the
◮ The ith object is a point in low-dimensional space: A(i, :). ◮ These points can be clustered, etc.
Application: citation network analysis [Kolda & Sun, ICDM ’08]
◮ A citation network forms an author × conference × keyword sparse
tensor.
◮ The rows of the resulting factors are clustered with k-means to reveal
relationships. Authors: Jiawei Han, Christos Faloutsos, . . . Conferences: KDD, ICDM, PAKDD, . . . Keywords: knowledge, learning, reasoning
8 / 40
Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions
8 / 40
The resulting optimization problem is non-convex: minimize
A,B,C,G
1 2
F
subject to ATA = I BTB = I CTC = I
9 / 40
HOOI is an alternating optimization algorithm. Tucker Decomposition with HOOI
1: while not converged do 2:
Y1 ← X ×2 BT ×3 CT
3:
A ← F1 leading left singular vectors of Y(1)
4: 5:
Y2 ← X ×1 AT ×3 CT
6:
B ← F2 leading left singular vectors of Y(2)
7: 8:
Y3 ← X ×1 AT ×2 BT
9:
C ← F3 leading left singular vectors of Y(3)
10: 11:
G ← X ×1 AT ×2 BT ×3 CT
12: end while
10 / 40
TTMc is the most expensive kernel in the HOOI algorithm. Tucker Decomposition with HOOI
1: while not converged do 2:
Y1 ← X ×2 BT ×3 CT
3:
A ← F1 leading left singular vectors of Y(1)
4: 5:
Y2 ← X ×1 AT ×3 CT
6:
B ← F2 leading left singular vectors of Y(2)
7: 8:
Y3 ← X ×1 AT ×2 BT
9:
C ← F3 leading left singular vectors of Y(3)
10: 11:
G ← X ×1 AT ×2 BT ×3 CT
12: end while
11 / 40
A first step is to optimize a single TTM kernel and apply in sequence: Y1 ←
×3 CT Challenge:
◮ Intermediate results become more dense after each TTM. ◮ Memory overheads are dependent on sparsity pattern and
factorization rank, but can be several orders of magnitude.
Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for multi-aspect data mining”. In: International Conference on Data Mining (ICDM). 2008.
12 / 40
Y1 ← X ×2 BT ×3 CT Solutions:
2Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for
multi-aspect data mining”. In: International Conference on Data Mining (ICDM). 2008.
3Oguz Kaya and Bora U¸
Tucker decomposition of higher order sparse tensors. Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ
13 / 40
Y1 ← X ×2 BT ×3 CT Solutions:
◮ Requires multiple passes over the input tensor and many FLOPs. 2Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for
multi-aspect data mining”. In: International Conference on Data Mining (ICDM). 2008.
3Oguz Kaya and Bora U¸
Tucker decomposition of higher order sparse tensors. Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ
13 / 40
Y1 ← X ×2 BT ×3 CT Solutions:
◮ Requires multiple passes over the input tensor and many FLOPs.
non-zeros3.
◮ Only a single pass over the tensor! 2Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for
multi-aspect data mining”. In: International Conference on Data Mining (ICDM). 2008.
3Oguz Kaya and Bora U¸
Tucker decomposition of higher order sparse tensors. Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ
13 / 40
Processing each non-zero individually has cost O(nnz(X)F2F3) and O(F2F3) memory overhead. Y1(i, :, :) += X(i, j, k) [B(j, :) ◦ C(k, :)] i Y j B k C
Oguz Kaya and Bora U¸
Tucker decomposition of higher order sparse tensors. Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ
14 / 40
The elementwise formulation of TTMc naturally lends itself to a coordinate storage format: →
15 / 40
Some of the intermediate results across TTMc kernels can be reused: Y1 ← X ×2 BT ×3 CT ×4 DT Y2 ← X ×1 AT ×3 CT ×4 DT becomes: Z ← X ×3 CT ×4 DT Y1 ← Z ×2 BT Y2 ← Z ×1 AT
Muthu Baskaran et al. “Efficient and scalable computations with sparse tensors”. In: High Performance Extreme Computing (HPEC). 2012.
16 / 40
State-of-the-art TTMc: Each node in the tree stores intermediate results from a set of modes.
Oguz Kaya and Bora U¸
Tucker decomposition of higher order sparse tensors. Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ
17 / 40
Parallelism:
◮ Independent units of work within each node are indentified. ◮ For flat dimension trees, this equates to parallelizing over
Y1(i, :, :) slices.
Oguz Kaya and Bora U¸
Tucker decomposition of higher order sparse tensors. Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ
18 / 40
Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions
18 / 40
Existing algorithms either:
◮ have intermediate data blowup ◮ perform many operations ◮ trade memory for performance (i.e., memoization)
◮ Overheads depend on the sparsity pattern and factorization rank
Can we accelerate TTMc without memory overheads?
19 / 40
CSF encodes a sparse tensor as a forest.
◮ Each path from root to leaf encodes a non-zero. ◮ CSF can be viewed as a generalization of CSR.
→
Shaden Smith and George Karypis. “Tensor-Matrix Products with a Compressed Sparse Tensor”. In: 5th Workshop on Irregular Applications: Architectures and Algorithms. 2015.
20 / 40
Going back to the non-zero formulation: Y1(i, :, :) += X(i, j, k) [B(j, :) ◦ C(k, :)] There are two arithmetic redundancies we can exploit:
21 / 40
Consider two non-zeros in the same fiber X(i, j, :) Y1(i, :, :) += X(i, j, k1) [B(j, :) ◦ C(k1, :)] Y1(i, :, :) += X(i, j, k2) [B(j, :) ◦ C(k2, :)] We can factor out B(j, :) Y1(i, :, :) += B(j, :) ◦ [X(i, j, k1)C(k1, :) + X(i, j, k2)C(k2, :)]
22 / 40
◮ Children nodes are summed and then expanded with an outer
product.
◮ Intermediate memory is kept negligible with a post-order
depth-first traversal.
◮ We only need to keep a single stack of intermediate results.
23 / 40
Compare to computing with coordinate format: Savings: The cost of each non-zero (leaf) is now linear in the rank.
24 / 40
Suppose we’re now computing Y3: Y3(:, :, k1) += X(i, j, k1) [A(i, :) ◦ B(j, :)] , Y3(:, :, k2) += X(i, j, k2) [A(i, :) ◦ B(j, :)] . The outer product can be reused: S ← A(i, :) ◦ B(j, :) Y3(:, :, k1) += X(i, j, k1) · S Y3(:, :, k2) += X(i, j, k2) · S
25 / 40
◮ Each child node uses its parent in an outer product. ◮ Use a pre-order depth-first traversal to manage memory.
26 / 40
Compare to computing with coordinate format: Savings: outer products are constructed less often, but non-zeros still have the same asymptotic cost as coordinate form.
27 / 40
These two optimizations can be combined within the same kernel.
◮ Traversal is still depth-first
◮ Use a pre-order traversal for levels above the mode of interest. ◮ Use a post-order traversal for levels below the mode of interest.
28 / 40
We distribute trees to threads and use dynamic load balancing. Race conditions are dependent on the mode of interest:
◮ Root nodes are unique, so no race conditions ◮ Otherwise, use a mutex to lock the slice of Y
29 / 40
Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions
29 / 40
TTMc is significantly more expensive when computing for the lower-level modes.
◮ This is due to FLOPs and synchronization costs.
30 / 40
We can reorder the modes of X and store additional copies of the tensor.
◮ Selectively place modes near the top which were previously
expensive. → , or
31 / 40
Given storage for K copies of the tensor, how do we select them from the N! mode possible orderings? Greedy, heuristic algorithm:
◮ The first CSF always sorts the modes based on their lengths. ◮ For each of the K − 1 remaining CSF representations:
◮ Select the mode which is estimated to be the most expensive
based on FLOPs.
◮ Place that mode at the top of the next CSF, with the remaining
modes sorted by length.
◮ Examine the new CSF and update the new best cost estimates.
32 / 40
Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions
32 / 40
Source code:
◮ Part of the Surprisingly ParalleL spArse Tensor Toolkit4 ◮ Written in C and parallelized with OpenMP ◮ Compiled with icc v16.0.3 and linked with Intel MKL
HyperTensor
◮ Implements dimenson tree-based methods ◮ Written in C++ and parallelized with OpenMP
Machine specifications:
◮ 2x 12-core Intel E5-2680v3 processors (Haswell) ◮ Double-precision floats and 32-bit integers
4SPLATT: https://github.com/ShadenSmith/splatt
33 / 40
Experiments
◮ All measurements are for a sequence of TTMc kernels forming
◮ We fix F1 = F2 = . . . = 20.
34 / 40
Dataset Non-zeros Modes Dimensions NELL-2 77M 3 12K, 9K, 29K Netflix 100M 3 480K, 18K, 2K Enron 54M 4 6K, 6K, 244K, 1K Alzheimer 6.27M 5 5, 1K, 156, 1K, 396 Poisson3D, Poisson4D 100M 3,4 3K, . . . , 3K
K and M stand for thousand and million, respectively.
35 / 40
The greedy algorithm usually matches or gets close to the best possible CSF configuration.
Netflix NELL-2 Enron Alzheimer Poisson3D Poisson4D
10-3 10-2 10-1 100 Required number of FLOPs relative to HT-FLAT
2.8e-01 3.4e-01 3.6e-02 1.1e-01 2.6e-01 4.1e-02 2.9e-01 2.4e-01 1.7e-01 3.0e-01 2.6e-01 4.1e-01 1.1e-01 8.2e-02 2.2e-02 1.1e-01 5.8e-02 1.9e-01 1.1e-01 8.2e-02 2.0e-02 1.8e-03 5.8e-02 5.4e-02 1.1e-01 8.2e-02 1.3e-02 1.3e-03 5.8e-02 5.4e-02
HT-BTREE CSF-1 CSF-2 CSF-3 CSF-BEST
36 / 40
CSF benefits more as dimensionality increases.
Netflix NELL-2 Enron Alzheimer Poisson3D Poisson4D
10-3 10-2 10-1 100 Required number of FLOPs relative to HT-FLAT
2.8e-01 3.4e-01 3.6e-02 1.1e-01 2.6e-01 4.1e-02 2.9e-01 2.4e-01 1.7e-01 3.0e-01 2.6e-01 4.1e-01 1.1e-01 8.2e-02 2.2e-02 1.1e-01 5.8e-02 1.9e-01 1.1e-01 8.2e-02 2.0e-02 1.8e-03 5.8e-02 5.4e-02 1.1e-01 8.2e-02 1.3e-02 1.3e-03 5.8e-02 5.4e-02
HT-BTREE CSF-1 CSF-2 CSF-3 CSF-BEST
36 / 40
Adding CSF representations improves scalability due to fewer and smaller critical regions.
1 4 8 12 16 24 Cores 1 4 8 12 16 24 Speedup
Netflix
1 4 8 12 16 24 Cores 1 4 8 12 16 24 Speedup
NELL-2
1 4 8 12 16 24 Cores 1 4 8 12 16 24 Speedup
Enron
1 4 8 12 16 24 Cores 1 4 8 12 16 24 Speedup
Alzheimer
1 4 8 12 16 24 Cores 1 4 8 12 16 24 Speedup
Poisson3D
1 4 8 12 16 24 Cores 1 4 8 12 16 24 Speedup
Poisson4D
ideal HT-FLAT HT-BTREE CSF-1 CSF-2 CSF-A
37 / 40
Selecting the number of CSFs provides tuning for memory vs. speed.
◮ CSF always provides the options for the smallest and fastest
executions.
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Memory (GB) 100 101 Time (s) 1 . 8 x l a r g e r 8 . 8 x f a s t e r
Netflix
1.0 2.0 3.0 4.0 5.0 Memory (GB) 100 101 Time (s) 1 . 6 x s m a l l e r 1 4 . 9 x f a s t e r
NELL-2
0.5 1.0 1.5 2.0 2.5 3.0 3.5 Memory (GB) 100 101 102 Time (s) 2.2x larger 4.0x faster
Enron
100 101 Memory (GB) 101 102 103 Time (s) 28.6x smaller 20.7x faster
Alzheimer
1.0 2.0 3.0 4.0 5.0 Memory (GB) 10-1 100 101 Time (s) 1.3x smaller 17.4x faster
Poisson3D
5.0 10.0 15.0 Memory (GB) 101 102 Time (s) 2.1x smaller 1.5x faster
Poisson4D
HT-FLAT HT-BTREE CSF-1 CSF-2 CSF-A 38 / 40
Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions
38 / 40
Contributions:
◮ We optimized TTMc kernels via a compressed tensor
representation
◮ CSF naturally exposes arithmetic redundancies in TTMc
◮ Multiple CSF tensors can further accelerate computation
◮ Up to 20× speedup over the state-of-the-art while using 28× less
memory
◮ Choosing the number of data copies offers tunable
computation/memory tradeoff
39 / 40
Contributions:
◮ We optimized TTMc kernels via a compressed tensor
representation
◮ CSF naturally exposes arithmetic redundancies in TTMc
◮ Multiple CSF tensors can further accelerate computation
◮ Up to 20× speedup over the state-of-the-art while using 28× less
memory
◮ Choosing the number of data copies offers tunable
computation/memory tradeoff
Future work:
◮ Alternative decompositions to reduce synchronization costs ◮ Memoization is also applicable to CSF formulation!
◮ Li et al., IPDPS ’17
39 / 40
All of our work is open source (to be updated soon): https://github.com/ShadenSmith/splatt Datasets are freely available: http://frostt.io/
40 / 40
40 / 40
The CPD models a tensor as the summation of rank-1 tensors. ≈ + · · · +
minimize
A,B,C
L(X, A, B, C) =
F
A(:, f ) ◦ B(:, f ) ◦ C(:, f )
F
Notation A ∈ RI×F, B ∈ RJ×F, and C ∈ RK×F denote the factor matrices for a 3D tensor.
40 / 40
Algorithm 1 TTMc()
1: function TTMc(X, mode) 2:
for i1 = 1, . . . , IN in parallel do
3:
construct(X(i1, :, . . . , :), mode, 1)
4:
end for
5: end function
40 / 40
Algorithm 2 construct()
1:
⊲ Construct Kronecker products and push them down to level mode−1.
2: function construct(node, mode, above) 3:
d ← level(node) ⊲ The level in the tree (i.e., distance from the root).
4:
id ← node id(node) ⊲ The partial coordinate of a non-zero.
5: 6:
if d < mode then
7:
above ← above ⊗ A(d)(id, :)
8:
for c ∈ children(node) do
9:
construct(c, mode, above)
10:
end for
11: 12:
else if d = mode then
13:
below ←
c∈children(node) accumulate(c)
14:
Lock mutex id.
15:
Y(d)(id, :) ← Y(d)(id, :) + (above ⊗ below) ⊲ Update Y(d).
16:
Unlock mutex id.
17:
end if
18: end function
40 / 40
Algorithm 3 accumulate()
1:
⊲ Pull Kronecker products up from the leaf nodes.
2: function accumulate(node) 3:
id ← node id(node)
4:
if level(node) = N then
5:
return X(i1, . . . , id) · A(N)(id, :)
6:
else
7:
return A(d)(id, :) ⊗
c∈children(node) accumulate(c)
8:
end if
9: end function
40 / 40