SLIDE 1
= Tensors More Examples Entity Keywords Tensor Matrix Relation - - PowerPoint PPT Presentation
= Tensors More Examples Entity Keywords Tensor Matrix Relation - - PowerPoint PPT Presentation
On Optimizing Distributed Tucker Decomposition for Sparse Tensors Venkatesan Chakaravarthy, Jee W. Choi , Douglas Joseph, Prakash Murali, Shivmaran S Pandian, Yogish Sabharwal, and Dheeraj Sreedhar IBM Research = Tensors More Examples
SLIDE 2
SLIDE 3
Singular Value Decomposition Tucker Decomposition = Higher Order SVD
Tucker Decomposition
𝑁
≈
𝑉 𝑊𝑈
\sigma
𝜏1𝜏2 𝜏𝑠
Left singular vectors Right singular vectors Singular values
Σ ≈
Applications
- PCA: Analyze different dimensions
- Text analytics
- Computer vision
- Signal processing
𝐿1 × 𝐿2 × 𝐿3 𝑀1 × 𝑀2 × 𝑀3
SLIDE 4
Higher Order Orthonormal Iterator (HOOI) 𝑈
Improve 𝐵 − 𝑜𝑓𝑥 𝐶 − 𝑜𝑓𝑥 𝐷 − 𝑜𝑓𝑥 𝐵 𝐶 𝐷
+
HOOI
- Refinement : improve accuracy
- Applied multiple times to get increasing accuracy
4
- Initial decomposition – Obtained via HOSVD, STHOSVD, random matrices
SLIDE 5
Prior Work & Objective
- Dense tensors [BK’07, ABK’16, CCJ+ 17, JPK’ 15]
- Sequential, shared memory and distributed implementations
- Sparse tensors [KS’08, BMV+ 12, SK’16, KU’16]
- Sequential, shared memory and distributed implementations
- [KU’16] : First distributed implementation for sparse tensors
Our objective
- Efficient distributed implementation for sparse tensors
- Builds on work of [KU’16]
[KS’08] T. Kolda and J. Sun. 2008. Scalable tensor decompositions for multi-aspect data mining. In ICDM. [BMV’12] M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Efficient and scalable computations with sparse tensors. In HPEC. [SK’16] S. Smith and G. Karypis. 2017. Accelerating the Tucker Decomposition with Compressed Sparse Tensors. In Euro-Par. [KU’16] O. Kaya and B. Uçar. 2016. High performance parallel algorithms for the Tucker decomposition of sparse tensors. In ICPP.
SLIDE 6
HOOI – Outline
Alternating least squares
- Fix B and C. Find Best A.
- Fix A and C. Find Best B
- Fix B and C. Find Best C
T
𝐶𝑈 𝐷𝑈 X X
SVD
𝐵 − 𝑜𝑓𝑥 𝑁1
T
𝐵𝑈 𝐷𝑈 X X
SVD
B−𝑜𝑓𝑥 𝑁2
T
𝐵𝑈 𝐶𝑈 X X
SVD
C−𝑜𝑓𝑥 𝑁3 Matricize Singular Vectors 𝐿1𝑤𝑓𝑑𝑢𝑝𝑠𝑡 𝐿2𝑤𝑓𝑑𝑢𝑝𝑠𝑡 𝐿3𝑤𝑓𝑑𝑢𝑝𝑠𝑡
T B C SVD A-new A C SVD B-new A B SVD C-new
6
TTM – Tensor times matrix multiplications
SLIDE 7
Sparse HOOI : Distribution Schemes and Performance Parameters
e1 e2 e3 e4 e5 e6 e7 e8
Proc 3 Proc 1 Proc 2 (1, 2, 1), 0.1
e1
(2, 3, 2), 1.2
e2
(1, 1, 2), 3.1
e3
(3, 2, 1), 0.4
e4
(3, 1, 2), 1.1
e5
(1, 3, 2), 1.4
e6
(2, 2, 2), 0.5
e7
(3, 3 1), 0.7
e8
Coordinate Representation
- TTM Component
- Computation only
- All schemes have same computational load (FLOPs)
- Load balance
- SVD Component
- Both computation and communication
- Computational load
- Load balance
- Communication volume
- Factor Matrix transfer (FM)
- Communication only
- At the end of each HOOI invocation, factor matrix
rows need to be communicated among processors for the next invocation
- Communication volume
Sparse HOOI : Distribution Schemes and Performance Parameters
e1 e2 e3 e4 e5 e6 e7 e8
Proc 3 Proc 1 Proc 2 (1, 2, 1), 0.1
e1
(2, 3, 2), 1.2
e2
(1, 1, 2), 3.1
e3
(3, 2, 1), 0.4
e4
(3, 1, 2), 1.1
e5
(1, 3, 2), 1.4
e6
(2, 2, 2), 0.5
e7
(3, 3 1), 0.7
e8
Coordinate Representation
- TTM Component
- Computation only
- All schemes have same computational load (FLOPs)
- Load balance
- SVD Component
- Both computation and communication
- Computational load
- Load balance
- Communication volume
- Factor Matrix transfer (FM)
- Communication only
- At the end of each HOOI invocation, factor matrix
rows need to be communicated among processors for the next invocation
- Communication volume
SLIDE 8
Prior Schemes
- CoarseG - Coarse grained schemes [KU’16]
- Allocate entire “slices” to processors
- MediumG - Medium grained scheme [SK ‘16]
- Grid based partitioning – similar to block partitioning of matrices
- FineG - Fine grained scheme [KU’16]
- Allocate individual elements using hypergraph partitioning methods
TTM SVD FM
- Dist. time
CoarseG Inefficient Efficient Inefficient Fast MediumG Efficient Inefficient Efficient Fast FineG Efficient Inefficient Efficient Slow
- Distribution time
- CoarseG, MediumG – greedy, fast procedures
- FineG – Complex, slow procedure based on sophisticated hypergraph partitioning methods
SLIDE 9
Our Contributions
- We identify certain fundamental metrics
- Determine TTM load balance, SVD load and load balance, SVD volume
- Design a new distribution scheme denoted Lite
Lite
- Near-optimal on the fundamental metrics
- Near-optimal on
- TTM load balance
- SVD load and load balance
- SVD communication volume
- Lightweight procedure with fast distribution time
- Performance gain up to 3x
- Only parameter not optimized is FM volume
- Computation time dominates
- So, Lite performance better
TTM SVD FM
- Dist. time
CoarseG Inefficient Efficient Inefficient Fast MediumG Efficient Inefficient Efficient Fast FineG Efficient Inefficient Efficient Slow Lite Near-
- ptimal
Near- Optimal Inefficient Fast
SLIDE 10
Mode 1 – Sequential TTM
T B C SVD A-new A C SVD B-new A B SVD C-new
T
𝐶𝑈 𝐷𝑈 X X
SVD
A-new
𝑁
Penultimate matrix TTM
3 1 2 (1, -, -) e1 (2, -, -) e2 (1, -, -) e3 (3, -, -) e4 (3, -, -) e5 (1, -, -) e6 (2, -, -) e7 (3, -, -) e8 Penultimate Matrix M 𝐿1 L1
Kronecker product Involves factor matrices
- Slice – elements with same
coordinate; same color
- Each slice updates the
corresponding row in the penultimate matrix
SLIDE 11
Distributed TTM
Local copy 1 2 3 (1, -, -) e1 (2, -, -) e2 (1, -, -) e3 1 2 3 (3, -, -) e4 (3, -, -) e5 (1, -, -) e6 1 2 3 (2, -, -) e7 (3, -, -) e8
Proc 1 Proc 2 Proc 3
Local copy Local copy
3 1 2
Penultimate Matrix M Penultimate matrix is sum-distributed
SLIDE 12
SVD via Lanczos Method
- Lanczos method provides a vector xin. Return xout = Z xin.
- Done implicitly over the sum-distributed matrix
𝑦𝑝𝑣𝑢
Local copy
1 2 3 e1 e2 (1, -, -) (2, -, -) (1, -, -) e3 1 2 3 e4 e5 (3, -, -) (3, -, -) (1, -, -) e6 1 2 3 e7 (2, -, -) (3, -, -) e8
Proc 1 Proc 2 Proc 3 Local copy Local copy
𝑦𝑗𝑜 1 2 3 1 2 3 1 2 3 = = = 𝑦𝑝𝑣𝑢
1
𝑦𝑝𝑣𝑢
2
1 2 3 𝑦𝑝𝑣𝑢 P 1 P 3 P 2 x 𝑦𝑗𝑜 x 𝑦𝑗𝑜 x
TTM Computation SVD Computation
Owners
SLIDE 13
Performance Metrics Along Each Mode
TTM
- TTM-LImb
- Max number of elements assigned to the
processors
- Optimal value – E / P
- Number of elements / Number of processors
SVD
- SVD-Redundancy
- Total number of times slices are shared
- Measures computational load, comm volume
- Optimal value = L (length along the mode)
- SVD-LImb:
- Max number of slices shared by the processors
- Optimal value = L / P
Factor Matrix Transfer
- Communication volume at the each iteration
Local copy 1 2 3 (1, -, -) e1 (2, -, -) e2 (1, -, -) e3 1 2 3 (3, -, -) e4 (3, -, -) e5 (1, -, -) e6 1 2 3 (2, -, -) e7 (3, -, -) e8
Proc 1 Proc 2 Proc 3 𝑎1
1
𝑎1
2
𝑎1
3
Local copy Local copy TTM-Limb = 3 (optimal) SVD-Red = 6 (optimal = 3) SVD-Limb = 2 (optimal =1)
SLIDE 14
Prior Schemes
- Uni-policy Schemes
- A single policy for computation along all modes
- Only a single copy of the tensor need to be stored
- Less memory footprint, but less optimization opportunities
- Multi-policy Schemes
- Independent distribution for each mode
- N copies need to be stored, one along each mode
- More memory footprint, but more optimization opportunities
- Coarse-grained Scheme [Kaya-Ucar’16]
- Allocate each slice in entirety to a processor. No slice-sharing
- Medium-grained Scheme [Smith et al ‘16]
- Grid partitioning proposed in CP context
- Hyper-graph Scheme (Fine-grained) [Kaya-Ucar ‘16]
- All issues captured via a hyper-graph
- Find min-cut partitioning
SLIDE 15
Our Scheme - Lite
SLIDE 16
Lite - Results
- TTM-Limb = E / P (optimal)
- SVD-Redundancy = L + P (optimal = L)
- SVD-Limb = L/P + 2 (optimal = L/P)
- Near-optimal
- TTM-Limb
- SVD computational load, load balance
- SVD communication volume
- Only issue is high factor matrix transfer volume
- Computation dominates. So, not an issue.
CoarseG MediumG FineG Lite Type Multi-policy Uni-policy Uni-policy Multi-policy Distribution time Greedy (fast) Greedy (fast) Complex (Slow) Greedy (fast) TTM-Limb High Reasonable Reasonable Optimal SVD- Redundancy Optimal High Reasonable Optimal SVD-Limb Reasonable Reasonable High Optimal SVD-Volume Optimal Reasonable Reasonable Optimal FM-Volume High Reasonable Reasonable High
SLIDE 17
Experimental Evaluation
- R92 cluster – 2 to 32 nodes.
- 16 MPI ranks per node, each mapped to a core. So, 32 to 512 MPI ranks
- Dataset : FROSTT repository
SLIDE 18
HOOI Time – 512 ranks, K = 10
Gain
- CoarseG – 12x
- MediumG – 4.5x
- HyperG – 4x
- Best Prior – 3x
SLIDE 19
Flickr Breakup - 512 ranks, K = 10
SLIDE 20
Performance Metrics
TTM Load Imbalance SVD Load SVD Load Imbalance
SLIDE 21
Strong Scaling: Speedup from 32 to 512 ranks
SLIDE 22
Tensor Distribution Time – 512 ranks
SLIDE 23
Conclusions and Future Work
LITE
- Optimal on TTM load imbalance, SVD load, load balance and communication volume
- Fast distribution time
- Good scalability
- Higher communication volume, but computation time dominates.
Future Work
- Apply ideas from our work to shared memory setting
- Apply ideas from shared memory setting to our implementation
- Compressed sparse fiber representation [SK’16]