accelerating the tucker decomposition with compressed
play

Accelerating the Tucker Decomposition with Compressed Sparse Tensors - PowerPoint PPT Presentation

Accelerating the Tucker Decomposition with Compressed Sparse Tensors Shaden Smith and George Karypis Department of Computer Science & Engineering, University of Minnesota { shaden, karypis } @cs.umn.edu Euro-Par 2017 1 / 40 Outline Tensor


  1. Accelerating the Tucker Decomposition with Compressed Sparse Tensors Shaden Smith and George Karypis Department of Computer Science & Engineering, University of Minnesota { shaden, karypis } @cs.umn.edu Euro-Par 2017 1 / 40

  2. Outline Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions 1 / 40

  3. Table of Contents Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions 1 / 40

  4. Tensors Tensors are the generalization of matrices to higher dimensions. ◮ Allow us to represent and analyze multi-dimensional data ◮ Applications in precision healthcare, cybersecurity, recommender systems, . . . patients s e s o n g a i d procedures 2 / 40

  5. Essential operation: tensor-matrix multiplication Tensor-matrix multiplication (TTM; also called the n -way product) ◮ Given: tensor X ∈ R I × J × K and matrix M ∈ R F × K . ◮ Operation: X × 3 M ◮ Output: Y ∈ R I × J × F Elementwise: K � Y ( i , j , f ) = X ( i , j , k ) M ( f , k ) . k =1 3 / 40

  6. Chained tensor-matrix multiplication (TTMc) Tensor-matrix multiplications are often performed in sequence ( chained ). Y 1 ← X × 2 B T × 3 C T Notation Tensors can be unfolded along one mode to matrix form: Y ( n ) . ◮ Mode n forms the rows and the remaining modes become columns. 4 / 40

  7. Tucker decomposition The Tucker decomposition models a tensor X as a set of orthogonal factor matrices and a core tensor. Notation A ∈ R I × F 1 , B ∈ R J × F 2 , and C ∈ R K × F 3 denote the factor matrices. G ∈ R F 1 × F 2 × F 3 denotes the core tensor. 5 / 40

  8. Tucker decomposition The core tensor, G , can be viewed as weights for the interactions between the low-rank factor matrices. Elementwise: F 1 F 2 F 3 � � � X ( i , j , k ) ≈ G ( f 1 , f 2 , f 3 ) A ( i , f 1 ) B ( j , f 2 ) C ( k , f 3 ) f 1 =1 f 2 =1 f 3 =1 6 / 40

  9. Example Tucker applications Dense: data compression ◮ The Tucker decomposition has long been used to compress (dense) tensor data (think truncated SVD). ◮ Folks at Sandia have had huge successes in compressing large simulation outputs 1 . Sparse: unstructured data analysis ◮ More recently, used to discover relationships in unstructured data. ◮ The resulting tensors are sparse and high-dimensional. ◮ These large, sparse tensors are the focus of this talk. 1 Woody Austin, Grey Ballard, and Tamara G Kolda. “Parallel tensor compression for large-scale scientific data”. In: International Parallel & Distributed Processing Symposium (IPDPS’16) . IEEE. 2016, pp. 912–922. 7 / 40

  10. Example: dimensionality reduction for clustering Factor interpretation: ◮ Each row of a factor matrix represents an object from the original data. ◮ The i th object is a point in low-dimensional space: A ( i , :). ◮ These points can be clustered, etc. 8 / 40

  11. Example: dimensionality reduction for clustering Factor interpretation: ◮ Each row of a factor matrix represents an object from the original data. ◮ The i th object is a point in low-dimensional space: A ( i , :). ◮ These points can be clustered, etc. Application: citation network analysis [Kolda & Sun, ICDM ’08] ◮ A citation network forms an author × conference × keyword sparse tensor. ◮ The rows of the resulting factors are clustered with k -means to reveal relationships. Authors: Jiawei Han, Christos Faloutsos, . . . Conferences: KDD, ICDM, PAKDD, . . . Keywords: knowledge, learning, reasoning 8 / 40

  12. Table of Contents Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions 8 / 40

  13. Optimization problem The resulting optimization problem is non-convex: 1 2 � � G × 1 A T × 2 B T × 3 C T �� minimize � X − � � 2 � A , B , C , G F A T A = I subject to B T B = I C T C = I 9 / 40

  14. Higher-Order Orthogonal Iterations (HOOI) HOOI is an alternating optimization algorithm. Tucker Decomposition with HOOI 1: while not converged do Y 1 ← X × 2 B T × 3 C T 2: A ← F 1 leading left singular vectors of Y (1) 3: 4: Y 2 ← X × 1 A T × 3 C T 5: B ← F 2 leading left singular vectors of Y (2) 6: 7: Y 3 ← X × 1 A T × 2 B T 8: C ← F 3 leading left singular vectors of Y (3) 9: 10: G ← X × 1 A T × 2 B T × 3 C T 11: 12: end while 10 / 40

  15. Higher-Order Orthogonal Iterations (HOOI) TTMc is the most expensive kernel in the HOOI algorithm. Tucker Decomposition with HOOI 1: while not converged do Y 1 ← X × 2 B T × 3 C T 2: A ← F 1 leading left singular vectors of Y (1) 3: 4: Y 2 ← X × 1 A T × 3 C T 5: B ← F 2 leading left singular vectors of Y (2) 6: 7: Y 3 ← X × 1 A T × 2 B T 8: C ← F 3 leading left singular vectors of Y (3) 9: 10: G ← X × 1 A T × 2 B T × 3 C T 11: 12: end while 11 / 40

  16. Intermediate memory blowup A first step is to optimize a single TTM kernel and apply in sequence: �� X × 2 B T � × 3 C T � Y 1 ← Challenge: ◮ Intermediate results become more dense after each TTM. ◮ Memory overheads are dependent on sparsity pattern and factorization rank, but can be several orders of magnitude. Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for multi-aspect data mining”. In: International Conference on Data Mining (ICDM) . 2008. 12 / 40

  17. Intermediate memory blowup Y 1 ← X × 2 B T × 3 C T Solutions: 2 Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for multi-aspect data mining”. In: International Conference on Data Mining (ICDM) . 2008. 3 Oguz Kaya and Bora U¸ car. High-performance parallel algorithms for the Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 13 / 40

  18. Intermediate memory blowup Y 1 ← X × 2 B T × 3 C T Solutions: 1. Tile over Y 1 to constrain blowup 2 . ◮ Requires multiple passes over the input tensor and many FLOPs. 2 Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for multi-aspect data mining”. In: International Conference on Data Mining (ICDM) . 2008. 3 Oguz Kaya and Bora U¸ car. High-performance parallel algorithms for the Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 13 / 40

  19. Intermediate memory blowup Y 1 ← X × 2 B T × 3 C T Solutions: 1. Tile over Y 1 to constrain blowup 2 . ◮ Requires multiple passes over the input tensor and many FLOPs. 2. Instead, fuse the TTMs and use a formulation based on non-zeros 3 . ◮ Only a single pass over the tensor! 2 Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for multi-aspect data mining”. In: International Conference on Data Mining (ICDM) . 2008. 3 Oguz Kaya and Bora U¸ car. High-performance parallel algorithms for the Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 13 / 40

  20. Elementwise formulation Processing each non-zero individually has cost O (nnz( X ) F 2 F 3 ) and O ( F 2 F 3 ) memory overhead. Y 1 ( i , : , :) += X ( i , j , k ) [ B ( j , :) ◦ C ( k , :)] Y C i k B j car. High-performance parallel algorithms for the Oguz Kaya and Bora U¸ Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 14 / 40

  21. TTMc with coordinate form The elementwise formulation of TTMc naturally lends itself to a coordinate storage format: → 15 / 40

  22. Memoization Some of the intermediate results across TTMc kernels can be reused: Y 1 ← X × 2 B T × 3 C T × 4 D T Y 2 ← X × 1 A T × 3 C T × 4 D T becomes: Z ← X × 3 C T × 4 D T Y 1 ← Z × 2 B T Y 2 ← Z × 1 A T Muthu Baskaran et al. “Efficient and scalable computations with sparse tensors”. In: High Performance Extreme Computing (HPEC) . 2012. 16 / 40

  23. TTMc with dimension trees State-of-the-art TTMc: Each node in the tree stores intermediate results from a set of modes. car. High-performance parallel algorithms for the Oguz Kaya and Bora U¸ Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 17 / 40

  24. TTMc with dimension trees Parallelism: ◮ Independent units of work within each node are indentified. ◮ For flat dimension trees, this equates to parallelizing over Y 1 ( i , : , :) slices. car. High-performance parallel algorithms for the Oguz Kaya and Bora U¸ Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 18 / 40

  25. Table of Contents Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions 18 / 40

  26. Motivation Existing algorithms either: ◮ have intermediate data blowup ◮ perform many operations ◮ trade memory for performance (i.e., memoization) ◮ Overheads depend on the sparsity pattern and factorization rank Can we accelerate TTMc without memory overheads? 19 / 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend