Accelerating the Tucker Decomposition with Compressed Sparse Tensors - PowerPoint PPT Presentation

Accelerating the Tucker Decomposition with Compressed Sparse Tensors Shaden Smith and George Karypis Department of Computer Science & Engineering, University of Minnesota { shaden, karypis } @cs.umn.edu Euro-Par 2017 1 / 40

Outline Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions 1 / 40

Table of Contents Tensor Background Computing the Tucker Decomposition TTMc with a Compressed Sparse Tensor Utilizing Multiple Compressed Tensors Experiments Conclusions 1 / 40

Tensors Tensors are the generalization of matrices to higher dimensions. ◮ Allow us to represent and analyze multi-dimensional data ◮ Applications in precision healthcare, cybersecurity, recommender systems, . . . patients s e s o n g a i d procedures 2 / 40

Essential operation: tensor-matrix multiplication Tensor-matrix multiplication (TTM; also called the n -way product) ◮ Given: tensor X ∈ R I × J × K and matrix M ∈ R F × K . ◮ Operation: X × 3 M ◮ Output: Y ∈ R I × J × F Elementwise: K � Y ( i , j , f ) = X ( i , j , k ) M ( f , k ) . k =1 3 / 40

Chained tensor-matrix multiplication (TTMc) Tensor-matrix multiplications are often performed in sequence ( chained ). Y 1 ← X × 2 B T × 3 C T Notation Tensors can be unfolded along one mode to matrix form: Y ( n ) . ◮ Mode n forms the rows and the remaining modes become columns. 4 / 40

Tucker decomposition The Tucker decomposition models a tensor X as a set of orthogonal factor matrices and a core tensor. Notation A ∈ R I × F 1 , B ∈ R J × F 2 , and C ∈ R K × F 3 denote the factor matrices. G ∈ R F 1 × F 2 × F 3 denotes the core tensor. 5 / 40

Tucker decomposition The core tensor, G , can be viewed as weights for the interactions between the low-rank factor matrices. Elementwise: F 1 F 2 F 3 � � � X ( i , j , k ) ≈ G ( f 1 , f 2 , f 3 ) A ( i , f 1 ) B ( j , f 2 ) C ( k , f 3 ) f 1 =1 f 2 =1 f 3 =1 6 / 40

Example Tucker applications Dense: data compression ◮ The Tucker decomposition has long been used to compress (dense) tensor data (think truncated SVD). ◮ Folks at Sandia have had huge successes in compressing large simulation outputs 1 . Sparse: unstructured data analysis ◮ More recently, used to discover relationships in unstructured data. ◮ The resulting tensors are sparse and high-dimensional. ◮ These large, sparse tensors are the focus of this talk. 1 Woody Austin, Grey Ballard, and Tamara G Kolda. “Parallel tensor compression for large-scale scientific data”. In: International Parallel & Distributed Processing Symposium (IPDPS’16) . IEEE. 2016, pp. 912–922. 7 / 40

Example: dimensionality reduction for clustering Factor interpretation: ◮ Each row of a factor matrix represents an object from the original data. ◮ The i th object is a point in low-dimensional space: A ( i , :). ◮ These points can be clustered, etc. 8 / 40

Example: dimensionality reduction for clustering Factor interpretation: ◮ Each row of a factor matrix represents an object from the original data. ◮ The i th object is a point in low-dimensional space: A ( i , :). ◮ These points can be clustered, etc. Application: citation network analysis [Kolda & Sun, ICDM ’08] ◮ A citation network forms an author × conference × keyword sparse tensor. ◮ The rows of the resulting factors are clustered with k -means to reveal relationships. Authors: Jiawei Han, Christos Faloutsos, . . . Conferences: KDD, ICDM, PAKDD, . . . Keywords: knowledge, learning, reasoning 8 / 40

Optimization problem The resulting optimization problem is non-convex: 1 2 � � G × 1 A T × 2 B T × 3 C T �� minimize � X − � � 2 � A , B , C , G F A T A = I subject to B T B = I C T C = I 9 / 40

Higher-Order Orthogonal Iterations (HOOI) HOOI is an alternating optimization algorithm. Tucker Decomposition with HOOI 1: while not converged do Y 1 ← X × 2 B T × 3 C T 2: A ← F 1 leading left singular vectors of Y (1) 3: 4: Y 2 ← X × 1 A T × 3 C T 5: B ← F 2 leading left singular vectors of Y (2) 6: 7: Y 3 ← X × 1 A T × 2 B T 8: C ← F 3 leading left singular vectors of Y (3) 9: 10: G ← X × 1 A T × 2 B T × 3 C T 11: 12: end while 10 / 40

Higher-Order Orthogonal Iterations (HOOI) TTMc is the most expensive kernel in the HOOI algorithm. Tucker Decomposition with HOOI 1: while not converged do Y 1 ← X × 2 B T × 3 C T 2: A ← F 1 leading left singular vectors of Y (1) 3: 4: Y 2 ← X × 1 A T × 3 C T 5: B ← F 2 leading left singular vectors of Y (2) 6: 7: Y 3 ← X × 1 A T × 2 B T 8: C ← F 3 leading left singular vectors of Y (3) 9: 10: G ← X × 1 A T × 2 B T × 3 C T 11: 12: end while 11 / 40

Intermediate memory blowup A first step is to optimize a single TTM kernel and apply in sequence: �� X × 2 B T � × 3 C T � Y 1 ← Challenge: ◮ Intermediate results become more dense after each TTM. ◮ Memory overheads are dependent on sparsity pattern and factorization rank, but can be several orders of magnitude. Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for multi-aspect data mining”. In: International Conference on Data Mining (ICDM) . 2008. 12 / 40

Intermediate memory blowup Y 1 ← X × 2 B T × 3 C T Solutions: 2 Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for multi-aspect data mining”. In: International Conference on Data Mining (ICDM) . 2008. 3 Oguz Kaya and Bora U¸ car. High-performance parallel algorithms for the Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 13 / 40

Intermediate memory blowup Y 1 ← X × 2 B T × 3 C T Solutions: 1. Tile over Y 1 to constrain blowup 2 . ◮ Requires multiple passes over the input tensor and many FLOPs. 2 Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for multi-aspect data mining”. In: International Conference on Data Mining (ICDM) . 2008. 3 Oguz Kaya and Bora U¸ car. High-performance parallel algorithms for the Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 13 / 40

Intermediate memory blowup Y 1 ← X × 2 B T × 3 C T Solutions: 1. Tile over Y 1 to constrain blowup 2 . ◮ Requires multiple passes over the input tensor and many FLOPs. 2. Instead, fuse the TTMs and use a formulation based on non-zeros 3 . ◮ Only a single pass over the tensor! 2 Tamara Kolda and Jimeng Sun. “Scalable tensor decompositions for multi-aspect data mining”. In: International Conference on Data Mining (ICDM) . 2008. 3 Oguz Kaya and Bora U¸ car. High-performance parallel algorithms for the Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 13 / 40

Elementwise formulation Processing each non-zero individually has cost O (nnz( X ) F 2 F 3 ) and O ( F 2 F 3 ) memory overhead. Y 1 ( i , : , :) += X ( i , j , k ) [ B ( j , :) ◦ C ( k , :)] Y C i k B j car. High-performance parallel algorithms for the Oguz Kaya and Bora U¸ Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 14 / 40

TTMc with coordinate form The elementwise formulation of TTMc naturally lends itself to a coordinate storage format: → 15 / 40

Memoization Some of the intermediate results across TTMc kernels can be reused: Y 1 ← X × 2 B T × 3 C T × 4 D T Y 2 ← X × 1 A T × 3 C T × 4 D T becomes: Z ← X × 3 C T × 4 D T Y 1 ← Z × 2 B T Y 2 ← Z × 1 A T Muthu Baskaran et al. “Efficient and scalable computations with sparse tensors”. In: High Performance Extreme Computing (HPEC) . 2012. 16 / 40

TTMc with dimension trees State-of-the-art TTMc: Each node in the tree stores intermediate results from a set of modes. car. High-performance parallel algorithms for the Oguz Kaya and Bora U¸ Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 17 / 40

TTMc with dimension trees Parallelism: ◮ Independent units of work within each node are indentified. ◮ For flat dimension trees, this equates to parallelizing over Y 1 ( i , : , :) slices. car. High-performance parallel algorithms for the Oguz Kaya and Bora U¸ Tucker decomposition of higher order sparse tensors . Tech. rep. RR-8801. Inria-Research Centre Grenoble–Rhˆ one-Alpes, 2015. 18 / 40

Motivation Existing algorithms either: ◮ have intermediate data blowup ◮ perform many operations ◮ trade memory for performance (i.e., memoization) ◮ Overheads depend on the sparsity pattern and factorization rank Can we accelerate TTMc without memory overheads? 19 / 40

Accelerating the Tucker Decomposition with Compressed Sparse Tensors - PowerPoint PPT Presentation

Accelerating the Tucker Decomposition with Compressed Sparse Tensors Shaden Smith and George Karypis Department of Computer Science & Engineering, University of Minnesota { shaden, karypis } @cs.umn.edu Euro-Par 2017 1 / 40 Outline Tensor

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je University of

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je Wrocaw,

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda

Decoding in Compressed Sensing Ronald DeVore USC, 2008 p. 1/33 Discrete Compressed Sensing R

Consulting: Coding Is Only Half the Work Beth Tucker Long Who am I? Beth Tucker Long

AIR CHALLENGE SUMMARY SUSTAINABILITY NORTH AMERICA WHY COMPRESSED AIR? Inappropriate

Introduction to Compressed Sensing Gitta Kutyniok (Institut f ur Mathematik, Technische

Aligning DNA sequences on compressed collections of genomes Part 2. Compressed indexing The

Fast Data Driven Compressed Sensing and application to compressed quantitative MRI Mike Davies

Foundations of Compressed Sensing Mike Davies Edinburgh Compressed Sensing research group (E-CoS)

Compressed Sensing: Challenges and Emerging Topics Mike Davies Edinburgh Compressed Sensing

Deep Compressed Sensing Yan Wu, Mihaela Rosca, Tim Lillicrap Compressed Sensing A Brief Review

Infinite Dimensional Compressed Sensing Anders C. Hansen, University of Cambridge Chemnitz,

Emergency Communication Tucker Dunham, KD2JPM Abbie Heim, KD2PUA 1 About Me Tucker

Panoramic video content distribution in the xTV project Peter Quax, Panagiotis Issaris, Wouter

On The Complexity of Compressing Obfuscation Gilad Asharov, Naomi Ephraim, Ilan Komargodski, and

Benchmarking HDF5 Compression Filters in R Mike L. Smith @grimbough HDF5 is a file format for

Models for Sentence Compression A Comparison across Domains, Training Requirements and Evaluation

Preview question Officially the name of the Tor network is not an acronym, but the or part

Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks Minsoo

4200:225 Equilibrium Thermodynamics Unit I. Earth, Air, Fire, and Water Chapter 4.

[HDFS] Why data writes matter A write is performed once, But read happens many times (over)