Memory-Efficient Parallel Computation of Tensor and Matrix Products - PowerPoint PPT Presentation

Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decomposition N. Ravindran ∗ , N.D. Sidiropoulos ∗ , S. Smith † , and G. Karypis † ∗ Dept. of ECE & DTC, † Dept. of CSci & DTC University of Minnesota, Minneapolis Asilomar Conf. on Signals, Systems, and Computers, Nov. 3-5, 2014 Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 1 / 1

Outline Rank decomposition for tensors - PARAFAC/CANDECOMP (CP) ALS Computational bottleneck for big tensors: � unfolded tensor data times Khatri-Rao matrix product Prior work Proposed memory- and computation-efficient algorithms Review recent randomized tensor compression results: identifiability, PARACOMP Memory- and computation-efficient algorithms for multi-way tensor compression Parallelization and high-performance computing optimization (underway) Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 2 / 1

Rank decomposition for tensors Sum of outer products F � X = a f ◦ b f ◦ c f ( † ) f = 1 A , I × F holding { a f } F f = 1 , B , J × F holding { b f } F f = 1 , C , K × F holding { c f } F f = 1 X (: , : , k ) := k -th I × J matrix “slice” of X X T ( 3 ) := IJ × K matrix whose k -th column is vec ( X (: , : , k )) Similarly, JK × I matrix X T ( 1 ) ; and IK × J matrix X T ( 2 ) Equivalent ways to write ( † ): ( 1 ) = ( C ⊙ B ) A T ⇐ ( 2 ) = ( C ⊙ A ) B T ⇐ X T ⇒ X T ⇒ ( 3 ) = ( B ⊙ A ) C T ⇐ X T ⇒ vec ( X T ( 3 ) ) = ( C ⊙ B ⊙ A ) 1 Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 3 / 1

Alternating Least Squares (ALS) Multilinear least squares A , B , C || X ( 1 ) − ( C ⊙ B ) A T || 2 min F Nonconvex, in fact NP-hard even for F = 1. Alternating least squares using ( 1 ) = ( C ⊙ B ) A T ⇐ ( 2 ) = ( C ⊙ A ) B T ⇐ X T ⇒ X T ⇒ X T ( 3 ) = ( B ⊙ A ) C T Reduces to C = X ( 3 ) ( B ⊙ A )( B T B ∗ A T A ) † A = X ( 1 ) ( C ⊙ B )( C T C ∗ B T B ) † B = X ( 2 ) ( C ⊙ A )( C T C ∗ A T A ) † Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 4 / 1

Core computation Computation and inversion of ( C T C ∗ B T B ) relatively easy: relatively small K × F , J × F matrices, invert F × F matrix, usually F is small Direct computation of, say, X ( 1 ) ( C ⊙ B ) , requires O ( JKF ) memory to store C ⊙ B , in addition to O ( NNZ ) memory to store the tensor data, where NNZ is the number of non-zero elements in the tensor X Further, JKF flops are required to compute ( C ⊙ B ) , and JKF + 2 F NNZ flops to compute its product with X ( 1 ) Bottleneck is computing X ( 1 ) ( C ⊙ B ) ; likewise X ( 2 ) ( C ⊙ A ) , X ( 3 ) ( B ⊙ A ) Entire X needs to be accessed for each computation in each ALS iteration, incurring large data transport costs Memory access pattern of the tensor data is different for the three computations, making efficient block caching very difficult ‘Solution’: replicate data three times in main (fast) memory :-( Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 5 / 1

Prior work Tensor Toolbox [Kolda etal , 2008-] has explicit support for sparse tensors, avoids intermediate data explosion by ‘accumulating’ tensor-matrix operations Computes X ( 1 ) ( C ⊙ B ) with 3 F NNZ flops using NNZ intermediate memory (on top of that required to store the tensor). Does not provision for efficient parallelization (accumulation step must be performed serially) Kang et al , 2012, computes X ( 1 ) ( C ⊙ B ) with 5 F NNZ flops using O ( max ( J + NNZ , K + NNZ )) intermediate memory; in return, it admits parallel MapReduce implementation Room for considerable improvements in terms of memory- and computation-efficiency, esp. for high-performance parallel computing architectures Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 6 / 1

Our first contribution: Suite of three agorithms Algorithm 1: Output: M 1 ← X ( 1 ) ( C ⊙ B ) ∈ R I × F 1: M 1 ← 0 2: for k = 1 , . . . , K do M 1 ← M 1 + X (: , : , k ) B diag ( C ( k , :)) 3: 4: end for Algorithm 2: Output: M 2 ← X ( 2 ) ( C ⊙ A ) ∈ R J × F 1: M 2 ← 0 2: for k = 1 , . . . , K do M 2 ← M 2 + X (: , : , k ) A diag ( C ( k , :)) 3: 4: end for Algorithm 3: Output: M 3 ← X ( 3 ) ( B ⊙ A ) ∈ R K × F 1: for k = 1 , . . . , K do M 3 ( k , :) ← 1 T ( A ∗ ( X (: , : , k ) B )) 2: 3: end for Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 7 / 1

Features Essentially no additional intermediate memory needed - updates of A , B and C can be effectively performed in place Computational complexity savings relative to Kang et al , similar or better (depending on the pattern of nonzeros) than Kolda et al Algorithms 1, 2, 3 share the same tensor data access pattern - enabling efficient orderly block caching / pre-fetching if the tensor is stored in slower / serially read memory, without need for three-fold replication ( → asymmetry between Algorithms 1, 2 and Algorithm 3) The loops can be parallelized across K threads, where each thread only requires access to an I × J slice of the tensor. This favors parallel computation and distributed storage Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 8 / 1

Computational complexity of Algorithm 1 Algorithm 1: Output: M 1 ← X ( 1 ) ( C ⊙ B ) ∈ R I × F 1: M 1 ← 0 2: for k = 1 , . . . , K do M 1 ← M 1 + X (: , : , k ) B diag ( C ( k , :)) 3: 4: end for Let I k be the number of non-empty rows and J k be the number of K K non-empty columns in X (: , : , k ) and define NNZ 1 := � I k , NNZ 2 := � J k k = 1 k = 1 Assume that empty rows and columns of X (: , : , k ) can be identified offline and skipped during the matrix multiplication and update of M 1 operations Note: only need to scale by diag ( C ( k , :)) those rows of B corresponding to nonempty columns of X (: , : , k ) , and this can be done using FJ k flops, for a total of F NNZ 2 Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 9 / 1

Computational complexity of Algorithm 1, continued Algorithm 1: Output: M 1 ← X ( 1 ) ( C ⊙ B ) ∈ R I × F 1: M 1 ← 0 2: for k = 1 , . . . , K do M 1 ← M 1 + X (: , : , k ) B diag ( C ( k , :)) 3: 4: end for Next, the multiplications X (: , : , k ) B diag ( C ( k , :)) can be carried out for all k at 2 F NNZ flops (counting additions and multiplications). Finally, only rows of M 1 corresponding to nonzero rows of X (: , : , k ) need to be updated, and the cost of each row update is F , since X (: , : , k ) B diag ( C ( k , :)) has F columns; hence the total M 1 row updates cost is F NNZ 1 flops Overall F NNZ 1 + F NNZ 2 + 2 F NNZ flops. Kang: 5 F NNZ ; Kolda 3 F NNZ . Note NNZ > NNZ 1 , NNZ 2 Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 10 / 1

Multi-way tensor compression Multiply (every slab of) X from the I -mode with U T , from the J -mode with V T , and from the K -mode with W T , where U is I × L , V is J × M , and W is K × N , with L ≤ I , M ≤ J , N ≤ K and LMN ≪ IJK Sidiropoulos et al , IEEE SPL ’12: if columns of A , B , C are sparse, can recover LRT of big tensor from LRT of small tensor, under certain conditions Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 11 / 1

PARACOMP: PArallel RAndomly COMPressed Cubes Sidiropoulos et al , IEEE SPM Sep. ’14 (SI on SP for Big Data) Guaranteed ID of big tensor LRT from small tensor LRTs, sparse or dense factors and data Distributed storage, naturally parallel, overall complexity/storage gains � IJ � scale as O , for F ≤ I ≤ J ≤ K . F Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 12 / 1

Multi-way tensor compression: Computational aspects Compressed tensor Y p ∈ R L p × M p × N p can be computed as I J K � � � Y p ( l , m , n ) = U p ( l , i ) V p ( m , j ) W p ( n , k ) X ( i , j , k ) , i = 1 j = 1 k = 1 ∀ l ∈ { 1 , . . . , L p } , m ∈ { 1 , . . . , M p } , n ∈ { 1 , . . . , N p } On the bright side, can be performed ‘in place’, can exploit sparsity by summing only over the non-zero elements of X On the other hand, complexity is O ( LMNIJK ) for a dense tensor, and O ( LMN ( NNZ )) for a sparse tensor Bad U p , V p , W p memory access pattern (esp. for sparse X ) can bog down computations Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 13 / 1

Multi-way tensor compression: Computational aspects Alternative computation schedule I � T 1 ( l , j , k ) = U ( l , i ) X ( i , j , k ) , i = 1 ∀ l ∈ { 1 , . . . , L p } , j ∈ { 1 , . . . , J } , k ∈ { 1 , . . . , K } (1) J � T 2 ( l , m , k ) = V ( m , j ) T 1 ( l , j , k ) , j = 1 ∀ l ∈ { 1 , . . . , L p } , m ∈ { 1 , . . . , M p } , k ∈ { 1 , . . . , K } (2) K � Y ( l , m , n ) = W ( n , k ) T 2 ( l , m , k ) , k = 1 ∀ l ∈ { 1 , . . . , L p } , m ∈ { 1 , . . . , M p } , n ∈ { 1 , . . . , N p } (3) Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 14 / 1

Memory-Efficient Parallel Computation of Tensor and Matrix Products - PowerPoint PPT Presentation

Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decomposition N. Ravindran , N.D. Sidiropoulos , S. Smith , and G. Karypis Dept. of ECE & DTC, Dept. of CSci & DTC University of

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Zarr - scalable storage of tensor Zarr - scalable storage of tensor data for parallel and

CSL 860: Modern Parallel Computation Computation MEMORY CONSISTENCY Intuitive Memory Model

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

Future Directions In Tensor-Based Computation and Modeling February 20-21, 2009 Workshop Goal

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Parallel Numerical Algorithms Chapter 7 Differential Equations Section 7.5 Tensor

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

MUNICIPAL DEBT ISSUANCE FUNDAMENTALS SESSION 2: You Sold Your Bonds, Now What? SEPTEMBER 16, 2020

Flattened Device Trees for embedded FreeBSD Rafa Jaworowski raj@semihalf.com, raj@FreeBSD.org

OSA Task Tracker Task Task Name Status Note(s) ID MDE Call Center/ Any testing related

Differentiated services and economic incentives in light of history Andrew Odlyzko School of

Forewords Tagging in a nutshell Sources Slides inspired by M. Rajman and J.-C. Chappelier,

1. Physicochemistry of oil and related substances and how these properties influence

On Ability to Autonomously Execute Agent Programs with Sensing Sebastian Sardi na Giuseppe De

Presentation Title AGLF Lawyers Panel Juliet H. Huang Presenters Partner Chapman and Cutler