Memory-Efficient Parallel Computation of Tensor and Matrix Products - - PowerPoint PPT Presentation

memory efficient parallel computation of tensor and
SMART_READER_LITE
LIVE PREVIEW

Memory-Efficient Parallel Computation of Tensor and Matrix Products - - PowerPoint PPT Presentation

Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decomposition N. Ravindran , N.D. Sidiropoulos , S. Smith , and G. Karypis Dept. of ECE & DTC, Dept. of CSci & DTC University of


slide-1
SLIDE 1

Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decomposition

  • N. Ravindran∗, N.D. Sidiropoulos∗, S. Smith†, and G. Karypis†

∗ Dept. of ECE & DTC, † Dept. of CSci & DTC

University of Minnesota, Minneapolis Asilomar Conf. on Signals, Systems, and Computers, Nov. 3-5, 2014

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 1 / 1

slide-2
SLIDE 2

Outline

Rank decomposition for tensors - PARAFAC/CANDECOMP (CP) ALS Computational bottleneck for big tensors: unfolded tensor data times Khatri-Rao matrix product Prior work Proposed memory- and computation-efficient algorithms Review recent randomized tensor compression results: identifiability, PARACOMP Memory- and computation-efficient algorithms for multi-way tensor compression Parallelization and high-performance computing optimization (underway)

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 2 / 1

slide-3
SLIDE 3

Rank decomposition for tensors

Sum of outer products X =

F

  • f=1

af ◦ bf ◦ cf (†) A, I × F holding {af}F

f=1, B, J × F holding {bf}F f=1, C, K × F holding

{cf}F

f=1

X(:, :, k) := k-th I × J matrix “slice” of X XT

(3) := IJ × K matrix whose k-th column is vec(X(:, :, k))

Similarly, JK × I matrix XT

(1); and IK × J matrix XT (2)

Equivalent ways to write (†): XT

(1) = (C ⊙ B)AT ⇐

⇒ XT

(2) = (C ⊙ A)BT ⇐

⇒ XT

(3) = (B ⊙ A)CT ⇐

⇒ vec(XT

(3)) = (C ⊙ B ⊙ A)1

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 3 / 1

slide-4
SLIDE 4

Alternating Least Squares (ALS)

Multilinear least squares min

A,B,C ||X(1) − (C ⊙ B)AT||2 F

Nonconvex, in fact NP-hard even for F = 1. Alternating least squares using XT

(1) = (C ⊙ B)AT ⇐

⇒ XT

(2) = (C ⊙ A)BT ⇐

⇒ XT

(3) = (B ⊙ A)CT

Reduces to C = X(3)(B ⊙ A)(BTB ∗ ATA)† A = X(1)(C ⊙ B)(CTC ∗ BTB)† B = X(2)(C ⊙ A)(CTC ∗ ATA)†

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 4 / 1

slide-5
SLIDE 5

Core computation

Computation and inversion of (CTC ∗ BTB) relatively easy: relatively small K × F, J × F matrices, invert F × F matrix, usually F is small Direct computation of, say, X(1)(C ⊙ B), requires O(JKF) memory to store C ⊙ B, in addition to O(NNZ) memory to store the tensor data, where NNZ is the number of non-zero elements in the tensor X Further, JKF flops are required to compute (C ⊙ B), and JKF + 2F NNZ flops to compute its product with X(1) Bottleneck is computing X(1)(C ⊙ B); likewise X(2)(C ⊙ A), X(3)(B ⊙ A) Entire X needs to be accessed for each computation in each ALS iteration, incurring large data transport costs Memory access pattern of the tensor data is different for the three computations, making efficient block caching very difficult ‘Solution’: replicate data three times in main (fast) memory :-(

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 5 / 1

slide-6
SLIDE 6

Prior work

Tensor Toolbox [Kolda etal, 2008-] has explicit support for sparse tensors, avoids intermediate data explosion by ‘accumulating’ tensor-matrix

  • perations

Computes X(1)(C ⊙ B) with 3F NNZ flops using NNZ intermediate memory (on top of that required to store the tensor). Does not provision for efficient parallelization (accumulation step must be performed serially) Kang et al, 2012, computes X(1)(C ⊙ B) with 5F NNZ flops using O(max(J + NNZ, K + NNZ)) intermediate memory; in return, it admits parallel MapReduce implementation Room for considerable improvements in terms of memory- and computation-efficiency, esp. for high-performance parallel computing architectures

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 6 / 1

slide-7
SLIDE 7

Our first contribution: Suite of three agorithms

Algorithm 1: Output: M1 ← X(1)(C ⊙ B) ∈ RI×F

1: M1 ← 0 2: for k = 1, . . . , K do 3:

M1 ← M1 + X(:, :, k)B diag(C(k, :))

4: end for

Algorithm 2: Output: M2 ← X(2)(C ⊙ A) ∈ RJ×F

1: M2 ← 0 2: for k = 1, . . . , K do 3:

M2 ← M2 + X(:, :, k)Adiag(C(k, :))

4: end for

Algorithm 3: Output: M3 ← X(3)(B ⊙ A) ∈ RK×F

1: for k = 1, . . . , K do 2:

M3(k, :) ← 1T (A ∗ (X(:, :, k)B))

3: end for

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 7 / 1

slide-8
SLIDE 8

Features

Essentially no additional intermediate memory needed - updates of A, B and C can be effectively performed in place Computational complexity savings relative to Kang et al, similar or better (depending on the pattern of nonzeros) than Kolda et al Algorithms 1, 2, 3 share the same tensor data access pattern - enabling efficient orderly block caching / pre-fetching if the tensor is stored in slower / serially read memory, without need for three-fold replication (→ asymmetry between Algorithms 1, 2 and Algorithm 3) The loops can be parallelized across K threads, where each thread only requires access to an I × J slice of the tensor. This favors parallel computation and distributed storage

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 8 / 1

slide-9
SLIDE 9

Computational complexity of Algorithm 1

Algorithm 1: Output: M1 ← X(1)(C ⊙ B) ∈ RI×F

1: M1 ← 0 2: for k = 1, . . . , K do 3:

M1 ← M1 + X(:, :, k)B diag(C(k, :))

4: end for

Let Ik be the number of non-empty rows and Jk be the number of non-empty columns in X(:, :, k) and define NNZ1 :=

K

  • k=1

Ik, NNZ2 :=

K

  • k=1

Jk Assume that empty rows and columns of X(:, :, k) can be identified offline and skipped during the matrix multiplication and update of M1 operations Note: only need to scale by diag(C(k, :)) those rows of B corresponding to nonempty columns of X(:, :, k), and this can be done using FJk flops, for a total of FNNZ2

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 9 / 1

slide-10
SLIDE 10

Computational complexity of Algorithm 1, continued

Algorithm 1: Output: M1 ← X(1)(C ⊙ B) ∈ RI×F

1: M1 ← 0 2: for k = 1, . . . , K do 3:

M1 ← M1 + X(:, :, k)B diag(C(k, :))

4: end for

Next, the multiplications X(:, :, k)B diag(C(k, :)) can be carried out for all k at 2F NNZ flops (counting additions and multiplications). Finally, only rows of M1 corresponding to nonzero rows of X(:, :, k) need to be updated, and the cost of each row update is F, since X(:, :, k)B diag(C(k, :)) has F columns; hence the total M1 row updates cost is FNNZ1 flops Overall F NNZ1 + F NNZ2 + 2F NNZ flops. Kang: 5FNNZ; Kolda 3FNNZ. Note NNZ > NNZ1, NNZ2

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 10 / 1

slide-11
SLIDE 11

Multi-way tensor compression

Multiply (every slab of) X from the I-mode with UT, from the J-mode with VT, and from the K-mode with WT, where U is I × L, V is J × M, and W is K × N, with L ≤ I, M ≤ J, N ≤ K and LMN ≪ IJK Sidiropoulos et al, IEEE SPL ’12: if columns of A, B, C are sparse, can recover LRT of big tensor from LRT of small tensor, under certain conditions

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 11 / 1

slide-12
SLIDE 12

PARACOMP: PArallel RAndomly COMPressed Cubes

Sidiropoulos et al, IEEE SPM Sep. ’14 (SI on SP for Big Data) Guaranteed ID of big tensor LRT from small tensor LRTs, sparse or dense factors and data Distributed storage, naturally parallel, overall complexity/storage gains scale as O IJ

F

  • , for F ≤ I ≤ J ≤ K.

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 12 / 1

slide-13
SLIDE 13

Multi-way tensor compression: Computational aspects

Compressed tensor Yp ∈ RLp×Mp×Np can be computed as Yp(l, m, n) =

I

  • i=1

J

  • j=1

K

  • k=1

Up(l, i)Vp(m, j)Wp(n, k)X(i, j, k), ∀l ∈ {1, . . . , Lp} , m ∈ {1, . . . , Mp} , n ∈ {1, . . . , Np} On the bright side, can be performed ‘in place’, can exploit sparsity by summing only over the non-zero elements of X On the other hand, complexity is O(LMNIJK) for a dense tensor, and O(LMN(NNZ)) for a sparse tensor Bad Up, Vp, Wp memory access pattern (esp. for sparse X) can bog down computations

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 13 / 1

slide-14
SLIDE 14

Multi-way tensor compression: Computational aspects

Alternative computation schedule T1(l, j, k) =

I

  • i=1

U(l, i)X(i, j, k), ∀l ∈ {1, . . . , Lp} , j ∈ {1, . . . , J} , k ∈ {1, . . . , K} (1) T2(l, m, k) =

J

  • j=1

V(m, j)T1(l, j, k), ∀l ∈ {1, . . . , Lp} , m ∈ {1, . . . , Mp} , k ∈ {1, . . . , K} (2) Y(l, m, n) =

K

  • k=1

W(n, k)T2(l, m, k), ∀l ∈ {1, . . . , Lp} , m ∈ {1, . . . , Mp} , n ∈ {1, . . . , Np} (3)

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 14 / 1

slide-15
SLIDE 15

Multi-way tensor compression: Computational aspects

Complexity O(LIJK + MLJK + NLMK) for dense tensor, as opposed to O(LMNIJK) Compressing the I mode first, followed by the J mode, and finally the K mode Lower complexity by storing intermediate results in tensors T1 and T2, instead of computing them multiple times For I ≤ J ≤ K with I

L = J M = K N , it is advantageous to perform the

multiplications in the order shown in (1)-(3), i.e., compress the smallest uncompressed mode first But ... very large intermediate memory, and sparsity is lost after the first multiplication - T1 and T2 are, in general, dense, potentially requiring even more memory than the original tensor! Tradeoff between complexity and memory savings, with or without sparsity

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 15 / 1

slide-16
SLIDE 16

Our second contribution: Algorithm 4

Algorithm 4: Compression of X into Yp

1: for k = 1, . . . , K do 2:

T′

2 ← 0

3:

for b = 1, B + 1, 2B + 1, 3B + 1, . . . , J do

4:

T′

1 ← UT p X(:, b : (b + B − 1), k)

5:

T′

2 ← T′ 2 + T′ 1Vp(b : (b + B − 1), :)

6:

end for

7:

for n = 1, . . . , Np do

8:

Yp(:, :, n) ← Yp(:, :, n) + W(n, k)T′

2

9:

end for

10: end for

Same flop count as three-step sequential mode-product approach for dense tensors, and better flop count than ‘scalar computation’ for sparse tensors, but requires only limited intermediate memory in the form of T′

1 ∈ RL×B and T′ 2 ∈ RL×M, for any choice of B ≤ J

Choice of B does not affect computational complexity, choose B > 1 for better cache utilization

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 16 / 1

slide-17
SLIDE 17

How Algorithm 4 exploits sparsity

Matrix multiplication in Step 4 can fully exploit sparsity in the tensor data If the empty columns of T1 can be identified, then these columns need not participate while updating T2 in Step 5. Hence, overall complexity for sparse data is O(L(NNZ) + LM(NNZ2) + LMNK). Note that ‘scalar computation’ has higher complexity O(LMN(NNZ)), since NNZ > K, NNZ2 Hence, the above algorithm exploits sparsity better than the method of looping over all the non-zero elements of X However, despite this, it is clear sparsity is only fully exploited while compressing the first mode. Only fully empty columns after compressing the first mode can be exploited while compressing the second mode Tempting to believe that further optimizations of Algorithm 4 for sparse data are possible

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 17 / 1

slide-18
SLIDE 18

Algorithm 4: Parallelization

∃ more than one way of parallelizing Algorithm 4. Parallelizing the P replicas over P threads requires that each thread accesses the entire tensor once. However, if the parallelization is done over the for loop in Step 1 in Algorithm 4, i.e., over as many as K threads, where each thread handles p = 1, . . . , P, only a “slice” of the tensor data, X(:, :, k), is required at each thread. Similar to Algorithms 1, 2, and 3, this architecture favors situations where the tensor data is stored in a distributed fashion, with only a portion of the data locally available to each thread. Algorithm 4 can be generalized to tensors with more than 3 modes: the computations in steps 4 and 5 can be used to compress a pair of modes at a time, while the computation in step 8 can be used for the last mode in the case of an odd number of modes. The main result that the intermediate memory required is typically less than the final result will continue to hold.

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 18 / 1

slide-19
SLIDE 19

Web, papers, software, credits

Nikos Sidiropoulos, UMN http://www.ece.umn.edu/˜nikos/

Signal and Tensor Analytics Research (STAR) group https://sites.google.com/a/umn.edu/nikosgroup/home

George Karypis, UMN http://glaros.dtc.umn.edu/gkhome/index.php Sponsor

NSF-NIH/BIGDATA: Big Tensor Mining: Theory, Scalable Algorithms and Applications, Nikos Sidiropoulos, George Karypis (UMN), Christos Faloutsos, Tom Mitchell (CMU), NSF IIS-1247489/1247632.

Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 19 / 1