= Tensors More Examples Entity Keywords Tensor Matrix Relation - - PowerPoint PPT Presentation

tensors
SMART_READER_LITE
LIVE PREVIEW

= Tensors More Examples Entity Keywords Tensor Matrix Relation - - PowerPoint PPT Presentation

On Optimizing Distributed Tucker Decomposition for Sparse Tensors Venkatesan Chakaravarthy, Jee W. Choi , Douglas Joseph, Prakash Murali, Shivmaran S Pandian, Yogish Sabharwal, and Dheeraj Sreedhar IBM Research = Tensors More Examples


slide-1
SLIDE 1

On Optimizing Distributed Tucker Decomposition for Sparse Tensors

Venkatesan Chakaravarthy, Jee W. Choi, Douglas Joseph, Prakash Murali, Shivmaran S Pandian, Yogish Sabharwal, and Dheeraj Sreedhar IBM Research

=

slide-2
SLIDE 2

Tensors

Matrix Tensor Authors Conf Authors Conf Year

Image Video

Users Reviews Keywords

Amazon reviews [LM’13]

More Examples

Entity Relation Entity

NELL [CBK+ 13] Sender x Receiver x Word x Date Enron Emails [AS ‘04] Flickr [GSS ‘08] User x Image x Tag x Date User x Page x Tag x Date Delicious [GSS ‘08]

slide-3
SLIDE 3

Singular Value Decomposition Tucker Decomposition = Higher Order SVD

Tucker Decomposition

𝑁

𝑉 𝑊𝑈

\sigma

𝜏1𝜏2 𝜏𝑠

Left singular vectors Right singular vectors Singular values

Σ ≈

Applications

  • PCA: Analyze different dimensions
  • Text analytics
  • Computer vision
  • Signal processing

𝐿1 × 𝐿2 × 𝐿3 𝑀1 × 𝑀2 × 𝑀3

slide-4
SLIDE 4

Higher Order Orthonormal Iterator (HOOI) 𝑈

Improve 𝐵 − 𝑜𝑓𝑥 𝐶 − 𝑜𝑓𝑥 𝐷 − 𝑜𝑓𝑥 𝐵 𝐶 𝐷

+

HOOI

  • Refinement : improve accuracy
  • Applied multiple times to get increasing accuracy

4

  • Initial decomposition – Obtained via HOSVD, STHOSVD, random matrices
slide-5
SLIDE 5

Prior Work & Objective

  • Dense tensors [BK’07, ABK’16, CCJ+ 17, JPK’ 15]
  • Sequential, shared memory and distributed implementations
  • Sparse tensors [KS’08, BMV+ 12, SK’16, KU’16]
  • Sequential, shared memory and distributed implementations
  • [KU’16] : First distributed implementation for sparse tensors

Our objective

  • Efficient distributed implementation for sparse tensors
  • Builds on work of [KU’16]

[KS’08] T. Kolda and J. Sun. 2008. Scalable tensor decompositions for multi-aspect data mining. In ICDM. [BMV’12] M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Efficient and scalable computations with sparse tensors. In HPEC. [SK’16] S. Smith and G. Karypis. 2017. Accelerating the Tucker Decomposition with Compressed Sparse Tensors. In Euro-Par. [KU’16] O. Kaya and B. Uçar. 2016. High performance parallel algorithms for the Tucker decomposition of sparse tensors. In ICPP.

slide-6
SLIDE 6

HOOI – Outline

Alternating least squares

  • Fix B and C. Find Best A.
  • Fix A and C. Find Best B
  • Fix B and C. Find Best C

T

𝐶𝑈 𝐷𝑈 X X

SVD

𝐵 − 𝑜𝑓𝑥 𝑁1

T

𝐵𝑈 𝐷𝑈 X X

SVD

B−𝑜𝑓𝑥 𝑁2

T

𝐵𝑈 𝐶𝑈 X X

SVD

C−𝑜𝑓𝑥 𝑁3 Matricize Singular Vectors 𝐿1𝑤𝑓𝑑𝑢𝑝𝑠𝑡 𝐿2𝑤𝑓𝑑𝑢𝑝𝑠𝑡 𝐿3𝑤𝑓𝑑𝑢𝑝𝑠𝑡

T B C SVD A-new A C SVD B-new A B SVD C-new

6

TTM – Tensor times matrix multiplications

slide-7
SLIDE 7

Sparse HOOI : Distribution Schemes and Performance Parameters

e1 e2 e3 e4 e5 e6 e7 e8

Proc 3 Proc 1 Proc 2 (1, 2, 1), 0.1

e1

(2, 3, 2), 1.2

e2

(1, 1, 2), 3.1

e3

(3, 2, 1), 0.4

e4

(3, 1, 2), 1.1

e5

(1, 3, 2), 1.4

e6

(2, 2, 2), 0.5

e7

(3, 3 1), 0.7

e8

Coordinate Representation

  • TTM Component
  • Computation only
  • All schemes have same computational load (FLOPs)
  • Load balance
  • SVD Component
  • Both computation and communication
  • Computational load
  • Load balance
  • Communication volume
  • Factor Matrix transfer (FM)
  • Communication only
  • At the end of each HOOI invocation, factor matrix

rows need to be communicated among processors for the next invocation

  • Communication volume

Sparse HOOI : Distribution Schemes and Performance Parameters

e1 e2 e3 e4 e5 e6 e7 e8

Proc 3 Proc 1 Proc 2 (1, 2, 1), 0.1

e1

(2, 3, 2), 1.2

e2

(1, 1, 2), 3.1

e3

(3, 2, 1), 0.4

e4

(3, 1, 2), 1.1

e5

(1, 3, 2), 1.4

e6

(2, 2, 2), 0.5

e7

(3, 3 1), 0.7

e8

Coordinate Representation

  • TTM Component
  • Computation only
  • All schemes have same computational load (FLOPs)
  • Load balance
  • SVD Component
  • Both computation and communication
  • Computational load
  • Load balance
  • Communication volume
  • Factor Matrix transfer (FM)
  • Communication only
  • At the end of each HOOI invocation, factor matrix

rows need to be communicated among processors for the next invocation

  • Communication volume
slide-8
SLIDE 8

Prior Schemes

  • CoarseG - Coarse grained schemes [KU’16]
  • Allocate entire “slices” to processors
  • MediumG - Medium grained scheme [SK ‘16]
  • Grid based partitioning – similar to block partitioning of matrices
  • FineG - Fine grained scheme [KU’16]
  • Allocate individual elements using hypergraph partitioning methods

TTM SVD FM

  • Dist. time

CoarseG Inefficient Efficient Inefficient Fast MediumG Efficient Inefficient Efficient Fast FineG Efficient Inefficient Efficient Slow

  • Distribution time
  • CoarseG, MediumG – greedy, fast procedures
  • FineG – Complex, slow procedure based on sophisticated hypergraph partitioning methods
slide-9
SLIDE 9

Our Contributions

  • We identify certain fundamental metrics
  • Determine TTM load balance, SVD load and load balance, SVD volume
  • Design a new distribution scheme denoted Lite

Lite

  • Near-optimal on the fundamental metrics
  • Near-optimal on
  • TTM load balance
  • SVD load and load balance
  • SVD communication volume
  • Lightweight procedure with fast distribution time
  • Performance gain up to 3x
  • Only parameter not optimized is FM volume
  • Computation time dominates
  • So, Lite performance better

TTM SVD FM

  • Dist. time

CoarseG Inefficient Efficient Inefficient Fast MediumG Efficient Inefficient Efficient Fast FineG Efficient Inefficient Efficient Slow Lite Near-

  • ptimal

Near- Optimal Inefficient Fast

slide-10
SLIDE 10

Mode 1 – Sequential TTM

T B C SVD A-new A C SVD B-new A B SVD C-new

T

𝐶𝑈 𝐷𝑈 X X

SVD

A-new

𝑁

Penultimate matrix TTM

3 1 2 (1, -, -) e1 (2, -, -) e2 (1, -, -) e3 (3, -, -) e4 (3, -, -) e5 (1, -, -) e6 (2, -, -) e7 (3, -, -) e8 Penultimate Matrix M ෢ 𝐿1 L1

Kronecker product Involves factor matrices

  • Slice – elements with same

coordinate; same color

  • Each slice updates the

corresponding row in the penultimate matrix

slide-11
SLIDE 11

Distributed TTM

Local copy 1 2 3 (1, -, -) e1 (2, -, -) e2 (1, -, -) e3 1 2 3 (3, -, -) e4 (3, -, -) e5 (1, -, -) e6 1 2 3 (2, -, -) e7 (3, -, -) e8

Proc 1 Proc 2 Proc 3

Local copy Local copy

3 1 2

Penultimate Matrix M Penultimate matrix is sum-distributed

slide-12
SLIDE 12

SVD via Lanczos Method

  • Lanczos method provides a vector xin. Return xout = Z xin.
  • Done implicitly over the sum-distributed matrix

𝑦𝑝𝑣𝑢

Local copy

1 2 3 e1 e2 (1, -, -) (2, -, -) (1, -, -) e3 1 2 3 e4 e5 (3, -, -) (3, -, -) (1, -, -) e6 1 2 3 e7 (2, -, -) (3, -, -) e8

Proc 1 Proc 2 Proc 3 Local copy Local copy

𝑦𝑗𝑜 1 2 3 1 2 3 1 2 3 = = = 𝑦𝑝𝑣𝑢

1

𝑦𝑝𝑣𝑢

2

1 2 3 𝑦𝑝𝑣𝑢 P 1 P 3 P 2 x 𝑦𝑗𝑜 x 𝑦𝑗𝑜 x

TTM Computation SVD Computation

Owners

slide-13
SLIDE 13

Performance Metrics Along Each Mode

TTM

  • TTM-LImb
  • Max number of elements assigned to the

processors

  • Optimal value – E / P
  • Number of elements / Number of processors

SVD

  • SVD-Redundancy
  • Total number of times slices are shared
  • Measures computational load, comm volume
  • Optimal value = L (length along the mode)
  • SVD-LImb:
  • Max number of slices shared by the processors
  • Optimal value = L / P

Factor Matrix Transfer

  • Communication volume at the each iteration

Local copy 1 2 3 (1, -, -) e1 (2, -, -) e2 (1, -, -) e3 1 2 3 (3, -, -) e4 (3, -, -) e5 (1, -, -) e6 1 2 3 (2, -, -) e7 (3, -, -) e8

Proc 1 Proc 2 Proc 3 𝑎1

1

𝑎1

2

𝑎1

3

Local copy Local copy TTM-Limb = 3 (optimal) SVD-Red = 6 (optimal = 3) SVD-Limb = 2 (optimal =1)

slide-14
SLIDE 14

Prior Schemes

  • Uni-policy Schemes
  • A single policy for computation along all modes
  • Only a single copy of the tensor need to be stored
  • Less memory footprint, but less optimization opportunities
  • Multi-policy Schemes
  • Independent distribution for each mode
  • N copies need to be stored, one along each mode
  • More memory footprint, but more optimization opportunities
  • Coarse-grained Scheme [Kaya-Ucar’16]
  • Allocate each slice in entirety to a processor. No slice-sharing
  • Medium-grained Scheme [Smith et al ‘16]
  • Grid partitioning proposed in CP context
  • Hyper-graph Scheme (Fine-grained) [Kaya-Ucar ‘16]
  • All issues captured via a hyper-graph
  • Find min-cut partitioning
slide-15
SLIDE 15

Our Scheme - Lite

slide-16
SLIDE 16

Lite - Results

  • TTM-Limb = E / P (optimal)
  • SVD-Redundancy = L + P (optimal = L)
  • SVD-Limb = L/P + 2 (optimal = L/P)
  • Near-optimal
  • TTM-Limb
  • SVD computational load, load balance
  • SVD communication volume
  • Only issue is high factor matrix transfer volume
  • Computation dominates. So, not an issue.

CoarseG MediumG FineG Lite Type Multi-policy Uni-policy Uni-policy Multi-policy Distribution time Greedy (fast) Greedy (fast) Complex (Slow) Greedy (fast) TTM-Limb High Reasonable Reasonable Optimal SVD- Redundancy Optimal High Reasonable Optimal SVD-Limb Reasonable Reasonable High Optimal SVD-Volume Optimal Reasonable Reasonable Optimal FM-Volume High Reasonable Reasonable High

slide-17
SLIDE 17

Experimental Evaluation

  • R92 cluster – 2 to 32 nodes.
  • 16 MPI ranks per node, each mapped to a core. So, 32 to 512 MPI ranks
  • Dataset : FROSTT repository
slide-18
SLIDE 18

HOOI Time – 512 ranks, K = 10

Gain

  • CoarseG – 12x
  • MediumG – 4.5x
  • HyperG – 4x
  • Best Prior – 3x
slide-19
SLIDE 19

Flickr Breakup - 512 ranks, K = 10

slide-20
SLIDE 20

Performance Metrics

TTM Load Imbalance SVD Load SVD Load Imbalance

slide-21
SLIDE 21

Strong Scaling: Speedup from 32 to 512 ranks

slide-22
SLIDE 22

Tensor Distribution Time – 512 ranks

slide-23
SLIDE 23

Conclusions and Future Work

LITE

  • Optimal on TTM load imbalance, SVD load, load balance and communication volume
  • Fast distribution time
  • Good scalability
  • Higher communication volume, but computation time dominates.

Future Work

  • Apply ideas from our work to shared memory setting
  • Apply ideas from shared memory setting to our implementation
  • Compressed sparse fiber representation [SK’16]