PLANC: Parallel Low Rank Approximations with Non-negativity - - PowerPoint PPT Presentation

planc parallel low rank approximations with non
SMART_READER_LITE
LIVE PREVIEW

PLANC: Parallel Low Rank Approximations with Non-negativity - - PowerPoint PPT Presentation

PLANC: Parallel Low Rank Approximations with Non-negativity Constraints Ramakrishnan Kannan, Michael Matheson, Grey Ballard, Srinivas Eswar , Koby Hayashi, Haesun Park Ph.D student, School of CSE Georgia Institute of Technology Advisors: Rich


slide-1
SLIDE 1

PLANC: Parallel Low Rank Approximations with Non-negativity Constraints

Ramakrishnan Kannan, Michael Matheson, Grey Ballard, Srinivas Eswar, Koby Hayashi, Haesun Park

Ph.D student, School of CSE Georgia Institute of Technology Advisors: Rich Vuduc and Haesun Park

January 26, 2019 Workshop on Compiler Techniques for Sparse Tensor Algebra

Acknowledgement: This work was partly sponsored by NSF, Sandia and ORNL

Srinivas Eswar (GT) PLANC January 26, 2019 1 / 12

slide-2
SLIDE 2

Summary

PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation.

Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

slide-3
SLIDE 3

Summary

PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP).

Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

slide-4
SLIDE 4

Summary

PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Popular optimisation methods like Block Principal Pivoting, Alternating Direction Method of Multipliers, first order Nesterov methods etc. for Non-negative Least Squares are included.

Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

slide-5
SLIDE 5

Summary

PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Popular optimisation methods like Block Principal Pivoting, Alternating Direction Method of Multipliers, first order Nesterov methods etc. for Non-negative Least Squares are included. NTF is an important contributor towards explainable AI with a wide range of applications like spectral unmixing, scientific visualization, healthcare analytics, topic modelling etc.

Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

slide-6
SLIDE 6

CP Decomposition

Matrix: M ≈

R

  • r=1

ur(σrvT

r )

Tensor: X ≈

R

  • r=1

(λrur) ◦ vr ◦ wr This is known as the CANDECOMP or PARAFAC or canonical polyadic or CP decomposition. It approximate tensors as sum of

  • uter products or rank-1 tensors.

NNCP imposes non-negativity constraints on the factor matrices to aid interpretability.

Srinivas Eswar (GT) PLANC January 26, 2019 3 / 12

slide-7
SLIDE 7

Computational Bottlenecks

The MTTKRP is the major bottleneck for NNCP. M(1) = X(1)(W ⊙ V) mir =

J

  • j=1

K

  • k=1

xijkvjrwkr Standard approach is to explicitly matricise the tensor and form the Khatri-Rao product before calling DGEMM. Can we do better? Avoid matricisation of the tensor and full Khatri-Rao products ...

Srinivas Eswar (GT) PLANC January 26, 2019 4 / 12

slide-8
SLIDE 8

Communication Lower Bounds

Following the nested arrays lower bounds [BKR18].

Theorem

Any parallel MTTKRP algorithm involving a tensor with Ik = I 1/N for all k and that evenly distributes one copy of the input and output performs at least Ω NIR P

  • N

2N−1

+ NR I P 1/N sends and receives. (Either term can dominate.) Key Assumptions: algorithm is not allowed to pre-compute and re-use temporary values. Ω

  • NR

I

P

1/N is the most frequently occurring case for relatively small P or R.

Srinivas Eswar (GT) PLANC January 26, 2019 5 / 12

slide-9
SLIDE 9

Shared Memory Optimisation - Dimension Trees

Reuse computations across MTTKRPs. M(1) = X(1)(U(3) ⊙ U(2)) M(2) = X(2)(U(3) ⊙ U(1)) Utilise a “dimension tree” to store and reuse partial products [PTC13, LKL+17, HBJT18]. {1, 2, 3} {1, 2} M(3) M(1) M(2)

PM PM mTTV mTTV

PM = partial MTTKRP mTTV = multi-Tensor-Times-Vector

Srinivas Eswar (GT) PLANC January 26, 2019 6 / 12

slide-10
SLIDE 10

Distributed Memory Optimisation - Communication Avoiding

U(1) M(2)

U (3)

Each processor

1

Starts with one subtensor and subset of rows of each input factor matrix

Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

slide-11
SLIDE 11

Distributed Memory Optimisation - Communication Avoiding

U(1) M(2)

U (3)

Each processor

1

Starts with one subtensor and subset of rows of each input factor matrix

2

All-Gathers all the rows needed from U(1)

Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

slide-12
SLIDE 12

Distributed Memory Optimisation - Communication Avoiding

U(1) M(2)

U (3)

Each processor

1

Starts with one subtensor and subset of rows of each input factor matrix

2

All-Gathers all the rows needed from U(1)

3

All-Gathers all the rows needed from U(3)

Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

slide-13
SLIDE 13

Distributed Memory Optimisation - Communication Avoiding

U(1) M(2)

U (3)

Each processor

1

Starts with one subtensor and subset of rows of each input factor matrix

2

All-Gathers all the rows needed from U(1)

3

All-Gathers all the rows needed from U(3)

4

Computes its contribution to rows

  • f M(2) (local MTTKRP)

Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

slide-14
SLIDE 14

Distributed Memory Optimisation - Communication Avoiding

U(1) M(2)

U (3)

Each processor

1

Starts with one subtensor and subset of rows of each input factor matrix

2

All-Gathers all the rows needed from U(1)

3

All-Gathers all the rows needed from U(3)

4

Computes its contribution to rows

  • f M(2) (local MTTKRP)

Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

slide-15
SLIDE 15

Distributed Memory Optimisation - Communication Avoiding

U(1) M(2)

U (3)

Each processor

1

Starts with one subtensor and subset of rows of each input factor matrix

2

All-Gathers all the rows needed from U(1)

3

All-Gathers all the rows needed from U(3)

4

Computes its contribution to rows

  • f M(2) (local MTTKRP)

5

Reduce-Scatters to compute and distribute M(2) evenly

Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

slide-16
SLIDE 16

Distributed Memory Optimisation - Communication Avoiding

U(1) M(2)

U (3)

Each processor

1

Starts with one subtensor and subset of rows of each input factor matrix

2

All-Gathers all the rows needed from U(1)

3

All-Gathers all the rows needed from U(3)

4

Computes its contribution to rows

  • f M(2) (local MTTKRP)

5

Reduce-Scatters to compute and distribute M(2) evenly

6

Solve local NLS problem using M(2) and U(2)

Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

slide-17
SLIDE 17

Performance Plots - Strong Scaling

0-128 0-16 0-32 0-64 1-128 1-16 1-32 1-64 2-128 2-16 2-32 2-64 4-128 4-16 4-32 4-64 5-128 5-16 5-32 5-64 alg-p 10 20 30 40 50 Running Time (in Secs)

4D Strong Scaling gram nnls mttkrp multittv reducescatter allgather allreduce

Figure: Strong Scaling on synthetic tensor of dimensions 256 × 256 × 256 × 256

  • n 8, 16, 32, 64 and 128 nodes of Titan

Can achieve nearly linear scaling since NNCP is compute bound.

Srinivas Eswar (GT) PLANC January 26, 2019 8 / 12

slide-18
SLIDE 18

Performance Plots - CPU vs GPU

20 30 40 50 60 70 80 90 100 Low Rank (k) 100 150 200 250 300 350 400 Total Time(in Secs) MU HALS ANLS/BPP AO-ADMM Nestrov CP/ALS

Figure: CPU

20 30 40 50 60 70 80 90 100 Low Rank (k) 42.5 45.0 47.5 50.0 52.5 55.0 57.5 60.0 Total Time(in Secs) MU HALS ANLS/BPP AO-ADMM Nestrov CP/ALS

Figure: GPU

4D synthetic tensor of dimensions 384 × 384 × 384 × 384 on 81 Titan nodes as a 3 × 3 × 3 × 3 grid with varying low rank.

Offloading DGEMM calls to GPU can provide 7X speedup.

Srinivas Eswar (GT) PLANC January 26, 2019 9 / 12

slide-19
SLIDE 19

Compiler Challenges and Extensions to the Sparse Setting

1 Dimension Tree ordering.

Combinatorial explosion for sparse case (contrasted with the single split choice for the dense case). Sparse case involves growth in intermediate values.

2 Communication Pattern establishment and load balancing.

Automatic communicator setup given a processor grid and tensor

  • peration.

Automatic data distribution using the communication-avoiding loop

  • ptimisation [Kni15, DR16].

3 Block Parallelism in Least Squares Solvers.

Active Set orderings can be grouped in an embarrassingly parallel call. Sparse case with masking matrix has a similar RHS pattern.

4 Binary Bloat

Separate binaries for GPU/CPU and Sparse/Dense.

Srinivas Eswar (GT) PLANC January 26, 2019 10 / 12

slide-20
SLIDE 20

Summary

PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Popular optimisation methods like Block Principal Pivoting, Alternating Direction Method of Multipliers, first order Nesterov methods etc. for Non-negative Least Squares are included. NTF is an important contributor towards explainable AI with a wide range of applications like spectral unmixing, scientific visualization, healthcare analytics, topic modelling etc. Coming soon as a miniapp on OLCF machines. https://github.com/ramkikannan/planc.

Srinivas Eswar (GT) PLANC January 26, 2019 11 / 12

slide-21
SLIDE 21

References I

Grey Ballard, Nicholas Knight, and Kathryn Rouse. Communication lower bounds for matricized tensor times khatri-rao product. abs/1708.07401:Accepted, 2018. James Demmel and Alex Rusciano. Parallelepipeds obtaining HBL lower bounds. CoRR, abs/1611.05944, 2016. Koby Hayashi, Grey Ballard, Yujie Jiang, and Michael J. Tobia. Shared-memory parallelization of MTTKRP for dense tensors. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, pages 393–394, New York, NY, USA, 2018. ACM. Nicholas Sullender Knight. Communication-Optimal Loop Nests. PhD thesis, UC Berkeley, 2015. Athanasios P Liavas, Georgios Kostoulas, Georgios Lourakis, Kejun Huang, and Nicholas D Sidiropoulos. Nesterov-based parallel algorithm for large-scale nonnegative tensor factorization. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5895–5899. IEEE, 2017. Anh-Huy Phan, Petr Tichavsky, and Andrzej Cichocki. Fast alternating LS algorithms for high order CANDECOMP/PARAFAC tensor factorizations. IEEE Transactions on Signal Processing, 61(19):4834–4846, Oct 2013. Srinivas Eswar (GT) PLANC January 26, 2019 12 / 12

slide-22
SLIDE 22

NNCP Algorithm

For given rank R we formulate NNCP as the following optimisation problem, min

U,V,W≥0

  • X −

R

  • r=1

λrur ◦ vr ◦ wr

  • Non-linear and non-convex problem.

Solve via Alternating Non-negative Least Squares in an iterative manner using Block Coordinate Descent.

Srinivas Eswar (GT) PLANC January 26, 2019 1 / 2

slide-23
SLIDE 23

Alternating Non-negative Least Squares (ANLS)

Fixing all factor matrices but one results in a linear NLS problem, min

U≥0

  • X −

R

  • r=1

ur ◦ ˆ vr ◦ ˆ wr

  • r equivalently,

min

U≥0

  • X(1) − U( ˆ

W ⊙ ˆ V)T

  • F

⊙ is the Khatri-Rao product (column-wise Kronecker Product) of the factor matrices. Utilising the identity (A ⊙ B)T(A ⊙ B) = ATA ∗ BTB, we can cast the above problem as NLS solves via normal equations. min

U≥0

  • X(1)( ˆ

W ⊙ ˆ V) − U

  • ˆ

W

T ˆ

W ∗ ˆ V

V

  • F

where ∗ is the Hadamard Product.

Srinivas Eswar (GT) PLANC January 26, 2019 2 / 2