PLANC: Parallel Low Rank Approximations with Non-negativity - PowerPoint PPT Presentation

PLANC: Parallel Low Rank Approximations with Non-negativity Constraints Ramakrishnan Kannan, Michael Matheson, Grey Ballard, Srinivas Eswar , Koby Hayashi, Haesun Park Ph.D student, School of CSE Georgia Institute of Technology Advisors: Rich Vuduc and Haesun Park January 26, 2019 Workshop on Compiler Techniques for Sparse Tensor Algebra Acknowledgement: This work was partly sponsored by NSF, Sandia and ORNL Srinivas Eswar (GT) PLANC January 26, 2019 1 / 12

Summary PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

Summary PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

Summary PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Popular optimisation methods like Block Principal Pivoting, Alternating Direction Method of Multipliers, first order Nesterov methods etc. for Non-negative Least Squares are included. Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

Summary PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Popular optimisation methods like Block Principal Pivoting, Alternating Direction Method of Multipliers, first order Nesterov methods etc. for Non-negative Least Squares are included. NTF is an important contributor towards explainable AI with a wide range of applications like spectral unmixing, scientific visualization, healthcare analytics, topic modelling etc. Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

CP Decomposition R � u r ( σ r v T Matrix: M ≈ r ) r =1 R � Tensor: X ≈ ( λ r u r ) ◦ v r ◦ w r r =1 This is known as the CANDECOMP or PARAFAC or canonical polyadic or CP decomposition . It approximate tensors as sum of outer products or rank-1 tensors. NNCP imposes non-negativity constraints on the factor matrices to aid interpretability. Srinivas Eswar (GT) PLANC January 26, 2019 3 / 12

Computational Bottlenecks The MTTKRP is the major bottleneck for NNCP. M (1) = X (1) ( W ⊙ V ) J K � � m ir = x ijk v jr w kr j =1 k =1 Standard approach is to explicitly matricise the tensor and form the Khatri-Rao product before calling DGEMM. Can we do better? Avoid matricisation of the tensor and full Khatri-Rao products ... Srinivas Eswar (GT) PLANC January 26, 2019 4 / 12

Communication Lower Bounds Following the nested arrays lower bounds [BKR18]. Theorem Any parallel MTTKRP algorithm involving a tensor with I k = I 1 / N for all k and that evenly distributes one copy of the input and output performs at least � I N �� NIR � 1 / N � � 2 N − 1 Ω + NR P P sends and receives. (Either term can dominate.) Key Assumptions: algorithm is not allowed to pre-compute and re-use temporary values. � I � � 1 / N � Ω NR is the most frequently occurring case for relatively P small P or R . Srinivas Eswar (GT) PLANC January 26, 2019 5 / 12

Shared Memory Optimisation - Dimension Trees Reuse computations across MTTKRPs. M (1) = X (1) ( U (3) ⊙ U (2) ) M (2) = X (2) ( U (3) ⊙ U (1) ) Utilise a “dimension tree” to store and reuse partial products [PTC13, LKL + 17, HBJT18]. { 1 , 2 , 3 } PM PM { 1 , 2 } M (3) mTTV mTTV M (1) M (2) PM = partial MTTKRP mTTV = multi-Tensor-Times-Vector Srinivas Eswar (GT) PLANC January 26, 2019 6 / 12

Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix U (1) U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) All-Gathers all the rows needed 3 from U (3) U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) All-Gathers all the rows needed 3 from U (3) Computes its contribution to rows 4 of M (2) (local MTTKRP) U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) All-Gathers all the rows needed 3 from U (3) Computes its contribution to rows 4 of M (2) (local MTTKRP) Reduce-Scatters to compute and 5 distribute M (2) evenly U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) All-Gathers all the rows needed 3 from U (3) Computes its contribution to rows 4 of M (2) (local MTTKRP) Reduce-Scatters to compute and 5 distribute M (2) evenly U (3) Solve local NLS problem using 6 M (2) M (2) and U (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

Performance Plots - Strong Scaling 4D Strong Scaling 50 40 gram Running Time (in Secs) nnls 30 mttkrp multittv reducescatter 20 allgather allreduce 10 0 0-128 0-16 0-32 0-64 1-128 1-16 1-32 1-64 2-128 2-16 2-32 2-64 4-128 4-16 4-32 4-64 5-128 5-16 5-32 5-64 alg-p Figure: Strong Scaling on synthetic tensor of dimensions 256 × 256 × 256 × 256 on 8, 16, 32, 64 and 128 nodes of Titan Can achieve nearly linear scaling since NNCP is compute bound. Srinivas Eswar (GT) PLANC January 26, 2019 8 / 12

Performance Plots - CPU vs GPU MU 400 60.0 HALS ANLS/BPP 57.5 350 AO-ADMM Total Time(in Secs) Total Time(in Secs) Nestrov MU 55.0 300 CP/ALS HALS ANLS/BPP 52.5 250 AO-ADMM Nestrov 50.0 200 CP/ALS 47.5 150 45.0 100 42.5 20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100 Low Rank (k) Low Rank (k) Figure: CPU Figure: GPU 4D synthetic tensor of dimensions 384 × 384 × 384 × 384 on 81 Titan nodes as a 3 × 3 × 3 × 3 grid with varying low rank. Offloading DGEMM calls to GPU can provide 7X speedup . Srinivas Eswar (GT) PLANC January 26, 2019 9 / 12

Compiler Challenges and Extensions to the Sparse Setting 1 Dimension Tree ordering. Combinatorial explosion for sparse case (contrasted with the single split choice for the dense case). Sparse case involves growth in intermediate values. 2 Communication Pattern establishment and load balancing. Automatic communicator setup given a processor grid and tensor operation. Automatic data distribution using the communication-avoiding loop optimisation [Kni15, DR16]. 3 Block Parallelism in Least Squares Solvers. Active Set orderings can be grouped in an embarrassingly parallel call. Sparse case with masking matrix has a similar RHS pattern. 4 Binary Bloat Separate binaries for GPU/CPU and Sparse/Dense. Srinivas Eswar (GT) PLANC January 26, 2019 10 / 12

Summary PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Popular optimisation methods like Block Principal Pivoting, Alternating Direction Method of Multipliers, first order Nesterov methods etc. for Non-negative Least Squares are included. NTF is an important contributor towards explainable AI with a wide range of applications like spectral unmixing, scientific visualization, healthcare analytics, topic modelling etc. Coming soon as a miniapp on OLCF machines. https://github.com/ramkikannan/planc. Srinivas Eswar (GT) PLANC January 26, 2019 11 / 12

PLANC: Parallel Low Rank Approximations with Non-negativity - PowerPoint PPT Presentation

PLANC: Parallel Low Rank Approximations with Non-negativity Constraints Ramakrishnan Kannan, Michael Matheson, Grey Ballard, Srinivas Eswar , Koby Hayashi, Haesun Park Ph.D student, School of CSE Georgia Institute of Technology Advisors: Rich

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

LR-GLM: High-Dimensional Bayesian Inference Using Low-Rank Data Approximations Brian Trippe ,

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

1 Low-rank approximations to a matrix using SVD First point: we can write the SVD as a sum of

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

NetGAN without GAN: From Random Walks to Low-Rank Approximations Luca Rendsburg, Holger Heidrich,

Polynomial, sparse and low-rank approximations Anthony Nouy Centrale Nantes Laboratoire de

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product Matrix Approximations Xin Liu

Learning with Low Rank Approximations or how to use near separability to extract content from

On nonlinear approximations and the linear hull effect Anne Canteaut Inria, Paris, France joint

On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD

JUST THE MATHS SLIDES NUMBER 3.3 TRIGONOMETRY 3 (Approximations & inverse functions)

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

KJV COMPACT ULTRASLIM BIBLE Format: Slides KJV COMPACT ULTRASLIM BIBLE Format: Slides Book Review

Language Poetry 08.16.10 || English 2327: American Literature I || D. Glen Smith, instructor

Introduction and News Stefan Sldner-Rembold DUNE General Call 22 June 2018 Fermilab LBNC

Telecom Italia Group 2Q15 Results Marco Patuano Piergiorgio Peluso Agenda TI 2Q15

2013 year in review Hampton Roads office market Inventory 39 million r.s.f. Absorption 190,000

AP3: Visualisation Nicholas Tan Jerome, Andreas Kopmann Institute for Data Processing and

Scientific Programming in mpags-python.github.io Steven Bamford AS1/MPAGS Course Introduction

Radiative transfer in (solar) multi-fluid and MHD simulations N.Vitas with the SPIA team:

Sambuz

Useful Links

Newsletter

Mail Us

PLANC: Parallel Low Rank Approximations with Non-negativity - PowerPoint PPT Presentation

PLANC: Parallel Low Rank Approximations with Non-negativity Constraints Ramakrishnan Kannan, Michael Matheson, Grey Ballard, Srinivas Eswar , Koby Hayashi, Haesun Park Ph.D student, School of CSE Georgia Institute of Technology Advisors: Rich

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

LR-GLM: High-Dimensional Bayesian Inference Using Low-Rank Data Approximations Brian Trippe ,

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

1 Low-rank approximations to a matrix using SVD First point: we can write the SVD as a sum of

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

NetGAN without GAN: From Random Walks to Low-Rank Approximations Luca Rendsburg, Holger Heidrich,

Polynomial, sparse and low-rank approximations Anthony Nouy Centrale Nantes Laboratoire de

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product Matrix Approximations Xin Liu

Learning with Low Rank Approximations or how to use near separability to extract content from

On nonlinear approximations and the linear hull effect Anne Canteaut Inria, Paris, France joint

On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD

JUST THE MATHS SLIDES NUMBER 3.3 TRIGONOMETRY 3 (Approximations &amp; inverse functions)

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

KJV COMPACT ULTRASLIM BIBLE Format: Slides KJV COMPACT ULTRASLIM BIBLE Format: Slides Book Review

Language Poetry 08.16.10 || English 2327: American Literature I || D. Glen Smith, instructor

Introduction and News Stefan Sldner-Rembold DUNE General Call 22 June 2018 Fermilab LBNC

Telecom Italia Group 2Q15 Results Marco Patuano Piergiorgio Peluso Agenda TI 2Q15

2013 year in review Hampton Roads office market Inventory 39 million r.s.f. Absorption 190,000

AP3: Visualisation Nicholas Tan Jerome, Andreas Kopmann Institute for Data Processing and

Scientific Programming in mpags-python.github.io Steven Bamford AS1/MPAGS Course Introduction

Radiative transfer in (solar) multi-fluid and MHD simulations N.Vitas with the SPIA team:

Sambuz

Useful Links

Newsletter

Mail Us

JUST THE MATHS SLIDES NUMBER 3.3 TRIGONOMETRY 3 (Approximations & inverse functions)