Constrained Tensor Factorization with Accelerated AO-ADMM Shaden - PowerPoint PPT Presentation

Constrained Tensor Factorization with Accelerated AO-ADMM Shaden Smith 1 ∗ , Alec Beri 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Department of Computer Science, University of Maryland ∗ shaden@cs.umn.edu 1 / 26

Table of Contents Introduction Accelerated AO-ADMM Experiments Conclusions 1 / 26

Tensor Introduction ◮ Tensors are the generalization of matrices to higher dimensions. ◮ Allow us to represent and analyze multi-dimensional data. ◮ Applications in precision healthcare, cybersecurity, recommender systems, . . . patients s e s o n g a i d procedures 2 / 26

Canonical polyadic decomposition (CPD) The CPD models a tensor as the summation of rank-1 tensors. ≈ + · · · + � � 2 F � � � � � minimize L ( X , A , B , C ) = � X − A (: , f ) ◦ B (: , f ) ◦ C (: , f ) � � A , B , C � f =1 F Notation A ∈ R I × F , B ∈ R J × F , and C ∈ R K × F denote the factor matrices for a 3D tensor. 3 / 26

Alternating least squares (ALS) The CPD is most commonly computed with ALS: Algorithm 1 CPD-ALS 1: while not converged do A T ← ( C T C ∗ B T B ) − 1 � � T X (1) ( C � B ) 2: B T ← ( C T C ∗ A T A ) − 1 � � T X (2) ( C � A ) 3: C T ← ( B T B ∗ A T A ) − 1 � � T X (3) ( B � A ) 4: � �� Normal equations MTTKRP 5: end while Notation ∗ denotes the Hadamard (elementwise) product. 4 / 26

Constrained factorization We often want to impose some constraints or regularizations on the factorization: minimize L ( X , A , B , C ) + r ( A ) + r ( B ) + r ( C ) A , B , C � �� Loss Constraints/Regularizations Example Non-negative factorizations use an indicator function for R + : � 0 if A ≥ 0 r ( A ) = ∞ otherwise 5 / 26

AO-ADMM [Huang & Sidiropoulos ’15] AO-ADMM combines alternating optimization (AO) with alternating direction method of multipliers (ADMM). ◮ A , B , and C are updated in sequence using ADMM. 6 / 26

AO-ADMM [Huang & Sidiropoulos ’15] AO-ADMM combines alternating optimization (AO) with alternating direction method of multipliers (ADMM). ◮ A , B , and C are updated in sequence using ADMM. ADMM formulation for the update of A : � T ( C � B ) T � 1 2 � X (1) − ˜ � � minimize A F + r ( A ) � 2 A , ˜ A T . A = ˜ subject to A 6 / 26

Alternating optimization step (outer iterations) 1: Initialize primal variables A , B , and C randomly. 2: Initialize dual variables ˆ A , ˆ B , and ˆ C with 0 . 3: repeat G ← B T B ∗ C T C 4: K ← X (1) ( C � B ) 5: A , ˆ A ← ADMM( A , ˆ A , K , G ) 6: 7: G ← A T A ∗ C T C 8: K ← X (2) ( C � A ) 9: B , ˆ B ← ADMM( B , ˆ B , K , G ) 10: 11: G ← A T A ∗ B T B 12: K ← X (3) ( B � A ) 13: C , ˆ C ← ADMM( C , ˆ C , K , G ) 14: 15: until L ( X , A , B , C ) ceases to improve. 7 / 26

ADMM step (inner iterations) ADMM to update one factor matrix: 1: Input: H , U , K , G 2: Output: H , U 3: ρ ← trace( G ) / F 4: L ← Cholesky( G + ρ I ) 5: repeat H 0 ← H 6: H ← L − T L − 1 ( K + ρ ( H + U )) T ˜ 7: T + U || 2 2 || H − ˜ H ← argmin H r ( H ) + ρ H 8: F T U ← U + H − ˜ H 9: T || 2 r ← || H − ˜ F / || H || 2 H 10: F s ← || H − H 0 || 2 F / || U || 2 11: F 12: until r < ǫ and s < ǫ 8 / 26

Parallelization opportunities All steps but Line 8 are either element-wise or row-wise independent. 1: Input: H , U , K , G 2: Output: H , U 3: ρ ← trace( G ) / F 4: L ← Cholesky( G + ρ I ) 5: repeat H 0 ← H 6: H ← L − T L − 1 ( K + ρ ( H + U )) T ˜ 7: T + U || 2 2 || H − ˜ H ← argmin H r ( H ) + ρ H 8: F T U ← U + H − ˜ H 9: T || 2 r ← || H − ˜ F / || H || 2 H 10: F s ← || H − H 0 || 2 F / || U || 2 11: F 12: until r < ǫ and s < ǫ 9 / 26

Performance opportunities 1. The factor matrices are tall-skinny (e.g., 10 6 × 50). ◮ The ADMM step will be bound by memory bandwidth. 2. Real-world tensors have non-uniform distributions of non-zeros. ◮ This may lead to non-uniform convergence of the factor rows during ADMM. 3. Many constraints and regularizations naturally invoke sparsity in the factors. ◮ We can exploit this sparsity during MTTKRP ( in paper ). 10 / 26

Blocked ADMM If the proximity operator coming from r ( · ) is row-separable, reformulate the ADMM problem to work on B blocks of rows: B 1 � � 2 � T � ( X (1) ) b − ˜ � b ( C � B ) T � minimize A F + r ( A b ) b � 2 ( A 1 , ˜ A 1 ) ,..., ( A B , ˜ A B ) b =1 A 1 = ˜ A 1 , . . . , A B = ˜ subject to A B . Optimizing each block separately allows for them to converge at different rates, while acting as a form of cache tiling. 11 / 26

Blocked ADMM More simply: 12 / 26

Effects of block size The block size affects both convergence rate and computational efficiency: ◮ A block size of 1 optimizes each row of H independently. ◮ Larger block sizes better utilize hardware resources, but should be chosen to fit in cache. Our evaluation uses F =50, and we experimentally found a block size of 50 rows to be a good balance between convergence rate and performance. 13 / 26

Experimental Setup Source code: ◮ Modified from SPLATT 1 ◮ Written in C and parallelized with OpenMP ◮ Compiled with icc v17.0.1 and linked with Intel MKL Machine specifications: ◮ 2 × 10-core Intel Xeon E5-2650v3 (Haswell) ◮ 396GB RAM 1 https://github.com/ShadenSmith/splatt 14 / 26

Convergence measurement We measure convergence based on the relative reconstruction error: relative error = L ( X , A , B , C ) . � X � 2 F Termination: ◮ Convergence is detected when the relative error improves less than 10 − 6 or if we exceed 200 outer iterations. ◮ ADMM is limited to 50 iterations and ǫ = 10 − 2 . 15 / 26

Datasets We selected four tensors from the FROSTT 2 collection based on non-negative factorization performance: ◮ require a non-trivial number of iterations ◮ have a factorization quality that suggests a non-negative CPD is appropriate Dataset NNZ I J K Reddit 95M 310K 6K 510K 143M 3M 2M 25M NELL Amazon 1.7B 5M 18M 2M 3.5B 46 240K 240K Patents 2 http://frostt.io/ 16 / 26

Relative Factorization Costs Fraction of time spent in MTTKRP and ADMM during a rank-50 non-negative factorization: MTTKRP ADMM OTHER 1.0 Fraction of factorization time 0.8 0.6 0.4 0.2 0.0 Reddit NELL Amazon Patents 17 / 26

Parallel Scalability Blocked ADMM improves speedup when the factorization is dominated by ADMM: 20 20 Reddit Reddit NELL NELL Amazon Amazon Patents Patents ideal ideal Speedup Speedup 10 10 8 8 4 4 2 2 1 1 1 2 4 8 10 20 1 2 4 8 10 20 Threads Threads Baseline Blocked 18 / 26

Convergence: Reddit Blocking results in faster per-iteration runtimes and also converges in fewer iterations. base base 0.89 blocked 0.89 blocked Relative error 0.88 Relative error 0.88 0.87 0.87 0.86 0.86 0.85 0.85 0 10 20 30 40 50 60 70 80 90 0 5 10 15 20 25 30 35 40 45 Time (s) Outer iteration 19 / 26

Convergence: NELL Convergence is 3 . 7 × faster with blocking, despite using additional iterations to achieve a lower error. base base 0.62 0.62 blocked blocked 0.60 0.60 Relative error Relative error 0.58 0.58 0.56 0.56 0.54 0.54 0 500 1000 1500 2000 2500 3000 0 5 10 15 20 25 30 Time (s) Outer iteration 20 / 26

Convergence: Amazon Both formulations exceed the maximum of 200 outer iterations, but the blocked formulation achieves a lower error in less time. base base 0.69 0.69 blocked blocked 0.68 0.68 Relative error Relative error 0.67 0.67 0.66 0.66 0.65 0.65 0.64 0.64 0 2000 4000 6000 8000 10000 12000 0 50 100 150 200 Time (s) Outer iteration 21 / 26

Convergence: Patents Per-iteration runtimes are largely unaffected, as Patents is dominated by MTTKRP time. However, fewer iterations are required. base base 0.570 0.570 blocked blocked 0.565 0.565 Relative error Relative error 0.560 0.560 0.555 0.555 0.550 0.550 0.545 0.545 0.540 0.540 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 20 40 60 80 100 120 140 Time (s) Outer iteration 22 / 26

Wrapping up Blocked ADMM accelerates constrained tensor factorization in two ways: ◮ Optimizing blocks independently saves computation on the “simple” rows and better optimizes “hard” rows. ◮ Blocks can be kept in cache during ADMM, saving memory bandwidth. Also in the paper: ◮ MTTKRP can be accelerated by exploiting the sparsity that dynamically evolves in the factors. ◮ An additional ∼ 2 × speedup is achieved. Future work: ◮ Analytical model for selecting block sizes. ◮ Automatic runtime selection of data structure for sparse factors. 23 / 26

Reproducibility All of our work is open source (in the wip/ao-admm branch for now): https://github.com/ShadenSmith/splatt Datasets are freely available: http://frostt.io/ 24 / 26

Backup Slides 24 / 26

Constrained Tensor Factorization with Accelerated AO-ADMM Shaden - PowerPoint PPT Presentation

Constrained Tensor Factorization with Accelerated AO-ADMM Shaden Smith 1 , Alec Beri 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Department of Computer Science, University of Maryland

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

EWG on Maritime Security Brunei Darussalam and New Zealand Co-Chairs 2014 2017 ADMM-PLUS EWG

An ADMM algorithm for constrained material decomposition in spectral CT May, 23 th , 2018 Lake

A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization Shaden Smith George

Sparse Tensor Factorization: Algorithms, Data Structures, and Challenges Shaden Smith &

Infinite Positive Semidefinite Tensor Factorization for Source Separation of Mixture Signals

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Flexible ADMM for Block-Structured Convex and Nonconvex Optimization Zhi-Quan (Tom) Luo Joint

On the Equivalence of Inexact Proximal ALM and ADMM for a Class of Convex Composite Programming

Accelerated Douglas-Rachford splitting and ADMM for structured nonconvex optimization Panos

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Keyword-based Queries Single words

Inexact variable metric proximal gradient methods with line-search for convex and nonconvex

Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble

Proximity-based Clustering Clustering with no distance information What if one wants to

Optimization for data processing at a large scale Sparsity4PSL Summer School Emilie Chouzenoux

Signal analysis using sparse representation and proximal optimization methods Mai Quyen PHAM

CLARINET: WAN-Aware Optimization for Analytics Queries Presented By Robert Claus Agenda 1.

Stochastic Proximal Algorithms with Applications to Online Image Recovery Patrick Louis Combettes

Constrained Tensor Factorization with Accelerated AO-ADMM Shaden - PowerPoint PPT Presentation

Constrained Tensor Factorization with Accelerated AO-ADMM Shaden Smith 1 , Alec Beri 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Department of Computer Science, University of Maryland

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

EWG on Maritime Security Brunei Darussalam and New Zealand Co-Chairs 2014 2017 ADMM-PLUS EWG

An ADMM algorithm for constrained material decomposition in spectral CT May, 23 th , 2018 Lake

A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization Shaden Smith George

Sparse Tensor Factorization: Algorithms, Data Structures, and Challenges Shaden Smith &amp;

Infinite Positive Semidefinite Tensor Factorization for Source Separation of Mixture Signals

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Flexible ADMM for Block-Structured Convex and Nonconvex Optimization Zhi-Quan (Tom) Luo Joint

On the Equivalence of Inexact Proximal ALM and ADMM for a Class of Convex Composite Programming

Accelerated Douglas-Rachford splitting and ADMM for structured nonconvex optimization Panos

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Keyword-based Queries Single words

Inexact variable metric proximal gradient methods with line-search for convex and nonconvex

Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble

Proximity-based Clustering Clustering with no distance information What if one wants to

Optimization for data processing at a large scale Sparsity4PSL Summer School Emilie Chouzenoux

Signal analysis using sparse representation and proximal optimization methods Mai Quyen PHAM

CLARINET: WAN-Aware Optimization for Analytics Queries Presented By Robert Claus Agenda 1.

Stochastic Proximal Algorithms with Applications to Online Image Recovery Patrick Louis Combettes

Sparse Tensor Factorization: Algorithms, Data Structures, and Challenges Shaden Smith &