scaling the hierarchical topic modeling mountain
play

Scaling the Hierarchical Topic Modeling Mountain Neural NMF and - PowerPoint PPT Presentation

Scaling the Hierarchical Topic Modeling Mountain Neural NMF and Iterative Projection Methods Jamie Haddock Harvey Mudd College, January 28, 2020 Computational and Applied Mathematics UCLA 1 Research Overview Data Math. Data Science


  1. Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. ⊲ Regard the A matrices as independent variables, determine the S matrices from the A matrices. ⊲ Define q ( X , A ) := argmin S ≥ 0 � X − AS � 2 F (least-squares). ⊲ Pin the values of S to those of A by recursively setting S ( ℓ ) := q ( S ( ℓ − 1) , A ( ℓ ) ). 11

  2. Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. ⊲ Regard the A matrices as independent variables, determine the S matrices from the A matrices. ⊲ Define q ( X , A ) := argmin S ≥ 0 � X − AS � 2 F (least-squares). ⊲ Pin the values of S to those of A by recursively setting S ( ℓ ) := q ( S ( ℓ − 1) , A ( ℓ ) ). X S (0) S (1) q ( · , A (0) ) q ( · , A (1) ) 11

  3. Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. S (0) S (1) X q ( · , A (0) ) q ( · , A (1) ) 11

  4. Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. Training: S (0) S (1) X q ( · , A (0) ) q ( · , A (1) ) 11

  5. Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. Training: ⊲ forward propagation: S (0) = q ( X , A (0) ), S (0) S (1) X S (1) = q ( S (0) , A (1) ), ..., q ( · , A (0) ) q ( · , A (1) ) S ( L ) = q ( S ( L − 1) , A ( L ) ) ⊲ back propagation: update { A ( i ) } with ∇ E ( { A ( i ) } ) 11

  6. Least-squares Subroutine ⊲ least-squares is a fundamental subroutine in forward-propagation 12

  7. Least-squares Subroutine ⊲ least-squares is a fundamental subroutine in forward-propagation 12

  8. Least-squares Subroutine ⊲ least-squares is a fundamental subroutine in forward-propagation ⊲ iterative projection methods can solve these problems 12

  9. Iterative Projection Methods

  10. General Setup 13

  11. General Setup We are interested in solving highly overdetermined systems of equations , A x = b , where A ∈ R m × n , b ∈ R m and m ≫ n . Rows are denoted a T i . 13

  12. General Setup We are interested in solving highly overdetermined systems of equations , A x = b , where A ∈ R m × n , b ∈ R m and m ≫ n . Rows are denoted a T i . 13

  13. Iterative Projection Methods If { x ∈ R n : A x = b } is nonempty, these methods construct an approximation to a solution: 1. Randomized Kaczmarz Method Applications: 1. Tomography (Algebraic Reconstruction Technique) 14

  14. Iterative Projection Methods If { x ∈ R n : A x = b } is nonempty, these methods construct an approximation to a solution: 1. Randomized Kaczmarz Method 2. Motzkin’s Method Applications: 1. Tomography (Algebraic Reconstruction Technique) 2. Linear programming 14

  15. Iterative Projection Methods If { x ∈ R n : A x = b } is nonempty, these methods construct an approximation to a solution: 1. Randomized Kaczmarz Method 2. Motzkin’s Method 3. Sampling Kaczmarz-Motzkin Methods (SKM) Applications: 1. Tomography (Algebraic Reconstruction Technique) 2. Linear programming 3. Average consensus (greedy gossip with eavesdropping) 14

  16. Kaczmarz Method x 0 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15

  17. Kaczmarz Method x 0 x 1 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15

  18. Kaczmarz Method x 0 x 1 x 2 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15

  19. Kaczmarz Method x 0 x 1 x 2 x 3 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15

  20. Motzkin’s Method x 0 Given x 0 ∈ R n : 1. Choose i k ∈ [ m ] as | a T i k := argmax i x k − 1 − b i | . i ∈ [ m ] b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Motzkin, Schoenberg 1954] 16

  21. Motzkin’s Method x 0 x 1 Given x 0 ∈ R n : 1. Choose i k ∈ [ m ] as | a T i k := argmax i x k − 1 − b i | . i ∈ [ m ] b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Motzkin, Schoenberg 1954] 16

  22. Motzkin’s Method x 0 x 1 x 2 Given x 0 ∈ R n : 1. Choose i k ∈ [ m ] as | a T i k := argmax i x k − 1 − b i | . i ∈ [ m ] b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Motzkin, Schoenberg 1954] 16

  23. Our Hybrid Method (SKM) x 0 Given x 0 ∈ R n : 1. Choose τ k ⊂ [ m ] to be a sample of size β constraints chosen uniformly at random among the rows of A . 2. From the β rows, choose | a T i k := argmax i x k − 1 − b i | . i ∈ τ k 3. Define b ik − a T ik x k − 1 x k := x k − 1 + a i k . || a ik || 2 4. Repeat. 17 [De Loera, H., Needell ’17]

  24. Our Hybrid Method (SKM) x 0 x 1 Given x 0 ∈ R n : 1. Choose τ k ⊂ [ m ] to be a sample of size β constraints chosen uniformly at random among the rows of A . 2. From the β rows, choose | a T i k := argmax i x k − 1 − b i | . i ∈ τ k 3. Define b ik − a T ik x k − 1 x k := x k − 1 + a i k . || a ik || 2 4. Repeat. 17 [De Loera, H., Needell ’17]

  25. Our Hybrid Method (SKM) x 0 x 1 Given x 0 ∈ R n : 1. Choose τ k ⊂ [ m ] to be a x 2 sample of size β constraints chosen uniformly at random among the rows of A . 2. From the β rows, choose | a T i k := argmax i x k − 1 − b i | . i ∈ τ k 3. Define b ik − a T ik x k − 1 x k := x k − 1 + a i k . || a ik || 2 4. Repeat. 17 [De Loera, H., Needell ’17]

  26. Experimental Convergence ⊲ β : sample size ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ ‘faster’ convergence for larger sample size 18

  27. Experimental Convergence ⊲ β : sample size ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ ‘faster’ convergence for larger sample size 18

  28. Experimental Convergence ⊲ β : sample size ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ ‘faster’ convergence for larger sample size 18

  29. Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m 19

  30. Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m ⊲ MM (Agmon ’54): 1 − σ 2 min ( A ) � k � � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m 19

  31. Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m ⊲ MM (Agmon ’54): 1 − σ 2 min ( A ) � k � � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m ⊲ SKM (DeLoera, H., Needell ’17): 1 − σ 2 min ( A ) � k � E � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m 19

  32. Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m ⊲ MM (Agmon ’54): 1 − σ 2 min ( A ) � k � � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m ⊲ SKM (DeLoera, H., Needell ’17): 1 − σ 2 min ( A ) � k � E � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m Why are these all the same? 19

  33. A Pathological Example x 0 20

  34. Structure of the Residual Several works have used sparsity of the residual to improve the convergence rate of greedy methods. [De Loera, H., Needell ’17], [Bai, Wu ’18], [Du, Gao ’19] 21

  35. Structure of the Residual Several works have used sparsity of the residual to improve the convergence rate of greedy methods. [De Loera, H., Needell ’17], [Bai, Wu ’18], [Du, Gao ’19] However, not much sparsity can be expected in most cases. Instead, we’d like to use dynamic range of the residual to guarantee faster convergence. β ) � A τ x k − b τ � 2 � τ ∈ ( [ m ] 2 γ k := � β ) � A τ x k − b τ � 2 τ ∈ ( [ m ] ∞ 21

  36. Accelerated Convergence Rate Theorem (H. - Ma 2019) Let A be normalized so � a i � 2 = 1 for all rows i = 1 , ..., m. If the system A x = b is consistent with the unique solution x ∗ then the SKM method converges at least linearly in expectation and the rate depends on the dynamic range of the random sample of rows of A, τ j . Precisely, in the j + 1 st iteration of SKM, we have 1 − βσ 2 min ( A ) � � E τ j � x j +1 − x ∗ � 2 � x j − x ∗ � 2 2 ≤ 2 , γ j m β ) � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 where γ j := ∞ . β ) � A τ x j − b τ � 2 � τ ∈ ( [ m ] 22

  37. Accelerated Convergence Rate ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ bound uses dynamic range of sample of β rows 23

  38. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β 24

  39. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β 24

  40. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β 24

  41. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β E τ k � x k − x ∗ � 2 2 ≤ α � x k − 1 − x ∗ � 2 2 Previous: α = 1 − σ 2 min ( A ) RK m α = 1 − σ 2 min ( A ) SKM m 1 − σ 2 ≤ α ≤ 1 − σ 2 min ( A ) min ( A ) MM 4 m [H., Needell 2019] 24

  42. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β E τ k � x k − x ∗ � 2 2 ≤ α � x k − 1 − x ∗ � 2 2 Previous: Current: α = 1 − σ 2 α = 1 − σ 2 min ( A ) min ( A ) RK m m α = 1 − σ 2 1 − βσ 2 ≤ α ≤ 1 − σ 2 min ( A ) min ( A ) min ( A ) SKM m m m 1 − σ 2 ≤ α ≤ 1 − σ 2 min ( A ) ≤ α ≤ 1 − σ 2 min ( A ) min ( A ) min ( A ) 1 − σ 2 MM 4 m m [H., Needell 2019], [H., Ma 2019] 24

  43. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 2 Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β ⊲ nontrivial bounds on γ k for Gaussian and average consensus systems 24

  44. Now can we determine the optimal β ? 25

  45. Now can we determine the optimal β ? Roughly, if we know the value of γ j , we can (just) do it. 25

  46. Now can we determine the optimal β ? Roughly, if we know the value of γ j , we can (just) do it. 25

  47. Back to Hierarchical NMF 26

  48. Back to Hierarchical NMF 26

  49. Back to Hierarchical NMF 26

  50. Back to Hierarchical NMF Compare: ⊲ hNMF (sequential NMF) 26

  51. Back to Hierarchical NMF Compare: ⊲ hNMF (sequential NMF) ⊲ Deep NMF [Flenner, Hunter ’18] 26

  52. Back to Hierarchical NMF Compare: ⊲ hNMF (sequential NMF) ⊲ Deep NMF [Flenner, Hunter ’18] ⊲ Neural NMF 26

  53. Applications

  54. Experimental results: synthetic data 27

  55. Experimental results: synthetic data ⊲ unsupervised reconstruction with two-layer structure ( k (0) = 9 , k (1) = 4) 27

  56. Experimental results: synthetic data ⊲ unsupervised reconstruction with two-layer structure ( k (0) = 9 , k (1) = 4) 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend