investigation into a parallel singular value decomposition
play

Investigation into a Parallel Singular Value Decomposition Travis - PowerPoint PPT Presentation

Investigation into a Parallel Singular Value Decomposition Travis Askham Steven Delong Michael Lewis Courant Institute, New York December 18, 2012 Askham, Delong, Boston Mike Parallel SVD The Singular Value Decomposition Given an m n


  1. Investigation into a Parallel Singular Value Decomposition Travis Askham Steven Delong Michael Lewis Courant Institute, New York December 18, 2012 Askham, Delong, Boston Mike Parallel SVD

  2. The Singular Value Decomposition Given an m × n matrix A , its singular value decomposition is A = U Σ V T where U ∈ R m × m and V ∈ R n × n are orthogonal matrices and Σ ∈ R m × n is diagonal with nonnegative entries. The SVD is ubiquitous in numerical computing. Features of the decomposition The SVD can provide an optimal low-rank approximation to a matrix. Given an SVD you can form a pseudoinverse to your matrix. Example application areas Data compression. Signal processing. Pattern recognition. Certain least squares problems. Askham, Delong, Boston Mike Parallel SVD

  3. The Bidiagonalization Step Given a matrix A , a bidiagonalization of A is a decomposition A = UBV T where, again, U ∈ R m × m and V ∈ R n × n are orthogonal matrices but the matrix B ∈ R m × n is bidiagonal. What’s the point: All further calculations can then be done on the sparse matrix B . Can be done stably using Householder reflectors. A finite time algorithm (old fashioned). Askham, Delong, Boston Mike Parallel SVD

  4. Householder Reflectors and the Bidiagonalization Algorithm Left Householder     ˜ ˜ ˜ ˜ ˜ ˜ x x x x x x x x x x x x     x x x x x x 0 ˜ x ˜ x ˜ x x ˜ x ˜         x x x x x x 0 ˜ x x ˜ ˜ x ˜ x ˜ x     →     x x x x x x 0 x ˜ x ˜ ˜ x ˜ x ˜ x         x x x x x x 0 ˜ x ˜ x x ˜ x ˜ ˜ x 0 ˜ ˜ ˜ ˜ ˜ x x x x x x x x x x x Right Householder     x x x x x x 0 0 0 0 ˜ x ˜ x x ˜ x ˜ ˜ x ˜ x     0 ˜ ˜ ˜ ˜ ˜ 0 x x x x x         0 ˜ x x ˜ x ˜ x ˜ ˜ x 0     →     � 0 ˜ x ˜ x ˜ x x ˜ x ˜ 0 A         0 x ˜ x ˜ ˜ x ˜ x x ˜ 0 0 ˜ x ˜ x ˜ x ˜ x x ˜ 0 Askham, Delong, Boston Mike Parallel SVD

  5. Computational Considerations Features of the algorithm Dense calculations. The application of a Housholder reflector is independent for each column. The size of the working set reduces with each step. Coding the algorithm for a GPU Reduction: calculating vector norm to get the reflector Both vector level and matrix level parallelization in applying the reflector (to do) Askham, Delong, Boston Mike Parallel SVD

  6. Formulation of Problem We consider the following equation: B = U Σ V T for B ∈ R n × ( n +1) where   b 11 b 21 0 · · · 0   . ... .   0 b 12 b 22 .   B =  .  ... ... ... .   . 0 0 · · · 0 b 1 n b 2 n is a bidiagonal matrix. Our goal: To obtain the singular values (1) quickly, with (2) low memory usage Askham, Delong, Boston Mike Parallel SVD

  7. Reduction of Problem We notice that this system can be reduced to solving similar subproblems, namely   B 1 0   B = b 1 k e k b 2 k e 1 0 B 2 where (1) B i are themselves bidiagonal matrices, and (2) e j denotes the standard unit vectors with 1 in the j th component This is the fundamental design of any divide and conquer method. What makes this process different is that we maintain low memory usage by not retaining the singular vectors. Askham, Delong, Boston Mike Parallel SVD

  8. Formulation of Recursion Given that B i = U i (Σ i 0) ( V i v i ) T are the decomposition of the sub matrices, it can be shown that � � T B = ˜ ˜ U ( M 0) V ˜ v where   r 0 b 1 k l 1 b 2 k f 2   M = 0 Σ 1 0 0 0 Σ 2 where (1) l 1 is the last row in V 1 , (2) f 2 the first row in V 2 , and (3) r 0 can be similarly derived from results of the lower levels. Askham, Delong, Boston Mike Parallel SVD

  9. Formulation of Recursion (cont’d) For simplicity, let z = ( r 0 b 1 k l 1 b 2 k f 2 ) D i = diag d i M = U M (Σ 0) V M Observe that � � T B = ˜ ˜ = U (Σ 0) ( V v ) T UU M (Σ 0) V V M ˜ v implies the singular values of M are the singular values of B the first row of V = ˜ f V M and the last row of V = ˜ lV M (and thus require O ( n ) space at any given time) Askham, Delong, Boston Mike Parallel SVD

  10. Conclusion of Recursion This implies that by holding the (1) singular values, and (2) first and last rows from lower levels, we can backsolve our desired singular values. (Assuming we can solve for Σ and V M for a given M *cough*) Askham, Delong, Boston Mike Parallel SVD

  11. Solving for Σ , V M Given our z and d = ( d 1 , d 2 , . . . , d n ), with d 1 ≡ 0, we observe � z 2 j f ( σ i ) = 1 + = 0 d 2 j − σ 2 i j This is known as the secular equation. Observe (assuming we permute the system to order the d i ) 0 < σ 1 < d 2 < σ 2 < . . . < σ n Given our σ i , there exists a solution to find the columns v i of V M . Askham, Delong, Boston Mike Parallel SVD

  12. Observation on Computation The singular values can be computed independently The individual components of f = ˜ f V M and l = ˜ lV M can be computed independently by calculating each column of V M and doing the dot product. This suggests the above calculations are amenable to OpenMP (and they are) Askham, Delong, Boston Mike Parallel SVD

  13. Finding Singular Vectors Given the singular values σ i , right singular vectors are found using a “Twisted factorization”, which is essentially an inverse power iteration. Left singular vectors are found from right singular vectors with a matrix multiply. Now we have the singular vectors for B , apply Householder reflections to get U and V for A . Recall the inverse power iteration to find the eigenvector v of a matrix A corresponding to it’s smallest eigenvalue. We pick an initial x 0 and then solve repeatedly x n +1 ˜ x n +1 = x n , x n +1 = A ˜ x n +1 || || ˜ With shift we can find eigenvectors of the eigenvalue closest to some µ , in our case σ 2 i . We solve ( B t B − σ 2 i I )˜ x = γ k e k Askham, Delong, Boston Mike Parallel SVD

  14. Twisted Factorization The twisted factoriation allows us to do an inverse power iteration accurately and quickly. First we compute the “forward” and “backward” factorizations for each singular value. i I = P ( D 1) P t = Q ( D 2) Q t B t B − σ 2     1 0 0 0 1 0 0 . . . q 1 . . .     . . . .     1 0 . 0 1 . p 1 . . . q 2 . . .      .    ... . P = Q =     0 1 . 0 0 1 0 p 2 . . .     . .  ... ...   ... ...  . .  0 . 0   0 .  q m − 1 0 . . . 0 p m − 1 1 0 0 . . . . . . 1 D 1 = diag( D 1 1 , D 1 2 , . . . , D 1 m ) , D 2 = diag( D 2 1 , D 2 2 , . . . , D 2 m ) Askham, Delong, Boston Mike Parallel SVD

  15. Twisted Factorization 2 Given the Forward and Backward factorizations of B t B − σ 2 i I , the Twisted factorization at k is B t B − σ 2 i I = N k D k N t k Where   1   ...   p 1     ...   1     N k = p k − 1 1 q k     ...   1     ...   q m − 1 1 D k = diag( D 1 1 , . . . , D 1 k − 1 , γ k , D 2 k +1 , . . . , D 2 m ) Where γ k = D 1 k + D 2 k − ( α 2 k + β 2 k − 1 − σ 2 i ) Askham, Delong, Boston Mike Parallel SVD

  16. Solving the Twisted Factorization for singular vectors We solve the system N k D k N t k ˜ x = γ k e k first, which through the black magic of linear algebra is equivalent to N t k ˜ x = e k . For additional accuracy (needed specifically with clustered singular values), we do one more backsolve N k D k N t k x = ˜ x to get our final singular value x for the reduced bidiagonal matrix. y = Bx σ i gives our left singular vector for the reduced bidiagonal matrix. We still need to include the effects of the householder reflections from the bidiagonalization. A = H 1 BH t 2 = H 1 ( Y Σ X t ) H t 2 We apply the reflectors stored from the bidiagonalization part to the outputs X and Y . Askham, Delong, Boston Mike Parallel SVD

  17. Parallelization Ideas This process is completely independent for each σ . Do the whole thing on the GPU ABANDONED IMMEDIATELY - lots of O(n) sequential calculations per work item. Things have to be stored in global, lots of global access. Can we at least do the Gamma Calculation on the GPU? O ( N 2 ) independent entries ABANDONED EVENTUALLY - way too much data transfer from the CPU to the device, O( N 2 ), not THAT many flops. Ok, let’s use OpenMP. This seemed to work. Askham, Delong, Boston Mike Parallel SVD

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend