low rank tensor techniques for high dimensional problems
play

Low-Rank Tensor Techniques for High-Dimensional Problems Daniel - PowerPoint PPT Presentation

Low-Rank Tensor Techniques for High-Dimensional Problems Daniel Kressner CADMOS Chair for Numerical Algorithms and HPC MATHICSE, EPFL 1 Contents What is a tensor? Applications Matrices and low rank CP and Tucker


  1. High-dimensional elliptic PDEs: 3D model problem ◮ Discretization of 1D-Laplace:   − 1 2   ...   − 1 2   − ∂ xx ≈ =: A .   ... ...   − 1 − 1 2 ◮ Application in each coordinate direction: − ∂ ξ 1 ξ 1 u ( ξ 1 , ξ 2 , ξ 3 ) ≈ A ◦ 1 U , − ∂ ξ 2 ξ 2 u ( ξ 1 , ξ 2 , ξ 3 ) ≈ A ◦ 2 U , − ∂ ξ 3 ξ 3 u ( ξ 1 , ξ 2 , ξ 3 ) ≈ A ◦ 3 U . ◮ Hence, − ∆ u ≈ A ◦ 1 U + A ◦ 2 U + A ◦ 3 U or in vectorized form with u = vec ( U ) : − ∆ u ≈ ( I ⊗ I ⊗ A + I ⊗ A ⊗ I + A ⊗ I ⊗ I ) u . 24

  2. High-dimensional elliptic PDEs: 3D model problem Finite difference discretization of model problem − ∆ u = f in Ω , u | ∂ Ω = 0 for Ω = [ 0 , 1 ] 3 takes the form ( I ⊗ I ⊗ A + I ⊗ A ⊗ I + A ⊗ I ⊗ I ) u = f . Similar structure for finite element discretization with tensorized FEs: � � � V ⊗ W ⊗ Z = α ijk v i ( ξ 1 ) w j ( ξ 2 ) z k ( ξ 3 ) : α ijk ∈ R with V = { v 1 ( ξ 1 ) , . . . , v n ( ξ 1 ) } , W = { w 1 ( ξ 2 ) , . . . , w n ( ξ 2 ) } , Z = { z 1 ( ξ 3 ) , . . . , z n ( ξ 3 ) } Galerkin discretization ( K V ⊗ M W ⊗ M Z + M V ⊗ K W ⊗ M Z + M V ⊗ M W ⊗ K Z ) u = f , with 1D mass/stiffness matrices M V , M W , M Z , K V , K W , K Z . 25

  3. High-dimensional elliptic PDEs: Arbitrary dimensions Finite difference discretization of model problem − ∆ u = f in Ω , u | ∂ Ω = 0 for Ω = [ 0 , 1 ] d takes the form � � d � I ⊗ · · · ⊗ I ⊗ A ⊗ I ⊗ · · · ⊗ I u = f . j = 1 To obtain such Kronecker structure in general: ◮ tensorized domain; ◮ highly structured grid; ◮ coefficients that can be written/approximated as sum of separable functions. 26

  4. High-dimensional PDE-eigenvalue problems PDE-eigenvalue problem in Ω = [ 0 , 1 ] d , ∆ u ( ξ ) + V ( ξ ) u ( ξ ) = λ u ( ξ ) u ( ξ ) = 0 on ∂ Ω . Assumption: Potential represented as � s V ( 1 ) ( ξ 1 ) V ( 2 ) ( ξ 2 ) · · · V ( d ) V ( ξ ) = ( ξ d ) . j j j j = 1 � finite difference discretization A u = ( A L + A V ) u = λ u , with d � A L = I ⊗ · · · ⊗ I ⊗ A L ⊗ I ⊗ · · · ⊗ I , � �� � � �� � j = 1 d − j times j − 1 times s � A ( d ) V , j ⊗ · · · ⊗ A ( 2 ) V , j ⊗ A ( 1 ) A V = V , j . j = 1 27

  5. Quantum many-body problems ◮ spin- 1 / 2 particles: proton, neutron, electron, and quark. ◮ two states: spin-up, spin-down ◮ quantum state for each spin represented by vector in C 2 (spinor) ◮ quantum state for system of d spins represented by vector in C 2 d ◮ quantum mechanical operators expressed in terms of Pauli matrices � 0 � � 0 � � 1 � 1 − i 0 P x = , P y = , P z = . 1 0 i 0 0 − 1 ◮ spin Hamiltonian: sum of Kronecker products of Pauli matrices and identities � each term describes physical (inter)action of spins ◮ interaction of spins described by graph ◮ Goal: Compute ground state of spin Hamiltonian. 28

  6. Quantum many-body problems Example: 1d chain of 5 spins with periodic boundary conditions 1 2 3 4 5 Hamiltonian describing pairwise interaction between nearest neighbors: H = P z ⊗ P z ⊗ I ⊗ I ⊗ I + I ⊗ P z ⊗ P z ⊗ I ⊗ I + I ⊗ I ⊗ P z ⊗ P z ⊗ I + I ⊗ I ⊗ I ⊗ P z ⊗ P z + P z ⊗ I ⊗ I ⊗ I ⊗ P z 29

  7. Quantum many-body problems ◮ Ising (ZZ) model for 1d chain of d spins with open boundary conditions: p − 1 � H = I ⊗ · · · ⊗ I ⊗ P z ⊗ P z ⊗ I ⊗ · · · ⊗ I k = 1 p � + λ I ⊗ · · · ⊗ I ⊗ P x ⊗ I ⊗ · · · ⊗ I k = 1 λ = ratio between strength of magnetic field and pairwise interactions ◮ 1d Heisenberg (XY) model ◮ Current research: 2d models. ◮ More details in: Huckle/Waldherr/Schulte-Herbrüggen: Computations in Quantum Tensor Networks. Schollwöck: The density-matrix renormalization group in the age of matrix product states. 30

  8. Stochastic Automata Networks (SANs) ◮ 3 stochastic automata A 1 , A 2 , A 3 having 3 states each. ∈ R 3 describes probabilities of states ( 1 ) , ( 2 ) , ( 3 ) in A i ◮ Vector x ( i ) t at time t ◮ No coupling between automata � local transition x ( i ) �→ x ( i ) t t + 1 described by Markov chain: x ( i ) t + 1 = E i x ( i ) t , with a stochastic matrix E i . ◮ Stationary distribution of A i = Perron vector of E i (eigenvector for eigenvalue 1). 31

  9. Stochastic Automata Networks (SANs) ◮ 3 stochastic automata A 1 , A 2 , A 3 having 3 states each. ◮ Coupling between automata � local transition x ( i ) �→ x ( i ) t + 1 not t described by Markov chain. ◮ Need to consider all possible combinations of states in ( A 1 , A 2 , A 3 ) : ( 1 , 1 , 1 ) , ( 1 , 1 , 2 ) , ( 1 , 1 , 3 ) , ( 1 , 2 , 1 ) , ( 1 , 2 , 2 ) , . . . . ◮ Vector x t ∈ R 3 3 (or tensor X ( t ) ∈ R 3 × 3 × 3 ) describes probabilities of combined states. 32

  10. Stochastic Automata Networks (SANs) ◮ Transition x t �→ x t + 1 described by Markov chain: x t + 1 = E x t , with a large stochastic matrix E . ◮ Oversimplified example: I ⊗ I ⊗ � E 1 + I ⊗ � E 2 ⊗ I + � E = E 3 ⊗ I ⊗ I . � �� � local transition + I ⊗ E 21 ⊗ E 12 + E 32 ⊗ E 23 ⊗ I � �� � � �� � interaction between A 1 , A 2 interaction between A 2 , A 3 ◮ Goal: Compute stationary distribution = Perron vector of E . ◮ More details in: Stewart: Introduction to the Numerical Solution of Markov Chains. Chapter 9. Buchholz: Product Form Approximations for Communicating Markov Processes. 33

  11. Further applications Other applications in scientific computing featuring low-rank tensor concepts: ◮ Boltzmann equation [Ibragimov/Rjasanow’2009]. ◮ Dynamical systems [Koch/Lubich’2009]. ◮ Parabolic PDEs [Andreev/Tobler’2011], [Khoromskij’2009]. ◮ Stochastic PDEs [Khoromskij/Schwab’2010], [Matthies/Zander’2011], [Kressner/Tobler’2011], [Ballani/Grasedyck/Kluge’2011], . . . ◮ Electronic structure calculation [Chinnamsetty et al.’2007], [Flad et al.’2009], [Khoromskij/Khoromskaja’2009], [Limpanuparb/Gill’2009], [Benedikt et al.’2011], [Mohlenkamp’2011], . . . ◮ Evaluation of boundary integrals (in BEM): [Grasedyck], [Khoromskij/Sauter/Veit’2011]. ◮ . . . 34

  12. Summary ◮ Large diversity of applications leading to linear systems / eigenvalue problems with Kronecker product structures. ◮ For many problems of practical interest: Explicit storage / computation of solution infeasible. ◮ Increasing use of low-rank tensor techniques. Heaviest use currently: DMRG for quantum many-body problems. ◮ Remark: For PDE-related applications, high dimensionality can also be addressed during the discretization phase (sparse grids, adaptive sparse discretization, . . . ). Has advantages and disadvantages. 35

  13. Approximate low-rank matrices ◮ Singular value decomposition ◮ Separability and low rank ◮ Separability by polynomial interpolation ◮ Separability by exponential sums ◮ Low rank of snapshot matrices 36

  14. Low-rank approximation Setting: Matrix X ∈ R n × m , m and n too large to compute/store X explicitly. Idea: Replace X by RS T with R ∈ R n × r , S ∈ R m × r and r ≪ m , n . RS T X Memory nm nr + rm r Cost ops ( m , n ) ops ( m , n ) × min { m , n } (?) � � X − RS T � 2 : R ∈ R n × r , S ∈ R m × r � min = σ k + 1 . with singular values σ 1 ≥ σ 2 ≥ · · · ≥ σ min { m , n } of X . 37

  15. Construction from singular value decomposition SVD: Let matrix X ∈ R n × m and k = min { m , n } . Then ∃ orthonormal matrices � � � � ∈ R n × k , ∈ R m × k , U = u 1 , u 2 , . . . , u k V = v 1 , v 2 , . . . , v k such that X = U Σ V T , Σ = diag ( σ 1 , σ 2 , . . . , σ k ) . Choose r ≤ k and partition � � Σ 1 � � � � T = U 1 Σ 1 0 V T + U 2 Σ 2 V T X = U 1 , U 2 V 1 , V 2 2 . 1 0 Σ 2 � �� � ���� =: R =: S T Then � X − RS T � 2 = � Σ 2 � 2 = σ r + 1 . Good low rank approximation if singular values decay sufficiently fast. Also: span ( X ) ≈ span ( R ) , span ( X T ) ≈ span ( S T ) 38

  16. Discretization of bivariate function � � � � ◮ Bivariate function: f ( x , y ) : x min , x max × y min , y max → R . ◮ Function values on tensor grid [ x 1 , . . . , x n ] × [ y 1 , . . . , y m ] :   f ( x 1 , y 1 ) f ( x 1 , y 2 ) · · · f ( x 1 , y n )  f ( x 2 , y 1 ) f ( x 2 , y 2 ) · · · f ( x 2 , y n )    F = . . .   . . . . . . f ( x m , y 1 ) f ( x m , y 2 ) · · · f ( x m , y n ) Basic but crucial observation: f ( x , y ) = g ( x ) h ( y ) �     g ( x 1 ) h ( y 1 ) · · · g ( x 1 ) h ( y n ) g ( x 1 )     . . . h ( y n ) ] F = . .  = .  [ h ( y 1 ) · · ·   . . . g ( x m ) h ( y 1 ) · · · g ( x m ) h ( y n ) g ( x m ) Separability implies rank 1. 39

  17. Separability and low rank Approximation by sum of separable functions f ( x , y ) = g 1 ( x ) h 1 ( y ) + · · · + g r ( x ) h r ( y ) + error . � �� � =: f r ( x , y ) Define   f r ( x 1 , y 1 ) · · · f r ( x 1 , y n )   . . F r = . .  .  . . f r ( x m , y 1 ) · · · f r ( x m , y n ) Then F r has rank ≤ r and � F − F r � F ≤ √ mn × error . � √ σ r + 1 ( F ) ≤� F − F r � 2 ≤ � F − F r � F ≤ mn × error . Semi-separable approximation implies low-rank approximation. 40

  18. Semi-separable approximation by polynomials Solution of approximation problem f ( x , y ) = g 1 ( x ) h 1 ( y ) + · · · + g r ( x ) h r ( y ) + error . not trivial; g j , h j can be chosen arbitrarily! General construction by polynomial interpolation: 1. Lagrange interpolation of f ( x , y ) in y -coordinate: r � I y [ f ]( x , y ) = f ( x , θ j ) L j ( y ) j = 1 with Lagrange polynomials L j of degree r − 1 on [ x min , x max ] . 2. Interpolation of I y [ f ] in x -coordinate: r r � � I x [ I y [ f ]]( x , y ) = f ( ξ i , θ j ) L i ( x ) L j ( y ) ˆ = L i , x ( x ) L j , y ( y ) , i , j = 1 i = 1 where f [ f ( ξ i , θ j )] i , j is “diagonalized” by SVD. 41

  19. Semi-separable approximation by polynomials error ≤ � f − I x [ I y [ f ]] � ∞ = � f − I x [ f ] + I x [ f ] − I x [ I y [ f ]] � ∞ ≤ � f − I x [ f ] � ∞ + � I x � ∞ � f − I y [ f ] � ∞ with Lebesgue constant � I x � ∞ ∼ log r when using Chebyshev interpolation nodes. Polynomial interpolation error typically much too pessimistic ◮ Lebesgue constants hit hard in high dimensions: ( log r ) d − 1 . ◮ Severe theoretical barriers for general smooth multivariate functions: E. Novak and H. Wo´ zniakowski: Tractability of Multivariate Problems, Volume I and II. EMS. 42

  20. Semi-separable approximation of 1 / ( x + y ) Consider 1 f ( x , y ) = x + y , x , y ∈ [ α, β ] , 0 < α < β. Apply numerical quadrature: � ∞ r � 1 e − tz d t = ω j e − γ j z + error . z = 0 j = 1 Inserting z = x + y � r r � � 1 ω j e − γ j ( x + y ) + error = ω j e − γ j x e − γ j y + error . x + y = j = 1 j = 1 Choice of nodes γ j > 0 and weights ω j > 0 as in [Stenger’93, Braess’86, Braess/Hackbusch’05] � � � r π 2 error ≤ 8 − . | α | exp log ( 8 β/α ) 43

  21. Semi-separable approximation by exponential sums ◮ Consider more general case of function f ( x , y ) := g ( x + y ) . ◮ Approximation of g ( z ) with z := x + y by exponential sum r � g ( z ) ≈ ω j exp ( γ j z ) (1) j = 1 for some coefficients γ j , ω j ∈ R . ◮ (1) gives semi-separable approximation for f : r � f ( x , y ) = g ( x + y ) ≈ ω j exp ( γ j ( x + y )) j = 1 r � = ω j exp ( γ j x ) exp ( γ j y ) . j = 1 ◮ Naturally extends to arbitrarily many variables. ◮ Problem: (1) nontrivial approx problem [Braess’1986], [Hackbusch’2006], . . . 44

  22. Low-rank approximation of snapshot matrices Vector-valued function x ( α ) : [ α min , α max ] → R n Sampling at α 1 , . . . , α m ∈ [ α min , α max ] : � � Snapshot matrix X = = x ( α 1 ) , x ( α 2 ) , . . . , x ( α m ) 45

  23. Example: Baking 1 cookie Stationary heat equation with pw constant heat conductivity σ ( x , α ) : in Ω = [ − 1 , 1 ] 2 −∇ ( σ ( x , α ) ∇ u ) = f u = 0 on ∂ Ω , 2 ◮ σ ( baking tray ) = 1 1.5 ◮ σ ( cookie ) = 1 + α ◮ Undetermined parameter 1 0.5 α ∈ [ α min , α max ] . 0 0 0.5 1 1.5 2 # Vertices : 455, # Elements : 825, # Edges : 1279 Standard FE discretization results in linearly parameter-dependent linear system ( A 0 + α A 1 ) x ( α ) = b . 46

  24. Singular value decay – observation ◮ 1 Cookie: n = 371 , m = 101. log 10 (singular values of snapshot matrix) 5 0 −5 −10 −15 −20 0 20 40 60 80 100 ◮ Foundation of Proper Orthogonal Decomposition and Reduced Basis Methods. 47

  25. Singular value decay – explanation Polynomial approximation: x ( α ) = x 0 + α x 1 + α 2 x 2 + · · · + α k − 1 x k − 1 + error . Approximation error: ◮ Assume b ( · ) , A ( · ) analytic � x ( · ) analytic. ◮ Then error � ρ − k , where ρ > 1 depends on domain of analyticity of A , b . (Proof: Direct extension of classical result for scalar-valued functions.) 48

  26. Singular value decay – explanation Polynomial approximation: x ( α ) = x 0 + α x 1 + α 2 x 2 + · · · + α k − 1 x k − 1 + error . Snapshot matrix: � � X = x ( α 1 ) , x ( α 2 ) , . . . , x ( α m )   1 1 . . . 1   α 1 α 2 . . . α m � �   = x 0 , x 1 , . . . , x k − 1  + error  . . .  . . .  . . . α k − 1 α k − 1 α k − 1 . . . m 1 2 = matrix of rank k + error σ k + 1 ( X ) ≤ error � ρ − k Remark: Trivially extends to pw analytic case. 49

  27. Singular value decay – pw analytic case Example: Consider smallest singular value σ ( z ) and corresponding right singular vector v ( z ) of B ( z ) = A − i zI for z ∈ [ − 1 , 1 ] . ◮ A = 2 × 2 block diag randn , n = 400. ◮ s ( z ) only Lipschitz ◮ Snapshot matrix of singular vectors: cont, but pw anal. � � ◮ v ( z ) discontinuous, X = v ( z 1 ) , v ( z 2 ) , . . . , v ( z 100 ) but pw anal. for equidistant samples z j ∈ [ − 1 , 1 ] . σ ( z ) Singular values of X 5 0.03 10 0.025 0 10 0.02 −5 10 0.015 −10 10 0.01 −15 10 0.005 −20 0 10 −1 −0.5 0 0.5 1 0 20 40 60 80 100 z 50

  28. Summary Need strong singular value decay for good low-rank approximations. For function-related matrices/tensors: Strong link to semi-separable approximations. Smoothness seems to be important... at least somehow. ◮ Fortunately, smoothness is not necessary. Piecewise smoothness can be enough. ◮ Unfortunately, smoothness is not sufficient for higher-order tensors. ◮ Need to impose stronger regularity as dimension/order d increases, based, e.g., on mixed weak derivatives [Yserentant: Regularity and approximability of electronic wave functions. 2010]. 51

  29. Low-rank tensors: CP and Tucker ◮ CP ◮ Tucker ◮ Higher-order SVD ◮ Tensor networks 52

  30. CP decomposition ◮ Aim: Generalize concept of low rank from matrices to tensors. ◮ One possibility motivated by � �� � T = X = a 1 , a 2 , . . . , a R b 1 , b 2 , . . . , b R a 1 b T 1 + a 2 b T 2 + · · · + a R b T = R . � vectorization vec ( X ) = b 1 ⊗ a 1 + b 2 ⊗ a 2 + · · · + b R ⊗ a R . C anonical P olyadic decomposition of tensor X ∈ R n 1 × n 2 × n 3 defined via vec ( X ) = c 1 ⊗ b 1 ⊗ a 1 + c 2 ⊗ b 2 ⊗ a 2 + · · · + c R ⊗ b R ⊗ a R for vectors a j ∈ R n 1 , b j ∈ R n 2 , c j ∈ R n 3 . CP directly corresponds to semi-separable approximation. Tensor rank of X = minimal possible R 53

  31. CP decomposition Illustration of CP decomposition vec ( X ) = c 1 ⊗ b 1 ⊗ a 1 + c 2 ⊗ b 2 ⊗ a 2 + · · · + c R ⊗ b R ⊗ a R . c 1 c r b 1 b r X a 1 a r 54

  32. CP decomposition ◮ CP decomposition offers low data-complexity; for constant R : linear complexity in d . ◮ For matrices : ◮ rank r is upper semi-continuous � closedness property: sequence of rank = r matrices can only converge to rank ≤ r matrix. ◮ best low-rank approximation possible by successive rank-1 approximations. ◮ Robust black-box algorithms/software available ( svd , Lanczos). For tensors of order d ≥ 3: ◮ tensor rank R is not upper semi-continuous � lack of closedness ◮ successive rank-1 approximations fail ◮ all algorithms based on optimization techniques (ALS, Gauss-Newton) Picture taken from [Kolda/Bader’2009]. 55

  33. Tucker decomposition ◮ Aim: Generalize concept of low rank from matrices to tensors. ◮ Alternative possibility motivated by A = U · Σ · V T , U ∈ R n 1 × r , V ∈ R n 2 × r , Σ ∈ R r × r . � vectorization � � vec ( X ) = V ⊗ U · vec (Σ) . Ignore diagonal structure of Σ and call it C . Tucker decomposition of tensor X ∈ R n 1 × n 2 × n 3 defined via � � vec ( X ) = W ⊗ V ⊗ U · vec ( C ) with U ∈ R n 1 × r 1 , V ∈ R n 2 × r 2 , W ∈ R n 3 × r 3 , and core tensor C ∈ R r 1 × r 2 × r 3 . In terms of µ -mode matrix products: X = U ◦ 1 V ◦ 2 W ◦ 3 C =: ( U , V , W ) ◦ C . 56

  34. Tucker decomposition Illustration of Tucker decomposition X = ( U , V , W ) ◦ C V W C U X 57

  35. Tucker decomposition Consider all three matricizations: � � T , U · C ( 1 ) · X ( 1 ) = W ⊗ V � � T , V · C ( 2 ) · X ( 2 ) = W ⊗ U � � T . W · C ( 3 ) · X ( 3 ) = V ⊗ U These are low rank decompositions � � X ( 1 ) � � X ( 2 ) � � X ( 3 ) � rank ≤ r 1 , rank ≤ r 2 , rank ≤ r 3 . Multilinear rank of tensor X ∈ R n 1 × n 2 × n 3 defined by tuple � X ( i ) � ( r 1 , r 2 , r 3 ) , with r i = rank . 58

  36. Higher-order SVD (HOSVD) Goal: Approximate given tensor X by Tucker decomposition with prescribed multilinear rank ( r 1 , r 2 , r 3 ) . 1. Calculate SVD of matricizations: X ( µ ) = � U µ � Σ µ � V T for µ = 1 , 2 , 3 . µ 2. Truncate basis matrices: U µ := � U µ (: , 1 : r µ ) for µ = 1 , 2 , 3 . 3. Form core tensor: � � U T 3 ⊗ U T 2 ⊗ U T vec ( C ) := · vec ( X ) . 1 Truncated tensor produced by HOSVD [Lathauwer/De Moor/Vandewalle’2000]: � � � � � vec X := U 3 ⊗ U 2 ⊗ U 1 · vec ( C ) . Remark: � � Orthogonal projection � X with π µ X := U µ U T X := π 1 ◦ π 2 ◦ π 3 µ ◦ µ X . 59

  37. Higher-order SVD (HOSVD) Tensor � X resulting from HOSVD satisfies quasi-optimality condition √ �X − � X� ≤ d �X − X best � , where X best is best approximation of X with multilinear ranks ( r 1 , . . . , r d ) . Proof : X� 2 = �X − ( π 1 ◦ π 2 ◦ π 3 ) X� 2 �X − � = �X − π 1 X� 2 + � π 1 X − ( π 1 ◦ π 2 ) X� 2 + · · · · · · + � ( π 1 ◦ π 2 ) X − ( π 1 ◦ π 2 ◦ π 3 ) X� 2 ≤ �X − π 1 X� 2 + �X − π 2 X� 2 + �X − π 3 X� 2 Using �X − π µ X� ≤ �X − X best � for µ = 1 , 2 , 3 leads to X� 2 ≤ 3 · �X − X best � 2 . �X − � Best approximation : See [Kolda/Bader’09]. 60

  38. Tucker decomposition – Summary For general tensors: ◮ multilinear rank r is upper semi-continuous � closedness property. ◮ HOSVD – simple and robust algorithm to obtain quasi-optimal low-rank approximation. ◮ quasi-optimality good enough for most applications in scientific computing. ◮ robust black-box algorithms/software available (e.g., Tensor Toolbox). Drawback: Storage of core tensor ∼ r d � curse of dimensionality 61

  39. Tensor network diagrams Tensor network = undirected graph with: ◮ each node is a tensor; ◮ each outgoing edge is a mode; ◮ each connected edge represents a contraction; example: 2 1 r � 3 1 Z i 1 , i 2 , i 3 , i 4 = X i 1 , i 2 , j Y j , i 3 , i 4 . 2 3 j = 1 ◮ number of free edges = order of tensor represented by entire network Researchers on quantum many-body problems think 2 in terms of tensor networks! 2 and dream 62

  40. Tensor network diagrams Examples: (i) (ii) (iii) (iv) (v) 1 2 1 1 1 3 3 2 2 1 2 1 2 2 2 2 2 1 2 1 1 1 1 (i) vector; (ii) matrix; (iii) matrix-matrix multiplication; (iv) Tucker decomposition; (v) hierarchical Tucker decomposition. 63

  41. Low-rank tensors: Hierarchical Tucker ◮ Intro of Hierarchical Tucker Decomposition (HTD) ◮ M ATLAB toolbox htucker ◮ Basic operations: µ -mode matrix multiplication, addition, . . . ◮ Advanced Operations: inner product, elementwise multiplication, . . . 64

  42. Introduction ◮ CP offers low data complexity but difficult truncation; ◮ Tucker offers simple truncation but high data complexity. Recently developed formats: ◮ Matrix Product State (MPS), ◮ TT decomposition, ◮ Hierarchical Tucker decomposition (HTD). Aim to offer compromise between CP and Tucker. Focus in this lecture: HTD. ◮ L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl. , 31(4):2029–2054, 2010. ◮ W. Hackbusch and S. Kühn. A new scheme for the tensor representation. J. Fourier Anal. Appl. , 15(5):706–722, 2009. ◮ D. Kressner and C. Tobler. htucker – A M ATLAB toolbox for the hierarchical Tucker decomposition. In preparation. See http://www.math.ethz.ch/~ctobler . 65

  43. More general matricizations Recall: µ -mode matricization for tensor X , X ( µ ) ∈ R n µ × ( n 1 ··· n µ − 1 n µ + 1 ··· n d ) , µ = 1 , . . . , d . It is getting ugly... General matricization for mode de- composition { 1 , . . . , d } = t ∪ s : X ( t ) ∈ R ( n t 1 ··· n tk ) × ( n s 1 ··· n sd − k ) X with � X ( t ) � ( i t 1 ,..., i tk ) , ( i s 1 ,..., i sd − k ) := X i 1 ,..., i d . X ( 1 ) X ( 1 , 2 ) 66

  44. Hierarchical construction Singular value decomposition: X ( t ) = U t Σ t U T s . Column spaces are nested � t = t 1 ∪ t 2 ⇒ span ( U t ) ⊂ span ( U t 2 ⊗ U t 1 ) ⇒ ∃ B t : U t = ( U t 2 ⊗ U t 1 ) B t . Size of U t : � X ( t ) � U t ∈ R n t 1 ··· n tk × r t with r t = rank . For d = 4: = ( U 2 ⊗ U 1 ) B 12 U 12 U 34 = ( U 4 ⊗ U 3 ) B 34 vec ( X ) = X ( 1234 ) = ( U 34 ⊗ U 12 ) B 1234 ⇒ vec ( X ) = ( U 4 ⊗ U 3 ⊗ U 2 ⊗ U 1 )( B 34 ⊗ B 12 ) B 1234 . 67

  45. Dimension tree Tree structure for d = 4: U 1 ( n 1 × r 1 ) B 12 ( r 1 r 2 × r 12 ) ( r 1 r 2 × r 12 ) U 2 ( n 2 × r 2 ) B 1234 ( r 12 r 34 × 1 ) U 3 ( n 3 × r 3 ) B 34 ( r 3 r 4 × r 34 ) U 4 ( n 4 × r 4 ) Reshape: B 12 ∈ R r 1 r 2 × r 12 B 12 ∈ R r 1 × r 2 × r 12 ⇒ B 34 ∈ R r 3 r 4 × r 34 B 34 ∈ R r 3 × r 4 × r 34 ⇒ B 1234 ∈ R r 12 r 34 × 1 B 1234 ∈ R r 12 × r 34 ⇒ 68

  46. Dimension tree U 1 B 12 U 2 B 1234 U 3 B 34 U 4 ◮ Often, U 1 , U 2 , U 3 , U 4 are orthonormal. This is advantageous but not required. ◮ Storage requirements for general d : O ( dnr ) + O ( dr 3 ) , where r = max { r t } , n = max { n µ } . 69

  47. Constructors for M ATLAB class htensor x = htensor([4 5 6 7]) constructs zero htensor of size 4 × 5 × 6 × 7, with a balanced dimension tree. x = htensor([4 5 6 7], ’TT’) constructs zero htensor of size 4 × 5 × 6 × 7, with a TT-style dimension tree. x = htensor({U1, U2, U3}) constructs htensor from tensor in CP decomp X ( i 1 , i 2 , i 3 ) = � j U 1 ( i 1 , j ) U 2 ( i 2 , j ) U 3 ( i 3 , j ) . x = htenrandn([4 5 6 7]) constructs htensor of size 4 × 5 × 6 × 7, with random ranks and random entries. x = htenones([4 5 6 7]) constructs htensor of size 4 × 5 × 6 × 7, with all entries one. ... 70

  48. Basic functionality for M ATLAB class htensor Example: x is in htensor of order 4. x(1, 3, 4, 2) returns entry of X . x(1, 3, :, :) returns slice of X as an htensor . full(x) returns full tensor represented by X . (use with care) disp_tree(htenrand([5 4 6 3])) returns tree structure/ranks: ans is an htensor of size 5 x 4 x 6 x 3 1-4 1; 6 3 1 1-2 2; 3 4 6 1 4; 5 3 2 5; 4 4 3-4 3; 3 3 3 3 6; 6 3 4 7; 3 3 spy(x) displays spy plots of U t , B t , on the dimension tree. change_root(x, i) switches root node. 71

  49. Singular value tree plot_sv (x) plots singular values of corresponding matricizations in the dimension tree of a tensor X . Example: Singular value tree of solution to elliptic PDE with 4 parameters. Dim. 1, 2 Dim. 3, 4, 5 Dim. 1 Dim. 2 Dim. 3 Dim. 4, 5 Dim. 4 Dim. 5 Remark: Singular values are computed from Gramians. 72

  50. Basic ops: µ -mode matrix multiplication Application of matrix A ∈ R m × n µ to mode µ of X ∈ R n 1 ×···× n d : Y ( µ ) = AX ( µ ) . Y = A ◦ µ X ⇔ Nearly trivial if X is in H -Tucker format: � � A ◦ µ X = A ◦ µ ( U 1 , . . . , U d ) ◦ C = ( U 1 , . . . , U µ − 1 , AU µ , U µ + 1 , . . . , U d ) ◦ C ◮ Almost no operations required. ◮ Ranks stay the same. ◮ Orthogonality destroyed. ttm(x, A, 2) applies matrix A to htensor X in mode 2. y = ttm(x, {A, B, C}, [2, 3, 4]) y = ttm(x, @(x)(fft(x)), 2) applies FFT in mode 2. y = ttm(x, {A, B, C}, [2, 3, 4], ’h’) successively applies matrices A T , B T , C T in modes 2 , 3 , 4. 73

  51. Addition of low-rank matrices Addition of two matrices in low-rank format: A = U 1 Σ A U T B = V 1 Σ B V T 2 , 2 ⇒ � � Σ A � � U 2 � U 1 � T 0 A + B = V 1 V 2 0 Σ B ◮ No operations required. ◮ Rank increases. ◮ Orthogonality destroyed. 74

  52. Addition of low-rank tensors Addition of four tensors X 1 , X 2 , X 3 , X 4 in H -Tucker format: X 1 + X 2 + X 3 + X 4 . Proceed as in matrix case by embedding factors in larger matrices. ◮ No operations required. ◮ H -Tucker rank increases. ◮ Orthogonality destroyed. Command in htucker : x1 + x2 + x3 + x4 75

  53. U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 1 1 1 1 B [ 1 ] 12 B [ 2 ] 12 B [ 3 ] 12 B [ 4 ] 12 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 2 2 2 2 B [ 1 ] 1234 B [ 2 ] 1234 B [ 3 ] 1234 B [ 4 ] 1234 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 3 3 3 3 B [ 1 ] 34 B [ 2 ] 34 B [ 3 ] 34 B [ 4 ] 34 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 4 4 4 4 76

  54. Orthogonalization Any tensor X in H -Tucker format can be orthogonalized in the sense that all factors in the dimension tree, except for the root node, contain orthonormal columns. Example: vec ( X ) = ( U 4 ⊗ U 3 ⊗ U 2 ⊗ U 1 )( B 34 ⊗ B 12 ) B 1234 . Step 1: QR decompositions U t = Q t R t � vec ( X ) = ( Q 4 ⊗ Q 3 ⊗ Q 2 ⊗ Q 1 )( � B 34 ⊗ � B 12 ) B 1234 with � B 34 := ( R 4 ⊗ R 3 ) B 34 , � B 12 := ( R 2 ⊗ R 1 ) B 12 . Step 2: QR decompositions � B 34 = Q 34 R 34 , � B 12 = Q 12 R 12 � vec ( X ) = ( Q 4 ⊗ Q 3 ⊗ Q 2 ⊗ Q 1 )( Q 34 ⊗ Q 12 ) � B 1234 with � B 1234 := ( R 34 ⊗ R 12 ) B 1234 . Compt. requirements for general d : O ( dnr 2 ) + O ( dr 4 ) . Command in htucker : x = orthog(x) 77

  55. Norms and inner products Inner product of two tensors X , Y ∈ R n 1 ×··· n d : n 1 n d � � �X , Y� = � vec ( X ) , vec ( Y ) � = · · · x i 1 ,..., i d y i 1 ,..., i d . i 1 = 1 i d = 1 Can be performed efficiently in H -Tucker, provided that X , Y have compatible dimension trees. Example: Two tensors of order 4 � 1 ) T · · · ( B x 1234 ) T ( B x 34 ⊗ B x 12 ) T ( U x 4 ⊗ U x 3 ⊗ U x 2 ⊗ U x �X , Y� = · · · ( U y 4 ⊗ U y 3 ⊗ U y 2 ⊗ U y 1 )( B y 34 ⊗ B y 12 ) B y 1234 Norm: After X has been orthogonalized: � �X , X� = � B x �X� = 12 ··· d � F . Possibly most accurate way to compute norm. Used in norm(x) . 78

  56. Computation of inner products n 1 n d � � �X , Y� = · · · x i 1 ,..., i d y i 1 ,..., i d . i 1 = 1 i d = 1 79

  57. Computation of inner products 80

  58. Computation of inner products 81

  59. Computation of inner products 82

  60. Computation of inner products 83

  61. Computation of inner products – contraction step B y t t 1 ) T U y t 2 ) T U y ( U x ( U x t 1 t 2 ( B x t ) T t ) T � � t ) T U y t 2 ) T U y t 1 ) T U y B y ( U x t = ( B x ( U x t 2 ⊗ ( U x t . t 1 ◮ htucker command: innerprod(x,y) ◮ Overall cost: O ( dnr 2 ) + O ( dr 4 ) . 84

  62. Reduced Gramians in H -Tucker t U t U t t G t X ( t ) = U t V T X ( t ) ( X ( t ) ) T = U t V T U T ⇒ t V t t t � �� � =: G t � � X ( t ) � If U t orthonormal � svd = eig ( G t ) (used in plot_sv ). 85

  63. Reduced Gramians in H -Tucker 86

  64. Reduced Gramians in H -Tucker 87

  65. Reduced Gramians in H -Tucker 88

  66. Reduced Gramians in H -Tucker 89

  67. Reduced Gramians in H -Tucker 90

  68. Reduced Gramians in H -Tucker Implemented in htucker command gramians(x) . 91

  69. Advanced operations ◮ Truncation ◮ Combined addition + truncation ◮ Elementwise multiplication ◮ Elementwise reciprocal 92

  70. Truncation of explicit tensor Let X ∈ R n 1 × n 2 ×···× n d be explicitly given. ◮ For each tree node t , let W t contain r t dominant left singular vectors of X ( t ) and define projection π t X ( t ) = W t W T π t X = W t W T t X ( t ) . t ◦ t X ⇔ ◮ Truncated tensor: � � � � � � ˜ X := π t · · · π t X , t ∈T L t ∈T 1 where T ℓ contains all nodes on level ℓ . √ ◮ [Grasedyck’2010]: �X − ˜ X� ≤ 2 d − 3 �X − X best � . Proof similar as for HOSVD. 93

  71. Truncation of explicit tensor Example: vec ˜ X = ( W 4 W T 4 ⊗ W 3 W T 3 ⊗ W 2 W T 2 ⊗ W 1 W T 1 )( W 34 W T 34 ⊗ W 12 W T 12 ) vec X = ( W 4 ⊗ W 3 ⊗ W 2 ⊗ W 1 ) · · · ([ W T 4 ⊗ W T ⊗ [ W T 2 ⊗ W T ) ([ W T 34 ⊗ W T 3 ] W 34 1 ] W 12 12 ] vec X ) . � �� � � �� � � �� � =: B 34 =: B 12 =: B 1234 opts.max_rank = 10 maximal rank at truncation. opts.rel_eps = 1e-6 maximal relative truncation error. opts.abs_eps = 1e-6 maximal absolute truncation error. Condition max_rank takes precedence over rel_eps and abs_eps . xt = htensor.truncate_rtl(x, opts) returns truncated tensor ˜ X of a multidimensional array. Remark: There is also a significantly faster htensor.truncate_ltr (proceeds successively from leafs to roots), for which the same error bound holds [Tobler’10]. 94

  72. Truncation of H -Tucker tensor Let X ∈ R n 1 × n 2 ×···× n d be in H -Tucker format and orthogonalized . ◮ Compute left singular vectors of X ( t ) = U t V T t from eigenvectors of X ( t ) � X ( t ) � T = U t V T U T t V t t , � �� � = G t with reduced Gramian G t . If S t contains r t dominant eigenvectors of G t � W t = U t S t . ◮ Traverse tree from root to leafs. In each step: B t p S T B t p t ◦ B t p S T t S t S T t S t B t S t ◦ B t B t ◮ In htucker : truncate(x,opts) . Complexity O ( dnr 2 + dr 4 ) . 95

  73. Combined addition + truncation Sum of more than two tensors: Y = X 1 + X 2 + · · · + X s . Two possibilities to incorporate truncation operator T : 1. Y ≈ T ( X 1 + X 2 + X 3 + · · · + X s ) 2. Y ≈ T ( · · · ( T ( T ( X 1 + X 2 ) + X 3 ) + · · · + X s ) Option 2 is usually significantly cheaper but may suffer from severe cancellation. Artificial example: X 1 , X 2 , X 3 ∈ R 101 × 101 × 101 truncated tensor grid discretizations for summands of f ( x 1 , x 2 , x 3 ) = tan ( x 1 + x 2 + x 3 ) + ( x 1 + x 2 + x 3 ) − 1 − tan ( x 1 + x 2 + x 3 ) . Error(Option 1) ≈ 10 − 7 . Error(Option 2) ≈ 1 . 3. What is wrong with Option 1? 96

  74. Combined addition + truncation U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 1 1 1 1 B [ 1 ] 12 B [ 2 ] 12 B [ 3 ] 12 B [ 4 ] 12 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 2 2 2 2 B [ 1 ] 1234 B [ 2 ] 1234 B [ 3 ] 1234 B [ 4 ] 1234 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 3 3 3 3 B [ 1 ] 34 B [ 2 ] 34 B [ 3 ] 34 B [ 4 ] 34 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 4 4 4 4 ◮ Orthogonalization (needed before truncation) destroys block diagonal structure. ◮ Complexity O ( dns 2 r 2 + ds 4 r 4 ) for s summands. 97

  75. Combined addition + truncation Idea: New variant delays orthogonalization to keep block diagonal structure in transfer tensors as long as possible. Reduces O ( dns 2 r 2 + ds 4 r 4 ) to O ( dns 2 r 2 + ds 2 r 4 + ds 3 r 3 ) 2 10 time truncate std time truncate sum 1 time truncate succ. 10 O(t 4 ) Runtime [s] O(t 2 ) 0 10 O(t) −1 10 −2 10 0 1 10 10 Number of summands ◮ htucker command: add_truncate(x1 x2 x3 x4, opts) . 98

  76. Elementwise multiplication Elementwise multiplication (also called Hadamard or Schur product) of two low-rank matrices A = U 1 Σ A U T 2 , B = V 1 Σ B V T 2 : ⊙ V 2 ) T , A ⋆ B = ( U 1 � ⊙ V 1 )(Σ A ⊗ Σ B )( U 2 � with the row-wise Khatri-Rao product       c T d T c T 1 ⊗ d T 1 1 1  .   .   .  C � .  � . . ⊙ D = ⊙  =     . . . c T d T c T n ⊗ d T n n n ◮ Orthogonality destroyed. ◮ Rank increases significantly . But: singular value decay of Σ A ⊗ Σ B may become significantly stronger � additional opportunities for truncation. 99

  77. Elementwise multiplication Elementwise multiplication of two tensors X , Y in H -Tucker format: ◮ Row-wise Khatri-Rao product of leaf matrices. ◮ “Kronecker product” of non-leaf tensors. ◮ Optional: Products are only formed after suitable truncation to avoid excessive memory requirements. Commands in htucker : x.*y (without truncation) x.ˆ2 (without truncation) elem_mult( x, y, opt ) (with truncation) 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend