Low Rank Approximation Lecture 8
Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch
1
Low Rank Approximation Lecture 8 Daniel Kressner Chair for - - PowerPoint PPT Presentation
Low Rank Approximation Lecture 8 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 Linear operators in TT decomposition Consider linear operator L : R n 1 n d R n 1
1
Consider linear operator L : Rn1×···×nd → Rn1×···×nd and corresponding matrix representation L((i1, . . . , id), (j1, . . . , jd)). L is TT operator if it takes the form L((i1, . . . , id), (j1, . . . , jd)) =
R1
· · ·
Rd−1
A1(1, i1, j1, k1)A2(k1, i2, j2, k2) · · · Ad(kd−1, id, jd, 1) with Rµ−1 × nµ × nµ × Rµ tensors Aµ. Operator TT rank of L = Smallest possible values of (R1, . . . , Rµ−1). Matrix Product Operator (MPO): L((i1, . . . , id), (j1, . . . , jd)) = A1(i1, j1)A2(i2, j2) · · · Ad(id, jd), with Aµ(iµ, jµ) = Aµ(:, iµ, jµ, :).
2
Alternative view: Reinterpret matrix representation of L as tensor ten(L) ∈ Rn2
1×···×n2 d,
by merging indices (iµ, jµ) applying TT format, and separating indices (iµ, jµ). In practice: Perfect shuffle permutation of modes. Tensor network diagram for operator TT decomposition:
EFY. Show that the identity matrix has operator TT ranks 1. 3
Example: Discrete Laplace-like operator takes the form L : X → L1 ◦1 X + · · · + Ld ◦d X for matrices Lµ ∈ Rnµ×nµ. Matrix representation L = Ind ⊗ · · · ⊗ In2 ⊗ L1 + · · · + Ld ⊗ Ind−1 ⊗ · · · ⊗ In1 TT representation with cores A1(i1, j1) =
In1(i1, j1)
=
Lµ(iµ, jµ) Inµ(iµ, jµ)
= Ind(id, jd) Ld(id, jd)
Consider matrix-vector product ˜ x = Lx or, equivalently, ˜ X = L(X). Assume that L and U are in TT decomposition ˜ X(i1, . . . , id) =
L((i1, . . . , id), (j1, . . . , jd))X(j1, . . . , jd) =
A1(i1, j1) · · · Ad(id, jd)U1(j1) · · · Ud(jd) =
(A1(i1, j1) ⊗ U1(j1)) · · · (Ad(id, jd) ⊗ Ud(jd)) =
j1
A1(i1, j1) ⊗ U1(j1)
jd
Ad(id, jd) ⊗ Ud(jd)
˜ U1(i1) · · · ˜ Ud(id), that is, TT ranks of ˜ X are bounded by TT ranks of X multiplied with TT operator ranks of L.
5
Illustration by tensor network diagrams:
◮ Carrying out contractions requires O(dnr 4) operations for
tensors of order d.
6
Easy:
◮ (partial) contractions ◮ multiplication with TT operators ◮ recompression/truncation
Medium:
◮ entrywise products
Hard:
◮ almost everything else
Software:
◮ TT toolox (Matlab, Python), TTeMPS (Matlab), . . .
7
General setting: Solve optimization problem min
X f(X),
where X is a (large) matrix or tensor and f is “simple” (e.g., convex). Constrain X to Mr, set of rank-r matrices or tensors and aim at solving min
X∈Mr f(X),
Set X = i(U1, U2, . . . , Ud). (e.g., X = U1UT
2 ). Low-rank formats are multilinear there is hope
that optimizing for each component is simple: min
Uµ f(i(U1, U2, . . . , Ud)).
8
Set f(U1, . . . , Ud) := f(i(U1, . . . , Ud)). ALS:
1: while not converged do 2:
U1 ← arg minU1 f(i(U1, U2, . . . , Ud))
3:
U2 ← arg minU1 f(i(U1, U2, . . . , Ud))
4:
. . .
5:
Ud ← arg minU1 f(i(U1, U2, . . . , Ud))
6: end while
Examples:
◮ ALS for fitting CP decomposition ◮ Subspace iteration.
Closely related: Block Gauss-Seidel, Block Coordinate Descent. Difficulties:
◮ Representation (U1, U2, . . . , Ud) often non-unique, parameters
may become unbounded.
◮ Mr not closed ◮ Convergence (analysis)
9
Given A ∈ Rm×n, consider computation of best rank-r approximation: min
U∈Rm×r ,V∈Rn×r f(U, V),
f(U, V) := A − UV T2
F ◮ Representation UV T is unique for each U, V individually if U, V
have rank r.
◮ f is convex wrt U and V individually.
∇Uf(U, V), H = f(U + H, V) − f(U, V) + O(H2
2)
= −2AV − UV TV, H. Hence, 0 = ∇Uf(U, V) = −2(AV − UV TV) ⇔ U = AV(V TV)−1. For stability it is advisable to choose V such that it has orthonormal columns.
10
ALS for low-rank matrix approximation:
1: while not converged do 2:
Compute economy size QR factorization: V = QR and update V ← Q.
3:
U ← AV
4:
Compute economy size QR factorization: U = QR and update U ← Q.
5:
V ← ATU
6: end while
Returns an approximation A ≈ UV T. This is the subspace iteration from Lecture 1!
EFY. Develop an ALS method for solving the weighted low-rank approximation problem min U,V WL(A − UVT )WR F with square and invertible matrices WL, WR . 11
For linear operator L : Rm×n → Rm×n, consider linear system L(X) = C, C, X ∈ Rm×n. Examples:1
◮ Sylvester matrix equation:
AX + XB = C, A ∈ Rm×m, B ∈ Rn×n, C, X ∈ Rm×n. Applications: Discretized 2D Laplace on rectangle, stability analysis, optimal control, model reduction of linear control systems. Special case Lyapunov equations: m = n, A = BT, C symmetric (and often negative semi-definite)
◮ Stochastic Galerkin methods in uncertainty quantification. ◮ Stochastic control.
1See [V. Simoncini, Computational methods for linear matrix equations, SIAM Rev.,
58 (2016), pp. 377–441] for details and references.
12
Using the matrix ML representing L in canonical bases, we can rewrite L(X) = B as linear system ML(vec(X)) = vec(C). Assumption: ML has low Kronecker rank: ML = B1 ⊗ A1 + · · · + BR ⊗ AR, R ≪ m, n. Equivalently, L(X) = A1XBT
1 + · · · + ARXBT R
EFY. Develop a variant of ACA (from Lecture 3) that aims at approximating a given sparse matrix A by a matrix of low Kronecker rank for given m, n. EFY. Show that if m = n, ML is symmetric and has Kronecker rank R, one can find symmetric matrices A1, . . . , AR , B1, . . . , BR such that L(X) = A1XB1 + · · · + AR XBR . Is it always possible to choose all Ak , Bk positive semi-definite if ML is positive definite? 13
Two ways of turning L(X) = C into optimization problem:
min
X
1 2L(X), X − X, B.
min
X L(X) − B2 F
Will focus on spd ML in the following.
14
Low-rank approximation of L(X) = B obtained by solving min
U,V f(U, V)
for f(U, V) = 1 2L(UV T), UV T − UV T, C. Let L have Kronecker rank R. Then L(UV T), UV T =
R
AkUV TBk, UV T =
R
AkUV TBkV, U. This shows that arg minU f(U, V) is solution of linear matrix equation A1U(V TB1V) + · · · + ARU(V TBRV) = CV.
EFY. Show that this linear matrix equation always has a unique solution under the assumption that L is symmetric positive definite.
◮ For R = 2, can be reduced to R linear systems of size n × n. ◮ For R > 2, need to solve Rn × Rn system.
15
ALS for linear matrix equations:
1: while not converged do 2:
Compute economy size QR factorization: V = QR and update V ← Q.
3:
Solve A1U(V TB1V) + · · · + ARU(V TBRV) = CV for U.
4:
Compute economy size QR factorization: U = QR and update U ← Q.
5:
Solve (UTA1U)V TB1 + · · · + (UTARU)V TBR = UTC for V.
6: end while
Returns an approximation X ≈ UV T. For R = 2, there are better alternatives: ADI, Krylov subspace methods, ... [Simoncini’2016].
16
◮ −△u(x) + V(x)u = λu(x) in Ω = [0, 1] × [0, 1]
with Dirichlet b.c. and Henon-Heiles potential V
◮ Regular discretization ◮ Reshaped ground state into matrix
Ground state Singular values
100 200 300 10
−20
10
−15
10
−10
10
−5
10
Excellent rank-10 approximation possible
17
Consider symmetric n2 × n2 matrix A. Then λmin(A) = min
x=0
x, Ax x, x . We now...
◮ reshape vector x into n × n matrix X; ◮ reinterpret Ax as linear operator A : X → A(X).
18
Consider symmetric n2 × n2 matrix A. Then λmin(A) = min
X=0
X, A(X) X, X with matrix inner product ·, ·. We now...
◮ restrict X to low-rank matrices.
19
Consider symmetric n2 × n2 matrix A. Then λmin(A)≈ min
X=UV T =0
X, A(X) X, X .
◮ Approximation error governed by low-rank approximability of X. ◮ Solved by Riemannian optimization techniques or ALS.
20
ALS for solving λmin(A)≈ min
X=UV T =0
X, A(X) X, X . Initially:
◮ fix target rank r ◮ U ∈ Rm×r, V n×r randomly, such that V is ONB
˜ λ − λ = 6 × 103 residual = 3 × 103
21
ALS for solving λmin(A)≈ min
X=UV T =0
X, A(X) X, X . Fix V, optimize for U. X, A(X) = vec(UV T)TA vec(UV T) = vec(U)T(V ⊗ I)TA(V ⊗ I)vec(U) Compute smallest eigenvalue of reduced matrix (rn × rn) matrix (V ⊗ I)TA(V ⊗ I). Note: Computation of reduced matrix benefits from Kronecker structure of A.
22
ALS for solving λmin(A)≈ min
X=UV T =0
X, A(X) X, X . Fix V, optimize for U. ˜ λ − λ = 2 × 103 residual = 2 × 103
23
ALS for solving λmin(A)≈ min
X=UV T =0
X, A(X) X, X . Orthonormalize U, fix U, optimize for V. X, A(X) = vec(UV T)TA vec(UV T) = vec(V T)(I ⊗ U)TA(I ⊗ U)vec(V T) Compute smallest eigenvalue of reduced matrix (rn × rn) matrix (I ⊗ U)TA(I ⊗ U). Note: Computation of reduced matrix benefits from Kronecker structure of A.
24
ALS for solving λmin(A)≈ min
X=UV T =0
X, A(X) X, X . Orthonormalize U, fix U, optimize for V. ˜ λ − λ = 1.5 × 10−7 residual = 7.7×10−3
25
ALS for solving λmin(A)≈ min
X=UV T =0
X, A(X) X, X . Orthonormalize V, fix V, optimize for U. ˜ λ − λ = 1 × 10−12 residual = 6 × 10−7
26
ALS for solving λmin(A)≈ min
X=UV T =0
X, A(X) X, X . Orthonormalize U, fix U, optimize for V. ˜ λ − λ = 7.6 × 10−13 residual = 7.2×10−8
27
Recall interface matrices X≤µ−1 ∈ Rn1n2···nµ×rµ−1, X≥µ ∈ Rnµ+1nµ+2···nd×rµ−1 yielding factorization X <µ> = X≤µ−1X T
≥µ,
µ = 1, . . . , d − 1. Combined with recursion X T
≥µ+1 = UR µ(X T ≥µ ⊗ Inµ),
this yields X <µ> = X≤µ−1UR
µX T ≥µ+1,
µ = 1, . . . , d − 1. Hence, vec(X) = (X≥µ+1 ⊗ X≤µ−1) vec(Uµ) This formula allows us to pull out µth core!
28
A TT decomposition is called µ-orthogonal if (UL
ν)TUL ν = Irν,
X T
≤νX≤ν = Irν
for ν = 1, . . . , µ − 1. and UR
ν (UR ν )T = Irν,
X≥νX T
≥ν = Irµ
for ν = µ + 1, . . . , d. This implies that X≥µ+1 ⊗ X≤µ−1 has orthonormal columns! Consider eigenvalue problem λmin(A) = min
X=0
X, A(X) X, X Optimizing for µth core min
Uµ=0
X, A(X) X, X = min
Uµ=0
vec Uµ, Aµ vec Uµ vec Uµ, vec Uµ with rµ−1nµrµ × rµ−1nµrµ matrix Aµ = (X≥µ+1 ⊗ X≤µ−1)TA(X≥µ+1 ⊗ X≤µ−1)
29
◮ Uµ is obtained as eigenvector belonging to smallest eigenvalue
◮ Computation of Aµ for large d only feasible if A has low operator
TT ranks (and is in operator TT decomposition).
◮ One microstep of ALS optimizes Uµ and prepares for next core,
by adjusting orthogonalization.
◮ One sweep of ALS consists of processing cores twice: once from
left to right and once from right to left.
30
Input: X in right-orthogonal TT decomposition.
1: for µ = 1, 2, . . . , d − 1 do 2:
Compute Aµ and replace core Uµ by an eigenvector belonging to smallest eigenvalue of Aµ.
3:
Compute QR decomposition UL
µ = QR.
4:
Set UL
µ ← Q.
5:
Update Uµ+1 ← R ◦1 Uµ+1.
6: end for 7: for µ = d, d − 1, . . . , 2 do 8:
Compute Aµ and replace core Uµ by an eigenvector belonging to smallest eigenvalue of Aµ.
9:
Compute QR decomposition (UR
µ)T = QR.
10:
Set UR
µ ← QT.
11:
Update Uµ−1 ← R ◦3 Uµ−1.
12: end for
31
Remarks:
◮ “Small” matrix Aµ quickly gets large as TT ranks increase
Need to use iterative methods (e.g., Lanczos, LOBPCG), possibly combined with preconditioning [Kressner/Tobler’2011] for solving eigenvalue problems.
◮ In ALS TT ranks of X need to be chosen a priori. Adaptive choice
core, and split the optimized merged core DMRG, modified
◮ Principles of ALS easily extend to other optimization problems,
e.g., linear systems [Holtz/Rohwedder/Schneider’2012].
32
ALS
100 200 300 400 500 10
−15
10
−10
10
−5
10 10
5
Execution time [s] 100 200 300 400 500 15 20 25 30 35 40 45 err_lambda res nr_iter
Size = 12810 ≈ 1021. Maximal TT rank 40. See [Kressner/Steinlechner/Uschmajew’2014] for details.
33
ALS
500 1000 1500 2000 2500 10
−15
10
−10
10
−5
10 10
5
Execution time [s] 500 1000 1500 2000 2500 10 20 30 40 50 60 err_lambda res nr_iter
Size = 12820 ≈ 1042. Maximal TT rank 40.
34
ALS
500 1000 1500 10
−15
10
−10
10
−5
10 10
5
Execution time [s] 500 1000 1500 5 10 15 20 25 30 err_lambda res nr_iter
Size = 12820 ≈ 1042. Maximal TT rank 30.
35