Low Rank Approximation Lecture 4
Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch
1
Low Rank Approximation Lecture 4 Daniel Kressner Chair for - - PowerPoint PPT Presentation
Low Rank Approximation Lecture 4 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 Sampling based approximation Aim: Obtain rank- r approximation of m n matrix A from selected
1
Aim: Obtain rank-r approximation of m × n matrix A from selected entries of A. Two different situations:
◮ Unstructured sampling: Let Ω ⊂ {1, . . . , m} × {1, . . . , n}. Solve
min A − BCTΩ, M2
Ω =
m2
ij.
Matrix completion problem solved by general optimization techniques (ALS, Riemannian optimization, convex relaxation). Will discuss later.
◮ Column/row sampling:
Focus of this lecture.
2
Classical problem already considered by Knuth.1 Quantification of “good”: Smallest singular value not too small. Some notation:
◮ Given an m × n matrix A and index sets
I = {i1, . . . , ik}, 1 ≤ i1 < i2 < · · · ik ≤ m, J = {j1, . . . , jℓ}, 1 ≤ j1 < j2 < · · · jℓ ≤ n, we let A(I, J) = ai1,j1 · · · ai1,jn . . . . . . aim,j1 · · · aim,jn ∈ Rk×ℓ. The full index set is denoted by :, e.g., A(I, :).
◮ | det A| denotes the volume of a square matrix A.
1Knuth, Donald E. Semioptimal bases for linear dependencies. Linear and
Multilinear Algebra 17 (1985), no. 1, 1–4.
3
Let index set I, #I = r, be chosen such that |det(U(I, :))| is maximized among all r × r submatrices. Then 1 σmin(U(I, :)) ≤
Proof.2 W.l.o.g. I = {1, . . . , r}. Consider ˜ U = UU(I, :)−1 =
B
Because of det ˜ U(J, :) = det U(J, :)/ det U(I, :) for any J, submatrix #J = r, ˜ U(I, :) has maximal volume among all r × r submatrices of ˜ U.
2Following Lemma 2.1 in [Goreinov, S. A.; Tyrtyshnikov, E. E.; Zamarashkin, N. L. A
theory of pseudoskeleton approximations. Linear Algebra Appl. 261 (1997), 1–21].
4
Maximality of ˜ U(I, :) implies max |bij| ≤ 1. Argument: If there was bij with |bij| > 1 then interchanging rows r + i and j of ˜ U would increase volume of ˜ U(I, :). We have B2 ≤ BF ≤
This yields the result: U(I, :)−12 = UU(I, :)−12 =
2 ≤
5
Finding submatrix of maximal volume is NP hard.3 Greedy algorithm (column-by-column):4
◮ First step is easy: Choose i such that |ui1| is maximal. ◮ Now, assume that k < r steps have been performed and the first
k columns have been processed. Task: Choose optimal index in column k + 1. There is a one-to-one connection between greedy row selection and Gaussian elimination with column pivoting!
3Civril, A., Magdon-Ismail, M.: On selecting a maximum volume sub-matrix of a
matrix and related problems. Theoret. Comput. Sci. 410(47-49), 4801–4811 (2009)
4Reinvented multiple times in the literature.
6
Gaussian elimination without pivoting applied to U ∈ Rn×r: for k = 1, . . . , r do L(:, k) ←
1 ukk U(:, k),
R(k, :) ← U(k, :) U ← U − L(:, k)R(k, :) end for Let ˜ U denote the updated matrix U obtained after k steps. Then:
◮ ˜
U = U − LR with L = L11 L21
R = R11 R12
L11 unit lower triangular, R11 upper triangular.
◮ ˜
U is zero in first k rows and columns: ˜ U = ˜ U22
˜ U22 ∈ R(n−k)×(r−k)
7
Combining both relations gives U = LR + ˜ U = L11 L12 In−r R11 R12 ˜ U22
that the first k indices are given by Ik = {1, . . . , k}. Then det(U(Ik ∪ {k + i}, Ik ∪ {k + 1})) = det(U11) · ˜ U22(i, 1). Greedily maximizing determinant: Choose i such that |˜ U22(i, 1))| is maximal. This is Gaussian elimination with column pivoting! r steps of Gaussian elimination with column pivoting yields factorization of the form PU = LR, where
◮ P is permutation matrix ◮ L =
L11 L12
◮ R ∈ Rr×r is upper triangular
8
Simplified form of Gaussian elimination with column pivoting: Input: n × r matrix U Output: “Good” index set I ⊂ {1, . . . , n}, #I = r. Set I = ∅. for k = 1, . . . , r do Choose i∗ = argmaxi=1,...,n|uik|. Set I ← I ∪ i∗. U ← U −
1 ui∗,k U(:, k)U(i∗, :)
end for Performance of greedy algorithm in practice often quite good, but there are counter examples (see later).
9
Let T ∈ Rn×n be an upper triangular matrix satisfying |tii| ≥ |tij| for j > i. Then 1 ≤ min
i
|tii| · T −12 ≤ 1 3 √ 4n + 6n − 1 ≤ 2n−1.
that tii = 1. Let Zn = 1 −1 · · · −1 1 ... . . . . . . ... ... −1 · · · 1 ∈ Rn×n. By induction, one shows that |T −1| ≤ Z −1
n
(where the absolute value and the inequality are understood elementwise).
10
By the monotonicity of the spectral norm T −12 ≤ Z −1
n
2 ≤ Z −1
n
F. Because of (Z −1
n
)ij = 2j−i−1 for j > i (see exercises), we obtain Z −1
n
2
F = n
j−1
4j−i−1 = 1 3
n
(4j−1 + 2) = 1 9(4n + 6n − 1), completing the proof.
For the index set returned by the greedy algorithm applied to
U(I, :)−12 ≤ √ nr2r−1.
11
PU = LR. (1) Partitioning L = L1 L2
U(I, :) = L1R. Because PU is orthonormal, (1) also implies R−12 = L2 U(I, :)−12 ≤ L−1
1 2R−12 = L−1 1 2L2.
Because the magnitudes of the entries of L are bounded by 1, we have L2 ≤ LF ≤ √ nr · max |ℓij| = √ nr. Applying the lemma to LT
1 in order to bound L−1 1 2 completes proof.
12
Goal: Want to approximate vector f in subspace range(U). For I = {i1, . . . , ik} define selection operator: SI = ei1 ei2 · · · eik
Minimal error attained by orthogonal projection UUT. When replaced by oblique projection U(ST
I U)−1ST I f
increase of error bounded by result of lemma.
f − U(ST
I U)−1ST I f2 ≤ (ST I U)−12 · f − UUTf2.
I U)−1ST I . Then
(I − Π)f2 = (I − Π)(f − UUTf)2 ≤ I − Π2f − UUTf2. The proof is completed by noting (and using the exercises), I − Π2 = Π2 ≤ (ST
I U)−1ST I 2 = (ST I U)−12.
13
We have ST
I (I − U(ST I U)−1ST I ) = 0
and hence ST
I (f − U(ST I U)−1ST I f)2 = 0.
Interpretation: f is “interpolated” exactly at selected indices. Example: Let f contain discretization of exp(x) on [−1, 1] let U contain orthonormal basis of discretized monomials {1, x, x2, . . .}.
50 100 150 200
0.1 0.2
14
Iteration 1, Err ≈ 14.8 Iteration 2, Err ≈ 5.7
0.5 1 0.5 1 1.5 2 2.5 3
0.5 1 0.5 1 1.5 2 2.5 3
Iteration 3, Err ≈ 0.7 Iteration 4, Err ≈ 0.14
0.5 1 0.5 1 1.5 2 2.5 3
0.5 1 0.5 1 1.5 2 2.5 3 15
Comparison between best approximation, greedy approximation, approximation obtained by simply selecting first r indices.
2 4 6 8 10 10 -10 10 -5 10 0
Terminology:
◮ Continuous setting: EIM (Empirical Interpolation method), [M. Barrault, Y. Maday, N. C. Nguyen, and A. T. Patera, An “empirical interpolation” method: Application to efficient reduced-basis discretization of partial differential equations, C. R.
◮ Discrete setting: DEIM (Discrete EIM), [S. Chaturantabut and D. C. Sorensen. Nonlinear model reduction via discrete empirical
16
Consider LARGE ODE of the form ˙ x(t) = Ax(t) + F(x(t)). A is n × n matrix. Idea of POD5:
trajectories at discrete time points into snapshot matrix: X = x(t1) · · · x(tm) .
(e.g., by SVD).
system onto range(U): ˙ y(t) = UTAUy(t) + UTF(Uy(t)).
5See [S. Volkwein. Proper Orthogonal Decomposition: Theory and Reduced-Order
17
Problem: UTF(Uy(t)) still involves (large) dimension of original system. Using DEIM: UTF(Uy(t)) ≈ (ST
I U)−1ST I F(Uy(t)).
˙ y(t) = UTAUy(t) + (ST
I U)−1ST I F(Uy(t)).
Only need to evaluate #I = r instead of n entries of function F. Particularly efficient for F(x) = f1(x1) . . . fn(xn) ⇒ ST
I F(Uy(t)) =
fi1(xi1) . . . fir (xir ) Example from [Chaturantabut/Sorensen’2010]: Discretized FitzHugh-Nagumo equations involve F(x) = x ⊙ (x − 0.1) ⊙ (1 − x).
18
QR-based variant of index selection: Inspired from Orthogonal Matching Pursuit.. In step k + 1:
This is QR with column pivoting applied to UT.
For the index set IQ returned by QR with pivoting, it holds that (ST
IQU)−12 = U(IQ, :)−12 ≤
√ n − r + 1 3 √ 4r + 6r − 1.
c, Z. and Gugercin, S. A New Selection Operator for the Discrete Empirical Interpolation Method—Improved A Priori Error Bound and Extensions. SIAM Journal on Scientific Computing, 2016 38:2, A631–A648].
EFY. Using MATLAB’s command qr, implement the method above. Apply it to the exponential function, as above, and compare with the standard greedy method. 19
Let U be Q-factor of economy sized QR factorization of n × r matrix A =
1 −1 1 . . . ... ... −1 · · · −1 1 −1 · · · −1 −1 . . . . . . . . . −1 · · · −1 −1
Variation of famous example by Wilkinson. (QR-based) greedy do no pivoting, at least in exact arithmetic.
10 20 30 10 0 10 5 10 10
U(I, :)−12 vs. r for n = 100 and U returned by greedy.
20
Improve upon maxvol-based greedy (in a deterministic framework) via Knuth’s iterative exchange of rows. Given index set I, #I = r, and µ ≥ 1, µ ≈ 1, form ˜ U = UU(I, :)−1. Search for largest element (i∗, j∗) = argmax|˜ uij|. If |˜ ui∗j∗| ≤ µ, (2) terminate algorithm. Otherwise, set I ← I\{j∗} ∪ {i∗} and repeat.
EFY. Show that the this row exchange increases the volume of ˜ U(I, :) and U(I, :) by at least µ. EFY. Implement this strategy and apply it to Wilkinson’s example from the previous slide. How many iterates are needed for r = 30 to achieve µ = 1.01?
QR-based greedy can be improved by existing methods for rank-revealing QR [Golub/Van Loan’2013].
21
Let U(IK, :) be the submatrix obtained upon termination of Knuth’s
| det U(IK, :)| ≥ 1 (µ√r)r max
I
| det U(I, :)|
UU(IK, :)−1 are bounded by µ in absolute value. In particular this holds for X = U(I, :)U(IK, :)−1 for any index set I ⊂ {1, . . . , n} with #I = r. Hence, by Hadamard’s inequality, | det U(I, :)| | det U(IK, :)| = det(X) ≤
r
X(:, j)2 ≤ √ r
r r
X(:, j)∞ ≤ (µ √ r)r.
22
A = CUR, where C contains columns of A, R contains rows of A, U is chosen “wisely”. Theorem (Goreinov/Tyrtyshnikov/Zamarshkin’1997). Let ε := σk+1(A). Then there exist row indices I ⊂ {1, . . . , m} and column indices J ⊂ {1, . . . , n} and a matrix S ∈ Rk×k such that A − A(:, J)SA(I, :)2 ≤ ε(1 + 2 √ k( √ m + √ n)).
Choose I, J by selecting rows from Uk, Vk. According to max volume lemma, the square matrices ˆ U = Uk(I, :), ˆ V = Vk(J, :) satisfy ˆ U−12 ≤
ˆ V −12 ≤
Remains to choose S.
23
Form Φ = ˆ UΣk ˆ V T and choose ˜ k such that Φ − T˜
k(Φ)2 ≤
ε
U−12ˆ V −12 . We set S = T˜
k(Φ)+. We now estimate the four different components
k − A(:, J)SA(I, :) wrt Uk, Vk and their complements:
k (...)VkV T k :
Σk − UT
k A(:, J)SA(I, :)Vk2 = Σk − Σk ˆ
VS ˆ UΣk2 = Σk − ˆ U−1ΦT˜
k(Φ)+Φ ˆ
V −12 = ˆ U−1(Φ − T˜
k(Φ)) ˆ
V −12 ≤ ε
U−12 ˆ V −12.
k )(...)VkV T k
(I − UkUk)TA(:, J)SA(I, :)Vk2 ≤ εSA(I, :)Vk2 = εT˜
k(Φ)+Φ ˆ
V −1 ≤ ε
V −12
k (...)(I − VkV T k ): As in 2, bounded by ε
U−12.
24
k )(...)(I − VkV T k ):
(I − UkUk)TA(:, J)SA(I, :)(I − VkV T
k )2 ≤ ε2S2
≤ ε
U−12 ˆ V −12 In summary, we obtain A − A(:, J)SA(I, :)2 ≤ ε
U−12 +
V −12 2 , which implies the result.
25
Choice of S = (A(I, J))−1 in CUR Remainder term R := A − A(:, J)(A(I, J))−1A(I, :) has zero rows at I and zero columns at J. Cross approximation: 1 3 6 2 4 7 1 3 6 2 4 7
26
A more direct attempt to find a good cross.. Theorem (Goreinov/Tyrtyshnikov’2001). Suppose that A = A11 A12 A21 A22
A22 − A21A−1
11 A12C ≤ (r + 1)σr+1(A),
where MC := maxij |mij| As we already know, finding A11 is NP hard [Çivril/Magdon-Ismail’2013].
27
Proof of theorem for (r + 1) × (r + 1) matrices. Consider A = A11 a12 aT
21
a22
A11 ∈ Rr×r, a12, a21 ∈ Rr×1, a22 ∈ R, with invertible A11. We recall the definition of the adjunct of a matrix: adj(A) = CT, cij = (−1)i+jmij, where mij is the determinant of the matrix obtained by deleting row i and j of A. By the max-volume assumption, |mr+1,r+1| is maximal among all |mij|. On the other hand, A−1 = 1 det Aadj(A). This implies that the element of A−1 of maximum absolute value is at position (r + 1, r + 1): |(A−1)r+1,r+1| = A−1C := max
i,j
28
Proof of theorem for continued. On the other hand, we have for any k × ℓ matrix B that B2 ≤ BF ≤ √ kℓBC. Thus, σr+1(A)−1 = A−12 ≤ (r + 1)A−1C = (r + 1)|(A−1)r+1,r+1|. This completes the proof, using (e.g., via Schur complement) (A−1)r+1,r+1 = 1 a22 − aT
21A−1 11 a12
.
29
ACA with full pivoting [Bebendorf/Tyrtyshnikov’2000]
1: Set R0 := A, I := {}, J := {}, k := 0 2: repeat 3:
k := k + 1
4:
(ik, jk) := arg maxi,j |Rk−1(i, j)|
5:
I ← I ∪ {ik}, J ← J ∪ {jk}
6:
δk := Rk−1(ik, jk)
7:
uk := Rk−1(:, jk), vk := Rk−1(ik, :)T/δk
8:
Rk := Rk−1 − ukvT
k
9: until RkF ≤ εAF
◮ This is greedy for maxvol. (Proof on next slide.) ◮ Still too expensive for general matrices.
30
Lemma (Bebendorf’2000). Let Ik = {i1, . . . , ik} and Jk = {j1, . . . , jk} be the row/column index sets constructed in step k of the algorithm. Then det(A(Ik, Jk)) = R0(i1, j1) · · · Rk−1(ik, jk).
from A(Ik, jk) by subtracting scalar multiples of columns j1, . . . , jk−1 of
A(Ik, Jk) = A(Ik, Jk−1) Rk−1(Ik, jk) Ik−1 y 1
This implies det(˜ Ak) = det(A(Ik, Jk)). However, Rk−1(i, jk) = 0 for all i = i1, . . . , ik−1 and hence det A(Ik, Jk) = Rk−1(ik, jk) det(A(Ik−1, Jk−1)). Since det A(I1, J1) = A(i1, j1) = R0(i1, j1), the result follows by induction.
31
ACA with partial pivoting
1: Set R0 := A, I := {}, J := {}, k := 1, i∗ := 1 2: repeat 3:
j∗ := arg maxj |Rk−1(i∗, j)|
4:
δk := Rk−1(i∗, j∗)
5:
if δk = 0 then
6:
if #I = min{m, n} − 1 then
7:
Stop
8:
end if
9:
else
10:
uk := Rk−1(:, j∗), vk := Rk−1(i∗, :)T/δk
11:
Rk := Rk−1 − ukvT
k
12:
k := k + 1
13:
end if
14:
I ← I ∪ {i∗}, J ← J ∪ {j∗}
15:
i∗ := arg maxi∈I |uk(i)|
16: until stopping criterion is satisfied
32
ACA with partial pivoting. Remarks:
◮ Rk is never formed explicitly. Entries of Rk are computed from
Rk(i, j) = A(i, j) −
k
uℓ(i)vℓ(j).
◮ Ideal stopping criterion uk2vk2 ≤ εAF elusive.
Replace AF by AkF, recursively computed via Ak2
F = Ak−12 F + 2 k−1
uT
k ujvT j vk + uk2 2vk2 2.
33
Two 100 × 100 matrices: (a) The Hilbert matrix A defined by A(i, j) = 1/(i + j − 1). (b) The matrix A defined by A(i, j) = exp(−γ|i − j|/n) with γ = 0.1.
5 10 15 20 25 30 10
−15
10
−10
10
−5
10 k A − A2/A2 Hilbert matrix Full pivoting Partial pivoting SVD 20 40 60 80 100 10
−6
10
−4
10
−2
10 k A − A2/A2 Exponential matrix Full pivoting Partial pivoting SVD
34
We have Rk = Rk−1 − δkRk−1ekeT
k Rk−1 = (I − δkRk−1ekeT k )Rk−1 = LkRk−1,
where Lk ∈ Rm×m is given by Lk = 1 ... 1 ℓk+1,k 1 . . . ... ℓm,k 1 , ℓi,k = −eT
i Rk−1ek
eT
k Rk−1ek
. for i = k + 1, . . . , m. Matrix Lk differs only in position (k, k) from usual lower triangular factor in Gaussian elimination.
35
For symmetric positive semi-definite matrix A ∈ Rn×n:
◮ SVD becomes spectral decomposition. ◮ Can use trace instead of Frobenius norm to control error. ◮ Rk stays SPSD. ◮ Choice of rows/columns, e.g., by largest diagonal element of Rk. ◮ ACA becomes
= Cholesky (with diagonal pivoting). Analysis in [Higham’1990]. = Nyström method [Williams/Seeger’2001].
See [Harbrecht/Peters/Schneider’2012] for more detais.
36