[PPT] - Low Rank Approximation Lecture 4 Daniel Kressner Chair for PowerPoint Presentation

SLIDE 1

Low Rank Approximation Lecture 4

Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch

1

SLIDE 2

Sampling based approximation

Aim: Obtain rank-r approximation of m × n matrix A from selected entries of A. Two different situations:

◮ Unstructured sampling: Let Ω ⊂ {1, . . . , m} × {1, . . . , n}. Solve

min A − BCTΩ, M2

Ω =

(i,j)∈Ω

m2

ij.

Matrix completion problem solved by general optimization techniques (ALS, Riemannian optimization, convex relaxation). Will discuss later.

◮ Column/row sampling:

Focus of this lecture.

2

SLIDE 3

Row selection from orthonormal basis

Task. Given orthonormal basis U ∈ Rn×r find a “good” r × r submatrix
f U.

Classical problem already considered by Knuth.1 Quantification of “good”: Smallest singular value not too small. Some notation:

◮ Given an m × n matrix A and index sets

I = {i1, . . . , ik}, 1 ≤ i1 < i2 < · · · ik ≤ m, J = {j1, . . . , jℓ}, 1 ≤ j1 < j2 < · · · jℓ ≤ n, we let A(I, J) =    ai1,j1 · · · ai1,jn . . . . . . aim,j1 · · · aim,jn    ∈ Rk×ℓ. The full index set is denoted by :, e.g., A(I, :).

◮ | det A| denotes the volume of a square matrix A.

1Knuth, Donald E. Semioptimal bases for linear dependencies. Linear and

Multilinear Algebra 17 (1985), no. 1, 1–4.

3

SLIDE 4

Row selection from orthonormal basis

Lemma (Maximal volume yields good submatrix)

Let index set I, #I = r, be chosen such that |det(U(I, :))| is maximized among all r × r submatrices. Then 1 σmin(U(I, :)) ≤

r(n − r) + 1

Proof.2 W.l.o.g. I = {1, . . . , r}. Consider ˜ U = UU(I, :)−1 =

Ir

B

.

Because of det ˜ U(J, :) = det U(J, :)/ det U(I, :) for any J, submatrix #J = r, ˜ U(I, :) has maximal volume among all r × r submatrices of ˜ U.

2Following Lemma 2.1 in [Goreinov, S. A.; Tyrtyshnikov, E. E.; Zamarashkin, N. L. A

theory of pseudoskeleton approximations. Linear Algebra Appl. 261 (1997), 1–21].

4

SLIDE 5

Maximality of ˜ U(I, :) implies max |bij| ≤ 1. Argument: If there was bij with |bij| > 1 then interchanging rows r + i and j of ˜ U would increase volume of ˜ U(I, :). We have B2 ≤ BF ≤

(n − r)r max |bij| ≤
1 + (n − r)r.

This yields the result: U(I, :)−12 = UU(I, :)−12 =

1 + B2

2 ≤

1 + (n − r)r.

5

SLIDE 6

Greedy row selection from orthonormal basis

Finding submatrix of maximal volume is NP hard.3 Greedy algorithm (column-by-column):4

◮ First step is easy: Choose i such that |ui1| is maximal. ◮ Now, assume that k < r steps have been performed and the first

k columns have been processed. Task: Choose optimal index in column k + 1. There is a one-to-one connection between greedy row selection and Gaussian elimination with column pivoting!

3Civril, A., Magdon-Ismail, M.: On selecting a maximum volume sub-matrix of a

matrix and related problems. Theoret. Comput. Sci. 410(47-49), 4801–4811 (2009)

4Reinvented multiple times in the literature.

6

SLIDE 7

Greedy row selection from orthonormal basis

Gaussian elimination without pivoting applied to U ∈ Rn×r: for k = 1, . . . , r do L(:, k) ←

1 ukk U(:, k),

R(k, :) ← U(k, :) U ← U − L(:, k)R(k, :) end for Let ˜ U denote the updated matrix U obtained after k steps. Then:

◮ ˜

U = U − LR with L = L11 L21

∈ Rn×k,

R = R11 R12

∈ Rk×r

L11 unit lower triangular, R11 upper triangular.

◮ ˜

U is zero in first k rows and columns: ˜ U = ˜ U22

,

˜ U22 ∈ R(n−k)×(r−k)

7

SLIDE 8

Combining both relations gives U = LR + ˜ U = L11 L12 In−r R11 R12 ˜ U22

Back to the greedy algorithm: By a suitable permutation, suppose

that the first k indices are given by Ik = {1, . . . , k}. Then det(U(Ik ∪ {k + i}, Ik ∪ {k + 1})) = det(U11) · ˜ U22(i, 1). Greedily maximizing determinant: Choose i such that |˜ U22(i, 1))| is maximal. This is Gaussian elimination with column pivoting! r steps of Gaussian elimination with column pivoting yields factorization of the form PU = LR, where

◮ P is permutation matrix ◮ L =

L11 L12

with L11 ∈ Rr×r unit lower triangular and max |Lij| ≤ 1

◮ R ∈ Rr×r is upper triangular

8

SLIDE 9

Greedy row selection from orthonormal basis

Simplified form of Gaussian elimination with column pivoting: Input: n × r matrix U Output: “Good” index set I ⊂ {1, . . . , n}, #I = r. Set I = ∅. for k = 1, . . . , r do Choose i∗ = argmaxi=1,...,n|uik|. Set I ← I ∪ i∗. U ← U −

1 ui∗,k U(:, k)U(i∗, :)

end for Performance of greedy algorithm in practice often quite good, but there are counter examples (see later).

9

SLIDE 10

Analysis of greedy row selection

Lemma (Theorem 8.15 in [Higham’2002])

Let T ∈ Rn×n be an upper triangular matrix satisfying |tii| ≥ |tij| for j > i. Then 1 ≤ min

i

|tii| · T −12 ≤ 1 3 √ 4n + 6n − 1 ≤ 2n−1.

Proof. By diagonal scaling, we may assume without loss of generality

that tii = 1. Let Zn =       1 −1 · · · −1 1 ... . . . . . . ... ... −1 · · · 1       ∈ Rn×n. By induction, one shows that |T −1| ≤ Z −1

n

(where the absolute value and the inequality are understood elementwise).

10

SLIDE 11

By the monotonicity of the spectral norm T −12 ≤ Z −1

n

2 ≤ Z −1

n

F. Because of (Z −1

n

)ij = 2j−i−1 for j > i (see exercises), we obtain Z −1

n

2

F = n

j=1
1 +

j−1

i=1

4j−i−1 = 1 3

n

j=1

(4j−1 + 2) = 1 9(4n + 6n − 1), completing the proof.

Theorem

For the index set returned by the greedy algorithm applied to

rthnormal U ∈ Rn×r, it holds that

U(I, :)−12 ≤ √ nr2r−1.

11

SLIDE 12

Proof. We start from

PU = LR. (1) Partitioning L = L1 L2

with L1 ∈ Rr×r, factorization (1) implies

U(I, :) = L1R. Because PU is orthonormal, (1) also implies R−12 = L2 U(I, :)−12 ≤ L−1

1 2R−12 = L−1 1 2L2.

Because the magnitudes of the entries of L are bounded by 1, we have L2 ≤ LF ≤ √ nr · max |ℓij| = √ nr. Applying the lemma to LT

1 in order to bound L−1 1 2 completes proof.

12

SLIDE 13

Vector approximation

Goal: Want to approximate vector f in subspace range(U). For I = {i1, . . . , ik} define selection operator: SI = ei1 ei2 · · · eik

.

Minimal error attained by orthogonal projection UUT. When replaced by oblique projection U(ST

I U)−1ST I f

increase of error bounded by result of lemma.

Lemma

f − U(ST

I U)−1ST I f2 ≤ (ST I U)−12 · f − UUTf2.

Proof. Let Π = U(ST

I U)−1ST I . Then

(I − Π)f2 = (I − Π)(f − UUTf)2 ≤ I − Π2f − UUTf2. The proof is completed by noting (and using the exercises), I − Π2 = Π2 ≤ (ST

I U)−1ST I 2 = (ST I U)−12.

13

SLIDE 14

Connection to interpolation

We have ST

I (I − U(ST I U)−1ST I ) = 0

and hence ST

I (f − U(ST I U)−1ST I f)2 = 0.

Interpretation: f is “interpolated” exactly at selected indices. Example: Let f contain discretization of exp(x) on [−1, 1] let U contain orthonormal basis of discretized monomials {1, x, x2, . . .}.

50 100 150 200

0.2
0.1

0.1 0.2

14

SLIDE 15

Connection to interpolation

Iteration 1, Err ≈ 14.8 Iteration 2, Err ≈ 5.7

1
0.5

0.5 1 0.5 1 1.5 2 2.5 3

1
0.5

0.5 1 0.5 1 1.5 2 2.5 3

Iteration 3, Err ≈ 0.7 Iteration 4, Err ≈ 0.14

1
0.5

0.5 1 0.5 1 1.5 2 2.5 3

1
0.5

0.5 1 0.5 1 1.5 2 2.5 3 15

SLIDE 16

Connection to interpolation

Comparison between best approximation, greedy approximation, approximation obtained by simply selecting first r indices.

2 4 6 8 10 10 -10 10 -5 10 0

Terminology:

◮ Continuous setting: EIM (Empirical Interpolation method), [M. Barrault, Y. Maday, N. C. Nguyen, and A. T. Patera, An “empirical interpolation” method: Application to efficient reduced-basis discretization of partial differential equations, C. R.

Math. Acad. Sci. Paris, 339 (2004), pp. 667–672].

◮ Discrete setting: DEIM (Discrete EIM), [S. Chaturantabut and D. C. Sorensen. Nonlinear model reduction via discrete empirical

interpolation. SIAM Journal on Scientific Computing, 32(5), 2737–2764, 2010].

16

SLIDE 17

POD+DEIM

Consider LARGE ODE of the form ˙ x(t) = Ax(t) + F(x(t)). A is n × n matrix. Idea of POD5:

1. Simulate ODE for one or more initial conditions and collect

trajectories at discrete time points into snapshot matrix: X = x(t1) · · · x(tm) .

2. Compute ONB V ∈ Rn×r, r ≪ n, of dominant left subspace of X

(e.g., by SVD).

3. Assume approximation x ≈ UUTx = Uy and project dynamical

system onto range(U): ˙ y(t) = UTAUy(t) + UTF(Uy(t)).

5See [S. Volkwein. Proper Orthogonal Decomposition: Theory and Reduced-Order

Modelling. Lecture Notes, 2013] for a comprehensive introduction.

17

SLIDE 18

POD+DEIM

Problem: UTF(Uy(t)) still involves (large) dimension of original system. Using DEIM: UTF(Uy(t)) ≈ (ST

I U)−1ST I F(Uy(t)).

˙ y(t) = UTAUy(t) + (ST

I U)−1ST I F(Uy(t)).

Only need to evaluate #I = r instead of n entries of function F. Particularly efficient for F(x) =    f1(x1) . . . fn(xn)    ⇒ ST

I F(Uy(t)) =

   fi1(xi1) . . . fir (xir )    Example from [Chaturantabut/Sorensen’2010]: Discretized FitzHugh-Nagumo equations involve F(x) = x ⊙ (x − 0.1) ⊙ (1 − x).

18

SLIDE 19

Greedy row selection from orthonormal basis

QR-based variant of index selection: Inspired from Orthogonal Matching Pursuit.. In step k + 1:

1. Select row ik+1 of maximal 2-norm from U.
2. Set x = U(ik+1, :)T/U(ik+1, :)2.
3. Update U ← U(Ir − xxT).

This is QR with column pivoting applied to UT.

Theorem

For the index set IQ returned by QR with pivoting, it holds that (ST

IQU)−12 = U(IQ, :)−12 ≤

√ n − r + 1 3 √ 4r + 6r − 1.

Proof. See Theorem 2.1 in [Drmaˇ

c, Z. and Gugercin, S. A New Selection Operator for the Discrete Empirical Interpolation Method—Improved A Priori Error Bound and Extensions. SIAM Journal on Scientific Computing, 2016 38:2, A631–A648].

EFY. Using MATLAB’s command qr, implement the method above. Apply it to the exponential function, as above, and compare with the standard greedy method. 19

SLIDE 20

Counter example for greedy

Let U be Q-factor of economy sized QR factorization of n × r matrix A =        

1 −1 1 . . . ... ... −1 · · · −1 1 −1 · · · −1 −1 . . . . . . . . . −1 · · · −1 −1

        Variation of famous example by Wilkinson. (QR-based) greedy do no pivoting, at least in exact arithmetic.

10 20 30 10 0 10 5 10 10

U(I, :)−12 vs. r for n = 100 and U returned by greedy.

20

SLIDE 21

Row selection beyond greedy

Improve upon maxvol-based greedy (in a deterministic framework) via Knuth’s iterative exchange of rows. Given index set I, #I = r, and µ ≥ 1, µ ≈ 1, form ˜ U = UU(I, :)−1. Search for largest element (i∗, j∗) = argmax|˜ uij|. If |˜ ui∗j∗| ≤ µ, (2) terminate algorithm. Otherwise, set I ← I\{j∗} ∪ {i∗} and repeat.

EFY. Show that the this row exchange increases the volume of ˜ U(I, :) and U(I, :) by at least µ. EFY. Implement this strategy and apply it to Wilkinson’s example from the previous slide. How many iterates are needed for r = 30 to achieve µ = 1.01?

QR-based greedy can be improved by existing methods for rank-revealing QR [Golub/Van Loan’2013].

21

SLIDE 22

Theorem

Let U(IK, :) be the submatrix obtained upon termination of Knuth’s

procedure. Then

| det U(IK, :)| ≥ 1 (µ√r)r max

I

| det U(I, :)|

Proof. By construction, all entries of

UU(IK, :)−1 are bounded by µ in absolute value. In particular this holds for X = U(I, :)U(IK, :)−1 for any index set I ⊂ {1, . . . , n} with #I = r. Hence, by Hadamard’s inequality, | det U(I, :)| | det U(IK, :)| = det(X) ≤

r

j=1

X(:, j)2 ≤ √ r

r r

j=1

X(:, j)∞ ≤ (µ √ r)r.

22

SLIDE 23

The CUR decomposition: Existence results

A = CUR, where C contains columns of A, R contains rows of A, U is chosen “wisely”. Theorem (Goreinov/Tyrtyshnikov/Zamarshkin’1997). Let ε := σk+1(A). Then there exist row indices I ⊂ {1, . . . , m} and column indices J ⊂ {1, . . . , n} and a matrix S ∈ Rk×k such that A − A(:, J)SA(I, :)2 ≤ ε(1 + 2 √ k( √ m + √ n)).

Proof. Let Uk, Vk contain k dominant left/right singular vectors of A.

Choose I, J by selecting rows from Uk, Vk. According to max volume lemma, the square matrices ˆ U = Uk(I, :), ˆ V = Vk(J, :) satisfy ˆ U−12 ≤

k(m − k) + 1,

ˆ V −12 ≤

k(n − k) + 1.

Remains to choose S.

23

SLIDE 24

The CUR decomposition: Existence results

Form Φ = ˆ UΣk ˆ V T and choose ˜ k such that Φ − T˜

k(Φ)2 ≤

ε

ˆ

U−12ˆ V −12 . We set S = T˜

k(Φ)+. We now estimate the four different components

f UkΣkV T

k − A(:, J)SA(I, :) wrt Uk, Vk and their complements:

1. UkUT

k (...)VkV T k :

Σk − UT

k A(:, J)SA(I, :)Vk2 = Σk − Σk ˆ

VS ˆ UΣk2 = Σk − ˆ U−1ΦT˜

k(Φ)+Φ ˆ

V −12 = ˆ U−1(Φ − T˜

k(Φ)) ˆ

V −12 ≤ ε

ˆ

U−12 ˆ V −12.

2. (Im − UkUT

k )(...)VkV T k

(I − UkUk)TA(:, J)SA(I, :)Vk2 ≤ εSA(I, :)Vk2 = εT˜

k(Φ)+Φ ˆ

V −1 ≤ ε

ˆ

V −12

3. UkUT

k (...)(I − VkV T k ): As in 2, bounded by ε

ˆ

U−12.

24

SLIDE 25

The CUR decomposition: Existence results

4. (I − UkUT

k )(...)(I − VkV T k ):

(I − UkUk)TA(:, J)SA(I, :)(I − VkV T

k )2 ≤ ε2S2

≤ ε

ˆ

U−12 ˆ V −12 In summary, we obtain A − A(:, J)SA(I, :)2 ≤ ε

1 +
ˆ

U−12 +

ˆ

V −12 2 , which implies the result.

25

SLIDE 26

The CUR decomposition: Existence results

Choice of S = (A(I, J))−1 in CUR Remainder term R := A − A(:, J)(A(I, J))−1A(I, :) has zero rows at I and zero columns at J. Cross approximation: 1 3 6 2 4 7 1 3 6 2 4 7

≈

26

SLIDE 27

Adaptive Cross Approximation (ACA)

A more direct attempt to find a good cross.. Theorem (Goreinov/Tyrtyshnikov’2001). Suppose that A = A11 A12 A21 A22

where A11 ∈ Rr×r has maximal volume among all r × r submatrices
f A. Then

A22 − A21A−1

11 A12C ≤ (r + 1)σr+1(A),

where MC := maxij |mij| As we already know, finding A11 is NP hard [Çivril/Magdon-Ismail’2013].

27

SLIDE 28

Adaptive Cross Approximation (ACA)

Proof of theorem for (r + 1) × (r + 1) matrices. Consider A = A11 a12 aT

21

a22

,

A11 ∈ Rr×r, a12, a21 ∈ Rr×1, a22 ∈ R, with invertible A11. We recall the definition of the adjunct of a matrix: adj(A) = CT, cij = (−1)i+jmij, where mij is the determinant of the matrix obtained by deleting row i and j of A. By the max-volume assumption, |mr+1,r+1| is maximal among all |mij|. On the other hand, A−1 = 1 det Aadj(A). This implies that the element of A−1 of maximum absolute value is at position (r + 1, r + 1): |(A−1)r+1,r+1| = A−1C := max

i,j

(A−1)ij
.

28

SLIDE 29

Adaptive Cross Approximation (ACA)

Proof of theorem for continued. On the other hand, we have for any k × ℓ matrix B that B2 ≤ BF ≤ √ kℓBC. Thus, σr+1(A)−1 = A−12 ≤ (r + 1)A−1C = (r + 1)|(A−1)r+1,r+1|. This completes the proof, using (e.g., via Schur complement) (A−1)r+1,r+1 = 1 a22 − aT

21A−1 11 a12

.

29

SLIDE 30

Adaptive Cross Approximation (ACA)

ACA with full pivoting [Bebendorf/Tyrtyshnikov’2000]

1: Set R0 := A, I := {}, J := {}, k := 0 2: repeat 3:

k := k + 1

4:

(ik, jk) := arg maxi,j |Rk−1(i, j)|

5:

I ← I ∪ {ik}, J ← J ∪ {jk}

6:

δk := Rk−1(ik, jk)

7:

uk := Rk−1(:, jk), vk := Rk−1(ik, :)T/δk

8:

Rk := Rk−1 − ukvT

k

9: until RkF ≤ εAF

◮ This is greedy for maxvol. (Proof on next slide.) ◮ Still too expensive for general matrices.

30

SLIDE 31

Adaptive Cross Approximation (ACA)

Lemma (Bebendorf’2000). Let Ik = {i1, . . . , ik} and Jk = {j1, . . . , jk} be the row/column index sets constructed in step k of the algorithm. Then det(A(Ik, Jk)) = R0(i1, j1) · · · Rk−1(ik, jk).

Proof. From lines 7 and 8 of the algorithm, Rk−1(Ik, jk) is obtained

from A(Ik, jk) by subtracting scalar multiples of columns j1, . . . , jk−1 of

A. Hence, there is a vector y such that

A(Ik, Jk) = A(Ik, Jk−1) Rk−1(Ik, jk) Ik−1 y 1

.

This implies det(˜ Ak) = det(A(Ik, Jk)). However, Rk−1(i, jk) = 0 for all i = i1, . . . , ik−1 and hence det A(Ik, Jk) = Rk−1(ik, jk) det(A(Ik−1, Jk−1)). Since det A(I1, J1) = A(i1, j1) = R0(i1, j1), the result follows by induction.

31

SLIDE 32

Adaptive Cross Approximation (ACA)

ACA with partial pivoting

1: Set R0 := A, I := {}, J := {}, k := 1, i∗ := 1 2: repeat 3:

j∗ := arg maxj |Rk−1(i∗, j)|

4:

δk := Rk−1(i∗, j∗)

5:

if δk = 0 then

6:

if #I = min{m, n} − 1 then

7:

Stop

8:

end if

9:

else

10:

uk := Rk−1(:, j∗), vk := Rk−1(i∗, :)T/δk

11:

Rk := Rk−1 − ukvT

k

12:

k := k + 1

13:

end if

14:

I ← I ∪ {i∗}, J ← J ∪ {j∗}

15:

i∗ := arg maxi∈I |uk(i)|

16: until stopping criterion is satisfied

32

SLIDE 33

Adaptive Cross Approximation (ACA)

ACA with partial pivoting. Remarks:

◮ Rk is never formed explicitly. Entries of Rk are computed from

Rk(i, j) = A(i, j) −

k

ℓ=1

uℓ(i)vℓ(j).

◮ Ideal stopping criterion uk2vk2 ≤ εAF elusive.

Replace AF by AkF, recursively computed via Ak2

F = Ak−12 F + 2 k−1

j=1

uT

k ujvT j vk + uk2 2vk2 2.

33

SLIDE 34

Adaptive Cross Approximation (ACA)

Two 100 × 100 matrices: (a) The Hilbert matrix A defined by A(i, j) = 1/(i + j − 1). (b) The matrix A defined by A(i, j) = exp(−γ|i − j|/n) with γ = 0.1.

5 10 15 20 25 30 10

−15

10

−10

10

−5

10 k A − A2/A2 Hilbert matrix Full pivoting Partial pivoting SVD 20 40 60 80 100 10

−6

10

−4

10

−2

10 k A − A2/A2 Exponential matrix Full pivoting Partial pivoting SVD

1. Excellent convergence for Hilbert matrix.
2. Slow singular value decay impedes partial pivoting.

34

SLIDE 35

ACA is Gaussian elimination

We have Rk = Rk−1 − δkRk−1ekeT

k Rk−1 = (I − δkRk−1ekeT k )Rk−1 = LkRk−1,

where Lk ∈ Rm×m is given by Lk =             1 ... 1 ℓk+1,k 1 . . . ... ℓm,k 1             , ℓi,k = −eT

i Rk−1ek

eT

k Rk−1ek

. for i = k + 1, . . . , m. Matrix Lk differs only in position (k, k) from usual lower triangular factor in Gaussian elimination.

35

SLIDE 36

ACA for SPSD matrices

For symmetric positive semi-definite matrix A ∈ Rn×n:

◮ SVD becomes spectral decomposition. ◮ Can use trace instead of Frobenius norm to control error. ◮ Rk stays SPSD. ◮ Choice of rows/columns, e.g., by largest diagonal element of Rk. ◮ ACA becomes

= Cholesky (with diagonal pivoting). Analysis in [Higham’1990]. = Nyström method [Williams/Seeger’2001].

See [Harbrecht/Peters/Schneider’2012] for more detais.

36