Low Rank Approximation Lecture 4 Daniel Kressner Chair for - - PowerPoint PPT Presentation

low rank approximation lecture 4
SMART_READER_LITE
LIVE PREVIEW

Low Rank Approximation Lecture 4 Daniel Kressner Chair for - - PowerPoint PPT Presentation

Low Rank Approximation Lecture 4 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 Sampling based approximation Aim: Obtain rank- r approximation of m n matrix A from selected


slide-1
SLIDE 1

Low Rank Approximation Lecture 4

Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch

1

slide-2
SLIDE 2

Sampling based approximation

Aim: Obtain rank-r approximation of m × n matrix A from selected entries of A. Two different situations:

◮ Unstructured sampling: Let Ω ⊂ {1, . . . , m} × {1, . . . , n}. Solve

min A − BCTΩ, M2

Ω =

  • (i,j)∈Ω

m2

ij.

Matrix completion problem solved by general optimization techniques (ALS, Riemannian optimization, convex relaxation). Will discuss later.

◮ Column/row sampling:

Focus of this lecture.

2

slide-3
SLIDE 3

Row selection from orthonormal basis

  • Task. Given orthonormal basis U ∈ Rn×r find a “good” r × r submatrix
  • f U.

Classical problem already considered by Knuth.1 Quantification of “good”: Smallest singular value not too small. Some notation:

◮ Given an m × n matrix A and index sets

I = {i1, . . . , ik}, 1 ≤ i1 < i2 < · · · ik ≤ m, J = {j1, . . . , jℓ}, 1 ≤ j1 < j2 < · · · jℓ ≤ n, we let A(I, J) =    ai1,j1 · · · ai1,jn . . . . . . aim,j1 · · · aim,jn    ∈ Rk×ℓ. The full index set is denoted by :, e.g., A(I, :).

◮ | det A| denotes the volume of a square matrix A.

1Knuth, Donald E. Semioptimal bases for linear dependencies. Linear and

Multilinear Algebra 17 (1985), no. 1, 1–4.

3

slide-4
SLIDE 4

Row selection from orthonormal basis

Lemma (Maximal volume yields good submatrix)

Let index set I, #I = r, be chosen such that |det(U(I, :))| is maximized among all r × r submatrices. Then 1 σmin(U(I, :)) ≤

  • r(n − r) + 1

Proof.2 W.l.o.g. I = {1, . . . , r}. Consider ˜ U = UU(I, :)−1 =

  • Ir

B

  • .

Because of det ˜ U(J, :) = det U(J, :)/ det U(I, :) for any J, submatrix #J = r, ˜ U(I, :) has maximal volume among all r × r submatrices of ˜ U.

2Following Lemma 2.1 in [Goreinov, S. A.; Tyrtyshnikov, E. E.; Zamarashkin, N. L. A

theory of pseudoskeleton approximations. Linear Algebra Appl. 261 (1997), 1–21].

4

slide-5
SLIDE 5

Maximality of ˜ U(I, :) implies max |bij| ≤ 1. Argument: If there was bij with |bij| > 1 then interchanging rows r + i and j of ˜ U would increase volume of ˜ U(I, :). We have B2 ≤ BF ≤

  • (n − r)r max |bij| ≤
  • 1 + (n − r)r.

This yields the result: U(I, :)−12 = UU(I, :)−12 =

  • 1 + B2

2 ≤

  • 1 + (n − r)r.

5

slide-6
SLIDE 6

Greedy row selection from orthonormal basis

Finding submatrix of maximal volume is NP hard.3 Greedy algorithm (column-by-column):4

◮ First step is easy: Choose i such that |ui1| is maximal. ◮ Now, assume that k < r steps have been performed and the first

k columns have been processed. Task: Choose optimal index in column k + 1. There is a one-to-one connection between greedy row selection and Gaussian elimination with column pivoting!

3Civril, A., Magdon-Ismail, M.: On selecting a maximum volume sub-matrix of a

matrix and related problems. Theoret. Comput. Sci. 410(47-49), 4801–4811 (2009)

4Reinvented multiple times in the literature.

6

slide-7
SLIDE 7

Greedy row selection from orthonormal basis

Gaussian elimination without pivoting applied to U ∈ Rn×r: for k = 1, . . . , r do L(:, k) ←

1 ukk U(:, k),

R(k, :) ← U(k, :) U ← U − L(:, k)R(k, :) end for Let ˜ U denote the updated matrix U obtained after k steps. Then:

◮ ˜

U = U − LR with L = L11 L21

  • ∈ Rn×k,

R = R11 R12

  • ∈ Rk×r

L11 unit lower triangular, R11 upper triangular.

◮ ˜

U is zero in first k rows and columns: ˜ U = ˜ U22

  • ,

˜ U22 ∈ R(n−k)×(r−k)

7

slide-8
SLIDE 8

Combining both relations gives U = LR + ˜ U = L11 L12 In−r R11 R12 ˜ U22

  • Back to the greedy algorithm: By a suitable permutation, suppose

that the first k indices are given by Ik = {1, . . . , k}. Then det(U(Ik ∪ {k + i}, Ik ∪ {k + 1})) = det(U11) · ˜ U22(i, 1). Greedily maximizing determinant: Choose i such that |˜ U22(i, 1))| is maximal. This is Gaussian elimination with column pivoting! r steps of Gaussian elimination with column pivoting yields factorization of the form PU = LR, where

◮ P is permutation matrix ◮ L =

L11 L12

  • with L11 ∈ Rr×r unit lower triangular and max |Lij| ≤ 1

◮ R ∈ Rr×r is upper triangular

8

slide-9
SLIDE 9

Greedy row selection from orthonormal basis

Simplified form of Gaussian elimination with column pivoting: Input: n × r matrix U Output: “Good” index set I ⊂ {1, . . . , n}, #I = r. Set I = ∅. for k = 1, . . . , r do Choose i∗ = argmaxi=1,...,n|uik|. Set I ← I ∪ i∗. U ← U −

1 ui∗,k U(:, k)U(i∗, :)

end for Performance of greedy algorithm in practice often quite good, but there are counter examples (see later).

9

slide-10
SLIDE 10

Analysis of greedy row selection

Lemma (Theorem 8.15 in [Higham’2002])

Let T ∈ Rn×n be an upper triangular matrix satisfying |tii| ≥ |tij| for j > i. Then 1 ≤ min

i

|tii| · T −12 ≤ 1 3 √ 4n + 6n − 1 ≤ 2n−1.

  • Proof. By diagonal scaling, we may assume without loss of generality

that tii = 1. Let Zn =       1 −1 · · · −1 1 ... . . . . . . ... ... −1 · · · 1       ∈ Rn×n. By induction, one shows that |T −1| ≤ Z −1

n

(where the absolute value and the inequality are understood elementwise).

10

slide-11
SLIDE 11

By the monotonicity of the spectral norm T −12 ≤ Z −1

n

2 ≤ Z −1

n

F. Because of (Z −1

n

)ij = 2j−i−1 for j > i (see exercises), we obtain Z −1

n

2

F = n

  • j=1
  • 1 +

j−1

  • i=1

4j−i−1 = 1 3

n

  • j=1

(4j−1 + 2) = 1 9(4n + 6n − 1), completing the proof.

Theorem

For the index set returned by the greedy algorithm applied to

  • rthnormal U ∈ Rn×r, it holds that

U(I, :)−12 ≤ √ nr2r−1.

11

slide-12
SLIDE 12
  • Proof. We start from

PU = LR. (1) Partitioning L = L1 L2

  • with L1 ∈ Rr×r, factorization (1) implies

U(I, :) = L1R. Because PU is orthonormal, (1) also implies R−12 = L2 U(I, :)−12 ≤ L−1

1 2R−12 = L−1 1 2L2.

Because the magnitudes of the entries of L are bounded by 1, we have L2 ≤ LF ≤ √ nr · max |ℓij| = √ nr. Applying the lemma to LT

1 in order to bound L−1 1 2 completes proof.

12

slide-13
SLIDE 13

Vector approximation

Goal: Want to approximate vector f in subspace range(U). For I = {i1, . . . , ik} define selection operator: SI = ei1 ei2 · · · eik

  • .

Minimal error attained by orthogonal projection UUT. When replaced by oblique projection U(ST

I U)−1ST I f

increase of error bounded by result of lemma.

Lemma

f − U(ST

I U)−1ST I f2 ≤ (ST I U)−12 · f − UUTf2.

  • Proof. Let Π = U(ST

I U)−1ST I . Then

(I − Π)f2 = (I − Π)(f − UUTf)2 ≤ I − Π2f − UUTf2. The proof is completed by noting (and using the exercises), I − Π2 = Π2 ≤ (ST

I U)−1ST I 2 = (ST I U)−12.

13

slide-14
SLIDE 14

Connection to interpolation

We have ST

I (I − U(ST I U)−1ST I ) = 0

and hence ST

I (f − U(ST I U)−1ST I f)2 = 0.

Interpretation: f is “interpolated” exactly at selected indices. Example: Let f contain discretization of exp(x) on [−1, 1] let U contain orthonormal basis of discretized monomials {1, x, x2, . . .}.

50 100 150 200

  • 0.2
  • 0.1

0.1 0.2

14

slide-15
SLIDE 15

Connection to interpolation

Iteration 1, Err ≈ 14.8 Iteration 2, Err ≈ 5.7

  • 1
  • 0.5

0.5 1 0.5 1 1.5 2 2.5 3

  • 1
  • 0.5

0.5 1 0.5 1 1.5 2 2.5 3

Iteration 3, Err ≈ 0.7 Iteration 4, Err ≈ 0.14

  • 1
  • 0.5

0.5 1 0.5 1 1.5 2 2.5 3

  • 1
  • 0.5

0.5 1 0.5 1 1.5 2 2.5 3 15

slide-16
SLIDE 16

Connection to interpolation

Comparison between best approximation, greedy approximation, approximation obtained by simply selecting first r indices.

2 4 6 8 10 10 -10 10 -5 10 0

Terminology:

◮ Continuous setting: EIM (Empirical Interpolation method), [M. Barrault, Y. Maday, N. C. Nguyen, and A. T. Patera, An “empirical interpolation” method: Application to efficient reduced-basis discretization of partial differential equations, C. R.

  • Math. Acad. Sci. Paris, 339 (2004), pp. 667–672].

◮ Discrete setting: DEIM (Discrete EIM), [S. Chaturantabut and D. C. Sorensen. Nonlinear model reduction via discrete empirical

  • interpolation. SIAM Journal on Scientific Computing, 32(5), 2737–2764, 2010].

16

slide-17
SLIDE 17

POD+DEIM

Consider LARGE ODE of the form ˙ x(t) = Ax(t) + F(x(t)). A is n × n matrix. Idea of POD5:

  • 1. Simulate ODE for one or more initial conditions and collect

trajectories at discrete time points into snapshot matrix: X = x(t1) · · · x(tm) .

  • 2. Compute ONB V ∈ Rn×r, r ≪ n, of dominant left subspace of X

(e.g., by SVD).

  • 3. Assume approximation x ≈ UUTx = Uy and project dynamical

system onto range(U): ˙ y(t) = UTAUy(t) + UTF(Uy(t)).

5See [S. Volkwein. Proper Orthogonal Decomposition: Theory and Reduced-Order

  • Modelling. Lecture Notes, 2013] for a comprehensive introduction.

17

slide-18
SLIDE 18

POD+DEIM

Problem: UTF(Uy(t)) still involves (large) dimension of original system. Using DEIM: UTF(Uy(t)) ≈ (ST

I U)−1ST I F(Uy(t)).

˙ y(t) = UTAUy(t) + (ST

I U)−1ST I F(Uy(t)).

Only need to evaluate #I = r instead of n entries of function F. Particularly efficient for F(x) =    f1(x1) . . . fn(xn)    ⇒ ST

I F(Uy(t)) =

   fi1(xi1) . . . fir (xir )    Example from [Chaturantabut/Sorensen’2010]: Discretized FitzHugh-Nagumo equations involve F(x) = x ⊙ (x − 0.1) ⊙ (1 − x).

18

slide-19
SLIDE 19

Greedy row selection from orthonormal basis

QR-based variant of index selection: Inspired from Orthogonal Matching Pursuit.. In step k + 1:

  • 1. Select row ik+1 of maximal 2-norm from U.
  • 2. Set x = U(ik+1, :)T/U(ik+1, :)2.
  • 3. Update U ← U(Ir − xxT).

This is QR with column pivoting applied to UT.

Theorem

For the index set IQ returned by QR with pivoting, it holds that (ST

IQU)−12 = U(IQ, :)−12 ≤

√ n − r + 1 3 √ 4r + 6r − 1.

  • Proof. See Theorem 2.1 in [Drmaˇ

c, Z. and Gugercin, S. A New Selection Operator for the Discrete Empirical Interpolation Method—Improved A Priori Error Bound and Extensions. SIAM Journal on Scientific Computing, 2016 38:2, A631–A648].

EFY. Using MATLAB’s command qr, implement the method above. Apply it to the exponential function, as above, and compare with the standard greedy method. 19

slide-20
SLIDE 20

Counter example for greedy

Let U be Q-factor of economy sized QR factorization of n × r matrix A =        

1 −1 1 . . . ... ... −1 · · · −1 1 −1 · · · −1 −1 . . . . . . . . . −1 · · · −1 −1

        Variation of famous example by Wilkinson. (QR-based) greedy do no pivoting, at least in exact arithmetic.

10 20 30 10 0 10 5 10 10

U(I, :)−12 vs. r for n = 100 and U returned by greedy.

20

slide-21
SLIDE 21

Row selection beyond greedy

Improve upon maxvol-based greedy (in a deterministic framework) via Knuth’s iterative exchange of rows. Given index set I, #I = r, and µ ≥ 1, µ ≈ 1, form ˜ U = UU(I, :)−1. Search for largest element (i∗, j∗) = argmax|˜ uij|. If |˜ ui∗j∗| ≤ µ, (2) terminate algorithm. Otherwise, set I ← I\{j∗} ∪ {i∗} and repeat.

EFY. Show that the this row exchange increases the volume of ˜ U(I, :) and U(I, :) by at least µ. EFY. Implement this strategy and apply it to Wilkinson’s example from the previous slide. How many iterates are needed for r = 30 to achieve µ = 1.01?

QR-based greedy can be improved by existing methods for rank-revealing QR [Golub/Van Loan’2013].

21

slide-22
SLIDE 22

Theorem

Let U(IK, :) be the submatrix obtained upon termination of Knuth’s

  • procedure. Then

| det U(IK, :)| ≥ 1 (µ√r)r max

I

| det U(I, :)|

  • Proof. By construction, all entries of

UU(IK, :)−1 are bounded by µ in absolute value. In particular this holds for X = U(I, :)U(IK, :)−1 for any index set I ⊂ {1, . . . , n} with #I = r. Hence, by Hadamard’s inequality, | det U(I, :)| | det U(IK, :)| = det(X) ≤

r

  • j=1

X(:, j)2 ≤ √ r

r r

  • j=1

X(:, j)∞ ≤ (µ √ r)r.

22

slide-23
SLIDE 23

The CUR decomposition: Existence results

A = CUR, where C contains columns of A, R contains rows of A, U is chosen “wisely”. Theorem (Goreinov/Tyrtyshnikov/Zamarshkin’1997). Let ε := σk+1(A). Then there exist row indices I ⊂ {1, . . . , m} and column indices J ⊂ {1, . . . , n} and a matrix S ∈ Rk×k such that A − A(:, J)SA(I, :)2 ≤ ε(1 + 2 √ k( √ m + √ n)).

  • Proof. Let Uk, Vk contain k dominant left/right singular vectors of A.

Choose I, J by selecting rows from Uk, Vk. According to max volume lemma, the square matrices ˆ U = Uk(I, :), ˆ V = Vk(J, :) satisfy ˆ U−12 ≤

  • k(m − k) + 1,

ˆ V −12 ≤

  • k(n − k) + 1.

Remains to choose S.

23

slide-24
SLIDE 24

The CUR decomposition: Existence results

Form Φ = ˆ UΣk ˆ V T and choose ˜ k such that Φ − T˜

k(Φ)2 ≤

ε

  • ˆ

U−12ˆ V −12 . We set S = T˜

k(Φ)+. We now estimate the four different components

  • f UkΣkV T

k − A(:, J)SA(I, :) wrt Uk, Vk and their complements:

  • 1. UkUT

k (...)VkV T k :

Σk − UT

k A(:, J)SA(I, :)Vk2 = Σk − Σk ˆ

VS ˆ UΣk2 = Σk − ˆ U−1ΦT˜

k(Φ)+Φ ˆ

V −12 = ˆ U−1(Φ − T˜

k(Φ)) ˆ

V −12 ≤ ε

  • ˆ

U−12 ˆ V −12.

  • 2. (Im − UkUT

k )(...)VkV T k

(I − UkUk)TA(:, J)SA(I, :)Vk2 ≤ εSA(I, :)Vk2 = εT˜

k(Φ)+Φ ˆ

V −1 ≤ ε

  • ˆ

V −12

  • 3. UkUT

k (...)(I − VkV T k ): As in 2, bounded by ε

  • ˆ

U−12.

24

slide-25
SLIDE 25

The CUR decomposition: Existence results

  • 4. (I − UkUT

k )(...)(I − VkV T k ):

(I − UkUk)TA(:, J)SA(I, :)(I − VkV T

k )2 ≤ ε2S2

≤ ε

  • ˆ

U−12 ˆ V −12 In summary, we obtain A − A(:, J)SA(I, :)2 ≤ ε

  • 1 +
  • ˆ

U−12 +

  • ˆ

V −12 2 , which implies the result.

25

slide-26
SLIDE 26

The CUR decomposition: Existence results

Choice of S = (A(I, J))−1 in CUR Remainder term R := A − A(:, J)(A(I, J))−1A(I, :) has zero rows at I and zero columns at J. Cross approximation: 1 3 6 2 4 7 1 3 6 2 4 7

26

slide-27
SLIDE 27

Adaptive Cross Approximation (ACA)

A more direct attempt to find a good cross.. Theorem (Goreinov/Tyrtyshnikov’2001). Suppose that A = A11 A12 A21 A22

  • where A11 ∈ Rr×r has maximal volume among all r × r submatrices
  • f A. Then

A22 − A21A−1

11 A12C ≤ (r + 1)σr+1(A),

where MC := maxij |mij| As we already know, finding A11 is NP hard [Çivril/Magdon-Ismail’2013].

27

slide-28
SLIDE 28

Adaptive Cross Approximation (ACA)

Proof of theorem for (r + 1) × (r + 1) matrices. Consider A = A11 a12 aT

21

a22

  • ,

A11 ∈ Rr×r, a12, a21 ∈ Rr×1, a22 ∈ R, with invertible A11. We recall the definition of the adjunct of a matrix: adj(A) = CT, cij = (−1)i+jmij, where mij is the determinant of the matrix obtained by deleting row i and j of A. By the max-volume assumption, |mr+1,r+1| is maximal among all |mij|. On the other hand, A−1 = 1 det Aadj(A). This implies that the element of A−1 of maximum absolute value is at position (r + 1, r + 1): |(A−1)r+1,r+1| = A−1C := max

i,j

  • (A−1)ij
  • .

28

slide-29
SLIDE 29

Adaptive Cross Approximation (ACA)

Proof of theorem for continued. On the other hand, we have for any k × ℓ matrix B that B2 ≤ BF ≤ √ kℓBC. Thus, σr+1(A)−1 = A−12 ≤ (r + 1)A−1C = (r + 1)|(A−1)r+1,r+1|. This completes the proof, using (e.g., via Schur complement) (A−1)r+1,r+1 = 1 a22 − aT

21A−1 11 a12

.

29

slide-30
SLIDE 30

Adaptive Cross Approximation (ACA)

ACA with full pivoting [Bebendorf/Tyrtyshnikov’2000]

1: Set R0 := A, I := {}, J := {}, k := 0 2: repeat 3:

k := k + 1

4:

(ik, jk) := arg maxi,j |Rk−1(i, j)|

5:

I ← I ∪ {ik}, J ← J ∪ {jk}

6:

δk := Rk−1(ik, jk)

7:

uk := Rk−1(:, jk), vk := Rk−1(ik, :)T/δk

8:

Rk := Rk−1 − ukvT

k

9: until RkF ≤ εAF

◮ This is greedy for maxvol. (Proof on next slide.) ◮ Still too expensive for general matrices.

30

slide-31
SLIDE 31

Adaptive Cross Approximation (ACA)

Lemma (Bebendorf’2000). Let Ik = {i1, . . . , ik} and Jk = {j1, . . . , jk} be the row/column index sets constructed in step k of the algorithm. Then det(A(Ik, Jk)) = R0(i1, j1) · · · Rk−1(ik, jk).

  • Proof. From lines 7 and 8 of the algorithm, Rk−1(Ik, jk) is obtained

from A(Ik, jk) by subtracting scalar multiples of columns j1, . . . , jk−1 of

  • A. Hence, there is a vector y such that

A(Ik, Jk) = A(Ik, Jk−1) Rk−1(Ik, jk) Ik−1 y 1

  • .

This implies det(˜ Ak) = det(A(Ik, Jk)). However, Rk−1(i, jk) = 0 for all i = i1, . . . , ik−1 and hence det A(Ik, Jk) = Rk−1(ik, jk) det(A(Ik−1, Jk−1)). Since det A(I1, J1) = A(i1, j1) = R0(i1, j1), the result follows by induction.

31

slide-32
SLIDE 32

Adaptive Cross Approximation (ACA)

ACA with partial pivoting

1: Set R0 := A, I := {}, J := {}, k := 1, i∗ := 1 2: repeat 3:

j∗ := arg maxj |Rk−1(i∗, j)|

4:

δk := Rk−1(i∗, j∗)

5:

if δk = 0 then

6:

if #I = min{m, n} − 1 then

7:

Stop

8:

end if

9:

else

10:

uk := Rk−1(:, j∗), vk := Rk−1(i∗, :)T/δk

11:

Rk := Rk−1 − ukvT

k

12:

k := k + 1

13:

end if

14:

I ← I ∪ {i∗}, J ← J ∪ {j∗}

15:

i∗ := arg maxi∈I |uk(i)|

16: until stopping criterion is satisfied

32

slide-33
SLIDE 33

Adaptive Cross Approximation (ACA)

ACA with partial pivoting. Remarks:

◮ Rk is never formed explicitly. Entries of Rk are computed from

Rk(i, j) = A(i, j) −

k

  • ℓ=1

uℓ(i)vℓ(j).

◮ Ideal stopping criterion uk2vk2 ≤ εAF elusive.

Replace AF by AkF, recursively computed via Ak2

F = Ak−12 F + 2 k−1

  • j=1

uT

k ujvT j vk + uk2 2vk2 2.

33

slide-34
SLIDE 34

Adaptive Cross Approximation (ACA)

Two 100 × 100 matrices: (a) The Hilbert matrix A defined by A(i, j) = 1/(i + j − 1). (b) The matrix A defined by A(i, j) = exp(−γ|i − j|/n) with γ = 0.1.

5 10 15 20 25 30 10

−15

10

−10

10

−5

10 k A − A2/A2 Hilbert matrix Full pivoting Partial pivoting SVD 20 40 60 80 100 10

−6

10

−4

10

−2

10 k A − A2/A2 Exponential matrix Full pivoting Partial pivoting SVD

  • 1. Excellent convergence for Hilbert matrix.
  • 2. Slow singular value decay impedes partial pivoting.

34

slide-35
SLIDE 35

ACA is Gaussian elimination

We have Rk = Rk−1 − δkRk−1ekeT

k Rk−1 = (I − δkRk−1ekeT k )Rk−1 = LkRk−1,

where Lk ∈ Rm×m is given by Lk =             1 ... 1 ℓk+1,k 1 . . . ... ℓm,k 1             , ℓi,k = −eT

i Rk−1ek

eT

k Rk−1ek

. for i = k + 1, . . . , m. Matrix Lk differs only in position (k, k) from usual lower triangular factor in Gaussian elimination.

35

slide-36
SLIDE 36

ACA for SPSD matrices

For symmetric positive semi-definite matrix A ∈ Rn×n:

◮ SVD becomes spectral decomposition. ◮ Can use trace instead of Frobenius norm to control error. ◮ Rk stays SPSD. ◮ Choice of rows/columns, e.g., by largest diagonal element of Rk. ◮ ACA becomes

= Cholesky (with diagonal pivoting). Analysis in [Higham’1990]. = Nyström method [Williams/Seeger’2001].

See [Harbrecht/Peters/Schneider’2012] for more detais.

36