Low Rank Approximation Lecture 1
Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch
1
Low Rank Approximation Lecture 1 Daniel Kressner Chair for - - PowerPoint PPT Presentation
Low Rank Approximation Lecture 1 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 Organizational aspects Lectures: Tuesday 8-10, MA A110. First: September 25, Last: December
Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch
1
◮ Lectures: Tuesday 8-10, MA A110. First: September 25, Last:
December 18.
◮ Exercises: Tuesday 8-10, MA A110. First: September 25, Last:
December 18.
◮ Exam: Miniproject + oral exam. ◮ Webpage: https://anchp.epfl.ch/lowrank. ◮ daniel.kressner@epfl.ch, lana.perisa@epfl.ch
2
... his [Aleksandr Kogan’s] message went
to confirm that his approach was indeed similar to SVD or other matrix factorization meth-
Prize competition, and the Kosinki-Stillwell- Graepel Facebook
reduction of Facebook data was the core of his model.
3
For field F, let A ∈ F m×n. Then rank(A) := dim(range(A)). For simplicity, F = R throughout the lecture and often m ≥ n.
Lemma
Let A ∈ Rm×n. Then
Q ∈ Rn×n;
A11 A12 A22
A12 ∈ Rm1×n2, A22 ∈ Rm2×n2. Proof: See Linear Algebra 1 / Exercises.
4
Let B = {b1, . . . , br} ⊂ Rm with r = rank(A) be basis of range(A). Then each of the columns of A = a1, a2, . . . , an
as linear combination of B: ai = b1ci1 + b2ci2 + · · · + brcir = b1, . . . , br
ci1 . . . cir , for some coefficients cij ∈ R with i = 1, . . . , n, j = 1, . . . , r. Stacking these relations column by column
c11 · · · cn1 . . . . . . c1r · · · cnr
5
form A = BCT, B ∈ Rm×r, C ∈ Rn×r. We say that A has low rank if rank(A) ≪ m, n. Illustration of low-rank factorization: A BCT #entries mn mr + nr
◮ Generically (and in most applications), A has full rank, that is,
rank(A) = min{m, n}.
◮ Aim instead at approximating A by a low-rank matrix.
6
What? Theoretical foundations of low-rank approximation. When? A priori and a posteriori estimates for low-rank
approximation techniques. Why? Applications in engineering, scientific computing, data analysis, ... where low-rank approximation plays a central role. How? State-of-the-art algorithms for performing and working with low-rank approximations. Will cover both, matrices and tensors.
7
Golub/Van Loan’2013 Golub, Gene H.; Van Loan, Charles F . Matrix
Press, Baltimore, MD, 2013. Horn/Johnson’2013 Horn, Roger A.; Johnson, Charles R. Matrix
2013. + References on slides.
8
◮ SVD ◮ Relation to eigenvalues ◮ Norms ◮ Best low-rank approximation
9
Theorem (SVD). Let A ∈ Rm×n with m ≥ n. Then there are
A = UΣV T, with Σ = σ1 ... σn ∈ Rm×n and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
◮ σ1, . . . , σn are called singular values ◮ u1, . . . , un are called left singular vectors ◮ v1, . . . , vn are called right singular vectors ◮ Avi = σiui, ATui = σivi for i = 1, . . . , n. ◮ Singular values are always uniquely defined by A. ◮ Singular values are never unique. If σ1 > σ2 > · · · σn > 0 then
unique up to ui ← ±ui, vi ← ±vi.
10
Induction over n. n = 1 trivial. For general n, let v1 solve max{Av2 : v2 = 1} =: A2. Set σ1 := A2 and u1 := Av1/σ1.1 By definition, Av1 = σ1u1. After completion to orthogonal matrices U1 =
V1 =
UT
1 AV1 =
uT
1 Av1
uT
1 AV⊥
UT
⊥Av1
UT
⊥AV⊥
σ1 wT A1
with w := V T
⊥ATu1 and A1 = UT ⊥AV⊥. · 2 invariant under orthogonal
transformations σ1 = A2 = UT
1 AV12 =
wT A1
≥
1 + w2 2.
Hence, w = 0. Proof completed by applying induction to A1.
1If σ1 = 0, choose arbitrary u1.
11
◮ r = rank(A) is number of nonzero singular values of A. ◮ kernel(A) = span{vr+1, . . . , vn} ◮ range(A) = span{u1, . . . , ur}
12
Computation of SVD proceeds in two steps:
reflectors from left and n − 1 Householder reflectors from right, compute orthogonal matrices U1, V1 such that UT
1 AV1 = B =
B1
❅ ❅ ❅ ❅ ❅ , that is, B1 ∈ Rn×n is an upper bidiagonal matrix.
2 B1V2 is diagonal.
Set U = U1U2 and V = V1V2. Step 1 is usually the most expensive. Remarks on Step 1:
◮ If m is significantly larger than n, say, m ≥ 3n/2, first computing
QR decomposition of A reduces cost.
◮ Most modern implementations reduce A successively via banded
form to bidiagonal form.2
2Bischof, C. H.; Lang, B.; Sun, X. A framework for symmetric band reduction. ACM
13
In most applications, vectors un+1, . . . , um are not of interest. By
Theorem (Economy size SVD). Let A ∈ Rm×n with m ≥ n. Then there is a matrix U ∈ Rm×n with orthonormal columns and an
A = UΣV T, with Σ = σ1 ... σn ∈ Rn×n and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0. Computed by MATLAB’s [U,S,V] = svd(A,’econ’). Complexity: memory
singular values only O(mn) O(mn2) economy size SVD O(mn) O(mn2) (full) SVD O(m2 + mn) O(m2n + mn2)
14
Beware of roundoff error when interpreting singular value plots. Exmaple: semilogy(svd(hilb(100)))
20 40 60 80 100 10 -20 10 -10 10 0
◮ Kink is caused by roundoff error and does not reflect true
behavior of singular values.
◮ Exact singular values are known to decay exponentially.3 ◮ Sometimes more accuracy possible.4.
3Beckermann, B. The condition number of real Vandermonde, Krylov and positive
definite Hankel matrices. Numer. Math. 85 (2000), no. 4, 553–577.
4Drmaˇ
c, Z.; Veseli´ c, K. New fast and accurate Jacobi SVD algorithm. I. SIAM J. Matrix Anal. Appl. 29 (2007), no. 4, 1322–1342
15
Symmetric A = AT ∈ Rn×n admits spectral decomposition A = U diag(λ1, λ2, . . . , λn)UT with orthogonal matrix U. After reordering may assume |λ1| ≥ |λ2| ≥ · · · ≥ |λn|. Spectral decomposition can be turned into SVD A = UΣV T by defining Σ = diag(|λ1|, . . . , |λn|), V = U diag(sign(λ1), . . . , sign(λn)). Remark: This extends to the more general case of normal matrices (e.g., orthogonal or symmetric) via complex spectral or real Schur decompositions.
16
Consider SVD A = UΣV T of A ∈ Rm×n with m ≥ n. We then have:
ATA = VΣTΣV T = V diag(σ2
1, . . . , σ2 n)V T
ATA has eigenvalues σ2
1, . . . , σ2 n,
right singular vectors of A are eigenvectors of ATA.
AAT = UΣΣTUT = U diag(σ2
1, . . . , σ2 n, 0, . . . , 0)UT
AAT has eigenvalues σ2
1, . . . , σ2 n and, additionally, m − n zero
eigenvalues, first n left singular vectors A are eigenvectors of AAT.
A = A AT
U V Σ ΣT U V T eigenvalues of A are ±σj, j = 1, . . . , n, and zero (m − n times).
17
Given SVD A = UΣV T, one defines:
◮ Spectral norm: A2 = σ1. ◮ Frobenius norm: AF =
1 + · · · + σ2 n.
Basic properties:
◮ A2 = max{Av2 : v2 = 1} (see proof of SVD). ◮ · 2 and · F are both (submultiplicative) matrix norms. ◮ · 2 and · F are both unitarily invariant, that is
QAZ2 = A2, QAZF = AF for any orthogonal matrices Q, Z.
◮ A2 ≤ AF ≤ A2/√r ◮ ABF ≤ min{A2BF, AF B2}
18
Let B ∈ Rn×n have eigenvalues λ1, . . . , λn ∈ C. Then trace(B) := b11 + · · · + bnn = λ1 + · · · + λn. In turn, A2
F = trace ATA = trace AAT =
a2
ij.
Two simple consequences:
◮ · F is the norm induced by the matrix inner product
A, B := trace(ABT), A, B ∈ Rm×n.
◮ Partition A =
a1, a2, . . . , an
vec(A) = a1 . . . an ∈ Rmn. Then A, B = vec(A), vec(B) and AF = vec(A)2.
19
Theorem
For m ≥ n, let A, B ∈ Rm×n have singular values σ1(A) ≥ · · · ≥ σn(A) and σ1(B) ≥ · · · ≥ σn(B), respectively. Then |A, B| ≤ σ1(A)σ1(B) + · · · + σn(A)σn(B). Proof of Von Neumann’s trace inequality in lecture notes.5 Consequence: A − B2
F
= A − B, A − B = A2
F − 2A, B + B2 F
≥ A2
F − 2 n
σi(A)σi(B) + B2
F
=
n
(σi(A) − σi(B))2.
5This proof follows [Grigorieff, R. D. Note on von Neumann’s trace inequality. Math.
matrices; see Theorem 8.7.6 in [Horn/Johnson’2013].
20
There are other unitarily invariant matrix norms.6 Let s(A) = (σ1, . . . , σn). The p-Schatten norm defined by A(p) := s(A)p is a matrix norm for any 1 ≤ p ≤ ∞. p = ∞: spectral norm, p = 2: Frobenius norm, p = 1: nuclear norm.
Definition
The dual of a matrix norm · on Rm×n is defined by AD = max{A, B : B = 1}.
Lemma
Let p, q ∈ [1, ∞] such that p−1 + q−1 = 1. Then AD
(p) = A(q).
6Complete characterization via symm gauge functions in [Horn/Johnson’2013].
21
Consider k < n and let Uk := u1 · · · uk
Σk := diag(σ1, . . . , σk), Vk := u1 · · · uk
Then Tk(A) := UkΣkV T
k
has rank at most k. For any unitarily invariant norm · : Tk(A) − A =
A − Tk(A)2 = σk+1, A − Tk(A)F =
k+1 + · · · + σ2 n.
Nearly equal if and only if singular values decay sufficiently quickly.
22
Theorem (Schmidt-Mirsky). Let A ∈ Rm×n. Then A − Tk(A) = min
Proof7 for · F: Follows directly from consequence of Von Neumann’s trace inequality. Proof for · 2: For any B ∈ Rm×n of rank ≤ k, kernel(B) has dimension ≥ n − k. Hence, ∃w ∈ kernel(B) ∩ range(Vk+1) with w2 = 1. Then A − B2
2
≥ (A − B)w2
2 = Aw2 2 = AVk+1V T k+1w2 2
= Uk+1Σk+1V T
k+1w2 2
=
r+1
σj|vT
j w|2 ≥ σk+1 r+1
|vT
j w|2 = σk+1.
7See Section 7.4.9 in [Horn/Johnson’2013] for the general case.
23
Uniqueness:
◮ If σk > σk+1 best rank-k approximation with respect to Frobenius
norm is unique.
◮ If σk = σk+1 best rank-k approximation never unique. For
example I3 has several best rank-two approximations: 1 1 , 1 1 , 1 1 .
◮ With respect to spectral norm best rank-k approximation only
unique if σk+1 = 0. For example, diag(2, 1, ǫ) with 0 < ǫ < 1 has infinitely many best rank-two approximations: 2 1 , 2 − ǫ/2 1 − ǫ/2 , 2 − ǫ/3 1 − ǫ/3 1 , . . . .
24
Aim at finding a matrix Q ∈ Rm×k with orthonormal columns such that range(Q) ≈ range(A). I − QQT is orthogonal projector onto range(Q)⊥ Aim at minimizing (I − QQT)A = A − QQTA for unitarily invariant norm · . Because rank(QQTA) ≤ k, A − QQTA ≥ A − Tk(A). Setting Q = Uk one obtains UkUT
k A = UkUT k UΣV T = UkΣkV T k = Tk(A).
Q = Uk is optimal.
25
Variation: max{QTAF : QTQ = Ik}. Equivalent to max{|AAT, QQT| : QTQ = Ik}. By Von Neumann’s trace inequality and equivalence between eigenvectors of AAT and left singular vectors of A, optimal Q given by Uk.
26