Low Rank Approximation Lecture 1 Daniel Kressner Chair for - - PowerPoint PPT Presentation

low rank approximation lecture 1
SMART_READER_LITE
LIVE PREVIEW

Low Rank Approximation Lecture 1 Daniel Kressner Chair for - - PowerPoint PPT Presentation

Low Rank Approximation Lecture 1 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 Organizational aspects Lectures: Tuesday 8-10, MA A110. First: September 25, Last: December


slide-1
SLIDE 1

Low Rank Approximation Lecture 1

Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch

1

slide-2
SLIDE 2

Organizational aspects

◮ Lectures: Tuesday 8-10, MA A110. First: September 25, Last:

December 18.

◮ Exercises: Tuesday 8-10, MA A110. First: September 25, Last:

December 18.

◮ Exam: Miniproject + oral exam. ◮ Webpage: https://anchp.epfl.ch/lowrank. ◮ daniel.kressner@epfl.ch, lana.perisa@epfl.ch

2

slide-3
SLIDE 3

From http://www.niemanlab.org

... his [Aleksandr Kogan’s] message went

  • n

to confirm that his approach was indeed similar to SVD or other matrix factorization meth-

  • ds, like in the Netflix

Prize competition, and the Kosinki-Stillwell- Graepel Facebook

  • model. Dimensionality

reduction of Facebook data was the core of his model.

3

slide-4
SLIDE 4

Rank and basic properties

For field F, let A ∈ F m×n. Then rank(A) := dim(range(A)). For simplicity, F = R throughout the lecture and often m ≥ n.

Lemma

Let A ∈ Rm×n. Then

  • 1. rank(AT) = rank(A);
  • 2. rank(PAQ) = rank(A) for invertible matrices P ∈ Rm×m,

Q ∈ Rn×n;

  • 3. rank(AB) ≤ min{rank(A), rank(B)} for any matrix B ∈ Rn×p.
  • 4. rank

A11 A12 A22

  • = rank(A11) + rank(A22) for A11 ∈ Rm1×n1,

A12 ∈ Rm1×n2, A22 ∈ Rm2×n2. Proof: See Linear Algebra 1 / Exercises.

4

slide-5
SLIDE 5

Rank and matrix factorizations

Let B = {b1, . . . , br} ⊂ Rm with r = rank(A) be basis of range(A). Then each of the columns of A = a1, a2, . . . , an

  • can be expressed

as linear combination of B: ai = b1ci1 + b2ci2 + · · · + brcir = b1, . . . , br

  ci1 . . . cir    , for some coefficients cij ∈ R with i = 1, . . . , n, j = 1, . . . , r. Stacking these relations column by column

  • a1, . . . , an
  • =
  • b1, . . . , br

  c11 · · · cn1 . . . . . . c1r · · · cnr   

5

slide-6
SLIDE 6

Rank and matrix factorizations

  • Lemma. A matrix A ∈ Rm×n of rank r admits a factorization of the

form A = BCT, B ∈ Rm×r, C ∈ Rn×r. We say that A has low rank if rank(A) ≪ m, n. Illustration of low-rank factorization: A BCT #entries mn mr + nr

◮ Generically (and in most applications), A has full rank, that is,

rank(A) = min{m, n}.

◮ Aim instead at approximating A by a low-rank matrix.

6

slide-7
SLIDE 7

Questions addressed in lecture series

What? Theoretical foundations of low-rank approximation. When? A priori and a posteriori estimates for low-rank

  • approximation. Situations that allow for low-rank

approximation techniques. Why? Applications in engineering, scientific computing, data analysis, ... where low-rank approximation plays a central role. How? State-of-the-art algorithms for performing and working with low-rank approximations. Will cover both, matrices and tensors.

7

slide-8
SLIDE 8

Literature for Lecture 1

Golub/Van Loan’2013 Golub, Gene H.; Van Loan, Charles F . Matrix

  • computations. Fourth edition. Johns Hopkins University

Press, Baltimore, MD, 2013. Horn/Johnson’2013 Horn, Roger A.; Johnson, Charles R. Matrix

  • analysis. Second edition. Cambridge University Press,

2013. + References on slides.

8

slide-9
SLIDE 9
  • 1. Fundamental tools

◮ SVD ◮ Relation to eigenvalues ◮ Norms ◮ Best low-rank approximation

9

slide-10
SLIDE 10

The singular value decomposition

Theorem (SVD). Let A ∈ Rm×n with m ≥ n. Then there are

  • rthogonal matrices U ∈ Rm×m and V ∈ Rn×n such that

A = UΣV T, with Σ =      σ1 ... σn      ∈ Rm×n and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.

◮ σ1, . . . , σn are called singular values ◮ u1, . . . , un are called left singular vectors ◮ v1, . . . , vn are called right singular vectors ◮ Avi = σiui, ATui = σivi for i = 1, . . . , n. ◮ Singular values are always uniquely defined by A. ◮ Singular values are never unique. If σ1 > σ2 > · · · σn > 0 then

unique up to ui ← ±ui, vi ← ±vi.

10

slide-11
SLIDE 11

SVD: Sketch of proof

Induction over n. n = 1 trivial. For general n, let v1 solve max{Av2 : v2 = 1} =: A2. Set σ1 := A2 and u1 := Av1/σ1.1 By definition, Av1 = σ1u1. After completion to orthogonal matrices U1 =

  • u1, U⊥
  • ∈ Rm×m and

V1 =

  • v1, V⊥
  • ∈ Rn×n:

UT

1 AV1 =

uT

1 Av1

uT

1 AV⊥

UT

⊥Av1

UT

⊥AV⊥

  • =

σ1 wT A1

  • ,

with w := V T

⊥ATu1 and A1 = UT ⊥AV⊥. · 2 invariant under orthogonal

transformations σ1 = A2 = UT

1 AV12 =

  • σ1

wT A1

  • 2

  • σ2

1 + w2 2.

Hence, w = 0. Proof completed by applying induction to A1.

1If σ1 = 0, choose arbitrary u1.

11

slide-12
SLIDE 12

Very basic properties of the SVD

◮ r = rank(A) is number of nonzero singular values of A. ◮ kernel(A) = span{vr+1, . . . , vn} ◮ range(A) = span{u1, . . . , ur}

12

slide-13
SLIDE 13

SVD: Computation (for small dense matrices)

Computation of SVD proceeds in two steps:

  • 1. Reduction to bidiagonal form: By applying n Householder

reflectors from left and n − 1 Householder reflectors from right, compute orthogonal matrices U1, V1 such that UT

1 AV1 = B =

B1

  • =

  ❅ ❅ ❅ ❅ ❅   , that is, B1 ∈ Rn×n is an upper bidiagonal matrix.

  • 2. Reduction to diagonal form: Use Divide&Conquer to compute
  • rthogonal matrices U2, V2 such that Σ = UT

2 B1V2 is diagonal.

Set U = U1U2 and V = V1V2. Step 1 is usually the most expensive. Remarks on Step 1:

◮ If m is significantly larger than n, say, m ≥ 3n/2, first computing

QR decomposition of A reduces cost.

◮ Most modern implementations reduce A successively via banded

form to bidiagonal form.2

2Bischof, C. H.; Lang, B.; Sun, X. A framework for symmetric band reduction. ACM

  • Trans. Math. Software 26 (2000), no. 4, 581–601.

13

slide-14
SLIDE 14

SVD: Computation (for small dense matrices)

In most applications, vectors un+1, . . . , um are not of interest. By

  • mitting these vectors one obtains the following variant of the SVD.

Theorem (Economy size SVD). Let A ∈ Rm×n with m ≥ n. Then there is a matrix U ∈ Rm×n with orthonormal columns and an

  • rthonormal matrix V ∈ Rn×n such that

A = UΣV T, with Σ =    σ1 ... σn    ∈ Rn×n and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0. Computed by MATLAB’s [U,S,V] = svd(A,’econ’). Complexity: memory

  • perations

singular values only O(mn) O(mn2) economy size SVD O(mn) O(mn2) (full) SVD O(m2 + mn) O(m2n + mn2)

14

slide-15
SLIDE 15

SVD: Computation (for small dense matrices)

Beware of roundoff error when interpreting singular value plots. Exmaple: semilogy(svd(hilb(100)))

20 40 60 80 100 10 -20 10 -10 10 0

◮ Kink is caused by roundoff error and does not reflect true

behavior of singular values.

◮ Exact singular values are known to decay exponentially.3 ◮ Sometimes more accuracy possible.4.

3Beckermann, B. The condition number of real Vandermonde, Krylov and positive

definite Hankel matrices. Numer. Math. 85 (2000), no. 4, 553–577.

4Drmaˇ

c, Z.; Veseli´ c, K. New fast and accurate Jacobi SVD algorithm. I. SIAM J. Matrix Anal. Appl. 29 (2007), no. 4, 1322–1342

15

slide-16
SLIDE 16

Singular/eigenvalue relations: symmetric matrices

Symmetric A = AT ∈ Rn×n admits spectral decomposition A = U diag(λ1, λ2, . . . , λn)UT with orthogonal matrix U. After reordering may assume |λ1| ≥ |λ2| ≥ · · · ≥ |λn|. Spectral decomposition can be turned into SVD A = UΣV T by defining Σ = diag(|λ1|, . . . , |λn|), V = U diag(sign(λ1), . . . , sign(λn)). Remark: This extends to the more general case of normal matrices (e.g., orthogonal or symmetric) via complex spectral or real Schur decompositions.

16

slide-17
SLIDE 17

Singular/eigenvalue relations: general matrices

Consider SVD A = UΣV T of A ∈ Rm×n with m ≥ n. We then have:

  • 1. Spectral decomposition of Gramian

ATA = VΣTΣV T = V diag(σ2

1, . . . , σ2 n)V T

ATA has eigenvalues σ2

1, . . . , σ2 n,

right singular vectors of A are eigenvectors of ATA.

  • 2. Spectral decomposition of Gramian

AAT = UΣΣTUT = U diag(σ2

1, . . . , σ2 n, 0, . . . , 0)UT

AAT has eigenvalues σ2

1, . . . , σ2 n and, additionally, m − n zero

eigenvalues, first n left singular vectors A are eigenvectors of AAT.

  • 3. Decomposition of Golub-Kahan matrix

A = A AT

  • =

U V Σ ΣT U V T eigenvalues of A are ±σj, j = 1, . . . , n, and zero (m − n times).

17

slide-18
SLIDE 18

Norms: Spectral and Frobenius norm

Given SVD A = UΣV T, one defines:

◮ Spectral norm: A2 = σ1. ◮ Frobenius norm: AF =

  • σ2

1 + · · · + σ2 n.

Basic properties:

◮ A2 = max{Av2 : v2 = 1} (see proof of SVD). ◮ · 2 and · F are both (submultiplicative) matrix norms. ◮ · 2 and · F are both unitarily invariant, that is

QAZ2 = A2, QAZF = AF for any orthogonal matrices Q, Z.

◮ A2 ≤ AF ≤ A2/√r ◮ ABF ≤ min{A2BF, AF B2}

18

slide-19
SLIDE 19

Euclidean geometry on matrices

Let B ∈ Rn×n have eigenvalues λ1, . . . , λn ∈ C. Then trace(B) := b11 + · · · + bnn = λ1 + · · · + λn. In turn, A2

F = trace ATA = trace AAT =

  • i,j

a2

ij.

Two simple consequences:

◮ · F is the norm induced by the matrix inner product

A, B := trace(ABT), A, B ∈ Rm×n.

◮ Partition A =

a1, a2, . . . , an

  • and define vectorization

vec(A) =    a1 . . . an    ∈ Rmn. Then A, B = vec(A), vec(B) and AF = vec(A)2.

19

slide-20
SLIDE 20

Von Neumann’s trace inequality

Theorem

For m ≥ n, let A, B ∈ Rm×n have singular values σ1(A) ≥ · · · ≥ σn(A) and σ1(B) ≥ · · · ≥ σn(B), respectively. Then |A, B| ≤ σ1(A)σ1(B) + · · · + σn(A)σn(B). Proof of Von Neumann’s trace inequality in lecture notes.5 Consequence: A − B2

F

= A − B, A − B = A2

F − 2A, B + B2 F

≥ A2

F − 2 n

  • i=1

σi(A)σi(B) + B2

F

=

n

  • i=1

(σi(A) − σi(B))2.

5This proof follows [Grigorieff, R. D. Note on von Neumann’s trace inequality. Math.

  • Nachr. 151 (1991), 327–328]. For Mirsky’s ingenious proof based on doubly stochastic

matrices; see Theorem 8.7.6 in [Horn/Johnson’2013].

20

slide-21
SLIDE 21

Schatten norms

There are other unitarily invariant matrix norms.6 Let s(A) = (σ1, . . . , σn). The p-Schatten norm defined by A(p) := s(A)p is a matrix norm for any 1 ≤ p ≤ ∞. p = ∞: spectral norm, p = 2: Frobenius norm, p = 1: nuclear norm.

Definition

The dual of a matrix norm · on Rm×n is defined by AD = max{A, B : B = 1}.

Lemma

Let p, q ∈ [1, ∞] such that p−1 + q−1 = 1. Then AD

(p) = A(q).

6Complete characterization via symm gauge functions in [Horn/Johnson’2013].

21

slide-22
SLIDE 22

Best low-rank approximation

Consider k < n and let Uk := u1 · · · uk

  • ,

Σk := diag(σ1, . . . , σk), Vk := u1 · · · uk

  • .

Then Tk(A) := UkΣkV T

k

has rank at most k. For any unitarily invariant norm · : Tk(A) − A =

  • diag(0, . . . , 0, σk+1, . . . , σn)
  • In particular, for spectral norm and the Frobenius norm:

A − Tk(A)2 = σk+1, A − Tk(A)F =

  • σ2

k+1 + · · · + σ2 n.

Nearly equal if and only if singular values decay sufficiently quickly.

22

slide-23
SLIDE 23

Best low-rank approximation

Theorem (Schmidt-Mirsky). Let A ∈ Rm×n. Then A − Tk(A) = min

  • A − B : B ∈ Rm×n has rank at most k
  • holds for any unitarily invariant norm · .

Proof7 for · F: Follows directly from consequence of Von Neumann’s trace inequality. Proof for · 2: For any B ∈ Rm×n of rank ≤ k, kernel(B) has dimension ≥ n − k. Hence, ∃w ∈ kernel(B) ∩ range(Vk+1) with w2 = 1. Then A − B2

2

≥ (A − B)w2

2 = Aw2 2 = AVk+1V T k+1w2 2

= Uk+1Σk+1V T

k+1w2 2

=

r+1

  • j=1

σj|vT

j w|2 ≥ σk+1 r+1

  • j=1

|vT

j w|2 = σk+1.

7See Section 7.4.9 in [Horn/Johnson’2013] for the general case.

23

slide-24
SLIDE 24

Best low-rank approximation

Uniqueness:

◮ If σk > σk+1 best rank-k approximation with respect to Frobenius

norm is unique.

◮ If σk = σk+1 best rank-k approximation never unique. For

example I3 has several best rank-two approximations:   1 1   ,   1 1   ,   1 1   .

◮ With respect to spectral norm best rank-k approximation only

unique if σk+1 = 0. For example, diag(2, 1, ǫ) with 0 < ǫ < 1 has infinitely many best rank-two approximations:   2 1   ,   2 − ǫ/2 1 − ǫ/2   ,   2 − ǫ/3 1 − ǫ/3 1   , . . . .

24

slide-25
SLIDE 25

Approximating the range of a matrix

Aim at finding a matrix Q ∈ Rm×k with orthonormal columns such that range(Q) ≈ range(A). I − QQT is orthogonal projector onto range(Q)⊥ Aim at minimizing (I − QQT)A = A − QQTA for unitarily invariant norm · . Because rank(QQTA) ≤ k, A − QQTA ≥ A − Tk(A). Setting Q = Uk one obtains UkUT

k A = UkUT k UΣV T = UkΣkV T k = Tk(A).

Q = Uk is optimal.

25

slide-26
SLIDE 26

Approximating the range of a matrix

Variation: max{QTAF : QTQ = Ik}. Equivalent to max{|AAT, QQT| : QTQ = Ik}. By Von Neumann’s trace inequality and equivalence between eigenvectors of AAT and left singular vectors of A, optimal Q given by Uk.

26