Low Rank Approximation Lecture 3 Daniel Kressner Chair for - - PowerPoint PPT Presentation

low rank approximation lecture 3
SMART_READER_LITE
LIVE PREVIEW

Low Rank Approximation Lecture 3 Daniel Kressner Chair for - - PowerPoint PPT Presentation

Low Rank Approximation Lecture 3 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 Basic randomized salgorithm for low-rank approx Must read: Halko/Martinsson/Tropp2010:


slide-1
SLIDE 1

Low Rank Approximation Lecture 3

Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch

1

slide-2
SLIDE 2

Basic randomized salgorithm for low-rank approx

Must read: Halko/Martinsson/Tropp’2010: Finding Structure with Randomness... Randomized Algorithm:

  • 1. Draw Gaussian random matrix Ω ∈ Rn×r.
  • 2. Perform block mat-vec Y = AΩ.
  • 3. Compute (economic) QR decomposition Y = QR.
  • 4. Form B = QTA.
  • 5. Return ˆ

A = QB (in factorized form) Exact recovery: If A has rank r, we recover A = A with probability 1.

2

slide-3
SLIDE 3

Three test matrices

(a) The 100 × 100 Hilbert matrix A defined by A(i, j) = 1/(i + j − 1). (b) The matrix A defined by A(i, j) = exp(−γ|i − j|/n) with γ = 0.1. (c) 30 × 30 diagonal matrix with diagonal entries 1, 0.99, 0.98, 1 10, 0.99 10 , 0.98 10 , 1 100, 0.99 100 , 0.98 100 , . . .

20 40 60 80 100 10 -20 10 -10 10 0

Singular values of test matrices

3

slide-4
SLIDE 4

Randomized algorithm applied to test matrices

errors measured in spectral norm: (a) Hilbert matrix, r = 5: Exact mean std 0.0019 0.0092 0.0099 (b) Matrix with slower decay, r = 25: Exact mean std 0.0034 0.012 0.002 (c) Matrix with staircase sv, r = 7: Exact mean std 0.010 0.038 0.025

4

slide-5
SLIDE 5

Randomized algorithm applied to test matrices

errors measured in Frobenius norm: (a) Hilbert matrix, r = 5: Exact mean std 0.0019 0.0093 0.0099 (b) Matrix with slower decay, r = 25: Exact mean std 0.011 0.024 0.001 (c) Matrix with staircase sv, r = 7: Exact mean std 0.014 0.041 0.024

5

slide-6
SLIDE 6

Basic randomized algorithms for low-rank approx

Add oversampling. (usually small) integer p Randomized Algorithm:

  • 1. Draw standard Gaussian random matrix Ω ∈ Rn×(r+p).
  • 2. Perform block mat-vec Y = AΩ.
  • 3. Compute (economic) QR decomposition Y = QR.
  • 4. Form B = QTA.
  • 5. Return

A = QB (in factorized form) Problem: ˆ A has rank r + p > r. Solution: Compress B ≈ Tr(B) QTr(B) has rank r . Error: QTr(B) − A = QTr(B) − QB + QB − A ≤ Tr(B) − B + (I − QQT)A

6

slide-7
SLIDE 7

Basic randomized algorithms for low-rank approx

Add oversampling. (usually small) integer p Randomized Algorithm:

  • 1. Draw standard Gaussian random matrix Ω ∈ Rn×(r+p).
  • 2. Perform block mat-vec Y = AΩ.
  • 3. Compute (economic) QR decomposition Y = QR.
  • 4. Form B = QTA.
  • 5. Return

A = QTr(B) (in factorized form) Gold standard best rank-r approximation error:

◮ spectral norm: σr+1 ◮ Frobenius norm:

  • σ2

r+1 + · · · + σ2 n.

7

slide-8
SLIDE 8

Randomized algorithm applied to test matrices

errors measured in spectral norm: (a) Hilbert matrix, r = 5: Exact mean std 0.0019 0.0092 0.0099 p = 0 0.0019 0.0026 0.0019 p = 1 0.0019 0.0019 0.0001 p = 2 (b) Matrix with slower decay, r = 25: Exact mean std 0.0034 0.012 0.002 p = 0 0.0034 0.011 0.0017 p = 1 0.0034 0.010 0.0015 p = 2 0.0034 0.0064 0.0008 p = 10 0.0034 0.0037 0.0002 p = 25 (c) Matrix with staircase sv, r = 7: Exact mean std 0.010 0.038 0.025 p = 0 0.010 0.021 0.012 p = 1 0.010 0.012 0.005 p = 2

8

slide-9
SLIDE 9

Analysis: basic setting

Goal: Say something sensible about (I − QQT)A. Expected value, tail bounds, ... Multivariate normal distribution X ∼ N(µ, Σ) with mean µ ∈ Rn and (positive definite) covariance matrix Σ ∈ Rn×n has density fX(x) = 1

  • (2π)n det(Σ)

exp

  • −1

2(x − µ)TΣ−1(x − µ)

  • X ∼ N(0, Im) is called a Gaussian random vector.

Lemma

Let V ∈ Rm×n be orthonormal. If X ∼ N(0, Im) then V TX ∼ N(0, In). This invariance of Gaussian random vectors allows us to assume w.l.o.g. in the analysis that A = Σ diagonal (and square).

9

slide-10
SLIDE 10

Analysis: r = 1, p = 0

Partition Ω = ω1 ω2

  • ,

A = σ1 Σ2

  • .

Then Y = AΩ =

  • σ1ω1

Σ2ω2

  • = σ1ω1
  • 1

1 σ1ω1 Σ2ω2

  • .

Problem: Need to control norm of

1 σ1ω1 Σ2ω2 but ω−1 1

(reciprocal of standard normal random variable) is Cauchy distribution with undefined mean and variance. Need to consider p ≥ 2.

10

slide-11
SLIDE 11

Analysis: r = 1, p ≥ 2

Lemma

Let Π ˜

Y, ΠY be orthogonal projectors onto subspaces ˜

Y ⊂ Y. Then Π ˜

YAF ≤ ΠYAF,

(I − Π ˜

Y)AF ≥ (I − ΠY)AF.

for any matrix A.

  • Proof. EFY.

Y = AΩ =

  • σ1Ω1

Σ2Ω2

  • ,

˜ Y =

  • 1

σ−1

1 Σ2Ω2Ω† 1

  • ,

where Ω†

1 is pseudoinverse of Ω1. Because Ω1 is a row vector and

nonzero (with probability one), we obtain Ω†

1 = ΩT 1 /Ω12

  • 2. Thus,

range( ˜ Y) ⊂ range(Y). Thus, with f = σ−1

1 Σ2Ω2Ω† 1, we get

(I − QQT)AF ≤ (I − ˜ Q ˜ QT)AF, ˜ Q = 1

  • 1 + f2

2

1 f

  • ,

11

slide-12
SLIDE 12

Analysis: r = 1, p ≥ 2

(I − ˜ Q ˜ QT)A2

F = A2 F − ˜

QTA2

F

˜ QTA2

F

= 1 1 + f2

2

(σ2

1 + Σ2f2 2) ≥

σ2

1

1 + f2

2

≥ σ2

1(1 − f2 2) = σ2 1 − Σ2Ω2Ω† 12 2

In summary, we have:

Lemma

(I − ˜ Q ˜ QT)A2

F ≤ Σ22 F + Σ2Ω2Ω† 12 F.

To analyze red term, we use EΣ2Ω2Ω†

12 F = E

  • E
  • Σ2Ω2Ω†

12 F | Ω1

  • = Σ22

F · EΩ† 12 F.

(See exercises for proof that EAΩB2

F = A2 FB2 F for Gaussian

matrix Ω and constant matrices A, B.)

12

slide-13
SLIDE 13

Analysis: r = 1, p ≥ 2

It remains to control Ω12

F and Ω† 12 F = 1/Ω12 F.

Ω12

F is a sum of p + 1 squared independent standard normal

random variables. Pdf for X ∼ N(0, 1) given by fX(x) =

1 √ 2πe−x2/2. Pdf for Y = X 2 zero

for nonpositive values. For y > 0, we obtain Pr(0 ≤ Y ≤ y) = Pr(−√y ≤ X ≤ √y) = 2 √ 2π √y e−x2/2 dx = 1 √ 2π y e−t/2 dt, Y is called chi-squared distribution (1 degree of freedom): Y ∼ χ2

1.

Ω1 ∼ χ2

p+1 chi-squared distribution with p + 1 d.o.f.; pdf

fΩ1(x) = 2−(p+1)/2 Γ((p + 1)/2)x(p+1)/2−1 exp(−x/2)), x > 0.

13

slide-14
SLIDE 14

Analysis: r = 1, p ≥ 2

Ω†

12 F =

1 Ω12

F

=

  • p
  • i=1

Ω2

1,i

−1 ∼ Inv − χ2(p + 1), the inverse-chi-squared distribution with p + 1 degrees of freedom. Pdf given by 2−(p+1)/2 Γ((p + 1)/2)x−(p+1)/2−1 exp(−1/(2x)).

0.2 0.4 0.6 0.8 1 2 4 6 8 10

pdf for p = 1, p = 3, p = 9

14

slide-15
SLIDE 15

Analysis: r = 1, p ≥ 2

Textbook results:

◮ EΩ12 F = p + 1,

Eن

12 F = (p − 1)−1

Tail bound by [Laurent/Massart’2000]:

◮ P

  • Ω12

F ≤ p + 1 − t

  • ≤ exp

t2 4(p+1)

  • Theorem

For r = 1, p ≥ 2, we have E(I − QQT)AF ≤

  • 1 +

1 p − 1Σ2F. Probability of deviating from this upper bound decays exponentially, as indicated by tail bound for χ2

p+1.

15

slide-16
SLIDE 16

Analysis: general r, p ≥ 2

Lemma

(I − QQT)A2

F ≤ Σ22 F + Σ2Ω2Ω† 12 F.

Proof. Y = AΩ = Σ1Ω1 Σ2Ω2

  • ,

˜ Y = Ir F

  • ,

F = Σ2Ω2Ω†

1Σ−1 1 .

ONB ˜ Q = Ir F

  • (Ir + F TF)−1/2.

(I − QQT)A2

F ≤ (I − ˜

Q ˜ QT)A2

F = A2 F − ˜

QTA2

F.

˜ QTA2

F

=

  • (Ir + F TF)−1/2Σ1
  • 2

F +

  • (Ir + F TF)−1/2F TΣ2
  • 2

F

  • (Ir + F TF)−1/2Σ1
  • 2

F = trace(Σ1(Ir + F TF)−1Σ1)

≥ trace(Σ1(Ir − F TF)Σ1) = Σ12

F − Σ2Ω2Ω† 12 F

(Proof of trace(Σ1(Ir + F T F)−1Σ1) ≥ trace(Σ1(Ir − F T F)Σ1) on next slide.)

16

slide-17
SLIDE 17

Let G = F TF. Then we have (I + G)

  • (I + G)−1 − (I − G)
  • = G2

and hence (I + G)−1 − (I − G) = (I + G)−1G2 = G(I + G)−1G is symmetric positive semidefinite. This implies that Σ1

  • (I + G)−1 − (I − G)
  • Σ1

is symmetric positive semidefinite as well and hence 0 ≤ trace

  • Σ1
  • (I + G)−1 − (I − G)
  • Σ1
  • r, equivalently,

trace(Σ1(Ir + F TF)−1Σ1) ≥ trace(Σ1(Ir − F TF)Σ1).

17

slide-18
SLIDE 18

Analysis: general r, p ≥ 2

Again use EΣ2Ω2Ω†

12 F = Σ22 F · EΩ† 12 F.

By standard results in multivariate statistics, we have Eن

12 F =

r p − 1. Sketch of argument:

◮ Ω1ΩT 1 ∼ Wr(r + p) (Wishart distribution with r + p degrees of

freedom)

◮ (Ω1ΩT 1 )−1 ∼ W−1 r

(r + p) (inverse Wishart distribution with r + p degrees of freedom)

◮ E(Ω1ΩT 1 )−1 = 1 p−1Ir; see Page 96 in [Muirhead’1982]1 ◮ Result follows from Ω† 12 F = ΩT 1 (Ω1ΩT 1 )−12 F = trace

  • (Ω1ΩT

1 )−1

  • 1R. J. Muirhead, Aspects of Multivariate Statistical Theory, Wiley, New York, NY,

1982.

18

slide-19
SLIDE 19

Analysis: general r, p ≥ 2

Theorem

For p ≥ 2, we have E(I − QQT)AF ≤

  • 1 +

r p − 1Σ2F, E(I − QQT)A2 ≤

  • 1 +
  • r

p − 1

  • Σ22 + e√r + p

p Σ2F. For proof of spectral norm and tail bounds, see [Halko/Martinsson/Tropp’2010].

19

slide-20
SLIDE 20

Randomized subspace iteration

  • 1. Draw standard Gaussian random matrix Ω ∈ Rn×(k+p).
  • 2. Perform block mat-vec Y0 = AΩ.
  • 3. Perform q steps of subspace iteration: Y = (AAT)qY0.
  • 4. Compute (economic) QR decomposition Y = QR.
  • 5. Form B = QTA.
  • 6. Return

A = QTr(B) (in factorized form) Algorithm is essentially equivalent with randomized algorithm applied to (AAT)qA. Using Hölder’s inequality and result from the exercises, we have E(I − QQT)A2 ≤ (E(I − QQT)A2q+1

2

)1/(2q+1) ≤ (E(I − QQT)(AAT)2qA2)1/(2q+1). The singular values of (AAT)2qA are σ2q+1

1

, . . . , σ2q+1

n

.

20

slide-21
SLIDE 21

Randomized subspace iteration

Combination with previous result for randomized algorithm gives:

Theorem

For p ≥ 2, E(I − QQT)A2 is bounded by

  • 1 +
  • r

p − 1

  • σ2q+1

r+1

+ e√r + p p (σ2(2q+1)

k+1

+ · · · )1/2 1/(2q+1) ≤ σr+1

  • 1 +
  • r

p − 1 + e

  • (r + p)(n − r)

p 1/(2q+1)

21

slide-22
SLIDE 22

A posteriori error estimate and adaptive choice of k

Lemma

Let ω(i), i = 1, . . . , s be n-dimensional random Gaussian vectors. Then for any m × n matrix C the inequality C2 ≤ 10

  • 2/π max

i=1,...,s Cω(i)2.

holds with probability 1 − 10−s.

  • Proof. A misleading proof idea can be found in

[Halko/Martinsson/Tropp’2010]. See exercises for the full proof. Given ONB Q returned by randomized algorithm, use result for C = (I − QQT)A. Can be combined into adaptive algorithm for choosing k. Assuming that one knows AF the Frobenius norm error can be estimated from A2

F = QQTA2 F + (I − QQT)A2 F.

22