Low Rank Approximation Lecture 3
Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch
1
Low Rank Approximation Lecture 3 Daniel Kressner Chair for - - PowerPoint PPT Presentation
Low Rank Approximation Lecture 3 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 Basic randomized salgorithm for low-rank approx Must read: Halko/Martinsson/Tropp2010:
1
Must read: Halko/Martinsson/Tropp’2010: Finding Structure with Randomness... Randomized Algorithm:
A = QB (in factorized form) Exact recovery: If A has rank r, we recover A = A with probability 1.
2
(a) The 100 × 100 Hilbert matrix A defined by A(i, j) = 1/(i + j − 1). (b) The matrix A defined by A(i, j) = exp(−γ|i − j|/n) with γ = 0.1. (c) 30 × 30 diagonal matrix with diagonal entries 1, 0.99, 0.98, 1 10, 0.99 10 , 0.98 10 , 1 100, 0.99 100 , 0.98 100 , . . .
20 40 60 80 100 10 -20 10 -10 10 0
Singular values of test matrices
3
errors measured in spectral norm: (a) Hilbert matrix, r = 5: Exact mean std 0.0019 0.0092 0.0099 (b) Matrix with slower decay, r = 25: Exact mean std 0.0034 0.012 0.002 (c) Matrix with staircase sv, r = 7: Exact mean std 0.010 0.038 0.025
4
errors measured in Frobenius norm: (a) Hilbert matrix, r = 5: Exact mean std 0.0019 0.0093 0.0099 (b) Matrix with slower decay, r = 25: Exact mean std 0.011 0.024 0.001 (c) Matrix with staircase sv, r = 7: Exact mean std 0.014 0.041 0.024
5
Add oversampling. (usually small) integer p Randomized Algorithm:
A = QB (in factorized form) Problem: ˆ A has rank r + p > r. Solution: Compress B ≈ Tr(B) QTr(B) has rank r . Error: QTr(B) − A = QTr(B) − QB + QB − A ≤ Tr(B) − B + (I − QQT)A
6
Add oversampling. (usually small) integer p Randomized Algorithm:
A = QTr(B) (in factorized form) Gold standard best rank-r approximation error:
◮ spectral norm: σr+1 ◮ Frobenius norm:
r+1 + · · · + σ2 n.
7
errors measured in spectral norm: (a) Hilbert matrix, r = 5: Exact mean std 0.0019 0.0092 0.0099 p = 0 0.0019 0.0026 0.0019 p = 1 0.0019 0.0019 0.0001 p = 2 (b) Matrix with slower decay, r = 25: Exact mean std 0.0034 0.012 0.002 p = 0 0.0034 0.011 0.0017 p = 1 0.0034 0.010 0.0015 p = 2 0.0034 0.0064 0.0008 p = 10 0.0034 0.0037 0.0002 p = 25 (c) Matrix with staircase sv, r = 7: Exact mean std 0.010 0.038 0.025 p = 0 0.010 0.021 0.012 p = 1 0.010 0.012 0.005 p = 2
8
Goal: Say something sensible about (I − QQT)A. Expected value, tail bounds, ... Multivariate normal distribution X ∼ N(µ, Σ) with mean µ ∈ Rn and (positive definite) covariance matrix Σ ∈ Rn×n has density fX(x) = 1
exp
2(x − µ)TΣ−1(x − µ)
Let V ∈ Rm×n be orthonormal. If X ∼ N(0, Im) then V TX ∼ N(0, In). This invariance of Gaussian random vectors allows us to assume w.l.o.g. in the analysis that A = Σ diagonal (and square).
9
Partition Ω = ω1 ω2
A = σ1 Σ2
Then Y = AΩ =
Σ2ω2
1 σ1ω1 Σ2ω2
Problem: Need to control norm of
1 σ1ω1 Σ2ω2 but ω−1 1
(reciprocal of standard normal random variable) is Cauchy distribution with undefined mean and variance. Need to consider p ≥ 2.
10
Let Π ˜
Y, ΠY be orthogonal projectors onto subspaces ˜
Y ⊂ Y. Then Π ˜
YAF ≤ ΠYAF,
(I − Π ˜
Y)AF ≥ (I − ΠY)AF.
for any matrix A.
Y = AΩ =
Σ2Ω2
˜ Y =
σ−1
1 Σ2Ω2Ω† 1
where Ω†
1 is pseudoinverse of Ω1. Because Ω1 is a row vector and
nonzero (with probability one), we obtain Ω†
1 = ΩT 1 /Ω12
range( ˜ Y) ⊂ range(Y). Thus, with f = σ−1
1 Σ2Ω2Ω† 1, we get
(I − QQT)AF ≤ (I − ˜ Q ˜ QT)AF, ˜ Q = 1
2
1 f
11
(I − ˜ Q ˜ QT)A2
F = A2 F − ˜
QTA2
F
˜ QTA2
F
= 1 1 + f2
2
(σ2
1 + Σ2f2 2) ≥
σ2
1
1 + f2
2
≥ σ2
1(1 − f2 2) = σ2 1 − Σ2Ω2Ω† 12 2
In summary, we have:
(I − ˜ Q ˜ QT)A2
F ≤ Σ22 F + Σ2Ω2Ω† 12 F.
To analyze red term, we use EΣ2Ω2Ω†
12 F = E
12 F | Ω1
F · EΩ† 12 F.
(See exercises for proof that EAΩB2
F = A2 FB2 F for Gaussian
matrix Ω and constant matrices A, B.)
12
It remains to control Ω12
F and Ω† 12 F = 1/Ω12 F.
Ω12
F is a sum of p + 1 squared independent standard normal
random variables. Pdf for X ∼ N(0, 1) given by fX(x) =
1 √ 2πe−x2/2. Pdf for Y = X 2 zero
for nonpositive values. For y > 0, we obtain Pr(0 ≤ Y ≤ y) = Pr(−√y ≤ X ≤ √y) = 2 √ 2π √y e−x2/2 dx = 1 √ 2π y e−t/2 dt, Y is called chi-squared distribution (1 degree of freedom): Y ∼ χ2
1.
Ω1 ∼ χ2
p+1 chi-squared distribution with p + 1 d.o.f.; pdf
fΩ1(x) = 2−(p+1)/2 Γ((p + 1)/2)x(p+1)/2−1 exp(−x/2)), x > 0.
13
Ω†
12 F =
1 Ω12
F
=
Ω2
1,i
−1 ∼ Inv − χ2(p + 1), the inverse-chi-squared distribution with p + 1 degrees of freedom. Pdf given by 2−(p+1)/2 Γ((p + 1)/2)x−(p+1)/2−1 exp(−1/(2x)).
0.2 0.4 0.6 0.8 1 2 4 6 8 10
pdf for p = 1, p = 3, p = 9
14
Textbook results:
◮ EΩ12 F = p + 1,
Eن
12 F = (p − 1)−1
Tail bound by [Laurent/Massart’2000]:
◮ P
F ≤ p + 1 − t
t2 4(p+1)
For r = 1, p ≥ 2, we have E(I − QQT)AF ≤
1 p − 1Σ2F. Probability of deviating from this upper bound decays exponentially, as indicated by tail bound for χ2
p+1.
15
(I − QQT)A2
F ≤ Σ22 F + Σ2Ω2Ω† 12 F.
Proof. Y = AΩ = Σ1Ω1 Σ2Ω2
˜ Y = Ir F
F = Σ2Ω2Ω†
1Σ−1 1 .
ONB ˜ Q = Ir F
(I − QQT)A2
F ≤ (I − ˜
Q ˜ QT)A2
F = A2 F − ˜
QTA2
F.
˜ QTA2
F
=
F +
F
≥
F = trace(Σ1(Ir + F TF)−1Σ1)
≥ trace(Σ1(Ir − F TF)Σ1) = Σ12
F − Σ2Ω2Ω† 12 F
(Proof of trace(Σ1(Ir + F T F)−1Σ1) ≥ trace(Σ1(Ir − F T F)Σ1) on next slide.)
16
Let G = F TF. Then we have (I + G)
and hence (I + G)−1 − (I − G) = (I + G)−1G2 = G(I + G)−1G is symmetric positive semidefinite. This implies that Σ1
is symmetric positive semidefinite as well and hence 0 ≤ trace
trace(Σ1(Ir + F TF)−1Σ1) ≥ trace(Σ1(Ir − F TF)Σ1).
17
Again use EΣ2Ω2Ω†
12 F = Σ22 F · EΩ† 12 F.
By standard results in multivariate statistics, we have Eن
12 F =
r p − 1. Sketch of argument:
◮ Ω1ΩT 1 ∼ Wr(r + p) (Wishart distribution with r + p degrees of
freedom)
◮ (Ω1ΩT 1 )−1 ∼ W−1 r
(r + p) (inverse Wishart distribution with r + p degrees of freedom)
◮ E(Ω1ΩT 1 )−1 = 1 p−1Ir; see Page 96 in [Muirhead’1982]1 ◮ Result follows from Ω† 12 F = ΩT 1 (Ω1ΩT 1 )−12 F = trace
1 )−1
1982.
18
For p ≥ 2, we have E(I − QQT)AF ≤
r p − 1Σ2F, E(I − QQT)A2 ≤
p − 1
p Σ2F. For proof of spectral norm and tail bounds, see [Halko/Martinsson/Tropp’2010].
19
A = QTr(B) (in factorized form) Algorithm is essentially equivalent with randomized algorithm applied to (AAT)qA. Using Hölder’s inequality and result from the exercises, we have E(I − QQT)A2 ≤ (E(I − QQT)A2q+1
2
)1/(2q+1) ≤ (E(I − QQT)(AAT)2qA2)1/(2q+1). The singular values of (AAT)2qA are σ2q+1
1
, . . . , σ2q+1
n
.
20
Combination with previous result for randomized algorithm gives:
For p ≥ 2, E(I − QQT)A2 is bounded by
p − 1
r+1
+ e√r + p p (σ2(2q+1)
k+1
+ · · · )1/2 1/(2q+1) ≤ σr+1
p − 1 + e
p 1/(2q+1)
21
Let ω(i), i = 1, . . . , s be n-dimensional random Gaussian vectors. Then for any m × n matrix C the inequality C2 ≤ 10
i=1,...,s Cω(i)2.
holds with probability 1 − 10−s.
[Halko/Martinsson/Tropp’2010]. See exercises for the full proof. Given ONB Q returned by randomized algorithm, use result for C = (I − QQT)A. Can be combined into adaptive algorithm for choosing k. Assuming that one knows AF the Frobenius norm error can be estimated from A2
F = QQTA2 F + (I − QQT)A2 F.
22