SLIDE 1
Randomness DS-GA 1013 / MATH-GA 2824 Optimization-based Data - - PowerPoint PPT Presentation
Randomness DS-GA 1013 / MATH-GA 2824 Optimization-based Data - - PowerPoint PPT Presentation
Randomness DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Gaussian random variables Gaussian random vectors Randomized projections SVD of a
SLIDE 2
SLIDE 3
Gaussian random variables
The pdf of a Gaussian or normal random variable with mean µ and standard deviation σ is given by fX (x) = 1 √ 2πσ e− (x−µ)2
2σ2
SLIDE 4
Gaussian random variables
−10 −8 −6 −4 −2 2 4 6 8 10 0.1 0.2 0.3 0.4 x fX (x) µ = 2 σ = 1 µ = 0 σ = 2 µ = 0 σ = 4
SLIDE 5
Linear transformation of Gaussian
If x is a Gaussian random variable with mean µ and standard deviation σ, then for any a, b ∈ R y := ax + b is a Gaussian random variable with mean aµ + b and standard deviation |a| σ
SLIDE 6
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y)
SLIDE 7
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y)
SLIDE 8
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y)
SLIDE 9
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P
- x ≤ y − b
a
SLIDE 10
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P
- x ≤ y − b
a
- =
- y−b
a
−∞
1 √ 2πσ e− (x−µ)2
2σ2
dx
SLIDE 11
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P
- x ≤ y − b
a
- =
- y−b
a
−∞
1 √ 2πσ e− (x−µ)2
2σ2
dx = y
−∞
1 √ 2πaσ e− (w−aµ−b)2
2a2σ2
dw change of variables w = ax + b
SLIDE 12
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P
- x ≤ y − b
a
- =
- y−b
a
−∞
1 √ 2πσ e− (x−µ)2
2σ2
dx = y
−∞
1 √ 2πaσ e− (w−aµ−b)2
2a2σ2
dw change of variables w = ax + b Differentiating with respect to y: fy (y) = 1 √ 2πaσ e− (w−aµ−b)2
2a2σ2
SLIDE 13
Central limit theorem
Let x1, x2, x3, . . . be a sequence of iid random variables with mean µ and bounded variance σ2 The sequence of averages a1, a2, a3, . . . is defined as ai := 1 i
i
- j=1
xj
SLIDE 14
Central limit theorem
The sequence b1, b2, b3, . . . bi := √ i(ai − µ) converges in distribution to a Gaussian random variable with mean 0 and variance σ2 For any x ∈ R lim
i→∞ fbi (x) :=
1 √ 2πσ e− x2
2σ2
For large i the theorem suggests that the average ai is approximately Gaussian with mean µ and variance σ/√n
SLIDE 15
iid exponential λ = 2, i = 102
0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 1 2 3 4 5 6 7 8 9
SLIDE 16
iid exponential λ = 2, i = 103
0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 5 10 15 20 25 30
SLIDE 17
iid exponential λ = 2, i = 104
0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 10 20 30 40 50 60 70 80 90
SLIDE 18
iid geometric p = 0.4, i = 102
1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 0.5 1.0 1.5 2.0 2.5
SLIDE 19
iid geometric p = 0.4, i = 103
1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 1 2 3 4 5 6 7
SLIDE 20
iid geometric p = 0.4, i = 104
1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 5 10 15 20 25
SLIDE 21
Histogram of heights
60 62 64 66 68 70 72 74 76
Height (inches)
0.05 0.10 0.15 0.20 0.25
Gaussian distribution Real data
SLIDE 22
Gaussian random variables Gaussian random vectors Randomized projections SVD of a random matrix Randomized SVD
SLIDE 23
Gaussian random vector
A Gaussian random vector x is a random vector with joint pdf f
x (
x) = 1
- (2π)n |Σ|
exp
- −1
2 ( x − µ)T Σ−1 ( x − µ)
- where
µ ∈ Rn is the mean and Σ ∈ Rn×n the covariance matrix
SLIDE 24
Uncorrelation implies independence
If the covariance matrix is diagonal, Σ
x =
σ2
1
· · · σ2
2
· · · . . . . . . ... . . . · · · σ2
n
, the entries are independent
SLIDE 25
Proof
Σ−1
- x
=
1 σ2
1
· · ·
1 σ2
2
· · · . . . . . . ... . . . · · ·
1 σ2
n
|Σ| =
n
- i=1
σ2
i
SLIDE 26
Proof
f
x (
x)
SLIDE 27
Proof
f
x (
x) = 1
- (2π)n |Σ|
exp
- −1
2 ( x − µ)T Σ−1 ( x − µ)
SLIDE 28
Proof
f
x (
x) = 1
- (2π)n |Σ|
exp
- −1
2 ( x − µ)T Σ−1 ( x − µ)
- =
n
- i=1
1
- (2π)σi
exp
- −(
xi − µi)2 2σ2
i
SLIDE 29
Proof
f
x (
x) = 1
- (2π)n |Σ|
exp
- −1
2 ( x − µ)T Σ−1 ( x − µ)
- =
n
- i=1
1
- (2π)σi
exp
- −(
xi − µi)2 2σ2
i
- =
n
- i=1
f
xi (
xi)
SLIDE 30
Linear transformations
Let x be a Gaussian random vector of dimension n with mean µ and covariance matrix Σ For any matrix A ∈ Rm×n and b ∈ Rm Y = A x + b is Gaussian with mean A µ + b and covariance matrix AΣAT
SLIDE 31
Subvectors are also Gaussian
−3 −2 −1 1 2 3 −2 2 0.1 0.2 x y f
x (x, y)
f
x[2](y)
f
x[1](x)
SLIDE 32
Direction of iid standard Gaussian vectors
If the covariance matrix of a Gaussian vector x is I, then x is isotropic It does not favor any direction For any orthogonal matrix U x has the same distribution (Gaussian with mean U 0 = 0 and covariance matrix UIUT = UUT = I)
SLIDE 33
Magnitude of iid standard Gaussian vectors
In low dimensions joint pdf is mostly concentrated around the origin High dimensions? || x||2
2 = k i=1
x[i]2 is a χ2 (chi squared) random variable with k degrees
- f freedom
SLIDE 34
Magnitude of iid standard Gaussian vectors
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 4 6 8 10 x fy/k (x) k = 10 k = 20 k = 50 k = 100
SLIDE 35
Mean
E
- ||
x||2
2
SLIDE 36
Mean
E
- ||
x||2
2
- = E
k
- i=1
- x[i]2
SLIDE 37
Mean
E
- ||
x||2
2
- = E
k
- i=1
- x[i]2
- =
k
- i=1
E
- x[i]2
SLIDE 38
Mean
E
- ||
x||2
2
- = E
k
- i=1
- x[i]2
- =
k
- i=1
E
- x[i]2
= k
SLIDE 39
Variance
E
- ||
x||2
2
2
SLIDE 40
Variance
E
- ||
x||2
2
2 = E k
- i=1
- x[i]2
2
SLIDE 41
Variance
E
- ||
x||2
2
2 = E k
- i=1
- x[i]2
2 = E
k
- i=1
k
- j=1
- x[i]2
x[j]2
SLIDE 42
Variance
E
- ||
x||2
2
2 = E k
- i=1
- x[i]2
2 = E
k
- i=1
k
- j=1
- x[i]2
x[j]2 =
k
- i=1
k
- j=1
E
- x[i]2
x[j]2
SLIDE 43
Variance
E
- ||
x||2
2
2 = E k
- i=1
- x[i]2
2 = E
k
- i=1
k
- j=1
- x[i]2
x[j]2 =
k
- i=1
k
- j=1
E
- x[i]2
x[j]2 =
k
- i=1
E
- x[i]4
+ 2
k−1
- i=1
k
- j=i
E
- x[i]2
E
- x[j]2
SLIDE 44
Variance
E
- ||
x||2
2
2 = E k
- i=1
- x[i]2
2 = E
k
- i=1
k
- j=1
- x[i]2
x[j]2 =
k
- i=1
k
- j=1
E
- x[i]2
x[j]2 =
k
- i=1
E
- x[i]4
+ 2
k−1
- i=1
k
- j=i
E
- x[i]2
E
- x[j]2
= 3k + k(k − 1) 4th moment of standard Gaussian equals 3
SLIDE 45
Variance
E
- ||
x||2
2
2 = E k
- i=1
- x[i]2
2 = E
k
- i=1
k
- j=1
- x[i]2
x[j]2 =
k
- i=1
k
- j=1
E
- x[i]2
x[j]2 =
k
- i=1
E
- x[i]4
+ 2
k−1
- i=1
k
- j=i
E
- x[i]2
E
- x[j]2
= 3k + k(k − 1) 4th moment of standard Gaussian equals 3 = k(k + 2)
SLIDE 46
Variance
Var
- ||
x||2
2
- = E
- ||
x||2
2
2 − E
- ||
x||2
2
2 = k(k + 2) − k2 = 2k Relative standard deviation around mean scales as
- 2/k
SLIDE 47
Non-asymptotic tail bound
Let x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 P
- k (1 − ǫ) < ||
x||2
2 < k (1 + ǫ)
- ≥ 1 − 2
kǫ2
SLIDE 48
Markov’s inequality
Let x be a nonnegative random variable For any positive constant a > 0, P (x ≥ a) ≤ E (x) a
SLIDE 49
Proof
Define the indicator variable 1x≥a x − a 1x≥a ≥ 0
SLIDE 50
Proof
Define the indicator variable 1x≥a x − a 1x≥a ≥ 0 E (x) ≥ a E (1x≥a) = a P (x ≥ a)
SLIDE 51
Chebyshev bound
Let y := || x||2
2,
P (|y − k| ≥ kǫ)
SLIDE 52
Chebyshev bound
Let y := || x||2
2,
P (|y − k| ≥ kǫ) = P
- (y − E (y))2 ≥ k2ǫ2
SLIDE 53
Chebyshev bound
Let y := || x||2
2,
P (|y − k| ≥ kǫ) = P
- (y − E (y))2 ≥ k2ǫ2
≤ E
- (y − E (y))2
k2ǫ2 by Markov’s inequality
SLIDE 54
Chebyshev bound
Let y := || x||2
2,
P (|y − k| ≥ kǫ) = P
- (y − E (y))2 ≥ k2ǫ2
≤ E
- (y − E (y))2
k2ǫ2 by Markov’s inequality = Var (y) k2ǫ2
SLIDE 55
Chebyshev bound
Let y := || x||2
2,
P (|y − k| ≥ kǫ) = P
- (y − E (y))2 ≥ k2ǫ2
≤ E
- (y − E (y))2
k2ǫ2 by Markov’s inequality = Var (y) k2ǫ2 = 2 kǫ2
SLIDE 56
Non-asymptotic Chernoff tail bound
Let x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 P
- k (1 − ǫ) < ||
x||2
2 < k (1 + ǫ)
- ≥ 1 − 2 exp
- −kǫ2
8
SLIDE 57
Proof
Let y := || x||2
- 2. The result is implied by
P (y > k (1 + ǫ)) ≤ exp
- −kǫ2
8
- P (y < k (1 − ǫ)) ≤ exp
- −kǫ2
8
SLIDE 58
Proof
Fix t > 0 P (y > a)
SLIDE 59
Proof
Fix t > 0 P (y > a) = P (exp (ty) > exp (at))
SLIDE 60
Proof
Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality
SLIDE 61
Proof
Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality ≤ exp (−at) E
- exp
k
- i=1
txi
2
SLIDE 62
Proof
Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality ≤ exp (−at) E
- exp
k
- i=1
txi
2
- ≤ exp (−at)
k
- i=1
E
- exp
- txi
2
by independence of x1, . . . , xk
SLIDE 63
Proof
Lemma (by direct integration) E
- exp
- tx2
= 1 √1 − 2t Equivalent to controlling higher-order moments since E
- exp
- tx2
= E ∞
- i=0
- tx2i
i!
- =
∞
- i=0
E
- ti
x2i i! .
SLIDE 64
Proof
Fix t > 0 P (y > a) ≤ exp (−at)
k
- i=1
E
- exp
- txi
2
= exp (−at) (1 − 2t)
k 2
SLIDE 65
Proof
Setting a := k (1 + ǫ) and t := 1 2 − 1 2 (1 + ǫ), we conclude P (y > k (1 + ǫ)) ≤ (1 + ǫ)k 2 exp
- −kǫ
2
- ≤ exp
- −kǫ2
8
SLIDE 66
Projection onto a fixed subspace
PS1 z PS2 z 0.007 = ||PS1 z||2 || x||2 < ||PS2 z||2 || x||2 = 0.043 0.043 0.007 = 6.14 ≈
- dim (S2)
dim (S1) (not a coincidence)
SLIDE 67
Projection onto a fixed subspace
Let S be a k-dimensional subspace of Rn and z ∈ Rn a vector of iid standard Gaussian noise ||PS z||2
2 is a χ2 random variable with k degrees of freedom
It has the same distribution as y :=
k
- i=1
xi
2
where x1, . . . , xk are iid standard Gaussians.
SLIDE 68
Proof
Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2
2
SLIDE 69
Proof
Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2
2 =
- UUT
z
- 2
2
SLIDE 70
Proof
Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2
2 =
- UUT
z
- 2
2
= zTUUTUUT z
SLIDE 71
Proof
Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2
2 =
- UUT
z
- 2
2
= zTUUTUUT z = zTUUT z
SLIDE 72
Proof
Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2
2 =
- UUT
z
- 2
2
= zTUUTUUT z = zTUUT z = wT w
SLIDE 73
Proof
Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2
2 =
- UUT
z
- 2
2
= zTUUTUUT z = zTUUT z = wT w =
k
- i=1
- w[i]2
SLIDE 74
Proof
Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2
2 =
- UUT
z
- 2
2
= zTUUTUUT z = zTUUT z = wT w =
k
- i=1
- w[i]2
- w := UT
z is Gaussian with mean zero and covariance matrix Σ
w = UTΣ zU
SLIDE 75
Proof
Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2
2 =
- UUT
z
- 2
2
= zTUUTUUT z = zTUUT z = wT w =
k
- i=1
- w[i]2
- w := UT
z is Gaussian with mean zero and covariance matrix Σ
w = UTΣ zU
= UTU = I
SLIDE 76
Non-asymptotic Chernoff tail bound
Let x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 P
- k (1 − ǫ) < ||
x||2
2 < k (1 + ǫ)
- ≥ 1 − 2 exp
- −kǫ2
8
SLIDE 77
Projection onto a fixed subspace
Let S be a k-dimensional subspace of Rn and z ∈ Rn a vector of iid standard Gaussian noise For any ǫ > 0 P (k (1 − ǫ) < ||PS z||2 < k (1 + ǫ)) ≥ 1 − 2 exp
- −kǫ2
8
SLIDE 78
Gaussian random variables Gaussian random vectors Randomized projections SVD of a random matrix Randomized SVD
SLIDE 79
Dimensionality reduction
◮ PCA preserves the most energy (ℓ2 norm) ◮ Problem 1: Computationally expensive ◮ Problem 2: Depends on all of the data ◮ (Possible) Solution: Just project randomly! ◮ For a data set
x1, x2, . . . ∈ Rm compute A x1, A x2, . . . ∈ Rm where A ∈ Rk×n (k < n) has iid standard Gaussian entries
SLIDE 80
Fixed vector
Let A be a a × b matrix with iid standard Gaussian entries If v ∈ Rb is a deterministic vector with unit ℓ2 norm, then A v is an a-dimensional iid standard Gaussian vector Proof:
SLIDE 81
Fixed vector
Let A be a a × b matrix with iid standard Gaussian entries If v ∈ Rb is a deterministic vector with unit ℓ2 norm, then A v is an a-dimensional iid standard Gaussian vector Proof: (A v) [i], 1 ≤ i ≤ a is Gaussian with mean zero and variance Var
- AT
i,:
v
- =
vTΣAi,: v = vTI v = || v||2
2 = 1
SLIDE 82
Non-asymptotic Chernoff tail bound
Let x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 P
- k (1 − ǫ) < ||
x||2
2 < k (1 + ǫ)
- ≥ 1 − 2 exp
- −kǫ2
8
SLIDE 83
Fixed vector
Let A be a a × b matrix with iid standard Gaussian entries For any v ∈ Rp with unit norm and any ǫ ∈ (0, 1)
- a (1 − ǫ) ≤ ||A
v||2 ≤
- a (1 + ǫ)
with probability at least 1 − 2 exp
- −aǫ2/8
SLIDE 84
Johnson-Lindenstrauss lemma
Let A be a k × n matrix with iid standard Gaussian entries Let x1, . . . , xp ∈ Rn be any fixed set of p deterministic vectors For any pair xi, xj and any ǫ ∈ (0, 1) (1 − ǫ) || xi − xj||2
2 ≤
- 1
√ k A xi − 1 √ k A xj
- 2
2
≤ (1 + ǫ) || xi − xj||2
2
with probability at least 1
p as long as
k ≥ 16 log (p) ǫ2
SLIDE 85
Proof
Aim: Control action of A the normalized differences
- vij :=
- xi −
xj || xi − xj||2 Our event of interest is the intersection of the events Eij =
- k (1 − ǫ) < ||A
vij||2
2 < k (1 + ǫ)
- 1 ≤ i < p, i < j ≤ p
SLIDE 86
Fixed vector
Let A be a a × b matrix with iid standard Gaussian entries For any v ∈ Rb with unit norm and any ǫ ∈ (0, 1)
- a (1 − ǫ) ≤ ||A
v||2 ≤
- a (1 + ǫ)
with probability at least 1 − 2 exp
- −aǫ2/8
- This implies
P
- Ec
ij
- ≤ 2
p2 if k ≥ 16 log (p) ǫ2
SLIDE 87
Union bound
For any events S1, S2, . . . , Sn in a probability space P (∪iSi) ≤
n
- i=1
P (Si) .
SLIDE 88
Proof
Number of events Eij equals p
2
- = p (p − 1) /2
By the union bound P
i,j
Eij
SLIDE 89
Proof
Number of events Eij equals p
2
- = p (p − 1) /2
By the union bound P
i,j
Eij = 1 − P
i,j
Ec
ij
SLIDE 90
Proof
Number of events Eij equals p
2
- = p (p − 1) /2
By the union bound P
i,j
Eij = 1 − P
i,j
Ec
ij
≥ 1 −
- i,j
P
- Ec
ij
SLIDE 91
Proof
Number of events Eij equals p
2
- = p (p − 1) /2
By the union bound P
i,j
Eij = 1 − P
i,j
Ec
ij
≥ 1 −
- i,j
P
- Ec
ij
- ≥ 1 − p (p − 1)
2 2 p2
SLIDE 92
Proof
Number of events Eij equals p
2
- = p (p − 1) /2
By the union bound P
i,j
Eij = 1 − P
i,j
Ec
ij
≥ 1 −
- i,j
P
- Ec
ij
- ≥ 1 − p (p − 1)
2 2 p2 ≥ 1 p
SLIDE 93
Dimensionality reduction for visualization
Motivation: Visualize high-dimensional features projected onto 2D or 3D Example: Seeds from three different varieties of wheat: Kama, Rosa and Canadian Features:
◮ Area ◮ Perimeter ◮ Compactness ◮ Length of kernel ◮ Width of kernel ◮ Asymmetry coefficient ◮ Length of kernel groove
SLIDE 94
Dimensionality reduction for visualization
Randomized projection PCA
SLIDE 95
Nearest neighbors in random subspace
Nearest neighbors classification (Algorithm 4.2 in Lecture Notes 1) computes n distances in Rm for each new example Cost: O (nmp) for p examples Idea: Use a k × m iid standard Gaussian matrix to project onto k-dimensional space beforehand Cost:
◮ kmn operations to project training set ◮ kmp operations to project test set ◮ knp to perform nearest-neighbor classification
Much faster!
SLIDE 96
Face recognition
Training set: 360 64 × 64 images from 40 different subjects (9 each) Test set: 1 new image from each subject We model each image as a vector in R4096 (m = 4096) To classify we:
- 1. Project onto random a k-dimensional subspace
- 2. Apply nearest-neighbor classification using the ℓ2-norm distance in Rk
SLIDE 97
Performance
20 40 60 80 100 120 140 160 180 200 10 20 30 40 5 10 15 25 30 35 Dimension Errors Average Maximum Minimum
SLIDE 98
Nearest neighbor in R50
Test image Projection Closest projection
Corresponding image
SLIDE 99
Gaussian random variables Gaussian random vectors Randomized projections SVD of a random matrix Randomized SVD
SLIDE 100
Singular values of n × k matrix, k = 100
20 40 60 80 100 0.5 1 1.5 i
σi √n
n/k 2 5 10 20 50 100 200
SLIDE 101
Singular values of n × k matrix, k = 1000
200 400 600 800 1,000 0.5 1 1.5 i
σi √n
n/k 2 5 10 20 50 100 200
SLIDE 102
Singular values of a Gaussian matrix
Intuitively as n grows A ≈ U √n I
- V T = √n UV T,
iid Gaussian vectors in high dimensions are almost orthogonal
SLIDE 103
Singular values of a Gaussian matrix
Let A be a n × k matrix with iid standard Gaussian entries such that n > k For any fixed ǫ > 0, the singular values of A satisfy
- n (1 − ǫ) ≤ σk ≤ σ1 ≤
- n (1 + ǫ)
with probability at least 1 − 1/k as long as n > 64k ǫ2 log 12 ǫ
SLIDE 104
Proof
Recall that σ1 = max {||
x||2=1 | x∈Rk}
||A x||2 σk = min {||
x||2=1 | x∈Rk}
||A x||2 so the bounds are equivalent to n (1 − ǫ) < ||A v||2
2 < n (1 + ǫ)
SLIDE 105
Proof
Idea: Use union bound over all unit-norm vectors Problem: They are infinite! Solution: Use union bound on a finite set, then show that this is enough
SLIDE 106
ǫ-net
An ǫ-net of a set X ⊆ Rk is a subset Nǫ ⊆ X such that for every vector
- x ∈ X there exists
y ∈ Nǫ for which || x − y||2 ≤ ǫ. The covering number N (X, ǫ) of a set X at scale ǫ is the minimal cardinality of an ǫ-net of X
SLIDE 107
ǫ-net
ǫ
SLIDE 108
Covering number of a sphere
The covering number of the n-dimensional sphere Sk−1 at scale ǫ satisfies N
- Sk−1, ǫ
- ≤
2 + ǫ ǫ k ≤ 3 ǫ k
SLIDE 109
Covering number of a sphere
◮ Initialize Nǫ to the empty set ◮ Choose a point
x ∈ Sk−1 such that || x − y||2 > ǫ for any y ∈ Nǫ
◮ Add
x to Nǫ until there are no points in Sk−1 that are ǫ away from any point in Nǫ
SLIDE 110
Covering number of a sphere
ǫ/2 1 + ǫ/2
SLIDE 111
Covering number of a sphere
Vol
- Bk
1+ǫ/2
- ≥ Vol
- ∪
x∈NǫBk ǫ/2 (
x)
SLIDE 112
Covering number of a sphere
Vol
- Bk
1+ǫ/2
- ≥ Vol
- ∪
x∈NǫBk ǫ/2 (
x)
- = |Nǫ| Vol
- Bk
ǫ/2
SLIDE 113
Covering number of a sphere
Vol
- Bk
1+ǫ/2
- ≥ Vol
- ∪
x∈NǫBk ǫ/2 (
x)
- = |Nǫ| Vol
- Bk
ǫ/2
- By multivariable calculus
Vol
- Bk
r
- = rk Vol
- Bk
1
SLIDE 114
Covering number of a sphere
Vol
- Bk
1+ǫ/2
- ≥ Vol
- ∪
x∈NǫBk ǫ/2 (
x)
- = |Nǫ| Vol
- Bk
ǫ/2
- Vol
- Bk
r
- = rk Vol
- Bk
1
- so we conclude
(1 + ǫ/2)k ≥ |Nǫ| (ǫ/2)k
SLIDE 115
Proof
- 1. We prove the bounds
n (1 − ǫ2) < ||A v||2
2 < n (1 + ǫ2)
where ǫ2 := ǫ/2 on an ǫ1 := ǫ/4 net of the sphere
- 2. We show that by the triangle inequality, this implies that the bounds
hold on all the sphere
SLIDE 116
Fixed vector
Let A be a a × b matrix with iid standard Gaussian entries For any v ∈ Rb with unit norm and any ǫ ∈ (0, 1)
- a (1 − ǫ) ≤ ||A
v||2 ≤
- a (1 + ǫ)
with probability at least 1 − 2 exp
- −aǫ2/8
SLIDE 117
Bound on the ǫ1-net
We define the event E
v,ǫ2 :=
- n (1 − ǫ2) ||
v||2
2 ≤ ||A
v||2
2 ≤ n (1 + ǫ2) ||
v||2
2
- P
- ∪
v∈Nǫ1Ec
- v,ǫ2
SLIDE 118
Bound on the ǫ1-net
We define the event E
v,ǫ2 :=
- n (1 − ǫ2) ||
v||2
2 ≤ ||A
v||2
2 ≤ n (1 + ǫ2) ||
v||2
2
- P
- ∪
v∈Nǫ1Ec
- v,ǫ2
- ≤
- v∈Nǫ1
P
- Ec
- v,ǫ2
SLIDE 119
Bound on the ǫ1-net
We define the event E
v,ǫ2 :=
- n (1 − ǫ2) ||
v||2
2 ≤ ||A
v||2
2 ≤ n (1 + ǫ2) ||
v||2
2
- P
- ∪
v∈Nǫ1Ec
- v,ǫ2
- ≤
- v∈Nǫ1
P
- Ec
- v,ǫ2
- ≤ |Nǫ1| P
- Ec
- v,ǫ2
SLIDE 120
Bound on the ǫ1-net
We define the event E
v,ǫ2 :=
- n (1 − ǫ2) ||
v||2
2 ≤ ||A
v||2
2 ≤ n (1 + ǫ2) ||
v||2
2
- P
- ∪
v∈Nǫ1Ec
- v,ǫ2
- ≤
- v∈Nǫ1
P
- Ec
- v,ǫ2
- ≤ |Nǫ1| P
- Ec
- v,ǫ2
- ≤ 2
12 ǫ k exp
- −nǫ2
32
SLIDE 121
Bound on the ǫ1-net
We define the event E
v,ǫ2 :=
- n (1 − ǫ2) ||
v||2
2 ≤ ||A
v||2
2 ≤ n (1 + ǫ2) ||
v||2
2
- P
- ∪
v∈Nǫ1Ec
- v,ǫ2
- ≤
- v∈Nǫ1
P
- Ec
- v,ǫ2
- ≤ |Nǫ1| P
- Ec
- v,ǫ2
- ≤ 2
12 ǫ k exp
- −nǫ2
32
- ≤ 1
k if n > 64k ǫ2 log 12 ǫ
SLIDE 122
Upper bound on the sphere
Let x ∈ Sk−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||A x||2
SLIDE 123
Upper bound on the sphere
Let x ∈ Sk−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||A x||2 ≤ ||A v||2 + ||A ( x − v)||2
SLIDE 124
Upper bound on the sphere
Let x ∈ Sk−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||A x||2 ≤ ||A v||2 + ||A ( x − v)||2 ≤ √n
- 1 + ǫ
2
- + ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
SLIDE 125
Upper bound on the sphere
Let x ∈ Sk−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||A x||2 ≤ ||A v||2 + ||A ( x − v)||2 ≤ √n
- 1 + ǫ
2
- + ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
≤ √n
- 1 + ǫ
2
- + σ1 ||
x − v||2
SLIDE 126
Upper bound on the sphere
Let x ∈ Sk−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||A x||2 ≤ ||A v||2 + ||A ( x − v)||2 ≤ √n
- 1 + ǫ
2
- + ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
≤ √n
- 1 + ǫ
2
- + σ1 ||
x − v||2 ≤ √n
- 1 + ǫ
2
- + σ1ǫ
4
SLIDE 127
Upper bound on the sphere
σ1 ≤ √n
- 1 + ǫ
2
- + σ1ǫ
4 σ1 ≤ √n 1 + ǫ/2 1 − ǫ/4
- = √n
- 1 + ǫ − ǫ (1 − ǫ)
4 − ǫ
- ≤ √n (1 + ǫ)
SLIDE 128
Lower bound on the sphere
||A x||2
SLIDE 129
Lower bound on the sphere
||A x||2 ≥ ||A v||2 − ||A ( x − v)||2
SLIDE 130
Lower bound on the sphere
||A x||2 ≥ ||A v||2 − ||A ( x − v)||2 ≥ √n
- 1 − ǫ
2
- − ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
SLIDE 131
Lower bound on the sphere
||A x||2 ≥ ||A v||2 − ||A ( x − v)||2 ≥ √n
- 1 − ǫ
2
- − ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
≥ √n
- 1 − ǫ
2
- − σ1 ||
x − v||2
SLIDE 132
Lower bound on the sphere
||A x||2 ≥ ||A v||2 − ||A ( x − v)||2 ≥ √n
- 1 − ǫ
2
- − ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
≥ √n
- 1 − ǫ
2
- − σ1 ||
x − v||2 ≥ √n
- 1 − ǫ
2
- − ǫ
4 √n (1 + ǫ)
SLIDE 133
Lower bound on the sphere
||A x||2 ≥ ||A v||2 − ||A ( x − v)||2 ≥ √n
- 1 − ǫ
2
- − ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
≥ √n
- 1 − ǫ
2
- − σ1 ||
x − v||2 ≥ √n
- 1 − ǫ
2
- − ǫ
4 √n (1 + ǫ) = √n (1 − ǫ)
SLIDE 134
Gaussian random variables Gaussian random vectors Randomized projections SVD of a random matrix Randomized SVD
SLIDE 135
Fast SVD
For a matrix M ∈ Rm×n which is approximately rank k:
- 1. Choose a small oversampling parameter p (usually 5 or slightly larger).
- 2. Find a matrix
U ∈ Rm×(k+p) with k + p orthonormal columns that approximately span the column space of M
- 3. Compute W ∈ R(k+p)×n defined by W :=
UTM
- 4. Compute the SVD of W = UW SW V T
W
- 5. Output U := (
UUW ):,1:k, S := (SW )1:k,1:k and V := (VW ):,1:k as the SVD of M
SLIDE 136
Fast SVD
For a matrix M ∈ Rm×n which is approximately rank k:
- 1. Choose a small oversampling parameter p (usually 5 or slightly larger).
- 2. Find a matrix
U ∈ Rm×(k+p) with k + p orthonormal columns that approximately span the column space of M
- 3. Compute W ∈ R(k+p)×n defined by W :=
UTM O (kmn)
- 4. Compute the SVD of W = UW SW V T
W
O
- k2n
- 5. Output U := (
UUW ):,1:k, S := (SW )1:k,1:k and V := (VW ):,1:k as the SVD of M Complexity of regular SVD is O (mn min {m, n})
SLIDE 137
Fast SVD
≈ M UM SM V T
M
n m k k n
SLIDE 138
Fast SVD
= M
- UT
W k m n n
SLIDE 139
Fast SVD
The method works if (1) M is rank k and (2) U spans the column space M
SLIDE 140
Fast SVD
The method works if (1) M is rank k and (2) U spans the column space M = U UTM
SLIDE 141
Fast SVD
The method works if (1) M is rank k and (2) U spans the column space M = U UTM = UW
SLIDE 142
Fast SVD
The method works if (1) M is rank k and (2) U spans the column space M = U UTM = UW = UUW SW V T
W
SLIDE 143
Fast SVD
The method works if (1) M is rank k and (2) U spans the column space M = U UTM = UW = UUW SW V T
W
where U := UUW is an m × k matrix with orthonormal columns
SLIDE 144
Fast SVD
The method works if (1) M is rank k and (2) U spans the column space M = U UTM = UW = UUW SW V T
W
where U := UUW is an m × k matrix with orthonormal columns UTU = UT
W
UT UUW
SLIDE 145
Fast SVD
The method works if (1) M is rank k and (2) U spans the column space M = U UTM = UW = UUW SW V T
W
where U := UUW is an m × k matrix with orthonormal columns UTU = UT
W
UT UUW = UT
W UW = I
SLIDE 146
Power iterations
For approximately low-rank matrices performance depends on gap between σk and σk+1 The gap can be increased by power iterations This method is only used when computing U The input is
- M :=
- MMTq
M
SLIDE 147
Power iterations
For approximately low-rank matrices performance depends on gap between σk and σk+1 The gap can be increased by power iterations This method is only used when computing U The input is
- M :=
- MMTq
M =
- UMS2
MUT M
q UMSMV T
M
SLIDE 148
Power iterations
For approximately low-rank matrices performance depends on gap between σk and σk+1 The gap can be increased by power iterations This method is only used when computing U The input is
- M :=
- MMTq
M =
- UMS2
MUT M
q UMSMV T
M
= UMS2
MUT MUMS2 MUT M · · · UMS2 MUT MUMV T M
= UMS2q+1
M
V T
M
SLIDE 149
Problem
How do we estimate the column space of a low-rank matrix?
◮ Project onto random subspace with slightly larger dimension ◮ Select random columns
SLIDE 150
Randomized column-space approximation
For a matrix M ∈ Rm×n which is approximately rank k:
- 1. Create an n × (k + p) iid standard Gaussian matrix A, where p is a
small integer (e.g. 5)
- 2. Compute the m × (k + p) matrix B = MA
- 3. Orthonormalize the columns of B and output them as a matrix
- U ∈ Rm×(k+p).
- 4. Apply power iterations if necessary.
SLIDE 151
Randomized column-space approximation
B = MA = UMSMV T
M A
= UMSMC
SLIDE 152
Randomized column-space approximation
B = MA = UMSMV T
M A
= UMSMC
◮ If M is low rank C is a k × (k + p) iid standard Gaussian matrix
SLIDE 153
Randomized column-space approximation
B = MA = UMSMV T
M A
= UMSMC
◮ If M is low rank C is a k × (k + p) iid standard Gaussian matrix ◮ Otherwise, C is a min {m, n} × (k + p) iid standard Gaussian matrix
SLIDE 154
Randomized SVD of a video
◮ Video with 160 1080 × 1920 frames ◮ We interpret each frame as a vector in R20,736,000 ◮ Matrix formed by these vectors is approximately low rank ◮ Regular SVD takes 12 seconds (281.1 seconds if we take 691 frames) ◮ Fast SVD with randomized-column-space estimate takes 5.8 seconds
(10.4 seconds for 691 frames) to obtain a rank-10 approximation (q = 2, p = 7)
SLIDE 155
True singular values
20 40 60 80 100 120 140 160 180 100000 200000 300000 400000 500000 600000
True Singular Values
SLIDE 156
Left singular vector approximation
True
Estimated
SLIDE 157
Random column selection
For a matrix M ∈ Rm×n which is approximately rank k:
- 1. Select a random subset of column indices I := {i1, i2, . . . , ik′} with
k′ ≥ k
- 2. Orthonormalize the submatrix corresponding to I:
MI :=
- M:,i1
M:,i2 · · · M:,ik′
- and output them as a matrix
U ∈ Rm×k′
SLIDE 158
Random column selection
(Possible) Problem: If right singular vectors are sparse, this will not work MI = UMSM(VM)I
SLIDE 159
Example
M := −3 2 2 2 3 2 2 2 −3 2 2 2 3 2 2 2
SLIDE 160
Example
M = UM SMV T
M =
0.5 −0.5 0.5 0.5 0.5 −0.5 0.5 0.5 6.9282 6 0.577 0.577 0.577 1
SLIDE 161
Example, I = {2, 3}
MI = 2 2 2 2 2 2 2 2 = 0.5 0.5 0.5 0.5 6.2982
- 0.577
0.577
- .
SLIDE 162
Randomized SVD of a video
◮ Video with 160 1080 × 1920 frames ◮ We interpret each frame as a vector in R20,736,000 ◮ Matrix formed by these vectors is approximately low rank ◮ Regular SVD takes 12 seconds (281.1 seconds if we take 691 frames) ◮ Fast SVD with random-column-selection estimate takes 5.2 seconds to
- btain a rank-10 approximation (k′ = 17)
SLIDE 163
Left singular vector approximation
True
Estimated
SLIDE 164
Singular value approximation
1 2 3 4 5 6 7 8 9 100000 200000 300000 400000 500000 600000