Randomization DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data - - PowerPoint PPT Presentation
Randomization DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data - - PowerPoint PPT Presentation
Randomization DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring19/index.html Carlos Fernandez-Granda Motivating applications Gaussian random variables Randomized dimensionality
SLIDE 1
SLIDE 2
Motivating applications Gaussian random variables Randomized dimensionality reduction Compressed sensing
SLIDE 3
Dimensionality reduction
Data with a large number of features can be difficult to analyze Data modeled as vectors in Rp (p very large) Aim: Reduce dimensionality of representation SVD provides optimal subspace for dimensionality reduction Problem: Computationally expensive + must see dataset beforehand What if we compute inner products with some random vectors?
SLIDE 4
Dimensionality reduction for visualization
Motivation: Visualize high-dimensional features projected onto 2D or 3D Example: Seeds from three different varieties of wheat: Kama, Rosa and Canadian Features:
◮ Area ◮ Perimeter ◮ Compactness ◮ Length of kernel ◮ Width of kernel ◮ Asymmetry coefficient ◮ Length of kernel groove
SLIDE 5
Projection onto two first PDs
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
First principal component
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Second principal component
SLIDE 6
Projection onto two random vectors
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
SLIDE 7
Projection onto two random vectors
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
SLIDE 8
Projection onto two random vectors
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
SLIDE 9
Compressed sensing in MRI
Important goal in MRI: reduce scan time Can be achieved by measuring less frequency coefficients What happens if we undersample in the Fourier domain?
SLIDE 10
MR image
0.0 5.0 10.0 15.0 20.0 25.0
t2 (cm)
0.0 5.0 10.0 15.0 20.0 25.0
t1 (cm)
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
SLIDE 11
Fourier coefficients
- 3.0 -2.0 -1.0
0.0 1.0 2.0 3.0
k2 (1/cm)
- 3.0
- 2.0
- 1.0
0.0 1.0 2.0 3.0
k1 (1/cm)
10
6
10
5
10
4
10
3
10
2
SLIDE 12
x2 regular undersampling
- 3.0 -2.0 -1.0
0.0 1.0 2.0 3.0
k2 (1/cm)
- 3.0
- 2.0
- 1.0
0.0 1.0 2.0 3.0
k1 (1/cm)
10
5
10
4
10
3
10
2
10
1
SLIDE 13
Recovered image
0.0 5.0 10.0 15.0 20.0 25.0
t2 (cm)
0.0 5.0 10.0 15.0 20.0 25.0
t1 (cm)
0.5 1.0 1.5 2.0 2.5 3.0
SLIDE 14
x2 random undersampling
- 3.0 -2.0 -1.0
0.0 1.0 2.0 3.0
k2 (1/cm)
- 3.0
- 2.0
- 1.0
0.0 1.0 2.0 3.0
k1 (1/cm)
10
5
10
4
10
3
10
2
10
1
SLIDE 15
Recovered image
0.0 5.0 10.0 15.0 20.0 25.0
t2 (cm)
0.0 5.0 10.0 15.0 20.0 25.0
t1 (cm)
0.5 0.0 0.5 1.0 1.5 2.0
SLIDE 16
Motivating applications Gaussian random variables Randomized dimensionality reduction Compressed sensing
SLIDE 17
Gaussian random variables
The pdf of a Gaussian or normal random variable with mean µ and standard deviation σ is given by fX (x) = 1 √ 2πσ e− (x−µ)2
2σ2
A standard Gaussian has µ := 0 and σ := 1
SLIDE 18
Gaussian random variables
−10 −8 −6 −4 −2 2 4 6 8 10 0.1 0.2 0.3 0.4 x fX (x) µ = 2 σ = 1 µ = 0 σ = 2 µ = 0 σ = 4
SLIDE 19
Linear transformation of Gaussian
If x is a Gaussian random variable with mean µ and standard deviation σ, then for any a, b ∈ R y := ax + b is a Gaussian random variable with mean aµ + b and standard deviation |a| σ
SLIDE 20
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y)
SLIDE 21
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y)
SLIDE 22
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y)
SLIDE 23
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P
- x ≤ y − b
a
SLIDE 24
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P
- x ≤ y − b
a
- =
- y−b
a
−∞
1 √ 2πσ e− (x−µ)2
2σ2
dx
SLIDE 25
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P
- x ≤ y − b
a
- =
- y−b
a
−∞
1 √ 2πσ e− (x−µ)2
2σ2
dx = y
−∞
1 √ 2πaσ e− (w−aµ−b)2
2a2σ2
dw change of variables w = ax + b
SLIDE 26
Proof
Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P
- x ≤ y − b
a
- =
- y−b
a
−∞
1 √ 2πσ e− (x−µ)2
2σ2
dx = y
−∞
1 √ 2πaσ e− (w−aµ−b)2
2a2σ2
dw change of variables w = ax + b Differentiating with respect to y: fy (y) = 1 √ 2πaσ e− (w−aµ−b)2
2a2σ2
SLIDE 27
Gaussian random vector
A Gaussian random vector x is a random vector with joint pdf f
x (
x) = 1
- (2π)n |Σ|
exp
- −1
2 ( x − µ)T Σ−1 ( x − µ)
- where
µ ∈ Rd is the mean and Σ ∈ Rd×d the covariance matrix A standard Gaussian vector has µ := 0 and Σ := I
SLIDE 28
Uncorrelation implies independence
If the covariance matrix is diagonal, Σ
x =
σ2
1
· · · σ2
2
· · · . . . . . . ... . . . · · · σ2
d
, the entries are independent
SLIDE 29
Proof
Σ−1
- x
=
1 σ2
1
· · ·
1 σ2
2
· · · . . . . . . ... . . . · · ·
1 σ2
d
|Σ| =
d
- i=1
σ2
i
SLIDE 30
Proof
f
x (
x)
SLIDE 31
Proof
f
x (
x) = 1
- (2π)d |Σ|
exp
- −1
2 ( x − µ)T Σ−1 ( x − µ)
SLIDE 32
Proof
f
x (
x) = 1
- (2π)d |Σ|
exp
- −1
2 ( x − µ)T Σ−1 ( x − µ)
- =
d
- i=1
1
- (2π)σi
exp
- −(
xi − µi)2 2σ2
i
SLIDE 33
Proof
f
x (
x) = 1
- (2π)d |Σ|
exp
- −1
2 ( x − µ)T Σ−1 ( x − µ)
- =
d
- i=1
1
- (2π)σi
exp
- −(
xi − µi)2 2σ2
i
- =
d
- i=1
f
xi (
xi)
SLIDE 34
Linear transformations
Let x be a Gaussian random vector of dimension d with mean µ and covariance matrix Σ For any matrix A ∈ Rm×d and b ∈ Rm y = A x + b is Gaussian with mean A µ + b and covariance matrix AΣAT (as long as it is full rank) This is why Fourier and wavelet coefficients of Gaussian noise are also Gaussian noise
SLIDE 35
Subvectors are also Gaussian
−3 −2 −1 1 2 3 −2 2 0.1 0.2 x y f
x (x, y)
f
x[2](y)
f
x[1](x)
SLIDE 36
Audio data
4.0 4.1 4.2 4.3 4.4 4.5 Time (s) 7500 5000 2500 2500 5000 7500
SLIDE 37
DFT
4000 3000 2000 1000 1000 2000 3000 4000 Frequency (Hz) 10
3
10
2
10
1
100 101 102 103
SLIDE 38
Noisy image
SLIDE 39
Wavelet coefficients
SLIDE 40
Direction of iid standard Gaussian vectors
If the covariance matrix of a Gaussian vector x is I, then x is isotropic It does not favor any direction For any orthogonal matrix U x has the same distribution, Gaussian with mean and covariance matrix
SLIDE 41
Direction of iid standard Gaussian vectors
If the covariance matrix of a Gaussian vector x is I, then x is isotropic It does not favor any direction For any orthogonal matrix U x has the same distribution, Gaussian with mean U 0 = 0 and covariance matrix
SLIDE 42
Direction of iid standard Gaussian vectors
If the covariance matrix of a Gaussian vector x is I, then x is isotropic It does not favor any direction For any orthogonal matrix U x has the same distribution, Gaussian with mean U 0 = 0 and covariance matrix UIUT = UUT = I
SLIDE 43
Magnitude of iid standard Gaussian vectors
In low dimensions joint pdf is mostly concentrated around the origin What about in high dimensions?
SLIDE 44
ℓ2 norm of samples
101 102 103
Dimension
100 101
2 norm of samples
Dimension
SLIDE 45
χ2 random variable
χ2 (chi squared) random variable with d degrees of freedom y :=
d
- i=1
x2
i
where x1, . . . , xd are standard Gaussians Equal to squared ℓ2 norm of d-dimensional standard Gaussian vector
SLIDE 46
Squared ℓ2 norm divided by d
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 4 6 8 10 x fy/d (x) d = 10 d = 20 d = 50 d = 100
SLIDE 47
Mean
E
- ||
x||2
2
SLIDE 48
Mean
E
- ||
x||2
2
- = E
d
- i=1
- x[i]2
SLIDE 49
Mean
E
- ||
x||2
2
- = E
d
- i=1
- x[i]2
- =
d
- i=1
E
- x[i]2
SLIDE 50
Mean
E
- ||
x||2
2
- = E
d
- i=1
- x[i]2
- =
d
- i=1
E
- x[i]2
= d
SLIDE 51
Variance
E
- ||
x||2
2
2
SLIDE 52
Variance
E
- ||
x||2
2
2 = E d
- i=1
- x[i]2
2
SLIDE 53
Variance
E
- ||
x||2
2
2 = E d
- i=1
- x[i]2
2 = E
d
- i=1
d
- j=1
- x[i]2
x[j]2
SLIDE 54
Variance
E
- ||
x||2
2
2 = E d
- i=1
- x[i]2
2 = E
d
- i=1
d
- j=1
- x[i]2
x[j]2 =
d
- i=1
d
- j=1
E
- x[i]2
x[j]2
SLIDE 55
Variance
E
- ||
x||2
2
2 = E d
- i=1
- x[i]2
2 = E
d
- i=1
d
- j=1
- x[i]2
x[j]2 =
d
- i=1
d
- j=1
E
- x[i]2
x[j]2 =
d
- i=1
E
- x[i]4
+ 2
d−1
- i=1
d
- j=i+1
E
- x[i]2
E
- x[j]2
SLIDE 56
Variance
E
- ||
x||2
2
2 = E d
- i=1
- x[i]2
2 = E
d
- i=1
d
- j=1
- x[i]2
x[j]2 =
d
- i=1
d
- j=1
E
- x[i]2
x[j]2 =
d
- i=1
E
- x[i]4
+ 2
d−1
- i=1
d
- j=i+1
E
- x[i]2
E
- x[j]2
= 3d + d(d − 1) 4th moment of standard Gaussian equals 3
SLIDE 57
Variance
E
- ||
x||2
2
2 = E d
- i=1
- x[i]2
2 = E
d
- i=1
d
- j=1
- x[i]2
x[j]2 =
d
- i=1
d
- j=1
E
- x[i]2
x[j]2 =
d
- i=1
E
- x[i]4
+ 2
d−1
- i=1
d
- j=i+1
E
- x[i]2
E
- x[j]2
= 3d + d(d − 1) 4th moment of standard Gaussian equals 3 = d(d + 2)
SLIDE 58
Variance
Var
- ||
x||2
2
- = E
- ||
x||2
2
2 − E
- ||
x||2
2
2 = d(d + 2) − d2 = 2d Relative standard deviation around mean scales as
- 2/d
Geometrically, probability density concentrates close to surface of a sphere with radius √ d
SLIDE 59
Non-asymptotic tail bound
Let x be an iid standard Gaussian random vector of dimension d For any ǫ > 0 P
- d (1 − ǫ) < ||
x||2
2 < d (1 + ǫ)
- ≥ 1 −
2 dǫ2
SLIDE 60
Markov’s inequality
Let x be a nonnegative random variable For any positive constant a > 0, P (x ≥ a) ≤ E (x) a
SLIDE 61
Proof
Define the indicator variable 1x≥a x − a 1x≥a ≥ 0
SLIDE 62
Proof
Define the indicator variable 1x≥a x − a 1x≥a ≥ 0 E (x) ≥ a E (1x≥a) = a P (x ≥ a)
SLIDE 63
Chebyshev bound
Let y := || x||2
2,
P (|y − d| ≥ dǫ)
SLIDE 64
Chebyshev bound
Let y := || x||2
2,
P (|y − d| ≥ dǫ) = P
- (y − E (y))2 ≥ d2ǫ2
SLIDE 65
Chebyshev bound
Let y := || x||2
2,
P (|y − d| ≥ dǫ) = P
- (y − E (y))2 ≥ d2ǫ2
≤ E
- (y − E (y))2
d2ǫ2 by Markov’s inequality
SLIDE 66
Chebyshev bound
Let y := || x||2
2,
P (|y − d| ≥ dǫ) = P
- (y − E (y))2 ≥ d2ǫ2
≤ E
- (y − E (y))2
d2ǫ2 by Markov’s inequality = Var (y) d2ǫ2
SLIDE 67
Chebyshev bound
Let y := || x||2
2,
P (|y − d| ≥ dǫ) = P
- (y − E (y))2 ≥ d2ǫ2
≤ E
- (y − E (y))2
d2ǫ2 by Markov’s inequality = Var (y) d2ǫ2 = 2 dǫ2
SLIDE 68
Non-asymptotic Chernoff tail bound
Let x be an iid standard Gaussian random vector of dimension d For any ǫ > 0 P
- d (1 − ǫ) < ||
x||2
2 < d (1 + ǫ)
- ≥ 1 − 2 exp
- −dǫ2
8
SLIDE 69
Proof
Let y := || x||2
- 2. The result is implied by
P (y > d (1 + ǫ)) ≤ exp
- −dǫ2
8
- P (y < d (1 − ǫ)) ≤ exp
- −dǫ2
8
SLIDE 70
Proof
Fix t > 0 P (y > a)
SLIDE 71
Proof
Fix t > 0 P (y > a) = P (exp (ty) > exp (at))
SLIDE 72
Proof
Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality
SLIDE 73
Proof
Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality ≤ exp (−at) E
- exp
d
- i=1
txi
2
SLIDE 74
Proof
Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality ≤ exp (−at) E
- exp
d
- i=1
txi
2
- ≤ exp (−at)
d
- i=1
E
- exp
- txi
2
by independence of x1, . . . , xd
SLIDE 75
Proof
Lemma (by direct integration) E
- exp
- tx2
= 1 √1 − 2t Equivalent to controlling higher-order moments since E
- exp
- tx2
= E ∞
- i=0
- tx2i
i!
- =
∞
- i=0
E
- ti
x2i i! .
SLIDE 76
Proof
Fix t > 0 P (y > a) ≤ exp (−at)
d
- i=1
E
- exp
- txi
2
= exp (−at) (1 − 2t)
d 2
SLIDE 77
Proof
Setting a := d (1 + ǫ) and t := 1 2 − 1 2 (1 + ǫ), we conclude P (y > d (1 + ǫ)) ≤ (1 + ǫ)d 2 exp
- −dǫ
2
- ≤ exp
- −dǫ2
8
SLIDE 78
Projection onto a fixed subspace
Probability density is isotropic and has variance d Projection onto fixed k-dimensional subspace should capture fraction of variance equal to k/d Variance of projection should be k
SLIDE 79
Projection onto a fixed subspace
Let S be a k-dimensional subspace of Rd and x a d-dimensional standard Gaussian vector PS ( x) = UUT x is not a Gaussian vector Covariance: ΣPS (
x) = UUTΣ xUUT
SLIDE 80
Projection onto a fixed subspace
Let S be a k-dimensional subspace of Rd and x a d-dimensional standard Gaussian vector PS ( x) = UUT x is not a Gaussian vector Covariance: ΣPS (
x) = UUTΣ xUUT
= UUT
SLIDE 81
Projection onto a fixed subspace
Let S be a k-dimensional subspace of Rd and x a d-dimensional standard Gaussian vector PS ( x) = UUT x is not a Gaussian vector Covariance: ΣPS (
x) = UUTΣ xUUT
= UUT Not full rank
SLIDE 82
Projection onto a fixed subspace
Coefficients UT x are a Gaussian vector with covariance ΣUT
x = UTΣ xU = UTU = I
SLIDE 83
Projection onto a fixed subspace
Coefficients UT x are a Gaussian vector with covariance ΣUT
x = UTΣ xU = UTU = I
We have ||PS ( x)||2
2 = (UUT
x)TUUT x =
- UT
x
- 2
2
SLIDE 84
Projection onto a fixed subspace
Coefficients UT x are a Gaussian vector with covariance ΣUT
x = UTΣ xU = UTU = I
We have ||PS ( x)||2
2 = (UUT
x)TUUT x =
- UT
x
- 2
2
For any ǫ > 0 P
- k (1 − ǫ) < ||PS (
x)||2
2 < k (1 + ǫ)
- ≥ 1 − 2 exp
- −kǫ2
8
SLIDE 85
Linear regression
To analyze the performance of the least-squares estimator we assume a linear model with additive iid Gaussian noise
- ytrain := Xtrain
βtrue + ztrain The LS estimator equals
- βLS := arg min
- β
- ytrain − Xtrain
β2
SLIDE 86
Training error
The training error is the projection of the noise onto the orthogonal complement of the column space of Xtrain
- ytrain −
yLS = Pcol(Xtrain)⊥ ztrain Dimension of orthogonal complement of col(Xtrain) equals n − p Training RMSE :=
- ||
ytrain − yLS||2
2
n ≈ σ
- 1 − p
n
SLIDE 87
Temperature prediction via linear regression
200 500 1000 2000 5000
Number of training data
1 2 3 4 5 6 7
Average error (deg Celsius) 1 p/n 1 + p/n Training error Test error
SLIDE 88
Motivating applications Gaussian random variables Randomized dimensionality reduction Compressed sensing
SLIDE 89
Randomized linear maps
We use Gaussian matrices as randomized linear maps from Rd to Rk, k < d Each entry is sampled independently from standard Gaussian Question: Do we preserve distances between points in set? Equivalently, are any fixed vectors in the null space?
SLIDE 90
Fixed vector
Let A be a k × d matrix with iid standard Gaussian entries If v ∈ Rd is a deterministic vector with unit ℓ2 norm, then A v is a k-dimensional standard Gaussian vector
SLIDE 91
Fixed vector
Let A be a k × d matrix with iid standard Gaussian entries If v ∈ Rd is a deterministic vector with unit ℓ2 norm, then A v is a k-dimensional standard Gaussian vector Proof: (A v) [i], 1 ≤ i ≤ k is Gaussian with mean zero and variance Var
- AT
i,:
v
- =
vTΣAi,: v
SLIDE 92
Fixed vector
Let A be a k × d matrix with iid standard Gaussian entries If v ∈ Rd is a deterministic vector with unit ℓ2 norm, then A v is a k-dimensional standard Gaussian vector Proof: (A v) [i], 1 ≤ i ≤ k is Gaussian with mean zero and variance Var
- AT
i,:
v
- =
vTΣAi,: v = vTI v = || v||2
2 = 1
SLIDE 93
Fixed vector
Let A be a k × d matrix with iid standard Gaussian entries If v ∈ Rd is a deterministic vector with unit ℓ2 norm, then A v is a k-dimensional standard Gaussian vector Proof: (A v) [i], 1 ≤ i ≤ k is Gaussian with mean zero and variance Var
- AT
i,:
v
- =
vTΣAi,: v = vTI v = || v||2
2 = 1
Ai,:, 1 ≤ i ≤ k are all independent
SLIDE 94
Non-asymptotic Chernoff tail bound
Let x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 P
- k (1 − ǫ) < ||
x||2
2 < k (1 + ǫ)
- ≥ 1 − 2 exp
- −kǫ2
8
SLIDE 95
Fixed vector
Let A be a k × d matrix with iid standard Gaussian entries For any v ∈ Rd with unit norm and any ǫ ∈ (0, 1) √ 1 − ǫ ≤
- 1
√ k A v
- 2
≤ √ 1 + ǫ with probability at least 1 − 2 exp
- −kǫ2/8
SLIDE 96
Distance between two vectors
The result implies that if we fix two vectors x1 and x2 and define y := x2 − x1 then √ 1 − ǫ ||y||2 ≤
- 1
√ k Ay
- 2
≤ √ 1 + ǫ ||y||2 with high probability (just set v := y/ ||y||2) What about distances between a set of vectors?
SLIDE 97
Johnson-Lindenstrauss lemma
Let A be a k × d matrix with iid standard Gaussian entries Let x1, . . . , xp ∈ Rd be any fixed set of p deterministic vectors For any pair xi, xj and any ǫ ∈ (0, 1) (1 − ǫ) || xi − xj||2
2 ≤
- 1
√ k A xi − 1 √ k A xj
- 2
2
≤ (1 + ǫ) || xi − xj||2
2
with probability at least 1
p as long as
k ≥ 16 log (p) ǫ2
SLIDE 98
Johnson-Lindenstrauss lemma
Let A be a k × d matrix with iid standard Gaussian entries Let x1, . . . , xp ∈ Rd be any fixed set of p deterministic vectors For any pair xi, xj and any ǫ ∈ (0, 1) (1 − ǫ) || xi − xj||2
2 ≤
- 1
√ k A xi − 1 √ k A xj
- 2
2
≤ (1 + ǫ) || xi − xj||2
2
with probability at least 1
p as long as
k ≥ 16 log (p) ǫ2 No dependence on d!
SLIDE 99
Proof
Aim: Control action of A the normalized differences
- vij :=
- xi −
xj || xi − xj||2 Our event of interest is the intersection of the events Eij =
- k (1 − ǫ) < ||A
vij||2
2 < k (1 + ǫ)
- 1 ≤ i < p, i < j ≤ p
SLIDE 100
Proof
Aim: Control action of A the normalized differences
- vij :=
- xi −
xj || xi − xj||2 Our event of interest is the intersection of the events Eij =
- k (1 − ǫ) < ||A
vij||2
2 < k (1 + ǫ)
- 1 ≤ i < p, i < j ≤ p
Is it equal to
i,j Eij?
SLIDE 101
Fixed vector
Let A be a k × d matrix with iid standard Gaussian entries For any v ∈ Rd with unit norm and any ǫ ∈ (0, 1) √ 1 − ǫ ≤
- 1
√ k A v
- 2
≤ √ 1 + ǫ with probability at least 1 − 2 exp
- −kǫ2/8
- This implies
P
- Ec
ij
- ≤ 2
p2 if k ≥ 16 log (p) ǫ2
SLIDE 102
Union bound
For any events S1, S2, . . . , Sn in a probability space P (∪iSi) ≤
n
- i=1
P (Si) .
SLIDE 103
Proof
By the union bound P
i,j
Eij
SLIDE 104
Proof
By the union bound P
i,j
Eij = 1 − P
i,j
Ec
ij
SLIDE 105
Proof
By the union bound P
i,j
Eij = 1 − P
i,j
Ec
ij
≥ 1 −
- i,j
P
- Ec
ij
SLIDE 106
Proof
By the union bound P
i,j
Eij = 1 − P
i,j
Ec
ij
≥ 1 −
- i,j
P
- Ec
ij
- Number of events Eij?
SLIDE 107
Proof
By the union bound P
i,j
Eij = 1 − P
i,j
Ec
ij
≥ 1 −
- i,j
P
- Ec
ij
- ≥ 1 − p (p − 1)
2 2 p2 Number of events Eij? p
2
- = p (p − 1) /2
SLIDE 108
Proof
By the union bound P
i,j
Eij = 1 − P
i,j
Ec
ij
≥ 1 −
- i,j
P
- Ec
ij
- ≥ 1 − p (p − 1)
2 2 p2 ≥ 1 p Number of events Eij? p
2
- = p (p − 1) /2
SLIDE 109
Nearest-neighbor classification
Training set of points and labels { x1, l1}, . . . , { xn, ln} To classify a new data point y ∈ Rd, find i∗ := arg min
1≤i≤n ||
y − xi||2 , and assign li∗ to y Cost: O (dnp) to classify p new points
SLIDE 110
Nearest neighbors in random subspace
Use a k × d iid standard Gaussian matrix to project onto k-dimensional space Cost:
◮ dkn operations to project training set ◮ dkp operations to project test set ◮ knp to perform nearest-neighbor classification
Much faster!
SLIDE 111
Face recognition
Training set: 360 64 × 64 images from 40 different subjects (9 each) Test set: 1 new image from each subject We model each image as a vector in R4096 (d = 4096) To classify we:
- 1. Project onto random a k-dimensional subspace
- 2. Apply nearest-neighbor classification using the ℓ2-norm distance in Rk
SLIDE 112
Performance
20 40 60 80 100 120 140 160 180 200 10 20 30 40 5 10 15 25 30 35 Dimension Errors Average Maximum Minimum
SLIDE 113
Nearest neighbor in R50
Test image Projection Closest projection Corresponding image
SLIDE 114
Motivating applications Gaussian random variables Randomized dimensionality reduction Compressed sensing
SLIDE 115
Compressed sensing
Goal: Recovering signals from small number of data Arbitrary vector of dimension d cannot be recovered from m < d linear measurements However, signals of interest are highly structured For example, images are sparse in wavelet basis If signal is parametrized by s < m parameters, recovery may be possible We focus on simplified problem: recovering sparse vectors
SLIDE 116
MR image
0.0 5.0 10.0 15.0 20.0 25.0
t2 (cm)
0.0 5.0 10.0 15.0 20.0 25.0
t1 (cm)
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
SLIDE 117
Fourier coefficients
- 3.0 -2.0 -1.0
0.0 1.0 2.0 3.0
k2 (1/cm)
- 3.0
- 2.0
- 1.0
0.0 1.0 2.0 3.0
k1 (1/cm)
10
6
10
5
10
4
10
3
10
2
SLIDE 118
x2 regular undersampling
- 3.0 -2.0 -1.0
0.0 1.0 2.0 3.0
k2 (1/cm)
- 3.0
- 2.0
- 1.0
0.0 1.0 2.0 3.0
k1 (1/cm)
10
5
10
4
10
3
10
2
10
1
SLIDE 119
Recovered image
0.0 5.0 10.0 15.0 20.0 25.0
t2 (cm)
0.0 5.0 10.0 15.0 20.0 25.0
t1 (cm)
0.5 1.0 1.5 2.0 2.5 3.0
SLIDE 120
x2 random undersampling
- 3.0 -2.0 -1.0
0.0 1.0 2.0 3.0
k2 (1/cm)
- 3.0
- 2.0
- 1.0
0.0 1.0 2.0 3.0
k1 (1/cm)
10
5
10
4
10
3
10
2
10
1
SLIDE 121
Recovered image
0.0 5.0 10.0 15.0 20.0 25.0
t2 (cm)
0.0 5.0 10.0 15.0 20.0 25.0
t1 (cm)
0.5 0.0 0.5 1.0 1.5 2.0
SLIDE 122
DFT regular undersampling
SLIDE 123
DFT regular undersampling
200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5
Signal 1 Recovered
SLIDE 124
DFT regular undersampling
200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5
Signal 2 Recovered
SLIDE 125
DFT random undersampling
SLIDE 126
DFT random undersampling
200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5
SLIDE 127
DFT random undersampling
200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5
SLIDE 128
Gaussian measurements
SLIDE 129
Gaussian measurements
200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5
SLIDE 130
Gaussian measurements
200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5
SLIDE 131
Restricted-isometry property
Different sparse vectors should never produce similar data If two s-sparse vectors x1, x2 are far, then A x1, A x2 should be far The measurement operator should preserve distances (be an isometry) when restricted to act upon sparse vectors
SLIDE 132
Restricted-isometry property
A satisfies the restricted isometry property (RIP) with constant ǫ if (1 − ǫ) || x||2 ≤ ||A x||2 ≤ (1 + ǫ) || x||2 for any s-sparse vector x
SLIDE 133
Restricted-isometry property
A satisfies the restricted isometry property (RIP) with constant ǫ if (1 − ǫ) || x||2 ≤ ||A x||2 ≤ (1 + ǫ) || x||2 for any s-sparse vector x If A satisfies the RIP for a sparsity level 2s then for any s-sparse x1, x2 || x2 − x1||2
SLIDE 134
Restricted-isometry property
A satisfies the restricted isometry property (RIP) with constant ǫ if (1 − ǫ) || x||2 ≤ ||A x||2 ≤ (1 + ǫ) || x||2 for any s-sparse vector x If A satisfies the RIP for a sparsity level 2s then for any s-sparse x1, x2 || x2 − x1||2 = ||A ( x1 − x2)||2 ≥ (1 − ǫ) || x2 − x1||2
SLIDE 135
Restricted-isometry property
Deterministic matrices tend to not satisfy the RIP It is NP-hard to check if spark or RIP hold Random matrices satisfy RIP with high probability We prove it for Gaussian iid matrices, ideas in proof for random Fourier matrices are similar
SLIDE 136
Restricted-isometry property for Gaussian matrices
Let A ∈ Rm×d be a random matrix with iid standard Gaussian entries
1 √mA satisfies the RIP for a constant ǫ with probability 1 − C2 d as long as
m ≥ C1s ǫ2 log d s
- for two fixed constants C1, C2 > 0
SLIDE 137
Restricted-isometry property for Gaussian matrices
Let A ∈ Rm×d be a random matrix with iid standard Gaussian entries
1 √mA satisfies the RIP for a constant ǫ with probability 1 − C2 d as long as
m ≥ C1s ǫ2 log d s
- for two fixed constants C1, C2 > 0
Measurements proportional to sparsity (up to log factor)
SLIDE 138
Singular values of submatrix
Fix subset of s indices T ⊂ {1, . . . , d} Any matrix A ∈ Rm×d, m < d, satisfies σs(AT) ≤ ||A x||2 ≤ σ1(AT) for all vectors x ∈ Rd with support restricted to T AT is the m × s submatrix of A containing columns indexed by T σ1(AT) and σs(AT) are the largest and smallest singular value of AT
SLIDE 139
Proof
For any vector x ∈ Rd with support restricted to T A x = AT xT where xT ∈ Rs is the subvector of x that contains its nonzero entries
SLIDE 140
Proof strategy
Control singular values for fixed submatrix Apply union bound to extend bounds to all submatrices
SLIDE 141
Singular values of m × s matrix, s = 100
20 40 60 80 100 0.5 1 1.5 i
σi √m
m/s 2 5 10 20 50 100 200
SLIDE 142
Singular values of m × s matrix, s = 1000
200 400 600 800 1,000 0.5 1 1.5 i
σi √m
m/s 2 5 10 20 50 100 200
SLIDE 143
Singular values of a Gaussian matrix
For large enough m M ≈ U √m I
- V T = √m UV T,
Standard Gaussian vectors in high dimensions are almost orthogonal
SLIDE 144
Singular values of a Gaussian matrix
Let M be a m × s matrix with iid standard Gaussian entries such that m > s For any fixed ǫ > 0, the singular values of M satisfy
- m (1 − ǫ) ≤ σs ≤ σ1 ≤
- m (1 + ǫ)
with probability at least 1 − 2 12
ǫ
s exp
- − mǫ2
32
SLIDE 145
Union bound
For any events S1, S2, . . . , Sn in a probability space P (∪iSi) ≤
n
- i=1
P (Si) .
SLIDE 146
Proof
Number of different supports of size s
SLIDE 147
Proof
Number of different supports of size s d s
- ≤
ed s s
SLIDE 148
Proof
Number of different supports of size s d s
- ≤
ed s s By the union bound √ 1 − ǫ || x||2 ≤ 1 √m ||A x||2 ≤ √ 1 + ǫ || x||2 holds for any s-sparse vector x with probability at least 1 − 2 ed s s 12 ǫ s exp
- −mǫ2
32
- = 1 − exp
- log 2 + s + s log
d s
- + s log
12 ǫ
- − mǫ2
2
- ≤ 1 − C2
d as long as m ≥ C1s ǫ2 log d s
SLIDE 149
Singular values of a Gaussian matrix
Let M be a m × s matrix with iid standard Gaussian entries such that m > s For any fixed ǫ > 0, the singular values of M satisfy
- m (1 − ǫ) ≤ σs ≤ σ1 ≤
- m (1 + ǫ)
with probability at least 1 − 2 12
ǫ
s exp
- − mǫ2
32
- How do we prove this?
SLIDE 150
More of the same?
We need to prove that for any vector v of the s-dimensional sphere Ss−1 in Rs √m (1 − ǫ) < ||M v||2 < √m (1 + ǫ) Can we prove it for a fixed vector and use the union bound?
SLIDE 151
Proof strategy
- 1. Consider spread-out finite subset Nǫ ⊂ Ss−1 such that any point in
Ss−1 is close to a point in Nǫ
- 2. Prove bound on Nǫ
- 3. Show that bounds hold for all points that are close to Nǫ
SLIDE 152
ǫ-net
An ǫ-net of a set X ⊆ Rs is a subset Nǫ ⊆ X such that for every vector
- x ∈ X there exists
y ∈ Nǫ for which || x − y||2 ≤ ǫ. The covering number N (X, ǫ) of a set X at scale ǫ is the minimal cardinality of an ǫ-net of X
SLIDE 153
ǫ-net
ǫ
SLIDE 154
Covering number of a sphere
The covering number of the s-dimensional sphere Ss−1 at scale ǫ satisfies N
- Ss−1, ǫ
- ≤
2 + ǫ ǫ s ≤ 3 ǫ s
SLIDE 155
Covering number of a sphere
◮ Initialize Nǫ to the empty set ◮ Choose a point
x ∈ Ss−1 such that || x − y||2 > ǫ for any y ∈ Nǫ
◮ Add
x to Nǫ until there are no points in Ss−1 that are ǫ away from any point in Nǫ
SLIDE 156
Covering number of a sphere
ǫ/2 1 + ǫ/2
SLIDE 157
Covering number of a sphere
Vol
- Bs
1+ǫ/2
- ≥ Vol
- ∪
x∈NǫBs ǫ/2 (
x)
SLIDE 158
Covering number of a sphere
Vol
- Bs
1+ǫ/2
- ≥ Vol
- ∪
x∈NǫBs ǫ/2 (
x)
- = |Nǫ| Vol
- Bs
ǫ/2
SLIDE 159
Covering number of a sphere
Vol
- Bs
1+ǫ/2
- ≥ Vol
- ∪
x∈NǫBs ǫ/2 (
x)
- = |Nǫ| Vol
- Bs
ǫ/2
- By multivariable calculus
Vol
- Bs
r
- = rs Vol
- Bs
1
SLIDE 160
Covering number of a sphere
Vol
- Bs
1+ǫ/2
- ≥ Vol
- ∪
x∈NǫBs ǫ/2 (
x)
- = |Nǫ| Vol
- Bs
ǫ/2
- Vol
- Bs
r
- = rs Vol
- Bs
1
- so we conclude
(1 + ǫ/2)s ≥ |Nǫ| (ǫ/2)s
SLIDE 161
Proof
- 1. We prove the bounds
n (1 − ǫ2) < ||M v||2
2 < n (1 + ǫ2)
where ǫ2 := ǫ/2 on an ǫ1 := ǫ/4 net of the sphere
- 2. We show that by the triangle inequality, this implies that the bounds
hold on all the sphere
SLIDE 162
Fixed vector
Let M be a a × b matrix with iid standard Gaussian entries For any v ∈ Rb with unit norm and any ǫ ∈ (0, 1)
- a (1 − ǫ) ≤ ||M
v||2 ≤
- a (1 + ǫ)
with probability at least 1 − 2 exp
- −aǫ2/8
SLIDE 163
Bound on the ǫ1-net
We define the event E
v,ǫ2 :=
- m (1 − ǫ2) ||
v||2
2 ≤ ||M
v||2
2 ≤ m (1 + ǫ2) ||
v||2
2
- P
- ∪
v∈Nǫ1Ec
- v,ǫ2
SLIDE 164
Bound on the ǫ1-net
We define the event E
v,ǫ2 :=
- m (1 − ǫ2) ||
v||2
2 ≤ ||M
v||2
2 ≤ m (1 + ǫ2) ||
v||2
2
- P
- ∪
v∈Nǫ1Ec
- v,ǫ2
- ≤
- v∈Nǫ1
P
- Ec
- v,ǫ2
SLIDE 165
Bound on the ǫ1-net
We define the event E
v,ǫ2 :=
- m (1 − ǫ2) ||
v||2
2 ≤ ||M
v||2
2 ≤ m (1 + ǫ2) ||
v||2
2
- P
- ∪
v∈Nǫ1Ec
- v,ǫ2
- ≤
- v∈Nǫ1
P
- Ec
- v,ǫ2
- ≤ |Nǫ1| P
- Ec
- v,ǫ2
SLIDE 166
Bound on the ǫ1-net
We define the event E
v,ǫ2 :=
- m (1 − ǫ2) ||
v||2
2 ≤ ||M
v||2
2 ≤ m (1 + ǫ2) ||
v||2
2
- P
- ∪
v∈Nǫ1Ec
- v,ǫ2
- ≤
- v∈Nǫ1
P
- Ec
- v,ǫ2
- ≤ |Nǫ1| P
- Ec
- v,ǫ2
- ≤ 2
12 ǫ s exp
- −mǫ2
32
SLIDE 167
Upper bound on the sphere
Let x ∈ Ss−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||M x||2
SLIDE 168
Upper bound on the sphere
Let x ∈ Ss−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||M x||2 ≤ ||M v||2 + ||M ( x − v)||2
SLIDE 169
Upper bound on the sphere
Let x ∈ Ss−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||M x||2 ≤ ||M v||2 + ||M ( x − v)||2 ≤ √m
- 1 + ǫ
2
- + ||M (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
SLIDE 170
Upper bound on the sphere
Let x ∈ Ss−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||M x||2 ≤ ||M v||2 + ||M ( x − v)||2 ≤ √m
- 1 + ǫ
2
- + ||M (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
≤ √m
- 1 + ǫ
2
- + σ1 ||
x − v||2
SLIDE 171
Upper bound on the sphere
Let x ∈ Ss−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||M x||2 ≤ ||M v||2 + ||M ( x − v)||2 ≤ √m
- 1 + ǫ
2
- + ||M (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
≤ √m
- 1 + ǫ
2
- + σ1 ||
x − v||2 ≤ √m
- 1 + ǫ
2
- + σ1ǫ
4
SLIDE 172
Upper bound on the sphere
σ1 ≤ √m
- 1 + ǫ
2
- + σ1ǫ
4 σ1 ≤ √m 1 + ǫ/2 1 − ǫ/4
- = √m
- 1 + ǫ − ǫ (1 − ǫ)
4 − ǫ
- ≤ √m (1 + ǫ)
SLIDE 173
Lower bound on the sphere
||M x||2
SLIDE 174
Lower bound on the sphere
||M x||2 ≥ ||M v||2 − ||M ( x − v)||2
SLIDE 175
Lower bound on the sphere
||M x||2 ≥ ||M v||2 − ||M ( x − v)||2 ≥ √m
- 1 − ǫ
2
- − ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
SLIDE 176
Lower bound on the sphere
||M x||2 ≥ ||M v||2 − ||M ( x − v)||2 ≥ √m
- 1 − ǫ
2
- − ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
≥ √m
- 1 − ǫ
2
- − σ1 ||
x − v||2
SLIDE 177
Lower bound on the sphere
||M x||2 ≥ ||M v||2 − ||M ( x − v)||2 ≥ √m
- 1 − ǫ
2
- − ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
≥ √m
- 1 − ǫ
2
- − σ1 ||
x − v||2 ≥ √m
- 1 − ǫ
2
- − ǫ
4 √m (1 + ǫ)
SLIDE 178
Lower bound on the sphere
||M x||2 ≥ ||M v||2 − ||M ( x − v)||2 ≥ √m
- 1 − ǫ
2
- − ||A (
x − v)||2 assuming ∪
v∈Nǫ1Ec
- v,ǫ2 holds
≥ √m
- 1 − ǫ
2
- − σ1 ||
x − v||2 ≥ √m
- 1 − ǫ
2
- − ǫ