Randomization DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data - - PowerPoint PPT Presentation

randomization
SMART_READER_LITE
LIVE PREVIEW

Randomization DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data - - PowerPoint PPT Presentation

Randomization DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring19/index.html Carlos Fernandez-Granda Motivating applications Gaussian random variables Randomized dimensionality


slide-1
SLIDE 1

Randomization

DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science

https://cims.nyu.edu/~cfgranda/pages/MTDS_spring19/index.html Carlos Fernandez-Granda

slide-2
SLIDE 2

Motivating applications Gaussian random variables Randomized dimensionality reduction Compressed sensing

slide-3
SLIDE 3

Dimensionality reduction

Data with a large number of features can be difficult to analyze Data modeled as vectors in Rp (p very large) Aim: Reduce dimensionality of representation SVD provides optimal subspace for dimensionality reduction Problem: Computationally expensive + must see dataset beforehand What if we compute inner products with some random vectors?

slide-4
SLIDE 4

Dimensionality reduction for visualization

Motivation: Visualize high-dimensional features projected onto 2D or 3D Example: Seeds from three different varieties of wheat: Kama, Rosa and Canadian Features:

◮ Area ◮ Perimeter ◮ Compactness ◮ Length of kernel ◮ Width of kernel ◮ Asymmetry coefficient ◮ Length of kernel groove

slide-5
SLIDE 5

Projection onto two first PDs

2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

First principal component

2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Second principal component

slide-6
SLIDE 6

Projection onto two random vectors

2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

slide-7
SLIDE 7

Projection onto two random vectors

2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

slide-8
SLIDE 8

Projection onto two random vectors

2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

slide-9
SLIDE 9

Compressed sensing in MRI

Important goal in MRI: reduce scan time Can be achieved by measuring less frequency coefficients What happens if we undersample in the Fourier domain?

slide-10
SLIDE 10

MR image

0.0 5.0 10.0 15.0 20.0 25.0

t2 (cm)

0.0 5.0 10.0 15.0 20.0 25.0

t1 (cm)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

slide-11
SLIDE 11

Fourier coefficients

  • 3.0 -2.0 -1.0

0.0 1.0 2.0 3.0

k2 (1/cm)

  • 3.0
  • 2.0
  • 1.0

0.0 1.0 2.0 3.0

k1 (1/cm)

10

6

10

5

10

4

10

3

10

2

slide-12
SLIDE 12

x2 regular undersampling

  • 3.0 -2.0 -1.0

0.0 1.0 2.0 3.0

k2 (1/cm)

  • 3.0
  • 2.0
  • 1.0

0.0 1.0 2.0 3.0

k1 (1/cm)

10

5

10

4

10

3

10

2

10

1

slide-13
SLIDE 13

Recovered image

0.0 5.0 10.0 15.0 20.0 25.0

t2 (cm)

0.0 5.0 10.0 15.0 20.0 25.0

t1 (cm)

0.5 1.0 1.5 2.0 2.5 3.0

slide-14
SLIDE 14

x2 random undersampling

  • 3.0 -2.0 -1.0

0.0 1.0 2.0 3.0

k2 (1/cm)

  • 3.0
  • 2.0
  • 1.0

0.0 1.0 2.0 3.0

k1 (1/cm)

10

5

10

4

10

3

10

2

10

1

slide-15
SLIDE 15

Recovered image

0.0 5.0 10.0 15.0 20.0 25.0

t2 (cm)

0.0 5.0 10.0 15.0 20.0 25.0

t1 (cm)

0.5 0.0 0.5 1.0 1.5 2.0

slide-16
SLIDE 16

Motivating applications Gaussian random variables Randomized dimensionality reduction Compressed sensing

slide-17
SLIDE 17

Gaussian random variables

The pdf of a Gaussian or normal random variable with mean µ and standard deviation σ is given by fX (x) = 1 √ 2πσ e− (x−µ)2

2σ2

A standard Gaussian has µ := 0 and σ := 1

slide-18
SLIDE 18

Gaussian random variables

−10 −8 −6 −4 −2 2 4 6 8 10 0.1 0.2 0.3 0.4 x fX (x) µ = 2 σ = 1 µ = 0 σ = 2 µ = 0 σ = 4

slide-19
SLIDE 19

Linear transformation of Gaussian

If x is a Gaussian random variable with mean µ and standard deviation σ, then for any a, b ∈ R y := ax + b is a Gaussian random variable with mean aµ + b and standard deviation |a| σ

slide-20
SLIDE 20

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y)

slide-21
SLIDE 21

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y)

slide-22
SLIDE 22

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y)

slide-23
SLIDE 23

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P

  • x ≤ y − b

a

slide-24
SLIDE 24

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P

  • x ≤ y − b

a

  • =
  • y−b

a

−∞

1 √ 2πσ e− (x−µ)2

2σ2

dx

slide-25
SLIDE 25

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P

  • x ≤ y − b

a

  • =
  • y−b

a

−∞

1 √ 2πσ e− (x−µ)2

2σ2

dx = y

−∞

1 √ 2πaσ e− (w−aµ−b)2

2a2σ2

dw change of variables w = ax + b

slide-26
SLIDE 26

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P

  • x ≤ y − b

a

  • =
  • y−b

a

−∞

1 √ 2πσ e− (x−µ)2

2σ2

dx = y

−∞

1 √ 2πaσ e− (w−aµ−b)2

2a2σ2

dw change of variables w = ax + b Differentiating with respect to y: fy (y) = 1 √ 2πaσ e− (w−aµ−b)2

2a2σ2

slide-27
SLIDE 27

Gaussian random vector

A Gaussian random vector x is a random vector with joint pdf f

x (

x) = 1

  • (2π)n |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

  • where

µ ∈ Rd is the mean and Σ ∈ Rd×d the covariance matrix A standard Gaussian vector has µ := 0 and Σ := I

slide-28
SLIDE 28

Uncorrelation implies independence

If the covariance matrix is diagonal, Σ

x =

     σ2

1

· · · σ2

2

· · · . . . . . . ... . . . · · · σ2

d

     , the entries are independent

slide-29
SLIDE 29

Proof

Σ−1

  • x

=      

1 σ2

1

· · ·

1 σ2

2

· · · . . . . . . ... . . . · · ·

1 σ2

d

      |Σ| =

d

  • i=1

σ2

i

slide-30
SLIDE 30

Proof

f

x (

x)

slide-31
SLIDE 31

Proof

f

x (

x) = 1

  • (2π)d |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

slide-32
SLIDE 32

Proof

f

x (

x) = 1

  • (2π)d |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

  • =

d

  • i=1

1

  • (2π)σi

exp

  • −(

xi − µi)2 2σ2

i

slide-33
SLIDE 33

Proof

f

x (

x) = 1

  • (2π)d |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

  • =

d

  • i=1

1

  • (2π)σi

exp

  • −(

xi − µi)2 2σ2

i

  • =

d

  • i=1

f

xi (

xi)

slide-34
SLIDE 34

Linear transformations

Let x be a Gaussian random vector of dimension d with mean µ and covariance matrix Σ For any matrix A ∈ Rm×d and b ∈ Rm y = A x + b is Gaussian with mean A µ + b and covariance matrix AΣAT (as long as it is full rank) This is why Fourier and wavelet coefficients of Gaussian noise are also Gaussian noise

slide-35
SLIDE 35

Subvectors are also Gaussian

−3 −2 −1 1 2 3 −2 2 0.1 0.2 x y f

x (x, y)

f

x[2](y)

f

x[1](x)

slide-36
SLIDE 36

Audio data

4.0 4.1 4.2 4.3 4.4 4.5 Time (s) 7500 5000 2500 2500 5000 7500

slide-37
SLIDE 37

DFT

4000 3000 2000 1000 1000 2000 3000 4000 Frequency (Hz) 10

3

10

2

10

1

100 101 102 103

slide-38
SLIDE 38

Noisy image

slide-39
SLIDE 39

Wavelet coefficients

slide-40
SLIDE 40

Direction of iid standard Gaussian vectors

If the covariance matrix of a Gaussian vector x is I, then x is isotropic It does not favor any direction For any orthogonal matrix U x has the same distribution, Gaussian with mean and covariance matrix

slide-41
SLIDE 41

Direction of iid standard Gaussian vectors

If the covariance matrix of a Gaussian vector x is I, then x is isotropic It does not favor any direction For any orthogonal matrix U x has the same distribution, Gaussian with mean U 0 = 0 and covariance matrix

slide-42
SLIDE 42

Direction of iid standard Gaussian vectors

If the covariance matrix of a Gaussian vector x is I, then x is isotropic It does not favor any direction For any orthogonal matrix U x has the same distribution, Gaussian with mean U 0 = 0 and covariance matrix UIUT = UUT = I

slide-43
SLIDE 43

Magnitude of iid standard Gaussian vectors

In low dimensions joint pdf is mostly concentrated around the origin What about in high dimensions?

slide-44
SLIDE 44

ℓ2 norm of samples

101 102 103

Dimension

100 101

2 norm of samples

Dimension

slide-45
SLIDE 45

χ2 random variable

χ2 (chi squared) random variable with d degrees of freedom y :=

d

  • i=1

x2

i

where x1, . . . , xd are standard Gaussians Equal to squared ℓ2 norm of d-dimensional standard Gaussian vector

slide-46
SLIDE 46

Squared ℓ2 norm divided by d

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 4 6 8 10 x fy/d (x) d = 10 d = 20 d = 50 d = 100

slide-47
SLIDE 47

Mean

E

  • ||

x||2

2

slide-48
SLIDE 48

Mean

E

  • ||

x||2

2

  • = E

d

  • i=1
  • x[i]2
slide-49
SLIDE 49

Mean

E

  • ||

x||2

2

  • = E

d

  • i=1
  • x[i]2
  • =

d

  • i=1

E

  • x[i]2
slide-50
SLIDE 50

Mean

E

  • ||

x||2

2

  • = E

d

  • i=1
  • x[i]2
  • =

d

  • i=1

E

  • x[i]2

= d

slide-51
SLIDE 51

Variance

E

  • ||

x||2

2

2

slide-52
SLIDE 52

Variance

E

  • ||

x||2

2

2 = E   d

  • i=1
  • x[i]2

2 

slide-53
SLIDE 53

Variance

E

  • ||

x||2

2

2 = E   d

  • i=1
  • x[i]2

2  = E  

d

  • i=1

d

  • j=1
  • x[i]2

x[j]2  

slide-54
SLIDE 54

Variance

E

  • ||

x||2

2

2 = E   d

  • i=1
  • x[i]2

2  = E  

d

  • i=1

d

  • j=1
  • x[i]2

x[j]2   =

d

  • i=1

d

  • j=1

E

  • x[i]2

x[j]2

slide-55
SLIDE 55

Variance

E

  • ||

x||2

2

2 = E   d

  • i=1
  • x[i]2

2  = E  

d

  • i=1

d

  • j=1
  • x[i]2

x[j]2   =

d

  • i=1

d

  • j=1

E

  • x[i]2

x[j]2 =

d

  • i=1

E

  • x[i]4

+ 2

d−1

  • i=1

d

  • j=i+1

E

  • x[i]2

E

  • x[j]2
slide-56
SLIDE 56

Variance

E

  • ||

x||2

2

2 = E   d

  • i=1
  • x[i]2

2  = E  

d

  • i=1

d

  • j=1
  • x[i]2

x[j]2   =

d

  • i=1

d

  • j=1

E

  • x[i]2

x[j]2 =

d

  • i=1

E

  • x[i]4

+ 2

d−1

  • i=1

d

  • j=i+1

E

  • x[i]2

E

  • x[j]2

= 3d + d(d − 1) 4th moment of standard Gaussian equals 3

slide-57
SLIDE 57

Variance

E

  • ||

x||2

2

2 = E   d

  • i=1
  • x[i]2

2  = E  

d

  • i=1

d

  • j=1
  • x[i]2

x[j]2   =

d

  • i=1

d

  • j=1

E

  • x[i]2

x[j]2 =

d

  • i=1

E

  • x[i]4

+ 2

d−1

  • i=1

d

  • j=i+1

E

  • x[i]2

E

  • x[j]2

= 3d + d(d − 1) 4th moment of standard Gaussian equals 3 = d(d + 2)

slide-58
SLIDE 58

Variance

Var

  • ||

x||2

2

  • = E
  • ||

x||2

2

2 − E

  • ||

x||2

2

2 = d(d + 2) − d2 = 2d Relative standard deviation around mean scales as

  • 2/d

Geometrically, probability density concentrates close to surface of a sphere with radius √ d

slide-59
SLIDE 59

Non-asymptotic tail bound

Let x be an iid standard Gaussian random vector of dimension d For any ǫ > 0 P

  • d (1 − ǫ) < ||

x||2

2 < d (1 + ǫ)

  • ≥ 1 −

2 dǫ2

slide-60
SLIDE 60

Markov’s inequality

Let x be a nonnegative random variable For any positive constant a > 0, P (x ≥ a) ≤ E (x) a

slide-61
SLIDE 61

Proof

Define the indicator variable 1x≥a x − a 1x≥a ≥ 0

slide-62
SLIDE 62

Proof

Define the indicator variable 1x≥a x − a 1x≥a ≥ 0 E (x) ≥ a E (1x≥a) = a P (x ≥ a)

slide-63
SLIDE 63

Chebyshev bound

Let y := || x||2

2,

P (|y − d| ≥ dǫ)

slide-64
SLIDE 64

Chebyshev bound

Let y := || x||2

2,

P (|y − d| ≥ dǫ) = P

  • (y − E (y))2 ≥ d2ǫ2
slide-65
SLIDE 65

Chebyshev bound

Let y := || x||2

2,

P (|y − d| ≥ dǫ) = P

  • (y − E (y))2 ≥ d2ǫ2

≤ E

  • (y − E (y))2

d2ǫ2 by Markov’s inequality

slide-66
SLIDE 66

Chebyshev bound

Let y := || x||2

2,

P (|y − d| ≥ dǫ) = P

  • (y − E (y))2 ≥ d2ǫ2

≤ E

  • (y − E (y))2

d2ǫ2 by Markov’s inequality = Var (y) d2ǫ2

slide-67
SLIDE 67

Chebyshev bound

Let y := || x||2

2,

P (|y − d| ≥ dǫ) = P

  • (y − E (y))2 ≥ d2ǫ2

≤ E

  • (y − E (y))2

d2ǫ2 by Markov’s inequality = Var (y) d2ǫ2 = 2 dǫ2

slide-68
SLIDE 68

Non-asymptotic Chernoff tail bound

Let x be an iid standard Gaussian random vector of dimension d For any ǫ > 0 P

  • d (1 − ǫ) < ||

x||2

2 < d (1 + ǫ)

  • ≥ 1 − 2 exp
  • −dǫ2

8

slide-69
SLIDE 69

Proof

Let y := || x||2

  • 2. The result is implied by

P (y > d (1 + ǫ)) ≤ exp

  • −dǫ2

8

  • P (y < d (1 − ǫ)) ≤ exp
  • −dǫ2

8

slide-70
SLIDE 70

Proof

Fix t > 0 P (y > a)

slide-71
SLIDE 71

Proof

Fix t > 0 P (y > a) = P (exp (ty) > exp (at))

slide-72
SLIDE 72

Proof

Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality

slide-73
SLIDE 73

Proof

Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality ≤ exp (−at) E

  • exp

d

  • i=1

txi

2

slide-74
SLIDE 74

Proof

Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality ≤ exp (−at) E

  • exp

d

  • i=1

txi

2

  • ≤ exp (−at)

d

  • i=1

E

  • exp
  • txi

2

by independence of x1, . . . , xd

slide-75
SLIDE 75

Proof

Lemma (by direct integration) E

  • exp
  • tx2

= 1 √1 − 2t Equivalent to controlling higher-order moments since E

  • exp
  • tx2

= E ∞

  • i=0
  • tx2i

i!

  • =

  • i=0

E

  • ti

x2i i! .

slide-76
SLIDE 76

Proof

Fix t > 0 P (y > a) ≤ exp (−at)

d

  • i=1

E

  • exp
  • txi

2

= exp (−at) (1 − 2t)

d 2

slide-77
SLIDE 77

Proof

Setting a := d (1 + ǫ) and t := 1 2 − 1 2 (1 + ǫ), we conclude P (y > d (1 + ǫ)) ≤ (1 + ǫ)d 2 exp

  • −dǫ

2

  • ≤ exp
  • −dǫ2

8

slide-78
SLIDE 78

Projection onto a fixed subspace

Probability density is isotropic and has variance d Projection onto fixed k-dimensional subspace should capture fraction of variance equal to k/d Variance of projection should be k

slide-79
SLIDE 79

Projection onto a fixed subspace

Let S be a k-dimensional subspace of Rd and x a d-dimensional standard Gaussian vector PS ( x) = UUT x is not a Gaussian vector Covariance: ΣPS (

x) = UUTΣ xUUT

slide-80
SLIDE 80

Projection onto a fixed subspace

Let S be a k-dimensional subspace of Rd and x a d-dimensional standard Gaussian vector PS ( x) = UUT x is not a Gaussian vector Covariance: ΣPS (

x) = UUTΣ xUUT

= UUT

slide-81
SLIDE 81

Projection onto a fixed subspace

Let S be a k-dimensional subspace of Rd and x a d-dimensional standard Gaussian vector PS ( x) = UUT x is not a Gaussian vector Covariance: ΣPS (

x) = UUTΣ xUUT

= UUT Not full rank

slide-82
SLIDE 82

Projection onto a fixed subspace

Coefficients UT x are a Gaussian vector with covariance ΣUT

x = UTΣ xU = UTU = I

slide-83
SLIDE 83

Projection onto a fixed subspace

Coefficients UT x are a Gaussian vector with covariance ΣUT

x = UTΣ xU = UTU = I

We have ||PS ( x)||2

2 = (UUT

x)TUUT x =

  • UT

x

  • 2

2

slide-84
SLIDE 84

Projection onto a fixed subspace

Coefficients UT x are a Gaussian vector with covariance ΣUT

x = UTΣ xU = UTU = I

We have ||PS ( x)||2

2 = (UUT

x)TUUT x =

  • UT

x

  • 2

2

For any ǫ > 0 P

  • k (1 − ǫ) < ||PS (

x)||2

2 < k (1 + ǫ)

  • ≥ 1 − 2 exp
  • −kǫ2

8

slide-85
SLIDE 85

Linear regression

To analyze the performance of the least-squares estimator we assume a linear model with additive iid Gaussian noise

  • ytrain := Xtrain

βtrue + ztrain The LS estimator equals

  • βLS := arg min
  • β
  • ytrain − Xtrain

β2

slide-86
SLIDE 86

Training error

The training error is the projection of the noise onto the orthogonal complement of the column space of Xtrain

  • ytrain −

yLS = Pcol(Xtrain)⊥ ztrain Dimension of orthogonal complement of col(Xtrain) equals n − p Training RMSE :=

  • ||

ytrain − yLS||2

2

n ≈ σ

  • 1 − p

n

slide-87
SLIDE 87

Temperature prediction via linear regression

200 500 1000 2000 5000

Number of training data

1 2 3 4 5 6 7

Average error (deg Celsius) 1 p/n 1 + p/n Training error Test error

slide-88
SLIDE 88

Motivating applications Gaussian random variables Randomized dimensionality reduction Compressed sensing

slide-89
SLIDE 89

Randomized linear maps

We use Gaussian matrices as randomized linear maps from Rd to Rk, k < d Each entry is sampled independently from standard Gaussian Question: Do we preserve distances between points in set? Equivalently, are any fixed vectors in the null space?

slide-90
SLIDE 90

Fixed vector

Let A be a k × d matrix with iid standard Gaussian entries If v ∈ Rd is a deterministic vector with unit ℓ2 norm, then A v is a k-dimensional standard Gaussian vector

slide-91
SLIDE 91

Fixed vector

Let A be a k × d matrix with iid standard Gaussian entries If v ∈ Rd is a deterministic vector with unit ℓ2 norm, then A v is a k-dimensional standard Gaussian vector Proof: (A v) [i], 1 ≤ i ≤ k is Gaussian with mean zero and variance Var

  • AT

i,:

v

  • =

vTΣAi,: v

slide-92
SLIDE 92

Fixed vector

Let A be a k × d matrix with iid standard Gaussian entries If v ∈ Rd is a deterministic vector with unit ℓ2 norm, then A v is a k-dimensional standard Gaussian vector Proof: (A v) [i], 1 ≤ i ≤ k is Gaussian with mean zero and variance Var

  • AT

i,:

v

  • =

vTΣAi,: v = vTI v = || v||2

2 = 1

slide-93
SLIDE 93

Fixed vector

Let A be a k × d matrix with iid standard Gaussian entries If v ∈ Rd is a deterministic vector with unit ℓ2 norm, then A v is a k-dimensional standard Gaussian vector Proof: (A v) [i], 1 ≤ i ≤ k is Gaussian with mean zero and variance Var

  • AT

i,:

v

  • =

vTΣAi,: v = vTI v = || v||2

2 = 1

Ai,:, 1 ≤ i ≤ k are all independent

slide-94
SLIDE 94

Non-asymptotic Chernoff tail bound

Let x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 P

  • k (1 − ǫ) < ||

x||2

2 < k (1 + ǫ)

  • ≥ 1 − 2 exp
  • −kǫ2

8

slide-95
SLIDE 95

Fixed vector

Let A be a k × d matrix with iid standard Gaussian entries For any v ∈ Rd with unit norm and any ǫ ∈ (0, 1) √ 1 − ǫ ≤

  • 1

√ k A v

  • 2

≤ √ 1 + ǫ with probability at least 1 − 2 exp

  • −kǫ2/8
slide-96
SLIDE 96

Distance between two vectors

The result implies that if we fix two vectors x1 and x2 and define y := x2 − x1 then √ 1 − ǫ ||y||2 ≤

  • 1

√ k Ay

  • 2

≤ √ 1 + ǫ ||y||2 with high probability (just set v := y/ ||y||2) What about distances between a set of vectors?

slide-97
SLIDE 97

Johnson-Lindenstrauss lemma

Let A be a k × d matrix with iid standard Gaussian entries Let x1, . . . , xp ∈ Rd be any fixed set of p deterministic vectors For any pair xi, xj and any ǫ ∈ (0, 1) (1 − ǫ) || xi − xj||2

2 ≤

  • 1

√ k A xi − 1 √ k A xj

  • 2

2

≤ (1 + ǫ) || xi − xj||2

2

with probability at least 1

p as long as

k ≥ 16 log (p) ǫ2

slide-98
SLIDE 98

Johnson-Lindenstrauss lemma

Let A be a k × d matrix with iid standard Gaussian entries Let x1, . . . , xp ∈ Rd be any fixed set of p deterministic vectors For any pair xi, xj and any ǫ ∈ (0, 1) (1 − ǫ) || xi − xj||2

2 ≤

  • 1

√ k A xi − 1 √ k A xj

  • 2

2

≤ (1 + ǫ) || xi − xj||2

2

with probability at least 1

p as long as

k ≥ 16 log (p) ǫ2 No dependence on d!

slide-99
SLIDE 99

Proof

Aim: Control action of A the normalized differences

  • vij :=
  • xi −

xj || xi − xj||2 Our event of interest is the intersection of the events Eij =

  • k (1 − ǫ) < ||A

vij||2

2 < k (1 + ǫ)

  • 1 ≤ i < p, i < j ≤ p
slide-100
SLIDE 100

Proof

Aim: Control action of A the normalized differences

  • vij :=
  • xi −

xj || xi − xj||2 Our event of interest is the intersection of the events Eij =

  • k (1 − ǫ) < ||A

vij||2

2 < k (1 + ǫ)

  • 1 ≤ i < p, i < j ≤ p

Is it equal to

i,j Eij?

slide-101
SLIDE 101

Fixed vector

Let A be a k × d matrix with iid standard Gaussian entries For any v ∈ Rd with unit norm and any ǫ ∈ (0, 1) √ 1 − ǫ ≤

  • 1

√ k A v

  • 2

≤ √ 1 + ǫ with probability at least 1 − 2 exp

  • −kǫ2/8
  • This implies

P

  • Ec

ij

  • ≤ 2

p2 if k ≥ 16 log (p) ǫ2

slide-102
SLIDE 102

Union bound

For any events S1, S2, . . . , Sn in a probability space P (∪iSi) ≤

n

  • i=1

P (Si) .

slide-103
SLIDE 103

Proof

By the union bound P  

i,j

Eij  

slide-104
SLIDE 104

Proof

By the union bound P  

i,j

Eij   = 1 − P  

i,j

Ec

ij

 

slide-105
SLIDE 105

Proof

By the union bound P  

i,j

Eij   = 1 − P  

i,j

Ec

ij

  ≥ 1 −

  • i,j

P

  • Ec

ij

slide-106
SLIDE 106

Proof

By the union bound P  

i,j

Eij   = 1 − P  

i,j

Ec

ij

  ≥ 1 −

  • i,j

P

  • Ec

ij

  • Number of events Eij?
slide-107
SLIDE 107

Proof

By the union bound P  

i,j

Eij   = 1 − P  

i,j

Ec

ij

  ≥ 1 −

  • i,j

P

  • Ec

ij

  • ≥ 1 − p (p − 1)

2 2 p2 Number of events Eij? p

2

  • = p (p − 1) /2
slide-108
SLIDE 108

Proof

By the union bound P  

i,j

Eij   = 1 − P  

i,j

Ec

ij

  ≥ 1 −

  • i,j

P

  • Ec

ij

  • ≥ 1 − p (p − 1)

2 2 p2 ≥ 1 p Number of events Eij? p

2

  • = p (p − 1) /2
slide-109
SLIDE 109

Nearest-neighbor classification

Training set of points and labels { x1, l1}, . . . , { xn, ln} To classify a new data point y ∈ Rd, find i∗ := arg min

1≤i≤n ||

y − xi||2 , and assign li∗ to y Cost: O (dnp) to classify p new points

slide-110
SLIDE 110

Nearest neighbors in random subspace

Use a k × d iid standard Gaussian matrix to project onto k-dimensional space Cost:

◮ dkn operations to project training set ◮ dkp operations to project test set ◮ knp to perform nearest-neighbor classification

Much faster!

slide-111
SLIDE 111

Face recognition

Training set: 360 64 × 64 images from 40 different subjects (9 each) Test set: 1 new image from each subject We model each image as a vector in R4096 (d = 4096) To classify we:

  • 1. Project onto random a k-dimensional subspace
  • 2. Apply nearest-neighbor classification using the ℓ2-norm distance in Rk
slide-112
SLIDE 112

Performance

20 40 60 80 100 120 140 160 180 200 10 20 30 40 5 10 15 25 30 35 Dimension Errors Average Maximum Minimum

slide-113
SLIDE 113

Nearest neighbor in R50

Test image Projection Closest projection Corresponding image

slide-114
SLIDE 114

Motivating applications Gaussian random variables Randomized dimensionality reduction Compressed sensing

slide-115
SLIDE 115

Compressed sensing

Goal: Recovering signals from small number of data Arbitrary vector of dimension d cannot be recovered from m < d linear measurements However, signals of interest are highly structured For example, images are sparse in wavelet basis If signal is parametrized by s < m parameters, recovery may be possible We focus on simplified problem: recovering sparse vectors

slide-116
SLIDE 116

MR image

0.0 5.0 10.0 15.0 20.0 25.0

t2 (cm)

0.0 5.0 10.0 15.0 20.0 25.0

t1 (cm)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

slide-117
SLIDE 117

Fourier coefficients

  • 3.0 -2.0 -1.0

0.0 1.0 2.0 3.0

k2 (1/cm)

  • 3.0
  • 2.0
  • 1.0

0.0 1.0 2.0 3.0

k1 (1/cm)

10

6

10

5

10

4

10

3

10

2

slide-118
SLIDE 118

x2 regular undersampling

  • 3.0 -2.0 -1.0

0.0 1.0 2.0 3.0

k2 (1/cm)

  • 3.0
  • 2.0
  • 1.0

0.0 1.0 2.0 3.0

k1 (1/cm)

10

5

10

4

10

3

10

2

10

1

slide-119
SLIDE 119

Recovered image

0.0 5.0 10.0 15.0 20.0 25.0

t2 (cm)

0.0 5.0 10.0 15.0 20.0 25.0

t1 (cm)

0.5 1.0 1.5 2.0 2.5 3.0

slide-120
SLIDE 120

x2 random undersampling

  • 3.0 -2.0 -1.0

0.0 1.0 2.0 3.0

k2 (1/cm)

  • 3.0
  • 2.0
  • 1.0

0.0 1.0 2.0 3.0

k1 (1/cm)

10

5

10

4

10

3

10

2

10

1

slide-121
SLIDE 121

Recovered image

0.0 5.0 10.0 15.0 20.0 25.0

t2 (cm)

0.0 5.0 10.0 15.0 20.0 25.0

t1 (cm)

0.5 0.0 0.5 1.0 1.5 2.0

slide-122
SLIDE 122

DFT regular undersampling

slide-123
SLIDE 123

DFT regular undersampling

200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5

Signal 1 Recovered

slide-124
SLIDE 124

DFT regular undersampling

200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5

Signal 2 Recovered

slide-125
SLIDE 125

DFT random undersampling

slide-126
SLIDE 126

DFT random undersampling

200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5

slide-127
SLIDE 127

DFT random undersampling

200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5

slide-128
SLIDE 128

Gaussian measurements

slide-129
SLIDE 129

Gaussian measurements

200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5

slide-130
SLIDE 130

Gaussian measurements

200 400 600 800 1000 0.5 0.0 0.5 1.0 1.5 2.0 2.5

slide-131
SLIDE 131

Restricted-isometry property

Different sparse vectors should never produce similar data If two s-sparse vectors x1, x2 are far, then A x1, A x2 should be far The measurement operator should preserve distances (be an isometry) when restricted to act upon sparse vectors

slide-132
SLIDE 132

Restricted-isometry property

A satisfies the restricted isometry property (RIP) with constant ǫ if (1 − ǫ) || x||2 ≤ ||A x||2 ≤ (1 + ǫ) || x||2 for any s-sparse vector x

slide-133
SLIDE 133

Restricted-isometry property

A satisfies the restricted isometry property (RIP) with constant ǫ if (1 − ǫ) || x||2 ≤ ||A x||2 ≤ (1 + ǫ) || x||2 for any s-sparse vector x If A satisfies the RIP for a sparsity level 2s then for any s-sparse x1, x2 || x2 − x1||2

slide-134
SLIDE 134

Restricted-isometry property

A satisfies the restricted isometry property (RIP) with constant ǫ if (1 − ǫ) || x||2 ≤ ||A x||2 ≤ (1 + ǫ) || x||2 for any s-sparse vector x If A satisfies the RIP for a sparsity level 2s then for any s-sparse x1, x2 || x2 − x1||2 = ||A ( x1 − x2)||2 ≥ (1 − ǫ) || x2 − x1||2

slide-135
SLIDE 135

Restricted-isometry property

Deterministic matrices tend to not satisfy the RIP It is NP-hard to check if spark or RIP hold Random matrices satisfy RIP with high probability We prove it for Gaussian iid matrices, ideas in proof for random Fourier matrices are similar

slide-136
SLIDE 136

Restricted-isometry property for Gaussian matrices

Let A ∈ Rm×d be a random matrix with iid standard Gaussian entries

1 √mA satisfies the RIP for a constant ǫ with probability 1 − C2 d as long as

m ≥ C1s ǫ2 log d s

  • for two fixed constants C1, C2 > 0
slide-137
SLIDE 137

Restricted-isometry property for Gaussian matrices

Let A ∈ Rm×d be a random matrix with iid standard Gaussian entries

1 √mA satisfies the RIP for a constant ǫ with probability 1 − C2 d as long as

m ≥ C1s ǫ2 log d s

  • for two fixed constants C1, C2 > 0

Measurements proportional to sparsity (up to log factor)

slide-138
SLIDE 138

Singular values of submatrix

Fix subset of s indices T ⊂ {1, . . . , d} Any matrix A ∈ Rm×d, m < d, satisfies σs(AT) ≤ ||A x||2 ≤ σ1(AT) for all vectors x ∈ Rd with support restricted to T AT is the m × s submatrix of A containing columns indexed by T σ1(AT) and σs(AT) are the largest and smallest singular value of AT

slide-139
SLIDE 139

Proof

For any vector x ∈ Rd with support restricted to T A x = AT xT where xT ∈ Rs is the subvector of x that contains its nonzero entries

slide-140
SLIDE 140

Proof strategy

Control singular values for fixed submatrix Apply union bound to extend bounds to all submatrices

slide-141
SLIDE 141

Singular values of m × s matrix, s = 100

20 40 60 80 100 0.5 1 1.5 i

σi √m

m/s 2 5 10 20 50 100 200

slide-142
SLIDE 142

Singular values of m × s matrix, s = 1000

200 400 600 800 1,000 0.5 1 1.5 i

σi √m

m/s 2 5 10 20 50 100 200

slide-143
SLIDE 143

Singular values of a Gaussian matrix

For large enough m M ≈ U √m I

  • V T = √m UV T,

Standard Gaussian vectors in high dimensions are almost orthogonal

slide-144
SLIDE 144

Singular values of a Gaussian matrix

Let M be a m × s matrix with iid standard Gaussian entries such that m > s For any fixed ǫ > 0, the singular values of M satisfy

  • m (1 − ǫ) ≤ σs ≤ σ1 ≤
  • m (1 + ǫ)

with probability at least 1 − 2 12

ǫ

s exp

  • − mǫ2

32

slide-145
SLIDE 145

Union bound

For any events S1, S2, . . . , Sn in a probability space P (∪iSi) ≤

n

  • i=1

P (Si) .

slide-146
SLIDE 146

Proof

Number of different supports of size s

slide-147
SLIDE 147

Proof

Number of different supports of size s d s

ed s s

slide-148
SLIDE 148

Proof

Number of different supports of size s d s

ed s s By the union bound √ 1 − ǫ || x||2 ≤ 1 √m ||A x||2 ≤ √ 1 + ǫ || x||2 holds for any s-sparse vector x with probability at least 1 − 2 ed s s 12 ǫ s exp

  • −mǫ2

32

  • = 1 − exp
  • log 2 + s + s log

d s

  • + s log

12 ǫ

  • − mǫ2

2

  • ≤ 1 − C2

d as long as m ≥ C1s ǫ2 log d s

slide-149
SLIDE 149

Singular values of a Gaussian matrix

Let M be a m × s matrix with iid standard Gaussian entries such that m > s For any fixed ǫ > 0, the singular values of M satisfy

  • m (1 − ǫ) ≤ σs ≤ σ1 ≤
  • m (1 + ǫ)

with probability at least 1 − 2 12

ǫ

s exp

  • − mǫ2

32

  • How do we prove this?
slide-150
SLIDE 150

More of the same?

We need to prove that for any vector v of the s-dimensional sphere Ss−1 in Rs √m (1 − ǫ) < ||M v||2 < √m (1 + ǫ) Can we prove it for a fixed vector and use the union bound?

slide-151
SLIDE 151

Proof strategy

  • 1. Consider spread-out finite subset Nǫ ⊂ Ss−1 such that any point in

Ss−1 is close to a point in Nǫ

  • 2. Prove bound on Nǫ
  • 3. Show that bounds hold for all points that are close to Nǫ
slide-152
SLIDE 152

ǫ-net

An ǫ-net of a set X ⊆ Rs is a subset Nǫ ⊆ X such that for every vector

  • x ∈ X there exists

y ∈ Nǫ for which || x − y||2 ≤ ǫ. The covering number N (X, ǫ) of a set X at scale ǫ is the minimal cardinality of an ǫ-net of X

slide-153
SLIDE 153

ǫ-net

ǫ

slide-154
SLIDE 154

Covering number of a sphere

The covering number of the s-dimensional sphere Ss−1 at scale ǫ satisfies N

  • Ss−1, ǫ

2 + ǫ ǫ s ≤ 3 ǫ s

slide-155
SLIDE 155

Covering number of a sphere

◮ Initialize Nǫ to the empty set ◮ Choose a point

x ∈ Ss−1 such that || x − y||2 > ǫ for any y ∈ Nǫ

◮ Add

x to Nǫ until there are no points in Ss−1 that are ǫ away from any point in Nǫ

slide-156
SLIDE 156

Covering number of a sphere

ǫ/2 1 + ǫ/2

slide-157
SLIDE 157

Covering number of a sphere

Vol

  • Bs

1+ǫ/2

  • ≥ Vol

x∈NǫBs ǫ/2 (

x)

slide-158
SLIDE 158

Covering number of a sphere

Vol

  • Bs

1+ǫ/2

  • ≥ Vol

x∈NǫBs ǫ/2 (

x)

  • = |Nǫ| Vol
  • Bs

ǫ/2

slide-159
SLIDE 159

Covering number of a sphere

Vol

  • Bs

1+ǫ/2

  • ≥ Vol

x∈NǫBs ǫ/2 (

x)

  • = |Nǫ| Vol
  • Bs

ǫ/2

  • By multivariable calculus

Vol

  • Bs

r

  • = rs Vol
  • Bs

1

slide-160
SLIDE 160

Covering number of a sphere

Vol

  • Bs

1+ǫ/2

  • ≥ Vol

x∈NǫBs ǫ/2 (

x)

  • = |Nǫ| Vol
  • Bs

ǫ/2

  • Vol
  • Bs

r

  • = rs Vol
  • Bs

1

  • so we conclude

(1 + ǫ/2)s ≥ |Nǫ| (ǫ/2)s

slide-161
SLIDE 161

Proof

  • 1. We prove the bounds

n (1 − ǫ2) < ||M v||2

2 < n (1 + ǫ2)

where ǫ2 := ǫ/2 on an ǫ1 := ǫ/4 net of the sphere

  • 2. We show that by the triangle inequality, this implies that the bounds

hold on all the sphere

slide-162
SLIDE 162

Fixed vector

Let M be a a × b matrix with iid standard Gaussian entries For any v ∈ Rb with unit norm and any ǫ ∈ (0, 1)

  • a (1 − ǫ) ≤ ||M

v||2 ≤

  • a (1 + ǫ)

with probability at least 1 − 2 exp

  • −aǫ2/8
slide-163
SLIDE 163

Bound on the ǫ1-net

We define the event E

v,ǫ2 :=

  • m (1 − ǫ2) ||

v||2

2 ≤ ||M

v||2

2 ≤ m (1 + ǫ2) ||

v||2

2

  • P

v∈Nǫ1Ec

  • v,ǫ2
slide-164
SLIDE 164

Bound on the ǫ1-net

We define the event E

v,ǫ2 :=

  • m (1 − ǫ2) ||

v||2

2 ≤ ||M

v||2

2 ≤ m (1 + ǫ2) ||

v||2

2

  • P

v∈Nǫ1Ec

  • v,ǫ2
  • v∈Nǫ1

P

  • Ec
  • v,ǫ2
slide-165
SLIDE 165

Bound on the ǫ1-net

We define the event E

v,ǫ2 :=

  • m (1 − ǫ2) ||

v||2

2 ≤ ||M

v||2

2 ≤ m (1 + ǫ2) ||

v||2

2

  • P

v∈Nǫ1Ec

  • v,ǫ2
  • v∈Nǫ1

P

  • Ec
  • v,ǫ2
  • ≤ |Nǫ1| P
  • Ec
  • v,ǫ2
slide-166
SLIDE 166

Bound on the ǫ1-net

We define the event E

v,ǫ2 :=

  • m (1 − ǫ2) ||

v||2

2 ≤ ||M

v||2

2 ≤ m (1 + ǫ2) ||

v||2

2

  • P

v∈Nǫ1Ec

  • v,ǫ2
  • v∈Nǫ1

P

  • Ec
  • v,ǫ2
  • ≤ |Nǫ1| P
  • Ec
  • v,ǫ2
  • ≤ 2

12 ǫ s exp

  • −mǫ2

32

slide-167
SLIDE 167

Upper bound on the sphere

Let x ∈ Ss−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||M x||2

slide-168
SLIDE 168

Upper bound on the sphere

Let x ∈ Ss−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||M x||2 ≤ ||M v||2 + ||M ( x − v)||2

slide-169
SLIDE 169

Upper bound on the sphere

Let x ∈ Ss−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||M x||2 ≤ ||M v||2 + ||M ( x − v)||2 ≤ √m

  • 1 + ǫ

2

  • + ||M (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds
slide-170
SLIDE 170

Upper bound on the sphere

Let x ∈ Ss−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||M x||2 ≤ ||M v||2 + ||M ( x − v)||2 ≤ √m

  • 1 + ǫ

2

  • + ||M (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds

≤ √m

  • 1 + ǫ

2

  • + σ1 ||

x − v||2

slide-171
SLIDE 171

Upper bound on the sphere

Let x ∈ Ss−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||M x||2 ≤ ||M v||2 + ||M ( x − v)||2 ≤ √m

  • 1 + ǫ

2

  • + ||M (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds

≤ √m

  • 1 + ǫ

2

  • + σ1 ||

x − v||2 ≤ √m

  • 1 + ǫ

2

  • + σ1ǫ

4

slide-172
SLIDE 172

Upper bound on the sphere

σ1 ≤ √m

  • 1 + ǫ

2

  • + σ1ǫ

4 σ1 ≤ √m 1 + ǫ/2 1 − ǫ/4

  • = √m
  • 1 + ǫ − ǫ (1 − ǫ)

4 − ǫ

  • ≤ √m (1 + ǫ)
slide-173
SLIDE 173

Lower bound on the sphere

||M x||2

slide-174
SLIDE 174

Lower bound on the sphere

||M x||2 ≥ ||M v||2 − ||M ( x − v)||2

slide-175
SLIDE 175

Lower bound on the sphere

||M x||2 ≥ ||M v||2 − ||M ( x − v)||2 ≥ √m

  • 1 − ǫ

2

  • − ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds
slide-176
SLIDE 176

Lower bound on the sphere

||M x||2 ≥ ||M v||2 − ||M ( x − v)||2 ≥ √m

  • 1 − ǫ

2

  • − ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds

≥ √m

  • 1 − ǫ

2

  • − σ1 ||

x − v||2

slide-177
SLIDE 177

Lower bound on the sphere

||M x||2 ≥ ||M v||2 − ||M ( x − v)||2 ≥ √m

  • 1 − ǫ

2

  • − ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds

≥ √m

  • 1 − ǫ

2

  • − σ1 ||

x − v||2 ≥ √m

  • 1 − ǫ

2

  • − ǫ

4 √m (1 + ǫ)

slide-178
SLIDE 178

Lower bound on the sphere

||M x||2 ≥ ||M v||2 − ||M ( x − v)||2 ≥ √m

  • 1 − ǫ

2

  • − ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds

≥ √m

  • 1 − ǫ

2

  • − σ1 ||

x − v||2 ≥ √m

  • 1 − ǫ

2

  • − ǫ

4 √m (1 + ǫ) = √m (1 − ǫ)