Randomness DS-GA 1013 / MATH-GA 2824 Optimization-based Data - - PowerPoint PPT Presentation

randomness
SMART_READER_LITE
LIVE PREVIEW

Randomness DS-GA 1013 / MATH-GA 2824 Optimization-based Data - - PowerPoint PPT Presentation

Randomness DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Gaussian random variables Gaussian random vectors Randomized projections SVD of a


slide-1
SLIDE 1

Randomness

DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda

slide-2
SLIDE 2

Gaussian random variables Gaussian random vectors Randomized projections SVD of a random matrix Randomized SVD

slide-3
SLIDE 3

Gaussian random variables

The pdf of a Gaussian or normal random variable with mean µ and standard deviation σ is given by fX (x) = 1 √ 2πσ e− (x−µ)2

2σ2

slide-4
SLIDE 4

Gaussian random variables

−10 −8 −6 −4 −2 2 4 6 8 10 0.1 0.2 0.3 0.4 x fX (x) µ = 2 σ = 1 µ = 0 σ = 2 µ = 0 σ = 4

slide-5
SLIDE 5

Linear transformation of Gaussian

If x is a Gaussian random variable with mean µ and standard deviation σ, then for any a, b ∈ R y := ax + b is a Gaussian random variable with mean aµ + b and standard deviation |a| σ

slide-6
SLIDE 6

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y)

slide-7
SLIDE 7

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y)

slide-8
SLIDE 8

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y)

slide-9
SLIDE 9

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P

  • x ≤ y − b

a

slide-10
SLIDE 10

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P

  • x ≤ y − b

a

  • =
  • y−b

a

−∞

1 √ 2πσ e− (x−µ)2

2σ2

dx

slide-11
SLIDE 11

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P

  • x ≤ y − b

a

  • =
  • y−b

a

−∞

1 √ 2πσ e− (x−µ)2

2σ2

dx = y

−∞

1 √ 2πaσ e− (w−aµ−b)2

2a2σ2

dw change of variables w = ax + b

slide-12
SLIDE 12

Proof

Let a > 0 (proof for a < 0 is very similar), to Fy (y) = P (y ≤ y) = P (ax + b ≤ y) = P

  • x ≤ y − b

a

  • =
  • y−b

a

−∞

1 √ 2πσ e− (x−µ)2

2σ2

dx = y

−∞

1 √ 2πaσ e− (w−aµ−b)2

2a2σ2

dw change of variables w = ax + b Differentiating with respect to y: fy (y) = 1 √ 2πaσ e− (w−aµ−b)2

2a2σ2

slide-13
SLIDE 13

Central limit theorem

Let x1, x2, x3, . . . be a sequence of iid random variables with mean µ and bounded variance σ2 The sequence of averages a1, a2, a3, . . . is defined as ai := 1 i

i

  • j=1

xj

slide-14
SLIDE 14

Central limit theorem

The sequence b1, b2, b3, . . . bi := √ i(ai − µ) converges in distribution to a Gaussian random variable with mean 0 and variance σ2 For any x ∈ R lim

i→∞ fbi (x) :=

1 √ 2πσ e− x2

2σ2

For large i the theorem suggests that the average ai is approximately Gaussian with mean µ and variance σ/√n

slide-15
SLIDE 15

iid exponential λ = 2, i = 102

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 1 2 3 4 5 6 7 8 9

slide-16
SLIDE 16

iid exponential λ = 2, i = 103

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 5 10 15 20 25 30

slide-17
SLIDE 17

iid exponential λ = 2, i = 104

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 10 20 30 40 50 60 70 80 90

slide-18
SLIDE 18

iid geometric p = 0.4, i = 102

1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 0.5 1.0 1.5 2.0 2.5

slide-19
SLIDE 19

iid geometric p = 0.4, i = 103

1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 1 2 3 4 5 6 7

slide-20
SLIDE 20

iid geometric p = 0.4, i = 104

1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 5 10 15 20 25

slide-21
SLIDE 21

Histogram of heights

60 62 64 66 68 70 72 74 76

Height (inches)

0.05 0.10 0.15 0.20 0.25

Gaussian distribution Real data

slide-22
SLIDE 22

Gaussian random variables Gaussian random vectors Randomized projections SVD of a random matrix Randomized SVD

slide-23
SLIDE 23

Gaussian random vector

A Gaussian random vector x is a random vector with joint pdf f

x (

x) = 1

  • (2π)n |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

  • where

µ ∈ Rn is the mean and Σ ∈ Rn×n the covariance matrix

slide-24
SLIDE 24

Uncorrelation implies independence

If the covariance matrix is diagonal, Σ

x =

     σ2

1

· · · σ2

2

· · · . . . . . . ... . . . · · · σ2

n

     , the entries are independent

slide-25
SLIDE 25

Proof

Σ−1

  • x

=      

1 σ2

1

· · ·

1 σ2

2

· · · . . . . . . ... . . . · · ·

1 σ2

n

      |Σ| =

n

  • i=1

σ2

i

slide-26
SLIDE 26

Proof

f

x (

x)

slide-27
SLIDE 27

Proof

f

x (

x) = 1

  • (2π)n |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

slide-28
SLIDE 28

Proof

f

x (

x) = 1

  • (2π)n |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

  • =

n

  • i=1

1

  • (2π)σi

exp

  • −(

xi − µi)2 2σ2

i

slide-29
SLIDE 29

Proof

f

x (

x) = 1

  • (2π)n |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

  • =

n

  • i=1

1

  • (2π)σi

exp

  • −(

xi − µi)2 2σ2

i

  • =

n

  • i=1

f

xi (

xi)

slide-30
SLIDE 30

Linear transformations

Let x be a Gaussian random vector of dimension n with mean µ and covariance matrix Σ For any matrix A ∈ Rm×n and b ∈ Rm Y = A x + b is Gaussian with mean A µ + b and covariance matrix AΣAT

slide-31
SLIDE 31

Subvectors are also Gaussian

−3 −2 −1 1 2 3 −2 2 0.1 0.2 x y f

x (x, y)

f

x[2](y)

f

x[1](x)

slide-32
SLIDE 32

Direction of iid standard Gaussian vectors

If the covariance matrix of a Gaussian vector x is I, then x is isotropic It does not favor any direction For any orthogonal matrix U x has the same distribution (Gaussian with mean U 0 = 0 and covariance matrix UIUT = UUT = I)

slide-33
SLIDE 33

Magnitude of iid standard Gaussian vectors

In low dimensions joint pdf is mostly concentrated around the origin High dimensions? || x||2

2 = k i=1

x[i]2 is a χ2 (chi squared) random variable with k degrees

  • f freedom
slide-34
SLIDE 34

Magnitude of iid standard Gaussian vectors

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 4 6 8 10 x fy/k (x) k = 10 k = 20 k = 50 k = 100

slide-35
SLIDE 35

Mean

E

  • ||

x||2

2

slide-36
SLIDE 36

Mean

E

  • ||

x||2

2

  • = E

k

  • i=1
  • x[i]2
slide-37
SLIDE 37

Mean

E

  • ||

x||2

2

  • = E

k

  • i=1
  • x[i]2
  • =

k

  • i=1

E

  • x[i]2
slide-38
SLIDE 38

Mean

E

  • ||

x||2

2

  • = E

k

  • i=1
  • x[i]2
  • =

k

  • i=1

E

  • x[i]2

= k

slide-39
SLIDE 39

Variance

E

  • ||

x||2

2

2

slide-40
SLIDE 40

Variance

E

  • ||

x||2

2

2 = E   k

  • i=1
  • x[i]2

2 

slide-41
SLIDE 41

Variance

E

  • ||

x||2

2

2 = E   k

  • i=1
  • x[i]2

2  = E  

k

  • i=1

k

  • j=1
  • x[i]2

x[j]2  

slide-42
SLIDE 42

Variance

E

  • ||

x||2

2

2 = E   k

  • i=1
  • x[i]2

2  = E  

k

  • i=1

k

  • j=1
  • x[i]2

x[j]2   =

k

  • i=1

k

  • j=1

E

  • x[i]2

x[j]2

slide-43
SLIDE 43

Variance

E

  • ||

x||2

2

2 = E   k

  • i=1
  • x[i]2

2  = E  

k

  • i=1

k

  • j=1
  • x[i]2

x[j]2   =

k

  • i=1

k

  • j=1

E

  • x[i]2

x[j]2 =

k

  • i=1

E

  • x[i]4

+ 2

k−1

  • i=1

k

  • j=i

E

  • x[i]2

E

  • x[j]2
slide-44
SLIDE 44

Variance

E

  • ||

x||2

2

2 = E   k

  • i=1
  • x[i]2

2  = E  

k

  • i=1

k

  • j=1
  • x[i]2

x[j]2   =

k

  • i=1

k

  • j=1

E

  • x[i]2

x[j]2 =

k

  • i=1

E

  • x[i]4

+ 2

k−1

  • i=1

k

  • j=i

E

  • x[i]2

E

  • x[j]2

= 3k + k(k − 1) 4th moment of standard Gaussian equals 3

slide-45
SLIDE 45

Variance

E

  • ||

x||2

2

2 = E   k

  • i=1
  • x[i]2

2  = E  

k

  • i=1

k

  • j=1
  • x[i]2

x[j]2   =

k

  • i=1

k

  • j=1

E

  • x[i]2

x[j]2 =

k

  • i=1

E

  • x[i]4

+ 2

k−1

  • i=1

k

  • j=i

E

  • x[i]2

E

  • x[j]2

= 3k + k(k − 1) 4th moment of standard Gaussian equals 3 = k(k + 2)

slide-46
SLIDE 46

Variance

Var

  • ||

x||2

2

  • = E
  • ||

x||2

2

2 − E

  • ||

x||2

2

2 = k(k + 2) − k2 = 2k Relative standard deviation around mean scales as

  • 2/k
slide-47
SLIDE 47

Non-asymptotic tail bound

Let x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 P

  • k (1 − ǫ) < ||

x||2

2 < k (1 + ǫ)

  • ≥ 1 − 2

kǫ2

slide-48
SLIDE 48

Markov’s inequality

Let x be a nonnegative random variable For any positive constant a > 0, P (x ≥ a) ≤ E (x) a

slide-49
SLIDE 49

Proof

Define the indicator variable 1x≥a x − a 1x≥a ≥ 0

slide-50
SLIDE 50

Proof

Define the indicator variable 1x≥a x − a 1x≥a ≥ 0 E (x) ≥ a E (1x≥a) = a P (x ≥ a)

slide-51
SLIDE 51

Chebyshev bound

Let y := || x||2

2,

P (|y − k| ≥ kǫ)

slide-52
SLIDE 52

Chebyshev bound

Let y := || x||2

2,

P (|y − k| ≥ kǫ) = P

  • (y − E (y))2 ≥ k2ǫ2
slide-53
SLIDE 53

Chebyshev bound

Let y := || x||2

2,

P (|y − k| ≥ kǫ) = P

  • (y − E (y))2 ≥ k2ǫ2

≤ E

  • (y − E (y))2

k2ǫ2 by Markov’s inequality

slide-54
SLIDE 54

Chebyshev bound

Let y := || x||2

2,

P (|y − k| ≥ kǫ) = P

  • (y − E (y))2 ≥ k2ǫ2

≤ E

  • (y − E (y))2

k2ǫ2 by Markov’s inequality = Var (y) k2ǫ2

slide-55
SLIDE 55

Chebyshev bound

Let y := || x||2

2,

P (|y − k| ≥ kǫ) = P

  • (y − E (y))2 ≥ k2ǫ2

≤ E

  • (y − E (y))2

k2ǫ2 by Markov’s inequality = Var (y) k2ǫ2 = 2 kǫ2

slide-56
SLIDE 56

Non-asymptotic Chernoff tail bound

Let x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 P

  • k (1 − ǫ) < ||

x||2

2 < k (1 + ǫ)

  • ≥ 1 − 2 exp
  • −kǫ2

8

slide-57
SLIDE 57

Proof

Let y := || x||2

  • 2. The result is implied by

P (y > k (1 + ǫ)) ≤ exp

  • −kǫ2

8

  • P (y < k (1 − ǫ)) ≤ exp
  • −kǫ2

8

slide-58
SLIDE 58

Proof

Fix t > 0 P (y > a)

slide-59
SLIDE 59

Proof

Fix t > 0 P (y > a) = P (exp (ty) > exp (at))

slide-60
SLIDE 60

Proof

Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality

slide-61
SLIDE 61

Proof

Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality ≤ exp (−at) E

  • exp

k

  • i=1

txi

2

slide-62
SLIDE 62

Proof

Fix t > 0 P (y > a) = P (exp (ty) > exp (at)) ≤ exp (−at) E (exp (ty)) by Markov’s inequality ≤ exp (−at) E

  • exp

k

  • i=1

txi

2

  • ≤ exp (−at)

k

  • i=1

E

  • exp
  • txi

2

by independence of x1, . . . , xk

slide-63
SLIDE 63

Proof

Lemma (by direct integration) E

  • exp
  • tx2

= 1 √1 − 2t Equivalent to controlling higher-order moments since E

  • exp
  • tx2

= E ∞

  • i=0
  • tx2i

i!

  • =

  • i=0

E

  • ti

x2i i! .

slide-64
SLIDE 64

Proof

Fix t > 0 P (y > a) ≤ exp (−at)

k

  • i=1

E

  • exp
  • txi

2

= exp (−at) (1 − 2t)

k 2

slide-65
SLIDE 65

Proof

Setting a := k (1 + ǫ) and t := 1 2 − 1 2 (1 + ǫ), we conclude P (y > k (1 + ǫ)) ≤ (1 + ǫ)k 2 exp

  • −kǫ

2

  • ≤ exp
  • −kǫ2

8

slide-66
SLIDE 66

Projection onto a fixed subspace

PS1 z PS2 z 0.007 = ||PS1 z||2 || x||2 < ||PS2 z||2 || x||2 = 0.043 0.043 0.007 = 6.14 ≈

  • dim (S2)

dim (S1) (not a coincidence)

slide-67
SLIDE 67

Projection onto a fixed subspace

Let S be a k-dimensional subspace of Rn and z ∈ Rn a vector of iid standard Gaussian noise ||PS z||2

2 is a χ2 random variable with k degrees of freedom

It has the same distribution as y :=

k

  • i=1

xi

2

where x1, . . . , xk are iid standard Gaussians.

slide-68
SLIDE 68

Proof

Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2

2

slide-69
SLIDE 69

Proof

Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2

2 =

  • UUT

z

  • 2

2

slide-70
SLIDE 70

Proof

Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2

2 =

  • UUT

z

  • 2

2

= zTUUTUUT z

slide-71
SLIDE 71

Proof

Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2

2 =

  • UUT

z

  • 2

2

= zTUUTUUT z = zTUUT z

slide-72
SLIDE 72

Proof

Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2

2 =

  • UUT

z

  • 2

2

= zTUUTUUT z = zTUUT z = wT w

slide-73
SLIDE 73

Proof

Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2

2 =

  • UUT

z

  • 2

2

= zTUUTUUT z = zTUUT z = wT w =

k

  • i=1
  • w[i]2
slide-74
SLIDE 74

Proof

Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2

2 =

  • UUT

z

  • 2

2

= zTUUTUUT z = zTUUT z = wT w =

k

  • i=1
  • w[i]2
  • w := UT

z is Gaussian with mean zero and covariance matrix Σ

w = UTΣ zU

slide-75
SLIDE 75

Proof

Let UUT be a projection matrix for S, where the columns of U ∈ Rn×k are orthonormal: ||PS z||2

2 =

  • UUT

z

  • 2

2

= zTUUTUUT z = zTUUT z = wT w =

k

  • i=1
  • w[i]2
  • w := UT

z is Gaussian with mean zero and covariance matrix Σ

w = UTΣ zU

= UTU = I

slide-76
SLIDE 76

Non-asymptotic Chernoff tail bound

Let x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 P

  • k (1 − ǫ) < ||

x||2

2 < k (1 + ǫ)

  • ≥ 1 − 2 exp
  • −kǫ2

8

slide-77
SLIDE 77

Projection onto a fixed subspace

Let S be a k-dimensional subspace of Rn and z ∈ Rn a vector of iid standard Gaussian noise For any ǫ > 0 P (k (1 − ǫ) < ||PS z||2 < k (1 + ǫ)) ≥ 1 − 2 exp

  • −kǫ2

8

slide-78
SLIDE 78

Gaussian random variables Gaussian random vectors Randomized projections SVD of a random matrix Randomized SVD

slide-79
SLIDE 79

Dimensionality reduction

◮ PCA preserves the most energy (ℓ2 norm) ◮ Problem 1: Computationally expensive ◮ Problem 2: Depends on all of the data ◮ (Possible) Solution: Just project randomly! ◮ For a data set

x1, x2, . . . ∈ Rm compute A x1, A x2, . . . ∈ Rm where A ∈ Rk×n (k < n) has iid standard Gaussian entries

slide-80
SLIDE 80

Fixed vector

Let A be a a × b matrix with iid standard Gaussian entries If v ∈ Rb is a deterministic vector with unit ℓ2 norm, then A v is an a-dimensional iid standard Gaussian vector Proof:

slide-81
SLIDE 81

Fixed vector

Let A be a a × b matrix with iid standard Gaussian entries If v ∈ Rb is a deterministic vector with unit ℓ2 norm, then A v is an a-dimensional iid standard Gaussian vector Proof: (A v) [i], 1 ≤ i ≤ a is Gaussian with mean zero and variance Var

  • AT

i,:

v

  • =

vTΣAi,: v = vTI v = || v||2

2 = 1

slide-82
SLIDE 82

Non-asymptotic Chernoff tail bound

Let x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 P

  • k (1 − ǫ) < ||

x||2

2 < k (1 + ǫ)

  • ≥ 1 − 2 exp
  • −kǫ2

8

slide-83
SLIDE 83

Fixed vector

Let A be a a × b matrix with iid standard Gaussian entries For any v ∈ Rp with unit norm and any ǫ ∈ (0, 1)

  • a (1 − ǫ) ≤ ||A

v||2 ≤

  • a (1 + ǫ)

with probability at least 1 − 2 exp

  • −aǫ2/8
slide-84
SLIDE 84

Johnson-Lindenstrauss lemma

Let A be a k × n matrix with iid standard Gaussian entries Let x1, . . . , xp ∈ Rn be any fixed set of p deterministic vectors For any pair xi, xj and any ǫ ∈ (0, 1) (1 − ǫ) || xi − xj||2

2 ≤

  • 1

√ k A xi − 1 √ k A xj

  • 2

2

≤ (1 + ǫ) || xi − xj||2

2

with probability at least 1

p as long as

k ≥ 16 log (p) ǫ2

slide-85
SLIDE 85

Proof

Aim: Control action of A the normalized differences

  • vij :=
  • xi −

xj || xi − xj||2 Our event of interest is the intersection of the events Eij =

  • k (1 − ǫ) < ||A

vij||2

2 < k (1 + ǫ)

  • 1 ≤ i < p, i < j ≤ p
slide-86
SLIDE 86

Fixed vector

Let A be a a × b matrix with iid standard Gaussian entries For any v ∈ Rb with unit norm and any ǫ ∈ (0, 1)

  • a (1 − ǫ) ≤ ||A

v||2 ≤

  • a (1 + ǫ)

with probability at least 1 − 2 exp

  • −aǫ2/8
  • This implies

P

  • Ec

ij

  • ≤ 2

p2 if k ≥ 16 log (p) ǫ2

slide-87
SLIDE 87

Union bound

For any events S1, S2, . . . , Sn in a probability space P (∪iSi) ≤

n

  • i=1

P (Si) .

slide-88
SLIDE 88

Proof

Number of events Eij equals p

2

  • = p (p − 1) /2

By the union bound P  

i,j

Eij  

slide-89
SLIDE 89

Proof

Number of events Eij equals p

2

  • = p (p − 1) /2

By the union bound P  

i,j

Eij   = 1 − P  

i,j

Ec

ij

 

slide-90
SLIDE 90

Proof

Number of events Eij equals p

2

  • = p (p − 1) /2

By the union bound P  

i,j

Eij   = 1 − P  

i,j

Ec

ij

  ≥ 1 −

  • i,j

P

  • Ec

ij

slide-91
SLIDE 91

Proof

Number of events Eij equals p

2

  • = p (p − 1) /2

By the union bound P  

i,j

Eij   = 1 − P  

i,j

Ec

ij

  ≥ 1 −

  • i,j

P

  • Ec

ij

  • ≥ 1 − p (p − 1)

2 2 p2

slide-92
SLIDE 92

Proof

Number of events Eij equals p

2

  • = p (p − 1) /2

By the union bound P  

i,j

Eij   = 1 − P  

i,j

Ec

ij

  ≥ 1 −

  • i,j

P

  • Ec

ij

  • ≥ 1 − p (p − 1)

2 2 p2 ≥ 1 p

slide-93
SLIDE 93

Dimensionality reduction for visualization

Motivation: Visualize high-dimensional features projected onto 2D or 3D Example: Seeds from three different varieties of wheat: Kama, Rosa and Canadian Features:

◮ Area ◮ Perimeter ◮ Compactness ◮ Length of kernel ◮ Width of kernel ◮ Asymmetry coefficient ◮ Length of kernel groove

slide-94
SLIDE 94

Dimensionality reduction for visualization

Randomized projection PCA

slide-95
SLIDE 95

Nearest neighbors in random subspace

Nearest neighbors classification (Algorithm 4.2 in Lecture Notes 1) computes n distances in Rm for each new example Cost: O (nmp) for p examples Idea: Use a k × m iid standard Gaussian matrix to project onto k-dimensional space beforehand Cost:

◮ kmn operations to project training set ◮ kmp operations to project test set ◮ knp to perform nearest-neighbor classification

Much faster!

slide-96
SLIDE 96

Face recognition

Training set: 360 64 × 64 images from 40 different subjects (9 each) Test set: 1 new image from each subject We model each image as a vector in R4096 (m = 4096) To classify we:

  • 1. Project onto random a k-dimensional subspace
  • 2. Apply nearest-neighbor classification using the ℓ2-norm distance in Rk
slide-97
SLIDE 97

Performance

20 40 60 80 100 120 140 160 180 200 10 20 30 40 5 10 15 25 30 35 Dimension Errors Average Maximum Minimum

slide-98
SLIDE 98

Nearest neighbor in R50

Test image Projection Closest projection

Corresponding image

slide-99
SLIDE 99

Gaussian random variables Gaussian random vectors Randomized projections SVD of a random matrix Randomized SVD

slide-100
SLIDE 100

Singular values of n × k matrix, k = 100

20 40 60 80 100 0.5 1 1.5 i

σi √n

n/k 2 5 10 20 50 100 200

slide-101
SLIDE 101

Singular values of n × k matrix, k = 1000

200 400 600 800 1,000 0.5 1 1.5 i

σi √n

n/k 2 5 10 20 50 100 200

slide-102
SLIDE 102

Singular values of a Gaussian matrix

Intuitively as n grows A ≈ U √n I

  • V T = √n UV T,

iid Gaussian vectors in high dimensions are almost orthogonal

slide-103
SLIDE 103

Singular values of a Gaussian matrix

Let A be a n × k matrix with iid standard Gaussian entries such that n > k For any fixed ǫ > 0, the singular values of A satisfy

  • n (1 − ǫ) ≤ σk ≤ σ1 ≤
  • n (1 + ǫ)

with probability at least 1 − 1/k as long as n > 64k ǫ2 log 12 ǫ

slide-104
SLIDE 104

Proof

Recall that σ1 = max {||

x||2=1 | x∈Rk}

||A x||2 σk = min {||

x||2=1 | x∈Rk}

||A x||2 so the bounds are equivalent to n (1 − ǫ) < ||A v||2

2 < n (1 + ǫ)

slide-105
SLIDE 105

Proof

Idea: Use union bound over all unit-norm vectors Problem: They are infinite! Solution: Use union bound on a finite set, then show that this is enough

slide-106
SLIDE 106

ǫ-net

An ǫ-net of a set X ⊆ Rk is a subset Nǫ ⊆ X such that for every vector

  • x ∈ X there exists

y ∈ Nǫ for which || x − y||2 ≤ ǫ. The covering number N (X, ǫ) of a set X at scale ǫ is the minimal cardinality of an ǫ-net of X

slide-107
SLIDE 107

ǫ-net

ǫ

slide-108
SLIDE 108

Covering number of a sphere

The covering number of the n-dimensional sphere Sk−1 at scale ǫ satisfies N

  • Sk−1, ǫ

2 + ǫ ǫ k ≤ 3 ǫ k

slide-109
SLIDE 109

Covering number of a sphere

◮ Initialize Nǫ to the empty set ◮ Choose a point

x ∈ Sk−1 such that || x − y||2 > ǫ for any y ∈ Nǫ

◮ Add

x to Nǫ until there are no points in Sk−1 that are ǫ away from any point in Nǫ

slide-110
SLIDE 110

Covering number of a sphere

ǫ/2 1 + ǫ/2

slide-111
SLIDE 111

Covering number of a sphere

Vol

  • Bk

1+ǫ/2

  • ≥ Vol

x∈NǫBk ǫ/2 (

x)

slide-112
SLIDE 112

Covering number of a sphere

Vol

  • Bk

1+ǫ/2

  • ≥ Vol

x∈NǫBk ǫ/2 (

x)

  • = |Nǫ| Vol
  • Bk

ǫ/2

slide-113
SLIDE 113

Covering number of a sphere

Vol

  • Bk

1+ǫ/2

  • ≥ Vol

x∈NǫBk ǫ/2 (

x)

  • = |Nǫ| Vol
  • Bk

ǫ/2

  • By multivariable calculus

Vol

  • Bk

r

  • = rk Vol
  • Bk

1

slide-114
SLIDE 114

Covering number of a sphere

Vol

  • Bk

1+ǫ/2

  • ≥ Vol

x∈NǫBk ǫ/2 (

x)

  • = |Nǫ| Vol
  • Bk

ǫ/2

  • Vol
  • Bk

r

  • = rk Vol
  • Bk

1

  • so we conclude

(1 + ǫ/2)k ≥ |Nǫ| (ǫ/2)k

slide-115
SLIDE 115

Proof

  • 1. We prove the bounds

n (1 − ǫ2) < ||A v||2

2 < n (1 + ǫ2)

where ǫ2 := ǫ/2 on an ǫ1 := ǫ/4 net of the sphere

  • 2. We show that by the triangle inequality, this implies that the bounds

hold on all the sphere

slide-116
SLIDE 116

Fixed vector

Let A be a a × b matrix with iid standard Gaussian entries For any v ∈ Rb with unit norm and any ǫ ∈ (0, 1)

  • a (1 − ǫ) ≤ ||A

v||2 ≤

  • a (1 + ǫ)

with probability at least 1 − 2 exp

  • −aǫ2/8
slide-117
SLIDE 117

Bound on the ǫ1-net

We define the event E

v,ǫ2 :=

  • n (1 − ǫ2) ||

v||2

2 ≤ ||A

v||2

2 ≤ n (1 + ǫ2) ||

v||2

2

  • P

v∈Nǫ1Ec

  • v,ǫ2
slide-118
SLIDE 118

Bound on the ǫ1-net

We define the event E

v,ǫ2 :=

  • n (1 − ǫ2) ||

v||2

2 ≤ ||A

v||2

2 ≤ n (1 + ǫ2) ||

v||2

2

  • P

v∈Nǫ1Ec

  • v,ǫ2
  • v∈Nǫ1

P

  • Ec
  • v,ǫ2
slide-119
SLIDE 119

Bound on the ǫ1-net

We define the event E

v,ǫ2 :=

  • n (1 − ǫ2) ||

v||2

2 ≤ ||A

v||2

2 ≤ n (1 + ǫ2) ||

v||2

2

  • P

v∈Nǫ1Ec

  • v,ǫ2
  • v∈Nǫ1

P

  • Ec
  • v,ǫ2
  • ≤ |Nǫ1| P
  • Ec
  • v,ǫ2
slide-120
SLIDE 120

Bound on the ǫ1-net

We define the event E

v,ǫ2 :=

  • n (1 − ǫ2) ||

v||2

2 ≤ ||A

v||2

2 ≤ n (1 + ǫ2) ||

v||2

2

  • P

v∈Nǫ1Ec

  • v,ǫ2
  • v∈Nǫ1

P

  • Ec
  • v,ǫ2
  • ≤ |Nǫ1| P
  • Ec
  • v,ǫ2
  • ≤ 2

12 ǫ k exp

  • −nǫ2

32

slide-121
SLIDE 121

Bound on the ǫ1-net

We define the event E

v,ǫ2 :=

  • n (1 − ǫ2) ||

v||2

2 ≤ ||A

v||2

2 ≤ n (1 + ǫ2) ||

v||2

2

  • P

v∈Nǫ1Ec

  • v,ǫ2
  • v∈Nǫ1

P

  • Ec
  • v,ǫ2
  • ≤ |Nǫ1| P
  • Ec
  • v,ǫ2
  • ≤ 2

12 ǫ k exp

  • −nǫ2

32

  • ≤ 1

k if n > 64k ǫ2 log 12 ǫ

slide-122
SLIDE 122

Upper bound on the sphere

Let x ∈ Sk−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||A x||2

slide-123
SLIDE 123

Upper bound on the sphere

Let x ∈ Sk−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||A x||2 ≤ ||A v||2 + ||A ( x − v)||2

slide-124
SLIDE 124

Upper bound on the sphere

Let x ∈ Sk−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||A x||2 ≤ ||A v||2 + ||A ( x − v)||2 ≤ √n

  • 1 + ǫ

2

  • + ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds
slide-125
SLIDE 125

Upper bound on the sphere

Let x ∈ Sk−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||A x||2 ≤ ||A v||2 + ||A ( x − v)||2 ≤ √n

  • 1 + ǫ

2

  • + ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds

≤ √n

  • 1 + ǫ

2

  • + σ1 ||

x − v||2

slide-126
SLIDE 126

Upper bound on the sphere

Let x ∈ Sk−1 There exists v ∈ N (X, ǫ1) such that || x − v||2 ≤ ǫ/4 ||A x||2 ≤ ||A v||2 + ||A ( x − v)||2 ≤ √n

  • 1 + ǫ

2

  • + ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds

≤ √n

  • 1 + ǫ

2

  • + σ1 ||

x − v||2 ≤ √n

  • 1 + ǫ

2

  • + σ1ǫ

4

slide-127
SLIDE 127

Upper bound on the sphere

σ1 ≤ √n

  • 1 + ǫ

2

  • + σ1ǫ

4 σ1 ≤ √n 1 + ǫ/2 1 − ǫ/4

  • = √n
  • 1 + ǫ − ǫ (1 − ǫ)

4 − ǫ

  • ≤ √n (1 + ǫ)
slide-128
SLIDE 128

Lower bound on the sphere

||A x||2

slide-129
SLIDE 129

Lower bound on the sphere

||A x||2 ≥ ||A v||2 − ||A ( x − v)||2

slide-130
SLIDE 130

Lower bound on the sphere

||A x||2 ≥ ||A v||2 − ||A ( x − v)||2 ≥ √n

  • 1 − ǫ

2

  • − ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds
slide-131
SLIDE 131

Lower bound on the sphere

||A x||2 ≥ ||A v||2 − ||A ( x − v)||2 ≥ √n

  • 1 − ǫ

2

  • − ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds

≥ √n

  • 1 − ǫ

2

  • − σ1 ||

x − v||2

slide-132
SLIDE 132

Lower bound on the sphere

||A x||2 ≥ ||A v||2 − ||A ( x − v)||2 ≥ √n

  • 1 − ǫ

2

  • − ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds

≥ √n

  • 1 − ǫ

2

  • − σ1 ||

x − v||2 ≥ √n

  • 1 − ǫ

2

  • − ǫ

4 √n (1 + ǫ)

slide-133
SLIDE 133

Lower bound on the sphere

||A x||2 ≥ ||A v||2 − ||A ( x − v)||2 ≥ √n

  • 1 − ǫ

2

  • − ||A (

x − v)||2 assuming ∪

v∈Nǫ1Ec

  • v,ǫ2 holds

≥ √n

  • 1 − ǫ

2

  • − σ1 ||

x − v||2 ≥ √n

  • 1 − ǫ

2

  • − ǫ

4 √n (1 + ǫ) = √n (1 − ǫ)

slide-134
SLIDE 134

Gaussian random variables Gaussian random vectors Randomized projections SVD of a random matrix Randomized SVD

slide-135
SLIDE 135

Fast SVD

For a matrix M ∈ Rm×n which is approximately rank k:

  • 1. Choose a small oversampling parameter p (usually 5 or slightly larger).
  • 2. Find a matrix

U ∈ Rm×(k+p) with k + p orthonormal columns that approximately span the column space of M

  • 3. Compute W ∈ R(k+p)×n defined by W :=

UTM

  • 4. Compute the SVD of W = UW SW V T

W

  • 5. Output U := (

UUW ):,1:k, S := (SW )1:k,1:k and V := (VW ):,1:k as the SVD of M

slide-136
SLIDE 136

Fast SVD

For a matrix M ∈ Rm×n which is approximately rank k:

  • 1. Choose a small oversampling parameter p (usually 5 or slightly larger).
  • 2. Find a matrix

U ∈ Rm×(k+p) with k + p orthonormal columns that approximately span the column space of M

  • 3. Compute W ∈ R(k+p)×n defined by W :=

UTM O (kmn)

  • 4. Compute the SVD of W = UW SW V T

W

O

  • k2n
  • 5. Output U := (

UUW ):,1:k, S := (SW )1:k,1:k and V := (VW ):,1:k as the SVD of M Complexity of regular SVD is O (mn min {m, n})

slide-137
SLIDE 137

Fast SVD

≈ M UM SM V T

M

n m k k n

slide-138
SLIDE 138

Fast SVD

= M

  • UT

W k m n n

slide-139
SLIDE 139

Fast SVD

The method works if (1) M is rank k and (2) U spans the column space M

slide-140
SLIDE 140

Fast SVD

The method works if (1) M is rank k and (2) U spans the column space M = U UTM

slide-141
SLIDE 141

Fast SVD

The method works if (1) M is rank k and (2) U spans the column space M = U UTM = UW

slide-142
SLIDE 142

Fast SVD

The method works if (1) M is rank k and (2) U spans the column space M = U UTM = UW = UUW SW V T

W

slide-143
SLIDE 143

Fast SVD

The method works if (1) M is rank k and (2) U spans the column space M = U UTM = UW = UUW SW V T

W

where U := UUW is an m × k matrix with orthonormal columns

slide-144
SLIDE 144

Fast SVD

The method works if (1) M is rank k and (2) U spans the column space M = U UTM = UW = UUW SW V T

W

where U := UUW is an m × k matrix with orthonormal columns UTU = UT

W

UT UUW

slide-145
SLIDE 145

Fast SVD

The method works if (1) M is rank k and (2) U spans the column space M = U UTM = UW = UUW SW V T

W

where U := UUW is an m × k matrix with orthonormal columns UTU = UT

W

UT UUW = UT

W UW = I

slide-146
SLIDE 146

Power iterations

For approximately low-rank matrices performance depends on gap between σk and σk+1 The gap can be increased by power iterations This method is only used when computing U The input is

  • M :=
  • MMTq

M

slide-147
SLIDE 147

Power iterations

For approximately low-rank matrices performance depends on gap between σk and σk+1 The gap can be increased by power iterations This method is only used when computing U The input is

  • M :=
  • MMTq

M =

  • UMS2

MUT M

q UMSMV T

M

slide-148
SLIDE 148

Power iterations

For approximately low-rank matrices performance depends on gap between σk and σk+1 The gap can be increased by power iterations This method is only used when computing U The input is

  • M :=
  • MMTq

M =

  • UMS2

MUT M

q UMSMV T

M

= UMS2

MUT MUMS2 MUT M · · · UMS2 MUT MUMV T M

= UMS2q+1

M

V T

M

slide-149
SLIDE 149

Problem

How do we estimate the column space of a low-rank matrix?

◮ Project onto random subspace with slightly larger dimension ◮ Select random columns

slide-150
SLIDE 150

Randomized column-space approximation

For a matrix M ∈ Rm×n which is approximately rank k:

  • 1. Create an n × (k + p) iid standard Gaussian matrix A, where p is a

small integer (e.g. 5)

  • 2. Compute the m × (k + p) matrix B = MA
  • 3. Orthonormalize the columns of B and output them as a matrix
  • U ∈ Rm×(k+p).
  • 4. Apply power iterations if necessary.
slide-151
SLIDE 151

Randomized column-space approximation

B = MA = UMSMV T

M A

= UMSMC

slide-152
SLIDE 152

Randomized column-space approximation

B = MA = UMSMV T

M A

= UMSMC

◮ If M is low rank C is a k × (k + p) iid standard Gaussian matrix

slide-153
SLIDE 153

Randomized column-space approximation

B = MA = UMSMV T

M A

= UMSMC

◮ If M is low rank C is a k × (k + p) iid standard Gaussian matrix ◮ Otherwise, C is a min {m, n} × (k + p) iid standard Gaussian matrix

slide-154
SLIDE 154

Randomized SVD of a video

◮ Video with 160 1080 × 1920 frames ◮ We interpret each frame as a vector in R20,736,000 ◮ Matrix formed by these vectors is approximately low rank ◮ Regular SVD takes 12 seconds (281.1 seconds if we take 691 frames) ◮ Fast SVD with randomized-column-space estimate takes 5.8 seconds

(10.4 seconds for 691 frames) to obtain a rank-10 approximation (q = 2, p = 7)

slide-155
SLIDE 155

True singular values

20 40 60 80 100 120 140 160 180 100000 200000 300000 400000 500000 600000

True Singular Values

slide-156
SLIDE 156

Left singular vector approximation

True

Estimated

slide-157
SLIDE 157

Random column selection

For a matrix M ∈ Rm×n which is approximately rank k:

  • 1. Select a random subset of column indices I := {i1, i2, . . . , ik′} with

k′ ≥ k

  • 2. Orthonormalize the submatrix corresponding to I:

MI :=

  • M:,i1

M:,i2 · · · M:,ik′

  • and output them as a matrix

U ∈ Rm×k′

slide-158
SLIDE 158

Random column selection

(Possible) Problem: If right singular vectors are sparse, this will not work MI = UMSM(VM)I

slide-159
SLIDE 159

Example

M :=     −3 2 2 2 3 2 2 2 −3 2 2 2 3 2 2 2    

slide-160
SLIDE 160

Example

M = UM SMV T

M =

    0.5 −0.5 0.5 0.5 0.5 −0.5 0.5 0.5     6.9282 6 0.577 0.577 0.577 1

slide-161
SLIDE 161

Example, I = {2, 3}

MI =     2 2 2 2 2 2 2 2     =     0.5 0.5 0.5 0.5     6.2982

  • 0.577

0.577

  • .
slide-162
SLIDE 162

Randomized SVD of a video

◮ Video with 160 1080 × 1920 frames ◮ We interpret each frame as a vector in R20,736,000 ◮ Matrix formed by these vectors is approximately low rank ◮ Regular SVD takes 12 seconds (281.1 seconds if we take 691 frames) ◮ Fast SVD with random-column-selection estimate takes 5.2 seconds to

  • btain a rank-10 approximation (k′ = 17)
slide-163
SLIDE 163

Left singular vector approximation

True

Estimated

slide-164
SLIDE 164

Singular value approximation

1 2 3 4 5 6 7 8 9 100000 200000 300000 400000 500000 600000

First 10 Singular Values Gaussian Subsampled True