Reconstruction from Anisotropic Random Measurements Mark Rudelson - - PowerPoint PPT Presentation

reconstruction from anisotropic random measurements
SMART_READER_LITE
LIVE PREVIEW

Reconstruction from Anisotropic Random Measurements Mark Rudelson - - PowerPoint PPT Presentation

Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 2013 Ann Arbor, Michigan August 7, 2013 Want to estimate a parameter R p


slide-1
SLIDE 1

Reconstruction from Anisotropic Random Measurements

Mark Rudelson and Shuheng Zhou

The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 2013 Ann Arbor, Michigan

August 7, 2013

slide-2
SLIDE 2

Want to estimate a parameter β ∈ Rp

Example: How is a response y ∈ R related to the Parkinson’s disease affected by a set of genes among the Chinese population? Construct a linear model: y = βT x + ǫ, where E (y| x) = βT x.

◮ Parameter: Non-zero entries in β (sparsity of β) identify a subset of

genes and indicate how much they influence y.

Take a random sample of (X, Y), and use the sample to estimate β; that is, we have Y = Xβ + ǫ.

slide-3
SLIDE 3

Model selection and parameter estimation

When can we approximately recover β from n noisy observations Y? Questions: How many measurements n do we need in order to recover the non-zero positions in β? How does n scale with p or s, where s is the number of non-zero entries of β? What assumptions about the data matrix X are reasonable?

slide-4
SLIDE 4

Sparse recovery

When β is known to be s-sparse for some 1 ≤ s ≤ n, which means that at most s of the coefficients of β can be non-zero: Assume every 2s columns of X are linearly independent: Identifiability condition (reasonable once n ≥ 2s) Λmin(2s)

= min

υ=0,2s-sparse

Xυ2 n υ2 > 0. Proposition: (Cand` es-Tao 05). Suppose that any 2s columns of the n × p matrix X are linearly independent. Then, any s-sparse signal β ∈ Rp can be reconstructed uniquely from Xβ.

slide-5
SLIDE 5

ℓ0-minimization

How to reconstruct an s-sparse signal β ∈ Rp from the measurements Y = Xβ given Λmin(2s) > 0? Let β be the unique sparsest solution to Xβ = Y: β = arg minβ:Xβ=Y β0 where β0 := #{1 ≤ i ≤ p : βi = 0} is the sparsity of β. Unfortunately, ℓ0-minimization is computationally intractable; (in fact, it is an NP-complete problem).

slide-6
SLIDE 6

Basis pursuit

Consider the following convex optimization problem β∗ := arg minβ:Xβ=Y β1 . Basis pursuit works whenever the n × p measurement matrix X is sufficiently incoherent: RIP (Cand` es-Tao 05) requires that for all T ⊂ {1, . . . , p} with |T| ≤ s and for all coefficients sequences (cj)j∈T, (1 − δs) c2 ≤ XTc/n2 ≤ (1 + δs) c2 holds for some 0 < δs < 1 (s-restricted isometry constant). The “good” matrices for compressed sensing should satisfy the inequalities for the largest possible s.

slide-7
SLIDE 7

Restricted Isometry Property (RIP): examples

For Gaussian random matrix, or any sub-Gaussian ensemble, RIP holds with s ≍ n/ log(p/n). For random Fourier ensemble, or randomly sampled rows of

  • rthonormal matrices, RIP holds for s = O(n/ log4 p).

For a random matrix composed of columns that are independent isotropic vectors with log-concave densities, RIP holds for s = O(n/log2(p/n)).

References: Cand` es-Tao 05, 06, Rudelson-Vershynin 05, Donoho 06, Baraniuk et al. 08, Mendelson et al. 08, Adamczak et al. 09.

slide-8
SLIDE 8

Basis pursuit for high dimensional data

These algorithms are also robust with regards to noise, and RIP will be replaced by more relaxed conditions. In particular, the isotropicity condition which has been assumed in all literature cited above needs to be dropped. Let Xi ∈ Rp, i = 1, . . . , n be i.i.d. random row vectors of the design matrix X. Covariance matrix: Σ(Xi) = EXi ⊗ Xi = EXiX T

i

  • Σn

= 1 n

n

  • i=1

Xi ⊗ Xi = 1 n

n

  • i=1

XiX T

i

Xi is isotropic if Σ(Xi) = I and E Xi2

2 = n.

slide-9
SLIDE 9

Sparse recovery for Y = Xβ + ǫ

Lasso (Tibshirani 96), a.k.a. Basis Pursuit (Chen, Donoho and Saunders 98, and others):

  • β = arg min

β Y − Xβ2/2n + λnβ1,

where the scaling factor 1/(2n) is chosen by convenience. Dantzig selector (Cand` es-Tao 07): (DS) arg min

  • β∈Rp

β1 subject to X T(Y − X β)/n∞ ≤ λn.

References: Greenshtein-Ritov 04, Meinshausen-B¨ uhlmann 06, Zhao-Yu 06, Bunea et al. 07, Cand` es-Tao 07, van de Geer 08, Zhang-Huang 08, Wainwright 09, Koltchinskii 09, Meinshausen-Yu 09, Bickel et. al. 09, and

  • thers.
slide-10
SLIDE 10

The Cone Constraint

For an appropriately chosen λn, the solution of the Lasso or the Dantzig selector satisfies (under i.i.d. Gaussian noise), with high probability, υ := β − β ∈ C(s, k0) k0 = 1 for the Dantzig selector, and k0 = 3 for the Lasso. Object of interest: for 1 ≤ s0 ≤ p, and a positive number k0, C(s0, k0) =

  • x ∈ Rp | ∃J ∈ {1, . . . , p}, |J| = s0 s.t. xJc1 ≤ k0 xJ1
  • This object has appeared in earlier work in the noiseless setting

References: Donoho-Huo 01, Elad-Bruckstein 02, Feuer-Nemirovski 03, Cand` es-Tao 07, Bickel-Ritov-Tsybakov 09, Cohen-Dahmen-DeVore 09.

slide-11
SLIDE 11

The Lasso solution

slide-12
SLIDE 12

Restricted Eigenvalue (RE) condition

Object of interest: C(s0, k0) =

  • x ∈ Rp | ∃J ∈ {1, . . . , p}, |J| = s0 s.t. xJc1 ≤ k0 xJ1
  • .

Definition

Matrix Aq×p satisfies RE(s0, k0, A) condition with parameter K(s0, k0, A) if for any υ = 0, 1 K(s0, k0, A) := min

J⊆{1,...,p},

|J|≤s0

min

υJc 1≤k0υJ1

Aυ2 υJ2 > 0.

References: van de Geer 07, Bickel-Ritov-Tsybakov 09, van de Geer-B¨ uhlmann 09.

slide-13
SLIDE 13

An elementary estimate

Lemma

For each vector υ ∈ C(s0, k0), let T0 denote the locations of the s0 largest coefficients of υ in absolute values. Then

  • υT c
  • 1 ≤
  • υT0
  • 1, and
  • υT0
  • 2 ≥

υ2

1+k0 .

Implication: Let A be a q × p matrix such that RE(s0, 3k0, A) condition holds for 0 < K(s0, 3k0, A) < ∞. Then ∀υ ∈ C(s0, k0) ∩ Sp−1 Aυ2 ≥

  • υT0
  • 2

K(s0, k0, A) ≥ 1 K(s0, k0, A) · √1 + k0 > 0

slide-14
SLIDE 14

Sparse eigenvalues

Definition

For m ≤ p, we define the largest and smallest m-sparse eigenvalue of a q × p matrix A to be ρmax(m, A) := max

t∈Rp,t=0;m−sparse At2 2/ t2 2 ,

ρmin(m, A) := min

t∈Rp,t=0;m−sparse At2 2/ t2 2 .

If RE(s0, k0, A) is satisfied with k0 ≥ 1, then the square submatrices of size 2s0 of ATA are necessarily positive definite, that is, ρmin(2s0, A) > 0.

slide-15
SLIDE 15

Examples: of A which satisfies the Restricted Eigenvalue condition, but not RIP (Ruskutti, Wainwright, and Yu 10) Spiked Identity matrix: for a ∈ [0, 1), Σp×p = (1 − a)Ip×p + a 1 1T where 1 ∈ Rp is the vector of all ones. ρmin(Σ) > 0 Then for all s0 × s0 submatrix ΣSS, we have ρmax(ΣSS) ρmin(ΣSS) = 1 + a(s0 − 1) 1 − a Largest sparse eigenvalue → ∞ as s0 → ∞, but

  • Σ1/2ej
  • 2 = 1 is

bounded.

slide-16
SLIDE 16

Motivation: to construct classes of design matrices such that the Restricted Eigenvalue condition will be satisfied. Design matrix X has just independent rows, rather than independent entries: e.g., consider for some matrix Aq×p X = ΨA, where rows of the matrix Ψn×q are independent isotropic vectors with subgaussian marginals, and RE(s0, (1 + ε)k0, A) holds for some ε > 0, p > s0 ≥ 0, and k0 > 0. Design matrix X consists of independent identically distributed rows with bounded entries, whose covariance matrix Σ(Xi) = EXiX T

i

satisfies RE(s0, (1 + ε)k0, Σ1/2). The rows of X will be sampled from some distributions in Rp; The distribution may be highly non-Gaussian and perhaps discrete.

slide-17
SLIDE 17

Outline

Introduction The main results

◮ The reduction principle ◮ Applications of the reduction principle

Ingredients of the proof Conclusion

slide-18
SLIDE 18

Notation

Let e1, . . . , ep be the canonical basis of Rp. For a set J ⊂ {1, . . . , p}, denote EJ = span{ej : j ∈ J}. For a matrix A, we use A2 to denote its operator norm. For a set V ⊂ Rp, we let conv V denote the convex hull of V. For a finite set Y, the cardinality is denoted by |Y|. Let Bp

2 and Sp−1 be the unit Euclidean ball and the unit sphere

respectively

slide-19
SLIDE 19

The reduction principle:

Theorem

Let E = ∪|J|=dEJ for d(3k0) < p, where d(3k0) = s0 + s0 max

j

  • Aej
  • 2

2

16K 2(s0, 3k0, A)(3k0)2(3k0 + 1) δ2 and E denotes Rp otherwise. Let Ψ be a matrix such that ∀x ∈ AE (1 − δ) x2 ≤

  • Ψx
  • 2 ≤ (1 + δ) x2 .

Then RE(s0, k0, ΨA) holds with 0 < K(s0, k0, ΨA) ≤ K(s0,k0,A)

1−5δ

. If the matrix Ψ acts as almost isometry on the images of the d-sparse vectors under A, then the product ΨA satisfies the RE condition with a smaller parameter k0.

slide-20
SLIDE 20

Reformulation of the reduction principle:

Theorem

Restrictive Isometry. Let Ψ be a matrix such that ∀x ∈ AE (1 − δ) x2 ≤

  • Ψx
  • 2 ≤ (1 + δ) x2 .

Then for any x ∈ A

  • C(s0, k0)
  • ∩ Sq−1,

(1 − 5δ) ≤

  • Ψx
  • 2 ≤ (1 + 3δ)

If the matrix Ψ acts as almost isometry on the images of the d-sparse vectors under A, then it acts the same way on the images of C(s0, k0). It is reduced to checking that the almost isometry property holds for all vectors from some low-dimensional subspaces which is easier than checking the RE property directly.

slide-21
SLIDE 21

Definition: subgaussian random vectors

Let Y be a random vector in Rp

1

Y is called isotropic if for every y ∈ Rp, E

  • | Y, y |2

= y2

2.

2

Y is ψ2 with a constant α if for every y ∈ Rp, Y, y ψ2 := inf{t : E

  • exp( Y, y 2/t2)
  • ≤ 2} ≤ α y2 .

The ψ2 condition on a scalar random variable V is equivalent to the subgaussian tail decay of V, which means for some constant c, P (|V| > t) ≤ 2 exp(−t2/c2), for all t > 0. A random vector Y in Rp is subgaussian if the one-dimensional marginals Y, y are sub-gaussian random variables for all y ∈ Rp.

slide-22
SLIDE 22

The first application of the reduction principle

Let A be a q × p matrix satisfying RE(s0, 3k0, A) condition. Let m = min(d, p) where d = s0 + s0 max

j

  • Aej
  • 2

2

16K 2(s0, 3k0, A)(3k0)2(3k0 + 1) δ2 .

Theorem

Let Ψ be an n × q matrix whose rows are independent isotropic ψ2 random vectors in Rq with constant α. Suppose n ≥ 2000mα4 δ2 log 60ep mδ

  • .

Then with probability at least 1 − 2 exp(−δ2n/2000α4), RE(s0, k0, (1/√n)ΨA) condition holds with 0 < K(s0, k0, (1/ √ n)ΨA) ≤ K(s0, k0, A) 1 − δ .

slide-23
SLIDE 23

Examples of subgaussian vectors

The random vector Y with i.i.d N(0, 1) random coordinates. Discrete Gaussian vector, which is a random vector taking values

  • n the integer lattice

X p with distribution P(X = m) = C exp(− m2

2 /2) for m ∈

X p. A vector with independent centered bounded random coordinates. In particular, vectors with random symmetric Bernoulli coordinates, in other words, random vertices of the discrete cube.

slide-24
SLIDE 24

Previous results on (sub)Gaussian random vectors

Raskutti, Wainwright, and Yu 10: RE(s0, k0, X) holds for random Gaussian measurements / design matrix X which consists of n = O(s0 log p) independent copies of a Gaussian random vector Y ∼ Np(0, Σ), assuming that the RE condition holds for Σ1/2. Their proof relies on a deep result from the theory of Gaussian random processes – Gordon’s Minimax Lemma To establish the RE condition for more general classes of random matrices we had to introduce a new approach based on geometric functional analysis, namely, the reduction principle. The bound n = O(s0 log p) can be improved to the optimal one n = O(s0 log(p/s0)) when RE(s0, k0, Σ1/2) is replaced with RE(s0, (1 + ε)k0, Σ1/2) for any ε > 0.

slide-25
SLIDE 25

In Zhou 09, subgaussian random matrices of the form X = ΨΣ1/2 was considered, where Σ is a p × p positive semidefinite matrix: X satisfies RE(s0, k0) condition with overwhelming probability if for K := K(s0, k0, Σ1/2), n > 9c′α4 δ2 (2 + k0)2K 2

  • 4ρmax(s0, Σ1/2)s0 log 5ep

s0 ∧ s0 log p

  • .

Analysis there used a result in Mendelson et al. 07, 08. The current result does not involve ρmax(s0, A), nor the global parameters of the matrices A and Ψ, such as the norm or the smallest singular value. Recall the Spiked Identity matrix: for a ∈ [0, 1), Σp×p = (1 − a)Ip×p + a 1 1T which satisfies the RE condition, such that ρmax(s0, A) grows linearly with s0 while the maximum of

  • Σ1/2ej
  • 2 = 1.
slide-26
SLIDE 26

Design matrices with uniformly bounded entries

Let Y ∈ Rp be a random vector such that Y∞ ≤ M a.s. and denote Σ = EYY T. Let X be an n × p matrix, whose rows X1, . . . , Xn ∼ Y. Set d = s0 + s0 max

j

  • Σ1/2ej
  • 2

2

  • 16K 2(s0, 3k0, Σ1/2)(3k0)2(3k0 + 1)

δ2

  • .

Theorem

Assume that d ≤ p and ρ = ρmin(d, Σ1/2) > 0. Let Σ satisfy the RE(s0, 3k0, Σ1/2) condition. Suppose n ≥ CM2d · log p ρδ2 · log3 CM2d · log p ρδ2

  • .

Then with probability at least 1 − exp

  • −δρn/(6M2d)
  • , RE(s0, k0, X)

holds for matrix X/√n with 0 < K(s0, k0, X/√n)) ≤ K(s0,k0,Σ1/2)

1−δ

.

slide-27
SLIDE 27

Remarks on applying the reduction principle

To analyze different classes of random design matrices: Unlike the case of a random matrix with subgaussian marginals, the estimate of the second example contains the minimal sparse singular value ρmin(d, Σ1/2). The reconstruction of sparse signals by subgaussian design matrices or by random Fourier ensemble was analyzed in the literature before, however only under the RIP assumptions. The reduction principle can be applied to other types of random variables: e.g., random vectors with heavy-tailed marginals, or random vectors with log-concave densities.

References: Rudelson-Vershynin 05, Baraniuk et al. 08, Mendelson et al. 08, Vershynin 11a, b, Adamczak et al. 09.

slide-28
SLIDE 28

Maurey’s empirical approximation argument (Pisier 81)

Let u1, . . . , uM ∈ Rq. Let y ∈ conv(u1, . . . , uM): y =

  • j∈{1,...,M}

αjuj where αj ≥ 0, and

  • j

αj = 1. There exists a set L ⊂ {1, 2, . . . , M} such that |L| ≤ m = 4 maxj∈{1,...,M}

  • uj
  • 2

2

ε2 and a vector y′ ∈ conv(uj, j ∈ L) such that

  • y′ − y
  • 2 ≤ ε.

proof An application of the probabilistic methods: if we only want to approximate y rather than exactly represent it as a convex combination

  • f u1, . . . , uM, this is possible with much fewer points, namely,

u1, . . . , u|L|.

slide-29
SLIDE 29

Let y =

  • j∈{1,...,M}

αjuj where αj ≥ 0, and

  • j

αj = 1.

.

.

. . . .

. . . . . . . . . . . . . . . .

.

. . . . . . .

Goal: to find a vector y′ ∈ conv(uj, j ∈ L) such that

  • y′ − y
  • 2 ≤ ε.
slide-30
SLIDE 30

Let Y be a random vector in Rq such that P (Y = uℓ) = αℓ, ℓ ∈ {1, . . . , M} Then E (Y) =

  • ℓ∈{1,...,M}

αℓuℓ = y.

. .

. . . . . . . . . . . . . . . .

.

. . . . .

.

. . .

slide-31
SLIDE 31

Let Y1, . . . , Ym be independent copies of Y and let ε1, . . . , εm be ±1 i.i.d. mean zero Bernoulli random variables, chosen independently of Y1, . . . , Ym. By the standard symmetrization argument we have E   

  • y − 1

m

m

  • j=1

Yj

  • 2

2

   ≤ 4E  

  • 1

m

m

  • j=1

εjYj

  • 2

 

2

= 4 m2

m

  • j=1

E

  • Yj
  • 2

2

4 maxℓ∈{1,...,M} uℓ2

2

m ≤ ε2 (1) where E

  • Yj
  • 2

2

  • ≤ sup
  • Yj
  • 2

2 ≤

max

ℓ∈{1,...,M} uℓ2 2

and the last inequality in (1) follows from the definition of m.

slide-32
SLIDE 32

Fix a realization Yj = ukj, j = 1, . . . , m for which

  • y − 1

m

m

  • j=1

Yj

  • 2

≤ ε.

. . . . . .uk1

ukm-1 uk2 uk3 ukm

...

.

The vector 1

m

m

j=1 Yj belongs to the convex hull of {uℓ : ℓ ∈ L}, where

L is the set of different elements from the sequence k1, . . . , km. Obviously |L| ≤ m and the lemma is proved. Q.E.D.

slide-33
SLIDE 33

The Inclusion Lemma

To prove the restricted isometry of Ψ over the set of vectors in A

  • C(s0, k0)
  • ∩ Sq−1, we show that this set is contained in the convex

hull of the images of the sparse vectors with norms not exceeding (1 − δ)−1.

Lemma

Let 1 > δ > 0. Suppose RE(s0, k0, A) condition holds for matrix Aq×p. For a set J ⊂ {1, . . . , p}, EJ = span{ej : j ∈ J}. Set d = d(k0, A) = s0 + s0 max

j

  • Aej
  • 2

2

  • 16K 2(s0, k0, A)k2

0(k0 + 1)

δ2

  • .

Then A

  • C(s0, k0)
  • ∩ Sq−1 ⊂ (1 − δ)−1 conv

 

|J|≤d

AEJ ∩ Sq−1   where for d ≥ p, EJ is understood to be Rp.

slide-34
SLIDE 34

Conclusion

We prove a general reduction principle showing that if the matrix

  • Ψ acts as almost isometry on the images of the sparse vectors

under A, then the product ΨA satisfies the RE condition with a smaller parameter k0. We apply the reduction principle to analyze different classes of random design matrices. This analysis is reduced to checking that the almost isometry property holds for all vectors from some low-dimensional subspaces, which is easier than checking the RE property directly.

slide-35
SLIDE 35

Thank you!