Compressed sensing off-the-grid: The Fisher metric, support - - PowerPoint PPT Presentation

compressed sensing off the grid the fisher metric support
SMART_READER_LITE
LIVE PREVIEW

Compressed sensing off-the-grid: The Fisher metric, support - - PowerPoint PPT Presentation

Compressed sensing off-the-grid: The Fisher metric, support stability and optimal sampling bounds Clarice Poon University of Bath Joint work with: Nicolas Keriven and Gabriel Peyr e Ecole Normale Sup erieure February 6, 2019 1 / 36


slide-1
SLIDE 1

Compressed sensing off-the-grid: The Fisher metric, support stability and optimal sampling bounds

Clarice Poon

University of Bath Joint work with:

Nicolas Keriven and Gabriel Peyr´ e

´ Ecole Normale Sup´ erieure

February 6, 2019

1 / 36

slide-2
SLIDE 2

Outline

1

Compressed sensing off-the-grid

2

The Fisher metric and the minimum separation condition

3

Support stability for the subsampled problem

4

Ideas behind the proofs – Dual certificates

5

Removal of random signs assumption

2 / 36

slide-3
SLIDE 3

Compressed sensing [Cand`

es, Romberg & Tao ’06; Donoho ’06]

Task: Recover a ∈ CN from y = Φa where Φ ∈ Cm×N with m ≪ N and a is s-sparse.

Typical compressed sensing statement:

For certain random matrices Φ ∈ Cm×N, with high probability, a can be uniquely recovered from m = O (s log (N)) measurements by solving min

z∈CN z1 subject to Φz = y

  • r in the noisy case of y = Φa + w, the minimizer ˆ

a of min

z∈CN λ z1 + 1

2 Φz − y2

2

with λ ∼ δ/√s and w δ satisfies a − ˆ a1 σs(x)1 + √sδ.

3 / 36

slide-4
SLIDE 4

Compressed sensing [Cand`

es, Romberg & Tao ’06; Donoho ’06]

Task: Recover a ∈ CN from y = Φa where Φ ∈ Cm×N with m ≪ N and a is s-sparse.

Typical compressed sensing statement:

For certain random matrices Φ ∈ Cm×N, with high probability, a can be uniquely recovered from m = O (s log (N)) measurements by solving min

z∈CN z1 subject to Φz = y

  • r in the noisy case of y = Φa + w, the minimizer ˆ

a of min

z∈CN λ z1 + 1

2 Φz − y2

2

with λ ∼ δ/√s and w δ satisfies a − ˆ a1 σs(x)1 + √sδ. In the case where U is unitary, the above statement holds with Φ = PΩU where Ω are m = O(N · µ(U)2 · s · log(N)) uniformly drawn indices, µ(U) = maxi,j |Uij| is the so called coherence. In the case of U being the DFT, we have µ(U)2 = 1/N.

3 / 36

slide-5
SLIDE 5

Compressed sensing off the grid

Aim: Recover µ0 ∈ M(X), X ⊆ Rd, from m observations, y = Φµ0 + w Let (Ω, Λ) be a probability space. For ω ∈ Ω, we have random features ϕω ∈ C(X) . For k = 1, . . . , m, let ωk

iid

∼ Λ. The measurement operator is Φ : M(X) → Cm, Φµ

def.

= 1 √m

  • ϕωk(x)dµ(x)

m

k=1

Typically, the measure of interest is µ0 = s

j=1 ajδxj where aδx denotes the Dirac at x ∈ X

with amplitude a ∈ C (also called a “spike”).

4 / 36

slide-6
SLIDE 6

Imaging

Sampling the Fourier transform (e.g. astronomy)

Recover µ ∈ M(Td) from (Fµ(ωk))m

k=1 where F is the Fourier transform and ωk are drawn

iid from ([ [−fc, fc] ]d, Unif). Here, ϕω(x) = exp

  • −i2πx⊤ω
  • and

Φµ0 = 1 √m  

s

  • j=1

aj exp

  • −i2πx⊤

j ωk

m k=1

Sampling the Laplace transform (e.g. fluorescence microscopy)

Recover µ ∈ M(Rd

+) from (Lµ(ωk))m k=1 where L is the Laplace transform and ωk are drawn

iid from (Rd

+, Λα) where Λα(ω) ∝ exp

  • −2α⊤ω
  • .

Here, ϕω(x) = exp

  • −x⊤ω
  • and

Φµ0 = 1 √m  

s

  • j=1

aj exp

  • −x⊤

j ωk

m k=1

5 / 36

slide-7
SLIDE 7

Two layer neural network [Bach, 2015]

Let Ω ⊆ Rd, and ω1, . . . , ωm are the training samples drawn from (Ω, Λ), with corresponding values y1, . . . , ym ∈ R. Find a function of the form f(ω) =

s

  • j=1

aj max (xj, ω, 0) where aj ∈ R and xj ∈ Rd such that f(ωj) ≈ yj for j = 1, . . . , m. We can then use the function f to predict y given ω ∈ Ω.

6 / 36

slide-8
SLIDE 8

Two layer neural network [Bach, 2015]

Let Ω ⊆ Rd, and ω1, . . . , ωm are the training samples drawn from (Ω, Λ), with corresponding values y1, . . . , ym ∈ R. Find a function of the form f(ω) =

s

  • j=1

aj max (xj, ω, 0) where aj ∈ R and xj ∈ Rd such that f(ωj) ≈ yj for j = 1, . . . , m. We can then use the function f to predict y given ω ∈ Ω. This is precisely our sparse spikes problem where we let ϕω(x) = max (x, ω, 0) and Φµ0 =  

s

  • j=1

aj max (xj, ωk, 0)  

m k=1

where µ0 = s

j=1 ajδxj .

6 / 36

slide-9
SLIDE 9

Density estimation

Task: Given data on T , estimate parameters (ai) ∈ RN

+ and (xi)s i=1 ∈ X s of a mixture

ξ(t) =

s

  • j=1

ajξxj (t) =

  • X

ξx(t)dµ0(x) where µ0 =

j ajδxj where (ξx)x∈X is a family of template distributions. E.g.

x = (m, σ) ∈ X = R × R+ and ξx = N(m, σ2).

7 / 36

slide-10
SLIDE 10

Density estimation

Task: Given data on T , estimate parameters (ai) ∈ RN

+ and (xi)s i=1 ∈ X s of a mixture

ξ(t) =

s

  • j=1

ajξxj (t) =

  • X

ξx(t)dµ0(x) where µ0 =

j ajδxj where (ξx)x∈X is a family of template distributions. E.g.

x = (m, σ) ∈ X = R × R+ and ξx = N(m, σ2).

Sketching [Gribonval, Blanchard, Keriven & Traonmilin, 2017]

No direct access to ξ but n iid samples (t1, . . . , tn) ∈ T n drawn from ξ. You do not record this (possibly huge) set of data, but compute online a small set y ∈ Cm of m sketches against sketching functions θω(t): yk

def.

= 1 n

n

  • j=1

θωk(tj) ≈

  • T

θωk(t)ξ(t)dt =

  • X
  • T

θωk(t)ξx(t)dtdµ0(x). So, ϕω(x)

def.

=

  • T θωk(t)ξx(t)dt. E.g. θω(t) = eiω, t and ϕ·(x) is the characterisatic

function of ξx.

7 / 36

slide-11
SLIDE 11

The Beurling LASSO

The BLASSO was initially proposed by [De Castro & Gamboa, 2012] and [Bredies & Pikkarainnen, 2013]. Solve min

µ∈M(X)

1 2 Φµ − y2 + λ |µ| (X) ( ˆ Pλ(y)) where |µ| (X)

def.

= sup

  • Re (f, µ) ; f ∈ C(X), f∞ 1
  • .

Noiseless problem: for y0 = Φµ0, min

µ∈M(X) |µ| (X) subject to Φµ = y0

( ˆ P0(y0)) NB: If µ =

j ajδxj , then |µ| (X) = a1.

8 / 36

slide-12
SLIDE 12

The Beurling LASSO

The BLASSO was initially proposed by [De Castro & Gamboa, 2012] and [Bredies & Pikkarainnen, 2013]. Solve min

µ∈M(X)

1 2 Φµ − y2 + λ |µ| (X) ( ˆ Pλ(y)) where |µ| (X)

def.

= sup

  • Re (f, µ) ; f ∈ C(X), f∞ 1
  • .

Noiseless problem: for y0 = Φµ0, min

µ∈M(X) |µ| (X) subject to Φµ = y0

( ˆ P0(y0)) NB: If µ =

j ajδxj , then |µ| (X) = a1.

Goal: A CS-type theory. Under what conditions can we recover µ0 = s

j=1 ajδxj exactly (stably) from

m = O(s × log factors) (noisy) randomised linear measurements?

8 / 36

slide-13
SLIDE 13

Remarks

Other approaches include Prony-type methods (1795): MUSIC [Schmidt, 1986], ESPRIT [Roy, 1987], Finite Rate of Innovation [Vetterli, 2002] ...

◮ Nonvariational approaches which encodes the spikes positions as the zeros of some

polynomial, whose coefficients are derived from the measurements.

◮ Generally restricted to Fourier type measurements. ◮ Extension to multivariate setting is nontrivial.

There are efficient algorithms for solving this infinite dimensional problem, e.g. SDP approaches [Cand` es & Fernandez-Granda, 2012; De Castro, Gamboa, Henrion & Lasserre 2015] and Frank-Wolfe approaches [Bredies & Pikkarainnen 2013; Boyd, Schiebinger & Recht ’15; Denoyelle, Duval & Peyr´ e ’18] .

9 / 36

slide-14
SLIDE 14

Background on the BLASSO

Recovery of spikes of arbitrary signs require a minimum separation condition: [Cand` es & Fernandez-Granda ’12]: Given

  • Fµ0(k) ; k ∈ Zd, k∞ fc
  • , µ0 can be

recovered uniquely if ∆ = mini=j xi − xj∞ Cd

fc .

Many extensions to other measurement operators, minimum separation is fundamental (for BLASSO) and often imposed via ad hoc metrics [Bendory et al ’15, Tang ’15].

10 / 36

slide-15
SLIDE 15

Background on the BLASSO

Recovery of spikes of arbitrary signs require a minimum separation condition: [Cand` es & Fernandez-Granda ’12]: Given

  • Fµ0(k) ; k ∈ Zd, k∞ fc
  • , µ0 can be

recovered uniquely if ∆ = mini=j xi − xj∞ Cd

fc .

Many extensions to other measurement operators, minimum separation is fundamental (for BLASSO) and often imposed via ad hoc metrics [Bendory et al ’15, Tang ’15]. Stability for the recovered measure ˆ µ: Integral type stability estimates [Cand` es & Fernandez-Granda ’13]: Khi ⋆ (ˆ µ − µ0)L1. Support concentration [Fernandez-Granda ’13; Asa¨ ıs, De Castro & Gamboa ’12]: Bounds on

  • ˆ

µ(X near

j

) − aj

  • and |ˆ

µ| (X far). Support stability [Duval and Peyr´ e ’15]: in the small noise regime where w and λ are sufficiently small, ˆ µ consists of exactly s spikes, and the recovered amplitudes and positions vary continuously with respect to λ and w.

10 / 36

slide-16
SLIDE 16

Background on the BLASSO

Recovery of spikes of arbitrary signs require a minimum separation condition: [Cand` es & Fernandez-Granda ’12]: Given

  • Fµ0(k) ; k ∈ Zd, k∞ fc
  • , µ0 can be

recovered uniquely if ∆ = mini=j xi − xj∞ Cd

fc .

Many extensions to other measurement operators, minimum separation is fundamental (for BLASSO) and often imposed via ad hoc metrics [Bendory et al ’15, Tang ’15]. Stability for the recovered measure ˆ µ: Integral type stability estimates [Cand` es & Fernandez-Granda ’13]: Khi ⋆ (ˆ µ − µ0)L1. Support concentration [Fernandez-Granda ’13; Asa¨ ıs, De Castro & Gamboa ’12]: Bounds on

  • ˆ

µ(X near

j

) − aj

  • and |ˆ

µ| (X far). Support stability [Duval and Peyr´ e ’15]: in the small noise regime where w and λ are sufficiently small, ˆ µ consists of exactly s spikes, and the recovered amplitudes and positions vary continuously with respect to λ and w. Subsampling in the Fourier setting: [Tang et al ’13]: If sign(aj)s

j=1 is a Steinhaus sequence and ∆ C fc , then exact

recovery is guaranteed with O(s log(fc) log(s)) number of noiseless random Fourier coefficients. Extended to two dimensional setting by [Chi & Chen ’15]. So far, removal of the random signs assumption results in O(s2) measurements [Li & Chi ’17].

10 / 36

slide-17
SLIDE 17

Background on the BLASSO

Recovery of spikes of arbitrary signs require a minimum separation cond.: [Cand` es & Fernandez-Granda ’12]: Given

  • Fµ0(k) ; k ∈ Zd, k∞ fc
  • , µ0 can be

recovered uniquely if ∆ = mini=j xi − xj∞ Cd

fc .

Many extensions to other measurement operators, minimum separation is fundamental (for BLASSO) and often imposed via ad-hoc metrics [Bendory et al ’15, Tang ’15]. Stability for the recovered measure ˆ µ: Integral type stability estimates [Cand` es & Fernandez-Granda ’13]: Khi ⋆ (ˆ µ − µ0)L1. Support concentration [Fernandez-Granda ’13; Asa¨ ıs, De Castro & Gamboa ’12]: Bounds on

  • ˆ

µ(X near

j

) − aj

  • and |ˆ

µ| (X far). Support stability [Duval and Peyr´ e ’15]: in the small noise regime where w and λ are sufficiently small, ˆ µ consists of exactly s spikes, and the recovered amplitudes and positions vary continuously with respect to λ and w. Subsampling in the Fourier setting: [Tang et al ’13]: If sign(aj)s

j=1 is a Steinhaus sequence and ∆ C fc , then exact

recovery is guaranteed with O(s log(fc) log(s)) number of noiseless random Fourier coefficients. Extended to two dimensional setting by [Chi & Chen ’15]. So far, removal of the random signs assumption results in O(s2) measurements [Li & Chi ’17].

10 / 36

slide-18
SLIDE 18

Background on the BLASSO

Recovery of spikes of arbitrary signs require a minimum separation cond.: [Cand` es & Fernandez-Granda ’12]: Given

  • Fµ0(k) ; k ∈ Zd, k∞ fc
  • , µ0 can be

recovered uniquely if ∆ = mini=j xi − xj∞ Cd

fc .

Many extensions to other measurement operators, minimum separation is fundamental (for BLASSO) and often imposed via ad-hoc metrics [Bendory et al ’15, Tang ’15]. Stability for the recovered measure ˆ µ: Integral type stability estimates [Cand` es & Fernandez-Granda ’13]: Khi ⋆ (ˆ µ − µ0)L1. Support concentration [Fernandez-Granda ’13; Asa¨ ıs, De Castro & Gamboa ’12]: Bounds on

  • ˆ

µ(X near

j

) − aj

  • and |ˆ

µ| (X far).

Support stability [Duval and Peyr´

e ’15]: in the small noise regime where w and λ are sufficiently small, ˆ µ consists of exactly s spikes, and the recovered amplitudes and positions vary continuously with respect to λ and w. Subsampling in the Fourier setting: [Tang et al ’13]: If sign(aj)s

j=1 is a Steinhaus sequence and ∆ C fc , then exact

recovery is guaranteed with O(s log(fc) log(s)) number of noiseless random Fourier coefficients. Extended to two dimensional setting by [Chi & Chen ’15]. So far, removal of the random signs assumption results in O(s2) measurements [Li & Chi ’17].

10 / 36

slide-19
SLIDE 19

Background on the BLASSO

Recovery of spikes of arbitrary signs require a minimum separation cond.: [Cand` es & Fernandez-Granda ’12]: Given

  • Fµ0(k) ; k ∈ Zd, k∞ fc
  • , µ0 can be

recovered uniquely if ∆ = mini=j xi − xj∞ Cd

fc .

Many extensions to other measurement operators, minimum separation is fundamental (for BLASSO) and often imposed via ad-hoc metrics [Bendory et al ’15, Tang ’15]. Stability for the recovered measure ˆ µ: Integral type stability estimates [Cand` es & Fernandez-Granda ’13]: Khi ⋆ (ˆ µ − µ0)L1.

Support concentration [Fernandez-Granda ’13; Asa¨

ıs, De Castro & Gamboa ’12]: Bounds on

  • ˆ

µ(X near

j

) − aj

  • and |ˆ

µ| (X far).

Support stability [Duval and Peyr´

e ’15]: in the small noise regime where w and λ are sufficiently small, ˆ µ consists of exactly s spikes, and the recovered amplitudes and positions vary continuously with respect to λ and w. Subsampling in the Fourier setting: [Tang et al ’13]: If sign(aj)s

j=1 is a Steinhaus sequence and ∆ C fc , then exact

recovery is guaranteed with O(s log(fc) log(s)) number of noiseless random Fourier coefficients. Extended to two dimensional setting by [Chi & Chen ’15]. So far, removal of the random signs assumption results in O(s2) measurements [Li & Chi ’17].

10 / 36

slide-20
SLIDE 20

Outline

1

Compressed sensing off-the-grid

2

The Fisher metric and the minimum separation condition

3

Support stability for the subsampled problem

4

Ideas behind the proofs – Dual certificates

5

Removal of random signs assumption

11 / 36

slide-21
SLIDE 21

The covariance kernel

The covariance kernel

Define the covariance kernel: ˆ K(x, x′)

def.

=

1 m

m

k=1 ϕωk(x)ϕωk(x′), and the limit covariance

kernel as K(x, x′)

def.

= E[ ˆ K(x, x′)] =

  • ϕω(x)ϕω(x′)dΛ(ω).

Denote ˆ f

def.

= Φ∗y = ˆ K(x, x′)dµ0(x′) + Φ∗w ∈ C(X). The BLASSO can be rewritten as min

µ∈M(X)

1 2

  • ˆ

K(x, x′)dµ(x)dµ(x′) − Re ˆ f, µ + λ |µ| (X) ( ˆ Pλ(y)) and min

µ∈M(X) |µ| (X) subject to

  • ˆ

K(x, x′)d(µ − µ0)(x)d(µ − µ0)(x′) = 0. ( ˆ P0(y0))

12 / 36

slide-22
SLIDE 22

The covariance kernel

The covariance kernel

Define the covariance kernel: ˆ K(x, x′)

def.

=

1 m

m

k=1 ϕωk(x)ϕωk(x′), and the limit covariance

kernel as K(x, x′)

def.

= E[ ˆ K(x, x′)] =

  • ϕω(x)ϕω(x′)dΛ(ω).

Denote ˆ f

def.

= Φ∗y = ˆ K(x, x′)dµ0(x′) + Φ∗w ∈ C(X). The BLASSO can be rewritten as min

µ∈M(X)

1 2

  • ˆ

K(x, x′)dµ(x)dµ(x′) − Re ˆ f, µ + λ |µ| (X) ( ˆ Pλ(y)) and min

µ∈M(X) |µ| (X) subject to

  • ˆ

K(x, x′)d(µ − µ0)(x)d(µ − µ0)(x′) = 0. ( ˆ P0(y0)) Before discussing the role of subsampling, let’s look at the limit problem associated to K. What separation conditions should we impose to guarantee recovery of µ0 = s

j=1 ajδxj ?

12 / 36

slide-23
SLIDE 23

The covariance kernel

The covariance kernel

Define the covariance kernel: ˆ K(x, x′)

def.

=

1 m

m

k=1 ϕωk(x)ϕωk(x′), and the limit covariance

kernel as K(x, x′)

def.

= E[ ˆ K(x, x′)] =

  • ϕω(x)ϕω(x′)dΛ(ω).

Denote f

def.

=

  • K(x, x′)dµ0(x′) + ε ∈ C(X). The BLASSO can be rewritten as

min

µ∈M(X)

1 2

  • K(x, x′)dµ(x)dµ(x′) − Ref, µ + λ |µ| (X)

(Pλ(f)) and min

µ∈M(X) |µ| (X) subject to

  • K(x, x′)d(µ − µ0)(x)d(µ − µ0)(x′) = 0.

(P0(f0)) Before discussing the role of subsampling, let’s look at the limit problem associated to K. What separation conditions should we impose to guarantee recovery of µ0 = s

j=1 ajδxj ?

12 / 36

slide-24
SLIDE 24

The Fisher metric

Assume that for all x ∈ X, Eω[|ϕω(x)|2] = 1. Let Hx

def.

= ∇1∇2K(x, x) ∈ Cd×d and assume that Hx is positive definite for all x ∈ X.

13 / 36

slide-25
SLIDE 25

The Fisher metric

Assume that for all x ∈ X, Eω[|ϕω(x)|2] = 1. Let Hx

def.

= ∇1∇2K(x, x) ∈ Cd×d and assume that Hx is positive definite for all x ∈ X. f(x, ω)

def.

= |ϕω(x)|2 can be interpreted as a probability density function for the random variable ω conditional on parameter x ∈ X and its Fisher information matrix is:

  • ∇ (log f(x, ω)) ∇ (log f(x, ω))⊤ f(x, ω)dΛ(ω) = 4Hx.

13 / 36

slide-26
SLIDE 26

The Fisher metric

Assume that for all x ∈ X, Eω[|ϕω(x)|2] = 1. Let Hx

def.

= ∇1∇2K(x, x) ∈ Cd×d and assume that Hx is positive definite for all x ∈ X. f(x, ω)

def.

= |ϕω(x)|2 can be interpreted as a probability density function for the random variable ω conditional on parameter x ∈ X and its Fisher information matrix is:

  • ∇ (log f(x, ω)) ∇ (log f(x, ω))⊤ f(x, ω)dΛ(ω) = 4Hx.

H naturally induces a distance between points on X. Given a curve γ : [0, 1] → X, ℓH[γ]

def.

= 1

  • Hγ(t)γ′(t), γ′(t)dt and given x, x′ ∈ X,

dH(x, x′)

def.

= inf

  • ℓH[γ] ; γ : [0, 1] → X, γ(0) = x, γ(1) = x′

. Also called the “Fisher-Rao” geodesic distance, this is used extensively in information geometry for estimation and learning problems on parametric families of distributions (Amari and Nagaoka, 2007).

13 / 36

slide-27
SLIDE 27

The Fisher metric

Assume that for all x ∈ X, Eω[|ϕω(x)|2] = 1. Let Hx

def.

= ∇1∇2K(x, x) ∈ Cd×d and assume that Hx is positive definite for all x ∈ X. f(x, ω)

def.

= |ϕω(x)|2 can be interpreted as a probability density function for the random variable ω conditional on parameter x ∈ X and its Fisher information matrix is:

  • ∇ (log f(x, ω)) ∇ (log f(x, ω))⊤ f(x, ω)dΛ(ω) = 4Hx.

H naturally induces a distance between points on X. Given a curve γ : [0, 1] → X, ℓH[γ]

def.

= 1

  • Hγ(t)γ′(t), γ′(t)dt and given x, x′ ∈ X,

dH(x, x′)

def.

= inf

  • ℓH[γ] ; γ : [0, 1] → X, γ(0) = x, γ(1) = x′

. Also called the “Fisher-Rao” geodesic distance, this is used extensively in information geometry for estimation and learning problems on parametric families of distributions (Amari and Nagaoka, 2007).

Theorem

Under some generic conditions on K and ∆, if minj=k dH(xj, xk) ∆ and s smax, then µ0 can be exactly (stably) recovered as a solution to P0(f) (to Pλ(f)).

13 / 36

slide-28
SLIDE 28

Notation for derivatives

We can interpret the rth derivative as a multilinear map ∇rf : (Cd)r → C, given Q = {qℓ}r

ℓ=1 ∈ (Cd)r,

∇rf[Q] =

  • i1,...,ir

∂i1 · · · ∂irf(x)qi1 · · · qir. The normalised rth derivative is Dr[f](x)[Q] = ∇rf(x)[{H

− 1

2

x

qi}r

i=1].

and Kij(x, x′) : (Cd)i × (Cd)j → C is defined by K(ij)[Q, V ]

def.

= E

  • Di[ϕω][Q] · Dj[ϕω][V ]
  • .

14 / 36

slide-29
SLIDE 29

Admissible kernels ∗

A kernel K will be said admissible with respect to {rnear, ∆, εi, Bij, smax}, if

∗For simplicity, assume that K is real-valued. 15 / 36

slide-30
SLIDE 30

Admissible kernels

A kernel K will be said admissible with respect to {rnear, ∆, εi, Bij, smax}, if Sufficient peak: For dH(x, x′) rnear, |K(x, x′)| 1 − ε0. For dH(x, x′) rnear, K(02)(x, x′) −ε2Id

15 / 36

slide-31
SLIDE 31

Admissible kernels

A kernel K will be said admissible with respect to {rnear, ∆, εi, Bij, smax}, if Sufficient peak: For dH(x, x′) rnear, |K(x, x′)| 1 − ε0. For dH(x, x′) rnear, K(02)(x, x′) −ε2Id Sufficient decay: For dH(x, x′) ∆/4,

  • K(ij)(x, x′)
  • h

smax , where i, j ∈ {0, . . . , 2} with i + j 3,

h

def.

= mini∈{0,2}

  • εi

32B1i+32

  • .

15 / 36

slide-32
SLIDE 32

Admissible kernels

A kernel K will be said admissible with respect to {rnear, ∆, εi, Bij, smax}, if Sufficient peak: For dH(x, x′) rnear, |K(x, x′)| 1 − ε0. For dH(x, x′) rnear, K(02)(x, x′) −ε2Id Sufficient decay: For dH(x, x′) ∆/4,

  • K(ij)(x, x′)
  • h

smax , where i, j ∈ {0, . . . , 2} with i + j 3,

h

def.

= mini∈{0,2}

  • εi

32B1i+32

  • .

Uniform bounds: supx,x′∈X

  • K(ij)(x, x′)
  • Bij for i, j ∈ {0, 1, 2}.

Additionally, for dH(x, x′) rnear:

  • Id − H

− 1

2

x′ H

1 2

x

  • CHdH(x, x′).

15 / 36

slide-33
SLIDE 33

Admissible kernels

A kernel K will be said admissible with respect to {rnear, ∆, εi, Bij, smax}, if

Theorem

Suppose that K is admissible, and µ0 = s

j=1 ajδxj with minj=k dH(xj, xk) ∆ and

s smax. Then, µ0 can be exactly (stably) recovered as a solution to P0(f) (to Pλ(f)). NB: in general, εi, rnear, Bij, CH are just constants (possibly dependent on d), but independent of problem parameters.

15 / 36

slide-34
SLIDE 34

Examples

Discrete Fourier Continuous Fourier Microscopy (Laplace trans.) Random features ϕω(x) = ei2πω⊤x ϕω(x) = eiω⊤x ϕω(x)= d

i=1

  • 2(xi+αi)

αi

e−ω⊤x Λ ∝

j g(ωj) on [

[−fc, fc] ]d Λ = N (0, Σ) on Rd Λ ∝ e−α⊤ω on Rd

+

Kernel Jackson

i κ(xi − x′ i)

Gaussian † e−x−x′Σ

  • i κ(xi + αi, x′

i + αi), κ(x, x′) = √ 4xx′ x+x′

Metric and separation Hx = CfcId ‡

Hx = Σ Hx = diag

  • 1

4(xi+αi)2

  • dH(x, x′) = C

1 2 fc

  • x − x′
  • 2

dH(x, x′) =

  • x − x′
  • Σ

dH(x, x′)=

  • i
  • log
  • xi+αi

x′ i+αi

  • 2

∆ =

  • d√s

∆ =

  • log(s)

∆ = d + log(ds)

‡Cfc = π2 3 fc(fc + 4) ∼ f 2 c 16 / 36

slide-35
SLIDE 35

Examples

Discrete Fourier Continuous Fourier Microscopy (Laplace trans.) Random features ϕω(x) = ei2πω⊤x ϕω(x) = eiω⊤x ϕω(x)= d

i=1

  • 2(xi+αi)

αi

e−ω⊤x Λ ∝

j g(ωj) on [

[−fc, fc] ]d Λ = N (0, Σ) on Rd Λ ∝ e−α⊤ω on Rd

+

Kernel Jackson

i κ(xi − x′ i)

Gaussian † e−x−x′Σ

  • i κ(xi + αi, x′

i + αi), κ(x, x′) = √ 4xx′ x+x′

Metric and separation Hx = CfcId ‡

Hx = Σ Hx = diag

  • 1

4(xi+αi)2

  • dH(x, x′) = C

1 2 fc

  • x − x′
  • 2

dH(x, x′) =

  • x − x′
  • Σ

dH(x, x′)=

  • i
  • log
  • xi+αi

x′ i+αi

  • 2

∆ =

  • d√s

∆ =

  • log(s)

∆ = d + log(ds) Our result: xi − xj

√ d 4 √s fc

, Cand` es & Fernandez-Granda: xi − xj

Cd fc 16 / 36

slide-36
SLIDE 36

Examples

Discrete Fourier Continuous Fourier Microscopy (Laplace trans.) Random features ϕω(x) = ei2πω⊤x ϕω(x) = eiω⊤x ϕω(x)= d

i=1

  • 2(xi+αi)

αi

e−ω⊤x Λ ∝

j g(ωj) on [

[−fc, fc] ]d Λ = N (0, Σ) on Rd Λ ∝ e−α⊤ω on Rd

+

Kernel Jackson

i κ(xi − x′ i)

Gaussian † e−x−x′Σ

  • i κ(xi + αi, x′

i + αi), κ(x, x′) = √ 4xx′ x+x′

Metric and separation Hx = CfcId ‡

Hx = Σ Hx = diag

  • 1

4(xi+αi)2

  • dH(x, x′) = C

1 2 fc

  • x − x′
  • 2

dH(x, x′) =

  • x − x′
  • Σ

dH(x, x′)=

  • i
  • log
  • xi+αi

x′ i+αi

  • 2

∆ =

  • d√s

∆ =

  • log(s)

∆ = d + log(ds)

† xΣ = Σx, x 16 / 36

slide-37
SLIDE 37

Outline

1

Compressed sensing off-the-grid

2

The Fisher metric and the minimum separation condition

3

Support stability for the subsampled problem

4

Ideas behind the proofs – Dual certificates

5

Removal of random signs assumption

17 / 36

slide-38
SLIDE 38

The subsampled setting

Assumption 1

K is admissible, µ0 = s

j=1 ajδxj with minj=k dH(xj, xk) ∆ and s smax.

X is a compact domain wth RX

def.

= supx,x′∈X dH(x, x′),

18 / 36

slide-39
SLIDE 39

The subsampled setting

Assumption 1

K is admissible, µ0 = s

j=1 ajδxj with minj=k dH(xj, xk) ∆ and s smax.

X is a compact domain wth RX

def.

= supx,x′∈X dH(x, x′), To analyse the subsampled case, we need to control the deviation of ˆ K from K. Ideally, Lr(ω) = supx∈X Dr[ϕω](x) are uniformly bounded. But... ϕω(x) = exp(iω⊤x) = ⇒ Dr[ϕω](x) ∝ ωr

Σ−1 ,

  • n the other hand P(ωΣ−1 > x) 2d/2e−x/4.

18 / 36

slide-40
SLIDE 40

The subsampled setting

Assumption 1

K is admissible, µ0 = s

j=1 ajδxj with minj=k dH(xj, xk) ∆ and s smax.

X is a compact domain wth RX

def.

= supx,x′∈X dH(x, x′), Let Lr(ω)

def.

= supx∈X Dr[ϕω](x)

Assumption 2

With high probability, Lr(ωk) ¯ Lr for r = 0, 1, 2, 3 and k = 1, . . . , m. and either one of the following hold: sign(a) is a Steinhaus sequence and m C · s · log

  • Nd

ρ

  • log
  • s

ρ

  • ,

sign(a) is an arbitrary sign sequence and m C · s3/2 · log

  • Nd

ρ

  • ,

where C

def.

= ε−2(L2

2B11 + L2 1B22 + L2 1B), N

def.

=

d·RX ·L3 rnearε .

B = B00 + B02 + B10 + B12, ε = min{ε0, ε2}, Lr = maxir ¯ Li

18 / 36

slide-41
SLIDE 41

The subsampled setting

Assumption 1

K is admissible, µ0 = s

j=1 ajδxj with minj=k dH(xj, xk) ∆ and s smax.

X is a compact domain wth RX

def.

= supx,x′∈X dH(x, x′), Let Lr(ω)

def.

= supx∈X Dr[ϕω](x) and let Fr be such that Pω(Lr(ω) > t) Fr(t).

Assumption 2

For ρ > 0 (probability of failure) choose m ∈ N (number of measurements), and {¯ Li}3

i=0

such that

3

  • j=0

Fj(¯ Lj) ρ m and ¯ L2

j 3

  • i=0

Fi(¯ Li) + 2 ∞

¯ Lj

tFj(t)dt ε m . and either one of the following hold: sign(a) is a Steinhaus sequence and m C · s · log

  • Nd

ρ

  • log
  • s

ρ

  • ,

sign(a) is an arbitrary sign sequence and m C · s3/2 · log

  • Nd

ρ

  • ,

where C

def.

= ε−2(L2

2B11 + L2 1B22 + L2 1B), N

def.

=

d·RX ·L3 rnearε .

B = B00 + B02 + B10 + B12, ε = min{ε0, ε2}, Lr = maxir ¯ Li

18 / 36

slide-42
SLIDE 42

Support stability statement

Theorem

Let Dλ0,c0

def.

=

  • (λ, w) ∈ R+ × Cm ; λ D

s , w c0λ

  • where

D ∼ a min

  • rnear

√s,

ε√s L2

2a ,

ε CH(B+L2

2)

  • and

c0 ∼ min

  • ε0

¯ L0 , ε2 ¯ L2

  • (3.1)

and a = min{|ai| , |ai|−1}. Then, with probability at least 1 − ρ, (i) for all v

def.

= (λ, w) ∈ Dλ0,c0, ( ˆ Pλ(y)) has a unique solution which consists of exactly s spikes. (ii) The mapping v ∈ Dλ0,c0 → (ˆ av, {ˆ xv

j }s j=1) is continuously differentiable and we have

the error bound ˆ av − a +

  • j

d2

H(ˆ

xv

j , x0,j) √s(λ+w) mini|ai|

(3.2)

19 / 36

slide-43
SLIDE 43

Examples

Discrete Fourier Continuous Fourier Laplace transform Random features ϕω(x) = ei2πω⊤x ϕω(x) = eiω⊤x

ϕω(x)= d

i=1

  • 2(xi+αi)

αi

e−ω⊤x

Λ ∝ d

j=1 gj(ωj)

Λ = N(0, Σ) Λ ∝ e−α⊤ω

  • No. samples (up to log factors), p = 1 for random signs, p = 3/2 in general
  • Rand. sgn.: O(sd3)
  • Rand. sgn.: O(sd3)
  • Rand. sgn.: O(sd7)

General: O(s3/2d3) General: O(s3/2d3) General: O(s3/2d7) Stability regions λ = O(s−1d−2) λ = O(s−1d−2) λ = O(s−1d−3) w = O(s−1d−3) w = O(s−1d−3) w = O(s−1d−5)

20 / 36

slide-44
SLIDE 44

Examples

Discrete Fourier Continuous Fourier Laplace transform Random features ϕω(x) = ei2πω⊤x ϕω(x) = eiω⊤x

ϕω(x)= d

i=1

  • 2(xi+αi)

αi

e−ω⊤x

Λ ∝ d

j=1 gj(ωj)

Λ = N(0, Σ) Λ ∝ e−α⊤ω

  • No. samples (up to log factors), p = 1 for random signs, p = 3/2 in general
  • Rand. sgn.: O(sd3)
  • Rand. sgn.: O(sd3)
  • Rand. sgn.: O(sd7)

General: O(s3/2d3) General: O(s3/2d3) General: O(s3/2d7) Stability regions λ = O(s−1d−2) λ = O(s−1d−2) λ = O(s−1d−3) w = O(s−1d−3) w = O(s−1d−3) w = O(s−1d−5) Linear in sparsity when we have random signs. Improvement from s2 to s3/2 in the arbitrary signs case. Dependency on d is still in progress.

20 / 36

slide-45
SLIDE 45

Gaussian mixture estimation (1D)

Task: Suppose we have data {t1, . . . , tn} drawn from ξ =

s

  • j=1

ajN(mj, s2

j),

where aj > 0 and

  • j

aj = 1 Find aj ∈ R+, xj

def.

= (mj, sj) ∈ R × R+, j = 1, . . . , s.

21 / 36

slide-46
SLIDE 46

Gaussian mixture estimation (1D)

Task: Suppose we have data {t1, . . . , tn} drawn from ξ =

s

  • j=1

ajN(mj, s2

j),

where aj > 0 and

  • j

aj = 1 Find aj ∈ R+, xj

def.

= (mj, sj) ∈ R × R+, j = 1, . . . , s. Observe: y ∈ Cm of m sketches against sketching functions θω(t): yk

def.

= 1 n

n

  • j=1

θωk(tj) ≈

  • T

θωk(t)ξ(t)dt =

  • X
  • T

θωk(t)ξx(t)dtdµ0(x), where ξx = N(m, s2). i.e. our sparse spikes problem with µ0

def.

= s

i=1 aiδ(mi,si) and ϕω(x)

def.

=

  • T θω(t)ξx(t)dt.

21 / 36

slide-47
SLIDE 47

Gaussian mixture estimation (1D)

Task: Suppose we have data {t1, . . . , tn} drawn from ξ =

s

  • j=1

ajN(mj, s2

j),

where aj > 0 and

  • j

aj = 1 Find aj ∈ R+, xj

def.

= (mj, sj) ∈ R × R+, j = 1, . . . , s. Observe: y ∈ Cm of m sketches against sketching functions θω(t): yk

def.

= 1 n

n

  • j=1

θωk(tj) ≈

  • T

θωk(t)ξ(t)dt =

  • X
  • T

θωk(t)ξx(t)dtdµ0(x), where ξx = N(m, s2). i.e. our sparse spikes problem with µ0

def.

= s

i=1 aiδ(mi,si) and ϕω(x)

def.

=

  • T θω(t)ξx(t)dt.

Fourier sketching

Suppose that θωk(t) = exp(−iωkt), where ωk ∼ N(0, σ2). Then, Random features: ϕω(m, s) =

4

√ 2s2σ2 + 1 exp

  • −imω − (sω)2

2

  • Noise: w2 = O
  • log(ρ−1)

n

  • w.p. 1 − ρ.

21 / 36

slide-48
SLIDE 48

Support stability for Gaussian mixture estimation (1D)

Kernel K((m, s), (n, t)) = 2sσtσ

s2

σ+t2 σ exp

  • − (m−n)2

2(s2

σ+t2 σ)

  • where s2

σ = 1 2σ2 + s2

Metric and separation H(m,s) =

  • 1/(2s2

σ)

1/(2s2

σ)

  • dH((m, s), (n, t)) = 2arcsinh
  • 1

2

  • (m−n)2+(sσ−tσ)2

st

  • ∆ = O(log(smax)).
  • No. samples

Suppose X ⊂ R × (0, A] and σ ∝

1 A√ log(m/ρ)+1.

m = O(s3/2) (up to log factors) Stability region λ = O(min |ai| /(√s a2)), n = O(s2/ mini |ai|2)

22 / 36

slide-49
SLIDE 49

Support stability for Gaussian mixture estimation (1D)

Kernel K((m, s), (n, t)) = 2sσtσ

s2

σ+t2 σ exp

  • − (m−n)2

2(s2

σ+t2 σ)

  • where s2

σ = 1 2σ2 + s2

Metric and separation H(m,s) =

  • 1/(2s2

σ)

1/(2s2

σ)

  • dH((m, s), (n, t)) = 2arcsinh
  • 1

2

  • (m−n)2+(sσ−tσ)2

st

  • ∆ = O(log(smax)).
  • No. samples

Suppose X ⊂ R × (0, A] and σ ∝

1 A√ log(m/ρ)+1.

m = O(s3/2) (up to log factors) Stability region λ = O(min |ai| /(√s a2)), n = O(s2/ mini |ai|2) No closed form expression for dH in higher dimensions. If µ0 =

i aiN(x0,i, Σ) and Σ ∈ Rd×d is known, then ωk ∼ N(0, Σ−1/d) implies the

associated kernel is exp

  • − x − x′Σ−1/(2+d)
  • , support stability guaranteed if

◮ xj − xℓΣ−1

  • d log(s)

◮ m = O(s3/2d3), n = O(s2d6/ mini |ai|2) and λ = O(min |ai| /(√sd2 a2)). 22 / 36

slide-50
SLIDE 50

Outline

1

Compressed sensing off-the-grid

2

The Fisher metric and the minimum separation condition

3

Support stability for the subsampled problem

4

Ideas behind the proofs – Dual certificates

5

Removal of random signs assumption

23 / 36

slide-51
SLIDE 51

Fenchel Duals

The Fenchel dual of Pλ(y) is sup

p∈Cm,Φ∗p∞1

Rep, y − λ p2

2

(4.1) Note that for λ > 0, there is a unique dual solution pλ, since this is equivalent to minΦ∗p∞1 p − y/λ which a projection of y/λ onto a closed convex set.

24 / 36

slide-52
SLIDE 52

Fenchel Duals

The Fenchel dual of Pλ(y) is sup

p∈Cm,Φ∗p∞1

Rep, y − λ p2

2

(4.1) Note that for λ > 0, there is a unique dual solution pλ, since this is equivalent to minΦ∗p∞1 p − y/λ which a projection of y/λ onto a closed convex set. Primal dual relations: The dual solution pλ is related to any primal solution µλ by Φ∗pλ ∈ ∂ |µλ| and pλ = 1 λ (y − Φµλ) and for λ = 0, Φ∗p0 ∈ ∂ |µ0| and y = Φµ0.

24 / 36

slide-53
SLIDE 53

Fenchel Duals

The Fenchel dual of Pλ(y) is sup

p∈Cm,Φ∗p∞1

Rep, y − λ p2

2

(4.1) Note that for λ > 0, there is a unique dual solution pλ, since this is equivalent to minΦ∗p∞1 p − y/λ which a projection of y/λ onto a closed convex set. Primal dual relations: The dual solution pλ is related to any primal solution µλ by Φ∗pλ ∈ ∂ |µλ| and pλ = 1 λ (y − Φµλ) and for λ = 0, Φ∗p0 ∈ ∂ |µ0| and y = Φµ0. We have ∂ |µ| =

  • f ∈ C(X) ; f∞ 1, f, µ = |µ| (X)
  • , and

Supp(µλ) ⊆ {x ; |ηλ(x)| = 1} , where ηλ = Φ∗pλ. ηλ are often called dual certificates.

24 / 36

slide-54
SLIDE 54

Dual certificate guarantees for sparse measures

Let µ0 =

j ajδxj . Then ∂ |µ0| =

  • f ∈ C(X) ; f∞ 1, f(xj) = sign(aj)
  • µ

η Uniqueness: µ0 is the unique solution if ∃η such that η(xj) = sign(aj), |η(x)| < 1 for all x ∈ X ΦX : b ∈ Cs →

j bjϕ(xj) is injective.

25 / 36

slide-55
SLIDE 55

Dual certificate guarantees for sparse measures

Let µ0 =

j ajδxj . Then ∂ |µ0| =

  • f ∈ C(X) ; f∞ 1, f(xj) = sign(aj)
  • µ

η Uniqueness: µ0 is the unique solution if ∃η such that η(xj) = sign(aj), |η(x)| < 1 for all x ∈ X ΦX : b ∈ Cs →

j bjϕ(xj) is injective.

Stability is guaranteed if η is nondegenerate: ∀j sign(aj)∇2η(xj) ≺ 0 and ∀x ∈ {xj}s

j=1, |η(x)| < 1

25 / 36

slide-56
SLIDE 56

Stability

Clustering stability [Cand` es & Fernandez-Granda ’14 and Az¨ ais, De Castro & Gamboa ’13] Suppose η is nondegenerate with ε0, ε2 > 0, X near

j

∋ xj such that |η(x)| 1 − ε0 for all x ∈ X far where X far def. = X \ s

j=1 X near j

, ∀i, ∀x ∈ X near

i

, |η(x)| 1 − ε2dH(x, xi)2. Then, for λ ∼ δ/ p, ε0 |ˆ µ| (X far) + ε2

s

  • j=1
  • X near

i

dH(x, xi)2d |ˆ µ| (x) δ(1 + p).

26 / 36

slide-57
SLIDE 57

Stability

Clustering stability [Cand` es & Fernandez-Granda ’14 and Az¨ ais, De Castro & Gamboa ’13] Suppose η is nondegenerate with ε0, ε2 > 0, X near

j

∋ xj such that |η(x)| 1 − ε0 for all x ∈ X far where X far def. = X \ s

j=1 X near j

, ∀i, ∀x ∈ X near

i

, |η(x)| 1 − ε2dH(x, xi)2. Then, for λ ∼ δ/ p, defining PX(|ˆ µ|)

def.

= s

j=1 |ˆ

µ| (X near

j

)δxj , we have T 2

H(|ˆ

µ| , PX(|ˆ µ|)) δ p min{ε0, ε2} . where T 2

H

def.

= infµ,ν W 2

H(ˆ

µ, ˆ ν) + |µ − ˆ µ| (X) + |ν − ˆ ν| (X).

26 / 36

slide-58
SLIDE 58

Stability

Clustering stability [Cand` es & Fernandez-Granda ’14 and Az¨ ais, De Castro & Gamboa ’13] Suppose η is nondegenerate with ε0, ε2 > 0, X near

j

∋ xj such that |η(x)| 1 − ε0 for all x ∈ X far where X far def. = X \ s

j=1 X near j

, ∀i, ∀x ∈ X near

i

, |η(x)| 1 − ε2dH(x, xi)2. Then, for λ ∼ δ/ p, defining PX(|ˆ µ|)

def.

= s

j=1 |ˆ

µ| (X near

j

)δxj , we have T 2

H(|ˆ

µ| , PX(|ˆ µ|)) δ p min{ε0, ε2} . where T 2

H

def.

= infµ,ν W 2

H(ˆ

µ, ˆ ν) + |µ − ˆ µ| (X) + |ν − ˆ ν| (X). Support stability [Duval & Peyr´ e ’15] We have pλ → p0 where p0

def.

= argmin {p ; Φ∗p ∈ argmax(D0(y))} If the minimal norm certificate η0

def.

= Φ∗p0 is nondegenerate and µ0 is identifiable, then for λ and w

λ

sufficiently small, Pλ(Φµ0 + w) has unique solution µλ,w which consists of exactly s spikes and the recovered positions and amplitudes follow a C1 path as λ and w converge to 0.

26 / 36

slide-59
SLIDE 59

What is natural candidate for a nondegenerate solution to D0(y)?

27 / 36

slide-60
SLIDE 60

What is natural candidate for a nondegenerate solution to D0(y)?

In CS, for Φ ∈ Cm×N, we need to find v ∈ Im(Φ∗) such that |vj| < 1 for j ∈ T and vj = sign(aj) for j ∈ T.

27 / 36

slide-61
SLIDE 61

What is natural candidate for a nondegenerate solution to D0(y)?

In CS, for Φ ∈ Cm×N, we need to find v ∈ Im(Φ∗) such that |vj| < 1 for j ∈ T and vj = sign(aj) for j ∈ T. In the case E[Φ∗Φ] = Id, the Fuchs certificate is an appropriate choice: v = Φ∗ΦT (Φ∗

T ΦT )−1 sign(aT ).

By definition, vT = sign(aT ).

27 / 36

slide-62
SLIDE 62

What is natural candidate for a nondegenerate solution to D0(y)?

In CS, for Φ ∈ Cm×N, we need to find v ∈ Im(Φ∗) such that |vj| < 1 for j ∈ T and vj = sign(aj) for j ∈ T. In the case E[Φ∗Φ] = Id, the Fuchs certificate is an appropriate choice: v = Φ∗ΦT (Φ∗

T ΦT )−1 sign(aT ).

By definition, vT = sign(aT ).

Vanishing derivatives precertificate [Duval & Peyr´ e ’15]

In our case, for α ∈ Cs and β ∈ Csd, define ΓX : Cs(d+1) → Cm by ΓX α β

  • = ΦXα + Φ(1)

X β,

where ΦXα =

  • j

αjϕ(xj), Φ(1)

X β = s

  • j=1

βj⊤∇ϕ(xj) Consider ηV = Φ∗ΓX(Γ∗

XΓX)−1sign(a)

0sd

  • 27 / 36
slide-63
SLIDE 63

What is natural candidate for a nondegenerate solution to D0(y)?

In CS, for Φ ∈ Cm×N, we need to find v ∈ Im(Φ∗) such that |vj| < 1 for j ∈ T and vj = sign(aj) for j ∈ T. In the case E[Φ∗Φ] = Id, the Fuchs certificate is an appropriate choice: v = Φ∗ΦT (Φ∗

T ΦT )−1 sign(aT ).

By definition, vT = sign(aT ).

Vanishing derivatives precertificate [Duval & Peyr´ e ’15]

In our case, for α ∈ Cs and β ∈ Csd, define ΓX : Cs(d+1) → Cm by ΓX α β

  • = ΦXα + Φ(1)

X β,

where ΦXα =

  • j

αjϕ(xj), Φ(1)

X β = s

  • j=1

βj⊤∇ϕ(xj) Consider ηV = Φ∗ΓX(Γ∗

XΓX)−1sign(a)

0sd

  • by definition ηV (xj) = sign(aj) and ∇ηV (xj) = 0.

27 / 36

slide-64
SLIDE 64

What is natural candidate for a nondegenerate solution to D0(y)?

In CS, for Φ ∈ Cm×N, we need to find v ∈ Im(Φ∗) such that |vj| < 1 for j ∈ T and vj = sign(aj) for j ∈ T. In the case E[Φ∗Φ] = Id, the Fuchs certificate is an appropriate choice: v = Φ∗ΦT (Φ∗

T ΦT )−1 sign(aT ).

By definition, vT = sign(aT ).

Vanishing derivatives precertificate [Duval & Peyr´ e ’15]

In our case, for α ∈ Cs and β ∈ Csd, define ΓX : Cs(d+1) → Cm by ΓX α β

  • = ΦXα + Φ(1)

X β,

where ΦXα =

  • j

αjϕ(xj), Φ(1)

X β = s

  • j=1

βj⊤∇ϕ(xj) Consider ηV = Φ∗ΓX(Γ∗

XΓX)−1sign(a)

0sd

  • by definition ηV (xj) = sign(aj) and ∇ηV (xj) = 0.

In fact, ηV = Φ∗pV where pV = argmin

  • p2 ; (Φ∗p)(xj) = sign(aj), ∇(Φ∗p)(xj) = 0
  • .

27 / 36

slide-65
SLIDE 65

What is natural candidate for a nondegenerate solution to D0(y)?

In CS, for Φ ∈ Cm×N, we need to find v ∈ Im(Φ∗) such that |vj| < 1 for j ∈ T and vj = sign(aj) for j ∈ T. In the case E[Φ∗Φ] = Id, the Fuchs certificate is an appropriate choice: v = Φ∗ΦT (Φ∗

T ΦT )−1 sign(aT ).

By definition, vT = sign(aT ).

Vanishing derivatives precertificate [Duval & Peyr´ e ’15]

In our case, for α ∈ Cs and β ∈ Csd, define ΓX : Cs(d+1) → Cm by ΓX α β

  • = ΦXα + Φ(1)

X β,

where ΦXα =

  • j

αjϕ(xj), Φ(1)

X β = s

  • j=1

βj⊤∇ϕ(xj) Consider ηV = Φ∗ΓX(Γ∗

XΓX)−1sign(a)

0sd

  • by definition ηV (xj) = sign(aj) and ∇ηV (xj) = 0.

In fact, ηV = Φ∗pV where pV = argmin

  • p2 ; (Φ∗p)(xj) = sign(aj), ∇(Φ∗p)(xj) = 0
  • .

If ηV ∞ 1, then we have ηV = η0, and nondegeneracy guarantees support stability.

27 / 36

slide-66
SLIDE 66

Key ideas of proof

We can also write ηV (x) =

N

  • i=1

αiK(xi, x) +

N

  • i=1

βiK(10)(xi, x), α β

  • = D−1

K,X

sign(a) 0N

  • with covariance kernel K(x, x′) = ϕ(x), ϕ(x′), DK,X

def.

= M0, M1 MT

1

M2

  • ,

where M0 = (K(xi, xj))i,j, M1 = (K(01)(xi, xj))i,j, M2 = (K(11)(xi, xj))i,j.

28 / 36

slide-67
SLIDE 67

Key ideas of proof

We can also write ηV (x) =

N

  • i=1

αiK(xi, x) +

N

  • i=1

βiK(10)(xi, x), α β

  • = D−1

K,X

sign(a) 0N

  • with covariance kernel K(x, x′) = ϕ(x), ϕ(x′), DK,X

def.

= M0, M1 MT

1

M2

  • ,

where M0 = (K(xi, xj))i,j, M1 = (K(01)(xi, xj))i,j, M2 = (K(11)(xi, xj))i,j. The idea of the proof is ηV associated to the limit kernel K = E[ ˆ K] is nondegenerate.

28 / 36

slide-68
SLIDE 68

Key ideas of proof

We can also write ηV (x) =

N

  • i=1

αiK(xi, x) +

N

  • i=1

βiK(10)(xi, x), α β

  • = D−1

K,X

sign(a) 0N

  • with covariance kernel K(x, x′) = ϕ(x), ϕ(x′), DK,X

def.

= M0, M1 MT

1

M2

  • ,

where M0 = (K(xi, xj))i,j, M1 = (K(01)(xi, xj))i,j, M2 = (K(11)(xi, xj))i,j. The idea of the proof is ηV associated to the limit kernel K = E[ ˆ K] is nondegenerate. We therefore simply need to show that ˆ ηV associated to ˆ K is close to ηV :

◮ On X far ˆ

ηV ≈ ηV is bounded away from 1 in absolute value.

◮ On X near

j

, sign(aj)∇2ˆ ηV ≈ sign(aj)∇2ηV is negative definite.

28 / 36

slide-69
SLIDE 69

Key ideas of proof

We can also write ηV (x) =

N

  • i=1

αiK(xi, x) +

N

  • i=1

βiK(10)(xi, x), α β

  • = D−1

K,X

sign(a) 0N

  • with covariance kernel K(x, x′) = ϕ(x), ϕ(x′), DK,X

def.

= M0, M1 MT

1

M2

  • ,

where M0 = (K(xi, xj))i,j, M1 = (K(01)(xi, xj))i,j, M2 = (K(11)(xi, xj))i,j. The idea of the proof is ηV associated to the limit kernel K = E[ ˆ K] is nondegenerate. We therefore simply need to show that ˆ ηV associated to ˆ K is close to ηV :

◮ On X far ˆ

ηV ≈ ηV is bounded away from 1 in absolute value.

◮ On X near

j

, sign(aj)∇2ˆ ηV ≈ sign(aj)∇2ηV is negative definite. Our proof still requires random signs and is a direct extension of the work of Tang et al (to the higher dimensional and general operator setting), key difference is incorporation

  • f the Fisher metric.

28 / 36

slide-70
SLIDE 70

Comment on our s1.5 bound

To explain the random signs requirement, consider the Fuchs certificate in the finite dimensional case, v = Φ∗ΦT (Φ∗

T ΦT )−1 sign(aT ) = (sign(aT ), uj)N j=1

where uj = (Φ∗

T ΦT )−1Φ∗ T Φ{j}, and we need to show |vj| < 1 for j ∈ T:

29 / 36

slide-71
SLIDE 71

Comment on our s1.5 bound

To explain the random signs requirement, consider the Fuchs certificate in the finite dimensional case, v = Φ∗ΦT (Φ∗

T ΦT )−1 sign(aT ) = (sign(aT ), uj)N j=1

where uj = (Φ∗

T ΦT )−1Φ∗ T Φ{j}, and we need to show |vj| < 1 for j ∈ T:

Naively, |vj| √s uj2, so we need for j ∈ T,

  • Φ∗

T Φ{j}

  • 2

1 √s which forces m s2

measurements.

29 / 36

slide-72
SLIDE 72

Comment on our s1.5 bound

To explain the random signs requirement, consider the Fuchs certificate in the finite dimensional case, v = Φ∗ΦT (Φ∗

T ΦT )−1 sign(aT ) = (sign(aT ), uj)N j=1

where uj = (Φ∗

T ΦT )−1Φ∗ T Φ{j}, and we need to show |vj| < 1 for j ∈ T:

Naively, |vj| √s uj2, so we need for j ∈ T,

  • Φ∗

T Φ{j}

  • 2

1 √s which forces m s2

measurements. If sign(aT ) is made of random signs, then by Hoeffding’s inequality, with high probability, |vj| = |uj, sign(aT )| uj2 which yields m s (up to log).

29 / 36

slide-73
SLIDE 73

Comment on our s1.5 bound

To explain the random signs requirement, consider the Fuchs certificate in the finite dimensional case, v = Φ∗ΦT (Φ∗

T ΦT )−1 sign(aT ) = (sign(aT ), uj)N j=1

where uj = (Φ∗

T ΦT )−1Φ∗ T Φ{j}, and we need to show |vj| < 1 for j ∈ T:

Naively, |vj| √s uj2, so we need for j ∈ T,

  • Φ∗

T Φ{j}

  • 2

1 √s which forces m s2

measurements. If sign(aT ) is made of random signs, then by Hoeffding’s inequality, with high probability, |vj| = |uj, sign(aT )| uj2 which yields m s (up to log). But we can also write vj = ((Φ∗

T ΦT )−1 − Id)Φ∗ T Φ{j}, sign(aT ) + Φ∗ T Φ{j}, sign(aT )

So, we simply need to ensure that

  • Φ∗

T ΦT − IdT

  • 2→2 s−1/4 and
  • Φ∗

T Φ{j}

  • 2 s−1/4 which is true w.h.p. when m s1.5 (up to log factor).

29 / 36

slide-74
SLIDE 74

Outline

1

Compressed sensing off-the-grid

2

The Fisher metric and the minimum separation condition

3

Support stability for the subsampled problem

4

Ideas behind the proofs – Dual certificates

5

Removal of random signs assumption

30 / 36

slide-75
SLIDE 75

Ideas from (finite dimensional) compressed sensing

Instead of requiring that vj = sign(aj), it is enough that this holds approximately.

31 / 36

slide-76
SLIDE 76

Ideas from (finite dimensional) compressed sensing

Instead of requiring that vj = sign(aj), it is enough that this holds approximately.

Theorem (Gross (2011); Cand` es and Plan (2011))

Let T index the largest s entries of |a|. Suppose that there exists v = Φ∗p such that vT − sign(aT )2 1 4 and vT c∞ 1 4 and

  • (Φ∗

T ΦT )−1

  • 2→2 2

and max

i∈T c

  • Φ∗

T Φ{i}

  • 2 1,

then one can guarantee that ˆ a − a2 p2 δ + σ1(a)s provided that λ ∼ δ.

31 / 36

slide-77
SLIDE 77

Alternative proof: ∃ inexact certificate = ⇒ ∃ dual certificate

Theorem (Gross (2011); Cand` es and Plan (2011))

Let T index the largest s entries of |a|. Suppose that there exists v = Φ∗p such that vT − sign(aT )2 1 4 and vT c∞ 1 4 and

  • (Φ∗

T ΦT )−1

  • 2→2 2

and max

i∈T c

  • Φ∗

T Φ{i}

  • 2 1,

then one can guarantee that ˆ a − a2 (1 + p2)δ + σ1(x0)s provided that λ ∼ δ/ p. Proof:

1 Define u def.

= v + ˜ v where ˜ v

def.

= Φ∗ΦT (Φ∗

T ΦT )−1e and e = sign(aT ) − vT .

2 By definition, uT = vT + eT = sign(aT ) . 3 Note that

˜ vT c∞ Φ∗

T cΦT 2→∞

  • (Φ∗

T ΦT )−1

  • 2→2 e2 1

2 , so uT c∞ vT c∞ + ˜ vT c∞ 3

4 .

32 / 36

slide-78
SLIDE 78

Key steps of our proof

Apply the golfing scheme [Gross ’09, Cand` es & Plan ’11] to construct ˜ η ∈ Im(Φ∗) which is approximately nondegenerate on a finite grid:

◮ The vector V = (˜

η(xj), D1[˜ η](xj))s

j=1 satisfies

  • V −

sign(a)

0sd

  • δ,

◮ For all x ∈ X near

grid,j, sign(aj) · D2[˜

η](x) ≺ −ε2.

◮ For all x ∈ X far

grid, |˜

η(x)| < 1 − ε0.

33 / 36

slide-79
SLIDE 79

Key steps of our proof

Apply the golfing scheme [Gross ’09, Cand` es & Plan ’11] to construct ˜ η ∈ Im(Φ∗) which is approximately nondegenerate on a finite grid:

◮ The vector V = (˜

η(xj), D1[˜ η](xj))s

j=1 satisfies

  • V −

sign(a)

0sd

  • δ,

◮ For all x ∈ X near

grid,j, sign(aj) · D2[˜

η](x) ≺ −ε2.

◮ For all x ∈ X far

grid, |˜

η(x)| < 1 − ε0. Show that provided that grid is sufficiently dense, this holds on the entire domain X (depends on L1 and L3).

33 / 36

slide-80
SLIDE 80

Key steps of our proof

Apply the golfing scheme [Gross ’09, Cand` es & Plan ’11] to construct ˜ η ∈ Im(Φ∗) which is approximately nondegenerate on a finite grid:

◮ The vector V = (˜

η(xj), D1[˜ η](xj))s

j=1 satisfies

  • V −

sign(a)

0sd

  • δ,

◮ For all x ∈ X near

grid,j, sign(aj) · D2[˜

η](x) ≺ −ε2.

◮ For all x ∈ X far

grid, |˜

η(x)| < 1 − ε0. Show that provided that grid is sufficiently dense, this holds on the entire domain X (depends on L1 and L3). Add a small perturbation to ˜ η to obtain a true certificate.

33 / 36

slide-81
SLIDE 81

Key steps of our proof

Apply the golfing scheme [Gross ’09, Cand` es & Plan ’11] to construct ˜ η ∈ Im(Φ∗) which is approximately nondegenerate on a finite grid:

◮ The vector V = (˜

η(xj), D1[˜ η](xj))s

j=1 satisfies

  • V −

sign(a)

0sd

  • δ,

◮ For all x ∈ X near

grid,j, sign(aj) · D2[˜

η](x) ≺ −ε2.

◮ For all x ∈ X far

grid, |˜

η(x)| < 1 − ε0. Show that provided that grid is sufficiently dense, this holds on the entire domain X (depends on L1 and L3). Add a small perturbation to ˜ η to obtain a true certificate. We still construct a dual certificate, but it is not of minimal norm.

33 / 36

slide-82
SLIDE 82

The subsampled setting

Assumption 1

K is admissible, µ0 = s

j=1 ajδxj with minj=k dH(xj, xk) ∆ and s smax.

X is a compact domain wth RX

def.

= supx,x′∈X dH(x, x′), Let Lr(ω)

def.

= supx∈X Dr[ϕω](x) and let Fr be such that Pω(Lr(ω) > t) Fr(t).

Assumption 2

For ρ > 0 (probability of failure) choose m ∈ N (number of measurements), and {¯ Li}3

i=0

such that

3

  • j=0

Fj(¯ Lj) ρ m and ¯ L2

j 3

  • i=0

Fi(¯ Li) + 2 ∞

¯ Lj

tFj(t)dt ε m . and m C · s · (log2(s) + log(Nd)) where N

def.

= 1 ε L3RX d√s and C

def.

= 1 ε2

  • log

L2

ερ

  • log(s)

+ 1

  • L2

1B + L2 2

  • ,

B = B00 + B02 + B10 + B12, ε = min{ε0, ε2}, Lr = maxir ¯ Li

34 / 36

slide-83
SLIDE 83

Stability without the random signs assumption

Theorem

Let X near

j

def.

= {x ∈ X ; dH(x, xj) rnear} and X far def. = X \

s

  • j=1

X near

j

. (5.1) Suppose that w δ and λ ∼ δ/√s (ignoring log factors), then any solution ˆ µ to Pλ(y) is approximately s-sparse: by defining the “projection” of |ˆ µ| onto X

def.

= {xj} by PX(|ˆ µ|)

def.

= s

j=1 |ˆ

µ| (X near

j

)δxj we have T 2

H(|ˆ

µ| , PX(|ˆ µ|)) δ√s ε . where T 2

H

def.

= infµ,ν W 2

H(ˆ

µ, ˆ ν) + |µ − ˆ µ| (X) + |ν − ˆ ν| (X).

35 / 36

slide-84
SLIDE 84

Stability without the random signs assumption

Theorem

Let X near

j

def.

= {x ∈ X ; dH(x, xj) rnear} and X far def. = X \

s

  • j=1

X near

j

. (5.1) Suppose that w δ and λ ∼ δ/√s (ignoring log factors), then any solution ˆ µ to Pλ(y) is approximately s-sparse: by defining the “projection” of |ˆ µ| onto X

def.

= {xj} by PX(|ˆ µ|)

def.

= s

j=1 |ˆ

µ| (X near

j

)δxj we have T 2

H(|ˆ

µ| , PX(|ˆ µ|)) δ√s ε . where T 2

H

def.

= infµ,ν W 2

H(ˆ

µ, ˆ ν) + |µ − ˆ µ| (X) + |ν − ˆ ν| (X). Moreover, we have

  • j
  • aj − ˆ

µ(X near

j

)

  • 2 L1

ε (1 + |µ0| (X))

  • δ√s
  • 35 / 36
slide-85
SLIDE 85

Stability without the random signs assumption

Theorem

Let X near

j

def.

= {x ∈ X ; dH(x, xj) rnear} and X far def. = X \

s

  • j=1

X near

j

. (5.1) Suppose µ0 = s

j=1 ajδxj + ν0 where ν0 ⊥ j ajδxj .

Suppose that w δ and λ ∼ δ/√s (ignoring log factors), then any solution ˆ µ to Pλ(y) is approximately s-sparse: by defining the “projection” of |ˆ µ| onto X

def.

= {xj} by PX(|ˆ µ|)

def.

= s

j=1 |ˆ

µ| (X near

j

)δxj we have T 2

H(|ˆ

µ| , PX(|ˆ µ|)) δ√s+ |ν0| (X) ε . where T 2

H

def.

= infµ,ν W 2

H(ˆ

µ, ˆ ν) + |µ − ˆ µ| (X) + |ν − ˆ ν| (X).

35 / 36

slide-86
SLIDE 86

Summary

Papers: Support Localization and the Fisher Metric for off-the-grid Sparse Regularization, arXiv 1810.03340, AISTATS 2019. A Dual Certificates Analysis of Compressive Off-the-Grid Recovery, arXiv 1802.08464 Summary: Extended existing results to general measurement operators and the multivariate setting. Introduction of the Fisher metric, which offers a natural way of imposing the separation condition and allows a unified way of approaching nontranslational invariant problems. Quantitative support stability under a random signs assumption. Removal of the random signs condition (with support concentration guarantees). Outlook: Algorithmic implications of the Fisher metric/natural gradient? Our results are optimal wrt s, but what about d? How should we quantify noise stability in general?

36 / 36

slide-87
SLIDE 87

Summary

Papers: Support Localization and the Fisher Metric for off-the-grid Sparse Regularization, arXiv 1810.03340, AISTATS 2019. A Dual Certificates Analysis of Compressive Off-the-Grid Recovery, arXiv 1802.08464 Summary: Extended existing results to general measurement operators and the multivariate setting. Introduction of the Fisher metric, which offers a natural way of imposing the separation condition and allows a unified way of approaching nontranslational invariant problems. Quantitative support stability under a random signs assumption. Removal of the random signs condition (with support concentration guarantees). Outlook: Algorithmic implications of the Fisher metric/natural gradient? Our results are optimal wrt s, but what about d? How should we quantify noise stability in general? Thanks for listening!

36 / 36