Joint Blind Deconvolution and Blind Demixing via Nonconvex - - PowerPoint PPT Presentation

joint blind deconvolution and blind demixing via
SMART_READER_LITE
LIVE PREVIEW

Joint Blind Deconvolution and Blind Demixing via Nonconvex - - PowerPoint PPT Presentation

Joint Blind Deconvolution and Blind Demixing via Nonconvex Optimization Shuyang Ling Department of Mathematics, UC Davis July 19, 2017 Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 1 / 28 Acknowledgements Research in


slide-1
SLIDE 1

Joint Blind Deconvolution and Blind Demixing via Nonconvex Optimization

Shuyang Ling

Department of Mathematics, UC Davis

July 19, 2017

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 1 / 28

slide-2
SLIDE 2

Acknowledgements

Research in collaboration with: Prof.Xiaodong Li (UC Davis) Prof.Thomas Strohmer (UC Davis) Dr.Ke Wei (UC Davis) This work is sponsored by NSF-DMS and DARPA.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 2 / 28

slide-3
SLIDE 3

Outline

(a) Blind deconvolution meets blind demixing: applications in image processing and wireless communication (b) Mathematical models and convex approach (c) A nonconvex optimization approach towards joint blind deconvolution and blind demixing

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 3 / 28

slide-4
SLIDE 4

What is blind deconvolution?

What is blind deconvolution?

Suppose we observe a function y which consists of the convolution of two unknown functions, the blurring function f and the signal of interest g, plus noise w. How to reconstruct f and g from y? y = f ∗ g + w. It is obviously a highly ill-posed bilinear inverse problem... Much more difficult than ordinary deconvolution...but have important applications in various fields. Solvability? What conditions on f and g make this problem solvable? How? What algorithms shall we use to recover f and g?

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 4 / 28

slide-5
SLIDE 5

Why do we care about blind deconvolution?

Image deblurring

Let f be the blurring kernel and g be the original image, then y = f ∗ g is the blurred image. Question: how to reconstruct f and g from y

=

y

blurred image

f

blurring kernel

g

  • riginal

image

= + +

w

noise

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 5 / 28

slide-6
SLIDE 6

Blind deconvolution meets blind demixing

Suppose there are s users and each of them sends a message xi, which is encoded by C i, to a common receiver. Each encoded message g i = C ixi is convolved with an unknown impulse response function f i.

User 1 User 𝑗 User 𝑡

𝑕$ = 𝐷$𝑦$: signal

⋮ ⋮

𝑧 = ∑ 𝑔

3 ∗ 𝑕3 + 𝑥 7 38$

𝑔

3: channel

𝑔

$: channel

𝑔

7: channel

𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 (𝑔

$, 𝑦$)

𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 (𝑔

3, 𝑦3)

𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 (𝑔

7, 𝑦7)

Decoder 𝑔

3 ∗ 𝑕3

𝑔

$ ∗ 𝑕$

𝑔

7 ∗ 𝑕7

𝑕3 = 𝐷3𝑦3: signal 𝑕7 = 𝐷7𝑦7: signal

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 6 / 28

slide-7
SLIDE 7

Blind deconvolution and blind demixing

Consider the model: y =

s

  • i=1

f i ∗ g i + w. This is even more difficult than blind deconvolution (s = 1), since this is a “mixture” of blind deconvolution problems. It also includes phase retrieval as a special case if s = 1 and ¯ g i = f i.

More assumptions

Each impulse response f i has maximum delay spread K (compact support): f i(n) = 0, for n > K, f i = hi

  • .

Let g i := C ixi be the signal xi ∈ CN encoded by C i ∈ CL×N with L > N. We also require C i to be mutually incoherent by imposing randomness.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 7 / 28

slide-8
SLIDE 8

Mathematical model

Subspace assumption on the frequency domain

Denote F as the L × L DFT matrix. Let hi ∈ CK be the first K nonzero entries of f i and B be a low-frequency DFT matrix. There holds, ˆ f i = Ff i = Bhi. Let ˆ g i := Aixi where Ai := FC i and xi ∈ CN.

Mathematical model

y =

s

  • i=1

diag(Bhi)Aixi + w. Goal: We want to recover (hi, xi)s

i=1 from (y, B, Ai)s i=1.

Remark: The degree of freedom for unknowns: s(K + N); number of constraints: L.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 8 / 28

slide-9
SLIDE 9

Mathematical model

Subspace assumption on the frequency domain

Denote F as the L × L DFT matrix. Let hi ∈ CK be the first K nonzero entries of f i and B be a low-frequency DFT matrix. There holds, ˆ f i = Ff i = Bhi. Let ˆ g i := Aixi where Ai := FC i and xi ∈ CN.

Mathematical model

y =

s

  • i=1

diag(Bhi)Aixi + w. Goal: We want to recover (hi, xi)s

i=1 from (y, B, Ai)s i=1.

Remark: The degree of freedom for unknowns: s(K + N); number of constraints: L.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 8 / 28

slide-10
SLIDE 10

Naive approach

Nonlinear least squares

We may want to try nonlinear least squares approach: min

(hi,xi)

  • s
  • i=1

diag(Bhi)Aixi − y

  • 2
  • F(hi,xi)

. The objective function is highly nonconvex and more complicated than blind deconvolution (s = 1). Gradient descent might get stuck at local minima. No guarantees for recoverability.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 9 / 28

slide-11
SLIDE 11

Naive approach

Nonlinear least squares

We may want to try nonlinear least squares approach: min

(hi,xi)

  • s
  • i=1

diag(Bhi)Aixi − y

  • 2
  • F(hi,xi)

. The objective function is highly nonconvex and more complicated than blind deconvolution (s = 1). Gradient descent might get stuck at local minima. No guarantees for recoverability.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 9 / 28

slide-12
SLIDE 12

Convex relaxation and low-rank matrix recovery

Lifting

Let ai,l be the l-th column of A∗

i and bl be the l-th column of B∗.

yl =

s

  • i=1

(Bhi)l · (AIxi)l =

s

  • i=1

b∗

l hix∗ i

  • rank-1

ai,l. Let X i := hix∗

i and define the linear operator Ai : CK×N → CL as,

Ai(Z) := {b∗

l Zai,l}L l=1 = {

  • Z, bla∗

i,l

  • }L

l=1.

Then, there holds y = s

i=1 Ai(X i) + w.

See [Cand` es-Strohmer-Voroninski 13], [Ahmed-Recht-Romberg, 14].

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 10 / 28

slide-13
SLIDE 13

Convex relaxation and low-rank matrix recovery

Rank-s matrix recovery

We rewrite y = s

i=1 diag(Bhi)Aixi as

yl =

    h1x∗

1

· · · h2x∗

2

· · · . . . . . . ... . . . · · · hsx∗

s

    

  • rank-s matrix

,      bla∗

1,l

· · · bla∗

2,l

· · · . . . . . . ... . . . · · · bla∗

s,l

    

  • Recover a rank-s block diagonal matrix satisfying convex constraints.

Finding such a rank-s matrix is generally an NP-hard problem.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 11 / 28

slide-14
SLIDE 14

Low-rank matrix recovery

Nuclear norm minimization

The ground truth is a rank-s block-diagonal matrix. It is natural to recover the solution via solving min

s

  • i=1

Z i∗ subject to

s

  • i=1

Ai(Z i) = y where s

i=1 Z i∗ is the nuclear norm of blkdiag(Z 1, · · · , Z s).

Question: Can we recover {hi0x∗

i0}s i=1 exactly?

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 12 / 28

slide-15
SLIDE 15

Convex approach

Theorem

Assume that Let B ∈ CL×K be a partial DFT matrix with B∗B = I K; Each Ai is a Gaussian random matrix. The SDP relaxation is able to recover {(hi0, xi0)}s

i=1 exactly with

probability at least 1 − O(L−γ). Here the number of measurements L satifies [Ling-Strohmer 15] L ≥ Cγs2(K + µ2

hN) log3 L;

[Jung-Krahmer-St¨

  • ger 17] L ≥ Cγ(s(K + µ2

hN)) log3 L

where µ2

h = L max1≤i≤s Bhi02

hi02 .

We can jointly estimate the channels and signals for s users with one simple convex program. SDP is able to recover {(hi, xi)}s

i=1 but it is computationally

expensive. Can we solve this problem simply with gradient descent which also

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 13 / 28

slide-16
SLIDE 16

Convex approach

Theorem

Assume that Let B ∈ CL×K be a partial DFT matrix with B∗B = I K; Each Ai is a Gaussian random matrix. The SDP relaxation is able to recover {(hi0, xi0)}s

i=1 exactly with

probability at least 1 − O(L−γ). Here the number of measurements L satifies [Ling-Strohmer 15] L ≥ Cγs2(K + µ2

hN) log3 L;

[Jung-Krahmer-St¨

  • ger 17] L ≥ Cγ(s(K + µ2

hN)) log3 L

where µ2

h = L max1≤i≤s Bhi02

hi02 .

We can jointly estimate the channels and signals for s users with one simple convex program. SDP is able to recover {(hi, xi)}s

i=1 but it is computationally

expensive. Can we solve this problem simply with gradient descent which also

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 13 / 28

slide-17
SLIDE 17

A nonconvex optimization approach?

An increasing list of nonconvex approaches to various problems in machine learning and signal processing: Phase retrieval: Cand` es, Li, Soltanolkotabi, Chen, Wright, Sun, etc... Matrix completion: Sun, Luo, Montanari, etc... Various problems: Recht, Wainwright, Constantine, etc...

Two-step philosophy for provable nonconvex optimization

(a) Use spectral method to construct a starting point inside “the basin of attraction”; (b) Run gradient descent method. The key is to build up “the basin of attraction”.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 14 / 28

slide-18
SLIDE 18

Building “the basin of attraction”

The basin of attraction relies on the following three observations.

Observation 1: Unboundedness of solution

If the pair (hi0, xi0) is a solution to y = s

i=1 diag(Bhi0)Aixi0, then

so is the pair (αihi0, α−1

i

xi0) for any αi = 0. Thus the blind deconvolution problem always has infinitely many solutions of this type. We can recover (hi0, xi0) only up to a scalar. It is possible that hi ≫ xi (vice versa) while hi · xi is fixed. Hence we define Nd0 to balance hi and xi: Nd0 := {{(hi, xi)}s

i=1 : hi ≤ 2

  • di0, xi ≤ 2
  • di0}.

where di0 = hi0xi0.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 15 / 28

slide-19
SLIDE 19

Building “the basin of attraction”

Observation 2: Incoherence

Our numerical experiments have shown that the algorithm’s performance depends on how much bl (the rows of B) and hi0 are correlated. µ2

h := max 1≤i≤s

LBhi02

hi02 , the smaller µh, the better. Therefore, we introduce the Nµ to control the incoherence: Nµ := {{hi}s

i=1 :

√ LBhi∞ ≤ 4µ

  • di0}.

“Incoherence” is not a new idea. In matrix completion, we also require the left and right singular vectors of the ground truth cannot be too “aligned” with those of measurement matrices {bla∗

i,l}1≤l≤L.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 16 / 28

slide-20
SLIDE 20

Building “the basin of attraction”

Observation 3: “Close” to the ground truth

We define Nε to quantify closeness of {(hi, xi)}s

i=1 to true solution, i.e.,

Nε := {{(hi, xi)}s

i=1 : hix∗ i − hi0x∗ i0F ≤ εdi0}.

We want to find an initial guess close to {(hi0, xi0)}s

i=1.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 17 / 28

slide-21
SLIDE 21

Building “the basin of attraction”

Based on the three observations above, we define the three neighborhoods:

The basin of attraction

The basin of attraction is the intersection of the following three sets Nd0 ∩ Nµ ∩ Nε: Nd0 := {{(hi, xi)}s

i=1 : hi ≤ 2

  • di0, xi ≤ 2
  • di0, 1 ≤ i ≤ s}

Nµ := { {hi}s

i=1 :

√ LBhi∞ ≤ 4

  • di0µ, 1 ≤ i ≤ s}

Nε :=

  • {(hi, xi)}s

i=1 : hix∗ i − hi0x∗ i0F

di0 ≤ ε, 1 ≤ i ≤ s

  • where di0 = hi0xi0, µ is a parameter and µ ≥ µh and ε is a

predetermined parameter in (0, 1

15].

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 18 / 28

slide-22
SLIDE 22

Objective function: a variant of projected gradient descent

The objective function F consists of two parts: F and G: min

(h,x)

  • F(h, x) :=

F(h, x)

least squares term

+ G(h, x)

regularization term

where F(h, x) := s

i=1 Ai(hix∗ i ) − y2 and

G(h, x) := ρ

s

  • i=1
  • G0

hi2 2di

  • + G0

xi2 2di

  • Nd0: balance hi and xi

+

L

  • l=1

G0 L|b∗

l hi|2

8diµ2

  • Nµ: impose incoherence
  • .

Here G0(z) = max{z − 1, 0}2, ρ ≈ d2, d ≈ d0, di ≈ di0 and µ ≥ µh.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 19 / 28

slide-23
SLIDE 23

Algorithm: Initialization via spectral method

Note that A∗

i (y) =

A∗

i Ai(hi0x∗ i0)

  • E(A∗

i Ai(hi0x∗ i0))=hi0x∗ i0

+ A∗

i

 

j=i

Aj(hj0x∗

j0)

 

  • with mean 0

The leading singular vectors of A∗

i (y) can approximate (hi0, xi0).

Step 1: Initialization via spectral method and projection:

1: for i = 1, 2, . . . , s do 2:

Compute A∗

i (y), (since E(A∗ i (y)) = hi0x∗ i0);

3:

(d, ˆ hi0, ˆ xi0) = svds(A∗

i (y));

4:

u(0)

i

:= PNµ(√di ˆ hi0) and v (0)

i

:= √di ˆ xi0.

5: end for

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 20 / 28

slide-24
SLIDE 24

Algorithm: Wirtinger gradient descent

Step 2: Gradient descent with constant stepsize η:

1: Initialization: obtain (u(0)

i

, v (0)

i

) via Algorithm 1.

2: for t = 1, 2, . . . , do 3:

for i = 1, 2, . . . , s do

4:

u(t)

i

= u(t−1)

i

− η∇ Fhi(u(t−1), v (t−1))

5:

v (t)

i

= v (t−1)

i

− η∇ Fxi(u(t−1), v (t−1))

6:

end for

7: end for

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 21 / 28

slide-25
SLIDE 25

Main results

Theorem [Ling-Strohmer 17]

Assume w ∼ CN(0, σ2d2

0/L) and Ai as a complex Gaussian matrix.

There hold: the initial guess (u(0), v (0)) ∈

1 √ 3Nd0

  • 1

√ 3Nµ

N

2ε 5√sκ ,

s

i=1 u(t) i (v (t) i )∗ − hi0x∗ i02 F ≤ (1 − α)tεd0 + c0

√sA∗(w) with probability at least 1 − L−γ+1 and α = O((s(K + N) log2 L)−1) if L ≥ Cγ(µ2

h + σ2)s2κ4(K + N) log2 L log s/ε2,

where κ = max di0

min di0 .

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 22 / 28

slide-26
SLIDE 26

Main results

Theorem [Ling-Strohmer 17]

Assume w ∼ CN(0, σ2d2

0/L) and Ai as a complex Gaussian matrix.

There hold: the initial guess (u(0), v (0)) ∈

1 √ 3Nd0

  • 1

√ 3Nµ

N

2ε 5√sκ ,

s

i=1 u(t) i (v (t) i )∗ − hi0x∗ i02 F ≤ (1 − α)tεd0

  • linear convergence

+ c0 √sA∗(w)

  • error term

with probability at least 1 − L−γ+1 and α = O((s(K + N) log2 L)−1) if L ≥ Cγ(µ2

h + σ2)s2κ4(K + N) log2 L log s/ε2,

where κ = max di0

min di0 .

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 23 / 28

slide-27
SLIDE 27

Remark

The iterates (u(t)

i , v (t) i ) converges linearly to (hi0, xi0):

u(∞)

i

(v (∞)

i

)∗ − hi0x∗

i0F ≤ c0

√sA∗(w) A∗(w) converges to 0 with the rate of O(L−1/2): A∗(w) ≤ C0σd0

  • s(K + N)(log2 L)

L Therefore, (u(∞)

i

, v (∞)

i

) is a consistent estimator of (hi0, xi0). Challenges: s2 is not optimal. The optimal scaling should be L = O(s(K + N)) instead of L = O(s2(K + N)).

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 24 / 28

slide-28
SLIDE 28

Remark

The iterates (u(t)

i , v (t) i ) converges linearly to (hi0, xi0):

u(∞)

i

(v (∞)

i

)∗ − hi0x∗

i0F ≤ c0

√sA∗(w) A∗(w) converges to 0 with the rate of O(L−1/2): A∗(w) ≤ C0σd0

  • s(K + N)(log2 L)

L Therefore, (u(∞)

i

, v (∞)

i

) is a consistent estimator of (hi0, xi0). Challenges: s2 is not optimal. The optimal scaling should be L = O(s(K + N)) instead of L = O(s2(K + N)).

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 24 / 28

slide-29
SLIDE 29

Numerics: Does L scale linearly with s?

Let each Ai be a complex Gaussian matrix. The number of measurement scales linearly with the number of sources s if K and N are fixed. Approximately, L ≈ 1.5s(K + N) yields exact recovery.

s: 1 to 8 Number of measurements: L, from 100 to 1250 L vs. s, K = N = 50, Regularized GD, Gaussian

1 2 3 4 5 6 7 8 250 500 750 1000 1250 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure: Black: failure; white: success

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 25 / 28

slide-30
SLIDE 30

Back to the communication example

A more practical and useful choice of encoding matrix C i: C i = DiH (i.e., Ai = FDiH) where Di is a diagonal random binary ±1 matrix and H is an L × N deterministic partial Hadamard matrix. With this setting, our approach can demix many users without performing channel estimation.

1 2 3 4 5 6 7 8 9 10 11 12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

s: number of sources Empirical probability of success Plot: K = N = 50; Hadamard matrix L = 512 L = 786 L = 1024 L = 1280

L ≈ 1.5s(K + N) yields exact recovery.

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 26 / 28

slide-31
SLIDE 31

Numerics: robustness

We see that the relative error is linearly correlated with the noise in dB. Approximately, 10 units of increase in SNR leads to the same amount of decrease in relative error (in dB).

5 10 15 20 25 30 35 40 45 50 −60 −50 −40 −30 −20 −10 10

SNR(dB) Average Relative Error of 10 Samples(dB) K = N = 64, s = 6, Gaussian

L = 2s(K+N) L = 4s(K+N) 5 10 15 20 25 30 35 40 45 50 −60 −50 −40 −30 −20 −10 10

SNR(dB) Average Relative Error of 10 Samples(dB) K = N = 64, s = 6, Hadamard

L = 2s(K+N) L = 4s(K+N)

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 27 / 28

slide-32
SLIDE 32

Outlook and Conclusion

Conclusion: The proposed algorithm is arguably the first blind deconvolution/blind demixing algorithm that is numerically efficient, robust against noise and comes with rigorous recovery guarantees under subspace conditions. Open problem: Does similar result hold for other types of Ai? Open problem: what if either hi or xi is sparse? Major open problem in nonconvex optimization: How to remove the s2-dependence for rank-s matrix recovery?

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 28 / 28

slide-33
SLIDE 33

Outlook and Conclusion

Conclusion: The proposed algorithm is arguably the first blind deconvolution/blind demixing algorithm that is numerically efficient, robust against noise and comes with rigorous recovery guarantees under subspace conditions. Open problem: Does similar result hold for other types of Ai? Open problem: what if either hi or xi is sparse? Major open problem in nonconvex optimization: How to remove the s2-dependence for rank-s matrix recovery?

Shuyang Ling (UC Davis) FOCM, Barcelona, 2017 July 19, 2017 28 / 28