Nonconvex Phase Retrieval with Random Gaussian Measurements Yuejie - - PowerPoint PPT Presentation

nonconvex phase retrieval with random gaussian
SMART_READER_LITE
LIVE PREVIEW

Nonconvex Phase Retrieval with Random Gaussian Measurements Yuejie - - PowerPoint PPT Presentation

Nonconvex Phase Retrieval with Random Gaussian Measurements Yuejie Chi Department of Electrical and Computer Engineering December 2017 CSA, Berlin Acknowledgements Primary Collaborators: Yingbin Liang (OSU), Yuanxin Li (OSU), Yi Zhou


slide-1
SLIDE 1

Nonconvex Phase Retrieval with Random Gaussian Measurements

Yuejie Chi Department of Electrical and Computer Engineering December 2017 CSA, Berlin

slide-2
SLIDE 2

Acknowledgements

  • Primary Collaborators: Yingbin Liang (OSU), Yuanxin Li (OSU), Yi

Zhou (OSU), Huishuai Zhang (MSRA), Yuxin Chen (Princeton), Cong Ma (Princeton), and Kaizheng Wang (Princeton).

  • This research is supported by AFOSR, ONR and NSF.

1

slide-3
SLIDE 3

Phase Retrieval: The Missing Phase Problem

  • In high-frequency (e.g. optical) applications, the (optical) detection

devices [e.g., CCD cameras, photosensitive films, and the human eye] cannot measure the phase of a light wave. ω0 10ω0 100ω0

  • Optical devices measure the photon flux (no. of photons per second

per unit area), which is proportional to the magnitude.

  • This leads to the so-called phase retrieval problem — inference with
  • nly intensity measurements.

2

slide-4
SLIDE 4

Computational Imaging

Phase retrieval is the foundation for modern computational imaging.

3

Ankylography Ptychography Space Telescope Terahertz Imaging

slide-5
SLIDE 5

Mathematical Setup

  • Phase retrieval: estimate x♮ ∈ Rn/Cn from m phaseless

measurements: yi = |ai, x♮|, i = 1, . . . , m where ai corresponds to the ith measurement vector.

  • ai’s are (coded or oversampled) Fourier transform vectors;
  • ai’s are short-time Fourier transform vectors;
  • ai’s are “generic” vectors such as random Gaussian vectors.
  • In a vectorized notation, we write

y = |Ax♮| ∈ Rn

+,

where A =      −a∗

1−

−a∗

2−

. . . −a∗

m−

     ∈ R/Cm×n.

  • Phase retrieval solves a quadratic nonlinear system since:

y2

i = |ai, x♮|2 = (x♮)∗aia∗ i x♮,

i = 1, . . . , m,

4

slide-6
SLIDE 6

Nonconvex Procedure

ˆ x = argmin

x∈Rn/Cn

1 m

m

  • i=1

ℓ(yi; x)

  • Initialize x0 via spectral methods to land in the neighborhood of the

ground truth;

  • Iterative update using simple methods such as gradient descent and

alternating minimization;

Figure credit: Yuxin Chen.

5

slide-7
SLIDE 7

Quadratic Loss of Amplitudes

Wirtinger Flow (WF) employs the intensity-based loss surface [Cand` es et.al.]: ℓW F (x) = 1 m

m

  • i=1
  • |ai, x♮|2 − |ai, x|22 ,

which is nonconvex and smooth.

6

slide-8
SLIDE 8

Quadratic Loss of Amplitudes

Wirtinger Flow (WF) employs the intensity-based loss surface [Cand` es et.al.]: ℓW F (x) = 1 m

m

  • i=1
  • |ai, x♮|2 − |ai, x|22 ,

which is nonconvex and smooth. Reshaped Wirtinger Flow (RWF): In contrast, we propose to minimize the quadratic loss of amplitude measurements: ℓ(x) := 1 m y − |Ax|2

2

= 1 m

m

  • i=1

ℓ(yi; x) = 1 m

m

  • i=1
  • |ai, x♮| − |ai, x|

2 , which is nonconvex and nonsmooth.

6

slide-9
SLIDE 9

Quadratic Loss of Amplitudes

Wirtinger Flow (WF) employs the intensity-based loss surface [Cand` es et.al.]: ℓW F (x) = 1 m

m

  • i=1
  • |ai, x♮|2 − |ai, x|22 ,

which is nonconvex and smooth. Reshaped Wirtinger Flow (RWF): In contrast, we propose to minimize the quadratic loss of amplitude measurements: ℓ(x) := 1 m y − |Ax|2

2

= 1 m

m

  • i=1

ℓ(yi; x) = 1 m

m

  • i=1
  • |ai, x♮| − |ai, x|

2 , which is nonconvex and nonsmooth. Which one is better?

6

slide-10
SLIDE 10

The Choice of Loss Function is Important

The quadratic loss of amplitudes enjoy better curvature in expectation!

1 2 3 4 5 6 7 8 9 10 2 2 1 1

  • 1
  • 1
  • 2-2

(a) Expected loss of LS

0.5 1 1.5 2 2.5 3 3.5 2 4 4.5 5 2 1 1

  • 1
  • 1
  • 2-2

(b) ℓ(x)

20 40 60 80 100 120 2 140 160 180 2 1 1

  • 1
  • 1
  • 2-2

(c) ℓW F (x)

Figure: Surface of the expected loss function of (a) least-squares (mirrored symmetrically), (b) quadratic loss of amplitudes, and (c) quadratic loss of intensity when x = [1, −1]T .

In fact, Error Reduction (ER), proposed by Gerchberg and Saxton in 1972, can be interpreted as alternating minimization using ℓ(x).

7

slide-11
SLIDE 11

Gradient Descent

Reshaped Wirtinger Flow (RWF):

  • The generalized gradient of ℓ(x) can be calculated as

∇ℓ(x) = 1 m

m

  • i=1

(ai, x − yi · sign(ai, x)) ai

  • Start with an initialization x0. At iteration t = 0, 1, . . .

xt+1 = xt − µ∇ℓ(xt) =

  • I − µ

mA∗A

  • xt + µ

mA∗diag(y)sign(Axt), where µ is the step size.

  • Stochastic versions are even faster.

8

slide-12
SLIDE 12

Statistical Measurement Model

Strong performance guarantees are possible by leverage statistical properties of the measurement ensemble.

  • Gaussian measurement model:

ai ∼ N(0, I) i.i.d. if real-valued, ai ∼ CN(0, I) i.i.d. if complex-valued,

  • Distance measure:

dist(x, z) = min

φ∈[0,2π) x − ejφz.

z x

9

slide-13
SLIDE 13

Linear Convergence of RWF

Theorem (Zhang, Zhou, Liang, C., 2016)

Under i.i.d. Gaussian design, RWF achieves

  • xt − x♮2
  • 1 − µ

2

t x♮2 ( linear convergence) provided that step size µ = O(1) is small enough and sample size m n.

10

slide-14
SLIDE 14

Linear Convergence of RWF

Theorem (Zhang, Zhou, Liang, C., 2016)

Under i.i.d. Gaussian design, RWF achieves

  • xt − x♮2
  • 1 − µ

2

t x♮2 ( linear convergence) provided that step size µ = O(1) is small enough and sample size m n. loss function regularization step size sample size WF intensity-based no O(1/n) O(n log n) RWF amplitude-based no O(1) O(n) TWF intensity-based truncation O(1) O(n) WF can be improved by designing a better loss function or introducing proper regularization. But is it really that bad?

Zhang, Zhou, Liang and C., “Reshaped Wirtinger Flow and Incremental Algorithms for solving Quadratic Systems of Equations”, Journal of Machine Learning Research, to appear.

10

slide-15
SLIDE 15

Another look at WF

  • The local Hessian of WF loss satisfies w.h.p. when m = O(n log n):

1 2I ∇2ℓW F (x) nI

  • Implies a stepsize of µ = O(1/n)

= ⇒ O(n log(1/ǫ)) iterations to reach ǫ-accuracy.

11

slide-16
SLIDE 16

Another look at WF

  • The local Hessian of WF loss satisfies w.h.p. when m = O(n log n):

1 2I ∇2ℓW F (x) nI

  • Implies a stepsize of µ = O(1/n)

= ⇒ O(n log(1/ǫ)) iterations to reach ǫ-accuracy. Numerically, WF can run much more aggressively! (µ = 0.1)

11

slide-17
SLIDE 17

Gradient descent theory

Which region enjoys both strong convexity and smoothness? ∇2ℓW F (x) = 1 m

m

  • k=1
  • 3
  • a⊤

k x

2 −

  • a⊤

k x♮2

aka⊤

k 12

slide-18
SLIDE 18

Gradient descent theory

Which region enjoys both strong convexity and smoothness? ∇2ℓW F (x) = 1 m

m

  • k=1
  • 3
  • a⊤

k x

2 −

  • a⊤

k x♮2

aka⊤

k

  • Not smooth if x and ak are too close (coherent)

12

slide-19
SLIDE 19

Gradient descent theory

Which region enjoys both strong convexity and smoothness?

·

x\

  • x is not far away from x♮

12

slide-20
SLIDE 20

Gradient descent theory

Which region enjoys both strong convexity and smoothness?

·

a1 x\

r

  • a>

1 (x − x\)

  • kx − x\k2

. p log n

  • x is not far away from x♮
  • x is incoherent w.r.t. sampling vectors (incoherence region)

12

slide-21
SLIDE 21

Gradient descent theory

Which region enjoys both strong convexity and smoothness?

·

a1 a2 x\

  • k

− k p

  • a>

2 (x − x\)

  • kx − x\k2

. p log n r

  • a>

1 (x − x\)

  • kx − x\k2

. p log n

  • x is not far away from x♮
  • x is incoherent w.r.t. sampling vectors (incoherence region)

12

slide-22
SLIDE 22

A second look at gradient descent theory

region of local strong convexity + smoothness

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

13

slide-23
SLIDE 23

A second look at gradient descent theory

region of local strong convexity + smoothness

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

13

slide-24
SLIDE 24

A second look at gradient descent theory

region of local strong convexity + smoothness

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

13

slide-25
SLIDE 25

A second look at gradient descent theory

region of local strong convexity + smoothness

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

13

slide-26
SLIDE 26

A second look at gradient descent theory

region of local strong convexity + smoothness

· ·

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

13

slide-27
SLIDE 27

A second look at gradient descent theory

region of local strong convexity + smoothness

· ·

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

13

slide-28
SLIDE 28

A second look at gradient descent theory

region of local strong convexity + smoothness

· ·

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

13

slide-29
SLIDE 29

A second look at gradient descent theory

region of local strong convexity + smoothness

· ·

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

13

slide-30
SLIDE 30

A second look at gradient descent theory

region of local strong convexity + smoothness

· ·

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

  • Prior theory enforces regularization to promote incoherence

13

slide-31
SLIDE 31

WF is implicitly regularized

region of local strong convexity + smoothness

· ·

14

slide-32
SLIDE 32

WF is implicitly regularized

region of local strong convexity + smoothness

· ·

14

slide-33
SLIDE 33

WF is implicitly regularized

region of local strong convexity + smoothness

· ·

14

slide-34
SLIDE 34

WF is implicitly regularized

region of local strong convexity + smoothness

· ·

14

slide-35
SLIDE 35

WF is implicitly regularized

region of local strong convexity + smoothness

· ·

WF implicitly forces iterates to remain incoherent

14

slide-36
SLIDE 36

Theoretical guarantees

Theorem (Ma, Wang, C., Chen, 2017)

Under i.i.d. Gaussian design, WF achieves

a⊤

k (xt − x♮)

  • √log n (incoherence)

15

slide-37
SLIDE 37

Theoretical guarantees

Theorem (Ma, Wang, C., Chen, 2017)

Under i.i.d. Gaussian design, WF achieves

a⊤

k (xt − x♮)

  • √log n (incoherence)
  • xt − x♮2
  • 1 − µ

2

t x♮2 (near-linear convergence) provided that step size µ

1 log n and sample size m n log n. 15

slide-38
SLIDE 38

Theoretical guarantees

Theorem (Ma, Wang, C., Chen, 2017)

Under i.i.d. Gaussian design, WF achieves

a⊤

k (xt − x♮)

  • √log n (incoherence)
  • xt − x♮2
  • 1 − µ

2

t x♮2 (near-linear convergence) provided that step size µ

1 log n and sample size m n log n.

WF step size sample size Prior O(1/n) O(n log n) Ours O(1/ log n) O(n log n)

  • step size is O(1) if m = O(n log2 n).
  • This theory of “implicit regularization” is much more general and

can be extended to analyze matrix completion, blind deconvolution.

Ma, Wang, C. and Chen, “Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion and Blind Deconvolution”, arXiv:1711.10467.

15

slide-39
SLIDE 39

Key ingredient: leave-one-out analysis

For each 1 ≤ l ≤ m, introduce leave-one-out iterates xt,(l) by dropping lth measurement x

1

  • 3

2

  • 1

4

  • 2
  • 1

3 4 1 9 4 1 16 4 1 9 16

16

slide-40
SLIDE 40

Key ingredient: leave-one-out analysis

·

a1 incoherence region

  • n w.r.t. a1

{xt,(1)} {

  • Leave-one-out iterates xt,(l) are independent of al, and are hence

incoherent w.r.t. al with high prob.

17

slide-41
SLIDE 41

Key ingredient: leave-one-out analysis

·

a1 incoherence region

  • n w.r.t. a1

{xt,(1)} {

} {xt}

  • Leave-one-out iterates xt,(l) are independent of al, and are hence

incoherent w.r.t. al with high prob.

  • Leave-one-out iterates xt,(l) and true iterates xt are very close

17

slide-42
SLIDE 42

Robust Phase Retrieval with Outliers

Assume the measurements are corrupted by both sparse outliers: yi = |ai, x♮| + ηi + wi, i = 1, . . . , m, where η0 ≤ s · m is the sparse outlier with arbitrary values, 0 ≤ s < 1.

  • Goal: develop algorithms that are oblivious to outliers, and

statistically and computationally efficient.

18

slide-43
SLIDE 43

Robust Phase Retrieval with Outliers

Assume the measurements are corrupted by both sparse outliers: yi = |ai, x♮| + ηi + wi, i = 1, . . . , m, where η0 ≤ s · m is the sparse outlier with arbitrary values, 0 ≤ s < 1.

  • Goal: develop algorithms that are oblivious to outliers, and

statistically and computationally efficient.

  • Existing approaches are not robust, since it relies on sample mean:
  • Spectral initialization: the eigenvector can be arbitrarily perturbed

Y = 1 m

m

  • i=1

yiaia∗

i

  • Gradient descent: the search direction can be arbitrarily perturbed

xt+1 = xt − µ m

m

  • i=1

∇ℓ(yi; xt)

18

slide-44
SLIDE 44

Median-Truncated Wirtinger Flow

Key approach: “median-truncation”: we will rule out measurements adaptively each iteration based on how large the sample gradient/value deviates from the median.

  • Median-truncated spectral initialization: the leading eigenvector of

Y = 1 m

m

  • i=1

yiaia∗

i ✶{|yi|median({yi})}. 19

slide-45
SLIDE 45

Median-Truncated Wirtinger Flow

Key approach: “median-truncation”: we will rule out measurements adaptively each iteration based on how large the sample gradient/value deviates from the median.

  • Median-truncated spectral initialization: the leading eigenvector of

Y = 1 m

m

  • i=1

yiaia∗

i ✶{|yi|median({yi})}.

  • Median-truncated gradient descent:

xt+1 = xt − µ m

  • i∈Tt

∇ℓ(yi; xt), where the set Tt contains samples that not deviates too much from the sample median of residual:

19

slide-46
SLIDE 46

Median-Truncated Wirtinger Flow

Key approach: “median-truncation”: we will rule out measurements adaptively each iteration based on how large the sample gradient/value deviates from the median.

  • Median-truncated spectral initialization: the leading eigenvector of

Y = 1 m

m

  • i=1

yiaia∗

i ✶{|yi|median({yi})}.

  • Median-truncated gradient descent:

xt+1 = xt − µ m

  • i∈Tt

∇ℓ(yi; xt), where the set Tt contains samples that not deviates too much from the sample median of residual: Tt =

  • i : r(t)

i

median({r(t)

i })

  • where r(t)

i

= ℓ(yi; xt) = |yi − |a∗

i xt||. 19

slide-47
SLIDE 47

Performance guarantees

Theorem (Zhang, C. and Liang, 2016)

Under i.i.d. Gaussian design, median-TWF achieves

  • xt − x♮2
  • 1 − µ

2

t x♮2 (near-linear convergence) provided that step size µ = O(1) is small enough and sample size m = O(n log n) and fraction of corruption s = O(1).

20

slide-48
SLIDE 48

Performance guarantees

Theorem (Zhang, C. and Liang, 2016)

Under i.i.d. Gaussian design, median-TWF achieves

  • xt − x♮2
  • 1 − µ

2

t x♮2 (near-linear convergence) provided that step size µ = O(1) is small enough and sample size m = O(n log n) and fraction of corruption s = O(1).

0.1 0.2 0.3 0.4 Outliers fraction s 0.2 0.4 0.6 0.8 1 Success rate TWF trimean-TWF median-TWF median-RWF

(c) η∞ = 0.1x2

0.1 0.2 0.3 0.4 Outliers fraction s 0.2 0.4 0.6 0.8 1 Success rate TWF trimean-TWF median-TWF median-RWF

(d) η∞ = x2

Similar strategies can be used to robustify other low-rank estimation problems.

20

slide-49
SLIDE 49

Conclusions

  • Simple, iterative gradient descent are demonstrated to perform

remarkably well provided good initialization for phase retrieval with Gaussian measurements.

  • All the results can be extended to low-rank models, where

yi = a∗

i U2 = a∗ i (X)ai,

U ∈ R/Cn×r for with rank r ≪ n, where X = UU ∗ (ongoing work).

  • It will be interesting to study “implicit regularization” for other

algorithms such as alternating minimization and SGD also warrant further study.

21

slide-50
SLIDE 50

References

  • 1. Reshaped Wirtinger Flow and Incremental Algorithms for solving

Quadratic Systems of Equations, H. Zhang, Y. Liang, Y. Zhou and Y. Chi, Journal of Machine Learning Research, 2017, to appear.

  • 2. Implicit Regularization for Nonconvex Statistical Estimation, C. Ma, K.

Wang, Y. Chi and Y. Chen, arXiv:1711.10467.

  • 3. Median-Truncated Nonconvex Approach for Phase Retrieval with Outliers,
  • H. Zhang, Y. Chi and Y. Liang, submitted to IEEE Trans. on Information

Theory, 2017. Short version appeared at ICML 2016.

Thank you!

22