Nonconvex Phase Retrieval with Random Gaussian Measurements Yuejie - - PowerPoint PPT Presentation
Nonconvex Phase Retrieval with Random Gaussian Measurements Yuejie - - PowerPoint PPT Presentation
Nonconvex Phase Retrieval with Random Gaussian Measurements Yuejie Chi Department of Electrical and Computer Engineering December 2017 CSA, Berlin Acknowledgements Primary Collaborators: Yingbin Liang (OSU), Yuanxin Li (OSU), Yi Zhou
Acknowledgements
- Primary Collaborators: Yingbin Liang (OSU), Yuanxin Li (OSU), Yi
Zhou (OSU), Huishuai Zhang (MSRA), Yuxin Chen (Princeton), Cong Ma (Princeton), and Kaizheng Wang (Princeton).
- This research is supported by AFOSR, ONR and NSF.
1
Phase Retrieval: The Missing Phase Problem
- In high-frequency (e.g. optical) applications, the (optical) detection
devices [e.g., CCD cameras, photosensitive films, and the human eye] cannot measure the phase of a light wave. ω0 10ω0 100ω0
- Optical devices measure the photon flux (no. of photons per second
per unit area), which is proportional to the magnitude.
- This leads to the so-called phase retrieval problem — inference with
- nly intensity measurements.
2
Computational Imaging
Phase retrieval is the foundation for modern computational imaging.
3
Ankylography Ptychography Space Telescope Terahertz Imaging
Mathematical Setup
- Phase retrieval: estimate x♮ ∈ Rn/Cn from m phaseless
measurements: yi = |ai, x♮|, i = 1, . . . , m where ai corresponds to the ith measurement vector.
- ai’s are (coded or oversampled) Fourier transform vectors;
- ai’s are short-time Fourier transform vectors;
- ai’s are “generic” vectors such as random Gaussian vectors.
- In a vectorized notation, we write
y = |Ax♮| ∈ Rn
+,
where A = −a∗
1−
−a∗
2−
. . . −a∗
m−
∈ R/Cm×n.
- Phase retrieval solves a quadratic nonlinear system since:
y2
i = |ai, x♮|2 = (x♮)∗aia∗ i x♮,
i = 1, . . . , m,
4
Nonconvex Procedure
ˆ x = argmin
x∈Rn/Cn
1 m
m
- i=1
ℓ(yi; x)
- Initialize x0 via spectral methods to land in the neighborhood of the
ground truth;
- Iterative update using simple methods such as gradient descent and
alternating minimization;
Figure credit: Yuxin Chen.
5
Quadratic Loss of Amplitudes
Wirtinger Flow (WF) employs the intensity-based loss surface [Cand` es et.al.]: ℓW F (x) = 1 m
m
- i=1
- |ai, x♮|2 − |ai, x|22 ,
which is nonconvex and smooth.
6
Quadratic Loss of Amplitudes
Wirtinger Flow (WF) employs the intensity-based loss surface [Cand` es et.al.]: ℓW F (x) = 1 m
m
- i=1
- |ai, x♮|2 − |ai, x|22 ,
which is nonconvex and smooth. Reshaped Wirtinger Flow (RWF): In contrast, we propose to minimize the quadratic loss of amplitude measurements: ℓ(x) := 1 m y − |Ax|2
2
= 1 m
m
- i=1
ℓ(yi; x) = 1 m
m
- i=1
- |ai, x♮| − |ai, x|
2 , which is nonconvex and nonsmooth.
6
Quadratic Loss of Amplitudes
Wirtinger Flow (WF) employs the intensity-based loss surface [Cand` es et.al.]: ℓW F (x) = 1 m
m
- i=1
- |ai, x♮|2 − |ai, x|22 ,
which is nonconvex and smooth. Reshaped Wirtinger Flow (RWF): In contrast, we propose to minimize the quadratic loss of amplitude measurements: ℓ(x) := 1 m y − |Ax|2
2
= 1 m
m
- i=1
ℓ(yi; x) = 1 m
m
- i=1
- |ai, x♮| − |ai, x|
2 , which is nonconvex and nonsmooth. Which one is better?
6
The Choice of Loss Function is Important
The quadratic loss of amplitudes enjoy better curvature in expectation!
1 2 3 4 5 6 7 8 9 10 2 2 1 1
- 1
- 1
- 2-2
(a) Expected loss of LS
0.5 1 1.5 2 2.5 3 3.5 2 4 4.5 5 2 1 1
- 1
- 1
- 2-2
(b) ℓ(x)
20 40 60 80 100 120 2 140 160 180 2 1 1
- 1
- 1
- 2-2
(c) ℓW F (x)
Figure: Surface of the expected loss function of (a) least-squares (mirrored symmetrically), (b) quadratic loss of amplitudes, and (c) quadratic loss of intensity when x = [1, −1]T .
In fact, Error Reduction (ER), proposed by Gerchberg and Saxton in 1972, can be interpreted as alternating minimization using ℓ(x).
7
Gradient Descent
Reshaped Wirtinger Flow (RWF):
- The generalized gradient of ℓ(x) can be calculated as
∇ℓ(x) = 1 m
m
- i=1
(ai, x − yi · sign(ai, x)) ai
- Start with an initialization x0. At iteration t = 0, 1, . . .
xt+1 = xt − µ∇ℓ(xt) =
- I − µ
mA∗A
- xt + µ
mA∗diag(y)sign(Axt), where µ is the step size.
- Stochastic versions are even faster.
8
Statistical Measurement Model
Strong performance guarantees are possible by leverage statistical properties of the measurement ensemble.
- Gaussian measurement model:
ai ∼ N(0, I) i.i.d. if real-valued, ai ∼ CN(0, I) i.i.d. if complex-valued,
- Distance measure:
dist(x, z) = min
φ∈[0,2π) x − ejφz.
z x
9
Linear Convergence of RWF
Theorem (Zhang, Zhou, Liang, C., 2016)
Under i.i.d. Gaussian design, RWF achieves
- xt − x♮2
- 1 − µ
2
t x♮2 ( linear convergence) provided that step size µ = O(1) is small enough and sample size m n.
10
Linear Convergence of RWF
Theorem (Zhang, Zhou, Liang, C., 2016)
Under i.i.d. Gaussian design, RWF achieves
- xt − x♮2
- 1 − µ
2
t x♮2 ( linear convergence) provided that step size µ = O(1) is small enough and sample size m n. loss function regularization step size sample size WF intensity-based no O(1/n) O(n log n) RWF amplitude-based no O(1) O(n) TWF intensity-based truncation O(1) O(n) WF can be improved by designing a better loss function or introducing proper regularization. But is it really that bad?
Zhang, Zhou, Liang and C., “Reshaped Wirtinger Flow and Incremental Algorithms for solving Quadratic Systems of Equations”, Journal of Machine Learning Research, to appear.
10
Another look at WF
- The local Hessian of WF loss satisfies w.h.p. when m = O(n log n):
1 2I ∇2ℓW F (x) nI
- Implies a stepsize of µ = O(1/n)
= ⇒ O(n log(1/ǫ)) iterations to reach ǫ-accuracy.
11
Another look at WF
- The local Hessian of WF loss satisfies w.h.p. when m = O(n log n):
1 2I ∇2ℓW F (x) nI
- Implies a stepsize of µ = O(1/n)
= ⇒ O(n log(1/ǫ)) iterations to reach ǫ-accuracy. Numerically, WF can run much more aggressively! (µ = 0.1)
11
Gradient descent theory
Which region enjoys both strong convexity and smoothness? ∇2ℓW F (x) = 1 m
m
- k=1
- 3
- a⊤
k x
2 −
- a⊤
k x♮2
aka⊤
k 12
Gradient descent theory
Which region enjoys both strong convexity and smoothness? ∇2ℓW F (x) = 1 m
m
- k=1
- 3
- a⊤
k x
2 −
- a⊤
k x♮2
aka⊤
k
- Not smooth if x and ak are too close (coherent)
12
Gradient descent theory
Which region enjoys both strong convexity and smoothness?
·
x\
- x is not far away from x♮
12
Gradient descent theory
Which region enjoys both strong convexity and smoothness?
·
a1 x\
r
- a>
1 (x − x\)
- kx − x\k2
. p log n
- x is not far away from x♮
- x is incoherent w.r.t. sampling vectors (incoherence region)
12
Gradient descent theory
Which region enjoys both strong convexity and smoothness?
·
a1 a2 x\
- k
− k p
- a>
2 (x − x\)
- kx − x\k2
. p log n r
- a>
1 (x − x\)
- kx − x\k2
. p log n
- x is not far away from x♮
- x is incoherent w.r.t. sampling vectors (incoherence region)
12
A second look at gradient descent theory
region of local strong convexity + smoothness
- Prior theory only ensures that iterates remain in ℓ2 ball but not
incoherence region
13
A second look at gradient descent theory
region of local strong convexity + smoothness
- Prior theory only ensures that iterates remain in ℓ2 ball but not
incoherence region
13
A second look at gradient descent theory
region of local strong convexity + smoothness
- Prior theory only ensures that iterates remain in ℓ2 ball but not
incoherence region
13
A second look at gradient descent theory
region of local strong convexity + smoothness
- Prior theory only ensures that iterates remain in ℓ2 ball but not
incoherence region
13
A second look at gradient descent theory
region of local strong convexity + smoothness
· ·
- Prior theory only ensures that iterates remain in ℓ2 ball but not
incoherence region
13
A second look at gradient descent theory
region of local strong convexity + smoothness
· ·
- Prior theory only ensures that iterates remain in ℓ2 ball but not
incoherence region
13
A second look at gradient descent theory
region of local strong convexity + smoothness
· ·
- Prior theory only ensures that iterates remain in ℓ2 ball but not
incoherence region
13
A second look at gradient descent theory
region of local strong convexity + smoothness
· ·
- Prior theory only ensures that iterates remain in ℓ2 ball but not
incoherence region
13
A second look at gradient descent theory
region of local strong convexity + smoothness
· ·
- Prior theory only ensures that iterates remain in ℓ2 ball but not
incoherence region
- Prior theory enforces regularization to promote incoherence
13
WF is implicitly regularized
region of local strong convexity + smoothness
· ·
14
WF is implicitly regularized
region of local strong convexity + smoothness
· ·
14
WF is implicitly regularized
region of local strong convexity + smoothness
· ·
14
WF is implicitly regularized
region of local strong convexity + smoothness
· ·
14
WF is implicitly regularized
region of local strong convexity + smoothness
· ·
WF implicitly forces iterates to remain incoherent
14
Theoretical guarantees
Theorem (Ma, Wang, C., Chen, 2017)
Under i.i.d. Gaussian design, WF achieves
a⊤
k (xt − x♮)
- √log n (incoherence)
15
Theoretical guarantees
Theorem (Ma, Wang, C., Chen, 2017)
Under i.i.d. Gaussian design, WF achieves
a⊤
k (xt − x♮)
- √log n (incoherence)
- xt − x♮2
- 1 − µ
2
t x♮2 (near-linear convergence) provided that step size µ
1 log n and sample size m n log n. 15
Theoretical guarantees
Theorem (Ma, Wang, C., Chen, 2017)
Under i.i.d. Gaussian design, WF achieves
a⊤
k (xt − x♮)
- √log n (incoherence)
- xt − x♮2
- 1 − µ
2
t x♮2 (near-linear convergence) provided that step size µ
1 log n and sample size m n log n.
WF step size sample size Prior O(1/n) O(n log n) Ours O(1/ log n) O(n log n)
- step size is O(1) if m = O(n log2 n).
- This theory of “implicit regularization” is much more general and
can be extended to analyze matrix completion, blind deconvolution.
Ma, Wang, C. and Chen, “Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion and Blind Deconvolution”, arXiv:1711.10467.
15
Key ingredient: leave-one-out analysis
For each 1 ≤ l ≤ m, introduce leave-one-out iterates xt,(l) by dropping lth measurement x
1
- 3
2
- 1
4
- 2
- 1
3 4 1 9 4 1 16 4 1 9 16
16
Key ingredient: leave-one-out analysis
·
a1 incoherence region
- n w.r.t. a1
{xt,(1)} {
- Leave-one-out iterates xt,(l) are independent of al, and are hence
incoherent w.r.t. al with high prob.
17
Key ingredient: leave-one-out analysis
·
a1 incoherence region
- n w.r.t. a1
{xt,(1)} {
} {xt}
- Leave-one-out iterates xt,(l) are independent of al, and are hence
incoherent w.r.t. al with high prob.
- Leave-one-out iterates xt,(l) and true iterates xt are very close
17
Robust Phase Retrieval with Outliers
Assume the measurements are corrupted by both sparse outliers: yi = |ai, x♮| + ηi + wi, i = 1, . . . , m, where η0 ≤ s · m is the sparse outlier with arbitrary values, 0 ≤ s < 1.
- Goal: develop algorithms that are oblivious to outliers, and
statistically and computationally efficient.
18
Robust Phase Retrieval with Outliers
Assume the measurements are corrupted by both sparse outliers: yi = |ai, x♮| + ηi + wi, i = 1, . . . , m, where η0 ≤ s · m is the sparse outlier with arbitrary values, 0 ≤ s < 1.
- Goal: develop algorithms that are oblivious to outliers, and
statistically and computationally efficient.
- Existing approaches are not robust, since it relies on sample mean:
- Spectral initialization: the eigenvector can be arbitrarily perturbed
Y = 1 m
m
- i=1
yiaia∗
i
- Gradient descent: the search direction can be arbitrarily perturbed
xt+1 = xt − µ m
m
- i=1
∇ℓ(yi; xt)
18
Median-Truncated Wirtinger Flow
Key approach: “median-truncation”: we will rule out measurements adaptively each iteration based on how large the sample gradient/value deviates from the median.
- Median-truncated spectral initialization: the leading eigenvector of
Y = 1 m
m
- i=1
yiaia∗
i ✶{|yi|median({yi})}. 19
Median-Truncated Wirtinger Flow
Key approach: “median-truncation”: we will rule out measurements adaptively each iteration based on how large the sample gradient/value deviates from the median.
- Median-truncated spectral initialization: the leading eigenvector of
Y = 1 m
m
- i=1
yiaia∗
i ✶{|yi|median({yi})}.
- Median-truncated gradient descent:
xt+1 = xt − µ m
- i∈Tt
∇ℓ(yi; xt), where the set Tt contains samples that not deviates too much from the sample median of residual:
19
Median-Truncated Wirtinger Flow
Key approach: “median-truncation”: we will rule out measurements adaptively each iteration based on how large the sample gradient/value deviates from the median.
- Median-truncated spectral initialization: the leading eigenvector of
Y = 1 m
m
- i=1
yiaia∗
i ✶{|yi|median({yi})}.
- Median-truncated gradient descent:
xt+1 = xt − µ m
- i∈Tt
∇ℓ(yi; xt), where the set Tt contains samples that not deviates too much from the sample median of residual: Tt =
- i : r(t)
i
median({r(t)
i })
- where r(t)
i
= ℓ(yi; xt) = |yi − |a∗
i xt||. 19
Performance guarantees
Theorem (Zhang, C. and Liang, 2016)
Under i.i.d. Gaussian design, median-TWF achieves
- xt − x♮2
- 1 − µ
2
t x♮2 (near-linear convergence) provided that step size µ = O(1) is small enough and sample size m = O(n log n) and fraction of corruption s = O(1).
20
Performance guarantees
Theorem (Zhang, C. and Liang, 2016)
Under i.i.d. Gaussian design, median-TWF achieves
- xt − x♮2
- 1 − µ
2
t x♮2 (near-linear convergence) provided that step size µ = O(1) is small enough and sample size m = O(n log n) and fraction of corruption s = O(1).
0.1 0.2 0.3 0.4 Outliers fraction s 0.2 0.4 0.6 0.8 1 Success rate TWF trimean-TWF median-TWF median-RWF
(c) η∞ = 0.1x2
0.1 0.2 0.3 0.4 Outliers fraction s 0.2 0.4 0.6 0.8 1 Success rate TWF trimean-TWF median-TWF median-RWF
(d) η∞ = x2
Similar strategies can be used to robustify other low-rank estimation problems.
20
Conclusions
- Simple, iterative gradient descent are demonstrated to perform
remarkably well provided good initialization for phase retrieval with Gaussian measurements.
- All the results can be extended to low-rank models, where
yi = a∗
i U2 = a∗ i (X)ai,
U ∈ R/Cn×r for with rank r ≪ n, where X = UU ∗ (ongoing work).
- It will be interesting to study “implicit regularization” for other
algorithms such as alternating minimization and SGD also warrant further study.
21
References
- 1. Reshaped Wirtinger Flow and Incremental Algorithms for solving
Quadratic Systems of Equations, H. Zhang, Y. Liang, Y. Zhou and Y. Chi, Journal of Machine Learning Research, 2017, to appear.
- 2. Implicit Regularization for Nonconvex Statistical Estimation, C. Ma, K.
Wang, Y. Chi and Y. Chen, arXiv:1711.10467.
- 3. Median-Truncated Nonconvex Approach for Phase Retrieval with Outliers,
- H. Zhang, Y. Chi and Y. Liang, submitted to IEEE Trans. on Information
Theory, 2017. Short version appeared at ICML 2016.
Thank you!
22