Modern Optimization Meets Physics: Recent Progress on Phase - - PowerPoint PPT Presentation
Modern Optimization Meets Physics: Recent Progress on Phase - - PowerPoint PPT Presentation
Modern Optimization Meets Physics: Recent Progress on Phase Retrieval Yuxin Chen (on behalf of Emmanuel Cand` es) CSA 2015, Technical University Berlin, Dec. 2015 Missing phase problem Detectors record intensities of diffracted rays electric
Missing phase problem
Detectors record intensities of diffracted rays electric field x(t1, t2) − → Fraunhofer diffraction (Fourier transform) ˆ x(f1, f2)
Fig credit: Stanford SLAC
intensity of electrical field:
- ˆ
x(f1, f2)
- 2 =
- x(t1, t2)e−i2π(f1t1+f2t2)dt1dt2
- 2
Missing phase problem
Detectors record intensities of diffracted rays electric field x(t1, t2) − → Fraunhofer diffraction (Fourier transform) ˆ x(f1, f2)
Fig credit: Stanford SLAC
intensity of electrical field:
- ˆ
x(f1, f2)
- 2 =
- x(t1, t2)e−i2π(f1t1+f2t2)dt1dt2
- 2
Phase retrieval: recover signal x(t1, t2) from intensity measurements |ˆ x(f1, f2)
- 2
Origin in X-ray crystallography
Method for determining atomic structure within a crystal
Survey: Shechtman, Eldar, Cohen, Chapman, Miao and Segev (’15)
Phase retrieval in X-ray crystallography
Knowledge of phase crucial to build electron density map Algorithmic means of recovering phase structure without sophisticated setups Sayre (’52), Fienup (’78) Initial success in certain cases by using very specific prior knowledge
- H. Hauptman
- J. Karle
1985 Nobel Prize
Phase retrieval is a feasibility problem
x
A
Ax y = |Ax|2
1
- 3
2
- 1
4 2
- 2
- 1
3 4 1 9 4 1 16 4 4 1 9 16
Solve for x ∈ Cn in m quadratic equations yk ≈ |ak, x|2, k = 1, . . . , m
- r
y ≈ |Ax|2 where |z|2 := [|z1|2, · · · , |zm|2]⊤
An equivalent view: low-rank factorization
Introduce X = xx∗ to linearize constraints yk ≈ |a∗
kx|2 = a∗ k(xx∗)a
= ⇒ yk ≈ a∗
kXak
An equivalent view: low-rank factorization
Introduce X = xx∗ to linearize constraints yk ≈ |a∗
kx|2 = a∗ k(xx∗)a
= ⇒ yk ≈ a∗
kXak
find X s.t. yk ≈ a∗
kXak,
k = 1, · · · , m rank(X) = 1
An equivalent view: low-rank factorization
Introduce X = xx∗ to linearize constraints yk ≈ |a∗
kx|2 = a∗ k(xx∗)a
= ⇒ yk ≈ a∗
kXak
find X s.t. yk ≈ a∗
kXak,
k = 1, · · · , m rank(X) = 1 Solving quadratic systems is essentially low-rank matrix completion
Solving quadratic systems is NP-complete in general ...
“I can’t find an efficient algorithm, but neither can all these people.”
Fig credit: coding horror
NP-complete stone problem
Given weights wi ∈ R, i = 1, . . . , n, is there an assignment xi = ±1 s.t. n
i=1 wixi = 0?
NP-complete stone problem
Given weights wi ∈ R, i = 1, . . . , n, is there an assignment xi = ±1 s.t. n
i=1 wixi = 0?
Formulation as a quadratic system |xi|2 = 1, i = 1, . . . , n
- i
wixi
- 2
= 0 Many combinatorial problems can be reduced to solving quadratic systems
This talk: random quadratic systems are solvable in linear time!
A first impulse: maximum likelihood estimate
Assume (Pretend) a noise model ... minimizez ℓ(z)
- negative log-likelihood
= m
k=1 ℓk(z)
A first impulse: maximum likelihood estimate
Assume (Pretend) a noise model ... minimizez ℓ(z)
- negative log-likelihood
= m
k=1 ℓk(z)
Gaussian data: yk ∼ |a∗
kx|2 + N(0, σ2)
ℓk(z) =
- yk − |a∗
kz|2 2
A first impulse: maximum likelihood estimate
Assume (Pretend) a noise model ... minimizez ℓ(z)
- negative log-likelihood
= m
k=1 ℓk(z)
Gaussian data: yk ∼ |a∗
kx|2 + N(0, σ2)
ℓk(z) =
- yk − |a∗
kz|2 2
Poisson data: yk ∼ Poisson
- |a∗
kx|2
ℓk(z) = |a∗
kz|2 − yk log |a∗ kz|2
A first impulse: maximum likelihood estimate
Assume (Pretend) a noise model ... minimizez ℓ(z)
- negative log-likelihood
= m
k=1 ℓk(z)
Gaussian data: yk ∼ |a∗
kx|2 + N(0, σ2)
ℓk(z) =
- yk − |a∗
kz|2 2
Poisson data: yk ∼ Poisson
- |a∗
kx|2
ℓk(z) = |a∗
kz|2 − yk log |a∗ kz|2
Problem: ℓ nonconvex, many local stationary points
Prior art: solving random quadratic systems
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
cvx relaxation
- comput. cost
sample complexity infeasible
Prior art: solving random quadratic systems
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
mn
2
cvx relaxation
- comput. cost
sample complexity infeasible infeasible
Prior art: solving random quadratic systems
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
mn
2
mn2
cvx relaxation
- comput. cost
sample complexity infeasible infeasible
Prior art: solving random quadratic systems
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
mn
2
mn2 n log3 n
cvx relaxation
- comput. cost
sample complexity alt-min (fresh samples at each iter) infeasible infeasible
This talk: solving random quadratic systems
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
mn
2
mn2 n log3 n
cvx relaxation
- comput. cost
sample complexity alt-min (fresh samples at each iter) infeasible infeasible This talk
Nonconvex paradigm: optimize vector variables directly
minimizez ℓ(z) = m
k=1 ℓk(z)
≈ h − i initial guess z0
x
basin of attraction
Start from an appropriate initial point
1Fienup, Cand`
es et al., Netrapalli et al., Shechtman et al., Schniter et al., ...
Nonconvex paradigm: optimize vector variables directly
minimizez ℓ(z) = m
k=1 ℓk(z)
≈ h − i initial guess z0
x
basin of attraction
x
basin of attraction
z1 i ess z0 z2
Start from an appropriate initial point Proceed via some iterative updates1
1Fienup, Cand`
es et al., Netrapalli et al., Shechtman et al., Schniter et al., ...
Wirtinger flow: Cand` es, Li and Soltanolkotabi
Initialization via spectral method: z0 ← leading eigenvector of 1 m
m
- k=1
yk aka∗
k
Wirtinger flow: Cand` es, Li and Soltanolkotabi
Initialization via spectral method: z0 ← leading eigenvector of 1 m
m
- k=1
yk aka∗
k
Iterations: for t = 0, 1, . . . zt+1 = zt − µt m ∇ℓ(zt)
Make use of Wirtinger derivative ℓ(z) =
- k
- yk − |a∗
kz|22
∇ℓ(z) := 4
- k
- |a∗
kz|2 − yk
- (aka∗
k)z
Wirtinger flow: Cand` es, Li and Soltanolkotabi
Initialization via spectral method: z0 ← leading eigenvector of 1 m
m
- k=1
yk aka∗
k
Iterations: for t = 0, 1, . . . zt+1 = zt − µt m ∇ℓ(zt)
Make use of Wirtinger derivative ℓ(z) =
- k
- yk − |a∗
kz|22
∇ℓ(z) := 4
- k
- |a∗
kz|2 − yk
- (aka∗
k)z
Already rich theory (see Soltanolokotabi’s Ph. D. thesis)
Interpretation of spectral initialization
Initialization is first eigenvector of Y := 1
m
m
k=1 yk aka∗ k
Random Gaussian or coded diffraction models: E[Y ] = x2I + 2xx∗ eigenvalues: (3, 1, 1, . . . , 1) · x2
Wirtinger flow as a stochastic gradient scheme
L(z) = z∗ (I − xx∗) z
- penalizes orientation mismatch
+ 3 4
- z2 − x22
- penalizes norm mismatch
Ideal gradient scheme zt+1 = zt − µt z02 ∇L(zt)
Wirtinger flow as a stochastic gradient scheme
L(z) = z∗ (I − xx∗) z
- penalizes orientation mismatch
+ 3 4
- z2 − x22
- penalizes norm mismatch
Ideal gradient scheme zt+1 = zt − µt z02 ∇L(zt) But we don’t have access to L(·)!
Wirtinger flow as a stochastic gradient scheme
L(z) = z∗ (I − xx∗) z
- penalizes orientation mismatch
+ 3 4
- z2 − x22
- penalizes norm mismatch
What we can compute: ℓ(z) = 1 4m
m
- k=1
- yk − |a∗
kz|22
Wirtinger flow as a stochastic gradient scheme
L(z) = z∗ (I − xx∗) z
- penalizes orientation mismatch
+ 3 4
- z2 − x22
- penalizes norm mismatch
What we can compute: ℓ(z) = 1 4m
m
- k=1
- yk − |a∗
kz|22
Fixed z If zt independent of sampling vectors (false assumption) E[zt+1] = zt − µt z02 ∇L(zt)
Wirtinger flow as a stochastic gradient scheme
L(z) = z∗ (I − xx∗) z
- penalizes orientation mismatch
+ 3 4
- z2 − x22
- penalizes norm mismatch
What we can compute: ℓ(z) = 1 4m
m
- k=1
- yk − |a∗
kz|22
Fixed z If zt independent of sampling vectors (false assumption) E[zt+1] = zt − µt z02 ∇L(zt) A stochastic optimization scheme for maximizing population likelihood
Computational complexity
Ax := {a∗
kx}
Initialization: leading eigenvector → a few (e.g. 20) applications of A and A∗ 1 m
- k∈T0
yk aka∗
k = 1
mA∗ diag{yk} A
Computational complexity
Ax := {a∗
kx}
Initialization: leading eigenvector → a few (e.g. 20) applications of A and A∗ 1 m
- k∈T0
yk aka∗
k = 1
mA∗ diag{yk} A Iterations: one application of A and A∗ per iteration zt+1 = zt − µt m ∇ℓ(zt)
Computational complexity
Ax := {a∗
kx}
Initialization: leading eigenvector → a few (e.g. 20) applications of A and A∗ 1 m
- k∈T0
yk aka∗
k = 1
mA∗ diag{yk} A Iterations: one application of A and A∗ per iteration zt+1 = zt − µt m ∇ℓ(zt)
Approximate runtime
A couple hundred applications of A and A∗ No SVDs No matrix inversion
Connection
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Nonconvex approach to matrix completion: Keshavan, Montanari and Oh (’09) (1) Initial estimate by truncating SVD (2) Refinement by minimizing nonconvex functional
Performance guarantees of WF
Theorem (Cand` es, Li, Soltanolkotabi). Under i.i.d. Gaussian design, WF achieves dist(zt, x)
- 1 − µ
4 t/2 x, provided that µ 1
n and sample size : m n log n
Performance guarantees of WF
Theorem (Cand` es, Li, Soltanolkotabi). Under i.i.d. Gaussian design, WF achieves dist(zt, x)
- 1 − µ
4 t/2 x, provided that µ 1
n and sample size : m n log n
- Comput. cost: mn2 log 1
ǫ
Sample complexity: n log n
Can we further improve comput. cost and sample complexity?
Stage 1: improving initialization
Spectral initialization: z0 ← principal component of Y := 1 m
- k
ykaka∗
k
⇒ E[Y ] = x2I + 2xx∗
Stage 1: improving initialization
Spectral initialization: z0 ← principal component of Y := 1 m
- k
ykaka∗
k
⇒ E[Y ] = x2I + 2xx∗ Would succeed if x∗Y x ≈ maxu=1 f(u) := u∗Y u x
ideal shape of f(u)
Discard high-leverage data
Y ≈ 1 m
- k
|a∗
kx|2aka∗ k
- heavy-tailed
But what really is the value of f(u) := u∗Y u if u ∝ ak?
1 6000 12000 1 2 3 4
k
k (m = 6n)
f ak kakk
- k
k
- f (x)
x ai
typical shape
- utlier
Discard high-leverage data
Y ≈ 1 m
- k
|a∗
kx|2aka∗ k
- heavy-tailed
But what really is the value of f(u) := u∗Y u if u ∝ ak?
x al
1 6000 12000 1 2 3 4
k
k (m = 6n)
f ak kakk
- k
k
- f (x)
x ai
typical shape
- utlier
- utlier
Problem Large outliers yk = (a∗
kx)2 bear too much influence
Discard high-leverage data
Y ≈ 1 m
- k
|a∗
kx|2aka∗ k
- heavy-tailed
But what really is the value of f(u) := u∗Y u if u ∝ ak?
x al
1 6000 12000 1 2 3 4
k
k (m = 6n)
f ak kakk
- k
k
- f (x)
x ai
typical shape
- utlier
- utlier
Problem Large outliers yk = (a∗
kx)2 bear too much influence
Solution Throw away large samples and compute leading eigenvectors of 1 m
- k
ykaka∗
k · 1{|yk|Avg{|yl|}}
Importance of regularizion
n: signal dimension
1000 2000 3000 4000 5000
Relative error
0.6 0.7 0.8 0.9 1
spectral method
truncated spectral method
n : signal dimension (105)
0.5 1 1.5 2 2.5 3 3.5 4
Relative error
0.4 0.6 0.8 1 1.2 1.4
truncated spectral method spectral method
real Gaussian m = 6n complex CDP m = 12n
Empirical performance (m = 12n)
Ground truth x
Empirical performance (m = 12n)
Spectral initialization
Empirical performance (m = 12n)
Spectral initialization Regularized spectral initialization
Stage 2: improving search directions
WF: zt+1 = zt − µ m
- k
∇ℓk(z)
Stage 2: improving search directions
WF: zt+1 = zt − µ m
- k
∇ℓk(z)
z x locus of {∇ℓk(z)}
Stage 2: improving search directions
WF: zt+1 = zt − µ m
- k
∇ℓk(z)
z x locus of {∇ℓk(z)}
Problem: descent direction might have large variability
Solution: variance reduction via trimming
More adaptive rule: zt+1 = zt − µ
m
- k∈T ∇ℓk(z)
z x
Solution: variance reduction via trimming
More adaptive rule: zt+1 = zt − µ
m
- k∈T ∇ℓk(z)
z x T trims away excessively large grad components T :=
- k :
- ∇ℓk(z)
- typical-size
- ∇ℓl(z)
- 1≤l≤m
- Slight bias
+ much reduced variance
Larger step size µt is feasible
q ¡ q ¡
x
z1
WF: µt = O(1/n)
q ¡ q ¡
x
z1
WF with trimming: µt = O(1) With better-controlled descent directions, one can proceed more aggressively.
Summary: truncated Wirtinger flows
1 6000 12000 1 2 3 4
1 kakk2 a∗ kY ak
x∗Y x k (m = 6n)
z x
(1) Regularized spectral initialization: z0 ← principal component of 1 m
- k∈T0 yk aka∗
k
(2) Follow adaptive gradient descent zt+1 = zt − µt m
- k∈Tt ∇ℓk(z)
Adaptive and iteration-varying rules: discard high-leverage data {yk : k / ∈ Tt}
Theoretical guarantees (noiseless data)
≈ h − i initial guess z0
x
basin of attraction
Theorem (Chen, Cand´ es). When ak
i.i.d.
∼ N(0, I) and sample size m n,
- 1. TWF starts within basin of attraction
Theoretical guarantees (noiseless data)
≈ h − i initial guess z0
x
basin of attraction
Iteration
Relative error (log scale)
10 10
20 40 60
- 8
10
- 4
Theorem (Chen, Cand´ es). When ak
i.i.d.
∼ N(0, I) and sample size m n,
- 1. TWF starts within basin of attraction
- 2. TWF enjoys linear convergence
Numerical surprise
CG: solve y = Ax TWF: solve y = |Ax|2 Iteration
10 15
Relative error (log scale)
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100
20 40 60 5
least squares (CG) proposed algorithm
Numerical surprise
CG: solve y = Ax TWF: solve y = |Ax|2 Iteration
10 15
Relative error (log scale)
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100
20 40 60 5
least squares (CG) proposed algorithm
For random features A (m = 8n)
- comput. cost of TWF
≈ 4 × comput. cost of least squares (CG)
Empirical success rate (noiseless data)
m : number of measurements (n =1000)
2n 3n 4n 5n 6n
Empirical success rate
0.5 1
TWF
WF
Empirical success rate vs. sample size (# iters=1000)
Empirical performance
After regularized spectral initialization
Empirical performance
After regularized spectral initialization After 50 gradient iterations
Key convergence condition for gradient stage
∀z s.t. dist(z, x) ≤ εx, the search direction p obeys p, x − z z − x2 + p2 z x z −∇f(z) −∇f(z) z0 x z1 z2 z3 When specialized to gradient descent (i.e. p = −∇f(z)) Says that negative gradient points in the direction of minimizer x
Key convergence condition for gradient stage
∀z s.t. dist(z, x) ≤ εx, the search direction p obeys p, x − z z − x2 + p2 z x z −∇f(z) −∇f(z) z0 x z1 z2 z3 When specialized to gradient descent (i.e. p = −∇f(z)) Says that negative gradient points in the direction of minimizer x Does not say that f is convex!
Lack of local convexity
Real univariate function f with minimizer at z = x and f ′(z) · (z − x) |z − x|2 + |f ′(z)|2
x z f(z)
Says that the unique stationary point in the neighborhood is x
Step size µ
z x p z x µp z x µp z x µp To ensure sufficient progress at each iteration, pick the step size s.t. µp ≍ z − x
Step size µ
z x p z x µp z x µp z x µp To ensure sufficient progress at each iteration, pick the step size s.t. µp ≍ z − x For TWF, p = 1
m∇ℓtr(z) ≍ z − x, and hence we pick µ ≍ 1
Stability under noisy data
Noisy data: yk = |a∗
kx|2 + ηk
Signal-to-noise ratio: SNR :=
- k |a∗
kx|4
- k η2
k
≈ 3mx4 η2 i.i.d. Gaussian design
Stability under noisy data
Noisy data: yk = |a∗
kx|2 + ηk
Signal-to-noise ratio: SNR :=
- k |a∗
kx|4
- k η2
k
≈ 3mx4 η2 i.i.d. Gaussian design Theorem (Soltanolkotabi) WF converges to MLE Theorem (Chen, Cand` es) Relative error of TWF converges to O(
1 √ SNR)
Relative MSE vs. SNR (Poisson data)
SNR (dB) (n =1000)
15 20 25 30 35 40 45 50 55
Relative MSE (dB)
- 65
- 60
- 55
- 50
- 45
- 40
- 35
- 30
- 25
- 20
m = 6n m = 8n m = 10n
Slope ≈ -1
Empirical evidence: relative MSE scales inversely with SNR
This accuracy is nearly un-improvable (empirically)
Comparison with genie-aided MLE (with sign info. revealed) yk ∼ Poisson( |a∗
kx|2 )
and εk = sign (a∗
kx)
(revealed by a genie)
This accuracy is nearly un-improvable (empirically)
Comparison with genie-aided MLE (with sign info. revealed) yk ∼ Poisson( |a∗
kx|2 )
and εk = sign (a∗
kx)
(revealed by a genie)
SNR (dB) (n =100)
15 20 25 30 35 40 45 50 55
Relative MSE (dB)
- 65
- 60
- 55
- 50
- 45
- 40
- 35
- 30
- 25
- 20
truncated WF genie-aided MLE
little empirical loss due to missing signs
This accuracy is nearly un-improvable (theoretically)
Poisson data: yk
ind.
∼ Poisson( |a∗
kx|2 )
Signal-to-noise ratio: SNR ≈
- k |a∗
kx|4
- k Var(yk) ≈ 3x2
This accuracy is nearly un-improvable (theoretically)
Poisson data: yk
ind.
∼ Poisson( |a∗
kx|2 )
Signal-to-noise ratio: SNR ≈
- k |a∗
kx|4
- k Var(yk) ≈ 3x2
Theorem (Chen, Cand` es). Under i.i.d. Gaussian design, for any estimator ˆ x, inf
ˆ x
sup
x: x≥log1.5 m
E
- dist (ˆ
x, x) | {ak}
- x
- 1
√ SNR , provided that sample size m ≍ n.
Alternating Minimization: Netrapalli, Jain, Sanghavi (’13)
Provable guarantees for a (resampling) variant of non-convex iterative algorithm
Alternating Minimization: Netrapalli, Jain, Sanghavi (’13)
Provable guarantees for a (resampling) variant of non-convex iterative algorithm
initial estimate
- bey spatial
constraints
- bey frequency
constraints magnitude measurements support (real-valuedness, nonnegativity)
Alternating Minimization: Netrapalli, Jain, Sanghavi (’13)
Provable guarantees for a (resampling) variant of non-convex iterative algorithm
initial estimate
- bey spatial
constraints
- bey frequency
constraints magnitude measurements support (real-valuedness, nonnegativity)
Current guess is zt and ask for new data y(t+1) = |A(t+1)x|2
Alternating Minimization: Netrapalli, Jain, Sanghavi (’13)
Provable guarantees for a (resampling) variant of non-convex iterative algorithm
initial estimate
- bey spatial
constraints
- bey frequency
constraints magnitude measurements support (real-valuedness, nonnegativity)
Current guess is zt and ask for new data y(t+1) = |A(t+1)x|2 ˆ b(t+1) = y(t+1) ⊙ phase
- A(t+1)zt
z(t+1) = arg minz ˆ b(t+1) − A(t+1)z
Alternating Minimization: Netrapalli, Jain, Sanghavi (’13)
Provable guarantees for a (resampling) variant of non-convex iterative algorithm
initial estimate
- bey spatial
constraints
- bey frequency
constraints magnitude measurements support (real-valuedness, nonnegativity)
Current guess is zt and ask for new data y(t+1) = |A(t+1)x|2 ˆ b(t+1) = y(t+1) ⊙ phase
- A(t+1)zt