SLIDE 1 The Power of Nonconvex Optimization in Solving Random Quadratic Systems of Equations
Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford)
es, Communications on Pure and Applied Mathematics
- vol. 70, no. 5, pp. 822-883, May 2017
SLIDE 2 Agenda
- 1. The power of nonconvex optimization in solving random quadratic systems of
equations (Aug. 28)
- 2. Random initialization and implicit regularization in nonconvex statistical
estimation (Aug. 29)
- 3. The projected power method: an efficient nonconvex algorithm for joint
discrete assignment from pairwise data (Sep. 3)
- 4. Spectral methods meets asymmetry: two recent stories (Sep. 4)
- 5. Inference and uncertainty quantification for noisy matrix completion (Sep. 5)
SLIDE 3
- n (high-dimensional) statistics
nonconvex optimization
SLIDE 4 Nonconvex problems are everywhere
Maximum likelihood estimation is usually nonconvex maximizex ℓ(x; data) → may be nonconcave
x ∈ S → may be nonconvex
SLIDE 5 Nonconvex problems are everywhere
Maximum likelihood estimation is usually nonconvex maximizex ℓ(x; data) → may be nonconcave
x ∈ S → may be nonconvex
- low-rank matrix completion
- robust principal component analysis
- graph clustering
- dictionary learning
- blind deconvolution
- learning neural nets
- ...
SLIDE 6
Nonconvex optimization may be super scary
There may be bumps everywhere and exponentially many local optima e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98)
SLIDE 7 Example: solving quadratic programs is hard
Finding maximum cut in a graph is maximizex x⊤W x
x2
i = 1,
i = 1, · · · , n
SLIDE 8
Example: solving quadratic programs is hard
Fig credit: coding horror
SLIDE 9 One strategy: convex relaxation
Can relax into convex problems by
- finding convex surrogates (e.g. compressed sensing, matrix completion)
- lifting into higher dimensions (e.g. Max-Cut)
SLIDE 10 Example of convex surrogate: low-rank matrix completion
— Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09
minimizeM rank(M)
- subj. to data constraints
cvx surrogate minimizeM nuc-norm(M)
- subj. to data constraints
SLIDE 11 Example of convex surrogate: low-rank matrix completion
— Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09
minimizeM rank(M)
- subj. to data constraints
cvx surrogate minimizeM nuc-norm(M)
- subj. to data constraints
Robust variation used everyday by Netflix
SLIDE 12 Example of convex surrogate: low-rank matrix completion
— Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09
minimizeM rank(M)
- subj. to data constraints
cvx surrogate minimizeM nuc-norm(M)
- subj. to data constraints
Robust variation used everyday by Netflix Problem: operate in full matrix space even though X is low-rank
SLIDE 13 Example of lifting: Max-Cut
— Goemans, Williamson ’95
maximizex x⊤W x
x2
i = 1,
i = 1, · · · , n
SLIDE 14 Example of lifting: Max-Cut
— Goemans, Williamson ’95
maximizex x⊤W x
x2
i = 1,
i = 1, · · · , n let X be xx⊤ maximizeX X, W
Xi,i = 1, i = 1, · · · , n X 0 rank(X) = 1
SLIDE 15 Example of lifting: Max-Cut
— Goemans, Williamson ’95
maximizex x⊤W x
x2
i = 1,
i = 1, · · · , n let X be xx⊤ maximizeX X, W
Xi,i = 1, i = 1, · · · , n X 0 rank(X) = 1
SLIDE 16 Example of lifting: Max-Cut
— Goemans, Williamson ’95
maximizex x⊤W x
x2
i = 1,
i = 1, · · · , n let X be xx⊤ maximizeX X, W
Xi,i = 1, i = 1, · · · , n X 0 rank(X) = 1 Problem: explosion in dimensions (Rn → Rn×n)
SLIDE 17
How about optimizing nonconvex problems directly without lifting?
SLIDE 18
A case study: solving random quadratic systems of equations
SLIDE 19 Solving quadratic systems of equations
x
A
Ax y = |Ax|2
1
2
4 2
3 4 1 9 4 1 16 4 4 1 9 16
Solve for x ∈ Cn in m quadratic equations yk ≈ |ak, x|2, k = 1, . . . , m
SLIDE 20 Motivation: a missing phase problem in imaging science
Detectors record intensities of diffracted rays
→ Fourier transform ˆ x(f1, f2)
intensity of electrical field:
x(f1, f2)
- 2 =
- x(t1, t2)e−i2π(f1t1+f2t2)dt1dt2
- 2
Phase retrieval: recover true signal x(t1, t2) from intensity measurements
SLIDE 21 Motivation: latent variable models
Example: mixture of regression
y ≈ hx, βi y ≈ hx, −βi
- Samples {(yk, xk)}: drawn from one of two unknown regressors β and −β
yk ≈
with prob. 0.5 xk, −β , else (labels: latent variables)
SLIDE 22 Motivation: latent variable models
Example: mixture of regression
y ≈ hx, βi y ≈ hx, −βi
- Samples {(yk, xk)}: drawn from one of two unknown regressors β and −β
yk ≈
with prob. 0.5 xk, −β , else (labels: latent variables) — equivalent to observing |yk|2 ≈ |xk, β|2
SLIDE 23 Motivation: learning neural nets with quadratic activation
— Soltanolkotabi, Javanmard, Lee ’17, Li, Ma, Zhang ’17
hidden layer i er output layer er input layer o
σ σ σ y
+
a
X\
X
a
input features: a; weights: X = [x1, · · · , xr]
y =
r
σ(a⊤xi)
σ(z)=z2
:=
r
(a⊤xi)2
SLIDE 24 An equivalent view: low-rank factorization
Lifting: introduce X = xx∗ to linearize constraints yk = |a∗
kx|2 = a∗ k(xx∗)ak
= ⇒ yk = a∗
kXak
SLIDE 25 An equivalent view: low-rank factorization
Lifting: introduce X = xx∗ to linearize constraints yk = |a∗
kx|2 = a∗ k(xx∗)ak
= ⇒ yk = a∗
kXak
find X 0 s.t. yk = a∗
kXak,
k = 1, · · · , m rank(X) = 1
SLIDE 26 An equivalent view: low-rank factorization
Lifting: introduce X = xx∗ to linearize constraints yk = |a∗
kx|2 = a∗ k(xx∗)ak
= ⇒ yk = a∗
kXak
find X 0 s.t. yk = a∗
kXak,
k = 1, · · · , m rank(X) = 1
SLIDE 27 An equivalent view: low-rank factorization
Lifting: introduce X = xx∗ to linearize constraints yk = |a∗
kx|2 = a∗ k(xx∗)ak
= ⇒ yk = a∗
kXak
find X 0 s.t. yk = a∗
kXak,
k = 1, · · · , m rank(X) = 1 Works well if {ak} are random, but huge increase in dimensions
SLIDE 28 Prior art (before our work)
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
cvx relaxation
sample complexity infeasible
SLIDE 29 Prior art (before our work)
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
mn
2
cvx relaxation
sample complexity infeasible infeasible
SLIDE 30 Prior art (before our work)
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
mn
2
mn2
cvx relaxation
sample complexity infeasible infeasible
SLIDE 31 Prior art (before our work)
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
mn
2
mn2 n log n
3
cvx relaxation
sample complexity Wirtinger flow infeasible infeasible
SLIDE 32 Prior art (before our work)
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
mn
2
mn2 n log n
3
n log3 n
cvx relaxation
sample complexity Wirtinger flow alt-min (fresh samples at each iter) infeasible infeasible
SLIDE 33 A glimpse of our results
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
mn
2
mn2 n log n
3
n log3 n
cvx relaxation
sample complexity Wirtinger flow alt-min (fresh samples at each iter) infeasible infeasible Our algorithm
This work: random quadratic systems are solvable in linear time!
SLIDE 34 A glimpse of our results
n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n
n mn
mn
2
mn2 n log n
3
n log3 n
cvx relaxation
sample complexity Wirtinger flow alt-min (fresh samples at each iter) infeasible infeasible Our algorithm
This work: random quadratic systems are solvable in linear time! minimal sample size
- ptimal statistical accuracy
SLIDE 35 A first impulse: maximum likelihood estimate
maximizez ℓ(z) = 1 m m
k=1 ℓk(z)
SLIDE 36 A first impulse: maximum likelihood estimate
maximizez ℓ(z) = 1 m m
k=1 ℓk(z)
yk ∼ |a∗
kx|2 + N(0, σ2)
ℓk(z) = −
kz|2 2
SLIDE 37 A first impulse: maximum likelihood estimate
maximizez ℓ(z) = 1 m m
k=1 ℓk(z)
yk ∼ |a∗
kx|2 + N(0, σ2)
ℓk(z) = −
kz|2 2
yk ∼ Poisson
kx|2
ℓk(z) = −|a∗
kz|2 + yk log |a∗ kz|2
SLIDE 38 A first impulse: maximum likelihood estimate
maximizez ℓ(z) = 1 m m
k=1 ℓk(z)
yk ∼ |a∗
kx|2 + N(0, σ2)
ℓk(z) = −
kz|2 2
yk ∼ Poisson
kx|2
ℓk(z) = −|a∗
kz|2 + yk log |a∗ kz|2
Problem: −ℓ nonconvex, many local stationary points
SLIDE 39 Wirtinger flow: Cand` es, Li, Soltanolkotabi ’14
- Spectral initialization: z0 ← leading eigenvector of
1 m
m
ykaka∗
k
for t = 0, 1, . . . zt+1 = zt + µt∇ℓ(zt) Already rich theory (see also Soltanolkotabi ’14, Ma, Wang, Chi, Chen ’17)
SLIDE 40 Interpretation of spectral initialization
Spectral initialization: z0 ← leading eigenvector of Y := 1 m
m
ykaka∗
k
SLIDE 41 Interpretation of spectral initialization
Spectral initialization: z0 ← leading eigenvector of Y := 1 m
m
ykaka∗
k
- Rationale: E[Y ] = I + 2xx∗ (x2 = 1) under Gaussian design
SLIDE 42 Interpretation of spectral initialization
Spectral initialization: z0 ← leading eigenvector of Y := 1 m
m
ykaka∗
k
- Rationale: E[Y ] = I + 2xx∗ (x2 = 1) under Gaussian design
- Would succeed if Y → E[Y ]
SLIDE 43
Empirical performance of initialization (m = 12n)
Ground truth x ∈ R409600
SLIDE 44
Empirical performance of initialization (m = 12n)
Ground truth x ∈ R409600 Spectral initialization
SLIDE 45 Improving initialization
Y = 1 m
ykaka∗
k heavy-tailed
unless m ≫ n
SLIDE 46 Improving initialization
Y = 1 m
ykaka∗
k heavy-tailed
unless m ≫ n
1 6000 12000 1 2 3 4
1 kakk2 a∗ kY ak
x∗Y x k (m = 6n)
SLIDE 47 Improving initialization
Y = 1 m
ykaka∗
k heavy-tailed
unless m ≫ n
1 6000 12000 1 2 3 4
1 kakk2 a∗ kY ak
x∗Y x k (m = 6n)
Problem large outliers yk = |a∗
kx|2 bear too much influence
SLIDE 48 Improving initialization
Y = 1 m
ykaka∗
k heavy-tailed
unless m ≫ n
1 6000 12000 1 2 3 4
1 kakk2 a∗ kY ak
x∗Y x k (m = 6n)
Problem large outliers yk = |a∗
kx|2 bear too much influence
Solution discard large samples and run PCA for
1 m
ykaka∗
k1{|yk|Avg{|yl|}}
SLIDE 49 Improving initialization
Y = 1 m
ykaka∗
k heavy-tailed
unless m ≫ n
1 6000 12000 1 2 3 4
1 kakk2 a∗ kY ak
x∗Y x k (m = 6n)
Problem large outliers yk = |a∗
kx|2 bear too much influence
Solution discard large samples and run PCA for
1 m
ykaka∗
k1{|yk|Avg{|yl|}}
— improvable via more refined pre-processing
(Wang, Giannakis, Eldar ’16, Lu, Li ’17, Mondelli, Montanari ’17)
1 m
ρ(yk)aka∗
k
e.g. ρ(yk) = max{yk, a}
SLIDE 50
Empirical performance of initialization (m = 12n)
Ground truth x ∈ R409600 Regularized spectral initialization
SLIDE 51 Iterative refinement stage: search directions
Wirtinger flow: zt+1 = zt − µt m
m
k zt|2
aka⊤
k zt
SLIDE 52 Iterative refinement stage: search directions
Wirtinger flow: zt+1 = zt − µt m
m
k zt|2
aka⊤
k zt
Even in a local region around x (e.g. {z | z − x2 ≤ 0.1x2}):
- f(·) is NOT strongly convex unless m ≫ n
- f(·) has huge smoothness parameter
SLIDE 53 Iterative refinement stage: search directions
Wirtinger flow: zt+1 = zt − µt m
m
k zt|2
aka⊤
k zt
z x locus of {−∇ℓk(z)}
Problem: descent direction has large variability
SLIDE 54 Our solution: variance reduction via proper trimming
More adaptive rule: zt+1 = zt − µt m
m
yi − |a⊤
i zt|2
a⊤
i zt
ai1Ei
1(zt)∩Ei 2(zt)
where Ei
1(z) =
z ≤ |a⊤
i z|
z2 ≤ αub z
2(z) =
i z|2| ≤
αh m
|a⊤
i z|
z2
SLIDE 55 Our solution: variance reduction via proper trimming
More adaptive rule: zt+1 = zt − µt m
m
yi − |a⊤
i zt|2
a⊤
i zt
ai1Ei
1(zt)∩Ei 2(zt)
where Ei
1(z) =
z ≤ |a⊤
i z|
z2 ≤ αub z
2(z) =
i z|2| ≤
αh m
|a⊤
i z|
z2
x
SLIDE 56 Our solution: variance reduction via proper trimming
More adaptive rule: zt+1 = zt − µt m
m
yi − |a⊤
i zt|2
a⊤
i zt
ai1Ei
1(zt)∩Ei 2(zt)
where Ei
1(z) =
z ≤ |a⊤
i z|
z2 ≤ αub z
2(z) =
i z|2| ≤
αh m
|a⊤
i z|
z2
x informally, zt+1 = zt + µ
m
- k∈Tt ∇ℓk(zt)
- Tt trims away excessively large grad
components
SLIDE 57 Our solution: variance reduction via proper trimming
More adaptive rule: zt+1 = zt − µt m
m
yi − |a⊤
i zt|2
a⊤
i zt
ai1Ei
1(zt)∩Ei 2(zt)
where Ei
1(z) =
z ≤ |a⊤
i z|
z2 ≤ αub z
2(z) =
i z|2| ≤
αh m
|a⊤
i z|
z2
x informally, zt+1 = zt + µ
m
- k∈Tt ∇ℓk(zt)
- Tt trims away excessively large grad
components Slight bias + much reduced variance
SLIDE 58 Larger step size µt is feasible
q ¡ q ¡
x
z1
without trimming: µt = O(1/n)
q ¡ q ¡
x
z1
with trimming: µt = O(1) With better-controlled descent directions, one proceeds far more aggressively
SLIDE 59 Summary: truncated Wirtinger flows (TWF)
- 1. Regularized spectral initialization: z0 ← leading eigenvector of
1 m
k
- 2. Regularized gradient descent
zt+1 = zt + µt 1 m
Key idea: adaptively discard high-leverage data
SLIDE 60
Performance guarantees of TWF (noiseless data)
dist(z, x) := min{z ± x2} Theorem (Chen & Cand` es ’15). Under i.i.d. Gaussian design, TWF achieves dist(zt, x) (1 − ρ)t x2, t = 0, 1, · · · with high prob., provided that sample size m n. Here, 0 < ρ < 1 is const.
SLIDE 61
Performance guarantees of TWF (noiseless data)
dist(z, x) := min{z ± x2} Theorem (Chen & Cand` es ’15). Under i.i.d. Gaussian design, TWF achieves dist(zt, x) (1 − ρ)t x2, t = 0, 1, · · · with high prob., provided that sample size m n. Here, 0 < ρ < 1 is const.
≈ h − i initial guess z0
x
basin of attraction
start within basin of attraction linear convergence
SLIDE 62 Computational complexity
A := {a∗
k}1≤k≤m
- Initialization: leading eigenvector → a few applications of A and A∗
- k∈T0
yk aka∗
k = A∗ diag{yk · 1k∈T0} A
SLIDE 63 Computational complexity
A := {a∗
k}1≤k≤m
- Initialization: leading eigenvector → a few applications of A and A∗
- k∈T0
yk aka∗
k = A∗ diag{yk · 1k∈T0} A
- Iterations: one application of A and A∗ per iteration
zt+1 = zt + µt m ∇ℓtr(zt)
−∇ℓtr(zt) = A∗ν ν = 2 |Azt|2−y
Azt
· 1T
SLIDE 64 Computational complexity
A := {a∗
k}1≤k≤m
- Initialization: leading eigenvector → a few applications of A and A∗
- k∈T0
yk aka∗
k = A∗ diag{yk · 1k∈T0} A
- Iterations: one application of A and A∗ per iteration
zt+1 = zt + µt m ∇ℓtr(zt)
−∇ℓtr(zt) = A∗ν ν = 2 |Azt|2−y
Azt
· 1T
Approximate runtime: several tens of applications of A and A∗
SLIDE 65 Numerical surprise
- CG: solve y = Ax
- Our algorithm: solve y = |Ax|2
For random quadratic systems (m = 8n)
- comput. cost of our algo.
≈ 4 × comput. cost of least squares
SLIDE 66
Empirical performance
After regularized spectral initialization
SLIDE 67
Empirical performance
After regularized spectral initialization After 50 TWF iterations
SLIDE 68 Key convergence condition for gradient stage
If there are many samples: ∀z s.t. dist(z, x) ≤ εx2: ∇ℓ(z), x − z z − x2
2 + ∇ℓ(z)2 2
z x z ∇ℓ(z) ∇ℓ(z)
SLIDE 69 Key convergence condition for gradient stage
If there are NOT many samples, i.e. m ≍ n: ∀z s.t. dist(z, x) ≤ εx2: ∇ℓ(z), x − z z − x2
2 + ∇ℓ(z)2 2
z x z ∇ℓ(z) ∇ℓ(z)
SLIDE 70 Key convergence condition for gradient stage
If there are NOT many samples, i.e. m ≍ n: ∀z s.t. dist(z, x) ≤ εx2: ∇ℓtr(z), x − z z − x2
2 + ∇ℓtr(z)2 2
z x z ∇ℓ(z) ∇ℓ(z) z x z ∇ℓtr(z) ∇ℓtr(z)
SLIDE 71 Stability under noisy data
kx|2 + ηk
SNR :=
kx|4
k
≈ 3mx4 η2
SLIDE 72 Stability under noisy data
kx|2 + ηk
SNR :=
kx|4
k
≈ 3mx4 η2
Theorem (Soltanolkotabi) WF converges to MLE Theorem (Chen, Cand` es) Relative error of TWF converges to O(
1 √ SNR)
SLIDE 73 Relative MSE vs. SNR (Poisson data)
SNR (dB) (n =1000)
15 20 25 30 35 40 45 50 55
Relative MSE (dB)
- 65
- 60
- 55
- 50
- 45
- 40
- 35
- 30
- 25
- 20
m = 6n m = 8n m = 10n
Slope ≈ -1
Empirical evidence: relative MSE scales inversely with SNR
SLIDE 74 This accuracy is nearly un-improvable (empirically)
Comparison with genie-aided MLE (with sign info. revealed) yk ∼ Poisson( |a∗
kx|2 )
and εk = sign (a∗
kx)
(revealed by a genie)
SLIDE 75 This accuracy is nearly un-improvable (empirically)
Comparison with genie-aided MLE (with sign info. revealed) yk ∼ Poisson( |a∗
kx|2 )
and εk = sign (a∗
kx)
(revealed by a genie)
SNR (dB) (n =100)
15 20 25 30 35 40 45 50 55
Relative MSE (dB)
- 65
- 60
- 55
- 50
- 45
- 40
- 35
- 30
- 25
- 20
truncated WF genie-aided MLE
little empirical loss due to missing signs
SLIDE 76 This accuracy is nearly un-improvable (theoretically)
ind.
∼ Poisson( |a∗
kx|2 )
SNR ≈
kx|4
SLIDE 77 This accuracy is nearly un-improvable (theoretically)
ind.
∼ Poisson( |a∗
kx|2 )
SNR ≈
kx|4
Theorem (Chen, Cand` es). Under i.i.d. Gaussian design, for any estimator ˆ x, inf
ˆ x
sup
x: x≥log1.5 m
E
x, x) | {ak}
√ SNR , provided that sample size m ≍ n.
SLIDE 78 Phaseless 3D computational imaging
Fromenteze, Liu, Boyarsky, Gollub, & Smith ’16
ρ(ν)
φ(rr,ν)
f(r)
g ( r , r
r
, ν )
field source metasurface intensity measurement Measure intensities (with radiating metasurfaces) rather than complex signals for sub-centimeter wavelengths
^ f - Computational imaging ^ fI - Phaseless computational imaging
SLIDE 79 Phaseless 3D computational imaging
Fromenteze, Liu, Boyarsky, Gollub, & Smith ’16
^ f - Computational imaging ^ fI - Phaseless computational imaging
0.2 x (m)
Normalized magnitude (a.u.) 0.3 0.4 0.5 y (m)
Normalized magnitude (a.u.)
0.2 z (m)
Normalized magnitude (a.u.)
(red) phaseless reconstruction (blue) reconstruction w/ phase1
1This demonstration is proposed in microwave range as proof of concept
SLIDE 80 No need of sample splitting
- Several prior works use sample-splitting: require fresh samples at each
iteration; not practical but much easier to analyze z1 z2 z3 z4
z5
use fresh samples
z0
SLIDE 81 No need of sample splitting
- Several prior works use sample-splitting: require fresh samples at each
iteration; not practical but much easier to analyze z1 z2 z3 z4
z5
use fresh samples
z0
- Our works: reuse all samples in all iterations
z1 z2 z3 z4
z5
z0
same samples
SLIDE 82 A small sample of more recent works
- other optimal algorithms
- reshaped WF (Zhang et al.), truncated AF (Wang et al.), median-TWF (Zhang et al.)
- alt-min w/o resampling (Waldspurger)
- composite optimization (Duchi et al., Charisopoulos et al.)
- approximate message passing (Ma et al.)
- block coordinate descent (Barmherzig et al.)
- PhaseMax (Goldstein et al., Bahmani et al., Salehi et al., Dhifallah et al., Hand et al.)
- stochastic algorithms (Kolte et al., Zhang et al., Lu et al., Tan et al., Jeong et al.)
- improved WF theory: iteration complexity → O(log n log 1
ε) (Ma et al.)
- improved initialization (Lu et al., Wang et al., Mondelli et al.)
- random initialization (Chen et al.)
- structured quadratic systems (Cai et al., Soltanolkotabi, Wang et al., Yang et al.,
Qu et al.)
- geometric analysis (Sun et al., Davis et al.)
- low-rank generalization (White et al., Li et al., Vaswani et al.)
SLIDE 83 Central message
- Simple nonconvex paradigms are surprisingly effective for computing MLE
- Importance of statistical thinking (initialization)
statistical accuracy
convex relaxation nonconvex procedure
es, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,” Comm. Pure and Applied Math., 2017