The Power of Nonconvex Optimization in Solving Random Quadratic - - PowerPoint PPT Presentation

the power of nonconvex optimization in solving random
SMART_READER_LITE
LIVE PREVIEW

The Power of Nonconvex Optimization in Solving Random Quadratic - - PowerPoint PPT Presentation

The Power of Nonconvex Optimization in Solving Random Quadratic Systems of Equations Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford) Y. Chen, E. J. Cand` es, Communications on Pure and Applied Mathematics vol. 70, no. 5, pp. 822-883, May


slide-1
SLIDE 1

The Power of Nonconvex Optimization in Solving Random Quadratic Systems of Equations

Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford)

  • Y. Chen, E. J. Cand`

es, Communications on Pure and Applied Mathematics

  • vol. 70, no. 5, pp. 822-883, May 2017
slide-2
SLIDE 2

Agenda

  • 1. The power of nonconvex optimization in solving random quadratic systems of

equations (Aug. 28)

  • 2. Random initialization and implicit regularization in nonconvex statistical

estimation (Aug. 29)

  • 3. The projected power method: an efficient nonconvex algorithm for joint

discrete assignment from pairwise data (Sep. 3)

  • 4. Spectral methods meets asymmetry: two recent stories (Sep. 4)
  • 5. Inference and uncertainty quantification for noisy matrix completion (Sep. 5)
slide-3
SLIDE 3
  • n (high-dimensional) statistics

nonconvex optimization

slide-4
SLIDE 4

Nonconvex problems are everywhere

Maximum likelihood estimation is usually nonconvex maximizex ℓ(x; data) → may be nonconcave

  • subj. to

x ∈ S → may be nonconvex

slide-5
SLIDE 5

Nonconvex problems are everywhere

Maximum likelihood estimation is usually nonconvex maximizex ℓ(x; data) → may be nonconcave

  • subj. to

x ∈ S → may be nonconvex

  • low-rank matrix completion
  • robust principal component analysis
  • graph clustering
  • dictionary learning
  • blind deconvolution
  • learning neural nets
  • ...
slide-6
SLIDE 6

Nonconvex optimization may be super scary

There may be bumps everywhere and exponentially many local optima e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98)

slide-7
SLIDE 7

Example: solving quadratic programs is hard

Finding maximum cut in a graph is maximizex x⊤W x

  • subj. to

x2

i = 1,

i = 1, · · · , n

slide-8
SLIDE 8

Example: solving quadratic programs is hard

Fig credit: coding horror

slide-9
SLIDE 9

One strategy: convex relaxation

Can relax into convex problems by

  • finding convex surrogates (e.g. compressed sensing, matrix completion)
  • lifting into higher dimensions (e.g. Max-Cut)
slide-10
SLIDE 10

Example of convex surrogate: low-rank matrix completion

— Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09

minimizeM rank(M)

  • subj. to data constraints

cvx surrogate minimizeM nuc-norm(M)

  • subj. to data constraints
slide-11
SLIDE 11

Example of convex surrogate: low-rank matrix completion

— Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09

minimizeM rank(M)

  • subj. to data constraints

cvx surrogate minimizeM nuc-norm(M)

  • subj. to data constraints

Robust variation used everyday by Netflix

slide-12
SLIDE 12

Example of convex surrogate: low-rank matrix completion

— Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09

minimizeM rank(M)

  • subj. to data constraints

cvx surrogate minimizeM nuc-norm(M)

  • subj. to data constraints

Robust variation used everyday by Netflix Problem: operate in full matrix space even though X is low-rank

slide-13
SLIDE 13

Example of lifting: Max-Cut

— Goemans, Williamson ’95

maximizex x⊤W x

  • subj. to

x2

i = 1,

i = 1, · · · , n

slide-14
SLIDE 14

Example of lifting: Max-Cut

— Goemans, Williamson ’95

maximizex x⊤W x

  • subj. to

x2

i = 1,

i = 1, · · · , n let X be xx⊤ maximizeX X, W

  • subj. to

Xi,i = 1, i = 1, · · · , n X 0 rank(X) = 1

slide-15
SLIDE 15

Example of lifting: Max-Cut

— Goemans, Williamson ’95

maximizex x⊤W x

  • subj. to

x2

i = 1,

i = 1, · · · , n let X be xx⊤ maximizeX X, W

  • subj. to

Xi,i = 1, i = 1, · · · , n X 0 rank(X) = 1

slide-16
SLIDE 16

Example of lifting: Max-Cut

— Goemans, Williamson ’95

maximizex x⊤W x

  • subj. to

x2

i = 1,

i = 1, · · · , n let X be xx⊤ maximizeX X, W

  • subj. to

Xi,i = 1, i = 1, · · · , n X 0 rank(X) = 1 Problem: explosion in dimensions (Rn → Rn×n)

slide-17
SLIDE 17

How about optimizing nonconvex problems directly without lifting?

slide-18
SLIDE 18

A case study: solving random quadratic systems of equations

slide-19
SLIDE 19

Solving quadratic systems of equations

x

A

Ax y = |Ax|2

1

  • 3

2

  • 1

4 2

  • 2
  • 1

3 4 1 9 4 1 16 4 4 1 9 16

Solve for x ∈ Cn in m quadratic equations yk ≈ |ak, x|2, k = 1, . . . , m

slide-20
SLIDE 20

Motivation: a missing phase problem in imaging science

Detectors record intensities of diffracted rays

  • x(t1, t2) −

→ Fourier transform ˆ x(f1, f2)

intensity of electrical field:

  • ˆ

x(f1, f2)

  • 2 =
  • x(t1, t2)e−i2π(f1t1+f2t2)dt1dt2
  • 2

Phase retrieval: recover true signal x(t1, t2) from intensity measurements

slide-21
SLIDE 21

Motivation: latent variable models

Example: mixture of regression

y ≈ hx, βi y ≈ hx, −βi

  • Samples {(yk, xk)}: drawn from one of two unknown regressors β and −β

yk ≈

  • xk, β ,

with prob. 0.5 xk, −β , else (labels: latent variables)

slide-22
SLIDE 22

Motivation: latent variable models

Example: mixture of regression

y ≈ hx, βi y ≈ hx, −βi

  • Samples {(yk, xk)}: drawn from one of two unknown regressors β and −β

yk ≈

  • xk, β ,

with prob. 0.5 xk, −β , else (labels: latent variables) — equivalent to observing |yk|2 ≈ |xk, β|2

  • Goal: estimate β
slide-23
SLIDE 23

Motivation: learning neural nets with quadratic activation

— Soltanolkotabi, Javanmard, Lee ’17, Li, Ma, Zhang ’17

hidden layer i er output layer er input layer o

σ σ σ y

+

a

X\

X

a

input features: a; weights: X = [x1, · · · , xr]

  • utput:

y =

r

  • i=1

σ(a⊤xi)

σ(z)=z2

:=

r

  • i=1

(a⊤xi)2

slide-24
SLIDE 24

An equivalent view: low-rank factorization

Lifting: introduce X = xx∗ to linearize constraints yk = |a∗

kx|2 = a∗ k(xx∗)ak

= ⇒ yk = a∗

kXak

slide-25
SLIDE 25

An equivalent view: low-rank factorization

Lifting: introduce X = xx∗ to linearize constraints yk = |a∗

kx|2 = a∗ k(xx∗)ak

= ⇒ yk = a∗

kXak

find X 0 s.t. yk = a∗

kXak,

k = 1, · · · , m rank(X) = 1

slide-26
SLIDE 26

An equivalent view: low-rank factorization

Lifting: introduce X = xx∗ to linearize constraints yk = |a∗

kx|2 = a∗ k(xx∗)ak

= ⇒ yk = a∗

kXak

find X 0 s.t. yk = a∗

kXak,

k = 1, · · · , m rank(X) = 1

slide-27
SLIDE 27

An equivalent view: low-rank factorization

Lifting: introduce X = xx∗ to linearize constraints yk = |a∗

kx|2 = a∗ k(xx∗)ak

= ⇒ yk = a∗

kXak

find X 0 s.t. yk = a∗

kXak,

k = 1, · · · , m rank(X) = 1 Works well if {ak} are random, but huge increase in dimensions

slide-28
SLIDE 28

Prior art (before our work)

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

cvx relaxation

  • comput. cost

sample complexity infeasible

slide-29
SLIDE 29

Prior art (before our work)

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

cvx relaxation

  • comput. cost

sample complexity infeasible infeasible

slide-30
SLIDE 30

Prior art (before our work)

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

mn2

cvx relaxation

  • comput. cost

sample complexity infeasible infeasible

slide-31
SLIDE 31

Prior art (before our work)

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

mn2 n log n

3

cvx relaxation

  • comput. cost

sample complexity Wirtinger flow infeasible infeasible

slide-32
SLIDE 32

Prior art (before our work)

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

mn2 n log n

3

n log3 n

cvx relaxation

  • comput. cost

sample complexity Wirtinger flow alt-min (fresh samples at each iter) infeasible infeasible

slide-33
SLIDE 33

A glimpse of our results

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

mn2 n log n

3

n log3 n

cvx relaxation

  • comput. cost

sample complexity Wirtinger flow alt-min (fresh samples at each iter) infeasible infeasible Our algorithm

This work: random quadratic systems are solvable in linear time!

slide-34
SLIDE 34

A glimpse of our results

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

mn2 n log n

3

n log3 n

cvx relaxation

  • comput. cost

sample complexity Wirtinger flow alt-min (fresh samples at each iter) infeasible infeasible Our algorithm

This work: random quadratic systems are solvable in linear time! minimal sample size

  • ptimal statistical accuracy
slide-35
SLIDE 35

A first impulse: maximum likelihood estimate

maximizez ℓ(z) = 1 m m

k=1 ℓk(z)

slide-36
SLIDE 36

A first impulse: maximum likelihood estimate

maximizez ℓ(z) = 1 m m

k=1 ℓk(z)

  • Gaussian data:

yk ∼ |a∗

kx|2 + N(0, σ2)

ℓk(z) = −

  • yk − |a∗

kz|2 2

slide-37
SLIDE 37

A first impulse: maximum likelihood estimate

maximizez ℓ(z) = 1 m m

k=1 ℓk(z)

  • Gaussian data:

yk ∼ |a∗

kx|2 + N(0, σ2)

ℓk(z) = −

  • yk − |a∗

kz|2 2

  • Poisson data:

yk ∼ Poisson

  • |a∗

kx|2

ℓk(z) = −|a∗

kz|2 + yk log |a∗ kz|2

slide-38
SLIDE 38

A first impulse: maximum likelihood estimate

maximizez ℓ(z) = 1 m m

k=1 ℓk(z)

  • Gaussian data:

yk ∼ |a∗

kx|2 + N(0, σ2)

ℓk(z) = −

  • yk − |a∗

kz|2 2

  • Poisson data:

yk ∼ Poisson

  • |a∗

kx|2

ℓk(z) = −|a∗

kz|2 + yk log |a∗ kz|2

Problem: −ℓ nonconvex, many local stationary points

slide-39
SLIDE 39

Wirtinger flow: Cand` es, Li, Soltanolkotabi ’14

  • Spectral initialization: z0 ← leading eigenvector of

1 m

m

  • k=1

ykaka∗

k

  • Iterative refinement:

for t = 0, 1, . . . zt+1 = zt + µt∇ℓ(zt) Already rich theory (see also Soltanolkotabi ’14, Ma, Wang, Chi, Chen ’17)

slide-40
SLIDE 40

Interpretation of spectral initialization

Spectral initialization: z0 ← leading eigenvector of Y := 1 m

m

  • k=1

ykaka∗

k

slide-41
SLIDE 41

Interpretation of spectral initialization

Spectral initialization: z0 ← leading eigenvector of Y := 1 m

m

  • k=1

ykaka∗

k

  • Rationale: E[Y ] = I + 2xx∗ (x2 = 1) under Gaussian design
slide-42
SLIDE 42

Interpretation of spectral initialization

Spectral initialization: z0 ← leading eigenvector of Y := 1 m

m

  • k=1

ykaka∗

k

  • Rationale: E[Y ] = I + 2xx∗ (x2 = 1) under Gaussian design
  • Would succeed if Y → E[Y ]
slide-43
SLIDE 43

Empirical performance of initialization (m = 12n)

Ground truth x ∈ R409600

slide-44
SLIDE 44

Empirical performance of initialization (m = 12n)

Ground truth x ∈ R409600 Spectral initialization

slide-45
SLIDE 45

Improving initialization

Y = 1 m

  • k

ykaka∗

k heavy-tailed

  • E[Y ]

unless m ≫ n

slide-46
SLIDE 46

Improving initialization

Y = 1 m

  • k

ykaka∗

k heavy-tailed

  • E[Y ]

unless m ≫ n

1 6000 12000 1 2 3 4

1 kakk2 a∗ kY ak

x∗Y x k (m = 6n)

slide-47
SLIDE 47

Improving initialization

Y = 1 m

  • k

ykaka∗

k heavy-tailed

  • E[Y ]

unless m ≫ n

1 6000 12000 1 2 3 4

1 kakk2 a∗ kY ak

x∗Y x k (m = 6n)

Problem large outliers yk = |a∗

kx|2 bear too much influence

slide-48
SLIDE 48

Improving initialization

Y = 1 m

  • k

ykaka∗

k heavy-tailed

  • E[Y ]

unless m ≫ n

1 6000 12000 1 2 3 4

1 kakk2 a∗ kY ak

x∗Y x k (m = 6n)

Problem large outliers yk = |a∗

kx|2 bear too much influence

Solution discard large samples and run PCA for

1 m

  • k

ykaka∗

k1{|yk|Avg{|yl|}}

slide-49
SLIDE 49

Improving initialization

Y = 1 m

  • k

ykaka∗

k heavy-tailed

  • E[Y ]

unless m ≫ n

1 6000 12000 1 2 3 4

1 kakk2 a∗ kY ak

x∗Y x k (m = 6n)

Problem large outliers yk = |a∗

kx|2 bear too much influence

Solution discard large samples and run PCA for

1 m

  • k

ykaka∗

k1{|yk|Avg{|yl|}}

— improvable via more refined pre-processing

(Wang, Giannakis, Eldar ’16, Lu, Li ’17, Mondelli, Montanari ’17)

1 m

  • k

ρ(yk)aka∗

k

e.g. ρ(yk) = max{yk, a}

slide-50
SLIDE 50

Empirical performance of initialization (m = 12n)

Ground truth x ∈ R409600 Regularized spectral initialization

slide-51
SLIDE 51

Iterative refinement stage: search directions

Wirtinger flow: zt+1 = zt − µt m

m

  • k=1
  • yk − |a⊤

k zt|2

aka⊤

k zt

  • =−∇ℓk(zt)
slide-52
SLIDE 52

Iterative refinement stage: search directions

Wirtinger flow: zt+1 = zt − µt m

m

  • k=1
  • yk − |a⊤

k zt|2

aka⊤

k zt

  • =−∇ℓk(zt)

Even in a local region around x (e.g. {z | z − x2 ≤ 0.1x2}):

  • f(·) is NOT strongly convex unless m ≫ n
  • f(·) has huge smoothness parameter
slide-53
SLIDE 53

Iterative refinement stage: search directions

Wirtinger flow: zt+1 = zt − µt m

m

  • k=1
  • yk − |a⊤

k zt|2

aka⊤

k zt

  • =−∇ℓk(zt)

z x locus of {−∇ℓk(z)}

Problem: descent direction has large variability

slide-54
SLIDE 54

Our solution: variance reduction via proper trimming

More adaptive rule: zt+1 = zt − µt m

m

  • i=1

yi − |a⊤

i zt|2

a⊤

i zt

ai1Ei

1(zt)∩Ei 2(zt)

where Ei

1(z) =

  • αlb

z ≤ |a⊤

i z|

z2 ≤ αub z

  • ; Ei

2(z) =

  • |yi − |a⊤

i z|2| ≤

αh m

  • y−A(zz⊤)
  • 1

|a⊤

i z|

z2

slide-55
SLIDE 55

Our solution: variance reduction via proper trimming

More adaptive rule: zt+1 = zt − µt m

m

  • i=1

yi − |a⊤

i zt|2

a⊤

i zt

ai1Ei

1(zt)∩Ei 2(zt)

where Ei

1(z) =

  • αlb

z ≤ |a⊤

i z|

z2 ≤ αub z

  • ; Ei

2(z) =

  • |yi − |a⊤

i z|2| ≤

αh m

  • y−A(zz⊤)
  • 1

|a⊤

i z|

z2

  • z

x

slide-56
SLIDE 56

Our solution: variance reduction via proper trimming

More adaptive rule: zt+1 = zt − µt m

m

  • i=1

yi − |a⊤

i zt|2

a⊤

i zt

ai1Ei

1(zt)∩Ei 2(zt)

where Ei

1(z) =

  • αlb

z ≤ |a⊤

i z|

z2 ≤ αub z

  • ; Ei

2(z) =

  • |yi − |a⊤

i z|2| ≤

αh m

  • y−A(zz⊤)
  • 1

|a⊤

i z|

z2

  • z

x informally, zt+1 = zt + µ

m

  • k∈Tt ∇ℓk(zt)
  • Tt trims away excessively large grad

components

slide-57
SLIDE 57

Our solution: variance reduction via proper trimming

More adaptive rule: zt+1 = zt − µt m

m

  • i=1

yi − |a⊤

i zt|2

a⊤

i zt

ai1Ei

1(zt)∩Ei 2(zt)

where Ei

1(z) =

  • αlb

z ≤ |a⊤

i z|

z2 ≤ αub z

  • ; Ei

2(z) =

  • |yi − |a⊤

i z|2| ≤

αh m

  • y−A(zz⊤)
  • 1

|a⊤

i z|

z2

  • z

x informally, zt+1 = zt + µ

m

  • k∈Tt ∇ℓk(zt)
  • Tt trims away excessively large grad

components Slight bias + much reduced variance

slide-58
SLIDE 58

Larger step size µt is feasible

q ¡ q ¡

x

z1

without trimming: µt = O(1/n)

q ¡ q ¡

x

z1

with trimming: µt = O(1) With better-controlled descent directions, one proceeds far more aggressively

slide-59
SLIDE 59

Summary: truncated Wirtinger flows (TWF)

  • 1. Regularized spectral initialization: z0 ← leading eigenvector of

1 m

  • k∈T0 yk aka∗

k

  • 2. Regularized gradient descent

zt+1 = zt + µt 1 m

  • k∈Tt ∇ℓk(zt)
  • :=∇ℓtr(zt)

Key idea: adaptively discard high-leverage data

slide-60
SLIDE 60

Performance guarantees of TWF (noiseless data)

dist(z, x) := min{z ± x2} Theorem (Chen & Cand` es ’15). Under i.i.d. Gaussian design, TWF achieves dist(zt, x) (1 − ρ)t x2, t = 0, 1, · · · with high prob., provided that sample size m n. Here, 0 < ρ < 1 is const.

slide-61
SLIDE 61

Performance guarantees of TWF (noiseless data)

dist(z, x) := min{z ± x2} Theorem (Chen & Cand` es ’15). Under i.i.d. Gaussian design, TWF achieves dist(zt, x) (1 − ρ)t x2, t = 0, 1, · · · with high prob., provided that sample size m n. Here, 0 < ρ < 1 is const.

≈ h − i initial guess z0

x

basin of attraction

start within basin of attraction linear convergence

slide-62
SLIDE 62

Computational complexity

A := {a∗

k}1≤k≤m

  • Initialization: leading eigenvector → a few applications of A and A∗
  • k∈T0

yk aka∗

k = A∗ diag{yk · 1k∈T0} A

slide-63
SLIDE 63

Computational complexity

A := {a∗

k}1≤k≤m

  • Initialization: leading eigenvector → a few applications of A and A∗
  • k∈T0

yk aka∗

k = A∗ diag{yk · 1k∈T0} A

  • Iterations: one application of A and A∗ per iteration

zt+1 = zt + µt m ∇ℓtr(zt)

−∇ℓtr(zt) = A∗ν ν = 2 |Azt|2−y

Azt

· 1T

slide-64
SLIDE 64

Computational complexity

A := {a∗

k}1≤k≤m

  • Initialization: leading eigenvector → a few applications of A and A∗
  • k∈T0

yk aka∗

k = A∗ diag{yk · 1k∈T0} A

  • Iterations: one application of A and A∗ per iteration

zt+1 = zt + µt m ∇ℓtr(zt)

−∇ℓtr(zt) = A∗ν ν = 2 |Azt|2−y

Azt

· 1T

Approximate runtime: several tens of applications of A and A∗

slide-65
SLIDE 65

Numerical surprise

  • CG: solve y = Ax
  • Our algorithm: solve y = |Ax|2

For random quadratic systems (m = 8n)

  • comput. cost of our algo.

≈ 4 × comput. cost of least squares

slide-66
SLIDE 66

Empirical performance

After regularized spectral initialization

slide-67
SLIDE 67

Empirical performance

After regularized spectral initialization After 50 TWF iterations

slide-68
SLIDE 68

Key convergence condition for gradient stage

If there are many samples: ∀z s.t. dist(z, x) ≤ εx2: ∇ℓ(z), x − z z − x2

2 + ∇ℓ(z)2 2

z x z ∇ℓ(z) ∇ℓ(z)

slide-69
SLIDE 69

Key convergence condition for gradient stage

If there are NOT many samples, i.e. m ≍ n: ∀z s.t. dist(z, x) ≤ εx2: ∇ℓ(z), x − z z − x2

2 + ∇ℓ(z)2 2

z x z ∇ℓ(z) ∇ℓ(z)

slide-70
SLIDE 70

Key convergence condition for gradient stage

If there are NOT many samples, i.e. m ≍ n: ∀z s.t. dist(z, x) ≤ εx2: ∇ℓtr(z), x − z z − x2

2 + ∇ℓtr(z)2 2

z x z ∇ℓ(z) ∇ℓ(z) z x z ∇ℓtr(z) ∇ℓtr(z)

slide-71
SLIDE 71

Stability under noisy data

  • Noisy data: yk = |a∗

kx|2 + ηk

  • Signal-to-noise ratio:

SNR :=

  • k |a∗

kx|4

  • k η2

k

≈ 3mx4 η2

  • i.i.d. Gaussian design
slide-72
SLIDE 72

Stability under noisy data

  • Noisy data: yk = |a∗

kx|2 + ηk

  • Signal-to-noise ratio:

SNR :=

  • k |a∗

kx|4

  • k η2

k

≈ 3mx4 η2

  • i.i.d. Gaussian design

Theorem (Soltanolkotabi) WF converges to MLE Theorem (Chen, Cand` es) Relative error of TWF converges to O(

1 √ SNR)

slide-73
SLIDE 73

Relative MSE vs. SNR (Poisson data)

SNR (dB) (n =1000)

15 20 25 30 35 40 45 50 55

Relative MSE (dB)

  • 65
  • 60
  • 55
  • 50
  • 45
  • 40
  • 35
  • 30
  • 25
  • 20

m = 6n m = 8n m = 10n

Slope ≈ -1

Empirical evidence: relative MSE scales inversely with SNR

slide-74
SLIDE 74

This accuracy is nearly un-improvable (empirically)

Comparison with genie-aided MLE (with sign info. revealed) yk ∼ Poisson( |a∗

kx|2 )

and εk = sign (a∗

kx)

(revealed by a genie)

slide-75
SLIDE 75

This accuracy is nearly un-improvable (empirically)

Comparison with genie-aided MLE (with sign info. revealed) yk ∼ Poisson( |a∗

kx|2 )

and εk = sign (a∗

kx)

(revealed by a genie)

SNR (dB) (n =100)

15 20 25 30 35 40 45 50 55

Relative MSE (dB)

  • 65
  • 60
  • 55
  • 50
  • 45
  • 40
  • 35
  • 30
  • 25
  • 20

truncated WF genie-aided MLE

little empirical loss due to missing signs

slide-76
SLIDE 76

This accuracy is nearly un-improvable (theoretically)

  • Poisson data: yk

ind.

∼ Poisson( |a∗

kx|2 )

  • Signal-to-noise ratio:

SNR ≈

  • k |a∗

kx|4

  • k Var(yk) ≈ 3x2
slide-77
SLIDE 77

This accuracy is nearly un-improvable (theoretically)

  • Poisson data: yk

ind.

∼ Poisson( |a∗

kx|2 )

  • Signal-to-noise ratio:

SNR ≈

  • k |a∗

kx|4

  • k Var(yk) ≈ 3x2

Theorem (Chen, Cand` es). Under i.i.d. Gaussian design, for any estimator ˆ x, inf

ˆ x

sup

x: x≥log1.5 m

E

  • dist (ˆ

x, x) | {ak}

  • x
  • 1

√ SNR , provided that sample size m ≍ n.

slide-78
SLIDE 78

Phaseless 3D computational imaging

Fromenteze, Liu, Boyarsky, Gollub, & Smith ’16

ρ(ν)

φ(rr,ν)

f(r)

g ( r , r

r

, ν )

field source metasurface intensity measurement Measure intensities (with radiating metasurfaces) rather than complex signals for sub-centimeter wavelengths

^ f - Computational imaging ^ fI - Phaseless computational imaging

slide-79
SLIDE 79

Phaseless 3D computational imaging

Fromenteze, Liu, Boyarsky, Gollub, & Smith ’16

^ f - Computational imaging ^ fI - Phaseless computational imaging

  • 0.2

0.2 x (m)

  • 80
  • 60
  • 40
  • 20

Normalized magnitude (a.u.) 0.3 0.4 0.5 y (m)

  • 80
  • 60
  • 40
  • 20

Normalized magnitude (a.u.)

  • 0.2

0.2 z (m)

  • 80
  • 60
  • 40
  • 20

Normalized magnitude (a.u.)

(red) phaseless reconstruction (blue) reconstruction w/ phase1

1This demonstration is proposed in microwave range as proof of concept

slide-80
SLIDE 80

No need of sample splitting

  • Several prior works use sample-splitting: require fresh samples at each

iteration; not practical but much easier to analyze z1 z2 z3 z4

z5

use fresh samples

z0

slide-81
SLIDE 81

No need of sample splitting

  • Several prior works use sample-splitting: require fresh samples at each

iteration; not practical but much easier to analyze z1 z2 z3 z4

z5

use fresh samples

z0

  • Our works: reuse all samples in all iterations

z1 z2 z3 z4

z5

z0

same samples

slide-82
SLIDE 82

A small sample of more recent works

  • other optimal algorithms
  • reshaped WF (Zhang et al.), truncated AF (Wang et al.), median-TWF (Zhang et al.)
  • alt-min w/o resampling (Waldspurger)
  • composite optimization (Duchi et al., Charisopoulos et al.)
  • approximate message passing (Ma et al.)
  • block coordinate descent (Barmherzig et al.)
  • PhaseMax (Goldstein et al., Bahmani et al., Salehi et al., Dhifallah et al., Hand et al.)
  • stochastic algorithms (Kolte et al., Zhang et al., Lu et al., Tan et al., Jeong et al.)
  • improved WF theory: iteration complexity → O(log n log 1

ε) (Ma et al.)

  • improved initialization (Lu et al., Wang et al., Mondelli et al.)
  • random initialization (Chen et al.)
  • structured quadratic systems (Cai et al., Soltanolkotabi, Wang et al., Yang et al.,

Qu et al.)

  • geometric analysis (Sun et al., Davis et al.)
  • low-rank generalization (White et al., Li et al., Vaswani et al.)
slide-83
SLIDE 83

Central message

  • Simple nonconvex paradigms are surprisingly effective for computing MLE
  • Importance of statistical thinking (initialization)

statistical accuracy

  • comput. cost

convex relaxation nonconvex procedure

  • Y. Chen, E. Cand`

es, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,” Comm. Pure and Applied Math., 2017