Solving Random Quadratic Systems of Equations Is Nearly as Easy as - - PowerPoint PPT Presentation

solving random quadratic systems of equations is nearly
SMART_READER_LITE
LIVE PREVIEW

Solving Random Quadratic Systems of Equations Is Nearly as Easy as - - PowerPoint PPT Presentation

Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford) Y. Chen, E. J. Cand` es, Communications on Pure and Applied Mathematics vol. 70, no. 5, pp. 822-883,


slide-1
SLIDE 1

Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems

Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford)

  • Y. Chen, E. J. Cand`

es, Communications on Pure and Applied Mathematics

  • vol. 70, no. 5, pp. 822-883, May 2017
slide-2
SLIDE 2
  • n (high-dimensional) statistics

nonconvex optimization

slide-3
SLIDE 3

Solving quadratic systems of equations

x

A

Ax y = |Ax|2

1

  • 3

2

  • 1

4 2

  • 2
  • 1

3 4 1 9 4 1 16 4 4 1 9 16

Solve for x ∈ Cn in m quadratic equations yk ≈ |ak, x|2, k = 1, . . . , m

slide-4
SLIDE 4

Motivation: a missing phase problem in imaging science

Detectors record intensities of diffracted rays

  • x(t1, t2) −

→ Fourier transform ˆ x(f1, f2)

intensity of electrical field:

  • ˆ

x(f1, f2)

  • 2 =
  • x(t1, t2)e−i2π(f1t1+f2t2)dt1dt2
  • 2

Phase retrieval: recover true signal x(t1, t2) from intensity measurements

slide-5
SLIDE 5

Motivation: learning neural nets with quadratic activation

— Soltanolkotabi, Javanmard, Lee ’17, Li, Ma, Zhang ’17

hidden layer i er output layer er input layer o

σ σ σ y

+

a

X\

X

a

input features: a; weights: X = [x1, · · · , xr]

  • utput:

y =

r

  • i=1

σ(a⊤xi)

σ(z)=z2

:=

r

  • i=1

(a⊤xi)2

slide-6
SLIDE 6

Solving quadratic systems is NP-complete in general ...

“I can’t find an efficient algorithm, but neither can all these people.” Fig credit: coding horror

slide-7
SLIDE 7

Statistical models come to rescue

pe statistical models t − − els tractable algorithms benign l gn landscape s When data are generated by certain statistical / randomized models

  • e.g. ak ∼ N (0,In)

, problems are

  • ften much nicer than worst-case instances
slide-8
SLIDE 8

Convex relaxation

Lifting: introduce X = xx∗ to linearize constraints yk = |a∗

kx|2 = a∗ k(xx∗)ak

= ⇒ yk = a∗

kXak

slide-9
SLIDE 9

Convex relaxation

Lifting: introduce X = xx∗ to linearize constraints yk = |a∗

kx|2 = a∗ k(xx∗)ak

= ⇒ yk = a∗

kXak

find X 0 s.t. yk = a∗

kXak,

k = 1, · · · , m rank(X) = 1

slide-10
SLIDE 10

Convex relaxation

Lifting: introduce X = xx∗ to linearize constraints yk = |a∗

kx|2 = a∗ k(xx∗)ak

= ⇒ yk = a∗

kXak

find X 0 s.t. yk = a∗

kXak,

k = 1, · · · , m rank(X) = 1

slide-11
SLIDE 11

Convex relaxation

Lifting: introduce X = xx∗ to linearize constraints yk = |a∗

kx|2 = a∗ k(xx∗)ak

= ⇒ yk = a∗

kXak

find X 0 s.t. yk = a∗

kXak,

k = 1, · · · , m rank(X) = 1 Works well if {ak} are random

slide-12
SLIDE 12

Convex relaxation

Lifting: introduce X = xx∗ to linearize constraints yk = |a∗

kx|2 = a∗ k(xx∗)ak

= ⇒ yk = a∗

kXak

find X 0 s.t. yk = a∗

kXak,

k = 1, · · · , m rank(X) = 1 Works well if {ak} are random, but huge increase in dimensions

slide-13
SLIDE 13

Prior art (before our work)

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

cvx relaxation

  • comput. cost

sample complexity infeasible

slide-14
SLIDE 14

Prior art (before our work)

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

cvx relaxation

  • comput. cost

sample complexity infeasible infeasible

slide-15
SLIDE 15

Prior art (before our work)

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

mn2

cvx relaxation

  • comput. cost

sample complexity infeasible infeasible

slide-16
SLIDE 16

Prior art (before our work)

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

mn2 n log n

3

cvx relaxation

  • comput. cost

sample complexity Wirtinger flow infeasible infeasible

slide-17
SLIDE 17

Prior art (before our work)

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

mn2 n log n

3

n log3 n

cvx relaxation

  • comput. cost

sample complexity Wirtinger flow alt-min (fresh samples at each iter) infeasible infeasible

slide-18
SLIDE 18

A glimpse of our results

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

mn2 n log n

3

n log3 n

cvx relaxation

  • comput. cost

sample complexity Wirtinger flow alt-min (fresh samples at each iter) infeasible infeasible Our algorithm

This work: random quadratic systems are solvable in linear time!

slide-19
SLIDE 19

A glimpse of our results

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2

mn2 n log n

3

n log3 n

cvx relaxation

  • comput. cost

sample complexity Wirtinger flow alt-min (fresh samples at each iter) infeasible infeasible Our algorithm

This work: random quadratic systems are solvable in linear time! minimal sample size

  • ptimal statistical accuracy
slide-20
SLIDE 20

A first impulse: maximum likelihood estimate

minimizez f(z) = 1 m m

k=1 fk(z)

slide-21
SLIDE 21

A first impulse: maximum likelihood estimate

minimizez f(z) = 1 m m

k=1 fk(z)

  • Gaussian data:

yk ∼ |a∗

kx|2 + N(0, σ2)

fk(z) =

  • yk − |a∗

kz|2 2

slide-22
SLIDE 22

A first impulse: maximum likelihood estimate

minimizez f(z) = 1 m m

k=1 fk(z)

  • Gaussian data:

yk ∼ |a∗

kx|2 + N(0, σ2)

fk(z) =

  • yk − |a∗

kz|2 2

  • Poisson data:

yk ∼ Poisson

  • |a∗

kx|2

fk(z) = |a∗

kz|2 − yk log |a∗ kz|2

slide-23
SLIDE 23

A first impulse: maximum likelihood estimate

minimizez f(z) = 1 m m

k=1 fk(z)

  • Gaussian data:

yk ∼ |a∗

kx|2 + N(0, σ2)

fk(z) =

  • yk − |a∗

kz|2 2

  • Poisson data:

yk ∼ Poisson

  • |a∗

kx|2

fk(z) = |a∗

kz|2 − yk log |a∗ kz|2

Problem: f(·) nonconvex, many local stationary points

slide-24
SLIDE 24

A plausible nonconvex paradigm

minimizez f(z) = m

k=1 fk(z)

≈ h − i initial guess z0

x

basin of attraction

  • 1. initialize within local basin sufficiently close to x
  • (hopefully) nicer landscape
slide-25
SLIDE 25

A plausible nonconvex paradigm

minimizez f(z) = m

k=1 fk(z)

≈ h − i initial guess z0

x

basin of attraction

x

basin of attraction

z1 i ess z0 z2

  • 1. initialize within local basin sufficiently close to x
  • (hopefully) nicer landscape
  • 2. iterative refinement
slide-26
SLIDE 26

Wirtinger flow (Cand` es, Li, Soltanolkotabi ’14)

minimizez f(z) = 1 m

m

  • k=1
  • a⊤

k z

2 − yk 2

  • spectral initialization: z0 ← leading eigenvector
  • f certain data matrix
  • (Wirtinger) gradient descent:

zt+1 = zt − µt ∇f(zt), t = 0, 1, · · ·

slide-27
SLIDE 27

Performance guarantees for WF

n mn

mn

2

mn2 n log n

3

n log3 n

cvx relaxation

  • comput. cost

sample complexity Wirtinger flow alt-min (fresh samples at each iter) infeasible infeasible Our algorithm

  • suboptimal computational cost?

— n times more expensive than linear-time algorithms

  • suboptimal sample complexity?
slide-28
SLIDE 28

Iterative refinement stage: search directions

Wirtinger flow: zt+1 = zt − µt m

m

  • k=1
  • yk − |a⊤

k zt|2

aka⊤

k zt

  • =∇fk(zt)
slide-29
SLIDE 29

Iterative refinement stage: search directions

Wirtinger flow: zt+1 = zt − µt m

m

  • k=1
  • yk − |a⊤

k zt|2

aka⊤

k zt

  • =∇fk(zt)

Even in a local region around x (e.g. {z | z − x2 ≤ 0.1x2}):

  • f(·) is NOT strongly convex unless m ≫ n
  • f(·) has huge smoothness parameter
slide-30
SLIDE 30

Iterative refinement stage: search directions

Wirtinger flow: zt+1 = zt − µt m

m

  • k=1
  • yk − |a⊤

k zt|2

aka⊤

k zt

  • =∇fk(zt)

z x locus of {∇fk(z)}

Problem: descent direction has large variability

slide-31
SLIDE 31

Our solution: variance reduction via proper trimming

More adaptive rule: zt+1 = zt − µt m

m

  • i=1

yi − |a⊤

i zt|2

a⊤

i zt

ai1Ei

1(zt)∩Ei 2(zt)

where Ei

1(z) =

  • αlb

z ≤ |a⊤

i z|

z2 ≤ αub z

  • ; Ei

2(z) =

  • |yi − |a⊤

i z|2| ≤

αh m

  • y−A(zz⊤)
  • 1|a⊤

i z|

z2

slide-32
SLIDE 32

Our solution: variance reduction via proper trimming

More adaptive rule: zt+1 = zt − µt m

m

  • i=1

yi − |a⊤

i zt|2

a⊤

i zt

ai1Ei

1(zt)∩Ei 2(zt)

where Ei

1(z) =

  • αlb

z ≤ |a⊤

i z|

z2 ≤ αub z

  • ; Ei

2(z) =

  • |yi − |a⊤

i z|2| ≤

αh m

  • y−A(zz⊤)
  • 1|a⊤

i z|

z2

  • z

x

slide-33
SLIDE 33

Our solution: variance reduction via proper trimming

More adaptive rule: zt+1 = zt − µt m

m

  • i=1

yi − |a⊤

i zt|2

a⊤

i zt

ai1Ei

1(zt)∩Ei 2(zt)

where Ei

1(z) =

  • αlb

z ≤ |a⊤

i z|

z2 ≤ αub z

  • ; Ei

2(z) =

  • |yi − |a⊤

i z|2| ≤

αh m

  • y−A(zz⊤)
  • 1|a⊤

i z|

z2

  • z

x informally, zt+1 = zt − µ

m

  • k∈T ∇fk(zt)
  • T trims away excessively large grad

components

slide-34
SLIDE 34

Our solution: variance reduction via proper trimming

More adaptive rule: zt+1 = zt − µt m

m

  • i=1

yi − |a⊤

i zt|2

a⊤

i zt

ai1Ei

1(zt)∩Ei 2(zt)

where Ei

1(z) =

  • αlb

z ≤ |a⊤

i z|

z2 ≤ αub z

  • ; Ei

2(z) =

  • |yi − |a⊤

i z|2| ≤

αh m

  • y−A(zz⊤)
  • 1|a⊤

i z|

z2

  • z

x informally, zt+1 = zt − µ

m

  • k∈T ∇fk(zt)
  • T trims away excessively large grad

components Slight bias + much reduced variance

slide-35
SLIDE 35

Larger step size µt is feasible

q ¡ q ¡

x

z1

without trimming: µt = O(1/n)

q ¡ q ¡

x

z1

with trimming: µt = O(1) With better-controlled descent directions, one proceeds far more aggressively

slide-36
SLIDE 36

Initialization stage

Spectral initialization (e.g. alt-min, WF): z0 ← leading eigenvector of Y := 1 m

m

  • k=1

ykaka∗

k

slide-37
SLIDE 37

Initialization stage

Spectral initialization (e.g. alt-min, WF): z0 ← leading eigenvector of Y := 1 m

m

  • k=1

ykaka∗

k

  • Rationale: E[Y ] = x2

2 I + 2xx∗ under i.i.d. Gaussian design

slide-38
SLIDE 38

Initialization stage

Spectral initialization (e.g. alt-min, WF): z0 ← leading eigenvector of Y := 1 m

m

  • k=1

ykaka∗

k

  • Rationale: E[Y ] = x2

2 I + 2xx∗ under i.i.d. Gaussian design

  • Would succeed if Y → E[Y ]
slide-39
SLIDE 39

Improving initialization

Y = 1 m

  • k

ykaka∗

k heavy-tailed

  • E[Y ]

unless m ≫ n

slide-40
SLIDE 40

Improving initialization

Y = 1 m

  • k

ykaka∗

k heavy-tailed

  • E[Y ]

unless m ≫ n

1 6000 12000 1 2 3 4

1 kakk2 a∗ kY ak

x∗Y x k (m = 6n)

slide-41
SLIDE 41

Improving initialization

Y = 1 m

  • k

ykaka∗

k heavy-tailed

  • E[Y ]

unless m ≫ n

1 6000 12000 1 2 3 4

1 kakk2 a∗ kY ak

x∗Y x k (m = 6n)

Problem large outliers yk = |a∗

kx|2 bear too much influence

slide-42
SLIDE 42

Improving initialization

Y = 1 m

  • k

ykaka∗

k heavy-tailed

  • E[Y ]

unless m ≫ n

1 6000 12000 1 2 3 4

1 kakk2 a∗ kY ak

x∗Y x k (m = 6n)

Problem large outliers yk = |a∗

kx|2 bear too much influence

Solution discard large samples and run PCA for

1 m

  • k

ykaka∗

k1{|yk|Avg{|yl|}}

slide-43
SLIDE 43

Summary of proposed algorithm

  • 1. Regularized spectral initialization: z0 ← principal component of

1 m

  • k∈T0 yk aka∗

k

  • 2. Regularized gradient descent

zt+1 = zt − µt m

  • k∈Tt ∇fk(z)

Adaptive and iteration-varying rules: discard high-leverage data {yk : k / ∈ Tt}

slide-44
SLIDE 44

Theoretical guarantees (noiseless data)

≈ h − i initial guess z0

x

basin of attraction

Iteration

Relative error (log scale)

10 10

20 40 60

  • 8

10

  • 4

Theorem (Chen & Cand` es) When ak

i.i.d.

∼ N(0, In) and m n, with high probability our algorithm attains ε accuracy in O

  • log 1

ε

  • iterations
  • dimension-free linear convergence
slide-45
SLIDE 45

Computational complexity

A := {a∗

k}1≤k≤m

  • Initialization: leading eigenvector → a few applications of A and A∗
  • k∈T0

yk aka∗

k = A∗ diag{yk · 1k∈T0} A

slide-46
SLIDE 46

Computational complexity

A := {a∗

k}1≤k≤m

  • Initialization: leading eigenvector → a few applications of A and A∗
  • k∈T0

yk aka∗

k = A∗ diag{yk · 1k∈T0} A

  • Iterations: one application of A and A∗ per iteration

zt+1 = zt − µt m ∇ftr(zt)

∇ftr(zt) = A∗ν ν = 2 |Azt|2−y

Azt

· 1T

slide-47
SLIDE 47

Computational complexity

A := {a∗

k}1≤k≤m

  • Initialization: leading eigenvector → a few applications of A and A∗
  • k∈T0

yk aka∗

k = A∗ diag{yk · 1k∈T0} A

  • Iterations: one application of A and A∗ per iteration

zt+1 = zt − µt m ∇ftr(zt)

∇ftr(zt) = A∗ν ν = 2 |Azt|2−y

Azt

· 1T

Approximate runtime: several tens of applications of A and A∗

slide-48
SLIDE 48

Numerical performance

  • CG: solve y = Ax
  • Our algorithm: solve y = |Ax|2

Iteration

10 15

Relative error (log scale)

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100

20 40 60 5

least squares (CG) proposed algorithm

slide-49
SLIDE 49

Numerical performance

  • CG: solve y = Ax
  • Our algorithm: solve y = |Ax|2

Iteration

10 15

Relative error (log scale)

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100

20 40 60 5

least squares (CG) proposed algorithm

For random quadratic systems (m = 8n)

  • comput. cost of our algo.

≈ 4 × comput. cost of least squares

slide-50
SLIDE 50

Empirical performance (m = 12n)

Ground truth x ∈ R409600

slide-51
SLIDE 51

Empirical performance (m = 12n)

Spectral initialization

slide-52
SLIDE 52

Empirical performance (m = 12n)

Spectral initialization Proposed: regularized spectral initialization

slide-53
SLIDE 53

Empirical performance (m = 12n)

After regularized spectral initialization

slide-54
SLIDE 54

Empirical performance (m = 12n)

After regularized spectral initialization After 50 proposed iterations

slide-55
SLIDE 55

Stability under noisy data

Comparison with genie-aided MLE (with phase info. revealed) yk ∼ Poisson( |a∗

kx|2 )

and εk = sign (a∗

kx)

(revealed by a genie)

slide-56
SLIDE 56

Stability under noisy data

Comparison with genie-aided MLE (with phase info. revealed) yk ∼ Poisson( |a∗

kx|2 )

and εk = sign (a∗

kx)

(revealed by a genie)

SNR (dB) (n =100)

15 20 25 30 35 40 45 50 55

Relative MSE (dB)

  • 65
  • 60
  • 55
  • 50
  • 45
  • 40
  • 35
  • 30
  • 25
  • 20

truncated WF genie-aided MLE

little empirical loss due to missing signs

slide-57
SLIDE 57

Stability under noisy data

Comparison with genie-aided MLE (with phase info. revealed) yk ∼ Poisson( |a∗

kx|2 )

and εk = sign (a∗

kx)

(revealed by a genie)

SNR (dB) (n =100)

15 20 25 30 35 40 45 50 55

Relative MSE (dB)

  • 65
  • 60
  • 55
  • 50
  • 45
  • 40
  • 35
  • 30
  • 25
  • 20

truncated WF genie-aided MLE

little empirical loss due to missing signs Theorem (Chen & Cand` es) Our algorithm achieves optimal statistical accuracy!

slide-58
SLIDE 58

Deal with complicated dependencies across iterations

Several prior approaches: require fresh samples at each iteration z1 z2 z3 z4

z5

use fresh samples

z0

slide-59
SLIDE 59

Deal with complicated dependencies across iterations

Several prior approaches: require fresh samples at each iteration z1 z2 z3 z4

z5

use fresh samples

z0 This approach: reuse all samples in all iterations z1 z2 z3 z4

z5

z0

same samples

slide-60
SLIDE 60

A small sample of more recent works

  • other optimal algorithms
  • reshaped WF (Zhang et al.), truncated AF (Wang et al.), median-TWF (Zhang et al.)
  • alt-min w/o resampling (Waldspurger)
  • composite optimization (Duchi et al., Charisopoulos et al.)
  • approximate message passing (Ma et al.)
  • block coordinate descent (Barmherzig et al.)
  • PhaseMax (Goldstein et al., Bahmani et al., Salehi et al., Dhifallah et al., Hand et al.)
  • stochastic algorithms (Kolte et al., Zhang et al., Lu et al., Tan et al., Jeong et al.)
  • improved WF theory: iteration complexity → O(log n log 1

ε) (Ma et al.)

  • improved initialization (Lu et al., Wang et al., Mondelli et al.)
  • random initialization (Chen et al.)
  • structured quadratic systems (Cai et al., Soltanolkotabi, Wang et al., Yang et al.,

Qu et al.)

  • geometric analysis (Sun et al., Davis et al.)
  • low-rank generalization (White et al., Li et al., Vaswani et al.)
slide-61
SLIDE 61

Concluding remarks

Achieves optimal bias-variance tradeoff by adaptively discarding high-leverage data

  • comput. cost

sample size statistical accuracy cvx relaxation

  • ur non-cvx algo.
slide-62
SLIDE 62

Concluding remarks

Achieves optimal bias-variance tradeoff by adaptively discarding high-leverage data

  • comput. cost

sample size statistical accuracy cvx relaxation

  • ur non-cvx algo.
  • n (high-dimensional) statistics

nonconvex optimization