[PPT] - Modern Optimization Meets Physics: Recent Progress on Phase PowerPoint Presentation

SLIDE 1

Modern Optimization Meets Physics: Recent Progress on Phase Retrieval

Yuxin Chen (on behalf of Emmanuel Cand` es) CSA 2015, Technical University Berlin, Dec. 2015

SLIDE 2

Missing phase problem

Detectors record intensities of diffracted rays electric field x(t1, t2) − → Fraunhofer diffraction (Fourier transform) ˆ x(f1, f2)

Fig credit: Stanford SLAC

intensity of electrical field:

ˆ

x(f1, f2)

2 =
x(t1, t2)e−i2π(f1t1+f2t2)dt1dt2
2

SLIDE 3

Missing phase problem

Detectors record intensities of diffracted rays electric field x(t1, t2) − → Fraunhofer diffraction (Fourier transform) ˆ x(f1, f2)

Fig credit: Stanford SLAC

intensity of electrical field:

ˆ

x(f1, f2)

2 =
x(t1, t2)e−i2π(f1t1+f2t2)dt1dt2
2

Phase retrieval: recover signal x(t1, t2) from intensity measurements |ˆ x(f1, f2)

2

SLIDE 4

Origin in X-ray crystallography

Method for determining atomic structure within a crystal

Survey: Shechtman, Eldar, Cohen, Chapman, Miao and Segev (’15)

SLIDE 5

Phase retrieval in X-ray crystallography

Knowledge of phase crucial to build electron density map Algorithmic means of recovering phase structure without sophisticated setups Sayre (’52), Fienup (’78) Initial success in certain cases by using very specific prior knowledge

H. Hauptman
J. Karle

1985 Nobel Prize

SLIDE 6

Phase retrieval is a feasibility problem

x

A

Ax y = |Ax|2

1

3

2

1

4 2

2
1

3 4 1 9 4 1 16 4 4 1 9 16

Solve for x ∈ Cn in m quadratic equations yk ≈ |ak, x|2, k = 1, . . . , m

r

y ≈ |Ax|2 where |z|2 := [|z1|2, · · · , |zm|2]⊤

SLIDE 7

An equivalent view: low-rank factorization

Introduce X = xx∗ to linearize constraints yk ≈ |a∗

kx|2 = a∗ k(xx∗)a

= ⇒ yk ≈ a∗

kXak

SLIDE 8

An equivalent view: low-rank factorization

Introduce X = xx∗ to linearize constraints yk ≈ |a∗

kx|2 = a∗ k(xx∗)a

= ⇒ yk ≈ a∗

kXak

find X s.t. yk ≈ a∗

kXak,

k = 1, · · · , m rank(X) = 1

SLIDE 9

An equivalent view: low-rank factorization

Introduce X = xx∗ to linearize constraints yk ≈ |a∗

kx|2 = a∗ k(xx∗)a

= ⇒ yk ≈ a∗

kXak

find X s.t. yk ≈ a∗

kXak,

k = 1, · · · , m rank(X) = 1 Solving quadratic systems is essentially low-rank matrix completion

SLIDE 10

Solving quadratic systems is NP-complete in general ...

“I can’t find an efficient algorithm, but neither can all these people.”

Fig credit: coding horror

SLIDE 11

NP-complete stone problem

Given weights wi ∈ R, i = 1, . . . , n, is there an assignment xi = ±1 s.t. n

i=1 wixi = 0?

SLIDE 12

NP-complete stone problem

Given weights wi ∈ R, i = 1, . . . , n, is there an assignment xi = ±1 s.t. n

i=1 wixi = 0?

Formulation as a quadratic system |xi|2 = 1, i = 1, . . . , n

i

wixi

2

= 0 Many combinatorial problems can be reduced to solving quadratic systems

SLIDE 13

This talk: random quadratic systems are solvable in linear time!

SLIDE 14

A first impulse: maximum likelihood estimate

Assume (Pretend) a noise model ... minimizez ℓ(z)

negative log-likelihood

= m

k=1 ℓk(z)

SLIDE 15

A first impulse: maximum likelihood estimate

Assume (Pretend) a noise model ... minimizez ℓ(z)

negative log-likelihood

= m

k=1 ℓk(z)

Gaussian data: yk ∼ |a∗

kx|2 + N(0, σ2)

ℓk(z) =

yk − |a∗

kz|2 2

SLIDE 16

A first impulse: maximum likelihood estimate

Assume (Pretend) a noise model ... minimizez ℓ(z)

negative log-likelihood

= m

k=1 ℓk(z)

Gaussian data: yk ∼ |a∗

kx|2 + N(0, σ2)

ℓk(z) =

yk − |a∗

kz|2 2

Poisson data: yk ∼ Poisson

|a∗

kx|2

ℓk(z) = |a∗

kz|2 − yk log |a∗ kz|2

SLIDE 17

A first impulse: maximum likelihood estimate

Assume (Pretend) a noise model ... minimizez ℓ(z)

negative log-likelihood

= m

k=1 ℓk(z)

Gaussian data: yk ∼ |a∗

kx|2 + N(0, σ2)

ℓk(z) =

yk − |a∗

kz|2 2

Poisson data: yk ∼ Poisson

|a∗

kx|2

ℓk(z) = |a∗

kz|2 − yk log |a∗ kz|2

Problem: ℓ nonconvex, many local stationary points

SLIDE 18

Prior art: solving random quadratic systems

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

cvx relaxation

comput. cost

sample complexity infeasible

SLIDE 19

Prior art: solving random quadratic systems

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2 cvx relaxation

comput. cost

sample complexity infeasible infeasible

SLIDE 20

Prior art: solving random quadratic systems

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2 mn2

cvx relaxation

comput. cost

sample complexity infeasible infeasible

SLIDE 21

Prior art: solving random quadratic systems

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2 mn2 n log3 n

cvx relaxation

comput. cost

sample complexity alt-min (fresh samples at each iter) infeasible infeasible

SLIDE 22

This talk: solving random quadratic systems

n: # unknowns; m: sample size (# eqns); y = |Ax|2, A ∈ Rm×n

n mn

mn

2 mn2 n log3 n

cvx relaxation

comput. cost

sample complexity alt-min (fresh samples at each iter) infeasible infeasible This talk

SLIDE 23

Nonconvex paradigm: optimize vector variables directly

minimizez ℓ(z) = m

k=1 ℓk(z)

≈ h − i initial guess z0

x

basin of attraction

Start from an appropriate initial point

1Fienup, Cand`

es et al., Netrapalli et al., Shechtman et al., Schniter et al., ...

SLIDE 24

Nonconvex paradigm: optimize vector variables directly

minimizez ℓ(z) = m

k=1 ℓk(z)

≈ h − i initial guess z0

x

basin of attraction

x

basin of attraction

z1 i ess z0 z2

Start from an appropriate initial point Proceed via some iterative updates1

1Fienup, Cand`

es et al., Netrapalli et al., Shechtman et al., Schniter et al., ...

SLIDE 25

Wirtinger flow: Cand` es, Li and Soltanolkotabi

Initialization via spectral method: z0 ← leading eigenvector of 1 m

m

k=1

yk aka∗

k

SLIDE 26

Wirtinger flow: Cand` es, Li and Soltanolkotabi

Initialization via spectral method: z0 ← leading eigenvector of 1 m

m

k=1

yk aka∗

k

Iterations: for t = 0, 1, . . . zt+1 = zt − µt m ∇ℓ(zt)

Make use of Wirtinger derivative ℓ(z) =

k
yk − |a∗

kz|22

∇ℓ(z) := 4

k
|a∗

kz|2 − yk

(aka∗

k)z

SLIDE 27

Wirtinger flow: Cand` es, Li and Soltanolkotabi

Initialization via spectral method: z0 ← leading eigenvector of 1 m

m

k=1

yk aka∗

k

Iterations: for t = 0, 1, . . . zt+1 = zt − µt m ∇ℓ(zt)

Make use of Wirtinger derivative ℓ(z) =

k
yk − |a∗

kz|22

∇ℓ(z) := 4

k
|a∗

kz|2 − yk

(aka∗

k)z

Already rich theory (see Soltanolokotabi’s Ph. D. thesis)

SLIDE 28

Interpretation of spectral initialization

Initialization is first eigenvector of Y := 1

m

k=1 yk aka∗ k

Random Gaussian or coded diffraction models: E[Y ] = x2I + 2xx∗ eigenvalues: (3, 1, 1, . . . , 1) · x2

SLIDE 29

Wirtinger flow as a stochastic gradient scheme

L(z) = z∗ (I − xx∗) z

penalizes orientation mismatch

+ 3 4

z2 − x22
penalizes norm mismatch

Ideal gradient scheme zt+1 = zt − µt z02 ∇L(zt)

SLIDE 30

Wirtinger flow as a stochastic gradient scheme

L(z) = z∗ (I − xx∗) z

penalizes orientation mismatch

+ 3 4

z2 − x22
penalizes norm mismatch

Ideal gradient scheme zt+1 = zt − µt z02 ∇L(zt) But we don’t have access to L(·)!

SLIDE 31

Wirtinger flow as a stochastic gradient scheme

L(z) = z∗ (I − xx∗) z

penalizes orientation mismatch

+ 3 4

z2 − x22
penalizes norm mismatch

What we can compute: ℓ(z) = 1 4m

m

k=1
yk − |a∗

kz|22

SLIDE 32

Wirtinger flow as a stochastic gradient scheme

L(z) = z∗ (I − xx∗) z

penalizes orientation mismatch

+ 3 4

z2 − x22
penalizes norm mismatch

What we can compute: ℓ(z) = 1 4m

m

k=1
yk − |a∗

kz|22

Fixed z If zt independent of sampling vectors (false assumption) E[zt+1] = zt − µt z02 ∇L(zt)

SLIDE 33

Wirtinger flow as a stochastic gradient scheme

L(z) = z∗ (I − xx∗) z

penalizes orientation mismatch

+ 3 4

z2 − x22
penalizes norm mismatch

What we can compute: ℓ(z) = 1 4m

m

k=1
yk − |a∗

kz|22

Fixed z If zt independent of sampling vectors (false assumption) E[zt+1] = zt − µt z02 ∇L(zt) A stochastic optimization scheme for maximizing population likelihood

SLIDE 34

Computational complexity

Ax := {a∗

kx}

Initialization: leading eigenvector → a few (e.g. 20) applications of A and A∗ 1 m

k∈T0

yk aka∗

k = 1

mA∗ diag{yk} A

SLIDE 35

Computational complexity

Ax := {a∗

kx}

Initialization: leading eigenvector → a few (e.g. 20) applications of A and A∗ 1 m

k∈T0

yk aka∗

k = 1

mA∗ diag{yk} A Iterations: one application of A and A∗ per iteration zt+1 = zt − µt m ∇ℓ(zt)

SLIDE 36

Computational complexity

Ax := {a∗

kx}

Initialization: leading eigenvector → a few (e.g. 20) applications of A and A∗ 1 m

k∈T0

yk aka∗

k = 1

mA∗ diag{yk} A Iterations: one application of A and A∗ per iteration zt+1 = zt − µt m ∇ℓ(zt)

Approximate runtime

A couple hundred applications of A and A∗ No SVDs No matrix inversion

SLIDE 37

Connection

? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Nonconvex approach to matrix completion: Keshavan, Montanari and Oh (’09) (1) Initial estimate by truncating SVD (2) Refinement by minimizing nonconvex functional

SLIDE 38

Performance guarantees of WF

Theorem (Cand` es, Li, Soltanolkotabi). Under i.i.d. Gaussian design, WF achieves dist(zt, x)

1 − µ

4 t/2 x, provided that µ 1

n and sample size : m n log n

SLIDE 39

Performance guarantees of WF

Theorem (Cand` es, Li, Soltanolkotabi). Under i.i.d. Gaussian design, WF achieves dist(zt, x)

1 − µ

4 t/2 x, provided that µ 1

n and sample size : m n log n

Comput. cost: mn2 log 1

ǫ

Sample complexity: n log n

SLIDE 40

Can we further improve comput. cost and sample complexity?

SLIDE 41

Stage 1: improving initialization

Spectral initialization: z0 ← principal component of Y := 1 m

k

ykaka∗

k

⇒ E[Y ] = x2I + 2xx∗

SLIDE 42

Stage 1: improving initialization

Spectral initialization: z0 ← principal component of Y := 1 m

k

ykaka∗

k

⇒ E[Y ] = x2I + 2xx∗ Would succeed if x∗Y x ≈ maxu=1 f(u) := u∗Y u x

ideal shape of f(u)

SLIDE 43

Discard high-leverage data

Y ≈ 1 m

k

|a∗

kx|2aka∗ k

heavy-tailed

But what really is the value of f(u) := u∗Y u if u ∝ ak?

1 6000 12000 1 2 3 4

k

k (m = 6n)

f ak kakk

k

k

f (x)

x ai

typical shape

utlier

SLIDE 44

Discard high-leverage data

Y ≈ 1 m

k

|a∗

kx|2aka∗ k

heavy-tailed

But what really is the value of f(u) := u∗Y u if u ∝ ak?

x al

1 6000 12000 1 2 3 4

k

k (m = 6n)

f ak kakk

k

k

f (x)

x ai

typical shape

utlier
utlier

Problem Large outliers yk = (a∗

kx)2 bear too much influence

SLIDE 45

Discard high-leverage data

Y ≈ 1 m

k

|a∗

kx|2aka∗ k

heavy-tailed

But what really is the value of f(u) := u∗Y u if u ∝ ak?

x al

1 6000 12000 1 2 3 4

k

k (m = 6n)

f ak kakk

k

k

f (x)

x ai

typical shape

utlier
utlier

Problem Large outliers yk = (a∗

kx)2 bear too much influence

Solution Throw away large samples and compute leading eigenvectors of 1 m

k

ykaka∗

k · 1{|yk|Avg{|yl|}}

SLIDE 46

Importance of regularizion

n: signal dimension

1000 2000 3000 4000 5000

Relative error

0.6 0.7 0.8 0.9 1

spectral method

truncated spectral method

n : signal dimension (105)

0.5 1 1.5 2 2.5 3 3.5 4

Relative error

0.4 0.6 0.8 1 1.2 1.4

truncated spectral method spectral method

real Gaussian m = 6n complex CDP m = 12n

SLIDE 47

Empirical performance (m = 12n)

Ground truth x

SLIDE 48

Empirical performance (m = 12n)

Spectral initialization

SLIDE 49

Empirical performance (m = 12n)

Spectral initialization Regularized spectral initialization

SLIDE 50

Stage 2: improving search directions

WF: zt+1 = zt − µ m

k

∇ℓk(z)

SLIDE 51

Stage 2: improving search directions

WF: zt+1 = zt − µ m

k

∇ℓk(z)

z x locus of {∇ℓk(z)}

SLIDE 52

Stage 2: improving search directions

WF: zt+1 = zt − µ m

k

∇ℓk(z)

z x locus of {∇ℓk(z)}

Problem: descent direction might have large variability

SLIDE 53

Solution: variance reduction via trimming

More adaptive rule: zt+1 = zt − µ

m

k∈T ∇ℓk(z)

z x

SLIDE 54

Solution: variance reduction via trimming

More adaptive rule: zt+1 = zt − µ

m

k∈T ∇ℓk(z)

z x T trims away excessively large grad components T :=

k :
∇ℓk(z)
typical-size
∇ℓl(z)
1≤l≤m
Slight bias

+ much reduced variance

SLIDE 55

Larger step size µt is feasible

q ¡ q ¡

x

z1

WF: µt = O(1/n)

q ¡ q ¡

x

z1

WF with trimming: µt = O(1) With better-controlled descent directions, one can proceed more aggressively.

SLIDE 56

Summary: truncated Wirtinger flows

1 6000 12000 1 2 3 4

1 kakk2 a∗ kY ak

x∗Y x k (m = 6n)

z x

(1) Regularized spectral initialization: z0 ← principal component of 1 m

k∈T0 yk aka∗

k

(2) Follow adaptive gradient descent zt+1 = zt − µt m

k∈Tt ∇ℓk(z)

Adaptive and iteration-varying rules: discard high-leverage data {yk : k / ∈ Tt}

SLIDE 57

Theoretical guarantees (noiseless data)

≈ h − i initial guess z0

x

basin of attraction

Theorem (Chen, Cand´ es). When ak

i.i.d.

∼ N(0, I) and sample size m n,

1. TWF starts within basin of attraction

SLIDE 58

Theoretical guarantees (noiseless data)

≈ h − i initial guess z0

x

basin of attraction

Iteration

Relative error (log scale)

10 10

20 40 60

8

10

4

Theorem (Chen, Cand´ es). When ak

i.i.d.

∼ N(0, I) and sample size m n,

1. TWF starts within basin of attraction
2. TWF enjoys linear convergence

SLIDE 59

Numerical surprise

CG: solve y = Ax TWF: solve y = |Ax|2 Iteration

10 15

Relative error (log scale)

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100

20 40 60 5

least squares (CG) proposed algorithm

SLIDE 60

Numerical surprise

CG: solve y = Ax TWF: solve y = |Ax|2 Iteration

10 15

Relative error (log scale)

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100

20 40 60 5

least squares (CG) proposed algorithm

For random features A (m = 8n)

comput. cost of TWF

≈ 4 × comput. cost of least squares (CG)

SLIDE 61

Empirical success rate (noiseless data)

m : number of measurements (n =1000)

2n 3n 4n 5n 6n

Empirical success rate

0.5 1

TWF

WF

Empirical success rate vs. sample size (# iters=1000)

SLIDE 62

Empirical performance

After regularized spectral initialization

SLIDE 63

Empirical performance

After regularized spectral initialization After 50 gradient iterations

SLIDE 64

Key convergence condition for gradient stage

∀z s.t. dist(z, x) ≤ εx, the search direction p obeys p, x − z z − x2 + p2 z x z −∇f(z) −∇f(z) z0 x z1 z2 z3 When specialized to gradient descent (i.e. p = −∇f(z)) Says that negative gradient points in the direction of minimizer x

SLIDE 65

Key convergence condition for gradient stage

∀z s.t. dist(z, x) ≤ εx, the search direction p obeys p, x − z z − x2 + p2 z x z −∇f(z) −∇f(z) z0 x z1 z2 z3 When specialized to gradient descent (i.e. p = −∇f(z)) Says that negative gradient points in the direction of minimizer x Does not say that f is convex!

SLIDE 66

Lack of local convexity

Real univariate function f with minimizer at z = x and f ′(z) · (z − x) |z − x|2 + |f ′(z)|2

x z f(z)

Says that the unique stationary point in the neighborhood is x

SLIDE 67

Step size µ

z x p z x µp z x µp z x µp To ensure sufficient progress at each iteration, pick the step size s.t. µp ≍ z − x

SLIDE 68

Step size µ

z x p z x µp z x µp z x µp To ensure sufficient progress at each iteration, pick the step size s.t. µp ≍ z − x For TWF, p = 1

m∇ℓtr(z) ≍ z − x, and hence we pick µ ≍ 1

SLIDE 69

Stability under noisy data

Noisy data: yk = |a∗

kx|2 + ηk

Signal-to-noise ratio: SNR :=

k |a∗

kx|4

k η2

k

≈ 3mx4 η2 i.i.d. Gaussian design

SLIDE 70

Stability under noisy data

Noisy data: yk = |a∗

kx|2 + ηk

Signal-to-noise ratio: SNR :=

k |a∗

kx|4

k η2

k

≈ 3mx4 η2 i.i.d. Gaussian design Theorem (Soltanolkotabi) WF converges to MLE Theorem (Chen, Cand` es) Relative error of TWF converges to O(

1 √ SNR)

SLIDE 71

Relative MSE vs. SNR (Poisson data)

SNR (dB) (n =1000)

15 20 25 30 35 40 45 50 55

Relative MSE (dB)

65
60
55
50
45
40
35
30
25
20

m = 6n m = 8n m = 10n

Slope ≈ -1

Empirical evidence: relative MSE scales inversely with SNR

SLIDE 72

This accuracy is nearly un-improvable (empirically)

Comparison with genie-aided MLE (with sign info. revealed) yk ∼ Poisson( |a∗

kx|2 )

and εk = sign (a∗

kx)

(revealed by a genie)

SLIDE 73

This accuracy is nearly un-improvable (empirically)

Comparison with genie-aided MLE (with sign info. revealed) yk ∼ Poisson( |a∗

kx|2 )

and εk = sign (a∗

kx)

(revealed by a genie)

SNR (dB) (n =100)

15 20 25 30 35 40 45 50 55

Relative MSE (dB)

65
60
55
50
45
40
35
30
25
20

truncated WF genie-aided MLE

little empirical loss due to missing signs

SLIDE 74

This accuracy is nearly un-improvable (theoretically)

Poisson data: yk

ind.

∼ Poisson( |a∗

kx|2 )

Signal-to-noise ratio: SNR ≈

k |a∗

kx|4

k Var(yk) ≈ 3x2

SLIDE 75

This accuracy is nearly un-improvable (theoretically)

Poisson data: yk

ind.

∼ Poisson( |a∗

kx|2 )

Signal-to-noise ratio: SNR ≈

k |a∗

kx|4

k Var(yk) ≈ 3x2

Theorem (Chen, Cand` es). Under i.i.d. Gaussian design, for any estimator ˆ x, inf

ˆ x

sup

x: x≥log1.5 m

E

dist (ˆ

x, x) | {ak}

x
1

√ SNR , provided that sample size m ≍ n.

SLIDE 76

Alternating Minimization: Netrapalli, Jain, Sanghavi (’13)

Provable guarantees for a (resampling) variant of non-convex iterative algorithm

SLIDE 77

Alternating Minimization: Netrapalli, Jain, Sanghavi (’13)

Provable guarantees for a (resampling) variant of non-convex iterative algorithm

initial estimate

bey spatial

constraints

bey frequency

constraints magnitude measurements support (real-valuedness, nonnegativity)

SLIDE 78

Alternating Minimization: Netrapalli, Jain, Sanghavi (’13)

Provable guarantees for a (resampling) variant of non-convex iterative algorithm

initial estimate

bey spatial

constraints

bey frequency

constraints magnitude measurements support (real-valuedness, nonnegativity)

Current guess is zt and ask for new data y(t+1) = |A(t+1)x|2

SLIDE 79

Alternating Minimization: Netrapalli, Jain, Sanghavi (’13)

Provable guarantees for a (resampling) variant of non-convex iterative algorithm

initial estimate

bey spatial

constraints

bey frequency

constraints magnitude measurements support (real-valuedness, nonnegativity)

Current guess is zt and ask for new data y(t+1) = |A(t+1)x|2 ˆ b(t+1) = y(t+1) ⊙ phase

A(t+1)zt

z(t+1) = arg minz ˆ b(t+1) − A(t+1)z

SLIDE 80

Alternating Minimization: Netrapalli, Jain, Sanghavi (’13)

Provable guarantees for a (resampling) variant of non-convex iterative algorithm

initial estimate

bey spatial

constraints

bey frequency

constraints magnitude measurements support (real-valuedness, nonnegativity)

Current guess is zt and ask for new data y(t+1) = |A(t+1)x|2 ˆ b(t+1) = y(t+1) ⊙ phase

A(t+1)zt

z(t+1) = arg minz ˆ b(t+1) − A(t+1)z Requires solving LS at each iteration Requires fresh/new samples at each iteration!

SLIDE 81

Deal with complicated dependencies across iterations

Prior approaches: require fresh samples at each iteration z1 z2 z3 z4

z5

use fresh samples

z0

SLIDE 82

Deal with complicated dependencies across iterations

Prior approaches: require fresh samples at each iteration z1 z2 z3 z4

z5

use fresh samples

z0 This approach: reuse all samples in all iterations z1 z2 z3 z4

z5

z0

same samples

SLIDE 83

Summary: new approach to phase retrieval

Sample Robustness Comput. complexity to noise cost Cvx relaxation (Trunc.) Wirtinger Flow

Phase retrieval via Wirtinger flow: theory and algorithms (2015) by Cand` es, Li, Soltanolkotabi Solving random quadratic systems of equations is nearly as easy as solving linear systems (2015) by Chen and Cand` es