Implicit Regularization in Nonconvex Statistical Estimation Yuxin - - PowerPoint PPT Presentation

implicit regularization in nonconvex statistical
SMART_READER_LITE
LIVE PREVIEW

Implicit Regularization in Nonconvex Statistical Estimation Yuxin - - PowerPoint PPT Presentation

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering, Princeton University Cong Ma Kaizheng Wang Yuejie Chi Princeton ORFE Princeton ORFE CMU ECE Nonconvex estimation problems are everywhere


slide-1
SLIDE 1

Implicit Regularization in Nonconvex Statistical Estimation

Yuxin Chen Electrical Engineering, Princeton University

slide-2
SLIDE 2

Cong Ma Princeton ORFE Kaizheng Wang Princeton ORFE Yuejie Chi CMU ECE

slide-3
SLIDE 3

Nonconvex estimation problems are everywhere

Empirical risk minimization is usually nonconvex minimizex ℓ(x; y) → may be nonconvex

  • subj. to

x ∈ S → may be nonconvex

3/ 36

slide-4
SLIDE 4

Nonconvex estimation problems are everywhere

Empirical risk minimization is usually nonconvex minimizex ℓ(x; y) → may be nonconvex

  • subj. to

x ∈ S → may be nonconvex

  • low-rank matrix completion
  • graph clustering
  • dictionary learning
  • mixture models
  • deep learning
  • ...

3/ 36

slide-5
SLIDE 5

Nonconvex optimization may be super scary

There may be bumps everywhere and exponentially many local optima e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98)

4/ 36

slide-6
SLIDE 6

Nonconvex optimization may be super scary

There may be bumps everywhere and exponentially many local optima e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98)

4/ 36

slide-7
SLIDE 7

... but is sometimes much nicer than we think

Under certain statistical models, we see benign global geometry: no spurious local optima

Fig credit: Sun, Qu & Wright

5/ 36

slide-8
SLIDE 8

... but is sometimes much nicer than we think

efficient algorithms statistical models

exploit geometry benign landscape

slide-9
SLIDE 9

Optimization-based methods: two-stage approach

≈ h − i initial guess z0

x

basin of attraction

ng x¯

  • n: x0

data m

  • Start from an appropriate initial point

7/ 36

slide-10
SLIDE 10

Optimization-based methods: two-stage approach

≈ h − i initial guess z0

x

basin of attraction

ng x¯

  • n: x0

data m

x

basin of attraction

z1 i ess z0 z2

ng x¯

  • n: x0

data m

  • Find

x1 ind an i x2

  • Start from an appropriate initial point
  • Proceed via some iterative optimization algorithms

7/ 36

slide-11
SLIDE 11

Roles of regularization

  • Prevents overfitting and improves generalization
  • e.g. ℓ1 penalization, SCAD, nuclear norm penalization, ...

8/ 36

slide-12
SLIDE 12

Roles of regularization

  • Prevents overfitting and improves generalization
  • e.g. ℓ1 penalization, SCAD, nuclear norm penalization, ...
  • Improves computation by stabilizing search directions
  • e.g. trimming, projection, regularized loss

8/ 36

slide-13
SLIDE 13

Roles of regularization

  • Prevents overfitting and improves generalization
  • e.g. ℓ1 penalization, SCAD, nuclear norm penalization, ...
  • Improves computation by stabilizing search directions

= ⇒ focus of this talk

  • e.g. trimming, projection, regularized loss

8/ 36

slide-14
SLIDE 14

3 representative nonconvex problems

phase retrieval matrix completion blind deconvolution

9/ 36

slide-15
SLIDE 15

Regularized methods

phase retrieval matrix completion blind deconvolution

trimming regularized cost projection regularized cost projection

regularized regularized regularized

9/ 36

slide-16
SLIDE 16

Regularized vs. unregularized methods

phase retrieval matrix completion blind deconvolution

trimming suboptimal

  • comput. cost

regularized cost projection ? regularized cost projection ?

regularized unregularized regularized unregularized regularized unregularized

9/ 36

slide-17
SLIDE 17

Regularized vs. unregularized methods

phase retrieval matrix completion blind deconvolution

trimming suboptimal

  • comput. cost

regularized cost projection ? regularized cost projection ?

regularized unregularized regularized unregularized regularized unregularized

Are unregularized methods suboptimal for nonconvex estimation?

9/ 36

slide-18
SLIDE 18

Missing phase problem

Detectors record intensities of diffracted rays

  • electric field x(t1, t2) −

→ Fourier transform x(f1, f2)

Fig credit: Stanford SLAC

intensity of electrical field:

  • x(f1, f2)
  • 2 =
  • x(t1, t2)e−i2π(f1t1+f2t2)dt1dt2
  • 2

10/ 36

slide-19
SLIDE 19

Missing phase problem

Detectors record intensities of diffracted rays

  • electric field x(t1, t2) −

→ Fourier transform x(f1, f2)

Fig credit: Stanford SLAC

intensity of electrical field:

  • x(f1, f2)
  • 2 =
  • x(t1, t2)e−i2π(f1t1+f2t2)dt1dt2
  • 2

Phase retrieval: recover signal x(t1, t2) from intensity | x(f1, f2)

  • 2

10/ 36

slide-20
SLIDE 20

Solving quadratic systems of equations

x A

Ax y = |Ax|2

1

  • 3

2

  • 1

4 2

  • 2
  • 1

3 4 1 9 4 1 16 4 4 1 9 16

X y n p

) = m

Recover x♮ ∈ Rn from m random quadratic measurements yk = |a⊤

k x♮|2,

k = 1, . . . , m Assume w.l.o.g. x♮2 = 1

11/ 36

slide-21
SLIDE 21

Wirtinger flow (Cand` es, Li, Soltanolkotabi ’14)

Empirical risk minimization minimizex f(x) = 1 4m

m

  • k=1

a⊤

k x

2 − yk 2

12/ 36

slide-22
SLIDE 22

Wirtinger flow (Cand` es, Li, Soltanolkotabi ’14)

Empirical risk minimization minimizex f(x) = 1 4m

m

  • k=1

a⊤

k x

2 − yk 2

  • Initialization by spectral method
  • Gradient iterations: for t = 0, 1, . . .

xt+1 = xt − η ∇f(xt)

12/ 36

slide-23
SLIDE 23

Gradient descent theory revisited

Two standard conditions that enable geometric convergence of GD

13/ 36

slide-24
SLIDE 24

Gradient descent theory revisited

Two standard conditions that enable geometric convergence of GD

  • (local) restricted strong convexity (or regularity condition)

13/ 36

slide-25
SLIDE 25

Gradient descent theory revisited

Two standard conditions that enable geometric convergence of GD

  • (local) restricted strong convexity (or regularity condition)
  • (local) smoothness

∇2f(x) ≻ 0 and is well-conditioned

13/ 36

slide-26
SLIDE 26

Gradient descent theory revisited

f is said to be α-strongly convex and β-smooth if 0 αI ∇2f(x) βI, ∀x ℓ2 error contraction: GD with η = 1/β obeys xt+1 − x♮2 ≤

  • 1 − α

β

  • xt − x♮2

14/ 36

slide-27
SLIDE 27

Gradient descent theory revisited

xt+1 − x♮2 ≤ (1 − α/β) xt − x♮2 region of local strong convexity + smoothness

15/ 36

slide-28
SLIDE 28

Gradient descent theory revisited

xt+1 − x♮2 ≤ (1 − α/β) xt − x♮2 region of local strong convexity + smoothness

15/ 36

slide-29
SLIDE 29

Gradient descent theory revisited

xt+1 − x♮2 ≤ (1 − α/β) xt − x♮2 region of local strong convexity + smoothness

15/ 36

slide-30
SLIDE 30

Gradient descent theory revisited

xt+1 − x♮2 ≤ (1 − α/β) xt − x♮2 region of local strong convexity + smoothness

15/ 36

slide-31
SLIDE 31

Gradient descent theory revisited

0 αI ∇2f(x) βI, ∀x ℓ2 error contraction: GD with η = 1/β obeys xt+1 − x♮2 ≤

  • 1 − α

β

  • xt − x♮2
  • Condition number β/α determines rate of convergence

16/ 36

slide-32
SLIDE 32

Gradient descent theory revisited

0 αI ∇2f(x) βI, ∀x ℓ2 error contraction: GD with η = 1/β obeys xt+1 − x♮2 ≤

  • 1 − α

β

  • xt − x♮2
  • Condition number β/α determines rate of convergence
  • Attains ε-accuracy within O

β

α log 1 ε

iterations

16/ 36

slide-33
SLIDE 33

What does this optimization theory say about WF?

Gaussian designs: ak

i.i.d.

∼ N(0, In), 1 ≤ k ≤ m

17/ 36

slide-34
SLIDE 34

What does this optimization theory say about WF?

Gaussian designs: ak

i.i.d.

∼ N(0, In), 1 ≤ k ≤ m Population level (infinite samples) E

∇2f(x) = 3

  • x2

2 I + 2xx⊤

  • x♮

2

2I + 2x♮x♮⊤

  • locally positive definite and well-conditioned

Consequence: WF converges within O

log 1

ε

iterations if m → ∞

17/ 36

slide-35
SLIDE 35

What does this optimization theory say about WF?

Gaussian designs: ak

i.i.d.

∼ N(0, In), 1 ≤ k ≤ m Finite-sample level (m ≍ n log n) ∇2f(x) ≻ 0

17/ 36

slide-36
SLIDE 36

What does this optimization theory say about WF?

Gaussian designs: ak

i.i.d.

∼ N(0, In), 1 ≤ k ≤ m Finite-sample level (m ≍ n log n) ∇2f(x) ≻ 0 but ill-conditioned

  • condition number ≍ n

(even locally)

17/ 36

slide-37
SLIDE 37

What does this optimization theory say about WF?

Gaussian designs: ak

i.i.d.

∼ N(0, In), 1 ≤ k ≤ m Finite-sample level (m ≍ n log n) ∇2f(x) ≻ 0 but ill-conditioned

  • condition number ≍ n

(even locally) Consequence (Cand` es et al ’14): WF attains ε-accuracy within O

n log 1

ε

iterations if m ≍ n log n

17/ 36

slide-38
SLIDE 38

What does this optimization theory say about WF?

Gaussian designs: ak

i.i.d.

∼ N(0, In), 1 ≤ k ≤ m Finite-sample level (m ≍ n log n) ∇2f(x) ≻ 0 but ill-conditioned

  • condition number ≍ n

(even locally) Consequence (Cand` es et al ’14): WF attains ε-accuracy within O

n log 1

ε

iterations if m ≍ n log n

Too slow ... can we accelerate it?

17/ 36

slide-39
SLIDE 39

One solution: truncated WF (Chen, Cand` es ’15)

Regularize / trim gradient components to accelerate convergence

z x

18/ 36

slide-40
SLIDE 40

But wait a minute ...

WF converges in O(n) iterations

19/ 36

slide-41
SLIDE 41

But wait a minute ...

WF converges in O(n) iterations Step size taken to be ηt = O(1/n)

19/ 36

slide-42
SLIDE 42

But wait a minute ...

WF converges in O(n) iterations Step size taken to be ηt = O(1/n) This choice is suggested by generic optimization theory

19/ 36

slide-43
SLIDE 43

But wait a minute ...

WF converges in O(n) iterations Step size taken to be ηt = O(1/n) This choice is suggested by worst-case optimization theory

19/ 36

slide-44
SLIDE 44

But wait a minute ...

WF converges in O(n) iterations Step size taken to be ηt = O(1/n) This choice is suggested by worst-case optimization theory Does it capture what really happens?

19/ 36

slide-45
SLIDE 45

Numerical surprise with ηt = 0.1

100 200 300 400 500 10-15 10-10 10-5 100

Vanilla GD (WF) can proceed much more aggressively!

20/ 36

slide-46
SLIDE 46

A second look at gradient descent theory

Which region enjoys both strong convexity and smoothness? ∇2f(x) = 1 m

m

  • k=1
  • 3

a⊤

k x

2 − a⊤

k x♮2

aka⊤

k

21/ 36

slide-47
SLIDE 47

A second look at gradient descent theory

Which region enjoys both strong convexity and smoothness? ∇2f(x) = 1 m

m

  • k=1
  • 3

a⊤

k x

2 − a⊤

k x♮2

aka⊤

k

  • Not smooth if x and ak are too close (coherent)

21/ 36

slide-48
SLIDE 48

A second look at gradient descent theory

Which region enjoys both strong convexity and smoothness?

·√

x\

  • x is not far away from x♮

21/ 36

slide-49
SLIDE 49

A second look at gradient descent theory

Which region enjoys both strong convexity and smoothness?

·√

a1 x\

  • a>

1 (x − x\)

  • .

p log n

  • x is not far away from x♮
  • x is incoherent w.r.t. sampling vectors (incoherence region)

21/ 36

slide-50
SLIDE 50

A second look at gradient descent theory

Which region enjoys both strong convexity and smoothness?

·√

a1 a2 x\

  • p
  • a>

2 (x − x\)

  • .

p log n

  • a>

1 (x − x\)

  • .

p log n

  • x is not far away from x♮
  • x is incoherent w.r.t. sampling vectors (incoherence region)

21/ 36

slide-51
SLIDE 51

A second look at gradient descent theory

region of local strong convexity + smoothness

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

22/ 36

slide-52
SLIDE 52

A second look at gradient descent theory

region of local strong convexity + smoothness

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

22/ 36

slide-53
SLIDE 53

A second look at gradient descent theory

region of local strong convexity + smoothness

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

22/ 36

slide-54
SLIDE 54

A second look at gradient descent theory

region of local strong convexity + smoothness

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

22/ 36

slide-55
SLIDE 55

A second look at gradient descent theory

region of local strong convexity + smoothness

· ·

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

22/ 36

slide-56
SLIDE 56

A second look at gradient descent theory

region of local strong convexity + smoothness

· ·

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

22/ 36

slide-57
SLIDE 57

A second look at gradient descent theory

region of local strong convexity + smoothness

· ·

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

22/ 36

slide-58
SLIDE 58

A second look at gradient descent theory

region of local strong convexity + smoothness

· ·

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

22/ 36

slide-59
SLIDE 59

A second look at gradient descent theory

region of local strong convexity + smoothness

· ·

  • Prior theory only ensures that iterates remain in ℓ2 ball but not

incoherence region

  • Prior theory enforces regularization to promote incoherence

22/ 36

slide-60
SLIDE 60

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

· ·

23/ 36

slide-61
SLIDE 61

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

· ·

23/ 36

slide-62
SLIDE 62

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

· ·

23/ 36

slide-63
SLIDE 63

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

· ·

23/ 36

slide-64
SLIDE 64

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

· ·

GD implicitly forces iterates to remain incoherent

23/ 36

slide-65
SLIDE 65

Theoretical guarantees

Theorem 1 (Phase retrieval) Under i.i.d. Gaussian design, WF achieves

  • maxk
  • a⊤

k (xt − x♮)

  • √log n x♮2 (incoherence)

24/ 36

slide-66
SLIDE 66

Theoretical guarantees

Theorem 1 (Phase retrieval) Under i.i.d. Gaussian design, WF achieves

  • maxk
  • a⊤

k (xt − x♮)

  • √log n x♮2 (incoherence)
  • xt − x♮2

1 − η

2

t x♮2 (near-linear convergence)

provided that step size η ≍

1 log n and sample size m n log n.

24/ 36

slide-67
SLIDE 67

Theoretical guarantees

Theorem 1 (Phase retrieval) Under i.i.d. Gaussian design, WF achieves

  • maxk
  • a⊤

k (xt − x♮)

  • √log n x♮2 (incoherence)
  • xt − x♮2

1 − η

2

t x♮2 (near-linear convergence)

provided that step size η ≍

1 log n and sample size m n log n.

  • Step size:

1 log n (vs. 1 n)

24/ 36

slide-68
SLIDE 68

Theoretical guarantees

Theorem 1 (Phase retrieval) Under i.i.d. Gaussian design, WF achieves

  • maxk
  • a⊤

k (xt − x♮)

  • √log n x♮2 (incoherence)
  • xt − x♮2

1 − η

2

t x♮2 (near-linear convergence)

provided that step size η ≍

1 log n and sample size m n log n.

  • Step size:

1 log n (vs. 1 n)

  • Computational complexity:

n log n times faster than existing theory

24/ 36

slide-69
SLIDE 69

Key ingredient: leave-one-out analysis

For each 1 ≤ l ≤ m, introduce leave-one-out iterates xt,(l) by dropping lth measurement

x

1

  • 3

2

  • 1

4

  • 2
  • 1

3 4 1 9 4 1 16 4 1 9 16

25/ 36

slide-70
SLIDE 70

Key ingredient: leave-one-out analysis

·

incoherence region

{xt,(l)} al

w.r.t. al

  • Leave-one-out iterates {xt,(l)} are independent of al, and are

hence incoherent w.r.t. al with high prob.

26/ 36

slide-71
SLIDE 71

Key ingredient: leave-one-out analysis

·

incoherence region

} {xt} {xt,(l)} al

w.r.t. al

  • Leave-one-out iterates {xt,(l)} are independent of al, and are

hence incoherent w.r.t. al with high prob.

  • Leave-one-out iterates xt,(l) ≈ true iterates xt

26/ 36

slide-72
SLIDE 72

Key ingredient: leave-one-out analysis

·

incoherence region

} {xt} {xt,(l)} al

w.r.t. al

  • Leave-one-out iterates {xt,(l)} are independent of al, and are

hence incoherent w.r.t. al with high prob.

  • Leave-one-out iterates xt,(l) ≈ true iterates xt
  • a⊤

l (xt − x♮)

  • a⊤

l (xt,(l) − x♮)

  • +
  • a⊤

l (xt − xt,(l))

  • 26/ 36
slide-73
SLIDE 73

This recipe is quite general

slide-74
SLIDE 74

Low-rank matrix completion

           

  • ?

? ?

  • ?

? ?

  • ?

?

  • ?

?

  • ?

? ? ?

  • ?

?

  • ?

? ? ? ? ?

  • ?

?

  • ?

? ?

  • ?

?

           

? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

  • Fig. credit: Cand`

es

Given partial samples Ω of a low-rank matrix M, fill in missing entries

28/ 36

slide-75
SLIDE 75

Prior art

minimizeX f(X) =

  • (j,k)∈Ω
  • e⊤

j XX⊤ek − Mj,k

2

29/ 36

slide-76
SLIDE 76

Prior art

minimizeX f(X) =

  • (j,k)∈Ω
  • e⊤

j XX⊤ek − Mj,k

2

Existing theory on gradient descent requires

29/ 36

slide-77
SLIDE 77

Prior art

minimizeX f(X) =

  • (j,k)∈Ω
  • e⊤

j XX⊤ek − Mj,k

2

Existing theory on gradient descent requires

  • regularized loss (solve minX f(X) + R(X) instead)
  • e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma ’16

29/ 36

slide-78
SLIDE 78

Prior art

minimizeX f(X) =

  • (j,k)∈Ω
  • e⊤

j XX⊤ek − Mj,k

2

Existing theory on gradient descent requires

  • regularized loss (solve minX f(X) + R(X) instead)
  • e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma ’16
  • projection onto set of incoherent matrices
  • e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

29/ 36

slide-79
SLIDE 79

Theoretical guarantees

Theorem 2 (Matrix completion) Suppose M is rank-r, incoherent and well-conditioned. Vanilla gradient descent (with spectral initialization) achieves ε accuracy

  • in O

log 1

ε

iterations

if step size η 1/σmax(M) and sample size nr3 log3 n

30/ 36

slide-80
SLIDE 80

Theoretical guarantees

Theorem 2 (Matrix completion) Suppose M is rank-r, incoherent and well-conditioned. Vanilla gradient descent (with spectral initialization) achieves ε accuracy

  • in O

log 1

ε

iterations w.r.t. · F, · , and · 2,∞

  • incoherence

if step size η 1/σmax(M) and sample size nr3 log3 n

30/ 36

slide-81
SLIDE 81

Theoretical guarantees

Theorem 2 (Matrix completion) Suppose M is rank-r, incoherent and well-conditioned. Vanilla gradient descent (with spectral initialization) achieves ε accuracy

  • in O

log 1

ε

iterations w.r.t. · F, · , and · 2,∞

  • incoherence

if step size η 1/σmax(M) and sample size nr3 log3 n

  • Byproduct: vanilla GD controls entrywise error

— errors are spread out across all entries

30/ 36

slide-82
SLIDE 82

Blind deconvolution

  • Fig. credit: Romberg
  • Fig. credit:

EngineeringsALL

Reconstruct two signals from their convolution; equivalently, find h, x ∈ Cn s.t. b∗

khx∗ak = yk,

1 ≤ k ≤ m

31/ 36

slide-83
SLIDE 83

Prior art

minimizex,h f(x, h) =

m

  • k=1
  • b∗

k

  • hx∗ − h♮x♮∗

ak

  • 2

ak

i.i.d.

∼ N(0, I) and {bk} : partial Fourier basis

32/ 36

slide-84
SLIDE 84

Prior art

minimizex,h f(x, h) =

m

  • k=1
  • b∗

k

  • hx∗ − h♮x♮∗

ak

  • 2

ak

i.i.d.

∼ N(0, I) and {bk} : partial Fourier basis Existing theory on gradient descent requires

  • regularized loss + projection
  • e.g. Li, Ling, Strohmer, Wei ’16, Huang, Hand ’17, Ling, Strohmer

’17

32/ 36

slide-85
SLIDE 85

Prior art

minimizex,h f(x, h) =

m

  • k=1
  • b∗

k

  • hx∗ − h♮x♮∗

ak

  • 2

ak

i.i.d.

∼ N(0, I) and {bk} : partial Fourier basis Existing theory on gradient descent requires

  • regularized loss + projection
  • e.g. Li, Ling, Strohmer, Wei ’16, Huang, Hand ’17, Ling, Strohmer

’17

  • requires m iterations even with regularization

32/ 36

slide-86
SLIDE 86

Theoretical guarantees

Theorem 3 (Blind deconvolution) Suppose h♮ is incoherent w.r.t. {bk}. Vanilla gradient descent (with spectral initialization) achieves ε accuracy in O

log 1

ε

iterations,

provided that step size η 1 and sample size m npoly log(m).

  • Regularization-free
  • Converges in O

log 1

ε

iterations (vs. O m log 1

ε

iterations in

prior theory)

33/ 36

slide-87
SLIDE 87

Incoherence region in high dimensions

· ·

· · ·

2-dimensional high-dimensional (mental representation)

  • incoherence region is vanishingly small

34/ 36

slide-88
SLIDE 88

Complicated dependencies across iterations

  • Several prior sample-splitting approaches: require fresh samples

at each iteration; not what we actually run in practice

z1 z2 z3 z4

z5

use fresh samples

z0

slide-89
SLIDE 89

Complicated dependencies across iterations

  • Several prior sample-splitting approaches: require fresh samples

at each iteration; not what we actually run in practice

z1 z2 z3 z4

z5

use fresh samples

z0

  • This work: reuses all samples in all iterations

z1 z2 z3 z4

z5

z0

same samples

slide-90
SLIDE 90

Summary

  • Implicit regularization: vanilla gradient descent automatically

forces iterates to stay incoherent

36/ 36

slide-91
SLIDE 91

Summary

  • Implicit regularization: vanilla gradient descent automatically

forces iterates to stay incoherent

  • Enable error controls in a much stronger sense (e.g. entrywise

error control)

Paper: “Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution”, Cong Ma, Kaizheng Wang, Yuejie Chi, Yuxin Chen, arXiv:1711.10467

36/ 36