Noisy matrix completion: Understanding statistical guarantees for - - PowerPoint PPT Presentation

noisy matrix completion understanding statistical
SMART_READER_LITE
LIVE PREVIEW

Noisy matrix completion: Understanding statistical guarantees for - - PowerPoint PPT Presentation

Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization Cong Ma ORFE, Princeton University Yuejie Chi Jianqing Fan Yuxin Chen Yuling Yan CMU ECE Princeton ORFE Princeton EE Princeton


slide-1
SLIDE 1

Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization

Cong Ma ORFE, Princeton University

slide-2
SLIDE 2

Yuling Yan Princeton ORFE Yuejie Chi CMU ECE Jianqing Fan Princeton ORFE Yuxin Chen Princeton EE

slide-3
SLIDE 3

Convex relaxation for low-rank structure

minimize

Z

Z∗ subject to noiseless data constraints low-rank matrix

figure credit: Piet Mondrian

semidefinite relaxation

3/ 39

slide-4
SLIDE 4

Convex relaxation for low-rank structure

minimize

Z

Z∗ subject to noiseless data constraints matrix sensing

(Recht, Fazel, Parrilo ’07)

phase retrieval

(Cand` es, Strohmer, Voroninski ’11, Cand` es, Li ’12)

matrix completion

(Cand` es, Recht ’08, Cand` es, Tao ’08, Gross ’09)

robust PCA

(Chandrasekaran et al. ’09, Cand` es et al. ’09)

Hankel matrix completion

(Fazel et al. ’13, Chen, Chi ’13, Cai et al. ’15)

blind deconvolution

(Ahmed, Recht, Romberg ’12, Ling, Strohmer ’15)

joint alignment / matching

(Chen, Huang, Guibas ’14)

. . .

3/ 39

slide-5
SLIDE 5

Stability of convex relaxation against noise

minimize

Z

Z∗ subject to noisy data constraints low-rank matrix

figure credit: Piet Mondrian

semidefinite relaxation

4/ 39

slide-6
SLIDE 6

Stability of convex relaxation against noise

minimize

Z

f(Z; noisy data)

  • empirical loss

+ λZ∗ low-rank matrix

figure credit: Piet Mondrian

semidefinite relaxation

4/ 39

slide-7
SLIDE 7

Stability of convex relaxation against noise

minimize

Z

f(Z; noisy data)

  • empirical loss

+ λZ∗ matrix sensing (RIP measurements)

(Cand` es, Plan ’10)

phase retrieval (Gaussian measurements)

(Cand` es et al. ’11)

? matrix completion

(Cand` es, Plan ’09, Negahban, Wainwright ’10, Koltchinskii et al. ’10)

? robust PCA

(Zhou, Li, Wright, Cand` es, Ma ’10)

? Hankel matrix completion

(Chen, Chi ’13)

? blind deconvolution

(Ahmed, Recht, Romberg ’12, Ling, Strohmer ’15)

? joint alignment / matching . . .

4/ 39

slide-8
SLIDE 8

Stability of convex relaxation against noise

minimize

Z

f(Z; noisy data)

  • empirical loss

+ λZ∗ matrix sensing (RIP measurements)

(Cand` es, Plan ’10)

phase retrieval (Gaussian measurements)

(Cand` es et al. ’11)

? this talk: matrix completion

(Cand` es, Plan ’09, Negahban, Wainwright ’10, Koltchinskii et al. ’10)

? robust PCA

(Zhou, Li, Wright, Cand` es, Ma ’10)

? Hankel matrix completion

(Chen, Chi ’13)

? blind deconvolution

(Ahmed, Recht, Romberg ’12, Ling, Strohmer ’15)

? joint alignment / matching . . .

4/ 39

slide-9
SLIDE 9

Low-rank matrix completion

           

  • ?

? ?

  • ?

? ?

  • ?

?

  • ?

?

  • ?

? ? ?

  • ?

?

  • ?

? ? ? ? ?

  • ?

?

  • ?

? ?

  • ?

?

           

? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

figure credit: E. J. Cand` es

Given partial samples of a low-rank matrix M ⋆, fill in missing entries

5/ 39

slide-10
SLIDE 10

Noisy low-rank matrix completion

  • bservations:

Mi,j = M⋆

i,j + noise,

(i, j) ∈ Ω goal: estimate M ⋆ unknown rank-r matrix M ⋆ ∈ Rn×n

           

  • ?

? ?

  • ?

? ?

  • ?

?

  • ?

?

  • ?

? ? ?

  • ?

?

  • ?

? ? ? ? ?

  • ?

?

  • ?

? ?

  • ?

?

           

sampling set Ω

6/ 39

slide-11
SLIDE 11

Noisy low-rank matrix completion

  • bservations:

Mi,j = M⋆

i,j + noise,

(i, j) ∈ Ω goal: estimate M ⋆ convex relaxation: minimize

Z∈Rn×n

  • (i,j)∈Ω

Zi,j − Mi,j 2

  • squared loss

+ λZ∗

6/ 39

slide-12
SLIDE 12

Prior statistical guarantees for convex relaxation

  • random sampling: each (i, j) ∈ Ω with prob. p
  • random noise: i.i.d. sub-Gaussian noise with variance σ2
  • true matrix M ⋆ ∈ Rn×n: rank r = O(1), incoherent, . . .

7/ 39

slide-13
SLIDE 13

Cand` es, Plan ’09 σn1.5

. „ . .

m i n i m a x l i m i t C a n d ` e s , P l a n ’ 9 e g a h b a n , W a i n w r i g h t

ion ‡: noise standard dev.

ror

. . „

M ≠ M ı.

.

F

slide-14
SLIDE 14

minimax limit σ

  • n/p

Cand` es, Plan ’09 σn1.5

. „ . .

m i n i m a x l i m i t C a n d ` e s , P l a n ’ 9 e g a h b a n , W a i n w r i g h t

. . „ . .

minimax limit ion ‡: noise standard dev.

ror

. . „

M ≠ M ı.

.

F

ÎM ıÎ∞

slide-15
SLIDE 15

minimax limit σ

  • n/p

Cand` es, Plan ’09 σn1.5 Negahban, Wainwright ’10 max{σ, M ⋆∞}

  • n/p

. „ . .

m i n i m a x l i m i t C a n d ` e s , P l a n ’ 9 e g a h b a n , W a i n w r i g h t

. . „ . .

minimax limit ion ‡: noise standard dev.

ror

. . „

M ≠ M ı.

.

F

ÎM ıÎ∞ tion er

. . „

minimax limit Cand` es, Plan ’09 Negahban, Wainwright ’10 m Koltchinskii, Tsybakov, Lounici ’10

slide-16
SLIDE 16

minimax limit σ

  • n/p

Cand` es, Plan ’09 σn1.5 Negahban, Wainwright ’10 max{σ, M ⋆∞}

  • n/p

Koltchinskii, Tsybakov, Lounici ’10 max{σ, M ⋆∞}

  • n/p

. „ . .

m i n i m a x l i m i t C a n d ` e s , P l a n ’ 9 e g a h b a n , W a i n w r i g h t

. . „ . .

minimax limit ion ‡: noise standard dev.

ror

. . „

M ≠ M ı.

.

F

ÎM ıÎ∞ tion er

. . „

minimax limit Cand` es, Plan ’09 Negahban, Wainwright ’10 m Koltchinskii, Tsybakov, Lounici ’10 m i n i m a C a n d ` e s , P l a n ’ 9 N e g a h b a n , W a i n w r i g h t ’ 1 K

  • l

t c h i n s k i i , T s y b a k

  • v

, L

  • u

n i c i ’ 1 m

slide-17
SLIDE 17

100 200 300 400 500 600 700 800 900 1000 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 n rms error recovery error using SDP 1.68*(oracle error) 1.68*[(2nr − r2)/(pn2)]1/2

convex relaxation

  • n 1.68 × oracle bound

Existing theory for convex relaxation does not match practice . . .

slide-18
SLIDE 18

. (III.9) k − k ≈ with adversarial noise. Consequently, our analysis looses a pn factor vis a vis an optimal bound that is achievable via the help of an oracle. The diligent reader may argue that the least-squares

Existing theory for convex relaxation does not match practice . . .

slide-19
SLIDE 19

What are the roadblocks?

Strategy: Mcvx is optimizer if there exists W

  • dual certificate

s.t. ( Mcvx, W ) obeys KKT optimality condition

10/ 39

slide-20
SLIDE 20

What are the roadblocks?

Strategy: Mcvx is optimizer if there exists W

  • dual certificate

s.t. ( Mcvx, W ) obeys KKT optimality condition

David Gross

  • noiseless case:

Mcvx ← M ⋆

  • exact recovery

; W ← golfing scheme

10/ 39

slide-21
SLIDE 21

What are the roadblocks?

Strategy: Mcvx is optimizer if there exists W

  • dual certificate

s.t. ( Mcvx, W ) obeys KKT optimality condition

David Gross

  • noiseless case:

Mcvx ← M ⋆

  • exact recovery

; W ← golfing scheme

  • noisy case:

Mcvx is very complicated; hard to construct W . . .

10/ 39

slide-22
SLIDE 22

dual certification (golfing scheme)

slide-23
SLIDE 23

nonconvex optimization dual certification (golfing scheme)

slide-24
SLIDE 24

A detour: nonconvex optimization

Burer–Monteiro: represent Z by XY ⊤ with X, Y ∈ Rn×r

  • low-rank factors

with X,

¸

XY € with

12/ 39

slide-25
SLIDE 25

A detour: nonconvex optimization

Burer–Monteiro: represent Z by XY ⊤ with X, Y ∈ Rn×r

  • low-rank factors

with X,

¸

XY € with

nonconvex approach: minimize

X,Y ∈Rn×r f(X, Y ) =

  • (i,j)∈Ω

XY ⊤

i,j − Mi,j

2

  • squared loss

+ reg(X, Y )

12/ 39

slide-26
SLIDE 26

A detour: nonconvex optimization

  • Burer, Monteiro ’03
  • Rennie, Srebro ’05
  • Keshavan, Montanari, Oh ’09 ’10
  • Jain, Netrapalli, Sanghavi ’12
  • Hardt ’13
  • Sun, Luo ’14
  • Chen, Wainwright ’15
  • Tu, Boczar, Simchowitz, Soltanolkotabi, Recht ’15
  • Zhao, Wang, Liu ’15
  • Zheng, Lafferty ’16
  • Yi, Park, Chen, Caramanis ’16
  • Ge, Lee, Ma ’16
  • Ge, Jin, Zheng ’17
  • Ma, Wang, Chi, Chen ’17
  • Chen, Li ’18
  • Chen, Liu, Li ’19
  • ...

13/ 39

slide-27
SLIDE 27

A detour: nonconvex optimization

minimize

X,Y ∈Rn×r f(X, Y ) =

  • (i,j)∈Ω

XY ⊤

i,j − Mi,j

2

+ reg(X, Y )

  • suitable initialization: (X0, Y 0)
  • gradient descent: for t = 0, 1, . . .

Xt+1 = Xt − ηt ∇Xf(Xt, Y t) Y t+1 = Y t − ηt ∇Y f(Xt, Y t)

14/ 39

slide-28
SLIDE 28

A detour: nonconvex optimization

  • random sampling: each (i, j) ∈ Ω with prob. p
  • random noise: i.i.d. sub-Gaussian noise with variance σ2
  • true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, . . .

15/ 39

slide-29
SLIDE 29

minimax limit σ

  • n/p

nonconvex algorithms σ

  • n/p (optimal!)

. . „ . .

minimax limit ion ‡: noise standard dev.

ror

. . „

M ≠ M ı.

.

F

r r

  • r

. . „ .

m i n i m a x l i m i t n

  • n

c

  • n

v e x a l g

  • r

i t h m s

slide-30
SLIDE 30

n an Koltchinskii ) is critical poin nonconvex optimization

  • Candès, Recht ’08

Candès, Plan ’09 ahban, Wainwright in critical nconvex optimiza

  • Candès, Recht ’08
  • Candès, Plan ’09

Negahban, Wainwright ’10

r i t i c v e x

  • p

t C a n d è s , R e c h t ’ 8

  • C

a n d è s , P l a n ’ 9

  • N

e g a h b a n , W a i n w r i g h t ’ 1

  • Koltchinskii et al. ’10

Koltchinskii, Tsyba van, Mon

1}

rogram convex relaxation

  • 2008

2008 2019 Gross ’09

slide-31
SLIDE 31

n an Koltchinskii ) is critical poin nonconvex optimization

  • Candès, Recht ’08

Candès, Plan ’09 ahban, Wainwright in critical nconvex optimiza

  • Candès, Recht ’08
  • Candès, Plan ’09

Negahban, Wainwright ’10

r i t i c v e x

  • p

t C a n d è s , R e c h t ’ 8

  • C

a n d è s , P l a n ’ 9

  • N

e g a h b a n , W a i n w r i g h t ’ 1

  • Koltchinskii et al. ’10

Koltchinskii, Tsyba van, Mon

(X, Y ) is critical point of nonconvex optimization

1}

rogram convex relaxation

  • 2008

2008 2019

K

  • l

t c h i n s k i i

  • K

e s h a v a n , M

  • n

t a n

  • S

u n , L u

1 5 C h e n , W a i n w r i g h t , C h i , W a n , chinskii Keshavan, Montan

  • Sun, Luo ’15
  • Chen, Wainwright ’15

Ma, Chi, Wang, Chen ’ en, Liu, Li ’19 uo ’15 Chen, Wainwright

  • Ma, Chi, Wang, Chen
  • Chen, Liu, Li ’19

k i i , v a n , M

  • n

t a n S u n , L u

1 5

  • C

h e n , W a i n w r i g h t ’ 1 5

  • M

a , C h i , W a n g , C h e n ’ 1 7 C h e n , L i u , L i ’ 1 9 av Sun, Luo ’15

  • Chen, Wainwright
  • Zheng, Lafferty ’16

Ma, Chi, Wang, n, Liu, Li ’1 chinskii et al. ’10

  • Koltchinskii, Tsybakov, Loun
  • Keshavan, Montanari, Oh ’09

Sun, Luo ’15 en, Wainwright ’15 Lafferty ’16 ang Gross ’09

slide-32
SLIDE 32

n an Koltchinskii ) is critical poin nonconvex optimization

  • Candès, Recht ’08

Candès, Plan ’09 ahban, Wainwright in critical nconvex optimiza

  • Candès, Recht ’08
  • Candès, Plan ’09

Negahban, Wainwright ’10

r i t i c v e x

  • p

t C a n d è s , R e c h t ’ 8

  • C

a n d è s , P l a n ’ 9

  • N

e g a h b a n , W a i n w r i g h t ’ 1

  • Koltchinskii et al. ’10

Koltchinskii, Tsyba van, Mon

(X, Y ) is critical point of nonconvex optimization

1}

rogram convex relaxation

  • 2008

2008 2019

K

  • l

t c h i n s k i i

  • K

e s h a v a n , M

  • n

t a n

  • S

u n , L u

1 5 C h e n , W a i n w r i g h t , C h i , W a n , chinskii Keshavan, Montan

  • Sun, Luo ’15
  • Chen, Wainwright ’15

Ma, Chi, Wang, Chen ’ en, Liu, Li ’19 uo ’15 Chen, Wainwright

  • Ma, Chi, Wang, Chen
  • Chen, Liu, Li ’19

k i i , v a n , M

  • n

t a n S u n , L u

1 5

  • C

h e n , W a i n w r i g h t ’ 1 5

  • M

a , C h i , W a n g , C h e n ’ 1 7 C h e n , L i u , L i ’ 1 9 av Sun, Luo ’15

  • Chen, Wainwright
  • Zheng, Lafferty ’16

Ma, Chi, Wang, n, Liu, Li ’1 chinskii et al. ’10

  • Koltchinskii, Tsybakov, Loun
  • Keshavan, Montanari, Oh ’09

Sun, Luo ’15 en, Wainwright ’15 Lafferty ’16 ang Gross ’09

slide-33
SLIDE 33

An interesting experiment

convex: minimize

Z∈Rn×n

  • (i,j)∈Ω

Zi,j − Mi,j 2 + λZ∗

nonconvex: minimize

X,Y ∈Rn×r

  • (i,j)∈Ω

XY ⊤

i,j − Mi,j

2

+ λ

2X2 F + λ 2Y 2 F

  • reg(X,Y )

— Z∗ = min

Z=XY ⊤ 1 2X2 F + 1 2Y 2 F 18/ 39

slide-34
SLIDE 34

A motivating experiment

n = 1000, r = 5, p = 0.2, λ = 5σ√np

10-6 10-5 10-4 10-3 10-8 10-6 10-4 10-2 100

Estimation error: convex Estimation error: nonconvex Distance between solutions

ion ‡: noise standard dev.

19/ 39

slide-35
SLIDE 35

A motivating experiment

n = 1000, r = 5, p = 0.2, λ = 5σ√np

10-6 10-5 10-4 10-3 10-8 10-6 10-4 10-2 100

Estimation error: convex Estimation error: nonconvex Distance between solutions

ion ‡: noise standard dev.

Convex and nonconvex solutions are exceedingly close!

19/ 39

slide-36
SLIDE 36

zer stability

ty

A B

zer stability

ty

A B

ion nonconvex op convex s convex s ion nonconvex op

slide-37
SLIDE 37

Main results: r = O(1)

  • random sampling: each (i, j) ∈ Ω with prob. p log3 n

n

  • random noise: i.i.d. sub-Gaussian noise with variance σ2
  • true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, well-conditioned

21/ 39

slide-38
SLIDE 38

Main results: r = O(1)

  • random sampling: each (i, j) ∈ Ω with prob. p log3 n

n

  • random noise: i.i.d. sub-Gaussian noise with variance σ2
  • true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, well-conditioned

minimize

Z∈Rn×n

  • (i,j)∈Ω

Zi,j − Mi,j 2 + λZ∗

(λ ≍ σ√np)

21/ 39

slide-39
SLIDE 39

Main results: r = O(1)

  • random sampling: each (i, j) ∈ Ω with prob. p log3 n

n

  • random noise: i.i.d. sub-Gaussian noise with variance σ2
  • true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, well-conditioned

minimize

Z∈Rn×n

  • (i,j)∈Ω

Zi,j − Mi,j 2 + λZ∗

(λ ≍ σ√np) Theorem 1 (Chen, Chi, Fan, Ma, Yan ’19) With high prob., any minimizer Mcvx of convex program obeys 1.

  • Mcvx is nearly rank-r
  • Mcvx − projr(

Mcvx)

  • F ≪ 1

n5 · σ

  • n

p

21/ 39

slide-40
SLIDE 40

Main results: r = O(1)

  • random sampling: each (i, j) ∈ Ω with prob. p log3 n

n

  • random noise: i.i.d. sub-Gaussian noise with variance σ2
  • true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, well-conditioned

minimize

Z∈Rn×n

  • (i,j)∈Ω

Zi,j − Mi,j 2 + λZ∗

(λ ≍ σ√np) Theorem 1 (Chen, Chi, Fan, Ma, Yan ’19) With high prob., any minimizer Mcvx of convex program obeys 1.

  • Mcvx is nearly rank-r

2.

  • Mcvx − M ⋆
  • F σ
  • n

p

21/ 39

slide-41
SLIDE 41

Main results: r = O(1)

  • random sampling: each (i, j) ∈ Ω with prob. p log3 n

n

  • random noise: i.i.d. sub-Gaussian noise with variance σ2
  • true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, well-conditioned

minimize

Z∈Rn×n

  • (i,j)∈Ω

Zi,j − Mi,j 2 + λZ∗

(λ ≍ σ√np) Theorem 1 (Chen, Chi, Fan, Ma, Yan ’19) With high prob., any minimizer Mcvx of convex program obeys 1.

  • Mcvx is nearly rank-r

2.

  • Mcvx − M ⋆
  • F σ
  • n

p

  • Mcvx − M ⋆
  • ∞ σ
  • n log n

p

· 1 n

21/ 39

slide-42
SLIDE 42
  • Mcvx − M ⋆
  • F σ
  • n

p

Chen, Chi, Fan, Ma, Yan ’19 . . „ . . minimax limit

ion ‡: noise standard dev.

ror

. . „

M ≠ M ı.

.

F

  • minimax optimal estimation error

22/ 39

slide-43
SLIDE 43
  • Mcvx − M ⋆
  • F σ
  • n

p

  • Mcvx − M ⋆
  • ∞ σ
  • n log n

p

· 1 n

Chen, Chi, Fan, Ma, Yan ’19 . . „ . . minimax limit

ion ‡: noise standard dev.

ror

. . „

M ≠ M ı.

.

F

  • minimax optimal estimation error
  • estimation errors are spread out across all entries

22/ 39

slide-44
SLIDE 44

Implicit regularization

No need to enforce spikiness constraint as in Negahban & Wainwright minimize

Z∞≤α

  • (i,j)∈Ω

Zi,j − Mi,j 2 + λZ∗

  • convex programming automatically controls spikiness of solutions

23/ 39

slide-45
SLIDE 45

Statistical guarantees for iterative algorithms

minimize

Z

g(Z) =

  • (i,j)∈Ω

Zi,j − Mi,j 2 + λZ∗

Many algorithms (e.g. SVT, SOFT-IMPUTE, FPC, FISTA) have been proposed to minimize g(Z), typically without statistical guarantees

24/ 39

slide-46
SLIDE 46

Statistical guarantees for iterative algorithms

minimize

Z

g(Z) =

  • (i,j)∈Ω

Zi,j − Mi,j 2 + λZ∗

Many algorithms (e.g. SVT, SOFT-IMPUTE, FPC, FISTA) have been proposed to minimize g(Z), typically without statistical guarantees We provide statistical guarantees for any Z with g(Z) ≤ g(Zopt) + ε for some sufficiently small ε > 0

24/ 39

slide-47
SLIDE 47

Main results: general case

  • random sampling: each (i, j) ∈ Ω with prob. p r2 log3 n

n

  • random noise: i.i.d. sub-Gaussian noise with variance σ2
  • true matrix M ⋆ ∈ Rn×n: incoherent, well-conditioned

25/ 39

slide-48
SLIDE 48

Main results: general case

  • random sampling: each (i, j) ∈ Ω with prob. p r2 log3 n

n

  • random noise: i.i.d. sub-Gaussian noise with variance σ2
  • true matrix M ⋆ ∈ Rn×n: incoherent, well-conditioned

Theorem 2 (Chen, Chi, Fan, Ma, Yan ’19) With high prob., any minimizer Mcvx of convex program obeys 1.

  • Mcvx is nearly rank-r

2.

  • Mcvx − M ⋆
  • F

σ σmin(M⋆)

  • n

p M ⋆F

  • Mcvx − M ⋆
  • ∞ √r

σ σmin(M⋆)

  • n log n

p

M ⋆∞

  • Mcvx − M ⋆
  • σ

σmin(M⋆)

  • n

p M ⋆

25/ 39

slide-49
SLIDE 49

Main results: general case

  • random sampling: each (i, j) ∈ Ω with prob. p r2 log3 n

n

  • random noise: i.i.d. sub-Gaussian noise with variance σ2
  • true matrix M ⋆ ∈ Rn×n: incoherent, well-conditioned

sample complexity bound O(nr2 log3 n) is suboptimal in r!

25/ 39

slide-50
SLIDE 50

A little analysis: connection between convex and nonconvex solutions

slide-51
SLIDE 51

Link between convex and nonconvex optimizers

(X, Y ) is nonconvex optimizer

27/ 39

slide-52
SLIDE 52

Link between convex and nonconvex optimizers

(X, Y ) is nonconvex optimizer

?

= ⇒ XY ⊤ is convex solution

27/ 39

slide-53
SLIDE 53

Link between convex and nonconvex optimizers

  • λ is properly chosen
  • (X, Y ) is close to truth (in ℓ2,∞ sense)

’09 (X?, Y ?) endation sys ) (X, Y ) systems localiz

(X, Y ) is nonconvex optimizer

  • =

⇒ XY ⊤ is convex solution

i.e. dist(convex solution, nonconvex solution) = 0

27/ 39

slide-54
SLIDE 54

Approximate nonconvex optimizers

) (X, Y ) systems localiz ) (X0, Y 0) alization joint s

) (X1, Y 1) alization joint shap

Issue: we do NOT know properties of nonconvex optimizers

  • It is unclear whether nonconvex algorithms converge to
  • ptimizers (due to lack of strong convexity)

28/ 39

slide-55
SLIDE 55

Approximate nonconvex optimizers

Strategy: resort to “approximate stationary points

  • ∇f(X,Y ) ≈ 0

” instead

29/ 39

slide-56
SLIDE 56

Approximate nonconvex optimizers

Strategy: resort to “approximate stationary points

  • ∇f(X,Y ) ≈ 0

” instead

  • λ is properly chosen
  • (X, Y ) is close to truth (in ℓ2,∞ sense)

’09 (X?, Y ?) endation sys ) (X, Y ) systems localiz

∇f(X, Y ) ≈ 0

  • =

⇒ dist(XY ⊤, convex solutions) ≈ 0

29/ 39

slide-57
SLIDE 57

Construct approximate nonconvex optimizers via GD

starting from (X0, Y 0) = truth or spectral initialization: Xt+1 = Xt − η ∇Xf(Xt, Y t) Y t+1 = Y t − η ∇Y f(Xt, Y t) t = 0, 1, · · · , T

30/ 39

slide-58
SLIDE 58

Construct approximate nonconvex optimizers via GD

starting from (X0, Y 0) = truth or spectral initialization: Xt+1 = Xt − η ∇Xf(Xt, Y t) Y t+1 = Y t − η ∇Y f(Xt, Y t) t = 0, 1, · · · , T

  • when T is large: there exists point with very small gradient
  • ∇f(X,Y )F

1

ηT 30/ 39

slide-59
SLIDE 59

Construct approximate nonconvex optimizers via GD

starting from (X0, Y 0) = truth or spectral initialization: Xt+1 = Xt − η ∇Xf(Xt, Y t) Y t+1 = Y t − η ∇Y f(Xt, Y t) t = 0, 1, · · · , T

  • when T is large: there exists point with very small gradient
  • ∇f(X,Y )F

1

ηT

  • hopefully not far from (X⋆, Y ⋆) (in ℓ2,∞ sense in particular)

30/ 39

slide-60
SLIDE 60

Gradient descent for nonconvex matrix completion

slide-61
SLIDE 61

Gradient descent for nonconvex matrix completion

Xt+1 = Xt − η ∇Xf(Xt, Y t) Y t+1 = Y t − η ∇Y f(Xt, Y t) Prior works analyze regularized GD

  • not guaranteed to return small-gradient solutions
  • no ℓ2,∞ error control

— Keshavan et al. ’09, Sun, Luo ’15, Chen, Wainwright ’15, Zheng, Lafferty ’16

32/ 39

slide-62
SLIDE 62

Gradient descent for nonconvex matrix completion

Xt+1 = Xt − η ∇Xf(Xt, Y t) Y t+1 = Y t − η ∇Y f(Xt, Y t) Our work and Chen et al. analyze vanilla GD

  • regularization-free
  • optimal ℓ2,∞ error control

— Ma, Wang, Chi, Chen ’17, Chen, Liu, Li ’19

32/ 39

slide-63
SLIDE 63

Gradient descent theory revisited

Two standard conditions that enable geometric convergence of GD

33/ 39

slide-64
SLIDE 64

Gradient descent theory revisited

Two standard conditions that enable geometric convergence of GD

  • (local) restricted strong convexity

33/ 39

slide-65
SLIDE 65

Gradient descent theory revisited

Two standard conditions that enable geometric convergence of GD

  • (local) restricted strong convexity
  • (local) smoothness

33/ 39

slide-66
SLIDE 66

Gradient descent theory revisited

f is said to be α-strongly convex and β-smooth if 0 αI ∇2f(X) βI, ∀X ℓ2 error contraction: GD with η = 1/β obeys Xt+1 − X⋆F ≤

  • 1 − α

β

  • Xt − X⋆F

34/ 39

slide-67
SLIDE 67

Incoherence region

Which region enjoys both restricted strong convexity and smoothness?

35/ 39

slide-68
SLIDE 68

Incoherence region

Which region enjoys both restricted strong convexity and smoothness?

·

X?

  • X is not far away from X⋆

35/ 39

slide-69
SLIDE 69

Incoherence region

Which region enjoys both restricted strong convexity and smoothness?

·

e1

ke>

1 (X − X?)k2 ≤ kX?k2,1

X?

  • X is not far away from X⋆
  • X is incoherent w.r.t. standard basis vectors (incoherence region)

35/ 39

slide-70
SLIDE 70

Incoherence region

Which region enjoys both restricted strong convexity and smoothness?

·

e2

e1

ke>

1 (X − X?)k2 ≤ kX?k2,1

k − k ≤ k k

1

ke>

2 (X − X?)k2 ≤ kX?k2,1

X?

  • X is not far away from X⋆
  • X is incoherent w.r.t. standard basis vectors (incoherence region)

35/ 39

slide-71
SLIDE 71

Inadequacy of generic gradient descent theory

region of local strong convexity + smoothness

·√ ·√

  • Generic optimization theory does NOT ensure GD stays in

incoherence region

36/ 39

slide-72
SLIDE 72

Inadequacy of generic gradient descent theory

region of local strong convexity + smoothness

·√ ·√

  • Generic optimization theory does NOT ensure GD stays in

incoherence region

36/ 39

slide-73
SLIDE 73

Inadequacy of generic gradient descent theory

region of local strong convexity + smoothness

·√ ·√

  • Generic optimization theory does NOT ensure GD stays in

incoherence region

36/ 39

slide-74
SLIDE 74

Inadequacy of generic gradient descent theory

region of local strong convexity + smoothness

·√ ·√

  • Generic optimization theory does NOT ensure GD stays in

incoherence region

  • Demonstrating incoherence calls for new analysis tools

36/ 39

slide-75
SLIDE 75

Key proof idea: leave-one-out analysis

For each 1 ≤ l ≤ n, introduce leave-one-out iterates Xt,(l) by replacing {lth} row and column with true values

1 2 3 𝑚 𝑜 1 2 3 𝑚 𝑜 … … … …

= )

Xt,(l)

M (l)

37/ 39

slide-76
SLIDE 76

Key proof idea: leave-one-out analysis

·√ incoherence region w.r.t. el

{ } {Xt,(l)}

el

  • Leave-one-out iterates {Xt,(l)} contains more information of lth

row of truth; indep. of randomness in lth row

38/ 39

slide-77
SLIDE 77

Key proof idea: leave-one-out analysis

·√ incoherence region w.r.t. el

{Xt} { } {Xt,(l)}

el

  • Leave-one-out iterates {Xt,(l)} contains more information of lth

row of truth; indep. of randomness in lth row

  • Leave-one-out iterates {Xt,(l)} ≈ true iterates {Xt}

38/ 39

slide-78
SLIDE 78

n a n K

  • l

t c h i n s k i i ) i s c r i t i c a l p

  • i

n n

  • n

c

  • n

v e x

  • p

t i m i z a t i

  • n
  • C

a n d è s , R e c h t ’ 8 C a n d è s , P l a n ’ 9 a h b a n , W a i n w r i g h t in critical nconvex optimiza

  • Candès, Recht ’08
  • Candès, Plan ’09

Negahban, Wainwright ’10 ritic vex opt Candès, Recht ’08

  • Candès, Plan ’09
  • Negahban, Wainwright ’10
  • Koltchinskii et al. ’10

Koltchinskii, Tsyba van, Mon

(X, Y ) is critical point of nonconvex optimization

1}

rogram convex relaxation

  • 2008

2008 2019

Koltchinskii

  • Keshavan, Montan
  • Sun, Luo ’15

Chen, Wainwright , Chi, Wan , chinskii Keshavan, Montan

  • Sun, Luo ’15
  • Chen, Wainwright ’15

Ma, Chi, Wang, Chen ’ en, Liu, Li ’19 uo ’15 Chen, Wainwright

  • Ma, Chi, Wang, Chen
  • Chen, Liu, Li ’19

k i i , v a n , M

  • n

t a n S u n , L u

1 5

  • C

h e n , W a i n w r i g h t ’ 1 5

  • M

a , C h i , W a n g , C h e n ’ 1 7 C h e n , L i u , L i ’ 1 9 av Sun, Luo ’15

  • Chen, Wainwright
  • Zheng, Lafferty ’16

Ma, Chi, Wang, n, Liu, Li ’1 chinskii et al. ’10

  • Koltchinskii, Tsybakov, Loun
  • Keshavan, Montanari, Oh ’09

Sun, Luo ’15 en, Wainwright ’15 Lafferty ’16 ang

  • Koltchinskii et al. ’10
  • Chen, Chi, Fan, Ma, Yan ’19

Koltchinskii, Tsybakov, Lounici avan, Montanari, Oh ’09 ’15 ight ’1

Gross ’09

“Noisy matrix completion: understanding statistical guarantees for convex relaxation via nonconvex optimization”, Y. Chen, Y. Chi, J. Fan, C. Ma, Y. Yan, 2019