Noisy matrix completion: Understanding statistical guarantees for - - PowerPoint PPT Presentation
Noisy matrix completion: Understanding statistical guarantees for - - PowerPoint PPT Presentation
Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization Cong Ma ORFE, Princeton University Yuejie Chi Jianqing Fan Yuxin Chen Yuling Yan CMU ECE Princeton ORFE Princeton EE Princeton
Yuling Yan Princeton ORFE Yuejie Chi CMU ECE Jianqing Fan Princeton ORFE Yuxin Chen Princeton EE
Convex relaxation for low-rank structure
minimize
Z
Z∗ subject to noiseless data constraints low-rank matrix
figure credit: Piet Mondrian
semidefinite relaxation
3/ 39
Convex relaxation for low-rank structure
minimize
Z
Z∗ subject to noiseless data constraints matrix sensing
(Recht, Fazel, Parrilo ’07)
phase retrieval
(Cand` es, Strohmer, Voroninski ’11, Cand` es, Li ’12)
matrix completion
(Cand` es, Recht ’08, Cand` es, Tao ’08, Gross ’09)
robust PCA
(Chandrasekaran et al. ’09, Cand` es et al. ’09)
Hankel matrix completion
(Fazel et al. ’13, Chen, Chi ’13, Cai et al. ’15)
blind deconvolution
(Ahmed, Recht, Romberg ’12, Ling, Strohmer ’15)
joint alignment / matching
(Chen, Huang, Guibas ’14)
. . .
3/ 39
Stability of convex relaxation against noise
minimize
Z
Z∗ subject to noisy data constraints low-rank matrix
figure credit: Piet Mondrian
semidefinite relaxation
4/ 39
Stability of convex relaxation against noise
minimize
Z
f(Z; noisy data)
- empirical loss
+ λZ∗ low-rank matrix
figure credit: Piet Mondrian
semidefinite relaxation
4/ 39
Stability of convex relaxation against noise
minimize
Z
f(Z; noisy data)
- empirical loss
+ λZ∗ matrix sensing (RIP measurements)
(Cand` es, Plan ’10)
phase retrieval (Gaussian measurements)
(Cand` es et al. ’11)
? matrix completion
(Cand` es, Plan ’09, Negahban, Wainwright ’10, Koltchinskii et al. ’10)
? robust PCA
(Zhou, Li, Wright, Cand` es, Ma ’10)
? Hankel matrix completion
(Chen, Chi ’13)
? blind deconvolution
(Ahmed, Recht, Romberg ’12, Ling, Strohmer ’15)
? joint alignment / matching . . .
4/ 39
Stability of convex relaxation against noise
minimize
Z
f(Z; noisy data)
- empirical loss
+ λZ∗ matrix sensing (RIP measurements)
(Cand` es, Plan ’10)
phase retrieval (Gaussian measurements)
(Cand` es et al. ’11)
? this talk: matrix completion
(Cand` es, Plan ’09, Negahban, Wainwright ’10, Koltchinskii et al. ’10)
? robust PCA
(Zhou, Li, Wright, Cand` es, Ma ’10)
? Hankel matrix completion
(Chen, Chi ’13)
? blind deconvolution
(Ahmed, Recht, Romberg ’12, Ling, Strohmer ’15)
? joint alignment / matching . . .
4/ 39
Low-rank matrix completion
- ?
? ?
- ?
? ?
- ?
?
- ?
?
- ?
? ? ?
- ?
?
- ?
? ? ? ? ?
- ?
?
- ?
? ?
- ?
?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
figure credit: E. J. Cand` es
Given partial samples of a low-rank matrix M ⋆, fill in missing entries
5/ 39
Noisy low-rank matrix completion
- bservations:
Mi,j = M⋆
i,j + noise,
(i, j) ∈ Ω goal: estimate M ⋆ unknown rank-r matrix M ⋆ ∈ Rn×n
- ?
? ?
- ?
? ?
- ?
?
- ?
?
- ?
? ? ?
- ?
?
- ?
? ? ? ? ?
- ?
?
- ?
? ?
- ?
?
sampling set Ω
6/ 39
Noisy low-rank matrix completion
- bservations:
Mi,j = M⋆
i,j + noise,
(i, j) ∈ Ω goal: estimate M ⋆ convex relaxation: minimize
Z∈Rn×n
- (i,j)∈Ω
Zi,j − Mi,j 2
- squared loss
+ λZ∗
6/ 39
Prior statistical guarantees for convex relaxation
- random sampling: each (i, j) ∈ Ω with prob. p
- random noise: i.i.d. sub-Gaussian noise with variance σ2
- true matrix M ⋆ ∈ Rn×n: rank r = O(1), incoherent, . . .
7/ 39
Cand` es, Plan ’09 σn1.5
. „ . .
m i n i m a x l i m i t C a n d ` e s , P l a n ’ 9 e g a h b a n , W a i n w r i g h t
ion ‡: noise standard dev.
ror
. . „
M ≠ M ı.
.
F
minimax limit σ
- n/p
Cand` es, Plan ’09 σn1.5
. „ . .
m i n i m a x l i m i t C a n d ` e s , P l a n ’ 9 e g a h b a n , W a i n w r i g h t
. . „ . .
minimax limit ion ‡: noise standard dev.
ror
. . „
M ≠ M ı.
.
F
ÎM ıÎ∞
minimax limit σ
- n/p
Cand` es, Plan ’09 σn1.5 Negahban, Wainwright ’10 max{σ, M ⋆∞}
- n/p
. „ . .
m i n i m a x l i m i t C a n d ` e s , P l a n ’ 9 e g a h b a n , W a i n w r i g h t
. . „ . .
minimax limit ion ‡: noise standard dev.
ror
. . „
M ≠ M ı.
.
F
ÎM ıÎ∞ tion er
. . „
minimax limit Cand` es, Plan ’09 Negahban, Wainwright ’10 m Koltchinskii, Tsybakov, Lounici ’10
minimax limit σ
- n/p
Cand` es, Plan ’09 σn1.5 Negahban, Wainwright ’10 max{σ, M ⋆∞}
- n/p
Koltchinskii, Tsybakov, Lounici ’10 max{σ, M ⋆∞}
- n/p
. „ . .
m i n i m a x l i m i t C a n d ` e s , P l a n ’ 9 e g a h b a n , W a i n w r i g h t
. . „ . .
minimax limit ion ‡: noise standard dev.
ror
. . „
M ≠ M ı.
.
F
ÎM ıÎ∞ tion er
. . „
minimax limit Cand` es, Plan ’09 Negahban, Wainwright ’10 m Koltchinskii, Tsybakov, Lounici ’10 m i n i m a C a n d ` e s , P l a n ’ 9 N e g a h b a n , W a i n w r i g h t ’ 1 K
- l
t c h i n s k i i , T s y b a k
- v
, L
- u
n i c i ’ 1 m
100 200 300 400 500 600 700 800 900 1000 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 n rms error recovery error using SDP 1.68*(oracle error) 1.68*[(2nr − r2)/(pn2)]1/2
convex relaxation
- n 1.68 × oracle bound
Existing theory for convex relaxation does not match practice . . .
. (III.9) k − k ≈ with adversarial noise. Consequently, our analysis looses a pn factor vis a vis an optimal bound that is achievable via the help of an oracle. The diligent reader may argue that the least-squares
Existing theory for convex relaxation does not match practice . . .
What are the roadblocks?
Strategy: Mcvx is optimizer if there exists W
- dual certificate
s.t. ( Mcvx, W ) obeys KKT optimality condition
10/ 39
What are the roadblocks?
Strategy: Mcvx is optimizer if there exists W
- dual certificate
s.t. ( Mcvx, W ) obeys KKT optimality condition
David Gross
- noiseless case:
Mcvx ← M ⋆
- exact recovery
; W ← golfing scheme
10/ 39
What are the roadblocks?
Strategy: Mcvx is optimizer if there exists W
- dual certificate
s.t. ( Mcvx, W ) obeys KKT optimality condition
David Gross
- noiseless case:
Mcvx ← M ⋆
- exact recovery
; W ← golfing scheme
- noisy case:
Mcvx is very complicated; hard to construct W . . .
10/ 39
dual certification (golfing scheme)
nonconvex optimization dual certification (golfing scheme)
A detour: nonconvex optimization
Burer–Monteiro: represent Z by XY ⊤ with X, Y ∈ Rn×r
- low-rank factors
with X,
¸
XY € with
12/ 39
A detour: nonconvex optimization
Burer–Monteiro: represent Z by XY ⊤ with X, Y ∈ Rn×r
- low-rank factors
with X,
¸
XY € with
nonconvex approach: minimize
X,Y ∈Rn×r f(X, Y ) =
- (i,j)∈Ω
XY ⊤
i,j − Mi,j
2
- squared loss
+ reg(X, Y )
12/ 39
A detour: nonconvex optimization
- Burer, Monteiro ’03
- Rennie, Srebro ’05
- Keshavan, Montanari, Oh ’09 ’10
- Jain, Netrapalli, Sanghavi ’12
- Hardt ’13
- Sun, Luo ’14
- Chen, Wainwright ’15
- Tu, Boczar, Simchowitz, Soltanolkotabi, Recht ’15
- Zhao, Wang, Liu ’15
- Zheng, Lafferty ’16
- Yi, Park, Chen, Caramanis ’16
- Ge, Lee, Ma ’16
- Ge, Jin, Zheng ’17
- Ma, Wang, Chi, Chen ’17
- Chen, Li ’18
- Chen, Liu, Li ’19
- ...
13/ 39
A detour: nonconvex optimization
minimize
X,Y ∈Rn×r f(X, Y ) =
- (i,j)∈Ω
XY ⊤
i,j − Mi,j
2
+ reg(X, Y )
- suitable initialization: (X0, Y 0)
- gradient descent: for t = 0, 1, . . .
Xt+1 = Xt − ηt ∇Xf(Xt, Y t) Y t+1 = Y t − ηt ∇Y f(Xt, Y t)
14/ 39
A detour: nonconvex optimization
- random sampling: each (i, j) ∈ Ω with prob. p
- random noise: i.i.d. sub-Gaussian noise with variance σ2
- true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, . . .
15/ 39
minimax limit σ
- n/p
nonconvex algorithms σ
- n/p (optimal!)
. . „ . .
minimax limit ion ‡: noise standard dev.
ror
. . „
M ≠ M ı.
.
F
r r
- r
. . „ .
m i n i m a x l i m i t n
- n
c
- n
v e x a l g
- r
i t h m s
n an Koltchinskii ) is critical poin nonconvex optimization
- Candès, Recht ’08
Candès, Plan ’09 ahban, Wainwright in critical nconvex optimiza
- Candès, Recht ’08
- Candès, Plan ’09
Negahban, Wainwright ’10
r i t i c v e x
- p
t C a n d è s , R e c h t ’ 8
- C
a n d è s , P l a n ’ 9
- N
e g a h b a n , W a i n w r i g h t ’ 1
- Koltchinskii et al. ’10
Koltchinskii, Tsyba van, Mon
1}
rogram convex relaxation
- 2008
2008 2019 Gross ’09
n an Koltchinskii ) is critical poin nonconvex optimization
- Candès, Recht ’08
Candès, Plan ’09 ahban, Wainwright in critical nconvex optimiza
- Candès, Recht ’08
- Candès, Plan ’09
Negahban, Wainwright ’10
r i t i c v e x
- p
t C a n d è s , R e c h t ’ 8
- C
a n d è s , P l a n ’ 9
- N
e g a h b a n , W a i n w r i g h t ’ 1
- Koltchinskii et al. ’10
Koltchinskii, Tsyba van, Mon
(X, Y ) is critical point of nonconvex optimization
1}
rogram convex relaxation
- 2008
2008 2019
K
- l
t c h i n s k i i
- K
e s h a v a n , M
- n
t a n
- S
u n , L u
- ’
1 5 C h e n , W a i n w r i g h t , C h i , W a n , chinskii Keshavan, Montan
- Sun, Luo ’15
- Chen, Wainwright ’15
Ma, Chi, Wang, Chen ’ en, Liu, Li ’19 uo ’15 Chen, Wainwright
- Ma, Chi, Wang, Chen
- Chen, Liu, Li ’19
k i i , v a n , M
- n
t a n S u n , L u
- ’
1 5
- C
h e n , W a i n w r i g h t ’ 1 5
- M
a , C h i , W a n g , C h e n ’ 1 7 C h e n , L i u , L i ’ 1 9 av Sun, Luo ’15
- Chen, Wainwright
- Zheng, Lafferty ’16
Ma, Chi, Wang, n, Liu, Li ’1 chinskii et al. ’10
- Koltchinskii, Tsybakov, Loun
- Keshavan, Montanari, Oh ’09
Sun, Luo ’15 en, Wainwright ’15 Lafferty ’16 ang Gross ’09
n an Koltchinskii ) is critical poin nonconvex optimization
- Candès, Recht ’08
Candès, Plan ’09 ahban, Wainwright in critical nconvex optimiza
- Candès, Recht ’08
- Candès, Plan ’09
Negahban, Wainwright ’10
r i t i c v e x
- p
t C a n d è s , R e c h t ’ 8
- C
a n d è s , P l a n ’ 9
- N
e g a h b a n , W a i n w r i g h t ’ 1
- Koltchinskii et al. ’10
Koltchinskii, Tsyba van, Mon
(X, Y ) is critical point of nonconvex optimization
1}
rogram convex relaxation
- 2008
2008 2019
K
- l
t c h i n s k i i
- K
e s h a v a n , M
- n
t a n
- S
u n , L u
- ’
1 5 C h e n , W a i n w r i g h t , C h i , W a n , chinskii Keshavan, Montan
- Sun, Luo ’15
- Chen, Wainwright ’15
Ma, Chi, Wang, Chen ’ en, Liu, Li ’19 uo ’15 Chen, Wainwright
- Ma, Chi, Wang, Chen
- Chen, Liu, Li ’19
k i i , v a n , M
- n
t a n S u n , L u
- ’
1 5
- C
h e n , W a i n w r i g h t ’ 1 5
- M
a , C h i , W a n g , C h e n ’ 1 7 C h e n , L i u , L i ’ 1 9 av Sun, Luo ’15
- Chen, Wainwright
- Zheng, Lafferty ’16
Ma, Chi, Wang, n, Liu, Li ’1 chinskii et al. ’10
- Koltchinskii, Tsybakov, Loun
- Keshavan, Montanari, Oh ’09
Sun, Luo ’15 en, Wainwright ’15 Lafferty ’16 ang Gross ’09
An interesting experiment
convex: minimize
Z∈Rn×n
- (i,j)∈Ω
Zi,j − Mi,j 2 + λZ∗
nonconvex: minimize
X,Y ∈Rn×r
- (i,j)∈Ω
XY ⊤
i,j − Mi,j
2
+ λ
2X2 F + λ 2Y 2 F
- reg(X,Y )
— Z∗ = min
Z=XY ⊤ 1 2X2 F + 1 2Y 2 F 18/ 39
A motivating experiment
n = 1000, r = 5, p = 0.2, λ = 5σ√np
10-6 10-5 10-4 10-3 10-8 10-6 10-4 10-2 100
Estimation error: convex Estimation error: nonconvex Distance between solutions
ion ‡: noise standard dev.
19/ 39
A motivating experiment
n = 1000, r = 5, p = 0.2, λ = 5σ√np
10-6 10-5 10-4 10-3 10-8 10-6 10-4 10-2 100
Estimation error: convex Estimation error: nonconvex Distance between solutions
ion ‡: noise standard dev.
Convex and nonconvex solutions are exceedingly close!
19/ 39
zer stability
ty
A B
zer stability
ty
A B
ion nonconvex op convex s convex s ion nonconvex op
Main results: r = O(1)
- random sampling: each (i, j) ∈ Ω with prob. p log3 n
n
- random noise: i.i.d. sub-Gaussian noise with variance σ2
- true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, well-conditioned
21/ 39
Main results: r = O(1)
- random sampling: each (i, j) ∈ Ω with prob. p log3 n
n
- random noise: i.i.d. sub-Gaussian noise with variance σ2
- true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, well-conditioned
minimize
Z∈Rn×n
- (i,j)∈Ω
Zi,j − Mi,j 2 + λZ∗
(λ ≍ σ√np)
21/ 39
Main results: r = O(1)
- random sampling: each (i, j) ∈ Ω with prob. p log3 n
n
- random noise: i.i.d. sub-Gaussian noise with variance σ2
- true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, well-conditioned
minimize
Z∈Rn×n
- (i,j)∈Ω
Zi,j − Mi,j 2 + λZ∗
(λ ≍ σ√np) Theorem 1 (Chen, Chi, Fan, Ma, Yan ’19) With high prob., any minimizer Mcvx of convex program obeys 1.
- Mcvx is nearly rank-r
- Mcvx − projr(
Mcvx)
- F ≪ 1
n5 · σ
- n
p
21/ 39
Main results: r = O(1)
- random sampling: each (i, j) ∈ Ω with prob. p log3 n
n
- random noise: i.i.d. sub-Gaussian noise with variance σ2
- true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, well-conditioned
minimize
Z∈Rn×n
- (i,j)∈Ω
Zi,j − Mi,j 2 + λZ∗
(λ ≍ σ√np) Theorem 1 (Chen, Chi, Fan, Ma, Yan ’19) With high prob., any minimizer Mcvx of convex program obeys 1.
- Mcvx is nearly rank-r
2.
- Mcvx − M ⋆
- F σ
- n
p
21/ 39
Main results: r = O(1)
- random sampling: each (i, j) ∈ Ω with prob. p log3 n
n
- random noise: i.i.d. sub-Gaussian noise with variance σ2
- true matrix M ⋆ ∈ Rn×n: r = O(1), incoherent, well-conditioned
minimize
Z∈Rn×n
- (i,j)∈Ω
Zi,j − Mi,j 2 + λZ∗
(λ ≍ σ√np) Theorem 1 (Chen, Chi, Fan, Ma, Yan ’19) With high prob., any minimizer Mcvx of convex program obeys 1.
- Mcvx is nearly rank-r
2.
- Mcvx − M ⋆
- F σ
- n
p
- Mcvx − M ⋆
- ∞ σ
- n log n
p
· 1 n
21/ 39
- Mcvx − M ⋆
- F σ
- n
p
Chen, Chi, Fan, Ma, Yan ’19 . . „ . . minimax limit
ion ‡: noise standard dev.
ror
. . „
M ≠ M ı.
.
F
- minimax optimal estimation error
22/ 39
- Mcvx − M ⋆
- F σ
- n
p
- Mcvx − M ⋆
- ∞ σ
- n log n
p
· 1 n
Chen, Chi, Fan, Ma, Yan ’19 . . „ . . minimax limit
ion ‡: noise standard dev.
ror
. . „
M ≠ M ı.
.
F
- minimax optimal estimation error
- estimation errors are spread out across all entries
22/ 39
Implicit regularization
No need to enforce spikiness constraint as in Negahban & Wainwright minimize
Z∞≤α
- (i,j)∈Ω
Zi,j − Mi,j 2 + λZ∗
- convex programming automatically controls spikiness of solutions
23/ 39
Statistical guarantees for iterative algorithms
minimize
Z
g(Z) =
- (i,j)∈Ω
Zi,j − Mi,j 2 + λZ∗
Many algorithms (e.g. SVT, SOFT-IMPUTE, FPC, FISTA) have been proposed to minimize g(Z), typically without statistical guarantees
24/ 39
Statistical guarantees for iterative algorithms
minimize
Z
g(Z) =
- (i,j)∈Ω
Zi,j − Mi,j 2 + λZ∗
Many algorithms (e.g. SVT, SOFT-IMPUTE, FPC, FISTA) have been proposed to minimize g(Z), typically without statistical guarantees We provide statistical guarantees for any Z with g(Z) ≤ g(Zopt) + ε for some sufficiently small ε > 0
24/ 39
Main results: general case
- random sampling: each (i, j) ∈ Ω with prob. p r2 log3 n
n
- random noise: i.i.d. sub-Gaussian noise with variance σ2
- true matrix M ⋆ ∈ Rn×n: incoherent, well-conditioned
25/ 39
Main results: general case
- random sampling: each (i, j) ∈ Ω with prob. p r2 log3 n
n
- random noise: i.i.d. sub-Gaussian noise with variance σ2
- true matrix M ⋆ ∈ Rn×n: incoherent, well-conditioned
Theorem 2 (Chen, Chi, Fan, Ma, Yan ’19) With high prob., any minimizer Mcvx of convex program obeys 1.
- Mcvx is nearly rank-r
2.
- Mcvx − M ⋆
- F
σ σmin(M⋆)
- n
p M ⋆F
- Mcvx − M ⋆
- ∞ √r
σ σmin(M⋆)
- n log n
p
M ⋆∞
- Mcvx − M ⋆
- σ
σmin(M⋆)
- n
p M ⋆
25/ 39
Main results: general case
- random sampling: each (i, j) ∈ Ω with prob. p r2 log3 n
n
- random noise: i.i.d. sub-Gaussian noise with variance σ2
- true matrix M ⋆ ∈ Rn×n: incoherent, well-conditioned
sample complexity bound O(nr2 log3 n) is suboptimal in r!
25/ 39
A little analysis: connection between convex and nonconvex solutions
Link between convex and nonconvex optimizers
(X, Y ) is nonconvex optimizer
27/ 39
Link between convex and nonconvex optimizers
(X, Y ) is nonconvex optimizer
?
= ⇒ XY ⊤ is convex solution
27/ 39
Link between convex and nonconvex optimizers
- λ is properly chosen
- (X, Y ) is close to truth (in ℓ2,∞ sense)
’09 (X?, Y ?) endation sys ) (X, Y ) systems localiz
(X, Y ) is nonconvex optimizer
- =
⇒ XY ⊤ is convex solution
i.e. dist(convex solution, nonconvex solution) = 0
27/ 39
Approximate nonconvex optimizers
) (X, Y ) systems localiz ) (X0, Y 0) alization joint s
) (X1, Y 1) alization joint shap
Issue: we do NOT know properties of nonconvex optimizers
- It is unclear whether nonconvex algorithms converge to
- ptimizers (due to lack of strong convexity)
28/ 39
Approximate nonconvex optimizers
Strategy: resort to “approximate stationary points
- ∇f(X,Y ) ≈ 0
” instead
29/ 39
Approximate nonconvex optimizers
Strategy: resort to “approximate stationary points
- ∇f(X,Y ) ≈ 0
” instead
- λ is properly chosen
- (X, Y ) is close to truth (in ℓ2,∞ sense)
’09 (X?, Y ?) endation sys ) (X, Y ) systems localiz
∇f(X, Y ) ≈ 0
- =
⇒ dist(XY ⊤, convex solutions) ≈ 0
29/ 39
Construct approximate nonconvex optimizers via GD
starting from (X0, Y 0) = truth or spectral initialization: Xt+1 = Xt − η ∇Xf(Xt, Y t) Y t+1 = Y t − η ∇Y f(Xt, Y t) t = 0, 1, · · · , T
30/ 39
Construct approximate nonconvex optimizers via GD
starting from (X0, Y 0) = truth or spectral initialization: Xt+1 = Xt − η ∇Xf(Xt, Y t) Y t+1 = Y t − η ∇Y f(Xt, Y t) t = 0, 1, · · · , T
- when T is large: there exists point with very small gradient
- ∇f(X,Y )F
1
√
ηT 30/ 39
Construct approximate nonconvex optimizers via GD
starting from (X0, Y 0) = truth or spectral initialization: Xt+1 = Xt − η ∇Xf(Xt, Y t) Y t+1 = Y t − η ∇Y f(Xt, Y t) t = 0, 1, · · · , T
- when T is large: there exists point with very small gradient
- ∇f(X,Y )F
1
√
ηT
- hopefully not far from (X⋆, Y ⋆) (in ℓ2,∞ sense in particular)
30/ 39
Gradient descent for nonconvex matrix completion
Gradient descent for nonconvex matrix completion
Xt+1 = Xt − η ∇Xf(Xt, Y t) Y t+1 = Y t − η ∇Y f(Xt, Y t) Prior works analyze regularized GD
- not guaranteed to return small-gradient solutions
- no ℓ2,∞ error control
— Keshavan et al. ’09, Sun, Luo ’15, Chen, Wainwright ’15, Zheng, Lafferty ’16
32/ 39
Gradient descent for nonconvex matrix completion
Xt+1 = Xt − η ∇Xf(Xt, Y t) Y t+1 = Y t − η ∇Y f(Xt, Y t) Our work and Chen et al. analyze vanilla GD
- regularization-free
- optimal ℓ2,∞ error control
— Ma, Wang, Chi, Chen ’17, Chen, Liu, Li ’19
32/ 39
Gradient descent theory revisited
Two standard conditions that enable geometric convergence of GD
33/ 39
Gradient descent theory revisited
Two standard conditions that enable geometric convergence of GD
- (local) restricted strong convexity
33/ 39
Gradient descent theory revisited
Two standard conditions that enable geometric convergence of GD
- (local) restricted strong convexity
- (local) smoothness
33/ 39
Gradient descent theory revisited
f is said to be α-strongly convex and β-smooth if 0 αI ∇2f(X) βI, ∀X ℓ2 error contraction: GD with η = 1/β obeys Xt+1 − X⋆F ≤
- 1 − α
β
- Xt − X⋆F
34/ 39
Incoherence region
Which region enjoys both restricted strong convexity and smoothness?
35/ 39
Incoherence region
Which region enjoys both restricted strong convexity and smoothness?
·
X?
- X is not far away from X⋆
35/ 39
Incoherence region
Which region enjoys both restricted strong convexity and smoothness?
·
e1
ke>
1 (X − X?)k2 ≤ kX?k2,1
X?
- X is not far away from X⋆
- X is incoherent w.r.t. standard basis vectors (incoherence region)
35/ 39
Incoherence region
Which region enjoys both restricted strong convexity and smoothness?
·
e2
e1
ke>
1 (X − X?)k2 ≤ kX?k2,1
k − k ≤ k k
1
ke>
2 (X − X?)k2 ≤ kX?k2,1
X?
- X is not far away from X⋆
- X is incoherent w.r.t. standard basis vectors (incoherence region)
35/ 39
Inadequacy of generic gradient descent theory
region of local strong convexity + smoothness
·√ ·√
- Generic optimization theory does NOT ensure GD stays in
incoherence region
36/ 39
Inadequacy of generic gradient descent theory
region of local strong convexity + smoothness
·√ ·√
- Generic optimization theory does NOT ensure GD stays in
incoherence region
36/ 39
Inadequacy of generic gradient descent theory
region of local strong convexity + smoothness
·√ ·√
- Generic optimization theory does NOT ensure GD stays in
incoherence region
36/ 39
Inadequacy of generic gradient descent theory
region of local strong convexity + smoothness
·√ ·√
- Generic optimization theory does NOT ensure GD stays in
incoherence region
- Demonstrating incoherence calls for new analysis tools
36/ 39
Key proof idea: leave-one-out analysis
For each 1 ≤ l ≤ n, introduce leave-one-out iterates Xt,(l) by replacing {lth} row and column with true values
1 2 3 𝑚 𝑜 1 2 3 𝑚 𝑜 … … … …
= )
Xt,(l)
M (l)
37/ 39
Key proof idea: leave-one-out analysis
·√ incoherence region w.r.t. el
{ } {Xt,(l)}
el
- Leave-one-out iterates {Xt,(l)} contains more information of lth
row of truth; indep. of randomness in lth row
38/ 39
Key proof idea: leave-one-out analysis
·√ incoherence region w.r.t. el
{Xt} { } {Xt,(l)}
el
- Leave-one-out iterates {Xt,(l)} contains more information of lth
row of truth; indep. of randomness in lth row
- Leave-one-out iterates {Xt,(l)} ≈ true iterates {Xt}
38/ 39
n a n K
- l
t c h i n s k i i ) i s c r i t i c a l p
- i
n n
- n
c
- n
v e x
- p
t i m i z a t i
- n
- C
a n d è s , R e c h t ’ 8 C a n d è s , P l a n ’ 9 a h b a n , W a i n w r i g h t in critical nconvex optimiza
- Candès, Recht ’08
- Candès, Plan ’09
Negahban, Wainwright ’10 ritic vex opt Candès, Recht ’08
- Candès, Plan ’09
- Negahban, Wainwright ’10
- Koltchinskii et al. ’10
Koltchinskii, Tsyba van, Mon
(X, Y ) is critical point of nonconvex optimization
1}
rogram convex relaxation
- 2008
2008 2019
Koltchinskii
- Keshavan, Montan
- Sun, Luo ’15
Chen, Wainwright , Chi, Wan , chinskii Keshavan, Montan
- Sun, Luo ’15
- Chen, Wainwright ’15
Ma, Chi, Wang, Chen ’ en, Liu, Li ’19 uo ’15 Chen, Wainwright
- Ma, Chi, Wang, Chen
- Chen, Liu, Li ’19
k i i , v a n , M
- n
t a n S u n , L u
- ’
1 5
- C
h e n , W a i n w r i g h t ’ 1 5
- M
a , C h i , W a n g , C h e n ’ 1 7 C h e n , L i u , L i ’ 1 9 av Sun, Luo ’15
- Chen, Wainwright
- Zheng, Lafferty ’16
Ma, Chi, Wang, n, Liu, Li ’1 chinskii et al. ’10
- Koltchinskii, Tsybakov, Loun
- Keshavan, Montanari, Oh ’09
Sun, Luo ’15 en, Wainwright ’15 Lafferty ’16 ang
- Koltchinskii et al. ’10
- Chen, Chi, Fan, Ma, Yan ’19
Koltchinskii, Tsybakov, Lounici avan, Montanari, Oh ’09 ’15 ight ’1
Gross ’09