Non-convex Optimization for Machine Learning Prateek Jain - - PowerPoint PPT Presentation
Non-convex Optimization for Machine Learning Prateek Jain - - PowerPoint PPT Presentation
Non-convex Optimization for Machine Learning Prateek Jain Microsoft Research India Outline Optimization for Machine Learning Non-convex Optimization Convergence to Stationary Points First order stationary points Second order
Outline
- Optimization for Machine Learning
- Non-convex Optimization
- Convergence to Stationary Points
- First order stationary points
- Second order stationary points
- Non-convex Optimization in ML
- Neural Networks
- Learning with Structure
- Alternating Minimization
- Projected Gradient Descent
Relevant Monograph (Shameless Ad)
Optimization in ML
Supervised Learning
- Given points (๐ฆ๐, ๐ง๐)
- Prediction function: เท
๐ง๐ = ๐(๐ฆ๐, ๐ฅ)
- Minimize loss: min
๐ฅ ฯ๐ โ(๐ ๐ฆ๐, ๐ฅ , ๐ง๐)
Unsupervised Learning
Given points (๐ฆ1, ๐ฆ2 โฆ ๐ฆ๐) Find cluster center or train GANs Represent เท ๐ฆ๐ = ๐(๐ฆ๐, ๐ฅ) Minimize loss: min
๐ฅ ฯ๐ โ(๐ ๐ฆ๐, ๐ฅ , ๐ฆ๐)
Optimization Problems
- Unconstrained optimization
min
๐ฅโ๐๐ ๐(๐ฅ)
- Deep networks
- Regression
- Gradient Boosted Decision
Trees
- Constrained optimization
min
๐ฅ ๐(๐ฅ) ๐ก. ๐ข. ๐ฅ โ ๐ท
- Support Vector Machines
- Sparse regression
- Recommendation system
- โฆ
Convex Optimization
Convex function Convex set
๐ ๐๐ฅ1 + 1 โ ๐ ๐ฅ2 โค ๐๐ ๐ฅ1 + 1 โ ๐ ๐ ๐ฅ2 , 0 โค ๐ โค 1 โ๐ฅ1, ๐ฅ2 โ ๐ท, ๐๐ฅ1 + 1 โ ๐ ๐ฅ2 โ ๐ท 0 โค ๐ โค 1
min
๐ฅ ๐(๐ฅ)
๐ก. ๐ข. ๐ฅ โ ๐ท
Slide credit: Purushottam Kar
Examples
Linear Programming Quadratic Programming Semidefinite Programming
Slide credit: Purushottam Kar
Convex Optimization
- Unconstrained optimization
min
๐ฅโ๐๐ ๐(๐ฅ)
Optima: just ensure โ๐ฅ๐ ๐ฅ = 0
- Constrained optimization
min
๐ฅ ๐(๐ฅ) ๐ก. ๐ข. ๐ฅ โ ๐ท
Optima: KKT conditions
In this talk, lets assume ๐ is ๐ โsmooth => ๐ is differentiable ๐ ๐ฆ โค ๐ ๐ง + โ๐ ๐ง , ๐ฆ โ ๐ง + ๐ 2 ||๐ฆ โ ๐ง||2 OR, ||โ๐ ๐ฆ โ โ๐ ๐ง || โค ๐||๐ฆ โ ๐ง||
Gradient Descent Methods
- Projected gradient descent method:
- For t=1, 2, โฆ (until convergence)
- ๐ฅ๐ข+1 = ๐
๐ท(๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข )
- ๐: step-size
Convergence Proof
๐ ๐ฅ๐ข+1 โค ๐ ๐ฅ๐ข + โ๐ ๐ฅ๐ข , ๐ฅ๐ข+1 โ ๐ฅ๐ข + ๐ 2 ||๐ฅ๐ข+1 โ ๐ฅ๐ข||2 ๐ ๐ฅ๐ข+1 โค ๐ ๐ฅ๐ข โ 1 โ ๐๐ 2 ๐||โ๐ ๐ฅ๐ข ||2 โค ๐ ๐ฅ๐ข โ ๐ 2 ||โ๐ ๐ฅ๐ข ||2 ๐ ๐ฅ๐ข+1 โค ๐ ๐ฅโ + โ๐ ๐ฅ๐ข , ๐ฅ๐ข โ ๐ฅโ โ 1 2๐ ||๐ฅ๐ข+1 โ ๐ฅ๐ข||2 ๐ ๐ฅ๐ข+1 โค ๐ ๐ฅโ + 1 2๐ ||๐ฅ๐ข โ ๐ฅโ||2 โ ||๐ฅ๐ข+1 โ ๐ฅโ||2 ๐ ๐ฅ๐ โค ๐ ๐ฅโ + 1 ๐ โ 2๐ ||๐ฅ0 โ ๐ฅโ||2 โ ๐ ๐ฅ๐ โค ๐ ๐ฅโ + ๐ ๐ = ๐ ๐ โ ||๐ฅ0 โ ๐ฅโ||2 ๐
๐ฅ๐ข
Non-convexity?
- Critical points: โ๐ ๐ฅ = 0
- But: โf w = 0 โ Optimality
min
๐ฅโ๐๐ ๐(๐ฅ)
Local Optima
- ๐ ๐ฅ โค ๐ ๐ฅโฒ , โ||๐ฅ โ ๐ฅโฒ|| โค ๐
Local Minima
image credit: academo.org
First Order Stationary Points
- Defined by: โ๐ ๐ฅ = 0
- But โ2๐(๐ฅ) need not be positive semi-definite
First Order Stationary Point (FOSP)
image credit: academo.org
First Order Stationary Points
- E.g., ๐ ๐ฅ = 0.5(๐ฅ1
2 โ ๐ฅ2 2)
- โ๐ ๐ฅ =
๐ฅ1 โ๐ฅ2
- โ๐ 0 = 0
- But, โ2๐ ๐ฅ = 1
โ1 โ ๐๐๐๐๐๐๐๐๐ข๐
- ๐
๐ 2 , ๐
= โ
3 8 ๐2 โ ๐ 0,0
is not a local minima
First Order Stationary Point (FOSP)
image credit: academo.org
Second Order Stationary Points
Second Order Stationary Point (SOSP) if:
- โ๐ ๐ฅ = 0
- โ2๐ ๐ฅ โฝ 0
Does it imply local optimality?
Second Order Stationary Point (SOSP)
image credit: academo.org
Second Order Stationary Points
- ๐ ๐ฅ =
1 3 (๐ฅ1 3 โ 3 ๐ฅ1๐ฅ2 2)
- โ ๐ ๐ฅ = (๐ฅ1
2 โ ๐ฅ2 2)
โ2 ๐ฅ1๐ฅ2
- โ2๐ ๐ฅ =
2๐ฅ1 โ2๐ฅ2 โ2๐ฅ2 โ2๐ฅ1
- โ๐ 0 = 0, โ2๐ 0 = 0 โ 0 ๐๐ก ๐๐๐๐
- ๐ ๐, ๐
= โ
2 3 ๐3 < ๐(0)
Second Order Stationary Point (SOSP)
image credit: academo.org
Stationarity and local optima
- ๐ฅ is local optima implies: ๐ ๐ฅ โค ๐ ๐ฅโฒ , โ||๐ฅ โ ๐ฅโฒ|| โค ๐
- ๐ฅ is FOSP implies:
๐ ๐ฅ โค ๐ ๐ฅโฒ + ๐(||๐ฅ โ ๐ฅ||2)
- ๐ฅ is SOSP implies:
๐ ๐ฅ โค ๐ ๐ฅโฒ + ๐(||๐ฅ โ ๐ฅโฒ||3)
- ๐ฅ is p-th order SP implies:
๐ ๐ฅ โค ๐ ๐ฅโฒ + ๐(||๐ฅ โ ๐ฅโฒ||๐+1)
- That is, local optima: ๐ = โ
Computability?
First Order Stationary Point Second Order Stationary Point Third Order Stationary Point ๐ โฅ 4 Stationary Point NP-Hard Local Optima NP-Hard
๐ ๐ฅ โค ๐ ๐ฅโฒ + ๐(||๐ฅ โ ๐ฅโฒ||๐+1)
Anandkumar and Ge-2016
Does Gradient Descent Work for Local Optimality?
- Yes!
- In fact, with high probability converges
to a โlocal minimizerโ
- If initialized randomly!!!
- But no rates known ๏
- NP-hard in general!!
- Big open problem โบ
image credit: academo.org
Finding First Order Stationary Points
- Defined by: โ๐ ๐ฅ = 0
- But โ2๐(๐ฅ) need not be positive semi-definite
First Order Stationary Point (FOSP)
image credit: academo.org
Gradient Descent Methods
- Gradient descent:
- For t=1, 2, โฆ (until convergence)
- ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข
- ๐: step-size
- Assume:
||โ๐ ๐ฆ โ โ๐ ๐ง || โค ๐||๐ฆ โ ๐ง||
Convergence to FOSP
๐ ๐ฅ๐ข+1 โค ๐ ๐ฅ๐ข + โ๐ ๐ฅ๐ข , ๐ฅ๐ข+1 โ ๐ฅ๐ข + ๐ 2 ||๐ฅ๐ข+1 โ ๐ฅ๐ข||2 ๐ ๐ฅ๐ข+1 โค ๐ ๐ฅ๐ข โ 1 โ ๐๐ 2 ๐||โ๐ ๐ฅ๐ข ||2 โค ๐ ๐ฅ๐ข โ 1 2๐ ||โ๐ ๐ฅ๐ข ||2 ||โ๐ ๐ฅ๐ข ||2 โค ๐ ๐ฅ๐ข โ ๐ ๐ฅ๐ข+1 1 2๐ เท
๐ข
||โ๐ ๐ฅ๐ข ||2 โค ๐ ๐ฅ0 โ ๐(๐ฅโ) min
๐ข
||โ๐ ๐ฅ๐ข || โค 2๐ (๐ ๐ฅ0 โ ๐(๐ฅโ)) ๐ โค ๐
๐ = ๐ ๐ โ (๐ ๐ฅ0 โ ๐ ๐ฅโ ) ๐2
Accelerated Gradient Descent for FOSP?
- For t=1, 2โฆ.T
- ๐ฅ๐ข+1
๐๐ = 1 โ ๐ฝ๐ข ๐ฅ๐ข ๐๐ + ๐ฝ๐ข๐ฅ๐ข
- ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐๐ขโ๐(๐ฅ๐ข+1
๐๐)
- ๐ฅ๐ข+1
๐๐ = ๐ฅ๐ข ๐๐ โ ๐พ๐ขโ๐(๐ฅ๐ข+1 ๐๐)
- Convergence? min
๐ข
||โ๐ ๐ฅ๐ข || โค ๐
- For ๐ = ๐(
๐โ (๐ ๐ฅ0 โ๐ ๐ฅโ ) ๐
)
- If convex: ๐ = ๐(
๐โ (๐ ๐ฅ0 โ๐ ๐ฅโ ) 1/4 ๐
)
Ghadimi and Lan - 2013
๐ฅ๐ข ๐ฅ๐ข
๐๐
๐ฅ๐ข
๐๐
๐ฅ๐ข+1 ๐ฅ๐ข+1
๐๐
Non-convex Optimization: Sum of Functions
- What if the function has more structure?
min
๐ฅ
๐ ๐ฅ = 1 ๐ เท
๐=1 ๐
๐
๐(๐ฅ)
- โ๐ ๐ฅ = ฯ๐=1
๐
โ๐
๐(๐ฅ)
- I.e., computing gradient would require ๐(๐) computation
Does Stochastic Gradient Descent Work?
- For t=1, 2, โฆ (until convergence)
- Sample ๐๐ข โผ ๐๐๐๐[1, ๐]
- ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐
๐๐ข ๐ฅ๐ข
Proof? ๐น๐๐ข ๐ฅ๐ข+1 โ ๐ฅ๐ข = ๐โ๐(๐ฅ๐ข) ๐ ๐ฅ๐ข+1 โค ๐ ๐ฅ๐ข + โ๐ ๐ฅ๐ข , ๐ฅ๐ข+1 โ ๐ฅ๐ข + ๐ 2 ||๐ฅ๐ข+1 โ ๐ฅ๐ข||2 E ๐ ๐ฅ๐ข+1 โค ๐น ๐ ๐ฅ๐ข โ ๐ 2 ||โ๐ ๐ฅ๐ข ||2 + ๐ 2 ๐2 โ ๐๐๐ min
๐ข ||โ๐ ๐ฅ๐ข || โค ๐ ๐ ๐ฅ0 โ ๐ ๐ฅโ
โ ๐๐๐
1 4
๐
1 4
โค ๐ ๐ = ๐ ๐ โ ๐๐๐ โ (๐ ๐ฅ0 โ ๐ ๐ฅโ ) ๐4
Summary: Convergence to FOSP
Algorithm
- No. of Gradient Calls (Non-convex)
- No. of Gradient Calls (Convex)
GD [Folkore; Nesterov] ๐ 1 ๐2 ๐ 1 ๐ AGD [Ghadimi & Lan-2013] ๐ 1 ๐ ๐ 1 ๐ Algorithm
- No. of Gradient Calls
Convex Case GD [Folkore] ๐( ๐ ๐2) ๐(๐ ๐) AGD [Ghadimi & Lanโ2013] ๐ ๐ ๐ ๐ ๐ ๐ SGD [Ghadimi & Lanโ2013] ๐( 1 ๐4) ๐( 1 ๐2) SVRG [Reddi et al-2016, Allen- Zhu&Hazan-2016] ๐(๐ + ๐
2 3/๐2)
๐(๐ + ๐/๐2) MSVRG [Reddi et al-2016] ๐(min( 1 ๐4 , ๐
2 3
๐2)) ๐ ๐ + ๐ ๐2
๐ ๐ฅ = 1 ๐ เท
๐=1 ๐
๐
๐(๐ฅ)
Finding Second Order Stationary Points (SOSP)
Second Order Stationary Point (SOSP) if:
- โ๐ ๐ฅ = 0
- โ2๐ ๐ฅ โฝ 0
Approximate SOSP:
- ||โ๐ ๐ฅ || โค ๐
- ๐๐๐๐ โ2๐ ๐ฅ
โฅ โ ๐๐
Second Order Stationary Point (SOSP)
image credit: academo.org
Cubic Regularization (Nesterov and Polyak-2006)
- For t=1, 2, โฆ (until convergence)
๐ฅ๐ข+1 = arg min
๐ฅ ๐ ๐ฅ๐ข + ๐ฅ โ ๐ฅ๐ข, โ๐ ๐ฅ๐ข
+ 1 2 ๐ฅ โ ๐ฅ๐ข ๐โ2๐ ๐ฅ๐ข ๐ฅ โ ๐ฅ๐ข + ๐ 6 ||๐ฅ โ ๐ฅ๐ข||3
- Assumption: Hessian continuity, i.e., ||โ2๐ ๐ฆ โ โ2๐ ๐ง || โค ๐||๐ฆ โ ๐ง||
- Convergence to SOSP? ๐ = ๐(
1 ๐1.5)
- But requires Hessian computation! (even storage is ๐(๐2)
- Can we find SOSP using only gradients?
Noisy Gradient Descent for SOSP
- For t=1, 2, โฆ (until convergence)
- If ( ||โ๐(๐ฅ๐ข|| โฅ ๐ )
- ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข
- Else
- ๐ฅ๐ข+1 = ๐ฅ๐ข + ๐, ๐ โผ ๐ฟ โ ๐(0, ๐ฝ)
- Update ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข for next ๐ iterations
- Claim: above algorithm converges to SOSP in ๐(1/๐2)
Ge et al-2015, Jin et al-2017
Proof
- FOSP analysis: convergence in ๐
1 ๐2
iterations
- But, โ2๐ ๐ฅ๐ข
โฑ 0
- That is, ๐๐๐๐ โ2๐ ๐ฅ๐ข
< โ ๐๐
For t=1, 2, โฆ (until convergence) If ( ||โ๐(๐ฅ๐ข|| โฅ ๐ ) ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข Else ๐ฅ๐ข+1 = ๐ฅ๐ข + ๐, ๐ โผ ๐ฟ โ ๐(0, ๐ฝ) Update ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข for next ๐ iterations
image credit: academo.org
Proof
- Random perturbation with Gradient descent
leads to decrease in objective function
For t=1, 2, โฆ (until convergence) If ( ||โ๐(๐ฅ๐ข|| โฅ ๐ ) ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข Else ๐ฅ๐ข+1 = ๐ฅ๐ข + ๐, ๐ โผ ๐ฟ โ ๐(0, ๐ฝ) Update ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข for next ๐ iterations
image credit: academo.org
Proof?
- Random perturbation with Gradient descent leads to decrease in objective
function
- Hessian continuity => function nearly quadratic in small neighborhood
- ๐ ๐ฅ โ ๐ ๐ฅ๐ข + โ๐ ๐ฅ๐ข , ๐ฅ โ ๐ฅ๐ข + ๐ฅ โ ๐ฅ๐ข ๐โ2๐ ๐ฅ๐ข (๐ฅ โ ๐ฅ๐ข)
๐ฅ๐ +๐ข = ๐ฅ๐ โ1+๐ข โ ๐โ2๐ ๐ฅ๐ข ๐ฅ๐ โ1+๐ข โ ๐ฅ๐ข โ ๐ฅ๐ +๐ข โ ๐ฅ๐ข = ๐ฝ โ ๐โ2๐ ๐ฅ๐ข
๐ ๐ฅ๐ข+1 โ ๐ฅ๐ข
For t=1, 2, โฆ (until convergence) If ( ||โ๐(๐ฅ๐ข|| โฅ ๐ ) ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข Else ๐ฅ๐ข+1 = ๐ฅ๐ข + ๐, ๐ โผ ๐ฟ โ ๐(0, ๐ฝ) Update ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข for next ๐ iterations
Proof?
- Random perturbation with Gradient descent leads to decrease in objective function
- Hessian continuity => function nearly quadratic in small neighborhood
- ๐ ๐ฅ โ ๐ ๐ฅ๐ข + โ๐ ๐ฅ๐ข , ๐ฅ โ ๐ฅ๐ข + ๐ฅ โ ๐ฅ๐ข ๐โ2๐ ๐ฅ๐ข (๐ฅ โ ๐ฅ๐ข)
๐ฅ๐ +๐ข = ๐ฅ๐ โ1+๐ข โ ๐โ2๐ ๐ฅ๐ข ๐ฅ๐ โ1+๐ข โ ๐ฅ๐ข โ ๐ฅ๐ +๐ข โ ๐ฅ๐ข = ๐ฝ โ ๐โ2๐ ๐ฅ๐ข
๐ ๐ฅ๐ข+1 โ ๐ฅ๐ข
- ๐ฅ๐ +๐ข โ ๐ฅ๐ข converge to largest eigenvector of ๐ฝ โ ๐โ2๐(๐ฅ๐ข)
- Which is smallest (most negative) eigenvector of โ2๐(๐ฅ๐ข)
- Hence, ๐ฅ๐ +๐ข โ ๐ฅ๐ข ๐โ2๐ ๐ฅ๐ข
๐ฅ๐ +๐ข โ ๐ฅ๐ข โค โ๐ฟ2 ๐๐
- ๐ ๐ฅ๐ +๐ข โค ๐ ๐ฅ๐ข โ ๐ฟ2 ๐๐
For t=1, 2, โฆ (until convergence) If ( ||โ๐(๐ฅ๐ข|| โฅ ๐ ) ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข Else ๐ฅ๐ข+1 = ๐ฅ๐ข + ๐, ๐ โผ ๐ฟ โ ๐(0, ๐ฝ) Update ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข for next ๐ iterations
Proof
- Entrapment near SOSP
For t=1, 2, โฆ (until convergence) If ( ||โ๐(๐ฅ๐ข|| โฅ ๐ ) ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข Else ๐ฅ๐ข+1 = ๐ฅ๐ข + ๐, ๐ โผ ๐ฟ โ ๐(0, ๐ฝ) Update ๐ฅ๐ข+1 = ๐ฅ๐ข โ ๐โ๐ ๐ฅ๐ข for next ๐ iterations
Final result: convergence to SOSP in ๐(1/๐2)
Ge et al-2015, Jin et al-2017 image credit: academo.org
Summary: Convergence to SOSP
Algorithm
- No. of Gradient Calls (Non-convex)
- No. of Gradient Calls (Convex)
Noisy GD [Jin et al-2017, Ge et al- 2015] ๐ 1 ๐2 ๐ 1 ๐ Noisy Accelerated GD [Jin et al- 2017] ๐ 1 ๐1.75 ๐ 1 ๐ Cubic Regularization [Nesterov & Polyak-2006] ๐ 1 ๐1.5 N/A Algorithm
- No. of Gradient Calls
Convex Case Noisy GD [Jin et al-2017, Ge et al-2015] ๐( ๐ ๐2) ๐(๐ ๐) Noisy AGD [Jin et al-2017] ๐ ๐ ๐1.75 ๐ ๐ ๐ Noisy SGD [Jin et al-2017, Ge et al-2015] ๐( 1 ๐4) ๐( 1 ๐2) SVRG [Allen-Zhu-2018] ๐(๐ + ๐
3 4/๐2)
๐(๐ + ๐/๐2)
๐ ๐ฅ = 1 ๐ เท
๐=1 ๐
๐
๐(๐ฅ)
Convergence to Global Optima?
- FOSP/SOSP methods canโt even guarantee
local convergence
- Can we guarantee global optimality for
some โnicerโ non-convex problems?
- Yes!!!
- Use statistics โบ
image credit: academo.org
Can Statistics Help: Realizable models!
- Data points: ๐ฆ๐, ๐ง๐ โผ ๐ธ
- ๐ธ: nice distribution
- ๐น[๐ง๐] = ๐(๐ฆ๐, ๐ฅโ)
เท ๐ฅ = arg min
๐ฅ เท ๐
๐๐๐ก๐ก(๐ง๐, ๐(๐ฆ๐, ๐ฅ))
- That is, ๐ฅโ is the optimal solution!
- Parameter learning
Learning Neural Networks: Provably
- ๐ง๐ = 1 โ ๐ ๐
โ๐ฆ๐
- ๐ฆ๐ โผ ๐(0, ๐ฝ)
min
๐ เท ๐
๐ง๐โ1 โ ๐ ๐๐ฆ๐
2
- Does gradient descent converge to global optima: ๐
โ?
- NO!!!
- The objective function has poor local minima [Shamir et al-2017, Lee et al-2017]
๐ฆ๐
1 1 1 1 1
๐ง๐ ๐
โ
Learning Neural Networks: Provably
- But, no local minima within constant distance of ๐
โ
- If,
||๐
0 โ ๐ โ|| โค ๐
Then, Gradient Descent (๐
๐ข+1 = ๐ ๐ข โ ๐โf(Wt)) converges to ๐ โ
- No. of iterations: log 1/๐
Can we get rid of initialization condition? Yes but by changing the network [Liang-Lee-Srikantโ2018]
Zhong-Song-J-Bartlett-Dhillonโ2017
Learning with Structure
- ๐ง๐ = ๐ ๐ฆ๐, ๐ฅโ , ๐ฆ๐ โผ ๐ธ โ ๐๐ ,
1 โค ๐ โค ๐
- But no. of samples are limited!
- For example, ๐๐ ๐ โค ๐?
- Can we still recover ๐ฅโ? In general, no!
- But, what if ๐ฅโ has some structure?
Sparse Linear Regression
- But: ๐ โช ๐
- ๐ฅ: ๐ก โsparse (๐ก non-zeros)
- Information theoretically: ๐ = ๐ก log ๐ samples should suffice
0.1 1 โฎ 0.9
๐ ๐ฅ = =
n
โฎ
d
๐ง
41
Learning with structure
- Linear classification/regression
- ๐ท = {๐ฅ, ||๐ฅ||0 โค ๐ก}
- ๐ก โช ๐
- Matrix completion
- ๐ท = {๐, ๐ ๐๐๐ ๐ โค ๐ }
- ๐ โช (๐1, ๐2)
min
๐ฅ ๐(๐ฅ)
๐ก. ๐ข. ๐ฅ โ ๐ท
42
Other Examples
- Low-rank Tensor completion
- ๐ท = {๐, ๐ข๐๐๐ก๐๐ โ ๐ ๐๐๐ ๐ โค ๐ }
- ๐ โช (๐1, ๐2, ๐3)
- Robust PCA
- ๐ท = {๐, ๐ = ๐ + ๐, ๐ ๐๐๐ ๐ โค ๐ , ||๐||0 โค ๐ก}
- ๐ โช ๐1, ๐2 , ๐ โช ๐1 ร ๐2
43
Non-convex Structures
- Linear classification/regression
- ๐ท = {๐ฅ, ||๐ฅ||0 โค ๐ก}
- ๐ก โช ๐
- Matrix completion
- ๐ท = {๐, ๐ ๐๐๐ ๐ โค ๐ }
- ๐ โช (๐1, ๐2)
44
- NP-Hard
- ||๐ฅ||0: Non-convex
- NP-Hard
- ๐ ๐๐๐(๐): Non-convex
Non-convex Structures
- Low-rank Tensor completion
- ๐ท = {๐, ๐ข๐๐๐ก๐๐ โ ๐ ๐๐๐ ๐ โค ๐ }
- ๐ โช (๐1, ๐2, ๐3)
- Robust PCA
- ๐ท = {๐, ๐ = ๐ + ๐, ๐ ๐๐๐ ๐ โค ๐ , ||๐||0 โค ๐ก}
- ๐ โช ๐1, ๐2 , ๐ โช ๐1 ร ๐2
45
- Indeterminate
- ๐ข๐๐๐ก๐๐ ๐ ๐๐๐ ๐ : Non-convex
- NP-Hard
- ๐ ๐๐๐ ๐ , ||๐||0: Non-convex
Technique: Projected Gradient Descent
min
๐ฅ ๐ ๐ฅ
๐ก. ๐ข. ๐ฅ โ ๐ท
- ๐ฅ๐ข+1 = ๐ฅ๐ข โ โ๐ฅ๐(๐ฅ๐ข)
- ๐ฅ๐ข+1 = ๐๐ท(๐ฅ๐ข+1)
46
min
๐ฅ ||๐ฅ โ ๐ฅ๐ข+1||2
๐ก. ๐ข. ๐ฅ โ ๐ท
Results for Several Problems
- Sparse regression [Jain et al.โ14, Garg and Khandekarโ09]
- Sparsity
- Robust Regression [Bhatia et al.โ15]
- Sparsity+output sparsity
- Vector-value Regression [Jain & Tewariโ15]
- Sparsity+positive definite matrix
- Dictionary Learning [Agarwal et al.โ14]
- Matrix Factorization + Sparsity
- Phase Sensing [Netrapalli et al.โ13]
- System of Quadratic Equations
47
Results Contdโฆ
- Low-rank Matrix Regression [Jain et al.โ10, Jain et al.โ13]
- Low-rank structure
- Low-rank Matrix Completion [Jain & Netrapalliโ15, Jain et al.โ13]
- Low-rank structure
- Robust PCA [Netrapalli et al.โ14]
- Low-rank โฉ Sparse Matrices
- Tensor Completion [Jain and Ohโ14]
- Low-tensor rank
- Low-rank matrix approximation [Bhojanapalli et al.โ15]
- Low-rank structure
48
Sparse Linear Regression
- But: ๐ โช ๐
- ๐ฅ: ๐ก โsparse (๐ก non-zeros)
0.1 1 โฎ 0.9
๐ ๐ฅ = =
n
โฎ
d
๐ง
49
Sparse Linear Regression
min
๐ฅ ||๐ง โ ๐๐ฅ||2
๐ก. ๐ข. ||๐ฅ||0 โค ๐ก
- ||๐ง โ ๐๐ฅ||2 = ฯ๐ ๐ง๐ โ ๐ฆ๐, ๐ฅ
2
- ||๐ฅ||0: number of non-zeros
- NP-hard problem in general ๏
- ๐0: non-convex function
50
Technique: Projected Gradient Descent
min
๐ฅ ๐ ๐ฅ = ||๐ง โ ๐๐ฅ||2
๐ก. ๐ข. ||๐ฅ||0 โค ๐ก
- ๐ฅ๐ข+1 = ๐ฅ๐ข โ โ๐ฅ๐(๐ฅ๐ข)
- ๐ฅ๐ข+1 = ๐
๐ก(๐ฅ๐ข+1) [Jain, Tewari, Karโ2014]
51
min
๐ฅ ||๐ฅ โ ๐ฅ๐ข+1||2
๐ก. ๐ข. ||๐ฅ||0 โค ๐ก
Statistical Guarantees
๐ง๐ = โฉ๐ฆ๐, ๐ฅโโช + ๐๐
- ๐ฆ๐ โผ ๐(0, ฮฃ)
- ๐๐ โผ ๐(0, ๐2)
- ๐ฅโ: ๐ก โsparse
|| เท ๐ฅ โ ๐ฅโ|| โค ๐๐3 ๐ก log ๐ ๐
- ๐ = ๐1(ฮฃ)/๐๐(ฮฃ)
[Jain, Tewari, Karโ2014]
52
Low-rank Matrix Completion
min
๐
เท
๐,๐ โฮฉ
๐
๐๐ โ ๐๐๐ 2
๐ก. ๐ข ๐ฌ๐๐จ๐ฅ ๐ โค ๐ ฮฉ: set of known entries
- Special case of low-rank matrix regression
- However, assumptions required by the regression analysis not satisfied
Technique: Projected Gradient Descent
- ๐
0 = 0
- For t=0:T-1
๐
๐ข+1 = ๐ ๐ ๐ ๐ข โ ๐๐๐(๐ ๐ข)
- ๐
๐(๐): projection onto set of rank-r projection
- Singular Value Projection
- Pros:
- Fast (always, rank-r SVD)
- Matrix completion: ๐(๐ โ ๐ 3)!
- Cons: In general, might not even converge
- Our Result: Convergence under โcertainโ assumptions
[Jain, Tewari, Karโ2014], [Netrapalli, Jainโ2014], [Jain, Meka, Dhillonโ2009]
54
Guarantees
- Projected Gradient Descent:
- ๐
๐ข+1 = ๐ ๐ ๐ ๐ข โ ๐๐ผ๐๐ ๐ ๐ข
, โ๐ข
- Show ๐-approximate recovery in log
1 ๐ iterations
- Assuming:
- ๐: incoherent
- ฮฉ: uniformly sampled
- ฮฉ โฅ ๐ โ ๐ 5 โ log3 ๐
- First near linear time algorithm for exact Matrix Completion with
finite samples
[J., Netrapalliโ2015]
General Result for Any Function
- ๐: ๐๐ โ ๐
- ๐: satisfies RSC/RSS, i.e.,
๐ฝ โ ๐ฝ๐ร๐ โผ ๐ผ ๐ฅ โผ ๐ โ ๐ฝ๐ร๐, ๐๐, ๐ฅ โ ๐ท
- PGD guarantee:
๐ ๐ฅ๐ โค ๐ ๐ฅโ + ๐
After ๐ = ๐(log
๐ ๐ฅ0 ๐
) steps
- If
๐ ๐ฝ โค 1.5
[J., Tewari, Karโ2014]
min
๐ฅ ๐(๐ฅ)
๐ก. ๐ข. ๐ฅ โ ๐ท
Learning with Latent Variables
min
๐ฅ,๐จ ๐(๐ฅ, ๐จ)
- Typically, ๐จ are latent variables
- E.g., clustering: ๐ฅ: means of clusters, ๐จ: cluster index
- ๐: non โ convex
- NP-hard to solve in general
Alternating Minimization
๐จ๐ข+1 = arg min
๐จ
๐(๐ฅ๐ข, ๐จ) ๐ฅ๐ข+1 = arg min
๐ฅ ๐(๐ฅ, ๐จ๐ข+1)
- For example, if ๐(๐ฅ๐ข, ๐จ) is convex and ๐(๐ฅ, ๐จ๐ข) is convex
- Does that imply ๐(๐ฅ, ๐จ) is convex?
- No!!!
- ๐ ๐ฅ, ๐จ = ๐ฅ โ ๐จ
- Linear in both ๐ฅ, ๐จ individually
- So can Alt. Min. converge to global optima?
image credit: academo.org
Low-rank Matrix Completion
min
๐
เท
๐,๐ โฮฉ
๐
๐๐ โ ๐๐๐ 2
๐ก. ๐ข ๐ฌ๐๐จ๐ฅ ๐ โค ๐ ฮฉ: set of known entries
- Special case of low-rank matrix regression
- However, assumptions required by the regression analysis not satisfied
Matrix Completion: Alternating Minimization
W ๐ ๐๐ โ ร
ร F 2
y โ ๐ โ
) (
๐๐ข+1 = min
๐ ||๐ง โ ๐ โ (๐๐ข๐๐)||2 2
๐๐ข+1 = min
๐ ||๐ง โ ๐ โ ๐ ๐๐ข+1 ๐ ||2 2
Results: Alternating Minimization
- Provable global convergence [J., Netrapalli, Sanghaviโ13]
- Rate of convergence: geometric
||๐๐ โ ๐โ|| โค 2โ๐
- Assumptions:
- Matrix regression: RIP
- Matrix completion: uniform sampling and no. samples ฮฉ โฅ ๐(๐๐6)
[Jain, Netrapalli, Sanghaviโ13]
General Results
min
๐ฅ,๐จ ๐(๐ฅ, ๐จ)
- Alternating minimization: optimal?
- If:
- Joint Restricted Strong Convexity (Strong convexity close to the optimal)
- Restricted Smoothness (smoothness near optimal)
- Cross-product bound:
๐ฅ โ ๐ฅโ, โ๐ฅ๐ ๐ฅ, ๐จ โ โ๐ฅ๐ ๐ฅ, ๐จโ โ ๐จ โ ๐จโ, โ๐จ๐ ๐ฅ, ๐จ โ โ๐จ๐ ๐ฅโ, ๐จ โค ๐(|๐ฅ โ ๐ฅโ|2 + |๐จ โ ๐จโ|2)
Ha and Barber-2017, Jain and Kar-2018
Summary I
Non-convex Optimization: two approaches
- 1. General non-convex functions
- a. First Order Stationary Point
- b. Second Order Stationary Point
- 2. Statistical non-convex functions: learning with structure
- a. Projected Gradient Descent (RSC/RSS)
- b. Alternating minimization/EM algorithms (RSC/RSS)
Summary II
- First Order Stationary Point :๐ ๐ฅ โค ๐ ๐ฅโฒ + ||๐ฅ โ ๐ฅโฒ||2
- Tools: gradient descent, acceleration, stochastic gd, variance reduction
- Key quantity: iteration complexity
- Several questions: for example, can we do better? Especially in finite sum setting
- Second order stationary point: ๐ ๐ฅ โค ๐ ๐ฅโฒ + ||๐ฅ โ ๐ฅโฒ||3
- Tools: noise+gd, noise+acceleration, noise+sgd, noise+variance reduction
- Several questions: better rates? Can we remove Lipschitz condition on Hessian?
Summary III
- Projected Gradient Descent
- Works under statistical conditions like RSC/RSS
- Still several open questions for most problems
- E.g., tight guarantees support recovery for sparse linear regression?
- Alternating minimization
- Works under some assumptions on ๐
- What is the weakest condition on ๐ for Alt. Min. to work?