Non-convex Optimization for Machine Learning Prateek Jain - - PowerPoint PPT Presentation

โ–ถ
non convex optimization for machine learning
SMART_READER_LITE
LIVE PREVIEW

Non-convex Optimization for Machine Learning Prateek Jain - - PowerPoint PPT Presentation

Non-convex Optimization for Machine Learning Prateek Jain Microsoft Research India Outline Optimization for Machine Learning Non-convex Optimization Convergence to Stationary Points First order stationary points Second order


slide-1
SLIDE 1

Non-convex Optimization for Machine Learning

Prateek Jain

Microsoft Research India

slide-2
SLIDE 2

Outline

  • Optimization for Machine Learning
  • Non-convex Optimization
  • Convergence to Stationary Points
  • First order stationary points
  • Second order stationary points
  • Non-convex Optimization in ML
  • Neural Networks
  • Learning with Structure
  • Alternating Minimization
  • Projected Gradient Descent
slide-3
SLIDE 3

Relevant Monograph (Shameless Ad)

slide-4
SLIDE 4

Optimization in ML

Supervised Learning

  • Given points (๐‘ฆ๐‘—, ๐‘ง๐‘—)
  • Prediction function: เท

๐‘ง๐‘— = ๐œš(๐‘ฆ๐‘—, ๐‘ฅ)

  • Minimize loss: min

๐‘ฅ ฯƒ๐‘— โ„“(๐œš ๐‘ฆ๐‘—, ๐‘ฅ , ๐‘ง๐‘—)

Unsupervised Learning

Given points (๐‘ฆ1, ๐‘ฆ2 โ€ฆ ๐‘ฆ๐‘‚) Find cluster center or train GANs Represent เท ๐‘ฆ๐‘— = ๐œš(๐‘ฆ๐‘—, ๐‘ฅ) Minimize loss: min

๐‘ฅ ฯƒ๐‘— โ„“(๐œš ๐‘ฆ๐‘—, ๐‘ฅ , ๐‘ฆ๐‘—)

slide-5
SLIDE 5

Optimization Problems

  • Unconstrained optimization

min

๐‘ฅโˆˆ๐‘†๐‘’ ๐‘”(๐‘ฅ)

  • Deep networks
  • Regression
  • Gradient Boosted Decision

Trees

  • Constrained optimization

min

๐‘ฅ ๐‘”(๐‘ฅ) ๐‘ก. ๐‘ข. ๐‘ฅ โˆˆ ๐ท

  • Support Vector Machines
  • Sparse regression
  • Recommendation system
  • โ€ฆ
slide-6
SLIDE 6

Convex Optimization

Convex function Convex set

๐‘” ๐œ‡๐‘ฅ1 + 1 โˆ’ ๐œ‡ ๐‘ฅ2 โ‰ค ๐œ‡๐‘” ๐‘ฅ1 + 1 โˆ’ ๐œ‡ ๐‘” ๐‘ฅ2 , 0 โ‰ค ๐œ‡ โ‰ค 1 โˆ€๐‘ฅ1, ๐‘ฅ2 โˆˆ ๐ท, ๐œ‡๐‘ฅ1 + 1 โˆ’ ๐œ‡ ๐‘ฅ2 โˆˆ ๐ท 0 โ‰ค ๐œ‡ โ‰ค 1

min

๐‘ฅ ๐‘”(๐‘ฅ)

๐‘ก. ๐‘ข. ๐‘ฅ โˆˆ ๐ท

Slide credit: Purushottam Kar

slide-7
SLIDE 7

Examples

Linear Programming Quadratic Programming Semidefinite Programming

Slide credit: Purushottam Kar

slide-8
SLIDE 8

Convex Optimization

  • Unconstrained optimization

min

๐‘ฅโˆˆ๐‘†๐‘’ ๐‘”(๐‘ฅ)

Optima: just ensure โˆ‡๐‘ฅ๐‘” ๐‘ฅ = 0

  • Constrained optimization

min

๐‘ฅ ๐‘”(๐‘ฅ) ๐‘ก. ๐‘ข. ๐‘ฅ โˆˆ ๐ท

Optima: KKT conditions

In this talk, lets assume ๐‘” is ๐‘€ โˆ’smooth => ๐‘” is differentiable ๐‘” ๐‘ฆ โ‰ค ๐‘” ๐‘ง + โˆ‡๐‘” ๐‘ง , ๐‘ฆ โˆ’ ๐‘ง + ๐‘€ 2 ||๐‘ฆ โˆ’ ๐‘ง||2 OR, ||โˆ‡๐‘” ๐‘ฆ โˆ’ โˆ‡๐‘” ๐‘ง || โ‰ค ๐‘€||๐‘ฆ โˆ’ ๐‘ง||

slide-9
SLIDE 9

Gradient Descent Methods

  • Projected gradient descent method:
  • For t=1, 2, โ€ฆ (until convergence)
  • ๐‘ฅ๐‘ข+1 = ๐‘„

๐ท(๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข )

  • ๐œƒ: step-size
slide-10
SLIDE 10

Convergence Proof

๐‘” ๐‘ฅ๐‘ข+1 โ‰ค ๐‘” ๐‘ฅ๐‘ข + โˆ‡๐‘” ๐‘ฅ๐‘ข , ๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅ๐‘ข + ๐‘€ 2 ||๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅ๐‘ข||2 ๐‘” ๐‘ฅ๐‘ข+1 โ‰ค ๐‘” ๐‘ฅ๐‘ข โˆ’ 1 โˆ’ ๐‘€๐œƒ 2 ๐œƒ||โˆ‡๐‘” ๐‘ฅ๐‘ข ||2 โ‰ค ๐‘” ๐‘ฅ๐‘ข โˆ’ ๐œƒ 2 ||โˆ‡๐‘” ๐‘ฅ๐‘ข ||2 ๐‘” ๐‘ฅ๐‘ข+1 โ‰ค ๐‘” ๐‘ฅโˆ— + โˆ‡๐‘” ๐‘ฅ๐‘ข , ๐‘ฅ๐‘ข โˆ’ ๐‘ฅโˆ— โˆ’ 1 2๐œƒ ||๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅ๐‘ข||2 ๐‘” ๐‘ฅ๐‘ข+1 โ‰ค ๐‘” ๐‘ฅโˆ— + 1 2๐œƒ ||๐‘ฅ๐‘ข โˆ’ ๐‘ฅโˆ—||2 โˆ’ ||๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅโˆ—||2 ๐‘” ๐‘ฅ๐‘ˆ โ‰ค ๐‘” ๐‘ฅโˆ— + 1 ๐‘ˆ โ‹… 2๐œƒ ||๐‘ฅ0 โˆ’ ๐‘ฅโˆ—||2 โ‡’ ๐‘” ๐‘ฅ๐‘ˆ โ‰ค ๐‘” ๐‘ฅโˆ— + ๐œ— ๐‘ˆ = ๐‘ƒ ๐‘€ โ‹… ||๐‘ฅ0 โˆ’ ๐‘ฅโˆ—||2 ๐œ—

๐‘ฅ๐‘ข

slide-11
SLIDE 11

Non-convexity?

  • Critical points: โˆ‡๐‘” ๐‘ฅ = 0
  • But: โˆ‡f w = 0 โ‡ Optimality

min

๐‘ฅโˆˆ๐‘†๐‘’ ๐‘”(๐‘ฅ)

slide-12
SLIDE 12

Local Optima

  • ๐‘” ๐‘ฅ โ‰ค ๐‘” ๐‘ฅโ€ฒ , โˆ€||๐‘ฅ โˆ’ ๐‘ฅโ€ฒ|| โ‰ค ๐œ—

Local Minima

image credit: academo.org

slide-13
SLIDE 13

First Order Stationary Points

  • Defined by: โˆ‡๐‘” ๐‘ฅ = 0
  • But โˆ‡2๐‘”(๐‘ฅ) need not be positive semi-definite

First Order Stationary Point (FOSP)

image credit: academo.org

slide-14
SLIDE 14

First Order Stationary Points

  • E.g., ๐‘” ๐‘ฅ = 0.5(๐‘ฅ1

2 โˆ’ ๐‘ฅ2 2)

  • โˆ‡๐‘” ๐‘ฅ =

๐‘ฅ1 โˆ’๐‘ฅ2

  • โˆ‡๐‘” 0 = 0
  • But, โˆ‡2๐‘” ๐‘ฅ = 1

โˆ’1 โ‡’ ๐‘—๐‘œ๐‘’๐‘“๐‘”๐‘—๐‘œ๐‘—๐‘ข๐‘“

  • ๐‘”

๐œ— 2 , ๐œ—

= โˆ’

3 8 ๐œ—2 โ‡’ ๐‘” 0,0

is not a local minima

First Order Stationary Point (FOSP)

image credit: academo.org

slide-15
SLIDE 15

Second Order Stationary Points

Second Order Stationary Point (SOSP) if:

  • โˆ‡๐‘” ๐‘ฅ = 0
  • โˆ‡2๐‘” ๐‘ฅ โ‰ฝ 0

Does it imply local optimality?

Second Order Stationary Point (SOSP)

image credit: academo.org

slide-16
SLIDE 16

Second Order Stationary Points

  • ๐‘” ๐‘ฅ =

1 3 (๐‘ฅ1 3 โˆ’ 3 ๐‘ฅ1๐‘ฅ2 2)

  • โˆ‡ ๐‘” ๐‘ฅ = (๐‘ฅ1

2 โˆ’ ๐‘ฅ2 2)

โˆ’2 ๐‘ฅ1๐‘ฅ2

  • โˆ‡2๐‘” ๐‘ฅ =

2๐‘ฅ1 โˆ’2๐‘ฅ2 โˆ’2๐‘ฅ2 โˆ’2๐‘ฅ1

  • โˆ‡๐‘” 0 = 0, โˆ‡2๐‘” 0 = 0 โ‡’ 0 ๐‘—๐‘ก ๐‘‡๐‘ƒ๐‘‡๐‘„
  • ๐‘” ๐œ—, ๐œ—

= โˆ’

2 3 ๐œ—3 < ๐‘”(0)

Second Order Stationary Point (SOSP)

image credit: academo.org

slide-17
SLIDE 17

Stationarity and local optima

  • ๐‘ฅ is local optima implies: ๐‘” ๐‘ฅ โ‰ค ๐‘” ๐‘ฅโ€ฒ , โˆ€||๐‘ฅ โˆ’ ๐‘ฅโ€ฒ|| โ‰ค ๐œ—
  • ๐‘ฅ is FOSP implies:

๐‘” ๐‘ฅ โ‰ค ๐‘” ๐‘ฅโ€ฒ + ๐‘ƒ(||๐‘ฅ โˆ’ ๐‘ฅ||2)

  • ๐‘ฅ is SOSP implies:

๐‘” ๐‘ฅ โ‰ค ๐‘” ๐‘ฅโ€ฒ + ๐‘ƒ(||๐‘ฅ โˆ’ ๐‘ฅโ€ฒ||3)

  • ๐‘ฅ is p-th order SP implies:

๐‘” ๐‘ฅ โ‰ค ๐‘” ๐‘ฅโ€ฒ + ๐‘ƒ(||๐‘ฅ โˆ’ ๐‘ฅโ€ฒ||๐‘ž+1)

  • That is, local optima: ๐‘ž = โˆž
slide-18
SLIDE 18

Computability?

First Order Stationary Point Second Order Stationary Point Third Order Stationary Point ๐‘ž โ‰ฅ 4 Stationary Point NP-Hard Local Optima NP-Hard

๐‘” ๐‘ฅ โ‰ค ๐‘” ๐‘ฅโ€ฒ + ๐‘ƒ(||๐‘ฅ โˆ’ ๐‘ฅโ€ฒ||๐‘ž+1)

Anandkumar and Ge-2016

slide-19
SLIDE 19

Does Gradient Descent Work for Local Optimality?

  • Yes!
  • In fact, with high probability converges

to a โ€œlocal minimizerโ€

  • If initialized randomly!!!
  • But no rates known ๏Œ
  • NP-hard in general!!
  • Big open problem โ˜บ

image credit: academo.org

slide-20
SLIDE 20

Finding First Order Stationary Points

  • Defined by: โˆ‡๐‘” ๐‘ฅ = 0
  • But โˆ‡2๐‘”(๐‘ฅ) need not be positive semi-definite

First Order Stationary Point (FOSP)

image credit: academo.org

slide-21
SLIDE 21

Gradient Descent Methods

  • Gradient descent:
  • For t=1, 2, โ€ฆ (until convergence)
  • ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข
  • ๐œƒ: step-size
  • Assume:

||โˆ‡๐‘” ๐‘ฆ โˆ’ โˆ‡๐‘” ๐‘ง || โ‰ค ๐‘€||๐‘ฆ โˆ’ ๐‘ง||

slide-22
SLIDE 22

Convergence to FOSP

๐‘” ๐‘ฅ๐‘ข+1 โ‰ค ๐‘” ๐‘ฅ๐‘ข + โˆ‡๐‘” ๐‘ฅ๐‘ข , ๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅ๐‘ข + ๐‘€ 2 ||๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅ๐‘ข||2 ๐‘” ๐‘ฅ๐‘ข+1 โ‰ค ๐‘” ๐‘ฅ๐‘ข โˆ’ 1 โˆ’ ๐‘€๐œƒ 2 ๐œƒ||โˆ‡๐‘” ๐‘ฅ๐‘ข ||2 โ‰ค ๐‘” ๐‘ฅ๐‘ข โˆ’ 1 2๐‘€ ||โˆ‡๐‘” ๐‘ฅ๐‘ข ||2 ||โˆ‡๐‘” ๐‘ฅ๐‘ข ||2 โ‰ค ๐‘” ๐‘ฅ๐‘ข โˆ’ ๐‘” ๐‘ฅ๐‘ข+1 1 2๐‘€ เท

๐‘ข

||โˆ‡๐‘” ๐‘ฅ๐‘ข ||2 โ‰ค ๐‘” ๐‘ฅ0 โˆ’ ๐‘”(๐‘ฅโˆ—) min

๐‘ข

||โˆ‡๐‘” ๐‘ฅ๐‘ข || โ‰ค 2๐‘€ (๐‘” ๐‘ฅ0 โˆ’ ๐‘”(๐‘ฅโˆ—)) ๐‘ˆ โ‰ค ๐œ—

๐‘ˆ = ๐‘ƒ ๐‘€ โ‹… (๐‘” ๐‘ฅ0 โˆ’ ๐‘” ๐‘ฅโˆ— ) ๐œ—2

slide-23
SLIDE 23

Accelerated Gradient Descent for FOSP?

  • For t=1, 2โ€ฆ.T
  • ๐‘ฅ๐‘ข+1

๐‘›๐‘’ = 1 โˆ’ ๐›ฝ๐‘ข ๐‘ฅ๐‘ข ๐‘๐‘• + ๐›ฝ๐‘ข๐‘ฅ๐‘ข

  • ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒ๐‘ขโˆ‡๐‘”(๐‘ฅ๐‘ข+1

๐‘›๐‘’)

  • ๐‘ฅ๐‘ข+1

๐‘๐‘• = ๐‘ฅ๐‘ข ๐‘›๐‘’ โˆ’ ๐›พ๐‘ขโˆ‡๐‘”(๐‘ฅ๐‘ข+1 ๐‘›๐‘’)

  • Convergence? min

๐‘ข

||โˆ‡๐‘” ๐‘ฅ๐‘ข || โ‰ค ๐œ—

  • For ๐‘ˆ = ๐‘ƒ(

๐‘€โ‹…(๐‘” ๐‘ฅ0 โˆ’๐‘” ๐‘ฅโˆ— ) ๐œ—

)

  • If convex: ๐‘ˆ = ๐‘ƒ(

๐‘€โ‹…(๐‘” ๐‘ฅ0 โˆ’๐‘” ๐‘ฅโˆ— ) 1/4 ๐œ—

)

Ghadimi and Lan - 2013

๐‘ฅ๐‘ข ๐‘ฅ๐‘ข

๐‘๐‘•

๐‘ฅ๐‘ข

๐‘›๐‘’

๐‘ฅ๐‘ข+1 ๐‘ฅ๐‘ข+1

๐‘๐‘•

slide-24
SLIDE 24

Non-convex Optimization: Sum of Functions

  • What if the function has more structure?

min

๐‘ฅ

๐‘” ๐‘ฅ = 1 ๐‘œ เท

๐‘—=1 ๐‘œ

๐‘”

๐‘—(๐‘ฅ)

  • โˆ‡๐‘” ๐‘ฅ = ฯƒ๐‘—=1

๐‘œ

โˆ‡๐‘”

๐‘—(๐‘ฅ)

  • I.e., computing gradient would require ๐‘ƒ(๐‘œ) computation
slide-25
SLIDE 25

Does Stochastic Gradient Descent Work?

  • For t=1, 2, โ€ฆ (until convergence)
  • Sample ๐‘—๐‘ข โˆผ ๐‘‰๐‘œ๐‘—๐‘”[1, ๐‘œ]
  • ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘”

๐‘—๐‘ข ๐‘ฅ๐‘ข

Proof? ๐น๐‘—๐‘ข ๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅ๐‘ข = ๐œƒโˆ‡๐‘”(๐‘ฅ๐‘ข) ๐‘” ๐‘ฅ๐‘ข+1 โ‰ค ๐‘” ๐‘ฅ๐‘ข + โˆ‡๐‘” ๐‘ฅ๐‘ข , ๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅ๐‘ข + ๐‘€ 2 ||๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅ๐‘ข||2 E ๐‘” ๐‘ฅ๐‘ข+1 โ‰ค ๐น ๐‘” ๐‘ฅ๐‘ข โˆ’ ๐œƒ 2 ||โˆ‡๐‘” ๐‘ฅ๐‘ข ||2 + ๐‘€ 2 ๐œƒ2 โ‹… ๐‘Š๐‘๐‘  min

๐‘ข ||โˆ‡๐‘” ๐‘ฅ๐‘ข || โ‰ค ๐‘€ ๐‘” ๐‘ฅ0 โˆ’ ๐‘” ๐‘ฅโˆ—

โ‹… ๐‘Š๐‘๐‘ 

1 4

๐‘ˆ

1 4

โ‰ค ๐œ— ๐‘ˆ = ๐‘ƒ ๐‘€ โ‹… ๐‘Š๐‘๐‘  โ‹… (๐‘” ๐‘ฅ0 โˆ’ ๐‘” ๐‘ฅโˆ— ) ๐œ—4

slide-26
SLIDE 26

Summary: Convergence to FOSP

Algorithm

  • No. of Gradient Calls (Non-convex)
  • No. of Gradient Calls (Convex)

GD [Folkore; Nesterov] ๐‘ƒ 1 ๐œ—2 ๐‘ƒ 1 ๐œ— AGD [Ghadimi & Lan-2013] ๐‘ƒ 1 ๐œ— ๐‘ƒ 1 ๐œ— Algorithm

  • No. of Gradient Calls

Convex Case GD [Folkore] ๐‘ƒ( ๐‘œ ๐œ—2) ๐‘ƒ(๐‘œ ๐œ—) AGD [Ghadimi & Lanโ€™2013] ๐‘ƒ ๐‘œ ๐œ— ๐‘ƒ ๐‘œ ๐œ— SGD [Ghadimi & Lanโ€™2013] ๐‘ƒ( 1 ๐œ—4) ๐‘ƒ( 1 ๐œ—2) SVRG [Reddi et al-2016, Allen- Zhu&Hazan-2016] ๐‘ƒ(๐‘œ + ๐‘œ

2 3/๐œ—2)

๐‘ƒ(๐‘œ + ๐‘œ/๐œ—2) MSVRG [Reddi et al-2016] ๐‘ƒ(min( 1 ๐œ—4 , ๐‘œ

2 3

๐œ—2)) ๐‘ƒ ๐‘œ + ๐‘œ ๐œ—2

๐‘” ๐‘ฅ = 1 ๐‘œ เท

๐‘—=1 ๐‘œ

๐‘”

๐‘—(๐‘ฅ)

slide-27
SLIDE 27

Finding Second Order Stationary Points (SOSP)

Second Order Stationary Point (SOSP) if:

  • โˆ‡๐‘” ๐‘ฅ = 0
  • โˆ‡2๐‘” ๐‘ฅ โ‰ฝ 0

Approximate SOSP:

  • ||โˆ‡๐‘” ๐‘ฅ || โ‰ค ๐œ—
  • ๐œ‡๐‘›๐‘—๐‘œ โˆ‡2๐‘” ๐‘ฅ

โ‰ฅ โˆ’ ๐œ๐œ—

Second Order Stationary Point (SOSP)

image credit: academo.org

slide-28
SLIDE 28

Cubic Regularization (Nesterov and Polyak-2006)

  • For t=1, 2, โ€ฆ (until convergence)

๐‘ฅ๐‘ข+1 = arg min

๐‘ฅ ๐‘” ๐‘ฅ๐‘ข + ๐‘ฅ โˆ’ ๐‘ฅ๐‘ข, โˆ‡๐‘” ๐‘ฅ๐‘ข

+ 1 2 ๐‘ฅ โˆ’ ๐‘ฅ๐‘ข ๐‘ˆโˆ‡2๐‘” ๐‘ฅ๐‘ข ๐‘ฅ โˆ’ ๐‘ฅ๐‘ข + ๐œ 6 ||๐‘ฅ โˆ’ ๐‘ฅ๐‘ข||3

  • Assumption: Hessian continuity, i.e., ||โˆ‡2๐‘” ๐‘ฆ โˆ’ โˆ‡2๐‘” ๐‘ง || โ‰ค ๐œ||๐‘ฆ โˆ’ ๐‘ง||
  • Convergence to SOSP? ๐‘ˆ = ๐‘ƒ(

1 ๐œ—1.5)

  • But requires Hessian computation! (even storage is ๐‘ƒ(๐‘’2)
  • Can we find SOSP using only gradients?
slide-29
SLIDE 29

Noisy Gradient Descent for SOSP

  • For t=1, 2, โ€ฆ (until convergence)
  • If ( ||โˆ‡๐‘”(๐‘ฅ๐‘ข|| โ‰ฅ ๐œ— )
  • ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข
  • Else
  • ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข + ๐œ‚, ๐œ‚ โˆผ ๐›ฟ โ‹… ๐‘‚(0, ๐ฝ)
  • Update ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข for next ๐‘  iterations
  • Claim: above algorithm converges to SOSP in ๐‘ƒ(1/๐œ—2)

Ge et al-2015, Jin et al-2017

slide-30
SLIDE 30

Proof

  • FOSP analysis: convergence in ๐‘ƒ

1 ๐œ—2

iterations

  • But, โˆ‡2๐‘” ๐‘ฅ๐‘ข

โ‰ฑ 0

  • That is, ๐œ‡๐‘›๐‘—๐‘œ โˆ‡2๐‘” ๐‘ฅ๐‘ข

< โˆ’ ๐œ๐œ—

For t=1, 2, โ€ฆ (until convergence) If ( ||โˆ‡๐‘”(๐‘ฅ๐‘ข|| โ‰ฅ ๐œ— ) ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข Else ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข + ๐œ‚, ๐œ‚ โˆผ ๐›ฟ โ‹… ๐‘‚(0, ๐ฝ) Update ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข for next ๐‘  iterations

image credit: academo.org

slide-31
SLIDE 31

Proof

  • Random perturbation with Gradient descent

leads to decrease in objective function

For t=1, 2, โ€ฆ (until convergence) If ( ||โˆ‡๐‘”(๐‘ฅ๐‘ข|| โ‰ฅ ๐œ— ) ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข Else ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข + ๐œ‚, ๐œ‚ โˆผ ๐›ฟ โ‹… ๐‘‚(0, ๐ฝ) Update ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข for next ๐‘  iterations

image credit: academo.org

slide-32
SLIDE 32

Proof?

  • Random perturbation with Gradient descent leads to decrease in objective

function

  • Hessian continuity => function nearly quadratic in small neighborhood
  • ๐‘” ๐‘ฅ โ‰ˆ ๐‘” ๐‘ฅ๐‘ข + โˆ‡๐‘” ๐‘ฅ๐‘ข , ๐‘ฅ โˆ’ ๐‘ฅ๐‘ข + ๐‘ฅ โˆ’ ๐‘ฅ๐‘ข ๐‘ˆโˆ‡2๐‘” ๐‘ฅ๐‘ข (๐‘ฅ โˆ’ ๐‘ฅ๐‘ข)

๐‘ฅ๐‘ +๐‘ข = ๐‘ฅ๐‘ โˆ’1+๐‘ข โˆ’ ๐œƒโˆ‡2๐‘” ๐‘ฅ๐‘ข ๐‘ฅ๐‘ โˆ’1+๐‘ข โˆ’ ๐‘ฅ๐‘ข โ‡’ ๐‘ฅ๐‘ +๐‘ข โˆ’ ๐‘ฅ๐‘ข = ๐ฝ โˆ’ ๐œƒโˆ‡2๐‘” ๐‘ฅ๐‘ข

๐‘  ๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅ๐‘ข

For t=1, 2, โ€ฆ (until convergence) If ( ||โˆ‡๐‘”(๐‘ฅ๐‘ข|| โ‰ฅ ๐œ— ) ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข Else ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข + ๐œ‚, ๐œ‚ โˆผ ๐›ฟ โ‹… ๐‘‚(0, ๐ฝ) Update ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข for next ๐‘  iterations

slide-33
SLIDE 33

Proof?

  • Random perturbation with Gradient descent leads to decrease in objective function
  • Hessian continuity => function nearly quadratic in small neighborhood
  • ๐‘” ๐‘ฅ โ‰ˆ ๐‘” ๐‘ฅ๐‘ข + โˆ‡๐‘” ๐‘ฅ๐‘ข , ๐‘ฅ โˆ’ ๐‘ฅ๐‘ข + ๐‘ฅ โˆ’ ๐‘ฅ๐‘ข ๐‘ˆโˆ‡2๐‘” ๐‘ฅ๐‘ข (๐‘ฅ โˆ’ ๐‘ฅ๐‘ข)

๐‘ฅ๐‘ +๐‘ข = ๐‘ฅ๐‘ โˆ’1+๐‘ข โˆ’ ๐œƒโˆ‡2๐‘” ๐‘ฅ๐‘ข ๐‘ฅ๐‘ โˆ’1+๐‘ข โˆ’ ๐‘ฅ๐‘ข โ‡’ ๐‘ฅ๐‘ +๐‘ข โˆ’ ๐‘ฅ๐‘ข = ๐ฝ โˆ’ ๐œƒโˆ‡2๐‘” ๐‘ฅ๐‘ข

๐‘  ๐‘ฅ๐‘ข+1 โˆ’ ๐‘ฅ๐‘ข

  • ๐‘ฅ๐‘ +๐‘ข โˆ’ ๐‘ฅ๐‘ข converge to largest eigenvector of ๐ฝ โˆ’ ๐œƒโˆ‡2๐‘”(๐‘ฅ๐‘ข)
  • Which is smallest (most negative) eigenvector of โˆ‡2๐‘”(๐‘ฅ๐‘ข)
  • Hence, ๐‘ฅ๐‘ +๐‘ข โˆ’ ๐‘ฅ๐‘ข ๐‘ˆโˆ‡2๐‘” ๐‘ฅ๐‘ข

๐‘ฅ๐‘ +๐‘ข โˆ’ ๐‘ฅ๐‘ข โ‰ค โˆ’๐›ฟ2 ๐œ๐œ—

  • ๐‘” ๐‘ฅ๐‘ +๐‘ข โ‰ค ๐‘” ๐‘ฅ๐‘ข โˆ’ ๐›ฟ2 ๐œ๐œ—

For t=1, 2, โ€ฆ (until convergence) If ( ||โˆ‡๐‘”(๐‘ฅ๐‘ข|| โ‰ฅ ๐œ— ) ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข Else ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข + ๐œ‚, ๐œ‚ โˆผ ๐›ฟ โ‹… ๐‘‚(0, ๐ฝ) Update ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข for next ๐‘  iterations

slide-34
SLIDE 34

Proof

  • Entrapment near SOSP

For t=1, 2, โ€ฆ (until convergence) If ( ||โˆ‡๐‘”(๐‘ฅ๐‘ข|| โ‰ฅ ๐œ— ) ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข Else ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข + ๐œ‚, ๐œ‚ โˆผ ๐›ฟ โ‹… ๐‘‚(0, ๐ฝ) Update ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ๐‘ข for next ๐‘  iterations

Final result: convergence to SOSP in ๐‘ƒ(1/๐œ—2)

Ge et al-2015, Jin et al-2017 image credit: academo.org

slide-35
SLIDE 35

Summary: Convergence to SOSP

Algorithm

  • No. of Gradient Calls (Non-convex)
  • No. of Gradient Calls (Convex)

Noisy GD [Jin et al-2017, Ge et al- 2015] ๐‘ƒ 1 ๐œ—2 ๐‘ƒ 1 ๐œ— Noisy Accelerated GD [Jin et al- 2017] ๐‘ƒ 1 ๐œ—1.75 ๐‘ƒ 1 ๐œ— Cubic Regularization [Nesterov & Polyak-2006] ๐‘ƒ 1 ๐œ—1.5 N/A Algorithm

  • No. of Gradient Calls

Convex Case Noisy GD [Jin et al-2017, Ge et al-2015] ๐‘ƒ( ๐‘œ ๐œ—2) ๐‘ƒ(๐‘œ ๐œ—) Noisy AGD [Jin et al-2017] ๐‘ƒ ๐‘œ ๐œ—1.75 ๐‘ƒ ๐‘œ ๐œ— Noisy SGD [Jin et al-2017, Ge et al-2015] ๐‘ƒ( 1 ๐œ—4) ๐‘ƒ( 1 ๐œ—2) SVRG [Allen-Zhu-2018] ๐‘ƒ(๐‘œ + ๐‘œ

3 4/๐œ—2)

๐‘ƒ(๐‘œ + ๐‘œ/๐œ—2)

๐‘” ๐‘ฅ = 1 ๐‘œ เท

๐‘—=1 ๐‘œ

๐‘”

๐‘—(๐‘ฅ)

slide-36
SLIDE 36

Convergence to Global Optima?

  • FOSP/SOSP methods canโ€™t even guarantee

local convergence

  • Can we guarantee global optimality for

some โ€œnicerโ€ non-convex problems?

  • Yes!!!
  • Use statistics โ˜บ

image credit: academo.org

slide-37
SLIDE 37

Can Statistics Help: Realizable models!

  • Data points: ๐‘ฆ๐‘—, ๐‘ง๐‘— โˆผ ๐ธ
  • ๐ธ: nice distribution
  • ๐น[๐‘ง๐‘—] = ๐œš(๐‘ฆ๐‘—, ๐‘ฅโˆ—)

เท ๐‘ฅ = arg min

๐‘ฅ เท ๐‘—

๐‘š๐‘๐‘ก๐‘ก(๐‘ง๐‘—, ๐œš(๐‘ฆ๐‘—, ๐‘ฅ))

  • That is, ๐‘ฅโˆ— is the optimal solution!
  • Parameter learning
slide-38
SLIDE 38

Learning Neural Networks: Provably

  • ๐‘ง๐‘— = 1 โ‹… ๐œ ๐‘‹

โˆ—๐‘ฆ๐‘—

  • ๐‘ฆ๐‘— โˆผ ๐‘‚(0, ๐ฝ)

min

๐‘‹ เท ๐‘—

๐‘ง๐‘—โˆ’1 โ‹… ๐œ ๐‘‹๐‘ฆ๐‘—

2

  • Does gradient descent converge to global optima: ๐‘‹

โˆ—?

  • NO!!!
  • The objective function has poor local minima [Shamir et al-2017, Lee et al-2017]

๐‘ฆ๐‘—

1 1 1 1 1

๐‘ง๐‘— ๐‘‹

โˆ—

slide-39
SLIDE 39

Learning Neural Networks: Provably

  • But, no local minima within constant distance of ๐‘‹

โˆ—

  • If,

||๐‘‹

0 โˆ’ ๐‘‹ โˆ—|| โ‰ค ๐‘‘

Then, Gradient Descent (๐‘‹

๐‘ข+1 = ๐‘‹ ๐‘ข โˆ’ ๐œƒโˆ‡f(Wt)) converges to ๐‘‹ โˆ—

  • No. of iterations: log 1/๐œ—

Can we get rid of initialization condition? Yes but by changing the network [Liang-Lee-Srikantโ€™2018]

Zhong-Song-J-Bartlett-Dhillonโ€™2017

slide-40
SLIDE 40

Learning with Structure

  • ๐‘ง๐‘— = ๐œš ๐‘ฆ๐‘—, ๐‘ฅโˆ— , ๐‘ฆ๐‘— โˆผ ๐ธ โˆˆ ๐‘†๐‘’ ,

1 โ‰ค ๐‘— โ‰ค ๐‘œ

  • But no. of samples are limited!
  • For example, ๐‘—๐‘” ๐‘œ โ‰ค ๐‘’?
  • Can we still recover ๐‘ฅโˆ—? In general, no!
  • But, what if ๐‘ฅโˆ— has some structure?
slide-41
SLIDE 41

Sparse Linear Regression

  • But: ๐‘œ โ‰ช ๐‘’
  • ๐‘ฅ: ๐‘ก โˆ’sparse (๐‘ก non-zeros)
  • Information theoretically: ๐‘œ = ๐‘ก log ๐‘’ samples should suffice

0.1 1 โ‹ฎ 0.9

๐‘Œ ๐‘ฅ = =

n

โ‹ฎ

d

๐‘ง

41

slide-42
SLIDE 42

Learning with structure

  • Linear classification/regression
  • ๐ท = {๐‘ฅ, ||๐‘ฅ||0 โ‰ค ๐‘ก}
  • ๐‘ก โ‰ช ๐‘’
  • Matrix completion
  • ๐ท = {๐‘‹, ๐‘ ๐‘๐‘œ๐‘™ ๐‘‹ โ‰ค ๐‘ }
  • ๐‘  โ‰ช (๐‘’1, ๐‘’2)

min

๐‘ฅ ๐‘”(๐‘ฅ)

๐‘ก. ๐‘ข. ๐‘ฅ โˆˆ ๐ท

42

slide-43
SLIDE 43

Other Examples

  • Low-rank Tensor completion
  • ๐ท = {๐‘‹, ๐‘ข๐‘“๐‘œ๐‘ก๐‘๐‘  โˆ’ ๐‘ ๐‘๐‘œ๐‘™ ๐‘‹ โ‰ค ๐‘ }
  • ๐‘  โ‰ช (๐‘’1, ๐‘’2, ๐‘’3)
  • Robust PCA
  • ๐ท = {๐‘‹, ๐‘‹ = ๐‘€ + ๐‘‡, ๐‘ ๐‘๐‘œ๐‘™ ๐‘€ โ‰ค ๐‘ , ||๐‘‡||0 โ‰ค ๐‘ก}
  • ๐‘  โ‰ช ๐‘’1, ๐‘’2 , ๐‘‡ โ‰ช ๐‘’1 ร— ๐‘’2

43

slide-44
SLIDE 44

Non-convex Structures

  • Linear classification/regression
  • ๐ท = {๐‘ฅ, ||๐‘ฅ||0 โ‰ค ๐‘ก}
  • ๐‘ก โ‰ช ๐‘’
  • Matrix completion
  • ๐ท = {๐‘‹, ๐‘ ๐‘๐‘œ๐‘™ ๐‘‹ โ‰ค ๐‘ }
  • ๐‘  โ‰ช (๐‘’1, ๐‘’2)

44

  • NP-Hard
  • ||๐‘ฅ||0: Non-convex
  • NP-Hard
  • ๐‘ ๐‘๐‘œ๐‘™(๐‘‹): Non-convex
slide-45
SLIDE 45

Non-convex Structures

  • Low-rank Tensor completion
  • ๐ท = {๐‘‹, ๐‘ข๐‘“๐‘œ๐‘ก๐‘๐‘  โˆ’ ๐‘ ๐‘๐‘œ๐‘™ ๐‘‹ โ‰ค ๐‘ }
  • ๐‘  โ‰ช (๐‘’1, ๐‘’2, ๐‘’3)
  • Robust PCA
  • ๐ท = {๐‘‹, ๐‘‹ = ๐‘€ + ๐‘‡, ๐‘ ๐‘๐‘œ๐‘™ ๐‘€ โ‰ค ๐‘ , ||๐‘‡||0 โ‰ค ๐‘ก}
  • ๐‘  โ‰ช ๐‘’1, ๐‘’2 , ๐‘‡ โ‰ช ๐‘’1 ร— ๐‘’2

45

  • Indeterminate
  • ๐‘ข๐‘“๐‘œ๐‘ก๐‘๐‘ ๐‘ ๐‘๐‘œ๐‘™ ๐‘‹ : Non-convex
  • NP-Hard
  • ๐‘ ๐‘๐‘œ๐‘™ ๐‘‹ , ||๐‘‡||0: Non-convex
slide-46
SLIDE 46

Technique: Projected Gradient Descent

min

๐‘ฅ ๐‘” ๐‘ฅ

๐‘ก. ๐‘ข. ๐‘ฅ โˆˆ ๐ท

  • ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ โˆ‡๐‘ฅ๐‘”(๐‘ฅ๐‘ข)
  • ๐‘ฅ๐‘ข+1 = ๐‘„๐ท(๐‘ฅ๐‘ข+1)

46

min

๐‘ฅ ||๐‘ฅ โˆ’ ๐‘ฅ๐‘ข+1||2

๐‘ก. ๐‘ข. ๐‘ฅ โˆˆ ๐ท

slide-47
SLIDE 47

Results for Several Problems

  • Sparse regression [Jain et al.โ€™14, Garg and Khandekarโ€™09]
  • Sparsity
  • Robust Regression [Bhatia et al.โ€™15]
  • Sparsity+output sparsity
  • Vector-value Regression [Jain & Tewariโ€™15]
  • Sparsity+positive definite matrix
  • Dictionary Learning [Agarwal et al.โ€™14]
  • Matrix Factorization + Sparsity
  • Phase Sensing [Netrapalli et al.โ€™13]
  • System of Quadratic Equations

47

slide-48
SLIDE 48

Results Contdโ€ฆ

  • Low-rank Matrix Regression [Jain et al.โ€™10, Jain et al.โ€™13]
  • Low-rank structure
  • Low-rank Matrix Completion [Jain & Netrapalliโ€™15, Jain et al.โ€™13]
  • Low-rank structure
  • Robust PCA [Netrapalli et al.โ€™14]
  • Low-rank โˆฉ Sparse Matrices
  • Tensor Completion [Jain and Ohโ€™14]
  • Low-tensor rank
  • Low-rank matrix approximation [Bhojanapalli et al.โ€™15]
  • Low-rank structure

48

slide-49
SLIDE 49

Sparse Linear Regression

  • But: ๐‘œ โ‰ช ๐‘’
  • ๐‘ฅ: ๐‘ก โˆ’sparse (๐‘ก non-zeros)

0.1 1 โ‹ฎ 0.9

๐‘Œ ๐‘ฅ = =

n

โ‹ฎ

d

๐‘ง

49

slide-50
SLIDE 50

Sparse Linear Regression

min

๐‘ฅ ||๐‘ง โˆ’ ๐‘Œ๐‘ฅ||2

๐‘ก. ๐‘ข. ||๐‘ฅ||0 โ‰ค ๐‘ก

  • ||๐‘ง โˆ’ ๐‘Œ๐‘ฅ||2 = ฯƒ๐‘— ๐‘ง๐‘— โˆ’ ๐‘ฆ๐‘—, ๐‘ฅ

2

  • ||๐‘ฅ||0: number of non-zeros
  • NP-hard problem in general ๏Œ
  • ๐‘€0: non-convex function

50

slide-51
SLIDE 51

Technique: Projected Gradient Descent

min

๐‘ฅ ๐‘” ๐‘ฅ = ||๐‘ง โˆ’ ๐‘Œ๐‘ฅ||2

๐‘ก. ๐‘ข. ||๐‘ฅ||0 โ‰ค ๐‘ก

  • ๐‘ฅ๐‘ข+1 = ๐‘ฅ๐‘ข โˆ’ โˆ‡๐‘ฅ๐‘”(๐‘ฅ๐‘ข)
  • ๐‘ฅ๐‘ข+1 = ๐‘„

๐‘ก(๐‘ฅ๐‘ข+1) [Jain, Tewari, Karโ€™2014]

51

min

๐‘ฅ ||๐‘ฅ โˆ’ ๐‘ฅ๐‘ข+1||2

๐‘ก. ๐‘ข. ||๐‘ฅ||0 โ‰ค ๐‘ก

slide-52
SLIDE 52

Statistical Guarantees

๐‘ง๐‘— = โŒฉ๐‘ฆ๐‘—, ๐‘ฅโˆ—โŒช + ๐œƒ๐‘—

  • ๐‘ฆ๐‘— โˆผ ๐‘‚(0, ฮฃ)
  • ๐œƒ๐‘— โˆผ ๐‘‚(0, ๐œ‚2)
  • ๐‘ฅโˆ—: ๐‘ก โˆ’sparse

|| เท ๐‘ฅ โˆ’ ๐‘ฅโˆ—|| โ‰ค ๐œ‚๐œ†3 ๐‘ก log ๐‘’ ๐‘œ

  • ๐œ† = ๐œ‡1(ฮฃ)/๐œ‡๐‘’(ฮฃ)

[Jain, Tewari, Karโ€™2014]

52

slide-53
SLIDE 53

Low-rank Matrix Completion

min

๐‘‹

เท

๐‘—,๐‘˜ โˆˆฮฉ

๐‘‹

๐‘—๐‘˜ โˆ’ ๐‘๐‘—๐‘˜ 2

๐‘ก. ๐‘ข ๐ฌ๐›๐จ๐ฅ ๐‘‹ โ‰ค ๐‘  ฮฉ: set of known entries

  • Special case of low-rank matrix regression
  • However, assumptions required by the regression analysis not satisfied
slide-54
SLIDE 54

Technique: Projected Gradient Descent

  • ๐‘‹

0 = 0

  • For t=0:T-1

๐‘‹

๐‘ข+1 = ๐‘„ ๐‘  ๐‘‹ ๐‘ข โˆ’ ๐œƒ๐›‚๐‘”(๐‘‹ ๐‘ข)

  • ๐‘„

๐‘™(๐‘Ž): projection onto set of rank-r projection

  • Singular Value Projection
  • Pros:
  • Fast (always, rank-r SVD)
  • Matrix completion: ๐‘ƒ(๐‘’ โ‹… ๐‘ 3)!
  • Cons: In general, might not even converge
  • Our Result: Convergence under โ€œcertainโ€ assumptions

[Jain, Tewari, Karโ€™2014], [Netrapalli, Jainโ€™2014], [Jain, Meka, Dhillonโ€™2009]

54

slide-55
SLIDE 55

Guarantees

  • Projected Gradient Descent:
  • ๐‘‹

๐‘ข+1 = ๐‘„ ๐‘  ๐‘‹ ๐‘ข โˆ’ ๐œƒ๐›ผ๐‘‹๐‘” ๐‘‹ ๐‘ข

, โˆ€๐‘ข

  • Show ๐œ—-approximate recovery in log

1 ๐œ— iterations

  • Assuming:
  • ๐‘: incoherent
  • ฮฉ: uniformly sampled
  • ฮฉ โ‰ฅ ๐‘œ โ‹… ๐‘ 5 โ‹… log3 ๐‘œ
  • First near linear time algorithm for exact Matrix Completion with

finite samples

[J., Netrapalliโ€™2015]

slide-56
SLIDE 56

General Result for Any Function

  • ๐‘”: ๐‘†๐‘’ โ†’ ๐‘†
  • ๐‘”: satisfies RSC/RSS, i.e.,

๐›ฝ โ‹… ๐ฝ๐‘’ร—๐‘’ โ‰ผ ๐ผ ๐‘ฅ โ‰ผ ๐‘€ โ‹… ๐ฝ๐‘’ร—๐‘’, ๐‘—๐‘”, ๐‘ฅ โˆˆ ๐ท

  • PGD guarantee:

๐‘” ๐‘ฅ๐‘ˆ โ‰ค ๐‘” ๐‘ฅโˆ— + ๐œ—

After ๐‘ˆ = ๐‘ƒ(log

๐‘” ๐‘ฅ0 ๐œ—

) steps

  • If

๐‘€ ๐›ฝ โ‰ค 1.5

[J., Tewari, Karโ€™2014]

min

๐‘ฅ ๐‘”(๐‘ฅ)

๐‘ก. ๐‘ข. ๐‘ฅ โˆˆ ๐ท

slide-57
SLIDE 57

Learning with Latent Variables

min

๐‘ฅ,๐‘จ ๐‘”(๐‘ฅ, ๐‘จ)

  • Typically, ๐‘จ are latent variables
  • E.g., clustering: ๐‘ฅ: means of clusters, ๐‘จ: cluster index
  • ๐‘”: non โˆ’ convex
  • NP-hard to solve in general
slide-58
SLIDE 58

Alternating Minimization

๐‘จ๐‘ข+1 = arg min

๐‘จ

๐‘”(๐‘ฅ๐‘ข, ๐‘จ) ๐‘ฅ๐‘ข+1 = arg min

๐‘ฅ ๐‘”(๐‘ฅ, ๐‘จ๐‘ข+1)

  • For example, if ๐‘”(๐‘ฅ๐‘ข, ๐‘จ) is convex and ๐‘”(๐‘ฅ, ๐‘จ๐‘ข) is convex
  • Does that imply ๐‘”(๐‘ฅ, ๐‘จ) is convex?
  • No!!!
  • ๐‘” ๐‘ฅ, ๐‘จ = ๐‘ฅ โ‹… ๐‘จ
  • Linear in both ๐‘ฅ, ๐‘จ individually
  • So can Alt. Min. converge to global optima?

image credit: academo.org

slide-59
SLIDE 59

Low-rank Matrix Completion

min

๐‘‹

เท

๐‘—,๐‘˜ โˆˆฮฉ

๐‘‹

๐‘—๐‘˜ โˆ’ ๐‘๐‘—๐‘˜ 2

๐‘ก. ๐‘ข ๐ฌ๐›๐จ๐ฅ ๐‘‹ โ‰ค ๐‘  ฮฉ: set of known entries

  • Special case of low-rank matrix regression
  • However, assumptions required by the regression analysis not satisfied
slide-60
SLIDE 60

Matrix Completion: Alternating Minimization

W ๐‘‰ ๐‘Š๐‘ˆ โ‰… ร—

ร— F 2

y โˆ’ ๐˜ โ‹…

) (

๐‘Š๐‘ข+1 = min

๐‘Š ||๐‘ง โˆ’ ๐‘Œ โ‹… (๐‘‰๐‘ข๐‘Š๐‘ˆ)||2 2

๐‘‰๐‘ข+1 = min

๐‘‰ ||๐‘ง โˆ’ ๐‘Œ โ‹… ๐‘‰ ๐‘Š๐‘ข+1 ๐‘ˆ ||2 2

slide-61
SLIDE 61

Results: Alternating Minimization

  • Provable global convergence [J., Netrapalli, Sanghaviโ€™13]
  • Rate of convergence: geometric

||๐‘‹๐‘ˆ โˆ’ ๐‘‹โˆ—|| โ‰ค 2โˆ’๐‘ˆ

  • Assumptions:
  • Matrix regression: RIP
  • Matrix completion: uniform sampling and no. samples ฮฉ โ‰ฅ ๐‘ƒ(๐‘’๐‘™6)

[Jain, Netrapalli, Sanghaviโ€™13]

slide-62
SLIDE 62

General Results

min

๐‘ฅ,๐‘จ ๐‘”(๐‘ฅ, ๐‘จ)

  • Alternating minimization: optimal?
  • If:
  • Joint Restricted Strong Convexity (Strong convexity close to the optimal)
  • Restricted Smoothness (smoothness near optimal)
  • Cross-product bound:

๐‘ฅ โˆ’ ๐‘ฅโˆ—, โˆ‡๐‘ฅ๐‘” ๐‘ฅ, ๐‘จ โˆ’ โˆ‡๐‘ฅ๐‘” ๐‘ฅ, ๐‘จโˆ— โˆ’ ๐‘จ โˆ’ ๐‘จโˆ—, โˆ‡๐‘จ๐‘” ๐‘ฅ, ๐‘จ โˆ’ โˆ‡๐‘จ๐‘” ๐‘ฅโˆ—, ๐‘จ โ‰ค ๐‘ƒ(|๐‘ฅ โˆ’ ๐‘ฅโˆ—|2 + |๐‘จ โˆ’ ๐‘จโˆ—|2)

Ha and Barber-2017, Jain and Kar-2018

slide-63
SLIDE 63

Summary I

Non-convex Optimization: two approaches

  • 1. General non-convex functions
  • a. First Order Stationary Point
  • b. Second Order Stationary Point
  • 2. Statistical non-convex functions: learning with structure
  • a. Projected Gradient Descent (RSC/RSS)
  • b. Alternating minimization/EM algorithms (RSC/RSS)
slide-64
SLIDE 64

Summary II

  • First Order Stationary Point :๐‘” ๐‘ฅ โ‰ค ๐‘” ๐‘ฅโ€ฒ + ||๐‘ฅ โˆ’ ๐‘ฅโ€ฒ||2
  • Tools: gradient descent, acceleration, stochastic gd, variance reduction
  • Key quantity: iteration complexity
  • Several questions: for example, can we do better? Especially in finite sum setting
  • Second order stationary point: ๐‘” ๐‘ฅ โ‰ค ๐‘” ๐‘ฅโ€ฒ + ||๐‘ฅ โˆ’ ๐‘ฅโ€ฒ||3
  • Tools: noise+gd, noise+acceleration, noise+sgd, noise+variance reduction
  • Several questions: better rates? Can we remove Lipschitz condition on Hessian?
slide-65
SLIDE 65

Summary III

  • Projected Gradient Descent
  • Works under statistical conditions like RSC/RSS
  • Still several open questions for most problems
  • E.g., tight guarantees support recovery for sparse linear regression?
  • Alternating minimization
  • Works under some assumptions on ๐‘”
  • What is the weakest condition on ๐‘” for Alt. Min. to work?