The landscape of non-convex losses for statistical learning problems - - PowerPoint PPT Presentation

the landscape of non convex losses for statistical
SMART_READER_LITE
LIVE PREVIEW

The landscape of non-convex losses for statistical learning problems - - PowerPoint PPT Presentation

The landscape of non-convex losses for statistical learning problems Song Mei Stanford University October 19, 2017 Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 1 / 32 Deep learning Song Mei


slide-1
SLIDE 1

The landscape of non-convex losses for statistical learning problems

Song Mei

Stanford University

October 19, 2017

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 1 / 32

slide-2
SLIDE 2

Deep learning

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 2 / 32

slide-3
SLIDE 3

Deep learning

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 2 / 32

slide-4
SLIDE 4

Convolutional Neural Network

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 3 / 32

slide-5
SLIDE 5

Non-convex optimization

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 4 / 32

slide-6
SLIDE 6

Why does non-convex neural network perform well?

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 5 / 32

slide-7
SLIDE 7

Why does some non-convex optimization perform well?

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 6 / 32

slide-8
SLIDE 8

Why does some non-convex optimization perform well?

◮ Stochastic gradient descent escape bad local minima. ◮ Good initialization escape bad local minima. ◮ Globally there are less bad local minima. ◮ ....

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 6 / 32

slide-9
SLIDE 9

Non-convex optimization: analysis of global geometry

Number and locations of saddle points and local minima.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 7 / 32

slide-10
SLIDE 10

Let’s do it!

The objective function

♠✐♥

❲✐

✶ ♥

✐❂✶

❢②✐ ✛✭❲❦ ✁ ✁ ✁ ✛✭❲✷ ✁ ✛✭❲✶①✐✮✮✮❣✷

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 8 / 32

slide-11
SLIDE 11

Let’s do it!

The objective function

♠✐♥

❲✐

✶ ♥

✐❂✶

❢②✐ ✛✭❲✷ ✁ ✛✭❲✶①✐✮✮❣✷

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 8 / 32

slide-12
SLIDE 12

Let’s do it!

The objective function

♠✐♥

✶ ♥

✐❂✶

❢②✐ ✛✭❤✒❀ ①✐✐✮❣✷

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 8 / 32

slide-13
SLIDE 13

Binary linear classification

The model

③✐ ❂ ✭①✐❀ ②✐✮. ①✐ ✷ R❞, ②✐ ✷ ❢✵❀ ✶❣.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 9 / 32

slide-14
SLIDE 14

One node neural network

The model

③✐ ❂ ✭①✐❀ ②✐✮. ①✐ ✷ R❞, ②✐ ✷ ❢✵❀ ✶❣.

◮ Convex logit loss (❵❝ is cvx in ✒)

❵❝✭✒❀ ③✮ ❂ ②❤①❀ ✒✐ ❧♦❣❢✶ ✰ ❡①♣✭❤①❀ ✒✐✮❣✿

◮ Non-convex loss (❵ is not cvx in ✒)

❵✭✒❀ ③✮ ❂ ❢② ✛✭❤①❀ ✒✐✮❣✷❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮✿

◮ Empirical Risk

❘♥✭✒✮ ❂ ✶ ♥

✐❂✶

❵✭✒❀ ③✐✮ ❂ ✶ ♥

✐❂✶

❢②✐ ✛✭❤✒❀ ①✐✐✮❣✷✿

◮ Empirical risk minimizer

❫ ✒♥ ❂ ❛r❣ ♠✐♥

✒✷B❞✭❘✮

❘♥✭✒✮✿

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 10 / 32

slide-15
SLIDE 15

One node neural network

The model

③✐ ❂ ✭①✐❀ ②✐✮. ①✐ ✷ R❞, ②✐ ✷ ❢✵❀ ✶❣.

◮ Convex logit loss (❵❝ is cvx in ✒)

❵❝✭✒❀ ③✮ ❂ ②❤①❀ ✒✐ ❧♦❣❢✶ ✰ ❡①♣✭❤①❀ ✒✐✮❣✿

◮ Non-convex loss (❵ is not cvx in ✒)

❵✭✒❀ ③✮ ❂ ❢② ✛✭❤①❀ ✒✐✮❣✷❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮✿

◮ Empirical Risk

❘♥✭✒✮ ❂ ✶ ♥

✐❂✶

❵✭✒❀ ③✐✮ ❂ ✶ ♥

✐❂✶

❢②✐ ✛✭❤✒❀ ①✐✐✮❣✷✿

◮ Empirical risk minimizer

❫ ✒♥ ❂ ❛r❣ ♠✐♥

✒✷B❞✭❘✮

❘♥✭✒✮✿

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 10 / 32

slide-16
SLIDE 16

One node neural network

The model

③✐ ❂ ✭①✐❀ ②✐✮. ①✐ ✷ R❞, ②✐ ✷ ❢✵❀ ✶❣.

◮ Convex logit loss (❵❝ is cvx in ✒)

❵❝✭✒❀ ③✮ ❂ ②❤①❀ ✒✐ ❧♦❣❢✶ ✰ ❡①♣✭❤①❀ ✒✐✮❣✿

◮ Non-convex loss (❵ is not cvx in ✒)

❵✭✒❀ ③✮ ❂ ❢② ✛✭❤①❀ ✒✐✮❣✷❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮✿

◮ Empirical Risk

❘♥✭✒✮ ❂ ✶ ♥

✐❂✶

❵✭✒❀ ③✐✮ ❂ ✶ ♥

✐❂✶

❢②✐ ✛✭❤✒❀ ①✐✐✮❣✷✿

◮ Empirical risk minimizer

❫ ✒♥ ❂ ❛r❣ ♠✐♥

✒✷B❞✭❘✮

❘♥✭✒✮✿

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 10 / 32

slide-17
SLIDE 17

One node neural network

The model

③✐ ❂ ✭①✐❀ ②✐✮. ①✐ ✷ R❞, ②✐ ✷ ❢✵❀ ✶❣.

◮ Convex logit loss (❵❝ is cvx in ✒)

❵❝✭✒❀ ③✮ ❂ ②❤①❀ ✒✐ ❧♦❣❢✶ ✰ ❡①♣✭❤①❀ ✒✐✮❣✿

◮ Non-convex loss (❵ is not cvx in ✒)

❵✭✒❀ ③✮ ❂ ❢② ✛✭❤①❀ ✒✐✮❣✷❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮✿

◮ Empirical Risk

❘♥✭✒✮ ❂ ✶ ♥

✐❂✶

❵✭✒❀ ③✐✮ ❂ ✶ ♥

✐❂✶

❢②✐ ✛✭❤✒❀ ①✐✐✮❣✷✿

◮ Empirical risk minimizer

❫ ✒♥ ❂ ❛r❣ ♠✐♥

✒✷B❞✭❘✮

❘♥✭✒✮✿

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 10 / 32

slide-18
SLIDE 18

A negative theoretical result

Theorem (Auer et. al. . 1996)

For the one node neural network, ✽♥❀ ❞ ❃ ✵, there exists a dataset ✭①✐❀ ②✐✮♥

✐❂✶ such that the empirical risk ❜

❘♥✭✒✮ has ❜ ♥

❞ ❝❞ distinct local

minima.

❘♥✭✒✮

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 11 / 32

slide-19
SLIDE 19

A negative theoretical result

Theorem (Auer et. al. . 1996)

For the one node neural network, ✽♥❀ ❞ ❃ ✵, there exists a dataset ✭①✐❀ ②✐✮♥

✐❂✶ such that the empirical risk ❜

❘♥✭✒✮ has ❜ ♥

❞ ❝❞ distinct local

minima. The landscape of ❜ ❘♥✭✒✮ is very rough.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 11 / 32

slide-20
SLIDE 20

A negative theoretical result

Theorem (Auer et. al. . 1996)

For the one node neural network, ✽♥❀ ❞ ❃ ✵, there exists a dataset ✭①✐❀ ②✐✮♥

✐❂✶ such that the empirical risk ❜

❘♥✭✒✮ has ❜ ♥

❞ ❝❞ distinct local

minima. The landscape of ❜ ❘♥✭✒✮ is very rough. Is this the end of the world of deep learning?

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 11 / 32

slide-21
SLIDE 21

Real data experiment

◮ The "Australian" data set from Statlog: ❞ ❂ ✶✶, ♥ ❂ ✻✽✸. ◮ Random initialization ✒✭✵✮ ✘ ◆✭0❀ ■❞✮. ◮ Run gradient descent and track the path ✒✭❦✮. ◮ Generate multiple paths with independent initializations. ◮ Plot standard deviation over paths st❞✭✒✭❦✮✮ versus ❦.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 12 / 32

slide-22
SLIDE 22

Real data experiment

◮ The "Australian" data set from Statlog: ❞ ❂ ✶✶, ♥ ❂ ✻✽✸. ◮ Random initialization ✒✭✵✮ ✘ ◆✭0❀ ■❞✮. ◮ Run gradient descent and track the path ✒✭❦✮. ◮ Generate multiple paths with independent initializations. ◮ Plot standard deviation over paths st❞✭✒✭❦✮✮ versus ❦.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 12 / 32

slide-23
SLIDE 23

Real data experiment

◮ The "Australian" data set from Statlog: ❞ ❂ ✶✶, ♥ ❂ ✻✽✸. ◮ Random initialization ✒✭✵✮ ✘ ◆✭0❀ ■❞✮. ◮ Run gradient descent and track the path ✒✭❦✮. ◮ Generate multiple paths with independent initializations. ◮ Plot standard deviation over paths st❞✭✒✭❦✮✮ versus ❦.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 12 / 32

slide-24
SLIDE 24

Real data experiment

◮ The "Australian" data set from Statlog: ❞ ❂ ✶✶, ♥ ❂ ✻✽✸. ◮ Random initialization ✒✭✵✮ ✘ ◆✭0❀ ■❞✮. ◮ Run gradient descent and track the path ✒✭❦✮. ◮ Generate multiple paths with independent initializations. ◮ Plot standard deviation over paths st❞✭✒✭❦✮✮ versus ❦.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 12 / 32

slide-25
SLIDE 25

Real data experiment

◮ The "Australian" data set from Statlog: ❞ ❂ ✶✶, ♥ ❂ ✻✽✸. ◮ Random initialization ✒✭✵✮ ✘ ◆✭0❀ ■❞✮. ◮ Run gradient descent and track the path ✒✭❦✮. ◮ Generate multiple paths with independent initializations. ◮ Plot standard deviation over paths st❞✭✒✭❦✮✮ versus ❦.

Number of iterations

20 40 60 80 100 120 140 160 180 200

std

  • θ(k)
  • 10-7

10-6 10-5 10-4 10-3 10-2 10-1

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 12 / 32

slide-26
SLIDE 26

One node neural network

On real data, we "always" observe a unique minimum!

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 13 / 32

slide-27
SLIDE 27

One node neural network

On real data, we "always" observe a unique minimum! Why?

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 13 / 32

slide-28
SLIDE 28

One node neural network

On real data, we "always" observe a unique minimum! Why? Data generated by nature is not against us!

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 13 / 32

slide-29
SLIDE 29

A negative theoretical result

Theorem (Auer et. al. . 1996)

For the one node neural network, ✽♥❀ ❞ ❃ ✵, there exists a dataset ✭①✐❀ ②✐✮♥

✐❂✶ such that the empirical risk ❜

❘♥✭✒✮ has ❜ ♥

❞ ❝❞ distinct local

minima. The landscape of ❜ ❘♥✭✒✮ is very rough.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 14 / 32

slide-30
SLIDE 30

A positive result

Theorem (Mei, Bai, Montanari. 2016)

Assume ❨✐ are generated via P✭❨✐ ❂ ✶❥❳✐✮ ❂ ✛✭❤❳✐❀ ✒✵✐✮ with mild assumption on ❳✐, as ♥ ❂ ✡✭❞ ❧♦❣ ❞✮, with high probability: ✭❛✮ ❜ ❘♥✭✒✮ has a unique local minimizer ❫ ✒♥ in B❞✭0❀ ❘✮. ✭❜✮ ❫ ✒♥ satisfies ❦❫ ✒♥ ✒✵❦✷ ❂ ❖✭

✭❞ ❧♦❣ ♥✮❂♥✮. ✭❝✮ Gradient descent converges exponentially fast to ❫ ✒♥. The landscape of ❜ ❘♥✭✒✮ is actually smooth!

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 15 / 32

slide-31
SLIDE 31

Why assuming a statistical model make the landscape

  • f emprical risk smooth?

1 Assuming a statistical model ✭❳✐❀ ❨✐✮ ✐✿✐✿❞✿

✘ P, ✐ ❂ ✶❀ ✿ ✿ ✿ ❀ ♥, we can define the population risk ❘✭✒✮ ❂ E

❤ ❜

❘♥✭✒✮

❂ E❳❀❨

✭❨✐ ✛✭❤✒✵❀ ❳✐✮✮✷✐ ✿ The population risk is usually very smooth.

2 We can transfer the good properties of the population risk to the

empirical risk using uniform convergence argument. So empirical risk will be also smooth.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 16 / 32

slide-32
SLIDE 32

Why assuming a statistical model make the landscape

  • f emprical risk smooth?

1 Assuming a statistical model ✭❳✐❀ ❨✐✮ ✐✿✐✿❞✿

✘ P, ✐ ❂ ✶❀ ✿ ✿ ✿ ❀ ♥, we can define the population risk ❘✭✒✮ ❂ E

❤ ❜

❘♥✭✒✮

❂ E❳❀❨

✭❨✐ ✛✭❤✒✵❀ ❳✐✮✮✷✐ ✿ The population risk is usually very smooth.

2 We can transfer the good properties of the population risk to the

empirical risk using uniform convergence argument. So empirical risk will be also smooth.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 16 / 32

slide-33
SLIDE 33

Why assuming a statistical model make the landscape

  • f emprical risk smooth?

1 Assuming a statistical model ✭❳✐❀ ❨✐✮ ✐✿✐✿❞✿

✘ P, ✐ ❂ ✶❀ ✿ ✿ ✿ ❀ ♥, we can define the population risk ❘✭✒✮ ❂ E

❤ ❜

❘♥✭✒✮

❂ E❳❀❨

✭❨✐ ✛✭❤✒✵❀ ❳✐✮✮✷✐ ✿ The population risk is usually very smooth.

2 We can transfer the good properties of the population risk to the

empirical risk using uniform convergence argument. So empirical risk will be also smooth.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 16 / 32

slide-34
SLIDE 34

Population risk and empirical risk

The population risk has good properties under mild assumptions.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0

Figure: Population risk.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 17 / 32

slide-35
SLIDE 35

Population risk and empirical risk

The population risk has good properties under mild assumptions.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0

Figure: Population risk.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0 = [1, 0] ˆ θn = [0.816, −0.268]

Figure: An instance of empirical risk.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 17 / 32

slide-36
SLIDE 36

Population risk and empirical risk

The population risk has good properties under mild assumptions.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0

Figure: Population risk.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0 = [1, 0] ˆ θn = [0.816, −0.268]

Figure: An instance of empirical risk.

How can we relate the properties of empirical risk to population risk?

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 17 / 32

slide-37
SLIDE 37

Population risk and empirical risk

The population risk has good properties under mild assumptions.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0

Figure: Population risk.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0 = [1, 0] ˆ θn = [0.816, −0.268]

Figure: An instance of empirical risk.

How can we relate the properties of empirical risk to population risk? Uniform convergence!

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 17 / 32

slide-38
SLIDE 38

Uniform convergence of gradients and Hessians.

Theorem (Uniform convergence. Informal)

Under the settings, for any ✍ ❃ ✵, there exists a positive constant ❈ depending on ✭❘❀ ✍✮ but independent of ♥ and ❞, such that as long as ♥ ✕ ❈❞ ❧♦❣ ❞, we have

1

P

✵ ❅

s✉♣

✒✷B❞✭0❀❘✮

✌ ✌ ✌r ❜

❘♥✭✒✮ r❘✭✒✮

✌ ✌ ✌

✷ ✔

s

❈❞ ❧♦❣ ♥ ♥

✶ ❆ ✕ ✶ ✍✿

2

P

✵ ❅

s✉♣

✒✷B❞✭0❀❘✮

✌ ✌ ✌r✷ ❜

❘♥✭✒✮ r✷❘✭✒✮

✌ ✌ ✌

♦♣ ✔

s

❈❞ ❧♦❣ ♥ ♥

✶ ❆ ✕ ✶ ✍✿

Proof is based on concentration inequalities and covering numbers.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 18 / 32

slide-39
SLIDE 39

Uniform convergence implies unique minimum of empirical risk

Risk ¡ Empirical ¡risk Empirical ¡risk Empirical ¡risk Barriers ¡ Many ¡local ¡mins ¡ Risk ¡global ¡min Risk ¡ ERM ¡ Risk ¡global ¡min ERM ¡ Good ¡local ¡mins Smooth ¡far ¡from ¡mins ¡ Risk ¡global ¡min ERM ¡ Uniform ¡smooth ¡ surface ¡

The ¡landscape ¡of ¡non-­‑convex ¡empirical ¡risk

  • 1. ¡What ¡we ¡thought
  • 2. ¡What ¡hopefully ¡is ¡true
  • 3. ¡What ¡we ¡will ¡prove

Risk ¡

Figure: Landscape of empirical risk

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 19 / 32

slide-40
SLIDE 40

Numerical experiment

n/(d log d)

0.5 1 1.5 2 2.5 3 3.5

Success rate

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

d = 20 d = 40 d = 80 d = 160 d = 320

Figure: Probability to find a unique local minimum

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 20 / 32

slide-41
SLIDE 41

Extensions

◮ Robust regression, gaussian mixture model, etc. High dimensional

settings ❞ ✢ ♥. [Mei et. al., 2017]

◮ ReLU activation. [Tian, 2017] ◮ Two Layers neural network. [Soltanolkotabi et. al., 2017], [Zhong

  • et. al., 2017]

◮ Deep neural network. [Choromanska et. al., 2015]

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 21 / 32

slide-42
SLIDE 42

Interlude

Before studying the complex neural network, maybe we can first study some simpler non-convex optimization problems.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 22 / 32

slide-43
SLIDE 43

MaxCut Problem

◮ ●: a positively weighted graph. ❆●: its adjacency matrix. ◮ MaxCut of ●: known to be NP-hard

♠❛①✐♠✐③❡

①✷❢✝✶❣♥

✶ ✹

✐❀❥❂✶

❆●❀✐❥✭✶ ①✐①❥✮✿ (MaxCut)

◮ SDP relaxation: ✵✿✽✼✽-approximate guarantee [Goemanns and

Williamson, 1995] ♠❛①✐♠✐③❡

❳✷R♥✂♥

✶ ✹

✐❀❥❂✶

❆●❀✐❥✭✶ ❳✐❥✮❀ s✉❜❥❡❝t t♦ ❳✐✐ ❂ ✶❀ ❳ ✗ ✵✿ (SDPCut)

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 23 / 32

slide-44
SLIDE 44

MaxCut Problem

◮ ●: a positively weighted graph. ❆●: its adjacency matrix. ◮ MaxCut of ●: known to be NP-hard

♠❛①✐♠✐③❡

①✷❢✝✶❣♥

✶ ✹

✐❀❥❂✶

❆●❀✐❥✭✶ ①✐①❥✮✿ (MaxCut)

◮ SDP relaxation: ✵✿✽✼✽-approximate guarantee [Goemanns and

Williamson, 1995] ♠❛①✐♠✐③❡

❳✷R♥✂♥

✶ ✹

✐❀❥❂✶

❆●❀✐❥✭✶ ❳✐❥✮❀ s✉❜❥❡❝t t♦ ❳✐✐ ❂ ✶❀ ❳ ✗ ✵✿ (SDPCut)

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 23 / 32

slide-45
SLIDE 45

MaxCut Problem

◮ ●: a positively weighted graph. ❆●: its adjacency matrix. ◮ MaxCut of ●: known to be NP-hard

♠❛①✐♠✐③❡

①✷❢✝✶❣♥

✶ ✹

✐❀❥❂✶

❆●❀✐❥✭✶ ①✐①❥✮✿ (MaxCut)

◮ SDP relaxation: ✵✿✽✼✽-approximate guarantee [Goemanns and

Williamson, 1995] ♠❛①✐♠✐③❡

❳✷R♥✂♥

✶ ✹

✐❀❥❂✶

❆●❀✐❥✭✶ ❳✐❥✮❀ s✉❜❥❡❝t t♦ ❳✐✐ ❂ ✶❀ ❳ ✗ ✵✿ (SDPCut)

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 23 / 32

slide-46
SLIDE 46

The MaxCut SDP problem

◮ ❆ ✷ R♥✂♥ symmetric. ◮ MaxCut SDP:

♠❛①✐♠✐③❡

❳✷R♥✂♥

❤❆❀ ❳✐ s✉❜❥❡❝t t♦ ❳✐✐ ❂ ✶❀ ✐ ✷ ❬♥❪❀ ❳ ✗ ✵✿ (SDP)

◮ Applications: MaxCut problem, Z✷ synchronization, Stochastic

block model...

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 24 / 32

slide-47
SLIDE 47

The MaxCut SDP problem

◮ ❆ ✷ R♥✂♥ symmetric. ◮ MaxCut SDP:

♠❛①✐♠✐③❡

❳✷R♥✂♥

❤❆❀ ❳✐ s✉❜❥❡❝t t♦ ❳✐✐ ❂ ✶❀ ✐ ✷ ❬♥❪❀ ❳ ✗ ✵✿ (SDP)

◮ Applications: MaxCut problem, Z✷ synchronization, Stochastic

block model...

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 24 / 32

slide-48
SLIDE 48

The MaxCut SDP problem

◮ ❆ ✷ R♥✂♥ symmetric. ◮ MaxCut SDP:

♠❛①✐♠✐③❡

❳✷R♥✂♥

❤❆❀ ❳✐ s✉❜❥❡❝t t♦ ❳✐✐ ❂ ✶❀ ✐ ✷ ❬♥❪❀ ❳ ✗ ✵✿ (SDP)

◮ Applications: MaxCut problem, Z✷ synchronization, Stochastic

block model...

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 24 / 32

slide-49
SLIDE 49

Burer-Monteiro approach

◮ Convex formulation: ♥ up to ✶✵✸ using interior point method

♠❛①✐♠✐③❡

❳✷R♥✂♥

❤❆❀ ❳✐ s✉❜❥❡❝t t♦ ❳✐✐ ❂ ✶❀ ✐ ✷ ❬♥❪❀ ❳ ✗ ✵✿ (SDP)

◮ Change variable ❳ ❂ ✛ ✁ ✛T, ✛ ✷ R♥✂❦, ❦ ✜ ♥. ◮ Non-convex formulation: ♥ up to ✶✵✺

♠❛①✐♠✐③❡

✛✷R♥✂❦

❤✛❀ ❆✛✐ s✉❜❥❡❝t t♦ ✛ ❂ ❬✛✶❀ ✿ ✿ ✿ ❀ ✛♥❪T❀ ❦✛✐❦✷ ❂ ✶❀ ✐ ✷ ❬♥❪✿ (❦-Ncvx-SDP)

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 25 / 32

slide-50
SLIDE 50

Burer-Monteiro approach

◮ Convex formulation: ♥ up to ✶✵✸ using interior point method

♠❛①✐♠✐③❡

❳✷R♥✂♥

❤❆❀ ❳✐ s✉❜❥❡❝t t♦ ❳✐✐ ❂ ✶❀ ✐ ✷ ❬♥❪❀ ❳ ✗ ✵✿ (SDP)

◮ Change variable ❳ ❂ ✛ ✁ ✛T, ✛ ✷ R♥✂❦, ❦ ✜ ♥. ◮ Non-convex formulation: ♥ up to ✶✵✺

♠❛①✐♠✐③❡

✛✷R♥✂❦

❤✛❀ ❆✛✐ s✉❜❥❡❝t t♦ ✛ ❂ ❬✛✶❀ ✿ ✿ ✿ ❀ ✛♥❪T❀ ❦✛✐❦✷ ❂ ✶❀ ✐ ✷ ❬♥❪✿ (❦-Ncvx-SDP)

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 25 / 32

slide-51
SLIDE 51

Burer-Monteiro approach

◮ Convex formulation: ♥ up to ✶✵✸ using interior point method

♠❛①✐♠✐③❡

❳✷R♥✂♥

❤❆❀ ❳✐ s✉❜❥❡❝t t♦ ❳✐✐ ❂ ✶❀ ✐ ✷ ❬♥❪❀ ❳ ✗ ✵✿ (SDP)

◮ Change variable ❳ ❂ ✛ ✁ ✛T, ✛ ✷ R♥✂❦, ❦ ✜ ♥. ◮ Non-convex formulation: ♥ up to ✶✵✺

♠❛①✐♠✐③❡

✛✷R♥✂❦

❤✛❀ ❆✛✐ s✉❜❥❡❝t t♦ ✛ ❂ ❬✛✶❀ ✿ ✿ ✿ ❀ ✛♥❪T❀ ❦✛✐❦✷ ❂ ✶❀ ✐ ✷ ❬♥❪✿ (❦-Ncvx-SDP)

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 25 / 32

slide-52
SLIDE 52

Related literatures

◮ As ❦ ✕

♣ ✷♥, the global maxima of the Non Convex formulation coincide with the global maximizer of the Convex formulation [Pataki, 1998], [Barviok, 2001], [Burer and Monteiro, 2003].

◮ As ❦ ✕

♣ ✷♥, Non Convex formulation has no spurious local maxima [Boumal, et al., 2016].

◮ What if ❦ remains of order ✶, as ♥ ✦ ✶? Is there spurious local

maxima? Sadly, yes.

◮ How is these local maxima? Empirically, they are good!

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 26 / 32

slide-53
SLIDE 53

Related literatures

◮ As ❦ ✕

♣ ✷♥, the global maxima of the Non Convex formulation coincide with the global maximizer of the Convex formulation [Pataki, 1998], [Barviok, 2001], [Burer and Monteiro, 2003].

◮ As ❦ ✕

♣ ✷♥, Non Convex formulation has no spurious local maxima [Boumal, et al., 2016].

◮ What if ❦ remains of order ✶, as ♥ ✦ ✶? Is there spurious local

maxima? Sadly, yes.

◮ How is these local maxima? Empirically, they are good!

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 26 / 32

slide-54
SLIDE 54

Related literatures

◮ As ❦ ✕

♣ ✷♥, the global maxima of the Non Convex formulation coincide with the global maximizer of the Convex formulation [Pataki, 1998], [Barviok, 2001], [Burer and Monteiro, 2003].

◮ As ❦ ✕

♣ ✷♥, Non Convex formulation has no spurious local maxima [Boumal, et al., 2016].

◮ What if ❦ remains of order ✶, as ♥ ✦ ✶? Is there spurious local

maxima? Sadly, yes.

◮ How is these local maxima? Empirically, they are good!

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 26 / 32

slide-55
SLIDE 55

Related literatures

◮ As ❦ ✕

♣ ✷♥, the global maxima of the Non Convex formulation coincide with the global maximizer of the Convex formulation [Pataki, 1998], [Barviok, 2001], [Burer and Monteiro, 2003].

◮ As ❦ ✕

♣ ✷♥, Non Convex formulation has no spurious local maxima [Boumal, et al., 2016].

◮ What if ❦ remains of order ✶, as ♥ ✦ ✶? Is there spurious local

maxima? Sadly, yes.

◮ How is these local maxima? Empirically, they are good!

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 26 / 32

slide-56
SLIDE 56

Related literatures

◮ As ❦ ✕

♣ ✷♥, the global maxima of the Non Convex formulation coincide with the global maximizer of the Convex formulation [Pataki, 1998], [Barviok, 2001], [Burer and Monteiro, 2003].

◮ As ❦ ✕

♣ ✷♥, Non Convex formulation has no spurious local maxima [Boumal, et al., 2016].

◮ What if ❦ remains of order ✶, as ♥ ✦ ✶? Is there spurious local

maxima? Sadly, yes.

◮ How is these local maxima? Empirically, they are good!

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 26 / 32

slide-57
SLIDE 57

Related literatures

◮ As ❦ ✕

♣ ✷♥, the global maxima of the Non Convex formulation coincide with the global maximizer of the Convex formulation [Pataki, 1998], [Barviok, 2001], [Burer and Monteiro, 2003].

◮ As ❦ ✕

♣ ✷♥, Non Convex formulation has no spurious local maxima [Boumal, et al., 2016].

◮ What if ❦ remains of order ✶, as ♥ ✦ ✶? Is there spurious local

maxima? Sadly, yes.

◮ How is these local maxima? Empirically, they are good!

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 26 / 32

slide-58
SLIDE 58

Geometry

♠❛①✐♠✐③❡

✛✷R♥✂❦

❤✛❀ ❆✛✐ ✿❂ ❢✭✛✮ s✉❜❥❡❝t t♦ ❦✛✐❦✷ ❂ ✶✿

▼❦ ❂ ❢✛ ✷ R♥✂❦ ✿ ❦✛✐❦✷ ❂ ✶❣✿

Definition (✧-approximate concave point)

We call ✛ ✷ ▼❦ an ✧-approximate concave point of ❢ on ▼❦, if for any tangent vector ✉ ✷ ❚✛▼❦, we have ❤✉❀ Hess❢✭✛✮❬✉❪✐ ✔ ✧❤✉❀ ✉✐✿ (1)

Remark

A local maximizer is ✵-approximate concave. An ✧-approximate concave point is nearly locally optimal.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 27 / 32

slide-59
SLIDE 59

Geometry

♠❛①✐♠✐③❡

✛✷R♥✂❦

❤✛❀ ❆✛✐ ✿❂ ❢✭✛✮ s✉❜❥❡❝t t♦ ❦✛✐❦✷ ❂ ✶✿

▼❦ ❂ ❢✛ ✷ R♥✂❦ ✿ ❦✛✐❦✷ ❂ ✶❣✿

Definition (✧-approximate concave point)

We call ✛ ✷ ▼❦ an ✧-approximate concave point of ❢ on ▼❦, if for any tangent vector ✉ ✷ ❚✛▼❦, we have ❤✉❀ Hess❢✭✛✮❬✉❪✐ ✔ ✧❤✉❀ ✉✐✿ (1)

Remark

A local maximizer is ✵-approximate concave. An ✧-approximate concave point is nearly locally optimal.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 27 / 32

slide-60
SLIDE 60

Geometry

♠❛①✐♠✐③❡

✛✷R♥✂❦

❤✛❀ ❆✛✐ ✿❂ ❢✭✛✮ s✉❜❥❡❝t t♦ ❦✛✐❦✷ ❂ ✶✿

▼❦ ❂ ❢✛ ✷ R♥✂❦ ✿ ❦✛✐❦✷ ❂ ✶❣✿

Definition (✧-approximate concave point)

We call ✛ ✷ ▼❦ an ✧-approximate concave point of ❢ on ▼❦, if for any tangent vector ✉ ✷ ❚✛▼❦, we have ❤✉❀ Hess❢✭✛✮❬✉❪✐ ✔ ✧❤✉❀ ✉✐✿ (1)

Remark

A local maximizer is ✵-approximate concave. An ✧-approximate concave point is nearly locally optimal.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 27 / 32

slide-61
SLIDE 61

Landscape Theorem

Theorem (A Grothendieck-type inequality)

For any ✧-approximate concave point ✛ ✷ ▼❦ of the rank-❦ non-convex problem, we have ❢✭✛✮ ✕ ❙❉P✭❆✮ ✶ ❦ ✶✭❙❉P✭❆✮ ✰ ❙❉P✭❆✮✮ ♥ ✷ ✧✿ (2) ❙❉P✭❆✮: the maximum value of SDP with input matrix ❆. Geometric iterpretation: the function value for all local maxima are within a gap of order ❖✭✶❂❦✮ within the global maximum.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 28 / 32

slide-62
SLIDE 62

Landscape of non-convex SDP

◮ ❢✭✛✮ ✕ ❙❉P✭❆✮ ✶ ❦✶✭❙❉P✭❆✮ ✰ ❙❉P✭❆✮✮ ♥ ✷ ✧.

Gap =

1 k−1

  • SDP(A) + SDP(−A)
  • kSDP(A)

SDP(A) + SDP(−A)

nε/2

a saddle point with ε curvature global optimizer a local optimizer

SDP(A) −SDP(−A)

Gap Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 29 / 32

slide-63
SLIDE 63

Approximate MaxCut Guarantee

Theorem (Approximate MaxCut Guarantee)

For any ❦ ✕ ✸, if ✛❄ is a local maximizer of corresponding rank-❦ non-convex problem, then we can use ✛❄ to find a ✵✿✽✼✽ ✂ ✭✶ ✶❂✭❦ ✶✮✮-approximate MaxCut. The global maximizer: ✵✿✽✼✽-approximate MaxCut. Any Local maximizers: ✵✿✽✼✽ ✂ ✭✶ ✶❂✭❦ ✶✮✮-approximate MaxCut.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 30 / 32

slide-64
SLIDE 64

Approximate MaxCut Guarantee

Theorem (Approximate MaxCut Guarantee)

For any ❦ ✕ ✸, if ✛❄ is a local maximizer of corresponding rank-❦ non-convex problem, then we can use ✛❄ to find a ✵✿✽✼✽ ✂ ✭✶ ✶❂✭❦ ✶✮✮-approximate MaxCut. The global maximizer: ✵✿✽✼✽-approximate MaxCut. Any Local maximizers: ✵✿✽✼✽ ✂ ✭✶ ✶❂✭❦ ✶✮✮-approximate MaxCut.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 30 / 32

slide-65
SLIDE 65

Group Synchronization

◮ ❙❖✭❞✮ synchronization, Orthogonal Cut SDP

♠❛①✐♠✐③❡

❳✷R♥❦✂♥❦

❤❆❀ ❳✐ s✉❜❥❡❝t t♦ ❳✐✐ ❂ ■❦❀ ❳ ✗ ✵✿ (3)

◮ Similar guarantee.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 31 / 32

slide-66
SLIDE 66

Conclusion

◮ Studied the global geometry of some non-convex optimization

problems.

◮ Empirical risk minimization: uniform convergence excludes

spurious local minima.

◮ Non-convex MaxCut SDP: all local maxima are near global

maxima. What I did not emphasize: Kac-Rice formula.

Song Mei (Stanford University) The landscape of non-convex optimization October 19, 2017 32 / 32