No Spurious Local Minima in Training Deep Quadratic Networks Abbas - - PowerPoint PPT Presentation

no spurious local minima in training deep quadratic
SMART_READER_LITE
LIVE PREVIEW

No Spurious Local Minima in Training Deep Quadratic Networks Abbas - - PowerPoint PPT Presentation

No Spurious Local Minima in Training Deep Quadratic Networks Abbas Kazemipour Conference on Mathematical Theory of Deep Neural Networks October 31, 2019 New York City, NY Need for New Optimization Theory The mystery of deep neural networks


slide-1
SLIDE 1

No Spurious Local Minima in Training Deep Quadratic Networks

Abbas Kazemipour

Conference on Mathematical Theory of Deep Neural Networks October 31, 2019 New York City, NY

slide-2
SLIDE 2

. . . . . . . . .

Need for New Optimization Theory

§ The mystery of deep neural networks and gradient descent

› Good solutions despite highly nonlinear and nonconvex landscapes

§ Roles of overparameterization, regularization, normalization and side information

. . . . . . . . . . . . . . . . . . . . . . . .

slide-3
SLIDE 3

. . . . . . . . .

Shallow Quadratic Networks

§ Quadratic NNs: A Sweet Spot Between Theory and Practice › higher order polynomials, analytical and continuous activation functions § Minimum ! needed if "# ∈ ℝ&?

"# '(

)# *( *+

', Activations Linear Layer Quadratic

.

.

.

.

ℒ 0, 2 = 4

#

)# − 6 )# . 6 )# = 4

78( ,

'7 *7

9"# . = 2029 : ;#

Quadratc features "#x#

9

slide-4
SLIDE 4

Simple vs. Complex Cells

§ Primary visual cortex: sensitive vs insensitive to contrast

[Rust et. al, 2005]

slide-5
SLIDE 5

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Low-Rank Matrix Recovery

ℒ #, % = '

(

)

( − +

)

( ,

+ )

( = - . /( = %#%0 . /(

+ − + − −

% %0 # =

  • low-rank

/( ∈ ℝ4×4

  • ( ∈ ℝ

% ∈ ℝ4×6 random measurements

slide-6
SLIDE 6

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Low-Rank Matrix Recovery

§ Convexification via nuclear norm minimization (e.g. under RIP of "#) § SDP: Computationally challenging § Can we solve for (Λ, ') instead ? (Burer-Moteniro 02, 05)

minimize ℒ / = 1

#

2

# − / 4 5# 6

subject to rank / ≤ B + − + − −

D DE F = /

low-rank 5# ∈ ℝI×I KLM ∈ ℝ D ∈ ℝI×N

nonconvex

slide-7
SLIDE 7

Global Optimality Conditions

minimize

&, (

ℒ &, ( = +

,

  • , − ( & (/ 0 1, 2

minimize

3

ℒ 3 = +

,

  • , − 3 0 1, 2

Nonconvex Convex Computationally efficient (local search methods e.g. SGD)

Possible local minima

Least Squares + SVD 4 ≥ 6 neurons sufficient

No local minima

+

,

7

,1, = 0

A solution is globally optimal iff 7

,: = -, − (&(/ 0 1,

minimize

:, (

ℒ :, ( = +

,

(-, − ( : (/ 0 1,)2

=&=>

?

reparameterization

≡ & :

slide-8
SLIDE 8

Properties of Stationary Points

§ First order optimality § Second order optimality § Can we force ! to be full-rank or use semidefiniteness?

minimize

', !

ℒ ', ! = +

,

(., − ! ' !0 1 2,)4

!0 +

,

5

,2,! = 6

If ! full-rank then ∑, 5

,2, = 0

+

,

5

,2, = 0

+

,

2, 1 !'9: 4 − +

,

5

,2, 1 9'90 ≥ 6

If ! low-rank then ∑, 5

,2,

≽ ≼

1 2

slide-9
SLIDE 9

Escaping spurious critical points

§ Theorem 1: Global minimum is achieved › Solution is an eigenvalue decomposition ⇒ complexity "($%) § Theorem 2: All stationary points are global minima with probability 1

› Advantage of data normalization

minimize

,, .

ℒ0 ,, . = 2

3

(43 − . , .6 7 83)9 + ;||..6 − =||9

9

minimize

>, ,, .

ℒ ?, ,, . = 2

3

(43 −(. , .6 + ?=) 7 83)9

1 2

nonconvex penalty

large enough full rank and orthonormal = 7 83 = ||B3||C Adding norm of input as regressor (side information)

slide-10
SLIDE 10

Deep Quadratic Networks: Induction !" = $ % & '"'"

( = ∑*+ $

,*+ -* ⊗ -+ & /"

⊗0 = $

1$ %$ 1( & /"

⊗0 § Overparameterization: how big should the hidden layer be?

. . . . . . . . .

23

!" ∈ ℝ 63 67

27 . 9 . 9

$ % = :;:(

. . . . . .

'" ∈ ℝ7 /" ∈ ℝ< . 9 . 9 =3 =7<

>3 >7<

  • *= 1?*1(

Quadratic for ℎ ≥ B9

$ 1 = vec(-3), ⋯ , JKL(-7)

slide-11
SLIDE 11

Deep Quadratic Networks: Induction

§ Overparameterization: how big should the hidden layer be?

. . . . . . . . .

. " . "

. . . . . .

#$ ∈ ℝ'(

)$ ∈ ℝ' . " . "

*$ = ,

  • . #$#$

/ = ∑12 ,

312 41 ⊗ 42 . )$

⊗6 = ,

7,

  • ,

7/ . )$

⊗6

Quadratic for ℎ ≥ :"

, 7 = vec(4?), ⋯ , CDE(4F)

*$ ∈ ℝ

slide-12
SLIDE 12

Deep Quadratic Networks

§ Theorem 3: All stationary points of ℒ are global minima › Can form a similar objective by adding norms

. . . . . .

. . . . . .

. # . #

. # . # $%

$%

(')

$%

(#)

)%

(*)

minimize

(0, 2)

ℒ 3, 4 = 6

%

(7%−)%

* )# + : 6 ;

|| = >(;)= >(;)? − @||#

#

= >(')

= >(#)

= >(*A')

= >(*)

Number of neurons superexponential in depth

. # . #

. . . . . . . . .

. # . #

. . . . . .

.

#

.

#

slide-13
SLIDE 13

How well does gradient descent perform

§ Experiment setup: !" ∈ ℝ%& ∼ ( ), + , ,- = ∑0 102-0

3 , 10 = ±1 w. p. ½

20 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 20 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer

1 2

5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer

Most bad critical points are close to a global solution!

rank @

  • A
  • B-

= 1

ℒD E, F ℒ G, E, F ℒ E, F

y

5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error

Regular Norm Orthogonal

Average Normalized Error

slide-14
SLIDE 14

Power of Gradient Descent

§ How well does gradient descent work in practice?

Input Distribution: Gaussian

Data Block

1 2 3

5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error

Regular Norm Orthogonal

5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error

Regular Norm Orthogonal

5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error

Regular Norm Orthogonal

Average Normalized Error

Planted Gaussian Planted Identity Random Signs

5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error

Regular Norm Orthogonal

5 10 15 20 Number of Hidden Units 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error

Regular Norm Orthogonal

0.2 0.4 0.6 0.8 1 Average Normalized Error

Regular Norm Orthogonal

Average Normalized Error

slide-15
SLIDE 15

Power of Gradient Descent

§ How well does gradient descent work in practice?

Input Distribution: Gaussian

Network Setup

5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer

Regular Quadratic Added Norm Orthogonality Penalty Least Squares

5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer

Planted Guassian Non-planted (Random) Planted Identity Random Signs

slide-16
SLIDE 16

Power of Gradient Descent

§ How well does gradient descent work in practice?

Input Dimension

8

Fraction achieving Global Minimizer Average Normalized Error

20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error 20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error 20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error 20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer

9 10

slide-17
SLIDE 17

Summary

§ Quadratic neural networks are a sweet spot between theory and practice

› Local minima can be easily escaped via

  • Overparameterization
  • Normalization
  • Regularization

› Next steps: higher order polynomials, analytical and continuous activation functions Brett Larsen Shaul Druckmann