No Spurious Local Minima in Training Deep Quadratic Networks Abbas - - PowerPoint PPT Presentation
No Spurious Local Minima in Training Deep Quadratic Networks Abbas - - PowerPoint PPT Presentation
No Spurious Local Minima in Training Deep Quadratic Networks Abbas Kazemipour Conference on Mathematical Theory of Deep Neural Networks October 31, 2019 New York City, NY Need for New Optimization Theory The mystery of deep neural networks
. . . . . . . . .
Need for New Optimization Theory
§ The mystery of deep neural networks and gradient descent
› Good solutions despite highly nonlinear and nonconvex landscapes
§ Roles of overparameterization, regularization, normalization and side information
…
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
Shallow Quadratic Networks
§ Quadratic NNs: A Sweet Spot Between Theory and Practice › higher order polynomials, analytical and continuous activation functions § Minimum ! needed if "# ∈ ℝ&?
"# '(
)# *( *+
', Activations Linear Layer Quadratic
.
.
.
.
ℒ 0, 2 = 4
#
)# − 6 )# . 6 )# = 4
78( ,
'7 *7
9"# . = 2029 : ;#
Quadratc features "#x#
9
Simple vs. Complex Cells
§ Primary visual cortex: sensitive vs insensitive to contrast
[Rust et. al, 2005]
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Low-Rank Matrix Recovery
ℒ #, % = '
(
)
( − +
)
( ,
+ )
( = - . /( = %#%0 . /(
+ − + − −
% %0 # =
- low-rank
/( ∈ ℝ4×4
- ( ∈ ℝ
% ∈ ℝ4×6 random measurements
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Low-Rank Matrix Recovery
§ Convexification via nuclear norm minimization (e.g. under RIP of "#) § SDP: Computationally challenging § Can we solve for (Λ, ') instead ? (Burer-Moteniro 02, 05)
minimize ℒ / = 1
#
2
# − / 4 5# 6
subject to rank / ≤ B + − + − −
D DE F = /
low-rank 5# ∈ ℝI×I KLM ∈ ℝ D ∈ ℝI×N
nonconvex
Global Optimality Conditions
minimize
&, (
ℒ &, ( = +
,
- , − ( & (/ 0 1, 2
minimize
3
ℒ 3 = +
,
- , − 3 0 1, 2
Nonconvex Convex Computationally efficient (local search methods e.g. SGD)
Possible local minima
Least Squares + SVD 4 ≥ 6 neurons sufficient
No local minima
+
,
7
,1, = 0
A solution is globally optimal iff 7
,: = -, − (&(/ 0 1,
minimize
:, (
ℒ :, ( = +
,
(-, − ( : (/ 0 1,)2
=&=>
?
reparameterization
≡ & :
Properties of Stationary Points
§ First order optimality § Second order optimality § Can we force ! to be full-rank or use semidefiniteness?
minimize
', !
ℒ ', ! = +
,
(., − ! ' !0 1 2,)4
!0 +
,
5
,2,! = 6
If ! full-rank then ∑, 5
,2, = 0
+
,
5
,2, = 0
+
,
2, 1 !'9: 4 − +
,
5
,2, 1 9'90 ≥ 6
If ! low-rank then ∑, 5
,2,
≽ ≼
1 2
Escaping spurious critical points
§ Theorem 1: Global minimum is achieved › Solution is an eigenvalue decomposition ⇒ complexity "($%) § Theorem 2: All stationary points are global minima with probability 1
› Advantage of data normalization
minimize
,, .
ℒ0 ,, . = 2
3
(43 − . , .6 7 83)9 + ;||..6 − =||9
9
minimize
>, ,, .
ℒ ?, ,, . = 2
3
(43 −(. , .6 + ?=) 7 83)9
1 2
nonconvex penalty
large enough full rank and orthonormal = 7 83 = ||B3||C Adding norm of input as regressor (side information)
Deep Quadratic Networks: Induction !" = $ % & '"'"
( = ∑*+ $
,*+ -* ⊗ -+ & /"
⊗0 = $
1$ %$ 1( & /"
⊗0 § Overparameterization: how big should the hidden layer be?
. . . . . . . . .
23
!" ∈ ℝ 63 67
27 . 9 . 9
$ % = :;:(
. . . . . .
'" ∈ ℝ7 /" ∈ ℝ< . 9 . 9 =3 =7<
>3 >7<
- *= 1?*1(
Quadratic for ℎ ≥ B9
$ 1 = vec(-3), ⋯ , JKL(-7)
Deep Quadratic Networks: Induction
§ Overparameterization: how big should the hidden layer be?
. . . . . . . . .
. " . "
. . . . . .
#$ ∈ ℝ'(
)$ ∈ ℝ' . " . "
*$ = ,
- . #$#$
/ = ∑12 ,
312 41 ⊗ 42 . )$
⊗6 = ,
7,
- ,
7/ . )$
⊗6
Quadratic for ℎ ≥ :"
, 7 = vec(4?), ⋯ , CDE(4F)
*$ ∈ ℝ
Deep Quadratic Networks
§ Theorem 3: All stationary points of ℒ are global minima › Can form a similar objective by adding norms
…
. . . . . .
. . . . . .
. # . #
. # . # $%
$%
(')
$%
(#)
)%
(*)
minimize
(0, 2)
ℒ 3, 4 = 6
%
(7%−)%
* )# + : 6 ;
|| = >(;)= >(;)? − @||#
#
= >(')
= >(#)
= >(*A')
= >(*)
Number of neurons superexponential in depth
. # . #
. . . . . . . . .
. # . #
. . . . . .
.
#
.
#
How well does gradient descent perform
§ Experiment setup: !" ∈ ℝ%& ∼ ( ), + , ,- = ∑0 102-0
3 , 10 = ±1 w. p. ½
20 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 20 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer
1 2
5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer
Most bad critical points are close to a global solution!
rank @
- A
- B-
= 1
ℒD E, F ℒ G, E, F ℒ E, F
y
5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error
Regular Norm Orthogonal
Average Normalized Error
Power of Gradient Descent
§ How well does gradient descent work in practice?
Input Distribution: Gaussian
Data Block
1 2 3
5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error
Regular Norm Orthogonal
5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error
Regular Norm Orthogonal
5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error
Regular Norm Orthogonal
Average Normalized Error
Planted Gaussian Planted Identity Random Signs
5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error
Regular Norm Orthogonal
5 10 15 20 Number of Hidden Units 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error
Regular Norm Orthogonal
0.2 0.4 0.6 0.8 1 Average Normalized Error
Regular Norm Orthogonal
Average Normalized Error
Power of Gradient Descent
§ How well does gradient descent work in practice?
Input Distribution: Gaussian
Network Setup
5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer
Regular Quadratic Added Norm Orthogonality Penalty Least Squares
5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 5 10 15 20 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer
Planted Guassian Non-planted (Random) Planted Identity Random Signs
Power of Gradient Descent
§ How well does gradient descent work in practice?
Input Dimension
8
Fraction achieving Global Minimizer Average Normalized Error
20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error 20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error 20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Average Normalized Error 20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer 20 40 60 80 100 120 140 Number of Hidden Units 0.2 0.4 0.6 0.8 1 Fraction Achieving Global Minimzer
9 10
Summary
§ Quadratic neural networks are a sweet spot between theory and practice
› Local minima can be easily escaped via
- Overparameterization
- Normalization
- Regularization