No Spurious Local Minima in Training Deep Quadratic Networks Abbas - PowerPoint PPT Presentation

No Spurious Local Minima in Training Deep Quadratic Networks Abbas Kazemipour Conference on Mathematical Theory of Deep Neural Networks October 31, 2019 New York City, NY

Need for New Optimization Theory The mystery of deep neural networks and gradient descent § Good solutions despite highly nonlinear and nonconvex landscapes › Roles of overparameterization, regularization, normalization and side § information … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Shallow Quadratic Networks Quadratic NNs: A Sweet Spot Between Theory and Practice § › higher order polynomials, analytical and continuous activation functions , . = 202 9 : ; # 9 " # ) # = 4 6 ' 7 * 7 ' ( . . 78( * ( 9 Quadratc features " # x # " # ) # . . . . . . . . . ) # . ℒ 0, 2 = 4 ) # − 6 # * + . . ' , Minimum ! needed if " # ∈ ℝ & ? § Quadratic Linear Layer Activations

Simple vs. Complex Cells Primary visual cortex: sensitive vs insensitive to contrast § [Rust et. al, 2005]

/ ( ∈ ℝ 4×4 - ( ∈ ℝ Low-Rank Matrix Recovery % ∈ ℝ 4×6 ? ? + ? ? ? ? − ? ? − ? ? ? + ? ? ? ? = − ? ? ? ? ? ? ? ? ? ? ? ? ? % 0 - % # low-rank ( = - . / ( = %#% 0 . / ( + ) random measurements , ( − + ℒ #, % = ' ) ) ( (

5 # ∈ ℝ I×I K LM ∈ ℝ Low-Rank Matrix Recovery D ∈ ℝ I×N ? ? + ? ? ? ? − ? ? − ? ? ? + ? ? ? ? = − ? ? ? ? ? ? ? ? ? ? ? ? ? D E / D F low-rank 6 minimize ℒ / = 1 2 # − / 4 5 # nonconvex # subject to rank / ≤ B Convexification via nuclear norm minimization (e.g. under RIP of " # ) § SDP: Computationally challenging § Can we solve for (Λ, ') instead ? (Burer-Moteniro 02, 05) §

Global Optimality Conditions Least Squares + SVD - , − 3 0 1 , 2 minimize ℒ 3 = + Convex 4 ≥ 6 neurons sufficient 3 , No local minima reparameterization Computationally efficient - , − ( & ( / 0 1 , 2 minimize ℒ &, ( = + Nonconvex (local search methods e.g. SGD) &, ( , Possible local minima ≡ (- , − ( : ( / 0 1 , ) 2 0 minimize ℒ :, ( = + :, ( ? , & =&= > : A solution is globally optimal iff , : = - , − (&( / 0 1 , 7 + 7 , 1 , = 0 ,

Properties of Stationary Points (. , − ! ' ! 0 1 2 , ) 4 minimize ℒ ', ! = + + 5 , 2 , = 0 ', ! , , ! 0 + First order optimality § 5 , 2 , ! = 6 , 1 If ! full-rank then ∑ , 5 , 2 , = 0 2 , 1 !'9 : 4 − + , 2 , 1 9'9 0 ≥ 6 Second order optimality § + 5 , , ≽ 2 If ! low-rank then ∑ , 5 , 2 , 0 ≼ Can we force ! to be full-rank or use semidefiniteness? §

Escaping spurious critical points nonconvex penalty (4 3 − . , . 6 7 8 3 ) 9 + ;||.. 6 − =|| 9 9 1 minimize ℒ 0 ,, . = 2 ,, . 3 full rank and orthonormal large enough Theorem 1: Global minimum is achieved § › Solution is an eigenvalue decomposition ⇒ complexity "($ % ) (4 3 −(. , . 6 + ?=) 7 8 3 ) 9 2 minimize ℒ ?, ,, . = 2 >, ,, . 3 = 7 8 3 = ||B 3 || C Adding norm of input as regressor (side information) Theorem 2: All stationary points are global minima with probability 1 § Advantage of data normalization ›

Deep Quadratic Networks: Induction Overparameterization: how big should the hidden layer be? § 2 3 = 3 . 9 . 9 6 3 > 3 ' " ∈ ℝ 7 ! " ∈ ℝ / " ∈ ℝ < . . . . . . . . . . . . . . . 6 7 > 7< . 9 . 9 2 7 = 7< $ % = :;: ( - * = 1? * 1 ( $ 1 = vec(- 3 ), ⋯ , JKL(- 7 ) Quadratic for ℎ ≥ B 9 ⊗0 = $ ( = ∑ *+ $ 1 ( & / " ⊗0 ! " = $ 1$ %$ % & ' " ' " , *+ - * ⊗ - + & / "

Deep Quadratic Networks: Induction Overparameterization: how big should the hidden layer be? § . " # $ ∈ ℝ ' ( . " * $ ∈ ℝ ) $ ∈ ℝ ' . . . . . . . . . . . . . . . . " . " , 7 = vec(4 ? ), ⋯ , CDE(4 F ) Quadratic for ℎ ≥ : " ⊗6 = , / = ∑ 12 , 7 / . ) $ ⊗6 * $ = , 7, -, - . # $ # $ 3 12 4 1 ⊗ 4 2 . ) $

Deep Quadratic Networks . # (') (#) . # $ % $ % (*) ) % $ % . # . # . # … . . . . . . . . . . . . . . . . . . . . . . . . . . . . # . # . # = . # > (*) > (*A') = . # = > (#) Number of neurons superexponential in depth = > (') > (;)? − @|| # * ) # + : 6 || = > (;) = # minimize ℒ 3, 4 = 6 (7 % −) % (0, 2) % ; Theorem 3: All stationary points of ℒ are global minima § › Can form a similar objective by adding norms

How well does gradient descent perform 3 , 1 0 = ±1 w. p. ½ Experiment setup: ! " ∈ ℝ %& ∼ ( ), + , , - = ∑ 0 1 0 2 -0 § ℒ E, F ℒ D E, F ℒ G, E, F 1 1 1 Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer 0.8 0.8 0.8 0.6 2 0.6 1 0.6 0 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 5 10 15 20 20 0 5 10 15 20 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units 1 Regular Norm Average Normalized Error Average Normalized Error 0.8 Orthogonal Most bad critical points are close to a global solution! y 0.6 0.4 rank @ A - B - = 1 0.2 - 0 0 5 10 15 20 Number of Hidden Units

Power of Gradient Descent How well does gradient descent work in practice? § 1 2 3 Data Block 1 1 1 Regular Regular Regular Norm Norm Norm Average Normalized Error Average Normalized Error Average Normalized Error Average Normalized Error 0.8 0.8 0.8 Orthogonal Orthogonal Orthogonal Planted Identity 0.6 0.6 0.6 Random Signs Input Distribution: Gaussian 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units 1 1 1 Regular Regular Regular Norm Norm Norm Average Normalized Error Average Normalized Error Average Normalized Error Average Normalized Error 0.8 0.8 0.8 Orthogonal Orthogonal Orthogonal 0.6 0.6 0.6 Planted Gaussian 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units

Power of Gradient Descent How well does gradient descent work in practice? § Regular Quadratic Added Norm Orthogonality Penalty Least Squares Network Setup 1 1 1 1 Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer 0.8 0.8 0.8 0.8 Planted Identity 0.6 0.6 0.6 0.6 Random Signs 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units Number of Hidden Units Input Distribution: Gaussian 1 1 1 1 Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer 0.8 0.8 0.8 0.8 Planted 0.6 0.6 0.6 0.6 Guassian 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units Number of Hidden Units 1 1 1 1 Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 Non-planted (Random) 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units Number of Hidden Units

Power of Gradient Descent How well does gradient descent work in practice? § 8 9 10 Input Dimension 1 1 1 0.8 0.8 0.8 Average Normalized Error Average Normalized Error Average Normalized Error 0.6 0.6 0.6 Average Normalized Error 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Number of Hidden Units Number of Hidden Units Number of Hidden Units 1 1 1 Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer 0.8 0.8 0.8 Fraction achieving 0.6 0.6 0.6 Global Minimizer 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Number of Hidden Units Number of Hidden Units Number of Hidden Units

Summary § Quadratic neural networks are a sweet spot between theory and practice › Local minima can be easily escaped via • Overparameterization • Normalization • Regularization › Next steps: higher order polynomials, analytical and continuous activation functions Shaul Druckmann Brett Larsen

No Spurious Local Minima in Training Deep Quadratic Networks Abbas - PowerPoint PPT Presentation

No Spurious Local Minima in Training Deep Quadratic Networks Abbas Kazemipour Conference on Mathematical Theory of Deep Neural Networks October 31, 2019 New York City, NY Need for New Optimization Theory The mystery of deep neural networks

The quadratic formula You may recall the quadratic formula for roots of quadratic polynomials ax 2

Spurious Correlations Steve Borgatti Spurious Correlation Tuesday, October 23, 2001 10:28 AM Hi

Aeronautical Federal Aviation Administration Charting Forum 12-01 CATEGORY III CHART MINIMA

Optimization why does it work How many minima Do they control worm complexity Plain

On the Local Minima of the Empirical Risk Chi Jin * 1 , Lydia T. Liu* 1 , Rong Ge 2 , Michael I.

An Investigation of Why Overparameterization Exacerbates Spurious Correlations Shiori Sagawa*

Responding to Spurious Timeouts in TCP Andrei Gurtov University of Helsinki Reiner Ludwig

Spurious Retransmission Detection (SRD) with the TCP Echo Options

JUST THE MATHS SLIDES NUMBER 11.2 DIFFERENTIATION APPLICATIONS 2 (Local maxima and local

Finding Maxima and Minima For a function of two variables what does a relative maximum or relative

Introduction to Data Science: Neural [ 1 , 2 , , p ] g x w h m g h g f w old M

Section3.3 Analyzing Graphs of Quadratic Functions Introduction Definitions A quadratic function

11. Quadratic forms and ellipsoids Quadratic forms Orthogonal decomposition Positive

Solving Quadratic Equations MCR3U: Functions Recall that to solve a quadratic equation means to

Quadratic Residues Definition : The numbers 0 2 , 1 2 , 2 2 , . . . , ( n 1) 2 mod n , are

Sequential Quadratic Programming 1 Lecture 17 ME EN 575 Andrew Ning aning@byu.edu Outline

Multiplicative Weights Update as a Distributed Optimization Algorithm: Constrained Optimization

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

Dimension Free Optimization and Non-Convex Optimization Instructor: Sham Kakade 1 Non-convex

Generalized finite differences for solving stochastic control problems March 2005 F . Bonnans,

Variational approach to mean field games with density constraints Alp ar Rich ard M

Monte Carlo simulation inspired by computational optimization Colin Fox fox@physics.otago.ac.nz

Saka e Fuchino ( ) Graduate School of System Informatics Kobe University (

The Cayley-Moser Problem Optimal Stopping Buying a house, selling an asset, or searching for a