An Investigation into Neural Net Optimization via Hessian Eigenvalue - PowerPoint PPT Presentation

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density Behrooz Ghorbani Department of Electrical Engineering Stanford University (Joint Work with Shankar Krishnan & Ying Xiao from Google Research) June 2019 Behrooz Ghorbani Hessian Spectral Density June 2019 1 / 18

Overview Gradient descent and its variants are the most popular method of optimizing neural networks. Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature We present a scalable algorithm for computing the full eigenvalue density of the Hessian for deep neural networks. Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature We present a scalable algorithm for computing the full eigenvalue density of the Hessian for deep neural networks. We leverage this algorithm to study the effect of architecture / hyper-parameter choices on the optimization landscape. Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . H ( θ ) has eigenvalue-eigenvector pairs ( λ i , q i ) n i = 1 with λ 1 ≥ λ 2 · · · ≥ λ n . Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . H ( θ ) has eigenvalue-eigenvector pairs ( λ i , q i ) n i = 1 with λ 1 ≥ λ 2 · · · ≥ λ n . λ i is the curvature of the loss in direction of q i in the neighborhood of θ . Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . H ( θ ) has eigenvalue-eigenvector pairs ( λ i , q i ) n i = 1 with λ 1 ≥ λ 2 · · · ≥ λ n . λ i is the curvature of the loss in direction of q i in the neighborhood of θ . We focus on estimating the empirical distribution of λ i as a concrete way to study the loss curvature. Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ( t ) = 1 � n i = 1 δ ( t − λ i ) n Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

4.0 Smoothed Density 3.5 δ ( t − λ i ) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ( t ) = 1 � n i = 1 δ ( t − λ i ) n 4.0 δ ( t − λ i ) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

4.0 Smoothed Density 3.5 δ ( t − λ i ) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ∗ f ( t ) φ ( t ) = 1 � n i = 1 δ ( t − λ i ) − − − − − − − − − − − − − − − → n Convolution with Gaussian 4.0 δ ( t − λ i ) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

4.0 Smoothed Density 3.5 δ ( t − λ i ) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ∗ f ( t ) φ ( t ) = 1 � n φ σ ( t ) = 1 � n i = 1 δ ( t − λ i ) − − − − − − − − − − − − − − − → i = 1 f σ ( t − λ i ) n n Convolution with Gaussian 4.0 δ ( t − λ i ) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ∗ f ( t ) φ ( t ) = 1 � n φ σ ( t ) = 1 � n i = 1 δ ( t − λ i ) − − − − − − − − − − − − − − − → i = 1 f σ ( t − λ i ) n n Convolution with Gaussian 4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

Estimating the Smoothed Density Gene Golub and his students [Golub and Welsch (1969); Bai et al. (1996)] Behrooz Ghorbani Hessian Spectral Density June 2019 5 / 18

Estimating the Smoothed Density Gene Golub and his students [Golub and Welsch (1969); Bai et al. (1996)] � m � Constructs w i , ℓ i i = 1 such that for all "nice" functions g , n m 1 � � g ( λ i ) ≈ w i g ( ℓ i ) n i = 1 i = 1 Behrooz Ghorbani Hessian Spectral Density June 2019 5 / 18

An Investigation into Neural Net Optimization via Hessian Eigenvalue - PowerPoint PPT Presentation

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density Behrooz Ghorbani Department of Electrical Engineering Stanford University (Joint Work with Shankar Krishnan & Ying Xiao from Google Research) June 2019 Behrooz

Twisted Hessian curves cr.yp.to/papers.html#hessian Daniel J. Bernstein University of Illinois

Laboratory Investigation of Laboratory Investigation of Laboratory Investigation of Laboratory

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Twisted Hessian curves 1986 ChudnovskyChudnovsky, Sequences of numbers

Structure of the Hessian Graham C. Goodwin September 2004 Centre for Complex Dynamic Systems

HESSIAN vs OFFSET method PDF4LHC F b PDF4LHC February 2008 2008 A M Cooper-Sarkar Comparisons

Case Investigation of Avian in Southeast Asia Influenza Overview Initiating an investigation

Reduced-Hessian Methods for Constrained Optimization Philip E. Gill University of California,

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

Koszul/Souriau Fisher Metric Spaces & Optimization by Maximum Entropy: Hessian Information

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Out line Neural net wor ks Percept r on Neural Net works Supervised learning

Hessian-based sampling in high dimensions for goal-oriented model order reduction Peng Chen Omar

Scalable natural gradient using probabilistic models of backprop Roger Grosse Overview

Outline The electric grid as it is The smart grid

IM 7011: Information Economics Lecture 12: Moral Hazard Chen and Huang (2013) Ling-Chieh Kung

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

Statistical Geometry Processing Winter Semester 2011/2012 n r u v Differential Geometry

FINITE DIFFERENCE METHODS Dr. Sreenivas Jayanti Department of Chemical Engineering IIT-Madras

Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

Perturbation methods for DSGE models St ephane Adjemian stephane.adjemian@univ-lemans.fr

Sambuz

Useful Links

Newsletter

Mail Us

An Investigation into Neural Net Optimization via Hessian Eigenvalue - PowerPoint PPT Presentation

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density Behrooz Ghorbani Department of Electrical Engineering Stanford University (Joint Work with Shankar Krishnan & Ying Xiao from Google Research) June 2019 Behrooz

Twisted Hessian curves cr.yp.to/papers.html#hessian Daniel J. Bernstein University of Illinois

Laboratory Investigation of Laboratory Investigation of Laboratory Investigation of Laboratory

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Twisted Hessian curves 1986 ChudnovskyChudnovsky, Sequences of numbers

Structure of the Hessian Graham C. Goodwin September 2004 Centre for Complex Dynamic Systems

HESSIAN vs OFFSET method PDF4LHC F b PDF4LHC February 2008 2008 A M Cooper-Sarkar Comparisons

Case Investigation of Avian in Southeast Asia Influenza Overview Initiating an investigation

Reduced-Hessian Methods for Constrained Optimization Philip E. Gill University of California,

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

Koszul/Souriau Fisher Metric Spaces &amp; Optimization by Maximum Entropy: Hessian Information

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Out line Neural net wor ks Percept r on Neural Net works Supervised learning

Hessian-based sampling in high dimensions for goal-oriented model order reduction Peng Chen Omar

Scalable natural gradient using probabilistic models of backprop Roger Grosse Overview

Outline The electric grid as it is The smart grid

IM 7011: Information Economics Lecture 12: Moral Hazard Chen and Huang (2013) Ling-Chieh Kung

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

Statistical Geometry Processing Winter Semester 2011/2012 n r u v Differential Geometry

FINITE DIFFERENCE METHODS Dr. Sreenivas Jayanti Department of Chemical Engineering IIT-Madras

Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

Perturbation methods for DSGE models St ephane Adjemian stephane.adjemian@univ-lemans.fr

Sambuz

Useful Links

Newsletter

Mail Us

Koszul/Souriau Fisher Metric Spaces & Optimization by Maximum Entropy: Hessian Information