an investigation into neural net optimization via hessian
play

An Investigation into Neural Net Optimization via Hessian Eigenvalue - PowerPoint PPT Presentation

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density Behrooz Ghorbani Department of Electrical Engineering Stanford University (Joint Work with Shankar Krishnan & Ying Xiao from Google Research) June 2019 Behrooz


  1. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density Behrooz Ghorbani Department of Electrical Engineering Stanford University (Joint Work with Shankar Krishnan & Ying Xiao from Google Research) June 2019 Behrooz Ghorbani Hessian Spectral Density June 2019 1 / 18

  2. Overview Gradient descent and its variants are the most popular method of optimizing neural networks. Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

  3. Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

  4. Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

  5. Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature We present a scalable algorithm for computing the full eigenvalue density of the Hessian for deep neural networks. Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

  6. Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature We present a scalable algorithm for computing the full eigenvalue density of the Hessian for deep neural networks. We leverage this algorithm to study the effect of architecture / hyper-parameter choices on the optimization landscape. Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

  7. Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

  8. Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

  9. Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

  10. Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . H ( θ ) has eigenvalue-eigenvector pairs ( λ i , q i ) n i = 1 with λ 1 ≥ λ 2 · · · ≥ λ n . Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

  11. Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . H ( θ ) has eigenvalue-eigenvector pairs ( λ i , q i ) n i = 1 with λ 1 ≥ λ 2 · · · ≥ λ n . λ i is the curvature of the loss in direction of q i in the neighborhood of θ . Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

  12. Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . H ( θ ) has eigenvalue-eigenvector pairs ( λ i , q i ) n i = 1 with λ 1 ≥ λ 2 · · · ≥ λ n . λ i is the curvature of the loss in direction of q i in the neighborhood of θ . We focus on estimating the empirical distribution of λ i as a concrete way to study the loss curvature. Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

  13. 4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

  14. 4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

  15. 4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ( t ) = 1 � n i = 1 δ ( t − λ i ) n Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

  16. 4.0 Smoothed Density 3.5 δ ( t − λ i ) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ( t ) = 1 � n i = 1 δ ( t − λ i ) n 4.0 δ ( t − λ i ) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

  17. 4.0 Smoothed Density 3.5 δ ( t − λ i ) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ∗ f ( t ) φ ( t ) = 1 � n i = 1 δ ( t − λ i ) − − − − − − − − − − − − − − − → n Convolution with Gaussian 4.0 δ ( t − λ i ) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

  18. 4.0 Smoothed Density 3.5 δ ( t − λ i ) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ∗ f ( t ) φ ( t ) = 1 � n φ σ ( t ) = 1 � n i = 1 δ ( t − λ i ) − − − − − − − − − − − − − − − → i = 1 f σ ( t − λ i ) n n Convolution with Gaussian 4.0 δ ( t − λ i ) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

  19. Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ∗ f ( t ) φ ( t ) = 1 � n φ σ ( t ) = 1 � n i = 1 δ ( t − λ i ) − − − − − − − − − − − − − − − → i = 1 f σ ( t − λ i ) n n Convolution with Gaussian 4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

  20. Estimating the Smoothed Density Gene Golub and his students [Golub and Welsch (1969); Bai et al. (1996)] Behrooz Ghorbani Hessian Spectral Density June 2019 5 / 18

  21. Estimating the Smoothed Density Gene Golub and his students [Golub and Welsch (1969); Bai et al. (1996)] � m � Constructs w i , ℓ i i = 1 such that for all "nice" functions g , n m 1 � � g ( λ i ) ≈ w i g ( ℓ i ) n i = 1 i = 1 Behrooz Ghorbani Hessian Spectral Density June 2019 5 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend