An Investigation into Neural Net Optimization via Hessian Eigenvalue - - PowerPoint PPT Presentation

an investigation into neural net optimization via hessian
SMART_READER_LITE
LIVE PREVIEW

An Investigation into Neural Net Optimization via Hessian Eigenvalue - - PowerPoint PPT Presentation

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density Behrooz Ghorbani Department of Electrical Engineering Stanford University (Joint Work with Shankar Krishnan & Ying Xiao from Google Research) June 2019 Behrooz


slide-1
SLIDE 1

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

Behrooz Ghorbani

Department of Electrical Engineering Stanford University (Joint Work with Shankar Krishnan & Ying Xiao from Google Research)

June 2019

Behrooz Ghorbani Hessian Spectral Density June 2019 1 / 18

slide-2
SLIDE 2

Overview

Gradient descent and its variants are the most popular method of optimizing neural networks.

Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

slide-3
SLIDE 3

Overview

Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface

Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

slide-4
SLIDE 4

Overview

Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature

Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

slide-5
SLIDE 5

Overview

Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature We present a scalable algorithm for computing the full eigenvalue density of the Hessian for deep neural networks.

Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

slide-6
SLIDE 6

Overview

Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature We present a scalable algorithm for computing the full eigenvalue density of the Hessian for deep neural networks. We leverage this algorithm to study the effect of architecture / hyper-parameter choices on the

  • ptimization landscape.

Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18

slide-7
SLIDE 7

Basic Definitions

θ ∈ Rn is the model parameter. L(θ) ≡ 1

N

N

i=1 L(θ, (xi, yi)).

Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

slide-8
SLIDE 8

Basic Definitions

θ ∈ Rn is the model parameter. L(θ) ≡ 1

N

N

i=1 L(θ, (xi, yi)).

The Hessian matrix, H, is an n × n symmetric matrix of second derivatives: H(θt)i,j = ∂2L ∂θi∂θj |θ=θt

Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

slide-9
SLIDE 9

Basic Definitions

θ ∈ Rn is the model parameter. L(θ) ≡ 1

N

N

i=1 L(θ, (xi, yi)).

The Hessian matrix, H, is an n × n symmetric matrix of second derivatives: H(θt)i,j = ∂2L ∂θi∂θj |θ=θt H(θ) represents the (local) loss curvature at point θ.

Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

slide-10
SLIDE 10

Basic Definitions

θ ∈ Rn is the model parameter. L(θ) ≡ 1

N

N

i=1 L(θ, (xi, yi)).

The Hessian matrix, H, is an n × n symmetric matrix of second derivatives: H(θt)i,j = ∂2L ∂θi∂θj |θ=θt H(θ) represents the (local) loss curvature at point θ. H(θ) has eigenvalue-eigenvector pairs (λi, qi)n

i=1 with λ1 ≥ λ2 · · · ≥ λn.

Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

slide-11
SLIDE 11

Basic Definitions

θ ∈ Rn is the model parameter. L(θ) ≡ 1

N

N

i=1 L(θ, (xi, yi)).

The Hessian matrix, H, is an n × n symmetric matrix of second derivatives: H(θt)i,j = ∂2L ∂θi∂θj |θ=θt H(θ) represents the (local) loss curvature at point θ. H(θ) has eigenvalue-eigenvector pairs (λi, qi)n

i=1 with λ1 ≥ λ2 · · · ≥ λn.

λi is the curvature of the loss in direction of qi in the neighborhood of θ.

Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

slide-12
SLIDE 12

Basic Definitions

θ ∈ Rn is the model parameter. L(θ) ≡ 1

N

N

i=1 L(θ, (xi, yi)).

The Hessian matrix, H, is an n × n symmetric matrix of second derivatives: H(θt)i,j = ∂2L ∂θi∂θj |θ=θt H(θ) represents the (local) loss curvature at point θ. H(θ) has eigenvalue-eigenvector pairs (λi, qi)n

i=1 with λ1 ≥ λ2 · · · ≥ λn.

λi is the curvature of the loss in direction of qi in the neighborhood of θ. We focus on estimating the empirical distribution of λi as a concrete way to study the loss curvature.

Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18

slide-13
SLIDE 13

Hessian Computation in Deep Networks

The eigenvalue distribution function of H is defined as φ(t) = 1 n

n

  • i=1

δ(t − λi)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 δ(t − λi) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Smoothed Density δ(t − λi) Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

slide-14
SLIDE 14

Hessian Computation in Deep Networks

The eigenvalue distribution function of H is defined as φ(t) = 1 n

n

  • i=1

δ(t − λi) Let fσ(x) =

1 σ √ 2π exp(− x2 2σ2 ) be the Gaussian density.

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 δ(t − λi) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Smoothed Density δ(t − λi) Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

slide-15
SLIDE 15

Hessian Computation in Deep Networks

The eigenvalue distribution function of H is defined as φ(t) = 1 n

n

  • i=1

δ(t − λi) Let fσ(x) =

1 σ √ 2π exp(− x2 2σ2 ) be the Gaussian density.

φ(t) = 1

n

n

i=1 δ(t − λi)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 δ(t − λi) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Smoothed Density δ(t − λi) Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

slide-16
SLIDE 16

Hessian Computation in Deep Networks

The eigenvalue distribution function of H is defined as φ(t) = 1 n

n

  • i=1

δ(t − λi) Let fσ(x) =

1 σ √ 2π exp(− x2 2σ2 ) be the Gaussian density.

φ(t) = 1

n

n

i=1 δ(t − λi)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 δ(t − λi) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Smoothed Density δ(t − λi) Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

slide-17
SLIDE 17

Hessian Computation in Deep Networks

The eigenvalue distribution function of H is defined as φ(t) = 1 n

n

  • i=1

δ(t − λi) Let fσ(x) =

1 σ √ 2π exp(− x2 2σ2 ) be the Gaussian density.

φ(t) = 1

n

n

i=1 δ(t − λi) φ∗f(t)

− − − − − − − − − − − − − − − →

Convolution with Gaussian

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 δ(t − λi) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Smoothed Density δ(t − λi) Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

slide-18
SLIDE 18

Hessian Computation in Deep Networks

The eigenvalue distribution function of H is defined as φ(t) = 1 n

n

  • i=1

δ(t − λi) Let fσ(x) =

1 σ √ 2π exp(− x2 2σ2 ) be the Gaussian density.

φ(t) = 1

n

n

i=1 δ(t − λi) φ∗f(t)

− − − − − − − − − − − − − − − →

Convolution with Gaussian

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 δ(t − λi)

φσ(t) = 1

n

n

i=1 fσ(t − λi)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Smoothed Density δ(t − λi) Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

slide-19
SLIDE 19

Hessian Computation in Deep Networks

The eigenvalue distribution function of H is defined as φ(t) = 1 n

n

  • i=1

δ(t − λi) Let fσ(x) =

1 σ √ 2π exp(− x2 2σ2 ) be the Gaussian density.

φ(t) = 1

n

n

i=1 δ(t − λi) φ∗f(t)

− − − − − − − − − − − − − − − →

Convolution with Gaussian

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 δ(t − λi)

φσ(t) = 1

n

n

i=1 fσ(t − λi)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Smoothed Density δ(t − λi) Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18

slide-20
SLIDE 20

Estimating the Smoothed Density

Gene Golub and his students [Golub and Welsch (1969); Bai et al. (1996)]

Behrooz Ghorbani Hessian Spectral Density June 2019 5 / 18

slide-21
SLIDE 21

Estimating the Smoothed Density

Gene Golub and his students [Golub and Welsch (1969); Bai et al. (1996)] Constructs

  • wi, ℓi

m

i=1 such that for all "nice" functions g,

1 n

n

  • i=1

g(λi) ≈

m

  • i=1

wig(ℓi)

Behrooz Ghorbani Hessian Spectral Density June 2019 5 / 18

slide-22
SLIDE 22

Estimating the Smoothed Density

Gene Golub and his students [Golub and Welsch (1969); Bai et al. (1996)] Constructs

  • wi, ℓi

m

i=1 such that for all "nice" functions g,

1 n

n

  • i=1

g(λi) ≈

m

  • i=1

wig(ℓi) Use g(x) = fσ(t − x): φσ(t) = 1 n

n

  • i=1

fσ(t − λi) ≈ ˆ φ(t) = 1 m

m

  • i=1

wifσ(t − ℓi)

Behrooz Ghorbani Hessian Spectral Density June 2019 5 / 18

slide-23
SLIDE 23

Algorithm Sketch

Stochastic Draw v ∼ N(0, 1

n In)

Lanczos

1

Compute a basis for {v, Hv, · · · , Hm−1v}. Call this basis V.

2

Let T = V T HV Quadrature

1

Diagonalize T = UDUT .

2

Estimate φσ(t) = 1

n

n

i=1 f(t − λi) with ˆ

φv(t) = m

i=1 U2 1,if(t − Di,i)

Behrooz Ghorbani Hessian Spectral Density June 2019 6 / 18

slide-24
SLIDE 24

Algorithm Sketch

Stochastic Draw v ∼ N(0, 1

n In)

Lanczos

1

Compute a basis for {v, Hv, · · · , Hm−1v}. Call this basis V.

2

Let T = V T HV Quadrature

1

Diagonalize T = UDUT .

2

Estimate φσ(t) = 1

n

n

i=1 f(t − λi) with ˆ

φv(t) = m

i=1 U2 1,if(t − Di,i)

Computational Complexity

Behrooz Ghorbani Hessian Spectral Density June 2019 6 / 18

slide-25
SLIDE 25

Algorithm Sketch

Stochastic Draw v ∼ N(0, 1

n In)

Lanczos

1

Compute a basis for {v, Hv, · · · , Hm−1v}. Call this basis V.

2

Let T = V T HV Quadrature

1

Diagonalize T = UDUT .

2

Estimate φσ(t) = 1

n

n

i=1 f(t − λi) with ˆ

φv(t) = m

i=1 U2 1,if(t − Di,i)

Computational Complexity

Calculating

  • wi, ℓi

m

i=1 takes O(m × model size × dataset size). In practice, m ≈ 100 is more

than enough.

Behrooz Ghorbani Hessian Spectral Density June 2019 6 / 18

slide-26
SLIDE 26

Algorithm Sketch

Stochastic Draw v ∼ N(0, 1

n In)

Lanczos

1

Compute a basis for {v, Hv, · · · , Hm−1v}. Call this basis V.

2

Let T = V T HV Quadrature

1

Diagonalize T = UDUT .

2

Estimate φσ(t) = 1

n

n

i=1 f(t − λi) with ˆ

φv(t) = m

i=1 U2 1,if(t − Di,i)

Computational Complexity

Calculating

  • wi, ℓi

m

i=1 takes O(m × model size × dataset size). In practice, m ≈ 100 is more

than enough. Explicitly calculating the eigenvalues takes O(model size2 × dataset size).

Behrooz Ghorbani Hessian Spectral Density June 2019 6 / 18

slide-27
SLIDE 27

Accuracy

The algorithm enjoys strong theoretical guarantees. We present some such guarantees in our paper. Ubaru et al. (2017) provide additional details.

Behrooz Ghorbani Hessian Spectral Density June 2019 7 / 18

slide-28
SLIDE 28

Accuracy

The algorithm enjoys strong theoretical guarantees. We present some such guarantees in our paper. Ubaru et al. (2017) provide additional details.

Figure: Comparison of a degree 90 quadrature approximation with the actual Hessian density. The Hessian is calculated from a

2-layer network with 15910 parameters trained on MNIST.

Behrooz Ghorbani Hessian Spectral Density June 2019 7 / 18

slide-29
SLIDE 29

Let’s Train a ResNet-32

460K parameters. Trained on CIFAR-10. The network has Batch-Normalization (Ioffe and Szegedy (2015)).

Behrooz Ghorbani Hessian Spectral Density June 2019 8 / 18

slide-30
SLIDE 30

Experiments: Initialization

At initialization time, the Hessian has significant negative eigenvalues.

Behrooz Ghorbani Hessian Spectral Density June 2019 9 / 18

slide-31
SLIDE 31

Experiments: Initialization

At initialization time, the Hessian has significant negative eigenvalues. This points to a significant local non-convexity of the network at time 0.

25 20 15 10 5 5 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 Hessian Eigenvalue Density (Log Scale) Step 0

Behrooz Ghorbani Hessian Spectral Density June 2019 9 / 18

slide-32
SLIDE 32

Experiments: Initialization

At initialization time, the Hessian has significant negative eigenvalues. This points to a significant local non-convexity of the network at time 0. There is a significant difference between the initialization landscape and the training landscape.

25 20 15 10 5 5 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 Hessian Eigenvalue Density (Log Scale) Step 0

Behrooz Ghorbani Hessian Spectral Density June 2019 9 / 18

slide-33
SLIDE 33

Experiments: Initialization

At initialization time, the Hessian has significant negative eigenvalues. This points to a significant local non-convexity of the network at time 0. There is a significant difference between the initialization landscape and the training landscape.

25 20 15 10 5 5 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 Hessian Eigenvalue Density (Log Scale) Step 515

Behrooz Ghorbani Hessian Spectral Density June 2019 9 / 18

slide-34
SLIDE 34

Experiments: Initialization

At initialization time, the Hessian has significant negative eigenvalues. This points to a significant local non-convexity of the network at time 0. There is a significant difference between the initialization landscape and the training landscape. For small datasets such as CIFAR-10 / MNIST, negative directions disappear extremely fast.

25 20 15 10 5 5 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 Hessian Eigenvalue Density (Log Scale)

Step 0

25 20 15 10 5 5 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 Hessian Eigenvalue Density (Log Scale)

Step 515

Behrooz Ghorbani Hessian Spectral Density June 2019 9 / 18

slide-35
SLIDE 35

Experiments: Further Training

After the first epoch, the Hessian spectrum stabilizes.

Behrooz Ghorbani Hessian Spectral Density June 2019 10 / 18

slide-36
SLIDE 36

Experiments: Further Training

After the first epoch, the Hessian spectrum stabilizes.

Figure: Spectrum of the network stabilizes after the first epoch.

Behrooz Ghorbani Hessian Spectral Density June 2019 10 / 18

slide-37
SLIDE 37

Experiments: Further Training

After the first epoch, the Hessian spectrum stabilizes. The Hessian contains information about non-local geometry of the loss.

Figure: Spectrum of the network stabilizes after the first epoch.

Behrooz Ghorbani Hessian Spectral Density June 2019 10 / 18

slide-38
SLIDE 38

Experiments: Further Training

After the first epoch, the Hessian spectrum stabilizes. The Hessian contains information about non-local geometry of the loss. The eigenvalues of the Hessian at this stage determine if the network can be trained effectively.

Figure: Spectrum of the network stabilizes after the first epoch.

Behrooz Ghorbani Hessian Spectral Density June 2019 10 / 18

slide-39
SLIDE 39

Experiments: Reducing Learning Rate

Prevalent view: smaller learning rates allow you to converge to sharper local minima.

Behrooz Ghorbani Hessian Spectral Density June 2019 11 / 18

slide-40
SLIDE 40

Experiments: Reducing Learning Rate

Prevalent view: smaller learning rates allow you to converge to sharper local minima.

3 2 1 1 2 3 1 2 3 4 5 6 7 8 9 Step Size = 0.10 3 2 1 1 2 3 1 2 3 4 5 6 7 8 9 Step Size = 0.02

Behrooz Ghorbani Hessian Spectral Density June 2019 11 / 18

slide-41
SLIDE 41

Experiments: Reducing Learning Rate

Prevalent view: smaller learning rates allow you to converge to sharper local minima. Reducing the learning rate should bring about an increase in the top eigenvalue.

Behrooz Ghorbani Hessian Spectral Density June 2019 11 / 18

slide-42
SLIDE 42

Experiments: Reducing Learning Rate

Prevalent view: smaller learning rates allow you to converge to sharper local minima. Reducing the learning rate should bring about an increase in the top eigenvalue.

Figure: Learning rate is reduced by a factor of 10 at step 40k. Surprisingly, the top eigenvalue also decreases.

Behrooz Ghorbani Hessian Spectral Density June 2019 11 / 18

slide-43
SLIDE 43

Experiments: End of the Training

The Hessian spectrum at the end of the training:

Behrooz Ghorbani Hessian Spectral Density June 2019 12 / 18

slide-44
SLIDE 44

Experiments: End of the Training

The Hessian spectrum at the end of the training:

0.2 0.0 0.2 0.4 0.6 0.8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 Hessian Eigenvalue Density (Log Scale) Step 100k (End of Training) Figure: Spectrum of the Hessian after 100k steps of training. The smallest eigenvalue is ≈ −0.0006.

Behrooz Ghorbani Hessian Spectral Density June 2019 12 / 18

slide-45
SLIDE 45

Examining the Role of Architecture

Let’s remove Batch-Normalization from the network and reexamine the spectrum!

Behrooz Ghorbani Hessian Spectral Density June 2019 13 / 18

slide-46
SLIDE 46

Examining the Role of Architecture

Let’s remove Batch-Normalization from the network and reexamine the spectrum!

Figure: Spectrum of the Hessian after 7k steps of training. Outlier eigenvalues appear when BN is removed from the network.

Behrooz Ghorbani Hessian Spectral Density June 2019 13 / 18

slide-47
SLIDE 47

Experiments: Batch-Normalization

This observation is consistent over different architectures / datasets:

Behrooz Ghorbani Hessian Spectral Density June 2019 14 / 18

slide-48
SLIDE 48

Experiments: Batch-Normalization

This observation is consistent over different architectures / datasets:

Figure: The eigenvalue comparison of the Hessian of Resnet-18 trained on ImageNet dataset. Model with BN is shown in

blue and the model without BN in red. The Hessians are computed at the end of training.

Behrooz Ghorbani Hessian Spectral Density June 2019 14 / 18

slide-49
SLIDE 49

Experiments: Batch-Normalization

Our intuition from convex optimization suggests that first-order methods slow down significantly when λi are highly spread out. [see Bottou and Bousquet (2008) for explicit bounds]

Behrooz Ghorbani Hessian Spectral Density June 2019 15 / 18

slide-50
SLIDE 50

Experiments: Batch-Normalization

Our intuition from convex optimization suggests that first-order methods slow down significantly when λi are highly spread out. [see Bottou and Bousquet (2008) for explicit bounds] Often quantities such as Condition number, κ ≡ λ1

λn , are used to measure this spread.

Behrooz Ghorbani Hessian Spectral Density June 2019 15 / 18

slide-51
SLIDE 51

Experiments: Batch-Normalization

Our intuition from convex optimization suggests that first-order methods slow down significantly when λi are highly spread out. [see Bottou and Bousquet (2008) for explicit bounds] Often quantities such as Condition number, κ ≡ λ1

λn , are used to measure this spread.

Conjecture

Batch-Normalization helps optimization by removing large outlier eigenvalues.

Behrooz Ghorbani Hessian Spectral Density June 2019 15 / 18

slide-52
SLIDE 52

Experiments: Batch-Normalization

Our intuition from convex optimization suggests that first-order methods slow down significantly when λi are highly spread out. [see Bottou and Bousquet (2008) for explicit bounds] Often quantities such as Condition number, κ ≡ λ1

λn , are used to measure this spread.

Let’s test this assertion!

Conjecture

Batch-Normalization helps optimization by removing large outlier eigenvalues.

Behrooz Ghorbani Hessian Spectral Density June 2019 15 / 18

slide-53
SLIDE 53

BN with Population Statistics

Our observations suggest that BN is effective because it removes the outlier eigenvalues of the Hessian.

Behrooz Ghorbani Hessian Spectral Density June 2019 16 / 18

slide-54
SLIDE 54

BN with Population Statistics

Our observations suggest that BN is effective because it removes the outlier eigenvalues of the Hessian. We predict that in scenarios where BN is not effective, outlier eigenvalues are still present.

Behrooz Ghorbani Hessian Spectral Density June 2019 16 / 18

slide-55
SLIDE 55

BN with Population Statistics

Our observations suggest that BN is effective because it removes the outlier eigenvalues of the Hessian. We predict that in scenarios where BN is not effective, outlier eigenvalues are still present. Example: When statistics of the BN layer are computed from the full-dataset.

Behrooz Ghorbani Hessian Spectral Density June 2019 16 / 18

slide-56
SLIDE 56

BN with Population Statistics

Our observations suggest that BN is effective because it removes the outlier eigenvalues of the Hessian. We predict that in scenarios where BN is not effective, outlier eigenvalues are still present. Example: When statistics of the BN layer are computed from the full-dataset.

Figure: Optimization progress (in terms of loss) of batch normalization with mini-batch statistics and population statistics.

Behrooz Ghorbani Hessian Spectral Density June 2019 16 / 18

slide-57
SLIDE 57

BN with Population Statistics

Behrooz Ghorbani Hessian Spectral Density June 2019 17 / 18

slide-58
SLIDE 58

BN with Population Statistics

Figure: The Hessian spectrum for a Resnet-32 after 15k steps. On the left BN is using mini-batch statistics. The network on

the right is using population statistics.

Behrooz Ghorbani Hessian Spectral Density June 2019 17 / 18

slide-59
SLIDE 59

Any Questions?

Hope to see you at our poster session today (06:30 to 09:00 at Pacific Ballroom #51) Zhaojun Bai, Gark Fahey, and Gene Golub. Some large-scale matrix computation problems. Journal of Computational and Applied Mathematics, 74(1-2):71–89, 1996. Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008. Gene H Golub and John H Welsch. Calculation of gauss quadrature rules. Mathematics of computation, 23(106):221–230, 1969. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. Shashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation of tr(f(a)) via stochastic lanczos

  • quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017.

Behrooz Ghorbani Hessian Spectral Density June 2019 18 / 18