The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior - - PowerPoint PPT Presentation

the anisotropic noise in stochastic gradient descent its
SMART_READER_LITE
LIVE PREVIEW

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior - - PowerPoint PPT Presentation

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects Zhanxing Zhu , Jingfeng Wu , Bing Yu, Lei Wu, Jinwen Ma. Peking University Beijing Institute of Big Data Research


slide-1
SLIDE 1

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Zhanxing Zhu∗, Jingfeng Wu∗, Bing Yu, Lei Wu, Jinwen Ma.

Peking University Beijing Institute of Big Data Research

June, 2019

slide-2
SLIDE 2

The implicit bias of stochastic gradient descent

◮ Compared with gradient descent (GD), stochastic gradient

descent (SGD) tends to generalize better.

◮ This is attributed to the noise in SGD. ◮ In this work we study the anisotropic structure of SGD

noise and its importance for escaping and regularization.

slide-3
SLIDE 3

Stochastic gradient descent and its variants

Loss function L(θ) := 1

N

N

i=1 ℓ(xi; θ).

Gradient Langevin dynamic (GLD) θt+1 = θt − η∇θL(θt) + ηǫt, ǫt ∼ N

  • 0, σ2

t I

  • .

Stochastic gradient descent (SGD) θt+1 = θt − η˜ g(θt), ˜ g(θt) = 1

m

  • x∈Bt ∇θℓ(x; θt).

The structure of SGD noise ˜ g(θt) ∼ N

  • ∇L(θt), Σsgd(θt)
  • , Σsgd(θt) ≈

1 m

  • 1

N

N

i=1 ∇ℓ(xi; θt)∇ℓ(xi; θt)T − ∇L(θt)∇L(θt)T

. SGD reformulation θt+1 = θt − η∇L(θt) + ηǫt, ǫt ∼ N

  • 0, Σsgd(θt)
  • .
slide-4
SLIDE 4

GD with unbiased noise

θt+1 = θt − η∇θL(θt) + ǫt, ǫt ∼ N (0, Σt) . (1) Iteration (1) could be viewed as a discretization of the following continuous stochastic differential equation (SDE): dθt = −∇θL(θt) dt +

  • Σt dWt.

(2) Next we study the role of noise structure Σt by analyzing the continous SDE (2).

slide-5
SLIDE 5

Escaping efficiency

Definition (Escaping efficiency)

Suppose the SDE (2) is initialized at minimum θ0, then for a fixed time t small enough, the escaping efficiency is defined as the increase of loss potential: Eθt[L(θt) − L(θ0)] (3) Under suitable approximations, we could compute the escaping efficiency for SDE (2), E[L(θt) − L(θ0)] = − t E

  • ∇LT∇L
  • +

t 1 2ETr(HtΣt) dt (4) ≈ 1 4Tr

  • I − e−2Ht

Σ

  • ≈ t

2Tr (HΣ) . (5) Thus Tr (HΣ)serves as an important indicator for measuring the escaping behavior of noises with different structures.

slide-6
SLIDE 6

Factors affecting the escaping behavior

The noise scale For Gaussian noise ǫt ∼ N(0, Σt), we can measure its scale by ǫttrace := E[ǫT

t ǫt] = · · · = Tr(Σt). Thus

based on Tr(HΣ), we see that the larger noise scale is, the faster the escaping happens. To eliminate the impact of noise scale, assume that given time t, Tr(Σt) is constant. (6) The ill-conditioning of minima For the minima with Hessian as scalar matrix Ht = λI, the noises in same magnitude make no difference since Tr(HtΣt) = λTrΣt. The structure of noise For the ill-conditioned minima, the structure of noise plays an important role on the escaping!

slide-7
SLIDE 7

The impact of noise structure

Proposition

Let HD×D and ΣD×D be semi-positive definite. If

  • 1. H is ill-conditioned. Let λ1, λ2 . . . λD be the eigenvalues of H in

descent order, and for some constant k ≪ D and d > 1

2, the

eigenvalues satisfy λ1 > 0, λk+1, λk+2, . . . , λD < λ1D−d; (7)

  • 2. Σ is “aligned” with H. Let ui be the corresponding unit

eigenvector of eigenvalue λi, for some projection coefficient a > 0, we have uT

1 Σu1 ≥ aλ1

TrΣ TrH . (8) Then for such anisotropic Σ and its isotropic equivalence ¯ Σ = TrΣ

D I under

constraint (6), we have the follow ratio describing their difference in term

  • f escaping efficiency,

Tr (HΣ) Tr(H ¯ Σ) = O

  • aD(2d−1)

, d > 1 2. (9)

slide-8
SLIDE 8

Analyze the noise of SGD via Proposition 1

By Proposition 1, The anisotropic noises satisfying the two conditions indeed help escape from the ill-conditioned minima. Thus to see the importance of SGD noise, we only need to show it meets the two conditions.

◮ Condition 1 is naturally hold for neural networks, thanks to

their over-parameterization!

◮ See the following Proposition 2 for the second condition.

slide-9
SLIDE 9

SGD noise and Hessian

Proposition

Consider a binary classification problem with data {(xi, yi)}i∈I, y ∈ {0, 1}, and mean square loss, L(θ) = E(x,y)

  • φ ◦ f (x; θ) − y
  • 2 , where f denotes

the network and φ is a threshold activation function, φ(f ) = min{max{f , δ}, 1 − δ}, (10) δ is a small positive constant. Suppose the network f satisfies:

  • 1. it has one hidden layer and piece-wise linear activation;
  • 2. the parameters of its output layer are fixed during training.

Then there is a constant a > 0, for θ close enough to minima θ∗, u(θ)TΣ(θ)u(θ) ≥ aλ(θ) TrΣ(θ) TrH(θ) (11) holds almost everywhere, for λ(θ) and u(θ) being the maximal eigenvalue and its corresponding eigenvector of Hessian H(θ).

slide-10
SLIDE 10

Examples of different noise structures

Table: Compared dynamics defined in Eq. (1).

Dynamics Noise ǫt Remarks SGD ǫt ∼ N

  • 0, Σsgd

t

  • Σsgd

t

is the gradient covariance matrix. GLD constant ǫt ∼ N

  • 0, ̺2

t I

  • ̺t is a tunable constant.

GLD dy- namic ǫt ∼ N

  • 0, σ2

t I

  • σt is adjusted to force the noise share

the same magnitude with SGD noise, similarly hereinafter. GLD di- agonal ǫt ∼ N

  • 0, diag(Σsgd

t

)

  • diag(Σsgd

t

) is the diagonal of the covari- ance of SGD noise Σsgd

t

. GLD leading ǫt ∼ N

  • 0, σt ˜

Σt

  • ˜

Σt is the best low rank approximation

  • f Σsgd

t

. GLD Hessian ǫt ∼ N

  • 0, σt ˜

Ht

  • ˜

Ht is the best low rank approximation

  • f the Hessian.

GLD 1st eigven(H) ǫt ∼ N

  • 0, σtλ1u1uT

1

  • λ1, u1 are the maximal eigenvalue and

its corresponding unit eigenvector of the Hessian.

slide-11
SLIDE 11

2-D toy example

−1.0 −0.5 0.0 0.5 1.0 1.5

w1

−1.0 −0.5 0.0 0.5 1.0 1.5

w2

1 . 5 1.500 3 . 3.000 4.500 4.500 6.000 6.000 7.500 7.500 7.500 9 . 1 . 5

GD GLD const GLD diag GLD leading GLD Hessian GLD 1st eigven(H)

GD GLD const GLD diag GLD leading GLD Hessian GLD 1st eigven(H)

5 10 15 20 25 30 35

success rate (%)

Figure: 2-D toy example. Compared dynamics are initialized at the sharp

  • minima. Left: The trajectory of each compared dynamics for escaping

from the sharp minimum in one run. Right: Success rate of arriving the flat solution in 100 repeated runs

slide-12
SLIDE 12

One hidden layer network

5 10 15 20 25 30

iteration

10−1 100 101 102 103 Tr(HtΣt) Tr(Ht ¯ Σt) 32 hidden nodes 128 hidden nodes 512 hidden nodes

Figure: One hidden layer neural networks. The solid and the dotted lines represent the value of Tr(HΣ) and Tr(H ¯ Σ), respectively. The number of hidden nodes varies in {32, 128, 512}.

slide-13
SLIDE 13

FashionMNIST experiments

50 100 150 200 250 300 350 400

  • rder of eigenvalues

10−2 10−1 100 101

eigenvalue

eigenvalue spectrum at iter 3000

2000 4000 6000 8000 10000 12000 14000

iteration

1 2 3 4 5

Estimation of projection coefficient

2000 4000 6000 8000 10000 12000 14000

iteration

10−7 10−5 10−3 10−1 101 103

Tr(HtΣt) Tr(Ht¯ Σt)

Figure: FashionMNIST experiments. Left: The first 400 eigenvalues of Hessian at θ∗

GD, the sharp minima found by GD after 3000 iterations.

Middle: The projection coefficient estimation ˆ a = uT

1 Σu1TrH

λ1TrΣ

in Proposition 1. Right: Tr(HtΣt) versus Tr(Ht ¯ Σt) during SGD

  • ptimization initialized from θ∗

GD, ¯

Σt = TrΣt

D I denotes the isotropic

equivalence of SGD noise.

slide-14
SLIDE 14

FashionMNIST experiments

2000 4000 6000 8000 10000 12000 14000

iteration

40 45 50 55 60 65 70 75

test accuracy (%)

GD (66.22) GLD const (66.33) GLD dynamic (66.17) GLD diag (66.41) GLD leading (68.83) GLD Hessian (68.74) GLD 1st eigven(H) (68.90) SGD (68.79)

2000 4000 6000 8000 10000 12000 14000 16000

iteration

0.000 0.005 0.010 0.015 0.020 0.025 0.030

expected sharpness

GD GLD const GLD dynamic GLD diag GLD leading GLD Hessian GLD 1st eigven(H) SGD

Figure: FashionMNIST experiments. Compared dynamics are initialized at θ∗

GD found by GD, marked by the vertical dashed line in iteration 3000.

Left: Test accuracy versus iteration. Right: Expected sharpness versus

  • iteration. Expected sharpness (the higher the sharper) is measured as

Eν∼N (0,δ2I)

  • L(θ + ν)
  • − L(θ), and δ = 0.01, the expectation is computed

by average on 1000 times sampling.

slide-15
SLIDE 15

Conclusion

◮ We explore the escaping behavior of SGD-like processes

through analyzing their continuous approximation.

◮ We show that thanks to the anisotropic noise, SGD could

escape from sharp minima efficiently, which leads to implicit regularization effects.

◮ Our work raises concerns over studying the structure of SGD

noise and its effect.

◮ Experiments support our understanding.

Poster: Wed Jun 12th 06 : 30 ∼ 09 : 00 PM @ Pacific Ballroom #97