Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural - - PowerPoint PPT Presentation

step size matters in deep learning
SMART_READER_LITE
LIVE PREVIEW

Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural - - PowerPoint PPT Presentation

Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural Information Processing Systems December 4, 2018 Nar & Sastry Step Size Matters 1 Gradient Descent: Effect of Step Size Example min x R ( x 2 + 1)( x 1) 2 ( x


slide-1
SLIDE 1

Step Size Matters in Deep Learning

Kamil Nar Shankar Sastry Neural Information Processing Systems December 4, 2018

Nar & Sastry Step Size Matters 1

slide-2
SLIDE 2

Gradient Descent: Effect of Step Size

Example

minx∈R (x2 + 1)(x − 1)2(x − 2)2 f(x) x x∗

1 = 1

x∗

2 = 2

Nar & Sastry Step Size Matters 2

slide-3
SLIDE 3

Gradient Descent: Effect of Step Size

Example

minx∈R (x2 + 1)(x − 1)2(x − 2)2 f(x) x x∗

1 = 1

x∗

2 = 2

From random initialization

  • converges to x∗

1 only if δ ≤ 0.5

  • converges to x∗

2 only if δ ≤ 0.2

Nar & Sastry Step Size Matters 2

slide-4
SLIDE 4

Gradient Descent: Effect of Step Size

Example

minx∈R (x2 + 1)(x − 1)2(x − 2)2 f(x) x x∗

1 = 1

x∗

2 = 2

From random initialization

  • converges to x∗

1 only if δ ≤ 0.5

  • converges to x∗

2 only if δ ≤ 0.2

If the algorithm converges with δ = 0.3, the solution is x∗

1.

Nar & Sastry Step Size Matters 2

slide-5
SLIDE 5

Deep Linear Networks

x → WLWL−1 · · · W2W1x

Nar & Sastry Step Size Matters 3

slide-6
SLIDE 6

Deep Linear Networks

x → WLWL−1 · · · W2W1x

  • Cost function has infinitely many local minimum
  • Different dynamic characteristics at different optima

Nar & Sastry Step Size Matters 3

slide-7
SLIDE 7

Lyapunov Stability of Gradient Descent

Deep Linear Networks

Proposition

  • λ ∈ R and λ = 0
  • λ is estimated as multiplication of scalar parameters {wi}

min

{wi}

1 2 (wL . . . w2w1 − λ)2 .

Nar & Sastry Step Size Matters 4

slide-8
SLIDE 8

Lyapunov Stability of Gradient Descent

Deep Linear Networks

Proposition

  • λ ∈ R and λ = 0
  • λ is estimated as multiplication of scalar parameters {wi}

min

{wi}

1 2 (wL . . . w2w1 − λ)2 . For convergence to {w∗

i } with w∗ L . . . w∗ 2w∗ 1 = λ, step size must satisfy

δ ≤ 2 L

i=1

λ

w∗

i

2 .

Nar & Sastry Step Size Matters 4

slide-9
SLIDE 9

Lyapunov Stability of Gradient Descent

Deep Linear Networks

  • δ needs to be very small for equilibria with disproportionate {w∗

i }

  • For each δ, the algorithm can converge only to a subset of optima

Nar & Sastry Step Size Matters 5

slide-10
SLIDE 10

Lyapunov Stability of Gradient Descent

Deep Linear Networks

  • δ needs to be very small for equilibria with disproportionate {w∗

i }

  • For each δ, the algorithm can converge only to a subset of optima
  • No finite Lipschitz constant for the gradient on the whole

parameter space

Nar & Sastry Step Size Matters 5

slide-11
SLIDE 11

Deep Linear Networks

Theorem

  • {xi}i∈[N] satisfies

1 N

N

i=1 xix⊤ i = I

  • R is estimated as multiplication of {Wj} by

min

{Wj}

1 2N N

i=1 Rxi − WLWL−1 · · · W2W1xi2 2

Nar & Sastry Step Size Matters 6

slide-12
SLIDE 12

Deep Linear Networks

Theorem

  • {xi}i∈[N] satisfies

1 N

N

i=1 xix⊤ i = I

  • R is estimated as multiplication of {Wj} by

min

{Wj}

1 2N N

i=1 Rxi − WLWL−1 · · · W2W1xi2 2

Assume the gradient descent algorithm with random initialization has converged to ˆ

  • R. Then,

ρ( ˆ R) ≤ 2 Lδ L/(2L−2) almost surely.

Nar & Sastry Step Size Matters 6

slide-13
SLIDE 13

Deep Linear Networks

Theorem

  • {xi}i∈[N] satisfies

1 N

N

i=1 xix⊤ i = I

  • R is estimated as multiplication of {Wj} by

min

{Wj}

1 2N N

i=1 Rxi − WLWL−1 · · · W2W1xi2 2

Assume the gradient descent algorithm with random initialization has converged to ˆ

  • R. Then,

ρ( ˆ R) ≤ 2 Lδ L/(2L−2) almost surely.

  • Step size bounds the Lipschitz constant of the estimated function

Nar & Sastry Step Size Matters 6

slide-14
SLIDE 14

Deep Linear Networks

Theorem

  • {xi}i∈[N] satisfies

1 N

N

i=1 xix⊤ i = I

  • R is estimated as multiplication of {Wj} by

min

{Wj}

1 2N N

i=1 Rxi − WLWL−1 · · · W2W1xi2 2

Assume the gradient descent algorithm with random initialization has converged to ˆ

  • R. Then,

ρ( ˆ R) ≤ 2 Lδ L/(2L−2) almost surely.

  • Step size bounds the Lipschitz constant of the estimated function
  • Contrary to ordinary-least-squares

Nar & Sastry Step Size Matters 6

slide-15
SLIDE 15

Deep Linear Networks

Symmetric PSD matrices:

  • The bound is tight with identity initialization
  • Identity initialization allows convergence with the largest step size

Nar & Sastry Step Size Matters 7

slide-16
SLIDE 16

Nonlinear Networks Poster #8

Two-layer ReLU network: x → W(V x − b)+

Nar & Sastry Step Size Matters 8

slide-17
SLIDE 17

Nonlinear Networks Poster #8

Two-layer ReLU network: x → W(V x − b)+

Theorem

Let f : Rn → Rm be estimated by min

W,V

1 2 N

i=1 W(V xi − b)+ − f(xi)2 2.

Nar & Sastry Step Size Matters 8

slide-18
SLIDE 18

Nonlinear Networks Poster #8

Two-layer ReLU network: x → W(V x − b)+

Theorem

Let f : Rn → Rm be estimated by min

W,V

1 2 N

i=1 W(V xi − b)+ − f(xi)2 2.

If the algorithm converges, then the estimate ˆ f(xi) satisfies max

i∈[N] xi ˆ

f(xi) ≤ 1 δ almost surely.

Nar & Sastry Step Size Matters 8