introduction to advanced parameter gradient descent
play

Introduction to advanced parameter Gradient descent algorithm - PowerPoint PPT Presentation

Introduction to advanced parameter Gradient descent algorithm optimization w 1 d 1 g 1 1. Choose an initial weight vector and let = . w j w j d j 2. Let = + . + 1 So far: g j 3. Evaluate . + 1 What is a neural network?


  1. Introduction to advanced parameter Gradient descent algorithm optimization w 1 d 1 g 1 1. Choose an initial weight vector and let = – . w j w j η d j 2. Let = + . + 1 So far: g j 3. Evaluate . + 1 • What is a neural network? d j g j 4. Let = – . + 1 + 1 • Basic training algorithm: j j 5. Let = + 1 and go to step 2. • Gradient descent • Backpropagation Next: advanced training algorithms Gradient descent review How to improve over gradient descent? Gradient descent: w t ( ) w t ( ) ∆ w t ( ) + 1 = + where, • Must understand convergence properties w t η E w t ∆ ( ) ∇ [ ( ) ] = – • Use second-order information... Two main problems: • Slow convergence η • Trial-and-error selection of : cut number of epochs (training cycles) by orders of Goal magnitude ... how?

  2. First case study Why quadratic error surface? Disadvantages: E 20 ω 1 ω 2 2 2 = + (same as last time) • Too simple/too few parameters globally • NN error surface not quadratic Advantage: 40 2 E • Easy to visualize 20 1.5 0 1 locally • NN error surfaces will be quadratic near a local - 1.5 1.5 ω 2 - 1 - 1 0.5 minimum. - 0.5 - 0.5 0 ω 1 0 0 0.5 0.5 - 0.5 1 Will also look at simple nonquadratic error surfaces... Taylor series expansion Hessian matrix Definition : The Hessian matrix of a - Single dimension (from calculus): × W W H W dimensional function E w is defined as, ( ) ( ) ≈ ( ) ( ) x ( ) ( ) x ( ) 2 f x f x 0 f ' x 0 x 0 f '' x 0 x 0 + – + – E w H ∇ [ ∇ ( ) ] = where, Multi-dimensional error surface: T w ω 1 ω 2 … ω W = 1 ) T b - w ) T H w E w ( ) ≈ E w 0 ( ) ( w w 0 ( w 0 ( w 0 ) + – + - - – – 2 about some vector w 0 , where, Alternatively: b ∇ E w 0 ( ) = ∂ 2 E H i j = - - - - - - - - - - - - - - - - - - ( , ) ∂ω i ∂ω j ∇ [ ∇ E w 0 ( ) ] ( not just a German mercenary) H Hessian: =

  3. Some linear algebra Gradient descent convergence rate Definition : For a × square matrix , the eigenvalues W W H Near local minimum: λ are the solution of, λ min > 0 (why?) λ I W H – = 0 Definition : A square matrix is positive-definite , if and H Convergence governed by: only if all its eigenvalues are greater than zero. If a λ i matrix is positive-definite, then, λ min   - - - - - - - - - - - - v T H v v   > ∀ ≠ λ max 0 , 0 . Learning rate bound: > H • Quadratic error surface: 0 2 < η < 0 - - - - - - - - - - - - λ max H > • Arbitrary error surface: 0 near local minimum. Simple Hessian example Simple Hessian example (continued) 20 ω 1 ω 2 20 ω 1 ω 2 E E 2 2 2 2 = + = + Second partial derivatives: First partial derivatives: ∂ 2 E ∂ 2 E ∂ 2 E ∂ 2 E ∂ ∂ E E 40 ω 1 2 ω 2 - - - - - - - - - - = 40 - - - - - - - - - - = 2 - - - - - - - - - - - - - - - - - - - - = - - - - - - - - - - - - - - - - - - - - = 0 - - - - - - - - - = - - - - - - - - - = ∂ω 1 ∂ω 2 ∂ω 2 ∂ω 1 ∂ω 1 ∂ω 2 ∂ ω 1 ∂ ω 2 2 2 Hessian: Second partial derivatives: ∂ 2 E ∂ 2 E ∂ 2 E ∂ 2 E 40 0 H - - - - - - - - - - = 40 - - - - - - - - - - = 2 - - - - - - - - - - - - - - - - - - - - = - - - - - - - - - - - - - - - - - - - - = 0 = ∂ω 1 ∂ω 2 ∂ω 2 ∂ω 1 ∂ω 1 ∂ω 2 2 2 0 2

  4. Simple Hessian example (continued) Computation of eigenvalues λ 1 0 40 0 λ I 2 H – = 0 – = 0 0 1 0 2 40 0 H = 0 2 λ – 40 0 = 0 λ 0 – 2 What are the eigenvalues? ( λ ) λ ( ) – 40 – 2 = 0 λ min = 2 λ max = 40 Learning rate bounds Convergence examples η = 0.01 2 λ min = 2 1.5 λ max = 40 1 ω 2 0.5 0 2 < η < 0 - - - - - - - - - - - - λ max - 0.5 - 1.5 - 1 - 0.5 0 0.5 1 ω 1 2 < η < - - - - - - 0 = 0.05 (same as fixed-point derivation) 719 steps 40

  5. Convergence examples Convergence examples η η = 0.04 = 0.05 2 2 1.5 1.5 1 1 ω 2 0.5 0.5 0 0 - 0.5 - 0.5 - 1.5 - 1 - 0.5 - 1.5 - 1 - 0.5 0 0.5 1 0 0.5 1 ω 1 ω 1 175 steps no convergence Basic problem: “long valley with steep Solution sides” What characterizes a “ long valley with steep sides?” • Fixed learning rate is the problem • Answer: different learning rates for each weight. Length of contour lines proportional to: 1 1 - - - - - - - - - - - - - - - - - - - - and λ 1 λ 2 Key question: how to achieve automatically? Small ratios: λ min   - - - - - - - - - - - -   λ max So what can we do about this?

  6. Heuristic extension: momentum Gradient descent algorithm µ w 1 d 1 g 1 Gradient descent with momentum: 1. Choose an initial weight vector and let = – . w j w j η d j w t w t w t ( ) ( ) ∆ ( ) 2. Let = + . + 1 = + + 1 g j w 0 η E w 0 ∆ ( ) ∇ [ ( ) ] 3. Evaluate . = – + 1 ∆ w t ( ) η E w t ∇ [ ( ) ] µ∆ w t ( ) > ≤ µ < t d j g j = – + – 1 , 0 , 0 1 4. Let = – . + 1 + 1 j j 5. Let = + 1 and go to step 2. Notes: w t w t w t ∆ ( ) ( ) ( ) • dependent on and – 1 Ideally, high effective learning rate in shallow dimensions • • Little effect along steep dimensions Analyzing momentum term Analyzing momentum term Assumption (shallow region): Shallow regions: assume, ∇ E w t ( ) ≈ ∇ E w 0 ( ) g ∈ { 1 2 … , , } t ∇ E w t ( ) ≈ ∇ E w 0 ( ) g ∈ { 1 2 … , , } = , t = , In general, Then: t   µ t w 0 η g ∆ ( ) + 1  g  η 1 – ∑ = – µ s ∆ w t ( ) ≈ η g   – = – - - - - - - - - - - - - - - - - - - - - -   µ   1 – s w 1 η g µ w 0 η g 1 = 0 ∆ ( ) ≈ ∆ ( ) ( µ ) – + = – + In the limit: w 2 η g µ w 1 η g η g 1 ∆ ( ) ≈ ∆ ( ) µ [ ( µ ) ] – + = – + – + η – ∆ w t ( ) ≈ - g lim - - - - - - - - - - - - - - - - ( µ ) w 2 η g 1 ∆ ( ) ≈ ( µ µ 2 ) 1 – t → ∞ – + +

  7. Analyzing momentum term Analyzing momentum term Steep regions: oscillations Effective learning rate (shallow regions): η ⁄ ( µ ) 1 – ∇ E w t [ ( ) ] ≈ ∇ E w t [ ( ) ] + 1 – Net effect (ideally): little Momentum Convergence examples η 0.01 µ , = = 0.0 2 Advantage: 1.5 Increase effective learning rate in shallow regions • 1 ω 2 0.5 0 Disadvantages: - 0.5 - 1.5 - 1 - 0.5 0 0.5 1 • Yet another parameter to hand tune ω 1 µ • If not carefully chosen, can do more harm than good 719 steps

  8. Convergence examples Convergence examples η 0.01 µ , η 0.01 µ , = = 0.5 = = 0.9 2 2 1.5 1.5 1 1 ω 2 ω 2 0.5 0.5 0 0 - 0.5 - 0.5 - 1.5 - 1 - 0.5 - 1.5 - 1 - 0.5 0 0.5 1 0 0.5 1 ω 1 ω 1 341 steps 266 steps Convergence examples Convergence examples η 0.04 µ , η 0.04 µ , = = 0.0 = = 0.5 2 2 1.5 1.5 1 1 ω 2 ω 2 0.5 0.5 0 0 - 0.5 - 0.5 - 1.5 - 1 - 0.5 - 1.5 - 1 - 0.5 0 0.5 1 0 0.5 1 ω 1 ω 1 175 steps 60 steps

  9. Convergence examples Convergence examples: summary η 0.04 µ , = = 0.9 2 1.5 µ µ µ 1 = 0.0 = 0.5 = 0.9 0.5 ω 2 719 341 266 η = 0.01 0 175 60 272 η = 0.04 - 0.5 - 1 - 1.5 - 1 - 0.5 0 0.5 1 ω 1 272 steps Heuristic extensions to gradient descent Heuristic extensions to gradient descent Momentum popular in neural network community. • Individual learning rates: ( ) g i t ( t ) ∆ η i γ g i – 1 = (problems?) Many other heuristic attempts (some examples): ρ σ • Adaptive learning rate (what should and be?) ρη old ∆ < E • Quickprop: local independent quadratic assumption:  0 η new  = ση old ∆ E >  0 ( ) t g i ( ) ( ) t - ω i t ∆ ω i ∆ + 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - = (problems?) ( ) ( ) t t g i g i – 1 – η max 2 λ max ⁄ • = (what’s the problem here?)

  10. Heuristic extensions to gradient descent Steepest descent Problems: Gradient descent: w t ( ) w t ( ) ∆ w t ( ) • Additional hand-tuned parameters + 1 = + Independence of weight assumptions where, • ∆ w t ( ) η E w t ∇ [ ( ) ] = – More principled approach is desirable. Question: why take all those little tiny steps? Steepest descent: gradient descent with line Steepest descent minimization Question: Do we need to compute ? ∂ E ∂η ⁄ d t ( ) 1. Define search direction : d t E w t ( ) ∇ [ ( ) ] = – Answer: No. Use one-dimensional line search , which requires only evaluation of E η ( ) . 2. Minimize: E η ( ) ≡ E w t [ ( ) η d t ( ) ] + such that: E η∗ ( ) ≤ E η ( ) ∀ η Line search: two steps , 3. New update: 1. Bracket minimum η∗ d t w t w t ( ) ( ) ( ) + 1 = + (problems?) 2. Line minimization

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend