improving l bfgs initialization for trust region methods
play

Improving L-BFGS Initialization for Trust-Region Methods in Deep - PowerPoint PPT Presentation

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced Agenda


  1. Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced

  2. Agenda • Introduction, Problem Statement and Motivations • Overview on Quasi-Newton Optimization Methods • L-BFGS Trust Region Optimization Method • Proposed Methods for Initialization of L-BFGS • Application in Deep Learning (Image Classification Task)

  3. Introduction, Problem Statement and Motivations

  4. Unconstrained Optimization Problem N w ∈ R n L ( w ) , 1 X min ` i ( w ) N i =1 L : R n → R

  5. Optimization Algorithms Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838

  6. Optimization Algorithms 1. Start from a random point w 0 2. Repeat each iteration, k = 0 , 1 , 2 , . . . , 3. Choose a search direction p k 4. Choose a step size α k 5. Update parameters w k +1 ← w k + α k p k 6. Until kr L k < ✏

  7. Properties of Objective Function N w ∈ R n L ( w ) , 1 X min ` i ( w ) N i =1 • are both large in modern applications. n and N • is a non-convex and nonlinear function. L ( w ) • is ill-conditioned. r 2 L ( w ) • Computing full gradient, is expensive. r L • Computing Hessian, is not practical. r 2 L

  8. Stochastic Gradient Decent 1. Sample indices S k ⊂ { 1 , 2 , . . . , N } 2. Compute stochastic (subsampled) gradient 1 r L ( w k ) ⇡ r L ( w k ) ( S k ) , X r ` i ( w k ) |S k | i ∈ S k 3. Assign a learning rate α k p k = �r L ( w k ) ( S k ) 4. Update parameters using w k +1 w k � α k r L ( w k ) ( S k ) H. Robbins, D. Siegmund. (1971). ”A convergence theorem for non negative almost supermartingales and some applications”. Optimizing methods in statistics.

  9. Advantages of SGD • SGD algorithms are very easy to implement. • SGD requires only computing the gradient. • SGD has a low cost-per-iteration. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838

  10. Disadvantages of SGD • Very sensitive to the ill-conditioning problem and scaling. • Requires fine-tuning many hyper-parameters. • Unlikely exhibit acceptable performance on first try. • requires many trials and errors. • Can stuck in a saddle-point instead of local minimum. • Sublinear and slow rate of convergence. Bottou et al., (2016). Optimization methods for large-scale machine learning. print arXiv:1606.04838 J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

  11. Second-Order Methods 1. Sample indices S k ⊂ { 1 , 2 , . . . , N } 2. Compute stochastic (subsampled) gradient 1 r L ( w k ) ⇡ r L ( w k ) ( S k ) , X r ` i ( w k ) |S k | i ∈ S k 3. Compute Hessian 1 r 2 L ( w k ) ⇡ r 2 L ( w k ) ( S k ) , X r 2 ` i ( w k ) |S k | i ∈ S k

  12. Second-Order Methods 4. Compute Newton’s direction p k = �r 2 L ( w k ) − 1 r L ( w k ) 5. Find proper step length α k = min α L ( w k + α p k ) 6. Update parameters w k +1 ← w k + α k p k

  13. Second-Order Methods Advantages • The rate of convergence is super-linear (quadratic for Newton method). • They are resilient to problem ill-conditioning. • They involve less parameter tuning. • They are less sensitive to the choice of hyper- parameters. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

  14. Second-Order Methods Disadvantages • Computing the Hessian matrix is very expensive and requires massive storage. • Computing the inverse of Hessian is not practical. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838

  15. Quasi-Newton Methods 1. Construct a low-rank approximation of Hessian B k ⇡ r 2 L ( w k ) 2. Find the search direction by Minimizing the Quadratic Model of the objective function k p + 1 2 p T B k p T p ∈ R n Q k ( p ) , g T p k = argmin

  16. Quasi-Newton Matrices • Symmetric • Easy and Fast Computation • Satisfies Secant Condition B k +1 s k = y k s k , w k +1 − w k y k , r L ( w k +1 ) � r L ( w k )

  17. B royden F letcher G oldfarb S hanno. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

  18. B royden F letcher G oldfarb S hanno. 1 1 B k s k s T y k y T B k +1 = B k − k B k + k , s T y T k B k s k k s k s k , w k +1 − w k y k , r L ( w k +1 ) � r L ( w k ) B 0 = γ k I J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

  19. Quasi-Newton Methods Advantages • The rate of convergence is super-linear. • They are resilient to problem ill-conditioning. • The second derivative is not required. • They only use the gradient information to construct quasi-Newton matrices. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

  20. Quasi-Newton Methods disadvantages • The cost of storing the gradient informations can be expensive. • The quasi-Newton matrix can be dense. • The quasi-Newton matrix grow in size and rank in large-scale problems. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

  21. Limited-Memory BFGS Limited Memory Storage ⇥ ⇤ ⇥ ⇤ S k = s k − m . . . s k − 1 Y k = y k − m . . . y k − 1 L-BFGS Compact Representation B k = B 0 + Ψ k M k Ψ T k B 0 = γ k I where � − 1  − S T k B 0 S k − L k ⇥ ⇤ B 0 S k Y k , M k = Ψ k = − L T D k k S T k Y k = L k + D k + U k

  22. Limited-Memory Quasi-Newton Methods • Low rank approximation • Small memory of recent gradients. • Low cost of computation of search direction. • Linear or superlinear Convergence rate can be achieved. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

  23. Objectives B 0 = γ k I B k = B 0 + Ψ k M k Ψ T k What is the best choice for initialization?

  24. Overview on Quasi-Newton Optimization Strategies

  25. Line Search Method w k �r L ( w k ) p k Quadratic model k p + 1 2 p T B k p T p ∈ R n Q k ( p ) , g T p k = argmin if B k is positive definite: p k = B − 1 k g k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

  26. Line Search Method w k w k +1 p k α k p k Wolfe conditions L ( w k + α k p k )  L ( w k ) + c 1 α k r L ( w k ) T p k r L ( w k + α k p k ) T p k � c 2 r f ( w k ) T p k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

  27. Trust Region Method p k = arg min p ∈ R n Q ( p ) δ k s.t. k p k 2  δ k w k p k Toint et al. (2000), Trust Region Methods, SIAM.

  28. Trust Region Method 6 Global 4 minimum 2 0 Newton step -2 Local minimum -4 -6 -6 -4 -2 0 2 4 6 J. J. Mor´e and D. C. Sorensen, (1984) Newton’s method, in Studies in Mathematics, Volume 24. Studies in Numerical Analysis, Math. Assoc., pp. 29–82.

  29. L-BFGS Trust Region Optimization Method

  30. L-BFGS in Trust Region B k = B 0 + Ψ k M k Ψ T k Eigen-decomposition  Λ + γ k I � 0 P T B k = P γ k I 0 Sherman-Morrison-Woodbury Formula k = − 1 I − Ψ k ( τ ∗ M − 1 + Ψ T k Ψ k ) − 1 Ψ T ⇥ ⇤ p ∗ g k , k k τ ∗ τ ∗ = γ k + σ ∗ L. Adhikari et al. (2017) “Limited-memory trust-region methods for sparse relaxation,” in Proc.SPIE, vol. 10394. Brust et al, (2017). “On solving L-SR1 trust-region subproblems,” Computational Optimization and Applications, vol. 66, pp. 245–266.

  31. L-BFGS in Trust Region vs. Line-Search 26th European Signal Processing Conference, Rome, Italy. September 2018.

  32. Proposed Methods for Initialization of L-BFGS

  33. Initialization Method I B 0 = γ k I B k = B 0 + Ψ k M k Ψ T k Spectral estimate of Hessian γ k = y T k − 1 y k − 1 s T k − 1 y k − 1 k B − 1 0 y k − 1 � s k − 1 k 2 γ k = arg min 2 , B 0 = γ I γ J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

  34. Initialization Method II Consider a quadratic function L ( w ) = 1 2 w T Hw + g T w r 2 L ( w ) = H We have HS k = Y k Therefore S T k HS k = S T k Y k Erway et al. (2018). “Trust-Region Algorithms for Training Responses: Machine Learning Methods Using Indefinite Hessian Approximations,” ArXiv e-prints.

  35. Initialization Method II Since B k = B 0 + Ψ k M k Ψ T B 0 = γ k I k � − 1  − S T k B 0 S k − L k ⇥ ⇤ B 0 S k Y k , M k = Ψ k = − L T D k k Secant Condition B k S T k = Y k We have S T k HS k − γ k S T k S k = S T k Ψ k M k Ψ T k S k

  36. Initialization Method II S T k HS k − γ k S T k S k = S T k Ψ k M k Ψ T k S k General Eigen-Value Problem ( L k + D k + L T k ) z = λ S T k S k z Upper bound for initial value to avoid false curvature information γ k ∈ (0 , λ min )

  37. Initialization Method III B k = B 0 + Ψ k M k Ψ T B 0 = γ k I k S T k HS k − γ k S T k S k = S T k Ψ k M k Ψ T k S k Note that compact representation matrices contains γ k � − 1  − S T k B 0 S k − L k ⇥ ⇤ B 0 S k Y k , M k = Ψ k = − L T D k k

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend