Improving L-BFGS Initialization for Trust-Region Methods in Deep - PowerPoint PPT Presentation

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced

Agenda • Introduction, Problem Statement and Motivations • Overview on Quasi-Newton Optimization Methods • L-BFGS Trust Region Optimization Method • Proposed Methods for Initialization of L-BFGS • Application in Deep Learning (Image Classification Task)

Introduction, Problem Statement and Motivations

Unconstrained Optimization Problem N w ∈ R n L ( w ) , 1 X min ` i ( w ) N i =1 L : R n → R

Optimization Algorithms Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838

Optimization Algorithms 1. Start from a random point w 0 2. Repeat each iteration, k = 0 , 1 , 2 , . . . , 3. Choose a search direction p k 4. Choose a step size α k 5. Update parameters w k +1 ← w k + α k p k 6. Until kr L k < ✏

Properties of Objective Function N w ∈ R n L ( w ) , 1 X min ` i ( w ) N i =1 • are both large in modern applications. n and N • is a non-convex and nonlinear function. L ( w ) • is ill-conditioned. r 2 L ( w ) • Computing full gradient, is expensive. r L • Computing Hessian, is not practical. r 2 L

Stochastic Gradient Decent 1. Sample indices S k ⊂ { 1 , 2 , . . . , N } 2. Compute stochastic (subsampled) gradient 1 r L ( w k ) ⇡ r L ( w k ) ( S k ) , X r ` i ( w k ) |S k | i ∈ S k 3. Assign a learning rate α k p k = �r L ( w k ) ( S k ) 4. Update parameters using w k +1 w k � α k r L ( w k ) ( S k ) H. Robbins, D. Siegmund. (1971). ”A convergence theorem for non negative almost supermartingales and some applications”. Optimizing methods in statistics.

Advantages of SGD • SGD algorithms are very easy to implement. • SGD requires only computing the gradient. • SGD has a low cost-per-iteration. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838

Disadvantages of SGD • Very sensitive to the ill-conditioning problem and scaling. • Requires fine-tuning many hyper-parameters. • Unlikely exhibit acceptable performance on first try. • requires many trials and errors. • Can stuck in a saddle-point instead of local minimum. • Sublinear and slow rate of convergence. Bottou et al., (2016). Optimization methods for large-scale machine learning. print arXiv:1606.04838 J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Second-Order Methods 1. Sample indices S k ⊂ { 1 , 2 , . . . , N } 2. Compute stochastic (subsampled) gradient 1 r L ( w k ) ⇡ r L ( w k ) ( S k ) , X r ` i ( w k ) |S k | i ∈ S k 3. Compute Hessian 1 r 2 L ( w k ) ⇡ r 2 L ( w k ) ( S k ) , X r 2 ` i ( w k ) |S k | i ∈ S k

Second-Order Methods 4. Compute Newton’s direction p k = �r 2 L ( w k ) − 1 r L ( w k ) 5. Find proper step length α k = min α L ( w k + α p k ) 6. Update parameters w k +1 ← w k + α k p k

Second-Order Methods Advantages • The rate of convergence is super-linear (quadratic for Newton method). • They are resilient to problem ill-conditioning. • They involve less parameter tuning. • They are less sensitive to the choice of hyper- parameters. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Second-Order Methods Disadvantages • Computing the Hessian matrix is very expensive and requires massive storage. • Computing the inverse of Hessian is not practical. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838

Quasi-Newton Methods 1. Construct a low-rank approximation of Hessian B k ⇡ r 2 L ( w k ) 2. Find the search direction by Minimizing the Quadratic Model of the objective function k p + 1 2 p T B k p T p ∈ R n Q k ( p ) , g T p k = argmin

Quasi-Newton Matrices • Symmetric • Easy and Fast Computation • Satisfies Secant Condition B k +1 s k = y k s k , w k +1 − w k y k , r L ( w k +1 ) � r L ( w k )

B royden F letcher G oldfarb S hanno. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

B royden F letcher G oldfarb S hanno. 1 1 B k s k s T y k y T B k +1 = B k − k B k + k , s T y T k B k s k k s k s k , w k +1 − w k y k , r L ( w k +1 ) � r L ( w k ) B 0 = γ k I J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Quasi-Newton Methods Advantages • The rate of convergence is super-linear. • They are resilient to problem ill-conditioning. • The second derivative is not required. • They only use the gradient information to construct quasi-Newton matrices. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Quasi-Newton Methods disadvantages • The cost of storing the gradient informations can be expensive. • The quasi-Newton matrix can be dense. • The quasi-Newton matrix grow in size and rank in large-scale problems. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Limited-Memory BFGS Limited Memory Storage ⇥ ⇤ ⇥ ⇤ S k = s k − m . . . s k − 1 Y k = y k − m . . . y k − 1 L-BFGS Compact Representation B k = B 0 + Ψ k M k Ψ T k B 0 = γ k I where � − 1  − S T k B 0 S k − L k ⇥ ⇤ B 0 S k Y k , M k = Ψ k = − L T D k k S T k Y k = L k + D k + U k

Limited-Memory Quasi-Newton Methods • Low rank approximation • Small memory of recent gradients. • Low cost of computation of search direction. • Linear or superlinear Convergence rate can be achieved. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Objectives B 0 = γ k I B k = B 0 + Ψ k M k Ψ T k What is the best choice for initialization?

Overview on Quasi-Newton Optimization Strategies

Line Search Method w k �r L ( w k ) p k Quadratic model k p + 1 2 p T B k p T p ∈ R n Q k ( p ) , g T p k = argmin if B k is positive definite: p k = B − 1 k g k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Line Search Method w k w k +1 p k α k p k Wolfe conditions L ( w k + α k p k )  L ( w k ) + c 1 α k r L ( w k ) T p k r L ( w k + α k p k ) T p k � c 2 r f ( w k ) T p k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Trust Region Method p k = arg min p ∈ R n Q ( p ) δ k s.t. k p k 2  δ k w k p k Toint et al. (2000), Trust Region Methods, SIAM.

Trust Region Method 6 Global 4 minimum 2 0 Newton step -2 Local minimum -4 -6 -6 -4 -2 0 2 4 6 J. J. Mor´e and D. C. Sorensen, (1984) Newton’s method, in Studies in Mathematics, Volume 24. Studies in Numerical Analysis, Math. Assoc., pp. 29–82.

L-BFGS Trust Region Optimization Method

L-BFGS in Trust Region B k = B 0 + Ψ k M k Ψ T k Eigen-decomposition  Λ + γ k I � 0 P T B k = P γ k I 0 Sherman-Morrison-Woodbury Formula k = − 1 I − Ψ k ( τ ∗ M − 1 + Ψ T k Ψ k ) − 1 Ψ T ⇥ ⇤ p ∗ g k , k k τ ∗ τ ∗ = γ k + σ ∗ L. Adhikari et al. (2017) “Limited-memory trust-region methods for sparse relaxation,” in Proc.SPIE, vol. 10394. Brust et al, (2017). “On solving L-SR1 trust-region subproblems,” Computational Optimization and Applications, vol. 66, pp. 245–266.

L-BFGS in Trust Region vs. Line-Search 26th European Signal Processing Conference, Rome, Italy. September 2018.

Proposed Methods for Initialization of L-BFGS

Initialization Method I B 0 = γ k I B k = B 0 + Ψ k M k Ψ T k Spectral estimate of Hessian γ k = y T k − 1 y k − 1 s T k − 1 y k − 1 k B − 1 0 y k − 1 � s k − 1 k 2 γ k = arg min 2 , B 0 = γ I γ J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Initialization Method II Consider a quadratic function L ( w ) = 1 2 w T Hw + g T w r 2 L ( w ) = H We have HS k = Y k Therefore S T k HS k = S T k Y k Erway et al. (2018). “Trust-Region Algorithms for Training Responses: Machine Learning Methods Using Indefinite Hessian Approximations,” ArXiv e-prints.

Initialization Method II Since B k = B 0 + Ψ k M k Ψ T B 0 = γ k I k � − 1  − S T k B 0 S k − L k ⇥ ⇤ B 0 S k Y k , M k = Ψ k = − L T D k k Secant Condition B k S T k = Y k We have S T k HS k − γ k S T k S k = S T k Ψ k M k Ψ T k S k

Initialization Method II S T k HS k − γ k S T k S k = S T k Ψ k M k Ψ T k S k General Eigen-Value Problem ( L k + D k + L T k ) z = λ S T k S k z Upper bound for initial value to avoid false curvature information γ k ∈ (0 , λ min )

Initialization Method III B k = B 0 + Ψ k M k Ψ T B 0 = γ k I k S T k HS k − γ k S T k S k = S T k Ψ k M k Ψ T k S k Note that compact representation matrices contains γ k � − 1  − S T k B 0 S k − L k ⇥ ⇤ B 0 S k Y k , M k = Ψ k = − L T D k k

Improving L-BFGS Initialization for Trust-Region Methods in Deep - PowerPoint PPT Presentation

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced Agenda

for Sound Object Initialization Xin Qi and Andrew C. Myers Cornell University Friday, June 3,

Initializer lists and uniform initialization slides based on talk by Bjarne Stroustrup Jon

TULA REGION TULA Moscow REGION Moscow region Kaluga region Tula Novomoskovsk Ryazan

Trust Region Method Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS

Pennine Acute Hospitals NHS Trust: Improvement Journey 1 Pennine Improvement Plan Improving

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering Shehroz S.

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Y. Kim, C.

Selection of variables in initialization of Modelica models Masoud Najafi INRIA ( French national

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com

Nonsmooth trust region methods on Riemannian manifolds S. Hosseini Institut f ur Numerische

Trust But Verify Trust But Verify Trust But Verify Trust But Verify What Is CEC Entertainment?

Dynamics, robustness and fragility Private trust Public trust of trust Conclusions Dusko

Gods stories Gods stories Trust Trust To Rely Upon Something Totally Trust trust:

Composite Trust Composite Trust Composite Trust A formal derivation of conjunction A formal

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang, Felipe Perez,

Outline Complexity Hierarchy DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Course Introduction

Meta-Explanations, Interpretable Clustering & Other Recent Developments Fraunhofer HHI,

4.1 Classic Differential Geometry 2 Hao Li http://cs621.hao-li.com 1 Outline Parametric

Business Data Communications and Networks ESE 13711 Spring 2019/2020 Instructor: Abdullah

Eulerian Partial Duals of Plane Graphs Xianan Jin School of Mathematical Sciences, Xiamen

Energy-Based Processes for Exchangeable Data Mengjiao Yang, Bo Dai, Hanjun Dai, Dale Schuurmans

Signatures of Knowledge for Boolean Circuits under Standard Assumptions Zaira Pindado

Abhishek Bhargava 15-424: Final Project Michael You Markets are becoming more complex Big

Improving L-BFGS Initialization for Trust-Region Methods in Deep - PowerPoint PPT Presentation

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced Agenda

for Sound Object Initialization Xin Qi and Andrew C. Myers Cornell University Friday, June 3,

Initializer lists and uniform initialization slides based on talk by Bjarne Stroustrup Jon

TULA REGION TULA Moscow REGION Moscow region Kaluga region Tula Novomoskovsk Ryazan

Trust Region Method Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS

Pennine Acute Hospitals NHS Trust: Improvement Journey 1 Pennine Improvement Plan Improving

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering Shehroz S.

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Y. Kim, C.

Selection of variables in initialization of Modelica models Masoud Najafi INRIA ( French national

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com

Nonsmooth trust region methods on Riemannian manifolds S. Hosseini Institut f ur Numerische

Trust But Verify Trust But Verify Trust But Verify Trust But Verify What Is CEC Entertainment?

Dynamics, robustness and fragility Private trust Public trust of trust Conclusions Dusko

Gods stories Gods stories Trust Trust To Rely Upon Something Totally Trust trust:

Composite Trust Composite Trust Composite Trust A formal derivation of conjunction A formal

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang*, Felipe Perez*,

Outline Complexity Hierarchy DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Course Introduction

Meta-Explanations, Interpretable Clustering &amp; Other Recent Developments Fraunhofer HHI,

4.1 Classic Differential Geometry 2 Hao Li http://cs621.hao-li.com 1 Outline Parametric

Business Data Communications and Networks ESE 13711 Spring 2019/2020 Instructor: Abdullah

Eulerian Partial Duals of Plane Graphs Xianan Jin School of Mathematical Sciences, Xiamen

Energy-Based Processes for Exchangeable Data Mengjiao Yang*, Bo Dai*, Hanjun Dai, Dale Schuurmans

Signatures of Knowledge for Boolean Circuits under Standard Assumptions Zaira Pindado

Abhishek Bhargava 15-424: Final Project Michael You Markets are becoming more complex Big

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang, Felipe Perez,

Meta-Explanations, Interpretable Clustering & Other Recent Developments Fraunhofer HHI,

Energy-Based Processes for Exchangeable Data Mengjiao Yang, Bo Dai, Hanjun Dai, Dale Schuurmans