Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal - PowerPoint PPT Presentation

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1

Collaborators Raghu Bollapragada Richard Byrd Northwestern University University of Colorado 2

Optimization, Statistics, Machine Learning Dramatic breakdown of barriers between optimization and statistics. Partly stimulated by the rise of machine learning Learning process and learning measures defined using continuous optimization models – convex and non-convex Nonlinear optimization algorithms used to train statistical models – a great challenge with the advent of high dimensional models and huge data sets The stochastic gradient method currently plays a central role 3

Stochastic Gradient Method Robbins-Monro (1951) 1. Why has it risen to such prominence? 2. What is the main mechanism that drives it? 3. What can we say about its behavior in convex and non- convex cases? 4. What ideas have been proposed to improve upon SG? ``Optimization Methods for Machine Learning’’, Bottou, Curtis, Nocedal (2016) 4

Problem statement Given training set {( x 1 , y 1 ), … ( x n , y n )} Given a loss function ℓ ( z , y ) (hinge loss, logistic,...) Find a prediction function h ( x ; w ) (linear, DNN,...) n 1 ∑ ( h ( x i ; w ), y i ) ℓ min w n i = 1 Notation: f i ( w ) = ℓ ( h ( x i ; w ), y i ) n R n ( w ) = 1 ∑ f i ( w ) empirical risk n i = 1 Random variable ξ = ( x , y ) F ( w ) = E [ f ( w ; ξ )] expected risk 5

Stochastic Gradient Method First present algorithms for empirical risk minimization n R n ( w ) = 1 ∑ f i ( w ) n i = 1 w k + 1 = w k − α k ∇ f i ( w k ) i ∈ {1,..., n } choose at random • Very cheap, noisy iteration; gradient w.r.t. just 1 data point • Not a gradient descent method Stochastic process dependent on the choice of i • • Descent in expectation 6

Batch Optimization Methods w k + 1 = w k − α k ∇ R n ( w k ) batch gradient method w k + 1 = w k − α k n ∑ ∇ f i ( w k ) n i = 1 • More expensive, accurate step • Can choose among a wide range of optimization algorithms • Opportunities for parallelism Why has SG emerged as the preeminent method? Computational trade-offs between stochastic and batch methods Ability to minimize F (generalization error) 7

Practical Experience Logistic regression; speech data R n 0.6 Fast initial progress 0.5 of SG followed by Empirical Risk 0.4 drastic slowdown Batch L-BFGS LBFGS 0.3 0.2 Can we explain this? SGD 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Accessed Data Points 5 x 10 10 epochs 8

Intuition SG employs information more efficiently than batch methods Argument 1: Suppose data consists of 10 copies of a set S Iteration of batch method 10 times more expensive SG performs same computations 9

Computational complexity Total work to obtain R n ( w k ) ≤ R n ( w * ) + ε Batch gradient method: n log(1/ ε ) Stochastic gradient method: 1/ ε Think of ε = 10 − 3 Which one is better? More precisely: Batch: nd κ log(1/ ε ) SG: d νκ 2 / ε 10

n R n ( w ) = 1 ∑ Example by Bertsekas f i ( w ) n i = 1 w 1 − 1 w 1,* 1 Region of confusion Note that this is a geographical argument Analysis: given w k what is the expected decrease in the objective function R n as we choose one of the quadratics randomly? 11

A fundamental observation 2 + α k 2 E ‖ ∇ f i k ( w k ) ‖ E [ R n ( w k + 1 ) − R n ( w k )] ≤ − α k ‖ ∇ R n ( w k ) ‖ 2 2 Initially, gradient decrease dominates; then variance in gradient hinders progress To ensure convergence: α k → 0 in SG method to control variance. Sub-sampled Newton methods directly control the noise given in the last term What can we say when α k = α is constant? 12

• Only converges to a neighborhood of the optimal value. 13

= k 14

Deep Neural Networks Although much is understood about the SG method, still some great mysteries: why is it so much better than batch methods on DNNs? 15

Sharp and flat minima Keskar et al. (2016) Observing R along line From SG solution to batch solution Goodfellow et al Deep convolutional Neural net CIFAR-10 SG solution Batch solution SG: mini-batch of size 256 Batch: 10% of training set ADAM optimizer 16

Testing accuracy and sharpness Keskar (2016) Sharpness of minimizer vs batch size Sharpness: Max R in a small box around minimizer Testing accuracy vs batch size 17

Drawback of SG method: distributed computing 18

Sub-sampled Newton Methods 19

n R n ( w ) = 1 ∑ Iteration f i ( w ) n i = 1 Choose S ⊂ {1,..., n }, X ∈ {1,..., n } uniformly and independently ∇ 2 F S ( w k ) p = −∇ F w k + 1 = w k + α k p X ( w k ) Sub-sampled gradient and Hessian 1 S ( w k ) = 1 ∑ ∑ ∇ 2 f i ∇ F X ( w k ) = ∇ f i ∇ 2 F ( w k ) ( w k ) | X | | S | i ∈ X i ∈ S True Newton method impractical in large-scale machine learning Will not achieve scale invariance or quadratic convergence But the stochastic nature of the objective creates opportunities: Coordinate Hessian sample S and gradient sample X for optimal complexity 20

Active research area • Friedlander and Schmidt (2011) • Byrd, Chin, Neveitt, N. (2011) • Royset (2012) • Erdogdu and Montanari (2015) • Roosta-Khorasani and Mahoney (2016) • Agarwal, Bullins and Hazan (2016) • Pilanci and Wainwright (2015) • Pasupathy, Glynn, Ghosh, Hashemi (2015) • Xu, Yang, Roosta-Khorasani, Re’, Mahoney (2016) 21

Linear convergence ∇ 2 F S k ( w k ) p = −∇ F X k ( w k ) w k + 1 = w k + α p The following result is well known for strongly convex objective: Theorem: Under standard assumptions. If a) α = µ / L b) | S k | = constant c) | X k | = η k η > 1 (geometric growth) Then, E [ ‖ w k − w * ‖ ] → 0 at a linear rate and work complexity matches that of stochastic gradient method µ = smallest eigenvalue of any subsampled Hessian L = largest eigenvalue of Hessian of F 22

Local superlinear convergence Objective: expected risk F We can show the linear-quadratic result 2 + σ ‖ w k − w * ‖ v E k [ ‖ w k + 1 − w * ‖ ] ≤ C 1 ‖ w k − w * ‖ + µ | S k | µ | X k | To obtain superlinear convergence: i) | S k | → ∞ ii) | X k | must increase faster than geometrically 23

Closer look at the constant We can show the linear-quadratic result 2 + σ ‖ w k − w * ‖ v E k [ ‖ w k + 1 − w * ‖ ] ≤ C 1 ‖ w k − w * ‖ + µ | S k | µ | X k | || E [( ∇ 2 F i ( w ) − ∇ 2 F ( w ) 2 ]|| ≤ σ 2 tr(Cov( ∇ f i ( w ))) ≤ v 2 24

Achieving faster convergence We can show the linear-quadratic result 2 + σ ‖ w k − w * ‖ v E k [ ‖ w k + 1 − w * ‖ ] ≤ C 1 ‖ w k − w * ‖ + µ | S k | µ | X k | To obtain we used the bound σ E k [ ‖ ∇ 2 F S k ( w k ) − ∇ 2 F ( w k ) ‖ ≤ | S k | ‖ w k − w * ‖ Not matrix concentration inequalities 25

Formal superlinear convergence Theorem: under the conditions just stated, there is a neighborhood of w * such that for the sub-sampled Newton method with α k = 1 E [ ‖ w k − w * ‖ ] → 0 superlinearly 26

Observations on quasi- Newton methods Dennis-More’ B k p = −∇ F w k + 1 = w k + α k p and suppose w k → w * X ( w k ) If ‖ [ B k − ∇ 2 f ( x * )] p k ‖ → 0 convergence rate is superlinear ‖ p k ‖ 1. B k need not converge to B * ; only good approximation along search directions 27

Hacker’s definition of second-order method 1. One for which unit steplength is acceptable (yields sufficient reduction) most of the time 2. Why? How can one learn the scale of search directions without having learned curvature of the function in some relevant spaces? 3. This ``definition’’ is problem dependent 28

Inexact Methods ∇ 2 F S ( w k ) p = −∇ F X ( w k ) w k + 1 = w k + α k p 1. Exact method: Hessian approximation is inverted (.e.g Newton-sketch) 2. Inexact method: solve linear system inexactly by iterative solver 3. Conjugate gradient 4. Stochastic gradient 5. Both require only Hessian-vector products Newton-CG chooses a fixed sample S , applies CG to q k ( p ) = F ( w k ) + ∇ F ( w k ) T p + 1 2 p T ∇ 2 F S ( w k ) p 29

Newton-SGI (stochastic gradient iteration) If we apply the standard gradient method to q k ( p ) = F ( w k ) + ∇ F ( w k ) T p + 1 2 p T ∇ 2 F ( w k ) p we obtain the iteration i + 1 = p k i − ∇ q k ( p k i − ∇ F ( w k ) i ) = ( I − ∇ 2 F ( w k )) p k p k Consider instead the semi-stochastic gradient iteration: Change sample 1. Choose and index j at random; Hessian at each i + 1 = ( I − ∇ 2 F j ( w k )) p k i − ∇ F ( w k ) 2. p k nner iteration Gradient method using the exact gradient but and an estimate of the Hessian. This method is implicit in Agarwal, Bullins, Hazan 2016 30

Comparing Newton-CG and Newton-GD Number of Hessian-vector products to achieve ‖ w k + 1 − w * ‖ ≤ 1 2 ‖ w k − w * ‖ (*) ( ) Newton-SGI max ) 2 ˆ κ l κ l log( ˆ κ l )log( d ) O ( ˆ ( ) Newton-CG max log( ˆ κ l κ l κ l O ( ˆ max ) 2 ˆ max ) Results in Agarwal, Bullins and Hazan (2016) and Xu, Yang, Re, Roosta-Khorasani, Mahoney (2016): Decrease (*) obtained at each step with probability 1-p Our results give convergence of the whole sequence in expectation Complexity bounds are very pessimistic, particularly for CG 31

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal - PowerPoint PPT Presentation

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Richard Byrd Northwestern University University of Colorado 2 Optimization,

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang ICME, Stanford University

Visualizing Model Architecture john.sekar@mssm.edu SASB `17 Kinetics ~ Reaction Rules Enz Sub

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Learning from Irregularly-Sampled Time Series A Missing Data Perspective Steven Cheng-Xian Li

Worldwide Newton Conference Paris, September 2004 eBook composition on the Newton MessagePad 2100

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

BMI Report for Sampled CCSD Students: 2010 2013 BMI Report for Sampled CCSD Students: 2010 2013

Timestamp /16 at LBL, sampled 1-in-1K 2nd /16, sampled 1-in-1K Number of relays 8000 6000

Stratospheric Air Sampled at the Stratospheric Air Sampled at the Surface at Mauna Loa Surface

Quasi-Newton methods for minimization Lectures for PHD course on Non-linear equations and

Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico

NEWTON EARLY CHILDHOOD PROGRAM STAFF PRESENTATION NEWTON, MA 14 JANUARY 2020 SCHEDULE OVERVIEW

SIR ISAAC NEWTON (1642-1727) Born in the small village of Woolsthorpe, Newton quickly made an

Newton never dies It only gets new hardware Paul Guyot Worldwide Newton Conference 2004

Faces Introduction/Problem Statement Tell me this is Newton Dont tell me this is Newton

Directed Algebraic Topology Scott Newton PhD Student, Ohio State University newton.385@osu.edu

FourierSAT: A Fourier Expansion-Based Algebraic Framework for Solving Hybrid Boolean Constraints

10. Genetic Algorithms 10.1 The Basic Algorithm General-purpose black-box optimisation

1 Case study 1 Case study 2 Problem Problem Sort a huge randomly-ordered file of small Sort a

Announcements My office hours 45PM today (schedule change for this week only). Last

Optimization for Training Deep Models presented by Kan Ren Table of Contents Optimization

Characterizing Deep-Learning I/O Workloads in TensorFlow Steven W. D. Chien, Stefano Markidis,

Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell University of

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of