Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 - PowerPoint PPT Presentation

An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1

Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2

Outline 1. How to improve upon the stochastic gradient method for risk minimization 2. Noise reduction methods Dynamic Sampling (batching) Aggregated Gradient methods (SAG, SVRG, etc) 3. Second order methods 4. Propose a noise reduction method that re-uses old gradients and also employs dynamic sampling 3

Organization of optimization methods Stochastic Gradient Batch Gradient Method Method noise reduction condition number Stochastic Newton Method 4

Second-order methods Stochastic Gradient Batch Gradient Method Method • Averaging (Polyak-Ruppert) • Momentum • Natural gradient, Fischer • quasi-Newton • inexact Newton (Hessian-free) Stochastic Newton 5

Noise reducing methods Stochastic Gradient Batch Gradient Method Method • Dynamic sampling methods • aggregated gradient methods • Stochastic Newton Method This Talk: combine both ideas Why? 6

Objective Function min w F ( w ) = E [ f ( w ; ξ )] ξ = ( x , y ) random variable with distribution P f ( ⋅ ; ξ ) composition of loss ℓ and prediction h w k + 1 = w k − α k ∇ f ( w k ; ξ k ) stochastic gradient method (SG) Sample gradient approximation - batch (or mini-batch) s ( w ) = 1 ∑ ( w ; ξ i ) F f | S | i ∈ S w k + 1 = w k − α k ∇ F S ( w k ) batch (mini) method 7

Transient behavior of SG Expected function decrease 2 + α k 2 E [ F ( w k + 1 ) − F ( w k )] ≤ − α k ‖ ∇ F ( w k ) ‖ 2 E ‖ ∇ f ( w k , ξ k ) ‖ 2 Initially, gradient decrease dominates; then variance in gradient hinters progress (area of confusion) To ensure convergence α k → 0 in SG method to control variance. Steplength selected to achieve fast initial progress, but this will slow progress in later stages Dynamic sampling methods reduce gradient variance by increasing batch. What is the right rate? 8

Proposal: Gradient accuracy conditions Geometric noise reduction Consider stochastic gradient method with fixed steplength w k + 1 = w k − α g ( w k , ξ k ) If the variance of stochastic gradient decreases geometrically, the method yields linear convergence ∃ M > 0, ζ ∈ (0,1) s.t. Lemma If Schmidt et al 2 ≤ M ζ k − 1 2 ] − ‖ ∇ F ( w k ) ‖ E [ ‖ g ( w k , ξ k ) ‖ Pasupathy et al 2 2 Then * ] ≤ νρ k − 1 E [ F ( w k ) − F Extension of classical convergence result for gradient method where error in gradient estimates decreases sufficiently rapidly to preserve linear convergence

Proposal: Gradient accuracy conditions Optimal work complexity We can ensure variance condition 2 ≤ M ζ k − 1 2 ] − ‖ ∇ F ( w k ) ‖ E [ ‖ g ( w k , ξ k ) ‖ 2 2 s ( w ) = 1 ∑ ∇ F ∇ f ( w ; ξ i ) by letting | S k | = a k − 1 a > 1 | S | i ∈ S Moreover, we obtain optimal complexity bounds The total number of stochastic gradient evaluations to achieve E [ F ( w k ) − F * ] ≤ ε is O (1/ ε ) with favorable constants a ∈ [1,1 − β c µ 2 ] − 1 Pasupathy, Glynn et al 2014 Friedlander, Schmidt 2012 Homem-de-Mello, Shapiro 2012 Byrd, Chin, N., Wu 2013

‖ Theorem: Suppose F is strongly convex. Consider w k +1 = w k − (1/ L ) g k where S k is chosen so that variance condition holds and | S k | ≥ γ k for γ >1. Then E [ F ( w k ) − F ( w * )] ≤ C ρ k ρ < 1 ! ! the number of gradient samples to achieve ε accuracy is O κ ω d ⎛ ⎞ d = no. of variables ⎜ ⎟ ⎝ ⎠ ε λ κ = condition number, λ = smallest eigenv of Hessian Var ∇ ℓ ( w k ; i ) 1 ≤ ω (population) ‖

Dynamic sampling (batching) At every iteration, choose a subset S of {1,…,n} and apply one step of an optimization algorithm to the function S ( w k ) = 1 ∑ f ( w ; ξ i ), F | S | i ∈ S At the start, a small sample size | S | is chosen • If optimization step is likely to reduce F(w) , sample size is kept unchanged; new sample S is chosen; next optimization step taken • Else, a larger sample size is chosen, a new random sample S is selected, a new iterate computed Many optimization methods can be used. This approach creates the opportunity of employing second order methods 12

｠｠ ‖ How to implement this in practice? 1. Predetermine a geometric increase, tuning parameter | S k | = a k − 1 a > 1 2. Use angle (i.e. variance test) Ensure bound is satisfied in expectation ‖ g ( w k ) − ∇ F ( w k ) ≤ θ ‖ g k ‖｠ θ < 1 Popular: combination of these two strategies. 13

Numerical test Numerical Tests: Newton-CG method with dynamic sampling, Armijo line search H k ( w k ) − 1 g k α k ≈ 1 w k + 1 = w k − α k ∇ 2 F Test Problem • From Google VoiceSearch • 191,607 training points • 129 classes; 235 features • 30,315 parameters (variables) • Small version of production problem • Multi-class logistic regression • Initial batch size: 1%; Hessian sample 10% 14

Function Dynamic New method Newton-CG L-BFGS (m=2,20) Batch L-BFGS Stochastic gradient descent Classical Newton-CG Dynamic change of sample sizes Time … based on variance estimate 15

However, not completely satisfactory More investigation is needed …. Particularly: • Transition between stochastic and batch regimes • Coordination between step size and batch size • Use of second order information (one stochastic gradient is not too noisy) • Can the idea of re-using gradients in a gradient aggregation approach help? 16

Transition from stochastic to batch regimes Stochastic process gradient methods S k m [ ] 1 − − − SGD twilight α k = 1 zone α k = 1 / k Gradient aggregation could smooth transition 17

Randomized Aggregated Gradient Methods (for empirical risk min) F ( w ) = E [ f ( w ; ξ )] Expected Risk: n n m ( w ) = 1 ∑ f ( w ; ξ i ) = 1 ∑ f i ( w ) Empirical Risk: F m i = 1 m i = 1 SAG, SAGA, SVRG, etc focus on minimizing empirical risk Iteration: w k + 1 = w k − α y k y k combination of gradients ∇ f i evaluated at previous iterates φ j m Choose j y k = 1 j ) + 1 ∑ m [ ∇ f j ( w k ) − ∇ f j ( φ k − 1 ∇ f i ( φ k − 1 i )] at random m i = 1 SAG 18

Example of Gradient Aggregation Methods m y k = 1 j ) + 1 ∑ m [ ∇ f j ( w k ) − ∇ f j ( φ k − 1 ∇ f i ( φ k − 1 i )] m i = 1 SAG j m m ( w ) = 1 ∑ f ( w ; ξ i ) = 1 ∑ f i ( w ) F SAG, SAGA, SVRG m i = 1 m i = 1 Achieve linear rate of convergence in expectation (after a full initialization pass) 19

EGR Method The Evolving Gradient Resampling Method for Expected Risk Minimization 20

Proposed algorithm 1. Minimizes expected risk (not training error) 2. Stores previous gradients and updates several (s k ) at each 3. iteration 4. Additional (u k ) gradients are computed at current iterate 5. Total amount of stored gradients increases monotonically 6. Shares properties with dynamic sampling and gradient aggregation methods Goal: analyze an algorithm of this generality (interesting in its own right) Finding right balance between re-using old information and batching can result in efficient method. 21

The EGR method t k + u k ( ∑ j ∈ t k 1 ∑ ∇ f i ( φ k − 1 ∑ f j ( w k ) ) j )] + y k = S k [ ∇ f j ( w k ) − ∇ f j ( φ k − 1 ) + ∇ i i = 1 j ∈ U k t k number of gradients in storage at start of iteration k u k = | U k | U k indices of new gradients sampled at w k S k indices of previously computed gradients that are updated s k = | S k | s k u k Evaluated at w k j j t k How should s k and u k be controled? Related work: Frostig et al 2014, Babanezhad et al 2015 22

Algorithms included in framework stochastic gradient method: s k = 0, u k = 1 dynamic sampling method: s k = 0, u k = function( k ) aggregated gradient: s k = constant, u k = 0 s k u k EGR lin: s k = 0, u k = r j j EGR quad: s k = rk , u k = r EGR exp: s k = u k ≈ a k t k 23

Assumptions: 1) s k = u k = a k a ∈ ! + geometric growth 2) F strongly convex, f i Lipschitz continuous gradients 3) tr (var [ ∇ f i ( w )]) ≤ v 2 ∀ w Lemma: ⎛ ⎞ ⎛ ⎞ E[E k [ e k ]] E[E k [ e k − 1 ]] ⎜ ⎟ ⎜ ⎟ E[ ‖ w k − w * ‖ ⎟ ≤ M E[ ‖ w k − 1 − w * ‖ ] ] Lyapunov function ⎜ ⎜ ⎟ σ k σ k − 1 ⎝ ⎠ ⎝ ⎠ t k + 1 e k = 1 ∑ i ) − ∇ f i ( w k ) ‖ v 2 / t k + 1 ‖ ∇ f i ( φ k σ k = t k + 1 i = 1

⎛ ⎞ 1 + η (1 + α L ) 1 − η 1 − η 1 + η α L 1 − η 1 + η α ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ M = α L 1 − α µ α ⎜ ⎟ ⎜ ⎟ 1 0 0 ⎜ ⎟ 1 + η ⎝ ⎠ η : probability that an old gradient is recomputed α : steplength L: Lipschitz constant µ : strong convexity parameter Lemma. For sufficiently small α the spectral radius of M satisfies ρ M < 1

Theorem: If α k is chosen small enough E ‖ w k − w * ‖ ≤ c k β R-linear convergence SAG special case: t k = m , u k = 0, s k = constant Simple proof of R-linear convergence of SAG but with a larger constant

Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 - PowerPoint PPT Presentation

An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve upon the stochastic gradient

An Automatic Recommendation System using R Christopher Byrd Analytics System Architect -

Self Service Budget Development (SSBD) 201819 Instructor: Erin Byrd erin.byrd@unf.edu A

A SUSTAINABLE MODEL By SOLAR ENERGY M. Tlin Keskin, Dr.F.Figen AR Clean Energy Foundation

Stefan Plantikow, Neo4j 2017 Stefan Plantikow, Neo4j 2 2017 Stefan Plantikow, Neo4j

Drug and Alcohol Testing in the Workplace Lamont Byrd, Director Safety and Health Department

Intuition, Philosophical Training, and Theism Nick Byrd www.byrdnick.com Outline 1 Motivation

Exercise: Solutions Stefan Schmid @ T-Labs, 2011 Stefan Schmid @ T-Labs, 2011 Task 1 Stefan

Richard North Richard North Chief Executive Chief Executive 1 Richard Solomons Richard

Interactions of light mesons with photons Stefan Leupold Uppsala University Meson 2014, Cracow,

PFAS in European Law and Industry current and future challenges Stefan Posner

A couple of slides on rare decays Stefan Leupold MesonNet, Prague, June 2013 1 Stefan

Twisted Alexander polynomials - an overview Stefan Friedl September 2010 Stefan Friedl Twisted

Vertex Coloring Stefan Schmid @ T-Labs, 2011 Graph Coloring Stefan Schmid @ T-Labs Berlin, 2012

Re po rting Mic ro a g g re ssio ns thro ug h a Mo b ile App Christy M. Byrd , PhD Unive

CT Scans are Essential: You will Fail Otherwise J. W . W. . Thomas By Byrd, MD MD Mini-Pane

Medicaid Reform Updates Melisa Byrd Senior Deputy Director/Medicaid Director Medical Care

Shallow RNNs: A Method for Accurate Time-series Classification on Tiny Devices* Don Kurian

Functional Steins Institut method Mines-Telecom L. Decreusefond Borchard symposium Roadmap

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC

Logic of the Scientific Method s e d i l S 2 n o i s s e S - - 0 4 2 k W c

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

The Entropy Rounding Method in Approximation Algorithms Thomas Rothvo Department of

Eliciting Informative Feedback: The Peer-Prediction Method Nolan Miller, Paul Resnick, &

Understanding DNSCurve D. J. Bernstein University of Illinois at Chicago & Technische

Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 - PowerPoint PPT Presentation

An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve upon the stochastic gradient

An Automatic Recommendation System using R Christopher Byrd Analytics System Architect -

Self Service Budget Development (SSBD) 201819 Instructor: Erin Byrd erin.byrd@unf.edu A

A SUSTAINABLE MODEL By SOLAR ENERGY M. Tlin Keskin, Dr.F.Figen AR Clean Energy Foundation

Stefan Plantikow, Neo4j 2017 Stefan Plantikow, Neo4j 2 2017 Stefan Plantikow, Neo4j

Drug and Alcohol Testing in the Workplace Lamont Byrd, Director Safety and Health Department

Intuition, Philosophical Training, and Theism Nick Byrd www.byrdnick.com Outline 1 Motivation

Exercise: Solutions Stefan Schmid @ T-Labs, 2011 Stefan Schmid @ T-Labs, 2011 Task 1 Stefan

Richard North Richard North Chief Executive Chief Executive 1 Richard Solomons Richard

Interactions of light mesons with photons Stefan Leupold Uppsala University Meson 2014, Cracow,

PFAS in European Law and Industry current and future challenges Stefan Posner

A couple of slides on rare decays Stefan Leupold MesonNet, Prague, June 2013 1 Stefan

Twisted Alexander polynomials - an overview Stefan Friedl September 2010 Stefan Friedl Twisted

Vertex Coloring Stefan Schmid @ T-Labs, 2011 Graph Coloring Stefan Schmid @ T-Labs Berlin, 2012

Re po rting Mic ro a g g re ssio ns thro ug h a Mo b ile App Christy M. Byrd , PhD Unive

CT Scans are Essential: You will Fail Otherwise J. W . W. . Thomas By Byrd, MD MD Mini-Pane

Medicaid Reform Updates Melisa Byrd Senior Deputy Director/Medicaid Director Medical Care

Shallow RNNs: A Method for Accurate Time-series Classification on Tiny Devices* Don Kurian

Functional Steins Institut method Mines-Telecom L. Decreusefond Borchard symposium Roadmap

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC

Logic of the Scientific Method s e d i l S 2 n o i s s e S - - 0 4 2 k W c

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

The Entropy Rounding Method in Approximation Algorithms Thomas Rothvo Department of

Eliciting Informative Feedback: The Peer-Prediction Method Nolan Miller, Paul Resnick, &amp;

Understanding DNSCurve D. J. Bernstein University of Illinois at Chicago &amp; Technische

Eliciting Informative Feedback: The Peer-Prediction Method Nolan Miller, Paul Resnick, &

Understanding DNSCurve D. J. Bernstein University of Illinois at Chicago & Technische