A Stochastic Gradient Method with an Exponential Convergence Rate - PowerPoint PPT Presentation

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets Nicolas Le Roux 1 , 2 , Mark Schmidt 1 and Francis Bach 1 1 Sierra project-team, INRIA - Ecole Normale Sup´ erieure, Paris 2 Now at Criteo 4/12/12 Nicolas Le Roux, Mark Schmidt, Francis Bach A Stochastic Gradient Method with an Exponential Convergence Rate for Finite

Context : Machine Learning for “Big Data” Large-scale machine learning : large n , large p n : number of observations (inputs) p : number of parameters in the model Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Context : Machine Learning for “Big Data” Large-scale machine learning : large n , large p n : number of observations (inputs) p : number of parameters in the model Examples : vision, bioinformatics, speech, language, etc. Pascal large-scale datasets : n = 5 · 10 5 , p = 10 3 ImageNet : n = 10 7 Industrial datasets : n > 10 8 , p > 10 7 Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Context : Machine Learning for “Big Data” Large-scale machine learning : large n , large p n : number of observations (inputs) p : number of parameters in the model Examples : vision, bioinformatics, speech, language, etc. Pascal large-scale datasets : n = 5 · 10 5 , p = 10 3 ImageNet : n = 10 7 Industrial datasets : n > 10 8 , p > 10 7 Main computational challenge : Design algorithms for very large n and p . Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

A standard machine learning optimization problem We want to minimize the sum of a finite set of smooth functions : n θ ∈ R p g ( θ ) = 1 � min f i ( θ ) n i = 1 Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

A standard machine learning optimization problem We want to minimize the sum of a finite set of smooth functions : n θ ∈ R p g ( θ ) = 1 � min f i ( θ ) n i = 1 For instance, we may have + λ − y i x ⊤ 2 � θ � 2 � � �� f i ( θ ) = log 1 + exp i θ Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

A standard machine learning optimization problem We want to minimize the sum of a finite set of smooth functions : n θ ∈ R p g ( θ ) = 1 � min f i ( θ ) n i = 1 For instance, we may have + λ − y i x ⊤ 2 � θ � 2 � � �� f i ( θ ) = log 1 + exp i θ We will focus on strongly-convex functions g . Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Deterministic methods n θ ∈ R p g ( θ ) = 1 � min f i ( θ ) n i = 1 Gradient descent updates θ k + 1 = θ k − α k g ′ ( θ k ) n = θ k − α k � f ′ i ( θ k ) n i = 1 Iteration cost in O ( n ) � C k � Linear convergence rate O Fancier methods exist but still in O ( n ) Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Stochastic methods n θ ∈ R p g ( θ ) = 1 � min f i ( θ ) n i = 1 Stochastic gradient descent updates i ( k ) ∼ U � 1 , n � θ k + 1 = θ k − α k f ′ i ( k ) ( θ k ) Iteration cost in O ( 1 ) Sublinear convergence rate O ( 1 / k ) Bound on the test error valid for one pass Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Hybrid methods log(excess cost) stochastic deterministic time Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Hybrid methods Goal = linear rate and O ( 1 ) iteration cost. log(excess cost) stochastic deterministic hybrid time Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Related work - Sublinear convergence rate Stochastic version of full gradient methods Schraudolph (1999), Sunehag et al. (2009), Ghadimi and Lan (2010), Martens (2010), Xiao (2010) Momentum, gradient/iterate averaging Polyak and Judistky (1992), Tseng (1998), Nesterov (2009), Xiao (2010), Kushner and Yin (2003), Hazan and Kale (2011), Rakhlin et al. (2012) None of these methods improve on the O ( 1 / k ) rate Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Related work - Linear convergence rate Constant step-size SG, accelerated SG Kesten (1958), Delyon and Juditsky (1993), Nedic and Bertsekas (2000) Linear convergence but only up to a fixed tolerance Hybrid methods, incremental average gradient Bertsekas (1997), Blatt et al. (2007), Friedlander and Schmidt (2012) Linear rate but iterations make full passes through the data Stochastic methods in the dual Shalev-Shwartz and Zhang (2012) Linear rate but limited choice for the f i ’s Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Stochastic Average Gradient Method Full gradient update : n θ k + 1 = θ k − α k � f ′ i ( θ k ) n i = 1 Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Stochastic Average Gradient Method Stochastic average gradient update : n θ k + 1 = θ k − α k � y k i n i = 1 i ( k ′ ) ( θ k ′ ) from the last k ′ where i was selected. Memory : y k i = f ′ Random selection of i ( k ) from { 1 , 2 , . . . , n } . Only evaluates f ′ i ( k ) ( θ k ) on each iteration. Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Stochastic Average Gradient Method Stochastic average gradient update : n θ k + 1 = θ k − α k � y k i n i = 1 i ( k ′ ) ( θ k ′ ) from the last k ′ where i was selected. Memory : y k i = f ′ Random selection of i ( k ) from { 1 , 2 , . . . , n } . Only evaluates f ′ i ( k ) ( θ k ) on each iteration. Stochastic variant of incremental average gradient [Blatt et al., 2007] Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

SAG convergence analysis Assume each f ′ i is L -continuous, average g is µ -strongly convex. 1 With step size α k � 2 nL , SAG has linear convergence rate. Linear convergence with iteration cost independent of n . Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

SAG convergence analysis Assume each f ′ i is L -continuous, average g is µ -strongly convex. 1 With step size α k � 2 nL , SAG has linear convergence rate. Linear convergence with iteration cost independent of n . 2 n µ , if n � 8 L 1 With step size α k = µ then � k � 1 − 1 E [ g ( θ k ) − g ( θ ∗ )] � C . 8 n Rate is “independent” of the condition number. Constant error reduction after each pass, � n � 1 − 1 � − 1 � ≤ exp = 0 . 8825 . 8 n 8 Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Comparison with full gradient methods Assume L = 100, µ = 0 . 01 and n = 80000 : � 2 = 0 . 9998 � Full gradient has rate 1 − µ L � µ � � Accelerated gradient has rate 1 − = 0 . 9900 L � n = 0 . 8825 � 1 SAG ( n iterations) multiplies the error by 1 − 8 n � √ L −√ µ � 2 Fastest possible first-order method has rate = 0 . 9608 √ L + √ µ Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Comparison with full gradient methods Assume L = 100, µ = 0 . 01 and n = 80000 : � 2 = 0 . 9998 � Full gradient has rate 1 − µ L � µ � � Accelerated gradient has rate 1 − = 0 . 9900 L � n = 0 . 8825 � 1 SAG ( n iterations) multiplies the error by 1 − 8 n � √ L −√ µ � 2 Fastest possible first-order method has rate = 0 . 9608 √ L + √ µ We beat two lower bounds (with additional assumptions) Stochastic gradient bound Full gradient bound Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Experiments - Training cost Quantum dataset ( n = 50000 , p = 78) ℓ 2 -regularized logistic regression 0 10 AFG L−BFGS pegasos Objective minus Optimum SAG−C −2 SAG−LS 10 −4 10 −6 10 −8 10 0 5 10 15 20 25 Effective Passes Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Experiments - Training cost RCV1 dataset ( n = 20242 , p = 47236) ℓ 2 -regularized logistic regression 0 10 AFG L−BFGS pegasos −2 10 Objective minus Optimum SAG−C SAG−LS −4 10 −6 10 −8 10 −10 10 0 5 10 15 20 25 Effective Passes Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Experiments - Testing cost Quantum dataset ( n = 50000 , p = 78) ℓ 2 -regularized logistic regression 4 x 10 AFG 1.7 L−BFGS pegasos SAG−C 1.65 SAG−LS Test Logistic Loss 1.6 1.55 1.5 1.45 1.4 0 5 10 15 20 25 Effective Passes Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

Experiments - Testing cost RCV1 dataset ( n = 20242 , p = 47236) ℓ 2 -regularized logistic regression 7000 AFG 6500 L−BFGS pegasos SAG−C 6000 SAG−LS 5500 Test Logistic Loss 5000 4500 4000 3500 3000 2500 2000 0 5 10 15 20 25 Effective Passes Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

A Stochastic Gradient Method with an Exponential Convergence Rate - PowerPoint PPT Presentation

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets Nicolas Le Roux 1 , 2 , Mark Schmidt 1 and Francis Bach 1 1 Sierra project-team, INRIA - Ecole Normale Sup erieure, Paris 2 Now at Criteo 4/12/12

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

Stochastic Gradient Method: Applications February 03, 2015 P. Carpentier Master MMMEF Cours

Applications of the Stochastic Gradient Method December 11, 2019 P. Carpentier Master

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Gaussian Process Regression with Noisy Inputs Dan Cervone Harvard Statistics Department March 3,

A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS,

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at

Waiting for 6+ years Pete Beckman Argonne National Laboratory 2 Data from Peter

Michael Spillane President, Product & Categories Good morning, and thank you for joining

Scalable performance analysis with Projections Sanjay Kale, http://charm.cs.illinois.edu Based

Correlations between Parallel Patterns and Multi-core Benchmarks Vivek Kale IWMSE workshop May

r t r r

A Stochastic Gradient Method with an Exponential Convergence Rate - PowerPoint PPT Presentation

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets Nicolas Le Roux 1 , 2 , Mark Schmidt 1 and Francis Bach 1 1 Sierra project-team, INRIA - Ecole Normale Sup erieure, Paris 2 Now at Criteo 4/12/12

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

Stochastic Gradient Method: Applications February 03, 2015 P. Carpentier Master MMMEF Cours

Applications of the Stochastic Gradient Method December 11, 2019 P. Carpentier Master

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Gaussian Process Regression with Noisy Inputs Dan Cervone Harvard Statistics Department March 3,

A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS,

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at

Waiting for 6+ years Pete Beckman Argonne National Laboratory 2 Data from Peter

Michael Spillane President, Product &amp; Categories Good morning, and thank you for joining

Scalable performance analysis with Projections Sanjay Kale, http://charm.cs.illinois.edu Based

Correlations between Parallel Patterns and Multi-core Benchmarks Vivek Kale IWMSE workshop May

r t r r

Michael Spillane President, Product & Categories Good morning, and thank you for joining