Stochastic gradient methods for machine learning Francis Bach - PowerPoint PPT Presentation

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013

Context Machine learning for “big data” • Large-scale machine learning : large p , large n , large k – p : dimension of each observation (input) – k : number of tasks (dimension of outputs) – n : number of observations • Examples : computer vision, bioinformatics, signal processing • Ideal running-time complexity : O ( pn + kn ) – Going back to simple methods – Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization – It is possible to improve on the sublinear convergence rate?

Context Machine learning for “big data” • Large-scale machine learning : large p , large n , large k – p : dimension of each observation (input) – k : number of tasks (dimension of outputs) – n : number of observations • Examples : computer vision, bioinformatics, signal processing • Ideal running-time complexity : O ( pn + kn ) • Going back to simple methods – Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization – It is possible to improve on the sublinear convergence rate?

Outline • Introduction – Supervised machine learning and convex optimization – Beyond the separation of statistics and optimization • Stochastic approximation algorithms (Bach and Moulines, 2011) – Stochastic gradient and averaging – Strongly convex vs. non-strongly convex • Going beyond stochastic gradient (Le Roux, Schmidt, and Bach, 2012) – More than a single pass through the data – Linear (exponential) convergence rate for strongly convex functions

Supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. • Prediction as a linear function θ ⊤ Φ( x ) of features Φ( x ) ∈ F = R p • (regularized) empirical risk minimization : find ˆ θ solution of n 1 � y i , θ ⊤ Φ( x i ) � � min ℓ + µ Ω( θ ) n θ ∈F i =1 convex data fitting term + regularizer

Supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. • Prediction as a linear function θ ⊤ Φ( x ) of features Φ( x ) ∈ F = R p • (regularized) empirical risk minimization : find ˆ θ solution of n 1 � y i , θ ⊤ Φ( x i ) � � min ℓ + µ Ω( θ ) n θ ∈F i =1 convex data fitting term + regularizer � n • Empirical risk: ˆ f ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) training cost n • Expected risk: f ( θ ) = E ( x,y ) ℓ ( y, θ ⊤ Φ( x )) testing cost • Two fundamental questions : (1) computing ˆ θ and (2) analyzing ˆ θ – May be tackled simultaneously

Smoothness and strong convexity • A function g : R p → R is L -smooth if and only if it is differentiable and its gradient is L -Lipschitz-continuous ∀ θ 1 , θ 2 ∈ R p , � g ′ ( θ 1 ) − g ′ ( θ 2 ) � � L � θ 1 − θ 2 � • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � L · Id smooth non−smooth

Smoothness and strong convexity • A function g : R p → R is L -smooth if and only if it is differentiable and its gradient is L -Lipschitz-continuous ∀ θ 1 , θ 2 ∈ R p , � g ′ ( θ 1 ) − g ′ ( θ 2 ) � � L � θ 1 − θ 2 � • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � L · Id • Machine learning � n – with g ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i )Φ( x i ) ⊤ n – Bounded data

Smoothness and strong convexity • A function g : R p → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R p , g ( θ 1 ) � g ( θ 2 ) + � g ′ ( θ 2 ) , θ 1 − θ 2 � + µ 2 � θ 1 − θ 2 � 2 2 � θ � 2 is convex • Equivalent definition: θ �→ g ( θ ) − µ • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � µ · Id strongly convex convex

Smoothness and strong convexity • A function g : R p → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R p , g ( θ 1 ) � g ( θ 2 ) + � g ′ ( θ 2 ) , θ 1 − θ 2 � + µ 2 � θ 1 − θ 2 � 2 2 � θ � 2 is convex • Equivalent definition: θ �→ g ( θ ) − µ • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � µ · Id • Machine learning � n – with g ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i )Φ( x i ) ⊤ n – Data with invertible covariance matrix (low correlation/dimension) – ... or with added regularization by µ 2 � θ � 2

Stochastic approximation • Goal : Minimizing a function f defined on a Hilbert space H – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ H • Stochastic approximation – Observation of f ′ n ( θ n ) = f ′ ( θ n ) + ε n , with ε n = i.i.d. noise – Non-convex problems

Stochastic approximation • Goal : Minimizing a function f defined on a Hilbert space H – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ H • Stochastic approximation – Observation of f ′ n ( θ n ) = f ′ ( θ n ) + ε n , with ε n = i.i.d. noise – Non-convex problems • Machine learning - statistics f n ( θ ) = ℓ ( y n , θ ⊤ Φ( x n )) – loss for a single pair of observations : – f ( θ ) = E f n ( θ ) = E ℓ ( y n , θ ⊤ Φ( x n )) = generalization error – Expected gradient: f ′ ( θ ) = E f ′ � ℓ ′ ( y n , θ ⊤ Φ( x n )) Φ( x n ) � n ( θ ) = E

Convex smooth stochastic approximation • Key properties of f and/or f n – Smoothness: f n L -smooth – Strong convexity: f µ -strongly convex

Convex smooth stochastic approximation • Key properties of f and/or f n – Smoothness: f n L -smooth – Strong convexity: f µ -strongly convex • Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) – Polyak-Ruppert averaging: ¯ � n − 1 θ n = 1 k =0 θ k n γ n = Cn − α – Which learning rate sequence γ n ? Classical setting: - Desirable practical behavior - Applicable (at least) to least-squares and logistic regression - Robustness to (potentially unknown) constants ( L , µ ) - Adaptivity to difficulty of the problem (e.g., strong convexity)

Convex smooth stochastic approximation • Key properties of f and/or f n – Smoothness: f n L -smooth – Strong convexity: f µ -strongly convex • Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) – Polyak-Ruppert averaging: ¯ � n − 1 θ n = 1 k =0 θ k n γ n = Cn − α – Which learning rate sequence γ n ? Classical setting: • Desirable practical behavior – Applicable (at least) to least-squares and logistic regression – Robustness to (potentially unknown) constants ( L , µ ) – Adaptivity to difficulty of the problem (e.g., strong convexity)

Convex stochastic approximation Related work • Machine learning/optimization – Known minimax rates of convergence (Nemirovski and Yudin, 1983; Agarwal et al., 2010) – Strongly convex: O ( n − 1 ) – Non-strongly convex: O ( n − 1 / 2 ) – Achieved with and/or without averaging (up to log terms) – Non-asymptotic analysis (high-probability bounds) – Online setting and regret bounds – Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009) – Nesterov and Vial (2008); Nemirovski et al. (2009)

Convex stochastic approximation Related work • Stochastic approximation – Asymptotic analysis – Non convex case with strong convexity around the optimum – γ n = Cn − α with α = 1 is not robust to the choice of C – α ∈ (1 / 2 , 1) is robust with averaging – Broadie et al. (2009); Kushner and Yin (2003); Kul ′ chitski˘ ı and Mozgovo˘ ı (1991); Fabian (1968) – Polyak and Juditsky (1992); Ruppert (1988)

Problem set-up - General assumptions • Unbiased gradient estimates : – f n ( θ ) is of the form h ( z n , θ ) , where z n is an i.i.d. sequence – e.g., f n ( θ ) = h ( z n , θ ) = ℓ ( y n , θ ⊤ Φ( x n )) with z n = ( x n , y n ) – NB: can be generalized • Variance of estimates : There exists σ 2 � 0 such that for all n � 1 , n ( θ ∗ ) − f ′ ( θ ∗ ) � 2 ) � σ 2 , where θ ∗ is a global minimizer of f E ( � f ′

Problem set-up - Smoothness/convexity assumptions • Smoothness of f n : For each n � 1 , the function f n is a.s. convex, differentiable with L -Lipschitz-continuous gradient f ′ n : – Bounded data

Stochastic gradient methods for machine learning Francis Bach - PowerPoint PPT Presentation

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine learning for big data

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Conjugate gradient methods for stochastic Galerkin finite element saddle point matrices B T A

On the steplength selection in Stochastic Gradient Methods Giorgia Franchini

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup

klaR: A Package Including Various Classification Tools Christian R over, Nils Raabe, Karsten

Online Learning with Model Selection Lizhe Sun, Adrian Barbu Florida State University

De Develop opment of of the new Research Infrastructure for or Europ opes Na Natural Sc

One-Pass Ranking Models for Low-Latency Product Recommendations Martin Saveski @msaveski

Session: O OCL CLC Ca C Cataloging News OCLC Cataloging Community Meeting Robin S Six

Stat 8931 (Aster Models) Lecture Slides Deck 7 Parametric Bootstrap Charles J. Geyer School of

A predictive multi-modal imaging marker for designing efficient and robust AD clinical trials

Towards Categorical Metadata Alternative 3 Tasks for Unreduced Climate Observations Functions