Stochastic gradient methods for machine learning Francis Bach - PowerPoint PPT Presentation

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France Joint work with Nicolas Le Roux, Mark Schmidt and Eric Moulines - November 2013

Context Machine learning for “big data” • Large-scale machine learning : large p , large n , large k – p : dimension of each observation (input) – n : number of observations – k : number of tasks (dimension of outputs) • Examples : computer vision, bioinformatics, text processing – Ideal running-time complexity : O ( pn + kn ) – Going back to simple methods – Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization – Using smoothness to go beyond stochastic gradient descent

Search engines - advertising

Advertising - recommendation

Object recognition

Learning for bioinformatics - Proteins • Crucial components of cell life • Predicting multiple functions and interactions • Massive data : up to 1 millions for humans! • Complex data – Amino-acid sequence – Link with DNA – Tri-dimensional molecule

Context Machine learning for “big data” • Large-scale machine learning : large p , large n , large k – p : dimension of each observation (input) – n : number of observations – k : number of tasks (dimension of outputs) • Examples : computer vision, bioinformatics, text processing • Ideal running-time complexity : O ( pn + kn ) – Going back to simple methods – Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization – Using smoothness to go beyond stochastic gradient descent

Context Machine learning for “big data” • Large-scale machine learning : large p , large n , large k – p : dimension of each observation (input) – n : number of observations – k : number of tasks (dimension of outputs) • Examples : computer vision, bioinformatics, text processing • Ideal running-time complexity : O ( pn + kn ) • Going back to simple methods – Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization – Using smoothness to go beyond stochastic gradient descent

Outline • Introduction: stochastic approximation algorithms – Supervised machine learning and convex optimization – Stochastic gradient and averaging – Strongly convex vs. non-strongly convex • Fast convergence through smoothness and constant step-sizes – Online Newton steps (Bach and Moulines, 2013) – O (1 /n ) convergence rate for all convex functions • More than a single pass through the data – Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) – Linear (exponential) convergence rate for strongly convex functions

Supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. • Prediction as a linear function � θ, Φ( x ) � of features Φ( x ) ∈ R p • (regularized) empirical risk minimization : find ˆ θ solution of n 1 � � � min ℓ y i , � θ, Φ( x i ) � + µ Ω( θ ) n θ ∈ R p i =1 convex data fitting term + regularizer

Supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. • Prediction as a linear function � θ, Φ( x ) � of features Φ( x ) ∈ R p • (regularized) empirical risk minimization : find ˆ θ solution of n 1 � � � min ℓ y i , � θ, Φ( x i ) � + µ Ω( θ ) n θ ∈ R p i =1 convex data fitting term + regularizer � n • Empirical risk: ˆ f ( θ ) = 1 i =1 ℓ ( y i , � θ, Φ( x i ) � ) training cost n • Expected risk: f ( θ ) = E ( x,y ) ℓ ( y, � θ, Φ( x ) � ) testing cost • Two fundamental questions : (1) computing ˆ θ and (2) analyzing ˆ θ – May be tackled simultaneously

Smoothness and strong convexity • A function g : R p → R is L -smooth if and only if it is twice differentiable and ∀ θ ∈ R p , g ′′ ( θ ) � L · Id smooth non−smooth

Smoothness and strong convexity • A function g : R p → R is L -smooth if and only if it is twice differentiable and ∀ θ ∈ R p , g ′′ ( θ ) � L · Id • Machine learning � n – with g ( θ ) = 1 i =1 ℓ ( y i , � θ, Φ( x i ) � ) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i ) ⊗ Φ( x i ) n – Bounded data

Smoothness and strong convexity • A function g : R p → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R p , g ( θ 1 ) � g ( θ 2 ) + � g ′ ( θ 2 ) , θ 1 − θ 2 � + µ 2 � θ 1 − θ 2 � 2 • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � µ · Id strongly convex convex

Smoothness and strong convexity • A function g : R p → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R p , g ( θ 1 ) � g ( θ 2 ) + � g ′ ( θ 2 ) , θ 1 − θ 2 � + µ 2 � θ 1 − θ 2 � 2 • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � µ · Id • Machine learning � n – with g ( θ ) = 1 i =1 ℓ ( y i , � θ, Φ( x i ) � ) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i ) ⊗ Φ( x i ) n – Data with invertible covariance matrix (low correlation/dimension)

Smoothness and strong convexity • A function g : R p → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R p , g ( θ 1 ) � g ( θ 2 ) + � g ′ ( θ 2 ) , θ 1 − θ 2 � + µ 2 � θ 1 − θ 2 � 2 • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � µ · Id • Machine learning � n – with g ( θ ) = 1 i =1 ℓ ( y i , � θ, Φ( x i ) � ) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i ) ⊗ Φ( x i ) n – Data with invertible covariance matrix (low correlation/dimension) • Adding regularization by µ 2 � θ � 2 – creates additional bias unless µ is small

Iterative methods for minimizing smooth functions • Assumption : g convex and smooth on R p • Gradient descent : θ t = θ t − 1 − γ t g ′ ( θ t − 1 ) – O (1 /t ) convergence rate for convex functions – O ( e − ρt ) convergence rate for strongly convex functions • Newton method : θ t = θ t − 1 − g ′′ ( θ t − 1 ) − 1 g ′ ( θ t − 1 ) e − ρ 2 t � � – O convergence rate

Iterative methods for minimizing smooth functions • Assumption : g convex and smooth on R p • Gradient descent : θ t = θ t − 1 − γ t g ′ ( θ t − 1 ) – O (1 /t ) convergence rate for convex functions – O ( e − ρt ) convergence rate for strongly convex functions • Newton method : θ t = θ t − 1 − g ′′ ( θ t − 1 ) − 1 g ′ ( θ t − 1 ) e − ρ 2 t � � – O convergence rate • Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages ⇒ Stochastic approximation

Stochastic approximation • Goal : Minimizing a function f defined on R p – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ R p • Stochastic approximation – Observation of f ′ n ( θ n ) = f ′ ( θ n ) + ε n , with ε n = i.i.d. noise – Non-convex problems

Stochastic approximation • Goal : Minimizing a function f defined on R p – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ R p • Stochastic approximation – Observation of f ′ n ( θ n ) = f ′ ( θ n ) + ε n , with ε n = i.i.d. noise – Non-convex problems • Machine learning - statistics – loss for a single pair of observations : f n ( θ ) = ℓ ( y n , � θ, Φ( x n ) � ) – f ( θ ) = E f n ( θ ) = E ℓ ( y n , � θ, Φ( x n ) � ) = generalization error – Expected gradient: f ′ ( θ ) = E f ′ � ℓ ′ ( y n , � θ, Φ( x n ) � ) Φ( x n ) � n ( θ ) = E

Convex stochastic approximation • Key assumption : smoothness and/or strongly convexity

Convex stochastic approximation • Key assumption : smoothness and/or strongly convexity • Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) � n – Polyak-Ruppert averaging: ¯ 1 θ n = k =0 θ k n +1 γ n = Cn − α – Which learning rate sequence γ n ? Classical setting: - Desirable practical behavior - Applicable (at least) to least-squares and logistic regression - Robustness to (potentially unknown) constants ( L , µ ) - Adaptivity to difficulty of the problem (e.g., strong convexity)

Stochastic gradient methods for machine learning Francis Bach - PowerPoint PPT Presentation

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France Joint work with Nicolas Le Roux, Mark Schmidt and Eric Moulines - November 2013 Context Machine learning for big data

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Conjugate gradient methods for stochastic Galerkin finite element saddle point matrices B T A

On the steplength selection in Stochastic Gradient Methods Giorgia Franchini

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup

Partial match queries: a limit process Nicolas Broutin Ralph Neininger Henning Sulzbach Partial

The hyperbolic Brownian plane Thomas Budzinski ENS Paris July 7th, 2016 Thomas Budzinski The

Understanding MCMC Marcel Lthi, University of Basel Slides based on presentation by Sandro

Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence Kengo, KAMATANI

Clustering and K-means Root Mean Square Error (RMS) Data: ! x 1 , ! x 2 , , ! x N R d

Chapter 3 Asymptotic Equipartition Property Peng-Hua Wang Graduate Inst. of Comm. Engineering

Clustering 2 Clustering 2 Nov 3 2008 HAC Algorithm HAC Algorithm St t Start with all objects in

Stochastic solution of large least squares systems in variational data assimilation Parallel