stochastic gradient methods for machine learning
play

Stochastic gradient methods for machine learning Francis Bach - PowerPoint PPT Presentation

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France Joint work with Nicolas Le Roux, Mark Schmidt and Eric Moulines - November 2013 Context Machine learning for big data


  1. Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France Joint work with Nicolas Le Roux, Mark Schmidt and Eric Moulines - November 2013

  2. Context Machine learning for “big data” • Large-scale machine learning : large p , large n , large k – p : dimension of each observation (input) – n : number of observations – k : number of tasks (dimension of outputs) • Examples : computer vision, bioinformatics, text processing – Ideal running-time complexity : O ( pn + kn ) – Going back to simple methods – Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization – Using smoothness to go beyond stochastic gradient descent

  3. Search engines - advertising

  4. Advertising - recommendation

  5. Object recognition

  6. Learning for bioinformatics - Proteins • Crucial components of cell life • Predicting multiple functions and interactions • Massive data : up to 1 millions for humans! • Complex data – Amino-acid sequence – Link with DNA – Tri-dimensional molecule

  7. Context Machine learning for “big data” • Large-scale machine learning : large p , large n , large k – p : dimension of each observation (input) – n : number of observations – k : number of tasks (dimension of outputs) • Examples : computer vision, bioinformatics, text processing • Ideal running-time complexity : O ( pn + kn ) – Going back to simple methods – Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization – Using smoothness to go beyond stochastic gradient descent

  8. Context Machine learning for “big data” • Large-scale machine learning : large p , large n , large k – p : dimension of each observation (input) – n : number of observations – k : number of tasks (dimension of outputs) • Examples : computer vision, bioinformatics, text processing • Ideal running-time complexity : O ( pn + kn ) • Going back to simple methods – Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization – Using smoothness to go beyond stochastic gradient descent

  9. Outline • Introduction: stochastic approximation algorithms – Supervised machine learning and convex optimization – Stochastic gradient and averaging – Strongly convex vs. non-strongly convex • Fast convergence through smoothness and constant step-sizes – Online Newton steps (Bach and Moulines, 2013) – O (1 /n ) convergence rate for all convex functions • More than a single pass through the data – Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) – Linear (exponential) convergence rate for strongly convex functions

  10. Supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. • Prediction as a linear function � θ, Φ( x ) � of features Φ( x ) ∈ R p • (regularized) empirical risk minimization : find ˆ θ solution of n 1 � � � min ℓ y i , � θ, Φ( x i ) � + µ Ω( θ ) n θ ∈ R p i =1 convex data fitting term + regularizer

  11. Supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. • Prediction as a linear function � θ, Φ( x ) � of features Φ( x ) ∈ R p • (regularized) empirical risk minimization : find ˆ θ solution of n 1 � � � min ℓ y i , � θ, Φ( x i ) � + µ Ω( θ ) n θ ∈ R p i =1 convex data fitting term + regularizer � n • Empirical risk: ˆ f ( θ ) = 1 i =1 ℓ ( y i , � θ, Φ( x i ) � ) training cost n • Expected risk: f ( θ ) = E ( x,y ) ℓ ( y, � θ, Φ( x ) � ) testing cost • Two fundamental questions : (1) computing ˆ θ and (2) analyzing ˆ θ – May be tackled simultaneously

  12. Supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. • Prediction as a linear function � θ, Φ( x ) � of features Φ( x ) ∈ R p • (regularized) empirical risk minimization : find ˆ θ solution of n 1 � � � min ℓ y i , � θ, Φ( x i ) � + µ Ω( θ ) n θ ∈ R p i =1 convex data fitting term + regularizer � n • Empirical risk: ˆ f ( θ ) = 1 i =1 ℓ ( y i , � θ, Φ( x i ) � ) training cost n • Expected risk: f ( θ ) = E ( x,y ) ℓ ( y, � θ, Φ( x ) � ) testing cost • Two fundamental questions : (1) computing ˆ θ and (2) analyzing ˆ θ – May be tackled simultaneously

  13. Smoothness and strong convexity • A function g : R p → R is L -smooth if and only if it is twice differentiable and ∀ θ ∈ R p , g ′′ ( θ ) � L · Id smooth non−smooth

  14. Smoothness and strong convexity • A function g : R p → R is L -smooth if and only if it is twice differentiable and ∀ θ ∈ R p , g ′′ ( θ ) � L · Id • Machine learning � n – with g ( θ ) = 1 i =1 ℓ ( y i , � θ, Φ( x i ) � ) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i ) ⊗ Φ( x i ) n – Bounded data

  15. Smoothness and strong convexity • A function g : R p → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R p , g ( θ 1 ) � g ( θ 2 ) + � g ′ ( θ 2 ) , θ 1 − θ 2 � + µ 2 � θ 1 − θ 2 � 2 • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � µ · Id strongly convex convex

  16. Smoothness and strong convexity • A function g : R p → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R p , g ( θ 1 ) � g ( θ 2 ) + � g ′ ( θ 2 ) , θ 1 − θ 2 � + µ 2 � θ 1 − θ 2 � 2 • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � µ · Id • Machine learning � n – with g ( θ ) = 1 i =1 ℓ ( y i , � θ, Φ( x i ) � ) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i ) ⊗ Φ( x i ) n – Data with invertible covariance matrix (low correlation/dimension)

  17. Smoothness and strong convexity • A function g : R p → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R p , g ( θ 1 ) � g ( θ 2 ) + � g ′ ( θ 2 ) , θ 1 − θ 2 � + µ 2 � θ 1 − θ 2 � 2 • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � µ · Id • Machine learning � n – with g ( θ ) = 1 i =1 ℓ ( y i , � θ, Φ( x i ) � ) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i ) ⊗ Φ( x i ) n – Data with invertible covariance matrix (low correlation/dimension) • Adding regularization by µ 2 � θ � 2 – creates additional bias unless µ is small

  18. Iterative methods for minimizing smooth functions • Assumption : g convex and smooth on R p • Gradient descent : θ t = θ t − 1 − γ t g ′ ( θ t − 1 ) – O (1 /t ) convergence rate for convex functions – O ( e − ρt ) convergence rate for strongly convex functions • Newton method : θ t = θ t − 1 − g ′′ ( θ t − 1 ) − 1 g ′ ( θ t − 1 ) e − ρ 2 t � � – O convergence rate

  19. Iterative methods for minimizing smooth functions • Assumption : g convex and smooth on R p • Gradient descent : θ t = θ t − 1 − γ t g ′ ( θ t − 1 ) – O (1 /t ) convergence rate for convex functions – O ( e − ρt ) convergence rate for strongly convex functions • Newton method : θ t = θ t − 1 − g ′′ ( θ t − 1 ) − 1 g ′ ( θ t − 1 ) e − ρ 2 t � � – O convergence rate • Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages ⇒ Stochastic approximation

  20. Stochastic approximation • Goal : Minimizing a function f defined on R p – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ R p • Stochastic approximation – Observation of f ′ n ( θ n ) = f ′ ( θ n ) + ε n , with ε n = i.i.d. noise – Non-convex problems

  21. Stochastic approximation • Goal : Minimizing a function f defined on R p – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ R p • Stochastic approximation – Observation of f ′ n ( θ n ) = f ′ ( θ n ) + ε n , with ε n = i.i.d. noise – Non-convex problems • Machine learning - statistics – loss for a single pair of observations : f n ( θ ) = ℓ ( y n , � θ, Φ( x n ) � ) – f ( θ ) = E f n ( θ ) = E ℓ ( y n , � θ, Φ( x n ) � ) = generalization error – Expected gradient: f ′ ( θ ) = E f ′ � ℓ ′ ( y n , � θ, Φ( x n ) � ) Φ( x n ) � n ( θ ) = E

  22. Convex stochastic approximation • Key assumption : smoothness and/or strongly convexity

  23. Convex stochastic approximation • Key assumption : smoothness and/or strongly convexity • Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) � n – Polyak-Ruppert averaging: ¯ 1 θ n = k =0 θ k n +1 γ n = Cn − α – Which learning rate sequence γ n ? Classical setting: - Desirable practical behavior - Applicable (at least) to least-squares and logistic regression - Robustness to (potentially unknown) constants ( L , µ ) - Adaptivity to difficulty of the problem (e.g., strong convexity)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend