stochastic gradient methods for machine learning
play

Stochastic gradient methods for machine learning Francis Bach - PowerPoint PPT Presentation

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine learning for big data


  1. Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013

  2. Context Machine learning for “big data” • Large-scale machine learning : large p , large n , large k – p : dimension of each observation (input) – k : number of tasks (dimension of outputs) – n : number of observations • Examples : computer vision, bioinformatics, signal processing • Ideal running-time complexity : O ( pn + kn ) – Going back to simple methods – Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization – It is possible to improve on the sublinear convergence rate?

  3. Context Machine learning for “big data” • Large-scale machine learning : large p , large n , large k – p : dimension of each observation (input) – k : number of tasks (dimension of outputs) – n : number of observations • Examples : computer vision, bioinformatics, signal processing • Ideal running-time complexity : O ( pn + kn ) • Going back to simple methods – Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization – It is possible to improve on the sublinear convergence rate?

  4. Outline • Introduction – Supervised machine learning and convex optimization – Beyond the separation of statistics and optimization • Stochastic approximation algorithms (Bach and Moulines, 2011) – Stochastic gradient and averaging – Strongly convex vs. non-strongly convex • Going beyond stochastic gradient (Le Roux, Schmidt, and Bach, 2012) – More than a single pass through the data – Linear (exponential) convergence rate for strongly convex functions

  5. Supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. • Prediction as a linear function θ ⊤ Φ( x ) of features Φ( x ) ∈ F = R p • (regularized) empirical risk minimization : find ˆ θ solution of n 1 � y i , θ ⊤ Φ( x i ) � � min ℓ + µ Ω( θ ) n θ ∈F i =1 convex data fitting term + regularizer

  6. Supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. • Prediction as a linear function θ ⊤ Φ( x ) of features Φ( x ) ∈ F = R p • (regularized) empirical risk minimization : find ˆ θ solution of n 1 � y i , θ ⊤ Φ( x i ) � � min ℓ + µ Ω( θ ) n θ ∈F i =1 convex data fitting term + regularizer � n • Empirical risk: ˆ f ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) training cost n • Expected risk: f ( θ ) = E ( x,y ) ℓ ( y, θ ⊤ Φ( x )) testing cost • Two fundamental questions : (1) computing ˆ θ and (2) analyzing ˆ θ – May be tackled simultaneously

  7. Supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. • Prediction as a linear function θ ⊤ Φ( x ) of features Φ( x ) ∈ F = R p • (regularized) empirical risk minimization : find ˆ θ solution of n 1 � y i , θ ⊤ Φ( x i ) � � min ℓ + µ Ω( θ ) n θ ∈F i =1 convex data fitting term + regularizer � n • Empirical risk: ˆ f ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) training cost n • Expected risk: f ( θ ) = E ( x,y ) ℓ ( y, θ ⊤ Φ( x )) testing cost • Two fundamental questions : (1) computing ˆ θ and (2) analyzing ˆ θ – May be tackled simultaneously

  8. Smoothness and strong convexity • A function g : R p → R is L -smooth if and only if it is differentiable and its gradient is L -Lipschitz-continuous ∀ θ 1 , θ 2 ∈ R p , � g ′ ( θ 1 ) − g ′ ( θ 2 ) � � L � θ 1 − θ 2 � • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � L · Id smooth non−smooth

  9. Smoothness and strong convexity • A function g : R p → R is L -smooth if and only if it is differentiable and its gradient is L -Lipschitz-continuous ∀ θ 1 , θ 2 ∈ R p , � g ′ ( θ 1 ) − g ′ ( θ 2 ) � � L � θ 1 − θ 2 � • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � L · Id • Machine learning � n – with g ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i )Φ( x i ) ⊤ n – Bounded data

  10. Smoothness and strong convexity • A function g : R p → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R p , g ( θ 1 ) � g ( θ 2 ) + � g ′ ( θ 2 ) , θ 1 − θ 2 � + µ 2 � θ 1 − θ 2 � 2 2 � θ � 2 is convex • Equivalent definition: θ �→ g ( θ ) − µ • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � µ · Id strongly convex convex

  11. Smoothness and strong convexity • A function g : R p → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R p , g ( θ 1 ) � g ( θ 2 ) + � g ′ ( θ 2 ) , θ 1 − θ 2 � + µ 2 � θ 1 − θ 2 � 2 2 � θ � 2 is convex • Equivalent definition: θ �→ g ( θ ) − µ • If g is twice differentiable: ∀ θ ∈ R p , g ′′ ( θ ) � µ · Id • Machine learning � n – with g ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i )Φ( x i ) ⊤ n – Data with invertible covariance matrix (low correlation/dimension) – ... or with added regularization by µ 2 � θ � 2

  12. Stochastic approximation • Goal : Minimizing a function f defined on a Hilbert space H – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ H • Stochastic approximation – Observation of f ′ n ( θ n ) = f ′ ( θ n ) + ε n , with ε n = i.i.d. noise – Non-convex problems

  13. Stochastic approximation • Goal : Minimizing a function f defined on a Hilbert space H – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ H • Stochastic approximation – Observation of f ′ n ( θ n ) = f ′ ( θ n ) + ε n , with ε n = i.i.d. noise – Non-convex problems • Machine learning - statistics f n ( θ ) = ℓ ( y n , θ ⊤ Φ( x n )) – loss for a single pair of observations : – f ( θ ) = E f n ( θ ) = E ℓ ( y n , θ ⊤ Φ( x n )) = generalization error – Expected gradient: f ′ ( θ ) = E f ′ � ℓ ′ ( y n , θ ⊤ Φ( x n )) Φ( x n ) � n ( θ ) = E

  14. Convex smooth stochastic approximation • Key properties of f and/or f n – Smoothness: f n L -smooth – Strong convexity: f µ -strongly convex

  15. Convex smooth stochastic approximation • Key properties of f and/or f n – Smoothness: f n L -smooth – Strong convexity: f µ -strongly convex • Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) – Polyak-Ruppert averaging: ¯ � n − 1 θ n = 1 k =0 θ k n γ n = Cn − α – Which learning rate sequence γ n ? Classical setting: - Desirable practical behavior - Applicable (at least) to least-squares and logistic regression - Robustness to (potentially unknown) constants ( L , µ ) - Adaptivity to difficulty of the problem (e.g., strong convexity)

  16. Convex smooth stochastic approximation • Key properties of f and/or f n – Smoothness: f n L -smooth – Strong convexity: f µ -strongly convex • Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) – Polyak-Ruppert averaging: ¯ � n − 1 θ n = 1 k =0 θ k n γ n = Cn − α – Which learning rate sequence γ n ? Classical setting: • Desirable practical behavior – Applicable (at least) to least-squares and logistic regression – Robustness to (potentially unknown) constants ( L , µ ) – Adaptivity to difficulty of the problem (e.g., strong convexity)

  17. Convex stochastic approximation Related work • Machine learning/optimization – Known minimax rates of convergence (Nemirovski and Yudin, 1983; Agarwal et al., 2010) – Strongly convex: O ( n − 1 ) – Non-strongly convex: O ( n − 1 / 2 ) – Achieved with and/or without averaging (up to log terms) – Non-asymptotic analysis (high-probability bounds) – Online setting and regret bounds – Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009) – Nesterov and Vial (2008); Nemirovski et al. (2009)

  18. Convex stochastic approximation Related work • Stochastic approximation – Asymptotic analysis – Non convex case with strong convexity around the optimum – γ n = Cn − α with α = 1 is not robust to the choice of C – α ∈ (1 / 2 , 1) is robust with averaging – Broadie et al. (2009); Kushner and Yin (2003); Kul ′ chitski˘ ı and Mozgovo˘ ı (1991); Fabian (1968) – Polyak and Juditsky (1992); Ruppert (1988)

  19. Problem set-up - General assumptions • Unbiased gradient estimates : – f n ( θ ) is of the form h ( z n , θ ) , where z n is an i.i.d. sequence – e.g., f n ( θ ) = h ( z n , θ ) = ℓ ( y n , θ ⊤ Φ( x n )) with z n = ( x n , y n ) – NB: can be generalized • Variance of estimates : There exists σ 2 � 0 such that for all n � 1 , n ( θ ∗ ) − f ′ ( θ ∗ ) � 2 ) � σ 2 , where θ ∗ is a global minimizer of f E ( � f ′

  20. Problem set-up - Smoothness/convexity assumptions • Smoothness of f n : For each n � 1 , the function f n is a.s. convex, differentiable with L -Lipschitz-continuous gradient f ′ n : – Bounded data

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend