stochastic algorithms in machine learning
play

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, - PowerPoint PPT Presentation

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, Lausanne December 1st, 2017 Journ ee Algorithmes Stochastiques, Paris Dauphine 1 Outline 1. Machine learning context. 2. Stochastic algorithms to minimize Empirical Risk .


  1. Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, Lausanne December 1st, 2017 Journ´ ee Algorithmes Stochastiques, Paris Dauphine 1

  2. Outline 1. Machine learning context. 2. Stochastic algorithms to minimize Empirical Risk . 3. Stochastic Approximation: using stochastic gradient descent (SGD) to minimize Generalization Risk. 4. Markov chain: insightful point of view on constant step size Stochastic Approximation. 2

  3. Supervised Machine Learning Goal: predict a phenomenon from “explanatory variables”, given a set of observations. Bio-informatics Image classification Input: DNA/RNA sequence, Input: Handwritten digits / Output: Disease predisposition Images, / Drug responsiveness Output: Digit n → 10 to 10 4 n → up to 10 9 d (e.g., number of basis) d (e.g., number of pixels) → 10 6 → 10 6 “Large scale” learning framework: both the number of examples n and the number of explanatory variables d are large. 3

  4. Supervised Machine Learning ◮ Consider an input/output pair ( X , Y ) ∈ X × Y , following some unknown distribution ρ . ◮ Y = R (regression) or {− 1 , 1 } (classification). ◮ Goal: find a function θ : X → R , such that θ ( X ) is a good prediction for Y . ◮ Prediction as a linear function � θ, Φ( X ) � of features Φ( X ) ∈ R d . ◮ Consider a loss function ℓ : Y × R → R + : squared loss, logistic loss, 0-1 loss, etc. ◮ Define the Generalization risk (a.k.a., generalization error, “true risk”) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . 4

  5. Empirical Risk minimization (I) ◮ Data: n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. ◮ n very large, up to 10 9 ◮ Computer vision: d = 10 4 to 10 6 ◮ Empirical risk (or training error): n R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 ◮ Empirical risk minimization (ERM) (regularized): find ˆ θ solution of n 1 � � y i , � θ, Φ( x i ) � � min ℓ + µ Ω( θ ) . n θ ∈ R d i =1 convex data fitting term + regularizer 5

  6. Empirical Risk minimization (II) For example, least-squares regression: n 1 � 2 � � y i − � θ, Φ( x i ) � min + µ Ω( θ ) , 2 n θ ∈ R d i =1 and logistic regression: n 1 � � � 1 + exp( − y i � θ, Φ( x i ) � ) min log + µ Ω( θ ) . n θ ∈ R d i =1 Two fundamental questions: (1) computing (2) analyzing ˆ θ . Take home ◮ Problem is formalized as a (convex) optimization problem. ◮ In the large scale setting, high dimensional problem and many examples. 6

  7. Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 ⇒ First order algorithms 1. High dimension d = Gradient Descent (GD) : θ k = θ k − 1 − γ k ˆ R ′ ( θ k − 1 ) Problem: computing the gradient costs O ( dn ) per iteration. ⇒ Stochastic algorithms 2. Large n = Stochastic Gradient Descent (SGD) 7

  8. Stochastic Gradient descent θ ∗ θ 0 ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). θ 0 θ 1 8 θ

  9. SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 ℓ ( y k , � θ, Φ( x k ) � ). � n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) ◮ with F k = σ (( x i , y i ) 1 ≤ i ≤ n , ( I i ) 1 ≤ i ≤ k ), n I k ( θ k − 1 ) |F k − 1 ] = 1 � ℓ ′ ( y k , � θ, Φ( x k ) � ) = ˆ E [ f ′ R ′ ( θ k − 1 ) . n k =1 Mathematical framework: smoothness and/or strong convexity. 9

  10. Mathematical framework: Smoothness ◮ A function g : R d → R is L -smooth if and only if it is twice differentiable and ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � L For all θ ∈ R d : � 2 g ( θ ) ≤ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + L � θ − θ ′ � � 10

  11. Mathematical framework: Strong Convexity ◮ A twice differentiable function g : R d → R is µ -strongly convex if and only if ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � µ For all θ ∈ R d : � 2 g ( θ ) ≥ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + µ � θ − θ ′ � � 11

  12. Application to machine learning ◮ We consider an a.s. convex loss in θ . Thus ˆ R and R are convex. ◮ Hessian of ˆ R ≈ covariance matrix 1 � n i =1 Φ( x i )Φ( x i ) ⊤ n ( ≃ E [Φ( X )Φ( X ) ⊤ ].) n R ′′ ( θ ) = 1 � ℓ ′′ ( � θ, Φ( X i ) � , Y i )Φ( x i )Φ( x i ) ⊤ � ˆ � n i =1 ◮ If ℓ is smooth, and E [ � Φ( X ) � 2 ] ≤ r 2 , R is smooth. ◮ If ℓ is µ -strongly convex, and data has an invertible covariance matrix (low correlation/dimension), R is strongly convex. 12

  13. Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate (or sequence of step sizes) ( γ k ) k ≥ 0 . For smooth and strongly convex problem, traditional analysis shows Fabian (1968); Robbins and Siegmund (1985) that θ k → θ ∗ almost surely if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 √ d k ( θ k − θ ∗ ) → N (0 , V ), for And asymptotic normality γ k = γ 0 k , γ 0 ≥ 1 µ . ◮ Limit variance scales as 1 /µ 2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown, so hard to choose the step size... 13

  14. Polyak Ruppert averaging θ 1 θ 0 θ 1 θ n θ ∗ Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ 1 θ 0 θ 1 k 1 ¯ � θ k = θ i . θ 2 k + 1 i =0 θ n θ ∗ θ n ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 k +1 ¯ k θ k +1 = k +1 θ k +1 + θ k . ◮ one could also consider other averaging schemes (e.g., 14 Lacoste-Julien et al. (2012)).

  15. Convex stochastic approximation: convergence ◮ Known global minimax rates of convergence for non-smooth problems Nemirovsky and Yudin (1983); Agarwal et al. (2012) ◮ Strongly convex: O (( µ k ) − 1 ) Attained by averaged stochastic gradient descent with γ k ∝ ( µ k ) − 1 ◮ Non-strongly convex: O ( k − 1 / 2 ) Attained by averaged stochastic gradient descent with γ k ∝ k − 1 / 2 Smooth strongly convex problems ◮ 1 ◮ Rate µ k for γ k ∝ k − 1 / 2 : adapts to strong convexity. 15

  16. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 16

  17. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 16

  18. Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n ◮ sample i k ∼ U [1; n ], use � n � 1 � f ′ i ( θ k i ) − f ′ i k ( θ k ik ) + f ′ i k ( θ k ) , n i =0 � ⊕ update costs the same as SGD � ⊖ needs to store all gradients f ′ i ( θ k i ) at “points in the past” Some references: ◮ SAG Schmidt et al. (2013), SAGA Defazio et al. (2014a) ◮ SVRG Johnson and Zhang (2013) (reduces memory cost but 2 epochs...) ◮ FINITO Defazio et al. (2014b) ◮ S2GD Koneˇ cn` y and Richt´ arik (2013)... And many others... See for example Niao He’s lecture notes for a nice overview. 17

  19. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k GD, SGD, SAG (Fig. from Schmidt et al. (2013)) 18

  20. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k Lower Bounds α β γ α : Stoch. opt. information theoretic lower bounds, Agarwal et al. (2012); β : Black box first order optimization, Nesterov (2004) ; γ : Lower bounds for optimizing finite sums, Agarwal and Bottou (2014); Arjevani and Shamir (2016). 19

  21. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD AGD SAG SGD � � � � � 1 1 1 Convex O O O √ √ k 2 k k O ( e −√ µ k ) � k � � � � 1 1 − ( µ ∧ 1 1 Stgly-Cvx O O n ) O µ k µ k Lower Bounds α β γ α : Stoch. opt. information theoretic lower bounds, Agarwal et al. (2012); β : Black box first order optimization, Nesterov (2004); γ : Lower bounds for optimizing finite sums, Agarwal and Bottou (2014). 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend