 
              Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS Paris, INRIA 17 november 2017 Joint work with Francis Bach and Alain Durmus. 1
Outline ◮ Introduction to Stochastic Approximation for Machine Learning. ◮ Markov chain: a simple yet insightful point of view on constant step size Stochastic Approximation. 2
Supervised Machine Learning ◮ Consider an input/output pair ( X , Y ) ∈ X × Y , following some unknown distribution ρ . ◮ Y = R (regression) or {− 1 , 1 } (classification). ◮ We want to find a function θ : X → R , such that θ ( X ) is a good prediction for Y . ◮ Prediction as a linear function � θ, Φ( X ) � of features Φ( X ) ∈ R d . ◮ Consider a loss function ℓ : Y × R → R + : squared loss, logistic loss, 0-1 loss, etc. ◮ We define the risk (generalization error) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . 3
Empirical Risk minimization (I) ◮ Data: n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. ◮ n very large, up to 10 9 ◮ Computer vision: d = 10 4 to 10 6 ◮ Empirical risk (or training error): n R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 ◮ Empirical risk minimization (regularized): find ˆ θ solution of n 1 � � � min ℓ y i , � θ, Φ( x i ) � + µ Ω( θ ) . n θ ∈ R d i =1 convex data fitting term + regularizer 4
Empirical Risk minimization (II) ◮ For example, least-squares regression: n 1 � 2 � � y i − � θ, Φ( x i ) � min + µ Ω( θ ) , 2 n θ ∈ R d i =1 ◮ and logistic regression: n 1 � � � 1 + exp( − y i � θ, Φ( x i ) � ) min log + µ Ω( θ ) . n θ ∈ R d i =1 ◮ Two fundamental questions: (1) computing ˆ θ and (2) analyzing ˆ θ . 2 important insights for ML Bottou and Bousquet (2008): 1. No need to optimize below statistical error, 2. Testing error is more important than training error. 5
Stochastic Approximation θ ∗ θ 0 ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). θ 0 θ 1 6 θ
Stochastic Approximation in Machine learning Loss for a single pair of observations, for any k ≤ n : f k ( θ ) = ℓ ( y k , � θ, Φ( x k ) � ) . ◮ Use one observation at each step ! ◮ Complexity: O ( d ) per iteration. ◮ Can be used for both true risk and empirical risk. 7
Stochastic Approximation in Machine learning n ◮ For the empirical error ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } . ◮ F k = σ (( x i , y i ) 1 ≤ i ≤ n , ( I i ) 1 ≤ i ≤ k ). ◮ At step k ∈ N ∗ , use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) E [ f ′ I Ik ( θ k − 1 ) |F k − 1 ] = ˆ R ′ ( θ k − 1 ) ◮ For the risk R ( θ ) = E f k ( θ ) = E ℓ ( y k , � θ, Φ( x k ) � ): ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) E [ f ′ k ( θ k − 1 ) |F k − 1 ] = R ′ ( θ k − 1 ) ◮ Single pass through the data, Running-time = O ( nd ), ◮ “Automatic” regularization. Analysis: Key assumptions: smoothness and/or strong convexity. 8
Mathematical framework: Smoothness ◮ A function g : R d → R is L -smooth if and only if it is twice differentiable and ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � L For all θ ∈ R d : � 2 g ( θ ) ≤ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + L � θ − θ ′ � � 9
Mathematical framework: Strong Convexity ◮ A twice differentiable function g : R d → R is µ -strongly convex if and only if ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � µ For all θ ∈ R d : � 2 g ( θ ) ≥ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + µ � θ − θ ′ � � 10
Application to machine learning ◮ We consider an a.s. convex loss in θ . Thus ˆ R and R are convex. ◮ Hessian of ˆ R (resp R ) ≈ covariance matrix i =1 Φ( x i )Φ( x i ) ⊤ or E [Φ( X )Φ( X ) ⊤ ]. 1 � n n R ′′ ( θ ) = E [ ℓ ′′ ( � θ, Φ( X ) � , Y )Φ( X )Φ( X ) ⊤ ] ◮ If ℓ is smooth, and E [ � Φ( X ) � 2 ] ≤ r 2 , R is smooth. ◮ If ℓ is µ -strongly convex, and data has an invertible covariance matrix (low correlation/dimension), R is strongly convex. 11
Analysis: behaviour of ( θ n ) n ≥ 0 θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) Importance of the learning rate (or sequence of step sizes) ( γ n ) n ≥ 0 . For smooth and strongly convex problem, traditional analysis shows Fabian (1968); Robbins and Siegmund (1985) that θ n → θ ∗ almost surely if ∞ ∞ � � γ 2 γ n = ∞ n < ∞ . n =1 n =1 And asymptotic normality √ n ( θ n − θ ∗ ) d → N (0 , V ), for γ n = γ 0 n , γ 0 ≥ 1 µ . ◮ Limit variance scales as 1 /µ 2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown, so hard to choose the step size... 12
Polyak Ruppert averaging θ 1 θ 0 θ 1 θ n θ ∗ Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ 1 θ 0 θ 1 n 1 ¯ � θ n = θ k . θ 2 n + 1 k =0 θ n θ ∗ θ n ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 n +1 ¯ n θ n +1 = n +1 θ n +1 + θ n . ◮ one could also consider other averaging schemes (e.g., 13 Lacoste-Julien et al. (2012)).
Convex stochastic approximation: convergence results ◮ Known global minimax rates of convergence for non-smooth problems Nemirovsky and Yudin (1983); Agarwal et al. (2012) ◮ Strongly convex: O (( µ n ) − 1 ) Attained by averaged stochastic gradient descent with γ n ∝ ( µ n ) − 1 ◮ Non-strongly convex: O ( n − 1 / 2 ) Attained by averaged stochastic gradient descent with γ n ∝ n − 1 / 2 Smooth strongly convex problems ◮ ◮ All step sizes γ n = Cn − α with α ∈ (1 / 2 , 1), with averaging, lead to O ( n − 1 ): ◮ asymptotic normality Polyak and Juditsky (1992), with variance independent of µ ! ◮ non asymptotic analysis Bach and Moulines (2011). ◮ Rate µ n for γ n ∝ n − 1 / 2 : adapts to strong convexity. 1 14
Stochastic Approximation: take home message ◮ Powerful algorithm: ◮ Simple to implement ◮ Cheap ◮ No regularization needed ◮ Convergence guarantees: 1 ◮ γ n = √ n good choice in most situations Problems: ◮ Initial conditions can be forgotten slowly: could we use even larger step sizes? 15
Motivation 1/ 2. Large step sizes! � θ n ) − R ( θ ∗ ) R (¯ � log 10 log 10 ( n ) Logistic regression. Final iterate (dashed), and averaged recursion (plain). 16
Motivation 1/ 2. Large step sizes, real data � θ n ) − R ( θ ∗ ) R (¯ � log 10 log 10 ( n ) Logistic regression, Covertype dataset, n = 581012, d = 54. Comparison between a constant learning rate and decaying 1 learning rate as √ n . 17
Motivation 2/ 2. Difference between quadratic and logistic loss Logistic Regression Least-Squares Regression � 1 � E R (¯ θ n ) − R ( θ ∗ ) = O ( γ 2 ) E R (¯ θ n ) − R ( θ ∗ ) = O n with γ = 1 / (2 R 2 ) with γ = 1 / (2 R 2 ) 18
Larger step sizes: Least-mean-square algorithm ◮ Least-squares: R ( θ ) = 1 ( Y − � Φ( X ) , θ � ) 2 � � with 2 E θ ∈ R d ◮ SGD = least-mean-square algorithm ◮ Usually studied without averaging and decreasing step-sizes. ◮ New analysis for averaging and constant step-size γ = 1 / (4 R 2 ) Bach and Moulines (2013) ◮ Assume � Φ( x n ) � � r and | y n − � Φ( x n ) , θ ∗ �| � σ almost surely ◮ No assumption regarding lowest eigenvalues of the Hessian ◮ Main result: θ n ) − R ( θ ∗ ) � 4 σ 2 d + � θ 0 − θ ∗ � 2 E R (¯ n γ n ◮ Matches statistical lower bound Tsybakov (2003). 19
Related work in Sierra Led to numerous (non trivial) extensions, at least in our lab ! ◮ Beyond parametric models: Non Parametric Stochastic Approximation with Large step sizes. Dieuleveut and Bach (2015) ◮ Improved Sampling: Averaged least-mean-squares: bias-variance trade-offs and optimal sampling distributions. D´ efossez and Bach (2015) ◮ Acceleration: Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression. Dieuleveut et al. (2016) ◮ Beyond smoothness and euclidean geometry: Stochastic Composite Least-Squares Regression with convergence rate O (1 / n ). Flammarion and Bach (2017) 20
SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ � � k ) , ◮ satisfies Markov property ◮ is homogeneous, for γ constant, ( ε k ) k ∈ N i.i.d. Also assume: k = R ′ + ε k +1 is almost surely L -co-coercive. ◮ R ′ ◮ Bounded moments E [ � ε k ( θ ∗ ) � 4 ] < ∞ . 21
Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n ◮ Convergence of second order moments of the chain, L 2 θ γ ¯ ¯ − → θ γ := E π γ [ θ ] . n n →∞ ◮ Behavior under the limit distribution ( γ → 0): ¯ θ γ = θ ∗ + ?. � Provable convergence improvement with extrapolation tricks. † Dieuleveut, Durmus, Bach [2017]. 22
Recommend
More recommend