mit 9 520 6 860 fall 2018 statistical learning theory and
play

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with Stochastic Gradients Sasha Rakhlin A. Rakhlin, 9.520/6.860 2018 Why Optimization? Much (but not all) of Machine Learning: write down objective


  1. MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with Stochastic Gradients Sasha Rakhlin A. Rakhlin, 9.520/6.860 2018

  2. Why Optimization? Much (but not all) of Machine Learning: write down objective function involving data and parameters, find good (or optimal) parameters through optimization. Key idea: find a near-optimal solution by iteratively using only local information about the objective (e.g. gradient, Hessian). A. Rakhlin, 9.520/6.860 2018

  3. Motivating example: Newton’s Method Newton’s method in 1d: w t +1 = w t − ( f ′′ ( w t )) − 1 f ′ ( w t ) Example (parabola): f ( w ) = aw 2 + bw + c Start with any w 1 . Then Newton’s Method gives w 2 = w 1 − (2 a ) − 1 (2 aw 1 + b ) which means w 2 = − b / (2 a ). Finds minimum of f in 1 step, no matter where you start ! A. Rakhlin, 9.520/6.860 2018

  4. Newton’s Method in multiple dim: w t +1 = w t − [ ∇ 2 f ( w t )] − 1 ∇ f ( w t ) (here ∇ 2 f ( w t ) is the Hessian, assume invertible) A. Rakhlin, 9.520/6.860 2018

  5. Recalling Least Squares Least Squares objective (without 1 / n normalization) n i w ) 2 = � Y − Xw � 2 � f ( w ) = ( y i − x T i =1 Calculate: ∇ 2 f ( w ) = 2 X T X and ∇ f ( w ) = − 2 X T ( Y − Xw ). Taking w 1 = 0, the Newton’s Method gives T X ) − 1 2 X T X ) − 1 X T ( Y − X 0) = ( X T Y w 2 = 0 + (2 X which is the least-squares solution (global min). Again, 1 step is enough. Verify: if f ( w ) = � Y − Xw � 2 + λ � w � 2 , ( X T X ) becomes ( X T X + λ ) A. Rakhlin, 9.520/6.860 2018

  6. What do we do if data ( x 1 , y 1 ) , . . . , ( x n , y n ) , . . . are streaming? Can we incorporate data on the fly without having to re-compute inverse ( X T X ) at every step? − → Online Learning A. Rakhlin, 9.520/6.860 2018

  7. Let w 1 = 0. Let w t be least-squares solution after seeing t − 1 data points. Can we get w t from w t − 1 cheaply? Newton’s Method will do it in 1 step (since objective is quadratic). Let C t = � t i =1 x i x T i (or + λ I ) and X t = [ x 1 , . . . , x t ] T , Y t = [ y 1 , . . . , y t ] T . Newton’s method gives w t +1 = w t + C − 1 t ( Y t − X t w t ) T X t This can be simplified to w t +1 = w t + C − 1 x t ( y t − x t w t ) T t since residuals up to t − 1 are orthogonal to columns of X t − 1 . The bottleneck is computing C − 1 . Can we update it quickly from C − 1 t − 1 ? t A. Rakhlin, 9.520/6.860 2018

  8. Sherman-Morrison formula: for invertible square A and any u , v T ) − 1 = A − 1 − A − 1 uv T A − 1 ( A + uv 1 + v T A − 1 u Hence t − 1 − C − 1 t C − 1 t − 1 x t x T C − 1 = C − 1 t − 1 t t C − 1 1 + x T t − 1 x t and (do the calculation) 1 C − 1 x t = C − 1 t − 1 x t · t t C − 1 1 + x T t − 1 x t Computation required: d × d matrix C − 1 times a d × 1 vector = O ( d 2 ) t time to incorporate new datapoint. Memory: O ( d 2 ). Unlike full regression from scratch, does not depend on amount of data t . A. Rakhlin, 9.520/6.860 2018

  9. Recursive Least Squares (cont.) Recap: recursive least squares is w t +1 = w t + C − 1 T x t ( y t − x t w t ) t with a rank-one update of C − 1 t − 1 to get C − 1 . t Consider throwing away second derivative information, replacing with scalar: T w t +1 = w t + η t x t ( y t − x t w t ) . where η t is a decreasing sequence. A. Rakhlin, 9.520/6.860 2018

  10. Online Least Squares The algorithm w t +1 = w t + η t x t ( y t − x t w t ) . T ◮ is recursive; ◮ does not require storing the matrix C − 1 ; t ◮ does not require updating the inverse, but only vector/vector multiplication. However, we are not guaranteed convergence in 1 step. How many? How to choose η t ? A. Rakhlin, 9.520/6.860 2018

  11. First, recognize that t w ) 2 = 2 x t [ y t − x −∇ ( y t − x T t w ] . T Hence, proposed method is gradient descent. Let us study it abstractly and then come back to least-squares. A. Rakhlin, 9.520/6.860 2018

  12. Lemma: Let f be convex G -Lipschitz. Let w ∗ ∈ argmin f ( w ) and w � w ∗ � ≤ B . Then gradient descent w t +1 = w t − η ∇ f ( w t ) B with η = T and w 1 = 0 yields a sequence of iterates such that the √ G � T w T = 1 average ¯ t =1 w t of trajectory satisfies T w T ) − f ( w ∗ ) ≤ BG f ( ¯ √ . T Proof: � w t +1 − w ∗ � 2 = � w t − η ∇ f ( w t ) − w ∗ � 2 = � w t − w ∗ � 2 + η 2 �∇ f ( w t ) � 2 − 2 η ∇ f ( w t ) T ( w t − w ∗ ) Rearrange: T ( w t − w ∗ ) = � w t − w ∗ � 2 − � w t +1 − w ∗ � 2 + η 2 �∇ f ( w t ) � 2 . 2 η ∇ f ( w t ) Note: Lipschitzness of f is equivalent to �∇ f ( w ) � ≤ G . A. Rakhlin, 9.520/6.860 2018

  13. Summing over t = 1 , . . . , T , telescoping, dropping negative term, using w 1 = 0, and dividing both sides by 2 η , T √ T ( w t − w ∗ ) ≤ 1 2 η � w ∗ � 2 + η 2 TG 2 ≤ � ∇ f ( w t ) BGT . t =1 Convexity of f means f ( w t ) − f ( w ∗ ) ≤ ∇ f ( w t ) T ( w t − w ∗ ) and so T T 1 f ( w t ) − f ( w ∗ ) ≤ 1 T ( w t − w ∗ ) ≤ BG � � ∇ f ( w t ) √ T T T t =1 t =1 Lemma follows by convexity of f and Jensen’s inequality. (end of proof) A. Rakhlin, 9.520/6.860 2018

  14. Gradient descent can be written as T ( w − w t ) } + 1 2 � w − w t � 2 w t +1 = argmin η { f ( w t ) + ∇ f ( w t ) w which can be interpreted as minimizing a linear approximation but staying close to previous solution. Alternatively, can interpret it as building a second-order model locally (since cannot fully trust the local information – unlike our first parabola example). A. Rakhlin, 9.520/6.860 2018

  15. Remarks: ◮ Gradient descent for non-smooth functions does not guarantee actual descent of the iterates w t (only their average). ◮ For constrained optimization problems over a set K , do projected gradient step w t +1 = Proj K ( w t − η ∇ f ( w t )) Proof essentially the same. ◮ Can take stepsize η t = BG √ t to make it horizon-independent. ◮ Knowledge of G and B not necessary (with appropriate changes). ◮ Faster convergence under additional assumptions on f (smoothness, strong convexity). ◮ Last class: for smooth functions (gradient is L -Lipschitz), constant step size 1 / L gives faster O (1 / T ) convergence. ◮ Gradients can be replaced with stochastic gradients (unbiased estimates). A. Rakhlin, 9.520/6.860 2018

  16. Stochastic Gradients Suppose we only have access to an unbiased estimate ∇ t of ∇ f ( w t ) at step t . That is, E [ ∇ t | w t ] = ∇ f ( w t ). Then Stochastic Gradient Descent (SGD) w t +1 = w t − η ∇ t enjoys the guarantee w T )] − f ( w ∗ ) ≤ BG √ n E [ f ( ¯ where G is such that E [ �∇ t � 2 ] ≤ G 2 for all t . Kind of amazing: at each step go in the direction that is wrong (but correct on average) and still converge. A. Rakhlin, 9.520/6.860 2018

  17. Stochastic Gradients Setting #1: Empirical loss can be written as n f ( w ) = 1 � ℓ ( y i , w T x i ) = E I ∼ unif[1:n] ℓ ( y I , w T x I ) n i =1 Then ∇ t = ∇ ℓ ( y I , w T t x I ) is an unbiased gradient: T T E [ ∇ t | w t ] = E [ ∇ ℓ ( y I , w t x I ) | w t ] = ∇ E [ ℓ ( y I , w t x I ) | w t ] = ∇ f ( w t ) Conclusion: if we pick index I uniformly at random from dataset and make gradient step ∇ ℓ ( y I , w T t x I ), then we are performing SGD on empirical loss objective. A. Rakhlin, 9.520/6.860 2018

  18. Stochastic Gradients Setting #2: Expected loss can be written as f ( w ) = E ℓ ( Y , w T X ) where ( X , Y ) is drawn i.i.d. from population P X × Y . Then ∇ t = ∇ ℓ ( Y , w T t X ) is an unbiased gradient: E [ ∇ t | w t ] = E [ ∇ ℓ ( Y , w t X ) | w t ] = ∇ E [ ℓ ( Y , w T t X ) | w t ] = ∇ f ( w t ) T Conclusion: if we pick example ( X , Y ) from distribution P X × Y and make gradient step ∇ ℓ ( Y , w T t X ), then we are performing SGD on expected loss objective. Equivalent to going through a dataset once. A. Rakhlin, 9.520/6.860 2018

  19. Stochastic Gradients Say we are in Setting #2 and we go through dataset once. The guarantee is w )] − f ( w ∗ ) ≤ BG E [ f ( ¯ √ T after T iterations. So, time complexity to find ǫ -minimizer of expected objective E ℓ ( w T X , Y ) is independent of the dataset size n !! Suitable for large-scale problems. A. Rakhlin, 9.520/6.860 2018

  20. Stochastic Gradients In practice, we cycle through the dataset several times (which is somewhere between Setting #1 and #2). A. Rakhlin, 9.520/6.860 2018

  21. Appendix A function f : R d → R is convex if f ( α u + (1 − α ) v ) ≤ α f ( u ) + (1 − α ) f ( v ) for any α ∈ [0 , 1] and u , v ∈ R d (or restricted to a convex set). For a differentiable function, convexity is equivalent to monotonicity �∇ f ( u ) − ∇ f ( v ) , u − v � ≥ 0 . (1) where � ∂ f ( u ) , . . . , ∂ f ( u ) � ∇ f ( u ) = . ∂ u 1 ∂ u d A. Rakhlin, 9.520/6.860 2018

  22. Appendix It holds that for a convex differentiable function f ( u ) ≥ f ( v ) + �∇ f ( v ) , u − v � . (2) A subdifferential set is defined (for a given v ) precisely as the set of all vectors ∇ such that f ( u ) ≥ f ( v ) + �∇ , u − v � . (3) for all u . The subdifferential set is denoted by ∂ f ( v ). A subdifferential will often substitute the gradient, even if we don’t specify it. A. Rakhlin, 9.520/6.860 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend