MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with Stochastic Gradients Sasha Rakhlin A. Rakhlin, 9.520/6.860 2018

Why Optimization? Much (but not all) of Machine Learning: write down objective function involving data and parameters, find good (or optimal) parameters through optimization. Key idea: find a near-optimal solution by iteratively using only local information about the objective (e.g. gradient, Hessian). A. Rakhlin, 9.520/6.860 2018

Motivating example: Newton’s Method Newton’s method in 1d: w t +1 = w t − ( f ′′ ( w t )) − 1 f ′ ( w t ) Example (parabola): f ( w ) = aw 2 + bw + c Start with any w 1 . Then Newton’s Method gives w 2 = w 1 − (2 a ) − 1 (2 aw 1 + b ) which means w 2 = − b / (2 a ). Finds minimum of f in 1 step, no matter where you start ! A. Rakhlin, 9.520/6.860 2018

Newton’s Method in multiple dim: w t +1 = w t − [ ∇ 2 f ( w t )] − 1 ∇ f ( w t ) (here ∇ 2 f ( w t ) is the Hessian, assume invertible) A. Rakhlin, 9.520/6.860 2018

Recalling Least Squares Least Squares objective (without 1 / n normalization) n i w ) 2 = � Y − Xw � 2 � f ( w ) = ( y i − x T i =1 Calculate: ∇ 2 f ( w ) = 2 X T X and ∇ f ( w ) = − 2 X T ( Y − Xw ). Taking w 1 = 0, the Newton’s Method gives T X ) − 1 2 X T X ) − 1 X T ( Y − X 0) = ( X T Y w 2 = 0 + (2 X which is the least-squares solution (global min). Again, 1 step is enough. Verify: if f ( w ) = � Y − Xw � 2 + λ � w � 2 , ( X T X ) becomes ( X T X + λ ) A. Rakhlin, 9.520/6.860 2018

What do we do if data ( x 1 , y 1 ) , . . . , ( x n , y n ) , . . . are streaming? Can we incorporate data on the fly without having to re-compute inverse ( X T X ) at every step? − → Online Learning A. Rakhlin, 9.520/6.860 2018

Let w 1 = 0. Let w t be least-squares solution after seeing t − 1 data points. Can we get w t from w t − 1 cheaply? Newton’s Method will do it in 1 step (since objective is quadratic). Let C t = � t i =1 x i x T i (or + λ I ) and X t = [ x 1 , . . . , x t ] T , Y t = [ y 1 , . . . , y t ] T . Newton’s method gives w t +1 = w t + C − 1 t ( Y t − X t w t ) T X t This can be simplified to w t +1 = w t + C − 1 x t ( y t − x t w t ) T t since residuals up to t − 1 are orthogonal to columns of X t − 1 . The bottleneck is computing C − 1 . Can we update it quickly from C − 1 t − 1 ? t A. Rakhlin, 9.520/6.860 2018

Sherman-Morrison formula: for invertible square A and any u , v T ) − 1 = A − 1 − A − 1 uv T A − 1 ( A + uv 1 + v T A − 1 u Hence t − 1 − C − 1 t C − 1 t − 1 x t x T C − 1 = C − 1 t − 1 t t C − 1 1 + x T t − 1 x t and (do the calculation) 1 C − 1 x t = C − 1 t − 1 x t · t t C − 1 1 + x T t − 1 x t Computation required: d × d matrix C − 1 times a d × 1 vector = O ( d 2 ) t time to incorporate new datapoint. Memory: O ( d 2 ). Unlike full regression from scratch, does not depend on amount of data t . A. Rakhlin, 9.520/6.860 2018

Recursive Least Squares (cont.) Recap: recursive least squares is w t +1 = w t + C − 1 T x t ( y t − x t w t ) t with a rank-one update of C − 1 t − 1 to get C − 1 . t Consider throwing away second derivative information, replacing with scalar: T w t +1 = w t + η t x t ( y t − x t w t ) . where η t is a decreasing sequence. A. Rakhlin, 9.520/6.860 2018

Online Least Squares The algorithm w t +1 = w t + η t x t ( y t − x t w t ) . T ◮ is recursive; ◮ does not require storing the matrix C − 1 ; t ◮ does not require updating the inverse, but only vector/vector multiplication. However, we are not guaranteed convergence in 1 step. How many? How to choose η t ? A. Rakhlin, 9.520/6.860 2018

First, recognize that t w ) 2 = 2 x t [ y t − x −∇ ( y t − x T t w ] . T Hence, proposed method is gradient descent. Let us study it abstractly and then come back to least-squares. A. Rakhlin, 9.520/6.860 2018

Lemma: Let f be convex G -Lipschitz. Let w ∗ ∈ argmin f ( w ) and w � w ∗ � ≤ B . Then gradient descent w t +1 = w t − η ∇ f ( w t ) B with η = T and w 1 = 0 yields a sequence of iterates such that the √ G � T w T = 1 average ¯ t =1 w t of trajectory satisfies T w T ) − f ( w ∗ ) ≤ BG f ( ¯ √ . T Proof: � w t +1 − w ∗ � 2 = � w t − η ∇ f ( w t ) − w ∗ � 2 = � w t − w ∗ � 2 + η 2 �∇ f ( w t ) � 2 − 2 η ∇ f ( w t ) T ( w t − w ∗ ) Rearrange: T ( w t − w ∗ ) = � w t − w ∗ � 2 − � w t +1 − w ∗ � 2 + η 2 �∇ f ( w t ) � 2 . 2 η ∇ f ( w t ) Note: Lipschitzness of f is equivalent to �∇ f ( w ) � ≤ G . A. Rakhlin, 9.520/6.860 2018

Summing over t = 1 , . . . , T , telescoping, dropping negative term, using w 1 = 0, and dividing both sides by 2 η , T √ T ( w t − w ∗ ) ≤ 1 2 η � w ∗ � 2 + η 2 TG 2 ≤ � ∇ f ( w t ) BGT . t =1 Convexity of f means f ( w t ) − f ( w ∗ ) ≤ ∇ f ( w t ) T ( w t − w ∗ ) and so T T 1 f ( w t ) − f ( w ∗ ) ≤ 1 T ( w t − w ∗ ) ≤ BG � � ∇ f ( w t ) √ T T T t =1 t =1 Lemma follows by convexity of f and Jensen’s inequality. (end of proof) A. Rakhlin, 9.520/6.860 2018

Gradient descent can be written as T ( w − w t ) } + 1 2 � w − w t � 2 w t +1 = argmin η { f ( w t ) + ∇ f ( w t ) w which can be interpreted as minimizing a linear approximation but staying close to previous solution. Alternatively, can interpret it as building a second-order model locally (since cannot fully trust the local information – unlike our first parabola example). A. Rakhlin, 9.520/6.860 2018

Remarks: ◮ Gradient descent for non-smooth functions does not guarantee actual descent of the iterates w t (only their average). ◮ For constrained optimization problems over a set K , do projected gradient step w t +1 = Proj K ( w t − η ∇ f ( w t )) Proof essentially the same. ◮ Can take stepsize η t = BG √ t to make it horizon-independent. ◮ Knowledge of G and B not necessary (with appropriate changes). ◮ Faster convergence under additional assumptions on f (smoothness, strong convexity). ◮ Last class: for smooth functions (gradient is L -Lipschitz), constant step size 1 / L gives faster O (1 / T ) convergence. ◮ Gradients can be replaced with stochastic gradients (unbiased estimates). A. Rakhlin, 9.520/6.860 2018

Stochastic Gradients Suppose we only have access to an unbiased estimate ∇ t of ∇ f ( w t ) at step t . That is, E [ ∇ t | w t ] = ∇ f ( w t ). Then Stochastic Gradient Descent (SGD) w t +1 = w t − η ∇ t enjoys the guarantee w T )] − f ( w ∗ ) ≤ BG √ n E [ f ( ¯ where G is such that E [ �∇ t � 2 ] ≤ G 2 for all t . Kind of amazing: at each step go in the direction that is wrong (but correct on average) and still converge. A. Rakhlin, 9.520/6.860 2018

Stochastic Gradients Setting #1: Empirical loss can be written as n f ( w ) = 1 � ℓ ( y i , w T x i ) = E I ∼ unif[1:n] ℓ ( y I , w T x I ) n i =1 Then ∇ t = ∇ ℓ ( y I , w T t x I ) is an unbiased gradient: T T E [ ∇ t | w t ] = E [ ∇ ℓ ( y I , w t x I ) | w t ] = ∇ E [ ℓ ( y I , w t x I ) | w t ] = ∇ f ( w t ) Conclusion: if we pick index I uniformly at random from dataset and make gradient step ∇ ℓ ( y I , w T t x I ), then we are performing SGD on empirical loss objective. A. Rakhlin, 9.520/6.860 2018

Stochastic Gradients Setting #2: Expected loss can be written as f ( w ) = E ℓ ( Y , w T X ) where ( X , Y ) is drawn i.i.d. from population P X × Y . Then ∇ t = ∇ ℓ ( Y , w T t X ) is an unbiased gradient: E [ ∇ t | w t ] = E [ ∇ ℓ ( Y , w t X ) | w t ] = ∇ E [ ℓ ( Y , w T t X ) | w t ] = ∇ f ( w t ) T Conclusion: if we pick example ( X , Y ) from distribution P X × Y and make gradient step ∇ ℓ ( Y , w T t X ), then we are performing SGD on expected loss objective. Equivalent to going through a dataset once. A. Rakhlin, 9.520/6.860 2018

Stochastic Gradients Say we are in Setting #2 and we go through dataset once. The guarantee is w )] − f ( w ∗ ) ≤ BG E [ f ( ¯ √ T after T iterations. So, time complexity to find ǫ -minimizer of expected objective E ℓ ( w T X , Y ) is independent of the dataset size n !! Suitable for large-scale problems. A. Rakhlin, 9.520/6.860 2018

Stochastic Gradients In practice, we cycle through the dataset several times (which is somewhere between Setting #1 and #2). A. Rakhlin, 9.520/6.860 2018

Appendix A function f : R d → R is convex if f ( α u + (1 − α ) v ) ≤ α f ( u ) + (1 − α ) f ( v ) for any α ∈ [0 , 1] and u , v ∈ R d (or restricted to a convex set). For a differentiable function, convexity is equivalent to monotonicity �∇ f ( u ) − ∇ f ( v ) , u − v � ≥ 0 . (1) where � ∂ f ( u ) , . . . , ∂ f ( u ) � ∇ f ( u ) = . ∂ u 1 ∂ u d A. Rakhlin, 9.520/6.860 2018

Appendix It holds that for a convex differentiable function f ( u ) ≥ f ( v ) + �∇ f ( v ) , u − v � . (2) A subdifferential set is defined (for a given v ) precisely as the set of all vectors ∇ such that f ( u ) ≥ f ( v ) + �∇ , u − v � . (3) for all u . The subdifferential set is denoted by ∂ f ( v ). A subdifferential will often substitute the gradient, even if we don’t specify it. A. Rakhlin, 9.520/6.860 2018

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with Stochastic Gradients Sasha Rakhlin A. Rakhlin, 9.520/6.860 2018 Why Optimization? Much (but not all) of Machine Learning: write down objective

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2017 Class Times:

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2016 Class Times:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary

MIT 9.520/6.860 Statistical Learning Theory and Applications Class 0: Mathcamp Lorenzo Rosasco

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2018 Class Times:

9.520/6.860: Statistical Learning Theory and Applications Class: Mon., Wed. 1:00 - 2:30 pm,

9.5 .520/6.860: : Statistical Learning Theory ry and Applications Class: Tue, Thu 11:00 -

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej

SR 520 Program City of Kirkland briefing Denise Cieri, P.E. SR 520 Program Administrator WSDOT

SR 520 Pr SR 520 Prog ogram am Sea eattle D ttle Design esign Commi Commissi ssion on SR

Scanning COMP 520: Compiler Design (4 credits) Professor Laurie Hendren hendren@cs.mcgill.ca

MIT MIT S EMINAR ON S EMINAR ON MIT ESD.69 EMINAR ON EMINAR ON MIT HST.926 H EALTH EALTH C ARE

2019 AGM 203-1634 Harvey Ave., Kelowna, BC, Canada Tel: +1 250 860 8599 Fax: +1 250 860 1362

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins

CAREERS MARKET organizations. (Top) Girls from the College Hair Salon coordinated by Ms Estelle

Communication Sketching Minimal information to connect with viewer schema Straight

Transformation Exercises: Denavit- Hartenberg Method Some images and exercises from:

Are Mobile OS Ready for the Post-Smartphone Era? Felix Xiaozhu Lin Assistant Professor,

Parameter-Free Convex Learning through Coin Betting Francesco Orabona and Dvid Pl Yahoo

Support vector machines (SVMs) Lecture 5 David Sontag New

Discriminant Analysis using Logistic Regression OLS1D XL4E: V0D XL4E : OLS1D V0D XL4E : OLS1D V0D

Linear Models for Classification Henrik I Christensen Robotics & Intelligent Machines @ GT

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with Stochastic Gradients Sasha Rakhlin A. Rakhlin, 9.520/6.860 2018 Why Optimization? Much (but not all) of Machine Learning: write down objective

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2017 Class Times:

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2016 Class Times:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary

MIT 9.520/6.860 Statistical Learning Theory and Applications Class 0: Mathcamp Lorenzo Rosasco

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2018 Class Times:

9.520/6.860: Statistical Learning Theory and Applications Class: Mon., Wed. 1:00 - 2:30 pm,

9.5 .520/6.860: : Statistical Learning Theory ry and Applications Class: Tue, Thu 11:00 -

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks &amp; software Andrzej

SR 520 Program City of Kirkland briefing Denise Cieri, P.E. SR 520 Program Administrator WSDOT

SR 520 Pr SR 520 Prog ogram am Sea eattle D ttle Design esign Commi Commissi ssion on SR

Scanning COMP 520: Compiler Design (4 credits) Professor Laurie Hendren hendren@cs.mcgill.ca

MIT MIT S EMINAR ON S EMINAR ON MIT ESD.69 EMINAR ON EMINAR ON MIT HST.926 H EALTH EALTH C ARE

2019 AGM 203-1634 Harvey Ave., Kelowna, BC, Canada Tel: +1 250 860 8599 Fax: +1 250 860 1362

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins

CAREERS MARKET organizations. (Top) Girls from the College Hair Salon coordinated by Ms Estelle

Communication Sketching Minimal information to connect with viewer schema Straight

Transformation Exercises: Denavit- Hartenberg Method Some images and exercises from:

Are Mobile OS Ready for the Post-Smartphone Era? Felix Xiaozhu Lin Assistant Professor,

Parameter-Free Convex Learning through Coin Betting Francesco Orabona and Dvid Pl Yahoo

Support vector machines (SVMs) Lecture 5 David Sontag New

Discriminant Analysis using Logistic Regression OLS1D XL4E: V0D XL4E : OLS1D V0D XL4E : OLS1D V0D

Linear Models for Classification Henrik I Christensen Robotics &amp; Intelligent Machines @ GT

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej

Linear Models for Classification Henrik I Christensen Robotics & Intelligent Machines @ GT