fine grained analysis of stability and generalization for
play

Fine-Grained Analysis of Stability and Generalization for SGD Yunwen - PowerPoint PPT Presentation

Fine-Grained Analysis of Stability and Generalization for SGD Yunwen Lei 1 and Yiming Ying 2 1 University of Kaiserslautern 2 University at Albany, State University of New York (SUNY) yunwen.lei@hotmail.com yying@albany.edu June, 2020 Overview


  1. Fine-Grained Analysis of Stability and Generalization for SGD Yunwen Lei 1 and Yiming Ying 2 1 University of Kaiserslautern 2 University at Albany, State University of New York (SUNY) yunwen.lei@hotmail.com yying@albany.edu June, 2020

  2. Overview

  3. Population and Empirical Risks � � Training Dataset: S = z 1 = ( x 1 , y 1 ) , . . . , z n = ( x n , y n ) with each example z i ∈ Z = X × Y Parametric model w ∈ Ω ⊆ R d for prediction Loss function: f ( w ; z ) measure performance of w on an example z Population risk: F ( w ) = E z [ f ( w ; z )] with best model w ∗ = arg min w ∈ Ω F ( w ) � n Empirical risk: F S ( w ) = 1 i =1 f ( w ; z i ) . n

  4. Excess Generalization Error Based on the training data S , a randomized algorithm denoted by A (e.g. SGD) outputs a model A ( S ) ∈ Ω ... Target of analysis: excess generalization error � � � � F ( A ( S )) − F ( w ∗ ) + F S ( A ( S )) − F S ( w ∗ ) = E F ( A ( S )) − F S ( A ( S )) E � �� � � �� � estimation error optimization error Vast literature on optimization error: (Duchi et al., 2011; Bach and Moulines, 2011; Rakhlin et al., 2012; Shamir and Zhang, 2013; Orabona, 2014; Ying and Zhou, 2017; Lin and Rosasco, 2017; Pillaud-Vivien et al., 2018; Bassily et al., 2018; Vaswani et al., 2019; M¨ ucke et al., 2019) and many others Algorithmic stability for studying estimation error: (Bousquet and Elisseeff, 2002; Elisseeff et al., 2005; Rakhlin et al., 2005; Shalev-Shwartz et al., 2010; Hardt et al., 2016; Kuzborskij and Lampert, 2018; Charles and Papailiopoulos, 2018; Feldman and Vondrak, 2018) etc.

  5. Uniform Stability Approach Uniform Stability (Bousquet and Elisseeff, 2002; Elisseeff et al., 2005) A randomized algorithm A is ǫ -uniformly stable if, for any two datasets S and S ′ that differ by one example, we have � � sup f ( A ( S ); z ) − f ( A ( S ′ ); z ) ≤ ǫ uniform . (1) z E A For G-Lipschitz, strongly smooth f , SGD with step size η t informally we have T � Generalization ≤ Uniform stability ≤ 1 η t G 2 . n t =1 These assumptions are restrictive: they are not true for q -norm loss f ( w ; z ) = | y −� w , x �| q ( q ∈ [1 , 2]) and hinge loss (1 − y � w , x � ) + with w ∈ R d . Can we remove these assumptions and explain the real power of SGD?

  6. Our Results

  7. On-Average Model Stability To handle the general setting, we propose a new concept of stability. Let S = { z i : i = 1 , . . . , n } and � S = { ˜ z i : i = 1 , . . . , n } , and for each i , let S ( i ) = { z 1 , . . . , z i − 1 , ˜ z i , z i +1 , . . . , z n } . On-Average Model Stability We say a randomized algorithm A : Z n �→ Ω is on-average model ǫ -stable if � 1 � n � � A ( S ) − A ( S ( i ) ) � 2 ≤ ǫ 2 . (2) E S , � 2 S , A n i =1 α -H¨ older continuous gradients ( α ∈ [0 , 1]) � � � ∂ f ( w , z ) − ∂ f ( w ′ , z ) � 2 ≤ � w − w ′ � α 2 . (3) α = 0 means that f is Lipschitz and α = 1 means strongly smoothness. If A is on-average model ǫ -stable, � 1+ α � � � � � ǫ 1+ α + ǫ α E F ( A ( S )) − F S ( A ( S )) = O E [ F S ( A ( S ))] . (4) Can handle both Lipschitz functions and un-bounded gradient!

  8. Case Study: Stochastic Gradient Descent We study the on-average model stability ǫ T +1 of w T +1 from SGD ... SGD for t = 1 , 2 , . . . to T do i t ← random index from { 1 , 2 , . . . , n } w t +1 ← w t − η t ∂ f ( w t ; z i t ) for some step sizes η t > 0 return w T +1 On-Average Model Stability for SGD If ∂ f is α -H¨ older continuous with α ∈ [0 , 1], then � T � T � T � 2 α 1+ α � � � � � 1 − α 2 + 1 + T / n ǫ 2 η 2 η 2 T +1 = O η 1 − α t E [ F S ( w t )] 1+ α t t n t =1 t =1 t =1 Weighted sum of risks (i.e. � T � � t =1 η 2 F S ( w t ) ) can be estimated t E using tools of analyzing optimization errors

  9. Main Results for SGD Our Key Message (Informal) Generalization ≤ On-average model stability ≤ Weighted sum of risks Recall, for uniform stability with Lipschitz and smooth f , that T � Generalization ≤ Uniform stability ≤ 1 η t G 2 n t =1 Specifically, we have the following excess generalization bounds...

  10. SGD with Smooth Functions w T = � T t =1 η t w t / � T Let f be convex and strongly-smooth. Let ¯ t =1 η t . Theorem (Minimax optimal generalization bounds) √ Choosing η t = 1 / T and T ≍ n implies that � � � 1 / √ n � − F ( w ∗ ) = O F (¯ w T ) . E Theorem (Fast generalization bounds under low noise) For low noise case F ( w ∗ ) = O (1 / n ), we can take η t = 1 , T ≍ n and get E [ F (¯ w T )] = O (1 / n ) . We remove bounded gradient assumptions. We get the first-ever fast generalization bound O (1 / n ) by stability analysis.

  11. SGD with Lipschitz Functions Let f be convex and G -Lipschitz (Not necessarily smooth! e.g. the hinge loss.) Our on-average model stability bounds can be simplified as �� T � � 1 + T / n 2 � ǫ 2 η 2 T +1 = O . (5) t t =1 Key idea: gradient update is approximately contractive � w − η∂ f ( w ; z ) − w ′ + η∂ f ( w ′ ; z ) � 2 2 ≤ � w − w ′ � 2 2 + O ( η 2 ) . (6) Theorem (Generalization bounds) 4 and T ≍ n 2 and get We can take η t = T − 3 w T )] − F ( w ∗ ) = O ( n − 1 2 ) . E [ F (¯ We get the first generalization bound O (1 / √ n ) for SGD with non-differentiable functions based on stability analysis.

  12. SGD with α -H¨ older continuous gradients Let f be convex and have α -H¨ older continuous gradients with α ∈ (0 , 1). Key idea: gradient update is approximately contractive � w − η∂ f ( w ; z ) − w ′ + η∂ f ( w ′ ; z ) � 2 2 1 − α ) . 2 ≤ � w − w ′ � 2 2 + O ( η Theorem √ If α ≥ 1 / 2 , we take η t = 1 / T, T ≍ n and get w T )] − F ( w ∗ ) = O ( n − 1 2 ) . E [ F (¯ 3 α − 3 2 − α 2(2 − α ) , T ≍ n 1+ α and get If α < 1 / 2 , we take η t = T w T )] − F ( w ∗ ) = O ( n − 1 2 ) . E [ F (¯ Theorem (Fast Generalization bounds) α 2+2 α − 3 2 w T )]= O ( n − 1+ α If F ( w ∗ )= O ( 1 1+ α and get E [ F (¯ 2 ) . n ) , we let η t = T , T ≍ n 4

  13. SGD with Relaxed Convexity We assume f is G -Lipschitz continuous. Non-convex f but convex F S � � T � 2 + 1 � t stability bound: ǫ 2 ≤ 1 t =1 η 2 t =1 η t t . n 2 n √ generalization bound: if η t = 1 / T and T ≍ n , then w T )] − F ( w ∗ ) = O (1 / √ n ) . E [ F (¯ Non-convex f but strongly-convex F S ( η t = 1 / t ) stability bound: ǫ 2 ≤ nT + 1 1 n 2 . generalization bound: if T ≍ n , then E [ F (¯ w T )] − F ( w ∗ ) = O (1 / n ) . example: least squares regression.

  14. References I F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems , pages 451–459, 2011. R. Bassily, M. Belkin, and S. Ma. On exponential convergence of sgd in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564 , 2018. O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research , 2(Mar):499–526, 2002. Z. Charles and D. Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning , pages 744–753, 2018. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research , 12:2121–2159, 2011. A. Elisseeff, T. Evgeniou, and M. Pontil. Stability of randomized learning algorithms. Journal of Machine Learning Research , 6 (Jan):55–79, 2005. V. Feldman and J. Vondrak. Generalization bounds for uniformly stable algorithms. In Advances in Neural Information Processing Systems , pages 9747–9757, 2018. M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning , pages 1225–1234, 2016. I. Kuzborskij and C. Lampert. Data-dependent stability of stochastic gradient descent. In International Conference on Machine Learning , pages 2820–2829, 2018. J. Lin and L. Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research , 18(1): 3375–3421, 2017. N. M¨ ucke, G. Neu, and L. Rosasco. Beating sgd saturation with tail-averaging and minibatching. In Advances in Neural Information Processing Systems , pages 12568–12577, 2019. F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems , pages 1116–1124, 2014. L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems , pages 8114–8124, 2018. A. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applications , 3(04):397–417, 2005.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend