painless stochastic gradient descent interpolation line
play

Painless Stochastic Gradient Descent : Interpolation, Line-Search, - PowerPoint PPT Presentation

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS 2019 Aaron Mishkin 1 21 Stochastic Gradient Descent: Workhorse of ML? Stochastic gradient descent (SGD) is today one of the main


  1. Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS 2019 Aaron Mishkin 1 ⁄ 21

  2. Stochastic Gradient Descent: Workhorse of ML? “Stochastic gradient descent (SGD) is today one of the main workhorses for solving large-scale supervised learning and optimization problems.” —Drori and Shamir [8] 2 ⁄ 21

  3. Consensus Says… …and also Agarwal et al. [1], Assran and Rabbat [2], Assran et al. [3], Bernstein et al. [6], Damaskinos et al. [7], Gefgner and Domke [9], Gower et al. [10], Grosse and Salakhudinov [11], Hofmann et al. [12], Kawaguchi and Lu [13], Li et al. [14], Patterson and Gibson [17], Pillaud-Vivien et al. [18], Xu et al. [21], Zhang et al. [22] 3 ⁄ 21

  4. Motivation: Challenges in Optimization for ML Stochastic gradient methods are the most popular algorithms for fjtting ML models, SGD: But practitioners face major challenges with 4 ⁄ 21 w k + 1 = w k − η k ∇ ˜ f ( w k ) . • Speed : step-size/averaging controls convergence rate. • Stability : hyper-parameters must be tuned carefully. • Generalization : optimizers encode statistical tradeofgs.

  5. Better Optimization via Better Models Idea : exploit model properties for better optimization. 5 ⁄ 21

  6. Interpolation Loss: n n Separable Not Separable 6 ⁄ 21 ∑ f ( w ) = 1 f i ( w ) . i = 1 Interpolation is satisfjed for f if ∀ w , f ( w ∗ ) ≤ f ( w ) = ⇒ f i ( w ∗ ) ≤ f i ( w ) .

  7. Constant Step-size SGD Interpolation and smoothness imply a noise bound , Takeaway : optimization speed and (some) statistical trade-ofgs. 7 ⁄ 21 E ∥∇ f i ( w ) ∥ 2 ≤ ρ ( f ( w ) − f ( w ∗ )) . • SGD converges with a constant step-size [4, 19]. • SGD is (nearly) as fast as gradient descent. • SGD converges to the ▶ minimum L 2 -norm solution for linear regression [20]. ▶ max-margin solution for logistic regression [16]. ▶ ??? for deep neural networks.

  8. Painless SGD What about stability and hyper-parameter tuning? Is grid-search the best we can do? 8 ⁄ 21

  9. Painless SGD: Tuning-free SGD via Line-Searches 9 ⁄ 21 Stochastic Armijo Condition : f i ( w k + 1 ) ≤ f i ( w k ) − c η k ∥∇ f i ( w k ) ∥ 2 .

  10. Painless SGD: Stochastic Armijo in Theory 10 ⁄ 21

  11. Painless SGD: Stochastic Armijo in Practice Classifjcation accuracy for ResNet-34 models trained on MNIST, CIFAR-10, and CIFAR-100. 11 ⁄ 21

  12. Painless SGD: Added Cost Backtracking is low-cost and averages once per-iteration. ⁄ 12 21 Iteration Costs Iteration Costs 0.005 Adam SGD + Armijo Tuned SGD Coin-Betting 0.005 Polyak + Armijo Nesterov + Armijo Time per Iteration (s) Time Per-Iteration (s) SGD + Goldstein AdaBound Coin-Betting SEG + Lipschitz Adam SGD + Armijo 0.004 Polyak + Armijo 0.004 0.003 0.003 0.002 0.002 0.001 0.001 0.000 0.000 MNIST CIFAR10 CIFAR100 mushrooms ijcnn MF: 1 MF: 10 Experiments Experiments

  13. Painless SGD: Sensitivity to Assumptions SGD with line-search is robust , but can still fail catastrophically. ⁄ 13 21 Bilinear with Interpolation Bilinear without Interpolation 10 1 3 × 10 1 10 0 Distance to the optimum Distance to the optimum 2 × 10 1 10 1 10 2 10 1 10 3 10 4 0 100 200 300 400 0 100 200 300 400 Number of epochs Number of epochs Adam Extra-Adam SEG + Lipschitz SVRE + Restarts

  14. Questions. 14 ⁄ 21

  15. Bonus: Robust Acceleration for SGD Stochastic acceleration is possible [15, 19], but ⁄ 15 Potential Solutions: 21 Synthetic Matrix Fac. Training Loss 10 4 10 10 0 50 100 150 200 250 300 350 Iterations Adam SGD + Armijo Nesterov + Armijo • it’s unstable with the backtracking Armijo line-search; and • the ”momentum” parameter must be fjne-tuned . • more sophisticated line-search (e.g. FISTA [5]). • stochastic restarts for oscillations.

  16. References I Rabbat. Stochastic gradient push for distributed deep ⁄ 16 arXiv preprint arXiv:1811.02564 , 2018. convergence of sgd in non-convex over-parametrized learning. [4] Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential learning. arXiv preprint arXiv:1811.10792 , 2018. [3] Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael [1] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order settings. arXiv preprint arXiv:2002.12414 , 2020. of nesterov’s accelerated gradient method in stochastic [2] Mahmoud Assran and Michael Rabbat. On the convergence 2017. The Journal of Machine Learning Research , 18(1):4148–4187, stochastic optimization for machine learning in linear time. 21

  17. References II Guerraoui, Arsany Hany Abdelmessih Guirguis, and Sébastien ⁄ 17 CONF, 2019. on Systems and Machine Learning (SysML), 2019 , number learning via robust gradient aggregation. In The Conference Louis Alexandre Rouault. Aggregathor: Byzantine machine [7] Georgios Damaskinos, El Mahdi El Mhamdi, Rachid [5] Amir Beck and Marc Teboulle. A fast iterative arXiv:1810.05291 , 2018. communication effjcient and fault tolerant. arXiv preprint Anima Anandkumar. signsgd with majority vote is [6] Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and SIAM J. Imaging Sciences , 2(1):183–202, 2009. shrinkage-thresholding algorithm for linear inverse problems. 21

  18. References III 2019. ⁄ 18 2304–2313, 2015. International Conference on Machine Learning , pages gradient by sparsely factorizing the inverse fjsher matrix. In [11] Roger Grosse and Ruslan Salakhudinov. Scaling up natural analysis and improved rates. arXiv preprint arXiv:1901.09401 , [8] Yoel Drori and Ohad Shamir. The complexity of fjnding Sailanbayev, Egor Shulgin, and Peter Richtárik. Sgd: General [10] Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek inference. arXiv preprint arXiv:1911.01894 , 2019. estimator selection, with an application to variational [9] Tomas Gefgner and Justin Domke. A rule for gradient preprint arXiv:1910.01845 , 2019. stationary points with stochastic gradient descent. arXiv 21

  19. References IV methods for distributed learning from heterogeneous datasets. ⁄ 19 momentum for over-parameterized learning. In ICLR , 2020. [15] Chaoyue Liu and Mikhail Belkin. Accelerating sgd with Intelligence , volume 33, pages 1544–1551, 2019. In Proceedings of the AAAI Conference on Artifjcial Qing Ling. Rsa: Byzantine-robust stochastic aggregation [12] Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, [14] Liping Li, Wei Xu, Tianyi Chen, Georgios B Giannakis, and minimization. stochastic optimization framework for empirical risk [13] Kenji Kawaguchi and Haihao Lu. Ordered sgd: A new Processing Systems , pages 2305–2313, 2015. descent with neighbors. In Advances in Neural Information and Brian McWilliams. Variance reduced stochastic gradient 21

  20. References V [16] Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a fjxed learning rate. arXiv preprint arXiv:1806.01796 , 2018. [17] Josh Patterson and Adam Gibson. Deep learning: A practitioner’s approach . ” O’Reilly Media, Inc.”, 2017. [18] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems , pages 8114–8124, 2018. 20 ⁄ 21

  21. References VI [21] Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. ⁄ 21 arXiv preprint arXiv:1606.07365 , 2016. Christopher Ré. Parallel sgd: When does averaging help? [22] Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and An empirical study. arXiv preprint arXiv:1708.07827 , 2017. Second-order optimization for non-convex machine learning: 2017. [19] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and methods in machine learning. In NeurIPS , pages 4148–4158, and Benjamin Recht. The marginal value of adaptive gradient [20] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, 1195–1204, 2019. Conference on Artifjcial Intelligence and Statistics , pages an accelerated perceptron. In The 22nd International faster convergence of sgd for over-parameterized models and 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend