towards understanding the importance of noise in training
play

Towards Understanding the Importance of Noise in Training Neural - PowerPoint PPT Presentation

Towards Understanding the Importance of Noise in Training Neural Networks Mo Zhou , Tianyi Liu , Yan Li , Dachao Lin , Enlu Zhou and Tuo Zhao Georgia Tech and Peking University June. 12, 2019 International


  1. Towards Understanding the Importance of Noise in Training Neural Networks Mo Zhou ♯ , Tianyi Liu † , Yan Li † , Dachao Lin ♯ , Enlu Zhou † and Tuo Zhao † † Georgia Tech and ♯ Peking University June. 12, 2019

  2. International Conference on Machine Learning (ICML), 2019 Background: Deep Neural Networks Great Success Speech and image recognition Nature language processing Recommendation systems Training Challenges Highly nonconvex optimization landscape: Saddle Points, Spurious Optima Computationally intractable Serious overfitting and curse of dimensionality Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 1/23

  3. International Conference on Machine Learning (ICML), 2019 Background: Deep Neural Networks Great Success Speech and image recognition Nature language processing Recommendation systems Training Challenges Highly nonconvex optimization landscape: Saddle Points, Spurious Optima Computationally intractable Serious overfitting and curse of dimensionality Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 1/23

  4. International Conference on Machine Learning (ICML), 2019 Background: Deep Neural Networks Great Success Speech and image recognition Nature language processing Recommendation systems Training Challenges Highly nonconvex optimization landscape: Saddle Points, Spurious Optima Computationally intractable Serious overfitting and curse of dimensionality Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 1/23

  5. International Conference on Machine Learning (ICML), 2019 Efficient Training by First Order Algorithms Existing Results: Escape strict saddle and converge to optima: Gradient Descent (GD): Lee et al., 2016; Jin et al., 2017; Panageas et al., 2017; Lee et al., 2017; Stochastic Gradient Descent (SGD): Dauphin et al., 2014; Ge et al., 2015; Kawaguchi, 2016; Hardt and Ma, 2016; Jin et al., 2017; Jin et al., 2019. Still far from being well understood! Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 2/23

  6. International Conference on Machine Learning (ICML), 2019 Efficient Training by First Order Algorithms Existing Results: Escape strict saddle and converge to optima: Gradient Descent (GD): Lee et al., 2016; Jin et al., 2017; Panageas et al., 2017; Lee et al., 2017; Stochastic Gradient Descent (SGD): Dauphin et al., 2014; Ge et al., 2015; Kawaguchi, 2016; Hardt and Ma, 2016; Jin et al., 2017; Jin et al., 2019. Still far from being well understood! Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 2/23

  7. International Conference on Machine Learning (ICML), 2019 Efficient Training by First Order Algorithms Existing Results: Escape strict saddle and converge to optima: Gradient Descent (GD): Lee et al., 2016; Jin et al., 2017; Panageas et al., 2017; Lee et al., 2017; Stochastic Gradient Descent (SGD): Dauphin et al., 2014; Ge et al., 2015; Kawaguchi, 2016; Hardt and Ma, 2016; Jin et al., 2017; Jin et al., 2019. Still far from being well understood! Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 2/23

  8. International Conference on Machine Learning (ICML), 2019 Practitioners’ Choice: Step Size Annealing Remark: The variance of the noise scales with the step size; Noise level: Large ⇒ Small. Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 3/23

  9. International Conference on Machine Learning (ICML), 2019 Practitioners’ Choice: Step Size Annealing Remark: The variance of the noise scales with the step size; Noise level: Large ⇒ Small. Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 3/23

  10. International Conference on Machine Learning (ICML), 2019 Our Empirical Observations Generalization Optima Noise Level GD Bad Sharp No SGD w./ very small Bad Sharp Very Small Step Size SGD w./ Step Size Stagewise Good Flat Annealing Decreasing What We Know: Not all optima generalize; Noise helps select optima that generalize. Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 4/23

  11. International Conference on Machine Learning (ICML), 2019 Our Empirical Observations Generalization Optima Noise Level GD Bad Sharp No SGD w./ very small Bad Sharp Very Small Step Size SGD w./ Step Size Stagewise Good Flat Annealing Decreasing What We Know: Not all optima generalize; Noise helps select optima that generalize. Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 4/23

  12. International Conference on Machine Learning (ICML), 2019 A Natural Question: How does noise help train neural networks in the presence of bad optima? Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 5/23

  13. International Conference on Machine Learning (ICML), 2019 Challenges General Neural Networks (NNs) Complex noncovex landscape; Beyond our technical limit. We Study: Two-Layer Nonoverlapping Convolutional NNs: Non-trivial spurious local optimum (does not generalize); GD with random initialization gets trapped with constant probability (at least 1 4 , can be 3 4 in the worst case); Simple structure that is technically manageable. Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 6/23

  14. International Conference on Machine Learning (ICML), 2019 Challenges General Neural Networks (NNs) Complex noncovex landscape; Beyond our technical limit. We Study: Two-Layer Nonoverlapping Convolutional NNs: Non-trivial spurious local optimum (does not generalize); GD with random initialization gets trapped with constant probability (at least 1 4 , can be 3 4 in the worst case); Simple structure that is technically manageable. Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 6/23

  15. International Conference on Machine Learning (ICML), 2019 Challenges General Neural Networks (NNs) Complex noncovex landscape; Beyond our technical limit. We Study: Two-Layer Nonoverlapping Convolutional NNs: Non-trivial spurious local optimum (does not generalize); GD with random initialization gets trapped with constant probability (at least 1 4 , can be 3 4 in the worst case); Simple structure that is technically manageable. Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 6/23

  16. International Conference on Machine Learning (ICML), 2019 Challenges General Neural Networks (NNs) Complex noncovex landscape; Beyond our technical limit. We Study: Two-Layer Nonoverlapping Convolutional NNs: Non-trivial spurious local optimum (does not generalize); GD with random initialization gets trapped with constant probability (at least 1 4 , can be 3 4 in the worst case); Simple structure that is technically manageable. Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 6/23

  17. International Conference on Machine Learning (ICML), 2019 Challenges General Neural Networks (NNs) Complex noncovex landscape; Beyond our technical limit. We Study: Two-Layer Nonoverlapping Convolutional NNs: Non-trivial spurious local optimum (does not generalize); GD with random initialization gets trapped with constant probability (at least 1 4 , can be 3 4 in the worst case); Simple structure that is technically manageable. Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 6/23

  18. International Conference on Machine Learning (ICML), 2019 Challenges General Neural Networks (NNs) Complex noncovex landscape; Beyond our technical limit. We Study: Two-Layer Nonoverlapping Convolutional NNs: Non-trivial spurious local optimum (does not generalize); GD with random initialization gets trapped with constant probability (at least 1 4 , can be 3 4 in the worst case); Simple structure that is technically manageable. Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 6/23

  19. International Conference on Machine Learning (ICML), 2019 Challengess Stochastic Gradient Descent Complex distribution of noise; Dependency on iterates. We Study: Perturbed Gradient Descent with Noise Annealing: Independent injected noise; Uniform distribution; Imitate the behavior of SGD. A non-trival example provides new insights! Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 7/23

  20. International Conference on Machine Learning (ICML), 2019 Challengess Stochastic Gradient Descent Complex distribution of noise; Dependency on iterates. We Study: Perturbed Gradient Descent with Noise Annealing: Independent injected noise; Uniform distribution; Imitate the behavior of SGD. A non-trival example provides new insights! Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 7/23

  21. International Conference on Machine Learning (ICML), 2019 Challengess Stochastic Gradient Descent Complex distribution of noise; Dependency on iterates. We Study: Perturbed Gradient Descent with Noise Annealing: Independent injected noise; Uniform distribution; Imitate the behavior of SGD. A non-trival example provides new insights! Zhou et al. — Towards Understanding the Importance of Noise in Training Neural Networks 7/23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend