a quantitative analysis of the effect of batch
play

A Quantitative Analysis of the Effect of Batch Normalization on - PowerPoint PPT Presentation

A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent Yongqiang Cai 1 , Qianxiao Li 1,2 , Zuowei Shen 1 9-15 June 2019 (ICML), Long Beach, CA, USA 1 Department of Mathematics, National University of Singapore, Singapore


  1. A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent Yongqiang Cai 1 , Qianxiao Li 1,2 , Zuowei Shen 1 9-15 June 2019 (ICML), Long Beach, CA, USA 1 Department of Mathematics, National University of Singapore, Singapore 2 Institute of High Performance Computing, A*STAR, Singapore

  2. Batch Normalization A vanilla fully-connected layer z = σ ( Wu + b ) . With batch normalization (Ioffe & Szegedy 2015): ξ − E [ ξ ] z = σ ( γ N ( Wu ) + β ) , N ( ξ ) := √ Var [ ξ ] . Batch normalization works well in practice, e.g. allows stable training with large learning rates, works well in high dimensions or ill-conditioned problems Related work on BN [Ma & Klabjan (2017); Kohler et al. (2018); Arora et al. (2019)] 1

  3. Batch Normalization A vanilla fully-connected layer z = σ ( Wu + b ) . With batch normalization (Ioffe & Szegedy 2015): ξ − E [ ξ ] z = σ ( γ N ( Wu ) + β ) , N ( ξ ) := √ Var [ ξ ] . Batch normalization works well in practice, e.g. allows stable training with large learning rates, works well in high dimensions or ill-conditioned problems Related work on BN [Ma & Klabjan (2017); Kohler et al. (2018); Arora et al. (2019)] Question: Can we quantify the precise effect of BN on gradient descent (GD)? 1

  4. Batch Normalization on Ordinary Least Squares Linear regression model: Model: y = x T w ∗ + noise Input: x ∈ R d Label: y ∈ R 2

  5. Batch Normalization on Ordinary Least Squares Linear regression model: Model: y = x T w ∗ + noise Input: x ∈ R d Label: y ∈ R OLS regression without BN 2 ( y − x T w ) 2 ] min w J 0 ( w ) := E x,y [ 1 Optimization problem: Gradient descent dynamics: w k +1 = w k − ε ∇ w J 0 ( w k ) = w k + ε ( g − Hw k ) , where H := E [ xx T ] , c := E [ y 2 ] . g := E [ xy ] , contraction ratio 2

  6. Batch Normalization on Ordinary Least Squares Linear regression model: Model: y = x T w ∗ + noise Input: x ∈ R d Label: y ∈ R OLS regression with BN � 1 y − a N ( x T w ) � 2 � � Optimization problem: min a,w J ( a, w ) = E x,y 2 Gradient descent dynamics:  w T � � k g a k +1 = a k − ε a ∇ a J ( a k , w k ) = a k + ε a √ k Hw k − a k ,   w T � w T � k g εa k √ w k +1 = w k − ε ∇ w J ( a k , w k ) = w k + g − k Hw k Hw k .  w T  w T k Hw k How does this compare with the GD case? w k +1 = w k − ε ∇ w J 0 ( w k ) = w k + ε ( g − Hw k ) Properties of interest: convergence, robustness 2

  7. Summary of Theoretical Results Property Gradient Descent Gradient Descent with BN Convergence only for small ε arbitrary ε provided ε a ≤ 1 Convergence Rate linear linear (can be faster) Robustness to Learning Rates small range of ε wide range of ε Robustness to Dimensions no effect the higher the better (a) Loss of GD and BNGD( d = 100) (b) Effect of dimension on BNGD 3

  8. Summary of Theoretical Results Property Gradient Descent Gradient Descent with BN Convergence only for small ε arbitrary ε provided ε a ≤ 1 Convergence Rate linear linear (can be faster) Robustness to Learning Rates small range of ε wide range of ε Robustness to Dimensions no effect the higher the better • Those properties are also observed in neural network experiments. (a) Loss of GD and BNGD( d = 100) (b) Effect of dimension on BNGD (c) Accuracy of BNGD on MNIST 3

  9. Poster: Pacific Ballroom #54 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend