A Performance Improvement Approach for Second-Order Optimization - PowerPoint PPT Presentation

A Performance Improvement Approach   for Second-Order Optimization in Large Mini-batch Training Hiroki Naganuma 1 , Rio Yokota 2 1 School of Computing, Tokyo Institute of Technology 2 Global Scientific Information and Computing Center, Tokyo Institute of Technology May 14th, 2019, Cyprus. 2nd High Performance Machine Learning Workshop (HPML2019)

Overview � 2 Our Work Position • Data Parallel Distributed Deep Learning • Second-Order Optimization • Improve Generarization Key Takeaways • Second-order optimization can converge faster than first-order optimization with low generarization performance • Smoothing loss function can improve second-order optimization performance

Agenda � 3 Introduction / Motivation Second Order Optimization Approach • Accuracy ↗ Model Size and Data Size ↗ • Natural Gradient Descent • K-FAC (Approximate Method) • Needs to Accelarate • Experimantal Methodology and Result Proposal to improve generarization Background / Problem • Sharp Minima and Flat Minima • Three Parallelism of Distributed Deep Learning • Mixup Data Augmentation • Large Mini-Batch Training Problem • Smoothing Loss FunctionI • Two Strategies • Experimantal Methodology and Result Conclusion

1. Introduction / Motivation � 5 Improvement of recognition accuracy and increase of training time with increasing number of parameters of convolutional neural network (CNN) ResNet50 architecture Fig2 ： [H. Kaiming et al, 2015] It takes 256GPU hours ※ to train Resnet 50 with 25M parameters(Convergence to Top1-Accuracy ResNet50 75.9%) ※ In case using NVIDIA Tesla P100+ as GPUs   + https://www.nvidia.com/en-us/data-center/tesla-p100/ Fig1 ： [Y. Huang et al, 2018] The figure shows the relationship between the recognition accuracy of ImageNet-1K 1000 class classification and the number of parameters of DNN model. DNNs with a lot of parameters tend to show high recognition accuracy

1. Introduction / Motivation � 6 Importance and time required of hyperparameter tuning in deep learning In deep learning, tuning of hyperparameters is essential   Hyperparameters: • Learning rate • Batch size • Number of training iterations • Number of layers of neural network • Number of channels   Even with the strategy of pruning, many trials with training to the end is necessary [J. Bergstra et al. 2011] Fig3 : Pruning method in parameter tuning ref (https://optuna.org) It takes 256GPU hours ※ to train Resnet 50 with 25M × Multiple Evaluations parameters(Convergence to Top1-Accuracy 75.9%) < ※ In case using NVIDIA Tesla P100+ as GPUs   Time taken for hyper-parameter tuning + https://www.nvidia.com/en-us/data-center/tesla-p100/

1. Introduction / Motivation � 7 Necessity of distributed deep learning It takes 256GPU hours ※ to train Resnet 50 ( ( with 25M parameters(Convergence to Time taken for × Multiple Evaluations < Top1-Accuracy 75.9%) hyper-parameter tuning ※ In case using NVIDIA Tesla P100+ as GPUs   + https://www.nvidia.com/en-us/data-center/tesla-p100/ Hyper-parameter tuning is necessary,   which requires a lot of time to obtain DNN with high recognition accuracy Speeding up with 1 GPU is important, but there is a limit to speeding up Needs to speed up by distributed deep learning In large mini-batch training for accelerating, the recognition accuracy finally obtained is degraded

2. Background / Problem � 9 Three Parallelism of Distributed Deep Learning A. Model Parallel/Data Parallel B. Parameter Server/Collective communication Fig4: Model/Data Parallel C. Sync/Async Fig5: How to communicate • The parallelism of distributed deep learning is mainly the following three   「 A. What 」「 B. How 」「 C. When 」 • 「 A. What 」 Data parallel is essential for speeding up • 「 B. How 」 Adapt collective communication method for speeding up • 「 C. When 」 There are pros and cons, and it is an unsolved problem that it is better to adopt which. In this research, we deal with synchronous type as in the previous research [J. Chen et al, 2018] Fig6: When does parameter update

2. Background / Problem � 10 Three Parallelism of Distributed Deep Learning Synchronous Data Parallel Distributed Deep Learninig Expect speedup by increasing the batch size => Large Mini-Batch Training e.g. batch size = 1 e.g. batch size = 3 Fig7: Difference Between Distributed Deep Learning and Deep Learning

バッチサイズを大きくすることで高速化を期待する同期型データ並列分散深層学習では   2. Background / Problem � 11 Three Parallelism of Distributed Deep Learning Validation Accuracy Increasing Mini-Batch Size   = |Input data used for one update| Synchronous Data Parallel   Distributed Deep Learning   = Large Mini-Batch Training LB training is fast but training accuracy is low SB training takes time to converge, but Training with large mini-batch the training accuracy is high (LB) in SGD is generally faster in training time than with small mini- batch (SB), but that the achievable Large Mini-Batch Training recognition accuracy is Small Mini-Batch Training degraded [Y. Yang et al. 2017] Training Time Fig8 : Convergence accuracy and training time at SB/LB using SGD

2. Background / Problem � 12 Difference between Large Mini-Batch Training and Small Mini-batch Training Large Mini-batch Training is not the same optimization as Small Mini-Batch Training There is a problem due to two differences Supervised Learning (Optimization Problem) Loss Function Objective Function : Train Data By Increasing the Batch-Size , It is expected to converge in more accurate directions with less iterations Fig9 ： Left figure (LB training update appearance), Right figure (SB training update appearance)

2. Background / Problem � 13 Difference between Large Mini-Batch Training and Small Mini-batch Training and Problems Problem 1.   Problem2 . Decreased number of iterations (number of updates) The gradient of the objective function is more accurate and the variance is reduced The recognition accuracy does not deteriorate by increasing the number of Fig11: Concept Skech of Sharp Minimum and Flat Minimum iterations [E. Hoffer et al. Good generalization is In LB training, the noise is not 2018] expected because it is appropriate and generalization possible to adjust noise in SB performance is degraded [S. training [S. Mandt et al 2017] Smith et al. 2018] Fig10 ： [E. Hoffer et al. 2018] => But that doesn't allow for speeding up by => It is necessary to prevent the accuracy distributed deep learning degradation that is a side effect of speeding up

2. Background / Problem � 14 Two Strategy to deal with Problems Problem 1.   Problem2 . Decreased number of iterations The gradient of the objective function is (number of updates)   more accurate and the variance is reduced   => Have to converge with few iteration => Have to avoid SharpMinima Strategy 2.   Strategy1.   Smoothing the objective function Use of natural gradient method (NGD) In large mini-batch training, the data for each batch is By linear interpolation of the input data in large mini-batch statistically stable, and using NGD has a large effect of training, the convergence to Flat Minimum is promoted in considering the curvature of the parameter space, and the optimization of the loss function, and generalization direction of one update vector can be calculated more performance is aimed to be improved. correctly [S. Amari 1998]. Convergence with fewer iterations can be expected

A Performance Improvement Approach for Second-Order Optimization - PowerPoint PPT Presentation

A Performance Improvement Approach for Second-Order Optimization in Large Mini-batch Training Hiroki Naganuma 1 , Rio Yokota 2 1 School of Computing, Tokyo Institute of Technology 2 Global Scientific Information and Computing Center, Tokyo

FY1 0 Second Quarter FY1 0 Second Quarter FY1 0 Second Quarter FY1 0 Second Quarter

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Second Quarter 2011 July 28, 2011 Results Second Quarter 2011 Results 1 Contents 1. Second

Managed Care Organizations' Quality Improvement (QIP) & Performance Improvement (PIP)

SAMSON: A Generalized Second-order SAMSON: A Generalized Second-order Arnoldi Method for Reducing

Lecture 15: Second-Order IIR Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis,

Math 211 Math 211 Lecture #22 November 14, 2000 2 Second Order Equations Second Order

Second Order Linear Differential Equations A second order linear differential equa- tion is an

Second order Taylor Second order Taylor Method Taylor expansion of y ( t + h ) about y ( t )

PERFORMANCE IMPROVEMENT PLANNING MODEL PERFORMANCE ASSESSMENT SYSTEM PROJECT Contents 1.

Remodeling Your Bathroom the JMC Way! Home Improvement Specialists Home Improvement Specialists

Continuous Improvement Continuous Improvement Update on Continuous Improvement Process Update on

Quality Improvement Methodology Beth Hickerson Quality Improvement Advisor September 18, 2017

second order propositional logic type theory week 08 2006 04 03 0 the course 1st order

Second Interim Management Statement 2013 Second Interim Management Statement 2013 Second Interim

Performance Performance Improvement: Improvement: Continuing Continuing Medical Education

Isosurfaces Over Simplicial Partitions of Multiresolution Grids Josiah Manson and Scott Schaefer

SGD without Replacement: Sharper Rates for General Smooth Convex Functions Dheeraj Nagaraj

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li ,

PUBLISHING A MONOGRAPH Michael Sharp Cambridge University Press msharp@cambridge.org The

Sharp Adaptive Estimation of the Trend Coefficient of an Ergodic Diffusion Arnak S. Dalalyan

A Piecewise Linear Model of Credit Traps and Credit Cycles: A Complete Characterization Iryna

to Uncover the Impacts of Income Taxation on Earnings Raj Chetty, Harvard and NBER John N.

Asymptotics of sharp constants of Markov-Bernstein inequalities in integral norm with classical

A Performance Improvement Approach for Second-Order Optimization - PowerPoint PPT Presentation

A Performance Improvement Approach for Second-Order Optimization in Large Mini-batch Training Hiroki Naganuma 1 , Rio Yokota 2 1 School of Computing, Tokyo Institute of Technology 2 Global Scientific Information and Computing Center, Tokyo

FY1 0 Second Quarter FY1 0 Second Quarter FY1 0 Second Quarter FY1 0 Second Quarter

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Second Quarter 2011 July 28, 2011 Results Second Quarter 2011 Results 1 Contents 1. Second

Managed Care Organizations' Quality Improvement (QIP) &amp; Performance Improvement (PIP)

SAMSON: A Generalized Second-order SAMSON: A Generalized Second-order Arnoldi Method for Reducing

Lecture 15: Second-Order IIR Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis,

Math 211 Math 211 Lecture #22 November 14, 2000 2 Second Order Equations Second Order

Second Order Linear Differential Equations A second order linear differential equa- tion is an

Second order Taylor Second order Taylor Method Taylor expansion of y ( t + h ) about y ( t )

PERFORMANCE IMPROVEMENT PLANNING MODEL PERFORMANCE ASSESSMENT SYSTEM PROJECT Contents 1.

Remodeling Your Bathroom the JMC Way! Home Improvement Specialists Home Improvement Specialists

Continuous Improvement Continuous Improvement Update on Continuous Improvement Process Update on

Quality Improvement Methodology Beth Hickerson Quality Improvement Advisor September 18, 2017

second order propositional logic type theory week 08 2006 04 03 0 the course 1st order

Second Interim Management Statement 2013 Second Interim Management Statement 2013 Second Interim

Performance Performance Improvement: Improvement: Continuing Continuing Medical Education

Isosurfaces Over Simplicial Partitions of Multiresolution Grids Josiah Manson and Scott Schaefer

SGD without Replacement: Sharper Rates for General Smooth Convex Functions Dheeraj Nagaraj

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li ,

PUBLISHING A MONOGRAPH Michael Sharp Cambridge University Press msharp@cambridge.org The

Sharp Adaptive Estimation of the Trend Coefficient of an Ergodic Diffusion Arnak S. Dalalyan

A Piecewise Linear Model of Credit Traps and Credit Cycles: A Complete Characterization Iryna

to Uncover the Impacts of Income Taxation on Earnings Raj Chetty, Harvard and NBER John N.

Asymptotics of sharp constants of Markov-Bernstein inequalities in integral norm with classical

Managed Care Organizations' Quality Improvement (QIP) & Performance Improvement (PIP)