Training ImageNet in 15 Minutes With ChainerMN: A Scalable Distributed Deep Learning Framework
Takuya Akiba, Shuji Suzuki, Keisuke Fukuda, and Kota Uenishi Preferred Networks, Inc.
Training ImageNet in 15 Minutes With ChainerMN: A Scalable - - PowerPoint PPT Presentation
Training ImageNet in 15 Minutes With ChainerMN: A Scalable Distributed Deep Learning Framework Takuya Akiba, Shuji Suzuki, Keisuke Fukuda, and Kota Uenishi Preferred Networks, Inc. Who are we? Preferred Networks, Inc. (PFN): A Tokyo-based
Training ImageNet in 15 Minutes With ChainerMN: A Scalable Distributed Deep Learning Framework
Takuya Akiba, Shuji Suzuki, Keisuke Fukuda, and Kota Uenishi Preferred Networks, Inc.
Preferred Networks, Inc. (PFN): A Tokyo-based Deep Learning & IoT company
2
3
and more!
4
“Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions”
arXiv:1710.06280
5
6
10 20 30 40 50 60 70
Goyal et al. (Facebook) Codreanu et al. Cho et al. (IBM) You et al. Akiba et al. (This work)
Time [min]
Training time of ResNet-50 (90 epochs) on ImageNet
62min. 60min. 15min. 31min. 50min.
7
NVIDIA CEO, at SC’17
8
9
10
Design a new model Train Evaluate Design a new model quicker Train faster Get a better (or equivalent) model
11
12
https://chainer.org/
Define-by-Run
Ca
Define
Model definition Computational graph Gradient function
Run
Computational graph Gradient function Training data
Define-by-Run
Model definition Computational graph Gradient function Training data
Define-and-Run
13
Caffe2, TensorFlow etc. PyTorch, TensorFlow(Eager Execution) etc.
Features
All- Reduce Forward Forward Forward Backward Backward Backward Optimize Optimize Optimize
Distributed Training with ChainerMN
14
8 GPUs per node, 128 nodes in total
2 HCAs per node, tree-like topology
15
16
⇒ We can use 1024 GPUs. 1 hour / 1024 * 256 = 15 mins. 🤕 Sounds easy? ABSOLUTELY NO! Technical Challenges:
“Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes” (arXiv:1711.04325)
17
18
19
From Keskar et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”
“It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize”
→ gradients are “less stochastic”, which makes it difficult to escape from local minima
Local minima Better model
20
21
“Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”
arXiv:1706.02677
Additional techniques for 1024 GPUs:
– SGD: generalizes well, but converges slower. – We start the training with RMSprop, then gradually transition to SGD.
moving averages
Transition functions SGD weight Epoch
22
23
All- Reduce Forward Forward Forward Backward Backward Backward Optimize Optimize Optimize
24
25
Improve the All-reduce bottleneck
26
Improve the All-reduce bottleneck
(as of the experiment, Open MPI 2.1.2, Infiniband FDR)
27
64MB Allreduce (MPI_SUM), 2 processes,
(default configuration: no advanced tuning)
”MPI” : Allreduce an array on host memory (ordinary MPI_Allreduce) “MPI-CUDA” : Allreduce an array on GPU’s device memory (You can pass device memory pointer to MPI routines)
NCCL is 5.9x faster! better
28
29
30
Improve the All-reduce bottleneck
Compute gradients Convert FP32 to FP16 Allreduce (with NCCL) Convert FP16 to FP32 and update
31
The accuracy degradation is negligible!!
32
33
(As of NCCL 2.0.5)
Tips for users of NCCL v2 with >1000 GPUs:
– ulimit -n unlimited, or will see ’unhandled system error’
– ulimit -s unlimited, or will see SEGV
just reboot all nodes.
34
(As of NCCL 2.0.5)
35
10 20 30 40 50 60 70
Goyal et al. (Facebook) Codreanu et al. Cho et al. (IBM) You et al. Akiba et al. (This work)
Time [min]
Training time of ResNet-50 (90 epochs) on ImageNet
62min. 60min. 15min. 31min. 50min.
Faster
Team Hardware Software Batchsize Time Accuracy He et al. P100 × 8 Caffe 256 29 hr 75.3 % Goyal et al. P100 × 256 Caffe2 8,192 1 hr 76.3 % Codreanu et al. KNL 7250 × 720 Intel Caffe 11,520 62 min 75.0 % Cho et al. P100 × 256 Torch 8,192 50 min You et al. Xeon 8160 × 1600 Intel Caffe 16,000 31 min 75.3 % This work P100 × 1024 Chainer 32,768 15 min 74.9 %
l
Dataset: ImageNet-1k
l
Accuracy: single-crop top-1 validation accuracy
l
Training duration: 90 epochs (common configuration for ResNet50) We achieved a total training time of 15 minutes while maintaining a comparable accuracy of 74.9%. 36
37
38
39
Cloud formation support is coming soon!
40
Computing time of ImageNet training with Double Buffering + FP16 communication
41
42
43
95% scalability up to 32 GPUs !!
model acc. 75% model acc. 76% ResNet-50 on ImageNet training
by Chris Ying (Google Brain)
44
How to move towards larger, more complex models?
45
Synchronous Asynchronous Fine-grained Coarse-grained
All- Reduce Forward Forward Forward Backward Backward Backward Optimize Optimize Optimize
Synchronous:
Parameter server
Asynchronous:
∆𝑋
∆𝑋 ∆𝑋 ∆𝑋 ∆𝑋
46
Example: Mixture-of-Experts [Shazeer+(Google Brain), ICLR’17]
Coarse-grained Find-grained
47
Synchronous Asynchronous Fine-grained Coarse-grained
Under active development Basic components ready to use
48
49
50