Training ImageNet in 15 Minutes With ChainerMN: A Scalable - - PowerPoint PPT Presentation

training imagenet in 15 minutes with chainermn a scalable
SMART_READER_LITE
LIVE PREVIEW

Training ImageNet in 15 Minutes With ChainerMN: A Scalable - - PowerPoint PPT Presentation

Training ImageNet in 15 Minutes With ChainerMN: A Scalable Distributed Deep Learning Framework Takuya Akiba, Shuji Suzuki, Keisuke Fukuda, and Kota Uenishi Preferred Networks, Inc. Who are we? Preferred Networks, Inc. (PFN): A Tokyo-based


slide-1
SLIDE 1

Training ImageNet in 15 Minutes With ChainerMN: A Scalable Distributed Deep Learning Framework

Takuya Akiba, Shuji Suzuki, Keisuke Fukuda, and Kota Uenishi Preferred Networks, Inc.

slide-2
SLIDE 2

Who are we?

Preferred Networks, Inc. (PFN): A Tokyo-based Deep Learning & IoT company

2

slide-3
SLIDE 3
  • Strong Engineering partnership
  • Active research

– Constantly publish papers in top-tier ML conferences – Including 3 papers in ICLR’18

Research and engineering in PFN

3

and more!

slide-4
SLIDE 4

4

“Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions”

arXiv:1710.06280

slide-5
SLIDE 5

Distributed Deep Learning

5

slide-6
SLIDE 6

6

10 20 30 40 50 60 70

Goyal et al. (Facebook) Codreanu et al. Cho et al. (IBM) You et al. Akiba et al. (This work)

Time [min]

Training time of ResNet-50 (90 epochs) on ImageNet

62min. 60min. 15min. 31min. 50min.

slide-7
SLIDE 7

7

Jen-Hsun Huang

NVIDIA CEO, at SC’17

slide-8
SLIDE 8

What we want:

Shorter training time It is always better No questions? J

8

slide-9
SLIDE 9

Answer: Not Really. Even if training time is faster…

  • Model accuracy is degraded => 😦
  • Programming is hard => 😦

Increasing the training throughput is easy… But it does not necessarily make R&D faster

9

slide-10
SLIDE 10

What we really want:

Shorter training time Faster R&D cycle

10

Design a new model Train Evaluate Design a new model quicker Train faster Get a better (or equivalent) model

slide-11
SLIDE 11

Background of the ImageNet challenge

11

slide-12
SLIDE 12

12

https://chainer.org/

slide-13
SLIDE 13

Define-by-Run

Ca

Chainer: A Flexible Deep Learning Framework

Define

Model definition Computational graph Gradient function

Run

Computational graph Gradient function Training data

Define-by-Run

Model definition Computational graph Gradient function Training data

Define-and-Run

13

Caffe2, TensorFlow etc. PyTorch, TensorFlow(Eager Execution) etc.

slide-14
SLIDE 14

ChainerMN: Distributed Training with Chainer

  • Add-on package for Chainer
  • Enables multi-node distributed deep learning using NVIDIA NCCL2

Features

  • Scalable: Near-linear scaling with hundreds of GPUs
  • Flexible: Even GANs, dynamic NNs, and RL are applicable

All- Reduce Forward Forward Forward Backward Backward Backward Optimize Optimize Optimize

Distributed Training with ChainerMN

14

slide-15
SLIDE 15

MN-1: an in-house supercomputer

  • NVIDIA Tesla P100 × 1024

8 GPUs per node, 128 nodes in total

  • Inter-connected by InfiniBand FDR

2 HCAs per node, tree-like topology

The number of employees is about 120, so this is relatively very large for us! Fun! (Do you think it’s crazy?)

15

slide-16
SLIDE 16

OK, let’s tackle the ImageNet problem with our 1024 P100 GPUs!

16

slide-17
SLIDE 17

Our goal: 15 min.

  • Training CNNs on ImageNet is very time consuming
  • Original ResNet-50 paper : 29 hours using 8 GPUs
  • Notable achievement by Goyal et al.: 1 hour using 256 GPUs.

⇒ We can use 1024 GPUs. 1 hour / 1024 * 256 = 15 mins. 🤕 Sounds easy? ABSOLUTELY NO! Technical Challenges:

  • 1. Large batch problem
  • 2. Performance Scalability (while keeping flexibility)
  • 3. Troubles L

“Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes” (arXiv:1711.04325)

17

slide-18
SLIDE 18

Challenges in the “ImageNet-15min challenge”

  • 1. The “large batch” problem

– “Sharp minima” – Fewer training iterations

  • 2. Performance scalability
  • 3. Technical issues L

18

slide-19
SLIDE 19

Challenges in the “ImageNet-15min challenge”

  • 1. The “large batch” problem

– “Sharp minima” – Fewer training iterations

  • 2. Performance scalability
  • 3. Technical issues L

19

slide-20
SLIDE 20

Challenge 1: The “large batch” problem

From Keskar et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”

“It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize”

  • 1. Computed gradients in each iteration is an average of larger number of samples

→ gradients are “less stochastic”, which makes it difficult to escape from local minima

  • 2. Total number of iterations (=updates) is smaller

Local minima Better model

20

slide-21
SLIDE 21
  • Linear scaling rule:
  • “If minibatch-size is k times larger, increase learning rate by k times”
  • Gradual warmup scheme

21

“Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”

arXiv:1706.02677

slide-22
SLIDE 22

Additional techniques for 1024 GPUs:

  • We needed to go further: 32*1024 = 32k batchsize!
  • RMSprop Warmup

– SGD: generalizes well, but converges slower. – We start the training with RMSprop, then gradually transition to SGD.

  • Batch normalization without

moving averages

Transition functions SGD weight Epoch

22

slide-23
SLIDE 23

Challenges in the “ImageNet-15min challenge”

  • 1. The “large batch” problem

– “Sharp minima” – Fewer training iterations

  • 2. Performance scalability
  • 3. Technical issues L

23

slide-24
SLIDE 24

Challenge 2: Performance scalability

All- Reduce Forward Forward Forward Backward Backward Backward Optimize Optimize Optimize

Allreduce operation is critical for scalability

24

slide-25
SLIDE 25

How to overcome scalability challenge?

  • Use faster communication routines
  • Reduce communication data

25

Improve the All-reduce bottleneck

slide-26
SLIDE 26

How to overcome scalability challenge?

  • Use faster communication routines
  • Reduce communication data

26

Improve the All-reduce bottleneck

slide-27
SLIDE 27

Faster communication routines

  • ChainerMN is built on top of MPI

– Just call MPI_Allreduce() and nothing else to do? (MPI should be well tuned… Agreed?) – Bandwidth efficiency of MPI_Allreduce with GPUDirect : 10%

(as of the experiment, Open MPI 2.1.2, Infiniband FDR)

27

slide-28
SLIDE 28

NCCL : Nvidia Collective Communication Library

64MB Allreduce (MPI_SUM), 2 processes,

  • Open MPI 2.1.2

(default configuration: no advanced tuning)

  • Over Infiniband FDR(4x)

”MPI” : Allreduce an array on host memory (ordinary MPI_Allreduce) “MPI-CUDA” : Allreduce an array on GPU’s device memory (You can pass device memory pointer to MPI routines)

NCCL is 5.9x faster! better

28

slide-29
SLIDE 29

Further optimizations for NCCL:

  • Improve network performance

– GPU Direct P2P & RDMA – Manual ring configuration

29

slide-30
SLIDE 30

How to overcome scalability challenge?

  • Use faster communication routines
  • Reduce communication data

30

Improve the All-reduce bottleneck

slide-31
SLIDE 31

Reduce communication data: use FP16

Compute gradients Convert FP32 to FP16 Allreduce (with NCCL) Convert FP16 to FP32 and update

31

The accuracy degradation is negligible!!

slide-32
SLIDE 32

Challenges in the “ImageNet-15min challenge”

  • 1. The “large batch” problem

– “Sharp minima” – Fewer training iterations

  • 2. Performance scalability
  • 3. Technical issues L

32

slide-33
SLIDE 33

Crash… Crash… Crash…

  • The more you buy, the more you crash
  • ≧ 192 GPUs: Crash → NCCL2: too many file descriptors
  • ≧ 784 GPUs: Crash → Bug in ChainerMN
  • ≧ 944 GPUs: Crash → NCCL2: stack overflow
  • Some GPUs were broken, as well

33

(As of NCCL 2.0.5)

slide-34
SLIDE 34

Crash… Crash… Crash…

Tips for users of NCCL v2 with >1000 GPUs:

  • NCCL v2 opens a large number of file descriptors.

– ulimit -n unlimited, or will see ’unhandled system error’

  • NCCL v2 uses huge amount of stack.

– ulimit -s unlimited, or will see SEGV

  • When it suddenly starts to claim ‘unhandled system error’,

just reboot all nodes.

34

(As of NCCL 2.0.5)

slide-35
SLIDE 35

35

10 20 30 40 50 60 70

Goyal et al. (Facebook) Codreanu et al. Cho et al. (IBM) You et al. Akiba et al. (This work)

Time [min]

Training time of ResNet-50 (90 epochs) on ImageNet

62min. 60min. 15min. 31min. 50min.

Faster

slide-36
SLIDE 36

Training ResNet-50 on ImageNet in 15 mins

Team Hardware Software Batchsize Time Accuracy He et al. P100 × 8 Caffe 256 29 hr 75.3 % Goyal et al. P100 × 256 Caffe2 8,192 1 hr 76.3 % Codreanu et al. KNL 7250 × 720 Intel Caffe 11,520 62 min 75.0 % Cho et al. P100 × 256 Torch 8,192 50 min You et al. Xeon 8160 × 1600 Intel Caffe 16,000 31 min 75.3 % This work P100 × 1024 Chainer 32,768 15 min 74.9 %

l

Dataset: ImageNet-1k

l

Accuracy: single-crop top-1 validation accuracy

l

Training duration: 90 epochs (common configuration for ResNet50) We achieved a total training time of 15 minutes while maintaining a comparable accuracy of 74.9%. 36

  • T. Akiba, et al. “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes” (modified)
slide-37
SLIDE 37

Maybe you are thinking: We don’t have so many GPUs… Our GPU cluster does not have Infiniband… It’s not for us 🙂

37

slide-38
SLIDE 38

ChainerMN is for you.

38

slide-39
SLIDE 39

Want to try Chainer + ChainerMN?

39

Cloud formation support is coming soon!

slide-40
SLIDE 40

Optimization technique for non-IB environment: Double buffering

  • Each update uses the gradients from previous iteration (1-step stale grad.)

40

slide-41
SLIDE 41

Computing time of ImageNet training with Double Buffering + FP16 communication

2.1 times faster !

41

  • Local batchsize: 64
  • 32 processes
  • NCCL for Allreduce
slide-42
SLIDE 42

42

slide-43
SLIDE 43

43

95% scalability up to 32 GPUs !!

model acc. 75% model acc. 76% ResNet-50 on ImageNet training

  • 25Gbps Ethernet
  • Double buffering
  • FP16 communication (NCCL)
  • V100 GPUs
  • Batchsize: 64/GPU
slide-44
SLIDE 44

Next step?

“ImageNet is the new MNIST”

by Chris Ying (Google Brain)

44

How to move towards larger, more complex models?

slide-45
SLIDE 45

Taxonomy of distributed deep learning

45

Data-parallelism Model-parallelism

Synchronous Asynchronous Fine-grained Coarse-grained

Main focus (currently)

slide-46
SLIDE 46

Data parallel: sync vs. async

All- Reduce Forward Forward Forward Backward Backward Backward Optimize Optimize Optimize

Synchronous:

Parameter server

Asynchronous:

∆𝑋

∆𝑋 ∆𝑋 ∆𝑋 ∆𝑋

46

slide-47
SLIDE 47

Model parallelism

Example: Mixture-of-Experts [Shazeer+(Google Brain), ICLR’17]

Coarse-grained Find-grained

47

slide-48
SLIDE 48

ChainerMN’s focus, now and future

Data-parallelism Model-parallelism

Synchronous Asynchronous Fine-grained Coarse-grained

Main focus (currently)

Under active development Basic components ready to use

48

slide-49
SLIDE 49

Conclusion

  • We finished training of ResNet-50 on ImageNet in 15 min.
  • We achieved both of speed, accuracy, and productivity

49

We will continue tackling hard problems !

slide-50
SLIDE 50

Thank you!

50