 
              Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL EVOLUTIONARY INTELLIGENCE LAB Electrical and Computing Engineering Department Duke University www.pittnuts.com
AI Systems Landing Into the Cloud: Facebook Big Basin Google TPU Cloud
AI Systems Landing Upon the Edge: iPhone X Face ID Autonomous Driving DJI Drone Challenges in landing???
The Rocket Analogy of Deep Learning AI An analogy by Andrew Ng Rocket: Big Neural Networks Engine: Big Computing Fuel: Big Data 3 BIGs
Deep Learning in the Cloud Tons of Cables!!! Communication Bottleneck!!! Google TPU Cloud Parallelism
Deep Learning on the Edge Real-time Model Size and Inference Speed Matter!!!
Research Highlights • Distributed Training in the Cloud – TernGrad : Ternary Gradients to Reduce Communication in Distributed Deep Learning, NIPS 2017 (oral) – On-going work on large-batch training • Efficient Inference on the Edge – Structurally Sparse DNNs ( NIPS 2016 & ICLR 2018 ) – Lower-rank DNNs ( ICCV 2017 ) – A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation ( CVPR 2017 ) – Direct Sparse Convolution ( ICLR 2017 )
TernGrad : Ternary Gradients to Reduce Communication in Distributed Deep Learning NIPS 2017 (Oral) Wei Wen 1 , Cong Xu 2 , Feng Yan 3 , Chunpeng Wu 1 , Yandan Wang 4 , Yiran Chen 1 , Hai (Helen) Li 1 Duke University 1 , Hewlett Packard Labs 2 , University of Nevada - Reno 3 , University of Pittsburgh 4 https://github.com/wenwei202/terngrad
Background – Stochastic Gradient Descent 𝐷 𝒙 ≜ 1 + Minimization target: 𝑜 & 𝑅 𝒜 ) , 𝒙 ),- 𝒙 ./- = 𝒙 . − 𝜃 . + ()) Batch gradient descent: 𝑜 &  . Computation expensive ),- ()) = 𝛼𝑅 𝒜 ) , 𝒙 .  . Mini-batch stochastic gradient descent (SGD): 𝒙 ./- = 𝒙 . − 𝜃 . 8 (7) Computation cheap 𝐶 &  . 7,- 𝐶 <<n samples are randomly drawn from training dataset
Background - Distributed Deep Learning 𝒉 . = 1 =8 ()) Synchronized Data Parallelism for 𝑂𝐶 &  . Parameter server(s) Stochastic Gradient Descent (SGD): ),- 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 1. Training data is split to N subsets Batch size= 𝑂𝐶 𝑥 . 2. Each worker has a model replica (copy) 𝑥 . 𝑥 . 3. Each replica is trained on a data subset (-) … 𝒉 . 4. Synchronization in parameter server(s) (=) 𝒉 . (<) Worker 1 𝒉 . Scalability: Worker N 1. Computing time decreases with N Worker 2 2. Communication can be the bottleneck Data 1 B samples 3. This work: quantizing gradient to three Data N (i.e., ternary ) levels {-1, 0, 1} (<2bits) Data 2
Communication Bottleneck 1 10 flop computation = t comp communication AlexNet × flops n total 0 10 time (s) × × 2 | W | g ( n ) = t comm − 1 10 b = g ( n ) log 2 n ( ) − 2 10 1 2 4 8 16 32 64 128 256 512 # of workers Credit: Alexander Ulanov
Background - Distributed Deep Learning Weeks or months -> hours Communication Bottleneck! 1 10 computation communication total 0 10 time (s) − 1 10 − 2 10 1 2 4 8 16 32 64 128 256 512 # of workers data
� � An Alternative Setting Parameter server ()) 𝒉 . = & 𝒉 . 𝑥 . 1. Only exchange gradients 𝒉 . 𝒉 . 2. Gradient quantization can reduce 𝒉 . communication in both directions (-) 𝒉 . …… (=) 𝒉 . Worker 1 (<) 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . Worker N Worker 2 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 .
� � Gradient Quantization for Communication Reduction Parameter server 32 bits ()) 𝒉 . = & 𝒉 . 𝒉 . Only exchange 𝒉 . conv 𝒉 . quantized (-) 𝒉 . …… gradients (=) 𝒉 . Worker 1 (<) 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . Worker N Worker 2 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 .
� � Gradient Quantization for Communication Reduction Parameter server < 2 bits ()) 𝒉 . = & 𝒉 . 𝒉 . Only exchange 𝒉 . 𝒉 . quantized (-) 𝒉 . …… gradients (=) 𝒉 . Worker 1 (<) 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . Worker N Worker 2 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 .
Stochastic Gradients without Bias Batch Gradient Descent 𝐷 𝒙 ≜ 1 + 𝑜 & 𝑅 𝒜 ) , 𝒙 ),- TernGrad 𝒙 ./- = 𝒙 . − 𝜃 . + ()) 𝑜 &  . (B) 𝒙 ./- = 𝒙 . − 𝜃 . A 𝑢𝑓𝑠𝑜𝑏𝑠𝑗𝑨𝑓  . ),- (B) E 𝑢𝑓𝑠𝑜𝑏𝑠𝑗𝑨𝑓  . = 𝛼𝐷 𝒙 No bias SGD (B) 𝒙 ./- = 𝒙 . − 𝜃 . A  . 𝐽 is randomly drawn from [1, n ] (B) = 𝛼𝐷 𝒙 E  . No bias
TernGrad is Simple Example: ()) : [0.30, -1.20, …, 0.9] 𝒉 . g t = ternarize ( g t ) = s t · sign ( g t ) � b t ˜ 𝑡 . : 1.20 s t , || g t || ∞ , max ( abs ( g t )) Signs: [1, -1, …, 1] P.R -.< P.S 𝑄 𝑐 .N = 1| . : [ -.< , -.< ,…, -.< ] ⇢ P ( b tk = 1 | g t ) = | g tk | /s t 𝒄 . : [ 0,1 ,…, 1 ] P ( b tk = 0 | g t ) = 1 � | g tk | /s t V : [0, -1, …, 1]*1.20 ()) 𝒉 . E z , b { ˜ g t } = E z , b { s t · sign ( g t ) � b t } = E z { s t · sign ( g t ) � E b { b t | z t }} = E z { g t } = r w C ( w t ) No bias
TernGrad is Simple
Convergence Standard SGD almost truly converges under assumptions (Fisk 1965, Metivier 1981&1983, Bottou 1998) Assumption 1: || w − w ∗ || 2 > ✏ ( w � w ∗ ) T r w C ( w ) > 0 . C ( w ) has a single minimum w * and 8 ✏ > 0 , inf Assumption 2: ⇢P + ∞ t =0 � 2 t < + 1 Learning rate 𝛿 . decreases neither very fast nor very slow P + ∞ t =0 � t = + 1 Assumption 3 (gradient bound): Assumption 3 (gradient bound): ≤ A + B || w − w ∗ || 2 E {|| g || ∞ · || g || 1 } ≤ A + B || w − w ∗ || 2 || g || 2 � E Standard SGD almost-truly converges TernGrad almost-truly converges ≤ E {|| g || ∞ · || g || 1 } ≤ A + B || w − w ∗ || 2 || g || 2 � E Stronger gradient bound in TernGrad
Closing Bound Gap Two methods to push the gradient bound of TernGrad closer to the bound of standard SGD Layer n Approach: Collect Layer n-1 1. Split gradients to buckets Collect Ternarize … 2. Do TernGrad bucket by bucket Ternarize Layer 2 Layer 1 When bucket size == 1, TernGrad is floating SGD Layer-wise ternarizing
Closing Bound Gap Two methods to push the gradient bound of TernGrad closer to the bound of standard SGD ⇢ g i | g i | ≤ c σ conv f ( g i ) = Iteration # sign ( g i ) · c σ | g i | > c σ c =2.5 works well for all tested datasets, DNNs and optimizers. fc Iteration # Suppose Gaussian distribution: 1. Change length 1.0%-1.5% (a) original (b) clipped 2. Change direction 2 ° -3 ° 3. Small bias with variance reduced Gradient clipping
Gradient Histograms conv Iteration # fc Iteration # (a) original (b) clipped (c) ternary (d) final Average gradients are not ternary, but quantized with 2 N +1 levels ( N : worker number). R< Z[\ ] (<=/-) > 1 , unless 𝑂 ≥ 2 RP Communication reduction:
Integration with Manifold Optimizers (All experiments: All hyper-parameters are tuned for standard SGD and fixed in TernGrad ) LeNet (total mini-batch size 64): close accuracy & randomness in TernGrad results in small variance 100.00% (a) momentum SGD (b) vanilla SGD baseline TernGrad 99.50% Accuracy 99.00% 98.50% N workers 98.00% 2 4 8 16 32 64 2 4 8 16 32 64
Integration with Manifold Optimizers CIFAR-10, mini-batch size 64 per worker Adam: D. P. Kingma, 2014 Tune hyper-parameters specifically for TernGrad may reduce accuracy gap TernGrad works with manifold optimizers: Vanilla SGD, Momentum, Adam
Scaling to Large-scale Deep Learning TernGrad : Randomness & (1) decrease randomness in dropout or regularization (2) use smaller weight decay No new hyper-parameters added AlexNet N. S. Keskar, et al., ICLR 2017
Convergence Curves (c) gradient sparsity of terngrad in fc6 (a) top-1 accuracy vs iteration (b) training loss vs iteration 70% 8 80% 60% 6 50% baseline 60% baseline 40% terngrad 4 terngrad 40% 30% 20% 2 20% 10% 0% 0 0% 0 50000 100000 150000 0 50000 100000 150000 0 50000 100000 150000 AlexNet trained on 4 workers with mini-batch size 512 Coverages within the same epochs under the same base learning rate
Scaling to Large-scale Deep Learning GoogLeNet accuracy loss is <2% on avgerage. Tune hyper-parameters specifically for TernGrad may reduce accuracy gap
Performance Model Training throughput on GPU cluster with Ethernet and PCI switch 100000 AlexNet FP32 AlexNet TernGrad GoogLeNet FP32 GoogLeNet TernGrad TernGrad gives higher speedup when VggNet-A FP32 VggNet-A TernGrad 80000 1. using more workers 4000 Images/sec 2. using smaller communication Images/sec 3000 60000 bandwidth (Ethernet vs InfiniBand) 2000 3. training DNNs with more fully- 40000 1000 connected layers (VggNet vs 0 GoogLeNet) 20000 0 1 2 4 8 32 64 128 256 512 16 # of GPUs (a)
Recommend
More recommend