Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL - PowerPoint PPT Presentation

Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL EVOLUTIONARY INTELLIGENCE LAB Electrical and Computing Engineering Department Duke University www.pittnuts.com

AI Systems Landing Into the Cloud: Facebook Big Basin Google TPU Cloud

AI Systems Landing Upon the Edge: iPhone X Face ID Autonomous Driving DJI Drone Challenges in landing???

The Rocket Analogy of Deep Learning AI An analogy by Andrew Ng Rocket: Big Neural Networks Engine: Big Computing Fuel: Big Data 3 BIGs

Deep Learning in the Cloud Tons of Cables!!! Communication Bottleneck!!! Google TPU Cloud Parallelism

Deep Learning on the Edge Real-time Model Size and Inference Speed Matter!!!

Research Highlights • Distributed Training in the Cloud – TernGrad : Ternary Gradients to Reduce Communication in Distributed Deep Learning, NIPS 2017 (oral) – On-going work on large-batch training • Efficient Inference on the Edge – Structurally Sparse DNNs ( NIPS 2016 & ICLR 2018 ) – Lower-rank DNNs ( ICCV 2017 ) – A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation ( CVPR 2017 ) – Direct Sparse Convolution ( ICLR 2017 )

TernGrad : Ternary Gradients to Reduce Communication in Distributed Deep Learning NIPS 2017 (Oral) Wei Wen 1 , Cong Xu 2 , Feng Yan 3 , Chunpeng Wu 1 , Yandan Wang 4 , Yiran Chen 1 , Hai (Helen) Li 1 Duke University 1 , Hewlett Packard Labs 2 , University of Nevada - Reno 3 , University of Pittsburgh 4 https://github.com/wenwei202/terngrad

Background – Stochastic Gradient Descent 𝐷 𝒙 ≜ 1 + Minimization target: 𝑜 & 𝑅 𝒜 ) , 𝒙 ),- 𝒙 ./- = 𝒙 . − 𝜃 . + ()) Batch gradient descent: 𝑜 & 𝑕 . Computation expensive ),- ()) = 𝛼𝑅 𝒜 ) , 𝒙 . 𝑕 . Mini-batch stochastic gradient descent (SGD): 𝒙 ./- = 𝒙 . − 𝜃 . 8 (7) Computation cheap 𝐶 & 𝑕 . 7,- 𝐶 <<n samples are randomly drawn from training dataset

Background - Distributed Deep Learning 𝒉 . = 1 =8 ()) Synchronized Data Parallelism for 𝑂𝐶 & 𝑕 . Parameter server(s) Stochastic Gradient Descent (SGD): ),- 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 1. Training data is split to N subsets Batch size= 𝑂𝐶 𝑥 . 2. Each worker has a model replica (copy) 𝑥 . 𝑥 . 3. Each replica is trained on a data subset (-) … 𝒉 . 4. Synchronization in parameter server(s) (=) 𝒉 . (<) Worker 1 𝒉 . Scalability: Worker N 1. Computing time decreases with N Worker 2 2. Communication can be the bottleneck Data 1 B samples 3. This work: quantizing gradient to three Data N (i.e., ternary ) levels {-1, 0, 1} (<2bits) Data 2

Communication Bottleneck 1 10 flop computation = t comp communication AlexNet × flops n total 0 10 time (s) × × 2 | W | g ( n ) = t comm − 1 10 b = g ( n ) log 2 n ( ) − 2 10 1 2 4 8 16 32 64 128 256 512 # of workers Credit: Alexander Ulanov

Background - Distributed Deep Learning Weeks or months -> hours Communication Bottleneck! 1 10 computation communication total 0 10 time (s) − 1 10 − 2 10 1 2 4 8 16 32 64 128 256 512 # of workers data

� � An Alternative Setting Parameter server ()) 𝒉 . = & 𝒉 . 𝑥 . 1. Only exchange gradients 𝒉 . 𝒉 . 2. Gradient quantization can reduce 𝒉 . communication in both directions (-) 𝒉 . …… (=) 𝒉 . Worker 1 (<) 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . Worker N Worker 2 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 .

� � Gradient Quantization for Communication Reduction Parameter server 32 bits ()) 𝒉 . = & 𝒉 . 𝒉 . Only exchange 𝒉 . conv 𝒉 . quantized (-) 𝒉 . …… gradients (=) 𝒉 . Worker 1 (<) 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . Worker N Worker 2 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 .

� � Gradient Quantization for Communication Reduction Parameter server < 2 bits ()) 𝒉 . = & 𝒉 . 𝒉 . Only exchange 𝒉 . 𝒉 . quantized (-) 𝒉 . …… gradients (=) 𝒉 . Worker 1 (<) 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . Worker N Worker 2 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 .

Stochastic Gradients without Bias Batch Gradient Descent 𝐷 𝒙 ≜ 1 + 𝑜 & 𝑅 𝒜 ) , 𝒙 ),- TernGrad 𝒙 ./- = 𝒙 . − 𝜃 . + ()) 𝑜 & 𝑕 . (B) 𝒙 ./- = 𝒙 . − 𝜃 . A 𝑢𝑓𝑠𝑜𝑏𝑠𝑗𝑨𝑓 𝑕 . ),- (B) E 𝑢𝑓𝑠𝑜𝑏𝑠𝑗𝑨𝑓 𝑕 . = 𝛼𝐷 𝒙 No bias SGD (B) 𝒙 ./- = 𝒙 . − 𝜃 . A 𝑕 . 𝐽 is randomly drawn from [1, n ] (B) = 𝛼𝐷 𝒙 E 𝑕 . No bias

TernGrad is Simple Example: ()) : [0.30, -1.20, …, 0.9] 𝒉 . g t = ternarize ( g t ) = s t · sign ( g t ) � b t ˜ 𝑡 . : 1.20 s t , || g t || ∞ , max ( abs ( g t )) Signs: [1, -1, …, 1] P.R -.< P.S 𝑄 𝑐 .N = 1|𝑕 . : [ -.< , -.< ,…, -.< ] ⇢ P ( b tk = 1 | g t ) = | g tk | /s t 𝒄 . : [ 0,1 ,…, 1 ] P ( b tk = 0 | g t ) = 1 � | g tk | /s t V : [0, -1, …, 1]*1.20 ()) 𝒉 . E z , b { ˜ g t } = E z , b { s t · sign ( g t ) � b t } = E z { s t · sign ( g t ) � E b { b t | z t }} = E z { g t } = r w C ( w t ) No bias

TernGrad is Simple

Convergence Standard SGD almost truly converges under assumptions (Fisk 1965, Metivier 1981&1983, Bottou 1998) Assumption 1: || w − w ∗ || 2 > ✏ ( w � w ∗ ) T r w C ( w ) > 0 . C ( w ) has a single minimum w * and 8 ✏ > 0 , inf Assumption 2: ⇢P + ∞ t =0 � 2 t < + 1 Learning rate 𝛿 . decreases neither very fast nor very slow P + ∞ t =0 � t = + 1 Assumption 3 (gradient bound): Assumption 3 (gradient bound): ≤ A + B || w − w ∗ || 2 E {|| g || ∞ · || g || 1 } ≤ A + B || w − w ∗ || 2 || g || 2 � E Standard SGD almost-truly converges TernGrad almost-truly converges ≤ E {|| g || ∞ · || g || 1 } ≤ A + B || w − w ∗ || 2 || g || 2 � E Stronger gradient bound in TernGrad

Closing Bound Gap Two methods to push the gradient bound of TernGrad closer to the bound of standard SGD Layer n Approach: Collect Layer n-1 1. Split gradients to buckets Collect Ternarize … 2. Do TernGrad bucket by bucket Ternarize Layer 2 Layer 1 When bucket size == 1, TernGrad is floating SGD Layer-wise ternarizing

Closing Bound Gap Two methods to push the gradient bound of TernGrad closer to the bound of standard SGD ⇢ g i | g i | ≤ c σ conv f ( g i ) = Iteration # sign ( g i ) · c σ | g i | > c σ c =2.5 works well for all tested datasets, DNNs and optimizers. fc Iteration # Suppose Gaussian distribution: 1. Change length 1.0%-1.5% (a) original (b) clipped 2. Change direction 2 ° -3 ° 3. Small bias with variance reduced Gradient clipping

Gradient Histograms conv Iteration # fc Iteration # (a) original (b) clipped (c) ternary (d) final Average gradients are not ternary, but quantized with 2 N +1 levels ( N : worker number). R< Z[\ ] (<=/-) > 1 , unless 𝑂 ≥ 2 RP Communication reduction:

Integration with Manifold Optimizers (All experiments: All hyper-parameters are tuned for standard SGD and fixed in TernGrad ) LeNet (total mini-batch size 64): close accuracy & randomness in TernGrad results in small variance 100.00% (a) momentum SGD (b) vanilla SGD baseline TernGrad 99.50% Accuracy 99.00% 98.50% N workers 98.00% 2 4 8 16 32 64 2 4 8 16 32 64

Integration with Manifold Optimizers CIFAR-10, mini-batch size 64 per worker Adam: D. P. Kingma, 2014 Tune hyper-parameters specifically for TernGrad may reduce accuracy gap TernGrad works with manifold optimizers: Vanilla SGD, Momentum, Adam

Scaling to Large-scale Deep Learning TernGrad : Randomness & (1) decrease randomness in dropout or regularization (2) use smaller weight decay No new hyper-parameters added AlexNet N. S. Keskar, et al., ICLR 2017

Convergence Curves (c) gradient sparsity of terngrad in fc6 (a) top-1 accuracy vs iteration (b) training loss vs iteration 70% 8 80% 60% 6 50% baseline 60% baseline 40% terngrad 4 terngrad 40% 30% 20% 2 20% 10% 0% 0 0% 0 50000 100000 150000 0 50000 100000 150000 0 50000 100000 150000 AlexNet trained on 4 workers with mini-batch size 512 Coverages within the same epochs under the same base learning rate

Scaling to Large-scale Deep Learning GoogLeNet accuracy loss is <2% on avgerage. Tune hyper-parameters specifically for TernGrad may reduce accuracy gap

Performance Model Training throughput on GPU cluster with Ethernet and PCI switch 100000 AlexNet FP32 AlexNet TernGrad GoogLeNet FP32 GoogLeNet TernGrad TernGrad gives higher speedup when VggNet-A FP32 VggNet-A TernGrad 80000 1. using more workers 4000 Images/sec 2. using smaller communication Images/sec 3000 60000 bandwidth (Ethernet vs InfiniBand) 2000 3. training DNNs with more fully- 40000 1000 connected layers (VggNet vs 0 GoogLeNet) 20000 0 1 2 4 8 32 64 128 256 512 16 # of GPUs (a)

Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL - PowerPoint PPT Presentation

Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL EVOLUTIONARY INTELLIGENCE LAB Electrical and Computing Engineering Department Duke University www.pittnuts.com AI Systems Landing Into the Cloud: Facebook Big Basin Google TPU

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge

Mobile Edge Cloud Services in 5G Yanyong Zhang WINLAB, Rutgers University

Edge-based Segmentation Transform Hough Edge Tracking Linking Edge Detection Canny Edge

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

OVERSEE: Outsourcing Verification to Enable Resource Sharing in Edge Environment Reporter:

Deploying Machine Learning Models on The Edge Deploying Machine Learning Models on The Edge Yan

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Effect of Edge Preparation Methods on Effect of Edge Preparation Methods on Edge Retention Rate

Next Edge Theta Yield Fund Next Edge Capital Corp., January 2016 IMPORTANT NOTES The Next Edge

Next Edge Private Debt Fund Next Edge Capital Corp., June 2018 IMPORTANT NOTES The Next Edge

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

M2 ONLINE SYSTEMS 2020 FA FALL PRODUCT PROGRAM TRAINING Girl Scouts Dakota Horizons

Lecture #2 Coordinated Phenological Research Networks: Nuts, Bolts, and Roles Alisa Hove, Susan

Chapter 11 Randomized Algorithms NEW CS 473: Theory II, Fall 2015 October 1, 2015 1 2 11.1

Wholesaling Nuts - Bolts - Screws Greg Parham greg@firstalliancetitle.com 303-523-5092 Why You

Health Sciences Compensation Plan and Outside Professional Activities Jonathan R. Hiatt

HTML Week 3.1 - IST 263 Section M005 HTML SUMMARY Describe HTML HTML Vocabulary

Overview of this module Course 02429 Analysis of correlated data: Mixed Linear Models Module 7:

Ready, Set, Bank Promoting Financial Well Being with Digital Banking 1 Ready, Set, Bank is