Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL - - PowerPoint PPT Presentation

deep learning in cloud edge ai systems
SMART_READER_LITE
LIVE PREVIEW

Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL - - PowerPoint PPT Presentation

Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL EVOLUTIONARY INTELLIGENCE LAB Electrical and Computing Engineering Department Duke University www.pittnuts.com AI Systems Landing Into the Cloud: Facebook Big Basin Google TPU


slide-1
SLIDE 1

Deep Learning in Cloud-Edge AI Systems

Wei Wen

COMPUTATIONAL EVOLUTIONARY INTELLIGENCE LAB Electrical and Computing Engineering Department Duke University www.pittnuts.com

slide-2
SLIDE 2

AI Systems Landing

Google TPU Cloud Facebook Big Basin

Into the Cloud:

slide-3
SLIDE 3

AI Systems Landing

iPhone X Face ID DJI Drone Autonomous Driving

Upon the Edge:

Challenges in landing???

slide-4
SLIDE 4

The Rocket Analogy of Deep Learning

An analogy by Andrew Ng

Rocket: Big Neural Networks Engine: Big Computing Fuel: Big Data AI

3 BIGs

slide-5
SLIDE 5

Deep Learning in the Cloud

Google TPU Cloud

Communication Bottleneck!!! Tons of Cables!!!

Parallelism

slide-6
SLIDE 6

Deep Learning on the Edge

Model Size and Inference Speed Matter!!!

Real-time

slide-7
SLIDE 7

Research Highlights

  • Distributed Training in the Cloud

– TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning, NIPS 2017 (oral) – On-going work on large-batch training

  • Efficient Inference on the Edge

– Structurally Sparse DNNs (NIPS 2016 & ICLR 2018) – Lower-rank DNNs (ICCV 2017) – A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation (CVPR 2017) – Direct Sparse Convolution (ICLR 2017)

slide-8
SLIDE 8

TernGrad: Ternary Gradients to Reduce

Communication in Distributed Deep Learning

NIPS 2017 (Oral) Wei Wen1, Cong Xu2, Feng Yan3, Chunpeng Wu1, Yandan Wang4, Yiran Chen1, Hai (Helen) Li1

Duke University1, Hewlett Packard Labs2, University of Nevada - Reno3, University of Pittsburgh4

https://github.com/wenwei202/terngrad

slide-9
SLIDE 9

Background – Stochastic Gradient Descent

𝐷 𝒙 ≜ 1 𝑜 & 𝑅 𝒜), 𝒙

+ ),-

𝒙./- = 𝒙. − 𝜃. 𝑜 & 𝑕.

()) + ),-

𝒙./- = 𝒙. − 𝜃. 𝐶 & 𝑕.

(7) 8 7,-

𝐶<<n samples are randomly drawn from training dataset Minimization target: Batch gradient descent: Computation expensive Computation cheap Mini-batch stochastic gradient descent (SGD): 𝑕.

()) = 𝛼𝑅 𝒜), 𝒙.

slide-10
SLIDE 10

Background - Distributed Deep Learning

Parameter server(s) 𝒙./- ← 𝒙. − 𝜃.𝒉. Worker 1 Worker 2 Worker N

𝒉.

(-)

𝒉.

(<)

𝒉.

(=)

𝑥. 𝑥. 𝑥.

Data 1 Data 2 Data N

Synchronized Data Parallelism for Stochastic Gradient Descent (SGD):

  • 1. Training data is split to N subsets
  • 2. Each worker has a model replica (copy)
  • 3. Each replica is trained on a data subset
  • 4. Synchronization in parameter server(s)

Scalability:

  • 1. Computing time decreases with N
  • 2. Communication can be the bottleneck
  • 3. This work: quantizing gradient to three

(i.e., ternary) levels {-1, 0, 1} (<2bits) 𝒉. = 1 𝑂𝐶 & 𝑕.

()) =8 ),-

B samples Batch size=𝑂𝐶

slide-11
SLIDE 11

Communication Bottleneck

1 2 4 8 16 32 64 128 256 512 10

−2

10

−1

10 10

1

time (s) # of workers computation communication total

n flops flop tcomp × =

b n g tcomm ) ( | | 2 × × = W ) ( log ) (

2 n

n g =

Credit: Alexander Ulanov AlexNet

slide-12
SLIDE 12

Background - Distributed Deep Learning

data

Weeks or months -> hours

1 2 4 8 16 32 64 128 256 512 10

−2

10

−1

10 10

1

time (s) # of workers computation communication total

Communication Bottleneck!

slide-13
SLIDE 13

An Alternative Setting

  • 1. Only exchange gradients
  • 2. Gradient quantization can reduce

communication in both directions Parameter server

𝒉. = & 𝒉.

())

  • Worker 1

𝒙./- ← 𝒙. − 𝜃.𝒉.

Worker 2

𝒙./- ← 𝒙. − 𝜃.𝒉.

Worker N

𝒙./- ← 𝒙. − 𝜃.𝒉.

……

𝒉.

(-)

𝒉.

(<)

𝒉.

(=)

𝒉. 𝒉. 𝒉. 𝑥.

slide-14
SLIDE 14

Gradient Quantization for Communication Reduction

Only exchange quantized gradients

Parameter server

𝒉. = & 𝒉.

())

  • Worker 1

𝒙./- ← 𝒙. − 𝜃.𝒉.

Worker 2

𝒙./- ← 𝒙. − 𝜃.𝒉.

Worker N

𝒙./- ← 𝒙. − 𝜃.𝒉.

……

𝒉.

(-)

𝒉.

(<)

𝒉.

(=)

𝒉. 𝒉. 𝒉.

conv

32 bits

slide-15
SLIDE 15

Gradient Quantization for Communication Reduction

Only exchange quantized gradients

Parameter server

𝒉. = & 𝒉.

())

  • Worker 1

𝒙./- ← 𝒙. − 𝜃.𝒉.

Worker 2

𝒙./- ← 𝒙. − 𝜃.𝒉.

Worker N

𝒙./- ← 𝒙. − 𝜃.𝒉.

……

𝒉.

(-)

𝒉.

(<)

𝒉.

(=)

𝒉. 𝒉. 𝒉.

< 2 bits

slide-16
SLIDE 16

Stochastic Gradients without Bias

Batch Gradient Descent TernGrad No bias 𝐷 𝒙 ≜ 1 𝑜 & 𝑅 𝒜), 𝒙

+ ),-

𝒙./- = 𝒙. − 𝜃. 𝑜 & 𝑕.

()) + ),-

𝒙./- = 𝒙. − 𝜃. A 𝑕.

(B)

𝐽 is randomly drawn from [1,n] E 𝑕.

(B) =𝛼𝐷 𝒙

𝒙./- = 𝒙. − 𝜃. A 𝑢𝑓𝑠𝑜𝑏𝑠𝑗𝑨𝑓 𝑕.

(B)

E 𝑢𝑓𝑠𝑜𝑏𝑠𝑗𝑨𝑓 𝑕.

(B)

=𝛼𝐷 𝒙 No bias SGD

slide-17
SLIDE 17

TernGrad is Simple

Example:

𝒉.

()): [0.30, -1.20, …, 0.9]

𝑡.: 1.20 Signs: [1, -1, …, 1] 𝑄 𝑐.N = 1|𝑕. : [

P.R

  • .< ,
  • .<
  • .<,…,

P.S

  • .<]

𝒄.: [0,1,…,1] 𝒉.

())

V : [0, -1, …, 1]*1.20 No bias

Ez,b {˜ gt} = Ez,b {st · sign (gt) bt} = Ez {st · sign (gt) Eb {bt|zt}} = Ez {gt} = rwC(wt)

˜ gt = ternarize(gt) = st · sign (gt) bt st , ||gt||∞ , max (abs (gt)) ⇢P(btk = 1 | gt) = |gtk|/st P(btk = 0 | gt) = 1 |gtk|/st

slide-18
SLIDE 18

TernGrad is Simple

slide-19
SLIDE 19

Convergence

⇢P+∞

t=0 2 t < +1

P+∞

t=0 t = +1

8✏ > 0, inf

||w−w∗||2>✏ (w w∗)T rwC(w) > 0.

C(w) has a single minimum w* and Learning rate 𝛿.decreases neither very fast nor very slow Assumption 1: Assumption 2: Assumption 3 (gradient bound): Assumption 3 (gradient bound):

TernGrad almost-truly converges

Standard SGD almost truly converges under assumptions (Fisk 1965, Metivier 1981&1983, Bottou 1998)

E {||g||∞ · ||g||1} ≤ A + B ||w − w∗||2

E

  • ||g||2

≤ A + B ||w − w∗||2

Stronger gradient bound in TernGrad

E

  • ||g||2

≤ E {||g||∞ · ||g||1} ≤ A + B ||w − w∗||2

Standard SGD almost-truly converges

slide-20
SLIDE 20

Closing Bound Gap

Two methods to push the gradient bound of TernGrad closer to the bound of standard SGD Layer n Layer n-1 Layer 2 Layer 1 …

Collect Ternarize Collect Ternarize

Layer-wise ternarizing Approach:

  • 1. Split gradients to buckets
  • 2. Do TernGrad bucket by bucket

When bucket size == 1, TernGrad is floating SGD

slide-21
SLIDE 21

Closing Bound Gap

Two methods to push the gradient bound of TernGrad closer to the bound of standard SGD

(a) original (b) clipped

Iteration # Iteration #

conv fc

Gradient clipping

f(gi) = ⇢gi |gi| ≤ cσ sign(gi) · cσ |gi| > cσ

c=2.5 works well for all tested datasets, DNNs and optimizers. Suppose Gaussian distribution:

  • 1. Change length 1.0%-1.5%
  • 2. Change direction 2° -3°
  • 3. Small bias with variance reduced
slide-22
SLIDE 22

Gradient Histograms

(a) original (b) clipped (c) ternary (d) final

Iteration # Iteration #

conv fc

Average gradients are not ternary, but quantized with 2N+1 levels (N: worker number). Communication reduction:

R< Z[\](<=/-) > 1, unless 𝑂 ≥ 2RP

slide-23
SLIDE 23

Integration with Manifold Optimizers

98.00% 98.50% 99.00% 99.50% 100.00% 2 4 8 16 32 64 2 4 8 16 32 64 baseline TernGrad Accuracy N workers (a) momentum SGD (b) vanilla SGD

LeNet (total mini-batch size 64): close accuracy & randomness in TernGrad results in small variance

(All experiments: All hyper-parameters are tuned for standard SGD and fixed in TernGrad)

slide-24
SLIDE 24

Integration with Manifold Optimizers

CIFAR-10, mini-batch size 64 per worker

Adam: D. P. Kingma, 2014

TernGrad works with manifold optimizers: Vanilla SGD, Momentum, Adam Tune hyper-parameters specifically for TernGrad may reduce accuracy gap

slide-25
SLIDE 25

Scaling to Large-scale Deep Learning

(1) decrease randomness in dropout or (2) use smaller weight decay No new hyper-parameters added

AlexNet

TernGrad: Randomness & regularization

  • N. S. Keskar, et al., ICLR 2017
slide-26
SLIDE 26

Convergence Curves

AlexNet trained on 4 workers with mini-batch size 512

0% 10% 20% 30% 40% 50% 60% 70% 50000 100000 150000 baseline terngrad 2 4 6 8 50000 100000 150000 baseline terngrad 0% 20% 40% 60% 80% 50000 100000 150000

(c) gradient sparsity of terngrad in fc6 (b) training loss vs iteration (a) top-1 accuracy vs iteration

Coverages within the same epochs under the same base learning rate

slide-27
SLIDE 27

Scaling to Large-scale Deep Learning

GoogLeNet accuracy loss is <2% on avgerage. Tune hyper-parameters specifically for TernGrad may reduce accuracy gap

slide-28
SLIDE 28

Performance Model

(a)

20000 40000 60000 80000 100000

Images/sec # of GPUs

Training throughput on GPU cluster with Ethernet and PCI switch

AlexNet FP32 AlexNet TernGrad GoogLeNet FP32 GoogLeNet TernGrad VggNet-A FP32 VggNet-A TernGrad 1 2 4 8 16 32 64 128 256 512 Images/sec

1000 2000 3000 4000

TernGrad gives higher speedup when

  • 1. using more workers
  • 2. using smaller communication

bandwidth (Ethernet vs InfiniBand)

  • 3. training DNNs with more fully-

connected layers (VggNet vs GoogLeNet)

slide-29
SLIDE 29

Structurally Sparse Deep Neural Networks

Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, Hai Li, “Learning Structured Sparsity in Deep Neural Networks”, NIPS, 2016. [paper][code][poster] Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, Hai Li, “Learning Intrinsic Sparse Structures within Long Short- Term Memory”, ICLR, 2018. [paper][code]

slide-30
SLIDE 30

Sparse Convolutional Neural Networks (SCNNs)

[1] S. Han et. al. NIPS 2015 [2] S. Han et. al. ISCA 2016.

Significantly reduces the storage size of DNNs [1] Good speedup with customized hardware [2]

1 0.5 1 1.5 conv1 conv2 conv3 conv4 conv5 Speedup Sparsity

1

Quadro K600 Tesla K40c

Wei Wen et. al. NIPS 2016

Tesla K40c GTX Titan Sparsity

Trivial Speedup Poor Data Locality Non- structured Sparsity

AlexNet by cuSPARSE

slide-31
SLIDE 31

Inefficiency of SCNNs

1 0.5 1 1.5 conv1 conv2 conv3 conv4 conv5

Quadro K600 Tesla K40c GTX Titan Sparsity

Speedup Sparsity

Speedup by SCNNs in GPUs

AlexNet

  • Format: Sparse matrixes are formatted

as Compressed Sparse Row (CSR)

  • Acceleration library: cuSPARSE

Non- structured Sparsity Poor Data Locality Trivial Speedup

  • W. Wen et. al. NIPS 2016
slide-32
SLIDE 32

Structured Sparsity in DNNs

Train weights with Group Lasso Regularization

One-shot pruning from scratch (with the same epochs)

Pruning dense structures

Faster Training speed Higher practical computing speed

  • S. Han et. al. NIPS 2015

Wei Wen et. al. NIPS 2016 Wei Wen et. al. ICLR 2018

slide-33
SLIDE 33

Efficiency of Structured Sparsity

2 4 6 8 10 12 14 16 18 20 22 24 50% 60% 70% 80% 90%

Speedup Sparsity

Non-structured Structured 2 4 6 8 10 12 14 16 18 20 50% 60% 70% 80% 90%

Sparsity

Non-structured Structured

Matrix sizes: W: 6000 * 3000, X: 3000 * 10 Matrix sizes: W: 2000 * 1000, X: 1000 * 100

Tested on CPUs

Non-structured sparsity Structured sparsity

  • W. Wen et. al. ICLR 2018
slide-34
SLIDE 34

Structurally Sparse Deep Neural Networks

  • What is Structurally Sparse Deep Neural Networks

(SSDNNs)?

Weights/connections are removed group by group. A group can be a rectangle block, a row, a column or even the whole matrix

slide-35
SLIDE 35

Structurally Sparse Deep Neural Networks

! ! ! ! ! ! ! ! shortcut depth-wise ! ! ! ! ! ! ! ! ! ! ! ! filter-wise channel-wise ! ! ! ! ! !

…!

! ! ! ! shape-wise

W (l)

:,cl,:,:

W (l)

nl,:,:,:

W (l)

:,cl,ml,kl

W (l)

In a nutshell, a group can be any form of a weight block, depending on what sparse structure you want to learn In CNNs, a group of weights can be a channel, a 3D filter, a 2D filter, a filter shape fiber (i.e. a weight column), and even a layer (in ResNets).

slide-36
SLIDE 36

Group Lasso regularization is all you need for SSDNNs

! argmin

w

E w

( )

{ }= argmin

w

ED w

( )+ λg ⋅Rg w ( )

{ }

Step 1: Weights are split to G groups w(1..G) i.e. vector length Step 2: Add group Lasso on each group w(g) Step 3: Sum group Lasso over all groups as a regularization: Step 4: SGD optimizing e.g. (w1, w2, w3, w4, w5) -> group 1: (w1, w2, w3), group 2: (w4, w5) sqrt(w1

2 + w2 2 +w3 2), sqrt(w4 2 + w5 2)

sqrt(w1

2 + w2 2 +w3 2) + sqrt(w4 2 + w5 2)

We refer to our method as Structured Sparsity Learning (SSL)

slide-37
SLIDE 37

Structured Sparsity Learning (SSL)

  • Why can SSL learn to remove weights group by group?

The gradient-based explanation:

wk

(n)

  • wk

(n)/||wk (n)||2

Regular gradients to minimize error Additional gradients to learn sparsity A group of weights

Many groups are pushed to zeros, but not all

slide-38
SLIDE 38

Easy to Implement

slide-39
SLIDE 39

Structurally Sparse LeNet – removing channels and filters

conv2 conv1

One 3D filter One Channel

slide-40
SLIDE 40

Structurally Sparse AlexNet – removing columns and rows

[6] S. Han, et. al. NIPS 2015

1 2 3 4 5 6

Quadro Tesla Titan Black Xeon T8 Xeon T4 Xeon T2 Xeon T1

l1 SSL speedup

2% loss No loss

slide-41
SLIDE 41

Structurally Sparse ResNets – removing layers

# layers error # layers error ResNet 20 8.82% 32 7.51% SSL-ResNet 14 8.54% 18 7.40% ResNet-20/32: baseline with 20/32 layers SSL-ResNet-#: Ours with # layers after removing layers in ResNet-20

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In CVPR. 2016.

slide-42
SLIDE 42

Hidden Structures in LSTMs

! ! "#$ℎ ! ⊗ ⊗ ⊕

input updates

forget gates input gates

  • utput gates

()*+ ,)*+ () ,)

  • )

.) /) 0) hidden states cell states

  • utputs

(hidden states) 1) inputs "#$ℎ

“Dimension Consistency”:

  • 1. Hidden Structures (in blue):

hidden states, input updates, cells, gates, outputs

  • 2. All hidden structures must have

the same dimension

  • 3. Hidden structures cannot be

independently removed

slide-43
SLIDE 43

Learning Sparse Hidden Structures in LSTMs

h Weights in next layer(s) Wxf Wxi Wxu Wxo Whf Whi Whu Who x h Weights matrices in LSTM

White strips (ISS: Intrinsic Sparse Structures) =

  • ne group of weights in SSL

Removing one group = reduce hidden size by one

slide-44
SLIDE 44

Structurally Sparse LSTMs

Wxf Wxi Wxg Wxo Whf Whi Whg Who xt ht-1 Wxf Wxi Wxg Wxo Whf Whi Whg Who xt+1 ht

http://colah.github.io/posts/2015- 08-Understanding-LSTMs/

A group in SSL LSTM 1 LSTM 2

Removing one group == reducing hidden size by one

slide-45
SLIDE 45

Structurally Sparse LSTMs

LSTM 1 LSTM 2 Output

Learned structurally sparse LSTMs

slide-46
SLIDE 46

Structurally Sparse Recurrent Highway Networks

Zilly, Julian Georg, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. “Recurrent highway networks.” arXiv:1607.03474 (2016).

Ours What weights does a group have?

slide-47
SLIDE 47

Question Answering with SSDNNs

Question Answering in search engine: Baseline Ours

slide-48
SLIDE 48

Efficiency of Structurally Sparse DNNs (after us)

  • H. Mao , S. Han, et. al., CVPR Workshop 2017
  • S. Gray, A. Radford and D. P. Kingma, GPU Kernels

for Block-Sparse Weights, OpenAI 2017

Structured sparsity is more efficient for computation Structured sparsity is more efficient for storage

slide-49
SLIDE 49

Conclusion

  • AI systems are landing into the cloud and upon the edge
  • Communication bottleneck limits the scalability of distributed deep learning

in the cloud

– We propose TernGrad to quantize gradients to reduce communication volume.

  • Lightweight computing engine restricts the deployment of cumbersome

Deep Neural Networks on the edge

– We propose Structurally Sparse Deep Neural Networks to simplify the “rocket” to lighten the load on the “engine”

  • More done/on-going work on large-batch training and model acceleration

– http://www.pittnuts.com/#Publications

slide-50
SLIDE 50

THANKS! Q&A?

Wei Wen, Duke University www.pittnuts.com