Deep Learning in Cloud-Edge AI Systems
Wei Wen
COMPUTATIONAL EVOLUTIONARY INTELLIGENCE LAB Electrical and Computing Engineering Department Duke University www.pittnuts.com
Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL - - PowerPoint PPT Presentation
Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL EVOLUTIONARY INTELLIGENCE LAB Electrical and Computing Engineering Department Duke University www.pittnuts.com AI Systems Landing Into the Cloud: Facebook Big Basin Google TPU
COMPUTATIONAL EVOLUTIONARY INTELLIGENCE LAB Electrical and Computing Engineering Department Duke University www.pittnuts.com
Google TPU Cloud Facebook Big Basin
iPhone X Face ID DJI Drone Autonomous Driving
Challenges in landing???
An analogy by Andrew Ng
Rocket: Big Neural Networks Engine: Big Computing Fuel: Big Data AI
Google TPU Cloud
Parallelism
Real-time
– TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning, NIPS 2017 (oral) – On-going work on large-batch training
– Structurally Sparse DNNs (NIPS 2016 & ICLR 2018) – Lower-rank DNNs (ICCV 2017) – A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation (CVPR 2017) – Direct Sparse Convolution (ICLR 2017)
NIPS 2017 (Oral) Wei Wen1, Cong Xu2, Feng Yan3, Chunpeng Wu1, Yandan Wang4, Yiran Chen1, Hai (Helen) Li1
Duke University1, Hewlett Packard Labs2, University of Nevada - Reno3, University of Pittsburgh4
https://github.com/wenwei202/terngrad
𝐷 𝒙 ≜ 1 𝑜 & 𝑅 𝒜), 𝒙
+ ),-
𝒙./- = 𝒙. − 𝜃. 𝑜 & .
()) + ),-
𝒙./- = 𝒙. − 𝜃. 𝐶 & .
(7) 8 7,-
𝐶<<n samples are randomly drawn from training dataset Minimization target: Batch gradient descent: Computation expensive Computation cheap Mini-batch stochastic gradient descent (SGD): .
()) = 𝛼𝑅 𝒜), 𝒙.
Parameter server(s) 𝒙./- ← 𝒙. − 𝜃.𝒉. Worker 1 Worker 2 Worker N
𝒉.
(-)
𝒉.
(<)
𝒉.
(=)
𝑥. 𝑥. 𝑥.
Data 1 Data 2 Data N
Synchronized Data Parallelism for Stochastic Gradient Descent (SGD):
Scalability:
(i.e., ternary) levels {-1, 0, 1} (<2bits) 𝒉. = 1 𝑂𝐶 & .
()) =8 ),-
B samples Batch size=𝑂𝐶
1 2 4 8 16 32 64 128 256 512 10
−2
10
−1
10 10
1
time (s) # of workers computation communication total
n flops flop tcomp × =
b n g tcomm ) ( | | 2 × × = W ) ( log ) (
2 n
n g =
Credit: Alexander Ulanov AlexNet
data
Weeks or months -> hours
1 2 4 8 16 32 64 128 256 512 10
−2
10
−1
10 10
1
time (s) # of workers computation communication total
Communication Bottleneck!
communication in both directions Parameter server
𝒉. = & 𝒉.
())
𝒙./- ← 𝒙. − 𝜃.𝒉.
Worker 2
𝒙./- ← 𝒙. − 𝜃.𝒉.
Worker N
𝒙./- ← 𝒙. − 𝜃.𝒉.
𝒉.
(-)
𝒉.
(<)
𝒉.
(=)
𝒉. 𝒉. 𝒉. 𝑥.
Parameter server
𝒉. = & 𝒉.
())
𝒙./- ← 𝒙. − 𝜃.𝒉.
Worker 2
𝒙./- ← 𝒙. − 𝜃.𝒉.
Worker N
𝒙./- ← 𝒙. − 𝜃.𝒉.
𝒉.
(-)
𝒉.
(<)
𝒉.
(=)
𝒉. 𝒉. 𝒉.
conv
Parameter server
𝒉. = & 𝒉.
())
𝒙./- ← 𝒙. − 𝜃.𝒉.
Worker 2
𝒙./- ← 𝒙. − 𝜃.𝒉.
Worker N
𝒙./- ← 𝒙. − 𝜃.𝒉.
𝒉.
(-)
𝒉.
(<)
𝒉.
(=)
𝒉. 𝒉. 𝒉.
Batch Gradient Descent TernGrad No bias 𝐷 𝒙 ≜ 1 𝑜 & 𝑅 𝒜), 𝒙
+ ),-
𝒙./- = 𝒙. − 𝜃. 𝑜 & .
()) + ),-
𝒙./- = 𝒙. − 𝜃. A .
(B)
𝐽 is randomly drawn from [1,n] E .
(B) =𝛼𝐷 𝒙
𝒙./- = 𝒙. − 𝜃. A 𝑢𝑓𝑠𝑜𝑏𝑠𝑗𝑨𝑓 .
(B)
E 𝑢𝑓𝑠𝑜𝑏𝑠𝑗𝑨𝑓 .
(B)
=𝛼𝐷 𝒙 No bias SGD
𝒉.
()): [0.30, -1.20, …, 0.9]
𝑡.: 1.20 Signs: [1, -1, …, 1] 𝑄 𝑐.N = 1|. : [
P.R
P.S
𝒄.: [0,1,…,1] 𝒉.
())
V : [0, -1, …, 1]*1.20 No bias
Ez,b {˜ gt} = Ez,b {st · sign (gt) bt} = Ez {st · sign (gt) Eb {bt|zt}} = Ez {gt} = rwC(wt)
˜ gt = ternarize(gt) = st · sign (gt) bt st , ||gt||∞ , max (abs (gt)) ⇢P(btk = 1 | gt) = |gtk|/st P(btk = 0 | gt) = 1 |gtk|/st
⇢P+∞
t=0 2 t < +1
P+∞
t=0 t = +1
8✏ > 0, inf
||w−w∗||2>✏ (w w∗)T rwC(w) > 0.
C(w) has a single minimum w* and Learning rate 𝛿.decreases neither very fast nor very slow Assumption 1: Assumption 2: Assumption 3 (gradient bound): Assumption 3 (gradient bound):
TernGrad almost-truly converges
Standard SGD almost truly converges under assumptions (Fisk 1965, Metivier 1981&1983, Bottou 1998)
E {||g||∞ · ||g||1} ≤ A + B ||w − w∗||2
E
≤ A + B ||w − w∗||2
Stronger gradient bound in TernGrad
E
≤ E {||g||∞ · ||g||1} ≤ A + B ||w − w∗||2
Standard SGD almost-truly converges
Two methods to push the gradient bound of TernGrad closer to the bound of standard SGD Layer n Layer n-1 Layer 2 Layer 1 …
Collect Ternarize Collect Ternarize
Layer-wise ternarizing Approach:
When bucket size == 1, TernGrad is floating SGD
Two methods to push the gradient bound of TernGrad closer to the bound of standard SGD
(a) original (b) clipped
Iteration # Iteration #
conv fc
Gradient clipping
f(gi) = ⇢gi |gi| ≤ cσ sign(gi) · cσ |gi| > cσ
c=2.5 works well for all tested datasets, DNNs and optimizers. Suppose Gaussian distribution:
(a) original (b) clipped (c) ternary (d) final
Iteration # Iteration #
conv fc
Average gradients are not ternary, but quantized with 2N+1 levels (N: worker number). Communication reduction:
R< Z[\](<=/-) > 1, unless 𝑂 ≥ 2RP
98.00% 98.50% 99.00% 99.50% 100.00% 2 4 8 16 32 64 2 4 8 16 32 64 baseline TernGrad Accuracy N workers (a) momentum SGD (b) vanilla SGD
LeNet (total mini-batch size 64): close accuracy & randomness in TernGrad results in small variance
(All experiments: All hyper-parameters are tuned for standard SGD and fixed in TernGrad)
CIFAR-10, mini-batch size 64 per worker
Adam: D. P. Kingma, 2014
TernGrad works with manifold optimizers: Vanilla SGD, Momentum, Adam Tune hyper-parameters specifically for TernGrad may reduce accuracy gap
(1) decrease randomness in dropout or (2) use smaller weight decay No new hyper-parameters added
TernGrad: Randomness & regularization
AlexNet trained on 4 workers with mini-batch size 512
0% 10% 20% 30% 40% 50% 60% 70% 50000 100000 150000 baseline terngrad 2 4 6 8 50000 100000 150000 baseline terngrad 0% 20% 40% 60% 80% 50000 100000 150000
(c) gradient sparsity of terngrad in fc6 (b) training loss vs iteration (a) top-1 accuracy vs iteration
Coverages within the same epochs under the same base learning rate
GoogLeNet accuracy loss is <2% on avgerage. Tune hyper-parameters specifically for TernGrad may reduce accuracy gap
(a)
20000 40000 60000 80000 100000
Images/sec # of GPUs
Training throughput on GPU cluster with Ethernet and PCI switch
AlexNet FP32 AlexNet TernGrad GoogLeNet FP32 GoogLeNet TernGrad VggNet-A FP32 VggNet-A TernGrad 1 2 4 8 16 32 64 128 256 512 Images/sec
1000 2000 3000 4000
TernGrad gives higher speedup when
bandwidth (Ethernet vs InfiniBand)
connected layers (VggNet vs GoogLeNet)
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, Hai Li, “Learning Structured Sparsity in Deep Neural Networks”, NIPS, 2016. [paper][code][poster] Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, Hai Li, “Learning Intrinsic Sparse Structures within Long Short- Term Memory”, ICLR, 2018. [paper][code]
[1] S. Han et. al. NIPS 2015 [2] S. Han et. al. ISCA 2016.
Significantly reduces the storage size of DNNs [1] Good speedup with customized hardware [2]
1 0.5 1 1.5 conv1 conv2 conv3 conv4 conv5 Speedup Sparsity
1
Quadro K600 Tesla K40c
Wei Wen et. al. NIPS 2016
Tesla K40c GTX Titan Sparsity
Trivial Speedup Poor Data Locality Non- structured Sparsity
AlexNet by cuSPARSE
1 0.5 1 1.5 conv1 conv2 conv3 conv4 conv5
Quadro K600 Tesla K40c GTX Titan Sparsity
Speedup Sparsity
Speedup by SCNNs in GPUs
AlexNet
as Compressed Sparse Row (CSR)
Non- structured Sparsity Poor Data Locality Trivial Speedup
Train weights with Group Lasso Regularization
One-shot pruning from scratch (with the same epochs)
Pruning dense structures
Wei Wen et. al. NIPS 2016 Wei Wen et. al. ICLR 2018
2 4 6 8 10 12 14 16 18 20 22 24 50% 60% 70% 80% 90%
Speedup Sparsity
Non-structured Structured 2 4 6 8 10 12 14 16 18 20 50% 60% 70% 80% 90%
Sparsity
Non-structured Structured
Matrix sizes: W: 6000 * 3000, X: 3000 * 10 Matrix sizes: W: 2000 * 1000, X: 1000 * 100
Tested on CPUs
Non-structured sparsity Structured sparsity
Weights/connections are removed group by group. A group can be a rectangle block, a row, a column or even the whole matrix
! ! ! ! ! ! ! ! shortcut depth-wise ! ! ! ! ! ! ! ! ! ! ! ! filter-wise channel-wise ! ! ! ! ! !
! ! ! ! shape-wise
W (l)
:,cl,:,:
W (l)
nl,:,:,:
W (l)
:,cl,ml,kl
W (l)
In a nutshell, a group can be any form of a weight block, depending on what sparse structure you want to learn In CNNs, a group of weights can be a channel, a 3D filter, a 2D filter, a filter shape fiber (i.e. a weight column), and even a layer (in ResNets).
! argmin
w
E w
( )
{ }= argmin
w
ED w
( )+ λg ⋅Rg w ( )
{ }
Step 1: Weights are split to G groups w(1..G) i.e. vector length Step 2: Add group Lasso on each group w(g) Step 3: Sum group Lasso over all groups as a regularization: Step 4: SGD optimizing e.g. (w1, w2, w3, w4, w5) -> group 1: (w1, w2, w3), group 2: (w4, w5) sqrt(w1
2 + w2 2 +w3 2), sqrt(w4 2 + w5 2)
sqrt(w1
2 + w2 2 +w3 2) + sqrt(w4 2 + w5 2)
We refer to our method as Structured Sparsity Learning (SSL)
(n)
(n)/||wk (n)||2
Regular gradients to minimize error Additional gradients to learn sparsity A group of weights
Many groups are pushed to zeros, but not all
conv2 conv1
One 3D filter One Channel
[6] S. Han, et. al. NIPS 2015
1 2 3 4 5 6
Quadro Tesla Titan Black Xeon T8 Xeon T4 Xeon T2 Xeon T1
l1 SSL speedup
2% loss No loss
# layers error # layers error ResNet 20 8.82% 32 7.51% SSL-ResNet 14 8.54% 18 7.40% ResNet-20/32: baseline with 20/32 layers SSL-ResNet-#: Ours with # layers after removing layers in ResNet-20
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In CVPR. 2016.
! ! "#$ℎ ! ⊗ ⊗ ⊕
input updates
⊗
forget gates input gates
()*+ ,)*+ () ,)
.) /) 0) hidden states cell states
(hidden states) 1) inputs "#$ℎ
“Dimension Consistency”:
hidden states, input updates, cells, gates, outputs
the same dimension
independently removed
h Weights in next layer(s) Wxf Wxi Wxu Wxo Whf Whi Whu Who x h Weights matrices in LSTM
White strips (ISS: Intrinsic Sparse Structures) =
Removing one group = reduce hidden size by one
Wxf Wxi Wxg Wxo Whf Whi Whg Who xt ht-1 Wxf Wxi Wxg Wxo Whf Whi Whg Who xt+1 ht
http://colah.github.io/posts/2015- 08-Understanding-LSTMs/
A group in SSL LSTM 1 LSTM 2
Removing one group == reducing hidden size by one
LSTM 1 LSTM 2 Output
Learned structurally sparse LSTMs
Zilly, Julian Georg, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. “Recurrent highway networks.” arXiv:1607.03474 (2016).
Ours What weights does a group have?
Question Answering in search engine: Baseline Ours
for Block-Sparse Weights, OpenAI 2017
Structured sparsity is more efficient for computation Structured sparsity is more efficient for storage
in the cloud
– We propose TernGrad to quantize gradients to reduce communication volume.
Deep Neural Networks on the edge
– We propose Structurally Sparse Deep Neural Networks to simplify the “rocket” to lighten the load on the “engine”
– http://www.pittnuts.com/#Publications
Wei Wen, Duke University www.pittnuts.com