Neurosurgeon
Collaborative Intelligence Between the Cloud and Mobile Edge
Stefanos Laskaridis
sl829@cam.ac.uk R244: Large-Scale Data Processing and Optimisation
by Y. Kang, J.Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars and L. Tang
Neurosurgeon Collaborative Intelligence Between the Cloud and - - PowerPoint PPT Presentation
Neurosurgeon Collaborative Intelligence Between the Cloud and Mobile Edge by Y. Kang, J.Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars and L. Tang Stefanos Laskaridis sl829@cam.ac.uk R244: Large-Scale Data Processing and Optimisation
Collaborative Intelligence Between the Cloud and Mobile Edge
Stefanos Laskaridis
sl829@cam.ac.uk R244: Large-Scale Data Processing and Optimisation
by Y. Kang, J.Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars and L. Tang
Approach
Approach
Approach
[2.41, 0.87] [7,92, 0.87]
2
Image taken from [1]
3
4
5
Mobile Platform
6
Server Platform
x 16 slots
Power Consumption Watts Up? meter Software
AlexNet [3]
7
More power but shorter bursts Transmission has the dominating cost
8
Images taken from [1]
9
10
Images taken from [2]
Convolution Pooling
All neurons are connected with all the neurons of the previous layer. Depth is the number of filters. Stride is how much we slide the filter each time. [2]
Convolves an image with one or more filters to produce a set of maps.
Downsamples an image to simplify representation. Can be average, max, or L2. [2]
Applies non-linear function to its input (sigmoid, Rectified Linear Unit, Tanh)
Normalises features across feature map.
Probability distribution over possible classes.
Chooses class with higher probability.
Randomly ignores neurons as regularisation to prevent overfitting.
11
Inference-only (fw propagation)
12
Images taken from [1] and [3]
2x over cloud-only 18% more energy-efficient
13
data but have high latency.
Images taken from [3]
pooling)
layers)
and then offload to cloud.
14
Abbreviation Network Input Layers Image Classification
IMC AlexNet Image 24 VGG VGG Image 46
Facial Recognition
FACE DeepFace Image 10
Digit Recognition
DIG MNIST Image 9
Speech Recognition
ASR Kaldi Speech 13
Part-of-speech Tagging
POS SENNA Word vectors 3
Named Entity Recognition
NER SENNA Word vectors 3
Word Chunking
CHK SENNA Word vectors 3
15
Partition points 5 10 15 20 Latency (s) (a) VGG Partition points 10 20 30 40 50 60 70 Energy (J) (a) VGG Energy (J)
Server processing latency Data communication latency Mobile processing latency
Data communication energy Mobile processing energy
16
Images taken from [1]
Layer latency Size of output data
input conv1.1 relu1.1 conv1.2 relu1.2 pool1 conv2.1 relu2.1 conv2.2 relu2.2 pool2 conv3.1 relu3.1 conv3.2 relu3.2 conv3.3 relu3.3 conv3.4 relu3.4 pool3 conv4.1 relu4.1 conv4.2 relu4.2 conv4.3 relu4.3 conv4.4 relu4.4 pool4 conv5.1 relu5.1 conv5.2 relu5.2 conv5.3 relu5.3 conv5.4 relu5.4 pool5 fc6 relu6 drop6 fc7 relu7 drop7 fc8 softmax argmax Layers 50 100 150 200 250 Latency (ms) 2 4 6 8 10 12 14 Data size (MB) (a) VGG
Partition points 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Latency (s) (b) FACE
Partition points 2 4 6 8 10 12 14 Energy (J) (b) FACE
Server processing latency Data communication latency Mobile processing latency
Data communication energy Mobile processing energy
17
Images taken from [1]
input conv1 pool2 conv3 local4 local5 local6 fc7 fc8 softmax argmax Layers 20 40 60 80 100 Latency (ms) 0.0 0.5 1.0 1.5 2.0 2.5 Data size (MB) (b) FACE
Layer latency Size of output data
Partition points 1 2 3 4 5 6 Latency (s) Partition points 5 10 15 20 25 Energy (J) (c) DIG
18
Images taken from [1]
Server processing latency Data communication latency Mobile processing latency
Data communication energy Mobile processing energy
Layer latency Size of output data
input conv1 pool1 conv2 pool2 fc1 relu1 fc2 softmax argmax Layers 1 2 3 4 5 Latency (ms) 1 2 3 4 5 Data size (MB) (c) DIG
Partition points 1 2 3 4 5 6 7 Latency (s) Partition points 5 10 15 20 25 30 Energy (J)
19
Images taken from [1]
Server processing latency Data communication latency Mobile processing latency
Data communication energy Mobile processing energy
Layer latency Size of output data
input fc1 sig1 fc2 sig2 fc3 sig3 fc4 sig4 fc5 sig5 fc6 sig6 fc7 Layers 10 20 30 40 50 Latency (ms) 1 2 3 4 5 Data size (MB) (d) ASR
Partition points 1 2 3 4 5 6 Latency (s) ×10−2 (e) POS
Partition points 0.00 0.05 0.10 0.15 0.20 0.25 Energy (J) (e) POS Energy (J)
20
Images taken from [1]
Server processing latency Data communication latency Mobile processing latency
Data communication energy Mobile processing energy
Layer latency Size of output data
input fc1 htanh fc3 Layers 0.0 0.1 0.2 0.3 0.4 Latency (ms) 0.0 1.1 2.2 3.4 4.5 Data size (MB) ×10−2 (e) POS
Partition points 1 2 3 4 5 6 Latency (s) ×10−2 (f) NER Partition points 0.00 0.05 0.10 0.15 0.20 0.25 Energy (J) (f) NER
21
Images taken from [1]
input fc1 htanh fc3 Layers 0.0 0.1 0.2 0.3 0.4 Latency (ms) 0.0 1.1 2.2 3.4 4.5 Data size (MB) ×10−2 (f) NER
Server processing latency Data communication latency Mobile processing latency
Data communication energy Mobile processing energy
Layer latency Size of output data
(b) FACE Partition points 1 2 3 4 5 6 Latency (s) ×10−2 (g) CHK
Partition points 0.00 0.05 0.10 0.15 0.20 0.25 Energy (J)
22
Images taken from [1]
Server processing latency Data communication latency Mobile processing latency
Data communication energy Mobile processing energy
Layer latency Size of output data
input fc1 htanh fc3 Layers 0.0 0.1 0.2 0.3 0.4 Latency (ms) 0.0 1.1 2.2 3.4 4.5 Data size (MB) ×10−2 (g) CHK
23
24
25
CONV FC POOL ACT …
+ ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++CONV FC POOL ACT …
1) Generate prediction models Deployment Phase
Target Application
1) Extract layer configurations Runtime Phase 4) Partitioned Execution 2) Predict layer performance 3) Evaluate partition points
Prediction Model Prediction Model Prediction Model Prediction Model Prediction Model26
Image taken from [1]
Layer Regression Variables Convolution (filter_size/stride)^2 * (# filters) Local, Pooling input, output feature maps Fully Connected # Input/Output neurons Softmax, Argmax # Input/Output neurons Activation, Normalisation # neurons
Linear or logarithmic regression model. GFLOPS for performance.
27
1: Input: 2: N: number of layers in the DNN 3: {Li|i = 1 · · · N}: layers in the DNN 4: {Di|i = 1 · · · N}: data size at each layer 5: f, g(Li): regression models predicting the latency and power of exe- cuting Li 6: K: current datacenter load level 7: B: current wireless network uplink bandwidth 8: PU: wireless network uplink power consumption 9: procedure PARTITIONDECISION 10: for each i in 1 · · · N do 11: TMi ← fmobile(Li) 12: TCi ← fcloud(Li, K) 13: PMi ← gmobile(Li) 14: TUi ← Di/B 15: if OptTarget == latency then 16: return arg min
j=1···N
(
j
P
i=1
TMi +
N
P
k=j+1
TCk + TUj) 17: else if OptTarget == energy then 18: return arg min
j=1···N
(
j
P
i=1
TMi × PMi + TUj × PU)
// Latency of mobile execution // Latency of cloud execution // Power of mobile execution // Transfer latency
28
Image taken from [1]
Device calculation Cloud Calculation Power consumed for local calculations Uplink power consumption
Mobile Wireless network Benchmarks IMC VGG FACE DIG ASR POS NER CHK CPU Wi-Fi input input input input input fc3 LTE input input input argmax input fc3 3G argmax input input argmax input fc3 GPU Wi-Fi pool5 input input argmax input fc3 LTE argmax argmax input argmax input fc3 3G argmax argmax argmax argmax input fc3
IMC VGG FACE DIG ASR POS NER CHK (a) Neurosurgeon using the mobile CPU 1X 5X 10X 15X 20X Latency speedup 20.4X IMC VGG FACE DIG ASR POS NER CHK (b) Neurosurgeon using the mobile GPU 1X 5X 10X 15X 20X Latency speedup 40.7X 20.6X Status quo Neurosurgeon Wi-Fi Neurosurgeon LTE Neurosurgeon 3G Neurosurgeon avg.
29
Images taken from [1]
Mispredicts when performance is close to
Mobile Wireless network Benchmarks IMC VGG FACE DIG ASR POS NER CHK CPU Wi-Fi input input input input input fc3 LTE input input input input input fc3 3G input input input argmax input fc3 GPU Wi-Fi input input input argmax input fc3 LTE pool5 input input argmax input fc3 3G argmax argmax input argmax input fc3
IMC VGG FACE DIG ASR POS NER CHK (a) Neurosurgeon using the mobile CPU 0% 20% 40% 60% 80% 100% Normalized mobile energy IMC VGG FACE DIG ASR POS NER CHK (b) Neurosurgeon using the mobile GPU 0% 20% 40% 60% 80% 100% Normalized mobile energy Status quo Neurosurgeon Wi-Fi Neurosurgeon LTE Neurosurgeon 3G Neurosurgeon avg.
30
Images taken from [1]
Even in suboptimal cases, 24.2% less energy than status quo
1 2 3 4 5 Mbps
LTE bandwidth
Time 0.0 0.5 1.0 1.5 2.0 Latency (s) partitioned local partitioned remote
Status quo Neurosurgeon
In real world scenarios, network quality may vary. Cloud-only will suffer the consequences. Neurosurgeon dynamically adapts
Offloading makes sense now Bad network to offload computation. 31
10% 20% 30% 40% 50% 60% 70% 80% 90% Server load level 0.2 0.4 0.6 0.8 Latency (s) remote partitioned local Status quo Neurosurgeon
Current server load is determined by pinging. Avoid latency, by taking server load into consideration. This strategy is further dropping server load, allowing for more user queries to be served. End-to-end latency of AlexNet Mobile CPU-only, transfers via Wi-Fi
Wi-Fi LTE 3G 1X 2X 3X 4X 5X 6X 7X Normalied throughput Baseline (Status quo) Neurosurgeon (0% mobile GPU users) Neurosurgeon (30% mobile GPU users) Neurosurgeon (70% mobile GPU users) Neurosurgeon (100% mobile GPU users)
32 Server is too loaded now. Compute locally. Offload to server.
33
34
MAUI [34] Comet [35] Odessa [36] CloneCloud [37] Neurosurgeon No need to transfer program state 3 3 Data-centric compute partitioning 3 Low/no runtime overhead 3 3 3 3 Requires no application-specific profiling 3 3 No programmer annotation needed 3 3 3 3 Server load sensitive 3 3 35
Image taken from [1]
IMC VGG FACE DIG ASR POS NER CHK (a) using the mobile CPU 1X 3X 5X 7X Latency speedup IMC VGG FACE DIG ASR POS NER CHK (b) using the mobile GPU 1X 3X 5X 7X 32x MAUI Neurosurgeon
MAUI Neurosurgeon Partitioning Control-based Data-centric Profiling Dynamic Static Partitioning Granularity Per annotated function Per layer Optimises for Power efficiency
Latency XOR Power Efficiency
Specificity General DNN Specific
MAUI scheduling for a layer depends on previous invocations.
36
MAUI: Making Smartphones Last Longer with Code Offload
Eduardo Cuervo†, Aruna Balasubramanian‡, Dae-ki Cho∗, Alec Wolman§, Stefan Saroiu§, Ranveer Chandra§, Paramvir Bahl§ †Duke University, ‡University of Massachusetts Amherst, ∗UCLA, §Microsoft Research ABSTRACT This paper presents MAUI, a system that enables fine-grained energy-aware offload of mobile code to the infrastructure. Previous approaches to these problems either relied heavily on programmer support to partition an application, or they were coarse-grained re- quiring full process (or full VM) migration. MAUI uses the benefitsImages taken from [1] and [4]
computing paradigm.
model is trained in the cloud and online learning is supplementing this model on the device [8].
37
!"#$%$%&' (#)#
*+ ,#)-.'/0#"%$%&
*12$32
(2 ($32
4%50"0%-0
*12$32
*6$32 7%/$%0'/0#"%$%&
8
4%50"0%-0 Image taken from [8]
38
from Cambridge [5])
setups.
39
(difficult to expand, does not learn how to assess)
40
based on user profile?
Latency Energy Efficiency High-performance profile Energy-efficiency profile
41
1: Input: 2: N: number of layers in the DNN 3: {Li|i = 1 · · · N}: layers in the DNN 4: {Di|i = 1 · · · N}: data size at each layer 5: f, g(Li): regression models predicting the latency and power of exe- cuting Li 6: K: current datacenter load level 7: B: current wireless network uplink bandwidth 8: PU: wireless network uplink power consumption 9: procedure PARTITIONDECISION 10: for each i in 1 · · · N do 11: TMi ← fmobile(Li) 12: TCi ← fcloud(Li, K) 13: PMi ← gmobile(Li) 14: TUi ← Di/B 15: if OptTarget == latency then 16: return arg min
j=1···N
(
j
P
i=1
TMi +
N
P
k=j+1
TCk + TUj) 17: else if OptTarget == energy then 18: return arg min
j=1···N
(
j
P
i=1
TMi × PMi + TUj × PU)
Smartphones support multitasking. Why not include K_mobile?
42
algorithms to push more computation locally (such as DeepX [6])
DSP) for further efficiency (such as DeepEar [7])
how to partition instead of static profiler?
43
Q&A
Stefanos Laskaridis
sl829@cam.ac.uk
44
Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge. Proceedings
Languages and Operating Systems, 615–629. https://doi.org/10.1145/3037697.3037698
http://cs231n.github.io/convolutional-networks/
Convolutional Neural Networks. Advances In Neural Information Processing Systems, 1–9. https://doi.org/http://dx.doi.org/10.1016/j.protcy.2014.09.007
. (2010). Maui. Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services - MobiSys ’10, 49. https://doi.org/10.1145/1814433.1814441
New Generation of Personal Data Analytics Normalised Inference Time (%). Retrieved from https://arxiv.org/pdf/1710.09027.pdf
45
., Forlivesi, C., Jiao, L., Qendro, L., & Kawsar, F . (2016). DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN 2016 - Proceedings, (1). https://doi.org/ 10.1109/IPSN.2016.7460664
., & Qendro, L. (2015). DeepEar: Robust Smartphone Audio Sensing in Unconstrained Acoustic Environments using Deep Learning. Ubicomp, 283–294. https://doi.org/ 10.1145/2750858.2804262
Retrieved from http://arxiv.org/abs/1703.00380
46