[PPT] - Neurosurgeon Collaborative Intelligence Between the Cloud and PowerPoint Presentation

SLIDE 1

Neurosurgeon

Collaborative Intelligence Between the Cloud and Mobile Edge

Stefanos Laskaridis 

sl829@cam.ac.uk R244: Large-Scale Data Processing and Optimisation

by Y. Kang, J.Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars and L. Tang

SLIDE 2

Summary

b. Mobile-only

Approach

a. Status quo

Approach

c. Neurosurgeon

Approach

[2.41, 0.87] [7,92, 0.87]

2

Image taken from [1]

SLIDE 3

Status Quo

3

SLIDE 4

Status Quo

Deep Neural Networks in “intelligent” applications
Apple Siri, Google Now, Microsoft Cortana
Deep Neural Network applications are mostly offloaded to powerful private
r public clouds for computation
Computer Vision
Natural Language Processing
Speech Recognition
Large volume of data transfers cause latency and energy consumption.
However, SoC advancements urged authors to revisit the problem.

4

SLIDE 5

The Mobile edge

5

SLIDE 6

Experiment Setup

Mobile Platform

Tegra K1 SoC
4+1 quad core ARM Cortex A15 CPU
2GB DDR3L 933MHz
NVIDIA Kepler with 192 CUDA cores

6

Server Platform

4U Intel Dual CPU Chassis, 8xPCIe 3.0

x 16 slots

2x Intel Xeon E5-2620, 2.1 GHz
1TB HDD
16x16GB DDR3 1866MHz ECC
NVIDIA Tesla K40 M-class 12GB PCIe

Power Consumption  Watts Up? meter Software

Deep Learning: Caffe
mCPU: OpenBLAS
GPU: cuDNN

SLIDE 7

Testing the Mobile Edge

Experiment running an Image of 152KB image through

AlexNet [3]

Measuring:
Communication Latency: 3G, LTE, WiFi
Computation Latency: mCPU, mGPU, cloud GPU
End-to-end Latency
Energy Consumption

7

SLIDE 8

Testing the Mobile Edge

More power but shorter bursts Transmission has the dominating cost

8

Images taken from [1]

SLIDE 9

Neurosurgeon:  Partitioning between Cloud and Mobile

9

SLIDE 10

cNN

10

Images taken from [2]

Convolution Pooling

SLIDE 11

DNN Layer types

Fully Connected Layer (fc)

All neurons are connected with all the neurons of the previous layer. Depth is the number of filters. Stride is how much we slide the filter each time. [2]

Convolutional & Local Layer (conv, local)

Convolves an image with one or more filters to produce a set of maps.

Pooling Layer (pool)

Downsamples an image to simplify representation. Can be average, max, or L2. [2]

Activation Layer (sig, relu, htanh)

Applies non-linear function to its input (sigmoid, Rectified Linear Unit, Tanh)

Normalisation layer (norm)

Normalises features across feature map.

Softmax Layer (softmax)

Probability distribution over possible classes.

Argmax Layer (argmax)

Chooses class with higher probability.

Dropout Layer (dropout)

Randomly ignores neurons as regularisation to prevent overfitting.

11

SLIDE 12

AlexNet

Inference-only  (fw propagation)

12

Images taken from [1] and [3]

2x over   cloud-only 18% more energy-efficient

SLIDE 13

AlexNet

13

Fully connected layers operate on few

data but have high latency.

Images taken from [3]

Convolutional layers produce a lot of data.
Pooling layers reduce a lot of data.

SLIDE 14

Partitioning

First layers have most of the data (convolutions and

pooling)

Later layers have most of the latency (fully connected

layers)

Key idea: Compute locally until the point it make sense

and then offload to cloud.

14

SLIDE 15

More Applications

Abbreviation Network Input Layers Image Classification

IMC AlexNet Image 24 VGG VGG Image 46

Facial Recognition

FACE DeepFace Image 10

Digit Recognition

DIG MNIST Image 9

Speech Recognition

ASR Kaldi Speech 13

Part-of-speech Tagging

POS SENNA Word vectors 3

Named Entity Recognition

NER SENNA Word vectors 3

Word Chunking

CHK SENNA Word vectors 3

15

SLIDE 16

VGG

Partition points 5 10 15 20 Latency (s) (a) VGG Partition points 10 20 30 40 50 60 70 Energy (J) (a) VGG Energy (J)

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

16

Images taken from [1]

Layer latency Size of output data

input conv1.1 relu1.1 conv1.2 relu1.2 pool1 conv2.1 relu2.1 conv2.2 relu2.2 pool2 conv3.1 relu3.1 conv3.2 relu3.2 conv3.3 relu3.3 conv3.4 relu3.4 pool3 conv4.1 relu4.1 conv4.2 relu4.2 conv4.3 relu4.3 conv4.4 relu4.4 pool4 conv5.1 relu5.1 conv5.2 relu5.2 conv5.3 relu5.3 conv5.4 relu5.4 pool5 fc6 relu6 drop6 fc7 relu7 drop7 fc8 softmax argmax Layers 50 100 150 200 250 Latency (ms) 2 4 6 8 10 12 14 Data size (MB) (a) VGG

SLIDE 17

FACE

Partition points 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Latency (s) (b) FACE

Partition points 2 4 6 8 10 12 14 Energy (J) (b) FACE

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

17

Images taken from [1]

input conv1 pool2 conv3 local4 local5 local6 fc7 fc8 softmax argmax Layers 20 40 60 80 100 Latency (ms) 0.0 0.5 1.0 1.5 2.0 2.5 Data size (MB) (b) FACE

Layer latency Size of output data

SLIDE 18

DIG

Partition points 1 2 3 4 5 6 Latency (s) Partition points 5 10 15 20 25 Energy (J) (c) DIG

18

Images taken from [1]

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

Layer latency Size of output data

input conv1 pool1 conv2 pool2 fc1 relu1 fc2 softmax argmax Layers 1 2 3 4 5 Latency (ms) 1 2 3 4 5 Data size (MB) (c) DIG

SLIDE 19

ASR

Partition points 1 2 3 4 5 6 7 Latency (s) Partition points 5 10 15 20 25 30 Energy (J)

19

Images taken from [1]

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

Layer latency Size of output data

input fc1 sig1 fc2 sig2 fc3 sig3 fc4 sig4 fc5 sig5 fc6 sig6 fc7 Layers 10 20 30 40 50 Latency (ms) 1 2 3 4 5 Data size (MB) (d) ASR

SLIDE 20

POS

Partition points 1 2 3 4 5 6 Latency (s) ×10−2 (e) POS

Partition points 0.00 0.05 0.10 0.15 0.20 0.25 Energy (J) (e) POS Energy (J)

20

Images taken from [1]

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

Layer latency Size of output data

input fc1 htanh fc3 Layers 0.0 0.1 0.2 0.3 0.4 Latency (ms) 0.0 1.1 2.2 3.4 4.5 Data size (MB) ×10−2 (e) POS

SLIDE 21

NER

Partition points 1 2 3 4 5 6 Latency (s) ×10−2 (f) NER Partition points 0.00 0.05 0.10 0.15 0.20 0.25 Energy (J) (f) NER

21

Images taken from [1]

input fc1 htanh fc3 Layers 0.0 0.1 0.2 0.3 0.4 Latency (ms) 0.0 1.1 2.2 3.4 4.5 Data size (MB) ×10−2 (f) NER

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

Layer latency Size of output data

SLIDE 22

CHK

(b) FACE Partition points 1 2 3 4 5 6 Latency (s) ×10−2 (g) CHK

Partition points 0.00 0.05 0.10 0.15 0.20 0.25 Energy (J)

22

Images taken from [1]

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

Layer latency Size of output data

input fc1 htanh fc3 Layers 0.0 0.1 0.2 0.3 0.4 Latency (ms) 0.0 1.1 2.2 3.4 4.5 Data size (MB) ×10−2 (g) CHK

SLIDE 23

Neurosurgeon

23

SLIDE 24

Neurosurgeon

Partitions DNN based on:
DNN Topology
Computation latency
Data size output
Dynamic factors
Wireless network
Datacenter workload

24

SLIDE 25

Neurosurgeon

Profiles device and cloud server
To generate prediction models
One time, in advance
Results stored in device for decision-making
Two, distinct goals:
Latency minimisation
Energy optimisation

25

SLIDE 26

Neurosurgeon

+ ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++

CONV FC POOL ACT …

+ ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++

CONV FC POOL ACT …

1) Generate prediction models Deployment Phase

Target Application

1) Extract layer configurations Runtime Phase 4) Partitioned Execution 2) Predict layer performance 3) Evaluate partition points

Prediction Model Prediction Model Prediction Model Prediction Model Prediction Model

26

Image taken from [1]

SLIDE 27

Regression model per DNN Layer

Layer Regression Variables Convolution (filter_size/stride)^2 *   (# filters) Local, Pooling input, output feature maps Fully Connected # Input/Output neurons Softmax, Argmax # Input/Output neurons Activation, Normalisation # neurons

Linear or logarithmic regression model. GFLOPS for performance.

27

SLIDE 28

Partitioning Algorithm

1: Input: 2: N: number of layers in the DNN 3: {Li|i = 1 · · · N}: layers in the DNN 4: {Di|i = 1 · · · N}: data size at each layer 5: f, g(Li): regression models predicting the latency and power of exe- cuting Li 6: K: current datacenter load level 7: B: current wireless network uplink bandwidth 8: PU: wireless network uplink power consumption 9: procedure PARTITIONDECISION 10: for each i in 1 · · · N do 11: TMi ← fmobile(Li) 12: TCi ← fcloud(Li, K) 13: PMi ← gmobile(Li) 14: TUi ← Di/B 15: if OptTarget == latency then 16: return arg min

j=1···N

(

j

P

i=1

TMi +

N

P

k=j+1

TCk + TUj) 17: else if OptTarget == energy then 18: return arg min

j=1···N

(

j

P

i=1

TMi × PMi + TUj × PU)

// Latency of mobile execution // Latency of cloud execution // Power of mobile execution // Transfer latency

28

Image taken from [1]

Device calculation Cloud Calculation Power consumed for local calculations Uplink power consumption

SLIDE 29

Benchmark Results -   Latency Optimisation

Mobile Wireless network Benchmarks IMC VGG FACE DIG ASR POS NER CHK CPU Wi-Fi input input input input input fc3 LTE input input input argmax input fc3 3G argmax input input argmax input fc3 GPU Wi-Fi pool5 input input argmax input fc3 LTE argmax argmax input argmax input fc3 3G argmax argmax argmax argmax input fc3

IMC VGG FACE DIG ASR POS NER CHK (a) Neurosurgeon using the mobile CPU 1X 5X 10X 15X 20X Latency speedup 20.4X IMC VGG FACE DIG ASR POS NER CHK (b) Neurosurgeon using the mobile GPU 1X 5X 10X 15X 20X Latency speedup 40.7X 20.6X Status quo Neurosurgeon Wi-Fi Neurosurgeon LTE Neurosurgeon 3G Neurosurgeon avg.

29

Images taken from [1]

Mispredicts when performance is close to

ne another.

SLIDE 30

Benchmark Results -   Power Optimisation

Mobile Wireless network Benchmarks IMC VGG FACE DIG ASR POS NER CHK CPU Wi-Fi input input input input input fc3 LTE input input input input input fc3 3G input input input argmax input fc3 GPU Wi-Fi input input input argmax input fc3 LTE pool5 input input argmax input fc3 3G argmax argmax input argmax input fc3

IMC VGG FACE DIG ASR POS NER CHK (a) Neurosurgeon using the mobile CPU 0% 20% 40% 60% 80% 100% Normalized mobile energy IMC VGG FACE DIG ASR POS NER CHK (b) Neurosurgeon using the mobile GPU 0% 20% 40% 60% 80% 100% Normalized mobile energy Status quo Neurosurgeon Wi-Fi Neurosurgeon LTE Neurosurgeon 3G Neurosurgeon avg.

30

Images taken from [1]

Even in suboptimal cases, 24.2% less energy than status quo

SLIDE 31

Testing under  Network Variation

1 2 3 4 5 Mbps

LTE bandwidth

Time 0.0 0.5 1.0 1.5 2.0 Latency (s) partitioned local partitioned remote

Status quo Neurosurgeon

In real world scenarios, network quality may vary. Cloud-only will suffer the consequences. Neurosurgeon dynamically adapts

ffloading to mitigate the problem.

Offloading makes sense now Bad network to offload computation. 31

SLIDE 32

Testing under  Server Load Variation

10% 20% 30% 40% 50% 60% 70% 80% 90% Server load level 0.2 0.4 0.6 0.8 Latency (s) remote partitioned local Status quo Neurosurgeon

Current server load is determined by pinging. Avoid latency, by taking server load into consideration. This strategy is further dropping server load, allowing for more user queries to be served. End-to-end latency of AlexNet  Mobile CPU-only, transfers via Wi-Fi

Wi-Fi LTE 3G 1X 2X 3X 4X 5X 6X 7X Normalied throughput Baseline (Status quo) Neurosurgeon (0% mobile GPU users) Neurosurgeon (30% mobile GPU users) Neurosurgeon (70% mobile GPU users) Neurosurgeon (100% mobile GPU users)

32 Server is too loaded now.  Compute locally. Offload to server.

SLIDE 33

Results

End-to-end latency improvement:
average: 3.1x
up to: 40.7x
Energy consumption improvement:
average: 59.5%
up to: 94.7%
Datacenter throughput improvement:
average: 1.5x
up to: 6.7x

33

SLIDE 34

Relevant work

34

SLIDE 35

Popular computation

ffloading solutions

MAUI [34] Comet [35] Odessa [36] CloneCloud [37] Neurosurgeon No need to transfer program state 3 3 Data-centric compute partitioning 3 Low/no runtime overhead 3 3 3 3 Requires no application-specific profiling 3 3 No programmer annotation needed 3 3 3 3 Server load sensitive 3 3 35

Image taken from [1]

SLIDE 36

Benchmark Result -   MAUI

IMC VGG FACE DIG ASR POS NER CHK (a) using the mobile CPU 1X 3X 5X 7X Latency speedup IMC VGG FACE DIG ASR POS NER CHK (b) using the mobile GPU 1X 3X 5X 7X 32x MAUI Neurosurgeon

MAUI Neurosurgeon Partitioning Control-based Data-centric Profiling Dynamic Static Partitioning Granularity Per annotated function Per layer Optimises for Power efficiency

Latency XOR  Power Efficiency

Specificity General DNN Specific

MAUI scheduling for a layer depends on previous invocations.

36

MAUI: Making Smartphones Last Longer with Code Offload

Eduardo Cuervo†, Aruna Balasubramanian‡, Dae-ki Cho∗, Alec Wolman§, Stefan Saroiu§, Ranveer Chandra§, Paramvir Bahl§ †Duke University, ‡University of Massachusetts Amherst, ∗UCLA, §Microsoft Research ABSTRACT This paper presents MAUI, a system that enables fine-grained energy-aware offload of mobile code to the infrastructure. Previous approaches to these problems either relied heavily on programmer support to partition an application, or they were coarse-grained re- quiring full process (or full VM) migration. MAUI uses the benefits

f a managed code environment to offer the best of both worlds:

it supports fine-grained code offload to maximize energy savings with minimal burden on the programmer. MAUI decides at runtime which methods should be remotely executed, driven by an op- timization engine that achieves the best energy savings possible under the mobile device’s current connectivity constrains. In our eval- uation, we show that MAUI enables: 1) a resource-intensive face Given the tremendous size of the mobile handset market, solving the energy impediment has quickly become the mobile industry’s foremost challenge [14]. One popular technique to reduce the energy needs of mobile devices is remote execution: applications can take advantage of the resource-rich infrastructure by delegating code execution to remote

servers. For the last two decades, there have been many attempts

to make mobile devices use remote execution to improve performance and energy consumption. Most of these attempts took one

f the following two approaches to remote execution. The first ap-

proach is to rely on programmers – to specify how to partition a program, what state needs to be remoted, and how to adapt the program partitioning scheme to the changing network conditions [9,

Images taken from [1] and [4]

SLIDE 37

Privacy Preserving Shared Models

Based on the edge

computing paradigm.

Models where a general

model is trained in the cloud and online learning is supplementing this model on the device [8].

37

!"#$%$%&' (#)#

*+ ,#)-.'/0#"%$%&

*12$32

(2 ($32

4%50"0%-0

*12$32

*6$32 7%/$%0'/0#"%$%&

8

4%50"0%-0 Image taken from [8]

SLIDE 38

Review & Critique

38

SLIDE 39

Review: The good parts

The name! :)
Brand new paper published in ASPLOS ’17 (1 citation

from Cambridge [5])

Rational extension of current model of execution based
n SoC developments.
All around benchmarks, substantial speedups.
Inclusive of GPU computation and different network

setups.

39

SLIDE 40

Critique

DNN specific (in contrast with MAUI)
Profiling has hardcoded the regression models for each type of layer

(difficult to expand, does not learn how to assess)

How would an rNN get split with Neurosurgeon?
Profiler assumes one type of hardware server-side.
Different sized containers based on load.
Different datacenter forwarding behind load balancer.
Adoption: NVIDIA Tegra K1 is a high-end SoC
Lower-end processors may shift offloading to the cloud.

40

SLIDE 41

Critique

Distinct optimisation for latency, energy efficiency.
Why not offer a Pareto’s optimality curve and pick point

based on user profile?

Latency Energy Efficiency High-performance profile Energy-efficiency profile

41

SLIDE 42

Critique

1: Input: 2: N: number of layers in the DNN 3: {Li|i = 1 · · · N}: layers in the DNN 4: {Di|i = 1 · · · N}: data size at each layer 5: f, g(Li): regression models predicting the latency and power of exe- cuting Li 6: K: current datacenter load level 7: B: current wireless network uplink bandwidth 8: PU: wireless network uplink power consumption 9: procedure PARTITIONDECISION 10: for each i in 1 · · · N do 11: TMi ← fmobile(Li) 12: TCi ← fcloud(Li, K) 13: PMi ← gmobile(Li) 14: TUi ← Di/B 15: if OptTarget == latency then 16: return arg min

j=1···N

(

j

P

i=1

TMi +

N

P

k=j+1

TCk + TUj) 17: else if OptTarget == energy then 18: return arg min

j=1···N

(

j

P

i=1

TMi × PMi + TUj × PU)

Smartphones support multitasking.  Why not include K_mobile?

42

SLIDE 43

Suggestions

Work with model decomposition and compression

algorithms to push more computation locally (such as DeepX [6])

Other hardware could be taken into consideration (e.g.

DSP) for further efficiency (such as DeepEar [7])

Could Reinforcement Learning be of any help in learning

how to partition instead of static profiler?

Offloading to devices in local network. (MAUI [4])

43

SLIDE 44

Thank you

Q&A

Stefanos Laskaridis 

sl829@cam.ac.uk

44

SLIDE 45

References

1. Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., & Tang, L. (2017).

Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge. Proceedings

f the Twenty-Second International Conference on Architectural Support for Programming

Languages and Operating Systems, 615–629. https://doi.org/10.1145/3037697.3037698

2. Stanford CS231n - Andrej Karpathy

http://cs231n.github.io/convolutional-networks/

3. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep

Convolutional Neural Networks. Advances In Neural Information Processing Systems, 1–9. https://doi.org/http://dx.doi.org/10.1016/j.protcy.2014.09.007

4. Cuervo, E., Balasubramanian, A., Cho, D., Wolman, A., Saroiu, S., Chandra, R., & Bahl, P

. (2010). Maui. Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services - MobiSys ’10, 49. https://doi.org/10.1145/1814433.1814441

5. Zhao, J., Mortier, R., Crowcroft, J., & Wang, L. (n.d.). User-centric Composable Services : A

New Generation of Personal Data Analytics Normalised Inference Time (%). Retrieved from https://arxiv.org/pdf/1710.09027.pdf

45

SLIDE 46

References

6. Lane, N. D., Bhattacharya, S., Georgiev, P

., Forlivesi, C., Jiao, L., Qendro, L., & Kawsar, F . (2016). DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN 2016 - Proceedings, (1). https://doi.org/ 10.1109/IPSN.2016.7460664

7. Lane, N. D., Georgiev, P

., & Qendro, L. (2015). DeepEar: Robust Smartphone Audio Sensing in Unconstrained Acoustic Environments using Deep Learning. Ubicomp, 283–294. https://doi.org/ 10.1145/2750858.2804262

8. Servia-Rodriguez, S., Wang, L., Zhao, J. R., Mortier, R., & Haddadi,
H. (2017). Personal Model Training under Privacy Constraints.

Retrieved from http://arxiv.org/abs/1703.00380

46

Neurosurgeon

Summary

Status Quo

Status Quo

The Mobile edge

Experiment Setup

Testing the Mobile Edge

Testing the Mobile Edge

Neurosurgeon: Partitioning between Cloud and Mobile

cNN

DNN Layer types

AlexNet

AlexNet

Partitioning

More Applications

VGG

FACE

DIG

ASR

POS

NER

CHK

Neurosurgeon

Neurosurgeon

Neurosurgeon

Neurosurgeon

Regression model per DNN Layer

Partitioning Algorithm

Benchmark Results - Latency Optimisation

Benchmark Results - Power Optimisation

Testing under Network Variation

Testing under Server Load Variation

Results

Relevant work

Popular computation

Benchmark Result - MAUI

Privacy Preserving Shared Models

Review & Critique

Review: The good parts

Critique

Critique

Critique

Suggestions

Thank you

References

References

Neurosurgeon:  Partitioning between Cloud and Mobile

Benchmark Results -   Latency Optimisation

Benchmark Results -   Power Optimisation

Testing under  Network Variation

Testing under  Server Load Variation

Benchmark Result -   MAUI