Neurosurgeon Collaborative Intelligence Between the Cloud and - - PowerPoint PPT Presentation

neurosurgeon
SMART_READER_LITE
LIVE PREVIEW

Neurosurgeon Collaborative Intelligence Between the Cloud and - - PowerPoint PPT Presentation

Neurosurgeon Collaborative Intelligence Between the Cloud and Mobile Edge by Y. Kang, J.Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars and L. Tang Stefanos Laskaridis sl829@cam.ac.uk R244: Large-Scale Data Processing and Optimisation


slide-1
SLIDE 1

Neurosurgeon

Collaborative Intelligence Between the Cloud and Mobile Edge

Stefanos Laskaridis


sl829@cam.ac.uk R244: Large-Scale Data Processing and Optimisation

by Y. Kang, J.Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars and L. Tang

slide-2
SLIDE 2

Summary

  • b. Mobile-only

Approach

  • a. Status quo

Approach

  • c. Neurosurgeon

Approach

[2.41, 0.87] [7,92, 0.87]

2

Image taken from [1]

slide-3
SLIDE 3

Status Quo

3

slide-4
SLIDE 4

Status Quo

  • Deep Neural Networks in “intelligent” applications
  • Apple Siri, Google Now, Microsoft Cortana
  • Deep Neural Network applications are mostly offloaded to powerful private
  • r public clouds for computation
  • Computer Vision
  • Natural Language Processing
  • Speech Recognition
  • Large volume of data transfers cause latency and energy consumption.
  • However, SoC advancements urged authors to revisit the problem.

4

slide-5
SLIDE 5

The Mobile edge

5

slide-6
SLIDE 6

Experiment Setup

Mobile Platform

  • Tegra K1 SoC
  • 4+1 quad core ARM Cortex A15 CPU
  • 2GB DDR3L 933MHz
  • NVIDIA Kepler with 192 CUDA cores

6

Server Platform

  • 4U Intel Dual CPU Chassis, 8xPCIe 3.0

x 16 slots

  • 2x Intel Xeon E5-2620, 2.1 GHz
  • 1TB HDD
  • 16x16GB DDR3 1866MHz ECC
  • NVIDIA Tesla K40 M-class 12GB PCIe

Power Consumption
 Watts Up? meter Software

  • Deep Learning: Caffe
  • mCPU: OpenBLAS
  • GPU: cuDNN
slide-7
SLIDE 7

Testing the Mobile Edge

  • Experiment running an Image of 152KB image through

AlexNet [3]

  • Measuring:
  • Communication Latency: 3G, LTE, WiFi
  • Computation Latency: mCPU, mGPU, cloud GPU
  • End-to-end Latency
  • Energy Consumption

7

slide-8
SLIDE 8

Testing the Mobile Edge

More power but shorter bursts Transmission has the dominating cost

8

Images taken from [1]

slide-9
SLIDE 9

Neurosurgeon:
 Partitioning between Cloud and Mobile

9

slide-10
SLIDE 10

cNN

10

Images taken from [2]

Convolution Pooling

slide-11
SLIDE 11

DNN Layer types

  • Fully Connected Layer (fc)


All neurons are connected with all the neurons of the previous layer. Depth is the number of filters. Stride is how much we slide the filter each time. [2]

  • Convolutional & Local Layer (conv, local)


Convolves an image with one or more filters to produce a set of maps.

  • Pooling Layer (pool)


Downsamples an image to simplify representation. Can be average, max, or L2. [2]

  • Activation Layer (sig, relu, htanh)


Applies non-linear function to its input (sigmoid, Rectified Linear Unit, Tanh)

  • Normalisation layer (norm)


Normalises features across feature map.

  • Softmax Layer (softmax)


Probability distribution over possible classes.

  • Argmax Layer (argmax)


Chooses class with higher probability.

  • Dropout Layer (dropout)


Randomly ignores neurons as regularisation to prevent overfitting.

11

slide-12
SLIDE 12

AlexNet

Inference-only
 (fw propagation)

12

Images taken from [1] and [3]

2x over 
 cloud-only 18% more energy-efficient

slide-13
SLIDE 13

AlexNet

13

  • Fully connected layers operate on few

data but have high latency.

Images taken from [3]

  • Convolutional layers produce a lot of data.
  • Pooling layers reduce a lot of data.
slide-14
SLIDE 14

Partitioning

  • First layers have most of the data (convolutions and

pooling)

  • Later layers have most of the latency (fully connected

layers)

  • Key idea: Compute locally until the point it make sense

and then offload to cloud.

14

slide-15
SLIDE 15

More Applications

Abbreviation Network Input Layers Image Classification

IMC AlexNet Image 24 VGG VGG Image 46

Facial Recognition

FACE DeepFace Image 10

Digit Recognition

DIG MNIST Image 9

Speech Recognition

ASR Kaldi Speech 13

Part-of-speech Tagging

POS SENNA Word vectors 3

Named Entity Recognition

NER SENNA Word vectors 3

Word Chunking

CHK SENNA Word vectors 3

15

slide-16
SLIDE 16

VGG

Partition points 5 10 15 20 Latency (s) (a) VGG Partition points 10 20 30 40 50 60 70 Energy (J) (a) VGG Energy (J)

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

16

Images taken from [1]

Layer latency Size of output data

input conv1.1 relu1.1 conv1.2 relu1.2 pool1 conv2.1 relu2.1 conv2.2 relu2.2 pool2 conv3.1 relu3.1 conv3.2 relu3.2 conv3.3 relu3.3 conv3.4 relu3.4 pool3 conv4.1 relu4.1 conv4.2 relu4.2 conv4.3 relu4.3 conv4.4 relu4.4 pool4 conv5.1 relu5.1 conv5.2 relu5.2 conv5.3 relu5.3 conv5.4 relu5.4 pool5 fc6 relu6 drop6 fc7 relu7 drop7 fc8 softmax argmax Layers 50 100 150 200 250 Latency (ms) 2 4 6 8 10 12 14 Data size (MB) (a) VGG

slide-17
SLIDE 17

FACE

Partition points 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Latency (s) (b) FACE

Partition points 2 4 6 8 10 12 14 Energy (J) (b) FACE

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

17

Images taken from [1]

input conv1 pool2 conv3 local4 local5 local6 fc7 fc8 softmax argmax Layers 20 40 60 80 100 Latency (ms) 0.0 0.5 1.0 1.5 2.0 2.5 Data size (MB) (b) FACE

Layer latency Size of output data

slide-18
SLIDE 18

DIG

Partition points 1 2 3 4 5 6 Latency (s) Partition points 5 10 15 20 25 Energy (J) (c) DIG

18

Images taken from [1]

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

Layer latency Size of output data

input conv1 pool1 conv2 pool2 fc1 relu1 fc2 softmax argmax Layers 1 2 3 4 5 Latency (ms) 1 2 3 4 5 Data size (MB) (c) DIG

slide-19
SLIDE 19

ASR

Partition points 1 2 3 4 5 6 7 Latency (s) Partition points 5 10 15 20 25 30 Energy (J)

19

Images taken from [1]

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

Layer latency Size of output data

input fc1 sig1 fc2 sig2 fc3 sig3 fc4 sig4 fc5 sig5 fc6 sig6 fc7 Layers 10 20 30 40 50 Latency (ms) 1 2 3 4 5 Data size (MB) (d) ASR

slide-20
SLIDE 20

POS

Partition points 1 2 3 4 5 6 Latency (s) ×10−2 (e) POS

Partition points 0.00 0.05 0.10 0.15 0.20 0.25 Energy (J) (e) POS Energy (J)

20

Images taken from [1]

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

Layer latency Size of output data

input fc1 htanh fc3 Layers 0.0 0.1 0.2 0.3 0.4 Latency (ms) 0.0 1.1 2.2 3.4 4.5 Data size (MB) ×10−2 (e) POS

slide-21
SLIDE 21

NER

Partition points 1 2 3 4 5 6 Latency (s) ×10−2 (f) NER Partition points 0.00 0.05 0.10 0.15 0.20 0.25 Energy (J) (f) NER

21

Images taken from [1]

input fc1 htanh fc3 Layers 0.0 0.1 0.2 0.3 0.4 Latency (ms) 0.0 1.1 2.2 3.4 4.5 Data size (MB) ×10−2 (f) NER

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

Layer latency Size of output data

slide-22
SLIDE 22

CHK

(b) FACE Partition points 1 2 3 4 5 6 Latency (s) ×10−2 (g) CHK

Partition points 0.00 0.05 0.10 0.15 0.20 0.25 Energy (J)

22

Images taken from [1]

Server processing latency Data communication latency Mobile processing latency

Data communication energy Mobile processing energy

Layer latency Size of output data

input fc1 htanh fc3 Layers 0.0 0.1 0.2 0.3 0.4 Latency (ms) 0.0 1.1 2.2 3.4 4.5 Data size (MB) ×10−2 (g) CHK

slide-23
SLIDE 23

Neurosurgeon

23

slide-24
SLIDE 24

Neurosurgeon

  • Partitions DNN based on:
  • DNN Topology
  • Computation latency
  • Data size output
  • Dynamic factors
  • Wireless network
  • Datacenter workload

24

slide-25
SLIDE 25

Neurosurgeon

  • Profiles device and cloud server
  • To generate prediction models
  • One time, in advance
  • Results stored in device for decision-making
  • Two, distinct goals:
  • Latency minimisation
  • Energy optimisation

25

slide-26
SLIDE 26

Neurosurgeon

+ ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++

CONV FC POOL ACT …

+ ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++

CONV FC POOL ACT …

1) Generate prediction models Deployment Phase

Target Application

1) Extract layer configurations Runtime Phase 4) Partitioned Execution 2) Predict layer performance 3) Evaluate partition points

Prediction Model Prediction Model Prediction Model Prediction Model Prediction Model

26

Image taken from [1]

slide-27
SLIDE 27

Regression model per DNN Layer

Layer Regression Variables Convolution (filter_size/stride)^2 * 
 (# filters) Local, Pooling input, output feature maps Fully Connected # Input/Output neurons Softmax, Argmax # Input/Output neurons Activation, Normalisation # neurons

Linear or logarithmic regression model. GFLOPS for performance.

27

slide-28
SLIDE 28

Partitioning Algorithm

1: Input: 2: N: number of layers in the DNN 3: {Li|i = 1 · · · N}: layers in the DNN 4: {Di|i = 1 · · · N}: data size at each layer 5: f, g(Li): regression models predicting the latency and power of exe- cuting Li 6: K: current datacenter load level 7: B: current wireless network uplink bandwidth 8: PU: wireless network uplink power consumption 9: procedure PARTITIONDECISION 10: for each i in 1 · · · N do 11: TMi ← fmobile(Li) 12: TCi ← fcloud(Li, K) 13: PMi ← gmobile(Li) 14: TUi ← Di/B 15: if OptTarget == latency then 16: return arg min

j=1···N

(

j

P

i=1

TMi +

N

P

k=j+1

TCk + TUj) 17: else if OptTarget == energy then 18: return arg min

j=1···N

(

j

P

i=1

TMi × PMi + TUj × PU)

// Latency of mobile execution // Latency of cloud execution // Power of mobile execution // Transfer latency

28

Image taken from [1]

Device calculation Cloud Calculation Power consumed for local calculations Uplink power consumption

slide-29
SLIDE 29

Benchmark Results - 
 Latency Optimisation

Mobile Wireless network Benchmarks IMC VGG FACE DIG ASR POS NER CHK CPU Wi-Fi input input input input input fc3 LTE input input input argmax input fc3 3G argmax input input argmax input fc3 GPU Wi-Fi pool5 input input argmax input fc3 LTE argmax argmax input argmax input fc3 3G argmax argmax argmax argmax input fc3

IMC VGG FACE DIG ASR POS NER CHK (a) Neurosurgeon using the mobile CPU 1X 5X 10X 15X 20X Latency speedup 20.4X IMC VGG FACE DIG ASR POS NER CHK (b) Neurosurgeon using the mobile GPU 1X 5X 10X 15X 20X Latency speedup 40.7X 20.6X Status quo Neurosurgeon Wi-Fi Neurosurgeon LTE Neurosurgeon 3G Neurosurgeon avg.

29

Images taken from [1]

Mispredicts when performance is close to

  • ne another.
slide-30
SLIDE 30

Benchmark Results - 
 Power Optimisation

Mobile Wireless network Benchmarks IMC VGG FACE DIG ASR POS NER CHK CPU Wi-Fi input input input input input fc3 LTE input input input input input fc3 3G input input input argmax input fc3 GPU Wi-Fi input input input argmax input fc3 LTE pool5 input input argmax input fc3 3G argmax argmax input argmax input fc3

IMC VGG FACE DIG ASR POS NER CHK (a) Neurosurgeon using the mobile CPU 0% 20% 40% 60% 80% 100% Normalized mobile energy IMC VGG FACE DIG ASR POS NER CHK (b) Neurosurgeon using the mobile GPU 0% 20% 40% 60% 80% 100% Normalized mobile energy Status quo Neurosurgeon Wi-Fi Neurosurgeon LTE Neurosurgeon 3G Neurosurgeon avg.

30

Images taken from [1]

Even in suboptimal cases, 24.2% less energy than status quo

slide-31
SLIDE 31

Testing under
 Network Variation

1 2 3 4 5 Mbps

LTE bandwidth

Time 0.0 0.5 1.0 1.5 2.0 Latency (s) partitioned local partitioned remote

Status quo Neurosurgeon

In real world scenarios, network quality may vary. Cloud-only will suffer the consequences. Neurosurgeon dynamically adapts

  • ffloading to mitigate the problem.

Offloading makes sense now Bad network to offload computation. 31

slide-32
SLIDE 32

Testing under
 Server Load Variation

10% 20% 30% 40% 50% 60% 70% 80% 90% Server load level 0.2 0.4 0.6 0.8 Latency (s) remote partitioned local Status quo Neurosurgeon

Current server load is determined by pinging. Avoid latency, by taking server load into consideration. This strategy is further dropping server load, allowing for more user queries to be served. End-to-end latency of AlexNet
 Mobile CPU-only, transfers via Wi-Fi

Wi-Fi LTE 3G 1X 2X 3X 4X 5X 6X 7X Normalied throughput Baseline (Status quo) Neurosurgeon (0% mobile GPU users) Neurosurgeon (30% mobile GPU users) Neurosurgeon (70% mobile GPU users) Neurosurgeon (100% mobile GPU users)

32 Server is too loaded now.
 Compute locally. Offload to server.

slide-33
SLIDE 33

Results

  • End-to-end latency improvement:
  • average: 3.1x
  • up to: 40.7x
  • Energy consumption improvement:
  • average: 59.5%
  • up to: 94.7%
  • Datacenter throughput improvement:
  • average: 1.5x
  • up to: 6.7x

33

slide-34
SLIDE 34

Relevant work

34

slide-35
SLIDE 35

Popular computation

  • ffloading solutions

MAUI [34] Comet [35] Odessa [36] CloneCloud [37] Neurosurgeon No need to transfer program state 3 3 Data-centric compute partitioning 3 Low/no runtime overhead 3 3 3 3 Requires no application-specific profiling 3 3 No programmer annotation needed 3 3 3 3 Server load sensitive 3 3 35

Image taken from [1]

slide-36
SLIDE 36

Benchmark Result - 
 MAUI

IMC VGG FACE DIG ASR POS NER CHK (a) using the mobile CPU 1X 3X 5X 7X Latency speedup IMC VGG FACE DIG ASR POS NER CHK (b) using the mobile GPU 1X 3X 5X 7X 32x MAUI Neurosurgeon

MAUI Neurosurgeon Partitioning Control-based Data-centric Profiling Dynamic Static Partitioning Granularity Per annotated function Per layer Optimises for Power efficiency

Latency XOR
 Power Efficiency

Specificity General DNN Specific

MAUI scheduling for a layer depends on previous invocations.

36

MAUI: Making Smartphones Last Longer with Code Offload

Eduardo Cuervo†, Aruna Balasubramanian‡, Dae-ki Cho∗, Alec Wolman§, Stefan Saroiu§, Ranveer Chandra§, Paramvir Bahl§ †Duke University, ‡University of Massachusetts Amherst, ∗UCLA, §Microsoft Research ABSTRACT This paper presents MAUI, a system that enables fine-grained energy-aware offload of mobile code to the infrastructure. Previous approaches to these problems either relied heavily on programmer support to partition an application, or they were coarse-grained re- quiring full process (or full VM) migration. MAUI uses the benefits
  • f a managed code environment to offer the best of both worlds:
it supports fine-grained code offload to maximize energy savings with minimal burden on the programmer. MAUI decides at run- time which methods should be remotely executed, driven by an op- timization engine that achieves the best energy savings possible un- der the mobile device’s current connectivity constrains. In our eval- uation, we show that MAUI enables: 1) a resource-intensive face Given the tremendous size of the mobile handset market, solving the energy impediment has quickly become the mobile industry’s foremost challenge [14]. One popular technique to reduce the energy needs of mobile de- vices is remote execution: applications can take advantage of the resource-rich infrastructure by delegating code execution to remote
  • servers. For the last two decades, there have been many attempts
to make mobile devices use remote execution to improve perfor- mance and energy consumption. Most of these attempts took one
  • f the following two approaches to remote execution. The first ap-
proach is to rely on programmers – to specify how to partition a program, what state needs to be remoted, and how to adapt the pro- gram partitioning scheme to the changing network conditions [9,

Images taken from [1] and [4]

slide-37
SLIDE 37

Privacy Preserving Shared Models

  • Based on the edge

computing paradigm.

  • Models where a general

model is trained in the cloud and online learning is supplementing this model on the device [8].

37

!"#$%$%&' (#)#

*+ ,#)-.'/0#"%$%&

*12$32

(2 ($32

4%50"0%-0

*12$32

*6$32 7%/$%0'/0#"%$%&

8

4%50"0%-0 Image taken from [8]

slide-38
SLIDE 38

Review & Critique

38

slide-39
SLIDE 39

Review: The good parts

  • The name! :)
  • Brand new paper published in ASPLOS ’17 (1 citation

from Cambridge [5])

  • Rational extension of current model of execution based
  • n SoC developments.
  • All around benchmarks, substantial speedups.
  • Inclusive of GPU computation and different network

setups.

39

slide-40
SLIDE 40

Critique

  • DNN specific (in contrast with MAUI)
  • Profiling has hardcoded the regression models for each type of layer

(difficult to expand, does not learn how to assess)

  • How would an rNN get split with Neurosurgeon?
  • Profiler assumes one type of hardware server-side.
  • Different sized containers based on load.
  • Different datacenter forwarding behind load balancer.
  • Adoption: NVIDIA Tegra K1 is a high-end SoC
  • Lower-end processors may shift offloading to the cloud.

40

slide-41
SLIDE 41

Critique

  • Distinct optimisation for latency, energy efficiency.
  • Why not offer a Pareto’s optimality curve and pick point

based on user profile?

Latency Energy Efficiency High-performance profile Energy-efficiency profile

41

slide-42
SLIDE 42

Critique

1: Input: 2: N: number of layers in the DNN 3: {Li|i = 1 · · · N}: layers in the DNN 4: {Di|i = 1 · · · N}: data size at each layer 5: f, g(Li): regression models predicting the latency and power of exe- cuting Li 6: K: current datacenter load level 7: B: current wireless network uplink bandwidth 8: PU: wireless network uplink power consumption 9: procedure PARTITIONDECISION 10: for each i in 1 · · · N do 11: TMi ← fmobile(Li) 12: TCi ← fcloud(Li, K) 13: PMi ← gmobile(Li) 14: TUi ← Di/B 15: if OptTarget == latency then 16: return arg min

j=1···N

(

j

P

i=1

TMi +

N

P

k=j+1

TCk + TUj) 17: else if OptTarget == energy then 18: return arg min

j=1···N

(

j

P

i=1

TMi × PMi + TUj × PU)

Smartphones support multitasking.
 Why not include K_mobile?

42

slide-43
SLIDE 43

Suggestions

  • Work with model decomposition and compression

algorithms to push more computation locally (such as DeepX [6])

  • Other hardware could be taken into consideration (e.g.

DSP) for further efficiency (such as DeepEar [7])

  • Could Reinforcement Learning be of any help in learning

how to partition instead of static profiler?

  • Offloading to devices in local network. (MAUI [4])

43

slide-44
SLIDE 44

Thank you

Q&A

Stefanos Laskaridis


sl829@cam.ac.uk

44

slide-45
SLIDE 45

References

  • 1. Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., & Tang, L. (2017).

Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge. Proceedings

  • f the Twenty-Second International Conference on Architectural Support for Programming

Languages and Operating Systems, 615–629. https://doi.org/10.1145/3037697.3037698

  • 2. Stanford CS231n - Andrej Karpathy


http://cs231n.github.io/convolutional-networks/

  • 3. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep

Convolutional Neural Networks. Advances In Neural Information Processing Systems, 1–9. https://doi.org/http://dx.doi.org/10.1016/j.protcy.2014.09.007

  • 4. Cuervo, E., Balasubramanian, A., Cho, D., Wolman, A., Saroiu, S., Chandra, R., & Bahl, P

. (2010). Maui. Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services - MobiSys ’10, 49. https://doi.org/10.1145/1814433.1814441

  • 5. Zhao, J., Mortier, R., Crowcroft, J., & Wang, L. (n.d.). User-centric Composable Services : A

New Generation of Personal Data Analytics Normalised Inference Time (%). Retrieved from https://arxiv.org/pdf/1710.09027.pdf

45

slide-46
SLIDE 46

References

  • 6. Lane, N. D., Bhattacharya, S., Georgiev, P

., Forlivesi, C., Jiao, L., Qendro, L., & Kawsar, F . (2016). DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN 2016 - Proceedings, (1). https://doi.org/ 10.1109/IPSN.2016.7460664

  • 7. Lane, N. D., Georgiev, P

., & Qendro, L. (2015). DeepEar: Robust Smartphone Audio Sensing in Unconstrained Acoustic Environments using Deep Learning. Ubicomp, 283–294. https://doi.org/ 10.1145/2750858.2804262

  • 8. Servia-Rodriguez, S., Wang, L., Zhao, J. R., Mortier, R., & Haddadi,
  • H. (2017). Personal Model Training under Privacy Constraints.

Retrieved from http://arxiv.org/abs/1703.00380

46