Adaptive Distributed Convolutional Neural Network Inference at the - - PowerPoint PPT Presentation

adaptive distributed convolutional neural network
SMART_READER_LITE
LIVE PREVIEW

Adaptive Distributed Convolutional Neural Network Inference at the - - PowerPoint PPT Presentation

Adaptive Distributed Convolutional Neural Network Inference at the Network Edge with ADCNN 17 August 2020 ICPP 2020 Sai Qian Zhang , Jieyu Lin, Qi Zhang Executing DNN Inference Tasks for End Users Option 2: cloud only Option 1: edge only


slide-1
SLIDE 1

Adaptive Distributed Convolutional Neural Network Inference at the Network Edge with ADCNN

17 August 2020 ICPP 2020 Sai Qian Zhang, Jieyu Lin, Qi Zhang

slide-2
SLIDE 2

Option 1: edge only

Image Edge devices Audio

user data user data

Video

u s e r d a t a

Image Audio

user data

Video Cloud data center

Option 2: cloud only

user data user data ... ...

Executing DNN Inference Tasks for End Users

  • Using edge device to handle the end user data leads to a long processing time, while using

cloud server to process the end user data acquires a large communication delay.

Limited computing capability Large communication overhead

slide-3
SLIDE 3

Motivation

  • Edge devices

○ Resource-limited ○ Pervasive

  • Adaptive Distributed Convolutional Neural Network (ADCNN)

○ We propose a framework for agile execution of inference tasks on edge clusters for Convolutional Neural Networks (CNNs)

  • Challenges

○ Reduce the inference latency while keeping the accuracy performance ○ Device heterogeneity and performance fluctuation ○ Applicable to different CNN models

slide-4
SLIDE 4

Agenda

  • Background
  • CNN partitioning strategies
  • ADCNN framework
  • Modification on CNN architecture
  • Evaluation
  • Conclusion
slide-5
SLIDE 5

CNN Background -- Convolutional Layer

Input feature maps

224 224

Filter

3 3 3

Input feature maps Output feature maps

...

Filter 1 Filter 2 Filter K

... ...

224 224 3 3

  • The weight filters slide across the ifmaps. The dot product between the entries of each ifmap and

weight filter are calculated at each position.

slide-6
SLIDE 6

Background -- CNN Workload Characteristics

  • Earlier layers take much longer to process than the later layers.

Processing time for VGG16

slide-7
SLIDE 7

CNN Partitioning Strategies: CNN Channelwise Partitioning

  • In channelwise partition, each node needs to exchange their partially accumulated ofmaps to

produce final ofmaps, which may lead to a significant communication overhead. Convolution ifmaps

  • fmaps

Filter 1

...

... ... ... ...

C/2 C/2

... ...

K/2 K/2

Filter K W H R U N M

Workload of device 1 Workload of device 2

slide-8
SLIDE 8

CNN Partitioning Strategies: Spatial Partitioning

  • In spatial partition, each tile needs to transmit their data halo in order to compute the correct result.

ifmap

A B D C

data halo

A B D C A B D C

Data halo transmission among tiles

0.2 0.6 0.9 0.2 0.6 0.4 0.3 0.4 0.3 0.9

(c) (b) (a)

slide-9
SLIDE 9

Fully Decomposable Spatial Partition (FDSP)

A B D C

Normal Spatial Partition

0.2 0.6 0.9 0.2 0.6 0.4 0.3 0.4 0.3 0.9

A B D C

  • The cross-tile information transfer can be eliminated by padding the edge pixels with zeros.

0.0 0.0 0.0 0.0 0.0

Fully Decomposable Spatial Partition (FDSP)

slide-10
SLIDE 10

ADCNN Framework

Progressive Retraining Dog

Original CNN model Output CNN model Central node

Edge device cluster

Tiles Input

...

Step 1 Step 2

... ... ...

Conv node

... ... ...

... ... ...

Conv node

... ... ...

...

slide-11
SLIDE 11

ADCNN Framework

...

  • The Conv nodes need to transmit the intermediate results to the Central node, which may still

cause a significant communication overhead. Edge device cluster Input tiles

Results

Conv node

... ... ...

Conv node

... ... ...

Central node

slide-12
SLIDE 12

Modification on CNN Topology

  • We modify the CNN model for reducing this communication overhead.
  • We adopt progressive retraining by adding the modification on the CNN architecture

1.3 0.1

  • 0.2

0.3 0.1

  • 0.5 2.3
  • 2.2

0.2 0.1 2.5

  • 3.8

1.2

Apply clipped ReLU [1,0,2,0,1,0,0,0, 0,0,0,0,2,0,0,0] [1,1,2,1,1,7,2,3]

0.1

  • 1.3

Unroll the neurons

0.1 1.1 0.0 0.0 0.1 0.0 0.0 1.8 0.0 0.0 0.0 1.8 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 2.0 0.0 1.0 0.0 0.0 0.0

Quantization RLE

1 2 3 4

Output from the CONV nodes

slide-13
SLIDE 13

ADCNN Architecture

  • ADCNN takes advantage of the fine-grained, fully independent tiles generated by FDSP and adapt it to

dynamic conditions, allowing it to achieve fine-grained load balancing across heterogeneous edge nodes.

CONV nodes cluster

Statistics collection

Stats

...

...

Input partition Layer computation Input tiles

Intermediate results

Central node Dog

...

# received

results

d:[-0.9,...,1.1],i_id:6,t_id:1,n_id:1 d:[0.3,...,-0.8],i_id:6,t_id:2,n_id:1 d:[-0.4,...,0.2],i_id:6,t_id:4,n_id:N

2 3 4 1

i_id:6 t_id:1 n_id:1 i_id:6 t_id:2 n_id:1 i_id:6 t_id:4 n_id:N

ADCNN System

slide-14
SLIDE 14

Accuracy Evaluation

  • We evaluate different CNN models from different applications.
  • Accuracy degradations are around 1% for 8 by 8 FDSP on the input sample.

VGG16 Fully Convolutional Network

slide-15
SLIDE 15

Inference Latency Comparison

  • We implement ADCNN system with nine

identical Raspberry Pi devices which simulate the edge devices. Among these nine devices, eight are used as Conv nodes, and the rest

  • ne is used as the Central node.
  • Baselines:

○ Single device scheme ○ Remote cloud scheme

  • ADCNN decreases the average processing

latency by 6.68x and 4.42x, respectively.

slide-16
SLIDE 16

ADCNN Performance in Dynamic Environment

  • We adjust the CPU processing speed on four of the Conv nodes (node 5,6,7,8) in the middle of the processing

50 input images, and detect its impact on tile assignment and overall inference latency.

  • ADCNN can handle the dynamic condition on the the node performance effectively.

Variation on Inference Latency Changes on Tile Assignment

slide-17
SLIDE 17

Conclusion

  • We introduce ADCNN, a distributed inference framework which jointly optimize CNN

architecture and computing system for better performance in dynamic network environments.

  • ADCNN applies FDSP to partition the compute-intensive convolutional layers into many

small independent computational tasks which can be executed in parallel on separate edge devices.

  • ADCNN system can take advantage of the fine-grained, fully independent tiles

generated by FDSP and adapt it to dynamic conditions, allowing it to achieve fine-grained load balancing across heterogeneous edge nodes.

  • Compared to existing distributed CNN inference approaches, ADCNN provides up to

2.8x lower latency, while achieving a competitive inference accuracy. Additionally, ADCNN can quickly adapt to the variations on edge device performance.