Real-Time Image Recognition Using Collaborative IoT Devices , - - PowerPoint PPT Presentation

real time image recognition using collaborative iot
SMART_READER_LITE
LIVE PREVIEW

Real-Time Image Recognition Using Collaborative IoT Devices , - - PowerPoint PPT Presentation

Real-Time Image Recognition Using Collaborative IoT Devices , JiashenCao ,MatthewWoodward , MichaelS.Ryoo ,andHyesoonKim *Georgia Institute of Technology **Indiana University; EgoVidInc. Prevalence of IoT Devices 2 Internet


slide-1
SLIDE 1

Real-Time Image Recognition Using Collaborative IoT Devices

, JiashenCao∗,MatthewWoodward∗, MichaelS.Ryoo∗∗,andHyesoonKim∗

*Georgia Institute of Technology **Indiana University; EgoVidInc.

slide-2
SLIDE 2

Prevalence of IoT Devices

2 ReQuEST workshop 2018

} Smart Locks, Smart Sprinklers, Smart

Plugs, Smart Baby Monitors, Smart Cookers, Smart Thermostats, Smart Mirrors, Smart Cleaners, and Smart Refrigerators

https://www.pentasecurity.com/blog/10-smartest-iot-devices-2017/

Many of which generate/capture abundance of real-time raw data such as images. Internet of Things (IoT) devices are everywhere

slide-3
SLIDE 3

How to Process IoT data?

3 ReQuEST workshop 2018

} Advancements of deep neural networks (DNN)

provides many high-accuracy solutions to previously impossible tasks:

} Image Recognition } Face Recognition } Video (Action Recognition) } Voice Recognition

} Performing these tasks in real-time requires high

computational power.

slide-4
SLIDE 4

Where to Process (I)

4 ReQuEST workshop 2018

} (Option A) Use the individual IoT device

} Limited energy (e.g., battery powered) } Limited compute power } So, unable to meet time constrains

} (Option B) Offload to Cloud

} Such as Voice recognition service of Apple’s Siri, Amazon’s Echo,

Microsoft’s Cortana, and Google Home

} Any problem?

slide-5
SLIDE 5

Where to Process (II)

5 ReQuEST workshop 2018

F.Biscotti et al.,“The Impact of the Internet of Things

  • n Data Centers,” Gartner Research, vol. 18, 2014.

} (Option B) Cloud processing is promising but:

} Not Scalable } More traffic, data, and storage } IoT devices outnumbered world population in 2017 } Privacy and Security } Voice recognition? Big Brother’s spying devices in the novel

1984

} Multiple layers: Network security, encryption, and etc. } Quality of Service (QoS) and Reliability } We have a tight timing constraint for real-time recognition

slide-6
SLIDE 6

Where to Process (III)

6 ReQuEST workshop 2018

} (Option C) What if we could harvest the aggregated

computational power of local IoT devices?

} At a given time, not all devices are fully utilized

slide-7
SLIDE 7

Collaborative IoT Devices

7

} (Option C) We study such collaboration between IoT

devices in our paper, Musical Chair.

} Our performance metric: Inferences per second } We use same models, so we have same accuracy

Hadidi et al. "Musical Chair: Efficient Real-Time Recognition Using Collaborative IoT Devices." arXiv preprint arXiv:1802.02138 (2018).

In this work, we showcase the application of Musical Chair for Image recognition models on a farm of Raspberry PIs

ReQuEST workshop 2018

slide-8
SLIDE 8

Outline

8

} Motivation } Musical Chair

} Data and Model Parallelism

} Hardware and Software Overview } System Evaluations } Conclusion

ReQuEST workshop 2018

slide-9
SLIDE 9

Musical Chair

9 ReQuEST workshop 2018

} Musical Chair is a technique for distributing DNN

computations over multiple IoT devices.

Hadidi et al. "Musical Chair: Efficient Real-Time Recognition Using Collaborative IoT Devices." arXiv preprint arXiv:1802.02138 (2018).

(i) Profiling DNN Layers Profiling Hardware Phase

cnn cnn cnn cnn cnn fc fc fc fc fc fc DNN Layers

15 16 17 18 19 20 10 20 30 45 50 55

Behavior Models

Environment and DNN Model Inspection

Video Frames . . . Video Optical Flows . . . . . . . . . . . . . Spatial Stream Temporal Stream 256 256 256 Element Intermediate Representation Max Pooling Max Pooling Max Pooling Max Max Max Max Temporal Pyramid Max Pooling Max Pooling Max Pooling Max Max Max Max 4 Levels Output 15x256 Output 15x256

S

DNN Model

S

Communication Latency and Bandwidth

Number of Devices (n)

(ii) Gather Data on Environment and DNN Model

(iii) Generate Task Assignments

(b) Four-node system Recording Optical Flow Node A Tasks Spatial CNN Node B Task Temporal CNN Node C Task Maxpool ! " ⁄ Dense Layers Node D Tasks

=

Task Assignments for {1…n} devices

Recording Optical Flow Node A Tasks Spatial CNN Node B Task Temporal CNN Node C Task Maxpool fc_1 (8k) Node D Tasks fc_2 (8k) fc_3 (51) Node E Tasks Maxpool fc_1(4k) Node I Tasks Recording Optical Flow Node A Tasks Spatial CNN Nodes B, C, & D Task Temporal CNN Nodes E, F, & G Task Maxpool fc_1(4k) Node H Tasks fc_2(4k) Node K Task fc_2(4k) Node J Task fc_3(51) Node L Task Maxpool fc_1(4k) Node G Tasks Recording Optical Flow Node A Tasks Spatial CNN Nodes B & C Task Temporal CNN Nodes D & E Task Maxpool fc_1(4k) Node F Tasks fc_2(4k) Node I Task fc_2(4k) Node H Task fc_3(51) Node J Task

Task Assignment Phase Distributaion

+ +

slide-10
SLIDE 10

Model & Data Parallelism

10 ReQuEST workshop 2018

} Two forms of distribution: Task A Task B Task C

X1 X2 X3 X4 Input Output Y1 Y2 Y3

Arbitrary Task Assignments: Custom DNN Model: Input Task B Output Task B Data Parallelism:

Task B Copy Task B Input 1 Input 2

Model Parallelism:

Input 1 Copy Input 1 Part 1 Task B Part 2 Task B Output 2 Output 1 Part 1 Output 1 Part2 Output 1

Data parallelism is providing the next input to multiple devices in a network.

slide-11
SLIDE 11

Model & Data Parallelism

11 ReQuEST workshop 2018

} Two forms of distribution: Task A Task B Task C

X1 X2 X3 X4 Input Output Y1 Y2 Y3

Arbitrary Task Assignments: Custom DNN Model: Input Task B Output Task B Data Parallelism:

Task B Copy Task B Input 1 Input 2

Model Parallelism:

Input 1 Copy Input 1 Part 1 Task B Part 2 Task B Output 2 Output 1 Part 1 Output 1 Part2 Output 1

Data parallelism is providing the next input to multiple devices in a network. Model parallelism is splitting parts of a given layer or group of layers over multiple devices.

slide-12
SLIDE 12

Model & Data Parallelism

12 ReQuEST workshop 2018

} Two forms of distribution: Task A Task B Task C

X1 X2 X3 X4 Input Output Y1 Y2 Y3

Arbitrary Task Assignments: Custom DNN Model: Input Task B Output Task B Data Parallelism:

Task B Copy Task B Input 1 Input 2

Model Parallelism:

Input 1 Copy Input 1 Part 1 Task B Part 2 Task B Output 2 Output 1 Part 1 Output 1 Part2 Output 1

Data parallelism is providing the next input to multiple devices in a network. Model parallelism is splitting parts of a given layer or group of layers over multiple devices. Convolution Layers: Mostly data parallelism Fully Connected Layers: Either data or model parallelism depending on size of the layer, input, and memory

slide-13
SLIDE 13

Hardware Overview

13 ReQuEST workshop 2018

} Raspberry PI 3:

} Cheap and accessible platform } Connected via a Wifi router } No GPU

} Nvidia Jetson TX2:

} High-end embedded platform } Has a GPU

Moreover, we measured whole system power with a power analyzer

slide-14
SLIDE 14

Software Overview

14 ReQuEST workshop 2018

} Dependencies:

} Ubuntu 16.04 } Keras 2.1

} With Tensorflow backend for Raspberry Pis } With Tensorflow-GPU backend for TX2

} Apache Avro for procedure call and data serialization

} Image Recognition Models:

} AlexNet } VGG16

slide-15
SLIDE 15

AlexNet

15 ReQuEST workshop 2018

Input Size: 220x220x3 Five convolution layers Three fully connected layers

  • A. Krizhevsky et al., “Imagenet Classification With

Deep Convolutional Neural Networks,” in NIPS 2012

Input

220 220 3

11 11

55 55 48

5 5

128 27 27

3

192 13 13

3 3

192 13 13

3 3

128 13 13

3 3 4092 1000 4092 conv2D maxpool conv2D maxpool conv2D conv2D conv2D maxpool fc_1 fc_2 fc_3 3

Convolution (CNN) Layers

slide-16
SLIDE 16

AlexNet Distribution I

16 ReQuEST workshop 2018

Input

220 220 3

11 11

55 55 48

5 5

128 27 27

3

192 13 13

3 3

192 13 13

3 3

128 13 13

3 3 4092 1000 4092 conv2D maxpool conv2D maxpool conv2D conv2D conv2D maxpool fc_1 fc_2 fc_3 3

Convolution (CNN) Layers

Five-device system:

Input Stream

Tasks of A

fc_2 (4k) fc_3 (1k)

Task of E

fc_1(2k)

Task of D

Model Parallelism

fc_1(2k)

Task of C

Model Parallelism

CNN Layers

Tasks of B

M e r g e

slide-17
SLIDE 17

AlexNet Distribution II

17 ReQuEST workshop 2018

Input

220 220 3

11 11

55 55 48

5 5

128 27 27

3

192 13 13

3 3

192 13 13

3 3

128 13 13

3 3 4092 1000 4092 conv2D maxpool conv2D maxpool conv2D conv2D conv2D maxpool fc_1 fc_2 fc_3 3

Convolution (CNN) Layers

Six-device system:

Input Stream

Tasks of A

fc_2 (4k) fc_3 (1k)

Task of F

M e r g e

CNN Layers

Tasks of B & C

Data Parallelism

fc_1(2k)

Task of E

Model Parallelism

fc_1(2k)

Task of D

Model Parallelism

slide-18
SLIDE 18

AlexNet Results

18 ReQuEST workshop 2018

Comparable IPS with TX2 (-30%) Lower dynamic energy consumption

0.5 1 1.5 2

T X 2 ( G P U ) T X 2 ( C P U ) 5

  • D

e i v c e 6

  • D

e v i c e

Energy per Inference (J) Dynamic Energy Static Energy 0.5 1 1.5 2 2.5

T X 2 ( G P U ) T X 2 ( C P U ) 5

  • D

e i v c e 6

  • D

e v i c e

Energy per Inference (J) Total Energy

Dynamic and Static Energy Total Energy

1 2 3 4 5 6 7 8

TX2 (GPU) TX2 (CPU) 5-Deivce 6-Device

Inference per Second (IPS)

IPS

slide-19
SLIDE 19

VGG16

19 ReQuEST workshop 2018

Input Size: 224x224x3 13 convolution layers Three fully connected layers

  • K. Simonyan et al., “Very Deep Convolutional Networks

for Large-Scale Image Recognition,” in ICLR, 2015.

224 224 3

3 3

64

4092 1000 4092 conv2D fc_1 fc_2 fc_3 3 3

64

3 3 3 3

112 112 128

3 3

128

conv2D maxpool conv2D conv2D maxpool 3 3

56 256

conv2D

56

2x 3 3

256

conv2D maxpool 3 3

28 512

conv2D

28

conv2D maxpool

3 3

512

3 3

14 512

conv2D

14

maxpool

3 3

512

2x

2x conv2D

Block 1 Block 2 Block 3 Block 4 Block 5

slide-20
SLIDE 20

VGG16 Distribution I

20 ReQuEST workshop 2018

Nine-device system:

224 224 3

3 3

64

4092 1000 4092 conv2D fc_1 fc_2 fc_3 3 3

64

3 3 3 3

112 112 128

3 3

128

conv2D maxpool conv2D conv2D maxpool 3 3

56 256

conv2D

56

2x 3 3

256

conv2D maxpool 3 3

28 512

conv2D

28

conv2D maxpool 3 3

512

3 3

14 512

conv2D

14

maxpool 3 3

512

2x 2x conv2D

Block 1 Block 2 Block 3 Block 4 Block 5

Block 5

Tasks of F

Block 1 Tasks of B fc_2 (4K) fc_3 (1k)

Task of I

Block 2,3,4

Tasks of C, D,& E

Data Parallelism

fc_1(2k)

Task of G

Model Parallelism

fc_1(2k)

Task of H

Model Parallelism

Input Stream Tasks of A

M e r g e

slide-21
SLIDE 21

VGG16 Distribution II

21 ReQuEST workshop 2018

11-device system:

224 224 3

3 3

64

4092 1000 4092 conv2D fc_1 fc_2 fc_3 3 3

64

3 3 3 3

112 112 128

3 3

128

conv2D maxpool conv2D conv2D maxpool 3 3

56 256

conv2D

56

2x 3 3

256

conv2D maxpool 3 3

28 512

conv2D

28

conv2D maxpool 3 3

512

3 3

14 512

conv2D

14

maxpool 3 3

512

2x 2x conv2D

Block 1 Block 2 Block 3 Block 4 Block 5

Input Stream Tasks of A

fc_2 (4K) fc_3 (1k)

Task of L

Merge

fc_1(2k)

Task of J

Model Parallelism

fc_1(2k)

Task of K

Model Parallelism

Block 1,2, 3,4,5

Tasks of B, C, D, E F, G,& H

Data Parallelism

slide-22
SLIDE 22

VGG16 Results

22 ReQuEST workshop 2018

5 10 15 20 25 30

TX2 (GPU) TX2 (CPU) 9-Deivce 11-Device

Energy per Inference (J) Dynamic Energy Static Energy 10 20 30 40 50

TX2 (GPU) TX2 (CPU) 9-Deivce 11-Device

Energy per Inference (J) Total Energy

Dynamic and Static Energy Total Energy

0.2 0.4 0.6 0.8 1 1.2 1.4

TX2 (GPU) TX2 (CPU) 9-Deivce 11-Device

Inference per Second (IPS)

IPS Dy

Comparable IPS with TX2 (-15%) We achieve 2.3x speedup, by reassigning CNN blocks

slide-23
SLIDE 23

Conclusions

23

} We used a farm of Raspberry PIs for DNN processing } We are able to process IoT data locally by distribution } Our technique achieves acceptable real-time performance

} Future Work:

} Study the robustness of such systems } Apply our technique to more DNN models } Implement our model on distributed robot systems

ReQuEST workshop 2018