Real-Time Image Recognition Using Collaborative IoT Devices , - - PowerPoint PPT Presentation
Real-Time Image Recognition Using Collaborative IoT Devices , - - PowerPoint PPT Presentation
Real-Time Image Recognition Using Collaborative IoT Devices , JiashenCao ,MatthewWoodward , MichaelS.Ryoo ,andHyesoonKim *Georgia Institute of Technology **Indiana University; EgoVidInc. Prevalence of IoT Devices 2 Internet
Prevalence of IoT Devices
2 ReQuEST workshop 2018
} Smart Locks, Smart Sprinklers, Smart
Plugs, Smart Baby Monitors, Smart Cookers, Smart Thermostats, Smart Mirrors, Smart Cleaners, and Smart Refrigerators
https://www.pentasecurity.com/blog/10-smartest-iot-devices-2017/
Many of which generate/capture abundance of real-time raw data such as images. Internet of Things (IoT) devices are everywhere
How to Process IoT data?
3 ReQuEST workshop 2018
} Advancements of deep neural networks (DNN)
provides many high-accuracy solutions to previously impossible tasks:
} Image Recognition } Face Recognition } Video (Action Recognition) } Voice Recognition
} Performing these tasks in real-time requires high
computational power.
Where to Process (I)
4 ReQuEST workshop 2018
} (Option A) Use the individual IoT device
} Limited energy (e.g., battery powered) } Limited compute power } So, unable to meet time constrains
} (Option B) Offload to Cloud
} Such as Voice recognition service of Apple’s Siri, Amazon’s Echo,
Microsoft’s Cortana, and Google Home
} Any problem?
Where to Process (II)
5 ReQuEST workshop 2018
F.Biscotti et al.,“The Impact of the Internet of Things
- n Data Centers,” Gartner Research, vol. 18, 2014.
} (Option B) Cloud processing is promising but:
} Not Scalable } More traffic, data, and storage } IoT devices outnumbered world population in 2017 } Privacy and Security } Voice recognition? Big Brother’s spying devices in the novel
1984
} Multiple layers: Network security, encryption, and etc. } Quality of Service (QoS) and Reliability } We have a tight timing constraint for real-time recognition
Where to Process (III)
6 ReQuEST workshop 2018
} (Option C) What if we could harvest the aggregated
computational power of local IoT devices?
} At a given time, not all devices are fully utilized
Collaborative IoT Devices
7
} (Option C) We study such collaboration between IoT
devices in our paper, Musical Chair.
} Our performance metric: Inferences per second } We use same models, so we have same accuracy
Hadidi et al. "Musical Chair: Efficient Real-Time Recognition Using Collaborative IoT Devices." arXiv preprint arXiv:1802.02138 (2018).
In this work, we showcase the application of Musical Chair for Image recognition models on a farm of Raspberry PIs
ReQuEST workshop 2018
Outline
8
} Motivation } Musical Chair
} Data and Model Parallelism
} Hardware and Software Overview } System Evaluations } Conclusion
ReQuEST workshop 2018
Musical Chair
9 ReQuEST workshop 2018
} Musical Chair is a technique for distributing DNN
computations over multiple IoT devices.
Hadidi et al. "Musical Chair: Efficient Real-Time Recognition Using Collaborative IoT Devices." arXiv preprint arXiv:1802.02138 (2018).
(i) Profiling DNN Layers Profiling Hardware Phase
cnn cnn cnn cnn cnn fc fc fc fc fc fc DNN Layers
15 16 17 18 19 20 10 20 30 45 50 55Behavior Models
Environment and DNN Model Inspection
Video Frames . . . Video Optical Flows . . . . . . . . . . . . . Spatial Stream Temporal Stream 256 256 256 Element Intermediate Representation Max Pooling Max Pooling Max Pooling Max Max Max Max Temporal Pyramid Max Pooling Max Pooling Max Pooling Max Max Max Max 4 Levels Output 15x256 Output 15x256S
DNN Model
S
Communication Latency and Bandwidth
Number of Devices (n)
(ii) Gather Data on Environment and DNN Model
(iii) Generate Task Assignments
(b) Four-node system Recording Optical Flow Node A Tasks Spatial CNN Node B Task Temporal CNN Node C Task Maxpool ! " ⁄ Dense Layers Node D Tasks=
Task Assignments for {1…n} devices
Recording Optical Flow Node A Tasks Spatial CNN Node B Task Temporal CNN Node C Task Maxpool fc_1 (8k) Node D Tasks fc_2 (8k) fc_3 (51) Node E Tasks Maxpool fc_1(4k) Node I Tasks Recording Optical Flow Node A Tasks Spatial CNN Nodes B, C, & D Task Temporal CNN Nodes E, F, & G Task Maxpool fc_1(4k) Node H Tasks fc_2(4k) Node K Task fc_2(4k) Node J Task fc_3(51) Node L Task Maxpool fc_1(4k) Node G Tasks Recording Optical Flow Node A Tasks Spatial CNN Nodes B & C Task Temporal CNN Nodes D & E Task Maxpool fc_1(4k) Node F Tasks fc_2(4k) Node I Task fc_2(4k) Node H Task fc_3(51) Node J TaskTask Assignment Phase Distributaion
+ +
Model & Data Parallelism
10 ReQuEST workshop 2018
} Two forms of distribution: Task A Task B Task C
X1 X2 X3 X4 Input Output Y1 Y2 Y3
Arbitrary Task Assignments: Custom DNN Model: Input Task B Output Task B Data Parallelism:
Task B Copy Task B Input 1 Input 2
Model Parallelism:
Input 1 Copy Input 1 Part 1 Task B Part 2 Task B Output 2 Output 1 Part 1 Output 1 Part2 Output 1
Data parallelism is providing the next input to multiple devices in a network.
Model & Data Parallelism
11 ReQuEST workshop 2018
} Two forms of distribution: Task A Task B Task C
X1 X2 X3 X4 Input Output Y1 Y2 Y3
Arbitrary Task Assignments: Custom DNN Model: Input Task B Output Task B Data Parallelism:
Task B Copy Task B Input 1 Input 2
Model Parallelism:
Input 1 Copy Input 1 Part 1 Task B Part 2 Task B Output 2 Output 1 Part 1 Output 1 Part2 Output 1
Data parallelism is providing the next input to multiple devices in a network. Model parallelism is splitting parts of a given layer or group of layers over multiple devices.
Model & Data Parallelism
12 ReQuEST workshop 2018
} Two forms of distribution: Task A Task B Task C
X1 X2 X3 X4 Input Output Y1 Y2 Y3
Arbitrary Task Assignments: Custom DNN Model: Input Task B Output Task B Data Parallelism:
Task B Copy Task B Input 1 Input 2
Model Parallelism:
Input 1 Copy Input 1 Part 1 Task B Part 2 Task B Output 2 Output 1 Part 1 Output 1 Part2 Output 1
Data parallelism is providing the next input to multiple devices in a network. Model parallelism is splitting parts of a given layer or group of layers over multiple devices. Convolution Layers: Mostly data parallelism Fully Connected Layers: Either data or model parallelism depending on size of the layer, input, and memory
Hardware Overview
13 ReQuEST workshop 2018
} Raspberry PI 3:
} Cheap and accessible platform } Connected via a Wifi router } No GPU
} Nvidia Jetson TX2:
} High-end embedded platform } Has a GPU
Moreover, we measured whole system power with a power analyzer
Software Overview
14 ReQuEST workshop 2018
} Dependencies:
} Ubuntu 16.04 } Keras 2.1
} With Tensorflow backend for Raspberry Pis } With Tensorflow-GPU backend for TX2
} Apache Avro for procedure call and data serialization
} Image Recognition Models:
} AlexNet } VGG16
AlexNet
15 ReQuEST workshop 2018
Input Size: 220x220x3 Five convolution layers Three fully connected layers
- A. Krizhevsky et al., “Imagenet Classification With
Deep Convolutional Neural Networks,” in NIPS 2012
Input
220 220 3
11 11
55 55 48
5 5
128 27 27
3
192 13 13
3 3
192 13 13
3 3
128 13 13
3 3 4092 1000 4092 conv2D maxpool conv2D maxpool conv2D conv2D conv2D maxpool fc_1 fc_2 fc_3 3
Convolution (CNN) Layers
AlexNet Distribution I
16 ReQuEST workshop 2018
Input
220 220 3
11 11
55 55 48
5 5
128 27 27
3
192 13 13
3 3
192 13 13
3 3
128 13 13
3 3 4092 1000 4092 conv2D maxpool conv2D maxpool conv2D conv2D conv2D maxpool fc_1 fc_2 fc_3 3
Convolution (CNN) Layers
Five-device system:
Input Stream
Tasks of A
fc_2 (4k) fc_3 (1k)
Task of E
fc_1(2k)
Task of D
Model Parallelism
fc_1(2k)
Task of C
Model Parallelism
CNN Layers
Tasks of B
M e r g e
AlexNet Distribution II
17 ReQuEST workshop 2018
Input
220 220 3
11 11
55 55 48
5 5
128 27 27
3
192 13 13
3 3
192 13 13
3 3
128 13 13
3 3 4092 1000 4092 conv2D maxpool conv2D maxpool conv2D conv2D conv2D maxpool fc_1 fc_2 fc_3 3
Convolution (CNN) Layers
Six-device system:
Input Stream
Tasks of A
fc_2 (4k) fc_3 (1k)
Task of F
M e r g e
CNN Layers
Tasks of B & C
Data Parallelism
fc_1(2k)
Task of E
Model Parallelism
fc_1(2k)
Task of D
Model Parallelism
AlexNet Results
18 ReQuEST workshop 2018
Comparable IPS with TX2 (-30%) Lower dynamic energy consumption
0.5 1 1.5 2
T X 2 ( G P U ) T X 2 ( C P U ) 5
- D
e i v c e 6
- D
e v i c e
Energy per Inference (J) Dynamic Energy Static Energy 0.5 1 1.5 2 2.5
T X 2 ( G P U ) T X 2 ( C P U ) 5
- D
e i v c e 6
- D
e v i c e
Energy per Inference (J) Total Energy
Dynamic and Static Energy Total Energy
1 2 3 4 5 6 7 8
TX2 (GPU) TX2 (CPU) 5-Deivce 6-Device
Inference per Second (IPS)
IPS
VGG16
19 ReQuEST workshop 2018
Input Size: 224x224x3 13 convolution layers Three fully connected layers
- K. Simonyan et al., “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” in ICLR, 2015.
224 224 3
3 3
64
4092 1000 4092 conv2D fc_1 fc_2 fc_3 3 3
64
3 3 3 3
112 112 128
3 3
128
conv2D maxpool conv2D conv2D maxpool 3 3
56 256
conv2D
56
2x 3 3
256
conv2D maxpool 3 3
28 512
conv2D
28
conv2D maxpool
3 3
512
3 3
14 512
conv2D
14
maxpool
3 3
512
2x
2x conv2D
Block 1 Block 2 Block 3 Block 4 Block 5
VGG16 Distribution I
20 ReQuEST workshop 2018
Nine-device system:
224 224 3
3 3
64
4092 1000 4092 conv2D fc_1 fc_2 fc_3 3 3
64
3 3 3 3
112 112 128
3 3
128
conv2D maxpool conv2D conv2D maxpool 3 3
56 256
conv2D
56
2x 3 3
256
conv2D maxpool 3 3
28 512
conv2D
28
conv2D maxpool 3 3
512
3 3
14 512
conv2D
14
maxpool 3 3
512
2x 2x conv2D
Block 1 Block 2 Block 3 Block 4 Block 5
Block 5
Tasks of F
Block 1 Tasks of B fc_2 (4K) fc_3 (1k)
Task of I
Block 2,3,4
Tasks of C, D,& E
Data Parallelism
fc_1(2k)
Task of G
Model Parallelism
fc_1(2k)
Task of H
Model Parallelism
Input Stream Tasks of A
M e r g e
VGG16 Distribution II
21 ReQuEST workshop 2018
11-device system:
224 224 3
3 3
64
4092 1000 4092 conv2D fc_1 fc_2 fc_3 3 3
64
3 3 3 3
112 112 128
3 3
128
conv2D maxpool conv2D conv2D maxpool 3 3
56 256
conv2D
56
2x 3 3
256
conv2D maxpool 3 3
28 512
conv2D
28
conv2D maxpool 3 3
512
3 3
14 512
conv2D
14
maxpool 3 3
512
2x 2x conv2D
Block 1 Block 2 Block 3 Block 4 Block 5
Input Stream Tasks of A
fc_2 (4K) fc_3 (1k)
Task of L
Merge
fc_1(2k)
Task of J
Model Parallelism
fc_1(2k)
Task of K
Model Parallelism
Block 1,2, 3,4,5
Tasks of B, C, D, E F, G,& H
Data Parallelism
VGG16 Results
22 ReQuEST workshop 2018
5 10 15 20 25 30
TX2 (GPU) TX2 (CPU) 9-Deivce 11-Device
Energy per Inference (J) Dynamic Energy Static Energy 10 20 30 40 50
TX2 (GPU) TX2 (CPU) 9-Deivce 11-Device
Energy per Inference (J) Total Energy
Dynamic and Static Energy Total Energy
0.2 0.4 0.6 0.8 1 1.2 1.4
TX2 (GPU) TX2 (CPU) 9-Deivce 11-Device
Inference per Second (IPS)
IPS Dy
Comparable IPS with TX2 (-15%) We achieve 2.3x speedup, by reassigning CNN blocks
Conclusions
23
} We used a farm of Raspberry PIs for DNN processing } We are able to process IoT data locally by distribution } Our technique achieves acceptable real-time performance
} Future Work:
} Study the robustness of such systems } Apply our technique to more DNN models } Implement our model on distributed robot systems
ReQuEST workshop 2018