DeepCache: Principled Cache for Mobile Deep Vision
Mengwei Xu1, Mengze Zhu1, Yunxin Liu2 Felix Xiaozhu Lin3, Xuanzhe Liu1
1Peking University, 2Microsoft Research, 3Purdue University
DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , - - PowerPoint PPT Presentation
DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , Mengze Zhu 1 , Yunxin Liu 2 Felix Xiaozhu Lin 3 , Xuanzhe Liu 1 1 Peking University, 2 Microsoft Research, 3 Purdue University Background: Mobile Vision Your mobile device sees
1Peking University, 2Microsoft Research, 3Purdue University
Augmented Reality Recognition & Detection Game Face Beauty
CNN model: a graph of computation nodes (convolution, pooling, activation, etc) convolution operation (input feature map * kernel = out feature map)
Algorithm-level Compression
Hardware-level Acceleration
previous frame Current frame
Inference Engine Class: elephant Pos: (-1.5, 7.9)
input
ith frame
Inference Engine Class: elephant Pos: (-1.5, 7.9)
input Similar? ith frame (i+1)th frame
Inference Engine Class: elephant Pos: (-1.5, 7.9) Class: elephant Pos: (-1.5, 7.9)
input Similar? YES: Let’s Reuse ith frame (i+1)th frame
Inference Engine Class: elephant Pos: (-1.5, 7.9) Inference Engine Class: elephant Pos: (-1.1, 9.3)
input Similar? YES: Let’s Reuse NO: Do Computation ith frame (i+1)th frame
Two images are similar
But the elephant position is different!
previous frame current frame
Reuse the CNN computations of similar regions
cache & reuse
Inference Engine Class: elephant Pos: (-1.5, 7.9) Inference Engine (revised) Class: elephant Pos: (-1.1, 9.3)
input
matching reusable regions
ith frame (i+1)th frame
layer-level cache/reuse
previous frame current frame
Reusable region (5x5)
Reusable region (3x3)
3x3 Convolution Eroded
Re-computation Required
current frame
(a) The “best” match, with highest matching score
B1 B2 B1 B2 B1 B2 previous frame previous frame
(b) The “proper” match, with a high matching score
also suffer less cache erosion.
conv pool conv fc conv pool conv fc Image match previous frame current frame Cache and reuse Cache-aware CNN inference engine Reusable regions Processors (CPU, GPU, etc.) Camera Deep learning engine w/ DeepCache Raw image Pre-processing (resizing etc.) Input Output
Vision Applications
Storage Operating System Cache storage
Current Frame Previous Frame
(x1, y1) Current Frame Previous Frame
(x2, y2)
(x1, y1) Current Frame Previous Frame
(x2, y2)
block (x1+Mx, y2 + My) with average movement in previous frame
(x1, y1) Current Frame Previous Frame
(x1+Mx, y2 + My)
PSRN: 24 (reusable) PSRN: 21 (reusable)
(x1, y1) Current Frame Previous Frame
(x1+Mx, y2 + My)
(x1, y1)
2-skip 8/16 blocks computed 3-skip 6/16 blocks computed
unchangeable during execution among CNN layers. reusable
reusable
Convolution
Kernel=11x11 Stride=2 Padding=5
(100, 100, 100, 40) ⇒ (120, 120, 100, 40)
(53, 53, 45, 15) ⇒ (63, 63, 45, 15) reusable
ReLu
reusable
Pooling
Kernel=3x3 Stride=2 Padding=1
(27, 27, 21, 7) ⇒ (32, 32, 21, 7)
(53, 53, 45, 15) ⇒ (63, 63, 45, 15)
Because of cache erosion! But what affects cache erosion?
unchangeable during execution among CNN layers. partial erosion full erosion no erosion
propagation for ResNet).
1.00 1.00 1.00 1.00 1.00 1.84 21.24 35.77 26.67 3.44 2.25 28.63 43.67 34.31 3.57 2.30 30.98 48.59 37.82 3.67
10 20 30 40 50 60 AlexNet GoogLeNet ResNet-50 YOLO Dave-orig
Normalized Latency
DeepCache MIL-50% MIL-75% MIL-100%
and using propagation later
29.2 0.4 5.7 0.2 21.0 0.0 2.9 13.0 0.0 9.5 0.0 6.5 0.0 5.9 0.0 4.4 0.0 0.8 0.4
0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0
conv1 relu1 lrn1 pool1 conv2 relu2 lrn2 conv3 relu3 conv4 relu4 conv5 relu5 fc6 relu6 fc7 relu7 fc8 prob
Normalized Latency (%) Convolutional Fully-connected Others
Convolution is often the dominant layer (> 80% overall computations)
[1] Huynh Nguyen Loc, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys’17)
1 1 1 1 1 0.870 0.898 0.927 0.929 0.932 0.720 0.803 0.849 0.858 0.862
0.0 0.2 0.4 0.6 0.8 1.0 1.2
REC_1 REC_2 REC_3 DET DRV
Normalized Processing Latency no-cache DeepMon DeepCache
10 20 30 40 50 60 70 80 Conv_1 Conv_10 Conv_20 Conv_30 Conv_40 Conv_50
Conv-layer Latency (ms)
no-cache DeepCache
165
deeper layers
number of images.
1.83 6.86 11.85 35.63 39.90 1.67 6.01 11.11 33.26 37.27 1.49 4.90 9.71 28.97 34.47
10 20 30 40 50
DRV REC_1 REC_2 REC_3 DET
Energy (J)
no-cache DeepMon DeepCache
0.000 0.002 0.004 0.006 0.008
DeepMon COMO
AlexNet Euclidean Distance
0.000 0.002 0.004 0.006
DeepMon COMO
YOLO
2.43 1.14 2.98 0.91 2.29 1.43 2.62 1.06
1 2 3 4 5 REC_3-top1 REC_3-top3 REC_2-top1 REC_2-top3
Accuracy drop (%) DeepMon DeepCache
252.2 81.4 342.8 407.7 18.5 36.2 135.3 100.9 254.7 94.0 386.6 422.1 21.7 47.0 144.5 121.7
100 200 300 400 500 REC_1 REC_2 REC_3 DET DRV SqueezeNet DeepFace MobileNet
Memory Usage (MB) no-cache DeepCache
100 200 300 400 500 600 700
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)
no-cache DeepMon DeepCache
917 866 895 901 915 901 886 891 893 889 881 792 771 681 881 813 782 844 815 783 748 636 716 504 891 788 748 731 764 665200 400 600 800 1000
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)
no-cache DeepMon DeepCache
2689 2806 2694 2704 2690 2615 2735 2749 2700 2688 2448 2651 2452 2252 2640 2334 2625 2422 2449 2492 2276 2197 2158 1658 2588 2004 2508 2115 2168 2281500 1000 1500 2000 2500 3000 3500
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)
no-cache DeepMon DeepCache
2758 2780 2803 2803 2826 2731 2752 2832 2832 2825 2516 2502 2691 2491 2832 2583 2604 2665 2611 2475 2373 2298 2379 2159 2788 2455 2417 2386 2441 2284500 1000 1500 2000 2500 3000 3500
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)
no-cache DeepMon DeepCache
T1: Basketball T2: ApplyEyeMakeup T3: CleanAndJerk T4: Billiards T5: BandMarching T6: ApplyLipstick T7: CliffDiving T8: BrushingTeeth T9: BlowDryHair T10: BalanceBeam
latency, and energy for different scenarios and applications.
4 8 12 16 700 750 800 850 900
5 10 15 20 25 30
Accuracy Drop (%) Processing Latency (ms)
Block Size processing latency accuracy drop
5 10 15 20 25 400 500 600 700 800 900
10 15 20 25 30
Accuracy Drop (%) Processing Latency (ms)
Matching Threshold processing latency accuracy drop