[PPT] - DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , PowerPoint Presentation

SLIDE 1

DeepCache: Principled Cache for Mobile Deep Vision

Mengwei Xu1, Mengze Zhu1, Yunxin Liu2 Felix Xiaozhu Lin3, Xuanzhe Liu1

1Peking University, 2Microsoft Research, 3Purdue University

SLIDE 2

Background: Mobile Vision

Your mobile device sees what you see, and does what you cannot do
Core: computer vision algorithm

Augmented Reality Recognition & Detection Game Face Beauty

SLIDE 3

Background: CNN-based Vision

Convolutional Neural Network (CNN) is the state-of-the-art vision algorithm.
CNN is accurate, but also resource-hungry.

CNN model: a graph of computation nodes (convolution, pooling, activation, etc) convolution operation (input feature map * kernel = out feature map)

SLIDE 4

Background: Optimizing CNN Workloads

Our approach: leveraging the temporal locality of mobile video stream
Similar but not identical
Object movement/appearance
Camera movement
Light variation
etc…

Algorithm-level Compression

quantization
pruning
factorization
distilling

Hardware-level Acceleration

CPU
GPU
DSP
AI-specific chips

previous frame Current frame

SLIDE 5

Caching Mobile Vision – a naïve approach

Just cache/reuse the final results based on input image

Inference Engine Class: elephant Pos: (-1.5, 7.9)

input

uput

ith frame

SLIDE 6

Caching Mobile Vision – a naïve approach

Just cache/reuse the final results based on input image

Inference Engine Class: elephant Pos: (-1.5, 7.9)

input Similar? ith frame (i+1)th frame

uput

SLIDE 7

Caching Mobile Vision – a naïve approach

Just cache/reuse the final results based on input image

Inference Engine Class: elephant Pos: (-1.5, 7.9) Class: elephant Pos: (-1.5, 7.9)

input Similar? YES: Let’s Reuse ith frame (i+1)th frame

uput

SLIDE 8

Caching Mobile Vision – a naïve approach

Just cache/reuse the final results based on input image

Inference Engine Class: elephant Pos: (-1.5, 7.9) Inference Engine Class: elephant Pos: (-1.1, 9.3)

input Similar? YES: Let’s Reuse NO: Do Computation ith frame (i+1)th frame

uput

SLIDE 9

Caching Mobile Vision – a naïve approach

Just cache/reuse the final results based on input image
Why it’s not enough?
Coarse-grained: whole image as comparison unit
Cannot handle position-sensitive tasks

Two images are similar

Similar background
Similar animals

But the elephant position is different!

SLIDE 10

Caching Mobile Vision – De

DeepCa Cache

Treat image as a collection of blocks, and cache/reuse them at a fine granularity.

previous frame current frame

Reuse the CNN computations of similar regions

cache & reuse

KEY IDEA

SLIDE 11

Caching Mobile Vision – De

DeepCa Cache

Treat image as a collection of blocks, and cache/reuse them at a fine granularity.

Inference Engine Class: elephant Pos: (-1.5, 7.9) Inference Engine (revised) Class: elephant Pos: (-1.1, 9.3)

input

matching reusable regions

ith frame (i+1)th frame

uput

layer-level cache/reuse

SLIDE 12

Challenges of DeepCache

Scene variation – the overall background may be moved between frames
Moving camera, autonomous driving, drone, etc
ffset

previous frame current frame

SLIDE 13

Challenges of DeepCache

Cache erosion – reusability tends to diminish at its deeper layers

Reusable region (5x5)

n input feature map

Reusable region (3x3)

n output feature map

3x3 Convolution Eroded

Re-computation Required

SLIDE 14

Challenges of DeepCache

Cache erosion – reusability tends to diminish at its deeper layers

current frame

(a) The “best” match, with highest matching score

B1 B2 B1 B2 B1 B2 previous frame previous frame

(b) The “proper” match, with a high matching score

1. Merge smaller ones
2. Good news: early layers contribute most of the computation cost and

also suffer less cache erosion.

SLIDE 15

DeepCache Design: Overview

Design Principles
No cloud offloading
No efforts from developers
No modification to models

conv pool conv fc conv pool conv fc Image match previous frame current frame Cache and reuse Cache-aware CNN inference engine Reusable regions Processors (CPU, GPU, etc.) Camera Deep learning engine w/ DeepCache Raw image Pre-processing (resizing etc.) Input Output

Vision Applications

Storage Operating System Cache storage

Two modules
Image matcher
Cache-aware inference engine

SLIDE 16

DeepCache Design: Image Matching

Principles: high similarity, low overhead, and merged to big ones.
Input: two raw images
Output: a set of matched rectangles
(x1, y1, w, h) in current frame -> (x2, y2, w, h) in previous frame

SLIDE 17

DeepCache Design: Image Matching

Step 1: dividing the current frame into an NxN grid.
N is a configurable parameter (default: 10 x 10).

Current Frame Previous Frame

SLIDE 18

DeepCache Design: Image Matching

Step 2: find the most-matched block in previous frame for each divided block
Motion estimation: diamond search

(x1, y1) Current Frame Previous Frame

(x2, y2)

SLIDE 19

DeepCache Design: Image Matching

Step 3: calculate the average block movement (offset): (Mx, My).
Filter those outliers

(x1, y1) Current Frame Previous Frame

(x2, y2)

SLIDE 20

DeepCache Design: Image Matching

Step 4: calculate the similarity between block (x1, y1) in current frame and the

block (x1+Mx, y2 + My) with average movement in previous frame

Metrics: Peak Signal to Noise Ratio (PSNR)

(x1, y1) Current Frame Previous Frame

(x1+Mx, y2 + My)

PSRN: 24 (reusable) PSRN: 21 (reusable)

SLIDE 21

DeepCache Design: Image Matching

Step 5: merge blocks into larger ones if possible

(x1, y1) Current Frame Previous Frame

(x1+Mx, y2 + My)

(x1, y1)

SLIDE 22

DeepCache Design: Image Matching

Optimization 1: skip block matching in Step 2 (k-skip)
Optimization 2: in Step 4, reuse the matching scores computed in Step 2
Not always applicable: depends on the average movement

2-skip 8/16 blocks computed 3-skip 6/16 blocks computed

SLIDE 23

DeepCache Design: Cache-aware CNN Inference

Propagation: the reusable regions passed from image matching is not

unchangeable during execution among CNN layers. reusable

reusable

Convolution

Kernel=11x11 Stride=2 Padding=5

(100, 100, 100, 40) ⇒ (120, 120, 100, 40)

(53, 53, 45, 15) ⇒ (63, 63, 45, 15) reusable

ReLu

reusable

Pooling

Kernel=3x3 Stride=2 Padding=1

(27, 27, 21, 7) ⇒ (32, 32, 21, 7)

(53, 53, 45, 15) ⇒ (63, 63, 45, 15)

Because of cache erosion! But what affects cache erosion?

SLIDE 24

DeepCache Design: Cache-aware CNN Inference

Propagation: the reusable regions passed from image matching is not

unchangeable during execution among CNN layers. partial erosion full erosion no erosion

SLIDE 25

DeepCache Design: Cache-aware CNN Inference

Why Propagation? Why not match the input of each layer?
Low return: feature maps are high dimensional data, difficult to interpret.
High cost: matching feature map requires a lot of computations (40× compared to

propagation for ResNet).

1.00 1.00 1.00 1.00 1.00 1.84 21.24 35.77 26.67 3.44 2.25 28.63 43.67 34.31 3.57 2.30 30.98 48.59 37.82 3.67

10 20 30 40 50 60 AlexNet GoogLeNet ResNet-50 YOLO Dave-orig

Normalized Latency

DeepCache MIL-50% MIL-75% MIL-100%

DeepCache: match input images once,

and using propagation later

MIL: matching inter-layer

SLIDE 26

DeepCache Design: Cache-aware CNN Inference

Cache/Reuse: reuse the computation results at the output of convolutions.
Mind the data locality during reuse!
Depends on the convolution implementation: img2col + gemm, unrolled, etc.

29.2 0.4 5.7 0.2 21.0 0.0 2.9 13.0 0.0 9.5 0.0 6.5 0.0 5.9 0.0 4.4 0.0 0.8 0.4

0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0

conv1 relu1 lrn1 pool1 conv2 relu2 lrn2 conv3 relu3 conv4 relu4 conv5 relu5 fc6 relu6 fc7 relu7 fc8 prob

Normalized Latency (%) Convolutional Fully-connected Others

Convolution is often the dominant layer (> 80% overall computations)

SLIDE 27

DeepCache Implementation

Image matching is implemented based on RenderScript
A programming framework on Android for intensive computations
GPU-support, generic, high data-parallel
Cache-awareness is built upon ncnn
Popular deep learning inference framework for mobile devices
High speed, lightweight, no dependency

SLIDE 28

Evaluation – Setup

Popular CNN models and datasets
Platform: Nexus 6, Android 6.0
Alternative
ncnn without cache
Coarse-grained cache used in [1]DeepMon

[1] Huynh Nguyen Loc, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys’17)

SLIDE 29

Evaluation – Execution Speedup

DeepCache saves 15% ~ 28% model execution time (2X DeepMon)
The speedup depends on model architecture
Deeper layers, less savings

1 1 1 1 1 0.870 0.898 0.927 0.929 0.932 0.720 0.803 0.849 0.858 0.862

0.0 0.2 0.4 0.6 0.8 1.0 1.2

REC_1 REC_2 REC_3 DET DRV

Normalized Processing Latency no-cache DeepMon DeepCache

10 20 30 40 50 60 70 80 Conv_1 Conv_10 Conv_20 Conv_30 Conv_40 Conv_50

Conv-layer Latency (ms)

no-cache DeepCache

165

deeper layers

SLIDE 30

Evaluation – Energy Saving

DeepCache can save around 20% energy consumption of processing same

number of images.

The energy saved by using DeepCache to process 10 images (ResNet) is equal to 40 seconds
f playing video on mobile devices.

1.83 6.86 11.85 35.63 39.90 1.67 6.01 11.11 33.26 37.27 1.49 4.90 9.71 28.97 34.47

10 20 30 40 50

DRV REC_1 REC_2 REC_3 DET

Energy (J)

no-cache DeepMon DeepCache

SLIDE 31

Evaluation – Accuracy

DeepCache is able to keep the accuracy
2% on average, similar to DeepMon

0.000 0.002 0.004 0.006 0.008

DeepMon COMO

AlexNet Euclidean Distance

0.000 0.002 0.004 0.006

DeepMon COMO

YOLO

2.43 1.14 2.98 0.91 2.29 1.43 2.62 1.06

1 2 3 4 5 REC_3-top1 REC_3-top3 REC_2-top1 REC_2-top3

Accuracy drop (%) DeepMon DeepCache

SLIDE 32

Evaluation – Memory Overhead

The overhead of cache is acceptable
2MB ~ 44MB, while nowadays mobile devices usually have more than 1G memory
Note we only cache the results of convolutional layers, and only for one frame

252.2 81.4 342.8 407.7 18.5 36.2 135.3 100.9 254.7 94.0 386.6 422.1 21.7 47.0 144.5 121.7

100 200 300 400 500 REC_1 REC_2 REC_3 DET DRV SqueezeNet DeepFace MobileNet

Memory Usage (MB) no-cache DeepCache

SLIDE 33

Conclusion

DeepCache: cache design for mobile deep vision
Image matching on raw images
Cache/reuse in inference engine
Evaluation
~20% execution speedup and energy savings
Little accuracy loss

Thank you for attention!

SLIDE 34

Evaluation – Execution Speedup

DeepCache saves 15% ~ 28% model execution time (> DeepMon)
Performance depends on scenarios

572 568 569 584 579 574 560 554 553 569 471 449 495 470 562 521 492 480 521 476 394 387 404 309 515 378 416 407 482 394

100 200 300 400 500 600 700

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)

no-cache DeepMon DeepCache

917 866 895 901 915 901 886 891 893 889 881 792 771 681 881 813 782 844 815 783 748 636 716 504 891 788 748 731 764 665

200 400 600 800 1000

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)

no-cache DeepMon DeepCache

2689 2806 2694 2704 2690 2615 2735 2749 2700 2688 2448 2651 2452 2252 2640 2334 2625 2422 2449 2492 2276 2197 2158 1658 2588 2004 2508 2115 2168 2281

500 1000 1500 2000 2500 3000 3500

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)

no-cache DeepMon DeepCache

2758 2780 2803 2803 2826 2731 2752 2832 2832 2825 2516 2502 2691 2491 2832 2583 2604 2665 2611 2475 2373 2298 2379 2159 2788 2455 2417 2386 2441 2284

500 1000 1500 2000 2500 3000 3500

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)

no-cache DeepMon DeepCache

T1: Basketball T2: ApplyEyeMakeup T3: CleanAndJerk T4: Billiards T5: BandMarching T6: ApplyLipstick T7: CliffDiving T8: BrushingTeeth T9: BlowDryHair T10: BalanceBeam

SLIDE 35

Evaluation – Configuration

Some configuration can be tailored to make better trade-off among accuracy,

latency, and energy for different scenarios and applications.

Matching threshold, block size

4 8 12 16 700 750 800 850 900

5 10 15 20 25 30

Accuracy Drop (%) Processing Latency (ms)

Block Size processing latency accuracy drop

5 10 15 20 25 400 500 600 700 800 900

10 15 20 25 30

Accuracy Drop (%) Processing Latency (ms)

Matching Threshold processing latency accuracy drop