DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , - - PowerPoint PPT Presentation

deepcache principled cache for mobile deep vision
SMART_READER_LITE
LIVE PREVIEW

DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , - - PowerPoint PPT Presentation

DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , Mengze Zhu 1 , Yunxin Liu 2 Felix Xiaozhu Lin 3 , Xuanzhe Liu 1 1 Peking University, 2 Microsoft Research, 3 Purdue University Background: Mobile Vision Your mobile device sees


slide-1
SLIDE 1

DeepCache: Principled Cache for Mobile Deep Vision

Mengwei Xu1, Mengze Zhu1, Yunxin Liu2 Felix Xiaozhu Lin3, Xuanzhe Liu1

1Peking University, 2Microsoft Research, 3Purdue University

slide-2
SLIDE 2

Background: Mobile Vision

  • Your mobile device sees what you see, and does what you cannot do
  • Core: computer vision algorithm

Augmented Reality Recognition & Detection Game Face Beauty

slide-3
SLIDE 3

Background: CNN-based Vision

  • Convolutional Neural Network (CNN) is the state-of-the-art vision algorithm.
  • CNN is accurate, but also resource-hungry.

CNN model: a graph of computation nodes (convolution, pooling, activation, etc) convolution operation (input feature map * kernel = out feature map)

slide-4
SLIDE 4

Background: Optimizing CNN Workloads

  • Our approach: leveraging the temporal locality of mobile video stream
  • Similar but not identical
  • Object movement/appearance
  • Camera movement
  • Light variation
  • etc…

Algorithm-level Compression

  • quantization
  • pruning
  • factorization
  • distilling

Hardware-level Acceleration

  • CPU
  • GPU
  • DSP
  • AI-specific chips

previous frame Current frame

slide-5
SLIDE 5

Caching Mobile Vision – a naïve approach

  • Just cache/reuse the final results based on input image

Inference Engine Class: elephant Pos: (-1.5, 7.9)

input

  • uput

ith frame

slide-6
SLIDE 6

Caching Mobile Vision – a naïve approach

  • Just cache/reuse the final results based on input image

Inference Engine Class: elephant Pos: (-1.5, 7.9)

input Similar? ith frame (i+1)th frame

  • uput
slide-7
SLIDE 7

Caching Mobile Vision – a naïve approach

  • Just cache/reuse the final results based on input image

Inference Engine Class: elephant Pos: (-1.5, 7.9) Class: elephant Pos: (-1.5, 7.9)

input Similar? YES: Let’s Reuse ith frame (i+1)th frame

  • uput
slide-8
SLIDE 8

Caching Mobile Vision – a naïve approach

  • Just cache/reuse the final results based on input image

Inference Engine Class: elephant Pos: (-1.5, 7.9) Inference Engine Class: elephant Pos: (-1.1, 9.3)

input Similar? YES: Let’s Reuse NO: Do Computation ith frame (i+1)th frame

  • uput
slide-9
SLIDE 9

Caching Mobile Vision – a naïve approach

  • Just cache/reuse the final results based on input image
  • Why it’s not enough?
  • Coarse-grained: whole image as comparison unit
  • Cannot handle position-sensitive tasks

Two images are similar

  • Similar background
  • Similar animals

But the elephant position is different!

slide-10
SLIDE 10

Caching Mobile Vision – De

DeepCa Cache

  • Treat image as a collection of blocks, and cache/reuse them at a fine granularity.

previous frame current frame

Reuse the CNN computations of similar regions

cache & reuse

KEY IDEA

slide-11
SLIDE 11

Caching Mobile Vision – De

DeepCa Cache

  • Treat image as a collection of blocks, and cache/reuse them at a fine granularity.

Inference Engine Class: elephant Pos: (-1.5, 7.9) Inference Engine (revised) Class: elephant Pos: (-1.1, 9.3)

input

matching reusable regions

ith frame (i+1)th frame

  • uput

layer-level cache/reuse

slide-12
SLIDE 12

Challenges of DeepCache

  • Scene variation – the overall background may be moved between frames
  • Moving camera, autonomous driving, drone, etc
  • ffset

previous frame current frame

slide-13
SLIDE 13

Challenges of DeepCache

  • Cache erosion – reusability tends to diminish at its deeper layers

Reusable region (5x5)

  • n input feature map

Reusable region (3x3)

  • n output feature map

3x3 Convolution Eroded

Re-computation Required

slide-14
SLIDE 14

Challenges of DeepCache

  • Cache erosion – reusability tends to diminish at its deeper layers

current frame

(a) The “best” match, with highest matching score

B1 B2 B1 B2 B1 B2 previous frame previous frame

(b) The “proper” match, with a high matching score

  • 1. Merge smaller ones
  • 2. Good news: early layers contribute most of the computation cost and

also suffer less cache erosion.

slide-15
SLIDE 15

DeepCache Design: Overview

  • Design Principles
  • No cloud offloading
  • No efforts from developers
  • No modification to models

conv pool conv fc conv pool conv fc Image match previous frame current frame Cache and reuse Cache-aware CNN inference engine Reusable regions Processors (CPU, GPU, etc.) Camera Deep learning engine w/ DeepCache Raw image Pre-processing (resizing etc.) Input Output

Vision Applications

Storage Operating System Cache storage

  • Two modules
  • Image matcher
  • Cache-aware inference engine
slide-16
SLIDE 16

DeepCache Design: Image Matching

  • Principles: high similarity, low overhead, and merged to big ones.
  • Input: two raw images
  • Output: a set of matched rectangles
  • (x1, y1, w, h) in current frame -> (x2, y2, w, h) in previous frame
slide-17
SLIDE 17

DeepCache Design: Image Matching

  • Step 1: dividing the current frame into an NxN grid.
  • N is a configurable parameter (default: 10 x 10).

Current Frame Previous Frame

slide-18
SLIDE 18

DeepCache Design: Image Matching

  • Step 2: find the most-matched block in previous frame for each divided block
  • Motion estimation: diamond search

(x1, y1) Current Frame Previous Frame

(x2, y2)

slide-19
SLIDE 19

DeepCache Design: Image Matching

  • Step 3: calculate the average block movement (offset): (Mx, My).
  • Filter those outliers

(x1, y1) Current Frame Previous Frame

(x2, y2)

slide-20
SLIDE 20

DeepCache Design: Image Matching

  • Step 4: calculate the similarity between block (x1, y1) in current frame and the

block (x1+Mx, y2 + My) with average movement in previous frame

  • Metrics: Peak Signal to Noise Ratio (PSNR)

(x1, y1) Current Frame Previous Frame

(x1+Mx, y2 + My)

PSRN: 24 (reusable) PSRN: 21 (reusable)

slide-21
SLIDE 21

DeepCache Design: Image Matching

  • Step 5: merge blocks into larger ones if possible

(x1, y1) Current Frame Previous Frame

(x1+Mx, y2 + My)

(x1, y1)

slide-22
SLIDE 22

DeepCache Design: Image Matching

  • Optimization 1: skip block matching in Step 2 (k-skip)
  • Optimization 2: in Step 4, reuse the matching scores computed in Step 2
  • Not always applicable: depends on the average movement

2-skip 8/16 blocks computed 3-skip 6/16 blocks computed

slide-23
SLIDE 23

DeepCache Design: Cache-aware CNN Inference

  • Propagation: the reusable regions passed from image matching is not

unchangeable during execution among CNN layers. reusable

reusable

Convolution

Kernel=11x11 Stride=2 Padding=5

(100, 100, 100, 40) ⇒ (120, 120, 100, 40)

(53, 53, 45, 15) ⇒ (63, 63, 45, 15) reusable

ReLu

reusable

Pooling

Kernel=3x3 Stride=2 Padding=1

(27, 27, 21, 7) ⇒ (32, 32, 21, 7)

(53, 53, 45, 15) ⇒ (63, 63, 45, 15)

Because of cache erosion! But what affects cache erosion?

slide-24
SLIDE 24

DeepCache Design: Cache-aware CNN Inference

  • Propagation: the reusable regions passed from image matching is not

unchangeable during execution among CNN layers. partial erosion full erosion no erosion

slide-25
SLIDE 25

DeepCache Design: Cache-aware CNN Inference

  • Why Propagation? Why not match the input of each layer?
  • Low return: feature maps are high dimensional data, difficult to interpret.
  • High cost: matching feature map requires a lot of computations (40× compared to

propagation for ResNet).

1.00 1.00 1.00 1.00 1.00 1.84 21.24 35.77 26.67 3.44 2.25 28.63 43.67 34.31 3.57 2.30 30.98 48.59 37.82 3.67

10 20 30 40 50 60 AlexNet GoogLeNet ResNet-50 YOLO Dave-orig

Normalized Latency

DeepCache MIL-50% MIL-75% MIL-100%

  • DeepCache: match input images once,

and using propagation later

  • MIL: matching inter-layer
slide-26
SLIDE 26

DeepCache Design: Cache-aware CNN Inference

  • Cache/Reuse: reuse the computation results at the output of convolutions.
  • Mind the data locality during reuse!
  • Depends on the convolution implementation: img2col + gemm, unrolled, etc.

29.2 0.4 5.7 0.2 21.0 0.0 2.9 13.0 0.0 9.5 0.0 6.5 0.0 5.9 0.0 4.4 0.0 0.8 0.4

0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0

conv1 relu1 lrn1 pool1 conv2 relu2 lrn2 conv3 relu3 conv4 relu4 conv5 relu5 fc6 relu6 fc7 relu7 fc8 prob

Normalized Latency (%) Convolutional Fully-connected Others

Convolution is often the dominant layer (> 80% overall computations)

slide-27
SLIDE 27

DeepCache Implementation

  • Image matching is implemented based on RenderScript
  • A programming framework on Android for intensive computations
  • GPU-support, generic, high data-parallel
  • Cache-awareness is built upon ncnn
  • Popular deep learning inference framework for mobile devices
  • High speed, lightweight, no dependency
slide-28
SLIDE 28

Evaluation – Setup

  • Popular CNN models and datasets
  • Platform: Nexus 6, Android 6.0
  • Alternative
  • ncnn without cache
  • Coarse-grained cache used in [1]DeepMon

[1] Huynh Nguyen Loc, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys’17)

slide-29
SLIDE 29

Evaluation – Execution Speedup

  • DeepCache saves 15% ~ 28% model execution time (2X DeepMon)
  • The speedup depends on model architecture
  • Deeper layers, less savings

1 1 1 1 1 0.870 0.898 0.927 0.929 0.932 0.720 0.803 0.849 0.858 0.862

0.0 0.2 0.4 0.6 0.8 1.0 1.2

REC_1 REC_2 REC_3 DET DRV

Normalized Processing Latency no-cache DeepMon DeepCache

10 20 30 40 50 60 70 80 Conv_1 Conv_10 Conv_20 Conv_30 Conv_40 Conv_50

Conv-layer Latency (ms)

no-cache DeepCache

165

deeper layers

slide-30
SLIDE 30

Evaluation – Energy Saving

  • DeepCache can save around 20% energy consumption of processing same

number of images.

  • The energy saved by using DeepCache to process 10 images (ResNet) is equal to 40 seconds
  • f playing video on mobile devices.

1.83 6.86 11.85 35.63 39.90 1.67 6.01 11.11 33.26 37.27 1.49 4.90 9.71 28.97 34.47

10 20 30 40 50

DRV REC_1 REC_2 REC_3 DET

Energy (J)

no-cache DeepMon DeepCache

slide-31
SLIDE 31

Evaluation – Accuracy

  • DeepCache is able to keep the accuracy
  • 2% on average, similar to DeepMon

0.000 0.002 0.004 0.006 0.008

DeepMon COMO

AlexNet Euclidean Distance

0.000 0.002 0.004 0.006

DeepMon COMO

YOLO

2.43 1.14 2.98 0.91 2.29 1.43 2.62 1.06

1 2 3 4 5 REC_3-top1 REC_3-top3 REC_2-top1 REC_2-top3

Accuracy drop (%) DeepMon DeepCache

slide-32
SLIDE 32

Evaluation – Memory Overhead

  • The overhead of cache is acceptable
  • 2MB ~ 44MB, while nowadays mobile devices usually have more than 1G memory
  • Note we only cache the results of convolutional layers, and only for one frame

252.2 81.4 342.8 407.7 18.5 36.2 135.3 100.9 254.7 94.0 386.6 422.1 21.7 47.0 144.5 121.7

100 200 300 400 500 REC_1 REC_2 REC_3 DET DRV SqueezeNet DeepFace MobileNet

Memory Usage (MB) no-cache DeepCache

slide-33
SLIDE 33

Conclusion

  • DeepCache: cache design for mobile deep vision
  • Image matching on raw images
  • Cache/reuse in inference engine
  • Evaluation
  • ~20% execution speedup and energy savings
  • Little accuracy loss

Thank you for attention!

slide-34
SLIDE 34

Evaluation – Execution Speedup

  • DeepCache saves 15% ~ 28% model execution time (> DeepMon)
  • Performance depends on scenarios
572 568 569 584 579 574 560 554 553 569 471 449 495 470 562 521 492 480 521 476 394 387 404 309 515 378 416 407 482 394

100 200 300 400 500 600 700

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)

no-cache DeepMon DeepCache

917 866 895 901 915 901 886 891 893 889 881 792 771 681 881 813 782 844 815 783 748 636 716 504 891 788 748 731 764 665

200 400 600 800 1000

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)

no-cache DeepMon DeepCache

2689 2806 2694 2704 2690 2615 2735 2749 2700 2688 2448 2651 2452 2252 2640 2334 2625 2422 2449 2492 2276 2197 2158 1658 2588 2004 2508 2115 2168 2281

500 1000 1500 2000 2500 3000 3500

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)

no-cache DeepMon DeepCache

2758 2780 2803 2803 2826 2731 2752 2832 2832 2825 2516 2502 2691 2491 2832 2583 2604 2665 2611 2475 2373 2298 2379 2159 2788 2455 2417 2386 2441 2284

500 1000 1500 2000 2500 3000 3500

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Processing Latency (ms)

no-cache DeepMon DeepCache

T1: Basketball T2: ApplyEyeMakeup T3: CleanAndJerk T4: Billiards T5: BandMarching T6: ApplyLipstick T7: CliffDiving T8: BrushingTeeth T9: BlowDryHair T10: BalanceBeam

slide-35
SLIDE 35

Evaluation – Configuration

  • Some configuration can be tailored to make better trade-off among accuracy,

latency, and energy for different scenarios and applications.

  • Matching threshold, block size

4 8 12 16 700 750 800 850 900

5 10 15 20 25 30

Accuracy Drop (%) Processing Latency (ms)

Block Size processing latency accuracy drop

5 10 15 20 25 400 500 600 700 800 900

10 15 20 25 30

Accuracy Drop (%) Processing Latency (ms)

Matching Threshold processing latency accuracy drop