Memory-Optimal Direct Convolutions for Maximizing Classification - - PowerPoint PPT Presentation

memory optimal direct convolutions for maximizing
SMART_READER_LITE
LIVE PREVIEW

Memory-Optimal Direct Convolutions for Maximizing Classification - - PowerPoint PPT Presentation

Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Devices Albert Gural 1 , Boris Murmann 1 1 Stanford University The 36 th International Conference on Machine Learning Long Beach, California June 11, 2019


slide-1
SLIDE 1

Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Devices

Albert Gural1, Boris Murmann1

1Stanford University

The 36th International Conference on Machine Learning Long Beach, California June 11, 2019

slide-2
SLIDE 2

Introduction

  • Embedded devices are increasingly targets of machine learning for IoT
  • Microsoft EdgeML
  • Bonsai [1]: decision tree achieves 94.38% on MNIST-2 in 2KB
  • ProtoNN [2]: nearest neighbors achieves 93.25% on MNIST-2 in 2KB
  • FastGRNN [3]: RNN achieves 98.20% on MNIST in 6KB
  • Google TensorFlow Lite for MCUs [4]
  • Hard memory constraints make deep learning difficult
  • “Bonsai is not compared to deep convolutional neural networks as they have not yet been

demonstrated to fit on such tiny IoT devices” [1]

  • But CNNs typically have SOTA performance for image classification tasks
  • Can we do better with CNNs?
  • Goal: MNIST classifier in 2KB
slide-3
SLIDE 3

Introduction

  • Deep CNN implementation research

typically focused on speed

  • FFT, Winograd, gemm
  • Minimal research prioritizing memory

reduction

  • Memory-Efficient Convolution [5]

improves memory use of gemm methods, but still has overhead

  • Zero-Memory Overhead [6] performs

direct convolutions for zero overhead beyond input/output activation storage Memory-Efficient Convolution [5] Zero-Memory Overhead [6]

slide-4
SLIDE 4

Introduction

  • Deep CNN implementation research

typically focused on throughput

  • FFT, Winograd, gemm
  • Minimal research prioritizing memory

reduction

  • Memory-Efficient Convolution [5]

improves memory use of gemm methods, but still has overhead

  • Zero-Memory Overhead [6] performs

direct convolutions for zero overhead beyond input/output activation storage

  • Can do even better by replacing

input activations while computing

  • utput activations

Negative-Memory Overhead

28 × 28 × 1 176 10

AvgPool 2x2 Conv 3x3 Conv 3x3 Conv 3x3 MaxPool 2x2 Dense Flatten

slide-5
SLIDE 5

Replace Method

… … … … 𝑔

𝑝𝑣𝑢 ≤ 𝑔 𝑗𝑜

𝑔

𝑝𝑣𝑢 > 𝑔 𝑗𝑜

input pixel

  • utput pixel

stale pixel kernel

3 6 9 1 2 4 5 10 11 features height width

slide-6
SLIDE 6

Herringbone Method

25 cost; 20 free 30 cost; 32 free 55 cost; 60 free

1 2 3 4 5 6 7 8 15 16 17 18 19 20 21 9 22 28 29 30 31 32 33 10 23 34 39 40 41 42 43 11 24 35 44 48 49 50 51 12 25 36 45 52 55 56 57 13 26 37 46 53 58 60 61 14 27 38 47 54 59 62 63

Herringbone tile Order of Convolutions

slide-7
SLIDE 7

Herringbone Method

In paper, we demonstrate

  • ptimality for lossless, per-

layer, direct convolutions

25 cost; 20 free 30 cost; 32 free 55 cost; 60 free

1 2 3 4 5 6 7 8 15 16 17 18 19 20 21 9 22 28 29 30 31 32 33 10 23 34 39 40 41 42 43 11 24 35 44 48 49 50 51 12 25 36 45 52 55 56 57 13 26 37 46 53 58 60 61 14 27 38 47 54 59 62 63

Herringbone tile Order of Convolutions

slide-8
SLIDE 8

Transpose Implementation

Transpose method: process a row, transpose, process a row, transpose, … 1 2 3 4 5 6 7 8 9 10 11 4 8 1 5 9 2 6 10 3 7 11 1 2 3 4 5 6 7 8 9 10 11 4 8 1 5 9 2 6 10 3 7 11 For each start: Check if start > any other element in its cycle If not, rotate elements in the cycle Successor: 𝑘 = 𝑗 mod 𝐼 ⋅ 𝑋 + 𝑗/𝐼 mem layout A mem layout B

slide-9
SLIDE 9

Convolution Strategy Comparison

Conv 1 Conv 2 Conv 3 FC

slide-10
SLIDE 10

Applicability

slide-11
SLIDE 11

Case Study

Arduino

Program SRAM

serial comm.

Serialized CNN + Input Images Output Classes

SRAM (2048B)

NN workspace (1960B) Stack

slide-12
SLIDE 12

Case Study

611c1c0150318141b1532a27304888b8bc8e67062038e88784217b578e0efd047480558181f06fe8114475add415fe81d527ec42a3ead2c862d 28feb482fc6d4e7edd1aea57f685f7d8948f6841c6b33258fc5711cd0707446d404138fb231989e9b70981b0183cc38412578774407764ea141 cf9b18a2e08e2e64de7562bf6d28b7df6eb38509483f11e91a3d001ca7db26e09d6088f7589c72715f1e7cf4c9d71f5685849580b016f2150e2 17812fb5d60d6f5cf46420917c4a4797cd83fd2871a087f0183112871fa8784600ce27f8d1f8ed31c302ee7bbf07ea57ec7f8073e7e47957731 8389b88df8381783282cef87d8e0838ff827f78cc1478e5be8d78bd8a79e86ed8742a1698872180d4c635470d03c1762e37c0da766287f8718e 8c6889a89b88d0c02080e4ddfa3f73ba3a4267c0fd14e7f825042c259f1e85798cf58f188583ca788442c828608e78488f608df88a888488580 875380774bf08edc8e7a908e8e72bd72e4218e74e448f39f1fd315c72948ece4f5eae8049d89fff871b722d83ac60e38d788791838867845a78 3f87287aec2df8082e7d18c80e41788cb8eafc2ab3f2872854ef1028cd717c078c1de2a2f708d58b648872fc331834ebca48772d1583f21d678 71ec85b8074ee7dd83888b61c78dfd70df88227788a8817b837887881f78b3801c837b77d88fce824478d08e79e07dc1e0877e8745d06d89d37 38c548fdc88858318d1d7e721855d47630dc1889d788a458f378b7c9147ca788ff8093cfe88574877b8142707388cf898787a7c71383a8fae08 974c0078fc756f88d7628e288dd18f0d88330f8b76213289a2c08880d7273271f27e87d8e7b77a8f80b9888ffa88811877f0b867f1b4f04bd48 f87f88e96418778877881888772f744004a4b87574db264736063827118387031d32ddc312808f7c87f8f75073837887757a7848c8a1a77e88e 84f7768668c278881cd770d3663f3f7c8703be8e423cf14f8683f87b63418370286340f327d86cdee423ec0422473b8c50307e37c9817e80555 7b54106788c741f788d07c1d17217e7ae8d623fe24ff48ed87f323081303e40421633c84143d76f882577472e8e3f1f2175088678a85271e493 f67d8f4668708fe7728d788782f387773788274288d870d2e48ceb7753f3144f8e524385508f1777c2e88fdcbe21318893f78ae677877d8178e 83f8537255f1382b88312323154313d450652b7c87418073c187e888b437878888e8fb88782783c52d2b88de2771023820746e561125c083132 37488e4282608346e21d42231d3444a2ef23321887600f51e687a1fcf48c8cdbe887157300df41ffd0f1df827f8f1104e3f2157e1f643f8beee 7b80155e435011151001c1e12ee1f4223ece1f342ee1c27fb0ef8f5e2221e031751032e611f1c1480b448b5775155b5842c804538d708773f24 308788d0078fb10240def3117e05227d09648373133d572e55a11d0402467e01677017212083874782c6f68578f7774853085712187404ee811 4d24f38222a02278287f2a4487661787f188b787888288880cc87c70872d77417778bf39c87861747857ef3342d625e071814718270ef761308 3c5618437be61412c2eb234c4d4e0ec13c7d0a1822637f853473b302e30ed20e00af2e2511f4c3d0c44231213473f1c10952520320411101251 82f3cb4e30333d07aebdb9ed47748758df4dd7b53e52e40f21ee343df10f4bde0271582f7e18c4d2432fb62b7186357f787f06f2788171f101c f7858e5e8487083283b8ed6a77e2d2884843d3d983e6dede578ef8b7a8e78608f18788f887c82e28d07768683571c5d1722a18645f717532667 582482c7f78890c887878882188e332a7c73d8fd7c1852418328797c7f815878801575f7278272e381bb17ed1bd4e4848754e7e72230313811e 705d7c8d478f38488878da7e5b82b075e5816665012826c781f7ece383c80335202e373f20250d323c003f5e68086738787135d2c22f817af8e e80db08787f81818b4853872837f78d7377e12857b781d78f83880e607832e2e72f321730448f4d3f5c38876768137c77e7e158ff9708df8e88 237d7287b788385787c88387f8dc77817b67878427f8080d1a47f1aca2e0

Arduino

Program SRAM

serial comm.

Serialized CNN + Input Images Output Classes

SRAM (2048B)

NN workspace (1960B)

NN serialization (1525B) NN activations (435B)

28 × 28 × 1 176 10

AvgPool 2x2 Conv 3x3 Conv 3x3 Conv 3x3 MaxPool 2x2 Dense Flatten

Network Topology Weights and Biases Stack (88B)

slide-13
SLIDE 13

Results

  • Fits in 2KB SRAM
  • Network Topology
  • Weights and Biases
  • Intermediate Activations
  • Achieves 99.15% Test Accuracy on MNIST

Comparison to MNIST-2 and MNIST-10 results from [1,2,3]

slide-14
SLIDE 14

Summary

  • Applicability
  • Replace strategy applies to any CNN
  • Herringbone/Transpose strategies apply to many 2D classification CNNs
  • Use Scenario
  • Tiny MCUs with negligible caching
  • Maximize accuracy given memory constraint
  • Maximize free memory given fixed NN
  • Applications
  • Microrobotic vision
  • Touchpad input classification
  • Spectrogram classification of 1D signals
  • Voice, gesture recognition
  • Activity tracking
  • Biometric security
  • Other sensors

Spectrogram of “yes” keyword from [7]

slide-15
SLIDE 15

References

1. Kumar, Ashish, Saurabh Goyal, and Manik Varma. "Resource-efficient machine learning in 2 KB RAM for the internet of things." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017. 2. Gupta, Chirag, et al. "Protonn: Compressed and accurate knn for resource-scarce devices." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017. 3. Kusupati, Aditya, et al. "Fastgrnn: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network." Advances in Neural Information Processing Systems. 2018. 4. TensorFlow Lite for Microcontrollers. URL: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/experimental/micro 5. Cho, Minsik, and Daniel Brand. "MEC: memory-efficient convolution for deep neural network." Proceedings

  • f the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017.

6. Zhang, Jiyuan, Franz Franchetti, and Tze Meng Low. "High performance zero-memory overhead direct convolutions." arXiv preprint arXiv:1809.10170 (2018). 7. Warden, Pete. "Speech commands: A dataset for limited-vocabulary speech recognition." arXiv preprint arXiv:1804.03209 (2018).

Code: https://github.com/agural/memory-optimal-direct-convolutions Poster: Pacific Ballroom #89