Memory-Optimal Direct Convolutions for Maximizing Classification - PowerPoint PPT Presentation

Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Devices Albert Gural 1 , Boris Murmann 1 1 Stanford University The 36 th International Conference on Machine Learning Long Beach, California June 11, 2019

Introduction Embedded devices are increasingly targets of machine learning for IoT • Microsoft EdgeML • Bonsai [1]: decision tree achieves 94.38% on MNIST-2 in 2KB • ProtoNN [2]: nearest neighbors achieves 93.25% on MNIST-2 in 2KB • FastGRNN [3]: RNN achieves 98.20% on MNIST in 6KB • Google TensorFlow Lite for MCUs [4] • Hard memory constraints make deep learning difficult • “Bonsai is not compared to deep convolutional neural networks as they have not yet been • demonstrated to fit on such tiny IoT devices” [1] But CNNs typically have SOTA performance for image classification tasks • Can we do better with CNNs? • Goal: MNIST classifier in 2KB •

Introduction Deep CNN implementation research • typically focused on speed FFT, Winograd, gemm • Minimal research prioritizing memory • reduction Memory-Efficient Convolution [5] • Memory-Efficient Convolution [5] improves memory use of gemm methods, but still has overhead Zero-Memory Overhead [6] performs • direct convolutions for zero overhead beyond input/output activation storage Zero-Memory Overhead [6]

Introduction Deep CNN implementation research • 28 × 28 × 1 typically focused on throughput 176 10 FFT, Winograd, gemm • AvgPool 2x2 Conv 3x3 Conv 3x3 Conv 3x3 MaxPool 2x2 Flatten Dense Minimal research prioritizing memory • reduction Memory-Efficient Convolution [5] • improves memory use of gemm methods, but still has overhead Zero-Memory Overhead [6] performs • direct convolutions for zero overhead Negative-Memory Overhead beyond input/output activation storage Can do even better by replacing • input activations while computing output activations

Replace Method 2 5 features 1 4 input pixel 0 3 output pixel height 11 6 9 stale pixel 10 kernel width … … 𝑔 𝑝𝑣𝑢 ≤ 𝑔 𝑗𝑜 … … 𝑔 𝑝𝑣𝑢 > 𝑔 𝑗𝑜

Herringbone Method … 25 cost; 20 free 30 cost; 32 free 55 cost; 60 free Order of Convolutions Herringbone tile 0 1 2 3 4 5 6 7 8 15 16 17 18 19 20 21 9 22 28 29 30 31 32 33 10 23 34 39 40 41 42 43 11 24 35 44 48 49 50 51 12 25 36 45 52 55 56 57 13 26 37 46 53 58 60 61 14 27 38 47 54 59 62 63

Herringbone Method In paper, we demonstrate optimality for lossless, per- … layer, direct convolutions 25 cost; 20 free 30 cost; 32 free 55 cost; 60 free Order of Convolutions Herringbone tile 0 1 2 3 4 5 6 7 8 15 16 17 18 19 20 21 9 22 28 29 30 31 32 33 10 23 34 39 40 41 42 43 11 24 35 44 48 49 50 51 12 25 36 45 52 55 56 57 13 26 37 46 53 58 60 61 14 27 38 47 54 59 62 63

Transpose Implementation Transpose method: process a row, transpose, process a row, transpose, … 0 1 2 3 Successor: 𝑘 = 𝑗 mod 𝐼 ⋅ 𝑋 + 𝑗/𝐼 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 mem layout A 0 4 8 0 4 8 1 5 9 2 6 10 3 7 11 mem layout B 1 5 9 For each start: 2 6 10 Check if start > any other element in its cycle 3 7 11 If not, rotate elements in the cycle

Convolution Strategy Comparison Conv 1 Conv 2 Conv 3 FC

Applicability

Case Study Arduino SRAM (2048B) Program SRAM NN workspace (1960B) serial comm. Output Serialized CNN Classes + Input Images Stack

Case Study SRAM (2048B) 611c1c0150318141b1532a27304888b8bc8e67062038e88784217b578e0efd047480558181f06fe8114475add415fe81d527ec42a3ead2c862d 28feb482fc6d4e7edd1aea57f685f7d8948f6841c6b33258fc5711cd0707446d404138fb231989e9b70981b0183cc38412578774407764ea141 cf9b18a2e08e2e64de7562bf6d28b7df6eb38509483f11e91a3d001ca7db26e09d6088f7589c72715f1e7cf4c9d71f5685849580b016f2150e2 17812fb5d60d6f5cf46420917c4a4797cd83fd2871a087f0183112871fa8784600ce27f8d1f8ed31c302ee7bbf07ea57ec7f8073e7e47957731 8389b88df8381783282cef87d8e0838ff827f78cc1478e5be8d78bd8a79e86ed8742a1698872180d4c635470d03c1762e37c0da766287f8718e NN workspace (1960B) 8c6889a89b88d0c02080e4ddfa3f73ba3a4267c0fd14e7f825042c259f1e85798cf58f188583ca788442c828608e78488f608df88a888488580 875380774bf08edc8e7a908e8e72bd72e4218e74e448f39f1fd315c72948ece4f5eae8049d89fff871b722d83ac60e38d788791838867845a78 3f87287aec2df8082e7d18c80e41788cb8eafc2ab3f2872854ef1028cd717c078c1de2a2f708d58b648872fc331834ebca48772d1583f21d678 NN serialization (1525B) 71ec85b8074ee7dd83888b61c78dfd70df88227788a8817b837887881f78b3801c837b77d88fce824478d08e79e07dc1e0877e8745d06d89d37 38c548fdc88858318d1d7e721855d47630dc1889d788a458f378b7c9147ca788ff8093cfe88574877b8142707388cf898787a7c71383a8fae08 974c0078fc756f88d7628e288dd18f0d88330f8b76213289a2c08880d7273271f27e87d8e7b77a8f80b9888ffa88811877f0b867f1b4f04bd48 Weights and Biases Network Topology f87f88e96418778877881888772f744004a4b87574db264736063827118387031d32ddc312808f7c87f8f75073837887757a7848c8a1a77e88e NN activations (435B) 84f7768668c278881cd770d3663f3f7c8703be8e423cf14f8683f87b63418370286340f327d86cdee423ec0422473b8c50307e37c9817e80555 7b54106788c741f788d07c1d17217e7ae8d623fe24ff48ed87f323081303e40421633c84143d76f882577472e8e3f1f2175088678a85271e493 f67d8f4668708fe7728d788782f387773788274288d870d2e48ceb7753f3144f8e524385508f1777c2e88fdcbe21318893f78ae677877d8178e 83f8537255f1382b88312323154313d450652b7c87418073c187e888b437878888e8fb88782783c52d2b88de2771023820746e561125c083132 37488e4282608346e21d42231d3444a2ef23321887600f51e687a1fcf48c8cdbe887157300df41ffd0f1df827f8f1104e3f2157e1f643f8beee Stack (88B) 7b80155e435011151001c1e12ee1f4223ece1f342ee1c27fb0ef8f5e2221e031751032e611f1c1480b448b5775155b5842c804538d708773f24 308788d0078fb10240def3117e05227d09648373133d572e55a11d0402467e01677017212083874782c6f68578f7774853085712187404ee811 4d24f38222a02278287f2a4487661787f188b787888288880cc87c70872d77417778bf39c87861747857ef3342d625e071814718270ef761308 3c5618437be61412c2eb234c4d4e0ec13c7d0a1822637f853473b302e30ed20e00af2e2511f4c3d0c44231213473f1c10952520320411101251 82f3cb4e30333d07aebdb9ed47748758df4dd7b53e52e40f21ee343df10f4bde0271582f7e18c4d2432fb62b7186357f787f06f2788171f101c f7858e5e8487083283b8ed6a77e2d2884843d3d983e6dede578ef8b7a8e78608f18788f887c82e28d07768683571c5d1722a18645f717532667 582482c7f78890c887878882188e332a7c73d8fd7c1852418328797c7f815878801575f7278272e381bb17ed1bd4e4848754e7e72230313811e 705d7c8d478f38488878da7e5b82b075e5816665012826c781f7ece383c80335202e373f20250d323c003f5e68086738787135d2c22f817af8e e80db08787f81818b4853872837f78d7377e12857b781d78f83880e607832e2e72f321730448f4d3f5c38876768137c77e7e158ff9708df8e88 237d7287b788385787c88387f8dc77817b67878427f8080d1a47f1aca2e0 Arduino 28 × 28 × 1 176 10 Program SRAM AvgPool 2x2 Conv 3x3 Conv 3x3 Conv 3x3 MaxPool 2x2 Flatten Dense serial comm. Serialized CNN Output Classes + Input Images

Results Fits in 2KB SRAM • Network Topology • Weights and Biases • Intermediate Activations • Achieves 99.15% Test Accuracy on MNIST • Comparison to MNIST-2 and MNIST-10 results from [1,2,3]

Summary Applicability • Replace strategy applies to any CNN • Herringbone/Transpose strategies apply to many 2D classification CNNs • Use Scenario • Tiny MCUs with negligible caching • Maximize accuracy given memory constraint • Maximize free memory given fixed NN • Applications • Microrobotic vision • Touchpad input classification • Spectrogram classification of 1D signals • Voice, gesture recognition • Activity tracking • Spectrogram of “yes” keyword from [7] Biometric security • Other sensors •

References 1. Kumar, Ashish, Saurabh Goyal, and Manik Varma. "Resource-efficient machine learning in 2 KB RAM for the internet of things." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017. 2. Gupta, Chirag, et al. "Protonn: Compressed and accurate knn for resource-scarce devices." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017. 3. Kusupati, Aditya, et al. "Fastgrnn: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network." Advances in Neural Information Processing Systems. 2018. 4. TensorFlow Lite for Microcontrollers. URL: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/experimental/micro 5. Cho, Minsik, and Daniel Brand. "MEC: memory-efficient convolution for deep neural network." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017. 6. Zhang, Jiyuan, Franz Franchetti, and Tze Meng Low. "High performance zero-memory overhead direct convolutions." arXiv preprint arXiv:1809.10170 (2018). 7. Warden, Pete. "Speech commands: A dataset for limited-vocabulary speech recognition." arXiv preprint arXiv:1804.03209 (2018). Code: https://github.com/agural/memory-optimal-direct-convolutions Poster: Pacific Ballroom #89

Memory-Optimal Direct Convolutions for Maximizing Classification - PowerPoint PPT Presentation

Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Devices Albert Gural 1 , Boris Murmann 1 1 Stanford University The 36 th International Conference on Machine Learning Long Beach, California June 11, 2019

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Dense Predictions Using Dilated Convolutions Najmus Ibrahim University of Toronto Institute for

Time-aware Large Kernel Convolutions Vasileios Lioutas and Yuhong Guo ICML | 2020 Brief Overview

Laplace Transforms and Convolutions Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Great Lakes Chloride, Inc. Direct Liquid Application (DLA) Direct Liquid Application (DLA)

State of Collaboration Direct Deposit and Payroll Reissuance 1 1 Topics Direct Deposit

Direct loan Direct loan Information Information Feder deral Direct Student Loans l Direct

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

IDC Update on How Big Data Is Redefining High Performance Computing Earl Joseph

Auto-grading for 3D Modeling Assignments in MOOCs Swapneel Mehta Nitin Ayer Chirag Raman

Structural analysis of expected and unexpected clauses in sentences using gaze-tracking studies

The most famous math textbook in history Chirag Kalelkar National Chemical Laboratory Pune At

Anne)Bracy: Career)Path Undergrad)@)Stanford Grad)School)@)UPenn (computer)architecture)

Multiple Instance Learning for Fast, Stable and Early RNN Predictions Don Dennis , Microsoft

Software Engineering Summer 2017 A Software Crisis Denver International Airport Approved for

Towards automatic estimation of conversation floors within F-formations Chirag Raman , Hayley Hung