RANA: Towards Efficient Neural Acceleration with Refresh-Optimized - PowerPoint PPT Presentation

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu , Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University The 45th International Symposium on Computer Architecture - ISCA 2018

Ubiquitous Deep Neural Networks (DNNs) Image Classification Object Detection Video Surveillance Speech Recognition 1

DNN Requires Large On-Chip Buffer • Modern DNN’s layer data storage can reach 0.3~6.27MB. • The numbers will increase if the network processes higher resolution images or larger batch size. [1] Krizhevsky et al ., “ ImageNet Classification with D eep Convolutional Neural Networks”, NIPS’12. [2] Simonyan et al ., “Very D eep Convolutional Networks for Large-Scale Image Recognition ”, ICLR’15. [3] Szegedy et al ., “Going D eeper with Convolutions ”, CVPR’15. [4] He et al ., “Deep Residual Learning for Image Recognition”, CVPR’16. 2

SRAM-based DNN Accelerators • The small footprint limits the on-chip buffer size of conventional SRAM-based DNN accelerators. – Usually <500KB with area cost of 3~20mm 2 . (Normalized) IO Configurable Interface Controller Configuratin FC/LSTM CONV Configuratin Configuration Context Weight Buffer Data Buffer1 Heterogeneous PE Array Bank[0] Buffer ... ... CTRL ... PE PE PE PE PE PE Bank[47] IO ... PE PE PE PE PE PE ... ... ... ... ... ... Bank[0] ... Buffer PE PE PE PE PE PE ... ... ... CTRL Bank[47] ... PE PE PE PE PE PE Data Buffer2 Super Super Super Super Super Super ... PE PE PE PE PE PE Data Buffer System Thinker, 348KB, 19.4mm 2 DianNao, 44KB, 3.0mm 2 Envision, 77KB, 10.1mm 2 (Normalized) Eyeriss, 182KB, 12.3mm 2 Thinker: Yin et al ., “A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications”, JSSC’18. 3 DianNao: Chen et al ., “DianNao: A Small -Footprint High-Throughput Accelerator for Ubiquitous Machine- Learning”, ASPLOS’14 . Eyeriss: Chen et al., “Eyeriss: An Energy -Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”, ISSCC’16. Envision: Moons et al ., “ENVISION: A 0.26 -to-10TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 28nm FDSOI”, ISSCC’17.

SRAM vs. eDRAM (Embedded DRAM) eDRAM has higher density than SRAM. Refresh is required for data retention. Charge will leak over time and might cause retention failures. 4

Refresh is an Energy Bottleneck [1] HPCA’13 eDRAM Power Breakdown [2] ISCA’10 System Power Breakdown Overhead: eDRAM Refresh Energy 5 [1] Chang et al., “ Technology Comparison for Large Last-Level Caches (L3Cs): Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized eDRAM”, HPCA’13. [2] Wilkerson et al., “Reducing Cache Power with Low -Cost, Multi-bit Error- Correcting Codes”, ISCA’10.

Opportunity to Remove eDRAM Refresh Refresh Interval = Retention Time 6 Ghosh, “Modeling of Retention Time for High-Speed Embedded Dynamic Random Access Memories”, TCASI’14.658

Opportunity to Remove eDRAM Refresh Refresh is unnecessary, if Data Lifetime < Retention Time Opportunity1: Increase retention time by training. Opportunity2: Reduce data lifetime by scheduling. 7

RANA: Retention-Aware Neural Acceleration Framework 1. Accuracy Constraint 1. Energy Modeling 1. Data Mapping 2. eDRAM Retention Time 2. Data Lifetime Analysis 2. Memory Controller Distribution 3. Buffer Storage Analysis Modification 1 2 3 1. DNN Accelerator Retention-Aware Hybrid Computation Refresh-Optimized Optimized Energy 2. Target DNN Model Training Method Pattern eDRAM Controller Consumption Tolerable Layerwise Retention Time Configurations (Training) (Scheduling) (Architecture) Compilation Phase Execution Phase • Strengthen DNN accelerators with refresh-optimized eDRAM: – Increase on-chip buffer size by replacing SRAM with eDRAM. – Reduce energy overhead by removing unnecessary eDRAM refresh. 8

RANA: Retention-Aware Neural Acceleration Framework 1 2 3 1. DNN Accelerator Retention-Aware Hybrid Computation Refresh-Optimized Optimized Energy 2. Target DNN Model Training Method Pattern eDRAM Controller Consumption Tolerable Layerwise Retention Time Configurations (Training) (Scheduling) (Architecture) DNN accelerator DNN model Unified Buffer System eDRAM Controller Layer description Hardware constraints eDRAM Bank Run Switch to the eDRAM Bank Reference Programmable scheduling next layer scheme Clock Clock Divider eDRAM Bank Computation Pattern: <OD/WD, Tm , Tn, Tr , Tc > eDRAM Bank Refresh Issuer No The last layer? eDRAM Refresh Flags eDRAM Bank Yes Configurations for each layer Retention Time ↑ Data Lifetime ↓ Refresh Control 9

Tech1: Retention-Aware Training Method • Retention time is diverse among different cells. – Retention failure rate: Fraction of the cells under the given retention time. The weakest cell appears at the 45micro-second point. Typical eDRAM Retention Time Distribution (32KB) Kong et al., “Analysis of Retention Time Distribution of Embedded DRAM – A New Method to Characterize Across-Chip 10 Threshold Voltage Variation”, ITC’08.

Tech1: Retention-Aware Training Method • Retrain the network to tolerate higher failure rate and get longer tolerable retention time. Target DNN Model Failure Rate (r) Fixed-Point Pretrain Fixed-Point Random DNN Model Bit-Level Errors Weight Adding Layer Masks Adjustment Retention-Aware Retrain Training Method Retention-Aware DNN Model 11

Tech1: Retention-Aware Training Method • Failure rate of 10 −5 : No accuracy loss, 734 𝜈 s. • Failure rate of 10 −4 : Accuracy decreases. 45 𝜈 s 734 𝜈 s 1030 𝜈 s Relative Accuracy under Different Retention Failure Rates 12

Tech2: Hybrid Computation Pattern • Computation pattern, expressed in a loop. • Data lifetime and buffer storage are related to the loop ordering, especially the outermost-level loop. 13

Tech2: Hybrid Computation Pattern • Outputs are dynamically updated by accumulation, which recharges the cells like periodic refresh. • Different computation patterns have different data lifetime and buffer storage requirements. Output Dependent Input Dependent Weight Dependent 14

Tech2: Hybrid Computation Pattern • Scheduling scheme: – Input: DNN accelerator and network’s parameters. – Optimization: Minimize total system energy. – Output: Layerwise configurations. DNN accelerator DNN model Layer description Scheduling Scheme Hardware constraints min 𝐹𝑜𝑓𝑠𝑕𝑧 s. t. 𝐹𝑜𝑓𝑠𝑕𝑧 = 𝐹𝑟𝑣𝑏𝑢𝑗𝑝𝑜 (14) , 𝑈𝑜 ∙ 𝑈ℎ ∙ 𝑈𝑚 ≤ 𝑆 𝑗 , Run Switch to the 𝑈𝑛 ∙ 𝑈𝑠 ∙ 𝑈𝑑 ≤ 𝑆 𝑝 , scheduling next layer 𝑈𝑛 ∙ 𝑈𝑜 ∙ 𝐿 2 ≤ 𝑆 𝑥 , scheme 1 ≤ 𝑈𝑛 ≤ 𝑁 , 1 ≤ 𝑈𝑜 ≤ 𝑂 , 1 ≤ 𝑈𝑠 ≤ 𝑆 , Computation Pattern: 1 ≤ 𝑈𝑑 ≤ 𝐷 . <OD/WD, Tm , Tn, Tr , Tc > No The last layer? Yes Configurations for each layer 15

Tech3: Refresh-Optimized eDRAM Controller • eDRAM controller: – Programmable clock divider: Refresh interval. – Refresh issuers and flags, for each eDRAM bank. – Configuration from Tech1 & Tech2. Unified Buffer System eDRAM Controller eDRAM Bank eDRAM Bank Reference Programmable Clock Clock Divider eDRAM Bank eDRAM Bank Refresh Issuer eDRAM Refresh Flags eDRAM Bank 16

Evaluation Platform • RTL-level cycle-accurate simulation, for performance estimation and memory access tracing. • System-level energy estimation, based on synthesis, Destiny and CACTI. Platform Configurations DNN Accelerator 256 MACs, 384KB SRAM, 200MHz, 5.682mm 2 , 65nm eDRAM 1.454MB, retention time = 45 𝜈 s, 65nm Kong et al., “Analysis of Retention Time Distribution of Embedded DRAM – A New Method to Characterize Across-Chip 17 Threshold Voltage Variation”, ITC’08.

Experimental Results eDRAM refresh operations: 99.7% ↓ Off-chip memory access: 41.7% ↓ System energy consumption: 66.2% ↓ 18

Scalability to Other Architectures • DaDianNao: 4096 MACs, 36MB eDRAM, 606MHz. eDRAM refresh operations: 99.9% ↓ System energy consumption: 69.4% ↓ 19 Chen et al ., “ DaDianNao: A Machine- Learning Supercomputer”, MICRO’14.

Takeaway 1 2 3 1. DNN Accelerator Retention-Aware Hybrid Computation Refresh-Optimized Optimized Energy 2. Target DNN Model Training Method Pattern eDRAM Controller Consumption Tolerable Layerwise Retention Time Configurations (Training) (Scheduling) (Architecture) RANA: Retention-Aware Neural Acceleration Framework • Training: Retention-aware training method. – Exploit DNN’s error resilience to improve tolerable retention time. • Scheduling: Hybrid computation pattern. – Different computing order and parallelism show different data lifetime and buffer storage requirement. Architecture: Refresh-Optimized eDRAM controller. • – No need to refresh all the banks. – No need to always use the worst-case refresh interval. • Not limited to applying eDRAM to DNN acceleration. – Approximate computing: Retention and error resilience. 20

Thank you for your attention! Email: tfb13@mails.tsinghua.edu.cn

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized - PowerPoint PPT Presentation

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu , Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University The 45th International Symposium on Computer

Selling in Mobile Markets Rana Sobhany Author, Marketing iPhone Apps (OReilly)

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Winston-Salem State University Undergrad Brand Refresh WSSU UG Brand Refresh | Progress IMC:

Refresh RUS 1 Network RUS: Electrification Refresh 30 year perspective as part of Long

JSNA Refresh www.WalsallIntelligence.org.uk JSNA 2018-2019 refresh Gap in life expectancy Ageing

VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

The accumulation of sedimentary PCBs in The accumulation of sedimentary PCBs in bullfrog ( Rana

Introduction to Elliptic Curve Cryptography Rana Barua Indian Statistical Institute Kolkata May

r r Prof. Inder K. Rana Room 112 B Department of Mathematics

A Survey of AIS-20/31 Compliant TRNG Cores Suitable for FPGA Devices Oto P E TURA , Ugo M

E40M RC Filters M. Horowitz, J. Plummer, R. Howe 1 Reading Reader: The rest of

MSP430 Clock System and Timer College of Computer and Information Science , Northeastern

Chapter 7: Communication Techniques EET-223: RF Communication Circuits Walter Lara Basic

Creating Behavioral Models CDNLive, March, 2015 Bob Peruzzi, Joe Medero Agenda Introduction

4G M OBILE W IRELESS W I MAX W SS W MAX Aditya K. Jagannatham Indian Institute of Technology

Mobile Communications 3GPP Long Term Evolution Manuel P. Ricardo Faculdade de Engenharia da

ASL V6 Planning L. Strow Frequency Calibration L1b/L1c and RTA Planning for V6 L1b