DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics
Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi
49th International Conference on Parallel Processing – ICPP August 2020
using Residue Arithmetic and Integrated Photonics Jiaxin Peng, - - PowerPoint PPT Presentation
DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi 49th International Conference on Parallel Processing ICPP August 2020
49th International Conference on Parallel Processing – ICPP August 2020
➢Introduction ➢Background ➢Integrated Photonic Residue Arithmetic Computing Engine for Convolutional Neural Network ➢Performance Evaluation ➢Conclusion
2
3
➢Some NN applications require real-time analysis for inference ➢Computation intensive; includes billion multiply-accumulate (MAC) operations ➢We propose DNNARA: a deep neural network accelerator using residue arithmetic based on integrated photonics ➢All the computations through the neural network are done in residue number system (RNS) to avoid extra binary to/from RNS conversions
4
Block Diagram of a DNNARA System
➢DNNARA: RNS with wavelength-division multiplexing (WDM)
WDM feature
independent feature
– save area/hardware resources
5
➢ Convolutional Neural Network ➢ Residue Number System
6
➢Widely applied in classification
➢Including several layers/functions
➢Contains up to billion multiply-accumulate (MAC) operations
7
➢Each Integer X is represented by its “residue,” or remainder obtained by dividing it by a modulus Mi
➢Negative Number Notation: Similar to 2’s compliment
➢Residue Arithmetic: Operations carried out on residues
8
➢ Overview ➢ Residue Adders and Multipliers ➢ Residue Matrix-Vector Multiplication Unit ➢ Sigmoid Unit ➢ Max Pooling Unit
9
10
➢Basic block
propagates cross (“cross” state – (b))
➢Residue Adder [1] – one-hot encoding
number & d – odd number)
up table (LUT)
➢An AS-Benes modulo-5 adder (e)
➢A Modulo-N Residue Multiplier Implementation (f) ➢WDM capable
11
➢Schematic of designed R-MVM (b) ➢Wavelength-Division Multiplexing (WDM) Capable ➢Lasers, MRRs, PDs, LUTs, Registers, as well as photonic and electrical connections are needed ➢sel to choose either the partial sum or bias ➢Example: 5x5 input feature and a 2x2 kernel
12
different wavelengths
loading the states of switches in the LUT
solutions for all the multiplications
13
the selection line to load the states for adders
is decoded as the light
14
➢In residue domain, it is hard to calculate the sigmoid function ➢Instead, it could be considered as a polynomial because sigmoid function could be represented as Taylor series ➢Need to pre-calculate the terms that include x, and build the connection accordingly ➢Example: P(x) = ax4 + bx3 + cx2 + dx + e in modulo-5 system
15
➢Sign detection in RNS is implicit ➢Instead, we convert the number from RNS to MRS (mixed-radix number system) [2] ➢From the MRS, the coefficient of even number 2 (a4) separates the number to negative or non-negative ➢It is serial but could be pipelined
16
17
➢Electrical memory component
➢Optical Switch [4]
➢Optical circuit
➢Lasers/MRRs/PDs
respectively)
➢HyperTransport serial link
➢System Level Design
18
Configurations of Selected Benchmarks
➢Swept Parameters
➢Computation capability
/(time*area*power)
19
20
➢Real benchmarks ➢The more chip the faster but did not scaled proportionally ➢Consumes more power ➢Due to communication ➢19 times faster compared to a GPU (Nvidia Tesla V- 100) for VGG-4 with same power budget
21
➢Proposed DNNARA, a deep neural network accelerator that using residue number system ➢DNNARA is a hybrid electro-optical design ➢Proposed a system-level CNN accelerator chip with nano-photonic ➢Built a system-level simulator for experimental estimation ➢Could reach up to 12.6 GOPS/(second·mm2· watt) ➢Reached 19 times faster compared to a state-of-art GPU (Nvidia Tesla V- 100) for VGG-4 with same power budget
22
➢ [1] Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi.2019. Integrated Photonics Architectures for Residue Number System Computations. In IEEE International Conference on Rebooting Computing (ICRC 2019).129–137. ➢ [2] Nicholas S Szabo and Richard I Tanaka. 1967.Residue arithmetic and its applications to computer technology. McGraw-Hill. ➢ [3] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 3–14. ➢ [4] Shuai Sun, Vikram K Narayana, Ibrahim Sarpkaya, Joseph Crandall, Richard ASoref, Hamed Dalir, Tarek El-Ghazawi, and Volker J
1–12. ➢ [5] Rupert F Oulton, Volker J Sorger, Thomas Zentgraf, Ren-Min Ma, Christopher Gladden, Lun Dai, Guy Bartal, and Xiang Zhang. 2009. Plasmon lasers at deep subwavelength scale.Nature461, 7264 (2009), 629 ➢ [6] Erman Timurdogan, Cheryl M Sorace-Agaskar, Jie Sun, Ehsan Shah Hosseini, Aleksandr Biberman, and Michael R Watts. 2014. An ultralow power a thermal silicon modulator. Nature communications5 (2014), 4008. ➢ [7] Yannick Salamin, Ping Ma, Benedikt Baeuerle, Alexandros Emboras, Yuriy Fedoryshyn, Wolfgang Heni, Bojun Cheng, Arne Josten, and Juerg Leuthold. 2018.100 GHz plasmonic photo detector. ACS photonics5, 8 (2018), 3291–3297. ➢ [8] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on
23
24