using Residue Arithmetic and Integrated Photonics Jiaxin Peng, - - PowerPoint PPT Presentation

using residue arithmetic and
SMART_READER_LITE
LIVE PREVIEW

using Residue Arithmetic and Integrated Photonics Jiaxin Peng, - - PowerPoint PPT Presentation

DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi 49th International Conference on Parallel Processing ICPP August 2020


slide-1
SLIDE 1

DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics

Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi

49th International Conference on Parallel Processing – ICPP August 2020

slide-2
SLIDE 2

Outline

➢Introduction ➢Background ➢Integrated Photonic Residue Arithmetic Computing Engine for Convolutional Neural Network ➢Performance Evaluation ➢Conclusion

2

slide-3
SLIDE 3

Introduction

3

slide-4
SLIDE 4

Introduction

➢Some NN applications require real-time analysis for inference ➢Computation intensive; includes billion multiply-accumulate (MAC) operations ➢We propose DNNARA: a deep neural network accelerator using residue arithmetic based on integrated photonics ➢All the computations through the neural network are done in residue number system (RNS) to avoid extra binary to/from RNS conversions

4

Block Diagram of a DNNARA System

slide-5
SLIDE 5

Introduction

➢DNNARA: RNS with wavelength-division multiplexing (WDM)

  • Execute multiple MVMs due to

WDM feature

  • Speedup MVMs due to digit-

independent feature

  • Residues are small-sized
  • Increase the system parallelism

– save area/hardware resources

5

slide-6
SLIDE 6

Background

➢ Convolutional Neural Network ➢ Residue Number System

6

slide-7
SLIDE 7

Background – Convolutional Neural Network

➢Widely applied in classification

  • Image recognition

➢Including several layers/functions

  • Convolutional layers
  • Activation functions – add non-linearity
  • ReLu (Rectified Linear Unit)
  • Sigmoid function / Hyperbolic tangent function
  • Pooling layers – down ample the output
  • Max pooling
  • Average pooling
  • Fully-connected layers

➢Contains up to billion multiply-accumulate (MAC) operations

7

slide-8
SLIDE 8

Background - Residue Number System (RNS)

➢Each Integer X is represented by its “residue,” or remainder obtained by dividing it by a modulus Mi

  • Example: Moduli are M1=2, M2=3, M3=5, M4=7
  • X = 20 is represented as X={0, 2, 0, 6}[2, 3, 5, 7]
  • Range of numbers that can be represented: 0 to (M– 1)(here 0 to 219) (M=M1*M2*M3*M4)
  • Moduli should be relatively prime

➢Negative Number Notation: Similar to 2’s compliment

  • r = |m-|-X|m|m (where X is negative)
  • Example: -20 = {|2-0|2, |3-2|3, |5-0|5, |7-6|7}[2, 3, 5, 7] = {0, 1, 0, 1}[2, 3, 5, 7]
  • Range of numbers that can be represented: [−(𝑁−1)/2,(𝑁−1)/2]if M is odd, or[−𝑁/2,𝑁/2−1]if M is even

➢Residue Arithmetic: Operations carried out on residues

  • Example: Addition of X=20={0, 2, 0, 6}[2, 3, 5, 7] and Y=5={1, 2, 0, 5 }[2, 3, 5, 7]
  • X+Y = {0+1, 2+2, 0+0, 6+5 }[2, 3, 5, 7] → = {1, 1, 0, 4 }[2, 3, 5, 7]
  • X*Y = {0*1, 2*2, 0*0, 6*5 }[2, 3, 5, 7] → = {0, 1, 0, 2 }[2, 3, 5, 7]
  • Residue arithmetic is carried out as modulo additions and multiplication on the residues
  • Residue arithmetic is carried out on each residue in parallel

8

slide-9
SLIDE 9

Integrated Photonic Residue Arithmetic Computing Engine for Neural Network

➢ Overview ➢ Residue Adders and Multipliers ➢ Residue Matrix-Vector Multiplication Unit ➢ Sigmoid Unit ➢ Max Pooling Unit

9

slide-10
SLIDE 10
  • R-MVM: Residue Matrix-Vector Multiplication
  • R-Multiplier: Residue Multiplier
  • R-Adder: Residue Adder
  • MRR: Micro-Ring Resonator
  • PD: Photo-Detector
  • LUT: Look-up Table
  • RNS2Bin: RNS to Binary
  • Bin2RNS: Binary to RNS
  • T: tile

Overview Architecture

10

slide-11
SLIDE 11

Integrated Photonic Residue Adder and Multiplier

➢Basic block

  • An electro-optical 2×2 switch
  • Light either propagates through (“bar” state – (a))or

propagates cross (“cross” state – (b))

➢Residue Adder [1] – one-hot encoding

  • Could be considered as a mapping (injection)
  • Arbitrary Size Benes (AS-Benes) Network (c – even

number & d – odd number)

  • Switch states are precomputed and stored in a look-

up table (LUT)

➢An AS-Benes modulo-5 adder (e)

  • Example with |3+4|5 = 2

➢A Modulo-N Residue Multiplier Implementation (f) ➢WDM capable

11

slide-12
SLIDE 12

Residue MVM (R-MVM) Computing Block

➢Schematic of designed R-MVM (b) ➢Wavelength-Division Multiplexing (WDM) Capable ➢Lasers, MRRs, PDs, LUTs, Registers, as well as photonic and electrical connections are needed ➢sel to choose either the partial sum or bias ➢Example: 5x5 input feature and a 2x2 kernel

12

slide-13
SLIDE 13

Pipeline of a MAC operation

  • Cycle 1:
  • Input feature (x) are encoded as light with

different wavelengths

  • Weights (w) are encoded as the selection line,

loading the states of switches in the LUT

  • Cycle 2:
  • Setup the switch states accordingly
  • Inject light and detect light - multiply
  • MRRs & PDs act like filter to derive the

solutions for all the multiplications

13

slide-14
SLIDE 14

Pipeline of a MAC operation

  • Cycle 3:
  • Results from last cycle (w*x) are decoded as

the selection line to load the states for adders

  • According to sel, either the partial sum or bias

is decoded as the light

  • Cycle 4:
  • Setup the adders
  • Inject light and detect light – add
  • Cycle 5: Write back to the register

14

slide-15
SLIDE 15

Sigmoid Function Unit - Polynomial

➢In residue domain, it is hard to calculate the sigmoid function ➢Instead, it could be considered as a polynomial because sigmoid function could be represented as Taylor series ➢Need to pre-calculate the terms that include x, and build the connection accordingly ➢Example: P(x) = ax4 + bx3 + cx2 + dx + e in modulo-5 system

15

slide-16
SLIDE 16

Max pool Function Unit

➢Sign detection in RNS is implicit ➢Instead, we convert the number from RNS to MRS (mixed-radix number system) [2] ➢From the MRS, the coefficient of even number 2 (a4) separates the number to negative or non-negative ➢It is serial but could be pipelined

16

slide-17
SLIDE 17

Performance Evaluation

17

slide-18
SLIDE 18

Experiment Setup

➢Electrical memory component

  • CACTI 7.0 [3],

➢Optical Switch [4]

  • Lumerical FTDT

➢Optical circuit

  • Lumerical Interconnect

➢Lasers/MRRs/PDs

  • Data from other work ([5], [6],and [7],

respectively)

➢HyperTransport serial link

  • Data from [8]

➢System Level Design

  • Our own simulator

18

Configurations of Selected Benchmarks

slide-19
SLIDE 19

Design Space Exploration

➢Swept Parameters

  • WDM size
  • # of tiles in a chip
  • # of MVMs in a tile

➢Computation capability

  • # of operations

/(time*area*power)

19

slide-20
SLIDE 20

Hardware Specification

20

slide-21
SLIDE 21

Speed & Power Analysis

➢Real benchmarks ➢The more chip the faster but did not scaled proportionally ➢Consumes more power ➢Due to communication ➢19 times faster compared to a GPU (Nvidia Tesla V- 100) for VGG-4 with same power budget

21

slide-22
SLIDE 22

Conclusion

➢Proposed DNNARA, a deep neural network accelerator that using residue number system ➢DNNARA is a hybrid electro-optical design ➢Proposed a system-level CNN accelerator chip with nano-photonic ➢Built a system-level simulator for experimental estimation ➢Could reach up to 12.6 GOPS/(second·mm2· watt) ➢Reached 19 times faster compared to a state-of-art GPU (Nvidia Tesla V- 100) for VGG-4 with same power budget

22

slide-23
SLIDE 23

References

➢ [1] Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi.2019. Integrated Photonics Architectures for Residue Number System Computations. In IEEE International Conference on Rebooting Computing (ICRC 2019).129–137. ➢ [2] Nicholas S Szabo and Richard I Tanaka. 1967.Residue arithmetic and its applications to computer technology. McGraw-Hill. ➢ [3] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 3–14. ➢ [4] Shuai Sun, Vikram K Narayana, Ibrahim Sarpkaya, Joseph Crandall, Richard ASoref, Hamed Dalir, Tarek El-Ghazawi, and Volker J

  • Sorger. 2017. Hybrid photonic-plasmonic nonblocking broadband 5×5 router for optical networks. IEEE Pho-tonics Journal10, 2 (2017),

1–12. ➢ [5] Rupert F Oulton, Volker J Sorger, Thomas Zentgraf, Ren-Min Ma, Christopher Gladden, Lun Dai, Guy Bartal, and Xiang Zhang. 2009. Plasmon lasers at deep subwavelength scale.Nature461, 7264 (2009), 629 ➢ [6] Erman Timurdogan, Cheryl M Sorace-Agaskar, Jie Sun, Ehsan Shah Hosseini, Aleksandr Biberman, and Michael R Watts. 2014. An ultralow power a thermal silicon modulator. Nature communications5 (2014), 4008. ➢ [7] Yannick Salamin, Ping Ma, Benedikt Baeuerle, Alexandros Emboras, Yuriy Fedoryshyn, Wolfgang Heni, Bojun Cheng, Arne Josten, and Juerg Leuthold. 2018.100 GHz plasmonic photo detector. ACS photonics5, 8 (2018), 3291–3297. ➢ [8] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on

  • Microarchitecture. IEEE Computer Society, 609–622.

23

slide-24
SLIDE 24

24

Thank you!