[PPT] - using Residue Arithmetic and Integrated Photonics Jiaxin Peng, PowerPoint Presentation

SLIDE 1

DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics

Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi

49th International Conference on Parallel Processing – ICPP August 2020

SLIDE 2

Outline

➢Introduction ➢Background ➢Integrated Photonic Residue Arithmetic Computing Engine for Convolutional Neural Network ➢Performance Evaluation ➢Conclusion

2

SLIDE 3

Introduction

3

SLIDE 4

Introduction

➢Some NN applications require real-time analysis for inference ➢Computation intensive; includes billion multiply-accumulate (MAC) operations ➢We propose DNNARA: a deep neural network accelerator using residue arithmetic based on integrated photonics ➢All the computations through the neural network are done in residue number system (RNS) to avoid extra binary to/from RNS conversions

4

Block Diagram of a DNNARA System

SLIDE 5

Introduction

➢DNNARA: RNS with wavelength-division multiplexing (WDM)

Execute multiple MVMs due to

WDM feature

Speedup MVMs due to digit-

independent feature

Residues are small-sized
Increase the system parallelism

– save area/hardware resources

5

SLIDE 6

Background

➢ Convolutional Neural Network ➢ Residue Number System

6

SLIDE 7

Background – Convolutional Neural Network

➢Widely applied in classification

Image recognition

➢Including several layers/functions

Convolutional layers
Activation functions – add non-linearity
ReLu (Rectified Linear Unit)
Sigmoid function / Hyperbolic tangent function
Pooling layers – down ample the output
Max pooling
Average pooling
Fully-connected layers

➢Contains up to billion multiply-accumulate (MAC) operations

7

SLIDE 8

Background - Residue Number System (RNS)

➢Each Integer X is represented by its “residue,” or remainder obtained by dividing it by a modulus Mi

Example: Moduli are M1=2, M2=3, M3=5, M4=7
X = 20 is represented as X={0, 2, 0, 6}[2, 3, 5, 7]
Range of numbers that can be represented: 0 to (M– 1)(here 0 to 219) (M=M1*M2*M3*M4)
Moduli should be relatively prime

➢Negative Number Notation: Similar to 2’s compliment

r = |m-|-X|m|m (where X is negative)
Example: -20 = {|2-0|2, |3-2|3, |5-0|5, |7-6|7}[2, 3, 5, 7] = {0, 1, 0, 1}[2, 3, 5, 7]
Range of numbers that can be represented: [−(𝑁−1)/2,(𝑁−1)/2]if M is odd, or[−𝑁/2,𝑁/2−1]if M is even

➢Residue Arithmetic: Operations carried out on residues

Example: Addition of X=20={0, 2, 0, 6}[2, 3, 5, 7] and Y=5={1, 2, 0, 5 }[2, 3, 5, 7]
X+Y = {0+1, 2+2, 0+0, 6+5 }[2, 3, 5, 7] → = {1, 1, 0, 4 }[2, 3, 5, 7]
X*Y = {0*1, 2*2, 0*0, 6*5 }[2, 3, 5, 7] → = {0, 1, 0, 2 }[2, 3, 5, 7]
Residue arithmetic is carried out as modulo additions and multiplication on the residues
Residue arithmetic is carried out on each residue in parallel

8

SLIDE 9

Integrated Photonic Residue Arithmetic Computing Engine for Neural Network

➢ Overview ➢ Residue Adders and Multipliers ➢ Residue Matrix-Vector Multiplication Unit ➢ Sigmoid Unit ➢ Max Pooling Unit

9

SLIDE 10

R-MVM: Residue Matrix-Vector Multiplication
R-Multiplier: Residue Multiplier
R-Adder: Residue Adder
MRR: Micro-Ring Resonator
PD: Photo-Detector
LUT: Look-up Table
RNS2Bin: RNS to Binary
Bin2RNS: Binary to RNS
T: tile

Overview Architecture

10

SLIDE 11

Integrated Photonic Residue Adder and Multiplier

➢Basic block

An electro-optical 2×2 switch
Light either propagates through (“bar” state – (a))or

propagates cross (“cross” state – (b))

➢Residue Adder [1] – one-hot encoding

Could be considered as a mapping (injection)
Arbitrary Size Benes (AS-Benes) Network (c – even

number & d – odd number)

Switch states are precomputed and stored in a look-

up table (LUT)

➢An AS-Benes modulo-5 adder (e)

Example with |3+4|5 = 2

➢A Modulo-N Residue Multiplier Implementation (f) ➢WDM capable

11

SLIDE 12

Residue MVM (R-MVM) Computing Block

➢Schematic of designed R-MVM (b) ➢Wavelength-Division Multiplexing (WDM) Capable ➢Lasers, MRRs, PDs, LUTs, Registers, as well as photonic and electrical connections are needed ➢sel to choose either the partial sum or bias ➢Example: 5x5 input feature and a 2x2 kernel

12

SLIDE 13

Pipeline of a MAC operation

Cycle 1:
Input feature (x) are encoded as light with

different wavelengths

Weights (w) are encoded as the selection line,

loading the states of switches in the LUT

Cycle 2:
Setup the switch states accordingly
Inject light and detect light - multiply
MRRs & PDs act like filter to derive the

solutions for all the multiplications

13

SLIDE 14

Pipeline of a MAC operation

Cycle 3:
Results from last cycle (w*x) are decoded as

the selection line to load the states for adders

According to sel, either the partial sum or bias

is decoded as the light

Cycle 4:
Setup the adders
Inject light and detect light – add
Cycle 5: Write back to the register

14

SLIDE 15

Sigmoid Function Unit - Polynomial

➢In residue domain, it is hard to calculate the sigmoid function ➢Instead, it could be considered as a polynomial because sigmoid function could be represented as Taylor series ➢Need to pre-calculate the terms that include x, and build the connection accordingly ➢Example: P(x) = ax4 + bx3 + cx2 + dx + e in modulo-5 system

15

SLIDE 16

Max pool Function Unit

➢Sign detection in RNS is implicit ➢Instead, we convert the number from RNS to MRS (mixed-radix number system) [2] ➢From the MRS, the coefficient of even number 2 (a4) separates the number to negative or non-negative ➢It is serial but could be pipelined

16

SLIDE 17

Performance Evaluation

17

SLIDE 18

Experiment Setup

➢Electrical memory component

CACTI 7.0 [3],

➢Optical Switch [4]

Lumerical FTDT

➢Optical circuit

Lumerical Interconnect

➢Lasers/MRRs/PDs

Data from other work ([5], [6],and [7],

respectively)

➢HyperTransport serial link

Data from [8]

➢System Level Design

Our own simulator

18

Configurations of Selected Benchmarks

SLIDE 19

Design Space Exploration

➢Swept Parameters

WDM size
# of tiles in a chip
# of MVMs in a tile

➢Computation capability

# of operations

/(time*area*power)

19

SLIDE 20

Hardware Specification

20

SLIDE 21

Speed & Power Analysis

➢Real benchmarks ➢The more chip the faster but did not scaled proportionally ➢Consumes more power ➢Due to communication ➢19 times faster compared to a GPU (Nvidia Tesla V- 100) for VGG-4 with same power budget

21

SLIDE 22

Conclusion

➢Proposed DNNARA, a deep neural network accelerator that using residue number system ➢DNNARA is a hybrid electro-optical design ➢Proposed a system-level CNN accelerator chip with nano-photonic ➢Built a system-level simulator for experimental estimation ➢Could reach up to 12.6 GOPS/(second·mm2· watt) ➢Reached 19 times faster compared to a state-of-art GPU (Nvidia Tesla V- 100) for VGG-4 with same power budget

22

SLIDE 23

References

➢ [1] Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi.2019. Integrated Photonics Architectures for Residue Number System Computations. In IEEE International Conference on Rebooting Computing (ICRC 2019).129–137. ➢ [2] Nicholas S Szabo and Richard I Tanaka. 1967.Residue arithmetic and its applications to computer technology. McGraw-Hill. ➢ [3] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 3–14. ➢ [4] Shuai Sun, Vikram K Narayana, Ibrahim Sarpkaya, Joseph Crandall, Richard ASoref, Hamed Dalir, Tarek El-Ghazawi, and Volker J

Sorger. 2017. Hybrid photonic-plasmonic nonblocking broadband 5×5 router for optical networks. IEEE Pho-tonics Journal10, 2 (2017),

1–12. ➢ [5] Rupert F Oulton, Volker J Sorger, Thomas Zentgraf, Ren-Min Ma, Christopher Gladden, Lun Dai, Guy Bartal, and Xiang Zhang. 2009. Plasmon lasers at deep subwavelength scale.Nature461, 7264 (2009), 629 ➢ [6] Erman Timurdogan, Cheryl M Sorace-Agaskar, Jie Sun, Ehsan Shah Hosseini, Aleksandr Biberman, and Michael R Watts. 2014. An ultralow power a thermal silicon modulator. Nature communications5 (2014), 4008. ➢ [7] Yannick Salamin, Ping Ma, Benedikt Baeuerle, Alexandros Emboras, Yuriy Fedoryshyn, Wolfgang Heni, Bojun Cheng, Arne Josten, and Juerg Leuthold. 2018.100 GHz plasmonic photo detector. ACS photonics5, 8 (2018), 3291–3297. ➢ [8] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on

Microarchitecture. IEEE Computer Society, 609–622.

23

SLIDE 24

24

using Residue Arithmetic and Integrated Photonics Jiaxin Peng, - - PowerPoint PPT Presentation

DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics

Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi

Outline

Introduction

Introduction

Introduction

Background

Background – Convolutional Neural Network

Background - Residue Number System (RNS)

Integrated Photonic Residue Arithmetic Computing Engine for Neural Network

Overview Architecture

Integrated Photonic Residue Adder and Multiplier

Residue MVM (R-MVM) Computing Block

Pipeline of a MAC operation

Pipeline of a MAC operation

Sigmoid Function Unit - Polynomial

Max pool Function Unit

Performance Evaluation

Experiment Setup

Design Space Exploration

Hardware Specification

Speed & Power Analysis

Conclusion

References

Thank you!