Efficient Implementation of Modular Division by Input Bit Splitting - - PowerPoint PPT Presentation

efficient implementation of modular division by input bit
SMART_READER_LITE
LIVE PREVIEW

Efficient Implementation of Modular Division by Input Bit Splitting - - PowerPoint PPT Presentation

Efficient Implementation of Modular Division by Input Bit Splitting Danila Gorodecky Tiziano Villa National Academy of Science of Belarus, University of Verona, Minsk, Belarus Verona, Italy danila.gorodecky@gmail.com tiziano.villa@univr.it


slide-1
SLIDE 1

Efficient Implementation

  • f Modular Division by

Input Bit Splitting

Danila Gorodecky

National Academy of Science of Belarus, Minsk, Belarus danila.gorodecky@gmail.com

Tiziano Villa

University of Verona, Verona, Italy tiziano.villa@univr.it

slide-2
SLIDE 2

2

Motivation

– due to historical and algorithmic reasons RNS is not a common arithmetic approach and this topic has not studied proper and not common implemented in hardware computing; – there are a numeric of a perspective applications such parallel computing, cryptography; – there is no an “efficient” hardware approach of X mod P calculation (not synthesizable in state-

  • f-art EDA tools)
slide-3
SLIDE 3

3

Implementation of RNS

– Digital filtering with finite impulse response (FIR-filtering); – Crypto system of Federal Reserve System of USA; – Space flight control (Russia); – Data transferring between Space satellites and Earth (Russia); – Air Defense System (USA, Russia);

slide-4
SLIDE 4

4

Common architecture of computation in RNS

Converter of positional numbers to modular representation (1) summator/multiplier (mod p1) S1 A1 A2 S A1 An Converter of modular representation to positional number (2) A2 An summator/multiplier (mod p2) S2 A1 A2 An summator/multiplier (mod pm) Sm A1 A2 An

( )

i i

p A mod

) (mod ...

2 1 i i n

p R A A A =   

) (mod ...

2 1 i i n

p S A A A = + + +

) (mod

1 1

P S R

  • r

S

m i i m i i

=

 

= =

=

=

m i i

p P

1

slide-5
SLIDE 5

5

} , , {

3 2 1

A A A A =

} , , {

3 2 1

B B B B =

i i i i

p p A A p A       − = ) (mod

i i i i

p p B B p B       − = ) (mod

P r Y S Y S Y S S  −  +  +  =

3 3 2 2 1 1

,... 2 , 1 , = r

i i i

k p P Y         =

i i

p k , 1 = ) (mod 1

i i i

p p Y = 3 , 2 , 1 = i 314  A 314  B

Example of computation in RNS

slide-6
SLIDE 6

6

Main hardware design problems

  • X (mod P) realizations for an arbitrary P

(forward conversion). Synthesis error in the most of all EDA tools;

  • Transformation modular representation

into position numbers (reverse conversion). It is a long critical pass or big hardware costs.

slide-7
SLIDE 7

7

Approaches of X (mod P) design

[1] P.V.A. Mohan, ”Residue Number System. Theory and applications”, Springer International Publishing, 2016, 351 p. [2] J.T. Butler and T. Sasao, ”Fast hardware computation of x mod z”// 25th IEEE International Parallel and Distributed Processing Symposium Anchorage, Ak, USA, May 16-17, 2011, p. 289-292. [3] Mark A. Will and Ryan K. L. Ko, “Computing Mod Without Mod”

slide-8
SLIDE 8
  • splitting input into small tuples

(up to 12-bit);

  • Boolean minimization

(Disjunctive Normal Form, Reed-Muller expansion, Binary Decision Tree, Majority Graph)

8

Input bit splitting approach

slide-9
SLIDE 9

9

Fourier transformation multiplication (FTM)

FTM splits up binary vectors into m-bit tuples and multiplies to each others:

1 2 … m m+1 m+2 … 2m 2m+1 … 3m … (k-1)m (k-1)m+1 … km 1 2 … m m+1 m+2 … 2m 2m+1 … 3m … (k-1)m (k-1)m+1 … km

multiplication

There are of 2m-bit operands in FTM 

= − +  =

 =

k i j i m j k j iB

A R

1 ) 2 ( 1

2

2

k

It leads to

  • levels of adders tree

 

k m

2

log A= B= R= 1 2 … m m+1 m+2 … 2m … (k-1)m (k-1)m+1 … km … 2k(m-1) … 2km

slide-10
SLIDE 10

10

Multiplication in Synopsys in 28 nm technology

6.66 4.16 3.33 2.5 2.38 2.12 2.04 6.25 5 3.85 3.03 2.5 2.17 1.75

1 2 3 4 5 6 7

2x2 3x3 4x4 5x5 6x6 7x7 8x8

Frequency, GHz

Comparison of monolith and Synopsys multipliers (regular arithmetic)

Synopsis Monolith

slide-11
SLIDE 11

11

FTM for both arithmetic

Example. , where and are 14-bits operands.

. 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

24 4 4 20 3 4 16 2 4 12 1 4 20 4 3 16 3 3 12 2 3 8 1 3 16 4 2 12 3 2 8 2 2 4 1 2 12 4 1 8 3 1 4 2 1 1 1

  +   +   +   + +   +   +   +   + +   +   +   +   + +   +   +   +  = B A B A B A B A B A B A B A B A B A B A B A B A B A B A B A B A R

R B A = 

A

B Multiplication in regular arithmetic, where is 28 bit vector:

R

( )

1 2 3 4

, , , A A A A A=

( )

1 2 3 4 1

, , , a a a a A =

( )

5 6 7 8 2

, , , a a a a A =

( )

9 10 11 12 3

, , , a a a a A =

( )

13 14 4

,a a A =

( )

1 2 3 4

, , , B B B B B=

( )

1 2 3 4 1

, , , b b b b B =

( )

5 6 7 8 2

, , , b b b b B =

( )

9 10 11 12 3

, , , b b b b B =

( )

13 14 4

,b b B =

slide-12
SLIDE 12

12

Adder-tree levels reduction

FTM needs levels in adder-tree, k – number of m-bits sub vectors. We propose techniques to upgrade architectures of monolith based multipliers and as a consequence to minimize speed of calculation. The principle of adder-tree optimization concludes in concatenating of detached results of monolith multiplication. For instance, for k=4 and m=4, concatenating of the following

  • perands can be joint into one vector:

 

m k

2

log

.... 2 2 2 ...

24 4 4 16 4 2 8 3 1 1 1

  +   +   +  = B A B A B A B A R

( )

1 1 B

A 

) 00000000 , (

3 1 B

A 

) 000000 0000000000 , (

4 2 B

A  ) 0000 0000000000 0000000000 , (

4 4 B

A 

8 bits 8 bits 8 bits 16 bits 8 bits 24 bits 8 bits

& & & concatenation

( )

1 1 3 1 4 2 4 4

, , , B A B A B A B A    

slide-13
SLIDE 13

13

Efficient architecture

Example. , where and are 14-bits operands and R is 28 bits.

R B A = 

A

B

( )

1 2 3 4

, , , A A A A A=

( )

1 2 3 4

, , , B B B B B=

1 1 1

R B A = 

2 2 1

R B A = 

3 3 1

R B A = 

4 4 1

R B A = 

5 1 2

R B A = 

6 2 2

R B A = 

7 3 2

R B A = 

8 4 2

R B A = 

9 1 3

R B A = 

10 2 3

R B A = 

11 3 3

R B A = 

12 4 3

R B A = 

13 1 4

R B A = 

14 2 4

R B A = 

15 3 4

R B A = 

16 4 4

R B A = 

( ) ( ) ( ) ( ) ( ) ( )

00 0000000000 , 00000000 , , 00000000 , , 0000 , , , 0000 , , , , , ,

13 4 9 14 6 8 5 10 15 2 7 12 1 3 11 16

R R R R R R R R R R R R R R R R R + + + + + + + + =

slide-14
SLIDE 14

14

2.04 1.85 1.69 1.66 1.61 1.53 1.51 1.38 1.38 1.38 1.36 1.31 1.3 2.08 1.7 1.7 1.61 1.61 1.45 1.39 1.37 1.35 1.31 1.28 1.22 1.24 0.5 1 1.5 2 2.5 8x8 10x10 12x12 14x14 16x16 18x18 20x20 22x22 24x24 26x26 28x28 30x30 32x32 Frequency, GHz

Comparison of monolith based and Synopsis multipliers (regular arithmetic)

Synopsys Monolith based

Multiplication in Synopsys in 28 nm technology

slide-15
SLIDE 15

15

Fourier transformation for X (mod P)

Fourier transformation for X (mod P):

slide-16
SLIDE 16

Two-level Boolean functions minimization

16

slide-17
SLIDE 17

17

Boolean Functions minimization Espresso (Berkeley, USA)

  • r

ELS (Minsk, Belarus)

  • DNF minimization (two-level minimization);
  • Binary Decision Diagram (DBB) minimization (multi-level

minimization);

  • minimization in class of Reed-Muller expansions;
  • etc.
slide-18
SLIDE 18

module x_100_mod_997( input [100:1] X,

  • utput [10:1] R );

wire [22:1] R_temp_1; wire [15:1] R_temp_2; wire [11:1] R_temp_3; reg [10:1] R_temp; assign R_temp_1 = X [ 10 : 1 ] + X [ 20 : 11 ] * 5'b11011 + X [ 30 : 21 ] * 10'b1011011001 + X [ 40 : 31 ] * 10'b1011100100 + X [ 50 : 41 ] * 6'b101000 + X [ 60 : 51 ] * 7'b1010011 + X [ 70 : 61 ] * 8'b11110111 + X [ 80 : 71 ] * 10'b1010101111 + X [ 90 : 81 ] * 10'b1001011011 + X [ 100 : 91 ] * 9'b101001001 ; assign R_temp_2 = R_temp_1 [ 10 : 1 ] + R_temp_1 [ 20 : 11 ] * 5'b11011 + R_temp_1 [ 22 : 21 ] * 10'b1011011001 ; assign R_temp_3 = R_temp_2 [ 10 : 1 ] + R_temp_2 [ 15 : 11 ] * 5'b11011 ; always @(R_temp_3) begin if (R_temp_3 >= 10'b1111100101) R_temp <= R_temp_3 - 10'b1111100101; else R_temp <= R_temp_3; endassign R = R_temp; endmodule

18

Verilog for [100:1] (mod 997)

R_temp_1 < 3566179 (22-bit number) R_temp_2 < 30831 (15-bit number) R_temp_3 < 1833 (11-bit number)

slide-19
SLIDE 19

19

Performance in Synopsys in 28 nm for X (mod P)

549 671 746 775 584 680 637 625 217 653 632 662 581 507 45 43 100 200 300 400 500 600 700 800 900 1000 19 53 113 241 461 977 2011 4051 frequency, MHz mod P

Bit range [300:1]

approach Synopsys

507 637 709 724 574 636 625 617 168 609 625 621 543 495 28 26 100 200 300 400 500 600 700 800 900 19 53 113 241 461 977 2011 4051 frequency, MHz mod P

Bit range [400:1]

approach Synopsys

500 625 719 699 552 621 613 584 132 588 571 595 523 487 21 20 100 200 300 400 500 600 700 800 19 53 113 241 461 977 2011 4051 frequency, MHz mod P

Bit range [500:1]

approach Synopsys

slide-20
SLIDE 20

20

Area in Synopsys in 28 nm technology for X (mod P)

2854 3648 2772 2823 4526 5547 5735 5214 65686 58825 52218 50863 60610 60237 24794 26643 900 10900 20900 30900 40900 50900 60900 19 53 113 241 461 977 2011 4051 area, CELLS mod P

Bit range [300:1]

approach Synopsys

4074 4078 3183 4179 5931 6509 7581 7168 107092 83799 85116 84895 90821 92106 25946 27655 900 20900 40900 60900 80900 100900 19 53 113 241 461 977 2011 4051 area, CELLS mod P

Bit range [400:1]

approach Synopsys

4958 5655 4077 4015 7073 7665 8931 7926 164278 138767 122929 137623 141493 119715 32154 35370 900 20900 40900 60900 80900 100900 120900 140900 160900 19 53 113 241 461 977 2011 4051 area, CELLS mod P

Bit range [500:1]

approach Synopsys

slide-21
SLIDE 21

21

Conclusions

The proposed hardware realization of X (mod P)

  • is faster up to 30 times (comparing with Synopsys)
  • occupies area up to 34 times smaller (comparing with

Synopsys analogues). An upper bound 500 bit input X is the biggest for which Synopsys could synthesis a circuit. Synthesis of modulus function for 600 bit input X was corrupted after nine days of synthesis in Synopsys, but it takes only 20 minutes to synthesize circuits with the proposed technique.

slide-22
SLIDE 22

22

Further research

– “Input splitting” approach implement to reverse conversion (from RNS to positional numbers); – “Input splitting” approach implement to floating point for half, single, and double precision calculations for RISC- processors; – create EDA for design RNS units