Efficient Implementation
- f Modular Division by
Input Bit Splitting
Danila Gorodecky
National Academy of Science of Belarus, Minsk, Belarus danila.gorodecky@gmail.com
Tiziano Villa
University of Verona, Verona, Italy tiziano.villa@univr.it
Efficient Implementation of Modular Division by Input Bit Splitting - - PowerPoint PPT Presentation
Efficient Implementation of Modular Division by Input Bit Splitting Danila Gorodecky Tiziano Villa National Academy of Science of Belarus, University of Verona, Minsk, Belarus Verona, Italy danila.gorodecky@gmail.com tiziano.villa@univr.it
National Academy of Science of Belarus, Minsk, Belarus danila.gorodecky@gmail.com
University of Verona, Verona, Italy tiziano.villa@univr.it
2
3
Implementation of RNS
– Digital filtering with finite impulse response (FIR-filtering); – Crypto system of Federal Reserve System of USA; – Space flight control (Russia); – Data transferring between Space satellites and Earth (Russia); – Air Defense System (USA, Russia);
4
Common architecture of computation in RNS
Converter of positional numbers to modular representation (1) summator/multiplier (mod p1) S1 A1 A2 S A1 An Converter of modular representation to positional number (2) A2 An summator/multiplier (mod p2) S2 A1 A2 An summator/multiplier (mod pm) Sm A1 A2 An
( )
i i
p A mod
) (mod ...
2 1 i i n
p R A A A =
) (mod ...
2 1 i i n
p S A A A = + + +
) (mod
1 1
P S R
S
m i i m i i
=
= =
=
=
m i i
p P
1
5
} , , {
3 2 1
A A A A =
} , , {
3 2 1
B B B B =
i i i i
p p A A p A − = ) (mod
i i i i
p p B B p B − = ) (mod
P r Y S Y S Y S S − + + =
3 3 2 2 1 1
,... 2 , 1 , = r
i i i
k p P Y =
i i
p k , 1 = ) (mod 1
i i i
p p Y = 3 , 2 , 1 = i 314 A 314 B
Example of computation in RNS
6
Main hardware design problems
7
Approaches of X (mod P) design
[1] P.V.A. Mohan, ”Residue Number System. Theory and applications”, Springer International Publishing, 2016, 351 p. [2] J.T. Butler and T. Sasao, ”Fast hardware computation of x mod z”// 25th IEEE International Parallel and Distributed Processing Symposium Anchorage, Ak, USA, May 16-17, 2011, p. 289-292. [3] Mark A. Will and Ryan K. L. Ko, “Computing Mod Without Mod”
8
Input bit splitting approach
9
Fourier transformation multiplication (FTM)
1 2 … m m+1 m+2 … 2m 2m+1 … 3m … (k-1)m (k-1)m+1 … km 1 2 … m m+1 m+2 … 2m 2m+1 … 3m … (k-1)m (k-1)m+1 … km
multiplication
There are of 2m-bit operands in FTM
= − + =
=
k i j i m j k j iB
A R
1 ) 2 ( 1
2
2
k
It leads to
k m
2
log A= B= R= 1 2 … m m+1 m+2 … 2m … (k-1)m (k-1)m+1 … km … 2k(m-1) … 2km
10
Multiplication in Synopsys in 28 nm technology
6.66 4.16 3.33 2.5 2.38 2.12 2.04 6.25 5 3.85 3.03 2.5 2.17 1.75
1 2 3 4 5 6 7
2x2 3x3 4x4 5x5 6x6 7x7 8x8
Frequency, GHz
Comparison of monolith and Synopsys multipliers (regular arithmetic)
Synopsis Monolith
11
Example. , where and are 14-bits operands.
. 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
24 4 4 20 3 4 16 2 4 12 1 4 20 4 3 16 3 3 12 2 3 8 1 3 16 4 2 12 3 2 8 2 2 4 1 2 12 4 1 8 3 1 4 2 1 1 1
+ + + + + + + + + + + + + + + + + + = B A B A B A B A B A B A B A B A B A B A B A B A B A B A B A B A R
B Multiplication in regular arithmetic, where is 28 bit vector:
1 2 3 4
, , , A A A A A=
( )
1 2 3 4 1
, , , a a a a A =
( )
5 6 7 8 2
, , , a a a a A =
( )
9 10 11 12 3
, , , a a a a A =
( )
13 14 4
,a a A =
( )
1 2 3 4
, , , B B B B B=
( )
1 2 3 4 1
, , , b b b b B =
( )
5 6 7 8 2
, , , b b b b B =
( )
9 10 11 12 3
, , , b b b b B =
( )
13 14 4
,b b B =
12
Adder-tree levels reduction
FTM needs levels in adder-tree, k – number of m-bits sub vectors. We propose techniques to upgrade architectures of monolith based multipliers and as a consequence to minimize speed of calculation. The principle of adder-tree optimization concludes in concatenating of detached results of monolith multiplication. For instance, for k=4 and m=4, concatenating of the following
m k
2
log
24 4 4 16 4 2 8 3 1 1 1
( )
1 1 B
A
) 00000000 , (
3 1 B
A
) 000000 0000000000 , (
4 2 B
A ) 0000 0000000000 0000000000 , (
4 4 B
A
8 bits 8 bits 8 bits 16 bits 8 bits 24 bits 8 bits
& & & concatenation
( )
1 1 3 1 4 2 4 4
, , , B A B A B A B A
13
Example. , where and are 14-bits operands and R is 28 bits.
1 2 3 4
, , , A A A A A=
( )
1 2 3 4
, , , B B B B B=
1 1 1
R B A =
2 2 1
R B A =
3 3 1
R B A =
4 4 1
R B A =
5 1 2
R B A =
6 2 2
R B A =
7 3 2
R B A =
8 4 2
R B A =
9 1 3
R B A =
10 2 3
R B A =
11 3 3
R B A =
12 4 3
R B A =
13 1 4
R B A =
14 2 4
R B A =
15 3 4
R B A =
16 4 4
R B A =
( ) ( ) ( ) ( ) ( ) ( )
00 0000000000 , 00000000 , , 00000000 , , 0000 , , , 0000 , , , , , ,
13 4 9 14 6 8 5 10 15 2 7 12 1 3 11 16
R R R R R R R R R R R R R R R R R + + + + + + + + =
14
2.04 1.85 1.69 1.66 1.61 1.53 1.51 1.38 1.38 1.38 1.36 1.31 1.3 2.08 1.7 1.7 1.61 1.61 1.45 1.39 1.37 1.35 1.31 1.28 1.22 1.24 0.5 1 1.5 2 2.5 8x8 10x10 12x12 14x14 16x16 18x18 20x20 22x22 24x24 26x26 28x28 30x30 32x32 Frequency, GHz
Comparison of monolith based and Synopsis multipliers (regular arithmetic)
Synopsys Monolith based
Multiplication in Synopsys in 28 nm technology
15
Two-level Boolean functions minimization
16
17
minimization);
module x_100_mod_997( input [100:1] X,
wire [22:1] R_temp_1; wire [15:1] R_temp_2; wire [11:1] R_temp_3; reg [10:1] R_temp; assign R_temp_1 = X [ 10 : 1 ] + X [ 20 : 11 ] * 5'b11011 + X [ 30 : 21 ] * 10'b1011011001 + X [ 40 : 31 ] * 10'b1011100100 + X [ 50 : 41 ] * 6'b101000 + X [ 60 : 51 ] * 7'b1010011 + X [ 70 : 61 ] * 8'b11110111 + X [ 80 : 71 ] * 10'b1010101111 + X [ 90 : 81 ] * 10'b1001011011 + X [ 100 : 91 ] * 9'b101001001 ; assign R_temp_2 = R_temp_1 [ 10 : 1 ] + R_temp_1 [ 20 : 11 ] * 5'b11011 + R_temp_1 [ 22 : 21 ] * 10'b1011011001 ; assign R_temp_3 = R_temp_2 [ 10 : 1 ] + R_temp_2 [ 15 : 11 ] * 5'b11011 ; always @(R_temp_3) begin if (R_temp_3 >= 10'b1111100101) R_temp <= R_temp_3 - 10'b1111100101; else R_temp <= R_temp_3; endassign R = R_temp; endmodule
18
R_temp_1 < 3566179 (22-bit number) R_temp_2 < 30831 (15-bit number) R_temp_3 < 1833 (11-bit number)
19
Performance in Synopsys in 28 nm for X (mod P)
549 671 746 775 584 680 637 625 217 653 632 662 581 507 45 43 100 200 300 400 500 600 700 800 900 1000 19 53 113 241 461 977 2011 4051 frequency, MHz mod P
Bit range [300:1]
approach Synopsys
507 637 709 724 574 636 625 617 168 609 625 621 543 495 28 26 100 200 300 400 500 600 700 800 900 19 53 113 241 461 977 2011 4051 frequency, MHz mod P
Bit range [400:1]
approach Synopsys
500 625 719 699 552 621 613 584 132 588 571 595 523 487 21 20 100 200 300 400 500 600 700 800 19 53 113 241 461 977 2011 4051 frequency, MHz mod P
Bit range [500:1]
approach Synopsys
20
Area in Synopsys in 28 nm technology for X (mod P)
2854 3648 2772 2823 4526 5547 5735 5214 65686 58825 52218 50863 60610 60237 24794 26643 900 10900 20900 30900 40900 50900 60900 19 53 113 241 461 977 2011 4051 area, CELLS mod P
Bit range [300:1]
approach Synopsys
4074 4078 3183 4179 5931 6509 7581 7168 107092 83799 85116 84895 90821 92106 25946 27655 900 20900 40900 60900 80900 100900 19 53 113 241 461 977 2011 4051 area, CELLS mod P
Bit range [400:1]
approach Synopsys
4958 5655 4077 4015 7073 7665 8931 7926 164278 138767 122929 137623 141493 119715 32154 35370 900 20900 40900 60900 80900 100900 120900 140900 160900 19 53 113 241 461 977 2011 4051 area, CELLS mod P
Bit range [500:1]
approach Synopsys
21
The proposed hardware realization of X (mod P)
Synopsys analogues). An upper bound 500 bit input X is the biggest for which Synopsys could synthesis a circuit. Synthesis of modulus function for 600 bit input X was corrupted after nine days of synthesis in Synopsys, but it takes only 20 minutes to synthesize circuits with the proposed technique.
22
– “Input splitting” approach implement to reverse conversion (from RNS to positional numbers); – “Input splitting” approach implement to floating point for half, single, and double precision calculations for RISC- processors; – create EDA for design RNS units