efficient implementation of modular division by input bit
play

Efficient Implementation of Modular Division by Input Bit Splitting - PowerPoint PPT Presentation

Efficient Implementation of Modular Division by Input Bit Splitting Danila Gorodecky Tiziano Villa National Academy of Science of Belarus, University of Verona, Minsk, Belarus Verona, Italy danila.gorodecky@gmail.com tiziano.villa@univr.it


  1. Efficient Implementation of Modular Division by Input Bit Splitting Danila Gorodecky Tiziano Villa National Academy of Science of Belarus, University of Verona, Minsk, Belarus Verona, Italy danila.gorodecky@gmail.com tiziano.villa@univr.it

  2. Motivation – due to historical and algorithmic reasons RNS is not a common arithmetic approach and this topic has not studied proper and not common implemented in hardware computing; – there are a numeric of a perspective applications such parallel computing, cryptography; – there is no an “efficient” hardware approach of X mod P calculation (not synthesizable in state- of-art EDA tools) 2

  3. Implementation of RNS – Digital filtering with finite impulse response (FIR-filtering); – Crypto system of Federal Reserve System of USA; – Space flight control (Russia); – Data transferring between Space satellites and Earth (Russia); – Air Defense System (USA, Russia); 3

  4. Common architecture of computation in RNS A 1 A 2 S 1 summator/multiplier (mod p 1 ) A 1 A n Converter of Converter of A 1 positional modular A 2 A 2 S numbers S 2 representation summator/multiplier (mod p 2 ) to to A n modular positional A n representation number A 1 A 2 (1) (2) S m summator/multiplier (mod p m ) A n    = m m A A ... A R (mod p )   1 2 n i i = S or R S (mod P ) ( ) i i A mod p = = i i + + + = i 1 i 1 A A ... A S (mod p ) m 1 2 n i i  = P p 4 i = i 1

  5. Example of computation in RNS A = { A , A , A } 1 2 3 =  +  +  −  S S Y S Y S Y r P   1 1 2 2 3 3 A = − A (mod p ) A   p i i i Y =   p i i  A  1 (mod p ) 0 314 i p i B =   { B , B , B } P   = 1 2 3 Y k   i i   p   i B = − B (mod p ) B   p i i i   p i  B  0 314 = = = 0 , 1 , 2 ,... r i 1 , 2 , 3 k 1 , p i i 5

  6. Main hardware design problems • X (mod P) realizations for an arbitrary P (forward conversion) . Synthesis error in the most of all EDA tools; • Transformation modular representation into position numbers (reverse conversion). It is a long critical pass or big hardware costs. 6

  7. Approaches of X (mod P) design [1] P.V.A. Mohan, ”Residue Number System. Theory and applications”, Springer International Publishing, 2016, 351 p. [2] J.T. Butler and T. Sasao, ”Fast hardware computation of x mod z”// 25th IEEE International Parallel and Distributed Processin g Symposium Anchorage, Ak, USA, May 16-17, 2011, p. 289-292. 7 [3] Mark A. Will and Ryan K. L. Ko, “Computing Mod Without Mod”

  8. Input bit splitting approach • splitting input into small tuples (up to 12-bit); • Boolean minimization (Disjunctive Normal Form, Reed-Muller expansion, Binary Decision Tree, Majority Graph) 8

  9. Fourier transformation multiplication (FTM) FTM splits up binary vectors into m -bit tuples and multiplies to each others: A = 1 2 … m m+1 m+2 … 2m 2m+1 … 3m … (k -1)m (k- 1)m+1 … km multiplication 1 2 … m m+1 m+2 … 2m 2m+1 … 3m … (k -1)m (k- 1)m+1 … km B = R = 1 2 … m m+1 m+2 … 2m … (k -1)m (k- 1)m+1 … km … 2k(m - 1) … 2km k k There are of 2m -bit operands in FTM   + − =  m ( i j 2 ) 2 k 2 R A i B j = = i 1 j 1   m  log k It leads to -levels of adders tree 9 2

  10. Multiplication in Synopsys in 28 nm technology Comparison of monolith and Synopsys multipliers (regular arithmetic) 6.66 6.25 7 5 6 4.16 Frequency, GHz 5 3.85 3.33 3.03 4 2.5 2.17 2.5 2.38 2.12 2.04 3 1.75 2 1 0 2x2 3x3 4x4 5x5 6x6 7x7 8x8 Synopsis Monolith 10

  11. FTM for both arithmetic  = A Example. , where and B are 14-bits operands. A B R ( ) ( ) B = A = , , , B B B B A , A , A , A 4 3 2 1 4 3 2 1 ( ) ( ) ( ) B = ( ) B = A = A = , , , b , b , b , b b b b b a , a , a , a a , a , a , a 2 8 7 6 5 1 4 3 2 1 2 8 7 6 5 1 4 3 2 1 ( ) ( ) ( ) B = ( ) B = A = b , b A = b , b , b , b a , a a , a , a , a 4 14 13 3 12 11 10 9 4 14 13 3 12 11 10 9 Multiplication in regular arithmetic, where is 28 bit vector: R =  +   +   +   + 4 8 12 R A B A B 2 A B 2 A B 2 1 1 1 2 1 3 1 4 +   +   +   +   + 4 8 12 16 A B 2 A B 2 A B 2 A B 2 2 1 2 2 2 3 2 4 +   +   +   +   + 8 12 16 20 A B 2 A B 2 A B 2 A B 2 3 1 3 2 3 3 3 4 +   +   +   +   12 16 20 24 A B 2 A B 2 A B 2 A B 2 . 4 1 4 2 4 3 4 4 11

  12. Adder-tree levels reduction   k  log m FTM needs levels in adder-tree, k – number of m -bits 2 sub vectors. We propose techniques to upgrade architectures of monolith based multipliers and as a consequence to minimize speed of calculation. The principle of adder-tree optimization concludes in concatenating of detached results of monolith multiplication. For instance, for k=4 and m=4 , concatenating of the following operands can be joint into one vector: =  +   +   +   8 16 24 R ... A B A B 2 A B 2 A B 2 .... 1 1 1 3 2 4 4 4 ( ) A  1 B 1 & 8 bits A  ( 1 B , 00000000 ) 3 & concatenation 8 bits 8 bits A  ( 2 B , 0000000000 000000 ) 4 & 8 bits 16 bits A  ( , 0000000000 0000000000 0000 ) 4 B 4 8 bits 24 bits ( ) 12     , , , A B A B A B A B 4 4 2 4 1 3 1 1

  13. Efficient architecture  = A B R A Example. , where and are 14-bits operands and R is 28 bits. B ( ) ( ) B = A = B , B , B , B A , A , A , A 4 3 2 1 4 3 2 1  =  =  =  = A B R A B R A B R A B R 2 1 5 3 1 9 4 1 13 1 1 1  =  = A B R  = A B R  = A B R A B R 2 2 6 3 2 10 4 2 14 1 2 2  =  =  =  = A B R A B R A B R A B R 3 3 11 2 3 7 1 3 3 4 3 15  =  =  = A B R  = A B R A B R A B R 2 4 8 3 4 12 1 4 4 4 4 16 ( ) ( ) = + + R R , R , R , R R , R , R , 0000 16 11 3 1 12 7 2 ( ) ( ) + + + R , R , R , 0000 R , R , 00000000 15 10 5 8 6 ( ) ( ) 13 + + + R , R , 00000000 R R , 0000000000 00 14 9 4 13

  14. Multiplication in Synopsys in 28 nm technology Comparison of monolith based and Synopsis multipliers (regular arithmetic) 2.5 2.04 2.08 1.85 2 1.69 1.7 1.7 1.66 1.61 1.61 1.61 1.53 Frequency, GHz 1.51 1.45 1.38 1.38 1.39 1.36 1.38 1.37 1.35 1.31 1.5 1.3 1.28 1.31 1.22 1.24 1 0.5 0 8x8 10x10 12x12 14x14 16x16 18x18 20x20 22x22 24x24 26x26 28x28 30x30 32x32 Synopsys Monolith based 14

  15. Fourier transformation for X (mod P) Fourier transformation for X (mod P): 15

  16. Two-level Boolean functions minimization 16

  17. Boolean Functions minimization Espresso (Berkeley, USA) or ELS (Minsk, Belarus) - DNF minimization (two-level minimization); - Binary Decision Diagram (DBB) minimization (multi-level minimization); - minimization in class of Reed-Muller expansions; - etc. 17

  18. Verilog for [100:1] (mod 997) module x_100_mod_997( input [100:1] X, output [10:1] R ); R_temp_1 < 3566179 (22-bit number) wire [22:1] R_temp_1; wire [15:1] R_temp_2; R_temp_2 < 30831 (15-bit number) wire [11:1] R_temp_3; R_temp_3 < 1833 (11-bit number) reg [10:1] R_temp; assign R_temp_1 = X [ 10 : 1 ] + X [ 20 : 11 ] * 5'b11011 + X [ 30 : 21 ] * 10'b1011011001 + X [ 40 : 31 ] * 10'b1011100100 + X [ 50 : 41 ] * 6'b101000 + X [ 60 : 51 ] * 7'b1010011 + X [ 70 : 61 ] * 8'b11110111 + X [ 80 : 71 ] * 10'b1010101111 + X [ 90 : 81 ] * 10'b1001011011 + X [ 100 : 91 ] * 9'b101001001 ; assign R_temp_2 = R_temp_1 [ 10 : 1 ] + R_temp_1 [ 20 : 11 ] * 5'b11011 + R_temp_1 [ 22 : 21 ] * 10'b1011011001 ; assign R_temp_3 = R_temp_2 [ 10 : 1 ] + R_temp_2 [ 15 : 11 ] * 5'b11011 ; always @(R_temp_3) begin if (R_temp_3 >= 10'b1111100101) R_temp <= R_temp_3 - 10'b1111100101; else R_temp <= R_temp_3; 18 endassign R = R_temp; endmodule

  19. Performance in Synopsys in 28 nm for X (mod P) Bit range [300:1] 1000 900 775 746 800 680 frequency, MHz 671 662 653 637 632 625 700 584 581 549 600 507 500 400 217 300 200 45 43 100 0 19 53 113 241 461 977 2011 4051 mod P approach Synopsys Bit range [400:1] 900 724 800 709 637 636 625 625 700 621 617 frequency, MHz 609 574 543 600 507 495 500 400 300 168 200 28 26 100 0 19 53 113 241 461 977 2011 4051 mod P approach Synopsys Bit range [500:1] 800 719 699 700 625 621 613 595 588 584 571 552 frequency, MHz 600 523 500 487 500 400 300 132 200 100 21 20 0 19 53 113 241 461 977 2011 4051 19 mod P approach Synopsys

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend