Efficient Implementation of Modular Division by Input Bit Splitting - PowerPoint PPT Presentation

Efficient Implementation of Modular Division by Input Bit Splitting Danila Gorodecky Tiziano Villa National Academy of Science of Belarus, University of Verona, Minsk, Belarus Verona, Italy danila.gorodecky@gmail.com tiziano.villa@univr.it

Motivation – due to historical and algorithmic reasons RNS is not a common arithmetic approach and this topic has not studied proper and not common implemented in hardware computing; – there are a numeric of a perspective applications such parallel computing, cryptography; – there is no an “efficient” hardware approach of X mod P calculation (not synthesizable in state- of-art EDA tools) 2

Implementation of RNS – Digital filtering with finite impulse response (FIR-filtering); – Crypto system of Federal Reserve System of USA; – Space flight control (Russia); – Data transferring between Space satellites and Earth (Russia); – Air Defense System (USA, Russia); 3

Common architecture of computation in RNS A 1 A 2 S 1 summator/multiplier (mod p 1 ) A 1 A n Converter of Converter of A 1 positional modular A 2 A 2 S numbers S 2 representation summator/multiplier (mod p 2 ) to to A n modular positional A n representation number A 1 A 2 (1) (2) S m summator/multiplier (mod p m ) A n    = m m A A ... A R (mod p )   1 2 n i i = S or R S (mod P ) ( ) i i A mod p = = i i + + + = i 1 i 1 A A ... A S (mod p ) m 1 2 n i i  = P p 4 i = i 1

Example of computation in RNS A = { A , A , A } 1 2 3 =  +  +  −  S S Y S Y S Y r P   1 1 2 2 3 3 A = − A (mod p ) A   p i i i Y =   p i i  A  1 (mod p ) 0 314 i p i B =   { B , B , B } P   = 1 2 3 Y k   i i   p   i B = − B (mod p ) B   p i i i   p i  B  0 314 = = = 0 , 1 , 2 ,... r i 1 , 2 , 3 k 1 , p i i 5

Main hardware design problems • X (mod P) realizations for an arbitrary P (forward conversion) . Synthesis error in the most of all EDA tools; • Transformation modular representation into position numbers (reverse conversion). It is a long critical pass or big hardware costs. 6

Approaches of X (mod P) design [1] P.V.A. Mohan, ”Residue Number System. Theory and applications”, Springer International Publishing, 2016, 351 p. [2] J.T. Butler and T. Sasao, ”Fast hardware computation of x mod z”// 25th IEEE International Parallel and Distributed Processin g Symposium Anchorage, Ak, USA, May 16-17, 2011, p. 289-292. 7 [3] Mark A. Will and Ryan K. L. Ko, “Computing Mod Without Mod”

Input bit splitting approach • splitting input into small tuples (up to 12-bit); • Boolean minimization (Disjunctive Normal Form, Reed-Muller expansion, Binary Decision Tree, Majority Graph) 8

Fourier transformation multiplication (FTM) FTM splits up binary vectors into m -bit tuples and multiplies to each others: A = 1 2 … m m+1 m+2 … 2m 2m+1 … 3m … (k -1)m (k- 1)m+1 … km multiplication 1 2 … m m+1 m+2 … 2m 2m+1 … 3m … (k -1)m (k- 1)m+1 … km B = R = 1 2 … m m+1 m+2 … 2m … (k -1)m (k- 1)m+1 … km … 2k(m - 1) … 2km k k There are of 2m -bit operands in FTM   + − =  m ( i j 2 ) 2 k 2 R A i B j = = i 1 j 1   m  log k It leads to -levels of adders tree 9 2

Multiplication in Synopsys in 28 nm technology Comparison of monolith and Synopsys multipliers (regular arithmetic) 6.66 6.25 7 5 6 4.16 Frequency, GHz 5 3.85 3.33 3.03 4 2.5 2.17 2.5 2.38 2.12 2.04 3 1.75 2 1 0 2x2 3x3 4x4 5x5 6x6 7x7 8x8 Synopsis Monolith 10

FTM for both arithmetic  = A Example. , where and B are 14-bits operands. A B R ( ) ( ) B = A = , , , B B B B A , A , A , A 4 3 2 1 4 3 2 1 ( ) ( ) ( ) B = ( ) B = A = A = , , , b , b , b , b b b b b a , a , a , a a , a , a , a 2 8 7 6 5 1 4 3 2 1 2 8 7 6 5 1 4 3 2 1 ( ) ( ) ( ) B = ( ) B = A = b , b A = b , b , b , b a , a a , a , a , a 4 14 13 3 12 11 10 9 4 14 13 3 12 11 10 9 Multiplication in regular arithmetic, where is 28 bit vector: R =  +   +   +   + 4 8 12 R A B A B 2 A B 2 A B 2 1 1 1 2 1 3 1 4 +   +   +   +   + 4 8 12 16 A B 2 A B 2 A B 2 A B 2 2 1 2 2 2 3 2 4 +   +   +   +   + 8 12 16 20 A B 2 A B 2 A B 2 A B 2 3 1 3 2 3 3 3 4 +   +   +   +   12 16 20 24 A B 2 A B 2 A B 2 A B 2 . 4 1 4 2 4 3 4 4 11

Adder-tree levels reduction   k  log m FTM needs levels in adder-tree, k – number of m -bits 2 sub vectors. We propose techniques to upgrade architectures of monolith based multipliers and as a consequence to minimize speed of calculation. The principle of adder-tree optimization concludes in concatenating of detached results of monolith multiplication. For instance, for k=4 and m=4 , concatenating of the following operands can be joint into one vector: =  +   +   +   8 16 24 R ... A B A B 2 A B 2 A B 2 .... 1 1 1 3 2 4 4 4 ( ) A  1 B 1 & 8 bits A  ( 1 B , 00000000 ) 3 & concatenation 8 bits 8 bits A  ( 2 B , 0000000000 000000 ) 4 & 8 bits 16 bits A  ( , 0000000000 0000000000 0000 ) 4 B 4 8 bits 24 bits ( ) 12     , , , A B A B A B A B 4 4 2 4 1 3 1 1

Efficient architecture  = A B R A Example. , where and are 14-bits operands and R is 28 bits. B ( ) ( ) B = A = B , B , B , B A , A , A , A 4 3 2 1 4 3 2 1  =  =  =  = A B R A B R A B R A B R 2 1 5 3 1 9 4 1 13 1 1 1  =  = A B R  = A B R  = A B R A B R 2 2 6 3 2 10 4 2 14 1 2 2  =  =  =  = A B R A B R A B R A B R 3 3 11 2 3 7 1 3 3 4 3 15  =  =  = A B R  = A B R A B R A B R 2 4 8 3 4 12 1 4 4 4 4 16 ( ) ( ) = + + R R , R , R , R R , R , R , 0000 16 11 3 1 12 7 2 ( ) ( ) + + + R , R , R , 0000 R , R , 00000000 15 10 5 8 6 ( ) ( ) 13 + + + R , R , 00000000 R R , 0000000000 00 14 9 4 13

Multiplication in Synopsys in 28 nm technology Comparison of monolith based and Synopsis multipliers (regular arithmetic) 2.5 2.04 2.08 1.85 2 1.69 1.7 1.7 1.66 1.61 1.61 1.61 1.53 Frequency, GHz 1.51 1.45 1.38 1.38 1.39 1.36 1.38 1.37 1.35 1.31 1.5 1.3 1.28 1.31 1.22 1.24 1 0.5 0 8x8 10x10 12x12 14x14 16x16 18x18 20x20 22x22 24x24 26x26 28x28 30x30 32x32 Synopsys Monolith based 14

Fourier transformation for X (mod P) Fourier transformation for X (mod P): 15

Two-level Boolean functions minimization 16

Boolean Functions minimization Espresso (Berkeley, USA) or ELS (Minsk, Belarus) - DNF minimization (two-level minimization); - Binary Decision Diagram (DBB) minimization (multi-level minimization); - minimization in class of Reed-Muller expansions; - etc. 17

Verilog for [100:1] (mod 997) module x_100_mod_997( input [100:1] X, output [10:1] R ); R_temp_1 < 3566179 (22-bit number) wire [22:1] R_temp_1; wire [15:1] R_temp_2; R_temp_2 < 30831 (15-bit number) wire [11:1] R_temp_3; R_temp_3 < 1833 (11-bit number) reg [10:1] R_temp; assign R_temp_1 = X [ 10 : 1 ] + X [ 20 : 11 ] * 5'b11011 + X [ 30 : 21 ] * 10'b1011011001 + X [ 40 : 31 ] * 10'b1011100100 + X [ 50 : 41 ] * 6'b101000 + X [ 60 : 51 ] * 7'b1010011 + X [ 70 : 61 ] * 8'b11110111 + X [ 80 : 71 ] * 10'b1010101111 + X [ 90 : 81 ] * 10'b1001011011 + X [ 100 : 91 ] * 9'b101001001 ; assign R_temp_2 = R_temp_1 [ 10 : 1 ] + R_temp_1 [ 20 : 11 ] * 5'b11011 + R_temp_1 [ 22 : 21 ] * 10'b1011011001 ; assign R_temp_3 = R_temp_2 [ 10 : 1 ] + R_temp_2 [ 15 : 11 ] * 5'b11011 ; always @(R_temp_3) begin if (R_temp_3 >= 10'b1111100101) R_temp <= R_temp_3 - 10'b1111100101; else R_temp <= R_temp_3; 18 endassign R = R_temp; endmodule

Performance in Synopsys in 28 nm for X (mod P) Bit range [300:1] 1000 900 775 746 800 680 frequency, MHz 671 662 653 637 632 625 700 584 581 549 600 507 500 400 217 300 200 45 43 100 0 19 53 113 241 461 977 2011 4051 mod P approach Synopsys Bit range [400:1] 900 724 800 709 637 636 625 625 700 621 617 frequency, MHz 609 574 543 600 507 495 500 400 300 168 200 28 26 100 0 19 53 113 241 461 977 2011 4051 mod P approach Synopsys Bit range [500:1] 800 719 699 700 625 621 613 595 588 584 571 552 frequency, MHz 600 523 500 487 500 400 300 132 200 100 21 20 0 19 53 113 241 461 977 2011 4051 19 mod P approach Synopsys

Efficient Implementation of Modular Division by Input Bit Splitting - PowerPoint PPT Presentation

Efficient Implementation of Modular Division by Input Bit Splitting Danila Gorodecky Tiziano Villa National Academy of Science of Belarus, University of Verona, Minsk, Belarus Verona, Italy danila.gorodecky@gmail.com tiziano.villa@univr.it

Modular Budgets Modular Budgets Modular Budgets Modular Budgets OSPA NANO Session 10/25/06

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

Input Input devices Text entry Positional input Input Devices 1 iPod Wheel Input Devices 2

Efficient and secure modular operations using the Polynomial Modular Number System (Part 1)

Tra ffi c Management as a Service | Ghent, Belgium INPUT PROCESS OUTPUT INPUT PROCESS OUTPUT

Input Input devices Text entry Positional input Input Devices 1 MacBook Wheel (The Onion) -

Bit Basics Eric McCreath Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit

https://bit.ly/3pptcRS 3 4 https://bit.ly/2UiBgWq Vase Face Face https://bit.ly/3luge2Q

1 TEMPORARY MODULAR HOUSING Meeting Purpose Learn how Temporary Modular Housing will allow

Modular Applications, Loose Coupling, and the NetBeans Lookup API The Need for Modular

Managing Modular Software for your NuGet, C++ and Java Development Agenda Modular software

Input Devices Managing text and positional input 1 CS 349 - Input Devices iPod Wheel 2 CS 349

Expanding the Reach of Fuzzing Caroline Lemieux September 8 th , 2020 Fuzzcon Europe

Wayland Input Methods Michael Hasselmann Openismus GmbH Wayland Input Methods Input methods?

FEDLINK and Federal Libraries: Growing Stronger Together Joe Cappello Chief Operating Officer,

drm/i915 Updates Daniel Vetter, Intel OTC FOSDEM 2013 bug squashing bugs fixed by the

Fire-Sale FDI Yankun Wang Department of Economics February 2008 Yankun Wang (Cornell) February

SAT SAT ... To Become the Auto Parts Manufacturing Leader in ASEAN ... To Become the

IOSCO Asia Pacific Regional Committee Meeting Mr Andrew Sheng Chairman Securities and Futures

Message Passing Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term 2 2020 1

based Hardware Verification Makai Mann, Clark Barrett Hardware Verification SAT is king

Loophole: Timing Attacks on Shared Event Loops in Chrome Pepe Vila and Boris Kpf vwzq.net