Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 - PowerPoint PPT Presentation

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JÄRVINENÄDL 2 , WEIQIANG LIU 3 , HWAJEONG SEO 4 1 A P S I A , I N T E R D I S C I P L I N A R Y C E N T R E F O R S E C U R I T Y , R E L I A B I L I T Y A N D T R U S T ( S N T ) , U N I V E R S I T Y O F L U X E M B O U R G , L U X E M B O U R G 2 D E P A R T M E N T O F C O M P U T E R S C I E N C E , U N I V E R S I T Y O F H E L S I N K I , H E L S I N K I , F I N L A N D 3 C O L L E G E O F E L E C T R O N I C A N D I N F O R M A T I O N E N G I N E E R I N G , N A N J I N G U N I V E R S I T Y O F A E R O N A U T I C S A N D A S T R O N A U T I C S 4 D E P A R T M E N T O F I T , H A N S U N G U N I V E R S I T Y

Motivation • Cryptography degrades the performance of smartphone • In particular, public key cryptography imposes high overheads • Fast PKC implementation is important to achieve high availability

Motivation • Multi-precision arithmetic operation (for PKC) • Compact big number implementation is an open problem • Few works focus on ARMv8 • GCM (CT- RSA’15)  Binary field multiplication (SCN’16)  Binary ECC (SG- CRC’17) • This work improves the performance of multiplication on ARMv8!

Contribution • Compact implementations of multi-precision multiplication • Subtractive Karatsuba algorithm • Evaluation of multiple-level Karatsuba • Test input size (128, 256, 384, 512-bit) • Squaring dedicated routine

Target Platform – ARMv8 • 95% of smartphones based on ARM architecture • Modern smartphone supports 64-bit ARMv8

Target Platform – ARMv8 • 32-bit mode (AArch32) & 64-bit mode (AArch64) • 64-bit ARM & 128-bit NEON registers and instruction sets • Crypto (AES and SHA) operation

Multiplication on ARMv8 X0 X0 a0 a0 × × X1 X1 b0 b0 a1b0 a0b0 a0b0 a0b0 64 bits 64 bits X3 X2 X3 X2 MUL UMULH

Multiplication on ARMv8 SIMD (NEON) SISD (A64) V0 X0 a3 a2 a1 a0 a0 32 bits 64 bits V1 X1 b3 b2 b1 b0 b0 64 bits 32 bits V2 a1b0 a0b0 a0b0 a0b0 64 bits 64 bits X3 X2 UMULL UMULH MUL For 64-bit multiplication on ARMv8, NEON requires 4 UMULL routines but A64 only needs 1 MUL and 1 UMULH. A64 is more efficient than NEON for big integer multiplication.

Multi-precision Multiplication 256~2048-bit multiplication on 64-bit architecture - divide big integer (256~2048-bit) into small integer (64-bit) Method Operand-scanning Product-scanning Hybrid-scanning Computation order Row-wise Column-wise Mixture of row/column Requirement Many registers Efficient MAC routine General processor - ARMv8 supports 31x64-bit registers  operand-scanning (previous works)

Multi-precision Multiplication Operand-scanning method

Multi-precision Squaring A special case of multiplication where both operands are the same (i.e., A = B) Certain partial products become the same and need to be performed only once (i.e., 𝐵 0 × 𝐶 1 + 𝐵[1] × 𝐶[0] becomes 2 × 𝐵[0] × 𝐵[1] if 𝐵 = 𝐶 ) Two approaches: - Doubling the operand (i.e., 𝟑 × 𝐵[0]  2 × 𝐵[0] × 𝐵[1] ) - Doubling the result (i.e., 𝐵[0] × 𝐵[1]  𝟑 × 𝐵[0] × 𝐵[1] )  Sliding-block-doubling

Multi-precision Squaring Sliding-block-doubling method

Karatsuba-Ofman Algorithm Number of partial product School-book Karatsuba-Ofman 𝑂 2 𝑂 log 2 3 𝑜 2 and 𝐶 = 𝐶 𝑀 + 𝐶 𝐼 2 Τ 𝑜 2 The product 𝐷 = 𝐵 ∙ 𝐶 of two n-bit integers 𝐵 = 𝐵 𝑀 + 𝐵 𝐼 2 Τ 𝐷 = 𝐵 𝐼 ∙ 𝐶 𝐼 2 𝑜 + 𝑜 2 + 𝐵 𝑀 ∙ 𝐶 𝑀 𝐵 𝑀 + 𝐵 𝐼 ∙ 𝐶 𝑀 + 𝐶 𝐼 − 𝐵 𝑀 ∙ 𝐶 𝑀 − 𝐵 𝐼 ∙ 𝐶 𝐼 2 Τ

Subtractive Karatsuba Algorithm 𝐷 = 𝐵 𝐼 ∙ 𝐶 𝐼 2 𝑜 + 𝑜 2 + 𝐵 𝑀 ∙ 𝐶 𝑀 𝐵 𝑀 + 𝐵 𝐼 ∙ 𝐶 𝑀 + 𝐶 𝐼 − 𝐵 𝑀 ∙ 𝐶 𝑀 − 𝐵 𝐼 ∙ 𝐶 𝐼 2 Τ 𝐵 𝑀 + 𝐵 𝐼 ∙ 𝐶 𝑀 + 𝐶 𝐼 − 𝐵 𝑀 ∙ 𝐶 𝑀 − 𝐵 𝐼 ∙ 𝐶 𝐼 = 𝐵 𝑀 ∙ 𝐶 𝑀 + 𝐵 𝐼 ∙ 𝐶 𝐼 − |𝐵 𝐼 − 𝐵 𝑀 | ∙ |𝐶 𝐼 − 𝐶 𝑀 | Advantage: - constant size of operands ( n/2 )  fast constant-time multiplication Requirement: - Absolute value in two’s complement representation

Multi-precision Multiplication on ARMv8 128-bit Karatsuba multiplication

Multi-precision Squaring on ARMv8 No need for absolute value handling 𝐵 𝐼 − 𝐵 𝑀 ∙ 𝐵 𝐼 − 𝐵 𝑀  always positive value

Optimizations of instruction set Generation of the carry register - MOV X0, #0  …  ADDS X1, X1, X2  ADCS X3, X0, X0 Two’s complement - SBCS X2, X2, X2  EOR X0, X0, X2  AND X2, X2, #1  ADD X0, X0, X2

Evaluation IDE: Xcode Target: - 64-bit ARMv8-A architecture - Apple A7 (APL0698) @1.3GHz Program language: assembly Optimization level: -Ofast

Evaluation

Conclusion Achievements - Efficient implementations of multi-precision multiplication / squaring on ARMv8 Future works - Cryptography implementations (ECC, RSA, SIDH)

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 - PowerPoint PPT Presentation

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 , WEIQIANG LIU 3 , HWAJEONG SEO 4 1 A P S I A , I N T E R D I S C I P L I N A R Y C E N T R E F O R S E C U R I T Y , R E L I A B I L I T Y A N D T R U S T ( S N T )

Faster multiprecision integer division William Hart June 22, 2015 William Hart Faster

NOVA Microhypervisor on ARMv8-A FOSDEM 2020 Udo Steinberg BedRock Systems, Inc. February 2,

Numerical Recipes for Multiprecision Computations Henri Cohen May 13, 2014 IMB, Universit e

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Maths Multiplication and Division Maths | Year 2 | Multiplication and Division | Solve

Lecture 8: Binary Multiplication & Division Todays topics: Multiplication

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

MA/CSSE 473 Day 04 Multiplication runtime Multiplication based on Gauss formula Mathematical

The Rise of Multiprecision Computations Nick Higham School of Mathematics The University of

The Multiprecision Effort in the US Exascale Computing Project ICERM: Variable Precision in

Stochastic arithmetic in multiprecision Stef Graillat Joint work with Fabienne Jzquel and

Reliable multiprecision implementation of a class of special functions Team: A. Cuyt, V.B.

Reliable multiprecision arithmetic for number theory Fredrik Johansson LFANT seminar, IMB /

arm64e An ABI for Pointer Authentication LLVM Developers' Meeting John McCall October 22 nd , 2019

Object Oriented Programming COP3330 / CGS5409 Exception Handling Bitwise Operators

MLC/TLC NAND support: (new ?) challenges for the MTD/NAND subsystem Free Electrons - Embedded

Lecture 28 of 41 Collision Handling Part 2 of 2: Dynamic Collision Response, Particle Systems

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Distributed Statistical Estimation of Matrix Products with Applications David Woodruff Qin Zhang

How to Write Fast Numerical Code Spring 2011 Lecture 8 Instructor: Markus Pschel TA: Georg

Powering a number (a bit easier than the recursive mystery question on the homework) Problem:

Lecture 8: Cryptography Trust No One. 1 / 20 Cryptography: Basic Set Up Alice Bob Eve Goal:

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 - PowerPoint PPT Presentation

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 , WEIQIANG LIU 3 , HWAJEONG SEO 4 1 A P S I A , I N T E R D I S C I P L I N A R Y C E N T R E F O R S E C U R I T Y , R E L I A B I L I T Y A N D T R U S T ( S N T )

Faster multiprecision integer division William Hart June 22, 2015 William Hart Faster

NOVA Microhypervisor on ARMv8-A FOSDEM 2020 Udo Steinberg BedRock Systems, Inc. February 2,

Numerical Recipes for Multiprecision Computations Henri Cohen May 13, 2014 IMB, Universit e

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Maths Multiplication and Division Maths | Year 2 | Multiplication and Division | Solve

Lecture 8: Binary Multiplication &amp; Division Todays topics: Multiplication

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

MA/CSSE 473 Day 04 Multiplication runtime Multiplication based on Gauss formula Mathematical

The Rise of Multiprecision Computations Nick Higham School of Mathematics The University of

The Multiprecision Effort in the US Exascale Computing Project ICERM: Variable Precision in

Stochastic arithmetic in multiprecision Stef Graillat Joint work with Fabienne Jzquel and

Reliable multiprecision implementation of a class of special functions Team: A. Cuyt, V.B.

Reliable multiprecision arithmetic for number theory Fredrik Johansson LFANT seminar, IMB /

arm64e An ABI for Pointer Authentication LLVM Developers' Meeting John McCall October 22 nd , 2019

Object Oriented Programming COP3330 / CGS5409 Exception Handling Bitwise Operators

MLC/TLC NAND support: (new ?) challenges for the MTD/NAND subsystem Free Electrons - Embedded

Lecture 28 of 41 Collision Handling Part 2 of 2: Dynamic Collision Response, Particle Systems

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Distributed Statistical Estimation of Matrix Products with Applications David Woodruff Qin Zhang

How to Write Fast Numerical Code Spring 2011 Lecture 8 Instructor: Markus Pschel TA: Georg

Powering a number (a bit easier than the recursive mystery question on the homework) Problem:

Lecture 8: Cryptography Trust No One. 1 / 20 Cryptography: Basic Set Up Alice Bob Eve Goal:

Lecture 8: Binary Multiplication & Division Todays topics: Multiplication