Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 - - PowerPoint PPT Presentation

multiprecision multiplication on armv8
SMART_READER_LITE
LIVE PREVIEW

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 - - PowerPoint PPT Presentation

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 , WEIQIANG LIU 3 , HWAJEONG SEO 4 1 A P S I A , I N T E R D I S C I P L I N A R Y C E N T R E F O R S E C U R I T Y , R E L I A B I L I T Y A N D T R U S T ( S N T )


slide-1
SLIDE 1

Multiprecision Multiplication on ARMv8

ZHE LIU1, KIMMO JÄRVINENÄDL2, WEIQIANG LIU3, HWAJEONG SEO4

1 A P S I A , I N T E R D I S C I P L I N A R Y C E N T R E F O R S E C U R I T Y , R E L I A B I L I T Y A N D T R U S T ( S N T ) , U N I V E R S I T Y O F L U X E M B O U R G , L U X E M B O U R G 2 D E P A R T M E N T O F C O M P U T E R S C I E N C E , U N I V E R S I T Y O F H E L S I N K I , H E L S I N K I , F I N L A N D 3 C O L L E G E O F E L E C T R O N I C A N D I N F O R M A T I O N E N G I N E E R I N G , N A N J I N G U N I V E R S I T Y O F A E R O N A U T I C S A N D A S T R O N A U T I C S 4 D E P A R T M E N T O F I T , H A N S U N G U N I V E R S I T Y

slide-2
SLIDE 2

Motivation

  • Cryptography degrades the performance of smartphone
  • In particular, public key cryptography imposes high overheads
  • Fast PKC implementation is important to achieve high availability
slide-3
SLIDE 3

Motivation

  • Multi-precision arithmetic operation (for PKC)
  • Compact big number implementation is an open problem
  • Few works focus on ARMv8
  • GCM (CT-RSA’15)  Binary field multiplication (SCN’16)  Binary ECC (SG-CRC’17)
  • This work improves the performance of multiplication on ARMv8!
slide-4
SLIDE 4

Contribution

  • Compact implementations of multi-precision multiplication
  • Subtractive Karatsuba algorithm
  • Evaluation of multiple-level Karatsuba
  • Test input size (128, 256, 384, 512-bit)
  • Squaring dedicated routine
slide-5
SLIDE 5

Target Platform – ARMv8

  • 95% of smartphones based on ARM architecture
  • Modern smartphone supports 64-bit ARMv8
slide-6
SLIDE 6

Target Platform – ARMv8

  • 32-bit mode (AArch32) & 64-bit mode (AArch64)
  • 64-bit ARM & 128-bit NEON registers and instruction sets
  • Crypto (AES and SHA) operation
slide-7
SLIDE 7

Multiplication on ARMv8

a0 b0 a0b0 a0b0 X0 X1 X2 64 bits X3 UMULH a0 b0 a0b0 a1b0 X0 X1 X2 64 bits X3 MUL

× ×

slide-8
SLIDE 8

Multiplication on ARMv8

a3 a2 a1 a0 b0 b2 b3 b1 a0b0 a1b0 V0 V1 V2 32 bits 32 bits 64 bits SIMD (NEON) a0 b0 a0b0 a0b0 X0 X1 X2 64 bits 64 bits SISD (A64) 64 bits X3 MUL UMULH

For 64-bit multiplication on ARMv8, NEON requires 4 UMULL routines but A64 only needs 1 MUL and 1 UMULH. A64 is more efficient than NEON for big integer multiplication.

UMULL

slide-9
SLIDE 9

Multi-precision Multiplication

256~2048-bit multiplication on 64-bit architecture

  • divide big integer (256~2048-bit) into small integer (64-bit)
  • ARMv8 supports 31x64-bit registers  operand-scanning (previous works)

Method Operand-scanning Product-scanning Hybrid-scanning Computation order Row-wise Column-wise Mixture of row/column Requirement Many registers Efficient MAC routine General processor

slide-10
SLIDE 10

Multi-precision Multiplication

Operand-scanning method

slide-11
SLIDE 11

Multi-precision Squaring

A special case of multiplication where both operands are the same (i.e., A = B) Certain partial products become the same and need to be performed only once (i.e., 𝐵 0 × 𝐶 1 + 𝐵[1] × 𝐶[0] becomes 2 × 𝐵[0] × 𝐵[1] if 𝐵 = 𝐶) Two approaches:

  • Doubling the operand (i.e., 𝟑 × 𝐵[0]  2 × 𝐵[0] × 𝐵[1])
  • Doubling the result (i.e., 𝐵[0] × 𝐵[1] 𝟑 × 𝐵[0] × 𝐵[1])  Sliding-block-doubling
slide-12
SLIDE 12

Multi-precision Squaring

Sliding-block-doubling method

slide-13
SLIDE 13

Karatsuba-Ofman Algorithm

Number of partial product The product 𝐷 = 𝐵 ∙ 𝐶 of two n-bit integers 𝐵 = 𝐵𝑀 + 𝐵𝐼2 Τ

𝑜 2 and 𝐶 = 𝐶𝑀 + 𝐶𝐼2 Τ 𝑜 2

𝐷 = 𝐵𝐼 ∙ 𝐶𝐼2𝑜 + 𝐵𝑀 + 𝐵𝐼 ∙ 𝐶𝑀 + 𝐶𝐼 − 𝐵𝑀 ∙ 𝐶𝑀 − 𝐵𝐼 ∙ 𝐶𝐼 2 Τ

𝑜 2 + 𝐵𝑀 ∙ 𝐶𝑀

School-book Karatsuba-Ofman 𝑂2 𝑂log2 3

slide-14
SLIDE 14

Subtractive Karatsuba Algorithm

𝐷 = 𝐵𝐼 ∙ 𝐶𝐼2𝑜 + 𝐵𝑀 + 𝐵𝐼 ∙ 𝐶𝑀 + 𝐶𝐼 − 𝐵𝑀 ∙ 𝐶𝑀 − 𝐵𝐼 ∙ 𝐶𝐼 2 Τ

𝑜 2 + 𝐵𝑀 ∙ 𝐶𝑀

𝐵𝑀 + 𝐵𝐼 ∙ 𝐶𝑀 + 𝐶𝐼 − 𝐵𝑀 ∙ 𝐶𝑀 − 𝐵𝐼 ∙ 𝐶𝐼 = 𝐵𝑀 ∙ 𝐶𝑀 + 𝐵𝐼 ∙ 𝐶𝐼 − |𝐵𝐼 − 𝐵𝑀| ∙ |𝐶𝐼 − 𝐶𝑀| Advantage:

  • constant size of operands (n/2)  fast constant-time multiplication

Requirement:

  • Absolute value in two’s complement representation
slide-15
SLIDE 15

Multi-precision Multiplication on ARMv8

128-bit Karatsuba multiplication

slide-16
SLIDE 16

Multi-precision Squaring on ARMv8

No need for absolute value handling 𝐵𝐼 − 𝐵𝑀 ∙ 𝐵𝐼 − 𝐵𝑀  always positive value

slide-17
SLIDE 17

Optimizations of instruction set

Generation of the carry register

  • MOV X0, #0  …  ADDS X1, X1, X2  ADCS X3, X0, X0

Two’s complement

  • SBCS X2, X2, X2  EOR X0, X0, X2  AND X2, X2, #1  ADD X0, X0, X2
slide-18
SLIDE 18

Evaluation

IDE: Xcode Target:

  • 64-bit ARMv8-A architecture
  • Apple A7 (APL0698) @1.3GHz

Program language: assembly Optimization level: -Ofast

slide-19
SLIDE 19

Evaluation

slide-20
SLIDE 20

Evaluation

slide-21
SLIDE 21

Conclusion

Achievements

  • Efficient implementations of multi-precision multiplication / squaring on ARMv8

Future works

  • Cryptography implementations (ECC, RSA, SIDH)