hardware architectures for hecc
play

Hardware Architectures for HECC Gabriel GALLIN and Arnaud TISSERAND - PowerPoint PPT Presentation

Hardware Architectures for HECC Gabriel GALLIN and Arnaud TISSERAND CNRS Lab-STICC IRISA HAH Project CryptArchi June, 2017 Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion


  1. Hardware Architectures for HECC Gabriel GALLIN and Arnaud TISSERAND CNRS – Lab-STICC – IRISA HAH Project CryptArchi June, 2017

  2. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion Summary Context & Motivations 1 HECC Operations 2 Efficient Multiplier 3 Architectures and Tools for HECC 4 Conclusion 5 G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 2 / 22

  3. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion Summary Context & Motivations 1 HECC Operations 2 Efficient Multiplier 3 Architectures and Tools for HECC 4 Conclusion 5 G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 2 / 22

  4. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion Public-Key Cryptography (PKC) Provides cryptographic primitives such as digital signature, key exchange and specific encryption schemes First PKC standard: RSA - ≥ 2000-bit keys recommended today - Too costly for embedded applications Elliptic Curve Cryptography (ECC): - Better performances and lower cost than RSA - Allows more advanced schemes Hyper-Elliptic Curve Cryptography (HECC): - Evolution of ECC focusing on larger sets of curves - Supposed to have a smaller cost than ECC G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 3 / 22

  5. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion Operations Hierarchy in (H)ECC ADD and DBL built using F P operations Protocols Curve-Level Modular arithmetic in F P : Scalar Operations Multiplication - 100 · · · 200 bits elements for HECC [Software] [ k ] P b - Operations involve modular reduction - Choice for P : DBL(P) ADD(P ,Q) – Generic P : more flexible but slower P+P – Specific P ( e.g. pseudo-Mersenne): faster but more specific ... x ± y x x y Modular multiplication ( M ) and square ( S ): GF(p)/GF(2 m ) Operations - Most common and costly operations [Hardware] - Efficient dedicated units Main metric: numbers of M and S in F P G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 4 / 22

  6. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion ECC, HECC, Kummer-HECC F P elements size source ADD DBL ECC ℓ ECC 12 M + 2 S 7 M + 3 S [Bernstein and Lange] ℓ HECC ≈ 1 HECC 2 ℓ ECC 40 M + 4 S 38 M + 6 S [Lange, 2005] Kummer ℓ HECC 19 M + 12 S [Renes et al., 2016] ECC: - Size of F P elements 2 × larger - Simpler ADD and DBL operations HECC: - Smaller F P - More operations in F P for ADD / DBL Kummer-HECC is more efficient than ECC [Renes et al., 2016]: - ARM Cortex M0: up to 75% clock cycles reduction for signatures - AVR AT-mega: up to 32% cycles reduction for Diffie-Hellman G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 5 / 22

  7. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion Summary Context & Motivations 1 HECC Operations 2 Efficient Multiplier 3 Architectures and Tools for HECC 4 Conclusion 5 G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 5 / 22

  8. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion Curve-Level Operations in Kummer No ADD operation but still DBL Differential addition : xADD ( ± P , ± Q , ± ( P − Q )) → ± ( P + Q ) xADD and DBL can be combined: xDBLADD ( ± P , ± Q , ± ( P − Q )) → ( ± [2] P , ± ( P + Q )) For details see [Renes et al., 2016], [Gaudry, 2007] and [Bos et al., 2016] G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 6 / 22

  9. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion xDBLADD F P Operations cst cst cst var s a s s M M S M OUT var a s M M a a S M OUT var s a s s M M S M OUT var a s a a M M S OUT cst cst cst cst var s a S M a a S M OUT var a s S M s s S M OUT var s a a a S M S M OUT var a s S M s s S M OUT cst cst cst cst G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 7 / 22

  10. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion Scalar Multiplication Montgomery ladder based crypto scalarmult [Renes et al., 2016]: Require: m -bit scalar k = � m − 1 i =0 2 i k i , point P b , cst ∈ F 4 P Ensure: V 1 = [ k ] P b , V 2 = [ k + 1] P b V 1 ← cst V 2 ← P b for i = m − 1 downto 0 do ( V 1 , V 2 ) ← CSWAP ( k i , ( V 1 , V 2 )) ( V 1 , V 2 ) ← xDBLADD ( V 1 , V 2 , P b ) ( V 1 , V 2 ) ← CSWAP ( k i , ( V 1 , V 2 )) end for return ( V 1 , V 2 ) CSWAP ( k i , ( X , Y )) returns ( X , Y ) if k i = 0, else ( Y , X ) Constant time, uniform operations (independent from key bits) Some parallelism between xDBLADD internal F P operations CSWAP : very simple but involves secret bits (to be protected) G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 8 / 22

  11. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion Summary Context & Motivations 1 HECC Operations 2 Efficient Multiplier 3 Architectures and Tools for HECC 4 Conclusion 5 G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 8 / 22

  12. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion Montgomery Modular Multiplication (MMM) R = A × B n × n → 2 n bits q = ( R × ( −P − 1 )) mod (2 n ) n × n → n bits q P = q × P n × n → 2 n bits A B Objective: A × B mod P R Proposed in [Montgomery, 1985] q q Variants are actual state-of-the-art for F P multiplication (with generic P ) R Final reduction step discards n LSBs S G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 9 / 22

  13. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion Modular Multiplication: Dependencies Problem In practice, MMM is interleaved - Operands are split into s words of w bits such that n = s × w - Iterations over partial products and reductions on words - Coarsely Integrated Operand Scanning (CIOS) from [Ko¸ c et al., 1996] Impact on hardware implementation - Dependencies → latencies between internal iterations - Hardware pipeline in DSP slices cannot be filled efficiently Proposed solution: Hyper-Threaded Modular Multiplier (HTMM) - Based on simple CIOS algorithm - Use idle stages to compute other independent MMMs in parallel G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 10 / 22

  14. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion HTMM Internal Architecture HTMM architecture: 3 hardware stages - Stages are fully pipelined (several clock cycles per stage) - 3 to 4 DSP slices in each stage q i = t 0 S = + t t = A i B + S q i A i STAGE 1 STAGE 2 STAGE 3 B G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 11 / 22

  15. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion HTMM Internal Architecture HTMM architecture: 3 hardware stages - Stages are fully pipelined (several clock cycles per stage) - 3 to 4 DSP slices in each stage B (3) A (4) A (3) B (4) A (5) B (0) A (1) A (0) B (1) A (2) OPERANDS B (2) B (5) STAGE 1 0 1 2 0 1 2 0 1 2 0 1 2 3 4 5 ... STAGE 2 0 1 2 0 1 2 0 1 2 0 1 2 3 4 STAGE 3 0 1 2 0 1 2 0 1 2 0 1 2 3 RESUL T P (0) P (1) P (2) time G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 11 / 22

  16. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion HTMM Implementations Xilinx FPGAs - Virtex 4 XC4VLX100 (V4) - Virtex 5 XC5VLX110T (V5) - Spartan 6 XC6SLX75 (S6) Comparison with fastest MMM implementation in literature - Design presented in [Ma et al., 2013] - Implemented on the same FPGAs for fair comparison 2 versions of HTMM: - HTMM DRAM : operands stored in FPGA slices (LUTs) - HTMM BRAM : operands stored in FPGA BRAMs Parameters for HTMM: - P→ 128 bits - w = 34 bits, s = 4 - Operands size n = s × w = 134 bits G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 12 / 22

  17. Summary Context & Motivations HECC Operations Efficient Multiplier Architectures and Tools Conclusion HTMM Implementations Results Results for 3 independent multiplications: Version FPGA DSP BRAM FF LUT Slices Freq. Nb. Time 18K/9K (MHz) cycles (ns) V4 21 6/0 1311 1201 879 252 258 [Ma et al., 2013] V5 21 6/0 1310 1027 406 296 65 220 S6 21 0/6 1280 1600 540 210 309 V4 11 0/0 1638 1128 1346 330 239 HTMM DRAM V5 11 0/0 1616 652 517 400 79 198 S6 11 0/0 1631 1344 483 302 261 V4 11 2/0 615 364 449 328 241 HTMM BRAM V5 11 2/0 593 371 249 357 79 221 S6 11 0/2 587 359 180 304 260 S6: -47% DSPs, -66% BRAMs, -66% slices, -15% duration For only 1 single M , HTMM is less efficient (69 cycles against 25) G. Gallin - A. Tisserand Hardware Architectures for HECC CryptArchi 2017 13 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend