Hyper-Threaded Multiplier for HECC Gabriel GALLIN and Arnaud - PowerPoint PPT Presentation

Hyper-Threaded Multiplier for HECC Gabriel GALLIN and Arnaud TISSERAND CNRS – Lab-STICC – IRISA HAH Project Asilomar, Oct. 2017

Public-Key Cryptography (PKC) ◮ Provides cryptographic primitives such as digital signature, key exchange and specific encryption schemes ◮ First PKC standard: RSA - ≥ 2000-bit keys recommended today - Too costly for embedded applications ◮ Elliptic Curve Cryptography (ECC): - Better performances and lower cost than RSA - Allows more advanced schemes ◮ Hyper-Elliptic Curve Cryptography (HECC): - Evolution of ECC focusing on larger sets of curves - Supposed to have a smaller cost than ECC

ECC, HECC, Kummer-HECC F P elements size source ADD DBL ECC 12 M + 2 S 7 M + 3 S [1] ℓ ECC ℓ HECC ≈ 1 HECC 40 M + 4 S 38 M + 6 S [5] 2 ℓ ECC Kummer ℓ HECC 19 M + 12 S [8] ◮ ECC: - Size of F P elements 2 × larger - Simpler ADD and DBL operations ◮ HECC: - Smaller F P - More operations in F P for ADD / DBL ◮ Kummer-HECC is more efficient than ECC [8]: - ARM Cortex M0: up to 75% clock cycles reduction for signatures - AVR AT-mega: up to 32% cycles reduction for Diffie-Hellman M multiplication, S square on field F P

Curve-Level Operations in Kummer ◮ No ADD operation but still DBL ◮ Differential addition : xADD ( ± P , ± Q , ± ( P − Q )) → ± ( P + Q ) ◮ xADD and DBL can be combined: xDBLADD ( ± P , ± Q , ± ( P − Q )) → ( ± [2] P , ± ( P + Q )) For details see [8], [3] and [2]

xDBLADD F P Operations cst cst cst var s a M M s s S M OUT var a s M M a a S M OUT var s a M M s s S M OUT var a s a a M M S OUT cst cst cst cst var s a S M a a S M OUT var a s s s S M S M OUT var s a S M a a S M OUT var a s s s S M S M OUT cst cst cst cst

Scalar Multiplication Montgomery ladder based crypto scalarmult [8]: Require: m -bit scalar k = � m − 1 i =0 2 i k i , point P b , cst ∈ F 4 P Ensure: V 1 = [ k ] P b , V 2 = [ k + 1] P b V 1 ← cst V 2 ← P b for i = m − 1 downto 0 do ( V 1 , V 2 ) ← CSWAP ( k i , ( V 1 , V 2 )) ( V 1 , V 2 ) ← xDBLADD ( V 1 , V 2 , P b ) ( V 1 , V 2 ) ← CSWAP ( k i , ( V 1 , V 2 )) end for return ( V 1 , V 2 ) CSWAP ( k i , ( X , Y )) returns ( X , Y ) if k i = 0, else ( Y , X ) ◮ Constant time, uniform operations (independent from key bits) ◮ Some parallelism between xDBLADD internal F P operations ◮ CSWAP : very simple but involves secret bits (to be protected)

Montgomery Modular Multiplication (MMM) R = A × B n × n → 2 n bits q = ( R × ( −P − 1 )) mod (2 n ) n × n → n bits q P = q × P n × n → 2 n bits ◮ Objective: A × B mod P A B R ◮ Proposed in [7] q ◮ Variants are actual state-of-the-art q for F P multiplication (with generic P ) R S ◮ Final reduction step discards n LSBs

Modular Multiplication: Dependencies Problem ◮ In practice, MMM is interleaved - Operands are split into s words of w bits such that n = s × w - Iterations over partial products and reductions on words - Coarsely Integrated Operand Scanning (CIOS) from [4] ◮ Impact on hardware implementation - Dependencies → latencies between internal iterations - Hardware pipeline in DSP slices cannot be filled efficiently ◮ Proposed solution: Hyper-Threaded Modular Multiplier (HTMM) - Based on simple CIOS algorithm - Use idle stages to compute other independent MMMs in parallel

HTMM Internal Architecture ◮ HTMM architecture: 3 hardware stages - Stages are fully pipelined (several clock cycles per stage) - 3 to 4 DSP slices in each stage q i = t 0 S = + t t = A i B + S q i A i STAGE 1 STAGE 2 STAGE 3 B A (0) B (0) A (1) B (1) A (2) A (3) B (3) A (4) B (4) A (5) OPERANDS B (2) B (5) STAGE 1 0 1 2 0 1 2 0 1 2 0 1 2 3 4 5 ... STAGE 2 0 1 2 0 1 2 0 1 2 0 1 2 3 4 STAGE 3 0 1 2 0 1 2 0 1 2 0 1 2 3 RESUL T P (0) P (1) P (2) time

HTMM Internal Architecture (details) Pj[33:17] B B R j [67:34] M j [67:34] Acin Acin t j [33:0] C C Right wire shift by 17 bits Right wire shift by 17 bits Pj[16:0] P'0[16:0] B B B R j [33:17] M j [33:17] q i [33:17] A A i [33:17] A A PCIN PCIN t0[33:17] PCIN PCOUT Pj[33:17] PCOUT PCOUT B B j [33:17] B B P'0[33:17] Acin Acin Acin C C C Right wire shift by 17 bits Right wire shift by 17 bits Right wire shift by 17 bits Pj[16:0] B P'0[16:0] B j [16:0] B B A M j [16:0] q i [16:0] t j [16:0] A i [16:0] A A t0[16:0] C S j [33:0] OUTPUT

HTMM Implementations ◮ Xilinx FPGAs - Virtex 4 XC4VLX100 (V4) - Virtex 5 XC5VLX110T (V5) - Spartan 6 XC6SLX75 (S6) ◮ Comparison with fastest MMM implementation in literature - Design presented in [6] - Implemented on the same FPGAs for fair comparison ◮ 2 versions of HTMM: - HTMM DRAM : operands stored in FPGA slices (LUTs) - HTMM BRAM : operands stored in FPGA BRAMs ◮ Parameters for HTMM: - P→ 128 bits - w = 34 bits, s = 4 - Operands size n = s × w = 134 bits

HTMM Implementations Results Results for 3 independent multiplications: Unit FPGA DSP BRAM FF LUT Slices Freq. Nb. Time 18K/9K (MHz) cycles (ns) [6] V4 21 6/0 1311 1201 879 252 258 V5 21 6/0 1310 1027 406 296 65 220 S6 21 0/6 1280 1600 540 210 309 HTMM V4 11 0/0 1638 1128 1346 330 239 DRAM V5 11 0/0 1616 652 517 400 79 198 S6 11 0/0 1631 1344 483 302 261 HTMM V4 11 2/0 615 364 449 328 241 BRAM V5 11 2/0 593 371 249 357 79 221 S6 11 0/2 587 359 180 304 260 S6: -47% DSPs, -66% BRAMs, -66% slices, -15% duration For only 1 single M , HTMM is less efficient (69 cycles against 25)

Typical Architecture Model Data Memory Data DMUX Ctrl DMUX Global OReg ADD/SUB MUL TIPLIER OReg CSWAP Control Ctrl Data MUX Program Memory Parameters specified at design time: - Width w and nb. words s for internal communications ( s × w = n ) - Types and number of units

256b ECC vs 128b HECC (similar theoretical security) FPGA Version DSP BRAM Slices Freq. Nb. Time 18K (MHz) cycles (ms) ECC 37 11 4655 250 109,297 0.44 V4 H1 11 7 1413 330 183,051 0.55 H2 22 9 2356 330 115,211 0.35 ECC 37 10 1725 291 109,297 0.38 V5 H1 11 7 873 360 183,051 0.51 H2 22 9 1542 360 115,211 0.32 Gain H1 on V5: -70% DSPs, -30% BRAMs, -49% slices, +30% duration Gain H2 on V5: -40% DSPs, -10% BRAMs, -10% slices, -15% duration ECC results from [6]

Conclusions and Perspectives ◮ HTMM is more efficient than state of the art for 3 independent MMs ◮ leads to better area / computation time trade-offs ◮ more hardwired resources are active at each clock cycle ◮ µ Kummer based HECC is an efficient alternative to ECC - More complex formulas but larger internal parallelism - Large exploration space for architectures and arithmetic ◮ Future works - Study other HTMM versions - Study hyper-threaded schemes impact on energy consumption - Study hyper-threaded schemes impact on side-channel leakage

References [1] D. J. Bernstein and T. Lange. Explicit-formulas database. http://hyperelliptic.org/EFD/ . [2] Joppe W. Bos, Craig Costello, Huseyin Hisil, and Kristin Lauter. Fast cryptography in genus 2. Journal of Cryptology , 29(1):28–60, January 2016. [3] Pierrick Gaudry. Fast genus 2 arithmetic based on theta functions. Journal of Mathematical Cryptology , 1(3):243–265, 2007. [4] C ¸etin K. Ko¸ c, Tolga Acar, and Burton S. Kaliski, Jr. Analyzing and comparing Montgomery multiplication algorithms. Micro, IEEE , 16(3):26–33, June 1996. [5] T. Lange. Formulae for Arithmetic on Genus 2 Hyperelliptic Curves. Applicable Algebra in Engineering, Communication and Computing , 15(5):295–328, February 2005. [6] Yuan Ma, Zongbin Liu, Wuqiong Pan, and Jiwu Jing. A high-speed elliptic curve cryptographic processor for generic curves over GF(p). In Proc. 20th International Workshop on Selected Areas in Cryptography (SAC) , volume 8282 of LNCS , pages 421–437. Springer, August 2013. [7] Peter L. Montgomery. Modular multiplication without trial division. Mathematics of Computation , 44(170):519–521, April 1985. [8] Joost Renes, Peter Schwabe, Benjamin Smith, and Lejla Batina. µ Kummer: Efficient hyperelliptic signatures and key exchange on microcontrollers. In Proc. Workshop on Cryptographic Hardware and Embedded Systems (CHES) , volume 9813 of LNCS , pages 301–320. Springer, August 2016.

Hyper-Threaded Multiplier for HECC Gabriel GALLIN and Arnaud - PowerPoint PPT Presentation

Hyper-Threaded Multiplier for HECC Gabriel GALLIN and Arnaud TISSERAND CNRS Lab-STICC IRISA HAH Project Asilomar, Oct. 2017 Public-Key Cryptography (PKC) Provides cryptographic primitives such as digital signature, key exchange and

Architecture level Optimizations for Kummer based HECC on FPGAs Gabriel GALLIN Turku Ozlum

Hyper: Make VM Runs Like Container Xu Wang <xu@hyper.sh> Hyper HQ Agenda Lesson

Hardware Architectures for HECC Gabriel GALLIN and Arnaud TISSERAND CNRS Lab-STICC IRISA

HB 3472 Update HOUSE COMMITTEE ON HIGHER EDUCATION AND WORKFORCE DEVELOPMENT Presented by: Rob

Outcomes Based Funding Update HECC FEBRUARY FULL COMMISSION MEETING 2/12/2015 Brian Fox,

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business

DIAMETER PHOTO-MULTIPLIER TUBES DEREK BOYLAN PHOTO-MULTIPLIER TUBES (PMTS) Photomultipler

Axiomatic Foundations of Multiplier Preferences Tomasz Strzalecki Multiplier preferences

VHDL Modeling for Synthesis Hierarchical Design Textbook Section 4.8: Add and Shift Multiplier

URP Slides for Multiplier Tables 12 April Lectures Economic Impact of Maytag Closing Economic

Verilog Modeling for Synthesis Multiplier Design (Nelson model) Add and shift binary

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

1 Hyper-heuristics: Raising the Level of Generality of Search Hyper-heuristics: Raising the Level

Status of the Hyper- Kamiokande Experiment Erin OSullivan, on behalf of the Hyper-Kamiokande

Robotics MMRP TO#004 Vegetation Removal Fort Bragg Aerial Gunnery Range (AGR) Agenda

Let's use Ed25519 with GnuPG 2.1 and Gnuk Token! Niibe Yutaka One of New Features in GnuPG 2.1

XED:%EXPOSING%ON,DIE%ERROR% DETECTION%INFORMATION%FOR% STRONG%MEMORY%RELIABILITY

HyPoRes: An Hybrid Representation System for ECC P. Martins 1 J. Marrez 2 J.-C. Bajard 2 L. Sousa

I have no financial interests to disclose. Workshop: Case Management of Abnormal Pap Smears and

Elliptic curve cryptography on FPGAs: How fast can we go with a single chip? Kimmo Jrvinen

ECC (Part II) & Smart Contracts Sep. 16, 2019 Overview Cryptography with ECC How to

A Brief Introduction to Elliptic Curve Cryptography Or: A headache in 15 minutes Don Owen March

Hyper-Threaded Multiplier for HECC Gabriel GALLIN and Arnaud - PowerPoint PPT Presentation

Hyper-Threaded Multiplier for HECC Gabriel GALLIN and Arnaud TISSERAND CNRS Lab-STICC IRISA HAH Project Asilomar, Oct. 2017 Public-Key Cryptography (PKC) Provides cryptographic primitives such as digital signature, key exchange and

Architecture level Optimizations for Kummer based HECC on FPGAs Gabriel GALLIN Turku Ozlum

Hyper: Make VM Runs Like Container Xu Wang &lt;xu@hyper.sh&gt; Hyper HQ Agenda Lesson

Hardware Architectures for HECC Gabriel GALLIN and Arnaud TISSERAND CNRS Lab-STICC IRISA

HB 3472 Update HOUSE COMMITTEE ON HIGHER EDUCATION AND WORKFORCE DEVELOPMENT Presented by: Rob

Outcomes Based Funding Update HECC FEBRUARY FULL COMMISSION MEETING 2/12/2015 Brian Fox,

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business

DIAMETER PHOTO-MULTIPLIER TUBES DEREK BOYLAN PHOTO-MULTIPLIER TUBES (PMTS) Photomultipler

Axiomatic Foundations of Multiplier Preferences Tomasz Strzalecki Multiplier preferences

VHDL Modeling for Synthesis Hierarchical Design Textbook Section 4.8: Add and Shift Multiplier

URP Slides for Multiplier Tables 12 April Lectures Economic Impact of Maytag Closing Economic

Verilog Modeling for Synthesis Multiplier Design (Nelson model) Add and shift binary

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

1 Hyper-heuristics: Raising the Level of Generality of Search Hyper-heuristics: Raising the Level

Status of the Hyper- Kamiokande Experiment Erin OSullivan, on behalf of the Hyper-Kamiokande

Robotics MMRP TO#004 Vegetation Removal Fort Bragg Aerial Gunnery Range (AGR) Agenda

Let's use Ed25519 with GnuPG 2.1 and Gnuk Token! Niibe Yutaka One of New Features in GnuPG 2.1

XED:%EXPOSING%ON,DIE%ERROR% DETECTION%INFORMATION%FOR% STRONG%MEMORY%RELIABILITY

HyPoRes: An Hybrid Representation System for ECC P. Martins 1 J. Marrez 2 J.-C. Bajard 2 L. Sousa

I have no financial interests to disclose. Workshop: Case Management of Abnormal Pap Smears and

Elliptic curve cryptography on FPGAs: How fast can we go with a single chip? Kimmo Jrvinen

ECC (Part II) &amp; Smart Contracts Sep. 16, 2019 Overview Cryptography with ECC How to

A Brief Introduction to Elliptic Curve Cryptography Or: A headache in 15 minutes Don Owen March

Hyper: Make VM Runs Like Container Xu Wang <xu@hyper.sh> Hyper HQ Agenda Lesson

ECC (Part II) & Smart Contracts Sep. 16, 2019 Overview Cryptography with ECC How to