 
              Four Q on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields K. Järvinen 1 , A. Miele 2 , R. Azarderakhsh 3 , and P . Longa 4 1 Aalto University 2 Intel Corporation 3 Rochester Institute of Technology 4 Microsoft Research Contact: kimmo.jarvinen@aalto.fi, plonga@microsoft.com CHES 2016, Santa Barbara, CA, USA, August 17–19, 2016
Introduction Four Q : ◮ Four Q is a high-performance elliptic curve with very good SW performance (2–3 × faster than Curve25519) ◮ Four Q has been shown to offer the fastest scalar multiplications on a wide range of software platforms: ◮ On several 32-bit ARM microarchitectures (SAC 2016) ◮ On several 64-bit Intel/AMD processors, low and high-end (ASIACRYPT 2015) ◮ Four Q employs four-dimensional scalar decompositions, requires extensive precomputation, complex control, etc. ⇒ Not clear how well it suits for HW implementation Four Q on FPGA CHES 2016 2/17
Introduction Contributions: ◮ The first FPGA-based implementations of Four Q ◮ Four Q offers 2–2.5 × faster performance than Curve25519 ◮ Speed-area tradeoff is the primary optimization goal ◮ Protected against timing and SPA attacks ◮ We present three implementations: single-core, multi-core, and Montgomery ladder variant Four Q on FPGA CHES 2016 3/17
Four Q Costello, Longa, ASIACRYPT’15 E / F p 2 : − x 2 + y 2 = 1 + dx 2 y 2 ◮ Twisted Edwards curve with # E ( F p 2 ) = 392 · ξ where ξ is a 246-bit prime ◮ Defined over F p 2 with the Mersenne prime p = 2 127 − 1 ◮ Complete addition formulas over extended twisted Edwards coordinates (Hisil et al. ASIACRYPT’08) Four Q on FPGA CHES 2016 4/17
Four Q Costello, Longa, ASIACRYPT’15 E / F p 2 : − x 2 + y 2 = 1 + dx 2 y 2 ◮ Twisted Edwards curve with # E ( F p 2 ) = 392 · ξ where ξ is a 246-bit prime ◮ Defined over F p 2 with the Mersenne prime p = 2 127 − 1 ◮ Complete addition formulas over extended twisted Edwards coordinates (Hisil et al. ASIACRYPT’08) ◮ Two efficiently-computable endomorphisms ψ and φ ◮ Four-dimensional decomposition for the 256-bit scalar m with ( a 1 , a 2 , a 3 , a 4 ) such that a i ∈ [0 , 2 64 ) : [ m ] P = [ a 1 ] P + [ a 2 ] ψ ( P ) + [ a 3 ] φ ( P ) + [ a 4 ] ψ ( φ ( P )) Four Q on FPGA CHES 2016 4/17
Scalar Multiplication Input: Point P , integer m ∈ [0 , 2 256 ) Output: [ m ] P 1 Decompose and recode m 2 Precompute lookup table T 3 Q ← T [ v 64 ] 4 for i = 63 to 0 do Q ← [2] Q 5 Q ← Q + m i T [ v i ] 6 Four Q on FPGA CHES 2016 5/17
Scalar Multiplication Scalar decompose and recode Input: Point P , integer m ∈ [0 , 2 256 ) ◮ Decompose to a multi-scalar Output: [ m ] P ( a 1 , a 2 , a 3 , a 4 ) 1 Decompose and recode m ◮ Sign-aligned so that a 1 [ j ] ∈ {± 1 } 2 Precompute lookup table T and a i [ j ] ∈ { 0 , a 1 [ j ] } for 2 ≤ j ≤ 4 3 Q ← T [ v 64 ] ◮ Recode to signs m i ∈ {− 1 , 1 } 4 for i = 63 to 0 do and values v i ∈ [0 , 7] (point index) Q ← [2] Q 5 Q ← Q + m i T [ v i ] 6 Four Q on FPGA CHES 2016 5/17
Scalar Multiplication Precomputation Input: Point P , integer m ∈ [0 , 2 256 ) ◮ Precompute 8 points: T [ u ] = P + Output: [ m ] P [ u 0 ] φ ( P )+[ u 1 ] ψ ( P )+[ u 2 ] ψ ( φ ( P )) 1 Decompose and recode m for u = ( u 2 , u 1 , u 0 ) ∈ [0 , 7] 2 Precompute lookup table T ◮ Store them with 5 coordinates 3 Q ← T [ v 64 ] ( X + Y, Y − X, 2 Z, 2 dT, − 2 dT ) ⇒ 4 for i = 63 to 0 do + T [ u ] : ( X + Y, Y − X, 2 Z, 2 dT ) Q ← [2] Q 5 − T [ u ] : ( Y − X, X + Y, 2 Z, − 2 dT ) Q ← Q + m i T [ v i ] 6 ◮ 68 M + 27 S and several additions Four Q on FPGA CHES 2016 5/17
Scalar Multiplication Main for-loop Input: Point P , integer m ∈ [0 , 2 256 ) ◮ Fully regular and constant-time Output: [ m ] P ◮ Only 64 double-and-adds 1 Decompose and recode m ◮ Doubling: 2 Precompute lookup table T ( X, Y, Z, T a , T b ) ← ( X, Y, Z ) 3 Q ← T [ v 64 ] 4 for i = 63 to 0 do ◮ Addition: Q ← [2] Q ( X, Y, Z, T a , T b ) ← 5 Q ← Q + m i T [ v i ] 6 ( X, Y, Z, T a , T b ) × ( X + Y, Y − X, 2 Z, 2 dT ) Four Q on FPGA CHES 2016 5/17
General Architecture Scalar Decomposition and Recoding Unit ◮ Decomposes and recodes the scalar ◮ Mainly multiplications with constants Field Arithmetic Unit (“the core”) ◮ Precomputation and the main for-loop ◮ Highly optimized for F p with the Mersenne prime Four Q on FPGA CHES 2016 6/17
Scalar Unit ◮ Decomposition is computed with a truncated multiplier Y X (mainly multiplications with 195 264 constants) 17 ◮ The main component is a 264 17 × 264-bit multiplier FSM 17 × 264-bit row multiplier built 281 by using 11 DSPs + 264 281 ◮ Recoding is bit manipulations 17 and 64-bit additions 64 64 ◮ Outputs ( m 0 , v 0 ) first, scalar Z H Z L multiplication begins with ( m 64 , v 64 ) ⇒ Store in a LIFO buffer Four Q on FPGA CHES 2016 7/17
Field Arithmetic Unit commands, responses do di 64 64 Interface logic 2 16 127 2 18 Dual-port RAM 127 Control 127 127 16 Datapath Four Q on FPGA CHES 2016 8/17
Field Arithmetic Unit commands, responses do di 256 × 127-bit RAM (128 F p 2 elements) 64 64 4 BRAM Interface logic 2 16 127 2 18 Dual-port RAM 127 Control 127 127 16 Datapath Four Q on FPGA CHES 2016 8/17
Field Arithmetic Unit commands, 127-bit datapath, responses do di optimized for 64 64 p = 2 127 − 1 Interface logic 2 16 127 2 18 Dual-port RAM 127 Control 127 127 16 Datapath Four Q on FPGA CHES 2016 8/17
Field Arithmetic Unit commands, responses do di FSM + Program ROM 64 64 (6 BRAMs) Interface logic 2 16 127 2 18 Dual-port RAM 127 Control 127 127 16 Datapath Four Q on FPGA CHES 2016 8/17
Field Arithmetic Unit: Datapath 128 63 64 129 64 + 64 × 64 -bit 128 multiplier 127 127 (pipelined) 63 127 64 c 64 a b 127 1 127 127 127 r 127 + / − 0 127 127 c 0 1 Four Q on FPGA CHES 2016 9/17
Field Arithmetic Unit: Datapath Multiplier path 128 63 64 129 64 + 64 × 64 -bit 128 multiplier 127 127 (pipelined) 63 127 64 c 64 a b 127 1 127 127 127 r 127 + / − 0 127 127 c 0 1 Four Q on FPGA CHES 2016 9/17
Field Arithmetic Unit: Datapath 128 63 64 129 64 + 64 × 64 -bit 128 multiplier 127 127 (pipelined) 63 127 64 c 64 a b 127 1 127 127 127 r 127 + / − 0 127 127 c 0 1 Adder path Four Q on FPGA CHES 2016 9/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) Four Q on FPGA CHES 2016 10/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) Multiplier pipeline Adders Dual-port RAM Input regs Four Q on FPGA CHES 2016 10/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) R R 1 Four Q on FPGA CHES 2016 10/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) 2 Four Q on FPGA CHES 2016 10/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) 3 Four Q on FPGA CHES 2016 10/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) R R 4 Four Q on FPGA CHES 2016 10/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) R R 5 Four Q on FPGA CHES 2016 10/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) 6 Four Q on FPGA CHES 2016 10/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) + 7 Four Q on FPGA CHES 2016 10/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) R R % 8 Four Q on FPGA CHES 2016 10/17
Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) W + 9 Four Q on FPGA CHES 2016 10/17
Recommend
More recommend