Four Q on FPGA: New Hardware Speed Records for Elliptic Curve - PowerPoint PPT Presentation

Four Q on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields K. Järvinen 1 , A. Miele 2 , R. Azarderakhsh 3 , and P . Longa 4 1 Aalto University 2 Intel Corporation 3 Rochester Institute of Technology 4 Microsoft Research Contact: kimmo.jarvinen@aalto.fi, plonga@microsoft.com CHES 2016, Santa Barbara, CA, USA, August 17–19, 2016

Introduction Four Q : ◮ Four Q is a high-performance elliptic curve with very good SW performance (2–3 × faster than Curve25519) ◮ Four Q has been shown to offer the fastest scalar multiplications on a wide range of software platforms: ◮ On several 32-bit ARM microarchitectures (SAC 2016) ◮ On several 64-bit Intel/AMD processors, low and high-end (ASIACRYPT 2015) ◮ Four Q employs four-dimensional scalar decompositions, requires extensive precomputation, complex control, etc. ⇒ Not clear how well it suits for HW implementation Four Q on FPGA CHES 2016 2/17

Introduction Contributions: ◮ The first FPGA-based implementations of Four Q ◮ Four Q offers 2–2.5 × faster performance than Curve25519 ◮ Speed-area tradeoff is the primary optimization goal ◮ Protected against timing and SPA attacks ◮ We present three implementations: single-core, multi-core, and Montgomery ladder variant Four Q on FPGA CHES 2016 3/17

Four Q Costello, Longa, ASIACRYPT’15 E / F p 2 : − x 2 + y 2 = 1 + dx 2 y 2 ◮ Twisted Edwards curve with # E ( F p 2 ) = 392 · ξ where ξ is a 246-bit prime ◮ Defined over F p 2 with the Mersenne prime p = 2 127 − 1 ◮ Complete addition formulas over extended twisted Edwards coordinates (Hisil et al. ASIACRYPT’08) Four Q on FPGA CHES 2016 4/17

Four Q Costello, Longa, ASIACRYPT’15 E / F p 2 : − x 2 + y 2 = 1 + dx 2 y 2 ◮ Twisted Edwards curve with # E ( F p 2 ) = 392 · ξ where ξ is a 246-bit prime ◮ Defined over F p 2 with the Mersenne prime p = 2 127 − 1 ◮ Complete addition formulas over extended twisted Edwards coordinates (Hisil et al. ASIACRYPT’08) ◮ Two efficiently-computable endomorphisms ψ and φ ◮ Four-dimensional decomposition for the 256-bit scalar m with ( a 1 , a 2 , a 3 , a 4 ) such that a i ∈ [0 , 2 64 ) : [ m ] P = [ a 1 ] P + [ a 2 ] ψ ( P ) + [ a 3 ] φ ( P ) + [ a 4 ] ψ ( φ ( P )) Four Q on FPGA CHES 2016 4/17

Scalar Multiplication Input: Point P , integer m ∈ [0 , 2 256 ) Output: [ m ] P 1 Decompose and recode m 2 Precompute lookup table T 3 Q ← T [ v 64 ] 4 for i = 63 to 0 do Q ← [2] Q 5 Q ← Q + m i T [ v i ] 6 Four Q on FPGA CHES 2016 5/17

Scalar Multiplication Scalar decompose and recode Input: Point P , integer m ∈ [0 , 2 256 ) ◮ Decompose to a multi-scalar Output: [ m ] P ( a 1 , a 2 , a 3 , a 4 ) 1 Decompose and recode m ◮ Sign-aligned so that a 1 [ j ] ∈ {± 1 } 2 Precompute lookup table T and a i [ j ] ∈ { 0 , a 1 [ j ] } for 2 ≤ j ≤ 4 3 Q ← T [ v 64 ] ◮ Recode to signs m i ∈ {− 1 , 1 } 4 for i = 63 to 0 do and values v i ∈ [0 , 7] (point index) Q ← [2] Q 5 Q ← Q + m i T [ v i ] 6 Four Q on FPGA CHES 2016 5/17

Scalar Multiplication Precomputation Input: Point P , integer m ∈ [0 , 2 256 ) ◮ Precompute 8 points: T [ u ] = P + Output: [ m ] P [ u 0 ] φ ( P )+[ u 1 ] ψ ( P )+[ u 2 ] ψ ( φ ( P )) 1 Decompose and recode m for u = ( u 2 , u 1 , u 0 ) ∈ [0 , 7] 2 Precompute lookup table T ◮ Store them with 5 coordinates 3 Q ← T [ v 64 ] ( X + Y, Y − X, 2 Z, 2 dT, − 2 dT ) ⇒ 4 for i = 63 to 0 do + T [ u ] : ( X + Y, Y − X, 2 Z, 2 dT ) Q ← [2] Q 5 − T [ u ] : ( Y − X, X + Y, 2 Z, − 2 dT ) Q ← Q + m i T [ v i ] 6 ◮ 68 M + 27 S and several additions Four Q on FPGA CHES 2016 5/17

Scalar Multiplication Main for-loop Input: Point P , integer m ∈ [0 , 2 256 ) ◮ Fully regular and constant-time Output: [ m ] P ◮ Only 64 double-and-adds 1 Decompose and recode m ◮ Doubling: 2 Precompute lookup table T ( X, Y, Z, T a , T b ) ← ( X, Y, Z ) 3 Q ← T [ v 64 ] 4 for i = 63 to 0 do ◮ Addition: Q ← [2] Q ( X, Y, Z, T a , T b ) ← 5 Q ← Q + m i T [ v i ] 6 ( X, Y, Z, T a , T b ) × ( X + Y, Y − X, 2 Z, 2 dT ) Four Q on FPGA CHES 2016 5/17

General Architecture Scalar Decomposition and Recoding Unit ◮ Decomposes and recodes the scalar ◮ Mainly multiplications with constants Field Arithmetic Unit (“the core”) ◮ Precomputation and the main for-loop ◮ Highly optimized for F p with the Mersenne prime Four Q on FPGA CHES 2016 6/17

Scalar Unit ◮ Decomposition is computed with a truncated multiplier Y X (mainly multiplications with 195 264 constants) 17 ◮ The main component is a 264 17 × 264-bit multiplier FSM 17 × 264-bit row multiplier built 281 by using 11 DSPs + 264 281 ◮ Recoding is bit manipulations 17 and 64-bit additions 64 64 ◮ Outputs ( m 0 , v 0 ) first, scalar Z H Z L multiplication begins with ( m 64 , v 64 ) ⇒ Store in a LIFO buffer Four Q on FPGA CHES 2016 7/17

Field Arithmetic Unit commands, responses do di 64 64 Interface logic 2 16 127 2 18 Dual-port RAM 127 Control 127 127 16 Datapath Four Q on FPGA CHES 2016 8/17

Field Arithmetic Unit commands, responses do di 256 × 127-bit RAM (128 F p 2 elements) 64 64 4 BRAM Interface logic 2 16 127 2 18 Dual-port RAM 127 Control 127 127 16 Datapath Four Q on FPGA CHES 2016 8/17

Field Arithmetic Unit commands, 127-bit datapath, responses do di optimized for 64 64 p = 2 127 − 1 Interface logic 2 16 127 2 18 Dual-port RAM 127 Control 127 127 16 Datapath Four Q on FPGA CHES 2016 8/17

Field Arithmetic Unit commands, responses do di FSM + Program ROM 64 64 (6 BRAMs) Interface logic 2 16 127 2 18 Dual-port RAM 127 Control 127 127 16 Datapath Four Q on FPGA CHES 2016 8/17

Field Arithmetic Unit: Datapath 128 63 64 129 64 + 64 × 64 -bit 128 multiplier 127 127 (pipelined) 63 127 64 c 64 a b 127 1 127 127 127 r 127 + / − 0 127 127 c 0 1 Four Q on FPGA CHES 2016 9/17

Field Arithmetic Unit: Datapath Multiplier path 128 63 64 129 64 + 64 × 64 -bit 128 multiplier 127 127 (pipelined) 63 127 64 c 64 a b 127 1 127 127 127 r 127 + / − 0 127 127 c 0 1 Four Q on FPGA CHES 2016 9/17

Field Arithmetic Unit: Datapath 128 63 64 129 64 + 64 × 64 -bit 128 multiplier 127 127 (pipelined) 63 127 64 c 64 a b 127 1 127 127 127 r 127 + / − 0 127 127 c 0 1 Adder path Four Q on FPGA CHES 2016 9/17

Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) Four Q on FPGA CHES 2016 10/17

Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) Multiplier pipeline Adders Dual-port RAM Input regs Four Q on FPGA CHES 2016 10/17

Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) R R 1 Four Q on FPGA CHES 2016 10/17

Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) 2 Four Q on FPGA CHES 2016 10/17

Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) + 7 Four Q on FPGA CHES 2016 10/17

Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) R R % 8 Four Q on FPGA CHES 2016 10/17

Example: Multiplication in F p 2 3 multiplications, 2 additions and 3 subtractions in F p : a × b = ( a 0 , a 1 ) × ( b 0 , b 1 ) = ( a 0 · b 0 − a 1 · b 1 , ( a 0 + a 1 ) · ( b 0 + b 1 ) − a 0 · b 0 − a 1 · b 1 ) W + 9 Four Q on FPGA CHES 2016 10/17

Four Q on FPGA: New Hardware Speed Records for Elliptic Curve - PowerPoint PPT Presentation

Four Q on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields K. Jrvinen 1 , A. Miele 2 , R. Azarderakhsh 3 , and P . Longa 4 1 Aalto University 2 Intel Corporation 3 Rochester Institute of

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Potty Training in Potty Training in Potty Training in Potty Training in Four Days Four Days

Current Trends in Hybrid FPGA/CPU Devices Hybrid FPGA/CPU Devices Xilinx Zynq Series Real

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

RTLinux in an FPGA Alejandro Lucero alucero@os3sl.com www.os3sl.com RTLinux in a FPGA 1.

T42 Transputer Design in FPGA Transputer Design in FPGA T42 Year- -Three Design Status

T42 Transputer Design in FPGA Transputer Design in FPGA T42 Year- -Two Design Status

FPGA high-resolution TDC Development of high-resolution TDC based on FPGA.

FPGA%Timing%Models Many%FPGA%and%CPLD%vendors%provide%a% timing model

The nextpnr FOSS FPGA place-and-route tool Clifford Wolf Symbiotic EDA FOSS FPGA PnR VPR

The Central Curve in Linear Programming Bernd Sturmfels UC Berkeley and MATHEON Berlin joint work

Edwards Curves and the ECM Factorisation Method Peter Birkner Eindhoven University of Technology

Binary Edwards Curves Reza Rezaeian Farashahi Dept. of Mathematics and Computing Science TU

CS599: Algorithm Design in Strategic Settings Fall 2012 Lecture 11: Ironing and Approximate

The Future of Global Health Financing: Hope vs. Reality in the Push for Universal Health

Address Subcommittee February 8, 2017 1:00 2:30 PM Eastern Census HQ, Suitland, MD Meeting

ONC & Telehealth Guiding the Federal Health IT Agenda Karen DeSalvo, MD, MPH, MSc Acting

Next Anticipated Amendment: Regulation of Secrecy of Communication for Foreign Platformers? 2:30

Four Q on FPGA: New Hardware Speed Records for Elliptic Curve - PowerPoint PPT Presentation

Four Q on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields K. Jrvinen 1 , A. Miele 2 , R. Azarderakhsh 3 , and P . Longa 4 1 Aalto University 2 Intel Corporation 3 Rochester Institute of

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Potty Training in Potty Training in Potty Training in Potty Training in Four Days Four Days

Current Trends in Hybrid FPGA/CPU Devices Hybrid FPGA/CPU Devices Xilinx Zynq Series Real

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

RTLinux in an FPGA Alejandro Lucero alucero@os3sl.com www.os3sl.com RTLinux in a FPGA 1.

T42 Transputer Design in FPGA Transputer Design in FPGA T42 Year- -Three Design Status

T42 Transputer Design in FPGA Transputer Design in FPGA T42 Year- -Two Design Status

FPGA high-resolution TDC Development of high-resolution TDC based on FPGA.

FPGA%Timing%Models Many%FPGA%and%CPLD%vendors%provide%a% timing model

The nextpnr FOSS FPGA place-and-route tool Clifford Wolf Symbiotic EDA FOSS FPGA PnR VPR

The Central Curve in Linear Programming Bernd Sturmfels UC Berkeley and MATHEON Berlin joint work

Edwards Curves and the ECM Factorisation Method Peter Birkner Eindhoven University of Technology

Binary Edwards Curves Reza Rezaeian Farashahi Dept. of Mathematics and Computing Science TU

CS599: Algorithm Design in Strategic Settings Fall 2012 Lecture 11: Ironing and Approximate

The Future of Global Health Financing: Hope vs. Reality in the Push for Universal Health

Address Subcommittee February 8, 2017 1:00 2:30 PM Eastern Census HQ, Suitland, MD Meeting

ONC &amp; Telehealth Guiding the Federal Health IT Agenda Karen DeSalvo, MD, MPH, MSc Acting

Next Anticipated Amendment: Regulation of Secrecy of Communication for Foreign Platformers? 2:30

ONC & Telehealth Guiding the Federal Health IT Agenda Karen DeSalvo, MD, MPH, MSc Acting